awesome-web-archiving
github.com/iipc/awesome-web-archiving ↗An Awesome List for getting started with web archiving
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me acquisition resources from awesome-web-archiving"
Installation instructions →What's inside
Tools & Software
- ArchiveBoxAcquisition
A tool which maintains an additive archive from RSS feeds, bookmarks, and links using wget, Chrome headless, and other methods (formerly
- archivenowAcquisition
A
- ArchiveSparkAnalysis
An Apache Spark framework (not only) for Web Archives that enables easy data processing, extraction as well as derivation.
- Archives Research Compute HubAnalysis
Web application for distributed compute analysis of Archive-It web archive collections.
- Archives Unleashed NotebooksAnalysis
Notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.
- Archives Unleashed ToolkitAnalysis
Archives Unleashed Toolkit (AUT) is an open-source platform for analyzing web archives with Apache Spark.
Web Archiving Service Providers
- Archive-ItHosted, Closed Source
From the Internet Archive.
- ArkiweraHosted, Closed Source
- BrowsertrixSelf-hostable, Open Source
From
- ConiferSelf-hostable, Open Source
From
- HanzoHosted, Closed Source
- MirrorWebHosted, Closed Source
Resources for Web Publishers
- Archive Ready
- Definition of Web Archivability
This describes the ease with which web content can be preserved. (
Community Resources
- Archivers SlackSlack
- Archives Unleashed SlackSlack
- Awesome MementoOther Awesome Lists
- @commoncrawlTwitter
Official Common Crawl Foundation handle.
- Common CrawlMailing Lists
- Common Crawl FoundationDiscord
Training/Documentation
- Archives Unleashed Toolkit documentationFor Researchers using Web Archives
- A Whirlwind Tour of Common Crawl's Datasets as a Python notebookTraining Materials
- A Whirlwind Tour of Common Crawl's Datasets using JavaTraining Materials
- A Whirlwind Tour of Common Crawl's Datasets using PythonTraining Materials
- Continuing Education to Advance Web Archiving (CEDWARC)Training Materials
- GLAM Workbench: Web ArchivesFor Researchers using Web Archives
See also
Public Data
- Common Crawl CDX API
- Common Crawl files
WARCs, CDX files, parquet url index, parquet host index, etc.
- End of Term Archive
WARCs, CDX files, parquet url index
- Internet Archive Wayback
- UK Government Web Archive
Wayback
- Webrecorder US GovArchive
high-fidelity replay
Showing a sample of 185 resources. View the full list on GitHub →