An Awesome List for getting started with web archiving
-
Updated
Nov 6, 2024
An Awesome List for getting started with web archiving
Wayback Machine API interface & a command-line tool
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Parse And Create Web ARChive (WARC) files with node.js
A list of things related to software, literature, and other content for 🕣 Memento
A dockerized, queued high fidelity web archiver based on Squidwarc
Various Jupyter notebooks about Common Crawl data
Quick Cache and Archive search buttons
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
A social media open post web archiving tool
Digital Preservation of HTTP in documentary heritage.
Awesome list dedicated to digital and data preservation tools, sources, services and so on.
Decentralized web archiving
Seeder - Czech webarchive curating tool and public site
A javascript for fighting link rot and content drift using link decoration and web archives.
🗄 File-Based Reference Filing System.
A tool for detecting viruses and NSFW material in WARC files
Parser for WARC (aka WebArchive) files
Add a description, image, and links to the webarchiving topic page so that developers can more easily learn about it.
To associate your repository with the webarchiving topic, visit your repo's landing page and select "manage topics."