Difference between revisions of "The WARC Ecosystem"

From Archiveteam
Jump to navigation Jump to search
(Add a summary of what WARC is)
(→‎Tools: Add CDXJ Indexer)
 
(18 intermediate revisions by 7 users not shown)
Line 2: Line 2:


WARC is a file format for accurately storing Web traffic.
WARC is a file format for accurately storing Web traffic.
== Viewing WARCs ==
If you just want to view Archiveteam WARCs, then you should be able to load up a WARC viewer such as [https://replayweb.page ReplayWeb.page] with the WARC file.
There is an exception: if the WARC file ends in .warc.zst, you will need to decompress it with zstd first. If it says "Dictionary mismatch" or a similar error message, try [https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/zstdwarccat this Python script].
If you need help, contact us in the project channel, or if no such channel exists, {{IRC|archiveteam-bs}}.


== Information ==
== Information ==
* [[wikipedia:Web_ARChive]]
* [[wikipedia:Web_ARChive]]
* {{URL|https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=4817}} - Contains examples of WARC records
* {{URL|1=https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=4817}} - Contains examples of WARC records
* {{URL|http://bibnum.bnf.fr/WARC/|The WARC File Format (ISO 28500) - Information, Maintenance, Drafts}}
* {{URL|http://bibnum.bnf.fr/WARC/|The WARC File Format (ISO 28500) - Information, Maintenance, Drafts}}
* {{URL|http://archive-access.sourceforge.net/warc/}} - WARC ISO docs
* {{URL|http://archive-access.sourceforge.net/warc/}} - WARC ISO docs
Line 27: Line 35:
! Author count
! Author count
! Description
! Description
! Recommended
|-
|-
| [https://www.gnu.org/software/wget/ wget v1.14+]
| [https://www.gnu.org/software/wget/ wget v1.14+]
Line 35: Line 44:
| A non-interactive network downloader. wget also generates duplicate record ids in warc files.  
| A non-interactive network downloader. wget also generates duplicate record ids in warc files.  
More information about flags can be found on the [[Wget with WARC output]] page.
More information about flags can be found on the [[Wget with WARC output]] page.
| style="background-color: #ff9999" | No. Since version 1.20, wget writes WARCs with angle brackets around URIs. The WARC/1.0 grammar in the specification technically requires these brackets, but the examples given there contradict this. No other software is known to do this, and many WARC readers are unable to handle the brackets.
The unofficial Windows builds at https://eternallybored.org/misc/wget/ have bugs in at least the WARC-writing part that appears to cause them to truncate non-ASCII data. They are best avoided entirely. Consider using the Windows Subsystem for Linux (WSL) instead.
|-
| [https://github.com/ArchiveTeam/wget-lua wget-at]
| GPL v3+ || C, Lua
| See wget
| ?
| 1
| wget with various additions that make it suitable for ArchiveTeam use. Lua hooks for controlling many aspects of the crawl. Used for [[DPoS]] projects.
| style="background-color: #99ff99" | Yes
|-
|-
| InternetArchive's [https://github.com/internetarchive/warc warc python library]
| InternetArchive's [https://github.com/internetarchive/warc warc python library]
| GPL v2 || Python 2
| GPL v2 || Python 2
| looks to have a test suite - https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py
| [https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py looks to have a test suite]
| README with examples online at https://warc.readthedocs.io/en/latest/
| [https://warc.readthedocs.io/en/latest/ README with examples]
| 3 commiters on github
| 3 commiters on github
| library to work with WARC files
| library to work with WARC files
| style="background-color: #ff9999" | No. Obsolete as Python 2 is EOL.
|-
|-
| [https://github.com/odie5533/WarcMiddleware WarcMiddleware]
| [https://github.com/odie5533/WarcMiddleware WarcMiddleware]
Line 49: Line 70:
| 1 author
| 1 author
| Mirrors websites and saves the results to a WARC file
| Mirrors websites and saves the results to a WARC file
| style="background-color: #ff9999" | No. Does not correctly preserve the exact traffic as sent by the server.
|-
|-
| [https://github.com/odie5533/WarcProxy WarcProxy]
| [https://github.com/odie5533/WarcProxy WarcProxy]
Line 56: Line 78:
| 1 author
| 1 author
| a simple HTTP proxy that saves all HTTP traffic to a file
| a simple HTTP proxy that saves all HTTP traffic to a file
| ?
|-
|-
| [https://github.com/odie5533/WarcMITMProxy WarcMITMProxy]  
| [https://github.com/odie5533/WarcMITMProxy WarcMITMProxy]  
Line 64: Line 87:
| 1 author
| 1 author
| HTTPS proxy that saves traffic to a WARC file
| HTTPS proxy that saves traffic to a WARC file
| ?
|-
|-
| [https://github.com/internetarchive/warctools warc-tools]  
| [https://github.com/internetarchive/warctools warc-tools]  
| MIT License
| MIT License
| Python 2.6
| Python 2.7+/3.5+
| NO TEST SUITE
| NO TEST SUITE
| README
| README
Line 74: Line 98:


The previous versions can be found at https://code.google.com/p/warc-tools/ and https://bitbucket.org/hanzo/warc-tools
The previous versions can be found at https://code.google.com/p/warc-tools/ and https://bitbucket.org/hanzo/warc-tools
| ?
|-
|-
| [https://github.com/alard/warc-proxy WARC viewer]  
| [https://github.com/alard/warc-proxy WARC viewer]  
Line 82: Line 107:
| 1 author
| 1 author
| WARC viewer for browsing the contents of a WARC file.
| WARC viewer for browsing the contents of a WARC file.
| ?
|-
|-
| [https://github.com/alard/megawarc Megawarc]  
| [https://github.com/alard/megawarc Megawarc]  
Line 92: Line 118:


Checks if WARC files can be un-gzipped before adding them to the megawarc. Does not check anything else.
Checks if WARC files can be un-gzipped before adding them to the megawarc. Does not check anything else.
| ?
|-
|-
| [https://github.com/alard/warctozip-service warc to zip]  
| [https://github.com/alard/warctozip-service warc to zip]  
Line 100: Line 127:
| 1 author
| 1 author
| An HTTP-based warc-to-zip converter
| An HTTP-based warc-to-zip converter
| ?
|-
|-
| [https://github.com/chfoo/warcat warcat]  
| [https://github.com/chfoo/warcat warcat]  
Line 115: Line 143:


  https://github.com/internetarchive/ia-hadoop-tools  
  https://github.com/internetarchive/ia-hadoop-tools  
| ?
|-
|-
| [https://github.com/ArchiveTeam/archiveteam-megawarc-factory Archive Team megawarc factory]  
| [https://github.com/ArchiveTeam/archiveteam-megawarc-factory Archive Team megawarc factory]  
Line 124: Line 153:
| Generates 50gb warc files from existing warc files
| Generates 50gb warc files from existing warc files
Uploads to archive.org
Uploads to archive.org
| ?
|-
|-
| [https://github.com/rajbot/CDX-Writer CDX Writer]  
| [https://github.com/rajbot/CDX-Writer CDX Writer]  
| no license information
| AGPL v3
| Python
| Python
| Has a test suite
| Has a test suite
Line 132: Line 162:
| 1 author
| 1 author
| Create CDX index files from WARC files.
| Create CDX index files from WARC files.
| ?
|-
| [https://github.com/webrecorder/cdxj-indexer CDXJ Indexer]
| Apache v2.0
| Python 3
| Has a test suite
| None
| 1 core author, 3 contributors
| Create CDX and CDXJ index files from ARC and WARC files.
| ?
|-
|-
| [https://webarchive.jira.com/wiki/spaces/Heritrix/overview Heritrix]  
| [https://github.com/internetarchive/heritrix3 Heritrix]  
| Apache v2.0
| Apache v2.0
| Java
| Java
Line 140: Line 180:
| many authors
| many authors
| Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
| Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
| ?
|-
|-
| [https://github.com/openplaces/heritrix-cassandra Heritrix-Cassandra]  
| [https://github.com/openplaces/heritrix-cassandra Heritrix-Cassandra]  
| ? || ? || ? || ? || ?
| LGPL v2.1 || ? || ? || ? || ?
| A library for writing Heritrix 3 output directly to Cassandra as records.
| A library for writing Heritrix 3 output directly to Cassandra as records.
| ?
|-
|-
| [https://landsbokasafn.github.io/DeDuplicator/ DeDuplicator (Heritrix add-on)]
| [https://landsbokasafn.github.io/DeDuplicator/ DeDuplicator (Heritrix add-on)]
| GPL v2.1  
| LGPL v2.1  
| Java  
| Java  
| Very few tests  
| Very few tests  
Line 152: Line 194:
| 1 author
| 1 author
| The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.
| The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.
| ?
|-
|-
| [https://github.com/gwu-libraries/python-heritrix python-heritrix]  
| [https://github.com/gwu-libraries/python-heritrix python-heritrix]  
| ? || ? || ? || ? || ?
| ? || ? || ? || ? || ?
| A simple wrapper around the Heritrix 3.x API. Developed in April 2012 against Heritrix 3.1.0 at GWU Libraries in Washington, DC, USA.
| A simple wrapper around the Heritrix 3.x API. Developed in April 2012 against Heritrix 3.1.0 at GWU Libraries in Washington, DC, USA.
| ?
|-
|-
| [https://warcreate.com/ WARCreate (Chrome/Chromium extension)]
| [https://warcreate.com/ WARCreate (Chrome/Chromium extension)]
Line 164: Line 208:
| 1 author
| 1 author
| WARCreate is a Google Chrome extension that allows a user to create a WARC file from any browseable webpage. [https://github.com/machawk1/warcreate code repo]
| WARCreate is a Google Chrome extension that allows a user to create a WARC file from any browseable webpage. [https://github.com/machawk1/warcreate code repo]
| ?
|-
|-
| [https://sbforge.org/display/JWAT/JWAT Java Web Archive Toolkit]  
| [https://sbforge.org/display/JWAT/JWAT Java Web Archive Toolkit]  
Line 174: Line 219:


[https://bitbucket.org/nclarkekb/jwat/overview code repo]
[https://bitbucket.org/nclarkekb/jwat/overview code repo]
| ?
|-
|-
| [https://machawk1.github.io/wail/ Web Archiving Integration Layer (WAIL)]  
| [https://machawk1.github.io/wail/ Web Archiving Integration Layer (WAIL)]  
Line 185: Line 231:


[https://github.com/machawk1/wail code repo]
[https://github.com/machawk1/wail code repo]
| ?
|-
|-
| [https://github.com/odie5533/pylibwarc/ pylibwarc]  
| [https://github.com/odie5533/pylibwarc pylibwarc]  
| ISC License
| ISC License
| Python
| Python
Line 194: Line 241:
|CDX support
|CDX support
Another independent WARC library for Python.
Another independent WARC library for Python.
| ?
|-
|-
| [https://github.com/ArchiveTeam/wpull Wpull]  
| [https://github.com/ArchiveTeam/wpull Wpull]  
| GPL version 3
| GPL v3
| Python 3
| Python 3
| many unit tests (Travis CI registered), simple experimental fuzzer
| many unit tests (Travis CI registered), simple experimental fuzzer
Line 203: Line 251:
| Wget-compatible web downloader.
| Wget-compatible web downloader.
Beta quality. Lua/Python scripting. PhantomJS (experimental). Used by [[ArchiveBot]].
Beta quality. Lua/Python scripting. PhantomJS (experimental). Used by [[ArchiveBot]].
| style="background-color: #ffff99" | wpull 2.0.x has bugs that make it hard to use properly directly. ArchiveBot and grab-site integration is not affected by that.
|-
|-
| [https://github.com/ArchiveTeam/grab-site grab-site]  
| [https://github.com/ArchiveTeam/grab-site grab-site]  
Line 211: Line 260:
| 1 core author
| 1 core author
| wpull launcher with the dashboard and ignore patterns from ArchiveBot
| wpull launcher with the dashboard and ignore patterns from ArchiveBot
| style="background-color: #99ff99" | Yes.
|-
|-
| [https://github.com/ikreymer/pywb pywb]
| [https://github.com/webrecorder/pywb pywb]
| GPL version 3
| GPL v3
| Python 2
| Python 2.7+/3.4+
| yes
| yes
| README and wiki
| README and wiki
| 1 core author
| 2 core authors
| A full-fledged Python reimplementation of Wayback Machine web archive replay capabilities. Also provides a live rewriting proxy.
| A full-fledged Python reimplementation of Wayback Machine web archive replay capabilities. Also provides a live rewriting proxy.
|-  
| style="background-color: #ffff99" | Acceptable for regular use although some data gets mangled; see warcio
|-
| [https://github.com/helgeho/ArchiveSpark ArchiveSpark]
| [https://github.com/helgeho/ArchiveSpark ArchiveSpark]
| MIT License
| MIT License
Line 227: Line 278:
| 2 authors
| 2 authors
| Apache Spark framework that facilitates access to Web Archives
| Apache Spark framework that facilitates access to Web Archives
| ?
|-
|-
| [https://github.com/webrecorder/webrecorder-player Webrecorder Player]
| [https://archiveweb.page ArchiveWeb.page]
| Apache License 2.0
| AGPL-3.0
| JavaScript
| Javascript
| No
| [https://archiveweb.page/guide website]
| 5 core contributors
| Chrome extension for capturing WARC and WACZ files through interactive browsing.
| style="background-color: #ff9999" | No. Uses the Chrome Debugging Protocol<ref>{{URL|https://github.com/webrecorder/archiveweb.page/blob/5431064ead4c8245b5b58cbe9233664e525302d9/README.md#architecture}}</ref>, which cannot correctly capture headers and transfer encoding.
|-
| [https://replayweb.page ReplayWeb.page]
| AGPL-3.0
| Javascript
| No
| [https://replayweb.page/docs website]
| 5 core contributors
| Browser-based viewer for WARC, WACZ, HAR, and CDX files. Can be embedded into other sites.
| ?
| ?
| ?
| ?
| Desktop app for viewing high-fidelity web archives (WARC, HAR and ARC) on a local machine, no internet connection required. Particularly useful for social media, dynamic content. Supports OSX, Windows and Linux (experimental). Related to https://webrecorder.io/
|-
|-
| [https://github.com/webrecorder/warcio warcio]
| [https://github.com/webrecorder/warcio warcio]
| Apache 2.0
| Apache 2.0
| Python 2.7+/3.3+
| Python 2.7+/3.4+
| yes
| yes
| README
| README
| 7 contributors
| 14 contributors
| WARC writer library
| WARC writer library
| style="background-color: #ff9999" | Writing WARCs: No. Has long-standing bugs regarding correct preservation of data as sent by the server.<ref>{{URL|https://github.com/webrecorder/warcio/issues/128}}</ref><ref>{{URL|https://github.com/webrecorder/warcio/issues/129}}</ref>
Reading WARCs: Acceptable although [https://github.com/webrecorder/warcio/issues/128 this issue from above] also affects reading.
|-
|-
| [https://github.com/internetarchive/warcprox warcprox]
| [https://github.com/internetarchive/warcprox warcprox]
| GPL v2+ || Python 3.4+
| GPL v2+ || Python 3.8+
| yes
| yes
| README
| README
| 1 core author, 11 contributors
| 1 core author, 14 contributors
| MITM proxy for capturing to WARC. See also [https://github.com/internetarchive/brozzler brozzler], a crawler based on headless Chromium and warcprox.
| MITM proxy for capturing to WARC. See also [https://github.com/internetarchive/brozzler brozzler], a crawler based on headless Chromium and warcprox.
| style="background-color: #ffff99" | Yes. Has not been audited independently but is assumed to work correctly.
|-
| [https://gitea.arpa.li/JustAnotherArchivist/qwarc qwarc]
| GPL v3+ || Python 3.7+
| No
| No
| 1
| Flexible framework for rapid archival with little overhead, using parallel connections and minimal response processing. All retrieval logic has to be implemented by the user in Python.
| style="background-color: #ffff99" | Lack of documentation makes it hard to use. Not packaged. Versions up to and including 0.2.5 were based on warcio and thus shouldn't be used.
|-
| [https://archivebox.io/ ArchiveBox]
| MIT || Python 3.7+
| Yes
| GitHub wiki
| 1
| Self-hosted internet archival system that produces a variety of formats, including WARC.
| style="background-color: #ff9999" | No. Uses wget for the WARC mode and therefore inherits the angle brackets issue from it.
|-
| [https://github.com/webrecorder/warcio.js warcio.js]
| MIT License
| TypeScript
| Yes
| README
| 7 committers
| JS Streaming WARC IO optimized for Browser and Node
| ?
|-
| [https://github.com/nlnwa/warchaeology warchaeology]
| Apache-2.0 license
| Go
| ?
| [https://nlnwa.github.io/warchaeology/ website]
| 4 committers
| Command line tool for digging into WARC files
| ?
|-
| [https://github.com/N0taN3rd/node-warc node-warc]
| MIT License
| JavaScript
| Yes
| [https://n0tan3rd.github.io/node-warc/ website]
| 5 committers
| Parse And Create Web ARChive (WARC) files with node.js
| ?
|-
| [https://github.com/commoncrawl/nutch nutch] (Common Crawl fork)
| Apache 2.0 license
| Java
| Yes
| ?
| ?
| Fork of Apache Nutch web crawler with WARC writing support
| ?
|-
|-
! Name
! Name
Line 258: Line 375:
! Author count
! Author count
! Description
! Description
! Recommended
|}
|}


== Deprecated ==
== Deprecated ==
* https://github.com/internetarchive/archive-commons - split into 2 new repos: ia-web-commons & ia-hadoop-tools
 
* https://github.com/ikreymer/pywb-webrecorder
{|class="wikitable"
* https://code.google.com/p/warc-tools/
! Name
* https://github.com/lintool/warcbase
! License
* [https://github.com/ikreymer/webarchiveplayer WebArchivePlayer]
! Language
! Testing
! Documentation
! Author count
! Description
! Comment
|-
| [https://github.com/internetarchive/archive-commons archive-commons]
| License
| Language
| Testing
| Documentation
| ?
| ?
| split into 2 new repos: ia-web-commons & ia-hadoop-tools
|-
| [https://github.com/ikreymer/pywb-webrecorder pywb-webrecorder]
| MIT License
| Python 2.7
| No
| README
| ?
| ?
| ?
|-
| [https://code.google.com/p/warc-tools/ warc-tools]
| Apache License 2.0
| ?
| ?
| ?
| ?
| ?
| ?
|-
| [https://github.com/lintool/warcbase Warcbase]
| Apache License 2.0
| Java
| ?
| ?
| ?
| Warcbase is an open-source platform for managing analyzing web archives.
| ?
|-
| [https://github.com/ikreymer/webarchiveplayer WebArchivePlayer]
| GPL v3
| Python 2.7
| No
| ?
| ?
| WebArchivePlayer is a new desktop tool which provides a simple point-and-click wrapper for viewing any web archive file (in WARC and ARC format).
| Obsolete and replaced by Webrecorder Player.
|-
| [https://github.com/webrecorder/webrecorder-player Webrecorder Player]
| Apache License 2.0
| JavaScript
| ?
| ?
| ?
| Desktop app for viewing high-fidelity web archives (WARC, HAR and ARC) on a local machine, no internet connection required. Particularly useful for social media, dynamic content. Supports OSX, Windows and Linux (experimental). Related to https://webrecorder.io/
| Obsolete and replaced by replayweb.page.
|-
! Name
! License
! Language
! Testing
! Documentation
! Author count
! Description
! Comment
|}


== The WARC format ==
== The WARC format ==

Latest revision as of 01:00, 24 March 2024

Everything about the WARC format and the tools that support it.

WARC is a file format for accurately storing Web traffic.

Viewing WARCs

If you just want to view Archiveteam WARCs, then you should be able to load up a WARC viewer such as ReplayWeb.page with the WARC file.

There is an exception: if the WARC file ends in .warc.zst, you will need to decompress it with zstd first. If it says "Dictionary mismatch" or a similar error message, try this Python script.

If you need help, contact us in the project channel, or if no such channel exists, #archiveteam-bs (on hackint).

Information

Tools

Name License Language Testing Documentation Author count Description Recommended
wget v1.14+ GPL v3+ C Has a test suite but does not test any warc functionality Man pages, website, blog posts all over the net 2+ according to the changelog A non-interactive network downloader. wget also generates duplicate record ids in warc files.

More information about flags can be found on the Wget with WARC output page.

No. Since version 1.20, wget writes WARCs with angle brackets around URIs. The WARC/1.0 grammar in the specification technically requires these brackets, but the examples given there contradict this. No other software is known to do this, and many WARC readers are unable to handle the brackets.

The unofficial Windows builds at https://eternallybored.org/misc/wget/ have bugs in at least the WARC-writing part that appears to cause them to truncate non-ASCII data. They are best avoided entirely. Consider using the Windows Subsystem for Linux (WSL) instead.

wget-at GPL v3+ C, Lua See wget ? 1 wget with various additions that make it suitable for ArchiveTeam use. Lua hooks for controlling many aspects of the crawl. Used for DPoS projects. Yes
InternetArchive's warc python library GPL v2 Python 2 looks to have a test suite README with examples 3 commiters on github library to work with WARC files No. Obsolete as Python 2 is EOL.
WarcMiddleware ISC Python Not enough tests README + Scrapy docs 1 author Mirrors websites and saves the results to a WARC file No. Does not correctly preserve the exact traffic as sent by the server.
WarcProxy ISC Python NO TEST SUITE README 1 author a simple HTTP proxy that saves all HTTP traffic to a file ?
WarcMITMProxy ISC Python NO TEST SUITE README 1 author HTTPS proxy that saves traffic to a WARC file ?
warc-tools MIT License Python 2.7+/3.5+ NO TEST SUITE README 4 commiters warc validator, dump, search, index, convert arc to warc

The previous versions can be found at https://code.google.com/p/warc-tools/ and https://bitbucket.org/hanzo/warc-tools

?
WARC viewer no license information Python NO TEST SUITE README 1 author WARC viewer for browsing the contents of a WARC file. ?
Megawarc no license information Python NO TEST SUITE README 1 author Merge many small warcs into a large one

Checks if WARC files can be un-gzipped before adding them to the megawarc. Does not check anything else.

?
warc to zip no license information Python NO TEST SUITE README 1 author An HTTP-based warc-to-zip converter ?
warcat GPL v3 Python 3 yes README 1 author warcat concat, extract, list, pass, split, verify warc files

Install: pip-3 install warcat
Run: python3 -m warcat verify mysite.warc.gz

https://github.com/internetarchive/ia-web-commons 
https://github.com/internetarchive/ia-hadoop-tools 
?
Archive Team megawarc factory no license information Bash shell scripting NO TEST SUITE README 1 author Generates 50gb warc files from existing warc files

Uploads to archive.org

?
CDX Writer AGPL v3 Python Has a test suite README 1 author Create CDX index files from WARC files. ?
CDXJ Indexer Apache v2.0 Python 3 Has a test suite None 1 core author, 3 contributors Create CDX and CDXJ index files from ARC and WARC files. ?
Heritrix Apache v2.0 Java Has a test suite javadoc, website many authors Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. ?
Heritrix-Cassandra LGPL v2.1 ? ? ? ? A library for writing Heritrix 3 output directly to Cassandra as records. ?
DeDuplicator (Heritrix add-on) LGPL v2.1 Java Very few tests Getting Started page. 1 author The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls. ?
python-heritrix ? ? ? ? ? A simple wrapper around the Heritrix 3.x API. Developed in April 2012 against Heritrix 3.1.0 at GWU Libraries in Washington, DC, USA. ?
WARCreate (Chrome/Chromium extension) MIT JavaScript ??? none 1 author WARCreate is a Google Chrome extension that allows a user to create a WARC file from any browseable webpage. code repo ?
Java Web Archive Toolkit Apache 2.0 Java Partial Test Suite (check coverage profile) Online 1 author jwattools arc2warc, cdx, compress, decompress, extract, interval, pathindex, test, unpack

code repo

?
Web Archiving Integration Layer (WAIL) MIT Python ??? Online 1 author Web Archiving Integration Layer (WAIL) is a graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.

Tools included and accessible through the GUI are Heritrix 3.2.0 and OpenWayback 2.4.0.

code repo

?
pylibwarc ISC License Python ? ? 1 author CDX support

Another independent WARC library for Python.

?
Wpull GPL v3 Python 3 many unit tests (Travis CI registered), simple experimental fuzzer a quick start README, brief usage overview, good docstrings coverage 1 core author Wget-compatible web downloader.

Beta quality. Lua/Python scripting. PhantomJS (experimental). Used by ArchiveBot.

wpull 2.0.x has bugs that make it hard to use properly directly. ArchiveBot and grab-site integration is not affected by that.
grab-site MIT Python 3 no README 1 core author wpull launcher with the dashboard and ignore patterns from ArchiveBot Yes.
pywb GPL v3 Python 2.7+/3.4+ yes README and wiki 2 core authors A full-fledged Python reimplementation of Wayback Machine web archive replay capabilities. Also provides a live rewriting proxy. Acceptable for regular use although some data gets mangled; see warcio
ArchiveSpark MIT License Scala ? ? 2 authors Apache Spark framework that facilitates access to Web Archives ?
ArchiveWeb.page AGPL-3.0 Javascript No website 5 core contributors Chrome extension for capturing WARC and WACZ files through interactive browsing. No. Uses the Chrome Debugging Protocol[1], which cannot correctly capture headers and transfer encoding.
ReplayWeb.page AGPL-3.0 Javascript No website 5 core contributors Browser-based viewer for WARC, WACZ, HAR, and CDX files. Can be embedded into other sites. ?
warcio Apache 2.0 Python 2.7+/3.4+ yes README 14 contributors WARC writer library Writing WARCs: No. Has long-standing bugs regarding correct preservation of data as sent by the server.[2][3]

Reading WARCs: Acceptable although this issue from above also affects reading.

warcprox GPL v2+ Python 3.8+ yes README 1 core author, 14 contributors MITM proxy for capturing to WARC. See also brozzler, a crawler based on headless Chromium and warcprox. Yes. Has not been audited independently but is assumed to work correctly.
qwarc GPL v3+ Python 3.7+ No No 1 Flexible framework for rapid archival with little overhead, using parallel connections and minimal response processing. All retrieval logic has to be implemented by the user in Python. Lack of documentation makes it hard to use. Not packaged. Versions up to and including 0.2.5 were based on warcio and thus shouldn't be used.
ArchiveBox MIT Python 3.7+ Yes GitHub wiki 1 Self-hosted internet archival system that produces a variety of formats, including WARC. No. Uses wget for the WARC mode and therefore inherits the angle brackets issue from it.
warcio.js MIT License TypeScript Yes README 7 committers JS Streaming WARC IO optimized for Browser and Node ?
warchaeology Apache-2.0 license Go ? website 4 committers Command line tool for digging into WARC files ?
node-warc MIT License JavaScript Yes website 5 committers Parse And Create Web ARChive (WARC) files with node.js ?
nutch (Common Crawl fork) Apache 2.0 license Java Yes ? ? Fork of Apache Nutch web crawler with WARC writing support ?
Name License Language Testing Documentation Author count Description Recommended

Deprecated

Name License Language Testing Documentation Author count Description Comment
archive-commons License Language Testing Documentation ? ? split into 2 new repos: ia-web-commons & ia-hadoop-tools
pywb-webrecorder MIT License Python 2.7 No README ? ? ?
warc-tools Apache License 2.0 ? ? ? ? ? ?
Warcbase Apache License 2.0 Java ? ? ? Warcbase is an open-source platform for managing analyzing web archives. ?
WebArchivePlayer GPL v3 Python 2.7 No ? ? WebArchivePlayer is a new desktop tool which provides a simple point-and-click wrapper for viewing any web archive file (in WARC and ARC format). Obsolete and replaced by Webrecorder Player.
Webrecorder Player Apache License 2.0 JavaScript ? ? ? Desktop app for viewing high-fidelity web archives (WARC, HAR and ARC) on a local machine, no internet connection required. Particularly useful for social media, dynamic content. Supports OSX, Windows and Linux (experimental). Related to https://webrecorder.io/ Obsolete and replaced by replayweb.page.
Name License Language Testing Documentation Author count Description Comment

The WARC format

A .warc file is usually a group of one or more WARC records. The first record usually describes the records to follow.

Compression is optional. If used, each record is compressed via gzip. A gzip file supports multiple "members"; compressed warcs end in .warc.gz. According to the guidelines, WARC files should top out at 1 gb.

WARC record

  • header
  • content block
  • two newlines

WARC record header

The beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format with a given version number, followed by lines of named fields up to a blank line. The WARC record header format largely follows the tradition of HTTP/1.1 [RFC2616] and [RFC2822] headers, with one major exception, allowing UTF-8 [RFC3629].


Example of a 'request' record header:

 WARC/1.0
 WARC-Type: request
 WARC-Target-URI: http://xbox.gamespy.com/
 Content-Type: application/http;msgtype=request
 WARC-Date: 2013-04-02T16:12:40Z
 WARC-Record-ID: <urn:uuid:08d9edb9-0ab8-4352-ba56-6cbbd590f34f>
 WARC-IP-Address: 213.248.112.146
 WARC-Warcinfo-ID: <urn:uuid:2b6ad3f1-efab-4e37-8faa-fc8ad112692f>
 WARC-Block-Digest: sha1:T6PJSZTTP7HBNA6OFZACXAFK25GGLVT4
 Content-Length: 150

WARC named fields

  • A set of elements consisting of a name, a colon, and a value, with long values continued on indented lines.
  • Named fields may appear in any order.
  • Field values may contain any UTF-8 character.
  • The 'encoded-word' mechanism of [RFC2047] may also be used when writing WARC fields and shall also be understood by WARC reading software.

Defined field names

WARC-Type
required, can be one of 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', or 'continuation'
WARC-Record-ID
required, unique ID, as a URI
WARC-Date
required
Content-Length
required
Content-Type
mime type
WARC-Concurrent-To
repeatable, WARC-Record-IDs associated with this one
WARC-Block-Digest
optional, hash of the whole record
WARC-Payload-Digest
optional, hash of the just the payload
WARC-IP-Address
where the record was gotten from
WARC-Refers-To
previous WARC-Record-ID this relates to
WARC-Target-URI
the URL asked for
WARC-Truncated
why only part of the content was gotten
WARC-Warcinfo-ID
WARC-Record-ID of the associated high-level metadata record
WARC-Filename
warcinfo only, the expected name of the file containing this record
WARC-Profile
revisit only, the way revisiting was handled, as a URI
WARC-Identified-Payload-Type
a independently verified mime type of the payload (i.e. not just what it claims to be)
WARC-Segment-Origin-ID
continuation only
WARC-Segment-Number
WARC-Segment-Total-Length
continuation only

WARC content block

Part (zero or more octets) of a WARC record that follows the header and that forms the main body of a WARC record.

ArchiveBot job output

The ArchiveBot produces three types of files:

.meta.warc.gz
The log of the job, listing all the files requested and downloaded, as well as any errors.
.json
Some brief metadata about the job.
-0000.warc.gz, -0001.warc.gz, ...
The actual requests and responses, in full.

CDX File Format

Example of generating a list of URLs in a MegaWARC:

curl -sL 'https://archive.org/download/archiveteam_zapd_20131016071259/zapd_20131016071259.megawarc.warc.os.cdx.gz' \
| gunzip -c | cut -f3 -d' '

Example of getting a list of all the URLs in the Wayback Machine with a given prefix:

curl 'https://web.archive.org/cdx/search/cdx?fl=statuscode,timestamp,original&collapse=urlkey&matchType=prefix&url=http://www.conchord.org'