The WARC Ecosystem

From Archiveteam
Revision as of 08:02, 19 July 2015 by JesseW (talk | contribs) (convert list of tools to a table, for ease of reading and adding)
Jump to navigation Jump to search

Everything about the WARC format and the tools that support it.

Information

| https://en.wikipedia.org/wiki/Web_ARChive | https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=4817 - Contains examples of WARC records | ISO 28500 - The WARC File Format | http://archive-access.sourceforge.net/warc/ - WARC ISO docs | http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml | http://www.netpreserve.org/resources/warc-implementation-guidelines-v1 http://www.netpreserve.org/sites/default/files/resources/WARC_Guidelines_v1.pdf | http://commoncrawl.org/navigating-the-warc-file-format/

Tools

name license programming language test suite has documentation # of authors description
wget v1.14+ GPL v3+ C Has a test suite but does not test any warc functionality Man pages, website, blog posts all over the net 2+ according to the changelog A non-interactive network downloader. wget also generates duplicate record ids in warc files.

More information about flags can be found on the Wget with WARC output page.

InternetArchive's warc python library GPL v2 Python looks to have a test suite - https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py A readme with examples online at http://warc.readthedocs.org/en/latest/ 3 commiters on github library to work with WARC files
WarcMiddleware ISC Python Not enough tests A readme file + Scrapy docs 1 author Mirrors websites and saves the results to a WARC file
WarcProxy ISC Python NO TEST SUITE A readme file 1 author a simple HTTP proxy that saves all HTTP traffic to a file
WarcMITMProxy ISC Python NO TEST SUITE A readme file 1 author HTTPS proxy that saves traffic to a WARC file
warc-tools MIT License Python 2.6 NO TEST SUITE A readme file 4 commiters warc validator, dump, search, index, convert arc to warc

The previous versions can be found at https://code.google.com/p/warc-tools/ and http://code.hanzoarchives.com/warc-tools .

old: http://code.hanzoarchives.com/warc-tools/src/6e1d36297688/hanzo/warcextract.py
new (untested): http://code.hanzoarchives.com/warc-tools/src/fd3b49a7ee22fe4eee0d51dc841af40d4b9d2e1e/warcunpack_ia.py?at=default

WARC viewer no license information Python NO TEST SUITE A readme file 1 author WARC viewer for browsing the contents of a WARC file.
Megawarc no license information Python NO TEST SUITE A readme file 1 author Merge many small warcs into a large one

Checks if WARC files can be un-gzipped before adding them to the megawarc. Does not check anything else.

warc to zip no license information python NO TEST SUITE A readme file 1 author An HTTP-based warc-to-zip converter
warcat GPL v3 Python 3 yes A readme file. 1 author warcat concat, extract, list, pass, split, verify warc files

Install: pip-3 install warcat
Run: python3 -m warcat verify mysite.warc.gz

https://github.com/internetarchive/ia-web-commons 
https://github.com/internetarchive/ia-hadoop-tools 
Archive Team megawarc factory no license information Bash shell scripting NO TEST SUITE A readme file. 1 author Generates 50gb warc files from existing warc files

Uploads to archive.org

CDX Writer no license information python Has a test suite A readme file. 1 author Create CDX index files from WARC files.
Heritrix Apache v2.0 java Has a test suite javadoc, website many authors Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Heritrix-Cassandra ? ? ? ? ? A library for writing Heritrix 3 output directly to Cassandra as records.
DeDuplicator (Heritrix add-on) ? ? ? ? ? The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.
python-heritrix ? ? ? ? ? A simple wrapper around the Heritrix 3.x API. Developed in April 2012 against Heritrix 3.1.0 at GWU Libraries in Washington, DC, USA.
Chrome/Chromium plugin WARCreate GPL v3 javascript ??? none 1 author WARCreate is a Google Chrome extension that allows a user to create a Web ARChive (WARC) file from any browseable webpage. code repo
Java Web Archive Toolkit Apache 2.0 Java Partial Test Suite (check coverage profile) Online 1 author jwattools arc2warc, cdx, compress, decompress, extract, interval, pathindex, test, unpack

code repo

WAIL CC-BY-SA Python, JS ??? Online 1 Web Archiving Integration Layer (WAIL) is a graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.

Tools included and accessible through the GUI are Heritrix 3.1.2, Wayback 1.7, and warc-proxy. Support packages include Apache Tomcat, phantomjs and pyinstaller.

code repo

pylibwarc ISC License Python ? ? 1 author

CDX support Written by odie5533 which frequents #archiveteam, as another independant WARC library for Python.

Wpull GPL version 3 Python 3 many unit tests (Travis CI registered), simple experimental fuzzer a quick start readme, brief usage overview, good docstrings coverage 1 core author Wget-compatible web downloader.

Beta quality. Lua/Python scripting. PhantomJS (experimental). Used by ArchiveBot.

pywb GPL version 3 Python 2 yes readme and wiki 1 core author A full-fledged Python reimplementation of Wayback Machine web archive replay capabilities. Also provides a live rewriting proxy.
pywb-webrecorder MIT Python 2 no readme 1 core author An experimental/demo integration of pywb + warcprox to allow live recording to WARC. Allows instant replay of recorded content from WARC.
webarchiveplayer GPL version 3 Python 2 not yet, though most testable functionality in pywb readme 1 core author Point-and-click wrapper for Windows and OS X for browsing WARC files. Shows a basic file open dialog to select a WARC(s), then

starts a server and opens a browser. Also determines HTML pages within a WARC. Built on top of pywb. In beta at the moment (early 2015).

Deprecated

| https://code.google.com/p/warc-tools/ - Old, discontinued shit | https://github.com/internetarchive/archive-commons - split into 2 new repos: ia-web-commons & ia-hadoop-tools

The WARC format

| A .warc file is usually a group of one or more WARC records. | The first record usually describes the records to follow. | compression is optional | each record is compressed via gzip. A gzip file supports multiple "members" | compressed warcs end in .warc.gz | According to the guidelines warc files should top out at 1gb


WARC record 

| header | content block | two newlines

WARC record header 

The beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format with a given version number, followed by lines of named fields up to a blank line. The WARC record header format largely follows the tradition of HTTP/1.1 [RFC2616] and [RFC2822] headers, with one major exception, allowing UTF-8 [RFC3629].


Example of a 'request' record header:

 WARC/1.0
 WARC-Type: request
 WARC-Target-URI: http://xbox.gamespy.com/
 Content-Type: application/http;msgtype=request
 WARC-Date: 2013-04-02T16:12:40Z
 WARC-Record-ID: <urn:uuid:08d9edb9-0ab8-4352-ba56-6cbbd590f34f>
 WARC-IP-Address: 213.248.112.146
 WARC-Warcinfo-ID: <urn:uuid:2b6ad3f1-efab-4e37-8faa-fc8ad112692f>
 WARC-Block-Digest: sha1:T6PJSZTTP7HBNA6OFZACXAFK25GGLVT4
 Content-Length: 150
WARC named fields 

| A set of elements consisting of a name, a colon, and a value, with long values continued on indented lines. | Named fields may appear in any order. | Field values may contain any UTF-8 character. | The 'encoded-word' mechanism of [RFC2047] may also be used when writing WARC fields and shall also be understood by WARC reading software.

WARC content block 

Part (zero or more octets) of a WARC record that follows the header and that forms the main body of a WARC record.


CDX File Format

| http://archive.org/web/researcher/cdx_legend.php