The WARC Ecosystem

From Archiveteam
Jump to navigation Jump to search

Everything about the WARC format and the tools that support it.




 1 license
 2 programming language
 3 test suite
 4 has documentation
 5 # of authors
 6 description

wget v1.14+

 * GPL v3+
 * C
 * Has a test suite but does not test any warc functionality
 * Man pages, website, blog posts all over the net
 * 2+ according to the changelog
 * A non-interactive network downloader. wget also generates duplicate record ids in warc files.

More information about flags can be found on the Wget with WARC output page.

warc python library

 * GPL v2
 * Python
 * looks to have a test suite -
 * A readme with examples online at
 * 3 commiters on github
 * library to work with WARC files


 * BSD
 * python
 * A readme file.
 * 1 author
 * a simple HTTP proxy that saves all HTTP traffic to a file


 * MIT License
 * python 2.6
 * A readme file
 * 4 commiters
 * warc validator, dump, search, index, convert arc to warc

WARC viewer

 * no license information
 * python
 * A readme file
 * 1 author
 * WARC viewer for browsing the contents of a WARC file.
 - needs a firefox addon installed to work


 * no license information
 * python
 * A readme file
 * 1 author
 * Merge many small warcs into a large one

Checks if WARC files can be un-gzipped before adding them to the megawarc. Does not check anything else.

warc to zip

 * no license information
 * python
 * A readme file
 * 1 author
 * An HTTP-based warc-to-zip converter


 * GPL v3
 * Python 3
 * yes
 * A readme file.
 * 1 author
 * Web ARChive (WARC) Archiving Tool

Archive Team megawarc factory

 * no license information
 * Bash shell scripting
 * A readme file.
 * 1 author
 * Generates 50gb warc files from existing warc files

Uploads to

CDX Writer

 * no license information
 * python
 * Has a test suite
 * A readme file.
 * 1 author
 * Create CDX index files from WARC files.


 * Apache v2.0
 * java
 * Has a test suite
 * javadoc, website
 * many authors
 * Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Heritrix-Cassandra A library for writing Heritrix 3 output directly to Cassandra as records.

DeDuplicator (Heritrix add-on) The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.

Chrome/Chromium plugin WARCreate

 * no license information
 * javascript
 * ???
 * none
 * 1 author
 * WARCreate is a Google Chrome extension that allows a user to create a Web ARChive (WARC) file from any browseable webpage.


The WARC format

  • A .warc file is usually a group of one or more WARC records.
  • The first record usually describes the records to follow.
  • compression is optional
  • each record is compressed via gzip. A gzip file supports multiple "members"
  • compressed warcs end in .warc.gz
  • According to the guidelines warc files should top out at 1gb

WARC record

  • header
  • content block
  • two newlines

WARC record header

The beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format with a given version number, followed by lines of named fields up to a blank line. The WARC record header format largely follows the tradition of HTTP/1.1 [RFC2616] and [RFC2822] headers, with one major exception, allowing UTF-8 [RFC3629].

Example of a 'request' record header:

 WARC-Type: request
 Content-Type: application/http;msgtype=request
 WARC-Date: 2013-04-02T16:12:40Z
 WARC-Record-ID: <urn:uuid:08d9edb9-0ab8-4352-ba56-6cbbd590f34f>
 WARC-Warcinfo-ID: <urn:uuid:2b6ad3f1-efab-4e37-8faa-fc8ad112692f>
 Content-Length: 150

WARC named fields

  • A set of elements consisting of a name, a colon, and a value, with long values continued on indented lines.
  • Named fields may appear in any order.
  • Field values may contain any UTF-8 character.
  • The 'encoded-word' mechanism of [RFC2047] may also be used when writing WARC fields and shall also be understood by WARC reading software.

WARC content block

Part (zero or more octets) of a WARC record that follows the header and that forms the main body of a WARC record.

CDX File Format