Wget with WARC output

From Archiveteam
Revision as of 20:28, 25 June 2020 by Stuartyeates (talk | contribs) (add details of limitations)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

From the discussion about Working with ARCHIVE.ORG, we learn that it is important to save not just files but also HTTP headers. With Wget, that's difficult. With a few tricks you can keep the response headers, but there is no option to save the request headers. You also lose the response headers that don't produce an HTML page: Wget doesn't save redirects and 404 responses.

Since version 1.14[1] Wget supports writing to a WARC file (Web ARChive file format) file, just like Heritrix and other archiving tools. With the WARC format, both the request and the response headers get saved. It also provides a clean way to store redirects and 404 responses.

There is an additional advantage: if Wget writes these headers to a WARC file, it is no longer necessary to use the --save-headers to save them at the top of each downloaded file. There is no need to remove these headers afterwards to produce a clean copy: the mirror produced by Wget is usable without post-processing.

Wget's WARC file support (as at 1.19.5 / RHEL8) is relatively incomplete and immature compared to other specialist archiving systems. In particular: (a) only HTTP(S) requests and replies are stored not auxiliary content such as DNS queries, PKI used to negotiate HTTPS connections and etc. and (b) wget overwrites WARC files (but not idx files) if you're not very careful.

Usage

To download a file and save the request and response data to a WARC file, run this:

wget "http://www.archiveteam.org/" --warc-file="at"

This will download the file to index.html, but it will also create a file at-00000.warc.gz. This is a gzipped WARC file that contains the request and response headers (of the initial redirect and of the Wiki homepage) and the html data.

If you want to have an uncompressed WARC file, use the --no-warc-compression option:

wget "http://www.archiveteam.org/" --warc-file="at" --no-warc-compression

Saving one file is nice, but the warc-file option becomes even more powerful if you combine it with Wget's mirror option: (You may want to try this with a smaller site than the AT wiki.)

wget "http://www.archiveteam.org/" --mirror --warc-file="at"

If you uncompress at-00000.warc.gz and look at it, you'll see that it contains WARC records for every request and response: it is a complete copy of the mirrored site, while at the same time Wget also created the normal mirror of the site.

If you have a list of urls that you want to write into an archive, but also want to store the original response for further processing you'd replace --mirror with --input-file=urls.txt --force-directories:

wget --no-verbose --input-file=urls.txt --force-directories --tries=3 --warc-file="at"

Options

--warc-file=FILENAME enables the WARC export. WARC files will be based on FILENAME: FILENAME-00000.warc.gz, FILENAME-00001.warc.gz et cetera.

--warc-max-size=NUMBER defines the maximum size of the WARC files. The default is an infinite limit ("inf"). If you download a large site, the recommended limit is 1GB, set the option to 1G to enable this limit. Note that this is a soft limit: files can get slightly larger than this, depending on the files you download.

--warc-header=STRING adds STRING as a custom header to the warcinfo record, e.g. "operator: Archive Team". This option can be used multiple times.

--warc-cdx=FILENAME writes a CDX index file to FILENAME.cdx. The CDX file will contain a list of the records and their locations in the WARC files.

--warc-dedup=FILENAME can be used to reduce the size of WARC files generated by a recrawl. FILENAME should point to a CDX file, generated with --warc-cdx in a previous run. For each file it downloads, Wget will check the CDX file to see if the response is listed there. If the exact file already exists, a "revisit" record with a reference to the previous record will be added to the WARC file, instead of a duplicate "response" record. Duplicate records are detected by comparing the SHA-1 digest of the payload of the response.

--no-warc-compression will write uncompressed WARC files. Compression is enabled by default. It is better to use the built-in compression than to compress the WARC files afterwards. The built-in compression will compress each record as an individual GZIP block, which allows other utilities to extract single records from the file.

--no-warc-digests disables the SHA-1 digests. By default, SHA-1 digests will be calculated for the whole response block and the response payload. If you really need to, you can disable that.

--no-warc-keep-log can be set if you don't want the Wget log in the WARC file. By default, Wget will add the log file as a separate record to the WARC file.

--warc-tempdir=DIRECTORY sets the temporary directory used by the WARC writer. The system tempdir will be used by default.

WARC file format

The WARC file format is an ISO standard. The official specification of ISO 28500:2009 is not available for free. However, the WARC 1.0 final draft and WARC 1.1 latest draft are free, and are supposed to be technically equivalent to the official standard.

The WARC usage task force has published WARC implementation guidelines with additional recommendations.