Wget with WARC output

From Archiveteam
Revision as of 14:18, 16 January 2017 by Megalanya0 (talk | contribs) (MOTHERFUCKER ! ! !)
Jump to navigation Jump to search

From the discussion about Working with ARCHIVE.ORG, we learn that it is important to save not just files but also HTTP headers. With Wget, that's difficult. With a few tricks you can keep the response headers, but there is no option to save the request headers. You also lose the response headers that don't produce an HTML page: Wget doesn't save redirects and 404 responses.

Since version 1.14[1] Wget supports writing to a WARC file (Web ARChive file format) file, just like Heritrix and other archiving tools. With the WARC format, both the request and the response headers get saved. It also provides a clean way to store redirects and 404 responses.

There is an additional advantage: if Wget writes these headers to a WARC file, it is no longer necessary to use the --save-headers to save them at the top of each downloaded file. There is no need to remove these headers afterwards to produce a clean copy: the mirror produced by Wget is useable without post-processing.

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

Options

--warc-file=FILENAME enables the WARC export. WARC files will be based on FILENAME: FILENAME-00000.warc.gz, FILENAME-00001.warc.gz et cetera.

--warc-max-size=NUMBER defines the maximum size of the WARC files. The default is an infinite limit ("inf"). If you download a large site, the recommended limit is 1GB, set the option to 1G to enable this limit. Note that this is a soft limit: files can get slightly larger than this, depending on the files you download.

--warc-header=STRING adds STRING as a custom header to the warcinfo record, e.g. "operator: Archive Team". This option can be used multiple times.

--warc-cdx=FILENAME writes a CDX index file to FILENAME.cdx. The CDX file will contain a list of the records and their locations in the WARC files.

--warc-dedup=FILENAME can be used to reduce the size of WARC files generated by a recrawl. FILENAME should point to a CDX file, generated with --warc-cdx in a previous run. For each file it downloads, Wget will check the CDX file to see if the response is listed there. If the exact file already exists, a "revisit" record with a reference to the previous record will be added to the WARC file, instead of a duplicate "response" record. Duplicate records are detected by comparing the SHA-1 digest of the payload of the response.

--no-warc-compression will write uncompressed WARC files. Compression is enabled by default. It is better to use the built-in compression than to compress the WARC files afterwards. The built-in compression will compress each record as an individual GZIP block, which allows other utilities to extract single records from the file.

--no-warc-digests disables the SHA-1 digests. By default, SHA-1 digests will be calculated for the whole response block and the response payload. If you really need to, you can disable that.

--no-warc-keep-log can be set if you don't want the Wget log in the WARC file. By default, Wget will add the log file as a separate record to the WARC file.

--warc-tempdir=DIRECTORY sets the temporary directory used by the WARC writer. The system tempdir will be used by default.

WARC file format

The WARC file format is an ISO standard. The official specification of ISO 28500:2009 is not available for free. However, the final draft is free, and is supposed to be technically equivalent to the official standard.

The WARC usage task force has published WARC implementation guidelines with additional recommendations.