Wget

From Archiveteam
Revision as of 16:27, 17 January 2017 by Jscott (talk | contribs) (Reverted edits by Megalanya1 (talk) to last revision by Jscott)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

GNU Wget is a free utility for non-interactive download of files from the Web. Using Wget, it is possible to grab a large chunk of data, or mirror an entire website, including its (public) folder structure, using a single command. In the tool belt of the renegade archivist, Wget tends to get an awful lot of use. (Note: Some people prefer to use cURL. If it can back up data, it's useful.)

This guide will not attempt to explain all possible uses of Wget; rather, this is intended to be a concise introduction to Wget, specifically geared towards using it to archive data such as podcasts, PDF documents, or entire websites. Dealing with issues such as user agent checks and robots.txt restrictions will be covered as well.

Installation

Installing Wget should be relatively straightforward on most platforms. If you need more information check out the article Wget installation.

Mirroring a website

When you run something like this:

wget http://icanhascheezburger.com/

...Wget will just grab the first page it hits, usually something like index.html. If you give it the -m flag:

wget -m http://icanhascheezburger.com/

...then Wget will happily slurp down anything within reach of its greedy claws, putting files in a complete directory structure. Go make a sandwich or something.

You'll probably want to pair -m with -c (which tells Wget to continue partially-complete downloads) and -b (which tells wget to fork to the background, logging to wget-log).

If you want to grab everything in a specific directory - say, the SICP directory on the mitpress web site - use the -np flag:

wget -mbc -np http://mitpress.mit.edu/sicp

This will tell Wget to not go up the directory tree, only downwards.

User-agents and robots.txt

By default, Wget strictly follows a website's robots.txt directives. In certain situations this will lead to Wget not grabbing anything at all, if for example the robots.txt doesn't allow Wget to access the site.

To avoid this: first, you should try using the --user-agent option:

wget -mbc --user-agent="" http://website.com/

This instructs Wget to not send any user agent string at all. Another option for this is:

wget -mbc -e robots=off http://website.com/

...which tells Wget to ignore robots.txt directives altogether.

You can append --wait 1 to add a delay of one second between requests, to lighten the server load and avoid being blocked, which might happen in certain cases if you make too many requests within too short a time.

Compression

Wget doesn't use compression by default! This can make a big difference when you're downloading easily compressible data, like human-language HTML text, but doesn't help at all when downloading material that is already compressed, like JPEG or PNG files. To enable compression, use:

wget --header="accept-encoding: gzip"

This will produce a file (if the remote server supports gzip compression) that uses the .html extension, but is actually gzip-encoded, which can be confusing.

Any vaguely modern server can sustain thousands of simultaneous text downloads, with video or large images being the big ticket items. But sites using outdated hardware, or run by habitual whiners, will complain when a site scraping uses 200 megabytes of transfer when it could have used 100.

Creating WARC with wget

If you wish to create a WARC file (which includes an entire mirror of a site), you will want something like this:

 export USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
 export SAVE_HOST="example.com"
 export WARC_NAME="example.com-panicgrab-20130611"
 wget \
 -e robots=off --mirror --page-requisites \
 --waitretry 5 --timeout 60 --tries 5 --wait 1 \
 --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" \
 -U "$USER_AGENT" "$SAVE_HOST"

You can find out more about Wget with WARC output.

You can even create a function

function quick-warc {
        if [ -f $1.warc.gz ]
        then
                echo "$1.warc.gz already exists"
        else
                wget --warc-file=$1 --warc-cdx --mirror --page-requisites --no-check-certificate --restrict-file-names=windows \
                -e robots=off --waitretry 5 --timeout 60 --tries 5 --wait 1 \
                -U "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27" \
                "http://$1/"
        fi
}


Forum Grab

src/wget --save-cookies team17-cookies.txt --post-data 'vb_login_username=USERNAMEGOESHERE&vb_login_password=PASSWORDGOESHERE&securitytoken=guest&cookieuser=1&do=login' http://forum.team17.com/login.php?do=login
src/wget --load-cookies team17-cookies.txt -e robots=off --wait 0.25 "http://forum.team17.com/" --mirror --warc-file="at-team17-forum"

Wordpress Grab

wget --no-parent --no-clobber --html-extension --recursive --convert-links --page-requisites --user=<username> --password=<password> <path>

Lua Scripting

If you need fine grain behavior Wget while it downloads, use a version of Wget with Lua hooks.

Tricks and Traps

  • A standard methodology to prevent scraping of websites is to block access via user agent string. Wget is a good web citizen and identifies itself. Renegade archivists are not good web citizens in this sense. The --user-agent option will allow you to act like something else.
  • Some websites are actually aggregates of multiple machines and subdomains, working together. (For example, a site called dyingwebsite.com will have additional machines like download.dyingwebsite.com or mp3.dyingwebsite.com) To account for this, add the following options: -H -Ddomain.com
  • If you do not want Wget to download the original files while making a WARC, use Wget with Lua hooks and --output-document and --truncate-out. Use of these options treats the output document as a temporary file. For the purposes of making a WARC file, these options should be used together to prevent growing files and poor performance.
  • Wget mistakes certain UTF-8 characters in the original filenames with control characters and happily escapes them, turning the filenames into garbage. If your system supports UTF-8 filenames (probably), you can turn the escaping off by using the --restrict-file-names=nocontrol option. Fortunately, the contents of the .warc files should be unaffected by the escaping.
    • Accidentally bitten by this "feature" already? Try this C program that recursively unescapes the filenames.

Parallel downloading

http://keramida.wordpress.com/2010/01/19/parallel-downloads-with-python-and-gnu-wget/

Essays and Reading on the Use of WGET