Difference between revisions of "Wget"
(Added section desctibing -m and related flags) |
|||
Line 2: | Line 2: | ||
This guide will not attempt to explain all possible uses of wget; rather, this is intended to be a concise intro to using wget, specifically geared towards using the tool to archive data such as podcasts, pdfs, or entire websites. Issues such as using wget to circumvent user-agent checks, or robots.txt restrictions, will be outlined as well. | This guide will not attempt to explain all possible uses of wget; rather, this is intended to be a concise intro to using wget, specifically geared towards using the tool to archive data such as podcasts, pdfs, or entire websites. Issues such as using wget to circumvent user-agent checks, or robots.txt restrictions, will be outlined as well. | ||
== Mirroring a website == | |||
When you run something like this: | |||
<pre> | |||
wget http://icanhascheezburger.com/ | |||
</pre> | |||
...wget will just grab the first page it hits, usually something like index.html. If you give it the -m flag: | |||
<pre> | |||
wget -m http://icanhascheezburger.com/ | |||
</pre> | |||
...then wget will happily slurp down anything within reach of its greedy claws, putting files in a complete directory structure. Go make a sandwich or something. | |||
You'll probably want to pair -m with -c (which tells wget to continue partially-complete downloads) and -b (which tells wget to fork to the background, logging to wget-log). | |||
If you want to grab everything in a specific directory - say, the SICP directory on the mitpress web site - use the -np flag: | |||
<pre> | |||
wget -mbc -np http://mitpress.mit.edu/sicp | |||
</pre> | |||
This will tell wget to not go up the directory tree, only downwards. | |||
== Tricks and Traps == | == Tricks and Traps == |
Revision as of 15:23, 7 January 2009
GNU Wget is a free utility for non-interactive download of files from the Web. Using wget, it is possible to grab a large chunk of data, or mirror an entire website with complete directory tree, with a single command. In the tool belt of the renegade archivist, Wget tends to get an awful lot of use. (Note: Some people prefer to use cURL)
This guide will not attempt to explain all possible uses of wget; rather, this is intended to be a concise intro to using wget, specifically geared towards using the tool to archive data such as podcasts, pdfs, or entire websites. Issues such as using wget to circumvent user-agent checks, or robots.txt restrictions, will be outlined as well.
Mirroring a website
When you run something like this:
wget http://icanhascheezburger.com/
...wget will just grab the first page it hits, usually something like index.html. If you give it the -m flag:
wget -m http://icanhascheezburger.com/
...then wget will happily slurp down anything within reach of its greedy claws, putting files in a complete directory structure. Go make a sandwich or something.
You'll probably want to pair -m with -c (which tells wget to continue partially-complete downloads) and -b (which tells wget to fork to the background, logging to wget-log).
If you want to grab everything in a specific directory - say, the SICP directory on the mitpress web site - use the -np flag:
wget -mbc -np http://mitpress.mit.edu/sicp
This will tell wget to not go up the directory tree, only downwards.
Tricks and Traps
- A standard methodology to prevent scraping of websites is to block access via user agent string. Wget is a good web citizen and identifies itself. Renegard archivists are not good web citizens in this sense. The --user-agent option will allow you to act like something else.
- Some websites are actually aggregates of multiple machines and subdomains, working together. (For example, a site called dyingwebsite.com will have additional machines like download.dyingwebsite.com or mp3.dyingwebsite.com) To account for this, add the following options: -H -Ddomain.com
Essays and Reading on the Use of WGET
- Mastering WGET by Gina Trapani
- Using wget or curl to download web sites for archival by Phil Sung