https://wiki.archiveteam.org/api.php?action=feedcontributions&user=Joseph&feedformat=atomArchiveteam - User contributions [en]2024-03-29T14:48:42ZUser contributionsMediaWiki 1.37.1https://wiki.archiveteam.org/index.php?title=Wget&diff=313Wget2009-01-15T20:33:54Z<p>Joseph: </p>
<hr />
<div>[http://www.gnu.org/software/wget/ GNU Wget] is a free utility for non-interactive download of files from the Web. Using wget, it is possible to grab a large chunk of data, or mirror an entire website with complete directory tree, with a single command. In the tool belt of the renegade archivist, Wget tends to get an awful lot of use. (Note: Some people prefer to use [http://curl.haxx.se/ cURL])<br />
<br />
This guide will not attempt to explain all possible uses of wget; rather, this is intended to be a concise intro to using wget, specifically geared towards using the tool to archive data such as podcasts, pdfs, or entire websites. Issues such as using wget to circumvent user-agent checks, or robots.txt restrictions, will be outlined as well.<br />
<br />
== Mirroring a website ==<br />
<br />
When you run something like this:<br />
<pre><br />
wget http://icanhascheezburger.com/<br />
</pre><br />
...wget will just grab the first page it hits, usually something like index.html. If you give it the -m flag:<br />
<pre><br />
wget -m http://icanhascheezburger.com/<br />
</pre><br />
...then wget will happily slurp down anything within reach of its greedy claws, putting files in a complete directory structure. Go make a sandwich or something.<br />
<br />
You'll probably want to pair -m with -c (which tells wget to continue partially-complete downloads) and -b (which tells wget to fork to the background, logging to wget-log).<br />
<br />
If you want to grab everything in a specific directory - say, the SICP directory on the mitpress web site - use the -np flag:<br />
<pre><br />
wget -mbc -np http://mitpress.mit.edu/sicp<br />
</pre><br />
<br />
This will tell wget to not go up the directory tree, only downwards.<br />
<br />
== User-agents and robots.txt ==<br />
<br />
By default, wget plays nicely with a website's robots.txt. This can lead to situations where wget won't grab anything, since the robots.txt disallows wget.<br />
<br />
To avoid this: first, you should try using the --user-agent option:<br />
<pre><br />
wget -mbc --user-agent="" http://website.com/<br />
</pre><br />
This instructs wget to not send any user agent string at all. Another option for this is:<br />
<pre><br />
wget -mbc -erobots=off http://website.com/<br />
</pre><br />
...which tells wget to ignore robots.txt directives altogether.<br />
<br />
== Tricks and Traps ==<br />
<br />
* A standard methodology to prevent scraping of websites is to block access via user agent string. Wget is a good web citizen and identifies itself. Renegade archivists are not good web citizens in this sense. The '''--user-agent''' option will allow you to act like something else.<br />
* Some websites are actually aggregates of multiple machines and subdomains, working together. (For example, a site called ''dyingwebsite.com'' will have additional machines like ''download.dyingwebsite.com'' or ''mp3.dyingwebsite.com'') To account for this, add the following options: '''-H -Ddomain.com'''<br />
<br />
== Essays and Reading on the Use of WGET ==<br />
<br />
* [http://lifehacker.com/software/top/geek-to-live--mastering-wget-161202.php Mastering WGET] by Gina Trapani<br />
* [http://psung.blogspot.com/2008/06/using-wget-or-curl-to-download-web.html Using wget or curl to download web sites for archival] by Phil Sung<br />
* [http://linux.about.com/od/commands/l/blcmdl1_wget.htm about.com Wget] list of commands</div>Josephhttps://wiki.archiveteam.org/index.php?title=User:Joseph&diff=283User:Joseph2009-01-13T00:12:15Z<p>Joseph: New page: '''OCD-rich individuals who want to download things'''</p>
<hr />
<div>'''OCD-rich individuals who want to download things'''</div>Josephhttps://wiki.archiveteam.org/index.php?title=Ficlets&diff=282Ficlets2009-01-13T00:10:12Z<p>Joseph: </p>
<hr />
<div>From the site: "Tiny fictional snippets that tell a short story. Each author adds to the larger narrative." It will [http://www.peopleconnectionblog.com/2008/12/02/ficlets-will-be-shut-down-permanently/ shut down] on January 15, 2009.<br />
<br />
== Backup Tools ==<br />
<br />
* [[Wget]]<br />
<br />
'''Also of note:'''<br />
<br />
[http://www.peopleconnectionblog.com/2008/12/02/ficlets-will-be-shut-down-permanently/#c15939294 Comment from creator of Ficlets (Dec. 3 2008)] ''I've already written an exporter and have all the stories (the ones not marked "mature" anyway). I have pretty much all of the author bios too. Since I was smart enough to insist that AOL license all the content under Creative Commons, I'll be launching a "ficlets graveyard" on 1/16 so at least the stories that people worked so hard one will live on.'' [http://lawver.net/ This guy might be worth contacting]<br />
<br />
== Vital Signs ==<br />
<br />
Soon to be dead.<br />
<br />
== Who's Working On It? ==<br />
<br />
Lore is grabbing it.<br />
<br />
Joseph</div>Josephhttps://wiki.archiveteam.org/index.php?title=Talk:Main_Page&diff=274Talk:Main Page2009-01-12T22:21:35Z<p>Joseph: </p>
<hr />
<div>Maybe we should tell each other what sites we have archived on ours boxes ? I just started on http://ficlets.com/ . Joseph 12/01/09 5:20pm</div>Josephhttps://wiki.archiveteam.org/index.php?title=Talk:Main_Page&diff=273Talk:Main Page2009-01-12T22:20:14Z<p>Joseph: New page: Maybe we should tell each other what sites we have up on ours boxes ? I just started on http://ficlets.com/ . Joseph 12/01/09 5:20pm</p>
<hr />
<div>Maybe we should tell each other what sites we have up on ours boxes ? I just started on http://ficlets.com/ . Joseph 12/01/09 5:20pm</div>Joseph