Difference between revisions of "Working with ARCHIVE.ORG"

Revision as of 17:30, 19 May 2013

The Internet Archive has enormous resources at its disposal, and shares many parallel goals with Archive Team. There are some advantages to making as much of Archive Team's data saves available to ARCHIVE.ORG, especially as regards the Wayback Machine, a mechanism to be able to browse older web materials going back years. Where possible, Archive Team should try to work with ARCHIVE.ORG, and get the love going.

Headers and Logs

The most striking difference between Archive Team and ARCHIVE.ORG is that while Archiveteam traditionally considers header and logging information to be a nice hat trick if you have the time, ARCHIVE.ORG absolutely needs it to import into their Wayback Machine. With WGET, the best way to do this is to save off the log files, and use --save-headers to include the headers in the file.

(More research needs to be done to ensure these options are in scripts.)

Unfortunately, this produces files that are not, initially, usable for just dumping into a new location, so there is a chance either two copies should be made, or a script written that strips the headers out of the files for later use.

Stripping the Headers

As an example, let's download this page:

wget -O headers.raw.html --save-headers "http://www.archiveteam.org/index.php?title=Working_with_ARCHIVE.ORG"

If we want to strip the headers, then we run:

cat headers.raw.html | perl -ne 'unless ($out) { $out=1 if $_ eq "\r\n"; next } print' > noheaders.raw.html

The Archive.org "Archiveteam" Collection

Archive.org currently has an Archive Team Collection, which consists of site-grabs, older archives, and various rips/mass downloads from websites and Archive Team activities. Right now, that collection is primarily administered by Jason Scott — sending him e-mail at jason@textfiles.com with something you think that collection should have is probably the best way to go about things.

@@ Line 2: / Line 2: @@
-The Internet Archive has enormous resources at its disposal, and shares many parallel goals as Archive Team. There's some advantages to making as much of Archive Team's data saves available to ARCHIVE.ORG, especially as regards the [http://wayback.archive.org/web/ Wayback Machine], a mechanism to be able to browse older web materials going back years. Where possible, Archive Team should try to work with ARCHIVE.ORG, and get the love going.
+The [[Internet Archive]] has enormous resources at its disposal, and shares many parallel goals with Archive Team. There are some advantages to making as much of Archive Team's data saves available to ARCHIVE.ORG, especially as regards the [http://wayback.archive.org/web/ Wayback Machine], a mechanism to be able to browse older web materials going back years. Where possible, Archive Team should try to work with ARCHIVE.ORG, and get the love going.
 __TOC__
 == Headers and Logs ==
@@ Line 21: / Line 21: @@
 == The Archive.org "Archiveteam" Collection ==
-Archive.org currently has an [http://www.archive.org/details/archiveteam Archive Team Collection], which consists of site-grabs, older archives, and various rips/mass downloads from websites and Archive Team activities. Right now, that collection is primarily administered by Jason Scott - sending him e-mail at jason@textfiles.com with something you think that collection should have is probably the best way to go about things.
+Archive.org currently has an [http://www.archive.org/details/archiveteam Archive Team Collection], which consists of site-grabs, older archives, and various rips/mass downloads from websites and Archive Team activities. Right now, that collection is primarily administered by Jason Scott &mdash; sending him e-mail at jason@textfiles.com with something you think that collection should have is probably the best way to go about things.

Difference between revisions of "Working with ARCHIVE.ORG"

Revision as of 17:30, 19 May 2013

Contents

Headers and Logs

Stripping the Headers

The Archive.org "Archiveteam" Collection

Navigation menu

Difference between revisions of "Working with ARCHIVE.ORG"

Revision as of 17:30, 19 May 2013

Headers and Logs

Stripping the Headers

The Archive.org "Archiveteam" Collection

Navigation menu

Search