Difference between revisions of "Site exploration"

From Archiveteam
Jump to navigation Jump to search
(hooray)
 
Line 14: Line 14:
MediaWiki wikis, especially the very large ones operated by the Wikimedia Foundation, often return a large number of important sites hosted with a service.
MediaWiki wikis, especially the very large ones operated by the Wikimedia Foundation, often return a large number of important sites hosted with a service.


[https://github.com/lewiscollard/mwlinkscrape mwlinkscrape.py] is a tool by an Archive Team patriot which extracts a machine-readable list from a number of wikis.  
[https://github.com/lewiscollard/mwlinkscrape mwlinkscrape.py] is a tool by an Archive Team patriot which extracts a machine-readable list from a number of wikis (it actually uses the text of [[List of major MediaWiki wikis with the LinkSearch extension|this page]] to get a list of wikis to scrape).  


  ./mwlinkscrape.py "*.dyingsite.com" > mw-sitelist.txt
  ./mwlinkscrape.py "*.dyingsite.com" > mw-sitelist.txt

Revision as of 15:02, 9 July 2013

This page contains some tips and tricks for exploring soon-to-be-dead websites, to find URLs to feed into the Archive Team crawlers.

Open Directory Project data

The Open Directory Project offers machine-readable downloads of its data. You want the "content.rdf.u8.gz" from there.

wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz

Quick-and-dirty shell parsing for the not-too-fussy:

grep '<link r:resource=.*dyingsite\.com' content.rdf.u8 | sed 's/.*<link r\:resource="\([^"]*\).*".*/\1/' | sort | uniq > odp-sitelist.txt

MediaWiki wikis

MediaWiki wikis, especially the very large ones operated by the Wikimedia Foundation, often return a large number of important sites hosted with a service.

mwlinkscrape.py is a tool by an Archive Team patriot which extracts a machine-readable list from a number of wikis (it actually uses the text of this page to get a list of wikis to scrape).

./mwlinkscrape.py "*.dyingsite.com" > mw-sitelist.txt

Bing API

Microsoft, bless their Redmondish hearts, have an API for fetching Bing search engine results, which has a free tier of 5000 queries per month (this will cover you for about 250 sets of 1000 results). However, it only returns the first 1000 results for any query, so you can't just search "site:dyingsite.com" and get all the things on a site. You'll need to get a bit creative with the search terms.

Grab this Python script (look for "BING_API_KEY" and replace it with your "Primary Account Key"), and then:

python bingscrape.py "site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "about me site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "gallery site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "in memoriam site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "diary site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "bob site:dyingsite.com" >> bing-sitelist.txt

And so on.

Common Crawl Index

The Common Crawl index is a very big (21 gigabytes compressed) list of URLs in the Common Crawl corpus. Grepping this list may well reveal plenty of URLs to archive. The list is in an odd format; along the lines of com.deadsite.www/subdirectory/subsubdirectory:http so you'll need to some filtering of the results.

grep '^com\.dyingsite[/\.]' zfqwbPRW.txt > commoncrawl-sitelist.txt