Difference between revisions of "Posterous"
(AM DONE =)) |
|||
Line 108: | Line 108: | ||
| 900-999 | | 900-999 | ||
| GLaDOS | | GLaDOS | ||
| | | Done | ||
| | | [http://posterous.archivingyoursh.it/9000000.hostnames.tar.gz 9000000.hostnames.tar.gz] | ||
|- | |- | ||
| 10,000,000 - 10,019,999 | | 10,000,000 - 10,019,999 |
Revision as of 14:36, 20 February 2013
Posterous | |
URL | http://posterous.com |
Status | Closing |
Archiving status | In progress... |
Archiving type | Unknown |
IRC channel | #preposterus (on hackint) |
Posterous is a blogging platform started in May 2008. It was acquired by Twitter on March 12, 2012 and will shut down April 30, 2013. Announcement
Site List Grab
We are currently assembling a list of Posterous sites that need grabbing. Development is seat-of-the-pants-y right now, and the following instructions will get your IP banned fairly quickly. Join us in #preposterus on efnet for state-of-the-art chitchat.
Instructions
Download the latest script: git
Claim a number range in the table below
Run 100 smegs concurrently. The following example will run the 1-2 million range:
for chunk in $(seq 100 199); do ./smeg $chunk & done
Running this with the python variant at a high scale WILL cause database lock collisions.
To see hostnames as they're found:
tail -q -n 0 -f *.hostnames
No output means you're IP banned.
Range Claim
Range | Chunk(s) | User | Status | Uploaded Hostnames |
---|---|---|---|---|
1 - 999,999 | 1-99 | closure | Done (742846) | archived |
1,000,000 - 1,999,999 | 100-199 | db48x / closure | Done (994303) | archived |
2,000,000 - 2,009,999 | 200 | aggroskater | Done (8907) | 2000000.hostnames.gz archived |
2,010,000 - 2,019,999 | 201 | aggroskater | Done (8094) | 2010000.hostnames.gz archived |
2,020,000 - 2,999,999 | 202-299 | dcmorton | Downloading (202-225 done) | |
3,000,000 - 3,999,999 | 300-399 | closure | Done (928023) | archived |
4,000,000 - 4,999,999 | 400-499 | chazchaz101 | Downloading | |
5,000,000 - 5,999,999 | 500-599 | Smiley / Soult | Done (984360) | 5000000.hostnames.gz, 5000000.sqlite.gz archived |
6,000,000 - 6,999,999 | 600-699 | dcmorton | Downloading (600-630 done) | |
7,000,000 - 7,999,999 | 700-799 | balrog / S[h]O[r]T | Done | hostnames.tgz archived |
8,000,000 - 8,999,999 | 800-899 | beardicus/Soult | Done (984258) | 8000000.hostnames.gz, 8000000.sqlite.gz archived |
9,000,000 - 9,999,999 | 900-999 | GLaDOS | Done | 9000000.hostnames.tar.gz |
10,000,000 - 10,019,999 | 1000-1001 | gui77 | Done | 10000000-10019999.sqlite.gz archived |
10,020,000 - 10,609,999 | 1002-1060 | S[h]O[r]T | Done | hostnames.tgz archived |
10,610,000 - 10,709,999 | 1061-1070 | siliconvalleypark | Done | 10610000-10709999-posterous.sqlite.gz archived |
10,710,000 - 11,009,999 | 1071-1100 | S[h]O[r]T | Done | hostnames.tgz archived |
Archiving a single blog
Developing a command to archive a single blog, including all images and assets.
USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27" wget "https://$hostname" --warc-file=$hostname.warc \ --mirror --no-check-certificate --span-hosts \ --domains=s3.amazonaws.com,files.posterous.com,getfile.posterous.com,getfile0.posterous.com,getfile1.posterous.com,getfile2.posterous.com,getfile3.posterous.com,getfile4.posterous.com,getfile5.posterous.com,getfile6.posterous.com,getfile7.posterous.com,getfile8.posterous.com,getfile9.posterous.com,getfile10.posterous.com \ -U "$USER_AGENT" -nv -e robots=off --page-requisites \ --timeout 60 --tries 20 --waitretry 5 \ --warc-header "operator: Archive Team" \ --warc-header "posterous-hostname: $hostname"
Using https because it allows for http pipelining, which may help prevent being banned.