Difference between revisions of "Posterous"

From Archiveteam
Jump to navigation Jump to search
Line 194: Line 194:
Developing a command to archive a single blog, including all images and assets.
Developing a command to archive a single blog, including all images and assets.


USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
  USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
wget "https://$hostname" --warc-file=$hostname.warc --mirror --no-check-certificate --span-hosts --domains=s3.amazonaws.com,files.posterous.com,getfile.posterous.com,getfile0.posterous.com,getfile1.posterous.com,getfile2.posterous.com,getfile3.posterous.com,getfile4.posterous.com,getfile5.posterous.com,getfile6.posterous.com,getfile7.posterous.com,getfile8.posterous.com,getfile9.posterous.com,getfile10.posterous.com -U "$USER_AGENT" -nv -e robots=off --page-requisites --timeout 60 --tries 20 --waitretry 5 --warc-header "operator: Archive Team" --warc-header "posterous-hostname: $hostname"  
  wget "https://$hostname" --warc-file=$hostname.warc \
    --mirror --no-check-certificate --span-hosts \
    --domains=s3.amazonaws.com,files.posterous.com,getfile.posterous.com,getfile0.posterous.com,getfile1.posterous.com,getfile2.posterous.com,getfile3.posterous.com,getfile4.posterous.com,getfile5.posterous.com,getfile6.posterous.com,getfile7.posterous.com,getfile8.posterous.com,getfile9.posterous.com,getfile10.posterous.com \
    -U "$USER_AGENT" -nv -e robots=off --page-requisites \
    --timeout 60 --tries 20 --waitretry 5 \
    --warc-header "operator: Archive Team" \
    --warc-header "posterous-hostname: $hostname"  


Using https because it allows for http pipelining, which may help prevent being banned.
Using https because it allows for http pipelining, which may help prevent being banned.

Revision as of 00:55, 19 February 2013

Posterous
Posterous home.png
URL http://posterous.com
Status Closing
Archiving status In progress...
Archiving type Unknown
IRC channel #preposterus (on hackint)

Posterous is a blogging platform started in May 2008. It was acquired by Twitter on March 12, 2012 and will shut down April 30, 2013. Announcement

Site List Grab

We are currently assembling a list of Posterous sites that need grabbing. Development is seat-of-the-pants-y right now, and the following instructions will get your IP banned fairly quickly. Join us in #preposterus on efnet for state-of-the-art chitchat.

Instructions

Download the latest script: git

Claim a number range in the table below

Run 100 smegs concurrently. The following example will run the 1-2 million range:

   for chunk in $(seq 100 199); do ./smeg $chunk & done

Running this with the python variant at a high scale WILL cause database lock collisions.

To see hostnames as they're found:

    tail -q -n 0 -f *.hostnames

No output means you're IP banned.

Range Claim

Range Chunk(s) User Status Uploaded Hostnames
1 - 999,999 1-99 closure Done (742846) archived
1,000,000 - 1,999,999 100-199 db48x / closure Done (994303) archived
2,000,000 - 2,009,999 200 aggroskater Done (8907) 2000000.hostnames.gz archived
2,010,000 - 2,019,999 201 aggroskater Done (8094) 2010000.hostnames.gz archived
2,020,000 - 2,999,999 202-299 dcmorton Downloading
3,000,000 - 3,999,999 300-399 closure Done (928023) archived
4,000,000 - 4,999,999 400-499 chazchaz101 Downloading
5,000,000 - 5,999,999 500-599 Smiley / Soult Done (984360) 5000000.hostnames.gz, 5000000.sqlite.gz archived
6,000,000 - 6,999,999 600-699 dcmorton Downloading
7,000,000 - 7,999,999 700-799 balrog / (Your name here!) Partial (39462) archived
7,905,000 - 7,909,999 790 yipdw Done
7,915,000 - 7,919,999 791 yipdw Done
7,925,000 - 7,929,999 792 yipdw Done
7,935,000 - 7,939,999 793 yipdw Downloading
8,000,000 - 8,999,999 800-899 beardicus/Soult Done (984258) 8000000.hostnames.gz, 8000000.sqlite.gz archived
9,000,000 - 9,999,999 900-999 GLaDOS Downloading
10,000,000 - 10,019,999 1000-1001 gui77 Partial 10000000.hostnames.gz
10,020,000 - 10,069,999 1002-1006 S[h]O[r]T Downloading
10,070,000 - 10,209,999 1007-1020 flaushy Downloading
10,210,000 - 10,309,999 1021-1030 S[h]O[r]T Downloading
10,310,000 - 10,409,999 1031-1040 S[h]O[r]T Downloading
10,410,000 - 10,509,999 1041-1050 S[h]O[r]T Downloading
10,510,000 - 10,609,999 1051-1060 S[h]O[r]T Downloading
10,610,000 - 10,709,999 1061-1070 siliconvalleypark Downloading
10,710,000 - 11,009,999 1071-1100 S[h]O[r]T Downloading

Archiving a single blog

Developing a command to archive a single blog, including all images and assets.

 USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
 wget "https://$hostname" --warc-file=$hostname.warc \
   --mirror --no-check-certificate --span-hosts \
   --domains=s3.amazonaws.com,files.posterous.com,getfile.posterous.com,getfile0.posterous.com,getfile1.posterous.com,getfile2.posterous.com,getfile3.posterous.com,getfile4.posterous.com,getfile5.posterous.com,getfile6.posterous.com,getfile7.posterous.com,getfile8.posterous.com,getfile9.posterous.com,getfile10.posterous.com \
   -U "$USER_AGENT" -nv -e robots=off --page-requisites \
   --timeout 60 --tries 20 --waitretry 5 \
   --warc-header "operator: Archive Team" \
   --warc-header "posterous-hostname: $hostname" 

Using https because it allows for http pipelining, which may help prevent being banned.