Difference between revisions of "Posterous"

Revision as of 14:36, 20 February 2013

Posterous

URL	http://posterous.com
Status	Closing
Archiving status	In progress...
Archiving type	Unknown
IRC channel	#preposterus (on hackint)

Posterous is a blogging platform started in May 2008. It was acquired by Twitter on March 12, 2012 and will shut down April 30, 2013. Announcement

Site List Grab

We are currently assembling a list of Posterous sites that need grabbing. Development is seat-of-the-pants-y right now, and the following instructions will get your IP banned fairly quickly. Join us in #preposterus on efnet for state-of-the-art chitchat.

Instructions

Download the latest script: git

Claim a number range in the table below

Run 100 smegs concurrently. The following example will run the 1-2 million range:

   for chunk in $(seq 100 199); do ./smeg $chunk & done

Running this with the python variant at a high scale WILL cause database lock collisions.

To see hostnames as they're found:

    tail -q -n 0 -f *.hostnames

No output means you're IP banned.

Range Claim

Range	Chunk(s)	User	Status	Uploaded Hostnames
1 - 999,999	1-99	closure	Done (742846)	archived
1,000,000 - 1,999,999	100-199	db48x / closure	Done (994303)	archived
2,000,000 - 2,009,999	200	aggroskater	Done (8907)	2000000.hostnames.gz archived
2,010,000 - 2,019,999	201	aggroskater	Done (8094)	2010000.hostnames.gz archived
2,020,000 - 2,999,999	202-299	dcmorton	Downloading (202-225 done)
3,000,000 - 3,999,999	300-399	closure	Done (928023)	archived
4,000,000 - 4,999,999	400-499	chazchaz101	Downloading
5,000,000 - 5,999,999	500-599	Smiley / Soult	Done (984360)	5000000.hostnames.gz, 5000000.sqlite.gz archived
6,000,000 - 6,999,999	600-699	dcmorton	Downloading (600-630 done)
7,000,000 - 7,999,999	700-799	balrog / S[h]O[r]T	Done	hostnames.tgz archived
8,000,000 - 8,999,999	800-899	beardicus/Soult	Done (984258)	8000000.hostnames.gz, 8000000.sqlite.gz archived
9,000,000 - 9,999,999	900-999	GLaDOS	Done	9000000.hostnames.tar.gz
10,000,000 - 10,019,999	1000-1001	gui77	Done	10000000-10019999.sqlite.gz archived
10,020,000 - 10,609,999	1002-1060	S[h]O[r]T	Done	hostnames.tgz archived
10,610,000 - 10,709,999	1061-1070	siliconvalleypark	Done	10610000-10709999-posterous.sqlite.gz archived
10,710,000 - 11,009,999	1071-1100	S[h]O[r]T	Done	hostnames.tgz archived

Archiving a single blog

Developing a command to archive a single blog, including all images and assets.

 USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
 wget "https://$hostname" --warc-file=$hostname.warc \
   --mirror --no-check-certificate --span-hosts \
   --domains=s3.amazonaws.com,files.posterous.com,getfile.posterous.com,getfile0.posterous.com,getfile1.posterous.com,getfile2.posterous.com,getfile3.posterous.com,getfile4.posterous.com,getfile5.posterous.com,getfile6.posterous.com,getfile7.posterous.com,getfile8.posterous.com,getfile9.posterous.com,getfile10.posterous.com \
   -U "$USER_AGENT" -nv -e robots=off --page-requisites \
   --timeout 60 --tries 20 --waitretry 5 \
   --warc-header "operator: Archive Team" \
   --warc-header "posterous-hostname: $hostname"

Using https because it allows for http pipelining, which may help prevent being banned.

@@ Line 108: / Line 108: @@
   | 900-999
   | GLaDOS
-  | Downloading
+  | Done
-  |
+  | [http://posterous.archivingyoursh.it/9000000.hostnames.tar.gz 9000000.hostnames.tar.gz]
   |-
   | 10,000,000 - 10,019,999

Difference between revisions of "Posterous"

Revision as of 14:36, 20 February 2013

Contents

Site List Grab

Instructions

Range Claim

Archiving a single blog

Navigation menu

Difference between revisions of "Posterous"

Revision as of 14:36, 20 February 2013

Site List Grab

Instructions

Range Claim

Archiving a single blog

Navigation menu

Search