Difference between revisions of "Posterous"

From Archiveteam
Jump to navigation Jump to search
Line 10: Line 10:


Posterous is a blogging platform started in May 2008. It was acquired by Twitter on March 12, 2012 and will shut down April 30, 2013. [http://blog.posterous.com/thanks-from-posterous Announcement]
Posterous is a blogging platform started in May 2008. It was acquired by Twitter on March 12, 2012 and will shut down April 30, 2013. [http://blog.posterous.com/thanks-from-posterous Announcement]
== Seesaw script ==
Download:
https://gist.github.com/Gelob/16aacab95d2d59887d86
Follow instructions to install seesaw and edit script for IP address.
Running too many concurrently will get you banned at :50 past the hour.


== Site List Grab ==
== Site List Grab ==

Revision as of 02:30, 23 February 2013

Posterous
Posterous home.png
URL http://posterous.com
Status Closing
Archiving status In progress...
Archiving type Unknown
IRC channel #preposterus (on hackint)

Posterous is a blogging platform started in May 2008. It was acquired by Twitter on March 12, 2012 and will shut down April 30, 2013. Announcement

Seesaw script

Download:

https://gist.github.com/Gelob/16aacab95d2d59887d86

Follow instructions to install seesaw and edit script for IP address.

Running too many concurrently will get you banned at :50 past the hour.

Site List Grab

We have assembled a list of Posterous sites that need grabbing. Total found: 9898986

http://archive.org/details/2013-02-22-posterous-hostname-list

Tools: git

Archiving a single blog

Developing a command to archive a single blog, including all images and assets.

 USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
 wget "https://$hostname" --warc-file=$hostname.warc \
   --mirror --no-check-certificate --span-hosts \
   --domains=$hostname,s3.amazonaws.com,files.posterous.com,getfile.posterous.com,getfile0.posterous.com,getfile1.posterous.com,getfile2.posterous.com,getfile3.posterous.com,getfile4.posterous.com,getfile5.posterous.com,getfile6.posterous.com,getfile7.posterous.com,getfile8.posterous.com,getfile9.posterous.com,getfile10.posterous.com \
   -U "$USER_AGENT" -nv -e robots=off --page-requisites \
   --timeout 60 --tries 20 --waitretry 5 \
   --warc-header "operator: Archive Team" \
   --warc-header "posterous-hostname: $hostname" 

Using https because it allows for http pipelining, which may help prevent being banned.