Difference between revisions of "Posterous"
m (→Where can I see the project status?: fix link)
|Line 66:||Line 66:|
=== Where can I see the project status? ===
=== Where can I see the project status? ===
You can see the status at
You can see the status at [http://tracker.archiveteam.org/posterous/] which is the dashboard for this project.
=== Cool! So you're almost done with this? ===
=== Cool! So you're almost done with this? ===
Revision as of 19:56, 1 April 2013
|Archiving status||In progress...|
|IRC channel||(on hackint)|
Frequently Asked Questions
It's going down! How can I help?
Glad you're interested! First and foremost, consider running our prepared Virtual Machine. Please see Posterous#Warrior down below.
What do you guys need? A huge fat pipe, a.k.a Bandwidth?
Needed/Wanted: Interested volunteers in general and IP addresses. A lot of bandwidth isn't needed, per se. You don't need a fat monster pipe/Internet tube to help out.
Can I donate some cash instead?
Not really, not to the ArchiveTeam specifically. If you feel like you could let go off of a few buckeroos, consider donating to the Internet Archive. They're awesome and do awesome things, just like us! (Yes, you're included in "us" - you're here, reading already!)
So, the Internet Archive, that's not you?
No. We're ArchiveTeam, a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage. We archive online platforms that are threatened to disappear. Internet Archive (archive.org) is an official non-profit which receives funding and has more than 200 employees. They scan books and host digital archives like The Wayback Machine. After we finish archiving all of Posterous, the Internet Archive would be a good partner to host the project.
Why aren't we fetching more/bashing the shit out of Posterous to get done already?!
Easy tiger! We all love us, getting a web service down on its knees good.. But we want to get as much as possible out of Posterous.
We've currently rate limited the project and continue to adjust accordingly as well as try out tactics. We've unfortunately been able to bring Posterous down to its knees a good few times indeed.
The problem is that Posterous is not designed for the load it's currently getting. Especially with us. They've designed Posterous so that the front-ends will hit a cache with content, before hitting the back-end. Ok, but why isn't that helping? We're going through *all* of their accounts and posts and we are ruining the cache. That means Posterous's back-end can't take the request rate at all.. Which will make requests return bad data or no data if we go too fast. Please keep this in mind.
The warrior tells me to ask to run the Posterous project?
Yes, yes - it indeed does. If you've read this - feel free to click on it and go on. This message/warning/notice was introduced earlier in the archiving project when we got banned a lot. Other Warrior projects havn't been this bitchy about banning - therefore the notice is still there.
Will I get banned?
If you help out by running the ArchiveTeam Warrior - it's very unlikely that your IP will get banned. Our objective is to get as much of Posterous as possible. Therefore we have taken measures and continue to take measures to rate limit, check for errors and retry and back off when appropriate to ensure getting as much as possible. There's also some magic over at Posterous end, we won't go into details here though.
If you however are starting your own "Rape the Posterous Silly"-project with own code, or are running too many concurrent jobs - with for example the stand alone code mentioned below (seesaw script for advanced users) - yes. It's very likely you'll get banned.
How do I know if I got banned?
How long am I banned, if banned?
Good question! No good answer! Next!
In the beginning of the crawl, individual IPs were banned for days - if not mistaken, a week or so. After experimenting with... overloading... Posterous from different IPs, the ban times have shortened.
The answer is: Hours to weeks. It's unclear.
OK, I'm running the Warrior - I'm getting 502/5XX errors!!
That's not a question.
Posterous will gag out 50X's occationally - we've taken measures to back off for a period of time and retry for a certain number of times. It's alright.
This does not mean your IP has been banned.
Uh, so.. looks like there's plenty of spam on Posterous?
Yep, but we don't care. Grab it all. It's not our thing to decide what gets saved and not – especially if we have the chance to save it all.
Maybe it'll be useful for a spam researcher in the future. Maybe not.
We have tried to prioritize "real" users as we can identify some as spam/banned due to how posterous has marked them, however this doesn't mean we shouldn't archive them (see above).
Where can I see the project status?
You can see the status at  – which is the dashboard for this project.
Cool! So you're almost done with this?
Sadly, no! All hostnames are not tracked on the dashboard - because of certain limitations in the current tracker/dashboard. We've unloaded a lot of the users/items. In total, we believe there to be about 10 Million hostnames/sites/users.
The tracker/status dashboard is barfing! Or giving 502 Bad Gateways. What's up?
The tracker/dashboard is a bit fragile - so please don't link it out all too much. It's not optimized for maximum page loads. It's however functional and the source code is freely available on [GitHub] - feel free to look into that and if you see anything that can be improved, submit a pull request.
Our tracker admins will of course kick it back to life if it's acting up. Please join our IRC Channel for status updates regarding the tracker and such
My userstats seems to be reset on the dashboard, what gives?
The user details are cached for a set of time, we've had the caching act up a few times. Please rest assured that every submitted work DOES get counted and it gets in. If you see your username getting submissions and then resetting the total - feel free to poke us in the IRC Channel (anyone with a @). We'll kick the cache in the butt, and your stats will show like it should. This shouldn't happen all that often though.
How do I know if my posterous favorite blogobongobloggo will be fetched?
There's no super nice way, but if you go to Posterous#Site List Grab below, you can grab the hostname list that we've spidered forth and check by opening it and searching for your username/hostname. Or you could use 'grep' on it, you know - like a man.
Can I opt out? I don't want to be saved!
Tough luck, it's already public - that's why we're grabbing it. Besides, don't be embarrassed! We all learn through history - let the history be.
This is cool and all, but where the fuck is the data going?
We'll make sure this data stays public after it's been downloaded. We'll make sure that the awesome duders and duduetters at [Internet Archive] gets a copy for sure. We're grabbing all the Posterous sites in a Internet Archive friendly file format called WARC (WebARCive) - so they should be able to put this into the Wayback machine - if they'd like to.
So, my Warrior doesn't get networking with VirtualBox on Ubuntu, what gives?
You should do the following:
VBoxManage modifyvm "archiveteam-warrior-2" --natdnshostresolver1 on VBoxManage modifyvm "archiveteam-warrior-2" --natdnsproxy1 on
Thanks and shout outs goes to hdevalenc
How do I even get the Warrior up and running on Debian?
sudo apt-get install virtualbox-ose wget http://archive.org/download/archiveteam-warrior/archiveteam-warrior-v2-20121008.ova tar xf archiveteam-warrior-v2-20121008.ova VBoxManage import archiveteam-warrior-v2-20121008.ovf screen VBoxHeadless --vnc --startvm archiveteam-warrior-2
Hit Ctrl+A, D to exit screen and leave the VM running. From a non-headless box:
ssh -L8001:localhost:8001 email@example.com
Point a browser to http://localhost:8001
How to help
You can help by installing and running the ArchiveTeam Warrior and selecting the "posterous" project. The Warrior is a virtual machine you can run in Virtualbox seamlessly to help out.
Seesaw script (for advanced users)
Follow instructions to install seesaw and edit script for IP address.
For wget: run ./get-wget-lua.sh
If you are on a box with more than one public IP address, you can place an IP address after --bind-address= on line 175. Example: "--bind-address=192.168.1.1",
# install prerequisites sudo apt-get install -y build-essential lua5.1 liblua5.1-0-dev python python-setuptools python-dev git-core openssl libssl-dev python-pip rsync gcc make git # grab the posterous scripts git clone http://github.com/ArchiveTeam/posterous-grab.git cd posterous-grab # grab and install the seesaw kit (for communicating with the tracker) git clone http://github.com/ArchiveTeam/seesaw-kit cd seesaw-kit sudo pip install -r requirements.txt sudo pip install seesaw cd ../ # download and compile wget-lua chmod +x get-wget-lua.sh && ./get-wget-lua.sh # run the pipeline and start downloading users. Use --help to see additional parameters # once started progress can be viewed from a browser on port 8001 seesaw-kit/run-pipeline --concurrent 1 --address <your_ip_address> pipeline.py <your_username>
Once running, progress can be viewed from the web interface, much like the warrior. The default port is 8001 but it can be changed with the --port parameter. The pipeline can be shut down either through the web interface or running
Site List Grab
We have assembled a list of Posterous sites that need grabbing. Total found: 9898986
We found 9.8 million possible Posterous accounts. After filtering out the banned/spam accounts we have 6,677,720 left.
They close April 30th, 2013. We have 29 days left and 2,600,000 accounts downloaded.
60 sec * 60 min * 24 hours = 86,400 seconds a day
(6,677,720 - 2,600,000)/86,400 = 47.2 days at 1 account a second.
47.2 days (1 fetch a second)/29 days left = 1.63 and round that up to 2 accounts per second actually needed.
Now taking into account that not all accounts are the same size and the previous outages we have had the safe number would be 3x the above answer. So we need to download 6 full accounts per second to positively get all of Posterous before it shuts down. This is also based on the assumption that we will not have to re-download any accounts at the end.