Lulu Poetry
Lulu Poetry | |
A screen shot of the Lulu Poetry home page | |
URL | http://www.poetry.com[IA•Wcite•.today•MemWeb] |
Status | Offline May 4, 2011 |
Archiving status | Saved! |
Archiving type | Unknown |
IRC channel | #archiveteam-bs (on hackint) |
Lulu Poetry or Poetry.com, announced on April 13, 2011 that they would close less than a month later on May 4, deleting all 14 million poems. Archive Team members amassed to find out how to help and aim their LOIC's at it. (By the way, I actually mean their crawlers, not DDoS cannons.)
News
May 4: As of midnight EST on May 4, the site appears unreachable (even from unblocked IPs). Looks like "available until May 4" was not inclusive. R.I.P. the work of millions. Now on to Google Cache!
May 2: We're getting IP-blocked all over. But it looks like something that's still successful is using proxies from a list on Wikipedia (the <!-- 8080 --> ones) and faking wget's user agent.
May 2: It looks like the battle has begun. They seem to have started blocking either our IPs or our wget user-agent strings. Current strategies include getting more IPs through proxies and donning our googlebot costumes.
May 1: For everyone who left wget running last night, we noticed that the site would go out periodically, serving pages that told of "site maintenance" instead of the poem page that wget was looking for. So we're having to find those files, delete them, then re-download them. See Tools for more info.
MOTHERFUCKER ! ! !
MOTHERFUCKER ! ! !
MOTHERFUCKER ! ! !
MOTHERFUCKER ! ! !
MOTHERFUCKER ! ! !
MOTHERFUCKER ! ! !
Coordination
Note: this is going really slow right now so maybe just claim 100,000 or so at a time.
Who is handling which chunks of urls? | ||||
IRC name | starting number | ending number | Progress | notes |
---|---|---|---|---|
closure | 0 | 999,999 | complete | |
closure | 1,00,000 | 14,715,000 | All IP space banned | random sampling (got 100 thousand aka ~ 0.7%) |
jag | 1,000,000 | 2,000,000 | in progress | |
notakp | 2,000,000 | 3,000,000 | Uploaded | feel free to take upper 2M(2.5M-2.99M) |
no2pencil | 3,000,000 | 3,999,999 | in progress | |
[free] | 4,000,000 | 4,399,999 | still free | If you're taking this, you should grab what I've done in this block and skip the ids from this list |
mel | 4,400,000 | 4,499,999 | IP banned | 80k done, now banned |
Qwerty01 | 4,500,000 | 4,699,999 | IP-banned | about 3,500 done in the 4,5xx,xxx range |
warthurton | 4,700,000 | 4,799,999 | in progress | |
greyjjd | 4,800,000 | 4,804,999 | IP banned | 2763 of the 5k. |
DFJustin | 4,805,000 | 4,899,999 | stalled | got 4,746 but seems to be down now |
beardicus | 4,900,000 | 4,999,999 | in progress | |
underscor | ??? | ??? | in progress? | active last night but never said their range |
BlueMax | ??? | ??? | in progress? | active last night but never said their range |
Coderjoe | 5,000,000 | 5,099,999 | in progress | |
[free] | 5,100,000 | 5,675,999 | available | |
Coderjoe | 5,676,000 | 5,695,999 | in progress | |
[free] | 5,696,000 | 6,351,999 | available | |
Coderjoe | 6,352,000 | 6,418,999 | in progress | |
[free] | 6,419,000 | 8,999,999 | available | |
alard | 9,000,000 | 9,999,999 | IP-banned | have 149,438 list of ids I've done, feel free to do the rest |
nuintari | 9,000,000 | 9,099,999 | IP Blocked (yes, all of them) | here is what I did get |
perfinion | 9,100,000 | 9,199,999 | ||
nuintari | 9,200,000 | 9,299,999 | IP Blocked | here is what I did get |
nuintari | 9,300,000 | 9,399,999 | IP Blocked | here is what I did get |
[free] | 9,400,000 | 9,999,999 | ||
Teaspoon | 10,000,000 | 10,999,999 | in progress | 16% |
DoubleJ | 11,000,000 | 11,099,999 | complete | Suspicious number of 404s starting evening of the 3rd |
flashmanbahadur | 11,100,000 | 11,199,999 | in progress | |
jch | 12,000,000 | 12,999,999 | site offline, incomplete | get my shit here |
jaybird11 | 13,000,000 | 13,009,999 | completed | http://www.bluegrasspals.com/13000000.tar.bz2 has these, plus others scattered throughout the 13 million block. |
emijrp | 14,000,000 | 14,099,999 | in progress | running this 100k urls into 10 chunks, 10k urls per chunk, it is better (not collapse the server) |
zappy | 14,200,000 | 14,206,470 | some 404s | here |
oli | 11,200,000 | 11,999,999 | IP(s) blocked | here's what I got |
yipdw | 14,300,000 | 14,399,999 | in progress | got ~1,000 so far; downloading on hold |
ersi | 14,400,000 | 14,715,000 | in progress | Currently haxing on the first 1000 of this range |
Miscellaneous
Thoughts from IRC
(8:16:42 PM) Qwerty01: warthurt: i think the first thing that would help after you've set up a proxy and changed user agent is to pace yourself
(8:16:50 PM) Qwerty01: to not show up on their radar as much
(8:17:15 PM) Qwerty01: you can set a wait time (--wait=3)
(8:17:40 PM) Qwerty01: maybe if you can set it up, run through a couple proxies at once, slowly on each one
(8:17:54 PM) Qwerty01: so that you still get a good rate but there's no single IP that stands out as hitting their server a lot
(8:18:48 PM) Qwerty01: in fact there's a host of wget options that can basically make you indistinguishable from a normal browser: --limit-rate=100k --wait=3 --random-wait
(8:20:52 PM) DoubleJ: Qwerty01: Yep, that's my strategy: A different proxy for each screen session.
(8:18:45 PM) no2penci1: proxy=`head -${n} ${file} | tail -1`
(8:19:00 PM) no2penci1: I stuffed a bunch of proxies into a text file, & then just read one line of that file
(8:19:04 PM) no2penci1: looping on n
Tools
Site Maintenance
When the site is under "site maintenance," instead of the poem page, it gives wget a page that says "site maintenance." So worse than a complete failure, it gives a complete html file that's incorrect. This is done through a 302 redirect to http://unavailable.poetry.com.
The updated wget commands above should avoid this problem. If you ran old wget commands, you need to find and remove the bad files.
You can find these files using this command: find [yourdirname] -type f -print0 | xargs --null grep -l "performing site maintenance"
Or to nuke all such files: grep "currently performing site maintenance" -r . | cut -d: -f1 | xargs rm -v (then just re-run your wgets with -nc to re-download what was missed).
For detecting server maintenance issues, no2pencil created the following correction script:
flist=`grep "currently performing site maintenance" *.html | cut -d: -f1` x=0 for file in ${flist}; do if [ -f ${file} ]; then echo correcting ${file} html=`echo ${file} | cut -c5-11` wget -E http://www.poetry.com/poems/archiveteam/${html}/ -O poem${html}.html 2>/dev/null echo done... x=`expr ${x} + 1` fi done if [ ${x} -eq 0 ]; then echo Directory clean else echo ${x} files corrected fi
Google Cache
- http://www.google.com/search?q=site%3Apoetry.com+intitle%3Aby+inurl%3Apoems+-inurl%3Atag
- click Cache
- ???
- PROFIT
Exit Strategy
We haven't yet decided what we'll keep out of all the html in those files. After all, we really just want the poems. This would save tons of space, too. Already, underscor has created a script that extracts the poems and metadata from the site. It could be re-purposed to extract the same from our downloaded files:
http://pastebin.com/Pst2aDS7
Closing announcement
From http://www.poetry.com/[IA•Wcite•.today•MemWeb]
Attention to Our Lulu Poetry Community
Lulu Poetry Closing Its Doors May 4, 2011
Dear Poets,
On May 4, 2011, Lulu Poetry will be closing its doors. Please be sure to copy and paste your poems onto your computer and connect with any fellow poets offsite, as we will be unable to save any customer information or poetry as of this date.
Over the past two years, we have been proud to provide a community where poetry writers can come together to share their remarkable works, learn from each other, and truly benefit themselves and anyone else interested in the craft. It has been a privilege to witness the creativity and effort that has sprung forth from this strong community of over 7 million poets and we have been thrilled to award over $35,000 in prizes in this time.
At Lulu, it makes us happy to see people do what they love and we’d still like to help you publish your poetry at Lulu.com. You can login using your Lulu Poetry username and password and start creating your own poetry book right away – absolutely free.
Thank you for your support and contribution to Lulu Poetry’s over 14 million poems. We look forward to your continued success on Lulu, where we’re committed to empowering authors to sell more books and reach more readers more easily than ever before.
Best,
Lulu Poetry