Lulu Poetry

From Archiveteam
Revision as of 15:48, 17 January 2017 by Megalanya1 (talk | contribs) (MOTHERFUCKER ! ! !)
Jump to navigation Jump to search
Lulu Poetry
Lulu Poetry logo
A screen shot of the Lulu Poetry home page
A screen shot of the Lulu Poetry home page
URL http://www.poetry.com[IAWcite.todayMemWeb]
Status Offline May 4, 2011
Archiving status Saved!
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)

Lulu Poetry or Poetry.com, announced on April 13, 2011 that they would close less than a month later on May 4, deleting all 14 million poems. Archive Team members amassed to find out how to help and aim their LOIC's at it. (By the way, I actually mean their crawlers, not DDoS cannons.)

News
May 4: As of midnight EST on May 4, the site appears unreachable (even from unblocked IPs). Looks like "available until May 4" was not inclusive. R.I.P. the work of millions. Now on to Google Cache!
May 2: We're getting IP-blocked all over. But it looks like something that's still successful is using proxies from a list on Wikipedia (the <!-- 8080 --> ones) and faking wget's user agent.
May 2: It looks like the battle has begun. They seem to have started blocking either our IPs or our wget user-agent strings. Current strategies include getting more IPs through proxies and donning our googlebot costumes.
May 1: For everyone who left wget running last night, we noticed that the site would go out periodically, serving pages that told of "site maintenance" instead of the poem page that wget was looking for. So we're having to find those files, delete them, then re-download them. See Tools for more info.

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

Howto

  1. Claim a range of numbers below.
  2. Generate a hotlist of urls for wget to download by running this, editing in your start and end number: perl -le 'print "http://www.poetry.com/poems/archiveteam/$_/" for 1000000..2000000' > hotlist
  3. Split the hotlist into 100 sublists: split hotlist
    • It splits a list into 1000-items chunks. If you choiced a list with 1M items, better use split -l10000 hotlist
  4. Run wget on each sublist, with logging, and timeout and "we're down" page avoidance: wget -T 8 --max-redirect=0 -o logfile.log -nv -nc -x -i xaa
  5. To avoid getting too many files in one directory, which some filesystems will choke on, recommend moving into a new subdirectory before running each wget on the sublist.
  6. For the daring, here's how to run all wgets on all the sublists in parallel, in subdirs, with logging, and avoidance of timeouts and the "site mainenance" problem: for x in ???; do mkdir $x.dir; cd $x.dir; wget -T 8 --max-redirect=0 -o $x.log -nv -nc -x -i ../$x & cd ..; done
  7. Once wget finishes, run it again! The -nc will make it download any files it missed the first time. Repeat until the logs don't show failures.
wget Options Translation (or see Manual)
short long version meaning
-E --adjust-extension adds ".html" to files that are html but didn't originally end in .html
-k --convert-links change links in html files to point to the local versions of the resources
-T --timeout= if it gets hung for this long (in seconds), it'll retry instead of sitting waiting
-o --output-file use the following filename as a log file instead of printing to screen
-nv --no-verbose don't write every little thing to the log file
-nc --no-clobber if a file is already present on disk, skip it instead of re-downloading it
-x --force-directories force it to create a hierarchy of directories mirroring the hierarchy in the url structure
-i --input-file use the following filename as a source of urls to download
-U --user-agent Give the following as your user agent string instead of ‘Wget/1.12’. Pretty much required at this point to keep from being blocked. One string you can use to look like a web browser:
-U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4'

Here's a useful page with some common user agents.

Coordination

Note: this is going really slow right now so maybe just claim 100,000 or so at a time.

Who is handling which chunks of urls?
IRC name starting number ending number Progress notes
closure 0 999,999 complete
closure 1,00,000 14,715,000 All IP space banned random sampling (got 100 thousand aka ~ 0.7%)
jag 1,000,000 2,000,000 in progress
notakp 2,000,000 3,000,000 Uploaded feel free to take upper 2M(2.5M-2.99M)
no2pencil 3,000,000 3,999,999 in progress
[free] 4,000,000 4,399,999 still free If you're taking this, you should grab what I've done in this block and skip the ids from this list
mel 4,400,000 4,499,999 IP banned 80k done, now banned
Qwerty01 4,500,000 4,699,999 IP-banned about 3,500 done in the 4,5xx,xxx range
warthurton 4,700,000 4,799,999 in progress
greyjjd 4,800,000 4,804,999 IP banned 2763 of the 5k.
DFJustin 4,805,000 4,899,999 stalled got 4,746 but seems to be down now
beardicus 4,900,000 4,999,999 in progress
underscor ??? ??? in progress? active last night but never said their range
BlueMax ??? ??? in progress? active last night but never said their range
Coderjoe 5,000,000 5,099,999 in progress
[free] 5,100,000 5,675,999 available
Coderjoe 5,676,000 5,695,999 in progress
[free] 5,696,000 6,351,999 available
Coderjoe 6,352,000 6,418,999 in progress
[free] 6,419,000 8,999,999 available
alard 9,000,000 9,999,999 IP-banned have 149,438 list of ids I've done, feel free to do the rest
nuintari 9,000,000 9,099,999 IP Blocked (yes, all of them) here is what I did get
perfinion 9,100,000 9,199,999
nuintari 9,200,000 9,299,999 IP Blocked here is what I did get
nuintari 9,300,000 9,399,999 IP Blocked here is what I did get
[free] 9,400,000 9,999,999
Teaspoon 10,000,000 10,999,999 in progress 16%
DoubleJ 11,000,000 11,099,999 complete Suspicious number of 404s starting evening of the 3rd
flashmanbahadur 11,100,000 11,199,999 in progress
jch 12,000,000 12,999,999 site offline, incomplete get my shit here
jaybird11 13,000,000 13,009,999 completed http://www.bluegrasspals.com/13000000.tar.bz2 has these, plus others scattered throughout the 13 million block.
emijrp 14,000,000 14,099,999 in progress running this 100k urls into 10 chunks,
10k urls per chunk, it is better (not collapse the server)
zappy 14,200,000 14,206,470 some 404s here
oli 11,200,000 11,999,999 IP(s) blocked here's what I got
yipdw 14,300,000 14,399,999 in progress got ~1,000 so far; downloading on hold
ersi 14,400,000 14,715,000 in progress Currently haxing on the first 1000 of this range

Miscellaneous

Thoughts from IRC

(8:16:42 PM) Qwerty01: warthurt: i think the first thing that would help after you've set up a proxy and changed user agent is to pace yourself
(8:16:50 PM) Qwerty01: to not show up on their radar as much
(8:17:15 PM) Qwerty01: you can set a wait time (--wait=3)
(8:17:40 PM) Qwerty01: maybe if you can set it up, run through a couple proxies at once, slowly on each one
(8:17:54 PM) Qwerty01: so that you still get a good rate but there's no single IP that stands out as hitting their server a lot
(8:18:48 PM) Qwerty01: in fact there's a host of wget options that can basically make you indistinguishable from a normal browser: --limit-rate=100k --wait=3 --random-wait
(8:20:52 PM) DoubleJ: Qwerty01: Yep, that's my strategy: A different proxy for each screen session.

(8:18:45 PM) no2penci1: proxy=`head -${n} ${file} | tail -1`
(8:19:00 PM) no2penci1: I stuffed a bunch of proxies into a text file, & then just read one line of that file
(8:19:04 PM) no2penci1: looping on n

Tools

Site Maintenance

When the site is under "site maintenance," instead of the poem page, it gives wget a page that says "site maintenance." So worse than a complete failure, it gives a complete html file that's incorrect. This is done through a 302 redirect to http://unavailable.poetry.com.

The updated wget commands above should avoid this problem. If you ran old wget commands, you need to find and remove the bad files.

You can find these files using this command: find [yourdirname] -type f -print0 | xargs --null grep -l "performing site maintenance"

Or to nuke all such files: grep "currently performing site maintenance" -r . | cut -d: -f1 | xargs rm -v (then just re-run your wgets with -nc to re-download what was missed).

For detecting server maintenance issues, no2pencil created the following correction script:

flist=`grep "currently performing site maintenance" *.html | cut -d: -f1`

x=0
for file in ${flist};
do
 if [ -f ${file} ];
  then
    echo correcting ${file}
    html=`echo ${file} | cut -c5-11`
    wget -E http://www.poetry.com/poems/archiveteam/${html}/ -O poem${html}.html 2>/dev/null
    echo done...
    x=`expr ${x} + 1`
  fi
done

if [ ${x} -eq 0 ]; 
then
  echo Directory clean
else
  echo ${x} files corrected
fi

Google Cache

  1. http://www.google.com/search?q=site%3Apoetry.com+intitle%3Aby+inurl%3Apoems+-inurl%3Atag
  2. click Cache
  3. ???
  4. PROFIT

Exit Strategy

We haven't yet decided what we'll keep out of all the html in those files. After all, we really just want the poems. This would save tons of space, too. Already, underscor has created a script that extracts the poems and metadata from the site. It could be re-purposed to extract the same from our downloaded files:
http://pastebin.com/Pst2aDS7

Closing announcement

From http://www.poetry.com/[IAWcite.todayMemWeb]

Attention to Our Lulu Poetry Community

Lulu Poetry Closing Its Doors May 4, 2011

Dear Poets,

On May 4, 2011, Lulu Poetry will be closing its doors. Please be sure to copy and paste your poems onto your computer and connect with any fellow poets offsite, as we will be unable to save any customer information or poetry as of this date.

Over the past two years, we have been proud to provide a community where poetry writers can come together to share their remarkable works, learn from each other, and truly benefit themselves and anyone else interested in the craft. It has been a privilege to witness the creativity and effort that has sprung forth from this strong community of over 7 million poets and we have been thrilled to award over $35,000 in prizes in this time.

At Lulu, it makes us happy to see people do what they love and we’d still like to help you publish your poetry at Lulu.com. You can login using your Lulu Poetry username and password and start creating your own poetry book right away – absolutely free.

Thank you for your support and contribution to Lulu Poetry’s over 14 million poems. We look forward to your continued success on Lulu, where we’re committed to empowering authors to sell more books and reach more readers more easily than ever before.

Best,
Lulu Poetry


v · t · e         Archive Team
Current events

Alive... OR ARE THEY · Deathwatch · Projects

Archiveteam.jpg
Archiving projects

APKMirror · Archive.is · BetaArchive · Government Backup (#datarefuge · ftp-gov· Gmane · Internet Archive · It Died · Megalodon.jp · OldApps.com · OldVersion.com · OSBetaArchive · TEXTFILES.COM · The Dead, the Dying & The Damned · The Mail Archive · UK Web Archive · WebCite · Vaporwave.me

Blogging

Blog.pl · Blogger · Blogster · Blogter.hu · Freeblog.hu · Fuelmyblog · Jux · LINE BLOG · LiveJournal · My Opera · Nolblog.hu · Open Diary · ownlog.com · Posterous · Powerblogs · Proust · Roon · Splinder · Tumblr · Vox · Weblog.nl · Windows Live Spaces · Wordpress.com · Xanga · Yahoo! Blog · Zapd

Cloud hosting/file sharing

aDrive · AnyHub · Box · Dropbox · Docstoc · Fast.io · Google Drive · Google Groups Files · iCloud · Fileplanet · LayerVault · MediaCrush · MediaFire · Mega · MegaUpload · MobileMe · OneDrive · Pomf.se · RapidShare · Ubuntu One · Yahoo! Briefcase

Corporations

Apple · IBM · Google · Loblaw · Lycos Europe · Microsoft · Yahoo!

Events

Arab Spring · Great Ape-Snake War · Spanish Revolution

Font Repos

DaFont · Google Web Fonts · GNU FreeFont · Fontspace

Forums/Message boards

4chan · Captain Luffy Forums · College Confidential · Discourse · DSLReports · ESPN Forums · Facepunch Forums · forums.starwars.com · HeavenGames · JamiiForums · Invisionfree · NeoGAF · Textream · The Classic Horror Film Board · Yahoo! Messages · Yahoo! Neighbors · Yuku.com · Zetaboards

Gaming

Atomicgamer · Bazaar.tf · City of Heroes · Club Nintendo · Clutch · Counter-Strike: Global Offensive · CS:GO Lounge · Desura · Dota 2 · Dota 2 Lounge · Emulation Zone · ESEA · GameBanana · GameMaker Sandbox · GameTrailers · Halo · Heroes of Newerth · HLTV.org · HQ Trivia · Infinite Crisis · joinDOTA · League of Legends · Liquipedia · Minecraft.net · Player.me · Playfire · Raptr · SingStar · Steam · SteamDB · SteamGridDB · Team Fortress 2 · TF2 Outpost · Warhammer · Xfire

Image hosting

500px · AOL Pictures · Blipfoto · Blingee · Canv.as · Camera+ · Cameroid · DailyBooth · Degree Confluence Project · DeviantART · Demotivalo.net · Flickr · Fotoalbum.hu · Fotolog.com · Fotopedia · Frontback · Geograph Britain and Ireland · Giphy · GTF Képhost · ImageShack · Imgh.us · Imgur · Inkblazers · Instagram · Kepfeltoltes.hu · Kephost.com · Kephost.hu · Kepkezelo.com · Keptarad.hu · Madden GIFERATOR · MLKSHK · Microsoft Clip Art · Microsoft Photosynth · Nokia Memories · noob.hu · Odysee · Panoramio · Photobucket · Picasa · Picplz · Pixiv · Portalgraphics.net · PSharing · Ptch · puu.sh · Rawporter · Relay.im · ScreenshotsDatabase.com · Sketch · Smack Jeeves · Snapjoy · Streetfiles · Tabblo · Tinypic · Trovebox · TwitPic · Wallbase · Wallhaven · Webshots · Wikimedia Commons

Knowledge/Wikis

arXiv · Citizendium · Clipboard.com · Deletionpedia · EditThis · Encyclopedia Dramatica · Etherpad · Everything2 · infoAnarchy · GeoNames · GNUPedia · Google Books (Google Books Ngram· Horror Movie Database · Insurgency Wiki · Knol · Lost Media Wiki · Neoseeker.com · Notepad.cc · Nupedia · OpenCourseWare · OpenStreetMap · Orain · Pastebin · Patch.com · Project Gutenberg · Puella Magi · Referata · Resedagboken · SongMeanings · ShoutWiki · The Internet Movie Database · TropicalWikis · Uncyclopedia · Urban Dictionary · Urban Exploration Resource · Webmonkey · Wikia · Wikidot · WikiHow · Wikkii · WikiLeaks · Wikipedia (Simple English Wikipedia· Wikispaces · Wikispot · Wik.is · Wiki-Site · WikiTravel · Word Count Journal

Magazines/Blogs/News

Cyberpunkreview.com · Game Developer Magazine · Gigaom · Hardware Canucks · Helium · JPG Magazine · Make Magazine · The Escapist · Polygamia.pl · San Fransisco Bay Guardian · Scoop · Regretsy · Yahoo! Voices

Microblogging

Heello · Identi.ca · Jaiku · Mommo.hu · Plurk · Sina Weibo · Tencent Weibo · Twitter · TwitLonger

Music/Audio

8tracks · AOL Music · Audimated.com · Cinch · digCCmixter · Dogmazic.net · Earbits · exfm · Free Music Archive · Gogoyoko · Indaba Music · Instacast · Instaudio · Jamendo · Last.fm · Music Unlimited · MOG · PureVolume · Reverbnation · ShareTheMusic · SoundCloud · Soundpedia · Spotify · This Is My Jam · TuneWiki · Twaud.io · WinAmp

People

Aaron Swartz · Michael S. Hart · Steve Jobs · Mark Pilgrim · Dennis Ritchie · Len Sassaman Project

Protocols/Infrastructure

FTP · Gopher · IRC · Usenet · World Wide Web
BitTorrent DHT

Q&A

Askville · Answerbag · Answers.com · Ask.com · Askalo · Baidu Knows · Blurtit · ChaCha · Experts Exchange · Formspring · GirlsAskGuys · Google Answers · Google Baraza · JustAnswer · MetaFilter · Quora · Retrospring · StackExchange · The AnswerBank · The Internet Oracle · Uclue · WikiAnswers · Yahoo! Answers

Recipes/Food

Allrecipes · Epicurious · Food.com · Foodily · Food Network · Punchfork · ZipList

Social bookmarking

Addinto · Backflip · Balatarin · BibSonomy · Bkmrx · Blinklist · BlogMarks · BookmarkSync · CiteULike · Connotea · Delicious · Designer News · Digg · Diigo · Dir.eccion.es · Evernote · Excite Bookmark · Faves · Favilous · folkd · Freelish · Getboo · GiveALink.org · Gnolia · Google Bookmarks · Hacker News · HeyStaks · IndianPad · Kippt · Knowledge Plaza · Licorize · Linkwad · Menéame · Microsoft Developer Network · myVIP · Mister Wong · My Web · Mylink Vault · Newsvine · Oneview · Pearltrees · Pinboard · Pocket · Propeller.com · Reddit · sabros.us · Scloog · Scuttle · Simpy · SiteBar · Slashdot · Squidoo · StumbleUpon · Twine · Voat · Vizited · Yummymarks · Xmarks · Yahoo! Buzz · Zootool · Zotero

Social networks

Bebo · BlackPlanet · Classmates.com · Cyworld · Dogster · Dopplr · douban · Ello · Facebook · Flixster · FriendFeed · Friendster · Friends Reunited · Gaia Online · Google+ · Habbo · hi5 · Hyves · iWiW · LinkedIn · Miiverse · mixi · MyHeritage · MyLife · Myspace · myVIP · Netlog · Odnoklassniki · Orkut · Plaxo · Qzone · Renren · Skyrock · Sonico.com · Storylane · Tagged · tvtag · Upcoming · Viadeo · Vine · VK · WeeWorld · Weibo · Wretch · Xuite · Yahoo! Groups · Yahoo! Stars India · Yahoo! Upcoming · more sites...

Shopping/Retail

Alibaba · AliExpress · Amazon · Apple Store · Barnes & Noble · DirectCanada · eBay · Kmart · NCIX · Printfection · RadioShack · Sears · Sears Canada · Target · The Book Depository · ThinkGeek · Toys "R" Us · Walmart

Software/code hosting

Android Development · Alioth · Assembla · BerliOS · Betavine · Bitbucket · BountySource · Codecademy · CodePlex · Freepository · Free Software Foundation · GNU Savannah · GitHost  · GitHub · GitHub Downloads · Gitorious · Gna! · Google Code · ibiblio · java.net · JavaForge · KnowledgeForge · Launchpad · LuaForge · Maemo · mozdev · OSOR.eu · OW2 Consortium · Openmoko · OpenSolaris · Ourproject.org · Ovi Store · Project Kenai · RubyForge · SEUL.org · SourceForge · Stypi · TestFlight · tigris.org · Transifex · TuxFamily · Yahoo! Downloads

Television/Radio

ABC · Austin City Limits · BBC · CBC · CBS · Computer Chronicles · CTV · Fox · G4 · Global TV · Jeopardy! · NBC · NHK · PBS · Penn & Teller: Bullshit! · The Howard Stern Show · TV News Archive (Understanding 9/11)

Torrenting/Piracy

ExtraTorrent · EZTV · isoHunt · KickassTorrents · The Pirate Bay · Torrentz · Library Genesis

Video hosting

Academic Earth · Bambuser · Blip.tv · Epic · Freshlive · Google Video · Justin.tv · Mixer · Niconico · Nokia Trailers · Oddshot.tv · Periscope · Plays.tv · Qwiki · Skillfeed · Stickam · TED Talks · Ticker.tv · Twitch.tv · Ustream · Videoplayer.hu · Viddler · Viddy · Vidme · Vimeo · Vine · Vstreamers · Yahoo! Video · YouTube · Famous Internet videos (Me at the zoo)

Web hosting

Angelfire · Brace.io · BT Internet · CableAmerica Personal Web Space · Claranet Netherlands Personal Web Pages · Comcast Personal Web Pages · Extra.hu · FortuneCity · Free ProHosting · GeoCities (patch· Google Business Sitebuilder · Google Sites · Internet Centrum · MBinternet · MSN TV · Nifty · Nwnyet · Parodius Networking · Prodigy.net · Saunalahti Iso G · Swipnet · Telenor · Tripod · University of Michigan personal webpages · Verizon Mysite · Verizon Personal Web Space · Webs · Webzdarma · Virgin Media

Web applications

Mailman · MediaWiki · phpBB · Simple Machines Forum · vBulletin

Information

A Million Ways to Die on the Web · Backup Tips · Cheap storage · Collecting items randomly · Data compression algorithms and tools · Dev · Discovery Data · DOS Floppies · Fortress of Solitude · Keywords · Naughty List · Nightmare Projects · Rescuing floppy disks · Rescuing optical media · Site exploration · The WARC Ecosystem · Working with ARCHIVE.ORG

Projects

ArchiveCorps · Audit2014 · Emularity · Faceoff · FlickrFckr · Froogle · INTERNETARCHIVE.BAK (Internet Archive Census· IRC Quotes · JSMESS · JSVLC · Just Solve the Problem · NewsGrabber · Project Newsletter · Valhalla · Web Roasting (ISP Hosting · University Web Hosting· Woohoo

Tools

ArchiveBot · ArchiveTeam Warrior (Tracker· Google Takeout · HTTrack · Video downloaders · Wget (Lua · WARC)

Teams

Bibliotheca Anonoma · LibreTeam · URLTeam · Yahoo Video Warroom · WikiTeam

Other

800notes · AOL · Akoha · Ancestry.com · April Fools' Day · Amplicate · AutoAdmit · Bre.ad · Circavie · Cobook · Co.mments · Countdown · Discourse · Distill · Dmoz · Easel · Eircode · Electronic Frontier Foundation · FanFiction.Net · Feedly · Ficlets · Forrst · FunnyExam.com · FurAffinity · Google Helpouts · Google Moderator · Google Poly · Google Reader · ICQmail · IFTTT · Jajah · JuniorNet · Lulu Poetry · Mobile Phone Applications · Mochi Media · Mozilla Firefox · MyBlogLog · NBII · Newgrounds · Neopets · Quantcast · Quizilla · Salon Table Talk · Shutdownify · Slidecast · Stack Overflow · SOPA blackout pages · starwars.yahoo.com · TechNet · Toshiba Support · USA-Gov · Volán · Widgetbox · Windows Technical Preview · Wunderlist · YTMND · Zoocasa

About Archive Team

Introduction · Philosophy · Who We Are · Our stance on robots.txt · Why Back Up? · Software · Formats · Storage Media · Recommended Reading · Films and documentaries about archiving · Talks · In The Media · FAQ