Difference between revisions of "Lulu Poetry"

From Archiveteam
Jump to navigation Jump to search
Line 83: Line 83:
| BlueMax|| ??? || ??? || in progress? || active last night but never said their range
| BlueMax|| ??? || ??? || in progress? || active last night but never said their range
|-
|-
| alard || 9,000,000 || 9,999,999 || IP-banned || have 149,438 [https://rapidshare.com/files/460321272/ids-9m-done.txt.gz list of ids I've done]
| alard || 9,000,000 || 9,999,999 || IP-banned || have 149,438 [https://rapidshare.com/files/460321272/ids-9m-done.txt.gz list of ids I've done], feel free to reclaim
|-
|-
| Teaspoon || 10,000,000 || 10,999,999 || in progress ||
| Teaspoon || 10,000,000 || 10,999,999 || in progress ||

Revision as of 23:08, 2 May 2011

Lulu Poetry
Lulu Poetry.gif
URL http://www.poetry.com[IAWcite.todayMemWeb]
Status Closing
Archiving status In progress...
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)

Lulu Poetry or Poetry.com, announced on April 13, 2011 that they would close less than a month later on May 4, deleting all 14 million poems. Archive Team members amassed to find out how to help and aim their LOIC's at it. (By the way, I actually mean their crawlers, not DDoS cannons.)

News
May 1: For everyone who left wget running last night, we noticed that the site would go out periodically, serving pages that told of "site maintenance" instead of the poem page that wget was looking for. So we're having to find those files, delete them, then re-download them. See Tools for more info.
May 2: It looks like the battle has begun. They seem to have started blocking either our IPs or our wget user-agent strings. Current strategies include getting more IPs through proxies and donning our googlebot costumes.

Site Structure

The urls appear to be flexible and sequential:
(12:13:09 AM) closure: http://www.poetry.com/poems/archiveteam-bitches/3535201/ , heh, look at that, you can just put in any number you like I think
(12:15:16 AM) closure: http://www.poetry.com/user/allofthem/7936443/ same for the users
There are apparently over 14 million poems. As of last night the numbers went up to http://www.poetry.com/user/whatever/14712220, though interspersed are urls without poems (author deletions?).

Howto

  1. Claim a range of numbers below.
  2. Generate a hotlist of urls for wget to download by running this, editing in your start and end number: perl -le 'print "http://www.poetry.com/poems/archiveteam/$_/" for 1000000..2000000' > hotlist
  3. Split the hotlist into 100 sublists: split hotlist
    • It splits a list into 1000-items chunks. If you choiced a list with 1M items, better use split -l10000 hotlist
  4. Run wget on each sublist, with logging, and timeout and "we're down" page avoidance: wget -T 8 --max-redirect=0 -o logfile.log -nv -nc -x -i xaa
  5. To avoid getting too many files in one directory, which some filesystems will choke on, recommend moving into a new subdirectory before running each wget on the sublist.
  6. For the daring, here's how to run all wgets on all the sublists in parallel, in subdirs, with logging, and avoidance of timeouts and the "site mainenance" problem: for x in ???; do mkdir $x.dir; cd $x.dir; wget -T 8 --max-redirect=0 -o $x.log -nv -nc -x -i ../$x & cd ..; done
  7. Once wget finishes, run it again! The -nc will make it download any files it missed the first time. Repeat until the logs don't show failures.
wget Options Translation (or see Manual)
short long version meaning
-E --adjust-extension adds ".html" to files that are html but didn't originally end in .html
-k --convert-links change links in html files to point to the local versions of the resources
-T --timeout= if it gets hung for this long (in seconds), it'll retry instead of sitting waiting
-o --output-file use the following filename as a log file instead of printing to screen
-nv --no-verbose don't write every little thing to the log file
-nc --no-clobber if a file is already present on disk, skip it instead of re-downloading it
-x --force-directories force it to create a hierarchy of directories mirroring the hierarchy in the url structure
-i --input-file use the following filename as a source of urls to download

Coordination

Note: this is going really slow right now so maybe just claim 100,000 or so at a time.

Who is handling which chunks of urls?
IRC name starting number ending number Progress notes
closure 0 100,000 complete 61 MB compressed
closure 100,000 200,000 complete 53 MB compressed
closure 200,000 999,999 in progress
jag 1,000,000 2,000,000 in progress
notakp 2,000,000 3,000,000 in progress
no2pencil 3,000,000 3,999,999 in progress
d8uv 4,000,000 4,499,999 in progress
Qwerty01 4,500,000 4,699,999 IP-banned about 3,500 done in the 4,5xx,xxx range
underscor ??? ??? in progress? active last night but never said their range
BlueMax ??? ??? in progress? active last night but never said their range
alard 9,000,000 9,999,999 IP-banned have 149,438 list of ids I've done, feel free to reclaim
Teaspoon 10,000,000 10,999,999 in progress
DoubleJ 11,000,000 11,099,999 IP-banned someone can pick up at 11,100,000
flashmanbahadur 11,100,000 11,199,999 in progress
oli 12,000,000 12,999,999 in progress
jaybird11 13,000,000 13,999,999 in progress
emijrp 14,000,000 14,099,999 in progress running this 100k urls into 10 chunks,
10k urls per chunk, it is better (not collapse the server)
warthurton 4,700,000 4,799,999 in progress
[yournamehere] 4,800,000 8,999,999 still free! claim some today!
[yournamehere] 11,200,000 11,999,999 still free! claim some today!
[seriouslyeditme] 14,100,000 14,715,000 still free! claim some today!

Tools

Site Maintenance

When the site is under "site maintenance," instead of the poem page, it gives wget a page that says "site maintenance." So worse than a complete failure, it gives a complete html file that's incorrect. This is done through a 302 redirect to http://unavailable.poetry.com.

The updated wget commands above should avoid this problem. If you ran old wget commands, you need to find and remove the bad files.

You can find these files using this command: find [yourdirname] -type f -print0 | xargs --null grep -l "performing site maintenance"

Or to nuke all such files: grep "currently performing site maintenance" -r . | cut -d: -f1 | xargs rm -v (then just re-run your wgets with -nc to re-download what was missed).

For detecting server maintenance issues, no2pencil created the following correction script:

flist=`grep "currently performing site maintenance" *.html | cut -d: -f1`

x=0
for file in ${flist};
do
 if [ -f ${file} ];
  then
    echo correcting ${file}
    html=`echo ${file} | cut -c5-11`
    wget -E http://www.poetry.com/poems/archiveteam/${html}/ -O poem${html}.html 2>/dev/null
    echo done...
    x=`expr ${x} + 1`
  fi
done

if [ ${x} -eq 0 ]; 
then
  echo Directory clean
else
  echo ${x} files corrected
fi

Exit Strategy

We haven't yet decided what we'll keep out of all the html in those files. After all, we really just want the poems. This would save tons of space, too. Already, underscor has created a script that extracts the poems and metadata from the site. It could be re-purposed to extract the same from our downloaded files:
http://pastebin.com/Pst2aDS7

Closing announcement

From http://www.poetry.com/[IAWcite.todayMemWeb]

Attention to Our Lulu Poetry Community

Lulu Poetry Closing Its Doors May 4, 2011

Dear Poets,

On May 4, 2011, Lulu Poetry will be closing its doors. Please be sure to copy and paste your poems onto your computer and connect with any fellow poets offsite, as we will be unable to save any customer information or poetry as of this date.

Over the past two years, we have been proud to provide a community where poetry writers can come together to share their remarkable works, learn from each other, and truly benefit themselves and anyone else interested in the craft. It has been a privilege to witness the creativity and effort that has sprung forth from this strong community of over 7 million poets and we have been thrilled to award over $35,000 in prizes in this time.

At Lulu, it makes us happy to see people do what they love and we’d still like to help you publish your poetry at Lulu.com. You can login using your Lulu Poetry username and password and start creating your own poetry book right away – absolutely free.

Thank you for your support and contribution to Lulu Poetry’s over 14 million poems. We look forward to your continued success on Lulu, where we’re committed to empowering authors to sell more books and reach more readers more easily than ever before.

Best,
Lulu Poetry