Difference between revisions of "IRC Quotes"

Revision as of 00:49, 8 April 2011

What's this, then?

Auguste, BlueMax and Dr-Spangle are currently scraping IRC quote databases (e.g. Bash.org). If you can help out or suggest other quote databases to scrape, please join them in #bashup.

Project Hosting

Auguste is currently hosting scrapes here. Everybody is encouraged to help mirror.

Helping Out

Scraping doesn't take a lot of work; the QDBs are all more or less the same. You only need to write one script, then make a few changes to adapt it to any other QDB you want to scrape. The actual scraping process should easily take under 10 minutes.

If you do want to help with the scraping, please follow the existing scrape format:

Each quote has its own file
Each file is named 'n.txt', where 'n' is the quote's ID number
All quotes should be compressed into an archive
The archive name should identify the original location and date of scraping (e.g. 'QuoteIRC.com Quote Collection 2011-04-04.7z', or 'DOMAIN.TLD Quote Collection YYYY-MM-DD.EXT')

Tips

Scrape from the browse page (e.g. http://bash.org/?browse). This way you can scrape 10-50 quotes per page request, rather than cycling through thousands of individual quote pages.

Project Status

Database	Has been scraped	Scraper	Notes
Bash.org	Yes	Dr-Spangle	The quote database that pretty much created all others.
DeadDyingDamned.com/QDB/	No		The unofficial ArchiveTeam QDB. I'll have the server automatically save these somewhere. --Auguste 13:36, 7 April 2011 (UTC)
I-Rox.com	Yes	Auguste
Mandaliet.com/furcqdb/	Yes	Auguste	The Furcadia quote database
QDB.MIT.edu	Yes	Auguste	The MIT quote database
QDB.us	Yes	Auguste
QuoteIRC.com	Yes	Auguste
Quotes.BurntElectrons.org	Yes	Auguste	The IRC.Mozilla.org quote database
WarpDrive.se	Yes	Auguste	Quotes are in Swedish
WQDB.org	Yes	Auguste	The Worms quote database
xkcdb.com	Yes	Auguste	The xkcd quote database
german-bash.org	~22000 of 330000	Darkstar	German version of bash.org
ibash.de	No		Another german quotes DB

@@ Line 14: / Line 14: @@
 * All quotes should be compressed into an archive
 * The archive name should identify the original location and date of scraping (e.g. 'QuoteIRC.com Quote Collection 2011-04-04.7z', or 'DOMAIN.TLD Quote Collection YYYY-MM-DD.EXT')
+'''Tips'''
+* Scrape from the browse page (e.g. [http://bash.org/?browse http://bash.org/?browse]).  This way you can scrape 10-50 quotes per page request, rather than cycling through thousands of individual quote pages.
 == Project Status ==