Collecting items randomly

From Archiveteam
Jump to navigation Jump to search

An ArchiveTeam member comes across a lot of websites to be saved, with various structures. Understanding how items are accessible is a crucial point in creating the item list, which can then easily be scraped with well-developed tools.

Imagine the following situation: A website associates a long, unique identifier to each item, impossible to discover with brute force. The site doesn't provide an index or sitemap, of course. You don't even know the number of items. (There are sites that work this way.)

An obvious way is a Google/Bing/Commoncrawl/whatever discovery. But wait! The site allows to request a random item. You keep clicking on the button, and you get different and different items. Say, you can do this request automatically, as you have the link.

Simple! Repeat requesting that URL x times, x is a big number.

But how long should we try? Obviously, if the items are presented really randomly, then some items soon appear twice, you don't even need to go too far to experience that. The longer you run the discovery script, the more often the already-seen items appear. How long does it worth trying to get a new random item?

(TLDR: click here.)

In combinatorics, the model representing this situation is sampling with replacement. The most often discussed question about this is "what is the probability of picking...". But now, our question is, after how many picks reaches the number of picked distinct items the number of all (n) items? Or, after k picks, what percentage of all the items have we seen?

Not knowing the number n, we can't really answer this question. We can only examine the tendency of getting new, yet not seen items, watching it as it decreases.

Or, we can run simulations, where we do know what the number n is and what percentage of the items we have seen, and we try to find some constant between the only two numbers we know: the number of picks (or tries) and the number of distinct (found) items.

Let's do this for say, 100 items. We keep picking, and after each try we note how many items we have found. The following table – to be short – contains the state after every tenth try.

Progress collecting 100 items randomly
Tries Items found Tries/found
0 0 1.00
10 10 1.00
20 17 1.18
30 25 1.20
40 31 1.29
50 37 1.35
60 45 1.33
70 51 1.37
80 58 1.38
90 60 1.50
100 64 1.56
110 68 1.62
120 71 1.69
130 74 1.76
140 75 1.87
150 78 1.92
160 81 1.98
170 84 2.02
180 84 2.14
190 85 2.24
200 88 2.27
210 90 2.33
220 90 2.44
230 92 2.50
240 93 2.58
250 93 2.69
260 93 2.80
270 93 2.90
280 95 2.95
290 95 3.05
300 96 3.13
310 97 3.20
320 97 3.30
330 97 3.40
340 97 3.51
350 98 3.57
360 98 3.67
370 99 3.74
380 99 3.84
390 99 3.94
400 99 4.04
410 99 4.14
420 100 4.20
430 100 4.30
440 100 4.40
450 100 4.50
460 100 4.60
470 100 4.70
480 100 4.80
490 100 4.90
500 100 5.00

Easy to see that if we pick 100 times, we get 60 distinct items, and even after 200 picks, we only have 88% of the items. After 300 picks, we miss only 4%, and after 400 picks, we have almost everything, and even more after 500 picks. (As this is only one experiment, these are just approximate numbers.)

So, if we knew how many items there are, four times that many requests of a random item would present us almost all the items. But what if we don't know the total number of items?

Look at the third column. When we reach 100%, the ratio of tries and found items is ~4.2. This is the number we can always calculate, and independent of the total number of items.

Don't ask me to prove this mathematically. Let me present another simulation instead, with not that small and round numbers: say, we have 3811 items.

Progress collecting 3811 items randomly
Tries Items found Tries/found Tries/total Found %
0 0 1.00 0.00 0
400 382 1.05 0.10 10
800 729 1.10 0.21 19
1200 1030 1.17 0.31 27
1600 1309 1.22 0.42 34
2000 1556 1.29 0.52 41
2400 1783 1.35 0.63 47
2800 1991 1.41 0.73 52
3200 2168 1.48 0.84 57
3600 2323 1.55 0.94 61
4000 2484 1.61 1.05 65
4400 2608 1.69 1.15 68
4800 2727 1.76 1.26 72
5200 2835 1.83 1.36 74
5600 2915 1.92 1.47 76
6000 3001 2.00 1.57 79
6400 3073 2.08 1.68 81
6800 3145 2.16 1.78 83
7200 3201 2.25 1.89 84
7600 3263 2.33 1.99 86
8000 3321 2.41 2.10 87
8400 3378 2.49 2.20 89
8800 3427 2.57 2.31 90
9200 3456 2.66 2.41 91
9600 3487 2.75 2.52 91
10000 3519 2.84 2.62 92
10400 3560 2.92 2.73 93
10800 3577 3.02 2.83 94
11200 3600 3.11 2.94 94
11600 3620 3.20 3.04 95
12000 3639 3.30 3.15 95
12400 3661 3.39 3.25 96
12800 3679 3.48 3.36 97
13200 3695 3.57 3.46 97
13600 3705 3.67 3.57 97
14000 3723 3.76 3.67 98
14400 3736 3.85 3.78 98
14800 3744 3.95 3.88 98
15200 3751 4.05 3.99 98
15600 3760 4.15 4.09 99
16000 3769 4.25 4.20 99
16400 3775 4.34 4.30 99
16800 3782 4.44 4.41 99
17200 3785 4.54 4.51 99
17600 3788 4.65 4.62 99
18000 3793 4.75 4.72 100
18400 3794 4.85 4.83 100
18800 3796 4.95 4.93 100
19200 3798 5.06 5.04 100
19600 3799 5.16 5.14 100
20000 3801 5.26 5.25 100

Let's first check the percentage (in the case above, that was the number of found items, because number of all the items was 100). Now it's the fifth column. We have 3811 items. After this many tries, you can see, ~63% items found. Twice as many tries gives us 86%, three times: ~95%, four times: 98%, five times: ~100%. The percentages are similar to those in the first simulation.

Now, go on to the tries/found ratio (third column), which is the most interesting. When we reach 100%, it is ~4.7. In the first case it was 4.2. But let's look at some other milestones: when we reach 68%, this ratio is 1.62 and 1.69 in the first and second case, respectively; at 84%, 2.14 and 2.25. You see, this is quite constant. Of course, there are larger differences in the last percents, and even 100% doesn't mean you've got each and every items. But, after x tries, you can count how many distinct items you have, and can make a guess on how much of the total corpus you've discovered.

Conclusion

So, if you need to do such a discovery, make n tries, then count the distinct items, that is k, and calculate n/k. Then find the ratio nearest to it in one of the preceding tables, and then you get an approximation of what percentage of all items you've discovered.

We could also learn that if we want to discover almost all the items, we need to push that random button at least three times more than the actual number of items, but doesn't worth more than five times that many tries. However, twice the number of items still gives a fair result.

An example: a run of 94,836 successful queries on kepkezelo.com gave 40,418 distinct items. The n/k ratio is 2.35. According to the second table, that means ~86% of the items has been discovered. Thus, the total number of items is around 47,000, and we can expect that after such another run we'd have ~98% of them.


v · t · e         Archive Team
Current events

Alive... OR ARE THEY · Deathwatch · Projects

Archiveteam.jpg
Archiving projects

APKMirror · Archive.is · BetaArchive · Government Backup (#datarefuge · ftp-gov· Gmane · Internet Archive · It Died · Megalodon.jp · OldApps.com · OldVersion.com · OSBetaArchive · TEXTFILES.COM · The Dead, the Dying & The Damned · The Mail Archive · UK Web Archive · WebCite · Vaporwave.me

Blogging

Blog.pl · Blogger · Blogster · Blogter.hu · Freeblog.hu · Fuelmyblog · Jux · LINE BLOG · LiveJournal · My Opera · Nolblog.hu · Open Diary · ownlog.com · Posterous · Powerblogs · Proust · Roon · Splinder · Tumblr · Vox · Weblog.nl · Windows Live Spaces · Wordpress.com · Xanga · Yahoo! Blog · Zapd

Cloud hosting/file sharing

aDrive · AnyHub · Box · Dropbox · Docstoc · Fast.io · Google Drive · Google Groups Files · iCloud · Fileplanet · LayerVault · MediaCrush · MediaFire · Mega · MegaUpload · MobileMe · OneDrive · Pomf.se · RapidShare · Ubuntu One · Yahoo! Briefcase

Corporations

Apple · IBM · Google · Loblaw · Lycos Europe · Microsoft · Yahoo!

Events

Arab Spring · Great Ape-Snake War · Spanish Revolution

Font Repos

DaFont · Google Web Fonts · GNU FreeFont · Fontspace

Forums/Message boards

4chan · Captain Luffy Forums · College Confidential · Discourse · DSLReports · ESPN Forums · Facepunch Forums · forums.starwars.com · HeavenGames · JamiiForums · Invisionfree · NeoGAF · Textream · The Classic Horror Film Board · Yahoo! Messages · Yahoo! Neighbors · Yuku.com · Zetaboards

Gaming

Atomicgamer · Bazaar.tf · City of Heroes · Club Nintendo · Clutch · Counter-Strike: Global Offensive · CS:GO Lounge · Desura · Dota 2 · Dota 2 Lounge · Emulation Zone · ESEA · GameBanana · GameMaker Sandbox · GameTrailers · Halo · Heroes of Newerth · HLTV.org · HQ Trivia · Infinite Crisis · joinDOTA · League of Legends · Liquipedia · Minecraft.net · Player.me · Playfire · Raptr · SingStar · Steam · SteamDB · SteamGridDB · Team Fortress 2 · TF2 Outpost · Warhammer · Xfire

Image hosting

500px · AOL Pictures · Blipfoto · Blingee · Canv.as · Camera+ · Cameroid · DailyBooth · Degree Confluence Project · DeviantART · Demotivalo.net · Flickr · Fotoalbum.hu · Fotolog.com · Fotopedia · Frontback · Geograph Britain and Ireland · Giphy · GTF Képhost · ImageShack · Imgh.us · Imgur · Inkblazers · Instagram · Kepfeltoltes.hu · Kephost.com · Kephost.hu · Kepkezelo.com · Keptarad.hu · Madden GIFERATOR · MLKSHK · Microsoft Clip Art · Microsoft Photosynth · Nokia Memories · noob.hu · Odysee · Panoramio · Photobucket · Picasa · Picplz · Pixiv · Portalgraphics.net · PSharing · Ptch · puu.sh · Rawporter · Relay.im · ScreenshotsDatabase.com · Sketch · Smack Jeeves · Snapjoy · Streetfiles · Tabblo · Tinypic · Trovebox · TwitPic · Wallbase · Wallhaven · Webshots · Wikimedia Commons

Knowledge/Wikis

arXiv · Citizendium · Clipboard.com · Deletionpedia · EditThis · Encyclopedia Dramatica · Etherpad · Everything2 · infoAnarchy · GeoNames · GNUPedia · Google Books (Google Books Ngram· Horror Movie Database · Insurgency Wiki · Knol · Lost Media Wiki · Neoseeker.com · Notepad.cc · Nupedia · OpenCourseWare · OpenStreetMap · Orain · Pastebin · Patch.com · Project Gutenberg · Puella Magi · Referata · Resedagboken · SongMeanings · ShoutWiki · The Internet Movie Database · TropicalWikis · Uncyclopedia · Urban Dictionary · Urban Exploration Resource · Webmonkey · Wikia · Wikidot · WikiHow · Wikkii · WikiLeaks · Wikipedia (Simple English Wikipedia· Wikispaces · Wikispot · Wik.is · Wiki-Site · WikiTravel · Word Count Journal

Magazines/Blogs/News

Cyberpunkreview.com · Game Developer Magazine · Gigaom · Hardware Canucks · Helium · JPG Magazine · Make Magazine · The Escapist · Polygamia.pl · San Fransisco Bay Guardian · Scoop · Regretsy · Yahoo! Voices

Microblogging

Heello · Identi.ca · Jaiku · Mommo.hu · Plurk · Sina Weibo · Tencent Weibo · Twitter · TwitLonger

Music/Audio

8tracks · AOL Music · Audimated.com · Cinch · digCCmixter · Dogmazic.net · Earbits · exfm · Free Music Archive · Gogoyoko · Indaba Music · Instacast · Instaudio · Jamendo · Last.fm · Music Unlimited · MOG · PureVolume · Reverbnation · ShareTheMusic · SoundCloud · Soundpedia · Spotify · This Is My Jam · TuneWiki · Twaud.io · WinAmp

People

Aaron Swartz · Michael S. Hart · Steve Jobs · Mark Pilgrim · Dennis Ritchie · Len Sassaman Project

Protocols/Infrastructure

FTP · Gopher · IRC · Usenet · World Wide Web
BitTorrent DHT

Q&A

Askville · Answerbag · Answers.com · Ask.com · Askalo · Baidu Knows · Blurtit · ChaCha · Experts Exchange · Formspring · GirlsAskGuys · Google Answers · Google Baraza · JustAnswer · MetaFilter · Quora · Retrospring · StackExchange · The AnswerBank · The Internet Oracle · Uclue · WikiAnswers · Yahoo! Answers

Recipes/Food

Allrecipes · Epicurious · Food.com · Foodily · Food Network · Punchfork · ZipList

Social bookmarking

Addinto · Backflip · Balatarin · BibSonomy · Bkmrx · Blinklist · BlogMarks · BookmarkSync · CiteULike · Connotea · Delicious · Designer News · Digg · Diigo · Dir.eccion.es · Evernote · Excite Bookmark · Faves · Favilous · folkd · Freelish · Getboo · GiveALink.org · Gnolia · Google Bookmarks · Hacker News · HeyStaks · IndianPad · Kippt · Knowledge Plaza · Licorize · Linkwad · Menéame · Microsoft Developer Network · myVIP · Mister Wong · My Web · Mylink Vault · Newsvine · Oneview · Pearltrees · Pinboard · Pocket · Propeller.com · Reddit · sabros.us · Scloog · Scuttle · Simpy · SiteBar · Slashdot · Squidoo · StumbleUpon · Twine · Voat · Vizited · Yummymarks · Xmarks · Yahoo! Buzz · Zootool · Zotero

Social networks

Bebo · BlackPlanet · Classmates.com · Cyworld · Dogster · Dopplr · douban · Ello · Facebook · Flixster · FriendFeed · Friendster · Friends Reunited · Gaia Online · Google+ · Habbo · hi5 · Hyves · iWiW · LinkedIn · Miiverse · mixi · MyHeritage · MyLife · Myspace · myVIP · Netlog · Odnoklassniki · Orkut · Plaxo · Qzone · Renren · Skyrock · Sonico.com · Storylane · Tagged · tvtag · Upcoming · Viadeo · Vine · VK · WeeWorld · Weibo · Wretch · Xuite · Yahoo! Groups · Yahoo! Stars India · Yahoo! Upcoming · more sites...

Shopping/Retail

Alibaba · AliExpress · Amazon · Apple Store · Barnes & Noble · DirectCanada · eBay · Kmart · NCIX · Printfection · RadioShack · Sears · Sears Canada · Target · The Book Depository · ThinkGeek · Toys "R" Us · Walmart

Software/code hosting

Android Development · Alioth · Assembla · BerliOS · Betavine · Bitbucket · BountySource · Codecademy · CodePlex · Freepository · Free Software Foundation · GNU Savannah · GitHost  · GitHub · GitHub Downloads · Gitorious · Gna! · Google Code · ibiblio · java.net · JavaForge · KnowledgeForge · Launchpad · LuaForge · Maemo · mozdev · OSOR.eu · OW2 Consortium · Openmoko · OpenSolaris · Ourproject.org · Ovi Store · Project Kenai · RubyForge · SEUL.org · SourceForge · Stypi · TestFlight · tigris.org · Transifex · TuxFamily · Yahoo! Downloads

Television/Radio

ABC · Austin City Limits · BBC · CBC · CBS · Computer Chronicles · CTV · Fox · G4 · Global TV · Jeopardy! · NBC · NHK · PBS · Penn & Teller: Bullshit! · The Howard Stern Show · TV News Archive (Understanding 9/11)

Torrenting/Piracy

ExtraTorrent · EZTV · isoHunt · KickassTorrents · The Pirate Bay · Torrentz · Library Genesis

Video hosting

Academic Earth · Bambuser · Blip.tv · Epic · Freshlive · Google Video · Justin.tv · Mixer · Niconico · Nokia Trailers · Oddshot.tv · Periscope · Plays.tv · Qwiki · Skillfeed · Stickam · TED Talks · Ticker.tv · Twitch.tv · Ustream · Videoplayer.hu · Viddler · Viddy · Vidme · Vimeo · Vine · Vstreamers · Yahoo! Video · YouTube · Famous Internet videos (Me at the zoo)

Web hosting

Angelfire · Brace.io · BT Internet · CableAmerica Personal Web Space · Claranet Netherlands Personal Web Pages · Comcast Personal Web Pages · Extra.hu · FortuneCity · Free ProHosting · GeoCities (patch· Google Business Sitebuilder · Google Sites · Internet Centrum · MBinternet · MSN TV · Nifty · Nwnyet · Parodius Networking · Prodigy.net · Saunalahti Iso G · Swipnet · Telenor · Tripod · University of Michigan personal webpages · Verizon Mysite · Verizon Personal Web Space · Webs · Webzdarma · Virgin Media

Web applications

Mailman · MediaWiki · phpBB · Simple Machines Forum · vBulletin

Information

A Million Ways to Die on the Web · Backup Tips · Cheap storage · Collecting items randomly · Data compression algorithms and tools · Dev · Discovery Data · DOS Floppies · Fortress of Solitude · Keywords · Naughty List · Nightmare Projects · Rescuing floppy disks · Rescuing optical media · Site exploration · The WARC Ecosystem · Working with ARCHIVE.ORG

Projects

ArchiveCorps · Audit2014 · Emularity · Faceoff · FlickrFckr · Froogle · INTERNETARCHIVE.BAK (Internet Archive Census· IRC Quotes · JSMESS · JSVLC · Just Solve the Problem · NewsGrabber · Project Newsletter · Valhalla · Web Roasting (ISP Hosting · University Web Hosting· Woohoo

Tools

ArchiveBot · ArchiveTeam Warrior (Tracker· Google Takeout · HTTrack · Video downloaders · Wget (Lua · WARC)

Teams

Bibliotheca Anonoma · LibreTeam · URLTeam · Yahoo Video Warroom · WikiTeam

Other

800notes · AOL · Akoha · Ancestry.com · April Fools' Day · Amplicate · AutoAdmit · Bre.ad · Circavie · Cobook · Co.mments · Countdown · Discourse · Distill · Dmoz · Easel · Eircode · Electronic Frontier Foundation · FanFiction.Net · Feedly · Ficlets · Forrst · FunnyExam.com · FurAffinity · Google Helpouts · Google Moderator · Google Poly · Google Reader · ICQmail · IFTTT · Jajah · JuniorNet · Lulu Poetry · Mobile Phone Applications · Mochi Media · Mozilla Firefox · MyBlogLog · NBII · Newgrounds · Neopets · Quantcast · Quizilla · Salon Table Talk · Shutdownify · Slidecast · Stack Overflow · SOPA blackout pages · starwars.yahoo.com · TechNet · Toshiba Support · USA-Gov · Volán · Widgetbox · Windows Technical Preview · Wunderlist · YTMND · Zoocasa

About Archive Team

Introduction · Philosophy · Who We Are · Our stance on robots.txt · Why Back Up? · Software · Formats · Storage Media · Recommended Reading · Films and documentaries about archiving · Talks · In The Media · FAQ