Difference between revisions of "URLTeam"
(Undo revision 1770 by Carter146093 (Talk); If I ever meet a spammer, I'll kill him by slicing off his nuts. You have been warned)
|(One intermediate revision by the same user not shown)|
Revision as of 22:57, 19 December 2010
Too many people using TinyURL and similar services
Twitter is a great example of what's wrong with trusting an online service with something of value. Check out some 'tweets':
- Hah, I'm a Zombie! http://tinyurl.com/8gnnb7 Ahh, the fun we all have with each other. about 1 hour ago from web
- Health privacy is dead. Here's why: http://ff.im/GMpx about 14 hours ago from FriendFeed
- Hmm, friendfeed released a new "import Twitter" feature today. It is taking a LONG time on my account. I wonder why.... http://ff.im/GM5W about 14 hours ago from FriendFeed
If these TinyURL services go away, there's not much content here. See Link Rot.
So, the project, scrape the TinyURL (and similar) services.
STATUS (as of mid-April, 2009): * tinyurl.com: 1M urls ripped * ff.im: 1M urls ripped * bit.ly: just started mid-April, 2009 * is.gd: over 70M urls ripped (by User:Chronomex) as of 2010-Aug-16
- NOTE: ripping is going slowly so I don't get banned and/or overwhelm the service. ff.im banned me for 24 hours once for ripping too quickly. Also, I'm ripping random URLs, not sequential.
- This looks like it would be a good task for distributed computing. Majestic-12 is a project whose main bottleneck is bandwidth, and they are doing quite well. You'd just need to give people a block of URLs to check, and have them report back the results.
It's actually not as hard as it sounds, because we don't need to scrape any web pages or parse any html, since the services just send a Location: header when queried for the hash, we just ask the service for the hash and parse the headers for the redirect url:
(18) firstname.lastname@example.org Wed 11:10am [~] % curl -LLIs http://tinyurl.com/6dvm2t | grep Location Location: http://www.readwriteweb.com/archives/too_many_people_use_tinyurl.php (19) email@example.com Wed 11:10am [~] % curl -LLIs http://ff.im/GMpx | grep Location Location: http://friendfeed.com/e/08954685-00fe-4e55-b28f-4b99f83bfb0d/Health-privacy-is-dead-Here-s-why/
Walk through all possible hash tags, check for errors, and we're good-to-go.
The Monkeyshines algorithmic scraper has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, I've gathered about 6M valid URLs pulled from twitter messages so far.
Or just ask!
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.
Try sending an email to the website owner:
Hello! I'm working with Jason Scott of textfiles.org and other members of the Archive Team. Since the recent scare involving http://tr.im/'s announced (and then retracted) imminent demise, we've been working to archive all the links from URL shorteners around the Internet. If I'm not mistaken, you operate urlx.org. Would you be so kind as to share with us a copy of your URL database? We'll do our best to preserve this data forever in a useful way. We are already very far along in scraping links from tr.im, but it's faster (and friendlier!) to contact site owners asking for a copy of their data than it is to scrape. We've got a domain registered, urlte.am, and all links will be available for redirect in the format: http://urlx.org.urlte.am/av3 If you could help us, that would be excellent! Thank you,
URL shortening services:
Todo: copy more from the list at .
List last updated 2009-08-14.
4url.cc - Completely ripped by User:chronomex as of 2009-08-14
- adjix.com - case-insensitive, incremental
- ad.vu - mirror of adjix.com
- budurl.com - Appears nonincremental
- cli.gs - Appears nonincremental
- cort.as - http://cortas.elpais.com/
- decenturl.com - Not at all easy to scrape.
- doiop.com - Appears nonincremental
- dwarfurl.com - Numeric, appears incremental: http://dwarfurl.com/08041
- easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3
- easyurl.net - Appears nonincremental: http://easyurl.net/afd2f
- ff.im - is this really a shortener?
- imfy.us - requires a recaptcha to get to the linked site.
- is.gd - Being ripped by User:chronomex
- jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
litturl.com - Random, 3-chars. Being ripped by User:chronomex
- memurl.com - Pronounceable. Broken.
- metamark.net / xrl.us - ? http://xrl.us/bfabog
- minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
- myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5
- notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/
- nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.
- ow.ly - I can't get it to work.
- plexp.com - Parked.
- pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc
- poprl.com - Not resolving
- redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok
- s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab
- shorterlink.com - Parked.
- shortlinks.co.uk - Not resolving
- short.to - Probably sequential/loweralpha: http://short.to/msmp
- shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok
- shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp
- ur1.ca - Database is downloadable from website directly.
url.0daymeme.com - completely ripped by User:chronomex as of 2009-08-14.
- url9.com - Sequential, alphanumeric. Leading 0s are significant.
- urlx.org - Owner has agreed to share his database
- xrl.us - see metamark.net
xs.md - completely ripped by User:Chronomex as of 2009-08-15.
- goog.gl - Google
- fb.me - Facebook
- amzn.to - Amazon
- binged.it - Bing (bonus points for being longer than bing.com)
- y.ahoo.it - Yahoo
- youtu.be - YouTube
- t.co? - Twitter
- post.ly - Posterous
- wp.me - Wordpress.com
- flic.kr - Flickr
- lnkd.in - LinkedIn
- su.pr - StumbleUpon
- go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)
- nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)
- tcrn.ch - Techcrunch
- ff.im - FriendFeed - bought by facebook
- digg.com - discontinued - 
Dead or Broken Shorteners
- chod.sk - Appears nonincremental, not resolving
- gonext.org - not resolving
- ix.it - Not resolving
- jijr.com - Doesn't appear to be a shortener, now parked
- kissa.be - "Kissa.be url shortener service is shutdown"
- kurl.us - Parked.
- miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."
- minurl.org - Presently in ERROR 404
- muhlink.com - Not resolving
- myurl.us - cpanel frontend
Distribution of tinyurl's
Tinyurl.com urls are supposed to be un-ordered now, but there's enough prehistory that you should concentrate on ones with the initial digit 5-9 or a-d. Distribution of 6-character tinyurl.com urls (from 20M tinyurls extracted from twitter)
1 1 # 63 64 0 125 189 1 282371 282560 2 330545 613105 3 386626 999731 4 1585765 2585496 5 1676929 4262425 6 1009816 5272241 7 1007035 6279276 8 1009790 7289066 9 1509965 8799031 a 1712227 10511258 b 4986046 15497304 c 3592027 19089331 d 331 19089662 e 473 19090135 f 514 19090649 g 399 19091048 h 353 19091401 i 363 19091764 j 14146 19105910 k 33050 19138960 l 33517 19172477 m 33273 19205750 n 194311 19400061 o 194817 19594878 p 194563 19789441 q 85263 19874704 r 896 19875600 s 780 19876380 t 167 19876547 u 224 19876771 v 484 19877255 w 12 19877267 x 92827 19970094 y 126 19970220 z
NOTE: http://301works.com/ is supposedly also archiving all of the url-shorteners, but you wouldn't know it from their web page.