Difference between revisions of "URLTeam"

Revision as of 22:57, 19 December 2010

Too many people using TinyURL and similar services

Twitter is a great example of what's wrong with trusting an online service with something of value. Check out some 'tweets':

Hah, I'm a Zombie! http://tinyurl.com/8gnnb7 Ahh, the fun we all have with each other. about 1 hour ago from web
Health privacy is dead. Here's why: http://ff.im/GMpx about 14 hours ago from FriendFeed
Hmm, friendfeed released a new "import Twitter" feature today. It is taking a LONG time on my account. I wonder why.... http://ff.im/GM5W about 14 hours ago from FriendFeed

If these TinyURL services go away, there's not much content here. See Link Rot.

So, the project, scrape the TinyURL (and similar) services.

STATUS (as of mid-April, 2009):
* tinyurl.com: 1M urls ripped
* ff.im: 1M urls ripped
* bit.ly: just started mid-April, 2009
* is.gd: over 70M urls ripped (by User:Chronomex) as of 2010-Aug-16

NOTE: ripping is going slowly so I don't get banned and/or overwhelm the service. ff.im banned me for 24 hours once for ripping too quickly. Also, I'm ripping random URLs, not sequential.

This looks like it would be a good task for distributed computing. Majestic-12 is a project whose main bottleneck is bandwidth, and they are doing quite well. You'd just need to give people a block of URLs to check, and have them report back the results.

HOWTO

It's actually not as hard as it sounds, because we don't need to scrape any web pages or parse any html, since the services just send a Location: header when queried for the hash, we just ask the service for the hash and parse the headers for the redirect url:

(18) swebb@swebb.cluster Wed 11:10am  [~] % curl -LLIs http://tinyurl.com/6dvm2t | grep Location 
Location: http://www.readwriteweb.com/archives/too_many_people_use_tinyurl.php
(19) swebb@swebb.cluster Wed 11:10am  [~] % curl -LLIs http://ff.im/GMpx | grep Location
Location: http://friendfeed.com/e/08954685-00fe-4e55-b28f-4b99f83bfb0d/Health-privacy-is-dead-Here-s-why/

Walk through all possible hash tags, check for errors, and we're good-to-go.

Monkeyshines

The Monkeyshines algorithmic scraper has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, I've gathered about 6M valid URLs pulled from twitter messages so far.

Or just ask!

Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.

Try sending an email to the website owner:

Hello!

I'm working with Jason Scott of textfiles.org and other members of the
Archive Team.

Since the recent scare involving http://tr.im/'s announced (and then
retracted) imminent demise, we've been working to archive all the
links from URL shorteners around the Internet.

If I'm not mistaken, you operate urlx.org.  Would you be so kind as to
share with us a copy of your URL database?  We'll do our best to
preserve this data forever in a useful way.

We are already very far along in scraping links from tr.im, but it's
faster (and friendlier!) to contact site owners asking for a copy of
their data than it is to scrape.

We've got a domain registered, urlte.am, and all links will be
available for redirect in the format:

http://urlx.org.urlte.am/av3

If you could help us, that would be excellent!

Thank you,

URL shortening services:

Todo: copy more from the list at [1].

List last updated 2009-08-14.

1link.in
~~4url.cc - Completely ripped by User:chronomex as of 2009-08-14~~
6url.com
adjix.com - case-insensitive, incremental
ad.vu - mirror of adjix.com
biglnk.com
bit.ly
budurl.com - Appears nonincremental
canurl.com
cli.gs - Appears nonincremental
cort.as - http://cortas.elpais.com/
decenturl.com - Not at all easy to scrape.
dlvr.it
doiop.com - Appears nonincremental
dwarfurl.com - Numeric, appears incremental: http://dwarfurl.com/08041
easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3
easyurl.net - Appears nonincremental: http://easyurl.net/afd2f
ff.im - is this really a shortener?
go2cut.com
ilix.in
imfy.us - requires a recaptcha to get to the linked site.
is.gd - Being ripped by User:chronomex
jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
~~litturl.com - Random, 3-chars. Being ripped by User:chronomex~~
lnkurl.com
memurl.com - Pronounceable. Broken.
metamark.net / xrl.us - ? http://xrl.us/bfabog
minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5
notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/
nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.
ow.ly - I can't get it to work.
plexp.com - Parked.
pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc
poprl.com - Not resolving
qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf
redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok
rod.gs
s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab
shorterlink.com - Parked.
shortlinks.co.uk - Not resolving
short.to - Probably sequential/loweralpha: http://short.to/msmp
shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok
shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp
shrinkurl.us
shrt.st
shurl.net
simurl.com
shorl.com
smarturl.eu
snipr.com
snipurl.com
snurl.com
sn.vc
starturl.com
surl.co.uk
tighturl.com
timesurl.at
tiny123.com
tiny.cc
tinylink.com
tinyurl.com
tobtr.com
traceurl.com
tr.im
tweetburner.com
twitpwr.com
twitthis.com
twurl.nl
u.mavrev.com
ur1.ca - Database is downloadable from website directly.
~~url.0daymeme.com - completely ripped by User:chronomex as of 2009-08-14.~~
url9.com - Sequential, alphanumeric. Leading 0s are significant.
urlborg.com
urlbrief.com
urlcover.com
urlcut.com
urlhawk.com
url-press.com
urlsmash.com
urltea.com
urlvi.be
urlx.org - Owner has agreed to share his database
vimeo.com
wlink.us
xaddr.com
xil.in
xrl.us - see metamark.net
xym.kr
x.se
~~xs.md - completely ripped by User:Chronomex as of 2009-08-15.~~
yatuc.com
yep.it
yweb.com
zi.ma
w3t.org

"Official" shorteners

goog.gl - Google
fb.me - Facebook
amzn.to - Amazon
binged.it - Bing (bonus points for being longer than bing.com)
y.ahoo.it - Yahoo
youtu.be - YouTube
t.co? - Twitter
post.ly - Posterous
wp.me - Wordpress.com
flic.kr - Flickr
lnkd.in - LinkedIn
su.pr - StumbleUpon
go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)
nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)
tcrn.ch - Techcrunch
ff.im - FriendFeed - bought by facebook
digg.com - discontinued - [2]

Dead or Broken Shorteners

chod.sk - Appears nonincremental, not resolving
gonext.org - not resolving
ix.it - Not resolving
jijr.com - Doesn't appear to be a shortener, now parked
kissa.be - "Kissa.be url shortener service is shutdown"
kurl.us - Parked.
miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."
minurl.org - Presently in ERROR 404
muhlink.com - Not resolving
myurl.us - cpanel frontend

Distribution of tinyurl's

Tinyurl.com urls are supposed to be un-ordered now, but there's enough prehistory that you should concentrate on ones with the initial digit 5-9 or a-d. Distribution of 6-character tinyurl.com urls (from 20M tinyurls extracted from twitter)

        1	         1	#
       63	        64	0
      125	       189	1
   282371	    282560	2
   330545	    613105	3
   386626	    999731	4
  1585765	   2585496	5
  1676929	   4262425	6
  1009816	   5272241	7
  1007035	   6279276	8
  1009790	   7289066	9
  1509965	   8799031	a
  1712227	  10511258	b
  4986046	  15497304	c
  3592027	  19089331	d
      331	  19089662	e
      473	  19090135	f
      514	  19090649	g
      399	  19091048	h
      353	  19091401	i
      363	  19091764	j
    14146	  19105910	k
    33050	  19138960	l
    33517	  19172477	m
    33273	  19205750	n
   194311	  19400061	o
   194817	  19594878	p
   194563	  19789441	q
    85263	  19874704	r
      896	  19875600	s
      780	  19876380	t
      167	  19876547	u
      224	  19876771	v
      484	  19877255	w
       12	  19877267	x
    92827	  19970094	y
      126	  19970220	z

NOTE: http://301works.com/ is supposedly also archiving all of the url-shorteners, but you wouldn't know it from their web page.

Difference between revisions of "URLTeam"

Revision as of 22:57, 19 December 2010

Contents

Too many people using TinyURL and similar services

HOWTO

Or just ask!

URL shortening services:

"Official" shorteners

Dead or Broken Shorteners

Distribution of tinyurl's

Navigation menu

Revision as of 12:54, 3 December 2010 (view source) Emijrp (talk \| contribs) (→‎URL shortening services:) ← Older edit		Revision as of 22:57, 19 December 2010 (view source) Soult (talk \| contribs) (Undo revision 1770 by Carter146093 (Talk); If I ever meet a spammer, I'll kill him by slicing off his nuts. You have been warned) Newer edit →
(One intermediate revision by the same user not shown)

Difference between revisions of "URLTeam"

Revision as of 22:57, 19 December 2010

Too many people using TinyURL and similar services

HOWTO

Or just ask!

URL shortening services:

"Official" shorteners

Dead or Broken Shorteners

Distribution of tinyurl's

Navigation menu

Search