Difference between revisions of "URLTeam"

Revision as of 14:41, 25 December 2010

Urlteam
url shortening was a fucking awful idea url shortening was a fucking awful idea
URL	http://urlte.am
Status	Online!
Archiving status	In progress...
Archiving type	Unknown
IRC channel	#archiveteam-bs (on hackint)

Services like TinyURL are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see Wikipedia: Link Rot). 301Works.org claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.

Who did this?

You can join us in our IRC channel: #urlteam on EFNet

User:Scumola started this wiki page
User:Chronomex started the Urlteam scraping effort
User:Soult helps with scraping

Tools

User:Chronomex wrote his own efficient Perl-based scraper: [1]
User:Soult did the same in Ruby
The Monkeyshines algorithmic scraper has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, User:Mrflip gathered about 6M valid URLs pulled from twitter messages so far.

Or just ask!

Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.

Try sending an email to the website owner:

Hello!

I'm working with Jason Scott of textfiles.org and other members of the
Archive Team.

Since the recent scare involving http://tr.im/'s announced (and then
retracted) imminent demise, we've been working to archive all the
links from URL shorteners around the Internet.

If I'm not mistaken, you operate urlx.org.  Would you be so kind as to
share with us a copy of your URL database?  We'll do our best to
preserve this data forever in a useful way.

We are already very far along in scraping links from tr.im, but it's
faster (and friendlier!) to contact site owners asking for a copy of
their data than it is to scrape.

We've got a domain registered, urlte.am, and all links will be
available for redirect in the format:

http://urlx.org.urlte.am/av3

If you could help us, that would be excellent!

Thank you,

URL shorteners

New table

The new table includes shorteners we have already started to scrape.

Name	Number of shorturls	Scraping done by	Status	Comments
TinyURL	1,000,000,000	User:Soult	5-letter codes done, on halt due to being banned (2010-12-20)	non-sequential, bans IP for requesting too many non-existing shorturls
bit.ly	4,000,000,000	User:Soult	about 1/4	non-sequential
is.gd	287,151,326	User:Chronomex	about 1/3 (2010-10-31)	sequential
ff.im	?	User:Chronomex		only used by FriendFeed, no interface to shorten new URLs
4url.cc	1365 (2009-08-14)^[1]	User:Chronomex	done (2009-08-14)	sequential
litturl.com	33695^[2]	User:Chronomex	done	dead (2010-11-18)
xs.md	17619 (2009-08-15)^[3]	User:Chronomex	done	dead (2010-11-18)
url.0daymeme.com	18780 (2009-08-14)^[4]	User:Chronomex	done	dead (2010-11-18)
tr.im	?	User:Soult	5-letter codes finished, 6-letter codes in progress	no new urls can be created, website says it will shut down at the end of 2010, often breaks completely when crawling too fast
Name	Number of shorturls	Scraping done by	Status	Comments

Old list^[5]

List last updated 2009-08-14.

1link.in
6url.com
adjix.com - case-insensitive, incremental
ad.vu - mirror of adjix.com
biglnk.com
budurl.com - Appears nonincremental
canurl.com
cli.gs - Appears nonincremental
decenturl.com - Not at all easy to scrape.
dlvr.it
doiop.com - Appears nonincremental
dwarfurl.com - Numeric, appears incremental: http://dwarfurl.com/08041
easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3
easyurl.net - Appears nonincremental: http://easyurl.net/afd2f
go2cut.com
ilix.in
imfy.us - requires a recaptcha to get to the linked site.
jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
lnkurl.com
memurl.com - Pronounceable. Broken.
metamark.net / xrl.us - ? http://xrl.us/bfabog
minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5em.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
lnkurl.com
memurl.com - Pronounceable. Broken.
metamark.net / xrl.us - ? http://xrl.us/bfabog
minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5
notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/
nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.
ow.ly - I can't get it to work.
plexp.com - Parked.
pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc
poprl.com - Not resolving
qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf
redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok
rod.gs
s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab
shorterlink.com - Parked.
shortlinks.co.uk - Not resolving
short.to - Probably sequential/loweralpha: http://short.to/msmp
shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok
shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp
shrinkurl.us
shrt.st
shurl.net
simurl.com
shorl.com
smarturl.eu
snipr.com
snipurl.com
snurl.com
sn.vc
starturl.com
surl.co.uk
tighturl.com
timesurl.at
tiny123.com
tiny.cc
tinylink.com
tobtr.com
traceurl.com
tr.im
tweetburner.com
twitpwr.com
twitthis.com
twurl.nl
u.mavrev.com
ur1.ca - Database is downloadable from website directly.
url9.com - Sequential, alphanumeric. Leading 0s are significant.
urlborg.com
urlbrief.com
urlcover.com
urlcut.com
urlhawk.com
url-press.com
urlsmash.com
urltea.com
urlvi.be
urlx.org - Owner has agreed to share his database
vimeo.com
wlink.us
xaddr.com
xil.in
xrl.us - see metamark.net
xym.kr
x.se
yatuc.com
yep.it
yweb.com
zi.ma
w3t.org

"Official" shorteners

goog.gl - Google
fb.me - Facebook
amzn.to - Amazon
binged.it - Bing (bonus points for being longer than bing.com)
y.ahoo.it - Yahoo
youtu.be - YouTube
t.co? - Twitter
post.ly - Posterous
wp.me - Wordpress.com
flic.kr - Flickr
lnkd.in - LinkedIn
su.pr - StumbleUpon
go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)
nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)
tcrn.ch - Techcrunch
digg.com - discontinued - [2]

Dead or Broken Shorteners

chod.sk - Appears nonincremental, not resolving
gonext.org - not resolving
ix.it - Not resolving
jijr.com - Doesn't appear to be a shortener, now parked
kissa.be - "Kissa.be url shortener service is shutdown"
kurl.us - Parked.
miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."
minurl.org - Presently in ERROR 404
muhlink.com - Not resolving
myurl.us - cpanel frontend

References

Weblinks

[1] ttp://github.com/chronomex/urlteam

[2] ttp://github.com/chronomex/urlteam

[3] ttp://github.com/chronomex/urlteam

[4] ttp://github.com/chronomex/urlteam

[5] ttp://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html

[1]

[2]

[3]

[4]

[5]

Difference between revisions of "URLTeam"

Revision as of 14:41, 25 December 2010

Contents

Who did this?

Tools

Or just ask!

URL shorteners

New table

Old list^[5]

"Official" shorteners

Dead or Broken Shorteners

References

Weblinks

Navigation menu

@@ Line 1: / Line 1: @@
-=== Too many people using TinyURL and similar services ===
+{{Infobox project
+| title = Urlteam
+| image = Urlteam logo.png
+| description = url shortening was a fucking awful idea
+| URL = http://urlte.am
+| project_status = {{online}}
+| archiving_status = {{in progress}}
+}}
-Twitter is a great example of what's wrong with trusting an online service with something of value.  Check out some 'tweets':
+Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.
-* Hah, I'm a Zombie! http://tinyurl.com/8gnnb7 Ahh, the fun we all have with each other. about 1 hour ago from web
+== Who did this? ==
-* Health privacy is dead. Here's why: http://ff.im/GMpx about 14 hours ago from FriendFeed
+You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]
-* Hmm, friendfeed released a new "import Twitter" feature today. It is taking a LONG time on my account. I wonder why.... http://ff.im/GM5W about 14 hours ago from FriendFeed
+* [[User:Scumola]] started this wiki page
+* [[User:Chronomex]] started the Urlteam scraping effort
-If these TinyURL services go away, there's not much content here.  See [http://en.wikipedia.org/wiki/Link_rot Link Rot].
+* [[User:Soult]] helps with scraping
-So, the project, scrape the TinyURL (and similar) services.
- STATUS (as of mid-April, 2009):
- * tinyurl.com: 1M urls ripped
- * ff.im: 1M urls ripped
- * bit.ly: just started mid-April, 2009
- * is.gd: over 70M urls ripped (by [[User:Chronomex]]) as of 2010-Aug-16
-* NOTE: ripping is going slowly so I don't get banned and/or overwhelm the service.  ff.im banned me for 24 hours once for ripping too quickly.  Also, I'm ripping random URLs, not sequential.
-* This looks like it would be a good task for distributed computing.  [http://www.majestic12.co.uk/ Majestic-12] is a project whose main bottleneck is bandwidth, and they are doing quite well.  You'd just need to give people a block of URLs to check, and have them report back the results.
-== HOWTO ==
-It's actually not as hard as it sounds, because we don't need to scrape any web pages or parse any html, since the services just send a Location: header when queried for the hash, we just ask the service for the hash and parse the headers for the redirect url:
- (18) swebb@swebb.cluster Wed 11:10am  [~] % curl -LLIs http://tinyurl.com/6dvm2t | grep Location
- Location: http://www.readwriteweb.com/archives/too_many_people_use_tinyurl.php
- (19) swebb@swebb.cluster Wed 11:10am  [~] % curl -LLIs http://ff.im/GMpx | grep Location
- Location: http://friendfeed.com/e/08954685-00fe-4e55-b28f-4b99f83bfb0d/Health-privacy-is-dead-Here-s-why/
-Walk through all possible hash tags, check for errors, and we're good-to-go.
-'''Monkeyshines'''
-The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs.  It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs.  With it, I've gathered about 6M valid URLs pulled from twitter messages so far.
+== Tools ==
+* [[User:Chronomex]] wrote his own efficient Perl-based scraper: [http://github.com/chronomex/urlteam]
+* [[User:Soult]] did the same in Ruby
+* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs.  It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs.  With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.
 === Or just ask! ===
 Here's a template that worked for me at least once.  Well, data is pending but the site owner is gung-ho.
@@ Line 68: / Line 52: @@
   Thank you,
-== URL shortening services: ==
+== URL shorteners ==
+=== New table ===
-Todo: copy more from the list at [http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html].
+The new table includes shorteners we have already started to scrape.
+{| class="sortable wikitable" style="width: auto; text-align: center"
+! Name
+! Number of shorturls
+! Scraping done by
+! Status
+! Comments
+|-
+| [http://tinyurl.com TinyURL]
+| 1,000,000,000
+| [[User:Soult]]
+| 5-letter codes done, on halt due to being banned (2010-12-20)
+| non-sequential, bans IP for requesting too many non-existing shorturls
+|-
+| [http://bit.ly bit.ly]
+| 4,000,000,000
+| [[User:Soult]]
+| about 1/4
+| non-sequential
+|-
+| [http://is.gd is.gd]
+| 287,151,326
+| [[User:Chronomex]]
+| about 1/3 (2010-10-31)
+| sequential
+|-
+| [http://ff.im ff.im]
+| ?
+| [[User:Chronomex]]
+|
+| only used by FriendFeed, no interface to shorten new URLs
+|-
+| [http://4url.cc/ 4url.cc]
+| 1365 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref>
+| [[User:Chronomex]]
+| done (2009-08-14)
+| sequential
+|-
+| litturl.com
+| 33695<ref>http://github.com/chronomex/urlteam</ref>
+| [[User:Chronomex]]
+| done
+| dead (2010-11-18)
+|-
+| xs.md
+| 17619 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref>
+| [[User:Chronomex]]
+| done
+| dead (2010-11-18)
+|-
+| url.0daymeme.com
+| 18780 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref>
+| [[User:Chronomex]]
+| done
+| dead (2010-11-18)
+|-
+| [http://tr.im tr.im]
+| ?
+| [[User:Soult]]
+| 5-letter codes finished, 6-letter codes in progress
+| no new urls can be created, website says it will shut down at the end of 2010, often breaks completely when crawling too fast
+|- class="sortbottom"
+! Name
+! Number of shorturls
+! Scraping done by
+! Status
+! Comments
+|}
+=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===
 List last updated 2009-08-14.
 * 1link.in
-* <strike>4url.cc - Completely ripped by [[User:chronomex]] as of 2009-08-14</strike>
 * 6url.com
 * adjix.com - case-insensitive, incremental
 * ad.vu - mirror of adjix.com
 * biglnk.com
-* bit.ly
 * budurl.com - Appears nonincremental
 * canurl.com
 * cli.gs - Appears nonincremental
-* cort.as - http://cortas.elpais.com/
 * decenturl.com - Not at all easy to scrape.
 * dlvr.it
@@ Line 91: / Line 139: @@
 * easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3
 * easyurl.net - Appears nonincremental: http://easyurl.net/afd2f
-* ff.im - is this really a shortener?
 * go2cut.com
 * ilix.in
 * imfy.us - requires a recaptcha to get to the linked site.
-* is.gd - Being ripped by [[User:chronomex]]
 * jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
-* <strike>litturl.com - Random, 3-chars.  Being ripped by [[User:chronomex]]</strike>
+* lnkurl.com
+* memurl.com - Pronounceable.  Broken.
+* metamark.net / xrl.us - ? http://xrl.us/bfabog
+* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
+* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5em.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
 * lnkurl.com
 * memurl.com - Pronounceable.  Broken.
@@ Line 135: / Line 185: @@
 * tiny.cc
 * tinylink.com
-* tinyurl.com
 * tobtr.com
 * traceurl.com
@@ Line 145: / Line 194: @@
 * u.mavrev.com
 * ur1.ca - Database is downloadable from website directly.
-* <strike>url.0daymeme.com - completely ripped by [[User:chronomex]] as of 2009-08-14.</strike>
 * url9.com - Sequential, alphanumeric.  Leading 0s are significant.
 * urlborg.com
@@ Line 164: / Line 212: @@
 * xym.kr
 * x.se
-* <strike>xs.md - completely ripped by [[User:Chronomex]] as of 2009-08-15.</strike>
 * yatuc.com
 * yep.it
@@ Line 171: / Line 218: @@
 * w3t.org
-=== "Official" shorteners ===
+==== "Official" shorteners ====
 * goog.gl - Google
 * fb.me - Facebook
@@ Line 188: / Line 234: @@
 * nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)
 * tcrn.ch - Techcrunch
-* ff.im - FriendFeed - bought by facebook
 * digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]
-=== Dead or Broken Shorteners ===
+==== Dead or Broken Shorteners ====
 * chod.sk - Appears nonincremental, not resolving
 * gonext.org - not resolving
@@ Line 204: / Line 248: @@
 * myurl.us - cpanel frontend
-== Distribution of tinyurl's ==
+== References ==
+<references />
-Tinyurl.com urls are supposed to be un-ordered now, but there's enough prehistory that you should concentrate on ones with the initial digit 5-9 or a-d. Distribution of 6-character tinyurl.com urls (from 20M tinyurls extracted from twitter)
-	         1	#
-	        64	0
-	       189	1
-	    282560	2
-	    613105	3
-	    999731	4
-   1585765	   2585496	5
-   1676929	   4262425	6
-   1009816	   5272241	7
-   1007035	   6279276	8
-   1009790	   7289066	9
-   1509965	   8799031	a
-   1712227	  10511258	b
-   4986046	  15497304	c
-   3592027	  19089331	d
-	  19089662	e
-	  19090135	f
-	  19090649	g
-	  19091048	h
-	  19091401	i
-	  19091764	j
-	  19105910	k
-	  19138960	l
-	  19172477	m
-	  19205750	n
-	  19400061	o
-	  19594878	p
-	  19789441	q
-	  19874704	r
-	  19875600	s
-	  19876380	t
-	  19876547	u
-	  19876771	v
-	  19877255	w
-	  19877267	x
-	  19970094	y
-	  19970220	z
-NOTE: http://301works.com/ is supposedly also archiving all of the url-shorteners, but you wouldn't know it from their web page.
+== Weblinks ==
+* [http://urlte.am urlte.am]
+* [http://301works.org 301works.org]
 [[Category: URL Shortening]]

Difference between revisions of "URLTeam"

Revision as of 14:41, 25 December 2010

Who did this?

Tools

Or just ask!

URL shorteners

New table

Old list[5]

"Official" shorteners

Dead or Broken Shorteners

References

Weblinks

Navigation menu

Search

Old list^[5]