2011-03-19T18:45:57Z

Jeroenz0r:

2011-02-12T13:20:57Z

Jeroenz0r: /* Old listhttp://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html */

{{Infobox project
| title = Urlteam
| image = Urlteam logo.png
| description = url shortening was a fucking awful idea
| URL = http://urlte.am
| project_status = {{online}}
| archiving_status = {{in progress}}
}}

'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.

Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.

== Who did this? ==
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]
* [[User:Scumola]] started this wiki page
* [[User:Chronomex]] started the Urlteam scraping effort
* [[User:Soult]] Helps with scraping
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)

== Tools ==
* [[User:Chronomex]] wrote his own efficient Perl-based scraper: [http://github.com/chronomex/urlteam]
* [[User:Soult]] did the same in Ruby
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.

=== Or just ask! ===
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.

Try sending an email to the website owner:

Hello!

I'm working with Jason Scott of textfiles.org and other members of the
Archive Team.

Since the recent scare involving http://tr.im/'s announced (and then
retracted) imminent demise, we've been working to archive all the
links from URL shorteners around the Internet.

If I'm not mistaken, you operate urlx.org. Would you be so kind as to
share with us a copy of your URL database? We'll do our best to
preserve this data forever in a useful way.

We are already very far along in scraping links from tr.im, but it's
faster (and friendlier!) to contact site owners asking for a copy of
their data than it is to scrape.

We've got a domain registered, urlte.am, and all links will be
available for redirect in the format:

http://urlx.org.urlte.am/av3

If you could help us, that would be excellent!

Thank you,

== URL shorteners ==
=== New table ===
The new table includes shorteners we have already started to scrape.
{| class="sortable wikitable" style="width: auto; text-align: center"
! Name
! Number of shorturls
! Scraping done by
! Status
! Comments
|-
| [http://tinyurl.com TinyURL]
| 1,000,000,000
| [[User:Soult]]
| 5-letter codes done, on halt due to being banned (2010-12-20)
| non-sequential, bans IP for requesting too many non-existing shorturls
|-
| [http://bit.ly bit.ly]
| 4,000,000,000
| [[User:Soult]]
| about 1/4
| non-sequential
|-
| [http://is.gd is.gd]
| 287,151,326
| [[User:Chronomex]]
| about 1/3 (2010-10-31)
| sequential
|-
| [http://ff.im ff.im]
| ?
| [[User:Chronomex]]
|
| only used by FriendFeed, no interface to shorten new URLs
|-
| [http://4url.cc/ 4url.cc]
| 1365 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref>
| [[User:Chronomex]]
| done (2009-08-14)
| sequential
|-
| litturl.com
| 33695<ref>http://github.com/chronomex/urlteam</ref>
| [[User:Chronomex]]
| done
| dead (2010-11-18)
|-
| xs.md
| 17619 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref>
| [[User:Chronomex]]
| done
| dead (2010-11-18)
|-
| url.0daymeme.com
| 18780 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref>
| [[User:Chronomex]]
| done
| dead (2010-11-18)
|-
| [http://tr.im tr.im]
| ?
| [[User:Soult]]
| 5-letter codes finished, 6-letter codes in progress
| no new urls can be created, website says it will shut down at the end of 2010, often breaks completely when crawling too fast
|-
| adjix.com
| ?
| [[User:Jeroenz0r]]
| Already done: 00-zz, 000-zzz and 0000-9999
| case-insensitive, incremental
|-
| rod.gs
| ?
| [[User:Jeroenz0r]]
| Small work like 00-ZZ, 000-ZZZ
| case-sensitive, incremental
|- class="sortbottom"
! Name
! Number of shorturls
! Scraping done by
! Status
! Comments
|}

=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===
List last updated 2009-08-14.
* 1link.in - Website dead
* 6url.com - HTML redirect
* ad.vu - mirror of adjix.com
* biglnk.com
* budurl.com - Appears non-incremental
* canurl.com
* cli.gs - Appears non-incremental
* decenturl.com - Not at all easy to scrape.
* dlvr.it
* doiop.com - Appears non-incremental
* dwarfurl.com - Numeric, appears incremental: http://dwarfurl.com/08041
* easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f
* go2cut.com
* ilix.in
* imfy.us - requires a recaptcha to get to the linked site.
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
* lnkurl.com
* memurl.com - Pronounceable. Broken.
* metamark.net / xrl.us - ? http://xrl.us/bfabog
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5em.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
* lnkurl.com
* memurl.com - Pronounceable. Broken.
* metamark.net / xrl.us - ? http://xrl.us/bfabog
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.
* ow.ly - I can't get it to work.
* plexp.com - Parked.
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc
* poprl.com - Not resolving
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab
* shorterlink.com - Parked.
* shortlinks.co.uk - Not resolving
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp
* shrinkurl.us
* shrt.st
* shurl.net
* simurl.com
* shorl.com
* smarturl.eu
* snipr.com
* snipurl.com
* snurl.com
* sn.vc
* starturl.com
* surl.co.uk
* tighturl.com
* timesurl.at
* tiny123.com
* tiny.cc
* tinylink.com
* tobtr.com
* traceurl.com
* tr.im
* tweetburner.com
* twitpwr.com
* twitthis.com
* twurl.nl
* u.mavrev.com
* ur1.ca - Database is downloadable from website directly.
* url9.com - Sequential, alphanumeric. Leading 0s are significant.
* urlborg.com
* urlbrief.com
* urlcover.com
* urlcut.com
* urlhawk.com
* url-press.com
* urlsmash.com
* urltea.com
* urlvi.be
* urlx.org - Owner has agreed to share his database
* vimeo.com
* wlink.us
* xaddr.com
* xil.in
* xrl.us - see metamark.net
* xym.kr
* x.se
* yatuc.com
* yep.it
* yweb.com
* zi.ma
* w3t.org

==== "Official" shorteners ====
* goog.gl - Google
* fb.me - Facebook
* amzn.to - Amazon
* binged.it - Bing (bonus points for being longer than bing.com)
* y.ahoo.it - Yahoo
* youtu.be - YouTube
* t.co? - Twitter
* post.ly - Posterous
* wp.me - Wordpress.com
* flic.kr - Flickr
* lnkd.in - LinkedIn
* su.pr - StumbleUpon
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)
* tcrn.ch - Techcrunch
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]

==== Dead or Broken Shorteners ====
* chod.sk - Appears non-incremental, not resolving
* gonext.org - not resolving
* ix.it - Not resolving
* jijr.com - Doesn't appear to be a shortener, now parked
* kissa.be - "Kissa.be url shortener service is shutdown"
* kurl.us - Parked.
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."
* minurl.org - Presently in ERROR 404
* muhlink.com - Not resolving
* myurl.us - cpanel frontend

== References ==
<references />

== Weblinks ==
* [http://urlte.am urlte.am]
* [http://301works.org 301works.org]

[[Category: URL Shortening]]

URLTeam

2011-02-12T13:19:27Z

Jeroenz0r: /* New table */

{{Infobox project
| title = Urlteam
| image = Urlteam logo.png
| description = url shortening was a fucking awful idea
| URL = http://urlte.am
| project_status = {{online}}
| archiving_status = {{in progress}}
}}

'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.

Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.

== Who did this? ==
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]
* [[User:Scumola]] started this wiki page
* [[User:Chronomex]] started the Urlteam scraping effort
* [[User:Soult]] Helps with scraping
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)

== Tools ==
* [[User:Chronomex]] wrote his own efficient Perl-based scraper: [http://github.com/chronomex/urlteam]
* [[User:Soult]] did the same in Ruby
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.

=== Or just ask! ===
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.

Try sending an email to the website owner:

Hello!

I'm working with Jason Scott of textfiles.org and other members of the
Archive Team.

Since the recent scare involving http://tr.im/'s announced (and then
retracted) imminent demise, we've been working to archive all the
links from URL shorteners around the Internet.

If I'm not mistaken, you operate urlx.org. Would you be so kind as to
share with us a copy of your URL database? We'll do our best to
preserve this data forever in a useful way.

We are already very far along in scraping links from tr.im, but it's
faster (and friendlier!) to contact site owners asking for a copy of
their data than it is to scrape.

We've got a domain registered, urlte.am, and all links will be
available for redirect in the format:

http://urlx.org.urlte.am/av3

If you could help us, that would be excellent!

Thank you,

== URL shorteners ==
=== New table ===
The new table includes shorteners we have already started to scrape.
{| class="sortable wikitable" style="width: auto; text-align: center"
! Name
! Number of shorturls
! Scraping done by
! Status
! Comments
|-
| [http://tinyurl.com TinyURL]
| 1,000,000,000
| [[User:Soult]]
| 5-letter codes done, on halt due to being banned (2010-12-20)
| non-sequential, bans IP for requesting too many non-existing shorturls
|-
| [http://bit.ly bit.ly]
| 4,000,000,000
| [[User:Soult]]
| about 1/4
| non-sequential
|-
| [http://is.gd is.gd]
| 287,151,326
| [[User:Chronomex]]
| about 1/3 (2010-10-31)
| sequential
|-
| [http://ff.im ff.im]
| ?
| [[User:Chronomex]]
|
| only used by FriendFeed, no interface to shorten new URLs
|-
| [http://4url.cc/ 4url.cc]
| 1365 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref>
| [[User:Chronomex]]
| done (2009-08-14)
| sequential
|-
| litturl.com
| 33695<ref>http://github.com/chronomex/urlteam</ref>
| [[User:Chronomex]]
| done
| dead (2010-11-18)
|-
| xs.md
| 17619 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref>
| [[User:Chronomex]]
| done
| dead (2010-11-18)
|-
| url.0daymeme.com
| 18780 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref>
| [[User:Chronomex]]
| done
| dead (2010-11-18)
|-
| [http://tr.im tr.im]
| ?
| [[User:Soult]]
| 5-letter codes finished, 6-letter codes in progress
| no new urls can be created, website says it will shut down at the end of 2010, often breaks completely when crawling too fast
|-
| adjix.com
| ?
| [[User:Jeroenz0r]]
| Already done: 00-zz, 000-zzz and 0000-9999
| case-insensitive, incremental
|-
| rod.gs
| ?
| [[User:Jeroenz0r]]
| Small work like 00-ZZ, 000-ZZZ
| case-sensitive, incremental
|- class="sortbottom"
! Name
! Number of shorturls
! Scraping done by
! Status
! Comments
|}

=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===
List last updated 2009-08-14.
* 1link.in - Website dead
* 6url.com - HTML redirect
* ad.vu - mirror of adjix.com
* biglnk.com
* budurl.com - Appears non-incremental
* canurl.com
* cli.gs - Appears non-incremental
* decenturl.com - Not at all easy to scrape.
* dlvr.it
* doiop.com - Appears non-incremental
* dwarfurl.com - Numeric, appears incremental: http://dwarfurl.com/08041
* easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f
* go2cut.com
* ilix.in
* imfy.us - requires a recaptcha to get to the linked site.
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
* lnkurl.com
* memurl.com - Pronounceable. Broken.
* metamark.net / xrl.us - ? http://xrl.us/bfabog
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5em.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
* lnkurl.com
* memurl.com - Pronounceable. Broken.
* metamark.net / xrl.us - ? http://xrl.us/bfabog
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.
* ow.ly - I can't get it to work.
* plexp.com - Parked.
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc
* poprl.com - Not resolving
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok
* rod.gs
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab
* shorterlink.com - Parked.
* shortlinks.co.uk - Not resolving
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp
* shrinkurl.us
* shrt.st
* shurl.net
* simurl.com
* shorl.com
* smarturl.eu
* snipr.com
* snipurl.com
* snurl.com
* sn.vc
* starturl.com
* surl.co.uk
* tighturl.com
* timesurl.at
* tiny123.com
* tiny.cc
* tinylink.com
* tobtr.com
* traceurl.com
* tr.im
* tweetburner.com
* twitpwr.com
* twitthis.com
* twurl.nl
* u.mavrev.com
* ur1.ca - Database is downloadable from website directly.
* url9.com - Sequential, alphanumeric. Leading 0s are significant.
* urlborg.com
* urlbrief.com
* urlcover.com
* urlcut.com
* urlhawk.com
* url-press.com
* urlsmash.com
* urltea.com
* urlvi.be
* urlx.org - Owner has agreed to share his database
* vimeo.com
* wlink.us
* xaddr.com
* xil.in
* xrl.us - see metamark.net
* xym.kr
* x.se
* yatuc.com
* yep.it
* yweb.com
* zi.ma
* w3t.org

==== "Official" shorteners ====
* goog.gl - Google
* fb.me - Facebook
* amzn.to - Amazon
* binged.it - Bing (bonus points for being longer than bing.com)
* y.ahoo.it - Yahoo
* youtu.be - YouTube
* t.co? - Twitter
* post.ly - Posterous
* wp.me - Wordpress.com
* flic.kr - Flickr
* lnkd.in - LinkedIn
* su.pr - StumbleUpon
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)
* tcrn.ch - Techcrunch
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]

==== Dead or Broken Shorteners ====
* chod.sk - Appears non-incremental, not resolving
* gonext.org - not resolving
* ix.it - Not resolving
* jijr.com - Doesn't appear to be a shortener, now parked
* kissa.be - "Kissa.be url shortener service is shutdown"
* kurl.us - Parked.
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."
* minurl.org - Presently in ERROR 404
* muhlink.com - Not resolving
* myurl.us - cpanel frontend

== References ==
<references />

== Weblinks ==
* [http://urlte.am urlte.am]
* [http://301works.org 301works.org]

[[Category: URL Shortening]]