Difference between revisions of "URLTeam"

From Archiveteam
Jump to navigation Jump to search
(Undo revision 1770 by Carter146093 (Talk); If I ever meet a spammer, I'll kill him by slicing off his nuts. You have been warned)
 
(Redo the page)
Line 1: Line 1:
=== Too many people using TinyURL and similar services ===
{{Infobox project
| title = Urlteam
| image = Urlteam logo.png
| description = url shortening was a fucking awful idea
| URL = http://urlte.am
| project_status = {{online}}
| archiving_status = {{in progress}}
}}


Twitter is a great example of what's wrong with trusting an online service with something of value. Check out some 'tweets':
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.


* Hah, I'm a Zombie! http://tinyurl.com/8gnnb7 Ahh, the fun we all have with each other. about 1 hour ago from web
== Who did this? ==
* Health privacy is dead. Here's why: http://ff.im/GMpx about 14 hours ago from FriendFeed
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]
* Hmm, friendfeed released a new "import Twitter" feature today. It is taking a LONG time on my account. I wonder why.... http://ff.im/GM5W about 14 hours ago from FriendFeed
* [[User:Scumola]] started this wiki page
 
* [[User:Chronomex]] started the Urlteam scraping effort
If these TinyURL services go away, there's not much content here.  See [http://en.wikipedia.org/wiki/Link_rot Link Rot].
* [[User:Soult]] helps with scraping
 
So, the project, scrape the TinyURL (and similar) services. 
STATUS (as of mid-April, 2009):
* tinyurl.com: 1M urls ripped
* ff.im: 1M urls ripped
* bit.ly: just started mid-April, 2009
* is.gd: over 70M urls ripped (by [[User:Chronomex]]) as of 2010-Aug-16
* NOTE: ripping is going slowly so I don't get banned and/or overwhelm the service.  ff.im banned me for 24 hours once for ripping too quickly.  Also, I'm ripping random URLs, not sequential.
 
* This looks like it would be a good task for distributed computing.  [http://www.majestic12.co.uk/ Majestic-12] is a project whose main bottleneck is bandwidth, and they are doing quite well.  You'd just need to give people a block of URLs to check, and have them report back the results.
 
== HOWTO ==
 
It's actually not as hard as it sounds, because we don't need to scrape any web pages or parse any html, since the services just send a Location: header when queried for the hash, we just ask the service for the hash and parse the headers for the redirect url:
 
(18) swebb@swebb.cluster Wed 11:10am  [~] % curl -LLIs http://tinyurl.com/6dvm2t | grep Location
Location: http://www.readwriteweb.com/archives/too_many_people_use_tinyurl.php
(19) swebb@swebb.cluster Wed 11:10am  [~] % curl -LLIs http://ff.im/GMpx | grep Location
Location: http://friendfeed.com/e/08954685-00fe-4e55-b28f-4b99f83bfb0d/Health-privacy-is-dead-Here-s-why/
 
Walk through all possible hash tags, check for errors, and we're good-to-go.
 
 
'''Monkeyshines'''
 
The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs.  It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs.  With it, I've gathered about 6M valid URLs pulled from twitter messages so far.


== Tools ==
* [[User:Chronomex]] wrote his own efficient Perl-based scraper: [http://github.com/chronomex/urlteam]
* [[User:Soult]] did the same in Ruby
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs.  It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs.  With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.


=== Or just ask! ===
=== Or just ask! ===
Here's a template that worked for me at least once.  Well, data is pending but the site owner is gung-ho.
Here's a template that worked for me at least once.  Well, data is pending but the site owner is gung-ho.


Line 68: Line 52:
  Thank you,
  Thank you,


== URL shortening services: ==
== URL shorteners ==
 
=== New table ===
Todo: copy more from the list at [http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html].
The new table includes shorteners we have already started to scrape.
{| class="sortable wikitable" style="width: auto; text-align: center"
! Name
! Number of shorturls
! Scraping done by
! Status
! Comments
|-
| [http://tinyurl.com TinyURL]
| 1,000,000,000
| [[User:Soult]]
| 5-letter codes done, on halt due to being banned (2010-12-20)
| non-sequential, bans IP for requesting too many non-existing shorturls
|-
| [http://bit.ly bit.ly]
| 4,000,000,000
| [[User:Soult]]
| about 1/4
| non-sequential
|-
| [http://is.gd is.gd]
| 287,151,326
| [[User:Chronomex]]
| about 1/3 (2010-10-31)
| sequential
|-
| [http://ff.im ff.im]
| ?
| [[User:Chronomex]]
|
| only used by FriendFeed, no interface to shorten new URLs
|-
| [http://4url.cc/ 4url.cc]
| 1365 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref>
| [[User:Chronomex]]
| done (2009-08-14)
| sequential
|-
| litturl.com
| 33695<ref>http://github.com/chronomex/urlteam</ref>
| [[User:Chronomex]]
| done
| dead (2010-11-18)
|-
| xs.md
| 17619 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref>
| [[User:Chronomex]]
| done
| dead (2010-11-18)
|-
| url.0daymeme.com
| 18780 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref>
| [[User:Chronomex]]
| done
| dead (2010-11-18)
|-
| [http://tr.im tr.im]
| ?
| [[User:Soult]]
| 5-letter codes finished, 6-letter codes in progress
| no new urls can be created, website says it will shut down at the end of 2010, often breaks completely when crawling too fast
|- class="sortbottom"
! Name
! Number of shorturls
! Scraping done by
! Status
! Comments
|}


=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===
List last updated 2009-08-14.
List last updated 2009-08-14.
* 1link.in
* 1link.in
* <strike>4url.cc - Completely ripped by [[User:chronomex]] as of 2009-08-14</strike>
* 6url.com
* 6url.com
* adjix.com - case-insensitive, incremental
* adjix.com - case-insensitive, incremental
* ad.vu - mirror of adjix.com
* ad.vu - mirror of adjix.com
* biglnk.com
* biglnk.com
* bit.ly
* budurl.com - Appears nonincremental
* budurl.com - Appears nonincremental
* canurl.com
* canurl.com
* cli.gs - Appears nonincremental
* cli.gs - Appears nonincremental
* cort.as - http://cortas.elpais.com/
* decenturl.com - Not at all easy to scrape.
* decenturl.com - Not at all easy to scrape.
* dlvr.it
* dlvr.it
Line 91: Line 139:
* easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3
* easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3
* easyurl.net - Appears nonincremental: http://easyurl.net/afd2f
* easyurl.net - Appears nonincremental: http://easyurl.net/afd2f
* ff.im - is this really a shortener?
* go2cut.com
* go2cut.com
* ilix.in
* ilix.in
* imfy.us - requires a recaptcha to get to the linked site.
* imfy.us - requires a recaptcha to get to the linked site.
* is.gd - Being ripped by [[User:chronomex]]
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
* <strike>litturl.com - Random, 3-chars. Being ripped by [[User:chronomex]]</strike>
* lnkurl.com
* memurl.com - Pronounceable.  Broken.
* metamark.net / xrl.us - ? http://xrl.us/bfabog
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5em.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
* lnkurl.com
* lnkurl.com
* memurl.com - Pronounceable.  Broken.
* memurl.com - Pronounceable.  Broken.
Line 135: Line 185:
* tiny.cc
* tiny.cc
* tinylink.com
* tinylink.com
* tinyurl.com
* tobtr.com
* tobtr.com
* traceurl.com
* traceurl.com
Line 145: Line 194:
* u.mavrev.com
* u.mavrev.com
* ur1.ca - Database is downloadable from website directly.
* ur1.ca - Database is downloadable from website directly.
* <strike>url.0daymeme.com - completely ripped by [[User:chronomex]] as of 2009-08-14.</strike>
* url9.com - Sequential, alphanumeric.  Leading 0s are significant.
* url9.com - Sequential, alphanumeric.  Leading 0s are significant.
* urlborg.com
* urlborg.com
Line 164: Line 212:
* xym.kr
* xym.kr
* x.se
* x.se
* <strike>xs.md - completely ripped by [[User:Chronomex]] as of 2009-08-15.</strike>
* yatuc.com
* yatuc.com
* yep.it
* yep.it
Line 171: Line 218:
* w3t.org
* w3t.org


=== "Official" shorteners ===
==== "Official" shorteners ====
 
* goog.gl - Google
* goog.gl - Google
* fb.me - Facebook
* fb.me - Facebook
Line 188: Line 234:
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)
* tcrn.ch - Techcrunch
* tcrn.ch - Techcrunch
* ff.im - FriendFeed - bought by facebook
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]


=== Dead or Broken Shorteners ===
==== Dead or Broken Shorteners ====
 
* chod.sk - Appears nonincremental, not resolving
* chod.sk - Appears nonincremental, not resolving
* gonext.org - not resolving
* gonext.org - not resolving
Line 204: Line 248:
* myurl.us - cpanel frontend
* myurl.us - cpanel frontend


== Distribution of tinyurl's ==
== References ==
 
<references />
Tinyurl.com urls are supposed to be un-ordered now, but there's enough prehistory that you should concentrate on ones with the initial digit 5-9 or a-d. Distribution of 6-character tinyurl.com urls (from 20M tinyurls extracted from twitter)
 
        1         1 #
        63         64 0
      125       189 1
    282371     282560 2
    330545     613105 3
    386626     999731 4
  1585765   2585496 5
  1676929   4262425 6
  1009816   5272241 7
  1007035   6279276 8
  1009790   7289066 9
  1509965   8799031 a
  1712227   10511258 b
  4986046   15497304 c
  3592027   19089331 d
      331   19089662 e
      473   19090135 f
      514   19090649 g
      399   19091048 h
      353   19091401 i
      363   19091764 j
    14146   19105910 k
    33050   19138960 l
    33517   19172477 m
    33273   19205750 n
    194311   19400061 o
    194817   19594878 p
    194563   19789441 q
    85263   19874704 r
      896   19875600 s
      780   19876380 t
      167   19876547 u
      224   19876771 v
      484   19877255 w
        12   19877267 x
    92827   19970094 y
      126   19970220 z
 


NOTE: http://301works.com/ is supposedly also archiving all of the url-shorteners, but you wouldn't know it from their web page.
== Weblinks ==
* [http://urlte.am urlte.am]
* [http://301works.org 301works.org]


[[Category: URL Shortening]]
[[Category: URL Shortening]]

Revision as of 14:41, 25 December 2010

Urlteam
url shortening was a fucking awful idea
url shortening was a fucking awful idea
URL http://urlte.am
Status Online!
Archiving status In progress...
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)

Services like TinyURL are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see Wikipedia: Link Rot). 301Works.org claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.

Who did this?

You can join us in our IRC channel: #urlteam on EFNet

Tools

  • User:Chronomex wrote his own efficient Perl-based scraper: [1]
  • User:Soult did the same in Ruby
  • The Monkeyshines algorithmic scraper has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, User:Mrflip gathered about 6M valid URLs pulled from twitter messages so far.

Or just ask!

Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.

Try sending an email to the website owner:

Hello!

I'm working with Jason Scott of textfiles.org and other members of the
Archive Team.

Since the recent scare involving http://tr.im/'s announced (and then
retracted) imminent demise, we've been working to archive all the
links from URL shorteners around the Internet.

If I'm not mistaken, you operate urlx.org.  Would you be so kind as to
share with us a copy of your URL database?  We'll do our best to
preserve this data forever in a useful way.

We are already very far along in scraping links from tr.im, but it's
faster (and friendlier!) to contact site owners asking for a copy of
their data than it is to scrape.

We've got a domain registered, urlte.am, and all links will be
available for redirect in the format:

http://urlx.org.urlte.am/av3

If you could help us, that would be excellent!

Thank you,

URL shorteners

New table

The new table includes shorteners we have already started to scrape.

Name Number of shorturls Scraping done by Status Comments
TinyURL 1,000,000,000 User:Soult 5-letter codes done, on halt due to being banned (2010-12-20) non-sequential, bans IP for requesting too many non-existing shorturls
bit.ly 4,000,000,000 User:Soult about 1/4 non-sequential
is.gd 287,151,326 User:Chronomex about 1/3 (2010-10-31) sequential
ff.im ? User:Chronomex only used by FriendFeed, no interface to shorten new URLs
4url.cc 1365 (2009-08-14)[1] User:Chronomex done (2009-08-14) sequential
litturl.com 33695[2] User:Chronomex done dead (2010-11-18)
xs.md 17619 (2009-08-15)[3] User:Chronomex done dead (2010-11-18)
url.0daymeme.com 18780 (2009-08-14)[4] User:Chronomex done dead (2010-11-18)
tr.im ? User:Soult 5-letter codes finished, 6-letter codes in progress no new urls can be created, website says it will shut down at the end of 2010, often breaks completely when crawling too fast
Name Number of shorturls Scraping done by Status Comments

Old list[5]

List last updated 2009-08-14.

  • 1link.in
  • 6url.com
  • adjix.com - case-insensitive, incremental
  • ad.vu - mirror of adjix.com
  • biglnk.com
  • budurl.com - Appears nonincremental
  • canurl.com
  • cli.gs - Appears nonincremental
  • decenturl.com - Not at all easy to scrape.
  • dlvr.it
  • doiop.com - Appears nonincremental
  • dwarfurl.com - Numeric, appears incremental: http://dwarfurl.com/08041
  • easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3
  • easyurl.net - Appears nonincremental: http://easyurl.net/afd2f
  • go2cut.com
  • ilix.in
  • imfy.us - requires a recaptcha to get to the linked site.
  • jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
  • lnkurl.com
  • memurl.com - Pronounceable. Broken.
  • metamark.net / xrl.us - ? http://xrl.us/bfabog
  • minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
  • myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5em.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
  • lnkurl.com
  • memurl.com - Pronounceable. Broken.
  • metamark.net / xrl.us - ? http://xrl.us/bfabog
  • minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
  • myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5
  • notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/
  • nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.
  • ow.ly - I can't get it to work.
  • plexp.com - Parked.
  • pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc
  • poprl.com - Not resolving
  • qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf
  • redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok
  • rod.gs
  • s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab
  • shorterlink.com - Parked.
  • shortlinks.co.uk - Not resolving
  • short.to - Probably sequential/loweralpha: http://short.to/msmp
  • shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok
  • shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp
  • shrinkurl.us
  • shrt.st
  • shurl.net
  • simurl.com
  • shorl.com
  • smarturl.eu
  • snipr.com
  • snipurl.com
  • snurl.com
  • sn.vc
  • starturl.com
  • surl.co.uk
  • tighturl.com
  • timesurl.at
  • tiny123.com
  • tiny.cc
  • tinylink.com
  • tobtr.com
  • traceurl.com
  • tr.im
  • tweetburner.com
  • twitpwr.com
  • twitthis.com
  • twurl.nl
  • u.mavrev.com
  • ur1.ca - Database is downloadable from website directly.
  • url9.com - Sequential, alphanumeric. Leading 0s are significant.
  • urlborg.com
  • urlbrief.com
  • urlcover.com
  • urlcut.com
  • urlhawk.com
  • url-press.com
  • urlsmash.com
  • urltea.com
  • urlvi.be
  • urlx.org - Owner has agreed to share his database
  • vimeo.com
  • wlink.us
  • xaddr.com
  • xil.in
  • xrl.us - see metamark.net
  • xym.kr
  • x.se
  • yatuc.com
  • yep.it
  • yweb.com
  • zi.ma
  • w3t.org

"Official" shorteners

  • goog.gl - Google
  • fb.me - Facebook
  • amzn.to - Amazon
  • binged.it - Bing (bonus points for being longer than bing.com)
  • y.ahoo.it - Yahoo
  • youtu.be - YouTube
  • t.co? - Twitter
  • post.ly - Posterous
  • wp.me - Wordpress.com
  • flic.kr - Flickr
  • lnkd.in - LinkedIn
  • su.pr - StumbleUpon
  • go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)
  • nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)
  • tcrn.ch - Techcrunch
  • digg.com - discontinued - [2]

Dead or Broken Shorteners

  • chod.sk - Appears nonincremental, not resolving
  • gonext.org - not resolving
  • ix.it - Not resolving
  • jijr.com - Doesn't appear to be a shortener, now parked
  • kissa.be - "Kissa.be url shortener service is shutdown"
  • kurl.us - Parked.
  • miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."
  • minurl.org - Presently in ERROR 404
  • muhlink.com - Not resolving
  • myurl.us - cpanel frontend

References

Weblinks