https://wiki.archiveteam.org/api.php?action=feedcontributions&user=Jeroenz0r&feedformat=atomArchiveteam - User contributions [en]2024-03-29T06:20:37ZUser contributionsMediaWiki 1.37.1https://wiki.archiveteam.org/index.php?title=AnyHub&diff=6712AnyHub2011-11-21T19:43:26Z<p>Jeroenz0r: /* What is AnyHub.net? */</p>
<hr />
<div>{{Infobox project<br />
| title = AnyHub<br />
| image = Anyhub.net_2011-11-15_7-30-3.png<br />
| description = File hosting website<br />
| URL = http://www.anyhub.net/<br />
| project_status = {{offline}}<br />
| archiving_status = {{saved}}<br />
| irc = AnyHubTeam <br />
}}<br />
== What is AnyHub.net? == <br />
AnyHub is a fast, free and simple file host that anyone can use. Signup not required, and upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br><br />
[[File:Anyhub.netFAQ_2011-11-15_7-30-44.png|thumb|The original FAQ]]<br />
<br />
== AnyHub's death ==<br />
The official banner said ''AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.''<br><br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can define a region and start downloading!<br><br />
Github page: https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "'''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh'''"<br><br />
<br />
== How does the tool work? ==<br />
The dld-client is one of the easier download-tools.<br><br />
Just start a terminal/screen with "'''./dld-client.sh ''{your_nickname}'''''" (nickname needs to be A-Z, a-z, 0-9, - and _)<br><br />
The download stats/dashboard is here: http://anyhub.heroku.com/<br />
<br />
== Uploading your data ==<br />
<br />
To upload the data you've downloaded, first contact SketchCow on IRC for an rsync slot. Once you have that you can run the <code>./upload-finished.sh</code> script to upload your data. For example, run this in your script directory: <code>./upload-finished.sh batcave.textfiles.com::YOURNICK/anyhub/</code><br />
<br />
== Info/stats about AnyHub ==<br />
They had great stats at http://www.anyhub.net/stats<br><br />
The json data: http://www.anyhub.net/stats/recent<br><br />
As of 18 November, 2011: '''1122585''' files ('''2.81''' TiB)<br />
<br />
== IRC/Chat ==<br />
See here: [[IRC]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6680AnyHub2011-11-19T11:05:49Z<p>Jeroenz0r: </p>
<hr />
<div>{{Infobox project<br />
| title = AnyHub<br />
| image = Anyhub.net_2011-11-15_7-30-3.png<br />
| description = File hosting website<br />
| URL = http://www.anyhub.net/<br />
| project_status = {{offline}}<br />
| archiving_status = {{saved}}<br />
| irc = AnyHubTeam<br />
}}<br />
== What is AnyHub.net? ==<br />
AnyHub is a fast, free and simple file host that anyone can use. Signup not required, and upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br><br />
--Edit from their [http://archiveteam.org/images/3/3b/Anyhub.netFAQ_2011-11-15_7-30-44.png FAQ].<br />
<br />
== AnyHub's death ==<br />
The official banner said ''AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.''<br><br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can define a region and start downloading!<br><br />
Github page: https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "'''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh'''"<br><br />
<br />
== How does the tool work? ==<br />
The dld-client is one of the easier download-tools.<br><br />
Just start a terminal/screen with "'''./dld-client.sh ''{your_nickname}'''''" (nickname needs to be A-Z, a-z, 0-9, - and _)<br><br />
The download stats/dashboard is here: http://anyhub.heroku.com/<br />
<br />
== Uploading your data ==<br />
<br />
To upload the data you've downloaded, first contact SketchCow on IRC for an rsync slot. Once you have that you can run the <code>./upload-finished.sh</code> script to upload your data. For example, run this in your script directory: <code>./upload-finished.sh batcave.textfiles.com::YOURNICK/anyhub/</code><br />
<br />
== Info/stats about AnyHub ==<br />
They had great stats at http://www.anyhub.net/stats<br><br />
The json data: http://www.anyhub.net/stats/recent<br><br />
As of 18 November, 2011: '''1122585''' files ('''2.81''' TiB)<br />
<br />
== IRC/Chat ==<br />
See here: [[IRC]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6677AnyHub2011-11-18T16:44:48Z<p>Jeroenz0r: /* What is AnyHub.net? */</p>
<hr />
<div>{{Infobox project<br />
| title = AnyHub<br />
| image = Anyhub.net_2011-11-15_7-30-3.png<br />
| description = File hosting website<br />
| URL = http://www.anyhub.net/<br />
| project_status = {{closing}}<br />
| archiving_status = {{saved}}<br />
| irc = AnyHubTeam<br />
}}<br />
== What is AnyHub.net? ==<br />
AnyHub is a fast, free and simple file host that anyone can use. Signup not required, and upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br><br />
--Edit from their [http://archiveteam.org/images/3/3b/Anyhub.netFAQ_2011-11-15_7-30-44.png FAQ].<br />
<br />
== AnyHub's death ==<br />
The official banner said ''AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.''<br><br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can define a region and start downloading!<br><br />
Github page: https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "'''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh'''"<br><br />
<br />
== How does the tool work? ==<br />
The dld-client is one of the easier download-tools.<br><br />
Just start a terminal/screen with "'''./dld-client.sh ''{your_nickname}'''''" (nickname needs to be A-Z, a-z, 0-9, - and _)<br><br />
The download stats/dashboard is here: http://anyhub.heroku.com/<br />
<br />
== Uploading your data ==<br />
<br />
To upload the data you've downloaded, first contact SketchCow on IRC for an rsync slot. Once you have that you can run the <code>./upload-finished.sh</code> script to upload your data. For example, run this in your script directory: <code>./upload-finished.sh batcave.textfiles.com::YOURNICK/anyhub/</code><br />
<br />
== Info/stats about AnyHub ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
The json data: http://www.anyhub.net/stats/recent<br><br />
As of 18 November, 2011: '''1122585''' files ('''2.81''' TiB)<br />
<br />
== IRC/Chat ==<br />
See here: [[IRC]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6675AnyHub2011-11-18T11:57:32Z<p>Jeroenz0r: /* Tools */</p>
<hr />
<div>{{Infobox project<br />
| title = AnyHub<br />
| image = Anyhub.net_2011-11-15_7-30-3.png<br />
| description = File hosting website<br />
| URL = http://www.anyhub.net/<br />
| project_status = {{closing}}<br />
| archiving_status = {{saved}}<br />
| irc = AnyHubTeam<br />
}}<br />
== What is AnyHub.net? ==<br />
AnyHub is a fast, free and simple file host that anyone can use. Signup not required, uet still makes you able to upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br><br />
--Edit from their [http://archiveteam.org/images/3/3b/Anyhub.netFAQ_2011-11-15_7-30-44.png FAQ].<br />
<br />
== AnyHub's death ==<br />
The official banner said ''AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.''<br><br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can define a region and start downloading!<br><br />
Github page: https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "'''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh'''"<br><br />
<br />
== How does the tool work? ==<br />
The dld-client is one of the easier download-tools.<br><br />
Just start a terminal/screen with "'''./dld-client.sh ''{your_nickname}'''''" (nickname needs to be A-Z, a-z, 0-9, - and _)<br><br />
The download stats/dashboard is here: http://anyhub.heroku.com/<br />
<br />
== Uploading your data ==<br />
<br />
To upload the data you've downloaded, first contact SketchCow on IRC for an rsync slot. Once you have that you can run the <code>./upload-finished.sh</code> script to upload your data. For example, run this in your script directory: <code>./upload-finished.sh batcave.textfiles.com::YOURNICK/anyhub/</code><br />
<br />
== Info/stats about AnyHub ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
The json data: http://www.anyhub.net/stats/recent<br><br />
As of 18 November, 2011: '''1122585''' files ('''2.81''' TiB)<br />
<br />
== IRC/Chat ==<br />
See here: [[IRC]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6636AnyHub2011-11-15T06:54:09Z<p>Jeroenz0r: </p>
<hr />
<div>{{Infobox project<br />
| title = AnyHub<br />
| image = Anyhub.net_2011-11-15_7-30-3.png<br />
| description = File hosting website<br />
| URL = http://www.anyhub.net/<br />
| project_status = {{closing}}<br />
| archiving_status = {{inprogress}}<br />
| irc = AnyHubTeam<br />
}}<br />
== What is AnyHub.net? ==<br />
AnyHub is a fast, free and simple file host that anyone can use. Signup not required, uet still makes you able to upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br><br />
--Edit from their [http://archiveteam.org/images/3/3b/Anyhub.netFAQ_2011-11-15_7-30-44.png FAQ].<br />
<br />
== AnyHub's death ==<br />
The official banner said ''AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.''<br><br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can define a region and start downloading!<br />
Github page: https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "'''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh'''"<br><br />
<br />
== How does the tool work? ==<br />
The dld-client is one of the easier download-tools.<br><br />
Just start a terminal/screen with "'''./dld-client.sh ''{your_nickname}'''''" (nickname needs to be A-Z, a-z, 0-9, - and _)<br><br />
The download stats/dashboard is here: http://anyhub.heroku.com/<br />
<br />
== Info/stats about AnyHub ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
The json data: http://www.anyhub.net/stats/recent<br><br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)<br />
<br />
== IRC/Chat ==<br />
See here: [[IRC]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=File:Anyhub.netFAQ_2011-11-15_7-30-44.png&diff=6635File:Anyhub.netFAQ 2011-11-15 7-30-44.png2011-11-15T06:42:29Z<p>Jeroenz0r: Screenshot of the FAQ on anyhub.net</p>
<hr />
<div>Screenshot of the FAQ on anyhub.net</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=Splinder&diff=6634Splinder2011-11-15T06:37:20Z<p>Jeroenz0r: </p>
<hr />
<div>{{Infobox project<br />
| title = Splinder<br />
| image = Us.splinder.com_2011-11-15_7-34-30.png<br />
| URL = {{url|1=http://www.splinder.com/}}<br />
{{url|1=http://www.us.splinder.com/}}<br />
| project_status = {{closing}}<br />
| archiving_status = {{inprogress}}<br />
}}<br />
Splinder.com has been the main blog hosting company in Italy for a while (see [[Wikipedia:it:Splinder]]). It was founded in 2001 and it hosts about half a million blogs and over 55 millions pages.<br />
Since 8th November, 2011 a warning on the home page says that no new PRO accounts are being created since the 1st June. The company has confirmed that the website will close on the 24th.[http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/comment/65653358#cid-65653358]<br />
<br />
== How to help archiving ==<br />
<br />
There is a distributed download script that gets usernames from a tracker and downloads the data.<br />
<br />
Make sure you are on Linux, that you have curl, git, a recent version of Bash. Your system must also be able to compile wget.<br />
<br />
# Get the code: <code>git clone https://github.com/ArchiveTeam/splinder-grab</code><br />
# Get and compile the latest version of wget-warc: <code>./get-wget-warc.sh</code><br />
# Think of a nickname for yourself (preferably use your IRC name).<br />
# Run the download script with <code>./dld-client.sh "<YOURNICK>"</code><br />
# To stop the script gracefully, run <code>touch STOP</code> in the script's working directory. It will finish the current task and stop.<br />
<br />
===Notes===<br />
<br />
* Compiling wget-warc will require dev packages for the various libraries that it needs. Most questions have been about gnutls; install the <code>gnutls-devel</code> or <code>gnutls-dev</code> package with your favorite package manager.<br />
* Downloading one user's data can take between 10 seconds and a few hours.<br />
* The data for one user is equally varied, from a few kB to several GB.<br />
* The downloaded data will be saved in the <code>./data/</code> subdirectory.<br />
* Download speeds from me.com are not that high. You can run multiple clients to speed things up.<br />
<br />
== Status ==<br />
<br />
There is a [http://splinder.heroku.com real-time dashboard] where you can check the progress.<br />
<br />
==External links==<br />
*http://www.splinder.com/<br />
*http://www.us.splinder.com/<br />
<br />
==Site structure==<br />
<br />
The users are identified by their usernames. Fortunately, the side provides a list of all users. Usernames are not case-sensitive, but there is a case preference.<br />
<br />
==Example URLs==<br />
User profile: <code><nowiki>http://www.splinder.com/profile/<<username>></nowiki></code><br />
<br />
<pre><br />
Example profile:<br />
http://www.splinder.com/profile/difficilifoglie<br />
<br />
View count on profile page:<br />
http://www.splinder.com/ajax.php?type=counter&op=profile&profile=Romanticdreamer<br />
<br />
Example of friends list paging: (160 per page, starting at 0)<br />
http://www.splinder.com/profile/difficilifoglie/friends<br />
http://www.splinder.com/profile/difficilifoglie/friends/160<br />
<br />
Inverse friends (probably also paged):<br />
http://www.splinder.com/profile/difficilifoglie/friendof<br />
<br />
Link to blog: (note: not always the same as the username)<br />
http://difficilifoglie.splinder.com/<br />
http://learnonline.splinder.com/<br />
<br />
Photo:<br />
http://www.splinder.com/profile/difficilifoglie/photo<br />
http://www.splinder.com/mediablog/wondermum/media/24544805<br />
<br />
Video:<br />
http://www.splinder.com/profile/wondermum/video<br />
http://www.splinder.com/mediablog/wondermum/media/25737390<br />
<br />
Audio:<br />
Not a separate user feed, but only accessible via mediablog<br />
http://www.splinder.com/mediablog/learnonline/media/25727030<br />
<br />
Mediablog: combination of the audio + video + photo lists<br />
http://www.splinder.com/mediablog/learnonline<br />
(16 per page, starting at 0)<br />
http://www.splinder.com/mediablog/learnonline/16<br />
<br />
Mediablog has PowerPoint, Word files:<br />
http://www.splinder.com/mediablog/learnonline/media/25641346<br />
http://www.splinder.com/mediablog/learnonline/media/25546305<br />
http://www.splinder.com/mediablog/learnonline/media/21901634<br />
http://www.splinder.com/mediablog/learnonline/media/24875290<br />
<br />
User avatar: grab url from profile page<br />
<br />
Photo file: grab url from photo page and remove _medium to get original picture<br />
http://files.splinder.com/d5e492233631af39212268593afca02d_square.jpg<br />
http://files.splinder.com/d5e492233631af39212268593afca02d_medium.jpg<br />
http://files.splinder.com/d5e492233631af39212268593afca02d.jpg<br />
older photos do not have this structure, different ids for each size:<br />
http://www.splinder.com/mediablog/babboramo/media/17359043<br />
http://files.splinder.com/13b615ccbd75354ee4e0d973da66c2b2.jpeg<br />
http://files.splinder.com/770d7b9ecac27083d9204af327ebe743.jpeg<br />
<br />
PowerPoint, Word files: grab url from media page<br />
http://files.splinder.com/46dbf3d5a0b12e490f81ddb8444b4fad.ppt<br />
http://files.splinder.com/ab3ce16c850ac530351d9df0937152c7.pdf<br />
<br />
Video items: grab url from media page<br />
http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_square.jpg<br />
http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_thumbnail.jpg<br />
http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_small.flv<br />
note: square, thumbnail, small is not always available, check flashvars for vidpath, imgpath<br />
http://www.splinder.com/mediablog/babboramo/media/13131052<br />
http://files.splinder.com/e067653e1532e55ee208605fcb84361a.flv<br />
http://files.splinder.com/f56060b7fef139f03b72e06ca9fcba55.jpeg<br />
<br />
Audio items: grab url from media page, flashvars<br />
sometimes there is a _thumbnail, remove that to get a better quality<br />
http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef_thumbnail.mp3<br />
http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef.mp3<br />
<br />
Comments on blog posts:<br />
http://www.splinder.com/myblog/comment/list/25742358<br />
on some, but not on all blogs, those comments are also included in the blog page<br />
http://dal15al25.splinder.com/post/25740180<br />
http://soluzioni.splinder.com/post/2802227/blog-pager-su-piu-righe<br />
http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/<br />
http://civati.splinder.com/post/25742977<br />
pagination: see media comments<br />
<br />
Comments on media items:<br />
http://www.splinder.com/media/comment/list/21254470<br />
http://www.splinder.com/media/comment/list/21254470?from=50<br />
(50 per page, starting at 0)<br />
number of comments is on the media page<br />
http://www.splinder.com/mediablog/danspo/media/21254470<br />
<br />
<br />
Blog urls:<br />
the blogs have content from their own subdomain, but also from<br />
files.splinder.com<br />
www.splinder.com/misc/ (topbar css, gif)<br />
www.splinder.com/includes/ (js)<br />
www.splinder.com/modules/service_links/ (images)<br />
syndication.splinder.com<br />
<br />
links to www.splinder.com that should NOT be followed:<br />
/myblog/<br />
/users/<br />
/media/<br />
/node/<br />
/profile/<br />
/mediablog/<br />
/community/<br />
/user/<br />
/night/<br />
/home/<br />
/mysearch/<br />
/online/<br />
/trackback/<br />
<br />
</pre><br />
<br />
wget-warc --mirror --page-requisites --span-hosts --domains=learnonline.splinder.com,files.splinder.com,www.splinder.com,syndication.splinder.com --exclude-directories="/users,/media,/node,/profile,/mediablog,/community,/user,/night,/home,/mysearch,/online,/trackback,/myblog/post,/myblog/posts,/myblog/tags,/myblog/tag,/myblog/view,/myblog/latest,/myblog/subscribe" -nv -o wget.log "http://learnonline.splinder.com/"</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=File:Us.splinder.com_2011-11-15_7-34-30.png&diff=6633File:Us.splinder.com 2011-11-15 7-34-30.png2011-11-15T06:36:08Z<p>Jeroenz0r: Screenshot of us.splinder.com. Made on 2011-11-15, notice about closing is visible.</p>
<hr />
<div>Screenshot of us.splinder.com. Made on 2011-11-15, notice about closing is visible.</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6632AnyHub2011-11-15T06:33:21Z<p>Jeroenz0r: </p>
<hr />
<div>{{Infobox project<br />
| title = AnyHub<br />
| image = Anyhub.net_2011-11-15_7-30-3.png<br />
| description = File hosting website<br />
| URL = http://www.anyhub.net<br />
| project_status = {{closing}}<br />
| archiving_status = {{inprogress}}<br />
| irc = AnyHubTeam<br />
}}<br />
== AnyHub.net FAQ ==<br />
AnyHub is a fast, free and simple file host that anyone can use. (signup not required)<br><br />
You may upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br />
<br />
== AnyHub's death ==<br />
AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.<br><br />
Well, this is where archiveteam kicks in.<br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can just start downloading!<br />
https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh''"<br><br />
And then start your downloader! "'''./dld-client.sh ''{your_nickname}'''''"<br />
<br />
== Info/stats ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
The json data: http://www.anyhub.net/stats/recent<br><br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)<br />
<br />
== Who will do what? ==<br />
fuck this, we use "'''./dld-client.sh ''{your_nickname}'''''"<br><br />
Stats here: http://anyhub.heroku.com/<br />
<br />
== IRC/Chat ==<br />
See here: [[IRC]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=File:Anyhub.net_2011-11-15_7-30-3.png&diff=6631File:Anyhub.net 2011-11-15 7-30-3.png2011-11-15T06:32:43Z<p>Jeroenz0r: This is a screenshot of AnyHub's mainpage. It was taken 2011-11-15, so the banner which warns you about deletion is visible.</p>
<hr />
<div>This is a screenshot of AnyHub's mainpage. It was taken 2011-11-15, so the banner which warns you about deletion is visible.</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=User:Jeroenz0r&diff=6624User:Jeroenz0r2011-11-14T17:56:03Z<p>Jeroenz0r: /* Current project: */</p>
<hr />
<div>The drive behind these projects is what I like the most!<br />
<br />
==Current project:==<br />
* [[Urlteam]]<br />
* [[AnyHub]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=Archiveteam:IRC&diff=6612Archiveteam:IRC2011-11-13T15:33:17Z<p>Jeroenz0r: </p>
<hr />
<div>'''IRC''' (Internet Relay Chat) is an internet protocol that allows multiple users to connect to a server and chat. Each IRC "server" can be connected to by a person, then someone joins a "channel" with the particular subject they are interested in.<br />
<br />
The ArchiveTeam uses IRC as it's one-stop shop for coordinating official and unofficial AT projects.<br />
<br />
You can log the channels where you are using your client, generally. But if you want a 24/7 bot logging your channel, you can use a script like [http://toolserver.org/~bryan/TsLogBot/TsLogBot.py this] (change the server and channel variables).<br />
<br />
== ArchiveTeam on IRC ==<br />
<br />
Below are a list of the IRC channels the ArchiveTeam uses to control all it's projects, in no particular order. All these channels are on the [http://efnet.org EFNet] network.<br />
<br />
{| border="1" align="center" style="text-align:center;" cellpadding="6"<br />
|Channel name||Channel hashtag||Channel description||Status<br />
|-<br />
|colspan="4"|<b>In use channels</b><br />
|-<br />
|Archive Team<br />
|[irc://irc.efnet.org/archiveteam #archiveteam]<br />
|The main ArchiveTeam channel, mainly used for news, announcement and early project planning.<br />
|N/A<br />
|-<br />
|AT Chat<br />
|[irc://irc.efnet.org/atchat #atchat]<br />
|Off-topic discussion for things not directly related to ArchiveTeam and its projects.<br />
|N/A<br />
|-<br />
|ArchiveMeme<br />
|[irc://irc.efnet.org/archivememe #archivememe]<br />
|An unofficial fan channel started by BlueMax. http://memegenerator.net/ArchiveTeam<br />
|N/A<br />
|-<br />
|colspan="4"|<b>Currently active projects</b><br />
|-<br />
|BashUp<br />
|[irc://irc.efnet.org/bashup #bashup]<br />
|The ArchiveTeam [[IRC Quotes|Quote Backup Project]], dedicated to backing up quote databases (such as Bash.org) and similar websites (similar to FMyLife or MyLifeIsAverage).<br />
|<font color=#0000ff>Active</font><br />
|-<br />
|Forever Alone<br />
|[irc://irc.efnet.org/foreveralone #foreveralone]<br />
|The Friendster backup project.<br />
|<font color=#0000ff>Active</font><br />
|-<br />
|Magically Delicious<br />
|[irc://irc.efnet.org/magicallydelicious #magicallydelicious ]<br />
|Delicious backup project<br />
|<font color=#0000ff>Active</font><br />
|-<br />
|Google Grape||[irc://irc.efnet.org/googlegrape #googlegrape]<br />
|Main channel for coordinating the [[Google Video Warroom|Google Video project]].<br />
|<font color=#0000ff>Active</font><br />
|-<br />
|BOINC Google<br />
|[irc://irc.efnet.org/boincgoogle #boincgoogle]<br />
|Sub-channel for the [[Google Video Warroom|Google Video project]], for details about running the distributed-download software<br />
|<font color=#C0C000>Semi-Active</font><br />
|-<br />
|Lulu Poetry<br />
|[irc://irc.efnet.org/lulupoetry #lulupoetry]<br />
|Channel for the brief but intense [[Poetry.com]] archiving project.<br />
|<font color=#C0C000>Semi-Active</font><br />
|-<br />
|Archive Strikes Back<br />
|[irc://irc.efnet.org/archivestrikesback #archivestrikesback ]<br />
|Channel for [[Forums.starwars.com]] archive project.<br />
|<font color=#0000ff>Active</font><br />
|-<br />
|AnyHub<br />
|[irc://irc.efnet.org/AnyHubTeam #AnyHubTeam]<br />
|The [[AnyHub]] team.<br />
|<font color=#0000ff>Active</font><br />
|-<br />
|colspan="4"|<b>Currently idle or complete projects</b><br />
|-<br />
|FlickrFckr<br />
|[irc://irc.efnet.org/flickrfckr #flickrfckr]<br />
|The [[FlickrFckr|Flickr backup project]] of the Archive Team. Not needed just yet, but it's a Yahoo owned service, so we're always prepped.<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|Archive Commandos<br />
|[irc://irc.efnet.org/archivecommandos #archivecommandos]<br />
|http://archiveteam.org/index.php?title=Commandos<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|WikiTeam<br />
|[irc://irc.efnet.org/wikiteam #wikiteam]<br />
|The [[WikiTeam|Wiki backup project]]. Any wiki can be backed up here.<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|ProdigalSon<br />
|[irc://irc.efnet.org/prodigalson #prodigalson]<br />
|The [[Pages|backup project for pages.prodigy.net]].<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|ArchiveBox<br />
|[irc://irc.efnet.org/archivebox #archivebox]<br />
|The project started by jch to provide a virtual machine that can download ArchiveTeam projects with predetermined scripts and tools.<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|GetMUD<br />
|[irc://irc.efnet.org/getmud #getmud]<br />
|The multi-user-dungeon backup project of the Archive Team. Currently no progress as of yet.<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|URLTeam<br />
|[irc://irc.efnet.org/urlteam #urlteam]<br />
|The [[URLTeam|URL shortener backup project]] of the ArchiveTeam. To quote: "URL shortening = fucking bad idea"<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|Space Invaders<br />
|[irc://irc.efnet.org/spaceinvaders #spaceinvaders]<br />
|The [[Talk:Windows Live Spaces|Windows Live Spaces backup project]].<br />
|<font color=#ff0000>Idle</font><br />
|}<br />
<br />
== IRC Logs ==<br />
[[User:Auguste|Auguste]] is currently hosting logs of most, if not all, of the above channels at [http://archivebox.dyndns.org/cargohold/irclogs/ http://archivebox.dyndns.org/cargohold/irclogs/]. These are logged with a dedicated Weechat client. If some logs are missing, check the '1970' directory - the server has no internal clock, so it does the time warp whenever openntpd fails.<br />
<br />
[[User:Scumola|Scumola]]/swebb is also hosting chatlogs of some channels at [http://badcheese.com/~steve/atlogs/ http://badcheese.com/~steve/atlogs/]. Though only logs from the past week or so are listed, older chatlogs can still be accessed by changing the URL.<br />
<br />
== Unofficial ArchiveTeam QDB ==<br />
ArchiveTeamsters are encouraged to visit and contribute to the unofficial [http://www.deaddyingdamned.com/qdb/ ArchiveTeam quote database].<br />
<br />
[[Category:Archive Team]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=Archiveteam:IRC&diff=6611Archiveteam:IRC2011-11-13T15:31:56Z<p>Jeroenz0r: /* ArchiveTeam on IRC */ Added AnyHub</p>
<hr />
<div>'''IRC''' (Internet Relay Chat) is an internet protocol that allows multiple users to connect to a server and chat. Each IRC "server" can be connected to by a person, then someone joins a "channel" with the particular subject they are interested in.<br />
<br />
The ArchiveTeam uses IRC as it's one-stop shop for coordinating official and unofficial AT projects.<br />
<br />
You can log the channels where you are using your client, generally. But if you want a 24/7 bot logging your channel, you can use a script like [http://toolserver.org/~bryan/TsLogBot/TsLogBot.py this] (change the server and channel variables).<br />
<br />
== ArchiveTeam on IRC ==<br />
<br />
Below are a list of the IRC channels the ArchiveTeam uses to control all it's projects, in no particular order. All these channels are on the [http://efnet.org EFNet] network.<br />
<br />
{| border="1" align="center" style="text-align:center;" cellpadding="6"<br />
|Channel name||Channel hashtag||Channel description||Status<br />
|-<br />
|colspan="4"|<b>In use channels</b><br />
|-<br />
|Archive Team<br />
|[irc://irc.efnet.org/archiveteam #archiveteam]<br />
|The main ArchiveTeam channel, mainly used for news, announcement and early project planning.<br />
|N/A<br />
|-<br />
|AT Chat<br />
|[irc://irc.efnet.org/atchat #atchat]<br />
|Off-topic discussion for things not directly related to ArchiveTeam and its projects.<br />
|N/A<br />
|-<br />
|ArchiveMeme<br />
|[irc://irc.efnet.org/archivememe #archivememe]<br />
|An unofficial fan channel started by BlueMax. http://memegenerator.net/ArchiveTeam<br />
|N/A<br />
|-<br />
|colspan="4"|<b>Currently active projects</b><br />
|-<br />
|BashUp<br />
|[irc://irc.efnet.org/bashup #bashup]<br />
|The ArchiveTeam [[IRC Quotes|Quote Backup Project]], dedicated to backing up quote databases (such as Bash.org) and similar websites (similar to FMyLife or MyLifeIsAverage).<br />
|<font color=#0000ff>Active</font><br />
|-<br />
|Forever Alone<br />
|[irc://irc.efnet.org/foreveralone #foreveralone]<br />
|The Friendster backup project.<br />
|<font color=#0000ff>Active</font><br />
|-<br />
|Magically Delicious<br />
|[irc://irc.efnet.org/magicallydelicious #magicallydelicious ]<br />
|Delicious backup project<br />
|<font color=#0000ff>Active</font><br />
|-<br />
|Google Grape||[irc://irc.efnet.org/googlegrape #googlegrape]<br />
|Main channel for coordinating the [[Google Video Warroom|Google Video project]].<br />
|<font color=#0000ff>Active</font><br />
|-<br />
|BOINC Google<br />
|[irc://irc.efnet.org/boincgoogle #boincgoogle]<br />
|Sub-channel for the [[Google Video Warroom|Google Video project]], for details about running the distributed-download software<br />
|<font color=#C0C000>Semi-Active</font><br />
|-<br />
|Lulu Poetry<br />
|[irc://irc.efnet.org/lulupoetry #lulupoetry]<br />
|Channel for the brief but intense [[Poetry.com]] archiving project.<br />
|<font color=#C0C000>Semi-Active</font><br />
|-<br />
|Archive Strikes Back<br />
|[irc://irc.efnet.org/archivestrikesback #archivestrikesback ]<br />
|Channel for [[Forums.starwars.com]] archive project.<br />
|<font color=#0000ff>Active</font><br />
|-<br />
|colspan="4"|<b>Currently idle or complete projects</b><br />
|-<br />
|FlickrFckr<br />
|[irc://irc.efnet.org/flickrfckr #flickrfckr]<br />
|The [[FlickrFckr|Flickr backup project]] of the Archive Team. Not needed just yet, but it's a Yahoo owned service, so we're always prepped.<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|Archive Commandos<br />
|[irc://irc.efnet.org/archivecommandos #archivecommandos]<br />
|http://archiveteam.org/index.php?title=Commandos<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|WikiTeam<br />
|[irc://irc.efnet.org/wikiteam #wikiteam]<br />
|The [[WikiTeam|Wiki backup project]]. Any wiki can be backed up here.<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|ProdigalSon<br />
|[irc://irc.efnet.org/prodigalson #prodigalson]<br />
|The [[Pages|backup project for pages.prodigy.net]].<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|ArchiveBox<br />
|[irc://irc.efnet.org/archivebox #archivebox]<br />
|The project started by jch to provide a virtual machine that can download ArchiveTeam projects with predetermined scripts and tools.<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|GetMUD<br />
|[irc://irc.efnet.org/getmud #getmud]<br />
|The multi-user-dungeon backup project of the Archive Team. Currently no progress as of yet.<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|URLTeam<br />
|[irc://irc.efnet.org/urlteam #urlteam]<br />
|The [[URLTeam|URL shortener backup project]] of the ArchiveTeam. To quote: "URL shortening = fucking bad idea"<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|Space Invaders<br />
|[irc://irc.efnet.org/spaceinvaders #spaceinvaders]<br />
|The [[Talk:Windows Live Spaces|Windows Live Spaces backup project]].<br />
|<font color=#ff0000>Idle</font><br />
|-<br />
|AnyHub<br />
|[irc://irc.efnet.org/AnyHubTeam #AnyHubTeam]<br />
|The [[AnyHub]] team.<br />
|<font color=#0000ff>Active</font><br />
|}<br />
<br />
== IRC Logs ==<br />
[[User:Auguste|Auguste]] is currently hosting logs of most, if not all, of the above channels at [http://archivebox.dyndns.org/cargohold/irclogs/ http://archivebox.dyndns.org/cargohold/irclogs/]. These are logged with a dedicated Weechat client. If some logs are missing, check the '1970' directory - the server has no internal clock, so it does the time warp whenever openntpd fails.<br />
<br />
[[User:Scumola|Scumola]]/swebb is also hosting chatlogs of some channels at [http://badcheese.com/~steve/atlogs/ http://badcheese.com/~steve/atlogs/]. Though only logs from the past week or so are listed, older chatlogs can still be accessed by changing the URL.<br />
<br />
== Unofficial ArchiveTeam QDB ==<br />
ArchiveTeamsters are encouraged to visit and contribute to the unofficial [http://www.deaddyingdamned.com/qdb/ ArchiveTeam quote database].<br />
<br />
[[Category:Archive Team]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6610AnyHub2011-11-13T15:31:47Z<p>Jeroenz0r: </p>
<hr />
<div>{{Infobox project<br />
| title = AnyHub<br />
| image = AnyHub.png<br />
| description = File hosting website<br />
| URL = http://www.anyhub.net<br />
| project_status = {{closing}}<br />
| archiving_status = {{inprogress}}<br />
| irc = archiveteam<br />
}}<br />
== AnyHub.net FAQ ==<br />
AnyHub is a fast, free and simple file host that anyone can use. (signup not required)<br><br />
You may upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br />
<br />
== AnyHub's death ==<br />
AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.<br><br />
Well, this is where archiveteam kicks in.<br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can just start downloading!<br />
https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh''"<br><br />
And then start your downloader! "'''./dld-client.sh ''{your_nickname}'''''"<br />
<br />
== Info/stats ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
The json data: http://www.anyhub.net/stats/recent<br><br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)<br />
<br />
== Who will do what? ==<br />
fuck this, we use "'''./dld-client.sh ''{your_nickname}'''''"<br><br />
Stats here: http://anyhub.heroku.com/<br />
<br />
== IRC/Chat ==<br />
See here: [[IRC]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6609AnyHub2011-11-13T15:28:40Z<p>Jeroenz0r: /* Who will do what? */</p>
<hr />
<div>{{Infobox project<br />
| title = AnyHub<br />
| image = AnyHub.png<br />
| description = File hosting website<br />
| URL = http://www.anyhub.net<br />
| project_status = {{closing}}<br />
| archiving_status = {{inprogress}}<br />
| irc = archiveteam<br />
}}<br />
== AnyHub.net FAQ ==<br />
AnyHub is a fast, free and simple file host that anyone can use. (signup not required)<br><br />
You may upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br />
<br />
== AnyHub's death ==<br />
AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.<br><br />
Well, this is where archiveteam kicks in.<br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can just start downloading!<br />
https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh''"<br><br />
And then start your downloader! "'''./dld-client.sh ''{your_nickname}'''''"<br />
<br />
== Info/stats ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
The json data: http://www.anyhub.net/stats/recent<br><br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)<br />
<br />
== Who will do what? ==<br />
fuck this, we use "'''./dld-client.sh ''{your_nickname}'''''"<br><br />
Stats here: http://anyhub.heroku.com/</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6605AnyHub2011-11-13T15:10:34Z<p>Jeroenz0r: </p>
<hr />
<div>== AnyHub.net FAQ ==<br />
AnyHub is a fast, free and simple file host that anyone can use. (signup not required)<br><br />
You may upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br />
<br />
== AnyHub's death ==<br />
AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.<br><br />
Well, this is where archiveteam kicks in.<br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can just start downloading!<br />
https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh''"<br><br />
And then start your downloader! "'''./dld-client.sh ''{your_nickname}'''''"<br />
<br />
== Info/stats ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
The json data: http://www.anyhub.net/stats/recent<br><br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)<br />
<br />
== Who will do what? ==<br />
fuck this, we use "'''./dld-client.sh ''{your_nickname}'''''"</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6604AnyHub2011-11-13T15:09:33Z<p>Jeroenz0r: /* Who will do what? */</p>
<hr />
<div>== AnyHub.net FAQ ==<br />
AnyHub is a fast, free and simple file host that anyone can use. (signup not required)<br><br />
You may upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br />
<br />
== AnyHub's death ==<br />
AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.<br><br />
Well, this is where archiveteam kicks in.<br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can just start downloading!<br />
https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh''"<br><br />
And then download a piece! "''./dld-range.sh {range}''"<br />
<br />
== Info/stats ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
The json data: http://www.anyhub.net/stats/recent<br><br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)<br />
<br />
== Who will do what? ==<br />
fuck this, we use '''./dld-client.sh ''{nickname}'''''</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6602AnyHub2011-11-13T14:40:53Z<p>Jeroenz0r: /* Ranges */</p>
<hr />
<div>== AnyHub.net FAQ ==<br />
AnyHub is a fast, free and simple file host that anyone can use. (signup not required)<br><br />
You may upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br />
<br />
== AnyHub's death ==<br />
AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.<br><br />
Well, this is where archiveteam kicks in.<br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can just start downloading!<br />
https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh''"<br><br />
And then download a piece! "''./dld-range.sh {range}''"<br />
<br />
== Info/stats ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
The json data: http://www.anyhub.net/stats/recent<br><br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)<br />
<br />
== Who will do what? ==<br />
{| border="1" style="text-align:center;"<br />
|'''Who?'''<br />
|'''What?'''<br />
|'''Status? (Standard Time +0000 UTC)'''<br />
|-<br />
|Jeroenz0r<br />
|4AA_-4AG_<br />
|Busy - 14:31 13 November 2011<br />
|-<br />
|etc<br />
|Ice cream<br />
|etc<br />
|}</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6601AnyHub2011-11-13T14:38:29Z<p>Jeroenz0r: /* Who will do what? */</p>
<hr />
<div>== AnyHub.net FAQ ==<br />
AnyHub is a fast, free and simple file host that anyone can use. (signup not required)<br><br />
You may upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br />
<br />
== AnyHub's death ==<br />
AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.<br><br />
Well, this is where archiveteam kicks in.<br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can just start downloading!<br />
https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh''"<br><br />
And then download a piece! "''./dld-range.sh {range}''"<br />
<br />
== Ranges ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)<br />
<br />
== Who will do what? ==<br />
{| border="1" style="text-align:center;"<br />
|'''Who?'''<br />
|'''What?'''<br />
|'''Status? (Standard Time +0000 UTC)'''<br />
|-<br />
|Jeroenz0r<br />
|4AA_-4AG_<br />
|Busy - 14:31 13 November 2011<br />
|-<br />
|etc<br />
|Ice cream<br />
|etc<br />
|}</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6600AnyHub2011-11-13T14:32:04Z<p>Jeroenz0r: /* Who will do what? */</p>
<hr />
<div>== AnyHub.net FAQ ==<br />
AnyHub is a fast, free and simple file host that anyone can use. (signup not required)<br><br />
You may upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br />
<br />
== AnyHub's death ==<br />
AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.<br><br />
Well, this is where archiveteam kicks in.<br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can just start downloading!<br />
https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh''"<br><br />
And then download a piece! "''./dld-range.sh {range}''"<br />
<br />
== Ranges ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)<br />
<br />
== Who will do what? ==<br />
{| border="1" style="text-align:center;"<br />
|'''Who?'''<br />
|'''What?'''<br />
|'''Status? (Standard Time +0000 UTC)'''<br />
|-<br />
|Jeroenz0r<br />
|4A**<br />
|Busy - 14:31 13 November 2011<br />
|-<br />
|etc<br />
|Ice cream<br />
|etc<br />
|}</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6599AnyHub2011-11-13T14:30:18Z<p>Jeroenz0r: /* Who will do what? */</p>
<hr />
<div>== AnyHub.net FAQ ==<br />
AnyHub is a fast, free and simple file host that anyone can use. (signup not required)<br><br />
You may upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br />
<br />
== AnyHub's death ==<br />
AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.<br><br />
Well, this is where archiveteam kicks in.<br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can just start downloading!<br />
https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh''"<br><br />
And then download a piece! "''./dld-range.sh {range}''"<br />
<br />
== Ranges ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)<br />
<br />
== Who will do what? ==<br />
{| border="1" style="text-align:center;"<br />
|'''Who?'''<br />
|'''What?'''<br />
|'''Status? (Standard Time +0000 UTC)'''<br />
|-<br />
|Jeroenz0r<br />
|4AA*<br />
|Busy - 14:29 13 November 2011<br />
|-<br />
|etc<br />
|Ice cream<br />
|etc<br />
|}</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6598AnyHub2011-11-13T13:57:21Z<p>Jeroenz0r: /* Who will do what? */</p>
<hr />
<div>== AnyHub.net FAQ ==<br />
AnyHub is a fast, free and simple file host that anyone can use. (signup not required)<br><br />
You may upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br />
<br />
== AnyHub's death ==<br />
AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.<br><br />
Well, this is where archiveteam kicks in.<br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can just start downloading!<br />
https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh''"<br><br />
And then download a piece! "''./dld-range.sh {range}''"<br />
<br />
== Ranges ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)<br />
<br />
== Who will do what? ==<br />
{| border="1" style="text-align:center;"<br />
|'''Who?'''<br />
|'''What?'''<br />
|'''Status?'''<br />
|-<br />
|Jeroenz0r<br />
|coming<br />
|coming<br />
|-<br />
|etc<br />
|Ice cream<br />
|etc<br />
|}</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6597AnyHub2011-11-13T13:56:58Z<p>Jeroenz0r: </p>
<hr />
<div>== AnyHub.net FAQ ==<br />
AnyHub is a fast, free and simple file host that anyone can use. (signup not required)<br><br />
You may upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br />
<br />
== AnyHub's death ==<br />
AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.<br><br />
Well, this is where archiveteam kicks in.<br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can just start downloading!<br />
https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh''"<br><br />
And then download a piece! "''./dld-range.sh {range}''"<br />
<br />
== Ranges ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)<br />
<br />
== Who will do what? ==<br />
{| border="1" align="center" style="text-align:center;"<br />
|'''Who?'''<br />
|'''What?'''<br />
|'''Status?'''<br />
|-<br />
|Jeroenz0r<br />
|coming<br />
|coming<br />
|-<br />
|etc<br />
|Ice cream<br />
|etc<br />
|}</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6596AnyHub2011-11-13T13:54:15Z<p>Jeroenz0r: /* Ranges */</p>
<hr />
<div>== AnyHub.net FAQ ==<br />
AnyHub is a fast, free and simple file host that anyone can use. (signup not required)<br><br />
You may upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br />
<br />
== AnyHub's death ==<br />
AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.<br><br />
Well, this is where archiveteam kicks in.<br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can just start downloading!<br />
https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh''"<br><br />
And then download a piece! "''./dld-range.sh {range}''"<br />
<br />
== Ranges ==<br />
They have great stats! http://www.anyhub.net/stats<br><br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=AnyHub&diff=6595AnyHub2011-11-13T13:53:55Z<p>Jeroenz0r: Stub</p>
<hr />
<div>== AnyHub.net FAQ ==<br />
AnyHub is a fast, free and simple file host that anyone can use. (signup not required)<br><br />
You may upload files of up to 10 GiB at a time.<br><br />
Files uploaded will generally be kept forever, unless they are in violation of our Terms of Service.<br><br><br />
AnyHub is developed and run by [http://charliesomerville.com/ Charlie Somerville], a student from Melbourne, Australia. The awesome design was created by [http://z-dev.org/ Matt Anderson], a talented graphics designer from Ohio.<br />
<br />
== AnyHub's death ==<br />
AnyHub will be shutting down as of '''Friday, 18th of November'''. Please download any important data immediately, as it will be unavailable past that date.<br><br />
Well, this is where archiveteam kicks in.<br />
<br />
== Tools ==<br />
The filenames it gets assigned seem to be ascending, so we can just start downloading!<br />
https://github.com/ArchiveTeam/anyhub-grab<br><br />
To download all tools: "''git clone git://github.com/ArchiveTeam/anyhub-grab.git ; cd anyhub-grab ; ./get-wget-warc.sh''"<br><br />
And then download a piece! "''./dld-range.sh {range}''"<br />
<br />
== Ranges ==<br />
They have great stats! http://www.anyhub.net/stats<br />
As of 13 November, 2011: '''1114459''' files ('''2.78''' TiB)</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=Talk:Windows_Live_Spaces&diff=3398Talk:Windows Live Spaces2011-03-23T14:50:53Z<p>Jeroenz0r: /* Phase 2: Downloading Hotlists */</p>
<hr />
<div>== Current Status ==<br />
'''March 22:''' Swicher got clarification about the shutdown date - Microsoft is apparently closing Windows Live Spaces in batches, and should be complete by March 30, 2011<ref>[http://windowsteamblog.com/windows_live/b/windowslive/archive/2011/03/02/over-one-million-new-blogs-on-wordpress-com-but-time-is-running-out.aspx#comments Over one million new blogs on WordPress.com, but time is running out] - See bottom-two comments</ref>. This doesn't leave us with much time.<br />
<br />
'''March 20:''' D-Day has been and gone, and as of March 22, 2011, Windows Live Spaces is still running. There are still millions of Spaces that have not been downloaded or migrated. Microsoft could shut down WLS at any moment and delete all the data, so we need as many people as possible to help download them all. As of March 2 2011, only 1,000,000 Spaces had been migrated to Wordpress,<ref>[http://windowsteamblog.com/windows_live/b/windowslive/archive/2011/03/02/over-one-million-new-blogs-on-wordpress-com-but-time-is-running-out.aspx Over one million new blogs on WordPress.com, but time is running out]</ref> so we have a lot of catching up to do.<br />
<br />
[[User:Swicher|Swicher]] is currently downloading [[Spaces of Windows Live Spaces pending for download|several thousand Spaces]] using HTTrack. These Spaces are duplicated as the first few hotlists, to be sure we do get them.<br />
<br />
== Phase 1: CID Scraping ==<br />
[[User:NovaKing|NovaKing]] is currently scraping Bing for more profiles. At the rate he's been going, he should have tens of thousands ready soon, which will be split up into hotlists and allocated to volunteers for downloading.<br />
<br />
== Phase 2: Downloading Hotlists ==<br />
This is a list of available hotlists and their status. They are generally split into chunks of 1,000 Spaces.<br />
<br />
If you would like to take ownership of one, speak to Auguste on IRC. Volunteers, please update this table as soon as you are finished, or let Auguste know if you are unable to complete it.<br />
<br />
{| border="1" width="100%"<br />
!Filename<br />
!Owner<br />
!Size (GB) (compressed size)<br />
!Status<br />
!Status notes<br />
|-<br />
|[http://pastebin.com/FMJh3vAa wls 0001-1000.txt]<br />
|ersi<br />
|13.7~ GB (1.1GB bzip2)<br />
|Complete<br />
|Uploaded, awaiting verification<br />
|-<br />
|[http://pastebin.com/xrXfPbL4 wls 1001-2000.txt]<br />
|ersi<br />
|20 GB (1.6GB bzip2)<br />
|Complete<br />
|Uploaded, awaiting verification<br />
|-<br />
|[http://pastebin.com/KAVYAW3c wls 2001-2202.txt]<br />
|Dr-Spangle<br />
|<br />
|Complete<br />
|Awaiting upload.<br />
|-<br />
|[http://pastebin.com/pygEEHBr wls 2203-3000.txt]<br />
|ersi<br />
|13 GB (945.3MB bzip2)<br />
|Complete<br />
|Uploaded, awaiting verification<br />
|-<br />
|[http://pastebin.com/LS8nvgdN wls 3001-4000.txt]<br />
|Jeroenz0r, joeyh<br />
|6.36GB (1.36GB gz)/(1.53GB Deflate)<br />
|Complete<br />
|Uploaded, awaiting verification<br />
|-<br />
|[http://pastebin.com/640Txn1g wls 4001-5000.txt]<br />
|amnesia<br />
|8.0G (1.7G zipped)<br />
|Complete<br />
|Uploaded, awaiting verification<br />
|-<br />
|[http://pastebin.com/1ciEWJB1 wls 5001-6000.txt]<br />
|Underscor<br />
|<br />
|In progress<br />
|<br />
|-<br />
|[http://pastebin.com/zi4D58iQ wls 6001-7000.txt]<br />
|amnesia<br />
|336M (74M zipped)<br />
|Complete<br />
|Uploaded, awaiting verification<br />
|-<br />
|[http://pastebin.com/6pYaicYF wls 7001-8000.txt]<br />
|None<br />
|<br />
|Unassigned<br />
|<br />
|}<br />
<br />
=== Instructions ===<br />
You have three options:<br />
* [http://pastebin.com/Lr0Xn0Wm SpaceInvader2.pl]<br />
** Downloads the list of Spaces, one Space at a time, using Wget. You probably want this one.<br />
** Usage: <code>SpaceInvader2.pl "HOTLIST"</code><br />
* [http://pastebin.com/W6dhEwV2 SpaceInvaderTurbo.pl]<br />
** Spawns multiple instances of Wget to download everything at once. If you have a hotlist of 1,000 Spaces, this means 1,000 instances of Wget, all downloading simultaneously. This may be unfriendly to both your CPU and Microsoft's systems, but it will cut a 7-day job down to a few hours. Use it at your own risk.<br />
** Usage: <code>SpaceInvaderTurbo.pl "HOTLIST"</code><br />
* [http://pastebin.com/pqskd0Xu spaceinvader.sh]<br />
** Same idea, different implementation. Will run up to 50 wget instances and won't be that hard on your machine.<br />
** Actually does save some images.<br />
** Usage: <code>spaceinvader.sh "HOTLIST"</code><br />
<br />
Due to insufficient time and planning, these scripts don't download any off-site dependencies - most of what you download will be HTML/text. The upside is that it compresses nicely.<br />
<br />
These scripts will just spit out files in the working directory, so you probably want to place them in ~/wls or something before executing them.<br />
<br />
Once you have finished downloading a hotlist, please update your details in the above table and compress all the Spaces into a single archive, along with a copy of your hotlist. 7-Zip on maximum compression should be able to get them down to ~10% of their original size.<br />
<br />
After compressing your hotlist, you can upload it to [[User:Underscor|Underscor]]'s FTP for temporary storage. Get the FTP details from him or Auguste. We still need to find some permanent storage to move everything to.<br />
<br />
== Phase 3: Storage ==<br />
TBA.<br />
<br />
== Other Tools ==<br />
Though Perl/Wget is the recommended method for archiving Spaces, there are a couple of other tools available.<br />
<br />
=== HTTrack (graphic version)===<br />
I will explain what is the procedure to download one or more Spaces using HTTrack graphic version (WinHTTrack in Windows and in Linux is called WebHTTrack).<br />
<br />
I assume that the reader should be familiarized with the use of WinHTTrack (or WebHTTrack) so I'll just explain that you need configure (in the Option Panel of the program) to download a Space of Windows Live Spaces. If you do not know how to use this program you can check [http://www.kitamuracomputers.com/tidelog/?p=615 this tutorial] (in English) or [http://www.manueldelafuente.com/2009/10/httrack-posible-solucion-la.html this one] (in Spanish).<br />
<br />
In the section "Scan Rules" must be added the following lines:<br />
+*.css +*.js -ad.doubleclick.net/* -mime:application/foobar<br />
+*.7z<br />
+*.pdf +*.doc +*.mid +*.3gp +*.djvu +*.amr +*.mp4 +*.ogg +*.ogv +*.ogm<br />
+*.mov +*.mpg +*.mpeg +*.avi +*.asf +*.mp3 +*.mp2 +*.rm +*.wav +*.vob +*.qt +*.vid +*.ac3 +*.wma +*.wmv<br />
+*.zip +*.tar +*.tgz +*.gz +*.rar +*.z<br />
+*.arj +*.dar +*.lzh +*.lz +*.lza +*.arc<br />
+*.gif +*.jpg +*.png +*.tif +*.bmp<br />
-*.entry#comment<br />
+*.profile.live.com/Lists/*<br />
+*.byfiles.storage.live.com/*<br />
+*.photos.live.com<br />
+*.spaces.live.com<br />
<br />
* Line 1 to 7 indicate what types of files are downloaded from a Space (if the program finds one these and this lines can be modified to suit the user).<br />
* Line 8 is because the program tries to capture the comments any post of a blog on Windows Live Spaces and this action generates errors (in addition to a waste of time when exploring a site).<br />
* Line 9 and 12 are used to capturing Spaces of the list of "friends" who might have the Space user which is capturing at that time (these lines are optional)<br />
* Lines 10 and 11 are to capture the files and photos that the user can have uploaded there.<br />
<br />
I'm not sure the data in *.photos.live.com will continue to exist after Windows Live Spaces is shut down, so I took the opportunity to save any photos in this there anyway. If you don't want to save photos, that line is optional.<br />
<br />
Then add in the field Browser "Identity" (from the section Browser ID) the following User Agent:<br />
<pre>Googlebot/2.1 (+ http://www.googlebot.com/bot.html)</pre><br />
And finally in the section "Spider" select the option "no robots.txt rules".<br />
<br />
Note that if you download thousands of Spaces in a single project of the program, also try to disable the option "Create Log files" in the section "Log files, Index, Cache" otherwise, their logs can weigh tens of GB on hard disk.<br />
<br />
=== LSSaver ===<br />
''Some of the descriptions in this section was taken from http://www.softsea.com/review/LSSaver.html''<br />
<br />
LSSaver is a Windows Freeware software to save an Windows Live Space blog to your local disk. It saves useful informations such as, blog title, content and comments. It is able to save the pictures included in the blog to local disk also.<br />
<br />
LSSaver is very simple to use, so that its operation is:<br />
* First, you need to enter a Microsoft Live Space username.<br />
* Then, you click on the "Get" button to retrive all blog entries. This operation may take up to several minutes depending on the number of entries that contains a blog, as well the user connection, when a blog entry is retrieved, it's title will appear in the tree which is the left part of the window. Wait until all titles are trieved. Then you can browse your blog titles by fold/unfold tree, check those you want to save. Once a blog entry is checked, it's content will appear on the right part of the window, check all blogs you want to save and wait until all of them appear.<br />
*To save the selected blogs, you simply click the Save button, a file selection window will open, select where the files will be saved and give a file name and click the Save button on the window, after a while, all the selected blogs are saved. The saved file is a HTML file, you can open it with a browser.<br />
<br />
The program works as it should but we must take into consideration some details that differentiate it from any web site downloader:<br />
*As explained before, when the program save a blog all the articles (and comments) are crammed into an HTML file (which could become a problem if the blog has a lot of content).<br />
*The names of the images are stored as 000001, 000002, etc. thus avoiding that the original can be found on the Internet (this refers to the images of external sites linked in a blog) or recognize the file format.<br />
<br />
== Useful links ==<br />
*In English:<br />
**[http://ezinearticles.com/?Windows-Live-Spaces-Officially-Closed Windows Live Spaces Officially Closed]<br />
**[http://techie-buzz.com/tech-news/windows-live-spaces-wordpress-migration.html Windows Live Spaces To Shut Down, Move 30 Million Users To WordPress.Com]<br />
**[http://www.liveside.net/2011/02/21/windows-live-spaces-to-close-march-16th-remember Windows Live Spaces to close March 16th, remember?]<br />
**[http://www.darrenstraight.com/blog/2011/03/13/your-windows-live-space-will-close-on-16-march-3-days-left Your Windows Live Space will close on 16 March – 3 days left]<br />
*In Spanish:<br />
**[http://www.danisaur.es/2010/09/30/microsoft-cierra-windows-live-spaces/ Microsoft cierra Windows Live Spaces]<br />
**[http://grupogeek.com/2010/10/01/microsoft-cierra-windows-live-spaces-y-transfiere-a-sus-usuarios-a-wordpress/ Microsoft cierra Windows Live Spaces y transfiere a sus usuarios a WordPress]<br />
**[http://tecnokadosh.abbaproducciones.cl/2010/10/1612 Windows Live Spaces se cierra]<br />
**[http://solucionok.blogspot.com/2010/10/windows-live-spaces-llega-su-fin-y.html Solucion OK: Windows Live Spaces llega a su fin y continúa con WordPress.com]<br />
**[http://mynetx.es/5275/recordatorio-windows-live-spaces-cerrara-pronto Recordatorio: Windows Live Spaces cerrará pronto]<br />
**[http://pastehtml.com/view/1dhf1ez.html Email de Windows Live que le llega a cada usuario con un Space activo]<br />
<br />
== References ==<br />
<references/></div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=Talk:Urlteam&diff=3306Talk:Urlteam2011-03-20T17:42:23Z<p>Jeroenz0r: moved Talk:Urlteam to Talk:URLTeam:&#32;Capitalization is important. Lets use URLTeam</p>
<hr />
<div>#REDIRECT [[Talk:URLTeam]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=Talk:URLTeam&diff=3305Talk:URLTeam2011-03-20T17:42:23Z<p>Jeroenz0r: moved Talk:Urlteam to Talk:URLTeam:&#32;Capitalization is important. Lets use URLTeam</p>
<hr />
<div>== Regarding archiving ==<br />
<br />
Just randomly requesting TinyURLs like you propose will get you banned since you are making many requests for non-existent TinyURLs. We do allow bots to crawl TinyURLs, but only if they are crawling TinyURLs that exist which they pulled from whatever source they are crawling.<br />
<br />
Kevin "Gilby" Gilbertson<br />
<br />
TinyURL, Founder<br />
<br />
http://tinyurl.com<br />
<br />
== A Problem Easily Solved ==<br />
<br />
Just provide for us an excel spreadsheet in the form of:<br />
<br />
tinyurl ID | full URL<br />
<br />
And scraping won't be necessary. Up for it?<br />
<br />
--[[User:Jscott|Jscott]] 20:25, 4 December 2010 (UTC)<br />
<br />
:I e-mailed the TinyURL owner and he [http://i55.tinypic.com/j5bia9.jpg replied] with that.<br />
:<br />
:[[User:Zachera|Zachera]] 00:06, 11 December 2010 (UTC)</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=Urlteam&diff=3304Urlteam2011-03-20T17:42:22Z<p>Jeroenz0r: moved Urlteam to URLTeam:&#32;Capitalization is important. Lets use URLTeam</p>
<hr />
<div>#REDIRECT [[URLTeam]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=3303URLTeam2011-03-20T17:42:22Z<p>Jeroenz0r: moved Urlteam to URLTeam:&#32;Capitalization is important. Lets use URLTeam</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[TinyBack]] (written in ruby by [[User:Soult]])<br />
* [[User:Chronomex]] wrote his own Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/3 (2011-02-15)<br />
| non-sequential<br />
|-<br />
| [http://goo.gl goo.gl]<br />
| ??<br />
| [[User:Scumola]]<br />
| started (2011-03-04)<br />
| goo.gl throttles pulls<br />
|-<br />
| [http://is.gd is.gd]<br />
| 354,527,352<br />
| [[User:Chronomex]]/[[User:Soult]]<br />
| probably got about 95% before switch to non-sequential<br />
| now non-sequential, new software version added crappy rate limiting<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1279 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| dead (2011-02-15)<br />
|-<br />
| litturl.com<br />
| 17096<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 3084 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 14867 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| resolving still works despite planning to completely shut down at the end of 2010 (2011-02-15), whoever owns that thing is a major pain in the ass<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz, 0000-izzz.<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|-<br />
| biglnk.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 0-Z, 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* budurl.com - Appears non-incremental<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* ilix.in - HTML redirect<br />
* imfy.us - requires a recaptcha to get to the linked site, and avast goes nuts.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* myurl.in - http://myurl.in/xtP5H / http://urlgator.com/xtP5H /http://ug4.me/xtP5H / http://link-ed.in/xtP5H - HTML redirect<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shortlinks.co.uk - Working again.<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us - Alway telling URL is malformed<br />
* shrt.st - Appears incremental: http://shrt.st/vpz<br />
* simurl.com - Doesn't appear guessable: http://simurl.com/panpes<br />
* shorl.com - Doesn't appear guessable: http://shorl.com/tisikestibahu<br />
* smarturl.eu / joturl.com / zip.sm - Doesn't appear guessable, HTML redirect.<br />
* snipr.com - Appears incremental: http://snipr.com/27nvst http://snipr.com/27nvtt<br />
* snipurl.com - See above ^<br />
* snurl.com - See above above ^^<br />
* surl.co.uk - Many shortening options.<br />
* tighturl.com - Appears incremental: http://tighturl.com/30xu http://tighturl.com/30xv<br />
* tiny.cc - Appears non-incremental<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com / twurl.nl - Appears incremental<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goo.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own). Not a shortener anymore.<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
* 1link.in - Website dead<br />
* canurl.com - Website dead<br />
* dwarfurl.com - Website dead/Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Website dead/Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* go2cut.com - Website dead<br />
* lnkurl.com - Website dead<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh - Website dead<br />
* memurl.com - Pronounceable. Broken.<br />
<br />
==== Hueg list ====<br />
[http://code.google.com/p/shortenurl/wiki/URLShorteningServices]<br />
<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=3283URLTeam2011-03-19T20:52:15Z<p>Jeroenz0r: /* Dead or Broken Shorteners */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[TinyBack]] (written in ruby by [[User:Soult]])<br />
* [[User:Chronomex]] wrote his own Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/3 (2011-02-15)<br />
| non-sequential<br />
|-<br />
| [http://goo.gl goo.gl]<br />
| ??<br />
| [[User:Scumola]]<br />
| started (2011-03-04)<br />
| goo.gl throttles pulls<br />
|-<br />
| [http://is.gd is.gd]<br />
| 354,527,352<br />
| [[User:Chronomex]]/[[User:Soult]]<br />
| probably got about 95% before switch to non-sequential<br />
| now non-sequential, new software version added crappy rate limiting<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1279 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| dead (2011-02-15)<br />
|-<br />
| litturl.com<br />
| 17096<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 3084 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 14867 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| resolving still works despite planning to completely shut down at the end of 2010 (2011-02-15), whoever owns that thing is a major pain in the ass<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz, 0000-izzz.<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|-<br />
| biglnk.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 0-Z, 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* budurl.com - Appears non-incremental<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* ilix.in - HTML redirect<br />
* imfy.us - requires a recaptcha to get to the linked site, and avast goes nuts.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* myurl.in - http://myurl.in/xtP5H / http://urlgator.com/xtP5H /http://ug4.me/xtP5H / http://link-ed.in/xtP5H - HTML redirect<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shortlinks.co.uk - Working again.<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us - Alway telling URL is malformed<br />
* shrt.st - Appears incremental: http://shrt.st/vpz<br />
* simurl.com - Doesn't appear guessable: http://simurl.com/panpes<br />
* shorl.com - Doesn't appear guessable: http://shorl.com/tisikestibahu<br />
* smarturl.eu / joturl.com / zip.sm - Doesn't appear guessable, HTML redirect.<br />
* snipr.com - Appears incremental: http://snipr.com/27nvst http://snipr.com/27nvtt<br />
* snipurl.com - See above ^<br />
* snurl.com - See above above ^^<br />
* surl.co.uk - Many shortening options.<br />
* tighturl.com - Appears incremental: http://tighturl.com/30xu http://tighturl.com/30xv<br />
* tiny.cc - Appears non-incremental<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com / twurl.nl - Appears incremental<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goo.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own). Not a shortener anymore.<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
* 1link.in - Website dead<br />
* canurl.com - Website dead<br />
* dwarfurl.com - Website dead/Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Website dead/Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* go2cut.com - Website dead<br />
* lnkurl.com - Website dead<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh - Website dead<br />
* memurl.com - Pronounceable. Broken.<br />
<br />
==== Hueg list ====<br />
[http://code.google.com/p/shortenurl/wiki/URLShorteningServices]<br />
<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=3282URLTeam2011-03-19T20:52:07Z<p>Jeroenz0r: /* Old listhttp://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[TinyBack]] (written in ruby by [[User:Soult]])<br />
* [[User:Chronomex]] wrote his own Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/3 (2011-02-15)<br />
| non-sequential<br />
|-<br />
| [http://goo.gl goo.gl]<br />
| ??<br />
| [[User:Scumola]]<br />
| started (2011-03-04)<br />
| goo.gl throttles pulls<br />
|-<br />
| [http://is.gd is.gd]<br />
| 354,527,352<br />
| [[User:Chronomex]]/[[User:Soult]]<br />
| probably got about 95% before switch to non-sequential<br />
| now non-sequential, new software version added crappy rate limiting<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1279 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| dead (2011-02-15)<br />
|-<br />
| litturl.com<br />
| 17096<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 3084 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 14867 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| resolving still works despite planning to completely shut down at the end of 2010 (2011-02-15), whoever owns that thing is a major pain in the ass<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz, 0000-izzz.<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|-<br />
| biglnk.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 0-Z, 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* budurl.com - Appears non-incremental<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* ilix.in - HTML redirect<br />
* imfy.us - requires a recaptcha to get to the linked site, and avast goes nuts.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* myurl.in - http://myurl.in/xtP5H / http://urlgator.com/xtP5H /http://ug4.me/xtP5H / http://link-ed.in/xtP5H - HTML redirect<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shortlinks.co.uk - Working again.<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us - Alway telling URL is malformed<br />
* shrt.st - Appears incremental: http://shrt.st/vpz<br />
* simurl.com - Doesn't appear guessable: http://simurl.com/panpes<br />
* shorl.com - Doesn't appear guessable: http://shorl.com/tisikestibahu<br />
* smarturl.eu / joturl.com / zip.sm - Doesn't appear guessable, HTML redirect.<br />
* snipr.com - Appears incremental: http://snipr.com/27nvst http://snipr.com/27nvtt<br />
* snipurl.com - See above ^<br />
* snurl.com - See above above ^^<br />
* surl.co.uk - Many shortening options.<br />
* tighturl.com - Appears incremental: http://tighturl.com/30xu http://tighturl.com/30xv<br />
* tiny.cc - Appears non-incremental<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com / twurl.nl - Appears incremental<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goo.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own). Not a shortener anymore.<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
<br />
==== Hueg list ====<br />
[http://code.google.com/p/shortenurl/wiki/URLShorteningServices]<br />
<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=3281URLTeam2011-03-19T20:30:39Z<p>Jeroenz0r: </p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[TinyBack]] (written in ruby by [[User:Soult]])<br />
* [[User:Chronomex]] wrote his own Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/3 (2011-02-15)<br />
| non-sequential<br />
|-<br />
| [http://goo.gl goo.gl]<br />
| ??<br />
| [[User:Scumola]]<br />
| started (2011-03-04)<br />
| goo.gl throttles pulls<br />
|-<br />
| [http://is.gd is.gd]<br />
| 354,527,352<br />
| [[User:Chronomex]]/[[User:Soult]]<br />
| probably got about 95% before switch to non-sequential<br />
| now non-sequential, new software version added crappy rate limiting<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1279 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| dead (2011-02-15)<br />
|-<br />
| litturl.com<br />
| 17096<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 3084 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 14867 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| resolving still works despite planning to completely shut down at the end of 2010 (2011-02-15), whoever owns that thing is a major pain in the ass<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz, 0000-izzz.<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|-<br />
| biglnk.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 0-Z, 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 1link.in - Website dead<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* budurl.com - Appears non-incremental<br />
* canurl.com - Website dead<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* dwarfurl.com - Website dead/Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Website dead/Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* go2cut.com - Website dead<br />
* ilix.in - HTML redirect<br />
* imfy.us - requires a recaptcha to get to the linked site, and avast goes nuts.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com - Website dead<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh - Website dead<br />
* myurl.in - http://myurl.in/xtP5H / http://urlgator.com/xtP5H /http://ug4.me/xtP5H / http://link-ed.in/xtP5H - HTML redirect<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* plexp.com - Parked.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* poprl.com - Not resolving<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shorterlink.com - Parked.<br />
* shortlinks.co.uk - Not resolving<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us<br />
* shrt.st<br />
* shurl.net<br />
* simurl.com<br />
* shorl.com<br />
* smarturl.eu<br />
* snipr.com<br />
* snipurl.com<br />
* snurl.com<br />
* sn.vc<br />
* starturl.com<br />
* surl.co.uk<br />
* tighturl.com<br />
* timesurl.at<br />
* tiny123.com<br />
* tiny.cc<br />
* tinylink.com<br />
* tobtr.com<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goo.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own). Not a shortener anymore.<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
<br />
==== Hueg list ====<br />
[http://code.google.com/p/shortenurl/wiki/URLShorteningServices]<br />
<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=3280URLTeam2011-03-19T20:20:22Z<p>Jeroenz0r: /* "Official" shorteners */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[TinyBack]] (written in ruby by [[User:Soult]])<br />
* [[User:Chronomex]] wrote his own Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/3 (2011-02-15)<br />
| non-sequential<br />
|-<br />
| [http://goo.gl goo.gl]<br />
| ??<br />
| [[User:Scumola]]<br />
| started (2011-03-04)<br />
| goo.gl throttles pulls<br />
|-<br />
| [http://is.gd is.gd]<br />
| 354,527,352<br />
| [[User:Chronomex]]/[[User:Soult]]<br />
| probably got about 95% before switch to non-sequential<br />
| now non-sequential, new software version added crappy rate limiting<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1279 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| dead (2011-02-15)<br />
|-<br />
| litturl.com<br />
| 17096<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 3084 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 14867 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| resolving still works despite planning to completely shut down at the end of 2010 (2011-02-15), whoever owns that thing is a major pain in the ass<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz, 0000-izzz.<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|-<br />
| biglnk.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 0-Z, 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 1link.in - Website dead<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* budurl.com - Appears non-incremental<br />
* canurl.com - Website dead<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* dwarfurl.com - Website dead/Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Website dead/Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* go2cut.com - Website dead<br />
* ilix.in - HTML redirect<br />
* imfy.us - requires a recaptcha to get to the linked site, and avast goes nuts.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com - Website dead<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh - Website dead<br />
* myurl.in - http://myurl.in/xtP5H / http://urlgator.com/xtP5H /http://ug4.me/xtP5H / http://link-ed.in/xtP5H - HTML redirect<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* plexp.com - Parked.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* poprl.com - Not resolving<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shorterlink.com - Parked.<br />
* shortlinks.co.uk - Not resolving<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us<br />
* shrt.st<br />
* shurl.net<br />
* simurl.com<br />
* shorl.com<br />
* smarturl.eu<br />
* snipr.com<br />
* snipurl.com<br />
* snurl.com<br />
* sn.vc<br />
* starturl.com<br />
* surl.co.uk<br />
* tighturl.com<br />
* timesurl.at<br />
* tiny123.com<br />
* tiny.cc<br />
* tinylink.com<br />
* tobtr.com<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goo.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own). Not a shortener anymore.<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=3279URLTeam2011-03-19T20:12:12Z<p>Jeroenz0r: /* Tools */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[TinyBack]] (written in ruby by [[User:Soult]])<br />
* [[User:Chronomex]] wrote his own Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/3 (2011-02-15)<br />
| non-sequential<br />
|-<br />
| [http://goo.gl goo.gl]<br />
| ??<br />
| [[User:Scumola]]<br />
| started (2011-03-04)<br />
| goo.gl throttles pulls<br />
|-<br />
| [http://is.gd is.gd]<br />
| 354,527,352<br />
| [[User:Chronomex]]/[[User:Soult]]<br />
| probably got about 95% before switch to non-sequential<br />
| now non-sequential, new software version added crappy rate limiting<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1279 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| dead (2011-02-15)<br />
|-<br />
| litturl.com<br />
| 17096<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 3084 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 14867 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| resolving still works despite planning to completely shut down at the end of 2010 (2011-02-15), whoever owns that thing is a major pain in the ass<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz, 0000-izzz.<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|-<br />
| biglnk.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 0-Z, 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 1link.in - Website dead<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* budurl.com - Appears non-incremental<br />
* canurl.com - Website dead<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* dwarfurl.com - Website dead/Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Website dead/Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* go2cut.com - Website dead<br />
* ilix.in - HTML redirect<br />
* imfy.us - requires a recaptcha to get to the linked site, and avast goes nuts.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com - Website dead<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh - Website dead<br />
* myurl.in - http://myurl.in/xtP5H / http://urlgator.com/xtP5H /http://ug4.me/xtP5H / http://link-ed.in/xtP5H - HTML redirect<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* plexp.com - Parked.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* poprl.com - Not resolving<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shorterlink.com - Parked.<br />
* shortlinks.co.uk - Not resolving<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us<br />
* shrt.st<br />
* shurl.net<br />
* simurl.com<br />
* shorl.com<br />
* smarturl.eu<br />
* snipr.com<br />
* snipurl.com<br />
* snurl.com<br />
* sn.vc<br />
* starturl.com<br />
* surl.co.uk<br />
* tighturl.com<br />
* timesurl.at<br />
* tiny123.com<br />
* tiny.cc<br />
* tinylink.com<br />
* tobtr.com<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goog.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=TinyBack&diff=3278TinyBack2011-03-19T18:52:15Z<p>Jeroenz0r: </p>
<hr />
<div>'''TinyBack''' is a link-shortner scraper written in Ruby. This pack has a few tools included, it can scrape, remove URLs that link to a not found page, sort on alphabetic order and remove duplicates. It has been developed by [[User:Soult]] for the [[urlteam]] project.<br />
<br />
==scraping==<br />
TinyBack can chop large ranges in smaller ones, and request these efficiently in a random order. It has a great logging functionally, making error analysis after a crash easy.<br />
<br />
{{Stub Category|article=[[urlteam]]|newstub=urlteam|category=Urlteam}}</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=Template:Stub&diff=3277Template:Stub2011-03-19T18:50:37Z<p>Jeroenz0r: Created page with '{{asbox | image = | pix = | subject = | article = article | qualifier = | category = stubs | tempsort = no | name = Template:Stub }}'</p>
<hr />
<div>{{asbox<br />
| image = <br />
| pix = <br />
| subject = <br />
| article = article<br />
| qualifier = <br />
| category = stubs<br />
| tempsort = no<br />
| name = Template:Stub<br />
}}</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=TinyBack&diff=3276TinyBack2011-03-19T18:45:57Z<p>Jeroenz0r: </p>
<hr />
<div>'''TinyBack''' is a link-shortner scraper written in Ruby. This pack has a few tools included, it can scrape, remove URLs that link to a not found page, sort on alphabetic order and remove duplicates. It has been developed by [[User:Soult]] for the [[urlteam]] project.<br />
<br />
==scraping==<br />
TinyBack can chop large ranges in smaller ones, and request these efficiently in a random order. It has a great logging functionally, making error analysis after a crash easy.<br />
<br />
{{Stub article=[[urlteam]]}}</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=TinyBack&diff=3275TinyBack2011-03-19T18:42:07Z<p>Jeroenz0r: Created page with ''''TinyBack''' is a link-shortner scraper written in Ruby. This pack has a few tools included, it can scrape, remove URLs that link to a not found page, sort on alphabetic order …'</p>
<hr />
<div>'''TinyBack''' is a link-shortner scraper written in Ruby. This pack has a few tools included, it can scrape, remove URLs that link to a not found page, sort on alphabetic order and remove duplicates. It has been developed by [[User:Soult]] for the [[urlteam]] project.<br />
<br />
==scraping==<br />
TinyBack can chop large ranges in smaller ones, and request these efficiently in a random order. It has a great logging functionally, making error analysis after a crash easy.<br />
<br />
WP:IDEALSTUB</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=3274URLTeam2011-03-19T18:30:54Z<p>Jeroenz0r: /* Tools */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[Tinyback]] (written in ruby by [[User:Soult]])<br />
* [[User:Chronomex]] wrote his own Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/3 (2011-02-15)<br />
| non-sequential<br />
|-<br />
| [http://goo.gl goo.gl]<br />
| ??<br />
| [[User:Scumola]]<br />
| started (2011-03-04)<br />
| goo.gl throttles pulls<br />
|-<br />
| [http://is.gd is.gd]<br />
| 354,527,352<br />
| [[User:Chronomex]]/[[User:Soult]]<br />
| probably got about 95% before switch to non-sequential<br />
| now non-sequential, new software version added crappy rate limiting<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1279 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| dead (2011-02-15)<br />
|-<br />
| litturl.com<br />
| 17096<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 3084 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 14867 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| resolving still works despite planning to completely shut down at the end of 2010 (2011-02-15), whoever owns that thing is a major pain in the ass<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz, 0000-izzz.<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|-<br />
| biglnk.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 0-Z, 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 1link.in - Website dead<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* budurl.com - Appears non-incremental<br />
* canurl.com - Website dead<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* dwarfurl.com - Website dead/Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Website dead/Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* go2cut.com - Website dead<br />
* ilix.in - HTML redirect<br />
* imfy.us - requires a recaptcha to get to the linked site, and avast goes nuts.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com - Website dead<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh - Website dead<br />
* myurl.in - http://myurl.in/xtP5H / http://urlgator.com/xtP5H /http://ug4.me/xtP5H / http://link-ed.in/xtP5H - HTML redirect<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* plexp.com - Parked.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* poprl.com - Not resolving<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shorterlink.com - Parked.<br />
* shortlinks.co.uk - Not resolving<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us<br />
* shrt.st<br />
* shurl.net<br />
* simurl.com<br />
* shorl.com<br />
* smarturl.eu<br />
* snipr.com<br />
* snipurl.com<br />
* snurl.com<br />
* sn.vc<br />
* starturl.com<br />
* surl.co.uk<br />
* tighturl.com<br />
* timesurl.at<br />
* tiny123.com<br />
* tiny.cc<br />
* tinylink.com<br />
* tobtr.com<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goog.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=User:Jeroenz0r&diff=3273User:Jeroenz0r2011-03-19T18:26:04Z<p>Jeroenz0r: </p>
<hr />
<div>==Current project:==<br />
* [[Urlteam]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=2978URLTeam2011-03-11T16:11:43Z<p>Jeroenz0r: /* Old listhttp://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[User:Chronomex]] wrote his own efficient Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* [[User:Soult]] did the same in Ruby <br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/3 (2011-02-15)<br />
| non-sequential<br />
|-<br />
| [http://goo.gl goo.gl]<br />
| ??<br />
| [[User:Scumola]]<br />
| started (2011-03-04)<br />
| goo.gl throttles pulls<br />
|-<br />
| [http://is.gd is.gd]<br />
| 354,527,352<br />
| [[User:Chronomex]]/[[User:Soult]]<br />
| probably got about 95% before switch to non-sequential<br />
| now non-sequential, new software version added crappy rate limiting<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1279 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| dead (2011-02-15)<br />
|-<br />
| litturl.com<br />
| 17096<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 3084 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 14867 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| resolving still works despite planning to completely shut down at the end of 2010 (2011-02-15), whoever owns that thing is a major pain in the ass<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz, 0000-izzz.<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|-<br />
| biglnk.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 0-Z, 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 1link.in - Website dead<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* budurl.com - Appears non-incremental<br />
* canurl.com - Website dead<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* dwarfurl.com - Website dead/Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Website dead/Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* go2cut.com - Website dead<br />
* ilix.in - HTML redirect<br />
* imfy.us - requires a recaptcha to get to the linked site, and avast goes nuts.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com - Website dead<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh - Website dead<br />
* myurl.in - http://myurl.in/xtP5H / http://urlgator.com/xtP5H /http://ug4.me/xtP5H / http://link-ed.in/xtP5H - HTML redirect<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* plexp.com - Parked.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* poprl.com - Not resolving<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shorterlink.com - Parked.<br />
* shortlinks.co.uk - Not resolving<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us<br />
* shrt.st<br />
* shurl.net<br />
* simurl.com<br />
* shorl.com<br />
* smarturl.eu<br />
* snipr.com<br />
* snipurl.com<br />
* snurl.com<br />
* sn.vc<br />
* starturl.com<br />
* surl.co.uk<br />
* tighturl.com<br />
* timesurl.at<br />
* tiny123.com<br />
* tiny.cc<br />
* tinylink.com<br />
* tobtr.com<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goog.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=2977URLTeam2011-03-11T15:54:03Z<p>Jeroenz0r: /* New table */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[User:Chronomex]] wrote his own efficient Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* [[User:Soult]] did the same in Ruby <br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/3 (2011-02-15)<br />
| non-sequential<br />
|-<br />
| [http://goo.gl goo.gl]<br />
| ??<br />
| [[User:Scumola]]<br />
| started (2011-03-04)<br />
| goo.gl throttles pulls<br />
|-<br />
| [http://is.gd is.gd]<br />
| 354,527,352<br />
| [[User:Chronomex]]/[[User:Soult]]<br />
| probably got about 95% before switch to non-sequential<br />
| now non-sequential, new software version added crappy rate limiting<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1279 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| dead (2011-02-15)<br />
|-<br />
| litturl.com<br />
| 17096<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 3084 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 14867 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| resolving still works despite planning to completely shut down at the end of 2010 (2011-02-15), whoever owns that thing is a major pain in the ass<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz, 0000-izzz.<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|-<br />
| biglnk.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 0-Z, 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 1link.in - Website dead<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* budurl.com - Appears non-incremental<br />
* canurl.com<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* dwarfurl.com - Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* go2cut.com<br />
* ilix.in<br />
* imfy.us - requires a recaptcha to get to the linked site.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5em.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* plexp.com - Parked.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* poprl.com - Not resolving<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shorterlink.com - Parked.<br />
* shortlinks.co.uk - Not resolving<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us<br />
* shrt.st<br />
* shurl.net<br />
* simurl.com<br />
* shorl.com<br />
* smarturl.eu<br />
* snipr.com<br />
* snipurl.com<br />
* snurl.com<br />
* sn.vc<br />
* starturl.com<br />
* surl.co.uk<br />
* tighturl.com<br />
* timesurl.at<br />
* tiny123.com<br />
* tiny.cc<br />
* tinylink.com<br />
* tobtr.com<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goog.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=2436URLTeam2011-02-13T19:19:38Z<p>Jeroenz0r: /* New table */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[User:Chronomex]] wrote his own efficient Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* [[User:Soult]] did the same in Ruby <br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/4<br />
| non-sequential<br />
|-<br />
| [http://is.gd is.gd]<br />
| 287,151,326<br />
| [[User:Chronomex]]<br />
| about 1/3 (2010-10-31)<br />
| sequential<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1365 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| sequential<br />
|-<br />
| litturl.com<br />
| 33695<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 17619 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 18780 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| no new urls can be created, website says it will shut down at the end of 2010, often breaks completely when crawling too fast<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz and 0000-9999. Currently doing 9999-j000<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|-<br />
| biglnk.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 0-Z, 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 1link.in - Website dead<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* budurl.com - Appears non-incremental<br />
* canurl.com<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* dwarfurl.com - Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* go2cut.com<br />
* ilix.in<br />
* imfy.us - requires a recaptcha to get to the linked site.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5em.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* plexp.com - Parked.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* poprl.com - Not resolving<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shorterlink.com - Parked.<br />
* shortlinks.co.uk - Not resolving<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us<br />
* shrt.st<br />
* shurl.net<br />
* simurl.com<br />
* shorl.com<br />
* smarturl.eu<br />
* snipr.com<br />
* snipurl.com<br />
* snurl.com<br />
* sn.vc<br />
* starturl.com<br />
* surl.co.uk<br />
* tighturl.com<br />
* timesurl.at<br />
* tiny123.com<br />
* tiny.cc<br />
* tinylink.com<br />
* tobtr.com<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goog.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=User:Jeroenz0r&diff=2434User:Jeroenz0r2011-02-12T16:19:13Z<p>Jeroenz0r: </p>
<hr />
<div>==Current project:==<br />
* [[TinyURL]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=2433URLTeam2011-02-12T14:09:56Z<p>Jeroenz0r: /* Old listhttp://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[User:Chronomex]] wrote his own efficient Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* [[User:Soult]] did the same in Ruby <br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/4<br />
| non-sequential<br />
|-<br />
| [http://is.gd is.gd]<br />
| 287,151,326<br />
| [[User:Chronomex]]<br />
| about 1/3 (2010-10-31)<br />
| sequential<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1365 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| sequential<br />
|-<br />
| litturl.com<br />
| 33695<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 17619 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 18780 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| no new urls can be created, website says it will shut down at the end of 2010, often breaks completely when crawling too fast<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz and 0000-9999<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|-<br />
| biglnk.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 0-Z, 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 1link.in - Website dead<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* budurl.com - Appears non-incremental<br />
* canurl.com<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* dwarfurl.com - Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* go2cut.com<br />
* ilix.in<br />
* imfy.us - requires a recaptcha to get to the linked site.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5em.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* plexp.com - Parked.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* poprl.com - Not resolving<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shorterlink.com - Parked.<br />
* shortlinks.co.uk - Not resolving<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us<br />
* shrt.st<br />
* shurl.net<br />
* simurl.com<br />
* shorl.com<br />
* smarturl.eu<br />
* snipr.com<br />
* snipurl.com<br />
* snurl.com<br />
* sn.vc<br />
* starturl.com<br />
* surl.co.uk<br />
* tighturl.com<br />
* timesurl.at<br />
* tiny123.com<br />
* tiny.cc<br />
* tinylink.com<br />
* tobtr.com<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goog.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=2432URLTeam2011-02-12T14:09:53Z<p>Jeroenz0r: /* New table */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[User:Chronomex]] wrote his own efficient Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* [[User:Soult]] did the same in Ruby <br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/4<br />
| non-sequential<br />
|-<br />
| [http://is.gd is.gd]<br />
| 287,151,326<br />
| [[User:Chronomex]]<br />
| about 1/3 (2010-10-31)<br />
| sequential<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1365 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| sequential<br />
|-<br />
| litturl.com<br />
| 33695<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 17619 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 18780 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| no new urls can be created, website says it will shut down at the end of 2010, often breaks completely when crawling too fast<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz and 0000-9999<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|-<br />
| biglnk.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 0-Z, 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 1link.in - Website dead<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* biglnk.com<br />
* budurl.com - Appears non-incremental<br />
* canurl.com<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* dwarfurl.com - Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* go2cut.com<br />
* ilix.in<br />
* imfy.us - requires a recaptcha to get to the linked site.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5em.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* plexp.com - Parked.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* poprl.com - Not resolving<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shorterlink.com - Parked.<br />
* shortlinks.co.uk - Not resolving<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us<br />
* shrt.st<br />
* shurl.net<br />
* simurl.com<br />
* shorl.com<br />
* smarturl.eu<br />
* snipr.com<br />
* snipurl.com<br />
* snurl.com<br />
* sn.vc<br />
* starturl.com<br />
* surl.co.uk<br />
* tighturl.com<br />
* timesurl.at<br />
* tiny123.com<br />
* tiny.cc<br />
* tinylink.com<br />
* tobtr.com<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goog.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=2431URLTeam2011-02-12T13:26:17Z<p>Jeroenz0r: /* New table */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[User:Chronomex]] wrote his own efficient Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* [[User:Soult]] did the same in Ruby <br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/4<br />
| non-sequential<br />
|-<br />
| [http://is.gd is.gd]<br />
| 287,151,326<br />
| [[User:Chronomex]]<br />
| about 1/3 (2010-10-31)<br />
| sequential<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1365 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| sequential<br />
|-<br />
| litturl.com<br />
| 33695<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 17619 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 18780 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| no new urls can be created, website says it will shut down at the end of 2010, often breaks completely when crawling too fast<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz and 0000-9999<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 1link.in - Website dead<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* biglnk.com<br />
* budurl.com - Appears non-incremental<br />
* canurl.com<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* dwarfurl.com - Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* go2cut.com<br />
* ilix.in<br />
* imfy.us - requires a recaptcha to get to the linked site.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5em.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* plexp.com - Parked.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* poprl.com - Not resolving<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shorterlink.com - Parked.<br />
* shortlinks.co.uk - Not resolving<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us<br />
* shrt.st<br />
* shurl.net<br />
* simurl.com<br />
* shorl.com<br />
* smarturl.eu<br />
* snipr.com<br />
* snipurl.com<br />
* snurl.com<br />
* sn.vc<br />
* starturl.com<br />
* surl.co.uk<br />
* tighturl.com<br />
* timesurl.at<br />
* tiny123.com<br />
* tiny.cc<br />
* tinylink.com<br />
* tobtr.com<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goog.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=2430URLTeam2011-02-12T13:20:57Z<p>Jeroenz0r: /* Old listhttp://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[User:Chronomex]] wrote his own efficient Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* [[User:Soult]] did the same in Ruby <br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/4<br />
| non-sequential<br />
|-<br />
| [http://is.gd is.gd]<br />
| 287,151,326<br />
| [[User:Chronomex]]<br />
| about 1/3 (2010-10-31)<br />
| sequential<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1365 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| sequential<br />
|-<br />
| litturl.com<br />
| 33695<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 17619 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 18780 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| no new urls can be created, website says it will shut down at the end of 2010, often breaks completely when crawling too fast<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz and 0000-9999<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Small work like 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 1link.in - Website dead<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* biglnk.com<br />
* budurl.com - Appears non-incremental<br />
* canurl.com<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* dwarfurl.com - Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* go2cut.com<br />
* ilix.in<br />
* imfy.us - requires a recaptcha to get to the linked site.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5em.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* plexp.com - Parked.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* poprl.com - Not resolving<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shorterlink.com - Parked.<br />
* shortlinks.co.uk - Not resolving<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us<br />
* shrt.st<br />
* shurl.net<br />
* simurl.com<br />
* shorl.com<br />
* smarturl.eu<br />
* snipr.com<br />
* snipurl.com<br />
* snurl.com<br />
* sn.vc<br />
* starturl.com<br />
* surl.co.uk<br />
* tighturl.com<br />
* timesurl.at<br />
* tiny123.com<br />
* tiny.cc<br />
* tinylink.com<br />
* tobtr.com<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goog.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0rhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=2429URLTeam2011-02-12T13:19:27Z<p>Jeroenz0r: /* New table */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
}}<br />
<br />
'''TinyURL''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Services like '''TinyURL''' are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://301works.org 301Works.org] claims to archive shorturls, but so far they have not released a single backup file for download. As always, the Archiveteam is here to help with their Urlteam subcommitee.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
<br />
== Tools ==<br />
* [[User:Chronomex]] wrote his own efficient Perl-based scraper: [http://github.com/chronomex/urlteam]<br />
* [[User:Soult]] did the same in Ruby <br />
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.<br />
<br />
=== Or just ask! ===<br />
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.<br />
<br />
Try sending an email to the website owner:<br />
<br />
Hello!<br />
<br />
I'm working with Jason Scott of textfiles.org and other members of the<br />
Archive Team.<br />
<br />
Since the recent scare involving http://tr.im/'s announced (and then<br />
retracted) imminent demise, we've been working to archive all the<br />
links from URL shorteners around the Internet.<br />
<br />
If I'm not mistaken, you operate urlx.org. Would you be so kind as to<br />
share with us a copy of your URL database? We'll do our best to<br />
preserve this data forever in a useful way.<br />
<br />
We are already very far along in scraping links from tr.im, but it's<br />
faster (and friendlier!) to contact site owners asking for a copy of<br />
their data than it is to scrape.<br />
<br />
We've got a domain registered, urlte.am, and all links will be<br />
available for redirect in the format:<br />
<br />
http://urlx.org.urlte.am/av3<br />
<br />
If you could help us, that would be excellent!<br />
<br />
Thank you,<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com TinyURL]<br />
| 1,000,000,000<br />
| [[User:Soult]]<br />
| 5-letter codes done, on halt due to being banned (2010-12-20)<br />
| non-sequential, bans IP for requesting too many non-existing shorturls<br />
|-<br />
| [http://bit.ly bit.ly]<br />
| 4,000,000,000<br />
| [[User:Soult]]<br />
| about 1/4<br />
| non-sequential<br />
|-<br />
| [http://is.gd is.gd]<br />
| 287,151,326<br />
| [[User:Chronomex]]<br />
| about 1/3 (2010-10-31)<br />
| sequential<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1365 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done (2009-08-14)<br />
| sequential<br />
|-<br />
| litturl.com<br />
| 33695<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 17619 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 18780 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| ?<br />
| [[User:Soult]]<br />
| 5-letter codes finished, 6-letter codes in progress<br />
| no new urls can be created, website says it will shut down at the end of 2010, often breaks completely when crawling too fast<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz and 0000-9999<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Small work like 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Old list<ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref> ===<br />
List last updated 2009-08-14.<br />
* 1link.in - Website dead<br />
* 6url.com - HTML redirect<br />
* ad.vu - mirror of adjix.com<br />
* biglnk.com<br />
* budurl.com - Appears non-incremental<br />
* canurl.com<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* dwarfurl.com - Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* go2cut.com<br />
* ilix.in<br />
* imfy.us - requires a recaptcha to get to the linked site.<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5em.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* lnkurl.com<br />
* memurl.com - Pronounceable. Broken.<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh<br />
* myurl.in - Doesn't appear guessable, is probably bruteforceable: http://myurl.in/lT7z5<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* ow.ly - I can't get it to work.<br />
* plexp.com - Parked.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* poprl.com - Not resolving<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* rod.gs<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab<br />
* shorterlink.com - Parked.<br />
* shortlinks.co.uk - Not resolving<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp<br />
* shrinkurl.us<br />
* shrt.st<br />
* shurl.net<br />
* simurl.com<br />
* shorl.com<br />
* smarturl.eu<br />
* snipr.com<br />
* snipurl.com<br />
* snurl.com<br />
* sn.vc<br />
* starturl.com<br />
* surl.co.uk<br />
* tighturl.com<br />
* timesurl.at<br />
* tiny123.com<br />
* tiny.cc<br />
* tinylink.com<br />
* tobtr.com<br />
* traceurl.com<br />
* tr.im<br />
* tweetburner.com<br />
* twitpwr.com<br />
* twitthis.com<br />
* twurl.nl<br />
* u.mavrev.com<br />
* ur1.ca - Database is downloadable from website directly.<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant.<br />
* urlborg.com<br />
* urlbrief.com<br />
* urlcover.com<br />
* urlcut.com<br />
* urlhawk.com<br />
* url-press.com<br />
* urlsmash.com<br />
* urltea.com<br />
* urlvi.be<br />
* urlx.org - Owner has agreed to share his database<br />
* vimeo.com<br />
* wlink.us<br />
* xaddr.com<br />
* xil.in<br />
* xrl.us - see metamark.net<br />
* xym.kr<br />
* x.se<br />
* yatuc.com<br />
* yep.it<br />
* yweb.com<br />
* zi.ma<br />
* w3t.org<br />
<br />
==== "Official" shorteners ====<br />
* goog.gl - Google<br />
* fb.me - Facebook<br />
* amzn.to - Amazon<br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own)<br />
* tcrn.ch - Techcrunch<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
<br />
==== Dead or Broken Shorteners ====<br />
* chod.sk - Appears non-incremental, not resolving<br />
* gonext.org - not resolving<br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Jeroenz0r