https://wiki.archiveteam.org/api.php?action=feedcontributions&user=Asparagirl&feedformat=atomArchiveteam - User contributions [en]2024-03-29T05:16:55ZUser contributionsMediaWiki 1.37.1https://wiki.archiveteam.org/index.php?title=ArchiveBot&diff=30845ArchiveBot2018-08-06T22:44:58Z<p>Asparagirl: </p>
<hr />
<div>[[File:Librarianmotoko.jpg|200px|right|thumb|Imagine Motoko Kusanagi as an archivist.]]<br />
<br />
'''ArchiveBot''' is an [[IRC]] bot designed to automate the archival of smaller websites (e.g. up to a few hundred thousand URLs). You give it a URL to start at, and it grabs all content under that URL, [[Wget_with_WARC_output|records it in a WARC]] file, and then uploads that WARC to ArchiveTeam servers for eventual injection into the [https://archive.org/search.php?query=collection%3Aarchivebot&sort=-publicdate Internet Archive]'s Wayback Machine (or other archive sites).<br />
<br />
== Details ==<br />
<br />
To use ArchiveBot, drop by the IRC channel [http://chat.efnet.org:9090/?nick=&channels=%23archivebot&Login=Login '''#archivebot'''] on EFNet. To interact with ArchiveBot, you [http://archivebot.readthedocs.org/en/latest/commands.html issue '''commands'''] by typing it into the channel. Note you will need channel operator (<code>@</code>) or voice (<code>+</code>) permissions in order to issue archiving jobs; please ask for assistance or leave a message describing the website you want to archive.<br />
<br />
The [http://dashboard.at.ninjawedding.org/3 '''dashboard'''] publicly shows the sites being downloaded currently. The [http://archivebot.at.ninjawedding.org:4567/pipelines pipeline monitor station] shows the status of deployed instances of crawlers. The [http://archive.fart.website/archivebot/viewer/ viewer] assists in browsing and searching archives.<br />
<br />
You can also follow [https://twitter.com/archivebot @ArchiveBot] on [[Twitter]]!<ref>Formerly known as [https://twitter.com/atarchivebot @ATArchiveBot]</ref> although its tweets may slightly lag behind the current status of the bot.<br />
<br />
== Components ==<br />
<br />
IRC interface<br />
:The bot listens for commands in the IRC channel and then reports back status on the IRC channel. You can ask it to archive a whole website or single webpage, check whether the URL has been saved, change the delay time between requests, or add some ignore rules to avoid crawling certain web cruft. This IRC interface is collaborative, meaning anyone with permission can adjust the parameter of jobs. Note that the bot isn't a chat bot so it will ignore you if it doesn't understand a command.<br />
<br />
Dashboard<br />
:The [http://dashboard.at.ninjawedding.org/3 '''ArchiveBot dashboard'''] is a web-based front-end displaying the URLs being downloaded by the various web crawls. Each URL line in the dashboard is categorized by its HTTP code into successes, warnings, and errors. It will be highlighted in yellow or red. the dashboard also provides RSS feeds.<br />
<br />
Backend<br />
:The backend contains the database of all jobs and several maintenance tasks such as trimming logs and posting Tweets on Twitter. The backend is the centralized portion of ArchiveBot.<br />
<br />
Crawler<br />
:The crawler will download and spider the website into WARC files. The crawler is the distributed portion of ArchiveBot. Volunteers run pipeline nodes connected to the backend. The backend will tell the nodes/pipelines what jobs to run. Once the crawl job has finished, the pipeline reports back to the backend and uploads the WARC files to the staging server. This process is handled by a supervisor script called a pipeline.<br />
<br />
Staging server<br />
:The staging server, known as [[FOS|FOS (Fortress of Solitude)]], is the place where all the WARC files are temporarily uploaded. Once the current batch has been approved, the files will be uploaded to the Internet Archive for consumption by the Wayback Machine.<br />
<br />
ArchiveBot's source code can be found at https://github.com/ArchiveTeam/ArchiveBot. [[Dev|Contributions welcomed]]! Any issues or feature requests may be filed at [https://github.com/ArchiveTeam/ArchiveBot/issues the issue tracker].<br />
<br />
== People ==<br />
<br />
The main server that controls the IRC bot, pipeline manager backend, and web dashboard is operated by [[User:yipdw|yipdw]], although a few other ArchiveTeam members were given SSH access in late 2017. The staging server [[FOS|FOS (Fortress of Solitude)]], where the data sits for final checks before being moved over to the Internet Archive serves, is operated by [[User:jscott|SketchCow]]. The pipelines are operated by various volunteers around the world. Each pipeline typically runs two or three web crawl jobs at any given time.<br />
<br />
== Volunteer to run a Pipeline ==<br />
As of November 2017, ArchiveBot has again started accepting applications from volunteers who want to set up new pipelines. You'll need to have a machine with:<br />
<br />
* lots of disk space (40 GB minimum / 200 GB recommended / 500 GB atypical)<br />
* 512 MB RAM (2 GB recommended, 2 GB swap recommended)<br />
* 10 mbps upload/download speeds (100 mbps recommended)<br />
* long-term availability (2 months minimum)<br />
* always-on unrestricted internet access (absolutely no firewall/proxies/censorship/ISP-injected-ads/DNS-redirection/free-cafe-wifi)<br />
<br />
Suggestion: the $40/month Digital Ocean droplets (4 GB memory/2 CPU/60 GB hard drive) running Ubuntu work pretty well.<br />
<br />
If you have a suitable server available and would like to volunteer, please review the [https://github.com/ArchiveTeam/ArchiveBot/blob/master/INSTALL.pipeline Pipeline Install] instructions. Then contact ArchiveTeam members [[User:Asparagirl|Asparagirl]], [[User:astrid|astrid]], [[User:JAA|JAA]], [[User:yipdw|yipdw]], or other ArchiveTeam members hanging out in #archivebot, and we can hook you up, adding your machine to the list of approved pipelines, so that it will start processing incoming ArchiveBot jobs.<br />
<br />
=== Caveats ===<br />
As of August 2018, there are a few things you need to be aware of when operating an ArchiveBot pipeline:<br />
<br />
* Please give access to the pipeline for maintenance work when you're away (e.g. holidays, busy IRL) to someone who's around frequently. This is to avoid situations where jobs or pipelines are stuck for weeks or months without anyone being able to intervene.<br />
* Jobs that crash with an error need to be killed manually using <code>kill -9</code>.<br />
* The log files of jobs that are aborted or crash are not uploaded to the Internet Archive. Please keep the temporary <code>tmp-wpull-*.log.gz</code> files in the pipeline directory, rename them so the filename follows the same format as the JSON file (with extension <code>.log.gz</code> instead of <code>.json</code>), and upload them to FOS manually.<br />
** You can find the job ID for these files in the second line.<br />
** Finding the correct filename can be a bit tricky. You can use the viewer or the [https://github.com/JustAnotherArchivist/archivebot-archives archivebot-archives] repository. Keep in mind that the timestamp in the filename should approximately match the one at the beginning of the log file, though there is usually a difference between the two of at least a few seconds (the log file timestamps being later than the filename timestamp).<br />
** Be careful with the filename if there were multiple jobs for the same URL (i.e. the same job ID).<br />
** Here is a public gist on GitHub explaining step by step how to find the proper log file for your crashed or killed job, how to properly rename it, and how to rsync it up to FOS: [[https://gist.github.com/Asparagirl/155bd3c8ee4b8ad5ed737e45bcad1a5a]]<br />
** Contact [[User:JustAnotherArchivist]] if you need help with this.<br />
* Due to a bug somewhere deep in the network stack, connections get stuck from time to time. This causes jobs to slow down or halt entirely.<br />
** As a workaround, you can use the [https://github.com/JustAnotherArchivist/kill-wpull-connections kill-wpull-connections] script; it requires pgrep, lsof, and gdb.<br />
** In very rare cases, you may need to use [http://killcx.sourceforge.net/ killcx] to close the connections.<br />
* Also due to a bug suspected to be in the network stack, wpull processes sometimes use a lot of RAM (and CPU). If a process uses more than 300 MB continuously, that's likely the case. kill-wpull-connections seems to "fix" this issue, though it takes a while (minutes, rarely even an hour or more) from running the script until the usage actually drops down.<br />
* Make sure that you don't have any <code>search</code> or <code>domain</code> line in <code>/etc/resolv.conf</code>. We've grabbed a number of copies of the websites of OVH and Online.net as a result of such lines and broken <code>http://www/</code> links... (Cf [https://github.com/ArchiveTeam/ArchiveBot/issues/318 this issue on GitHub])<br />
<br />
== Installation ==<br />
<br />
Installing the ArchiveBot can be difficult. The [https://github.com/ArchiveTeam/ArchiveBot/blob/master/INSTALL.pipeline Pipeline Install] instructions are online, but are tricky.<br />
<br />
But there is a [https://github.com/ArchiveTeam/ArchiveBot/blob/master/.travis.yml Travis.yml automated install script] for [https://travis-ci.org/ArchiveTeam/ArchiveBot Travis-cl] that is designed to test the ArchiveBot. <br />
<br />
Since it's good enough for testing... it's good enough for installation, right? There must be a way to convert it into an installer script.<br />
<br />
== Disclaimers ==<br />
<br />
# Everything is provided on a best-effort basis; nothing is guaranteed to work. (We're volunteers, not a support team.)<br />
# We can decide to stop a job or ban a user if a job is deemed unnecessary. (We don't want to run up operator bandwidth bills and waste Internet Archive donations on costs.)<br />
# We're not Internet Archive. (We do what we want.)<br />
# We're not the Wayback Machine. Specifically, we are not <code>ia_archiver</code> or <code>archive.org_bot</code>. (We don't run crawlers on behalf of other crawlers.)<br />
<br />
Occasionally, we had to ban blocks of IP addresses from the channel. If you think a ban does not apply to you but cannot join the #archivebot channel, please join the main #archiveteam channel instead.<br />
<br />
== Bad Behavior ==<br />
<br />
If you are a website operator and you notice ArchiveBot misbehaving, please contact us on #archivebot or #archiveteam on EFnet (see top of page for links).<br />
<br />
ArchiveBot understands [[robots.txt]] (please read the article) but does not match any directives. It uses it for discovering more links such as sitemaps however.<br />
<br />
Also, please remember that '''we are not the [[Internet Archive|Internet Archive]]'''.<br />
<br />
== More ==<br />
<br />
Like ArchiveBot? Check out our [[Main_Page|homepage]] and other [[projects]]!<br />
<br />
== Notes ==<br />
<br />
<references/><br />
<br />
{{navigation_box}}</div>Asparagirlhttps://wiki.archiveteam.org/index.php?title=ArchiveBot&diff=30268ArchiveBot2018-01-06T21:42:22Z<p>Asparagirl: updaaaaates</p>
<hr />
<div>[[File:Librarianmotoko.jpg|200px|right|thumb|Imagine Motoko Kusanagi as an archivist.]]<br />
<br />
'''ArchiveBot''' is an [[IRC]] bot designed to automate the archival of smaller websites (e.g. up to a few hundred thousand URLs). You give it a URL to start at, and it grabs all content under that URL, [[Wget_with_WARC_output|records it in a WARC]] file, and then uploads that WARC to ArchiveTeam servers for eventual injection into the [https://archive.org/search.php?query=collection%3Aarchivebot&sort=-publicdate Internet Archive]'s Wayback Machine (or other archive sites).<br />
<br />
== Details ==<br />
<br />
To use ArchiveBot, drop by the IRC channel [http://chat.efnet.org:9090/?nick=&channels=%23archivebot&Login=Login '''#archivebot'''] on EFNet. To interact with ArchiveBot, you [http://archivebot.readthedocs.org/en/latest/commands.html issue '''commands'''] by typing it into the channel. Note you will need channel operator (<code>@</code>) or voice (<code>+</code>) permissions in order to issue archiving jobs; please ask for assistance or leave a message describing the website you want to archive.<br />
<br />
The [http://dashboard.at.ninjawedding.org/3 '''dashboard'''] publicly shows the sites being downloaded currently. The [http://archivebot.at.ninjawedding.org:4567/pipelines pipeline monitor station] shows the status of deployed instances of crawlers. The [http://archive.fart.website/archivebot/viewer/ viewer] assists in browsing and searching archives.<br />
<br />
You can also follow [https://twitter.com/archivebot @ArchiveBot] on [[Twitter]]!<ref>Formerly known as [https://twitter.com/atarchivebot @ATArchiveBot]</ref> although its tweets may slightly lag behind the current status of the bot.<br />
<br />
== Components ==<br />
<br />
IRC interface<br />
:The bot listens for commands in the IRC channel and then reports back status on the IRC channel. You can ask it to archive a whole website or single webpage, check whether the URL has been saved, change the delay time between requests, or add some ignore rules to avoid crawling certain web cruft. This IRC interface is collaborative, meaning anyone with permission can adjust the parameter of jobs. Note that the bot isn't a chat bot so it will ignore you if it doesn't understand a command.<br />
<br />
Dashboard<br />
:The [http://dashboard.at.ninjawedding.org/3 '''ArchiveBot dashboard'''] is a web-based front-end displaying the URLs being downloaded by the various web crawls. Each URL line in the dashboard is categorized by its HTTP code into successes, warnings, and errors. It will be highlighted in yellow or red. the dashaboard also provides RSS feeds.<br />
<br />
Backend<br />
:The backend contains the database of all jobs and several maintenance tasks such as trimming logs and posting Tweets on Twitter. The backend is the centralized portion of ArchiveBot.<br />
<br />
Crawler<br />
:The crawler will download and spider the website into WARC files. The crawler is the distributed portion of ArchiveBot. Volunteers run pipeline nodes connected to the backend. The backend will tell the nodes/pipelines what jobs to run. Once the crawl job has finished, the pipeline reports back to the backend and uploads the WARC files to the staging server. This process is handled by a supervisor script called a pipeline.<br />
<br />
Staging server<br />
:The staging server, known as [[FOS|FOS (Fortress of Solitude)]], is the place where all the WARC files are temporarily uploaded. Once the current batch has been approved, the files will be uploaded to the Internet Archive for consumption by the Wayback Machine.<br />
<br />
ArchiveBot's source code can be found at https://github.com/ArchiveTeam/ArchiveBot. [[Dev|Contributions welcomed]]! Any issues or feature requests may be filed at [https://github.com/ArchiveTeam/ArchiveBot/issues the issue tracker]. <br />
<br />
== People ==<br />
<br />
The main server that controls the IRC bot, pipeline manager backend, and web dashboard is operated by [[User:yipdw|yipdw]], although a few other ArchiveTeam members were given SSH access in late 2017. The staging server [[FOS|FOS (Fortress of Solitude)]], where the data sits for final checks before being moved over to the Internet Archive serves, is operated by [[User:jscott|SketchCow]]. The pipelines are operated by various volunteers around the world. Each pipeline typically runs two or three web crawl jobs at any given time.<br />
<br />
== Volunteer to run a Pipeline ==<br />
As of November 2017, ArchiveBot has again started accepting applications from volunteers who want to set up new pipelines. You'll need to have a machine with:<br />
<br />
* lots of disk space (40 GB minimum / 200 GB recommended / 500 GB atypical)<br />
* 512 MB RAM (2 GB recommended, 2 GB swap recommended)<br />
* 10 mbps upload/download speeds (100 mbps recommended)<br />
* long-term availability (2 months minimum)<br />
* always-on unrestricted internet access (absolutely no firewall/proxies/censorship/ISP-injected-ads/DNS-redirection/free-cafe-wifi)<br />
<br />
Suggestion: the $40/month Digital Ocean droplets (4 GB memory/2 CPU/60 GB hard drive) running Ubuntu work pretty well.<br />
<br />
If you have a suitable server available and would like to volunteer, please review the [https://github.com/ArchiveTeam/ArchiveBot/blob/master/INSTALL.pipeline Pipeline Install] instructions. Then contact ArchiveTeam members [[User:Asparagirl|Asparagirl]], [[User:astrid|astrid]], [[User:JAA|JAA]], [[User:yipdw|yipdw]], or other ArchiveTeam members hanging out in #archivebot, and we can hook you up, adding your machine to the list of approved pipelines, so that it will start processing incoming ArchiveBot jobs.<br />
<br />
== Installation ==<br />
<br />
Installing the ArchiveBot can be difficult. The [https://github.com/ArchiveTeam/ArchiveBot/blob/master/INSTALL.pipeline Pipeline Install] instructions are online, but are tricky.<br />
<br />
But there is a [https://github.com/ArchiveTeam/ArchiveBot/blob/master/.travis.yml Travis.yml automated install script] for [https://travis-ci.org/ArchiveTeam/ArchiveBot Travis-cl] that is designed to test the ArchiveBot. <br />
<br />
Since it's good enough for testing... it's good enough for installation, right? There must be a way to convert it into an installer script.<br />
<br />
== Disclaimers ==<br />
<br />
# Everything is provided on a best-effort basis; nothing is guaranteed to work. (We're volunteers, not a support team.)<br />
# We can decide to stop a job or ban a user if a job is deemed unnecessary. (We don't want to run up operator bandwidth bills and waste Internet Archive donations on costs.)<br />
# We're not Internet Archive. (We do what we want.)<br />
# We're not the Wayback Machine. Specifically, we are not <code>ia_archiver</code> or <code>archive.org_bot</code>. (We don't run crawlers on behalf of other crawlers.)<br />
<br />
Occasionally, we had to ban blocks of IP addresses from the channel. If you think a ban does not apply to you but cannot join the #archivebot channel, please join the main #archiveteam channel instead.<br />
<br />
== Bad Behavior ==<br />
<br />
If you are a website operator and you notice ArchiveBot misbehaving, please contact us on #archivebot or #archiveteam on EFnet (see top of page for links).<br />
<br />
ArchiveBot understands [[robots.txt]] (please read the article) but does not match any directives. It uses it for discovering more links such as sitemaps however.<br />
<br />
Also, please remember that '''we are not the [[Internet Archive|Internet Archive]]'''.<br />
<br />
== More ==<br />
<br />
Like ArchiveBot? Check out our [[Main_Page|homepage]] and other [[projects]]!<br />
<br />
== Notes ==<br />
<br />
<references/><br />
<br />
{{navigation_box}}</div>Asparagirlhttps://wiki.archiveteam.org/index.php?title=ArchiveBot&diff=30267ArchiveBot2018-01-06T21:23:44Z<p>Asparagirl: we're accepting pipelines again!</p>
<hr />
<div>[[File:Librarianmotoko.jpg|200px|right|thumb|Imagine Motoko Kusanagi as an archivist.]]<br />
<br />
'''ArchiveBot''' is an [[IRC]] bot designed to automate the archival of smaller websites (e.g. up to a few hundred thousand URLs). You give it a URL to start at, and it grabs all content under that URL, [[Wget_with_WARC_output|records it in a WARC]], and then uploads that WARC to ArchiveTeam servers for eventual injection into the [https://archive.org/search.php?query=collection%3Aarchivebot&sort=-publicdate Internet Archive] (or other archive sites).<br />
<br />
== Details ==<br />
<br />
To use ArchiveBot, drop by [http://chat.efnet.org:9090/?nick=&channels=%23archivebot&Login=Login '''#archivebot'''] on EFNet. To interact with ArchiveBot, you [http://archivebot.readthedocs.org/en/latest/commands.html issue '''commands'''] by typing it into the channel. Note you will need channel operator (<code>@</code>) or voice (<code>+</code>) permissions in order to issue archiving jobs; please ask for assistance or leave a message describing the website you want to archive. <br />
<br />
The [http://dashboard.at.ninjawedding.org/3 '''dashboard'''] shows the sites being downloaded currently. The [http://archivebot.at.ninjawedding.org:4567/pipelines pipeline monitor station] shows the status of deployed instances of crawlers. The [http://archive.fart.website/archivebot/viewer/ viewer] assists in browsing and searching archives.<br />
<br />
Follow [https://twitter.com/archivebot @ArchiveBot] on [[Twitter]]!<ref>Formerly known as [https://twitter.com/atarchivebot @ATArchiveBot]</ref><br />
<br />
=== Components ===<br />
<br />
IRC interface<br />
:The bot listens for commands and reports back status on the IRC channel. You can ask it to archive a website or webpage, check whether the URL has been saved, change the delay time between request, or add some ignore rules. This IRC interface is collaborative meaning anyone with permission can adjust the parameter of jobs. Note that the bot isn't a chat bot so it will ignore you if it doesn't understand a command.<br />
<br />
Dashboard<br />
:The dashboard displays the URLs being downloaded. Each URL line in the dashboard is categorized into successes, warnings, and errors. It will be highlighted in yellow or red. It also provides RSS feeds.<br />
<br />
Backend<br />
:The backend contains the database of jobs and several maintenance tasks such as trimming logs and posting Tweets on Twitter. The backend is the centralized portion of ArchiveBot.<br />
<br />
Crawler<br />
:The crawler will download and spider the website into WARC files. The crawler is the distributed portion of ArchiveBot. Volunteers run nodes connected to the backend. The backend will tell the nodes what jobs to run. Once the node has finished, it reports back to the backend and uploads the WARC files to the staging server. This process is handled by a supervisor script called a pipeline.<br />
<br />
Staging server<br />
:The staging server is the place where all the WARC files are uploaded temporary. Once the current batch has been approved, it will be uploaded to the Internet Archive for consumption by the Wayback Machine.<br />
<br />
ArchiveBot's source code can be found at https://github.com/ArchiveTeam/ArchiveBot. [[Dev|Contributions welcomed]]! Any issues or feature requests may be filed at [https://github.com/ArchiveTeam/ArchiveBot/issues the issue tracker]. <br />
<br />
=== People ===<br />
<br />
The IRC bot, pipeline manager backend, and dashboard is operated by [[User:yipdw|yipdw]], although a few other ArchiveTeam members were given access in late 2017. The staging server [[FOS|FOS (Fortress of Solitude)]], where the data sits for final checks before being moved over to the Internet Archive serves, is operated by [[User:jscott|SketchCow]]. The pipelines/crawlers are operated by various volunteers around the world. Each pipeline typically runs two or three web crawl jobs at any given time.<br />
<br />
== Volunteer to run a Pipeline ==<br />
As of December 2017, new ArchiveBot pipelines are being accepted again. You'll need to have a machine with:<br />
<br />
* lots of disk space (40 GB minimum / 200 GB recommended / 500 GB atypical)<br />
* 512 MB RAM (2 GB recommended, 2 GB swap recommended)<br />
* 10 mbps upload/download speeds (100 mbps recommended)<br />
* long-term availability (2 months minimum)<br />
* always-on unrestricted internet access (absolutely no firewall/proxies/censorship/ISP-injected-ads/DNS-redirection/free-cafe-wifi)<br />
<br />
If you would like to volunteer, please review the [https://github.com/ArchiveTeam/ArchiveBot/blob/master/INSTALL.pipeline Pipeline Install] instructions. Then contact ArchiveTeam members [[User:Asparagirl|Asparagirl]], [[User:astrid|astrid]], [[User:JAA|JAA]], [[User:yipdw|yipdw]], or other ArchiveTeam members hanging out in #archivebot and we can hook you up.<br />
<br />
=== Installation ===<br />
<br />
Installing the ArchiveBot can be difficult. The [https://github.com/ArchiveTeam/ArchiveBot/blob/master/INSTALL.pipeline Pipeline Install] instructions are online, but are tricky.<br />
<br />
But there is a [https://github.com/ArchiveTeam/ArchiveBot/blob/master/.travis.yml Travis.yml automated install script] for [https://travis-ci.org/ArchiveTeam/ArchiveBot Travis-cl] that is designed to test the ArchiveBot. <br />
<br />
Since it's good enough for testing... it's good enough for installation, right? There must be a way to convert it into an installer script.<br />
<br />
== Disclaimers ==<br />
<br />
# Everything is provided on a best-effort basis; nothing is guaranteed to work. (We're volunteers, not a support team.)<br />
# We can decide to stop a job or ban a user if a job is deemed unnecessary. (We don't want to run up operator bandwidth bills and waste Internet Archive donations on costs.)<br />
# We're not Internet Archive. (We do what we want.)<br />
# We're not the Wayback Machine. Specifically, we are not <code>ia_archiver</code> or <code>archive.org_bot</code>. (We don't run crawlers on behalf of other crawlers.)<br />
<br />
Occasionally, we had to ban blocks of IP addresses from the channel. If you think a ban does not apply to you but cannot join the #archivebot channel, please join the main #archiveteam channel instead.<br />
<br />
== Bad Behavior ==<br />
<br />
If you are a website operator and you notice ArchiveBot misbehaving, please contact us on #archivebot or #archiveteam on EFnet (see top of page for links).<br />
<br />
ArchiveBot understands [[robots.txt]] (please read the article) but does not match any directives. It uses it for discovering more links such as sitemaps however.<br />
<br />
Also, please remember that '''we are not the [[Internet Archive|Internet Archive]]'''.<br />
<br />
== More ==<br />
<br />
Like ArchiveBot? Check out our [[Main_Page|homepage]] and other [[projects]]!<br />
<br />
== Notes ==<br />
<br />
<references/><br />
<br />
{{navigation_box}}</div>Asparagirlhttps://wiki.archiveteam.org/index.php?title=Twitter&diff=30252Twitter2018-01-03T23:39:27Z<p>Asparagirl: </p>
<hr />
<div>{{Infobox project<br />
| title = Twitter<br />
| image = Twitter_account_timeline.png<br />
| description = <br />
| URL = http://twitter.com<br />
| project_status = {{online}}<br />
| archiving_status = {{nosavedyet}}<br />
}}<br />
'''Twitter''' is a microblogging service. With each "entry" being 140 characters or less, the ease with which you can track the tiniest details of your life is amazing. The site has become very popular as a result.<br />
<br />
The site is becoming so popular, in fact, that many people are deserting or cutting back on their weblogs to just use the Twitter service for what their weblogging used to fulfill; and with that comes rampant centralization, and with ''that'', greater risk. Back up your tweets!<br />
<br />
== Archives ==<br />
There are currently a few archives (but only partially):<br />
* [http://www.archive.org/details/twitter_cikm_2010 Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape]: almost 10 millon tweets<br />
* [http://www.archive.org/details/2011-05-calufa-twitter-sql The May 2011 Calufa Twitter Scrape]: 90+ million tweets from more than 6 million users<br />
* https://archive.org/search.php?query=twitterstream<br />
<br />
The Twitter search API seemingly returns only the latest 7 days worth of tweets.<br />
<br />
== Backup Tools ==<br />
<br />
* Twitter enables you to [https://twitter.com/settings/account request an archive of all of your tweets from the main settings page], which includes every tweet of yours (therefore bypassing the normal 3200 tweet API limit). This is then emailed to the account linked with the account.<br />
* [https://www.tweetscan.com/data.php Tweetscan Data] downloads your Twitter archive from 12/2007 onward in CSV format (requires Twitter account login/password)<br />
<br />
* [https://github.com/sferik/t t by sferik] is a command-line interface for Twitter using the API via an application you create on your account. Not only does it allow easy CSV/JSON export of your own data, but it allows you to scrape others tweets. API limits apply but this tool is <b>very</b> powerful<br />
<br />
Twitter automatically resizes uploaded images. To get image in its original resolution, append :orig after the url, e.g.:<br />
https://pbs.twimg.com/media/CBAoaU1UwAIUPIc.jpg:orig<br />
<br />
When using [[ArchiveBot]], the following arguments are helpful:<br />
--phantomjs --ignore-sets twitter<br />
It is also '''important to add a trailing slash to the URL''', so it gets each tweet individually, rather than only trying to download the whole timeline.<br />
<br />
* [https://github.com/sixohsix/twitter The Python Twitter API by sixohsix] has some pretty easy to use scripts for archiving Twitter accounts to a TXT file for people who aren't as technically inclined. It can only save the last 3K or so tweets due to inbuilt Twitter limits, though. (Note: the "-o" flag is pretty much required to archive accounts.)<br />
<br />
* '''[https://github.com/DocNow/twarc/ twarc]'''<br />
<br />
* [https://gist.github.com/Asparagirl/e3ee274e4df49230875c880255819d95 Here's a Gist with a step-by-step guide] to getting a long list of a user's tweet status URL's, using a Python program called Tweep.<br />
<br />
=== Scraping ===<br />
<br />
See [[Site exploration#Twitter|Site exploration]] for details.<br />
<br />
== Vital Signs == <br />
<br />
Very stable, probably not going anywhere too soon without warning.<br />
<br />
== Library of Congress ==<br />
<br />
The U.S. Library of Congress announced in April 2010, via its official Twitter account that it will be acquiring the entire archive of Twitter messages back through March 2006.[http://www.readwriteweb.com/archives/twitters_entire_archive_headed_to_the_library_of_c.php] As of 2016-02-23, this archive is still not available, and when/if it does become accessible it will likely be restricted to researchers, rather than the general public.[http://www.politico.com/story/2015/07/library-of-congress-twitter-archive-119698.html] In January 2017, it was announced that the Library of Congress will no longer archive all tweets, just ones from major news stories.<br />
<br />
== External links ==<br />
* http://twitter.com<br />
<br />
{{Navigation box}}<br />
<br />
[[Category:Microblogging services]]</div>Asparagirlhttps://wiki.archiveteam.org/index.php?title=Twitter&diff=30251Twitter2018-01-03T23:38:42Z<p>Asparagirl: Add info about Tweep and info about Library of Congress ending tweet archiving</p>
<hr />
<div>{{Infobox project<br />
| title = Twitter<br />
| image = Twitter_account_timeline.png<br />
| description = <br />
| URL = http://twitter.com<br />
| project_status = {{online}}<br />
| archiving_status = {{nosavedyet}}<br />
}}<br />
'''Twitter''' is a microblogging service. With each "entry" being 140 characters or less, the ease with which you can track the tiniest details of your life is amazing. The site has become very popular as a result.<br />
<br />
The site is becoming so popular, in fact, that many people are deserting or cutting back on their weblogs to just use the Twitter service for what their weblogging used to fulfill; and with that comes rampant centralization, and with ''that'', greater risk. Back up your tweets!<br />
<br />
== Archives ==<br />
There are currently a few archives (but only partially):<br />
* [http://www.archive.org/details/twitter_cikm_2010 Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape]: almost 10 millon tweets<br />
* [http://www.archive.org/details/2011-05-calufa-twitter-sql The May 2011 Calufa Twitter Scrape]: 90+ million tweets from more than 6 million users<br />
* https://archive.org/search.php?query=twitterstream<br />
<br />
The Twitter search API seemingly returns only the latest 7 days worth of tweets.<br />
<br />
== Backup Tools ==<br />
<br />
* Twitter enables you to [https://twitter.com/settings/account request an archive of all of your tweets from the main settings page], which includes every tweet of yours (therefore bypassing the normal 3200 tweet API limit). This is then emailed to the account linked with the account.<br />
* [https://www.tweetscan.com/data.php Tweetscan Data] downloads your Twitter archive from 12/2007 onward in CSV format (requires Twitter account login/password)<br />
<br />
* [https://github.com/sferik/t t by sferik] is a command-line interface for Twitter using the API via an application you create on your account. Not only does it allow easy CSV/JSON export of your own data, but it allows you to scrape others tweets. API limits apply but this tool is <b>very</b> powerful<br />
<br />
Twitter automatically resizes uploaded images. To get image in its original resolution, append :orig after the url, e.g.:<br />
https://pbs.twimg.com/media/CBAoaU1UwAIUPIc.jpg:orig<br />
<br />
When using [[ArchiveBot]], the following arguments are helpful:<br />
--phantomjs --ignore-sets twitter<br />
It is also '''important to add a trailing slash to the URL''', so it gets each tweet individually, rather than only trying to download the whole timeline.<br />
<br />
* [https://github.com/sixohsix/twitter The Python Twitter API by sixohsix] has some pretty easy to use scripts for archiving Twitter accounts to a TXT file for people who aren't as technically inclined. It can only save the last 3K or so tweets due to inbuilt Twitter limits, though. (Note: the "-o" flag is pretty much required to archive accounts.)<br />
<br />
* '''[https://github.com/DocNow/twarc/ twarc]'''<br />
<br />
* Here's a Gist with a step-by-step guide to getting a long list of a user's tweet status URL's, using a Python program called Tweep: [https://gist.github.com/Asparagirl/e3ee274e4df49230875c880255819d95]<br />
<br />
=== Scraping ===<br />
<br />
See [[Site exploration#Twitter|Site exploration]] for details.<br />
<br />
== Vital Signs == <br />
<br />
Very stable, probably not going anywhere too soon without warning.<br />
<br />
== Library of Congress ==<br />
<br />
The U.S. Library of Congress announced in April 2010, via its official Twitter account that it will be acquiring the entire archive of Twitter messages back through March 2006.[http://www.readwriteweb.com/archives/twitters_entire_archive_headed_to_the_library_of_c.php] As of 2016-02-23, this archive is still not available, and when/if it does become accessible it will likely be restricted to researchers, rather than the general public.[http://www.politico.com/story/2015/07/library-of-congress-twitter-archive-119698.html] In January 2017, it was announced that the Library of Congress will no longer archive all tweets, just ones from major news stories.<br />
<br />
== External links ==<br />
* http://twitter.com<br />
<br />
{{Navigation box}}<br />
<br />
[[Category:Microblogging services]]</div>Asparagirlhttps://wiki.archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/nominations&diff=30246INTERNETARCHIVE.BAK/nominations2018-01-02T23:53:17Z<p>Asparagirl: added suggestion</p>
<hr />
<div>== Nominations for IABAK Collections to Save ==<br />
<br />
As the project is taking off, we've started to "wing it" with regards to what to save, and to get ahead of that, this will be a wikified list of potential collections to use for future shards. Please link to the collection, and describe why it might be of use. <br />
<br />
Getting ahead of one of the issues, the reasons a collection might NOT be added yet is because:<br />
<br />
* Too many items in it, causing it to spray against dozens of shards<br />
* Too massive (for now), causing a shard to be a huge amount and we're stuck forever<br />
* The collection is actually a mirror of another collection elsewhere.<br />
<br />
<br />
== Some Potential Shard Additions ==<br />
<br />
* https://archive.org/details/reclaimtherecords - hundreds of TB of primary source genealogy files and vital records indices, most available nowhere else<br />
* https://archive.org/details/bibliothequesaintegenevieve - rare incunabula -- 327k files, too many for 1 shard<br />
* https://archive.org/details/archiveteam_ancestry - family history -- this is 650 files (3 TB)<br />
* https://archive.org/details/archiveteam-fortunecity - not backed up via torrents like the GeoCities grab, not as huge as some of the other AT projects (2.7 TB)<br />
<br />
== Accepted nominations, in progress ==<br />
<br />
* https://archive.org/details/archiveteam-fire - many great archives of websites -- 313k files, too many for 1 shard. created SHARD12, which contains all the items from 2011 through 2015, approximately 30% of the total files.<br />
* https://archive.org/details/archivebot - WARCs saved by the ArchiveBot. This is a small number of very large files; looks like it's going to be split into ~30 shards.<br />
<br />
== Accepted nominations, in shards now ==<br />
<br />
* https://archive.org/details/prelinger<br />
* https://archive.org/details/Bali - entire literature of Bali<br />
* https://archive.org/details/jcbmexicoincunables - rare incunabula<br />
* https://archive.org/details/cdbbsarchive - historical software<br />
* https://archive.org/details/prelingerhomemovies - more prelinger<br />
* https://archive.org/details/prelinger_library - more prelinger<br />
* https://archive.org/details/starr - rare old Asian books<br />
* https://archive.org/details/archiveteam-googlegroups - about 1TB of webpages and files from Google Group mailing lists<br />
* https://archive.org/details/googlegroups-part2 - related to archiveteam-googlegroups, ~200-300GB</div>Asparagirlhttps://wiki.archiveteam.org/index.php?title=User:Asparagirl&diff=23298User:Asparagirl2015-06-18T04:41:24Z<p>Asparagirl: </p>
<hr />
<div>Hi there, I'm Brooke Schreier Ganz, a web geek and mom living in the Bay Area (formerly in Los Angeles). I'm @Asparagirl on Twitter.</div>Asparagirlhttps://wiki.archiveteam.org/index.php?title=Current_Projects&diff=19014Current Projects2014-06-05T19:29:56Z<p>Asparagirl: /* Upcoming projects */</p>
<hr />
<div>__NOTOC__<br />
<br />
== Archive Team recruiting ==<br />
* [[Dev|Want to code for Archive Team? Here's a starting point.]]<br />
<br />
== Warrior based projects ==<br />
* [[URLTeam]]: URL shorteners were a fucking awful idea. IRC channel '''#urlteam'''. ''(Currently broken, coders wanted!)''<br />
* [[Justin.tv]]: Deleting all archived videos on June 8, 2014. IRC channel '''#justouttv'''.<br />
<br />
Help us: '''[[warrior|☞ Download and run your warrior ☜]]'''.<br />
What's on: [http://tracker.archiveteam.org/ online tracker].<br />
<br />
== Manual projects ==<br />
* [[FTP]]: Download all the FTP sites!<br />
* [[WikiTeam]]: permanent effort, [http://code.google.com/p/wikiteam/wiki/NewTutorial#I_have_no_shell_access_to_server everyone can help] (you choose the size of your downloads).<br />
* [[Puu.sh]]: Expiring inactive files after 1 month; now in continuous mode. IRC Channel '''#pushharder'''.<br />
<br />
== Upcoming projects ==<br />
* Four genealogy-themed websites (and their many sub-domains and message boards) that are being sunsetted by [[Ancestry.com]]<br />
* Saving Verizon customer pages, [http://www.verizon.com/support/residential/internet/fiosinternet/general+support/essentials+and+extras/questionsone/85372.htm shutting down] on September 2014.<br />
* [[MLKSHK]]: Shutting down September 1, 2014. '''IRC Channel #totheyard'''.<br />
* [[Helium]]: 1 million articles to be deleted on December 15, 2014.<br />
<br />
== Recently finished ==<br />
<!-- put projects here that are still in the tracker but not yet deleted so it won't confuse people --><br />
* [[Canv.as]]: Saving the images before they go offline. IRC Channel '''#canvas'''.<br />
* [[Mochi Media]]: Goodbye Flash games. Shanda-acquiree forced to shut down on March 31, 2014. IRC Channel '''#mochibaibai'''.<br />
* [[Dogster|Catster & Dogster]]: Won't be putting communities to sleep on March 3, 2014, but we got a copy anyway. IRC Channel '''#rawdogster'''.<br />
* [[Viddler]]: Won't be deleting personal and non-free account videos permanently. IRC Channel '''#fiddler'''.<br />
* [[My Opera]]: It's all over after 2014-03-03. IRC Channel '''#fatlady'''.<br />
<br />
=== Hiatus / Missed the Mark ===<br />
* [[Bebo]]: Trashed by [[AOL]] and Criterion Capital Partners. Saving the remains. IRC Channel '''#cockandballs'''.<br />
* Saving [[BerliOS]].<br />
* Bolt is imploding, and announced [http://boltagain.ning.com/ the death of their domain and a month left to live.]<br />
* [[Slidecast]] has announced it is going read only at the end of February 2014 and Slidecasts will become Slideshares on April 30, 2014.</div>Asparagirlhttps://wiki.archiveteam.org/index.php?title=Ancestry.com&diff=19013Ancestry.com2014-06-05T19:28:47Z<p>Asparagirl: </p>
<hr />
<div> Asparagir> "Ancestry.com announced this morning at 10:00 MT that it is retiring several of its websites. The websites are"<br />
Asparagir> MyFamily.com MyCanvas.com Genealogy.com Mundia.com<br />
<br />
'''Note:''' we are saving four websites (and their many sub-domains) that are *owned* by Ancestry.com and which are being sunsetted. Ancestry.com itself doesn't look like it's going anywhere right now.<br />
<br />
http://www.ancestryinsider.org/2014/06/ancestrycom-announces-retirement-of.html<br />
<br />
http://www.ancestry.com/cs/faq/genealogy-faq<br />
<br />
[[Category:In progress]]</div>Asparagirlhttps://wiki.archiveteam.org/index.php?title=ArchiveTeam_Warrior&diff=9255ArchiveTeam Warrior2013-02-03T23:23:22Z<p>Asparagirl: /* Projects */ -- fixed link to the latest torrent for URLTeam</p>
<hr />
<div>[[Image:Archive_team.png|200px|right]]<br />
The ArchiveTeam Warrior is a virtual archiving appliance. You can run it to help with the ArchiveTeam archiving efforts. It will download sites and upload them to our archive.<br />
<br />
It is easy to get started. Download the appliance and run it with your favorite virtualization tool (VirtualBox, VMware etc.).<br />
<br />
Instructions for VirtualBox:<br />
<ol><br />
<li>Install and run [https://www.virtualbox.org/ VirtualBox].</li><br />
<li>Download the [http://archive.org/download/archiveteam-warrior/archiveteam-warrior-v2-20121008.ova archiveteam-warrior-v2-20121008.ova] file.</li><br />
<li>In VirtualBox, click File > Import Appliance and open the file.</li><br />
<li> Start the appliance - it will automatically update and eventually show a message telling you to visit your browser<br />
<li>Go to [http://127.0.0.1:8001 http://127.0.0.1:8001] and check the settings page.<br />
<li>You can choose a username which will show on the leaderboards, and also the number of concurrent tasks the warrior will attempt to run.<br />
<li>Sometimes there will be multiple projects on offer - you can choose to follow any single project or the project that the ArchiveTeam are currently working on.<br />
<li>You MUST choose a project the first time you start the Warrior else it won't do anything.<br />
</ol><br />
<br />
<br />
----<br />
<br />
<br />
http://archive.org/details/archiveteam-warrior<br />
<br />
== Testing pre-production code ==<br />
<br />
(Don't do this unless you really need or want to.) If you are developing a warrior script, you can test it by switching your warrior from the <code>production</code> branch to the <code>master</code> branch.<br />
<br />
<ol><br />
<li>Start the warrior.</li><br />
<li>Press Alt+F2 and log in with username <code>root</code> and password <code>archiveteam</code>.</li><br />
<li><code>cd /home/warrior/warrior-code</code></li><br />
<li><code>sudo -u warrior git checkout master</code></li><br />
<li><code>reboot</code></li><br />
</ol><br />
<br />
By the same route you can return your warrior to the <code>production</code> branch.<br />
<br />
== Projects ==<br />
<br />
{| class="wikitable"<br />
! Project !! Status !! Began !! Finished !! Result !! Archive Location<br />
|-<br />
| MobileMe || '''Archive Posted''' || April 3, 2012 || Aug 8, 2012 || Success || <br />
[http://archive.org/details/archiveteam-mobileme-hero archive] [http://archive.org/details/archiveteam-mobileme-index index] [http://archive.org/download/archiveteam-mobileme-index/mobileme-20120817.html user lookup]<br />
|-<br />
| Fortune City || '''Archive Posted''' || April 4, 2012 || April 11, 2012 || Partial Success || [http://archive.org/details/archiveteam-fortunecity archive] [http://archive.org/download/test-memac-index-test/fortunecity.html user lookup]<br />
|-<br />
| Tabblo || '''Archive Posted''' || May 23, 2012 || May 26, 2012 || Success || [http://archive.org/details/tabblo-archive archive] [http://archive.org/download/test-memac-index-test/tabblo.html user lookup]<br />
|-<br />
| PicPlz || '''Archive Posted''' || June 3, 2012 || June 15, 2012 || || [http://archive.org/details/archiveteam-picplz archive] [http://archive.org/details/archiveteam-picplz-index index] [http://archive.org/download/archiveteam-picplz-index/picplz-20120823.html user lookup]<br />
|-<br />
| Tumblr (test project) || '''Archive Posted''' || August 9, 2012 || August 19, 2012 || || [http://archive.org/details/archiveteam-tumblr-test] [http://archive.org/details/archiveteam-tumblr-test-warc]<br />
|-<br />
| Cinch.FM || '''Archive Posted''' || August 20, 2012 || August 22, 2012 || Success || [http://archive.org/details/archiveteam-cinch]<br />
|-<br />
| City Of Heroes || '''Archive Posted''' || September 3, 2012 || December 1, 2012 || Success || [http://archive.org/details/archiveteam-city-of-heroes-www www] [http://archive.org/details/archiveteam-city-of-heroes-main forums] [http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-1 1] [http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-2 2] [http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-3 3] [http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-4 4] [http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-5 5]<br />
|-<br />
| Webshots || '''Archive Posted''' || October 4, 2012 || November 18, 2012 || || [http://archive.org/download/webshots-freeze-frame-index/index.html index]<br />
|-<br />
| BT Internet || '''Archive Posted''' || October 10, 2012 || November 2, 2012 || Success || [http://archive.org/details/archiveteam-btinternet archive]<br />
|-<br />
| Daily Booth || '''Archive Posted''' || November 19, 2012 || December 29, 2012 || || [http://archive.org/details/archiveteam_dailybooth] [http://archive.org/download/dailybooth-freeze-frame-index/index.html index]<br />
|-<br />
| Github || '''Archive Posted''' || December 13, 2012 || December 17, 2012 || Success || [http://archive.org/details/github-downloads-2012-12 archive] [http://archive.org/details/archiveteam-github-repository-index-201212 index]<br />
|-<br />
| Yahoo Blogs (Vietnamese) || Downloads Finished || January 8, 2013 || January 19, 2013 || ||<br />
|-<br />
| URLTeam || Active || || || || [http://urlte.am/releases/2013-01-02/urlteam.torrent latest]<br />
|-<br />
| Punchfork || Active || January 11, 2013 || || ||<br />
|-<br />
| weblog.nl || Active || January 19, 2013 || || ||<br />
|-<br />
| Xanga || In Development || January 22, 2013 || || ||<br />
|}<br />
<br />
=== Status ===<br />
:; In Development : a future project<br />
:; Active : start up a Warrior and join the fun; this one is in progress right now<br />
:; Downloads Finished : we've finished downloading the data<br />
:; Archived : the collected data has been properly archived<br />
:; Archive Posted : the archive is available for download<br />
<br />
=== Result ===<br />
:; Success : downloaded all of the data and posted the archive publicly<br />
:; Qualified Success : either we couldn't get all of the data, or the archive can't be made public<br />
:; Failure : the site closed before we could download anything</div>Asparagirlhttps://wiki.archiveteam.org/index.php?title=User:Asparagirl&diff=9253User:Asparagirl2013-02-03T07:30:39Z<p>Asparagirl: </p>
<hr />
<div>Hi there, I'm Brooke Schreier Ganz, a web geek and mom living in Los Angeles. I'm @Asparagirl on Twitter.</div>Asparagirlhttps://wiki.archiveteam.org/index.php?title=User:Asparagirl&diff=9252User:Asparagirl2013-02-03T07:30:18Z<p>Asparagirl: </p>
<hr />
<div>Hi there, I'm Brooke Schreier Ganz, a web geek and mom living in Los Angeles. I'm on Twitter at @Asparagirl.</div>Asparagirlhttps://wiki.archiveteam.org/index.php?title=User:Asparagirl&diff=9251User:Asparagirl2013-02-03T07:29:52Z<p>Asparagirl: Created page with "Hi there, I'm Brooke Schreier Ganz, a web geek and mom living in Los Angeles."</p>
<hr />
<div>Hi there, I'm Brooke Schreier Ganz, a web geek and mom living in Los Angeles.</div>Asparagirlhttps://wiki.archiveteam.org/index.php?title=URLTeam&diff=9250URLTeam2013-02-03T07:09:02Z<p>Asparagirl: /* Dead or Broken */</p>
<hr />
<div>{{Infobox project<br />
| title = Urlteam<br />
| image = Urlteam-logo.png<br />
| description = url shortening was a fucking awful idea<br />
| URL = http://urlte.am<br />
| project_status = {{online}}<br />
| archiving_status = {{in progress}}<br />
| source = https://github.com/ArchiveTeam/urlteam-stuff<br />
| tracker = http://tracker.tinyarchive.org/<br />
| irc = urlteam<br />
}}<br />
<br />
'''TinyURL''', '''bit.ly''' and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.<br />
<br />
Such services are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see [http://en.wikipedia.org/wiki/Link_rot Wikipedia: Link Rot]). [http://www.archive.org/details/301works Archive.org]/301Works is acting as an escrow for URL shortener databases, but they rely on URL shorteners to actually give them their databases. Even 301Works founding member ''bit.ly'' does not actually share their databases and most other big shorteners don't share theirs either.<br />
<br />
== Who did this? ==<br />
You can join us in our IRC channel: [irc://irc.efnet.org/urlteam #urlteam] on [http://www.efnet.org/ EFNet]<br />
* [[User:Scumola]] started this wiki page<br />
* [[User:Chronomex]] started the Urlteam scraping effort<br />
* [[User:Soult]] Helps with scraping<br />
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)<br />
* ... many ArchiveTeam people who run the scrapers<br />
<br />
== 301Work cooperation ==<br />
[[Image:301works logo.jpg|thumb]]<br />
The fine folks at archive.org have provides us with upload permissions to the 301Works archive: [http://www.archive.org/details/301utm http://www.archive.org/details/301utm]. They unfortunately do not want to make them downloadable, but the same data is in our torrents too, just in a different format (we use tab-delimited, xz-compressed files while 301works uses comma-delimited uncompressed files).<br />
<br />
== Tools ==<br />
* [https://github.com/chronomex/urlteam fetcher.pl]: Perl-based scraper by [[User:Chronomex]]<br />
* [https://github.com/soult/tinyback TinyBack]: Python 2.x-based, distributed scraper (also works with the [[Warrior]])<br />
<br />
=== TinyBack ===<br />
The easiest way to help with scraping is to run the Warrior and select the ''URLTeam'' project. You can also run TinyBack outside the warrior, thought Python 2.6 or newer is required:<br />
<br />
git clone https://github.com/soult/tinyback<br />
cd tinyback<br />
# Use ./run.py --help for more information on command-line options<br />
./run.py --tracker=http://tracker.tinyarchive.org/v1/ --num-threads=3 --sleep=180<br />
<br />
== URL shorteners ==<br />
=== New table ===<br />
The new table includes shorteners we have already started to scrape.<br />
{| class="sortable wikitable" style="width: auto; text-align: center"<br />
! Name<br />
! Est. number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|-<br />
| [http://tinyurl.com/ Tinyurl.com]<br />
| 1,000,000,000<br />
| [[Warrior]]<br />
| scraping: sequential, <= 6 characters<br />
| new shorturls: non-sequential, 7 characters<br />
|-<br />
| [http://bit.ly/ Bit.ly]<br />
| 4,000,000,000<br />
| [[Warrior]]<br />
| scraping: non-sequential, 6 characters<br />
| new shorturls: non-sequential, 6 characters<br />
|-<br />
| [http://goo.gl Goo.gl]<br />
| ?<br />
| [[User:Scumola]]<br />
| started (2011-03-04)<br />
| goo.gl throttles pulls<br />
|-<br />
| [http://is.gd is.gd]<br />
| 810,264,745 (2013-01-30)<br />
| [[Warrior]]<br />
| scraping: sequential, <= 5 characters<br />
| new shorturls: non-sequential, 6 characters<br />
|-<br />
| [http://ff.im ff.im]<br />
| ?<br />
| [[User:Chronomex]]<br />
|<br />
| only used by FriendFeed, no interface to shorten new URLs<br />
|-<br />
| [http://4url.cc/ 4url.cc]<br />
| 1279 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
|<br />
| dead (2011-02-15)<br />
|-<br />
| litturl.com<br />
| 17096 (2010-04-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
|<br />
| dead (2010-11-18)<br />
|-<br />
| xs.md<br />
| 3084 (2009-08-15)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| url.0daymeme.com<br />
| 14867 (2009-08-14)<ref>http://github.com/chronomex/urlteam</ref><br />
| [[User:Chronomex]]<br />
| done<br />
| dead (2010-11-18)<br />
|-<br />
| [http://tr.im tr.im]<br />
| 1990425<br />
| [[User:Soult]]<br />
| got what we could<br />
| dead (2011-12-31)<br />
|-<br />
| adjix.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Already done: 00-zz, 000-zzz, 0000-izzz.<br />
| case-insensitive, incremental<br />
|-<br />
| rod.gs<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 00-ZZ, 000-2Qc<br />
| case-sensitive, incremental, server can't keep up with all the requests.<br />
|-<br />
| biglnk.com<br />
| ?<br />
| [[User:Jeroenz0r]]<br />
| Done: 0-Z, 00-ZZ, 000-ZZZ<br />
| case-sensitive, incremental<br />
|-<br />
| go.to<br />
| 60000<br />
| [[User:Asiekierka]]<br />
| Done: ~45000 (go.to network links only: [http://64pixels.org/goto_dump.zip goto_dump.zip])<br />
| no codes, only names, google-fu only gives the first 1000 results for each, thankfully most domains have less<br />
|- class="sortbottom"<br />
! Name<br />
! Number of shorturls<br />
! Scraping done by<br />
! Status<br />
! Comments<br />
|}<br />
<br />
=== Alive ===<br />
<br />
Last verified 2012-12-29. Original list last updated 2009-08-14 <ref>http://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html</ref>.<br />
<br />
* awe.sm<br />
* budurl.com - Appears non-incremental<br />
* cli.gs - Appears non-incremental<br />
* decenturl.com - Not at all easy to scrape.<br />
* dlvr.it<br />
* doiop.com - Appears non-incremental<br />
* easyurl.net - Appears non-incremental: http://easyurl.net/afd2f<br />
* ilix.in - HTML redirect<br />
* jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388<br />
* metamark.net / xrl.us - ? http://xrl.us/bfabog<br />
* myurl.in - http://myurl.in/xtP5H / http://urlgator.com/xtP5H /http://ug4.me/xtP5H / http://link-ed.in/xtP5H - HTML redirect<br />
* notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/<br />
* nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.<br />
* pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc<br />
* redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok<br />
* sharedby.co - See vsb.li. Double redirects via USERNAME.sharedby.co/share/XXXXXX<br />
* shorl.com - Doesn't appear guessable: http://shorl.com/tisikestibahu<br />
* shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok<br />
* shrinkurl.us - Alway telling URL is malformed<br />
* shrt.st - Appears incremental: http://shrt.st/vpz<br />
* simurl.com - Doesn't appear guessable: http://simurl.com/panpes<br />
* smarturl.eu / joturl.com / zip.sm - Doesn't appear guessable, HTML redirect.<br />
* snipr.com / snipurl.com / snurl.com - Appears incremental: http://snipr.com/27nvst http://snipr.com/27nvtt<br />
* surl.co.uk - Many shortening options.<br />
* tighturl.com - Appears incremental: http://tighturl.com/30xu http://tighturl.com/30xv<br />
* tiny.cc - Appears non-incremental<br />
* tweetburner.com / twurl.nl - Appears incremental<br />
* twitthis.com<br />
* u.mavrev.com - Not accepting new urls.<br />
* ur1.ca - Database is downloadable from website directly.<br />
* urlcut.com<br />
* vimeo.com<br />
* vsb.li / links.visibli.com/links/ - The latter uses truncated md5 hex string.<br />
* xrl.us - see metamark.net<br />
* x.se - Cannot resolve, but www.x.se works.<br />
* yatuc.com - Not accepting new urls.<br />
* yep.it<br />
<br />
==== "Official" shorteners ====<br />
* goo.gl - Google<br />
* fb.me - Facebook<br />
* y.ahoo.it - Yahoo<br />
* youtu.be - YouTube<br />
* t.co? - Twitter<br />
* post.ly - Posterous<br />
* wp.me - Wordpress.com<br />
* flic.kr - Flickr<br />
* lnkd.in - LinkedIn<br />
* su.pr - StumbleUpon<br />
* go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)<br />
<br />
===== bit.ly aliases =====<br />
<br />
* amzn.to - Amazon <br />
* binged.it - Bing (bonus points for being longer than bing.com)<br />
* 1.usa.gov - USA Government<br />
* tcrn.ch - Techcrunch<br />
<br />
=== Dead or Broken ===<br />
<br />
* 1link.in - Website dead<br />
* 6url.com - HTML redirect, Error 500<br />
* ad.vu - mirror of adjix.com, application not found<br />
* canurl.com - Website dead<br />
* chod.sk - Appears non-incremental, not resolving<br />
* digg.com - discontinued - [http://about.digg.com/blog/update-diggs-short-url-service]<br />
* dwarfurl.com - Website dead/Numeric, appears incremental: http://dwarfurl.com/08041<br />
* easyuri.com - Website dead/Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3<br />
* go2cut.com - Website dead<br />
* gonext.org - not resolving<br />
* imfy.us - requires a recaptcha to get to the linked site, and avast goes nuts. DNS fails to resolve. <br />
* ix.it - Not resolving<br />
* jijr.com - Doesn't appear to be a shortener, now parked<br />
* jump.to - dead as of February 1, 2013<br />
* kissa.be - "Kissa.be url shortener service is shutdown"<br />
* kurl.us - Parked.<br />
* lnkurl.com - Website dead<br />
* memurl.com - Pronounceable. Broken.<br />
* miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."<br />
* minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh - Website dead<br />
* minurl.org - Presently in ERROR 404<br />
* muhlink.com - Not resolving<br />
* myurl.us - cpanel frontend<br />
* nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own). Taken by squatters<br />
* qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf. Domain parked.<br />
* s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab . Domain parked.<br />
* shortlinks.co.uk - Working again. Maybe not.<br />
* short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp<br />
* shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp . Domain parked.<br />
* traceurl.com - DNS fails to resolve.<br />
* tr.im - "Be back soon!"<br />
* twitpwr.com - Domain parked.<br />
* u.nu - "The shortest URLs. period." Website dead since at least 1st of october 2010 (http://web.archive.org/web/20100104023208/http://u.nu/)<br />
* url9.com - Sequential, alphanumeric. Leading 0s are significant. "The site is working correctly."<br />
* urlborg.com - 404 Not Found.<br />
* urlcover.com - Domain parked.<br />
* urlhawk.com - Domain parked.<br />
* url-press.com - Suspended by web host.<br />
* urlsmash.com - DNS not resolving.<br />
* urltea.com - Dreamhost's coming soon page.<br />
* urlvi.be - Domain parked.<br />
* urlx.org - Owner has agreed to share his database<br />
* w3t.org - 403 Forbidden.<br />
* wlink.us - Domain parked.<br />
* xaddr.com - Domain parked.<br />
* xil.in - Under construction.<br />
* xym.kr - Gibberish (?) Korean text blog.<br />
* yweb.com - Suspicious iframe with long url and fake loading gif image.<br />
* zi.ma - DNS not resolving.<br />
<br />
==== Discontinued ====<br />
<br />
* urlbrief.com - co-operates with 301Works.org<br />
<br />
=== Hueg list ===<br />
[http://code.google.com/p/shortenurl/wiki/URLShorteningServices]<br />
<br />
== References ==<br />
<references /><br />
<br />
== Weblinks ==<br />
* [http://urlte.am urlte.am]<br />
* [http://301works.org 301works.org]<br />
<br />
[[Category: URL Shortening]]</div>Asparagirl