From Archiveteam
Revision as of 05:30, 22 March 2024 by JustAnotherArchivist (talk | contribs) (Fix unsigned edit)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Regarding archiving

Just randomly requesting TinyURLs like you propose will get you banned since you are making many requests for non-existent TinyURLs. We do allow bots to crawl TinyURLs, but only if they are crawling TinyURLs that exist which they pulled from whatever source they are crawling.

Kevin "Gilby" Gilbertson

TinyURL, Founder

A Problem Easily Solved

Just provide for us an excel spreadsheet in the form of:

tinyurl ID | full URL

And scraping won't be necessary. Up for it?

--Jscott 20:25, 4 December 2010 (UTC)

I e-mailed the TinyURL owner and he replied with that.
Zachera 00:06, 11 December 2010 (UTC)

Another URL shortener

I ran into another URL shortener: Here's their API: Jodi.a.schneider 17:05, 3 September 2011 (UTC)

Could you archive (, please? Thank you!

To clarify: is the address of the web page; is the prefix of the generated shortened URLs. -- Pne 12:08, 9 September 2011 (UTC)

Distributed Scraping

You could make a browser extension that records the long url for each short url that a person's browser visits, for certain known shorteners. This might be particularly helpful for uncooperative shorteners, since they wouldn't know the difference.

Then it would be a matter of encouraging many people to install the browser extension.

URLTeam gone?

There was no update in December 2011 as announced, and currently shows a Dreamhost page "this member has not set up their site yet". Is this team dead? -- Pne 07:56, 29 January 2012 (UTC)

Luckily no. We assembled the new torrent on the very last day of December 2011. Unfortunately we then decided that, since we were already updating the homepage, that we should finally move the homepage away from Dreamhost. Due to some miscommunication between User:Jscott (current domain owner), (domain registry) and Dreamhost (old webhost) we ended up with the domain not working and because the ArchiveTeam is very busy at the moment with Splinder and MobileMe, it takes some time to fix things. Rest assured that all data is safe, the December 2011 torrent is well seeded and waiting for downloaders and we are already working on a new release. In fact, our Github repository already contains the updated website. --Soult 17:49, 29 January 2012 (UTC)

There's a site that's not quite a URL shortener, but is essentially providing the same service for different reasons. It's at Should anyone look into archiving it? Thanks! appears to be a service similar to URLteam, but without the, you know, archiving part. They have a big database of shortened URLs, and you can query their database without actually following the short URL. If they'd be willing to share it, it would be a useful addition. JesseW 01:15, 2 November 2015 (EST)

Searching script

set -e
# Written by Jesse Weinstein <>
# Released to the public domain as of Nov 21, 2015
WORKING_DIR=${2:-$( mktemp -d -p /mnt/bigdisk/nuc_files/urlteam_searches ${SEARCH_NAME}.XXXXXX )}


touch $WORKING_DIR/old_files_done.txt
IFS=$(echo -en "\n\b")
 for filename in $(cat xz_files.txt); do 
    if fgrep -q -x "$filename" $WORKING_DIR/old_files_done.txt; then 
       echo "Skipping $filename"
       xz -v -c -d "$filename" | fgrep "$SEARCH_FOR" | tee -a $WORKING_DIR/results.txt
       echo "$filename" >> $WORKING_DIR/old_files_done.txt

touch $WORKING_DIR/new_files_done.txt

 for zipfile in urlteam_*/*.zip; do
    if fgrep -q -x "$zipfile" $WORKING_DIR/new_files_done.txt; then 
       echo "Skipping $zipfile"
       unzip -p "$zipfile" '*.txt.xz' | xz -v -d | fgrep "$SEARCH_FOR" | tee -a $WORKING_DIR/results.txt
       echo "$zipfile" >> $WORKING_DIR/new_files_done.txt

cut -d '|' -f 2 $WORKING_DIR/results.txt | sort -u > $WORKING_DIR/result_urls.txt
(cd $WORKING_DIR ; wc -l * )

In case it is useful. JesseW 16:38, 21 November 2015 (EST)

Updated to handle filenames with spaces in old dump. JesseW 17:05, 21 November 2015 (EST)

Self-archival of linked URLs

Perhaps, in addition to self-shortening, we should also encourage websites to do self-archival. is something every serious website should use, probably. --Nemo 17:45, 12 June 2016 (EDT)

Python snippet to find incremental dumps

print '\n'.join(sorted([''+x['identifier']+'/'+x['identifier']+'_archive.torrent' for x in iaapi.search_items('urlteam terroroftinytown -collection:test_collection AND addeddate:[2016-05-17 TO 2017]')]))

Just so I can find it again easily. JesseW (talk) 01:27, 28 July 2016 (EDT)

Shouldn't be considered a URL shortener? is a popular alternative to the Wayback Machine. You can see the difference of the two URL schemes:

  •<original url>

While is not nominally a URL shortener, when people refer to webpages by their archived version, incidentally they are also referring to the webpage by a shortened URL. It is possible to access archived webpages through a link that preserves the canon url, but it is always at a click's distance, as an option in the "Share" button. Users of overwhelmingly use the short version.

If shuts down suddenly, we would of course lose millions of archived pages, but we would at the same time lose every clue what webpage millions of links refer to. This concern is the same as the concern with nominal URL shorteners that motivates URLTeam to act. Censuro (talk) 01:41, 20 March 2024 (UTC)