Talk:URLTeam

From Archiveteam
Jump to navigation Jump to search

Regarding archiving

Just randomly requesting TinyURLs like you propose will get you banned since you are making many requests for non-existent TinyURLs. We do allow bots to crawl TinyURLs, but only if they are crawling TinyURLs that exist which they pulled from whatever source they are crawling.

Kevin "Gilby" Gilbertson

TinyURL, Founder

http://tinyurl.com

A Problem Easily Solved

Just provide for us an excel spreadsheet in the form of:

tinyurl ID | full URL

And scraping won't be necessary. Up for it?

--Jscott 20:25, 4 December 2010 (UTC)

I e-mailed the TinyURL owner and he replied with that.
Zachera 00:06, 11 December 2010 (UTC)

Another URL shortener

I ran into another URL shortener: http://ln-s.net/home/ Here's their API: http://ln-s.net/home/apidoc.jsp Jodi.a.schneider 17:05, 3 September 2011 (UTC)

xrl.us

Could you archive xrl.us (metamark.net), please? Thank you!

To clarify: metamark.net is the address of the web page; xrl.us is the prefix of the generated shortened URLs. -- Pne 12:08, 9 September 2011 (UTC)

Distributed Scraping

You could make a browser extension that records the long url for each short url that a person's browser visits, for certain known shorteners. This might be particularly helpful for uncooperative shorteners, since they wouldn't know the difference.

Then it would be a matter of encouraging many people to install the browser extension.

URLTeam gone?

There was no update in December 2011 as announced, and http://urlte.am/ currently shows a Dreamhost page "this member has not set up their site yet". Is this team dead? -- Pne 07:56, 29 January 2012 (UTC)

Luckily no. We assembled the new torrent on the very last day of December 2011. Unfortunately we then decided that, since we were already updating the homepage, that we should finally move the homepage away from Dreamhost. Due to some miscommunication between User:Jscott (current domain owner), dot.am (domain registry) and Dreamhost (old webhost) we ended up with the domain not working and because the ArchiveTeam is very busy at the moment with Splinder and MobileMe, it takes some time to fix things. Rest assured that all data is safe, the December 2011 torrent is well seeded and waiting for downloaders and we are already working on a new release. In fact, our Github repository already contains the updated website. --Soult 17:49, 29 January 2012 (UTC)

www.donotlink.com

There's a site that's not quite a URL shortener, but is essentially providing the same service for different reasons. It's at http://www.donotlink.com Should anyone look into archiving it? Thanks!

TrueURL.net

http://www.trueurl.net appears to be a service similar to URLteam, but without the, you know, archiving part. They have a big database of shortened URLs, and you can query their database without actually following the short URL. If they'd be willing to share it, it would be a useful addition. JesseW 01:15, 2 November 2015 (EST)

Searching script

#!/bin/bash
set -e
# Written by Jesse Weinstein <jesse@wefu.org>
# Released to the public domain as of Nov 21, 2015
SEARCH_FOR="$1"
SEARCH_NAME=${SEARCH_FOR//[.\/]/_}
WORKING_DIR=${2:-$( mktemp -d -p /mnt/bigdisk/nuc_files/urlteam_searches ${SEARCH_NAME}.XXXXXX )}
OLD_DUMP=/mnt/bigdisk/transmission_files/downloads/URLTeamTorrentRelease2013July/
NEW_DUMPS=/mnt/bigdisk/transmission_files/downloads/

echo $WORKING_DIR

touch $WORKING_DIR/old_files_done.txt
SAVEIFS="$IFS"
IFS=$(echo -en "\n\b")
(cd $OLD_DUMP
 for filename in $(cat xz_files.txt); do 
    if fgrep -q -x "$filename" $WORKING_DIR/old_files_done.txt; then 
       echo "Skipping $filename"
    else 
       xz -v -c -d "$filename" | fgrep "$SEARCH_FOR" | tee -a $WORKING_DIR/results.txt
       echo "$filename" >> $WORKING_DIR/old_files_done.txt
    fi
 done)
IFS="$SAVEIFS"

touch $WORKING_DIR/new_files_done.txt

(cd $NEW_DUMPS
 for zipfile in urlteam_*/*.zip; do
    if fgrep -q -x "$zipfile" $WORKING_DIR/new_files_done.txt; then 
       echo "Skipping $zipfile"
    else
       unzip -p "$zipfile" '*.txt.xz' | xz -v -d | fgrep "$SEARCH_FOR" | tee -a $WORKING_DIR/results.txt
       echo "$zipfile" >> $WORKING_DIR/new_files_done.txt
    fi                   
 done)

cut -d '|' -f 2 $WORKING_DIR/results.txt | sort -u > $WORKING_DIR/result_urls.txt
(cd $WORKING_DIR ; wc -l * )

In case it is useful. JesseW 16:38, 21 November 2015 (EST)

Updated to handle filenames with spaces in old dump. JesseW 17:05, 21 November 2015 (EST)

Self-archival of linked URLs

Perhaps, in addition to self-shortening, we should also encourage websites to do self-archival. http://amberlink.org/ is something every serious website should use, probably. --Nemo 17:45, 12 June 2016 (EDT)

Python snippet to find incremental dumps

print '\n'.join(sorted(['https://archive.org/download/'+x['identifier']+'/'+x['identifier']+'_archive.torrent' for x in iaapi.search_items('urlteam terroroftinytown -collection:test_collection AND addeddate:[2016-05-17 TO 2017]')]))

Just so I can find it again easily. JesseW (talk) 01:27, 28 July 2016 (EDT)