Talk:URLTeam
Regarding archiving
Just randomly requesting TinyURLs like you propose will get you banned since you are making many requests for non-existent TinyURLs. We do allow bots to crawl TinyURLs, but only if they are crawling TinyURLs that exist which they pulled from whatever source they are crawling.
Kevin "Gilby" Gilbertson
TinyURL, Founder
A Problem Easily Solved
Just provide for us an excel spreadsheet in the form of:
tinyurl ID | full URL
And scraping won't be necessary. Up for it?
--Jscott 20:25, 4 December 2010 (UTC)
Another URL shortener
I ran into another URL shortener: http://ln-s.net/home/ Here's their API: http://ln-s.net/home/apidoc.jsp Jodi.a.schneider 17:05, 3 September 2011 (UTC)
xrl.us
Could you archive xrl.us (metamark.net), please? Thank you!
To clarify: metamark.net is the address of the web page; xrl.us is the prefix of the generated shortened URLs. -- Pne 12:08, 9 September 2011 (UTC)
Distributed Scraping
You could make a browser extension that records the long url for each short url that a person's browser visits, for certain known shorteners. This might be particularly helpful for uncooperative shorteners, since they wouldn't know the difference.
Then it would be a matter of encouraging many people to install the browser extension.
URLTeam gone?
There was no update in December 2011 as announced, and http://urlte.am/ currently shows a Dreamhost page "this member has not set up their site yet". Is this team dead? -- Pne 07:56, 29 January 2012 (UTC)
- Luckily no. We assembled the new torrent on the very last day of December 2011. Unfortunately we then decided that, since we were already updating the homepage, that we should finally move the homepage away from Dreamhost. Due to some miscommunication between User:Jscott (current domain owner), dot.am (domain registry) and Dreamhost (old webhost) we ended up with the domain not working and because the ArchiveTeam is very busy at the moment with Splinder and MobileMe, it takes some time to fix things. Rest assured that all data is safe, the December 2011 torrent is well seeded and waiting for downloaders and we are already working on a new release. In fact, our Github repository already contains the updated website. --Soult 17:49, 29 January 2012 (UTC)
www.donotlink.com
There's a site that's not quite a URL shortener, but is essentially providing the same service for different reasons. It's at http://www.donotlink.com Should anyone look into archiving it? Thanks!
TrueURL.net
http://www.trueurl.net appears to be a service similar to URLteam, but without the, you know, archiving part. They have a big database of shortened URLs, and you can query their database without actually following the short URL. If they'd be willing to share it, it would be a useful addition. JesseW 01:15, 2 November 2015 (EST)
Searching script
#!/bin/bash
set -e
# Written by Jesse Weinstein <jesse@wefu.org>
# Released to the public domain as of Nov 21, 2015
SEARCH_FOR="$1"
SEARCH_NAME=${SEARCH_FOR//[.\/]/_}
WORKING_DIR=${2:-$( mktemp -d -p /mnt/bigdisk/nuc_files/urlteam_searches ${SEARCH_NAME}.XXXXXX )}
OLD_DUMP=/mnt/bigdisk/transmission_files/downloads/URLTeamTorrentRelease2013July/
NEW_DUMPS=/mnt/bigdisk/transmission_files/downloads/
echo $WORKING_DIR
touch $WORKING_DIR/old_files_done.txt
SAVEIFS="$IFS"
IFS=$(echo -en "\n\b")
(cd $OLD_DUMP
for filename in $(cat xz_files.txt); do
if fgrep -q -x "$filename" $WORKING_DIR/old_files_done.txt; then
echo "Skipping $filename"
else
xz -v -c -d "$filename" | fgrep "$SEARCH_FOR" | tee -a $WORKING_DIR/results.txt
echo "$filename" >> $WORKING_DIR/old_files_done.txt
fi
done)
IFS="$SAVEIFS"
touch $WORKING_DIR/new_files_done.txt
(cd $NEW_DUMPS
for zipfile in urlteam_*/*.zip; do
if fgrep -q -x "$zipfile" $WORKING_DIR/new_files_done.txt; then
echo "Skipping $zipfile"
else
unzip -p "$zipfile" '*.txt.xz' | xz -v -d | fgrep "$SEARCH_FOR" | tee -a $WORKING_DIR/results.txt
echo "$zipfile" >> $WORKING_DIR/new_files_done.txt
fi
done)
cut -d '|' -f 2 $WORKING_DIR/results.txt | sort -u > $WORKING_DIR/result_urls.txt
(cd $WORKING_DIR ; wc -l * )
In case it is useful. JesseW 16:38, 21 November 2015 (EST)
- Updated to handle filenames with spaces in old dump. JesseW 17:05, 21 November 2015 (EST)
Self-archival of linked URLs
Perhaps, in addition to self-shortening, we should also encourage websites to do self-archival. http://amberlink.org/ is something every serious website should use, probably. --Nemo 17:45, 12 June 2016 (EDT)
Python snippet to find incremental dumps
print '\n'.join(sorted(['https://archive.org/download/'+x['identifier']+'/'+x['identifier']+'_archive.torrent' for x in iaapi.search_items('urlteam terroroftinytown -collection:test_collection AND addeddate:[2016-05-17 TO 2017]')]))
Just so I can find it again easily. JesseW (talk) 01:27, 28 July 2016 (EDT)
Shouldn't archive.today be considered a URL shortener?
Archive.today is a popular alternative to the Wayback Machine. You can see the difference of the two URL schemes:
- archive.today/<XXXXX>
- web.archive.org/web/20091125231500/<original url>
While archive.today is not nominally a URL shortener, when people refer to webpages by their archived version, incidentally they are also referring to the webpage by a shortened URL. It is possible to access archive.today archived webpages through a link that preserves the canon url, but it is always at a click's distance, as an option in the "Share" button. Users of archive.today overwhelmingly use the short version.
If archive.today shuts down suddenly, we would of course lose millions of archived pages, but we would at the same time lose every clue what webpage millions of archive.today links refer to. This concern is the same as the concern with nominal URL shorteners that motivates URLTeam to act. Censuro (talk) 01:41, 20 March 2024 (UTC)
- ...Anyone? Censuro (talk) 12:10, 1 August 2024 (UTC)
- Hi Censuro,
- In my opinion, your idea makes sense, and if URLTeam has space capacity, this is a project worth adding.
- Please note that "core" people don't necessarily read this wiki, so they might not notice your suggestion.
- Make sure to raise your idea in the #urlteam IRC channel. I'm not involved in the URLTeam project, so I don't know exactly who and how can add a project to the scraper, but calling attention on IRC is definitely the first step. After your idea is approved, you could perhaps help the devs by discovering the kinds of details that are already listed for other projects (URL patterns, response codes etc.).
- bzc6p (talk) 06:47, 4 August 2024 (UTC)
- Thank you. I defaulted to the wiki because I thought everyone would agree it's a more structured and permanent format. (one can also use RSS to effortlessly keep up with a watchlist) I'll for sure go to the IRC and ask, to get some feedback. Censuro (talk) 11:32, 4 August 2024 (UTC)
- The wiki is mainly used for documentation, so you are more than encouraged to record useful information here, as well as status/result of projects, as seen on existing projects' pages. It's just the initiating of getting things done, as well as operative communication, that happens on IRC. bzc6p (talk) 12:07, 10 August 2024 (UTC)
- Thank you. I defaulted to the wiki because I thought everyone would agree it's a more structured and permanent format. (one can also use RSS to effortlessly keep up with a watchlist) I'll for sure go to the IRC and ask, to get some feedback. Censuro (talk) 11:32, 4 August 2024 (UTC)