Difference between revisions of "Google Reader"

From Archiveteam
Jump to navigation Jump to search
(add tracker URL)
(→‎Your help is needed: change some wording)
Line 23: Line 23:
Google Reader acts as a cache for RSS/Atom feed content, keeping deleted posts and deleted blogs accessible (if you can recreate the RSS/Atom feed URL).  After the Reader shutdown, [https://groups.google.com/forum/?fromgroups=#!topic/Google-AJAX-Search-API/OaGf0eP57js this data might still be available via the Feeds API], but we'd like to grab most of this data before July 1 through the much more straightforward <tt>/reader/</tt> API.
Google Reader acts as a cache for RSS/Atom feed content, keeping deleted posts and deleted blogs accessible (if you can recreate the RSS/Atom feed URL).  After the Reader shutdown, [https://groups.google.com/forum/?fromgroups=#!topic/Google-AJAX-Search-API/OaGf0eP57js this data might still be available via the Feeds API], but we'd like to grab most of this data before July 1 through the much more straightforward <tt>/reader/</tt> API.


== Your help is needed ==
== How you can help ==


=== Give us your feed URLs ===
=== Upload your feed URLs ===


We need to discover as many feed URLs as possible.  Not all of them can be discovered through crawling, so we need your OPML files.  (Though if you have any private or passworded feeds, please strip them out.)
We need to discover as many feed URLs as possible.  Not all of them can be discovered through crawling, so so please upload your OPML files.  (Though if you have any private or passworded feeds, please strip them out.)


<big><b>Upload OPML files and lists of URLs to:
<big><b>Upload OPML files and lists of URLs to:
Line 116: Line 116:
** http://api.twitter.com/1/statuses/user_timeline.rss?screen_name=USERNAME
** http://api.twitter.com/1/statuses/user_timeline.rss?screen_name=USERNAME
** https://api.twitter.com/1/statuses/user_timeline.rss?screen_name=USERNAME
** https://api.twitter.com/1/statuses/user_timeline.rss?screen_name=USERNAME
** http://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME
** http://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME [very low hit rate]
** https://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME
** https://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME [very low hit rate]
** +lowercase USERNAME for each feed
** +lowercase USERNAME for each feed
**
**

Revision as of 00:11, 13 June 2013

Google Reader
URL http://www.google.com/reader/[IAWcite.todayMemWeb]
Status Online!
Archiving status In progress...
Archiving type Unknown
Project source https://github.com/ArchiveTeam/greader-grab
Project tracker here
IRC channel #donereading (on hackint)

Shutdown notification

On the March 13, Google announced that they'll "spring clean" Google Reader at Official Google Reader Blog:

we will soon retire Google Reader (the actual date is July 1, 2013)

Backing up your own data

Backing up the historical feed data

Google Reader acts as a cache for RSS/Atom feed content, keeping deleted posts and deleted blogs accessible (if you can recreate the RSS/Atom feed URL). After the Reader shutdown, this data might still be available via the Feeds API, but we'd like to grab most of this data before July 1 through the much more straightforward /reader/ API.

How you can help

Upload your feed URLs

We need to discover as many feed URLs as possible. Not all of them can be discovered through crawling, so so please upload your OPML files. (Though if you have any private or passworded feeds, please strip them out.)

Upload OPML files and lists of URLs to:

http://allyourfeed.ludios.org:8080/

Run the grab on your Linux machine

This project is not in the Warrior yet, so follow the install steps on https://github.com/ArchiveTeam/greader-grab

(Up to ~5GB of your disk space will be used; items are immediately uploaded elsewhere.)

Crawl websites to discover blogs and usernames

We need to discover millions of blog/username URLs on popular blogging platforms (which we'll turn into feed URLs).

Join #donereading and #archiveteam on efnet if you'd like to help with this.

Tools for URL discovery

git clone https://github.com/trivio/common_crawl_index
cd common_crawl_index
pip install --user boto
PYTHONPATH=. python bin/index_lookup_remote 'com.blogspot'

You can copy and edit bin/index_lookup_remote to print just the necessary information:

# Print entire URL:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + path

# Print just the subdomain:
	print '.'.join(url.split('/', 1)[0].split('.')[::-1])

# Print just the first two URL /path segments:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + '/'.join(path.split('/', 2)[0:2])

# Print just the first URL /path segment:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + '/'.join(path.split('/', 1)[0:1])

Pipe the output to | uniq | bzip2 > sitename-list.bz2, check it with bzless, and upload it to our OPML collector.

Add to to the above list of blog platforms

See:

Crawl Google Reader itself for feeds

https://www.google.com/reader/directory/search?q=keyword-here

https://www.google.com/reader/directory/search?q=keyword-here&start=10

External links

WARCs are landing at http://archive.org/details/archiveteam_greader