Google Reader

From Archiveteam
Revision as of 16:52, 2 July 2013 by Lewis Collard (talk | contribs) (→‎Quick Info: format tweaking, full shutdown notice)
Jump to navigation Jump to search
Google Reader
Greader screenshot en.gif
URL http://www.google.com/reader/[IAWcite.todayMemWeb]
Status Offline
Archiving status In progress...
Archiving type Unknown
Project source greader-grab

greader-directory-grab
greader-stats-grab

Project tracker greader-grab

greader-grab :80
greader-directory-grab
greader-directory-grab :80
greader-stats-grab
greader-stats-grab :80

IRC channel #donereading (on hackint)

Shutdown notification

On the March 13, Google announced that they'll "spring clean" Google Reader at Official Google Reader Blog:

We have just announced on the Official Google Blog that we will soon retire Google Reader (the actual date is July 1, 2013). We know Reader has a devoted following who will be very sad to see it go. We’re sad too.
There are two simple reasons for this: usage of Google Reader has declined, and as a company we’re pouring all of our energy into fewer products. We think that kind of focus will make for a better user experience.
To ensure a smooth transition, we’re providing a three-month sunset period so you have sufficient time to find an alternative feed-reading solution. If you want to retain your Reader data, including subscriptions, you can do so through Google Takeout.
Thank you again for using Reader as your RSS platform.

Reader and Reader API were turned off soon after midnight, Pacific time, July 2.

Archives

All WARCs will land at http://archive.org/details/archiveteam_greader over the next few days. The total size will be about 8800 GB (feed data + directory + stats).

Note that we don't yet have a convenient tool to read a specific feed in the uploaded megawarcs yet. The uploaded .cdx's do allow seeking directly to a feed URL in the megawarcs.

Backing up your own data

Backing up historical feed data

Google Reader acts as a cache for RSS/Atom feed content, keeping deleted posts and deleted blogs readable (if you can recreate the RSS/Atom feed URL). After the Reader shutdown, only a small portion (100 posts per blog) will be available via the Feeds API, so it is imperative we grab everything before July 1 through the /reader/ API.

How you can help

Upload your feed URLs

We need to discover as many feed URLs as possible. Not all of them can be discovered through crawling, so so please upload your OPML files. (Though if you have any private or passworded feeds, please strip them out.)

Upload OPML files and lists of URLs to:

http://allyourfeed.ludios.org:8080/

Run the grab on your Linux machine

This project is not in the Warrior yet, so follow the install steps on these projects:

https://github.com/ArchiveTeam/greader-grab (grabs the actual text content of feeds)

https://github.com/ArchiveTeam/greader-directory-grab (searches for feeds using Reader's Feed Directory)

https://github.com/ArchiveTeam/greader-stats-grab (grabs subscriber counts and other data)

(Up to ~5GB of your disk space will be used; items are immediately uploaded elsewhere.)

Crawl websites to discover blogs and usernames

We need to discover millions of blog/username URLs on popular blogging platforms (which we'll turn into feed URLs).

Join #donereading and #archiveteam on efnet if you'd like to help with this.

The counts listed below are underestimates; please ask on IRC for updated counts.

See https://github.com/ludios/greader-item-maker/blob/master/url_filter.py for additional sites not listed here.

Tools for URL discovery

git clone https://github.com/trivio/common_crawl_index
cd common_crawl_index
pip install --user boto
PYTHONPATH=. python bin/index_lookup_remote 'com.blogspot'

You can copy and edit bin/index_lookup_remote to print just the necessary information:

# Print entire URL:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + path

# Print just the subdomain:
	print '.'.join(url.split('/', 1)[0].split('.')[::-1])

# Print just the first two URL /path segments:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + '/'.join(path.split('/', 2)[0:2])

# Print just the first URL /path segment:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + '/'.join(path.split('/', 1)[0:1])

Pipe the output to | uniq | bzip2 > sitename-list.bz2, check it with bzless, and upload it to our OPML collector.

Add to the above list of blog platforms

See: