Difference between revisions of "Google Reader"

From Archiveteam
Jump to navigation Jump to search
(mention possible 404-rewriting)
(→‎Archiving: explain a little more how to use CDXes; link my master CDX index which combines them all for easier download & searching)
(109 intermediate revisions by 8 users not shown)
Line 2: Line 2:
| title = Google Reader
| title = Google Reader
| URL = {{url|1=http://www.google.com/reader/}}
| URL = {{url|1=http://www.google.com/reader/}}
| project_status = {{online}}
| image = greader_screenshot_en.gif
| archiving_status = {{-}}
| project_status = {{closed}}
| source = https://github.com/ArchiveTeam/greader-grab
| archiving_status = {{saved}}
| tracker = N/A
| source = [https://github.com/ArchiveTeam/greader-grab greader-grab]<br>
[https://github.com/ArchiveTeam/greader-directory-grab greader-directory-grab]<br>
[https://github.com/ArchiveTeam/greader-stats-grab greader-stats-grab]
| tracker = [http://tracker-alt.dyn.ludios.net:9292/greader/ greader-grab]<br>
[http://greader.archivingyoursh.it/greader/ greader-grab] :80<br>
[http://tracker-alt.dyn.ludios.net:9292/greader-directory/ greader-directory-grab]<br>
[http://greader.archivingyoursh.it/greader-directory/ greader-directory-grab] :80<br>
[http://tracker-alt.dyn.ludios.net:9292/greader-stats/ greader-stats-grab]<br>
[http://greader.archivingyoursh.it/greader-stats/ greader-stats-grab] :80
| irc = donereading
| irc = donereading
}}
}}
= Shutdown notification =
'''Google Reader''' was an RSS feed reader, launched by [[Google]] in 2005 and killed off in 2013.
== Shutdown notification ==
On the March 13, Google announced that they'll "spring clean" Google Reader at [http://googlereader.blogspot.com/2013/03/powering-down-google-reader.html Official Google Reader Blog]:
On the March 13, Google announced that they'll "spring clean" Google Reader at [http://googlereader.blogspot.com/2013/03/powering-down-google-reader.html Official Google Reader Blog]:
<blockquote>we will soon retire Google Reader (the actual date is July 1, 2013)</blockquote>
:'''Powering Down Google Reader'''
:''3/13/2013 04:06:00 PM''
:''Posted by Alan Green, Software Engineer''
:We have just announced on the [http://googleblog.blogspot.com/2013/03/a-second-spring-of-cleaning.html Official Google Blog] that we will soon retire Google Reader (the actual date is July 1, 2013). We know Reader has a devoted following who will be very sad to see it go. We’re sad too.
:There are two simple reasons for this: usage of Google Reader has declined, and as a company we’re pouring all of our energy into fewer products. We think that kind of focus will make for a better user experience.
:To ensure a smooth transition, we’re providing a three-month sunset period so you have sufficient time to find an alternative feed-reading solution. If you want to retain your Reader data, including subscriptions, you can do so through [https://www.google.com/takeout/?pli=1#custom:reader Google Takeout].
:Thank you again for using Reader as your RSS platform.
Reader and Reader API were turned off soon after midnight, Pacific time, July 2.


= Backing up your own data =
== Post-shutdown message ==
* Main page - [http://www.google.com/reader/ google.com/reader/]
:'''Thank you for stopping by.'''
* Export via [https://www.google.com/takeout/ Google Takeout]
:Google Reader has been [http://googleblog.blogspot.com.au/2013/03/a-second-spring-of-cleaning.html discontinued]. We want to thank all our loyal fans. We understand you may not agree with this decision, but we hope you'll come to love [http://alternativeto.net/software/google-reader/ these alternatives] as much as you loved Reader.
** Contains subscriptions and starred items, but not tags
:Sincerely,
** Can be imported into [http://theoldreader.com/ The Old Reader]
:The Google Reader team
* API: https://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI
:'''Frequently-asked questions'''
:# '''What will happen to my Google Reader data?''''<br />All Google Reader subscription data (eg. lists of people that you follow, items you have starred, notes you have created, etc.) will be systematically deleted from Google servers. You can download a copy of your Google Reader data via [https://www.google.com/takeout/#custom:reader Google Takeout] until 12PM PST July 15, 2013.
:# '''Will there be any way to retrieve my subscription data from Google in the future?'''<br />No -- all subscription data will be permanently, and irrevocably deleted. Google will not be able to recover any Google Reader subscription data for any user after July 15, 2013.
:# '''Why was Google Reader discontinued?'''<br />Please refer to our [http://googleblog.blogspot.com.au/2013/03/a-second-spring-of-cleaning.html blog post] for more information.


= Backing up the historical feed data =
== Archiving ==
Archive Team's [[User:Ivan|Ivan]] launched a heroic effort to retrieve historical feed data from the Google Reader API. Details can be found in the [[Google Reader/War room| war room]].


Google Reader acts as a cache for RSS/Atom feed content, keeping deleted posts and deleted blogs accessible (if you can recreate the RSS/Atom feed URL). After the Reader shutdown, this data might still be available<ref>https://groups.google.com/forum/?fromgroups=#!topic/Google-AJAX-Search-API/OaGf0eP57js</ref> via the Feeds API, but we'd like to grab most of this data before July 1 through the much more straightforward <tt>/reader/</tt> API.
All WARCs have been uploaded to http://archive.org/details/archiveteam_greader. The total size is about 8800 GB (feed data + directory + stats).


== Your help is needed ==
We don't yet have a convenient tool to read a specific feed in the uploaded megawarcs.  The [https://archive.org/details/archiveteam-googlereader201306-indexes.cdx master CDX index] (see the [https://archive.org/web/researcher/cdx_file_format.php CDX file format for interpreting each entry]) gives metadata about which megawarc an archived file/feed/URL is in, and also the file's byte range, which allows seeking directly to it in the megawarc.


=== Give us your feed URLs ===
== Backing up your own data ==
 
* Main page - [http://www.google.com/reader/ google.com/reader/]
We need to discover as many feed URLs as possible.  Not all of them can be discovered through crawling, so we need your OPML files.  (Though if you have any private or passworded feeds, please strip them out.)
* Export some data via [https://www.google.com/takeout/ Google Takeout] until July 15. Your ZIP file will contains subscriptions and starred items, but not tags. This can be imported into any of the [http://getgini.com/google-reader-alternatives Google Reader alternatives].
 
* Before it shut down, you could export *everything* with [http://readerisdead.com/ Reader is Dead]
<big><b>Upload OPML files and lists of URLs to:
 
http://allyourfeed.ludios.org:8080/
</b></big>
 
=== Install the ArchiveTeam Warrior, or run the pipeline on your Linux machine ===
 
Install the ArchiveTeam Warrior and have it run ArchiveTeam's Choice:
 
http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior
 
Google Reader will (probably) soon become the primary job.
 
If you cannot use the Warrior, follow the instructions on https://github.com/ArchiveTeam/greader-grab
 
=== Crawl websites to discover blogs and usernames ===
 
We need to discover millions of blog/username URLs on popular blogging platforms (which we'll turn into feed URLs).
 
Join #donereading and #archiveteam on efnet if you'd like to help with this.
 
* *.tumblr.com [12,065,345 discovered through IA and commoncrawl]
** http<font></font>://USERNAME.tumblr<font></font>.com/rss
* *.livejournal.com [211,146 discovered through commoncrawl]
** http://USERNAME.livejournal.com/data/rss
** http://USERNAME.livejournal.com/data/atom
** http://USERNAME.livejournal.com/data/rss/
** http://USERNAME.livejournal.com/data/atom/
** http://www.livejournal.com/users/USERNAME/data/atom/ (older feed location for users)
** http://www.livejournal.com/users/USERNAME/data/rss/ (older feed location for users)
** http://www.livejournal.com/users/USERNAME/data/atom (older feed location for users)
** http://www.livejournal.com/users/USERNAME/data/rss (older feed location for users)
** http://community.livejournal.com/COMMUNITY/data/rss (older feed location for communities)
** http://community.livejournal.com/COMMUNITY/data/atom (older feed location for communities)
** http://www.livejournal.com/community/COMMUNITY/data/rss (older feed location for communities)
** http://www.livejournal.com/community/COMMUNITY/data/atom (older feed location for communities)
* *.wordpress.com [1,319,787 discovered through commoncrawl]
** http://BLOGNAME.wordpress.com/feed/
** https://BLOGNAME.wordpress.com/feed/
** http://BLOGNAME.wordpress.com/feed
** https://BLOGNAME.wordpress.com/feed
** http://BLOGNAME.wordpress.com/comments/feed/
** https://BLOGNAME.wordpress.com/comments/feed
* *.blogspot.com [4,179,274 discovered through commoncrawl]
** http://BLOGNAME.blogspot.com/feeds/posts/default
** http://BLOGNAME.blogspot.com/feeds/posts/default?alt=rss
** http://BLOGNAME.blogspot.com/atom.xml (older feed)
** http://BLOGNAME.blogspot.com/rss.xml (older feed)
** http://www.BLOGNAME.blogspot.com/feeds/posts/default
** http://www.BLOGNAME.blogspot.com/feeds/posts/default?alt=rss
** http://www.BLOGNAME.blogspot.com/atom.xml (older feed)
** http://www.BLOGNAME.blogspot.com/rss.xml (older feed)
* http://feeds.feedburner.com/* [455,213 discovered through commoncrawl]
* *.posterous.com [9,901,701 discovered through spidering and commoncrawl]
** http://USERNAME.posterous.com/rss.xml
** https://USERNAME.posterous.com/rss.xml
* http://groups.google.com/group/* [13,966 discovered through commoncrawl]
** http://groups.google.com/group/GROUPNAME/feed/rss_v2_0_msgs.xml
** https://groups.google.com/group/GROUPNAME/feed/rss_v2_0_msgs.xml
** http://groups.google.com/group/GROUPNAME/feed/atom_v1_0_msgs.xml
** https://groups.google.com/group/GROUPNAME/feed/atom_v1_0_msgs.xml
* http://groups.yahoo.com/group/*/ [48,352 discovered through commoncrawl]
** http://rss.groups.yahoo.com/group/GROUPNAME/rss
** http://groups.yahoo.com/group/GROUPNAME/messages?rss=1 (older feed)
* *.typepad.com [102,384 discovered through commoncrawl]
* http://www.formspring.me/profile/USERNAME.rss
* *.exblog.jp [114,359 discovered through commoncrawl]
** http://BLOGNAME.exblog.jp/index.xml
** http://BLOGNAME.exblog.jp/atom.xml
** http://rss.exblog.jp/rss/exblog/BLOGNAME/index.xml
** http://rss.exblog.jp/rss/exblog/BLOGNAME/atom.xml
* http://blog.livedoor.jp/*
** http://blog.livedoor.jp/BLOGNAME/index.rdf
** http://blog.livedoor.jp/BLOGNAME/atom.xml
* http://*.xanga.com/ (previously http://<font></font>www.xanga.com/* )
** http://<font></font>USERNAME.xanga.com/rss
** http://<font></font>USERNAME.xanga.com/rss/
** http://<font></font>www.xanga.com/rss.aspx?user=USERNAME
** http://<font></font>www.xanga.com/USERNAME/rss
* twitter.com/* [206,125 discovered through commoncrawl]
** http://twitter.com/statuses/user_timeline/USER-ID.rss (older feed)
** https://twitter.com/statuses/user_timeline/USER-ID.rss (older feed)
** http://twitter.com/statuses/user_timeline/USER-ID.atom (older feed)
** https://twitter.com/statuses/user_timeline/USER-ID.atom (older feed)
** http://twitter.com/statuses/user_timeline/USERNAME.rss (older feed)
** https://twitter.com/statuses/user_timeline/USERNAME.rss (older feed)
** http://twitter.com/statuses/user_timeline/USERNAME.atom (older feed)
** https://twitter.com/statuses/user_timeline/USERNAME.atom (older feed)
**
** http://api.twitter.com/1/statuses/user_timeline.rss?screen_name=USERNAME
** https://api.twitter.com/1/statuses/user_timeline.rss?screen_name=USERNAME
** http://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME
** https://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME
**
** http://search.twitter.com/search.rss?q=* (check for feeds Reader already has cached)
** https://search.twitter.com/search.rss?q=* ibid
** http://search.twitter.com/search.atom?q=* ibid
** https://search.twitter.com/search.atom?q=* ibid
* facebook.com/*
** Has feeds for Pages; see http://ahrengot.com/tutorials/facebook-rss-feed/
* plus.google.com/*
** http://rss2lj.net/g+/USER-ID
* *.dreamwidth.org
* *.blog.com
* 4chan.org
** Image Boards: http://boards.4chan.org/BOARD/index.rss (RSS)
** Image Boards: https://boards.4chan.org/BOARD/index.rss (RSS)
** Text Boards: http://dis.4chan.org/atom/BOARD (Atom)
** Text Boards: https://dis.4chan.org/atom/BOARD (Atom)
* *.vox.com
** http://USERNAME.vox.com/library/posts/atom.xml
** http://USERNAME.vox.com/library/posts/atom-full.xml
** http://USERNAME.vox.com/library/posts/rss.xml
** http://USERNAME.vox.com/library/posts/rss-full.xml
** http://USERNAME.vox.com/library/photos/rss.xml (probably skip)
* *.jux.com
** http://USERNAME.jux.com/quarks.rss
** https://USERNAME.jux.com/quarks.rss
* *.at.webry.info
* craigslist.org
* Reddit feeds
** http://www.reddit.com/user/USERNAME/.rss
** https://pay.reddit.com/user/USERNAME/.rss
** http://www.reddit.com/user/USERNAME/comments/.rss
** https://pay.reddit.com/user/USERNAME/comments/.rss
** http://www.reddit.com/user/USERNAME/submitted/.rss
** https://pay.reddit.com/user/USERNAME/submitted/.rss
** http://www.reddit.com/r/SUBREDDIT/.rss
** https://pay.reddit.com/r/SUBREDDIT/.rss
** http://www.reddit.com/r/SUBREDDIT/top/.rss
** https://pay.reddit.com/r/SUBREDDIT/top/.rss
** http://www.reddit.com/r/SUBREDDIT/controversial/.rss
** https://pay.reddit.com/r/SUBREDDIT/controversial/.rss
** http://www.reddit.com/r/SUBREDDIT/new/.rss
** https://pay.reddit.com/r/SUBREDDIT/new/.rss
* http://blog.myspace.com/*
* Windows Live Spaces feeds
** http://*.spaces.live.com/feed.rss
** http://*.spaces.live.com/blog/feed.rss
** http://*.spaces.live.com/photos/feed.rss
* Old Hacker News feeds
** http://rss.searchyc.com/user/USERNAME
** http://rss.searchyc.com/user/USERNAME?only=comments
** http://rss.searchyc.com/user/USERNAME?sort=by_date
* Less Wrong feeds
** http://lesswrong.com/user/USERNAME/overview/.rss
** http://lesswrong.com/user/USERNAME/submitted/.rss
** http://lesswrong.com/user/USERNAME/comments/.rss
* http://www.quora.com/TOPIC/rss [101,265 discovered]
* "shared items" feeds created by Reader users
** http://www.google.com/reader/public/atom/user/*/state/com.google/broadcast
*** e.g. http://www.google.com/reader/public/atom/user/06575532310267031409/state/com.google/broadcast
** Probably download these through the special API URL, e.g. https://www.google.com/reader/api/0/stream/contents/user/06575532310267031409/state/com.google/broadcast?r=n&n=1000
* "generated feeds" created while the feature was available
** http://www.google.com/reader/public/atom/webfeed/*
*** e.g. http://www.google.com/reader/public/atom/webfeed/11571763057935010098
** Probably download these through the special API URL, e.g. https://www.google.com/reader/api/0/stream/contents/webfeed/11571763057935010098?r=n&n=1000
* del.icio.us feeds
** Users: http://del.icio.us/rss/USERNAME
** Tags: http://del.icio.us/rss/tag/TAGNAME
** Popular: http://del.icio.us/rss/popular
** Popular tags: http://del.icio.us/rss/popular/TAGNAME
* http://youtube.com/user/*
** http://www.youtube.com/rss/user/USERNAME/videos.rss (old feed)
** http://gdata.youtube.com/feeds/api/users/USERNAME/uploads
** https://gdata.youtube.com/feeds/api/users/USERNAME/uploads
** http://gdata.youtube.com/feeds/api/users/USERNAME/uploads?max-results=50
** http://gdata.youtube.com/feeds/api/users/USERNAME/uploads?alt=rss&max-results=50
** http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&v=2&client=ytapi-youtube-profile
** http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&v=2&orderby=published&client=ytapi-youtube-profile
** http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&client=ytapi-youtube-rss-redirect&v=2&orderby=updated (redirect from old feed)
* ... and many more (please add them above!)
** http://taimoorsultan.com/list-of-25-blogging-platforms/
** http://john.do/blogging-platforms/
** http://mashable.com/2007/08/06/free-blog-hosts/
** Many non-US blogging platforms
** Feeds from dead sites: http://www.archiveteam.org/index.php?title=Deathwatch#Dead_as_a_Doornail
 
==== Tools for URL discovery ====
 
* Custom crawls with wget, HTTrack, Python code, etc
* https://commoncrawl.org/analysis-of-the-ncsu-library-urls-in-the-common-crawl-index/
<pre>
git clone https://github.com/trivio/common_crawl_index
cd common_crawl_index
pip install --user boto
PYTHONPATH=. python bin/index_lookup_remote 'com.blogspot'
</pre>
* site:domain.com or site:domain.com/page/ searches using Google, Bing, startpage
* http://dnshistory.org/subdomains/1/domain.com
 
=== Crawl Google Reader itself for feeds ===
 
https://www.google.com/reader/directory/search?q=keyword-here
 
https://www.google.com/reader/directory/search?q=keyword-here&start=10
 
=== Add gzip support to wget-lua ===
 
It would be quite helpful to have a [https://github.com/alard/wget-lua wget-lua] that supports gzip content encoding (vanilla wget doesn't support it either.)  This will speed up downloads and save a lot of bandwidth.
 
There have already been some attempts at making wget support gzip:
 
https://github.com/kravietz/wget-gzip (Windows-only; needs to work on Linux)
 
https://github.com/ptolts/wget-with-gzip-compression (based on a wget from 2003?)
 
=== Make greader-grab not save the embedded styles and image on 404 pages ===
 
We get a ton of 404s from Reader's feed API, e.g. https://www.google.com/reader/api/0/stream/contents/feed/https%3A%2F%2Faws.amazon.com%2Frss%2404-this-please?r=n&n=100 and these 404 pages are bloating our WARCs.  If [https://github.com/ArchiveTeam/greader-grab greader-grab] used hanzo's [http://code.hanzoarchives.com/warc-tools warc-tools] to rewrite the .warc.gz (replacing the 404 responses) before uploading, we would save a ton of space.


<references/>
<references/>
[[Category:Google]]
[[Category:Google]]
{{Navigation box}}
{{Navigation box}}

Revision as of 16:11, 30 December 2015

Google Reader
Greader screenshot en.gif
URL http://www.google.com/reader/[IAWcite.todayMemWeb]
Status Offline
Archiving status Saved!
Archiving type Unknown
Project source greader-grab

greader-directory-grab
greader-stats-grab

Project tracker greader-grab

greader-grab :80
greader-directory-grab
greader-directory-grab :80
greader-stats-grab
greader-stats-grab :80

IRC channel #donereading (on hackint)

Google Reader was an RSS feed reader, launched by Google in 2005 and killed off in 2013.

Shutdown notification

On the March 13, Google announced that they'll "spring clean" Google Reader at Official Google Reader Blog:

Powering Down Google Reader
3/13/2013 04:06:00 PM
Posted by Alan Green, Software Engineer
We have just announced on the Official Google Blog that we will soon retire Google Reader (the actual date is July 1, 2013). We know Reader has a devoted following who will be very sad to see it go. We’re sad too.
There are two simple reasons for this: usage of Google Reader has declined, and as a company we’re pouring all of our energy into fewer products. We think that kind of focus will make for a better user experience.
To ensure a smooth transition, we’re providing a three-month sunset period so you have sufficient time to find an alternative feed-reading solution. If you want to retain your Reader data, including subscriptions, you can do so through Google Takeout.
Thank you again for using Reader as your RSS platform.

Reader and Reader API were turned off soon after midnight, Pacific time, July 2.

Post-shutdown message

Thank you for stopping by.
Google Reader has been discontinued. We want to thank all our loyal fans. We understand you may not agree with this decision, but we hope you'll come to love these alternatives as much as you loved Reader.
Sincerely,
The Google Reader team
Frequently-asked questions
  1. What will happen to my Google Reader data?'
    All Google Reader subscription data (eg. lists of people that you follow, items you have starred, notes you have created, etc.) will be systematically deleted from Google servers. You can download a copy of your Google Reader data via Google Takeout until 12PM PST July 15, 2013.
  2. Will there be any way to retrieve my subscription data from Google in the future?
    No -- all subscription data will be permanently, and irrevocably deleted. Google will not be able to recover any Google Reader subscription data for any user after July 15, 2013.
  3. Why was Google Reader discontinued?
    Please refer to our blog post for more information.

Archiving

Archive Team's Ivan launched a heroic effort to retrieve historical feed data from the Google Reader API. Details can be found in the war room.

All WARCs have been uploaded to http://archive.org/details/archiveteam_greader. The total size is about 8800 GB (feed data + directory + stats).

We don't yet have a convenient tool to read a specific feed in the uploaded megawarcs. The master CDX index (see the CDX file format for interpreting each entry) gives metadata about which megawarc an archived file/feed/URL is in, and also the file's byte range, which allows seeking directly to it in the megawarc.

Backing up your own data