Revision as of 09:30, 7 June 2013

Google Reader
URL	http://www.google.com/reader/^{[IA•Wcite•.today•MemWeb]}
Status	Online!
Archiving status	In progress...
Archiving type	Unknown
Project source	https://github.com/ArchiveTeam/greader-grab
Project tracker	N/A
IRC channel	#donereading (on hackint)

Shutdown notification

On the March 13, Google announced that they'll "spring clean" Google Reader at Official Google Reader Blog:

we will soon retire Google Reader (the actual date is July 1, 2013)

Backing up your own data

Main page - google.com/reader/
Export via Google Takeout
- Contains subscriptions and starred items, but not tags
- Can be imported into The Old Reader
API: https://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI

Backing up the historical feed data

Google Reader acts as a cache for RSS/Atom feed content, keeping deleted posts and deleted blogs accessible (if you can recreate the RSS/Atom feed URL). After the Reader shutdown, this data might still be available via the Feeds API, but we'd like to grab most of this data before July 1 through the much more straightforward /reader/ API.

Your help is needed

Give us your feed URLs

We need to discover as many feed URLs as possible. Not all of them can be discovered through crawling, so we need your OPML files. (Though if you have any private or passworded feeds, please strip them out.)

Upload OPML files and lists of URLs to:

http://allyourfeed.ludios.org:8080/

Run the grab on your Linux machine

This project is not in the Warrior yet, so follow the install steps on https://github.com/ArchiveTeam/greader-grab

(Up to ~5GB of your disk space will be used; items are immediately uploaded elsewhere.)

Crawl websites to discover blogs and usernames

We need to discover millions of blog/username URLs on popular blogging platforms (which we'll turn into feed URLs).

Join #donereading and #archiveteam on efnet if you'd like to help with this.

*.tumblr.com [12,065,345 discovered through IA and commoncrawl]
- http://USERNAME.tumblr.com/rss
*.livejournal.com [211,146 discovered through commoncrawl]
- http://USERNAME.livejournal.com/data/rss
- http://USERNAME.livejournal.com/data/atom
- http://USERNAME.livejournal.com/data/rss/
- http://USERNAME.livejournal.com/data/atom/
- http://www.livejournal.com/users/USERNAME/data/atom/ (older feed location for users)
- http://www.livejournal.com/users/USERNAME/data/rss/ (older feed location for users)
- http://www.livejournal.com/users/USERNAME/data/atom (older feed location for users)
- http://www.livejournal.com/users/USERNAME/data/rss (older feed location for users)
- http://community.livejournal.com/COMMUNITY/data/rss (older feed location for communities)
- http://community.livejournal.com/COMMUNITY/data/atom (older feed location for communities)
- http://www.livejournal.com/community/COMMUNITY/data/rss (older feed location for communities)
- http://www.livejournal.com/community/COMMUNITY/data/atom (older feed location for communities)
*.wordpress.com [1,319,787 discovered through commoncrawl]
- http://BLOGNAME.wordpress.com/feed/
- https://BLOGNAME.wordpress.com/feed/
- http://BLOGNAME.wordpress.com/feed
- https://BLOGNAME.wordpress.com/feed
- http://BLOGNAME.wordpress.com/comments/feed/
- https://BLOGNAME.wordpress.com/comments/feed
*.blogspot.com [4,179,274 discovered through commoncrawl]
- http://BLOGNAME.blogspot.com/feeds/posts/default
- http://BLOGNAME.blogspot.com/feeds/posts/default?alt=rss
- http://BLOGNAME.blogspot.com/atom.xml (older feed)
- http://BLOGNAME.blogspot.com/rss.xml (older feed)
- http://www.BLOGNAME.blogspot.com/feeds/posts/default
- http://www.BLOGNAME.blogspot.com/feeds/posts/default?alt=rss
- http://www.BLOGNAME.blogspot.com/atom.xml (older feed)
- http://www.BLOGNAME.blogspot.com/rss.xml (older feed)
http://feeds.feedburner.com/* [455,213 discovered through commoncrawl]
*.posterous.com [9,901,701 discovered through spidering and commoncrawl]
- http://USERNAME.posterous.com/rss.xml
- https://USERNAME.posterous.com/rss.xml
http://groups.google.com/group/* [13,966 discovered through commoncrawl]
- http://groups.google.com/group/GROUPNAME/feed/rss_v2_0_msgs.xml
- https://groups.google.com/group/GROUPNAME/feed/rss_v2_0_msgs.xml
- http://groups.google.com/group/GROUPNAME/feed/atom_v1_0_msgs.xml
- https://groups.google.com/group/GROUPNAME/feed/atom_v1_0_msgs.xml
http://groups.yahoo.com/group/*/ [48,352 discovered through commoncrawl]
- http://rss.groups.yahoo.com/group/GROUPNAME/rss
- http://groups.yahoo.com/group/GROUPNAME/messages?rss=1 (older feed)
*.typepad.com [77,983 domain-blogname pairs discovered through commoncrawl]
http://www.formspring.me/profile/USERNAME.rss
*.exblog.jp [114,359 discovered through commoncrawl]
- http://BLOGNAME.exblog.jp/index.xml
- http://BLOGNAME.exblog.jp/atom.xml
- http://rss.exblog.jp/rss/exblog/BLOGNAME/index.xml
- http://rss.exblog.jp/rss/exblog/BLOGNAME/atom.xml
http://blog.livedoor.jp/*
- http://blog.livedoor.jp/BLOGNAME/index.rdf
- http://blog.livedoor.jp/BLOGNAME/atom.xml
http://*.xanga.com/ (previously http://www.xanga.com/* )
- http://USERNAME.xanga.com/rss
- http://USERNAME.xanga.com/rss/
- http://www.xanga.com/rss.aspx?user=USERNAME
- http://www.xanga.com/USERNAME/rss
twitter.com/* [~40M discovered through various datasets]
- http://twitter.com/statuses/user_timeline/USER-ID.rss (older feed)
- https://twitter.com/statuses/user_timeline/USER-ID.rss (older feed)
- http://twitter.com/statuses/user_timeline/USER-ID.atom (older feed)
- https://twitter.com/statuses/user_timeline/USER-ID.atom (older feed)
- http://twitter.com/statuses/user_timeline/USERNAME.rss (older feed)
- https://twitter.com/statuses/user_timeline/USERNAME.rss (older feed)
- http://twitter.com/statuses/user_timeline/USERNAME.atom (older feed)
- https://twitter.com/statuses/user_timeline/USERNAME.atom (older feed)
- http://api.twitter.com/1/statuses/user_timeline.rss?screen_name=USERNAME
- https://api.twitter.com/1/statuses/user_timeline.rss?screen_name=USERNAME
- http://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME
- https://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME
- +lowercase USERNAME for each feed
- http://search.twitter.com/search.rss?q=* (check for feeds Reader already has cached)
- https://search.twitter.com/search.rss?q=* ibid
- http://search.twitter.com/search.atom?q=* ibid
- https://search.twitter.com/search.atom?q=* ibid
facebook.com/*
- Has feeds for Pages; see http://ahrengot.com/tutorials/facebook-rss-feed/
plus.google.com/*
- http://rss2lj.net/g+/USER-ID
- http://gplusrss.com/rss/feed/[some kind of checksum or hash]
*.dreamwidth.org
*.blog.com
http://pipes.yahoo.com/pipes/pipe.run*
4chan.org
- Image Boards: http://boards.4chan.org/BOARD/index.rss (RSS)
- Image Boards: https://boards.4chan.org/BOARD/index.rss (RSS)
- Text Boards: http://dis.4chan.org/atom/BOARD (Atom)
- Text Boards: https://dis.4chan.org/atom/BOARD (Atom)
*.vox.com
- http://USERNAME.vox.com/library/posts/atom.xml
- http://USERNAME.vox.com/library/posts/atom-full.xml
- http://USERNAME.vox.com/library/posts/rss.xml
- http://USERNAME.vox.com/library/posts/rss-full.xml
- http://USERNAME.vox.com/library/photos/rss.xml (probably skip)
*.jux.com
- http://USERNAME.jux.com/quarks.rss
- https://USERNAME.jux.com/quarks.rss
*.at.webry.info
craigslist.org
Reddit feeds
- http://www.reddit.com/user/USERNAME/.rss
- https://pay.reddit.com/user/USERNAME/.rss
- http://www.reddit.com/user/USERNAME/comments/.rss
- https://pay.reddit.com/user/USERNAME/comments/.rss
- http://www.reddit.com/user/USERNAME/submitted/.rss
- https://pay.reddit.com/user/USERNAME/submitted/.rss
- http://www.reddit.com/r/SUBREDDIT/.rss
- https://pay.reddit.com/r/SUBREDDIT/.rss
- http://www.reddit.com/r/SUBREDDIT/top/.rss
- https://pay.reddit.com/r/SUBREDDIT/top/.rss
- http://www.reddit.com/r/SUBREDDIT/controversial/.rss
- https://pay.reddit.com/r/SUBREDDIT/controversial/.rss
- http://www.reddit.com/r/SUBREDDIT/new/.rss
- https://pay.reddit.com/r/SUBREDDIT/new/.rss
- +everything again with lowercased SUBREDDIT or USERNAME
http://blog.myspace.com/*
Windows Live Spaces feeds
- http://*.spaces.live.com/feed.rss
- http://*.spaces.live.com/blog/feed.rss
- http://*.spaces.live.com/photos/feed.rss
Old Hacker News feeds
- http://rss.searchyc.com/user/USERNAME
- http://rss.searchyc.com/user/USERNAME?only=comments
- http://rss.searchyc.com/user/USERNAME?sort=by_date
Less Wrong feeds
- http://lesswrong.com/user/USERNAME/overview/.rss
- http://lesswrong.com/user/USERNAME/submitted/.rss
- http://lesswrong.com/user/USERNAME/comments/.rss
http://www.quora.com/TOPIC/rss [101,265 discovered]
"shared items" feeds created by Reader users
- http://www.google.com/reader/public/atom/user/*/state/com.google/broadcast
  - e.g. http://www.google.com/reader/public/atom/user/06575532310267031409/state/com.google/broadcast
- Probably download these through the special API URL, e.g. https://www.google.com/reader/api/0/stream/contents/user/06575532310267031409/state/com.google/broadcast?r=n&n=1000
"generated feeds" created while the feature was available
- http://www.google.com/reader/public/atom/webfeed/*
  - e.g. http://www.google.com/reader/public/atom/webfeed/11571763057935010098
- Probably download these through the special API URL, e.g. https://www.google.com/reader/api/0/stream/contents/webfeed/11571763057935010098?r=n&n=1000
del.icio.us feeds
- Users: http://del.icio.us/rss/USERNAME
- Tags: http://del.icio.us/rss/tag/TAGNAME
- Popular: http://del.icio.us/rss/popular
- Popular tags: http://del.icio.us/rss/popular/TAGNAME
http://youtube.com/user/*
- http://www.youtube.com/rss/user/USERNAME/videos.rss (old feed)
- http://gdata.youtube.com/feeds/api/users/USERNAME/uploads
- https://gdata.youtube.com/feeds/api/users/USERNAME/uploads
- http://gdata.youtube.com/feeds/api/users/USERNAME/uploads?max-results=50
- http://gdata.youtube.com/feeds/api/users/USERNAME/uploads?alt=rss&max-results=50
- http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&v=2&client=ytapi-youtube-profile
- http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&v=2&orderby=published&client=ytapi-youtube-profile
- http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&client=ytapi-youtube-rss-redirect&v=2&orderby=updated (redirect from old feed)
... and many more (please add them above!)

Tools for URL discovery

Custom crawls with wget, HTTrack, Python code, etc
https://commoncrawl.org/analysis-of-the-ncsu-library-urls-in-the-common-crawl-index/

git clone https://github.com/trivio/common_crawl_index
cd common_crawl_index
pip install --user boto
PYTHONPATH=. python bin/index_lookup_remote 'com.blogspot'

You can copy and edit bin/index_lookup_remote to print just the necessary information:

# Print entire URL:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + path

# Print just the subdomain:
	print '.'.join(url.split('/', 1)[0].split('.')[::-1])

# Print just the first two URL /path segments:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + '/'.join(path.split('/', 2)[0:2])

# Print just the first URL /path segment:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + '/'.join(path.split('/', 1)[0:1])

Pipe the output to | uniq | bzip2 > sitename-list.bz2, check it with bzless, and upload it to our OPML collector.

site:domain.com or site:domain.com/page/ searches using Google, Bing, startpage
http://dnshistory.org/subdomains/1/domain.com

Add to to the above list of blog platforms

See:

http://taimoorsultan.com/list-of-25-blogging-platforms/
http://john.do/blogging-platforms/
http://mashable.com/2007/08/06/free-blog-hosts/
Many non-US blogging platforms
Feeds from dead sites: http://www.archiveteam.org/index.php?title=Deathwatch#Dead_as_a_Doornail

Crawl Google Reader itself for feeds

https://www.google.com/reader/directory/search?q=keyword-here

https://www.google.com/reader/directory/search?q=keyword-here&start=10

Make greader-grab not save the embedded styles and image on 404 pages

We get a ton of 404s from Reader's feed API, e.g. https://www.google.com/reader/api/0/stream/contents/feed/https%3A%2F%2Faws.amazon.com%2Frss%2404-this-please?r=n&n=100 and these 404 pages are bloating our WARCs. If greader-grab used hanzo's warc-tools to rewrite the .warc.gz (replacing the 404 responses) before uploading, we would save a ton of space.

Add gzip support to wget-lua

It would be quite helpful to have a wget-lua that supports gzip content encoding (vanilla wget doesn't support it either.) This will speed up downloads and save a lot of bandwidth.

There have already been some attempts at making wget support gzip:

https://github.com/kravietz/wget-gzip (Windows-only; needs to work on Linux)

https://github.com/ptolts/wget-with-gzip-compression (based on a wget from 2003?)

External links

WARCs are landing at http://archive.org/details/archiveteam_greader

@@ Line 104: / Line 104: @@
 ** http://<font></font>www.xanga.com/rss.aspx?user=USERNAME
 ** http://<font></font>www.xanga.com/USERNAME/rss
-* twitter.com/* [206,125 discovered through commoncrawl]
+* twitter.com/* [~40M discovered through various datasets]
 ** http://twitter.com/statuses/user_timeline/USER-ID.rss (older feed)
 ** https://twitter.com/statuses/user_timeline/USER-ID.rss (older feed)
@@ Line 118: / Line 118: @@
 ** http://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME
 ** https://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME
+** +lowercase USERNAME for each feed
 **
 ** http://search.twitter.com/search.rss?q=* (check for feeds Reader already has cached)

Difference between revisions of "Google Reader"

Revision as of 09:30, 7 June 2013

Contents

Shutdown notification

Backing up your own data

Backing up the historical feed data

Your help is needed

Give us your feed URLs

Run the grab on your Linux machine

Crawl websites to discover blogs and usernames

Tools for URL discovery

Add to to the above list of blog platforms

Crawl Google Reader itself for feeds

Make greader-grab not save the embedded styles and image on 404 pages

Add gzip support to wget-lua

External links

Navigation menu

Difference between revisions of "Google Reader"

Revision as of 09:30, 7 June 2013

Shutdown notification

Backing up your own data

Backing up the historical feed data

Your help is needed

Give us your feed URLs

Run the grab on your Linux machine

Crawl websites to discover blogs and usernames

Tools for URL discovery

Add to to the above list of blog platforms

Crawl Google Reader itself for feeds

Make greader-grab not save the embedded styles and image on 404 pages

Add gzip support to wget-lua

External links

Navigation menu

Search