Google Reader/War room

This page is an archive of Archive Team's Google Reader backup project, kept here for the historical record.

Backing up historical feed data

Google Reader acts as a cache for RSS/Atom feed content, keeping deleted posts and deleted blogs readable (if you can recreate the RSS/Atom feed URL). After the Reader shutdown, only a small portion (100 posts per blog) will be available via the Feeds API, so it is imperative we grab everything before July 1 through the /reader/ API.

How you can help

Upload your feed URLs

We need to discover as many feed URLs as possible. Not all of them can be discovered through crawling, so so please upload your OPML files. (Though if you have any private or passworded feeds, please strip them out.)

Upload OPML files and lists of URLs to:

http://allyourfeed.ludios.org:8080/

Run the grab on your Linux machine

This project is not in the Warrior yet, so follow the install steps on these projects:

https://github.com/ArchiveTeam/greader-grab (grabs the actual text content of feeds)

https://github.com/ArchiveTeam/greader-directory-grab (searches for feeds using Reader's Feed Directory)

https://github.com/ArchiveTeam/greader-stats-grab (grabs subscriber counts and other data)

(Up to ~5GB of your disk space will be used; items are immediately uploaded elsewhere.)

Crawl websites to discover blogs and usernames

We need to discover millions of blog/username URLs on popular blogging platforms (which we'll turn into feed URLs).

Join #donereading and #archiveteam on efnet if you'd like to help with this.

The counts listed below are underestimates; please ask on IRC for updated counts.

See https://github.com/ludios/greader-item-maker/blob/master/url_filter.py for additional sites not listed here.

*.tumblr.com [12,065,345 discovered through IA and commoncrawl]
- http://USERNAME.tumblr.com/rss
*.livejournal.com [211,146 discovered through commoncrawl]
- http://USERNAME.livejournal.com/data/rss
- http://USERNAME.livejournal.com/data/atom
- http://USERNAME.livejournal.com/data/rss/
- http://USERNAME.livejournal.com/data/atom/
- http://www.livejournal.com/users/USERNAME/data/atom/ (older feed location for users)
- http://www.livejournal.com/users/USERNAME/data/rss/ (older feed location for users)
- http://www.livejournal.com/users/USERNAME/data/atom (older feed location for users)
- http://www.livejournal.com/users/USERNAME/data/rss (older feed location for users)
- http://community.livejournal.com/COMMUNITY/data/rss (older feed location for communities)
- http://community.livejournal.com/COMMUNITY/data/atom (older feed location for communities)
- http://www.livejournal.com/community/COMMUNITY/data/rss (older feed location for communities)
- http://www.livejournal.com/community/COMMUNITY/data/atom (older feed location for communities)
*.wordpress.com [1,319,787 discovered through commoncrawl]
- http://BLOGNAME.wordpress.com/feed/
- https://BLOGNAME.wordpress.com/feed/
- http://BLOGNAME.wordpress.com/feed/atom/
- https://BLOGNAME.wordpress.com/feed/atom/ (probably low hit rate)
- http://BLOGNAME.wordpress.com/feed/rss/
- https://BLOGNAME.wordpress.com/feed/rss/ (probably low hit rate)
- http://BLOGNAME.wordpress.com/feed
- https://BLOGNAME.wordpress.com/feed
- http://BLOGNAME.wordpress.com/comments/feed/
- https://BLOGNAME.wordpress.com/comments/feed
Wordpress blogs not on wordpress.com (easily identified by URLs containing "wp-content" or – with some false positives – by searching for '[0-9]{4}/[0-9]{2}/')
- SCHEMA+DOMAIN/feed/
- SCHEMA+DOMAIN/feed
- SCHEMA+DOMAIN/feed/rss/
- SCHEMA+DOMAIN/feed/atom/
- SCHEMA+DOMAIN/comments/feed/
- SCHEMA+DOMAIN/comments/feed
*.blogspot.com [4,179,274 discovered through commoncrawl]
- http://BLOGNAME.blogspot.com/feeds/posts/default
- http://BLOGNAME.blogspot.com/feeds/posts/default?alt=rss
- http://BLOGNAME.blogspot.com/atom.xml (older feed)
- http://BLOGNAME.blogspot.com/rss.xml (older feed)
- http://www.BLOGNAME.blogspot.com/feeds/posts/default
- http://www.BLOGNAME.blogspot.com/feeds/posts/default?alt=rss
- http://www.BLOGNAME.blogspot.com/atom.xml (older feed)
- http://www.BLOGNAME.blogspot.com/rss.xml (older feed)
- http://BLOGNAME.blogspot.com/feeds/THREADID/comments/default
  - e.g. http://digicmb.blogspot.com/feeds/206744415950084609/comments/default
blogger.com feeds, mostly redundant with blogspot.com
- http://www.blogger.com/feeds/*/posts/default
- http://www.blogger.com/feeds/*/posts/default?alt=rss
- https://www.blogger.com/feeds/*/posts/default
- https://www.blogger.com/feeds/*/posts/default?alt=rss
http://feeds.feedburner.com/* [455,213 discovered through commoncrawl]
- http://feeds.feedburner.com/FEEDNAME
- +lowercase FEEDNAME
http://feeds2.feedburner.com/*
- http://feeds2.feedburner.com/FEEDNAME
- +lowercase FEEDNAME
http://feeds.rapidfeeds.com/*/ [generated 1-60,000]
- e.g. http://feeds.rapidfeeds.com/35746/
*.posterous.com [9,901,701 discovered through spidering and commoncrawl]
- http://USERNAME.posterous.com/rss.xml
- https://USERNAME.posterous.com/rss.xml
http://groups.google.com/group/* [13,966 discovered through commoncrawl]
- http://groups.google.com/group/GROUPNAME/feed/rss_v2_0_msgs.xml
- https://groups.google.com/group/GROUPNAME/feed/rss_v2_0_msgs.xml
- http://groups.google.com/group/GROUPNAME/feed/atom_v1_0_msgs.xml
- https://groups.google.com/group/GROUPNAME/feed/atom_v1_0_msgs.xml
http://groups.yahoo.com/group/*/ [48,352 discovered through commoncrawl]
- http://rss.groups.yahoo.com/group/GROUPNAME/rss
- http://groups.yahoo.com/group/GROUPNAME/messages?rss=1 (older feed)
- +lowercase GROUPNAME
*.typepad.com [77,983 domain-blogname pairs discovered through commoncrawl]
*.typepad.jp
http://blog.roodo.com/*
*.diarynote.jp
ameblo.jp/*
http://www.wretch.cc/blog/*
http://www.formspring.me/profile/USERNAME.rss
*.blog.shinobi.jp
*.exblog.jp [114,359 discovered through commoncrawl]
- http://BLOGNAME.exblog.jp/index.xml
- http://BLOGNAME.exblog.jp/atom.xml
- http://rss.exblog.jp/rss/exblog/BLOGNAME/index.xml
- http://rss.exblog.jp/rss/exblog/BLOGNAME/atom.xml
http://*.blog.hexun.com
- http://USERNAME.blog.hexun.com/rss2.aspx
- http://fulltextrssfeed.com/USERNAME.blog.hexun.com/rss2.aspx
  - e.g. http://bbc1030.blog.hexun.com.tw/rss2.aspx
http://*.blog.hexun.com.tw
- http://USERNAME.blog.hexun.com.tw/rss2.aspx
- http://fulltextrssfeed.com/USERNAME.blog.hexun.com.tw/rss2.aspx
http://blog.livedoor.jp/*
- http://blog.livedoor.jp/BLOGNAME/index.rdf
- http://blog.livedoor.jp/BLOGNAME/atom.xml
http://*.altervista.org/
http://*.qzone.qq.com/
- http://feeds.qzone.qq.com/cgi-bin/cgi_rss_out?uin=QQID
- e.g. http://feeds.qzone.qq.com/cgi-bin/cgi_rss_out?uin=469826844
http://*.blog.163.com/rss/
- http://USERNAME.blog.163.com/rss/
- e.g. http://hxcy1965.blog.163.com/rss/
http://*.inube.com/
http://*.my.nero.com/
- https://www.google.com/reader/view/#stream/feed%2Fhttp%3A%2F%2Frss.my.nero.com%2FlatestUser lists the usernames
http://www.feed43.com/*
- e.g. http://feed43.com/6237213781584644.xml
http://*.blog4ever.com/
http://*.xanga.com/ (previously http://www.xanga.com/* )
- http://USERNAME.xanga.com/rss
- http://USERNAME.xanga.com/rss/
- http://www.xanga.com/rss.aspx?user=USERNAME
- http://www.xanga.com/USERNAME/rss
http://*.pixnet.net/
- http://feed.pixnet.net/blog/posts/rss/USERNAME
- http://feed.pixnet.net/blog/posts/atom/USERNAME
twitter.com/* [~40M discovered through various datasets]
- http://twitter.com/statuses/user_timeline/USER-ID.rss (older feed)
- https://twitter.com/statuses/user_timeline/USER-ID.rss (older feed)
- http://twitter.com/statuses/user_timeline/USER-ID.atom (older feed)
- https://twitter.com/statuses/user_timeline/USER-ID.atom (older feed)
- http://twitter.com/statuses/user_timeline/USERNAME.rss (older feed)
- https://twitter.com/statuses/user_timeline/USERNAME.rss (older feed)
- http://twitter.com/statuses/user_timeline/USERNAME.atom (older feed)
- https://twitter.com/statuses/user_timeline/USERNAME.atom (older feed)
- http://api.twitter.com/1/statuses/user_timeline.rss?screen_name=USERNAME
- https://api.twitter.com/1/statuses/user_timeline.rss?screen_name=USERNAME
- http://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME [very low hit rate]
- https://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME [very low hit rate]
- +lowercase USERNAME for each feed
- http://search.twitter.com/search.rss?q=* (check for feeds Reader already has cached)
- https://search.twitter.com/search.rss?q=* ibid
- http://search.twitter.com/search.atom?q=* ibid
- https://search.twitter.com/search.atom?q=* ibid
facebook.com/*
- Has feeds for Pages; see http://ahrengot.com/tutorials/facebook-rss-feed/
- Has feeds for Groups as well; see https://apps.facebook.com/groups_to_rss/
plus.google.com/*
- http://rss2lj.net/g+/USER-ID
- http://gplusrss.com/rss/feed/[some kind of checksum or hash]
- http://www.googleplusfeed.net/feed/USER-ID
  - e.g. http://www.googleplusfeed.net/feed/115030581977322198102
*.dreamwidth.org
*.blog.com
http://pipes.yahoo.com/pipes/pipe.run*
- You can search for feeds, e.g. http://pipes.yahoo.com/pipes/search?r=source%3Afeeds.feedburner.com
http://page2rss.com/rss/*
- e.g. http://page2rss.com/rss/0f57ce71ebdd24878485c8d3624c3819
http://page2rss.com/atom/*
- e.g. http://page2rss.com/atom/ae56d7ac85827977bcf0aa7857f3f309
4chan.org
- Image Boards: http://boards.4chan.org/BOARD/index.rss (RSS)
- Image Boards: https://boards.4chan.org/BOARD/index.rss (RSS)
- Text Boards: http://dis.4chan.org/atom/BOARD (Atom)
- Text Boards: https://dis.4chan.org/atom/BOARD (Atom)
*.vox.com
- http://USERNAME.vox.com/library/posts/atom.xml
- http://USERNAME.vox.com/library/posts/atom-full.xml
- http://USERNAME.vox.com/library/posts/rss.xml
- http://USERNAME.vox.com/library/posts/rss-full.xml
- http://USERNAME.vox.com/library/photos/rss.xml (probably skip)
*.jux.com
- http://USERNAME.jux.com/quarks.rss
- https://USERNAME.jux.com/quarks.rss
*.at.webry.info
http://www.rsspect.com/*
- e.g. http://www.rsspect.com/rss/vagrant.xml
http://buzz.googleapis.com/feeds/*/public/posted
- e.g. http://buzz.googleapis.com/feeds/112778807045063877346/public/posted
craigslist.org
http://www.mail-archive.com/*/maillist.xml
- e.g. http://www.mail-archive.com/linux-zigbee-devel@lists.sourceforge.net/maillist.xml
Reddit users
- http://www.reddit.com/user/USERNAME/.rss
- https://pay.reddit.com/user/USERNAME/.rss (very low hit rate)
- http://www.reddit.com/user/USERNAME/comments/.rss
- https://pay.reddit.com/user/USERNAME/comments/.rss (very low hit rate)
- http://www.reddit.com/user/USERNAME/submitted/.rss
- https://pay.reddit.com/user/USERNAME/submitted/.rss (very low hit rate)
- +everything again with lowercased USERNAME
Subreddits [152,042 found]
- http://www.reddit.com/r/SUBREDDIT/.rss
- https://pay.reddit.com/r/SUBREDDIT/.rss
- http://www.reddit.com/r/SUBREDDIT/top/.rss
- https://pay.reddit.com/r/SUBREDDIT/top/.rss
- http://www.reddit.com/r/SUBREDDIT/controversial/.rss
- https://pay.reddit.com/r/SUBREDDIT/controversial/.rss
- http://www.reddit.com/r/SUBREDDIT/new/.rss
- https://pay.reddit.com/r/SUBREDDIT/new/.rss
- +everything again with lowercased SUBREDDIT
http://blog.myspace.com/blog/rss.cfm?friendID=FRIENDID (+ https?)
- Are these the blogs that myspace deleted on 2013-06-14?
- e.g. http://www.google.com/reader/view/#stream/feed%2Fhttp%3A%2F%2Fblog.myspace.com%2Fblog%2Frss.cfm%3FfriendID%3D181926159
Windows Live Spaces feeds
- http://*.spaces.live.com/feed.rss
- http://*.spaces.live.com/blog/feed.rss
- http://*.spaces.live.com/photos/feed.rss
Old Hacker News feeds
- http://rss.searchyc.com/user/USERNAME
- http://rss.searchyc.com/user/USERNAME?only=comments
- http://rss.searchyc.com/user/USERNAME?only=comments&sort=by_date
- http://rss.searchyc.com/user/USERNAME?sort=by_date
- http://rss.searchyc.com/USERNAME?sort=by_date
Less Wrong feeds
- http://lesswrong.com/user/USERNAME/overview/.rss
- http://lesswrong.com/user/USERNAME/submitted/.rss
- http://lesswrong.com/user/USERNAME/comments/.rss
Quora feeds
- http://www.quora.com/TOPIC/rss [101,265 discovered]
- http://www.quora.com/USERNAME/rss
- http://www.quora.com/USERNAME/questions/rss
- http://www.quora.com/USERNAME/answers/rss
"shared items" feeds created by Reader users
- http://www.google.com/reader/public/atom/user/*/state/com.google/broadcast
  - e.g. http://www.google.com/reader/public/atom/user/06575532310267031409/state/com.google/broadcast
- Probably download these through the special API URL, e.g. https://www.google.com/reader/api/0/stream/contents/user/06575532310267031409/state/com.google/broadcast?r=n&n=1000
"generated feeds" created while the feature was available
- http://www.google.com/reader/public/atom/webfeed/*
  - e.g. http://www.google.com/reader/public/atom/webfeed/11571763057935010098
- Probably download these through the special API URL, e.g. https://www.google.com/reader/api/0/stream/contents/webfeed/11571763057935010098?r=n&n=1000
http://www.kickstarter.com/projects/PROJECTID/PROJECTNAME/posts.atom
- e.g. http://www.kickstarter.com/projects/306316578/light-table/posts.atom
del.icio.us feeds
- Users: http://del.icio.us/rss/USERNAME
- Tags: http://del.icio.us/rss/tag/TAGNAME
- Popular: http://del.icio.us/rss/popular
- Popular tags: http://del.icio.us/rss/popular/TAGNAME
http://youtube.com/user/*
- http://www.youtube.com/rss/user/USERNAME/videos.rss (old feed)
- http://gdata.youtube.com/feeds/api/users/USERNAME/uploads
- https://gdata.youtube.com/feeds/api/users/USERNAME/uploads
- http://gdata.youtube.com/feeds/api/users/USERNAME/uploads?max-results=50
- http://gdata.youtube.com/feeds/api/users/USERNAME/uploads?alt=rss&max-results=50
- http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&v=2&client=ytapi-youtube-profile
- http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&v=2&orderby=published&client=ytapi-youtube-profile
- http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&client=ytapi-youtube-rss-redirect&v=2&orderby=updated (redirect from old feed)
http://*.multiply.com/
- http://USERNAME.multiply.com/feed.rss
- http://USERNAME.multiply.com/feed
http://bandcamp.com
- http://USERNAME.bandcamp.com/feed
  - Artist pages have a list of fans, and fan pages have a list of artists, by crawling both you can map out the bandcamp userbase.
- http://USERNAME.bandcamp.com/feed/album/ALBUMNAME
  - For obvious reasons, the album needs to have been published by the given username.
http://vimeo.com/USERNAME
- http://vimeo.com/USERNAME/videos/rss
- https://vimeo.com/USERNAME/videos/rss
  - e.g. https://vimeo.com/chriskpalmer/videos/rss
... and many more (please add them above!)

Tools for URL discovery

Custom crawls with wget, HTTrack, Python code, etc
https://commoncrawl.org/analysis-of-the-ncsu-library-urls-in-the-common-crawl-index/

git clone https://github.com/trivio/common_crawl_index
cd common_crawl_index
pip install --user boto
PYTHONPATH=. python bin/index_lookup_remote 'com.blogspot'

You can copy and edit bin/index_lookup_remote to print just the necessary information:

# Print entire URL:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + path

# Print just the subdomain:
	print '.'.join(url.split('/', 1)[0].split('.')[::-1])

# Print just the first two URL /path segments:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + '/'.join(path.split('/', 2)[0:2])

# Print just the first URL /path segment:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + '/'.join(path.split('/', 1)[0:1])

Pipe the output to | uniq | bzip2 > sitename-list.bz2, check it with bzless, and upload it to our OPML collector.

site:domain.com or site:domain.com/page/ searches using Google, Bing, startpage
http://dnshistory.org/subdomains/1/domain.com

Add to the above list of blog platforms

See:

http://taimoorsultan.com/list-of-25-blogging-platforms/
http://john.do/blogging-platforms/
http://mashable.com/2007/08/06/free-blog-hosts/
Many non-US blogging platforms
Feeds from dead sites: http://www.archiveteam.org/index.php?title=Deathwatch#Dead_as_a_Doornail

Google Reader/War room

Contents

Backing up historical feed data

How you can help

Upload your feed URLs

Run the grab on your Linux machine

Crawl websites to discover blogs and usernames

Tools for URL discovery

Add to the above list of blog platforms

Navigation menu

Google Reader/War room

Backing up historical feed data

How you can help

Upload your feed URLs

Run the grab on your Linux machine

Crawl websites to discover blogs and usernames

Tools for URL discovery

Add to the above list of blog platforms

Navigation menu

Search