Difference between revisions of "Google Reader"
(wget-lua gzip support is lower priority than warc rewriting) |
(→Your help is needed: factor out _Add to to the above list of blog platforms_) |
||
Line 201: | Line 201: | ||
** http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&client=ytapi-youtube-rss-redirect&v=2&orderby=updated (redirect from old feed) | ** http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&client=ytapi-youtube-rss-redirect&v=2&orderby=updated (redirect from old feed) | ||
* ... and many more (please add them above!) | * ... and many more (please add them above!) | ||
==== Tools for URL discovery ==== | ==== Tools for URL discovery ==== | ||
Line 219: | Line 214: | ||
* site:domain.com or site:domain.com/page/ searches using Google, Bing, startpage | * site:domain.com or site:domain.com/page/ searches using Google, Bing, startpage | ||
* http://dnshistory.org/subdomains/1/domain.com | * http://dnshistory.org/subdomains/1/domain.com | ||
=== Add to to the above list of blog platforms === | |||
See: | |||
* http://taimoorsultan.com/list-of-25-blogging-platforms/ | |||
* http://john.do/blogging-platforms/ | |||
* http://mashable.com/2007/08/06/free-blog-hosts/ | |||
* Many non-US blogging platforms | |||
* Feeds from dead sites: http://www.archiveteam.org/index.php?title=Deathwatch#Dead_as_a_Doornail | |||
=== Crawl Google Reader itself for feeds === | === Crawl Google Reader itself for feeds === |
Revision as of 05:08, 5 June 2013
Google Reader | |
URL | http://www.google.com/reader/[IA•Wcite•.today•MemWeb] |
Status | Online! |
Archiving status | |
Archiving type | Unknown |
Project source | https://github.com/ArchiveTeam/greader-grab |
Project tracker | N/A |
IRC channel | #donereading (on hackint) |
Shutdown notification
On the March 13, Google announced that they'll "spring clean" Google Reader at Official Google Reader Blog:
we will soon retire Google Reader (the actual date is July 1, 2013)
Backing up your own data
- Main page - google.com/reader/
- Export via Google Takeout
- Contains subscriptions and starred items, but not tags
- Can be imported into The Old Reader
- API: https://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI
Backing up the historical feed data
Google Reader acts as a cache for RSS/Atom feed content, keeping deleted posts and deleted blogs accessible (if you can recreate the RSS/Atom feed URL). After the Reader shutdown, this data might still be available via the Feeds API, but we'd like to grab most of this data before July 1 through the much more straightforward /reader/ API.
Your help is needed
Give us your feed URLs
We need to discover as many feed URLs as possible. Not all of them can be discovered through crawling, so we need your OPML files. (Though if you have any private or passworded feeds, please strip them out.)
Upload OPML files and lists of URLs to:
http://allyourfeed.ludios.org:8080/
Install the ArchiveTeam Warrior, or run the pipeline on your Linux machine
Install the ArchiveTeam Warrior and have it run ArchiveTeam's Choice:
http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior
Google Reader will (probably) soon become the primary job.
If you cannot use the Warrior, follow the instructions on https://github.com/ArchiveTeam/greader-grab
Crawl websites to discover blogs and usernames
We need to discover millions of blog/username URLs on popular blogging platforms (which we'll turn into feed URLs).
Join #donereading and #archiveteam on efnet if you'd like to help with this.
- *.tumblr.com [12,065,345 discovered through IA and commoncrawl]
- http://USERNAME.tumblr.com/rss
- *.livejournal.com [211,146 discovered through commoncrawl]
- http://USERNAME.livejournal.com/data/rss
- http://USERNAME.livejournal.com/data/atom
- http://USERNAME.livejournal.com/data/rss/
- http://USERNAME.livejournal.com/data/atom/
- http://www.livejournal.com/users/USERNAME/data/atom/ (older feed location for users)
- http://www.livejournal.com/users/USERNAME/data/rss/ (older feed location for users)
- http://www.livejournal.com/users/USERNAME/data/atom (older feed location for users)
- http://www.livejournal.com/users/USERNAME/data/rss (older feed location for users)
- http://community.livejournal.com/COMMUNITY/data/rss (older feed location for communities)
- http://community.livejournal.com/COMMUNITY/data/atom (older feed location for communities)
- http://www.livejournal.com/community/COMMUNITY/data/rss (older feed location for communities)
- http://www.livejournal.com/community/COMMUNITY/data/atom (older feed location for communities)
- *.wordpress.com [1,319,787 discovered through commoncrawl]
- *.blogspot.com [4,179,274 discovered through commoncrawl]
- http://BLOGNAME.blogspot.com/feeds/posts/default
- http://BLOGNAME.blogspot.com/feeds/posts/default?alt=rss
- http://BLOGNAME.blogspot.com/atom.xml (older feed)
- http://BLOGNAME.blogspot.com/rss.xml (older feed)
- http://www.BLOGNAME.blogspot.com/feeds/posts/default
- http://www.BLOGNAME.blogspot.com/feeds/posts/default?alt=rss
- http://www.BLOGNAME.blogspot.com/atom.xml (older feed)
- http://www.BLOGNAME.blogspot.com/rss.xml (older feed)
- http://feeds.feedburner.com/* [455,213 discovered through commoncrawl]
- *.posterous.com [9,901,701 discovered through spidering and commoncrawl]
- http://groups.google.com/group/* [13,966 discovered through commoncrawl]
- http://groups.yahoo.com/group/*/ [48,352 discovered through commoncrawl]
- *.typepad.com [102,384 discovered through commoncrawl]
- http://www.formspring.me/profile/USERNAME.rss
- *.exblog.jp [114,359 discovered through commoncrawl]
- http://blog.livedoor.jp/*
- http://*.xanga.com/ (previously http://www.xanga.com/* )
- http://USERNAME.xanga.com/rss
- http://USERNAME.xanga.com/rss/
- http://www.xanga.com/rss.aspx?user=USERNAME
- http://www.xanga.com/USERNAME/rss
- twitter.com/* [206,125 discovered through commoncrawl]
- http://twitter.com/statuses/user_timeline/USER-ID.rss (older feed)
- https://twitter.com/statuses/user_timeline/USER-ID.rss (older feed)
- http://twitter.com/statuses/user_timeline/USER-ID.atom (older feed)
- https://twitter.com/statuses/user_timeline/USER-ID.atom (older feed)
- http://twitter.com/statuses/user_timeline/USERNAME.rss (older feed)
- https://twitter.com/statuses/user_timeline/USERNAME.rss (older feed)
- http://twitter.com/statuses/user_timeline/USERNAME.atom (older feed)
- https://twitter.com/statuses/user_timeline/USERNAME.atom (older feed)
- http://api.twitter.com/1/statuses/user_timeline.rss?screen_name=USERNAME
- https://api.twitter.com/1/statuses/user_timeline.rss?screen_name=USERNAME
- http://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME
- https://api.twitter.com/1/statuses/user_timeline.atom?screen_name=USERNAME
- http://search.twitter.com/search.rss?q=* (check for feeds Reader already has cached)
- https://search.twitter.com/search.rss?q=* ibid
- http://search.twitter.com/search.atom?q=* ibid
- https://search.twitter.com/search.atom?q=* ibid
- facebook.com/*
- Has feeds for Pages; see http://ahrengot.com/tutorials/facebook-rss-feed/
- plus.google.com/*
- *.dreamwidth.org
- *.blog.com
- 4chan.org
- Image Boards: http://boards.4chan.org/BOARD/index.rss (RSS)
- Image Boards: https://boards.4chan.org/BOARD/index.rss (RSS)
- Text Boards: http://dis.4chan.org/atom/BOARD (Atom)
- Text Boards: https://dis.4chan.org/atom/BOARD (Atom)
- *.vox.com
- *.jux.com
- *.at.webry.info
- craigslist.org
- Reddit feeds
- http://www.reddit.com/user/USERNAME/.rss
- https://pay.reddit.com/user/USERNAME/.rss
- http://www.reddit.com/user/USERNAME/comments/.rss
- https://pay.reddit.com/user/USERNAME/comments/.rss
- http://www.reddit.com/user/USERNAME/submitted/.rss
- https://pay.reddit.com/user/USERNAME/submitted/.rss
- http://www.reddit.com/r/SUBREDDIT/.rss
- https://pay.reddit.com/r/SUBREDDIT/.rss
- http://www.reddit.com/r/SUBREDDIT/top/.rss
- https://pay.reddit.com/r/SUBREDDIT/top/.rss
- http://www.reddit.com/r/SUBREDDIT/controversial/.rss
- https://pay.reddit.com/r/SUBREDDIT/controversial/.rss
- http://www.reddit.com/r/SUBREDDIT/new/.rss
- https://pay.reddit.com/r/SUBREDDIT/new/.rss
- http://blog.myspace.com/*
- Windows Live Spaces feeds
- Old Hacker News feeds
- Less Wrong feeds
- http://www.quora.com/TOPIC/rss [101,265 discovered]
- "shared items" feeds created by Reader users
- http://www.google.com/reader/public/atom/user/*/state/com.google/broadcast
- Probably download these through the special API URL, e.g. https://www.google.com/reader/api/0/stream/contents/user/06575532310267031409/state/com.google/broadcast?r=n&n=1000
- "generated feeds" created while the feature was available
- http://www.google.com/reader/public/atom/webfeed/*
- Probably download these through the special API URL, e.g. https://www.google.com/reader/api/0/stream/contents/webfeed/11571763057935010098?r=n&n=1000
- del.icio.us feeds
- Users: http://del.icio.us/rss/USERNAME
- Tags: http://del.icio.us/rss/tag/TAGNAME
- Popular: http://del.icio.us/rss/popular
- Popular tags: http://del.icio.us/rss/popular/TAGNAME
- http://youtube.com/user/*
- http://www.youtube.com/rss/user/USERNAME/videos.rss (old feed)
- http://gdata.youtube.com/feeds/api/users/USERNAME/uploads
- https://gdata.youtube.com/feeds/api/users/USERNAME/uploads
- http://gdata.youtube.com/feeds/api/users/USERNAME/uploads?max-results=50
- http://gdata.youtube.com/feeds/api/users/USERNAME/uploads?alt=rss&max-results=50
- http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&v=2&client=ytapi-youtube-profile
- http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&v=2&orderby=published&client=ytapi-youtube-profile
- http://gdata.youtube.com/feeds/base/users/USERNAME/uploads?alt=rss&client=ytapi-youtube-rss-redirect&v=2&orderby=updated (redirect from old feed)
- ... and many more (please add them above!)
Tools for URL discovery
- Custom crawls with wget, HTTrack, Python code, etc
- https://commoncrawl.org/analysis-of-the-ncsu-library-urls-in-the-common-crawl-index/
git clone https://github.com/trivio/common_crawl_index cd common_crawl_index pip install --user boto PYTHONPATH=. python bin/index_lookup_remote 'com.blogspot'
- site:domain.com or site:domain.com/page/ searches using Google, Bing, startpage
- http://dnshistory.org/subdomains/1/domain.com
Add to to the above list of blog platforms
See:
- http://taimoorsultan.com/list-of-25-blogging-platforms/
- http://john.do/blogging-platforms/
- http://mashable.com/2007/08/06/free-blog-hosts/
- Many non-US blogging platforms
- Feeds from dead sites: http://www.archiveteam.org/index.php?title=Deathwatch#Dead_as_a_Doornail
Crawl Google Reader itself for feeds
https://www.google.com/reader/directory/search?q=keyword-here
https://www.google.com/reader/directory/search?q=keyword-here&start=10
Make greader-grab not save the embedded styles and image on 404 pages
We get a ton of 404s from Reader's feed API, e.g. https://www.google.com/reader/api/0/stream/contents/feed/https%3A%2F%2Faws.amazon.com%2Frss%2404-this-please?r=n&n=100 and these 404 pages are bloating our WARCs. If greader-grab used hanzo's warc-tools to rewrite the .warc.gz (replacing the 404 responses) before uploading, we would save a ton of space.
Add gzip support to wget-lua
It would be quite helpful to have a wget-lua that supports gzip content encoding (vanilla wget doesn't support it either.) This will speed up downloads and save a lot of bandwidth.
There have already been some attempts at making wget support gzip:
https://github.com/kravietz/wget-gzip (Windows-only; needs to work on Linux)
https://github.com/ptolts/wget-with-gzip-compression (based on a wget from 2003?)