Difference between revisions of "Blogger"

Revision as of 17:49, 6 March 2015

Blogger


URL	http://www.blogger.com/
Status	Online!
Archiving status	In progress...
Archiving type	Unknown
Project source	blogger-discovery
Project tracker	bloggerdisco
IRC channel	#frogger (on hackint)

Blogger is a blog hosting service. On February 23, 2015, they announced that "sexually explicit" blogs would be restricted from public access in a month.^[1] We're downloading everything.

Strategy

Find as many http://foobar.blogspot.com domains as possible and download them. Blogs often link to other blogs, which will help, so each individual blog saved will help discover others. Also a small-scale crawl of Blogger profiles (e.g. http://www.blogger.com/profile/{random number up to 35217655}) will provide links to blogs authored by each user (e..g https://www.blogger.com/profile/5618947 links to http://hintergedanke.blogspot.com/) - Although note that this does not cover ALL bloggers or ALL blogs, and is merely a starting point for further discovery.

Country Redirect

Accessing http://whatever.blogspot.com will usually redirect to a country-specific subdomain depending on your IP address (e.g. whatever.blogspot.co.uk, whatever.blogspot.in, etc) which in some cases may be censored or edited to meet local laws and standards - this can be bypassed by requesting http://whatever.blogspot.com/ncr as the root URL.^[2] ^[3]

Downloading a single blog with Wget

These Wget parameters can download a BlogSpot blog, including comments and any on-site dependencies. It should also reject redundant pages such as the /search/ directory and any multiple occurrences of the same page but with different query strings. It has only be tested on blogs using a Blogger subdomain (e.g. http://foobar.blogspot.com), not custom domains (e.g. http://foobar.com). Both instances of [URL] should be replaced with the same URL. A simple Perl wrapper is available here.

wget --recursive --level=2 --no-clobber --no-parent --page-requisites --continue --convert-links --user-agent="" -e robots=off --reject "*\\?*,*@*" --exclude-directories="/search,/feeds" --referer="[URL]" --wait 1 [URL]

Export XML trick

Add this to a blog url and it will download the most recent 499 posts (that is the limit): /atom.xml?redirect=false&max-results=

How can I help?

Running the Warrior

Start up the Warrior and select the Blogger Discovery project. Do not increase the default concurrency of 2, because Google limits requests aggressively (and you get blocked for ~45 minutes, maybe less). Moreover, if you see "503 Service Unavailable" messages, decrease concurrency to 1.

Running the script manually

See details here: http://github.com/ArchiveTeam/blogger-discovery

Do not increase the concurrency above 2, because Google limits requests aggressively (and you get blocked for ~45 minutes, maybe less). Moreover, if you see "503 Service Unavailable" messages, decrease concurrency to 1.

External links

Blogger^{[IA•Wcite•.today•MemWeb]}

References

[1] ttps://support.google.com/blogger/answer/6170671?p=policy_update&hl=en&rd=1

[2] ttps://support.google.com/blogger/answer/2402711?hl=en

[3] ttp://www.bbc.co.uk/news/technology-16852920

[1]

[2]

[3]

@@ Line 6: / Line 6: @@
 | URL = http://www.blogger.com/
 | project_status = {{online}}
-| archiving_status = {{upcoming}}
+| archiving_status = {{inprogress}}
 | source = [https://github.com/ArchiveTeam/blogger-discovery blogger-discovery]
 | tracker = [http://tracker.archiveteam.org/bloggerdisco/ bloggerdisco]
@@ Line 28: / Line 28: @@
 == Export XML trick ==
-Add this to a blog url and it will download the most recent 499 posts (that is the limit): /atom.xml?redirect=false&max-results=499
+Add this to a blog url and it will download the most recent 499 posts (that is the limit): /atom.xml?redirect=false&max-results=
+== How can I help? ==
+=== Running the Warrior ===
+Start up the [[Warrior]] and select the ''Blogger Discovery'' project. '''Do not''' increase the default concurrency of 2, because Google limits requests aggressively (and you get blocked for ~45 minutes, maybe less). Moreover, if you see "503 Service Unavailable" messages, decrease concurrency to 1.
+=== Running the script manually ===
+See details here: http://github.com/ArchiveTeam/blogger-discovery
+'''Do not''' increase the concurrency above 2, because Google limits requests aggressively (and you get blocked for ~45 minutes, maybe less). Moreover, if you see "503 Service Unavailable" messages, decrease concurrency to 1.
 == External links ==

Difference between revisions of "Blogger"

Revision as of 17:49, 6 March 2015

Contents

Strategy

Country Redirect

Downloading a single blog with Wget

Export XML trick

How can I help?

Running the Warrior

Running the script manually

External links

References

Navigation menu

Difference between revisions of "Blogger"

Revision as of 17:49, 6 March 2015

Strategy

Country Redirect

Downloading a single blog with Wget

Export XML trick

How can I help?

Running the Warrior

Running the script manually

External links

References

Navigation menu

Search