Difference between revisions of "Reddit"

From Archiveteam
Jump to navigation Jump to search
(removed dead link to old reddit archive at archive.org)
Line 20: Line 20:
* ''<s>Extremely endangered</s> - many subreddits were picketing after the firing of a reddit employee named Victoria by turning themselves private or restricting submissions.''
* ''<s>Extremely endangered</s> - many subreddits were picketing after the firing of a reddit employee named Victoria by turning themselves private or restricting submissions.''
* ''''Caution'''' - Reddit seems to have calmed down and returned to normal functionality after Ellen Pao's firing, and the Reddit team is making serious reforms (reducing shadowbanning, more mod tools). However, the revolt left unresolved issues and sour grapes within the community, and it seems Reddit was only saved by the lack of a practical alternative (Voat.co was crushed and went offline due to floods of refugees). '''It would be wise to preemptively archive the site''' before another crisis occurs.
* ''''Caution'''' - Reddit seems to have calmed down and returned to normal functionality after Ellen Pao's firing, and the Reddit team is making serious reforms (reducing shadowbanning, more mod tools). However, the revolt left unresolved issues and sour grapes within the community, and it seems Reddit was only saved by the lack of a practical alternative (Voat.co was crushed and went offline due to floods of refugees). '''It would be wise to preemptively archive the site''' before another crisis occurs.
* On July 3rd, 2015, Reddit user Stuck_in_the_Matrix '''completed his 14-month effort to archive Reddit's entire publicly available dataset''', just in time before the onset of the Reddit revolt. [https://archive.org/details/2015_reddit_comments_corpus It has been uploaded to the Internet Archive] in its entirety.
* On July 3rd, 2015, Jason Baumgartner '''completed his 14-month effort to archive Reddit's entire publicly available dataset''', just in time before the onset of the Reddit revolt. The archive is still updated monthly. '''[http://files.pushshift.io/reddit/ The files are available here.]'''
* '''As of November 9, 2015, it is stable once again.'''
* As of November 9, 2015, became stable once again
* In 2017-2018, Reddit has carried out bannings of several subreddits including r/incels and r/maleforeveralone, which had tens of thousands of subscribers each. Other subreddits including r/Braincels, r/foreveralone and r/TheRedPill are also endangered. Discussions about banning those subreddits are currently taking place.[https://babe.net/2018/03/07/incel-40474][https://www.reddit.com/r/IncelTears/comments/83irsc/why_isnt_rbraincels_banned_yet/]


== Reddit Archive of Submissions and Comments (Without Images) ==
== Textual Archive (Without Images and Video) ==


On July 3rd, 2015, Jason Baumgartner completed his 14-month effort to archive Reddit's entire publicly available dataset, just in time before the onset of the Reddit revolt. The archive is still being updated monthly. '''[http://files.pushshift.io/reddit/ The files are available here.]'''
On July 3rd, 2015, Jason Baumgartner completed his 14-month effort to archive Reddit's entire publicly available texual dataset, just in time before the onset of the Reddit revolt. The archive is still being updated monthly. '''[http://files.pushshift.io/reddit/ The files are available here.]'''


* Does not include images and videos hosted by Reddit
* Reddit JSON API output
* Reddit JSON API output
* Some comments not accessible due to private subreddits or comment deletion or other API issues
* Some comments not accessible due to private subreddits or comment deletion or other API issues
* [https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/ Reddit /r/datasets - I have every publicly available Reddit comment for research. ~ 1.7 billion comments @ 250 GB compressed. Any interest in this?]
* [https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/ Reddit /r/datasets - I have every publicly available Reddit comment for research. ~ 1.7 billion comments @ 250 GB compressed. Any interest in this?]
* [https://archive.org/details/2015_reddit_comments_corpus Internet Archive - 2015 Reddit Comments Corpus]
* [https://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/ Google BigQuery Analysis of Reddit]
* [https://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/ Google BigQuery Analysis of Reddit]



Revision as of 14:24, 10 July 2018

reddit
Reddit logo
reddit home page as seen on March 26, 2013
reddit home page as seen on March 26, 2013
URL http://www.reddit.com/
Status Online!
Archiving status Partially saved
Archiving type Unknown
IRC channel #deaddit (on hackint)

reddit is a content aggregator and social bookmarking service similar to the likes of Digg. Users can submit links, submit text posts, vote and comment on submissions in communities called "subreddits". It received considerable attention from its twelve-hour SOPA blackout early in January 2012.

It contains some subredits devoted to similar goals as ArchiveTeam, including /r/AbandonedWebsites, /r/ForgottenWebsites, & /r/DataHoarder, which are worth checking for material to be added to ArchiveBot or otherwise benefit from the attention of the team.

Vital signs

  • Appears stable, though a small to medium size team is a concern.
  • Update (6/10/15): the admins carried out bannings of several subreddits claiming they were harassing people, the most notable of which was /r/fatpeoplehate. This has instilled some fear, uncertainty, and doubt in some part of the userbase, with a few claiming that reddit will soon become what Digg is now: nearly dead.
  • Extremely endangered - many subreddits were picketing after the firing of a reddit employee named Victoria by turning themselves private or restricting submissions.
  • 'Caution' - Reddit seems to have calmed down and returned to normal functionality after Ellen Pao's firing, and the Reddit team is making serious reforms (reducing shadowbanning, more mod tools). However, the revolt left unresolved issues and sour grapes within the community, and it seems Reddit was only saved by the lack of a practical alternative (Voat.co was crushed and went offline due to floods of refugees). It would be wise to preemptively archive the site before another crisis occurs.
  • On July 3rd, 2015, Jason Baumgartner completed his 14-month effort to archive Reddit's entire publicly available dataset, just in time before the onset of the Reddit revolt. The archive is still updated monthly. The files are available here.
  • As of November 9, 2015, became stable once again
  • In 2017-2018, Reddit has carried out bannings of several subreddits including r/incels and r/maleforeveralone, which had tens of thousands of subscribers each. Other subreddits including r/Braincels, r/foreveralone and r/TheRedPill are also endangered. Discussions about banning those subreddits are currently taking place.[1][2]

Textual Archive (Without Images and Video)

On July 3rd, 2015, Jason Baumgartner completed his 14-month effort to archive Reddit's entire publicly available texual dataset, just in time before the onset of the Reddit revolt. The archive is still being updated monthly. The files are available here.

The scripts used to generate this API dump were not made public, but it likely used PRAW, and it would probably be better to rewrite from scratch.

Also, this only preserves submissions and comments. All images hosted on Reddit were not archived. All sidebar, wiki, and live thread data was not retrieved, so these should be scraped in an expansion pack.

Data liberation

As of March 26, 2013, users can only see up to 1,000 posts and comments on a profile page. However, it was stated by admin "spladug" that older comments and posts are still in the database. spladug also states that the team is in favor for retrieving dumps of a user's data, but that the task would be taxing on the servers. Since this comment was posted, there appears to have been no progress on a dump system. Archiving would be nearly impossible using the old-fashioned way (without wget) if things do wind up FUBAR in the future because of this limitation.

Instead, any archival methods should scrape from the Reddit API (which would have to run over several months). The API provides all nested comments that are not noticed by HTML. In addition, it significantly reduces server load.


External Links