Difference between revisions of "4chan/4plebs"

From Archiveteam
Jump to navigation Jump to search
Line 93: Line 93:
=== /adv/ ===
=== /adv/ ===


Status: <span style="color: blue;">'''Scraping'''</span>
Status: <span style="color: blue;">'''Scraping - 2015-09-20'''</span>


* Amount of Images: 12973
* Amount of Images: 12973
Line 122: Line 122:


=== /s4s/ ===
=== /s4s/ ===
Status: <span style="color: blue;">'''Scraping - 2015-09-20'''</span>


* Amount of Images: 29504
* Amount of Images: 29504
Line 138: Line 140:


=== /trv/ ===
=== /trv/ ===
Status: <span style="color: blue;">'''Scraping - 2015-09-21'''</span>


* Amount of Images: 1713
* Amount of Images: 1713
Line 150: Line 154:


=== /x/ ===
=== /x/ ===
Status: <span style="color: blue;">'''Scraping - 2015-09-20'''</span>


* Amount of Images: 12344
* Amount of Images: 12344

Revision as of 04:17, 21 September 2015

archive.4plebs.org
Archive-4plebs.png
URL archive.4plebs.org
Status Online!
Archiving status Not saved yet
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)

Status: Online Saves Images?: Yes

4plebs is shedding all full-sized images dating before April 2014, about 240GBs worth of data, due to storage limits. We need to retrieve this data and put it on the Internet Archive for safekeeping.

List of Boards

Board Date of Oldest Thread Pedigree
/adv/ - Advice 2014-01-12 4plebs.org
/f/ - Flash 2014-03-15 4plebs.org
/hr/ - High Resolution 2012-12-01 (around birth) 4plebs.org
/o/ - Auto 2013-03-13 4plebs.org
/pol/ - Politically Incorrect 2013-10-28 4plebs.org
/s4s/ - Shit 4chan Says 2013-10-05 4plebs.org
/sp/ - Sports 2012-06-11 4plebs.org <- not4plebs.org <- (late 2014 threads lost) <- Archive.moe <- foolz.us
/tg/ Traditional Games 2011-06-26 4plebs.org
/trv/ - Travel 2012-07-02 4plebs.org
/x/ - Paranormal 2013-04-01 4plebs.org

Method 1: Web Scraping

Using wget, we just scrape the images off the server. It's not elegant, but it works, and thankfully the admin has provided some image lists. (change board name in URL to view another list) This will take about a month at least, and that's assuming we're scraping in parallel. The following bash script is used:

!/bin/bash
board="tg"
wget http://img.4plebs.org/boards/$board/image/to_be_removed_in_order.txt
sed -e 's|^./|http://img.4plebs.org/boards/$board/image/|g' -i to_be_removed_in_order.txt
wget -b --tries=10 -nc -c -i to_be_removed_in_order.txt --user-agent="Bibliotheca Anonoma Website Archiver/1.1 (+http://github.com/bibanon/bibanon/wiki)" -w 1

Web Scraping ETA

Below are rough estimates for scraping time, procedurally calculated based on the amount of images listed.

Assumes:

  • 2 second Average Download Time (includes 1 second delay)
  • 600KB Average filesize for regular boards
  • 3MB Average filesize for high resolution boards
  • 8MB Average filesize for /f/lashes

Total

  • Total Amount of Images: 372123
  • Total Estimated Size: 244789 MB (or) 240 GB
  • Total Estimated Timespan:
  • Parallel: 1 month (30 days)
  • Sequential: 2063 hours (or) 85 days

/adv/

Status: Scraping - 2015-09-20

  • Amount of Images: 12973
  • Estimated Timespan: 72 hours (or) 3 days
  • Estimated Size: 7601 MB (or) 8 GB

/hr/

  • Amount of Images: 11082
  • Estimated Timespan: 61 hours (or) 2 days
  • Estimated Size: 33246 MB (or) 33 GB

/f/

Nothing to be pruned?

/o/

  • Amount of Images: 37437
  • Estimated Timespan: 207 hours (or) 8 days
  • Estimated Size: 21935 MB (or) 22 GB

/pol/

  • Amount of Images: 107115
  • Estimated Timespan: 595 hours (or) 24 days
  • Estimated Size: 62762 MB (or) 62 GB

/s4s/

Status: Scraping - 2015-09-20

  • Amount of Images: 29504
  • Estimated Timespan: 163 hours (or) 6 days
  • Estimated Size: 17287 MB (or) 17 GB

/sp/

Nothing to be pruned?

/tg/

  • Amount of Images: 60556
  • Estimated Timespan: 336 hours (or) 14 days
  • Estimated Size: 35482 MB (or) 34 GB

/trv/

Status: Scraping - 2015-09-21

  • Amount of Images: 1713
  • Estimated Timespan: 9 hours
  • Estimated Size: 1003 MB (or) 1 GB

/tv/

  • Amount of Images: 99399
  • Estimated Timespan: 552 hours (or) 23 days
  • Estimated Size: 58241 MB (or) 57 GB

/x/

Status: Scraping - 2015-09-20

  • Amount of Images: 12344
  • Estimated Timespan: 68 hours (or) 2 days
  • Estimated Size: 7232 MB (or) 7 GB

Method 2: tar Piping

Web scraping does eat up bandwidth and take quite a long time. A better method is to pipe a tar archive from their host server to our (dedicated) server. Yes you heard that right, the tar backup is stored directly on the remote server, not on the host server.

That way, the host server doesn't have to store a redundant backup that could be massive. Instead, just spit it at our server directly.

tar -c /path/to/dir | ssh remote_server 'tar -xvf - -C /absolute/path/to/remotedir'

This would only take about a week or so to transfer 240GBs of data, and reduces the amount of overhead on the web server from requesting 330,000 files: we only send one continuous stream of data.