User:Nemo bis/Tumblr

From Archiveteam
Jump to navigation Jump to search

Simple script running for Tumblr with tmux on 8-core virtual servers with 1 TB disk (e.g. from a GCE trial). See concurrency suggested by trvz.

Partition and format the disk:

 sudo fdisk /dev/sdb ; sudo mkfs.ext4 /dev/sdb1

Install all the things:

 sudo apt install -y atop tmux iftop git-core libgnutls28-dev lua5.1 liblua5.1-0 liblua5.1-0-dev screen python-dev python-pip bzip2 zlib1g-dev flex autoconf ; sudo pip install seesaw; cd /mnt ; sudo mkdir at ; sudo chown nemobis at ; sudo mount /dev/sdb1 /mnt/at ; cd /mnt/at ; git clone https://github.com/ArchiveTeam/tumblr-grab.git ; cd tumblr-grab ; ./get-wget-lua.sh

Launch the scripts in tmux windows in a single session (a concurrency of 200 over 10 directories):

 tmux new-session -d atop ; for i in {1..10}; do tmux new-window -n t$i -d " cd /mnt/at ; git clone https://github.com/ArchiveTeam/tumblr-grab.git tumblr$i ; cd tumblr$i ; run-pipeline --concurrent 20 --disable-web-server --auto-update  pipeline.py Nemo" ; done

Open the terminal (see other tmux basics and how to relaunch en masse):

 tmux a
 # press "p" in atop to see  how much CPU wget-lua is consuming overall etc.

The main limit to speed is often the number of IP addresses and I/O wait time, rather than CPU and concurrency: see diggan's scripts to spawn warriors on multiple DigitalOcean instances.