Difference between revisions of "User:Djsmiley2k"

From Archiveteam
Jump to navigation Jump to search
Line 62: Line 62:

Now we move onto the project specific stuff, for xanga we'd do:
Now we move onto the project specific stuff, for xanga we'd do:
<pre>sudo git clone https://github.com/ArchiveTeam/xanga-grab.git
sudo git clone https://github.com/ArchiveTeam/xanga-grab.git
cd ./xanga-grab
cd ./xanga-grab
./get-wget-lua.sh ### building wget-lua
./get-wget-lua.sh ### building wget-lua

Revision as of 09:39, 25 June 2013


  • Need to figure full wiki/site layout - currently everything giant missmash
  • Will set fire to anyone who breaks the nice design changes
  • While html in pages can make them look "nice" its ****ing annoying to try and edit nicely if your not a html expert - look into converting into proper mediawiki mark up instead
    • Can we get some templates for projects (what is a project!?) / archive tasks / other crap

Generic Wget command

 export USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
 export SAVE_HOST=""
 export WARC_NAME=""
 wget \
 -e robots=off --mirror --page-requisites \
 --waitretry 5 --timeout 60 --tries 5 --wait 1 \
 --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" \

Forum Grab

src/wget --save-cookies team17-cookies.txt --post-data 'vb_login_username=USERNAMEGOESHERE&vb_login_password=PASSWORDGOESHERE&securitytoken=guest&cookieuser=1&do=login' http://forum.team17.com/login.php?do=login
src/wget --load-cookies team17-cookies.txt -e robots=off --wait 0.25 "http://forum.team17.com/" --mirror --warc-file="at-team17-forum"

Limit Warrior b/w

VBoxManage bandwidthctl archiveteam-warrior-2 --name Limit --add network --limit 3

Must be done while VM is powered off - can't be done with saved state. :(

Remote warrior control

Either ssh forward to local system:

ssh -L 8001:localhost:8001 tim.bowers@xxx.xxx.xxx.xxx -f -N 


curl -d "project_name=punchfork" http://localhost:8001/api/select-project

New Versions

main page

Build your own EC2 ami/instance

select which ever instance type you want - this is built out on ubuntu 13.04/lowest tier (free!)

login (on ubuntu you login as ubuntu) via ssh

Firstly we need to setup the basic system

sudo apt-get install build-essential lua5.1 liblua5.1-0-dev python python-setuptools python-dev git-core openssl libssl-dev python-pip rsync gcc make git screen

Then we need the seesaw kit, which is used for the grabbing parts

sudo git clone https://github.com/ArchiveTeam/seesaw-kit.git
cd ./seesaw-kit
sudo pip install -r requirements.txt

Now we move onto the project specific stuff, for xanga we'd do:

sudo git clone https://github.com/ArchiveTeam/xanga-grab.git
cd ./xanga-grab
./get-wget-lua.sh ### building wget-lua

And finally, we start the pipeline in a screensession

screen ../seesaw-kit/run-pipeline --concurrent 3 pipeline.py YOURNICKNAME

Important URLs

Is the rsync host up?

EC2 Instance setups

debian-squeeze-i386-warrior (ami-9c69f1f5)

User Text: {"downloader": "Smiley", "selected_project": "posterous", "concurrent_items": "6", "shared:rsync_threads": "4"}

Add second disk - 10Gb

Open port 22

Setup SSH forwarding: ssh -i ./.ssh/amazonkey.pem -N -f -L 8002:localhost:8001 ubuntu@***********.compute-1.amazonaws.com

Set automatic shutdown : echo "0 20 * * * root /sbin/shutdown -h now" | sudo tee /etc/cron.d/shutdown

Digital Ocean

sign up for DO -> use SSDTWEET code -> make a $10 payment -> unleash 500 instances upon the world

apt-get update && apt-get -y install git make python-pip libgnutls-dev liblua5.1-dev && pip install seesaw && git clone https://github.com/ArchiveTeam/yahoomessages-grab.git && cd yahoomessages-grab/ && ./get-wget-lua.sh && run-pipeline pipeline.py --disable-web-server Smiley