The staging servers accept WARC files, package them up, and upload to the Internet Archive. This guide is useful for those who are setting up Rsync targets.
Installation will cover:
- Environment: Ubuntu/Debian
Setup the Rsync target
The Rsync target consists of disk space, Rsync, and WARC packing scripts in a dedicated user account.
Create the system user account dedicated for the Rsync target:
sudo adduser --system --group --shell /bin/bash archiveteam
Log in as archiveteam:
sudo -u archiveteam -i
Create a place to store the uploads:
mkdir -p PROJECT_NAME/incoming-uploads/
You may log out of archiveteam at this point.
You will need to install Rsync:
sudo apt-get install rsync
Once rsync is installed, you will need to edit the rsync configuration file. If no
rsyncd.conf exists in
/etc, copy it from
Rsync uses a concept of "modules" which can be considered as namespaces. If you have copied the example file, you can modify the example ftp module to fit your new project. Perhaps you may call the module after the project name.
You will also need to include:
- path = /home/archiveteam/PROJECT_NAME/incoming-uploads/
- read only = no
- uid = archiveteam
- gid = archiveteam
Make Rsync start up as daemon on boot up by editing
/etc/default/rsync. Ensure it reads
Start up Rsync deamon:
sudo invoke-rc.d rsync start
The Megawarc Factory
The Megawarc Factory are scripts that package and bundle up all the uploaded WARC files that is received.
If Git, Curl, or Screen is not yet installed, install it now:
sudo apt-get install git curl screen
Log in as archiveteam and download the scripts needed:
git clone https://github.com/ArchiveTeam/archiveteam-megawarc-factory.git cd archiveteam-megawarc-factory/ git clone https://github.com/alard/megawarc.git cd
Let's begin to populate the configuration file:
cp archiveteam-megawarc-factory/config.example.sh PROJECT_NAME/config.sh nano PROJECT_NAME/config.sh
Going through the config.sh:
- MEGABYTES_PER_CHUNK denotes how big the mega WARC files. Typically it should be set at 50GB, but if you really don't have the space, you can use smaller files like 10GB.
- IA_AUTH is your Internet Archive S3-like API authentication keys.
- IA_COLLECTION, IA_ITEM_TITLE, IA_ITEM_PREFIX, FILE_PREFIX all should have the todos replaced with the project name.
- FS1_BASE_DIR should be set to /home/archiveteam/PROJECT_NAME/
- FS2_BASE_DIR should be set to same as above or another location.
- COMPLETED_DIR should be left empty (i.e., "") if the uploaded file is to be deleted.
Bother or ask politely someone about getting permission to upload your files to the collection archiveteam_PROJECT_NAME. You can ask on #archiveteam on EFNet.
Let's run the Megawarc Factory. First, create a sentinel file:
cd PROJECT_NAME touch RUN
You can run the Megawarc Factory in Screen. The 3 scripts will on separate command shells within one Screen session:
screen ../archiveteam-megawarc-factory/chunk-multiple CTRL+A c ionice -c 2 -n 6 nice -n 19 ../archiveteam-megawarc-factory/pack-multiple CTRL+A c ../archiveteam-megawarc-factory/upload-multiple CTRL+A d
Here's a few Screen pointers:
- screen -r will resume an existing screen session
- CTRL+A c creates a new command window
- CTRL+A SPACE switches to the next window
- CTRL+A " shows you a list of windows
- CTRL+A d leaves, or detaches, the screen session
To stop the Megawarc Factory, remove the sentinel file:
You can log out of the archiveteam account now.