Difference between revisions of "Dev/Seesaw"
(→pipeline variable: describe truncate-output) |
(→What a Archive Team Project Contains: add link to readme template) |
||
Line 12: | Line 12: | ||
: * brief information about the project | : * brief information about the project | ||
: * instructions on how to manually run the scripts | : * instructions on how to manually run the scripts | ||
: * A template is available here: [https://github.com/ArchiveTeam/standalone-readme-template standalone-readme-template] | |||
'''[Project Name Here].lua''' (optional) | '''[Project Name Here].lua''' (optional) |
Revision as of 10:29, 28 December 2013
Writing a Seesaw content grab script is the most challenging and fun aspect of the infrastructure.
What a Archive Team Project Contains
Once the Git repository has been created, be sure to include the following files:
pipeline.py
- This file contains the Seesaw client code for the project.
README.{md,rst,txt}
- This file contains
- * brief information about the project
- * instructions on how to manually run the scripts
- * A template is available here: standalone-readme-template
[Project Name Here].lua (optional)
- This is the Lua script used by Wget-Lua.
warrior_install.sh (optional)
- This file is executed by the Warrior to install extra libraries needed by the project.
wget-lua-warrior (optional)
- This executable is a build of Wget-Lua for the warrior environment.
get-wget-lua.sh (optional)
- Build scripts for Wget-Lua for those running scripts manually.
The repository is pulled in by the Warrior or manually be those who want to run the scripts manually.
Writing a pipeline.py (Seesaw Client)
The Seesaw client is a specific set of tasks that must be done within an item. Think of it as a template of instructions. Typically, the file is called pipeline.py. The pipeline file uses the Seesaw Library.
The pipeline file will typically use Wget with Lua scripting. Wget+Lua is a web crawler.
The Lua script is provided as an argument to Wget within the pipeline file. It controls fine grain operations within Wget such as rejecting unneeded URLs or adding more URLs as they are discovered.
The goal of the pipeline is to download, make WARC files, and upload them.
Quick Definitions
item
- a work unit
pipeline
- a series of tasks in an item
task
- a step in getting the item done
Installation
You will need:
- Python 2.6/2.7
- Lua
- Wget with Lua hooks
Typically, you can install these on Ubuntu by running:
sudo apt-get install build-essential lua5.1 liblua5.1-0-dev python python-setuptools python-dev openssl libssl-dev python-pip make libgnutls-dev zlib1g-dev sudo pip install seesaw
You will also need Wget with Lua. There is an Ubuntu PPA or you can build it yourself:
./get-wget-lua.sh
Grab a recent build script from here.
The pipeline file
The pipeline file typically includes:
- A line that checks the minimum seesaw version required
- Copy-and-pasted monkey patches if needed
- A routine to find Wget Lua
- A version number in the form of
YYYYMMDD.NN
- Misc constants
- Custom Tasks:
- PrepareDirectories
- MoveFiles
- Project information saved into the
project
variable - Instructions on how to deal with the item saved into the
pipeline
variable - An undeclared
downloader
variable which will be filled in by the Seesaw library
It is important to remember that each Task is a template on how to deal with each Item. Specific item variables should not be stored on a Task, but rather, it should be saved onto the item: item["my_data"] = "hello"
.
Minimum Seesaw Version Check
if StrictVersion(seesaw.__version__) < StrictVersion("0.0.15"): raise Exception("This pipeline needs seesaw version 0.0.15 or higher.")
This check is used to prevent manual script users from using an obsolete version of Seesaw. The Warrior will always upgrade to the latest Seesaw if dictated in the Tracker's projects.json file.
Version 0.0.15 is the supported legacy version, but it is suggested to rely on the latest version of Seesaw as specified in the Seesaw Python Package Index.
Monkey Patches
Monkey patches such as AsyncPopenFixed
are only provided for legacy versions of Seesaw.
Routine to find Wget-Lua
WGET_LUA = find_executable( "Wget+Lua", ["GNU Wget 1.14.lua.20130523-9a5c"], [ "./wget-lua", "./wget-lua-warrior", "./wget-lua-local", "../wget-lua", "../../wget-lua", "/home/warrior/wget-lua", "/usr/bin/wget-lua" ] ) if not WGET_LUA: raise Exception("No usable Wget+Lua found.")
This routine is a sanity check that aborts the script early if Wget+Lua has not been found. Omit this if needed.
Script Version
VERSION = "20131129.00"
This constant, to be used within pipeline
, is sent to the Tracker and should be embedded within the WARC files. It is used for accounting purposes:
- Tracker dmins can check the logs for faulty grab scripts and requeue the faulty items.
- Tracker admins can require the user to upgrade the scripts.
Always change the version whenever you make a non-cosmetic change. Note, this constant is only a variable. Be sure that it is used within pipeline
.
Misc constants
USER_AGENT = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27 ArchiveTeam" TRACKER_ID = "posterous" TRACKER_HOST = "tracker.archiveteam.org"
Constants like USER_AGENT
and TRACKER_HOST
are good programming practices for clean coding.
PrepareDirectories & MoveFiles
class PrepareDirectories(SimpleTask): """ A task that creates temporary directories and initializes filenames. It initializes these directories, based on the previously set item_name: item["item_dir"] = "%{data_dir}/%{item_name}" item["warc_file_base"] = "%{warc_prefix}-%{item_name}-%{timestamp}" These attributes are used in the following tasks, e.g., the Wget call. * set warc_prefix to the project name. * item["data_dir"] is set by the environment: it points to a working directory reserved for this item. * use item["item_dir"] for temporary files """ def __init__(self, warc_prefix): SimpleTask.__init__(self, "PrepareDirectories") self.warc_prefix = warc_prefix def process(self, item): item_name = item["item_name"] dirname = "/".join(( item["data_dir"], item_name )) if os.path.isdir(dirname): shutil.rmtree(dirname) os.makedirs(dirname) item["item_dir"] = dirname item["warc_file_base"] = "%s-%s-%s" % (self.warc_prefix, item_name, time.strftime("%Y%m%d-%H%M%S")) open("%(item_dir)s/%(warc_file_base)s.warc.gz" % item, "w").close() class MoveFiles(SimpleTask): """ After downloading, this task moves the warc file from the item["item_dir"] directory to the item["data_dir"], and removes the files in the item["item_dir"] directory. """ def __init__(self): SimpleTask.__init__(self, "MoveFiles") def process(self, item): os.rename("%(item_dir)s/%(warc_file_base)s.warc.gz" % item, "%(data_dir)s/%(warc_file_base)s.warc.gz" % item) shutil.rmtree("%(item_dir)s" % item)
These tasks are "tradition" (meaning, they are copied-and-pasted and modified to fit) for managing temporary files.
Note, PrepareDirectories
makes an empty warc.gz file since later tasks expect a warc.gz file.
project variable
project = Project( title = "Posterous", project_html = """ <img class="project-logo" alt="Posterous Logo" src="http://archiveteam.org/images/6/6c/Posterous_logo.png" height="50"/> <h2>Posterous.com <span class="links"> <a href="http://www.posterous.com/">Website</a> · <a href="http://tracker.archiveteam.org/posterous/">Leaderboard</a> </span> </h2> <p><i>Posterous</i> is closing April, 30th, 2013</p> """ , utc_deadline = datetime.datetime(2013, 04, 30, 23, 59, 0) )
This variable is used within the Warrior to show the HTML at the top of the page.
(Note, this could be potentially be used to show important messages, but that is not the intended use. Also, manual script users will not see anything related to this variable.)
pipeline variable
Here's a real chunk of code.
pipeline = Pipeline( # request an item from the tracker (using the universal-tracker protocol) # the downloader variable will be set by the warrior environment # # this task will wait for an item and sets item["item_name"] to the item name # before finishing GetItemFromTracker("http://%s/%s" % (TRACKER_HOST, TRACKER_ID), downloader, VERSION), # create the directories and initialize the filenames (see above) # warc_prefix is the first part of the warc filename # # this task will set item["item_dir"] and item["warc_file_base"] PrepareDirectories(warc_prefix="posterous.com"), # execute Wget+Lua # # the ItemInterpolation() objects are resolved during runtime # (when there is an Item with values that can be added to the strings) WgetDownload([ WGET_LUA, "-U", USER_AGENT, "-nv", "-o", ItemInterpolation("%(item_dir)s/wget.log"), "--no-check-certificate", "--output-document", ItemInterpolation("%(item_dir)s/wget.tmp"), "--truncate-output", "-e", "robots=off", "--rotate-dns", "--recursive", "--level=inf", "--page-requisites", "--span-hosts", "--domains", ItemInterpolation("%(item_name)s,s3.amazonaws.com,files.posterous.com," "getfile.posterous.com,getfile0.posterous.com,getfile1.posterous.com," "getfile2.posterous.com,getfile3.posterous.com,getfile4.posterous.com," "getfile5.posterous.com,getfile6.posterous.com,getfile7.posterous.com," "getfile8.posterous.com,getfile9.posterous.com,getfile10.posterous.com"), "--reject-regex", r"\.com/login", "--timeout", "60", "--tries", "20", "--waitretry", "5", "--lua-script", "posterous.lua", "--warc-file", ItemInterpolation("%(item_dir)s/%(warc_file_base)s"), "--warc-header", "operator: Archive Team", "--warc-header", "posterous-dld-script-version: " + VERSION, "--warc-header", ItemInterpolation("posterous-user: %(item_name)s"), ItemInterpolation("http://%(item_name)s/") ], max_tries = 2, # check this: which Wget exit codes count as a success? accept_on_exit_code = [ 0, 8 ], ), # this will set the item["stats"] string that is sent to the tracker (see below) PrepareStatsForTracker( # there are a few normal values that need to be sent defaults = { "downloader": downloader, "version": VERSION }, # this is used for the size counter on the tracker: # the groups should correspond with the groups set configured on the tracker file_groups = { # there can be multiple groups with multiple files # file sizes are measured per group "data": [ ItemInterpolation("%(item_dir)s/%(warc_file_base)s.warc.gz") ] }, ), # remove the temporary files, move the warc file from # item["item_dir"] to item["data_dir"] MoveFiles(), # there can be multiple items in the pipeline, but this wrapper ensures # that there is only one item uploading at a time # # the NumberConfigValue can be changed in the configuration panel LimitConcurrent(NumberConfigValue(min=1, max=4, default="1", name="shared:rsync_threads", title="Rsync threads", description="The maximum number of concurrent uploads."), # this upload task asks the tracker for an upload target # this can be HTTP or rsync and can be changed in the tracker admin panel UploadWithTracker( "http://%s/%s" % (TRACKER_HOST, TRACKER_ID), downloader = downloader, version = VERSION, # list the files that should be uploaded. # this may include directory names. # note: HTTP uploads will only upload the first file on this list files = [ ItemInterpolation("%(data_dir)s/%(warc_file_base)s.warc.gz") ], # the relative path for the rsync command # (this defines if the files are uploaded to a subdirectory on the server) rsync_target_source_path = ItemInterpolation("%(data_dir)s/"), # extra rsync parameters (probably standard) rsync_extra_args = [ "--recursive", "--partial", "--partial-dir", ".rsync-tmp" ] ), ), # if the item passed every task, notify the tracker and report the statistics SendDoneToTracker( tracker_url = "http://%s/%s" % (TRACKER_HOST, TRACKER_ID), stats = ItemValue("stats") ) )
It's pretty big.
Notice:
- the
downloader
variable should be left undefined ItemInterpolation
holds some magic.ItemInterpolation("%(item_dir)s/wget.log").realize(item)
executesitem % "%(item_dir)s/wget.log"
which gives usitem["item_dir"]+"/wget.log"
--output-document
concatenates everything into a single temporary file.--truncate-output
is a Wget+Lua option. It makes--output-document
into a temporary file option by downloading to the file, extract the URLs, and then set the temporary file to 0 bytes.- the use of
-e robots=off
because robots.txt is bad --lua-script posterous.lua
specifies the Lua script that controls WgetNumberConfigValue
adds another setting to the Warrior's advanced settings page
Lua Script
The Lua script is like a parasite controlling and modifying Wget's behavior from within.
- Recommended reading: Wget with Lua hooks
- Example listings: Wget with Lua hooks#More_Examples
- Reference documentation: Wget with Lua hooks (GitHub)
Generally, scripts will want to use:
download_child_p
httploop_result
get_urls
download_child_p
This hook is useful for advanced URL accepting and rejecting. Although Wget supports regular expression on its command line options, it can be messy. Lua only supports a small subset of regular expressions called Patterns.
httploop_result
This hook is useful for checking if we have been banned or implementing our own --wait
.
Here is a practical example that delays Wget for a minute on a ban, approximate 1 second between normal requests, and no delay on a content delivery network:
wget.callbacks.httploop_result = function(url, err, http_stat) local sleep_time = 60 local status_code = http_stat["statcode"] if status_code == 420 then io.stdout:write("\nBanned (code "..http_stat.statcode.."). Sleeping for ".. sleep_time .." seconds.\n") io.stdout:flush() -- Execute the UNIX sleep command (since Lua does not have its own delay function) -- Note that wget has its own linear backoff to this time as well os.execute("sleep " .. sleep_time) -- Tells wget to try again return wget.actions.CONTINUE else -- We're okay; sleep a bit (if we have to) and continue local sleep_time = 1.0 * (math.random(75, 125) / 100.0) if string.match(url["url"], "website-cdn%.net") then -- We should be able to go fast on images since that's what a web browser does sleep_time = 0 end if sleep_time > 0.001 then os.execute("sleep " .. sleep_time) end -- Tells wget to resume normal behavior return wget.actions.NOTHING end end
get_urls
This hook is used to add additional URLs.
This example injects URLs to simulate JavaScript requests:
wget.callbacks.get_urls = function(file, url, is_css, iri) local urls = {} for image_id in string.gmatch(html, "([a-zA-Z0-9]-)/image_thumb.png") do table.insert(urls, { url="http://example.com/photo_viewer.php?imageid="..image_id, post_data="crf_token=deadbeef" }) end return urls end
It can also be used to display a progress message:
url_count = 0 wget.callbacks.get_urls = function(file, url, is_css, iri) url_count = url_count + 1 if url_count % 5 == 0 then io.stdout:write("\r - Downloaded "..url_count.." URLs.") io.stdout:flush() end end
Useful Snippets
Read first 1 kilobyte of a file:
read_file_short = function(file) if file then local f = io.open(file) local data = f:read(4096) f:close() return data or "" else return "" end end
Run a pipeline.py
To run a pipeline file, run the command:
run-pipeline pipeline.py YOUR_NICKNAME
For more options, run:
run-pipeline --help
External Links
- Take a look at the grab scripts in recent Archive Team repositories for examples of clients.
- For more information, consult the seesaw-kit wiki.