Difference between revisions of "Dev/Seesaw"

From Archiveteam
< Dev
Jump to navigation Jump to search
(→‎project variable: about messages)
Line 367: Line 367:
* <code>--output-document</code> concatenates everything into a single temporary file.
* <code>--output-document</code> concatenates everything into a single temporary file.
* <code>--truncate-output</code> is a Wget+Lua option. It makes <code>--output-document</code> into a temporary file option by downloading to the file, extract the URLs, and then set the temporary file to 0 bytes.  
* <code>--truncate-output</code> is a Wget+Lua option. It makes <code>--output-document</code> into a temporary file option by downloading to the file, extract the URLs, and then set the temporary file to 0 bytes.  
* the use of <code>-e robots=off</code> because robots.txt is bad
* the use of <code>-e robots=off</code> because [[robots.txt]] is bad
* <code>--lua-script posterous.lua</code> specifies the Lua script that controls Wget
* <code>--lua-script posterous.lua</code> specifies the Lua script that controls Wget
* <code>NumberConfigValue</code> adds another setting to the Warrior's advanced settings page
* <code>NumberConfigValue</code> adds another setting to the Warrior's advanced settings page

Revision as of 11:36, 13 June 2015

Writing a Seesaw content grab script is the most challenging and fun aspect of the infrastructure.

What a Archive Team Project Contains

Once the Git repository has been created, be sure to include the following files:

pipeline.py

This file contains the Seesaw client code for the project.

README.{md,rst,txt}

This file contains
* brief information about the project
* instructions on how to manually run the scripts
* A template is available here: standalone-readme-template

[Project Name Here].lua (optional)

This is the Lua script used by Wget-Lua.

warrior-install.sh (optional)

This file is executed by the Warrior to install extra libraries needed by the project. Example: punchfork-grab warrior-install.sh.

wget-lua-warrior (optional)

This executable is a build of Wget-Lua for the warrior environment.

get-wget-lua.sh (optional)

Build scripts for Wget-Lua for those running scripts manually.

The repository is pulled in by the Warrior or manually be those who want to run the scripts manually.

Writing a pipeline.py (Seesaw Client)

The Seesaw client is a specific set of tasks that must be done within an item. Think of it as a template of instructions. Typically, the file is called pipeline.py. The pipeline file uses the Seesaw Library.

The pipeline file will typically use Wget with Lua scripting. Wget+Lua is a web crawler.

The Lua script is provided as an argument to Wget within the pipeline file. It controls fine grain operations within Wget such as rejecting unneeded URLs or adding more URLs as they are discovered.

The goal of the pipeline is to download, make WARC files, and upload them.

Quick Definitions

item

a work unit

pipeline

a series of tasks in an item

task

a step in getting the item done

Recommend reading.

Installation

You will need:

  • Python 2.6/2.7
  • Lua
  • Wget with Lua hooks

Typically, you can install these on Ubuntu by running:

sudo apt-get install build-essential lua5.1 liblua5.1-0-dev python python-setuptools python-dev openssl libssl-dev python-pip make libgnutls-dev zlib1g-dev
sudo pip install seesaw

You will also need Wget with Lua. There is an Ubuntu PPA or you can build it yourself:

./get-wget-lua.sh

Grab a recent build script from here.

The pipeline file

The pipeline file typically includes:

  • A line that checks the minimum seesaw version required
  • Copy-and-pasted monkey patches if needed
  • A routine to find Wget Lua
  • A version number in the form of YYYYMMDD.NN
  • Misc constants
  • Custom Tasks:
    • PrepareDirectories
    • MoveFiles
  • Project information saved into the project variable
  • Instructions on how to deal with the item saved into the pipeline variable
  • An undeclared downloader variable which will be filled in by the Seesaw library

It is important to remember that each Task is a template on how to deal with each Item. Specific item variables should not be stored on a Task, but rather, it should be saved onto the item: item["my_data"] = "hello".

Minimum Seesaw Version Check

if StrictVersion(seesaw.__version__) < StrictVersion("0.0.15"):
    raise Exception("This pipeline needs seesaw version 0.0.15 or higher.")

This check is used to prevent manual script users from using an obsolete version of Seesaw. The Warrior will always upgrade to the latest Seesaw if dictated in the Tracker's projects.json file.

Version 0.0.15 is the supported legacy version, but it is suggested to rely on the latest version of Seesaw as specified in the Seesaw Python Package Index.

Monkey Patches

Monkey patches such as AsyncPopenFixed are only provided for legacy versions of Seesaw.

Routine to find Wget-Lua

WGET_LUA = find_executable(
    "Wget+Lua",
    ["GNU Wget 1.14.lua.20130523-9a5c"],
    [
        "./wget-lua",
        "./wget-lua-warrior",
        "./wget-lua-local",
        "../wget-lua",
        "../../wget-lua",
        "/home/warrior/wget-lua",
        "/usr/bin/wget-lua"
    ]
)

if not WGET_LUA:
    raise Exception("No usable Wget+Lua found.")

This routine is a sanity check that aborts the script early if Wget+Lua has not been found. Omit this if needed.

Script Version

VERSION = "20131129.00"

This constant, to be used within pipeline, is sent to the Tracker and should be embedded within the WARC files. It is used for accounting purposes:

  • Tracker admins can check the logs for faulty grab scripts and requeue the faulty items.
  • Tracker admins can require the user to upgrade the scripts.

Always change the version whenever you make a non-cosmetic change. Note, this constant is only a variable. Be sure that it is used within pipeline.

Misc constants

USER_AGENT = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27 ArchiveTeam"
TRACKER_ID = "posterous"
TRACKER_HOST = "tracker.archiveteam.org"

Constants like USER_AGENT and TRACKER_HOST are good programming practices for clean coding.

Check IP address

This task checks the IP address to ensure the user is not behind a proxy or firewall. Sometimes websites are censored or the user is behind a captive portal (like a coffeeshop wifi) which will ruin results.

class CheckIP(SimpleTask):
    def __init__(self):
        SimpleTask.__init__(self, "CheckIP")
        self._counter = 0

    def process(self, item):
        ip_str = socket.gethostbyname('example.com')
        if ip_str not in ['1.2.3.4', '1.2.3.6']:
            item.log_output('Got IP address: %s' % ip_str)
            item.log_output(
                'Are you behind a firewall/proxy? That is a big no-no!')
            raise Exception(
                'Are you behind a firewall/proxy? That is a big no-no!')

        # Check only occasionally
        if self._counter <= 0:
            self._counter = 10
        else:
            self._counter -= 1

PrepareDirectories & MoveFiles

class PrepareDirectories(SimpleTask):
  """
  A task that creates temporary directories and initializes filenames.

  It initializes these directories, based on the previously set item_name:
    item["item_dir"] = "%{data_dir}/%{item_name}"
    item["warc_file_base"] = "%{warc_prefix}-%{item_name}-%{timestamp}"

  These attributes are used in the following tasks, e.g., the Wget call.

  * set warc_prefix to the project name.
  * item["data_dir"] is set by the environment: it points to a working
    directory reserved for this item.
  * use item["item_dir"] for temporary files
  """
  def __init__(self, warc_prefix):
    SimpleTask.__init__(self, "PrepareDirectories")
    self.warc_prefix = warc_prefix

  def process(self, item):
    item_name = item["item_name"]
    dirname = "/".join(( item["data_dir"], item_name ))

    if os.path.isdir(dirname):
      shutil.rmtree(dirname)
    os.makedirs(dirname)

    item["item_dir"] = dirname
    item["warc_file_base"] = "%s-%s-%s" % (self.warc_prefix, item_name, time.strftime("%Y%m%d-%H%M%S"))

    open("%(item_dir)s/%(warc_file_base)s.warc.gz" % item, "w").close()


class MoveFiles(SimpleTask):
  """
  After downloading, this task moves the warc file from the
  item["item_dir"] directory to the item["data_dir"], and removes
  the files in the item["item_dir"] directory.
  """
  def __init__(self):
    SimpleTask.__init__(self, "MoveFiles")

  def process(self, item):
    os.rename("%(item_dir)s/%(warc_file_base)s.warc.gz" % item,
              "%(data_dir)s/%(warc_file_base)s.warc.gz" % item)

    shutil.rmtree("%(item_dir)s" % item)

These tasks are "tradition" (meaning, they are copied-and-pasted and modified to fit) for managing temporary files.

Note, PrepareDirectories makes an empty warc.gz file since later tasks expect a warc.gz file.

project variable

project = Project(
  title = "Posterous",
  project_html = """
    <img class="project-logo"
      alt="Posterous Logo"
      src="http://archiveteam.org/images/6/6c/Posterous_logo.png"
      height="50"/>
    <h2>Posterous.com
      <span class="links">
        <a href="http://www.posterous.com/">Website</a> · 
        <a href="http://tracker.archiveteam.org/posterous/">Leaderboard</a>
      </span>
    </h2>
    <p><i>Posterous</i> is closing April, 30th, 2013</p>
  """
   , utc_deadline = datetime.datetime(2013, 04, 30, 23, 59, 0)
)

This variable is used within the Warrior to show the HTML at the top of the page.

Note, this could be potentially be used to show important messages using <p class="projectBroadcastMessage"></p>. However, manual script users will not see anything related to this variable so you may want to print out any important messages instead.

pipeline variable

Here's a real chunk of code.

pipeline = Pipeline(
  # request an item from the tracker (using the universal-tracker protocol)
  # the downloader variable will be set by the warrior environment
  #
  # this task will wait for an item and sets item["item_name"] to the item name
  # before finishing
  GetItemFromTracker("http://%s/%s" % (TRACKER_HOST, TRACKER_ID), downloader, VERSION),

  # create the directories and initialize the filenames (see above)
  # warc_prefix is the first part of the warc filename
  #
  # this task will set item["item_dir"] and item["warc_file_base"]
  PrepareDirectories(warc_prefix="posterous.com"),

  # execute Wget+Lua
  #
  # the ItemInterpolation() objects are resolved during runtime
  # (when there is an Item with values that can be added to the strings)
  WgetDownload([ WGET_LUA,
      "-U", USER_AGENT,
      "-nv",
      "-o", ItemInterpolation("%(item_dir)s/wget.log"),
      "--no-check-certificate",
      "--output-document", ItemInterpolation("%(item_dir)s/wget.tmp"),
      "--truncate-output",
      "-e", "robots=off",
      "--rotate-dns",
      "--recursive", "--level=inf",
      "--page-requisites",
      "--span-hosts", 
      "--domains", ItemInterpolation("%(item_name)s,s3.amazonaws.com,files.posterous.com,"
        "getfile.posterous.com,getfile0.posterous.com,getfile1.posterous.com,"
        "getfile2.posterous.com,getfile3.posterous.com,getfile4.posterous.com,"
        "getfile5.posterous.com,getfile6.posterous.com,getfile7.posterous.com,"
        "getfile8.posterous.com,getfile9.posterous.com,getfile10.posterous.com"),
      "--reject-regex", r"\.com/login",
      "--timeout", "60",
      "--tries", "20",
      "--waitretry", "5",
      "--lua-script", "posterous.lua",
      "--warc-file", ItemInterpolation("%(item_dir)s/%(warc_file_base)s"),
      "--warc-header", "operator: Archive Team",
      "--warc-header", "posterous-dld-script-version: " + VERSION,
      "--warc-header", ItemInterpolation("posterous-user: %(item_name)s"),
      ItemInterpolation("http://%(item_name)s/")
    ],
    max_tries = 2,
    # check this: which Wget exit codes count as a success?
    accept_on_exit_code = [ 0, 8 ],
  ),

  # this will set the item["stats"] string that is sent to the tracker (see below)
  PrepareStatsForTracker(
    # there are a few normal values that need to be sent
    defaults = { "downloader": downloader, "version": VERSION },
    # this is used for the size counter on the tracker:
    # the groups should correspond with the groups set configured on the tracker
    file_groups = {
      # there can be multiple groups with multiple files
      # file sizes are measured per group
      "data": [ ItemInterpolation("%(item_dir)s/%(warc_file_base)s.warc.gz") ]
    },
  ),

  # remove the temporary files, move the warc file from
  # item["item_dir"] to item["data_dir"]
  MoveFiles(),
  
  # there can be multiple items in the pipeline, but this wrapper ensures
  # that there is only one item uploading at a time
  #
  # the NumberConfigValue can be changed in the configuration panel
  LimitConcurrent(NumberConfigValue(min=1, max=4, default="1",
    name="shared:rsync_threads", title="Rsync threads", 
    description="The maximum number of concurrent uploads."),
    # this upload task asks the tracker for an upload target
    # this can be HTTP or rsync and can be changed in the tracker admin panel
    UploadWithTracker(
      "http://%s/%s" % (TRACKER_HOST, TRACKER_ID),
      downloader = downloader,
      version = VERSION,
      # list the files that should be uploaded.
      # this may include directory names.
      # note: HTTP uploads will only upload the first file on this list
      files = [
        ItemInterpolation("%(data_dir)s/%(warc_file_base)s.warc.gz")
      ],
      # the relative path for the rsync command
      # (this defines if the files are uploaded to a subdirectory on the server)
      rsync_target_source_path = ItemInterpolation("%(data_dir)s/"),
      # extra rsync parameters (probably standard)
      rsync_extra_args = [
        "--recursive",
        "--partial",
        "--partial-dir", ".rsync-tmp"
      ]
    ),
  ),

  # if the item passed every task, notify the tracker and report the statistics
  SendDoneToTracker(
    tracker_url = "http://%s/%s" % (TRACKER_HOST, TRACKER_ID),
    stats = ItemValue("stats")
  )
)

It's pretty big.

Notice:

  • the downloader variable should be left undefined
  • ItemInterpolation holds some magic. ItemInterpolation("%(item_dir)s/wget.log").realize(item) executes item % "%(item_dir)s/wget.log" which gives us item["item_dir"]+"/wget.log"
  • --output-document concatenates everything into a single temporary file.
  • --truncate-output is a Wget+Lua option. It makes --output-document into a temporary file option by downloading to the file, extract the URLs, and then set the temporary file to 0 bytes.
  • the use of -e robots=off because robots.txt is bad
  • --lua-script posterous.lua specifies the Lua script that controls Wget
  • NumberConfigValue adds another setting to the Warrior's advanced settings page

Lua Script

The Lua script is like a parasite controlling and modifying Wget's behavior from within.

Generally, scripts will want to use:

  1. download_child_p
  2. httploop_result
  3. get_urls

download_child_p

This hook is useful for advanced URL accepting and rejecting. Although Wget supports regular expression on its command line options, it can be messy. Lua only supports a small subset of regular expressions called Patterns.

httploop_result

This hook is useful for checking if we have been banned or implementing our own --wait.

Here is a practical example that delays Wget for a minute on a ban or server overload, approximate 1 second between normal requests, and no delay on a content delivery network:

wget.callbacks.httploop_result = function(url, err, http_stat)
  local sleep_time = 60
  local status_code = http_stat["statcode"]

  if status_code == 420 or status_code >= 500 then
    if status_code == 420 then
      io.stdout:write("\nBanned (code "..http_stat.statcode.."). Sleeping for ".. sleep_time .." seconds.\n")
    else
      io.stdout:write("\nServer angered! (code "..http_stat.statcode.."). Sleeping for ".. sleep_time .." seconds.\n")
    end

    io.stdout:flush()

    -- Execute the UNIX sleep command (since Lua does not have its own delay function)
    -- Note that wget has its own linear backoff to this time as well
    os.execute("sleep " .. sleep_time)

    -- Tells wget to try again
    return wget.actions.CONTINUE

  else
    -- We're okay; sleep a bit (if we have to) and continue
    local sleep_time = 1.0 * (math.random(75, 125) / 100.0)

    if string.match(url["url"], "website-cdn%.net") then
      -- We should be able to go fast on images since that's what a web browser does
      sleep_time = 0
    end

    if sleep_time > 0.001 then
      os.execute("sleep " .. sleep_time)
    end

    -- Tells wget to resume normal behavior
    return wget.actions.NOTHING
  end
end
  • You will likely want to be cautious and include the wget.actions.CONTINUE action to cover a wide case. Wget may consider a temporary server overload as a permanent error.
  • Yahoo! likes to use status 999 to indicate a temporary ban.

get_urls

This hook is used to add additional URLs.

This example injects URLs to simulate JavaScript requests:

wget.callbacks.get_urls = function(file, url, is_css, iri)
  local urls = {}

  for image_id in string.gmatch(html, "([a-zA-Z0-9]-)/image_thumb.png") do
    table.insert(urls, {
      url="http://example.com/photo_viewer.php?imageid="..image_id,
      post_data="crf_token=deadbeef"
    })
  end

  return urls
end

It can also be used to display a progress message:

url_count = 0

wget.callbacks.get_urls = function(file, url, is_css, iri)
  url_count = url_count + 1
  if url_count % 5 == 0 then
    io.stdout:write("\r - Downloaded "..url_count.." URLs.")
    io.stdout:flush()
  end
end

Useful Snippets

Read first 1 kilobyte of a file:

read_file_short = function(file)
  if file then
    local f = io.open(file)
    local data = f:read(4096)
    f:close()
    return data or ""
  else
    return ""
  end
end

Run a pipeline.py

To run a pipeline file, run the command:

run-pipeline pipeline.py YOUR_NICKNAME

For more options, run:

run-pipeline --help

External Links


Developer Documentation