Wget with Lua hooks

From Archiveteam
Jump to navigation Jump to search
  • New idea: add Lua scripting to wget.

Example usage:

wget http://www.archiveteam.org/ -r --lua-script=lua-example/print_parameters.lua

Installation

apt install build-essential git autoconf automake autopoint texinfo flex gperf autogen shtool liblua5.1-0-dev gnutls-dev
git clone https://github.com/ArchiveTeam/wget-lua
cd wget-lua
./bootstrap
./configure
make
mkdir -p ~/bin/ && cp ./src/wget ~/bin/wget-lua

Why would this be useful?

Custom error handling

What to do in case of an error? Sometimes you want wget to retry the url if it gets a server error.

wget.callbacks.httploop_result = function(url, err, http_stat)
  if http_stat.statcode == 500 then
    -- try again
    return wget.actions.CONTINUE
  elseif http_statcode == 404 then
    -- stop
    return wget.actions.EXIT
  else
    -- let wget decide
    return wget.actions.NOTHING
  end
end

httploop_result is useful for checking if we have been banned or implementing our own --wait.

Here is a practical example that delays Wget for a minute on a ban or server overload, approximate 1 second between normal requests, and no delay on a content delivery network:

wget.callbacks.httploop_result = function(url, err, http_stat)
  local sleep_time = 60
  local status_code = http_stat["statcode"]

  if status_code == 420 or status_code >= 500 then
    if status_code == 420 then
      io.stdout:write("\nBanned (code "..http_stat.statcode.."). Sleeping for ".. sleep_time .." seconds.\n")
    else
      io.stdout:write("\nServer angered! (code "..http_stat.statcode.."). Sleeping for ".. sleep_time .." seconds.\n")
    end

    io.stdout:flush()

    -- Execute the UNIX sleep command (since Lua does not have its own delay function)
    -- Note that wget has its own linear backoff to this time as well
    os.execute("sleep " .. sleep_time)

    -- Tells wget to try again
    return wget.actions.CONTINUE

  else
    -- We're okay; sleep a bit (if we have to) and continue
    local sleep_time = 1.0 * (math.random(75, 125) / 100.0)

    if string.match(url["url"], "website-cdn%.net") then
      -- We should be able to go fast on images since that's what a web browser does
      sleep_time = 0
    end

    if sleep_time > 0.001 then
      os.execute("sleep " .. sleep_time)
    end

    -- Tells wget to resume normal behavior
    return wget.actions.NOTHING
  end
end
  • You will likely want to be cautious and include the wget.actions.CONTINUE action to cover a wide case. Wget may consider a temporary server overload as a permanent error.
  • Yahoo! likes to use status 999 to indicate a temporary ban.

Custom decide rules

Download this url or not?

download_child_p is useful for advanced URL accepting and rejecting. Although Wget supports regular expression on its command line options, it can be messy. Lua only supports a small subset of regular expressions called Patterns.

wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict)
  if string.find(urlpos.url, "textfiles.com") then
    -- always download
    return true
  elseif string.find(urlpos.url, "archive.org") then
    -- never!
    return false
  else
    -- follow wget's advice
    return verdict
  end
end

Custom url extraction/generation

Sometimes it's useful if you can write your own url extraction code, for example to add urls that aren't actually on the page.

wget.callbacks.get_urls = function(file, url, is_css, iri)
  if string.find(url, ".com/profile/[^/]+/$") then
    -- make sure wget downloads the user's photo page
    -- and custom profile photo
    return {
      { url=url.."photo.html",
        link_expect_html=1,
        link_expect_css=0 },
      { url=url.."photo.jpg",
        link_expect_html=0,
        link_expect_css=0 }
    }
  else
    -- no new urls to add
    return {}
  end
end

This example injects URLs to simulate JavaScript requests:

wget.callbacks.get_urls = function(file, url, is_css, iri)
  local urls = {}

  for image_id in string.gmatch(html, "([a-zA-Z0-9]-)/image_thumb.png") do
    table.insert(urls, {
      url="http://example.com/photo_viewer.php?imageid="..image_id,
      post_data="crf_token=deadbeef"
    })
  end

  return urls
end

get_urls can also be used to display a progress message:

url_count = 0

wget.callbacks.get_urls = function(file, url, is_css, iri)
  url_count = url_count + 1
  if url_count % 5 == 0 then
    io.stdout:write("\r - Downloaded "..url_count.." URLs.")
    io.stdout:flush()
  end
end

More Examples

Archive Team has real life scripts on the Archive Team GitHub organization. Look for recent -grab projects. The Lua scripts range from simple checks to complex URL scraping.

Useful Snippets

Read first 1 kilobyte of a file:

read_file_short = function(file)
  if file then
    local f = io.open(file)
    local data = f:read(4096)
    f:close()
    return data or ""
  else
    return ""
  end
end