Difference between revisions of "Wget with Lua hooks"

From Archiveteam
Jump to navigation Jump to search
(commenting this out for now as there are some subtle differences between lua 5.1 and 5.2)
Line 4: Line 4:
** The Lua scripting is patched on the "lua" branch. You can use the [https://github.com/alard/wget-lua/compare/lua#files_bucket compare branch feature] on GitHub to see the differences.
** The Lua scripting is patched on the "lua" branch. You can use the [https://github.com/alard/wget-lua/compare/lua#files_bucket compare branch feature] on GitHub to see the differences.
** Alternative location: https://github.com/ArchiveTeam/wget-lua/tree/lua.
** Alternative location: https://github.com/ArchiveTeam/wget-lua/tree/lua.
** If you get errors about 'lua_open' while compiling, try applying [http://paste.archivingyoursh.it/raw/manavagose this] patch.
<!-- If you get errors about 'lua_open' while compiling, try applying [http://paste.archivingyoursh.it/raw/manavagose this] patch. -->
** If you get errors about 'wget.pod' while compiling, try applying [http://paste.archivingyoursh.it/raw/dekasuroda this] patch.
** If you get errors about 'wget.pod' while compiling, try applying [http://paste.archivingyoursh.it/raw/dekasuroda this] patch.



Revision as of 17:56, 6 December 2014

  • New idea: add Lua scripting to wget.

Example usage:

wget http://www.archiveteam.org/ -r --lua-script=lua-example/print_parameters.lua

Why would this be useful?

Custom error handling

What to do in case of an error? Sometimes you want wget to retry the url if it gets a server error.

wget.callbacks.httploop_result = function(url, err, http_stat)
  if http_stat.statcode == 500 then
    -- try again
    return wget.actions.CONTINUE
  elseif http_statcode == 404 then
    -- stop
    return wget.actions.EXIT
  else
    -- let wget decide
    return wget.actions.NOTHING
  end
end

Custom decide rules

Download this url or not?

wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict)
  if string.find(urlpos.url, "textfiles.com") then
    -- always download
    return true
  elseif string.find(urlpos.url, "archive.org") then
    -- never!
    return false
  else
    -- follow wget's advice
    return verdict
  end
end

Custom url extraction/generation

Sometimes it's useful if you can write your own url extraction code, for example to add urls that aren't actually on the page.

wget.callbacks.get_urls = function(file, url, is_css, iri)
  if string.find(url, ".com/profile/[^/]+/$") then
    -- make sure wget downloads the user's photo page
    -- and custom profile photo
    return {
      { url=url.."photo.html",
        link_expect_html=1,
        link_expect_css=0 },
      { url=url.."photo.jpg",
        link_expect_html=0,
        link_expect_css=0 }
    }
  else
    -- no new urls to add
    return {}
  end
end

More Examples

Archive Team has real life scripts on the Archive Team GitHub organization. Look for recent -grab projects. The Lua scripts range from simple checks to complex URL scraping.