Wget with Lua hooks
- New idea: add Lua scripting to wget.
- Get the source from: https://github.com/ArchiveTeam/wget-lua
- Old repo is located at https://github.com/alard/wget-lua/tree/lua
- If you get errors about 'wget.pod' while compiling, try applying this patch.
- Documentation: https://github.com/ArchiveTeam/wget-lua/wiki#wget-with-lua-hooks
Example usage:
wget http://www.archiveteam.org/ -r --lua-script=lua-example/print_parameters.lua
Installation
apt install build-essential git autoconf automake autopoint texinfo flex gperf autogen shtool liblua5.1-0-dev gnutls-dev git clone https://github.com/ArchiveTeam/wget-lua cd wget-lua ./bootstrap ./configure make mkdir -p ~/bin/ && cp ./src/wget ~/bin/wget-lua
Why would this be useful?
Custom error handling
What to do in case of an error? Sometimes you want wget to retry the url if it gets a server error.
wget.callbacks.httploop_result = function(url, err, http_stat)
if http_stat.statcode == 500 then
-- try again
return wget.actions.CONTINUE
elseif http_statcode == 404 then
-- stop
return wget.actions.EXIT
else
-- let wget decide
return wget.actions.NOTHING
end
end
httploop_result is useful for checking if we have been banned or implementing our own --wait.
Here is a practical example that delays Wget for a minute on a ban or server overload, approximate 1 second between normal requests, and no delay on a content delivery network:
wget.callbacks.httploop_result = function(url, err, http_stat)
local sleep_time = 60
local status_code = http_stat["statcode"]
if status_code == 420 or status_code >= 500 then
if status_code == 420 then
io.stdout:write("\nBanned (code "..http_stat.statcode.."). Sleeping for ".. sleep_time .." seconds.\n")
else
io.stdout:write("\nServer angered! (code "..http_stat.statcode.."). Sleeping for ".. sleep_time .." seconds.\n")
end
io.stdout:flush()
-- Execute the UNIX sleep command (since Lua does not have its own delay function)
-- Note that wget has its own linear backoff to this time as well
os.execute("sleep " .. sleep_time)
-- Tells wget to try again
return wget.actions.CONTINUE
else
-- We're okay; sleep a bit (if we have to) and continue
local sleep_time = 1.0 * (math.random(75, 125) / 100.0)
if string.match(url["url"], "website-cdn%.net") then
-- We should be able to go fast on images since that's what a web browser does
sleep_time = 0
end
if sleep_time > 0.001 then
os.execute("sleep " .. sleep_time)
end
-- Tells wget to resume normal behavior
return wget.actions.NOTHING
end
end
- You will likely want to be cautious and include the
wget.actions.CONTINUEaction to cover a wide case. Wget may consider a temporary server overload as a permanent error. - Yahoo! likes to use status 999 to indicate a temporary ban.
Custom decide rules
Download this url or not?
download_child_p is useful for advanced URL accepting and rejecting. Although Wget supports regular expression on its command line options, it can be messy. Lua only supports a small subset of regular expressions called Patterns.
wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict)
if string.find(urlpos.url, "textfiles.com") then
-- always download
return true
elseif string.find(urlpos.url, "archive.org") then
-- never!
return false
else
-- follow wget's advice
return verdict
end
end
Custom url extraction/generation
Sometimes it's useful if you can write your own url extraction code, for example to add urls that aren't actually on the page.
wget.callbacks.get_urls = function(file, url, is_css, iri)
if string.find(url, ".com/profile/[^/]+/$") then
-- make sure wget downloads the user's photo page
-- and custom profile photo
return {
{ url=url.."photo.html",
link_expect_html=1,
link_expect_css=0 },
{ url=url.."photo.jpg",
link_expect_html=0,
link_expect_css=0 }
}
else
-- no new urls to add
return {}
end
end
This example injects URLs to simulate JavaScript requests:
wget.callbacks.get_urls = function(file, url, is_css, iri)
local urls = {}
for image_id in string.gmatch(html, "([a-zA-Z0-9]-)/image_thumb.png") do
table.insert(urls, {
url="http://example.com/photo_viewer.php?imageid="..image_id,
post_data="crf_token=deadbeef"
})
end
return urls
end
get_urls can also be used to display a progress message:
url_count = 0
wget.callbacks.get_urls = function(file, url, is_css, iri)
url_count = url_count + 1
if url_count % 5 == 0 then
io.stdout:write("\r - Downloaded "..url_count.." URLs.")
io.stdout:flush()
end
end
More Examples
Archive Team has real life scripts on the Archive Team GitHub organization. Look for recent -grab projects. The Lua scripts range from simple checks to complex URL scraping.
- zapd-grab/zapd.lua: Avoids JavaScript monstrosity by scraping anything that looks like an URL on CDN.
- puush-grab/puush.lua: Checks the status code and the contents and returns custom error codes.
- posterous-grab/posterous.lua: Checks the status code and delays if needed.
- xanga-grab/xanga.lua: Implements its own URLs scraping.
- patch-grab/patch.lua: Scrapes URLs as it goes along and sends it off to a server to be done later.
- formspring-grab/formspring.lua: Manually behaves like JavaScript and builds its own request URLs.
- hyves-grab/hyves.lua: Works around JavaScript calls to pagination. Includes calling external process to decrypt ciphertext.
- ArchiveBot/pipeline/archivebot.lua: Logs results in Redis and implements custom URL checking.
Useful Snippets
Read first 1 kilobyte of a file:
read_file_short = function(file)
if file then
local f = io.open(file)
local data = f:read(4096)
f:close()
return data or ""
else
return ""
end
end