Qwarc
A Python framework written by JAA for quickly crawling sites and saving them to WARC. It does not do any processing itself beyond HTTP requests and responses; all logic must be implemented by the user.
Its source is found here. Ensure you are using the latest commit on the 0.2 branch: the master branch is outdated and uses warcio (which does not fully follow the WARC specification), so it should not be used.
Requirements
qwarc does not work in Python 3.9 and later. qwarc has very specific dependencies, which will be pulled in if you install it with pip. (A venv is probably a good idea.) You also need to manually run pip install async-timeout==3.0.1
.
Writing a spec file
qwarc is self-documenting. A given crawl's grab scripts are put into its meta WARC, e.g. in https://archive.org/download/forum.canucks.com_topic_updates_202309/forum.canucks.com-updates-meta.warc.gz. In this way, you can find example grab scripts.
Spec files are written in Python. qwarc will look for subclasses of qwarc.Item
(non-recursively). These subclasses should:
- Define a
itemType
attribute, to allow qwarc to uniquely identify the subclass - Implement a
generate
classmethod, which yields the initial set of items for that subclass- This does not have to be implemented if it doesn't need to queue anything initially, but remember that at least one subclass needs to queue something or the crawl won't do anything. Items must be strings, as they are stored in the SQLite database.
- Implement a
process
instance method, which processes an item.
When processing an item, you can:
- Access the item value with the
itemValue
property - Call
self.fetch
to fetch a URL- Set
responseHandler
to one of the response handlers fromqwarc.util
, or define your own - The return value is an aiohttp ClientResponse object
- Set
- Call
self.add_subitem
to add a subitem
Tips
- qwarc adds its own headers, such as a User-Agent, to the request by default. That means if you, for example, set the User-Agent header while calling
fetch
, your request will have two user agents, which is probably not what you want. This can be fixed by doing something like:
def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.headers = []
Alternatively, you can delete or change specific headers instead of removing all of them.