Qwarc

From Archiveteam
Jump to navigation Jump to search

A Python framework written by JAA for quickly crawling sites and saving them to WARC. It does not do any processing itself beyond HTTP requests and responses; all logic must be implemented by the user.

Its source is found here. Ensure you are using the latest commit on the 0.2 branch: the master branch is outdated and uses warcio (which does not fully follow the WARC specification), so it should not be used.

Requirements

qwarc does not work in Python 3.9 and later. qwarc has very specific dependencies, which will be pulled in if you install it with pip. (A venv is probably a good idea.) You also need to manually run pip install async-timeout==3.0.1.

Writing a spec file

qwarc is self-documenting. A given crawl's grab scripts are put into its meta WARC, e.g. in https://archive.org/download/forum.canucks.com_topic_updates_202309/forum.canucks.com-updates-meta.warc.gz. In this way, you can find example grab scripts.

Spec files are written in Python. qwarc will look for subclasses of qwarc.Item (non-recursively). These subclasses should:

  • Define a itemType attribute, to allow qwarc to uniquely identify the subclass
  • Implement a generate classmethod, which yields the initial set of items for that subclass
    • This does not have to be implemented if it doesn't need to queue anything initially, but remember that at least one subclass needs to queue something or the crawl won't do anything. Items must be strings, as they are stored in the SQLite database.
  • Implement a process instance method, which processes an item.

When processing an item, you can:

  • Access the item value with the itemValue property
  • Call self.fetch to fetch a URL
    • Set responseHandler to one of the response handlers from qwarc.util, or define your own
    • The return value is an aiohttp ClientResponse object
  • Call self.add_subitem to add a subitem

Tips

  • qwarc adds its own headers, such as a User-Agent, to the request by default. That means if you, for example, set the User-Agent header while calling fetch, your request will have two user agents, which is probably not what you want. This can be fixed by doing something like:
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.headers = []

Alternatively, you can delete or change specific headers instead of removing all of them.