Difference between revisions of "The WARC Ecosystem"

From Archiveteam
Jump to navigation Jump to search
Line 11: Line 11:
== Tools ==
== Tools ==


* name
=== name ===
   1 license
   1 license
   2 programming language
   2 programming language
Line 19: Line 19:
   6 description
   6 description


* wget v1.14+
=== [https://www.gnu.org/software/wget/ wget v1.14+] ===
   * GPL v3+
   * GPL v3+
   * C
   * C
Line 27: Line 27:
   * A non-interactive network downloader. wget also generates duplicate record ids in warc files.
   * A non-interactive network downloader. wget also generates duplicate record ids in warc files.


* https://github.com/internetarchive/warc
=== [https://github.com/internetarchive/warc warc python library]===
   * GPL v2
   * GPL v2
   * Python
   * Python
Line 35: Line 35:
   * library to work with WARC files
   * library to work with WARC files


* https://github.com/iramari/WarcProxy
=== [https://github.com/iramari/WarcProxy WarcProxy] ===
   * BSD
   * BSD
   * python
   * python
Line 43: Line 43:
   * a simple HTTP proxy that saves all HTTP traffic to a file
   * a simple HTTP proxy that saves all HTTP traffic to a file


* http://code.hanzoarchives.com/warc-tools
=== http://code.hanzoarchives.com/warc-tools ===
   * MIT License
   * MIT License
   * python 2.6
   * python 2.6
Line 51: Line 51:
   * warc validator, dump, search, index
   * warc validator, dump, search, index


* https://github.com/alard/warc-proxy
=== https://github.com/alard/warc-proxy ===
   * no license information
   * no license information
   * python
   * python
Line 60: Line 60:
   - needs a firefox addon installed to work
   - needs a firefox addon installed to work


* https://github.com/alard/megawarc
=== https://github.com/alard/megawarc ===
   * no license information
   * no license information
   * python
   * python
Line 68: Line 68:
   * Merge many small warcs into a large one
   * Merge many small warcs into a large one


* https://github.com/alard/warctozip-service
=== https://github.com/alard/warctozip-service ===
   * no license information
   * no license information
   * python
   * python
Line 76: Line 76:
   * An HTTP-based warc-to-zip converter
   * An HTTP-based warc-to-zip converter


* https://github.com/chfoo/warcat
=== https://github.com/chfoo/warcat ===
   * GPL v3
   * GPL v3
   * Python 3
   * Python 3
Line 84: Line 84:
   * WARCAT: Web ARChive (WARC) Archiving Tool
   * WARCAT: Web ARChive (WARC) Archiving Tool


* https://github.com/internetarchive/archive-commons split into 2 new repos: ia-web-commons & ia-hadoop-tools
=== https://github.com/internetarchive/archive-commons ===
split into 2 new repos: ia-web-commons & ia-hadoop-tools


* https://github.com/internetarchive/ia-web-commons
=== https://github.com/internetarchive/ia-web-commons ===


* https://github.com/internetarchive/ia-hadoop-tools
=== https://github.com/internetarchive/ia-hadoop-tools ===


* https://github.com/ArchiveTeam/archiveteam-megawarc-factory
=== https://github.com/ArchiveTeam/archiveteam-megawarc-factory ===
   * Generates 50gb warc files from existing warc files
   * Generates 50gb warc files from existing warc files
   * Uploads to archive.org
   * Uploads to archive.org
   * no license information
   * no license information


* cdx from warc - https://github.com/rajbot/CDX-Writer
=== https://github.com/rajbot/CDX-Writer ===


== Deprecated ==
== Deprecated ==

Revision as of 21:44, 12 April 2013

Everything about the WARC format and the tools that support it.

Information

Tools

name

 1 license
 2 programming language
 3 test suite
 4 has documentation
 5 # of authors
 6 description

wget v1.14+

 * GPL v3+
 * C
 * Has a test suite but does not test any warc functionality
 * Man pages, website, blog posts all over the net
 * 2+ according to the changelog
 * A non-interactive network downloader. wget also generates duplicate record ids in warc files.

warc python library

 * GPL v2
 * Python
 * looks to have a test suite - https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py
 * A readme with examples online at http://warc.readthedocs.org/en/latest/
 * 3 commiters on github
 * library to work with WARC files

WarcProxy

 * BSD
 * python
 * NO TEST SUITE
 * A readme file.
 * 1 author
 * a simple HTTP proxy that saves all HTTP traffic to a file

http://code.hanzoarchives.com/warc-tools

 * MIT License
 * python 2.6
 * NO TEST SUITE
 * A readme file
 * 4 commiters
 * warc validator, dump, search, index

https://github.com/alard/warc-proxy

 * no license information
 * python
 * NO TEST SUITE
 * A readme file
 * 1 author
 * WARC viewer for browsing the contents of a WARC file.
 - needs a firefox addon installed to work

https://github.com/alard/megawarc

 * no license information
 * python
 * NO TEST SUITE
 * A readme file
 * 1 author
 * Merge many small warcs into a large one

https://github.com/alard/warctozip-service

 * no license information
 * python
 * NO TEST SUITE
 * A readme file
 * 1 author
 * An HTTP-based warc-to-zip converter

https://github.com/chfoo/warcat

 * GPL v3
 * Python 3
 * yes
 * A readme file.
 * 1 author
 * WARCAT: Web ARChive (WARC) Archiving Tool

https://github.com/internetarchive/archive-commons

split into 2 new repos: ia-web-commons & ia-hadoop-tools

https://github.com/internetarchive/ia-web-commons

https://github.com/internetarchive/ia-hadoop-tools

https://github.com/ArchiveTeam/archiveteam-megawarc-factory

 * Generates 50gb warc files from existing warc files
 * Uploads to archive.org
 * no license information

https://github.com/rajbot/CDX-Writer

Deprecated

The WARC format

  • A .warc file is usually a group of one or more WARC records.
  • The first record usually describes the records to follow.
  • compression is optional
  • each record is compressed via gzip. A gzip file supports multiple "members"
  • compressed warcs end in .warc.gz
  • According to the guidelines warc files should top out at 1gb


WARC record

  • header
  • content block
  • two newlines

WARC record header

The beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format with a given version number, followed by lines of named fields up to a blank line. The WARC record header format largely follows the tradition of HTTP/1.1 [RFC2616] and [RFC2822] headers, with one major exception, allowing UTF-8 [RFC3629].


Example of a 'request' record header:

 WARC/1.0
 WARC-Type: request
 WARC-Target-URI: http://xbox.gamespy.com/
 Content-Type: application/http;msgtype=request
 WARC-Date: 2013-04-02T16:12:40Z
 WARC-Record-ID: <urn:uuid:08d9edb9-0ab8-4352-ba56-6cbbd590f34f>
 WARC-IP-Address: 213.248.112.146
 WARC-Warcinfo-ID: <urn:uuid:2b6ad3f1-efab-4e37-8faa-fc8ad112692f>
 WARC-Block-Digest: sha1:T6PJSZTTP7HBNA6OFZACXAFK25GGLVT4
 Content-Length: 150

WARC named fields

  • A set of elements consisting of a name, a colon, and a value, with long values continued on indented lines.
  • Named fields may appear in any order.
  • Field values may contain any UTF-8 character.
  • The 'encoded-word' mechanism of [RFC2047] may also be used when writing WARC fields and shall also be understood by WARC reading software.

WARC content block

Part (zero or more octets) of a WARC record that follows the header and that forms the main body of a WARC record.


CDX File Format