Difference between revisions of "The WARC Ecosystem"

Revision as of 21:44, 12 April 2013

Everything about the WARC format and the tools that support it.

Information

Tools

name

 1 license
 2 programming language
 3 test suite
 4 has documentation
 5 # of authors
 6 description

wget v1.14+

 * GPL v3+
 * C
 * Has a test suite but does not test any warc functionality
 * Man pages, website, blog posts all over the net
 * 2+ according to the changelog
 * A non-interactive network downloader. wget also generates duplicate record ids in warc files.

warc python library

 * GPL v2
 * Python
 * looks to have a test suite - https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py
 * A readme with examples online at http://warc.readthedocs.org/en/latest/
 * 3 commiters on github
 * library to work with WARC files

WarcProxy

 * BSD
 * python
 * NO TEST SUITE
 * A readme file.
 * 1 author
 * a simple HTTP proxy that saves all HTTP traffic to a file

http://code.hanzoarchives.com/warc-tools

 * MIT License
 * python 2.6
 * NO TEST SUITE
 * A readme file
 * 4 commiters
 * warc validator, dump, search, index

https://github.com/alard/warc-proxy

 * no license information
 * python
 * NO TEST SUITE
 * A readme file
 * 1 author
 * WARC viewer for browsing the contents of a WARC file.
 - needs a firefox addon installed to work

https://github.com/alard/megawarc

 * no license information
 * python
 * NO TEST SUITE
 * A readme file
 * 1 author
 * Merge many small warcs into a large one

https://github.com/alard/warctozip-service

 * no license information
 * python
 * NO TEST SUITE
 * A readme file
 * 1 author
 * An HTTP-based warc-to-zip converter

https://github.com/chfoo/warcat

 * GPL v3
 * Python 3
 * yes
 * A readme file.
 * 1 author
 * WARCAT: Web ARChive (WARC) Archiving Tool

https://github.com/internetarchive/archive-commons

split into 2 new repos: ia-web-commons & ia-hadoop-tools

https://github.com/internetarchive/ia-web-commons

https://github.com/internetarchive/ia-hadoop-tools

https://github.com/ArchiveTeam/archiveteam-megawarc-factory

 * Generates 50gb warc files from existing warc files
 * Uploads to archive.org
 * no license information

https://github.com/rajbot/CDX-Writer

Deprecated

http://archive-access.sourceforge.net/warc/ - bunch of docs

https://code.google.com/p/warc-tools/ - Old, discontinued shit

The WARC format

A .warc file is usually a group of one or more WARC records.
The first record usually describes the records to follow.
compression is optional
each record is compressed via gzip. A gzip file supports multiple "members"
compressed warcs end in .warc.gz
According to the guidelines warc files should top out at 1gb

WARC record

header
content block
two newlines

WARC record header

The beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format with a given version number, followed by lines of named fields up to a blank line. The WARC record header format largely follows the tradition of HTTP/1.1 [RFC2616] and [RFC2822] headers, with one major exception, allowing UTF-8 [RFC3629].

Example of a 'request' record header:

 WARC/1.0
 WARC-Type: request
 WARC-Target-URI: http://xbox.gamespy.com/
 Content-Type: application/http;msgtype=request
 WARC-Date: 2013-04-02T16:12:40Z
 WARC-Record-ID: <urn:uuid:08d9edb9-0ab8-4352-ba56-6cbbd590f34f>
 WARC-IP-Address: 213.248.112.146
 WARC-Warcinfo-ID: <urn:uuid:2b6ad3f1-efab-4e37-8faa-fc8ad112692f>
 WARC-Block-Digest: sha1:T6PJSZTTP7HBNA6OFZACXAFK25GGLVT4
 Content-Length: 150

WARC named fields

A set of elements consisting of a name, a colon, and a value, with long values continued on indented lines.
Named fields may appear in any order.
Field values may contain any UTF-8 character.
The 'encoded-word' mechanism of [RFC2047] may also be used when writing WARC fields and shall also be understood by WARC reading software.

WARC content block

Part (zero or more octets) of a WARC record that follows the header and that forms the main body of a WARC record.

CDX File Format

http://archive.org/web/researcher/cdx_legend.php

Difference between revisions of "The WARC Ecosystem"

Revision as of 21:44, 12 April 2013

Contents

Information

Tools

name

wget v1.14+

warc python library

WarcProxy

http://code.hanzoarchives.com/warc-tools

https://github.com/alard/warc-proxy

https://github.com/alard/megawarc

https://github.com/alard/warctozip-service

https://github.com/chfoo/warcat

https://github.com/internetarchive/archive-commons

https://github.com/internetarchive/ia-web-commons

https://github.com/internetarchive/ia-hadoop-tools

https://github.com/ArchiveTeam/archiveteam-megawarc-factory

https://github.com/rajbot/CDX-Writer

Deprecated

The WARC format

WARC record

WARC record header

WARC named fields

WARC content block

CDX File Format

Navigation menu

@@ Line 11: / Line 11: @@
 == Tools ==
-* name
+=== name ===
 license
 programming language
@@ Line 19: / Line 19: @@
 description
-* wget v1.14+
+=== [https://www.gnu.org/software/wget/ wget v1.14+] ===
    * GPL v3+
    * C
@@ Line 27: / Line 27: @@
    * A non-interactive network downloader. wget also generates duplicate record ids in warc files.
-* https://github.com/internetarchive/warc
+=== [https://github.com/internetarchive/warc warc python library]===
    * GPL v2
    * Python
@@ Line 35: / Line 35: @@
    * library to work with WARC files
-* https://github.com/iramari/WarcProxy
+=== [https://github.com/iramari/WarcProxy WarcProxy] ===
    * BSD
    * python
@@ Line 43: / Line 43: @@
    * a simple HTTP proxy that saves all HTTP traffic to a file
-* http://code.hanzoarchives.com/warc-tools
+=== http://code.hanzoarchives.com/warc-tools ===
    * MIT License
    * python 2.6
@@ Line 51: / Line 51: @@
    * warc validator, dump, search, index
-* https://github.com/alard/warc-proxy
+=== https://github.com/alard/warc-proxy ===
    * no license information
    * python
@@ Line 60: / Line 60: @@
    - needs a firefox addon installed to work
-* https://github.com/alard/megawarc
+=== https://github.com/alard/megawarc ===
    * no license information
    * python
@@ Line 68: / Line 68: @@
    * Merge many small warcs into a large one
-* https://github.com/alard/warctozip-service
+=== https://github.com/alard/warctozip-service ===
    * no license information
    * python
@@ Line 76: / Line 76: @@
    * An HTTP-based warc-to-zip converter
-* https://github.com/chfoo/warcat
+=== https://github.com/chfoo/warcat ===
    * GPL v3
    * Python 3
@@ Line 84: / Line 84: @@
    * WARCAT: Web ARChive (WARC) Archiving Tool
-* https://github.com/internetarchive/archive-commons split into 2 new repos: ia-web-commons & ia-hadoop-tools
+=== https://github.com/internetarchive/archive-commons ===
+split into 2 new repos: ia-web-commons & ia-hadoop-tools
-* https://github.com/internetarchive/ia-web-commons
+=== https://github.com/internetarchive/ia-web-commons ===
-* https://github.com/internetarchive/ia-hadoop-tools
+=== https://github.com/internetarchive/ia-hadoop-tools ===
-* https://github.com/ArchiveTeam/archiveteam-megawarc-factory
+=== https://github.com/ArchiveTeam/archiveteam-megawarc-factory ===
    * Generates 50gb warc files from existing warc files
    * Uploads to archive.org
    * no license information
-* cdx from warc - https://github.com/rajbot/CDX-Writer
+=== https://github.com/rajbot/CDX-Writer ===
 == Deprecated ==

Difference between revisions of "The WARC Ecosystem"

Revision as of 21:44, 12 April 2013

Information

Tools

name

Deprecated

The WARC format

WARC record

WARC record header

WARC named fields

WARC content block

CDX File Format

Navigation menu

Search