Difference between revisions of "The WARC Ecosystem"

Revision as of 08:02, 19 July 2015

Everything about the WARC format and the tools that support it.

Information

| https://en.wikipedia.org/wiki/Web_ARChive | https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=4817 - Contains examples of WARC records | ISO 28500 - The WARC File Format | http://archive-access.sourceforge.net/warc/ - WARC ISO docs | http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml | http://www.netpreserve.org/resources/warc-implementation-guidelines-v1 http://www.netpreserve.org/sites/default/files/resources/WARC_Guidelines_v1.pdf | http://commoncrawl.org/navigating-the-warc-file-format/

Tools

name	license	programming language	test suite	has documentation	# of authors	description
wget v1.14+	GPL v3+	C	Has a test suite but does not test any warc functionality	Man pages, website, blog posts all over the net	2+ according to the changelog	A non-interactive network downloader. wget also generates duplicate record ids in warc files. More information about flags can be found on the Wget with WARC output page.
InternetArchive's warc python library	GPL v2	Python	looks to have a test suite - https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py	A readme with examples online at http://warc.readthedocs.org/en/latest/	3 commiters on github	library to work with WARC files
WarcMiddleware	ISC	Python	Not enough tests	A readme file + Scrapy docs	1 author	Mirrors websites and saves the results to a WARC file
WarcProxy	ISC	Python	NO TEST SUITE	A readme file	1 author	a simple HTTP proxy that saves all HTTP traffic to a file
WarcMITMProxy	ISC	Python	NO TEST SUITE	A readme file	1 author	HTTPS proxy that saves traffic to a WARC file
warc-tools	MIT License	Python 2.6	NO TEST SUITE	A readme file	4 commiters	warc validator, dump, search, index, convert arc to warc The previous versions can be found at https://code.google.com/p/warc-tools/ and http://code.hanzoarchives.com/warc-tools . old: http://code.hanzoarchives.com/warc-tools/src/6e1d36297688/hanzo/warcextract.py new (untested): http://code.hanzoarchives.com/warc-tools/src/fd3b49a7ee22fe4eee0d51dc841af40d4b9d2e1e/warcunpack_ia.py?at=default
WARC viewer	no license information	Python	NO TEST SUITE	A readme file	1 author	WARC viewer for browsing the contents of a WARC file.
Megawarc	no license information	Python	NO TEST SUITE	A readme file	1 author	Merge many small warcs into a large one Checks if WARC files can be un-gzipped before adding them to the megawarc. Does not check anything else.
warc to zip	no license information	python	NO TEST SUITE	A readme file	1 author	An HTTP-based warc-to-zip converter
warcat	GPL v3	Python 3	yes	A readme file.	1 author	warcat concat, extract, list, pass, split, verify warc files Install: pip-3 install warcat Run: python3 -m warcat verify mysite.warc.gz https://github.com/internetarchive/ia-web-commons https://github.com/internetarchive/ia-hadoop-tools
Archive Team megawarc factory	no license information	Bash shell scripting	NO TEST SUITE	A readme file.	1 author	Generates 50gb warc files from existing warc files Uploads to archive.org
CDX Writer	no license information	python	Has a test suite	A readme file.	1 author	Create CDX index files from WARC files.
Heritrix	Apache v2.0	java	Has a test suite	javadoc, website	many authors	Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Heritrix-Cassandra	?	?	?	?	?	A library for writing Heritrix 3 output directly to Cassandra as records.
DeDuplicator (Heritrix add-on)	?	?	?	?	?	The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.
python-heritrix	?	?	?	?	?	A simple wrapper around the Heritrix 3.x API. Developed in April 2012 against Heritrix 3.1.0 at GWU Libraries in Washington, DC, USA.
Chrome/Chromium plugin WARCreate	GPL v3	javascript	???	none	1 author	WARCreate is a Google Chrome extension that allows a user to create a Web ARChive (WARC) file from any browseable webpage. code repo
Java Web Archive Toolkit	Apache 2.0	Java	Partial Test Suite (check coverage profile)	Online	1 author	jwattools arc2warc, cdx, compress, decompress, extract, interval, pathindex, test, unpack code repo
WAIL	CC-BY-SA	Python, JS	???	Online	1	Web Archiving Integration Layer (WAIL) is a graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages. Tools included and accessible through the GUI are Heritrix 3.1.2, Wayback 1.7, and warc-proxy. Support packages include Apache Tomcat, phantomjs and pyinstaller. code repo
pylibwarc	ISC License	Python	?	?	1 author CDX support Written by odie5533 which frequents #archiveteam, as another independant WARC library for Python.
Wpull	GPL version 3	Python 3	many unit tests (Travis CI registered), simple experimental fuzzer	a quick start readme, brief usage overview, good docstrings coverage	1 core author	Wget-compatible web downloader. Beta quality. Lua/Python scripting. PhantomJS (experimental). Used by ArchiveBot.
pywb	GPL version 3	Python 2	yes	readme and wiki	1 core author	A full-fledged Python reimplementation of Wayback Machine web archive replay capabilities. Also provides a live rewriting proxy.
pywb-webrecorder	MIT	Python 2	no	readme	1 core author	An experimental/demo integration of pywb + warcprox to allow live recording to WARC. Allows instant replay of recorded content from WARC.
webarchiveplayer	GPL version 3	Python 2	not yet, though most testable functionality in pywb	readme	1 core author	Point-and-click wrapper for Windows and OS X for browsing WARC files. Shows a basic file open dialog to select a WARC(s), then starts a server and opens a browser. Also determines HTML pages within a WARC. Built on top of pywb. In beta at the moment (early 2015).

Deprecated

| https://code.google.com/p/warc-tools/ - Old, discontinued shit | https://github.com/internetarchive/archive-commons - split into 2 new repos: ia-web-commons & ia-hadoop-tools

The WARC format

| A .warc file is usually a group of one or more WARC records. | The first record usually describes the records to follow. | compression is optional | each record is compressed via gzip. A gzip file supports multiple "members" | compressed warcs end in .warc.gz | According to the guidelines warc files should top out at 1gb

WARC record

| header | content block | two newlines

WARC record header

The beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format with a given version number, followed by lines of named fields up to a blank line. The WARC record header format largely follows the tradition of HTTP/1.1 [RFC2616] and [RFC2822] headers, with one major exception, allowing UTF-8 [RFC3629].

Example of a 'request' record header:

 WARC/1.0
 WARC-Type: request
 WARC-Target-URI: http://xbox.gamespy.com/
 Content-Type: application/http;msgtype=request
 WARC-Date: 2013-04-02T16:12:40Z
 WARC-Record-ID: <urn:uuid:08d9edb9-0ab8-4352-ba56-6cbbd590f34f>
 WARC-IP-Address: 213.248.112.146
 WARC-Warcinfo-ID: <urn:uuid:2b6ad3f1-efab-4e37-8faa-fc8ad112692f>
 WARC-Block-Digest: sha1:T6PJSZTTP7HBNA6OFZACXAFK25GGLVT4
 Content-Length: 150

WARC named fields

| A set of elements consisting of a name, a colon, and a value, with long values continued on indented lines. | Named fields may appear in any order. | Field values may contain any UTF-8 character. | The 'encoded-word' mechanism of [RFC2047] may also be used when writing WARC fields and shall also be understood by WARC reading software.

WARC content block

Part (zero or more octets) of a WARC record that follows the header and that forms the main body of a WARC record.

CDX File Format

| http://archive.org/web/researcher/cdx_legend.php

Difference between revisions of "The WARC Ecosystem"

Revision as of 08:02, 19 July 2015

Contents

Information

Tools

Deprecated

The WARC format

CDX File Format

Navigation menu

@@ Line 2: / Line 2: @@
 == Information ==
-* https://en.wikipedia.org/wiki/Web_ARChive
+| https://en.wikipedia.org/wiki/Web_ARChive
-* https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=4817 - Contains examples of WARC records
+| https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=4817 - Contains examples of WARC records
-* [http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf ISO 28500 - The WARC File Format]
+| [http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf ISO 28500 - The WARC File Format]
-* http://archive-access.sourceforge.net/warc/ - WARC ISO docs
+| http://archive-access.sourceforge.net/warc/ - WARC ISO docs
-* http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
+| http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
-* http://www.netpreserve.org/resources/warc-implementation-guidelines-v1 http://www.netpreserve.org/sites/default/files/resources/WARC_Guidelines_v1.pdf
+| http://www.netpreserve.org/resources/warc-implementation-guidelines-v1 http://www.netpreserve.org/sites/default/files/resources/WARC_Guidelines_v1.pdf
-* http://commoncrawl.org/navigating-the-warc-file-format/
+| http://commoncrawl.org/navigating-the-warc-file-format/
 == Tools ==
-=== name ===
+{|class="wikitable"
-license
+! name
-programming language
+! license
-test suite
+! programming language
-has documentation
+! test suite
-# of authors
+! has documentation
-description
+! # of authors
+! description
-=== [https://www.gnu.org/software/wget/ wget v1.14+] ===
+|-
-* GPL v3+
+| [https://www.gnu.org/software/wget/ wget v1.14+]
-* C
+| GPL v3+ || C
-* Has a test suite but does not test any warc functionality
+| Has a test suite but does not test any warc functionality
-* Man pages, website, blog posts all over the net
+| Man pages, website, blog posts all over the net
-* 2+ according to the changelog
+| 2+ according to the changelog
-* A non-interactive network downloader. wget also generates duplicate record ids in warc files.
+| A non-interactive network downloader. wget also generates duplicate record ids in warc files.
 More information about flags can be found on the [[Wget with WARC output]] page.
+|-
-=== InternetArchive's [https://github.com/internetarchive/warc warc python library]===
+| InternetArchive's [https://github.com/internetarchive/warc warc python library]
-* GPL v2
+| GPL v2 || Python
-* Python
+| looks to have a test suite - https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py
-* looks to have a test suite - https://github.com/internetarchive/warc/blob/master/warc/tests/test_warc.py
+| A readme with examples online at http://warc.readthedocs.org/en/latest/
-* A readme with examples online at http://warc.readthedocs.org/en/latest/
+| 3 commiters on github
-* 3 commiters on github
+| library to work with WARC files
-* library to work with WARC files
+|-
+| [https://github.com/odie5533/WarcMiddleware WarcMiddleware]
-=== [https://github.com/odie5533/WarcMiddleware WarcMiddleware] ===
+| ISC || Python
-* ISC
+| Not enough tests
-* Python
+| A readme file + [http://scrapy.org/ Scrapy docs]
-* Not enough tests
+| 1 author
-* A readme file + [http://scrapy.org/ Scrapy docs]
+| Mirrors websites and saves the results to a WARC file
-* 1 author
+|-
-* Mirrors websites and saves the results to a WARC file
+| [https://github.com/odie5533/WarcProxy WarcProxy]
+| ISC || Python
-=== [https://github.com/odie5533/WarcProxy WarcProxy] ===
+| NO TEST SUITE
-* ISC
+| A readme file
-* Python
+| 1 author
-* NO TEST SUITE
+| a simple HTTP proxy that saves all HTTP traffic to a file
-* A readme file
+|-
-* 1 author
+| [https://github.com/odie5533/WarcMITMProxy WarcMITMProxy]
-* a simple HTTP proxy that saves all HTTP traffic to a file
+| ISC
+| Python
-=== [https://github.com/odie5533/WarcMITMProxy WarcMITMProxy] ===
+| NO TEST SUITE
-* ISC
+| A readme file
-* Python
+| 1 author
-* NO TEST SUITE
+| HTTPS proxy that saves traffic to a WARC file
-* A readme file
+|-
-* 1 author
+| [https://github.com/internetarchive/warctools/ warc-tools]
-* HTTPS proxy that saves traffic to a WARC file
+| MIT License
+| Python 2.6
-=== [https://github.com/internetarchive/warctools/ warc-tools] ===
+| NO TEST SUITE
-* MIT License
+| A readme file
-* python 2.6
+| 4 commiters
-* NO TEST SUITE
+| warc validator, dump, search, index, convert arc to warc
-* A readme file
-* 4 commiters
-* warc validator, dump, search, index, convert arc to warc
 The previous versions can be found at https://code.google.com/p/warc-tools/ and http://code.hanzoarchives.com/warc-tools .
@@ Line 72: / Line 69: @@
 old: http://code.hanzoarchives.com/warc-tools/src/6e1d36297688/hanzo/warcextract.py<br />
 new (untested): http://code.hanzoarchives.com/warc-tools/src/fd3b49a7ee22fe4eee0d51dc841af40d4b9d2e1e/warcunpack_ia.py?at=default
+|-
-=== [https://github.com/alard/warc-proxy WARC viewer] ===
+| [https://github.com/alard/warc-proxy WARC viewer]
-* no license information
+| no license information
-* python
+| Python
-* NO TEST SUITE
+| NO TEST SUITE
-* A readme file
+| A readme file
-* 1 author
+| 1 author
-* WARC viewer for browsing the contents of a WARC file.
+| WARC viewer for browsing the contents of a WARC file.
+|-
-=== [https://github.com/alard/megawarc Megawarc] ===
+| [https://github.com/alard/megawarc Megawarc]
-* no license information
+| no license information
-* python
+| Python
-* NO TEST SUITE
+| NO TEST SUITE
-* A readme file
+| A readme file
-* 1 author
+| 1 author
-* Merge many small warcs into a large one
+| Merge many small warcs into a large one
 Checks if WARC files can be un-gzipped before adding them to the megawarc. Does not check anything else.
+|-
-=== [https://github.com/alard/warctozip-service warc to zip] ===
+| [https://github.com/alard/warctozip-service warc to zip]
-* no license information
+| no license information
-* python
+| python
-* NO TEST SUITE
+| NO TEST SUITE
-* A readme file
+| A readme file
-* 1 author
+| 1 author
-* An HTTP-based warc-to-zip converter
+| An HTTP-based warc-to-zip converter
+|-
-=== [https://github.com/chfoo/warcat warcat] ===
+| [https://github.com/chfoo/warcat warcat]
-* GPL v3
+| GPL v3
-* Python 3
+| Python 3
-* yes
+| yes
-* A readme file.
+| A readme file.
-* 1 author
+| 1 author
-* warcat concat, extract, list, pass, split, verify warc files
+| warcat concat, extract, list, pass, split, verify warc files
 Install: pip-3 install warcat<br />
 Run: python3 -m warcat verify mysite.warc.gz
-=== https://github.com/internetarchive/ia-web-commons ===
+ https://github.com/internetarchive/ia-web-commons
-=== https://github.com/internetarchive/ia-hadoop-tools ===
+ https://github.com/internetarchive/ia-hadoop-tools
+|-
-=== [https://github.com/ArchiveTeam/archiveteam-megawarc-factory Archive Team megawarc factory] ===
+| [https://github.com/ArchiveTeam/archiveteam-megawarc-factory Archive Team megawarc factory]
-* no license information
+| no license information
-* Bash shell scripting
+| Bash shell scripting
-* NO TEST SUITE
+| NO TEST SUITE
-* A readme file.
+| A readme file.
-* 1 author
+| 1 author
-* Generates 50gb warc files from existing warc files
+| Generates 50gb warc files from existing warc files
 Uploads to archive.org
+|-
-=== [https://github.com/rajbot/CDX-Writer CDX Writer] ===
+| [https://github.com/rajbot/CDX-Writer CDX Writer]
-* no license information
+| no license information
-* python
+| python
-* Has a test suite
+| Has a test suite
-* A readme file.
+| A readme file.
-* 1 author
+| 1 author
-* Create CDX index files from WARC files.
+| Create CDX index files from WARC files.
+|-
-=== [https://webarchive.jira.com/wiki/display/Heritrix/Heritrix Heritrix] ===
+| [https://webarchive.jira.com/wiki/display/Heritrix/Heritrix Heritrix]
-* Apache v2.0
+| Apache v2.0
-* java
+| java
-* Has a test suite
+| Has a test suite
-* javadoc, website
+| javadoc, website
-* many authors
+| many authors
-* Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
+| Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
+|-
-[https://github.com/openplaces/heritrix-cassandra Heritrix-Cassandra] A library for writing Heritrix 3 output directly to Cassandra as records.
+| [https://github.com/openplaces/heritrix-cassandra Heritrix-Cassandra]
+| ? || ? || ? || ? || ?
-[http://sourceforge.net/projects/deduplicator/ DeDuplicator (Heritrix add-on)] The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.
+| A library for writing Heritrix 3 output directly to Cassandra as records.
+|-
-[https://github.com/gwu-libraries/python-heritrix python-heritrix] A simple wrapper around the Heritrix 3.x API. Developed in April 2012 against Heritrix 3.1.0 at GWU Libraries in Washington, DC, USA.
+| [http://sourceforge.net/projects/deduplicator/ DeDuplicator (Heritrix add-on)]
+| ? || ? || ? || ? || ?
-=== [http://warcreate.com/ Chrome/Chromium plugin WARCreate] ===
+| The DeDuplicator is an add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.
-* GPL v3
+|-
-* javascript
+| [https://github.com/gwu-libraries/python-heritrix python-heritrix]
-* ???
+| ? || ? || ? || ? || ?
-* none
+| A simple wrapper around the Heritrix 3.x API. Developed in April 2012 against Heritrix 3.1.0 at GWU Libraries in Washington, DC, USA.
-* 1 author
+|-
-* WARCreate is a Google Chrome extension that allows a user to create a Web ARChive (WARC) file from any browseable webpage.
+| [http://warcreate.com/ Chrome/Chromium plugin WARCreate]
+| GPL v3
-[https://github.com/machawk1/warcreate code repo]
+| javascript
+| ???
-=== [https://sbforge.org/display/JWAT/JWAT Java Web Archive Toolkit] ===
+| none
-* Apache 2.0
+| 1 author
-* Java
+| WARCreate is a Google Chrome extension that allows a user to create a Web ARChive (WARC) file from any browseable webpage. [https://github.com/machawk1/warcreate code repo]
-* Partial Test Suite (check coverage profile)
+|-
-* Online
+| [https://sbforge.org/display/JWAT/JWAT Java Web Archive Toolkit]
-* 1 author
+| Apache 2.0
-* jwattools arc2warc, cdx, compress, decompress, extract, interval, pathindex, test, unpack
+| Java
+| Partial Test Suite (check coverage profile)
+| Online
+| 1 author
+| jwattools arc2warc, cdx, compress, decompress, extract, interval, pathindex, test, unpack
 [https://bitbucket.org/nclarkekb/jwat/overview code repo]
+|-
-=== [http://matkelly.com/wail/ WAIL] ===
+| [http://matkelly.com/wail/ WAIL]
-* CC-BY-SA
+| CC-BY-SA
-* Python, JS
+| Python, JS
-* ???
+| ???
-* Online
+| Online
-* 1
+| 1
-* Web Archiving Integration Layer (WAIL) is a graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.
+| Web Archiving Integration Layer (WAIL) is a graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.
 Tools included and accessible through the GUI are Heritrix 3.1.2, Wayback 1.7, and warc-proxy. Support packages include Apache Tomcat, phantomjs and pyinstaller.
 [https://github.com/machawk1/wail code repo]
+|-
-=== [https://github.com/odie5533/pylibwarc/ pylibwarc] ===
+| [https://github.com/odie5533/pylibwarc/ pylibwarc]
-* ISC License
+| ISC License
-* Python
+| Python
-* CDX support
+| ?
-* 1 author
+| ?
+| 1 author
+CDX support
 Written by odie5533 which frequents #archiveteam, as another independant WARC library for Python.
+|-
-=== [https://github.com/chfoo/wpull Wpull] ===
+| [https://github.com/chfoo/wpull Wpull]
-* GPL version 3
+| GPL version 3
-* Python 3
+| Python 3
-* many unit tests (Travis CI registered), simple experimental fuzzer
+| many unit tests (Travis CI registered), simple experimental fuzzer
-* a quick start readme, brief usage overview, good docstrings coverage
+| a quick start readme, brief usage overview, good docstrings coverage
-* 1 core author
+| 1 core author
-* Wget-compatible web downloader.
+| Wget-compatible web downloader.
 Beta quality. Lua/Python scripting. PhantomJS (experimental). Used by [[ArchiveBot]].
+|-
-=== [https://github.com/ikreymer/pywb pywb]===
+| [https://github.com/ikreymer/pywb pywb]
-* GPL version 3
+| GPL version 3
-* Python 2
+| Python 2
-* yes
+| yes
-* readme and wiki
+| readme and wiki
-* 1 core author
+| 1 core author
-* A full-fledged Python reimplementation of Wayback Machine web archive replay capabilities. Also provides a live rewriting proxy.
+| A full-fledged Python reimplementation of Wayback Machine web archive replay capabilities. Also provides a live rewriting proxy.
+|-
-=== [https://github.com/ikreymer/pywb-webrecorder pywb-webrecorder]===
+| [https://github.com/ikreymer/pywb-webrecorder pywb-webrecorder]
-* MIT
+| MIT
-* Python 2
+| Python 2
-* no
+| no
-* readme
+| readme
-* 1 core author
+| 1 core author
-* An experimental/demo integration of pywb + warcprox to allow live recording to WARC. Allows instant replay of recorded content from WARC.
+| An experimental/demo integration of pywb + warcprox to allow live recording to WARC. Allows instant replay of recorded content from WARC.
+|-
-=== [https://github.com/ikreymer/webarchiveplayer webarchiveplayer]===
+| [https://github.com/ikreymer/webarchiveplayer webarchiveplayer]
-* GPL version 3
+| GPL version 3
-* Python 2
+| Python 2
-* not yet, though most testable functionality in pywb
+| not yet, though most testable functionality in pywb
-* readme
+| readme
-* 1 core author
+| 1 core author
-* Point-and-click wrapper for Windows and OS X for browsing WARC files. Shows a basic file open dialog to select a WARC(s), then
+| Point-and-click wrapper for Windows and OS X for browsing WARC files. Shows a basic file open dialog to select a WARC(s), then
 starts a server and opens a browser. Also determines HTML pages within a WARC. Built on top of pywb. In beta at the moment (early 2015).
+|}
 == Deprecated ==
-* https://code.google.com/p/warc-tools/ - Old, discontinued shit
+| https://code.google.com/p/warc-tools/ - Old, discontinued shit
-* https://github.com/internetarchive/archive-commons - split into 2 new repos: ia-web-commons & ia-hadoop-tools
+| https://github.com/internetarchive/archive-commons - split into 2 new repos: ia-web-commons & ia-hadoop-tools
 == The WARC format ==
-* A .warc file is usually a group of one or more WARC records.
+| A .warc file is usually a group of one or more WARC records.
-* The first record usually describes the records to follow.
+| The first record usually describes the records to follow.
-* compression is optional
+| compression is optional
-* each record is compressed via gzip. A gzip file supports multiple "members"
+| each record is compressed via gzip. A gzip file supports multiple "members"
-* compressed warcs end in .warc.gz
+| compressed warcs end in .warc.gz
-* According to the guidelines warc files should top out at 1gb
+| According to the guidelines warc files should top out at 1gb
-=== WARC record ===
+ WARC record
-* header
+| header
-* content block
+| content block
-* two newlines
+| two newlines
-=== WARC record header ===
+ WARC record header
 The beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format with a given version number, followed by lines of named fields up to a blank line. The WARC record header format largely follows the tradition of HTTP/1.1 [RFC2616] and [RFC2822] headers, with one major exception, allowing UTF-8 [RFC3629].
@@ Line 256: / Line 258: @@
    Content-Length: 150
-=== WARC named fields ===
+ WARC named fields
-* A set of elements consisting of a name, a colon, and a value, with long values continued on indented lines.
+| A set of elements consisting of a name, a colon, and a value, with long values continued on indented lines.
-* Named fields may appear in any order.
+| Named fields may appear in any order.
-* Field values may contain any UTF-8 character.
+| Field values may contain any UTF-8 character.
-* The 'encoded-word' mechanism of [RFC2047] may also be used when writing WARC fields and shall also be understood by WARC reading software.
+| The 'encoded-word' mechanism of [RFC2047] may also be used when writing WARC fields and shall also be understood by WARC reading software.
-=== WARC content block ===
+ WARC content block
 Part (zero or more octets) of a WARC record that follows the header and that forms the main body of a
 WARC record.
@@ Line 270: / Line 272: @@
 == CDX File Format ==
-* http://archive.org/web/researcher/cdx_legend.php
+| http://archive.org/web/researcher/cdx_legend.php
 [[Category:Tools]]
 {{Navigation box}}

Difference between revisions of "The WARC Ecosystem"

Revision as of 08:02, 19 July 2015

Information

Tools

Deprecated

The WARC format

CDX File Format

Navigation menu

Search