Difference between revisions of "Internet Archive Census"

From Archiveteam
Jump to navigation Jump to search
m (unsign)
Line 22: Line 22:
* The largest single file (that is not just a tar of other files) is TELSEY_004.MOV, in item TELSEY_004 in the xfrstn collection.
* The largest single file (that is not just a tar of other files) is TELSEY_004.MOV, in item TELSEY_004 in the xfrstn collection.
* There are 22,596,286 files which are copies of other files. The duplicate files take up 1.06PB of space. (Assuming all files with the same MD5 are duplicates.)
* There are 22,596,286 files which are copies of other files. The duplicate files take up 1.06PB of space. (Assuming all files with the same MD5 are duplicates.)
* The largest duplicated file is all-20150219205226/part-0235.cdx.gz (195GB) in item wbsrv-0235-1. The entire wbsrv-0235-1 item is a duplicate of wbsrv-0235-0, that's 600GB. --[[User:Sep332|Sep332]] 10:46, 12 March 2015 (EDT)
* The largest duplicated file is all-20150219205226/part-0235.cdx.gz (195GB) in item wbsrv-0235-1. The entire wbsrv-0235-1 item is a duplicate of wbsrv-0235-0, that's 600GB.

Revision as of 14:47, 12 March 2015

The Internet Archive Census is an unofficial attempt to count and indicate the files available on the Internet Archive, focusing on downloadable, public-facing files. The purpose of this project is multi-fold, including determination of sizes of various collections, and determining priorities in backing up portions of the Internet Archive's data stores.

The first Census was conducted in March of 2015. Its results are on the Archive at https://archive.org/details/ia-bak-census_20150304.

Purpose of the Census

The Census was called for as a stepping stone in the INTERNETARCHIVE.BAK project, an experiment and project to have Archive Team back up the Internet Archive. While officially, the Internet Archive has 21 petabytes of information in its data stores (as of March 2015), some of that data is related to system overhead, or are stream-only/not available. By having a full run-through of the entire collection of items at the Archive, the next phases of the INTERNETARCHIVE.BAK experiment (testing methodologies) can move forward.

The data is also useful for talking about what the Internet Archive does, and what kinds of items are in the stacks - collections can be found with very large or manageable amounts of data, and audiences/researchers outside the backup experiment can do their own sets of data access and acquisition. Search engines can be experimented with, as well as data visualization.

Contents of the Census

The Census is a very large collection of JSON-formatted tables, returned by the use of the ia-mine utility by Jake Johnson of the Internet archive. Like all such projects, the data should not be considered perfect, although a large percentage should accurately reflect the site. As there is only one census so far, there is no comparable data in terms of growth or file change. (There are reports of total files or other activity, but not to the level of the JSON format material the Census provides).

Some Relevant Information from the Census

Based on the output of the Census:

  • The size of the listed data is 14.23 petabytes.
  • The census only contains "original" data, not derivations created by the system. (For example, if a .AVI file is uploaded, the census only counts the .AVI, and not a .MP4 or .GIF derived from the original file).
  • The vast majority of the data is compressed in some way. By far the largest kind of file is gzip, with 9PB uploaded! Most files that are not in a archive format are compressed videos, music, pictures etc.
  • The largest single file (that is not just a tar of other files) is TELSEY_004.MOV, in item TELSEY_004 in the xfrstn collection.
  • There are 22,596,286 files which are copies of other files. The duplicate files take up 1.06PB of space. (Assuming all files with the same MD5 are duplicates.)
  • The largest duplicated file is all-20150219205226/part-0235.cdx.gz (195GB) in item wbsrv-0235-1. The entire wbsrv-0235-1 item is a duplicate of wbsrv-0235-0, that's 600GB.