User:Vitzli

From Archiveteam
Jump to navigation Jump to search

Subpages

Prospecting IA.BAK collections

As of March 2025, ia-mine is not required, archive.org implements a rate-limiting on metadata fetch regardless of the number of parallel downloads.

Tools required: Python 3 libraries/modules - internetarchive, ia-mine; jq - json processing; parallel - run multiple programs in for each fashion.

archive.org account required (S3 keys) for ia-mine and internetarchive (ia) tools

2016-02-03 census

  • 10 shards
  • 79 collections
  • 142462 items total, 106054 unique items (my mistake, do uniq before doing large batch)

jq code

Remove 'collection' items:

parallel --jobs 4 'jq '"'"'. | select(.mediatype != "collection") | .identifier'"'"' '"$F_PREFIX"'/{}.col.json | tr -d '"'"'"'"'" ' > '"$F_PREFIX"'/{}.items.json'

Remove 'uploader' field:

parallel --jobs 4 'jq -c '"'"'del(.metadata.uploader)'"'"' '"$F_PREFIX"'/{}.mined.json > '"SHARDS-20160203-cleaned/$F_PREFIX"'/{}.cleaned.json'