User:Vitzli
Jump to navigation
Jump to search
Subpages
Prospecting IA.BAK collections
As of March 2025, ia-mine is not required, archive.org implements a rate-limiting on metadata fetch regardless of the number of parallel downloads.
Tools required: Python 3 libraries/modules - internetarchive, ia-mine; jq - json processing; parallel - run multiple programs in for each fashion.
archive.org account required (S3 keys) for ia-mine and internetarchive (ia) tools
2016-02-03 census
- 10 shards
- 79 collections
- 142462 items total, 106054 unique items (my mistake, do uniq before doing large batch)
jq code
Remove 'collection' items:
parallel --jobs 4 'jq '"'"'. | select(.mediatype != "collection") | .identifier'"'"' '"$F_PREFIX"'/{}.col.json | tr -d '"'"'"'"'"
' > '"$F_PREFIX"'/{}.items.json'
Remove 'uploader' field:
parallel --jobs 4 'jq -c '"'"'del(.metadata.uploader)'"'"' '"$F_PREFIX"'/{}.mined.json > '"SHARDS-20160203-cleaned/$F_PREFIX"'/{}.cleaned.json'