User:Vitzli

From Archiveteam
Revision as of 15:19, 3 February 2016 by Vitzli (talk | contribs) (Add IA.BAK preliminary prospecting report)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Saved stuff

  1. JBG Travels youtube channel, partial download, 847 videos total: part 1, part 2, part 3.
    Several videos were either marked private or removed at the request of his employer, although they contained only road video.
  2. Encyclopedia Astronautica snapshot (2015-10-22) according to Alive... OR ARE THEY - is on the watchlist
  3. Pole shift survival library — hasn't been updated since 2013, was quite popular among survival/prepping folks, not endangered as website is still online, but torrent is decaying.
  4. Amazon reviews webdata 1995-2013 — still available, but links were hidden.
  5. CGP Grey youtube channel, tar archive per year: 2010,2011, 2012, 2013, 2014, 2015
  6. SmarterEveryDay youtube channel, tar archive per year: 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015

Prospecting IA.BAK collections

Tools required: Python 3 libraries/modules - internetarchive, ia-mine; jq - json processing; parallel - run multiple programs in for each fashion.

archive.org account required (S3 keys) for ia-mine and internetarchive (ia) tools

2016-02-03 census

  • 10 shards
  • 79 collections
  • 142462 items total, 106054 unique items (my mistake, do uniq before doing large batch)

jq code

Remove 'collection' items:

parallel --jobs 4 'jq '"'"'. | select(.mediatype != "collection") | .identifier'"'"' '"$F_PREFIX"'/{}.col.json | tr -d '"'"'"'"'" ' > '"$F_PREFIX"'/{}.items.json'

Remove 'uploader' field:

parallel --jobs 4 'jq -c '"'"'del(.metadata.uploader)'"'"' '"$F_PREFIX"'/{}.mined.json > '"SHARDS-20160203-cleaned/$F_PREFIX"'/{}.cleaned.json'