Difference between revisions of "User:Vitzli"

From Archiveteam
Jump to navigation Jump to search
(Add user page)
 
(Add IA.BAK preliminary prospecting report)
 
(One intermediate revision by the same user not shown)
Line 5: Line 5:
# [https://archive.org/details/poleshift-survival-library Pole shift survival library] — hasn't been updated since 2013, was quite popular among survival/prepping folks, not endangered as website is still online, but torrent is decaying.
# [https://archive.org/details/poleshift-survival-library Pole shift survival library] — hasn't been updated since 2013, was quite popular among survival/prepping folks, not endangered as website is still online, but torrent is decaying.
# [https://archive.org/details/amazon-reviews-1995-2013 Amazon reviews webdata 1995-2013] — still available, but links were hidden.
# [https://archive.org/details/amazon-reviews-1995-2013 Amazon reviews webdata 1995-2013] — still available, but links were hidden.
# CGP Grey youtube channel, tar archive per year: [https://archive.org/details/CGPGrey-tar-2010 2010],[https://archive.org/details/CGPGrey-tar-2011 2011], [https://archive.org/details/CGPGrey-tar-2012 2012], [https://archive.org/details/CGPGrey-tar-2013 2013], [https://archive.org/details/CGPGrey-tar-2014 2014], [https://archive.org/details/CGPGrey-tar-2015 2015]
# SmarterEveryDay youtube channel, tar archive per year: [https://archive.org/details/SmarterEveryDay-tar-2007 2007], [https://archive.org/details/SmarterEveryDay-tar-2008 2008], [https://archive.org/details/SmarterEveryDay-tar-2009 2009], [https://archive.org/details/SmarterEveryDay-tar-2010 2010], [https://archive.org/details/SmarterEveryDay-tar-2011 2011], [https://archive.org/details/SmarterEveryDay-tar-2012 2012], [https://archive.org/details/SmarterEveryDay-tar-2013 2013], [https://archive.org/details/SmarterEveryDay-tar-2014 2014], [https://archive.org/details/SmarterEveryDay-tar-2015 2015]
== Prospecting IA.BAK collections ==
Tools required: Python 3 libraries/modules - internetarchive, ia-mine; jq - json processing; parallel - run multiple programs in ''for each'' fashion.
archive.org account required (S3 keys) for ia-mine and internetarchive (ia) tools
=== 2016-02-03 census ===
* 10 shards
* 79 collections
* 142462 items total, 106054 unique items (my mistake, do uniq before doing large batch)
=== jq code ===
Remove 'collection' items:
<code>
parallel --jobs 4 'jq '"'"'. | select(.mediatype != "collection") | .identifier'"'"' '"$F_PREFIX"'/{}.col.json | tr -d '"'"'"'"'"
' > '"$F_PREFIX"'/{}.items.json'
</code>
Remove 'uploader' field:
<code>
parallel --jobs 4 'jq -c '"'"'del(.metadata.uploader)'"'"' '"$F_PREFIX"'/{}.mined.json > '"SHARDS-20160203-cleaned/$F_PREFIX"'/{}.cleaned.json'
</code>

Latest revision as of 15:19, 3 February 2016

Saved stuff

  1. JBG Travels youtube channel, partial download, 847 videos total: part 1, part 2, part 3.
    Several videos were either marked private or removed at the request of his employer, although they contained only road video.
  2. Encyclopedia Astronautica snapshot (2015-10-22) according to Alive... OR ARE THEY - is on the watchlist
  3. Pole shift survival library — hasn't been updated since 2013, was quite popular among survival/prepping folks, not endangered as website is still online, but torrent is decaying.
  4. Amazon reviews webdata 1995-2013 — still available, but links were hidden.
  5. CGP Grey youtube channel, tar archive per year: 2010,2011, 2012, 2013, 2014, 2015
  6. SmarterEveryDay youtube channel, tar archive per year: 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015

Prospecting IA.BAK collections

Tools required: Python 3 libraries/modules - internetarchive, ia-mine; jq - json processing; parallel - run multiple programs in for each fashion.

archive.org account required (S3 keys) for ia-mine and internetarchive (ia) tools

2016-02-03 census

  • 10 shards
  • 79 collections
  • 142462 items total, 106054 unique items (my mistake, do uniq before doing large batch)

jq code

Remove 'collection' items:

parallel --jobs 4 'jq '"'"'. | select(.mediatype != "collection") | .identifier'"'"' '"$F_PREFIX"'/{}.col.json | tr -d '"'"'"'"'" ' > '"$F_PREFIX"'/{}.items.json'

Remove 'uploader' field:

parallel --jobs 4 'jq -c '"'"'del(.metadata.uploader)'"'"' '"$F_PREFIX"'/{}.mined.json > '"SHARDS-20160203-cleaned/$F_PREFIX"'/{}.cleaned.json'