Difference between revisions of "Audit2014"

From Archiveteam
Jump to navigation Jump to search
m (→‎Current Sub-Collections at Archive Team: typos fixed: indvidual → individual, indiviual → individual)
 
(48 intermediate revisions by 9 users not shown)
Line 1: Line 1:
We've uploaded a bunch of stuff: [https://archive.org/search.php?query=subject:Archiveteam https://archive.org/search.php?query=subject:Archiveteam]
We've uploaded a bunch of stuff:  
*[https://archive.org/search.php?query=subject:archiveteam subject:archiveteam] = 13,785 items
*[https://archive.org/search.php?query=collection:archiveteam collection:archiveteam] = 60,172 items
*[https://archive.org/search.php?query=NOT%20collection%3A%28archiveteam%29%20AND%20subject%3A%28archiveteam%29 subject:archiveteam AND NOT collection:archiveteam] = 2,028 items
 
(The 3rd one should eventually be close to empty.)


Let's go through the list and make sure it's categorized, has decent metadata, etc.
Let's go through the list and make sure it's categorized, has decent metadata, etc.
Line 14: Line 19:
; Indexing : If the item is a collection of sub-items, is one of these sub-items an index of the others? (This is a complicated thing to check for and to create when it doesn't exist, so we can come back to this after we've checked the rest.)
; Indexing : If the item is a collection of sub-items, is one of these sub-items an index of the others? (This is a complicated thing to check for and to create when it doesn't exist, so we can come back to this after we've checked the rest.)
; Your suggestion here : this is just off the top of my head.
; Your suggestion here : this is just off the top of my head.
== High-level Collections ==
* https://archive.org/details/web
** https://archive.org/details/archiveteam
*** https://archive.org/details/archiveteam-fire
*** https://archive.org/details/archivebot
** https://archive.org/details/wikiteam


== Current Sub-Collections at Archive Team ==
== Current Sub-Collections at Archive Team ==
Line 84: Line 96:
| [http://archive.org/details/archiveteam_greader archiveteam_greader] || Unaudited || || 368 || || [[Google Reader]]; 3 categories of WARCs: Directory, Stats & general. It would probably be good to also put them in separate collections. There is also a [https://archive.org/details/archiveteam_greaderstats_combined combined stats item].
| [http://archive.org/details/archiveteam_greader archiveteam_greader] || Unaudited || || 368 || || [[Google Reader]]; 3 categories of WARCs: Directory, Stats & general. It would probably be good to also put them in separate collections. There is also a [https://archive.org/details/archiveteam_greaderstats_combined combined stats item].
|-
|-
| [http://archive.org/details/archiveteam_ignsites archiveteam_ignsites] || Unaudited || || 81 || || [[IGN]] (needs link to archive); Each item contains a particular subdomain. Descriptive names.
| [http://archive.org/details/archiveteam_ignsites archiveteam_ignsites] || Unaudited || || 81 || || [[IGN]] (needs link to archive); Each item contains a particular subdomain. Descriptive names. ([https://archive.org/details/primeblog.ign.com primeblog.ign.com item] needs to be added to ''archiveteam'' and ''web'' collections)
|-
|-
| [http://archive.org/details/archiveteam_g4tv_forums archiveteam_g4tv_forums] || Unaudited || || 74 || || ARCs from [[wikipedia:G4 (TV channel)]], mainly from the forum
| [http://archive.org/details/archiveteam_g4tv_forums archiveteam_g4tv_forums] || Unaudited || || 74 || || ARCs from [[wikipedia:G4 (TV channel)]], mainly from the forum
Line 116: Line 128:
| [http://archive.org/details/webshots-freeze-frame webshots-freeze-frame] || Unaudited || || 2459 || No || [[Webshots]]; WARCs
| [http://archive.org/details/webshots-freeze-frame webshots-freeze-frame] || Unaudited || || 2459 || No || [[Webshots]]; WARCs
|-
|-
| [http://archive.org/details/tabblo-archive tabblo-archive] || Unaudited || || 1806 || Maybe: [https://archive.org/details/tabblo-archive-groups groups] item || [[Tabblo]]; 9 MegaWARCs, the rest of the items are groups of indiviual accounts as zip files
| [http://archive.org/details/tabblo-archive tabblo-archive] || Unaudited || || 1806 || Maybe: [https://archive.org/details/tabblo-archive-groups groups] item || [[Tabblo]]; 9 MegaWARCs, the rest of the items are groups of individual accounts as zip files
|-
|-
| [http://archive.org/details/archiveteam-fortunecity archiveteam-fortunecity] || Unaudited || [https://archive.org/details/archiveteam-fortunecity-list Yes] || 55 || || [[FortuneCity]]; 26 "Set" items (containing a single large tar in each one); also 26 WARC items, and one leftovers item
| [http://archive.org/details/archiveteam-fortunecity archiveteam-fortunecity] || Unaudited || [https://archive.org/details/archiveteam-fortunecity-list Yes] || 55 || || [[FortuneCity]]; 26 "Set" items (containing a single large tar in each one); also 26 WARC items, and one leftovers item
Line 130: Line 142:
| [http://archive.org/details/archiveteam-geocities archiveteam-geocities] || Unaudited || || 12 || || [[Geocities]]
| [http://archive.org/details/archiveteam-geocities archiveteam-geocities] || Unaudited || || 12 || || [[Geocities]]
|-
|-
| [http://archive.org/details/archiveteam-fire archiveteam-fire] || Unaudited || || 7135 || || A vast and misc. collection; needs quite a bit of TLC
| [http://archive.org/details/archiveteam-fire archiveteam-fire] || Unaudited || || 7135 || || A vast and misc. collection; needs quite a bit of TLC ; ([http://archive.org/details/www.asiatorrents.me-subtitle-1-to-38406-20141205 www.asiatorrents.me-subtitle-1-to-38406-20141205 item] needs to be added to the ''archiveteam'', and ''web'' collections)
|-
|-
| [http://archive.org/details/archiveteam-mypodcast archiveteam-mypodcast] || Unaudited || || 383 || || Each item is a separate podcast, containing indvidual sound files, playable through the IA interface; there is also a [https://archive.org/download/archiveteam-mypodcast-dataonly misc] item
| [http://archive.org/details/archiveteam-mypodcast archiveteam-mypodcast] || Unaudited || || 383 || || Each item is a separate podcast, containing individual sound files, playable through the IA interface; there is also a [https://archive.org/download/archiveteam-mypodcast-dataonly misc] item
|-
|-
| [http://archive.org/details/archiveteam-googlegroups archiveteam-googlegroups] || Unaudited || [[User:JesseW|JesseW]] || 1,348 || Partial (each item has a list of groups, but there's no overall list) || [[Google Groups]]; This is divided into items by the initial two letters (or digits or underscore). The item for "[https://archive.org/details/archiveteam-googlegroups-th th]" has an inconsistent title and category.
| [http://archive.org/details/archiveteam-googlegroups archiveteam-googlegroups] || Unaudited || [[User:JesseW|JesseW]] || 1,348 || Partial (each item has a list of groups, but there's no overall list) || [[Google Groups]]; This is divided into items by the initial two letters (or digits or underscore). The item for "[https://archive.org/details/archiveteam-googlegroups-th th]" has an inconsistent title and category.
|-
|-
| isohunt dumps [https://archive.org/details/isohunt.teapot.2013 1] [https://archive.org/details/isohunt.croissant.2013 2] [https://archive.org/details/isohunt.coffeepot.2013 3] || Unaudited || || 3 || No || These are not yet in a dedicated collection, and have never been post-processed. Some of the .torrent files may actually be error pages. This needs work, and proper full auditing.
| isohunt dumps [https://archive.org/details/isohunt.teapot.2013 1] [https://archive.org/details/isohunt.croissant.2013 2] [https://archive.org/details/isohunt.coffeepot.2013 3] || Audited || vitzli || 3 || [https://archive.org/details/isohunt.audit.2016 Partial] || These are not yet in a dedicated collection, and have never been post-processed. Some of the .torrent files may actually be error pages. This needs work, and proper full auditing. Visit [https://archive.org/download/isohunt.audit.2016/isohunt.audit.2016.html summary page] or [[IsoHunt]] for more details
 
|-
|-
| '''[https://archive.org/search.php?query=streetfiles No Category (streetfiles)]''' || Unaudited || || || ||
| '''[https://archive.org/search.php?query=streetfiles No Category (streetfiles)]''' || Unaudited || || || ||
Line 178: Line 191:


* https://archive.org/search.php?query=earbits Earbits gathering is in the wrong place and needs additional versions.
* https://archive.org/search.php?query=earbits Earbits gathering is in the wrong place and needs additional versions.
* The wiki front page needs updating


=== To be moved to better collection ===
=== To be moved to better collection ===
* https://archive.org/details/archiveteam-fileplanet is a well done collection with a description that goes into detail about the site... if only it had ''any'' of the items. Instead, they are dumped in Community Texts. They don't even have anything tying them to archiveteam in the item names, despite clearly being from us. https://archive.org/search.php?query=FileplanetFiles seems to bring them up.
==== Collections ====
* http://archive.org/details/archiveteam_atomicgamer
* http://archive.org/details/archiveteam_layervault
* http://archive.org/details/archiveteam_madden
* http://archive.org/details/archiveteam_tele2
* http://archive.org/details/archiveteam_viddler
* http://archive.org/details/archiveteam_friendfeed
* http://archive.org/details/archiveteam_furaffinity
* http://archive.org/details/archiveteam_lastfm
* http://archive.org/details/archiveteam_toshibadocs


==== Orphaned [[Twitch.tv]] ====
(The items within them also need to be added to the ''archiveteam'', and ''web'' collections.)
* https://archive.org/details/archiveteam_twitchtv_20140811223313
* https://archive.org/details/archiveteam_twitchtv_espesgrab


==== WARC ====
==== WARC ====
* https://archive.org/details/pouet.com_full_grab no WARC file visible for me
 
* https://archive.org/details/archiveteam_punchfork_archive-archive
 
* https://archive.org/details/sg1archive.com_forums_20140708
* Anything under https://archive.org/search.php?query=subject%3A%22warcarchives%22
* https://archive.org/details/2013_misc_warcs_02
* https://archive.org/details/fenopy-se-fire-grab-2014-12-30-16-38-13
* https://archive.org/details/2013_misc_warcs_01
* https://archive.org/details/netszar_com_2015_06
* https://archive.org/details/site-donkeyboytripodcom
* https://archive.org/details/swipnet-searchengine-crawl-nonrecursive
* https://archive.org/details/site-homeswipnetseclubnintendo007
* https://archive.org/details/swipnet-searchengine-crawl-recursive
* https://archive.org/details/site-homeswipnetsecpg
* https://archive.org/details/kajaszoszentpeter_hu_2015_06
* https://archive.org/details/site-homeswipnetsegamemaster
* https://archive.org/details/warc-hallofshame.gp.co.at
* https://archive.org/details/homeswipnetsenestabs
* https://archive.org/details/warc-freakedenough.at
* https://archive.org/details/Site-homeswipnetsew-62848
* https://archive.org/details/nintendoukkidsclub-20150608.warc
* https://archive.org/details/site-homeswipnetsesofiasgbc
* https://archive.org/details/warc-9chin
* https://archive.org/details/site-homeswipnetsexcheatsdk
* https://archive.org/details/warcarchive-www.bun23.com
* https://archive.org/details/site-home2swipnetsew26120
* https://archive.org/details/warchive-www.sotipro.com
* https://archive.org/details/site-home3.swipnet.se-w38081
* https://archive.org/details/files.hii-tech.com-warc
* https://archive.org/details/site-home4swipnetse-w42641
* https://archive.org/details/www.synthfool.com
* https://archive.org/details/site-home4swipnetse-w46722
* https://archive.org/details/wwwbiologyarizonaedu
* https://archive.org/details/site-homeswipnetsefredde2000
* https://archive.org/details/studionyami-com_penfifteen-2012-03-05
* https://archive.org/details/ubuntuone-panicgrab-20140405
* https://archive.org/details/fybertech
* https://archive.org/details/myopera-forums-1700001-1800000
* https://archive.org/details/wwwclarkuedu-djoyce-trig-20150608.warc
* https://archive.org/details/myopera-forums-1800001-1823192
* https://archive.org/details/rawporter.s3.amazonaws.com_20140616_partial
* https://archive.org/details/technet.microsoft.com-panicgrab-20130706
* https://archive.org/details/isohunt_facebook_page_snapshot WARC and other formats
* https://archive.org/details/Misc.yero.orgMusic
* https://archive.org/details/telinco.co.uk_pages
* https://archive.org/details/tribes_forum_emergency_grab
* https://archive.org/details/isohunt-20131019-mithrandir-extra
* https://archive.org/details/cscope.us-google-pdfs-grab-20130312
* https://archive.org/details/cscope.us-google-pdfs-grab-20130520
* https://archive.org/details/PinkTentacle
* https://archive.org/details/journalstar.com_sports_local_20120730.warc
* https://archive.org/details/www.battleforthenet.com-panicgrab-20140718
* https://archive.org/details/theopeninter.net-panicgrab-20140718
* https://archive.org/details/startupsfornetneutrality.org-panicgrab-20140718
* https://archive.org/details/net.net-panicgrab-20140718
* https://archive.org/details/wwdctimer.com-panicgrab-20140731
* https://archive.org/details/xn--19g.com-panicgrab-20140731
* https://archive.org/details/chromercise.com-panicgrab-20140731
* https://archive.org/details/hiddenfromgoogle.com-panicgrab-20140731
* https://archive.org/details/orteil.dashnet.org-panicgrab-20140731
* https://archive.org/details/pingus.seul.org-panicgrab-20140731
* https://archive.org/details/tux4kids.alioth.debian.org-panicgrab-20140731
* https://archive.org/details/tuxkart.sourceforge.net-panicgrab-20140731
* https://archive.org/details/assets.minecraft.net-panicgrab-20140807
* <nowiki>https://archive.org/details/bmf.*rustedmagick.com-cr-panicgrab-20140808</nowiki> (remove asterisk, spam filter doesn't like this link)
* https://archive.org/details/tppx.herokuapp.com-panicgrab-20140808
* https://archive.org/details/nintendo-warcs
* https://archive.org/details/www.battleforthenet.com-panicgrab-20140912
* https://archive.org/details/mojang.com-notch-panicgrab-20140912
* https://archive.org/details/http.lists.xiph.org.ad78c6615d420894
* https://archive.org/details/legowracers.4t2portfolio.co.uk-panicgrab-20141007
* https://archive.org/details/2014.oct.29G3.warc (Geometer's Sketchpad installers)
* https://archive.org/details/inw-begun-2014.oct.26-p6-00001.warc (ef.inwards.com snapshot)
* https://archive.org/details/Dsoi4Jan2014.megawarc.json (WARCs turned out to seem to be corrupt)
* https://archive.org/details/bds-9oct2013
* https://archive.org/details/Hogislandeducators2011.wikispaces.comWARCSnapshot9October2013
* https://archive.org/details/warcs-as-of-26jany2014
* https://archive.org/details/00001DlUMkUFTWc.info (WARCs and a bunch of other stuff)
* https://archive.org/details/D3jan2014.megawarc.json
* https://archive.org/details/MicrosoftDemandsTakedownOfMicrosoftSpyGuide.html (WARCs and other stuff)
* https://archive.org/details/warc-9aug2014
* https://archive.org/details/27may2014warcset
* https://archive.org/details/13jany2014warcs
* https://archive.org/details/mcspotlight.org-20141030
* https://archive.org/details/dr_static.s3.amazonaws.com-panicgrab-20140929
* https://archive.org/details/cc2014.oct.31-00000.warc (Songs from DJ Contacreast's website)
* https://archive.org/search.php?query=collection%3Aamjbarreldata (This is a collection of WARCs (not in Wayback at present, as far as I know) from my attempt at writing a distributed-computing website-specific archival tool, sort of a cross between Majestic-12 and Archivebot. Not sure if it's appropriate to list here, but it's a thing. Feel free to remove it if it's not....)
* https://archive.org/details/libertypost.org_20150115_partial


==== FTP ====
==== FTP ====
* https://archive.org/details/ftp.idsoftware.com
* https://archive.org/details/ftp.lucasarts.com-20130427
* https://archive.org/details/2014.02.ftp.inf.tuDresden.deAtari
* https://archive.org/details/2014.0102.ftp.festo.com
* https://archive.org/details/wa-begun-ul-27jany2014amn (This should probably be darked, it looks to me like it's someone's misconfigured home NAS)
* https://archive.org/details/2014.0102.mail.digipro.rs
* https://archive.org/details/2014.0102.mail.digipro.rs
* https://archive.org/details/2014.12.ftp.dlink.biz_201501
* https://archive.org/details/2015.01.12.ftp.sunet.sePubOpenBSD


==== Misc ====
==== Misc ====
Line 271: Line 240:
* https://archive.org/details/YahooBlogSitemaps20131216071927
* https://archive.org/details/YahooBlogSitemaps20131216071927
* https://archive.org/details/archiveteam-mobileme-index
* https://archive.org/details/archiveteam-mobileme-index
* https://archive.org/details/archiveteam-twitter-stream-2014-05
* https://archive.org/details/ESPNForumsPanicgrab
* https://archive.org/details/ESPNForumsPanicgrab
* https://archive.org/details/rawporter-grab
* https://archive.org/details/rawporter-grab
Line 303: Line 271:
* https://archive.org/details/thekeep_bbs
* https://archive.org/details/thekeep_bbs
* https://archive.org/details/mail.google.com-saved-1Oct2014
* https://archive.org/details/mail.google.com-saved-1Oct2014
* https://archive.org/details/madden_giferator_scrape_1-100000
* https://archive.org/details/madden_giferator_scrape_100001-200000
* https://archive.org/details/madden_giferator_scrape_200001-300000
* https://archive.org/details/Data2September2013.tar (Gunnerkrigg Court homepage comments snapshots)
* https://archive.org/details/Data2September2013.tar (Gunnerkrigg Court homepage comments snapshots)
* https://archive.org/details/shipwretched-items
* https://archive.org/details/fotodisco-raw-items
* https://archive.org/details/fotodisco-raw-items
* https://archive.org/details/quizilladisco-raw-items
* https://archive.org/details/qwikidisco-raw-items
* https://archive.org/details/qwikidisco-raw-items
* https://archive.org/details/twitpicdisco-raw-items
* https://archive.org/details/twitpicdisco-raw-items
* https://archive.org/details/maemo-fremantle-ovi
* https://archive.org/details/maemo-fremantle-ovi
* https://archive.org/details/toontown_infinite_github_20150103
* https://archive.org/details/toontown_infinite_github_20150103
 
* https://archive.org/details/amplicate_sitemaps_20140218
== [[URLTeam]] ==
* https://archive.org/details/twitch-raw-items
 
* https://archive.org/details/actionbutton_mini.tar
* <s>Upload the latest offical torrent release</s>. Done! [https://archive.org/details/URLTeamTorrentRelease2013July URLTeamTorrentRelease2013July]
* https://archive.org/details/ageofnerds_mini
* <s>Upload the Dropbox files in the URLTeam wiki page table that are *not* in the latest release</s>. Done!
* https://archive.org/details/2015feb06a07FuturamerlinAList
* [[user:chfoo]] needs access URLTeam collection OR [https://archive.org/search.php?query=urlteam%20terroroftinytown%20-collection%3Atest_collection&sort=-publicdate move the items as needed].
* https://archive.org/details/worldpeacehaven_gmail_Xaa
* https://archive.org/details/worldpeacehaven_gmail_Xab
* https://archive.org/details/2015feb02ob
* https://archive.org/details/2014dec09spe2
* https://archive.org/details/bigougit_mini_v2
* https://archive.org/details/galman33_mini
* https://archive.org/details/urls2015dec02n2
* https://archive.org/details/493nfos
* https://archive.org/details/archiveteam_dev_env_v1_appliances
* https://archive.org/details/Kazbeg_Panorama.jpg -- If tags can be edited by non-owners, this probably shouldn't have the ''archiveteam'' tag.
* https://archive.org/search.php?query=subject%3A%22wallbase%22 -- 10 different items, representing efforts at saving [[wallbase.cc]]; need to be sorted and organized
* https://archive.org/search.php?query=subject%3A%22aol%20archiveteam%2C%20aol%20files%2C%20aol%20protocol%22 -- 6 items that need their subject tags cleaned up
* https://archive.org/search.php?query=subject%3A%22Tabblo%22%20AND%20NOT%20collection%3Aarchiveteam -- 5 of the 11 Tabblo items are not in the Archiveteam collection
* https://archive.org/details/donkeykongsites
* https://archive.org/details/dogpictbot
* https://archive.org/details/HackerNewsStoriesAndCommentsDump
* https://archive.org/details/flipnote-hatena-dkl3collection - in wikiteam collection but not a wiki so should be somewhere else
* https://archive.org/details/msdos_Chenard_shareware
* https://archive.org/details/msdos_Spanverb_shareware
* https://archive.org/details/msdos_ADELINE_demo


== Missing ==
== Missing ==

Latest revision as of 02:21, 5 December 2017

We've uploaded a bunch of stuff:

(The 3rd one should eventually be close to empty.)

Let's go through the list and make sure it's categorized, has decent metadata, etc.

Many of our uploads are quite large, and have been broken into many items on Archive.org. We'll group them together here and verify each set all at once.

Things to check

Collection
Are all the related items grouped into a collection?
Description
Can a visitor figure out what each item represents? Items in a collection don't need to repeat the description of the collection, but it'd be nice if they had a sentence or two, and information about how the item differs from the other items in the collection ("MP3s from earbits.com, files starting with c." from the Earbits items is a good example.)
Inclusion
Are all the related items included in the same collection?
Categorization
Can a visitor find the item by browsing the collections?
Cross-references
Can a visitor find other items in a set, starting at any item in the set? Can a visitor find the index of a large set starting from any part of it?
Indexing
If the item is a collection of sub-items, is one of these sub-items an index of the others? (This is a complicated thing to check for and to create when it doesn't exist, so we can come back to this after we've checked the rest.)
Your suggestion here
this is just off the top of my head.

High-level Collections

Current Sub-Collections at Archive Team

Collection Status Auditor Item Count Has an Index Description of Audit
No Category (earbits) Unaudited 98 Yes The items are not in a collection. Most items are WARCs; the rest need additional work if anyone is going to be able to find the exact MP3 they want.
archiveteam_ptch Audited db48x 50 No Collection has great description, but no categories. Items in collection are WARCS. One item not included in the collection: deathy-s3-test-ptch
archiveteam_flowerpot Audited db48x 406 No The description of the collection is anemic, but each item is well-identified.
github_files Audited db48x 1 No Pretty bad shape. Only one item in the collection, and that's only half the data. Was the rest never uploaded? Has no description, keywords or other metadata. Other Github items could be included, such as this repository index, and these other file downloads
justintv Audited db48x 189 No Partial (Src) Decent description, but no other metadata. There are 51 other 'justintv' items, but none of them look to be from us.
archiveteam_mochimedia Audited db48x 9 No Collection includes Mochi's notice about the shutdown, but no other context. The items are all WARCs, and all have CDXs and JSON indexes, but there's no overall index.

Index can be easily generated from this 26MB JSON file--chfoo

archivebot Unaudited 1070 Sort of: Viewer ArchiveBot; The viewer doesn't seem to index into crawls; there's no link from the collection or the items to the viewer (or anywhere else)
archiveteam_yahooblogs and archiveteam_yahooblog Audited db48x 49 No Collection description is just the shutdown notice (and apparently quite a brief one at that) with no other context. Items are all WARCs, and all have CDXs and JSON indexes, but there's no overall index. One item is orphaned in a collection of its own; apparently caused by a typo in the collection name.
archiveteam-splinder Unaudited 53 See Splinder
archiveteam-picplz Audited db48x 141 Yes The collection description is just the shutdown message, with no other context. Items are tarballs containing WARCs. There is an index, but it's not a part of the collection ([1]). There's also a search page for the index, which is great.
archiveteam_puush Audited db48x 1781 The collection description is just the shutdown notice, but it's better than average; it includes some context. The items are all WARCs with CDXs, but there's no central index.
archiveteam_upcoming Audited dashcloud1 142 no The collection description only describes the site, not the items themselves. Individual items have no description of any kind.
archiveteam_randomfandom Audited dashcloud1 42 yes Short collection description, but has an index, and every collection item is well described. Index is located right on collection page.
archiveteam_antecedents Audited db48x 46 N/A This collection represents multiple sites, rather than multiple parts of a single large site. The collection description is quite brief, but each item appears to have a paragraph describing what the site is/was, as well as some basic metadata such as keywords. All the items appear to be WARCs with CDXs
archiveteam_jazzhands Audited db48x 443 No This one is a collection of items from multiple sites, but those sites are also broken up into multiple items based on when they were scanned. The items have brief descriptions and some keywords, and are WARCs with CDXs. A good way to improve this would be to make collections for each site as subcollections.
archiveteam-mobileme-hero Unaudited 4007 Yes (source)
archiveteam_myopera Audited dashcloud1 155 No Collection page has a nice description of the site, and the items. The items appear to be all have WARCs, and have no descriptions/keywords of any kind on them.
archiveteam_bebo Unaudited JesseW 2867 They appear to all be WARCs, most uploaded on the same day; it's not clear if all of them are in the Wayback Machine or not. Each item has no description or context.
archiveteam_dogster Audited jscott 55 ??? Collection well described. Wayback Machine-Ready WARCs, all integrated.
hyves Unaudited 517 Hyves
archiveteam_wretch Unaudited 2163 Wretch; WARCs
archiveteam_xanga Unaudited 454 Xanga; WARCs
twitterstream Unaudited 41 Twitter According to reviews, at least one file is empty.
pastebinpastes Unaudited 223 These are tarballs (less than 100 MBs, usually), containing each paste in a separate file. Most recently updated on July 1, 2014
archiveteam_zapd Unaudited 19 Zapd; WARCs
archiveteam_patch Unaudited 38 Patch ; WARCs
archiveteam_posterous Unaudited 444 Posterous ; WARCs
archiveteam_greader Unaudited 368 Google Reader; 3 categories of WARCs: Directory, Stats & general. It would probably be good to also put them in separate collections. There is also a combined stats item.
archiveteam_ignsites Unaudited 81 IGN (needs link to archive); Each item contains a particular subdomain. Descriptive names. (primeblog.ign.com item needs to be added to archiveteam and web collections)
archiveteam_g4tv_forums Unaudited 74 ARCs from wikipedia:G4 (TV channel), mainly from the forum
archiveteam-yahoovideo Unaudited 156 Yahoo! Video; various inconsistency in naming and categories; some items contain zip files, while others contain tar files.
archive-team-friendster Unaudited 137 Maybe -> archiveteam-friendster-index item Friendster; early (2011) project, variety of formats
archiveteam_formspring Unaudited 1477 Formspring; WARCs; some duplication in collection description
archiveteam_yahoo_messages Unaudited 17 Yahoo! Messages; WARCs; Minimal description on collection, none on items
archiveteam_punchfork Unaudited 47 Yes Punchfork; Needs link to index from collection description (and item descriptions); three different types of items, unclear differences
yahoo_korea_blogs Unaudited 10 WARCs; no item descriptions
archiveteam-cinch Unaudited 20 No Cinch.fm; 10 items, in both WARC and tar formats
archiveteam_dailybooth Unaudited 203 Yes DailyBooth; link to index on collection page needs adjusting; images seem to be downloadable; individual items lack descriptions
archiveteam_weblognl Unaudited 26 No Weblog.nl; no English-language description
stage6 Unaudited 790 Videos from wikipedia:Stage6; many seem to be unavailable from IA, due to "issues with the item's content."
googlegroups-part2 Unaudited 27 No Google Groups; each item contains a single tar file (ranging in size from 300 MB to over 40 GB); the tar files contain separate zip files for each group; the zip files the actual files. This should probably be grouped with the other grabs of Google Groups.
archiveteam-btinternet Unaudited 8 No WARCs
archiveteam-qaudio-archive Unaudited 7 No Many small WARCs in each item; lengthy explanation in collection description, none in each item
webshots-freeze-frame Unaudited 2459 No Webshots; WARCs
tabblo-archive Unaudited 1806 Maybe: groups item Tabblo; 9 MegaWARCs, the rest of the items are groups of individual accounts as zip files
archiveteam-fortunecity Unaudited Yes 55 FortuneCity; 26 "Set" items (containing a single large tar in each one); also 26 WARC items, and one leftovers item
2012-04-30-wikimedia-images-snapshot Unaudited Nemo 148 Not really Should become a subcollection of "wikicollections", so that it's next to "wikimediacommons". The "remote" tarballs partially overlap with xowa items nowadays. If a complete mirror of the Your.Org tarballs is desired, we should list it at [2] with some maintenance information. It's not clear whether investing N TB at IA is a priority here, nor whether IA expects WikiTeam to do the uploads instead (in that case, ask Hydriz or Arkiver). Also, the Your.Org dumps are currently blocked on the lack of a rsync server on Wikimedia servers.
archiveteam-anyhub Unaudited 39 AnyHub; 18 each WARC & tar items, and one called the "Blue Collection"
archiveteam-fileplanet Unaudited 675 FilePlanet
archiveteam-umich-save Unaudited 52
archiveteam-geocities Unaudited 12 Geocities
archiveteam-fire Unaudited 7135 A vast and misc. collection; needs quite a bit of TLC ; (www.asiatorrents.me-subtitle-1-to-38406-20141205 item needs to be added to the archiveteam, and web collections)
archiveteam-mypodcast Unaudited 383 Each item is a separate podcast, containing individual sound files, playable through the IA interface; there is also a misc item
archiveteam-googlegroups Unaudited JesseW 1,348 Partial (each item has a list of groups, but there's no overall list) Google Groups; This is divided into items by the initial two letters (or digits or underscore). The item for "th" has an inconsistent title and category.
isohunt dumps 1 2 3 Audited vitzli 3 Partial These are not yet in a dedicated collection, and have never been post-processed. Some of the .torrent files may actually be error pages. This needs work, and proper full auditing. Visit summary page or IsoHunt for more details
No Category (streetfiles) Unaudited
archiveteam_yahoovoices Unaudited 30 No Yahoo! Voices; WARCs
archiveteam_twitchtv Unaudited 2213 Yes (source) Twitch.tv
archiveteam_fotopedia Unaudited 40 Fotopedia; WARCs
archiveteam_canvas Unaudited 47 Canv.as; WARCs
archiveteam_ancestry Unaudited 82 Ancestry.com; WARCs

In progress???

But what happened after? Where are the archives?

Oddities, Mislocations, and To Do

To be moved to better collection

Collections

(The items within them also need to be added to the archiveteam, and web collections.)

WARC

FTP

Misc

Missing

  • Yahoo!_Blog: What happened to the Vietnam archives? Does anyone have a copy or at least a blurry screenshot of the Korean shutdown notice?