Difference between revisions of "YouTube"

From Archiveteam
Jump to navigation Jump to search
m (MOTHERFUCKER ! ! !)
m (Reverted edits by Megalanya1 (talk) to last revision by Vxbinaca)
Line 11: Line 11:
'''YouTube''' is a [[Video hostings|video sharing]] website currently owned by [[Google]]. YouTube is currently the most popular video hosting website on the planet.
'''YouTube''' is a [[Video hostings|video sharing]] website currently owned by [[Google]]. YouTube is currently the most popular video hosting website on the planet.


== '''MOTHERFUCKER ! ! !''' ==
== Archiving tools ==


== '''MOTHERFUCKER ! ! !''' ==
Several free FLV downloaders and video-to-URL converters exist on the web.
AT rescue projects usually use [https://github.com/rg3/youtube-dl/ youtube-dl].<br>
YouTube annotations (speech bubbles and notes) are available as XML
<pre>
http://www.youtube.com/api/reviews/y/read2?feat=TCS&video_id=
</pre>
To transform this XML to SRT, use [https://github.com/ndurner/AT-tools/tree/master/ann2srt ann2srt]


== '''MOTHERFUCKER ! ! !''' ==
=== (Automatic) tubeup.py - Youtube Video IA Archiver ===


== '''MOTHERFUCKER ! ! !''' ==
: '''Note:''' When uploading to the Internet Archive, please avoid exposing the site to legal risk by adhering to their terms of service for ''blatantly'' copyrighted content. Unfortunately, they are subject to similar threats of DMCA takedowns as YouTube, so do use discretion.


== '''MOTHERFUCKER ! ! !''' ==
: '''Note:''' Be very careful dumping channels over 100 videos with this script. Let an admin know what you're doing, dump 50 videos, and have a collection created. Work is being started on adding a flag to specify a collection name instead of "Community Video" which is what it defaults to. Always try to create an item. For the time being the script will have to be hand edited to specify a different collection.


== '''MOTHERFUCKER ! ! !''' ==
[https://github.com/bibanon/tubeup tubeup.py] is an automated archival script that uses [https://github.com/rg3/youtube-dl/ youtube-dl] to download a Youtube video (or any other provider supported by youtube-dl), and then uploads it with all metadata to the Internet Archive.


== '''MOTHERFUCKER ! ! !''' ==
This way, all metadata from the video, such as title, tags, categories, and description, are preserved in the corresponding Internet Archive item, without having to manually enter it.
 
It also creates a standardized Internet Archive item name format that makes it easy to find the video using the Youtube ID, and reduces duplication: https://archive.org/details/youtube-v9sGhNoSG3o
 
Youtube-dl [https://github.com/rg3/youtube-dl/blob/master/docs/supportedsites.md also works with many other video sites.]
 
* [https://github.com/bibanon/tubeup Github - tubeup.py]
 
=== (Manual) Recommended way to archive Youtube videos ===
 
First, download the video/playlist/channel/user using youtube-dl:
 
<tt>youtube-dl --title --continue --retries 4 --write-info-json --write-description --write-thumbnail --write-annotations --all-subs --ignore-errors -f bestvideo+bestaudio URL</tt>
 
This can be simplified by running the [https://github.com/matthazinski/youtube2internetarchive script by emijrp and others], which also handles upload.
 
You need a recent (2014) ffmpeg or avconv for the <tt>bestvideo+bestaudio</tt> muxing to work.  On Windows, you also need to run youtube-dl with Python 3.3/3.4 instead of Python 2.7, otherwise non-ASCII filenames will fail to mux.
 
Also, make sure you're using the most recent version of youtube-dl. Previous versions didn't work if the highest quality video+audio was webm+m4a. New versions should automagically merge incompatible formats into a .mkv file.<ref>https://github.com/rg3/youtube-dl/pull/5456</ref>
 
Then, upload it to https://archive.org/upload/ Make sure to upload not only the video itself (.mp4 and/or .mkv files), but also the metadata files created along with it (.info.json, .jpg, .annotations.xml and .description).
 
====kyan likes this method:====
 
Youtube sucker (look out it leaves some incompletes in the directory afterward. Can clean up w/ <tt>rm -v ./*.mp4 ./*.webm</tt> then <tt>ls | grep \.part$</tt> and get the video IDs out of that and redownload them and repeat etc etc). Can upload the WARCs only e.g. using ia (Python Internet Archive client) or warcdealer (automated uploader I hacked together) — or if you want, can upload the other stuff too, but that's kind of wasteful of storage space. In my opinion, getting stuff without a WARC is a great crime, given the ready availability of tools to create WARCs. Note that this method also works for other Web sites supported by youtube-dl too, although it maybe would need different cleanup commands afterward. Depends on youtube-dl and warcprox running on localhost:8000.
 
<tt>youtube-dl --title --continue --retries 100 --write-info-json --write-description --write-thumbnail --proxy="localhost:8000" --write-annotations --all-subs --no-check-certificate --ignore-errors -k -f bestvideo+bestaudio/best </tt>(stick the video/channel/playlist/whatever URL in here)


== Site reconnaissance  ==
== Site reconnaissance  ==

Revision as of 16:16, 17 January 2017

YouTube
YouTube logo
YouTube - Broadcast Yourself. 1303512848647.png
URL http://youtube.com[IAWcite.todayMemWeb]
Status Online!
Archiving status Not saved yet
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)

YouTube is a video sharing website currently owned by Google. YouTube is currently the most popular video hosting website on the planet.

Archiving tools

Several free FLV downloaders and video-to-URL converters exist on the web. AT rescue projects usually use youtube-dl.
YouTube annotations (speech bubbles and notes) are available as XML

http://www.youtube.com/api/reviews/y/read2?feat=TCS&video_id=

To transform this XML to SRT, use ann2srt

(Automatic) tubeup.py - Youtube Video IA Archiver

Note: When uploading to the Internet Archive, please avoid exposing the site to legal risk by adhering to their terms of service for blatantly copyrighted content. Unfortunately, they are subject to similar threats of DMCA takedowns as YouTube, so do use discretion.
Note: Be very careful dumping channels over 100 videos with this script. Let an admin know what you're doing, dump 50 videos, and have a collection created. Work is being started on adding a flag to specify a collection name instead of "Community Video" which is what it defaults to. Always try to create an item. For the time being the script will have to be hand edited to specify a different collection.

tubeup.py is an automated archival script that uses youtube-dl to download a Youtube video (or any other provider supported by youtube-dl), and then uploads it with all metadata to the Internet Archive.

This way, all metadata from the video, such as title, tags, categories, and description, are preserved in the corresponding Internet Archive item, without having to manually enter it.

It also creates a standardized Internet Archive item name format that makes it easy to find the video using the Youtube ID, and reduces duplication: https://archive.org/details/youtube-v9sGhNoSG3o

Youtube-dl also works with many other video sites.

(Manual) Recommended way to archive Youtube videos

First, download the video/playlist/channel/user using youtube-dl:

youtube-dl --title --continue --retries 4 --write-info-json --write-description --write-thumbnail --write-annotations --all-subs --ignore-errors -f bestvideo+bestaudio URL

This can be simplified by running the script by emijrp and others, which also handles upload.

You need a recent (2014) ffmpeg or avconv for the bestvideo+bestaudio muxing to work. On Windows, you also need to run youtube-dl with Python 3.3/3.4 instead of Python 2.7, otherwise non-ASCII filenames will fail to mux.

Also, make sure you're using the most recent version of youtube-dl. Previous versions didn't work if the highest quality video+audio was webm+m4a. New versions should automagically merge incompatible formats into a .mkv file.[1]

Then, upload it to https://archive.org/upload/ Make sure to upload not only the video itself (.mp4 and/or .mkv files), but also the metadata files created along with it (.info.json, .jpg, .annotations.xml and .description).

kyan likes this method:

Youtube sucker (look out it leaves some incompletes in the directory afterward. Can clean up w/ rm -v ./*.mp4 ./*.webm then ls | grep \.part$ and get the video IDs out of that and redownload them and repeat etc etc). Can upload the WARCs only e.g. using ia (Python Internet Archive client) or warcdealer (automated uploader I hacked together) — or if you want, can upload the other stuff too, but that's kind of wasteful of storage space. In my opinion, getting stuff without a WARC is a great crime, given the ready availability of tools to create WARCs. Note that this method also works for other Web sites supported by youtube-dl too, although it maybe would need different cleanup commands afterward. Depends on youtube-dl and warcprox running on localhost:8000.

youtube-dl --title --continue --retries 100 --write-info-json --write-description --write-thumbnail --proxy="localhost:8000" --write-annotations --all-subs --no-check-certificate --ignore-errors -k -f bestvideo+bestaudio/best (stick the video/channel/playlist/whatever URL in here)

Site reconnaissance

Little is known about its database, but according to data from 2006, it was 45TB and doubling every 4 months. At this rate it would be 660 Petabytes (Oct 2014) by now.

According to Leo Leung's calculations based on available information, an often updated Google spreadsheet estimates that in early 2015 YouTube's content reached 500 petabytes in size.

FYI, all of Google Video was about 45TB, and the Archive Team's biggest project, MobileMe was 200TB. The Internet Archive's total capacity is 50PB as of August 2014. So let's hope YouTube stays healthy, because the Archive Team may have finally met its match.

Vital signs

Will be living off Google for a long time if nothing changes.

References

See also

External links