Difference between revisions of "Talk:YouTube"

From Archiveteam
Jump to navigation Jump to search
(Added second comment scraping tool.)
(Added "Saving MORE stuff from YouTube?")
Line 106: Line 106:


: They are still accessible via the URLs just not through search any more. Flashfire42
: They are still accessible via the URLs just not through search any more. Flashfire42
==Saving MORE stuff from YouTube?==
On the topic of saving stuff from YouTube, are there any methods available for scraping more content such as a user's profile picture, channel banners, channel description, channel discussion page and video watermarks? I was thinking for things such as their profile picture it would be as simple as using wget, however I don't know if the url I am grabbing (such as [https://yt3.ggpht.com/a/AGF-l7_GPnSqhtxkf5pgDj4jdL3EfgJkG09iAXg3Og=s288-mo-c-c0xffffffff-rj-k-no this one]) is the largest resolution available, especially since the URLs don't seem to follow a specific format.

Revision as of 18:37, 21 July 2019

youtube2internearchive

https://github.com/emijrp/youtube2internetarchive contains a script which also handled upload to Internet Archive, but I can't find it any longer. --Nemo 06:28, 26 January 2015 (EST)

I've found something with Google. bzc6p (talk) 12:25, 26 January 2015 (EST)

If YouTube needed to be quickly captured for some unforeseen reason, it might make sense to download only the XML and SRT files, so then at least some record would be saved. Google's subtitle recognition is currently far from accurate, but it's certainly improving. wtron 06:48, 12 June 2015 (EST)

Options

Is it really necessary to explicitly call best A/V when youtube-dl it by default?

Also, why not embed the subs and thumbnail instead of make a separate file? Also why not xattrs for those of us with unix filesystems? Xattrs is only one extra flag.

My command is currently

youtube-dl -t --embed-subs --add-metadata --xattrs --console-title --embed-thumbnails

although I'm going to be incorporating elements from the suggested one into mine. The reasoning behind this is it's one file to send. That command is how I archive currently, it's changing though.

I'd appreciate hearing your input about why I may be wrong though. Thanks in advance,

--Vxbinaca 21:24, 29 May 2015 (EDT)

On your second note, I strongly believe it's better to have different things (video, thumbnail, subtitle) in separate files. Easier to access, process, categorize, recognize. I think it's worth the "trouble" of having three files (with the same name) instead of one.

bzc6p (talk) 07:08, 31 May 2015 (EDT)

xattrs are not portable and will get lost when copying to a file system that doesn't have them (or when uploading it somewhere, like to IA) --Darkstar 08:53, 31 May 2015 (EDT)

Solid reasoning. I've now switched to your way of doing things. --Vxbinaca 19:32, 2 August 2015 (EDT)

--ignore-errors shouldn't be youtube-dl archiving default best practices

Theres a myriad of reasons this isn't a good idea to have by default. Downloads getting snapped off on channel rips could go unnoticed (I search for these with ls *.part). Problems with various versions of youtube-dl could lead to a channel rip with half-processed videos, see this issue on github.

Perhaps for a well-tested version that works on YouTube running in a warrior, --ignore-errors is appropriate, but for an attended rip we should by default suggest people not use it and instead just make sure all of it got ripped and if theres an error try to work resolve that particular video, and if it's a problem they can't get around then just go --ignore-errors.

I'm open to being told why I may be wrong though. --Vxbinaca 19:32, 2 August 2015 (EDT)

On September 1, 2015 The Verge reported about oncoming paid subscription option (ad-free+premium videos), although paid ad-free videos may not be a nail in the coffin, previously free content could become a premium. According to the article, transition could happen in a few months -- Vitzli 06:46, 2 September 2015 (EDT)

YouTube has never been profitable. bzc6p (talk) 08:46, 2 September 2015 (EDT)

As we don't have 500 petabytes of free storage...

... a solution may be to discard "low-value" videos. I mean, if we discard duplicates (films, music, etc) and we set a limit of 1 PB (1000 TB) for quality content, what could be the lucky videos that would be downloaded and preserved? We can work in a approach like this. Just because we don't have space for all, doesn't mean that we don't download anything. Emijrp 09:33, 20 October 2015 (EDT)

Google will probably give a notice long before that date. So we'll have the time to find stuff worth saving. We could create a "warroom" or such, and users (ArchiveTeam members and other people) could suggest channels worth saving, with description, average views and size estimate, say, in a table, clear-cut. If someone suggests videos that are otherwise available or not too popular, they can be striked out (with proper reasoning). (The reviewing of suggestions can be done by everyone continously.) A deadline of suggestions could be given (say, 2 months before the end), and after that a Committee could select the "lucky" 1000 TB that could end up in the Archive.

  • In the meanwhile, the Archive itself would be queried for already saved videos, and those wouldn't be saved again.
  • The Archive and the Team should expect other preserving actions, and ours should be in accordance with those (no duplicates).
  • There could be national limits, e.g. not only a global limit, but also language or country-specific, say, 500 TB English videos, and 10–50 terabytes per other countries (just ad-hoc numbers, see the concept).

The importance of the last point and the importance of saving some of YouTube at all is, I think, in the fact that, without too much exaggaration, a substantial part of today's culture is stored and represented there, on the most popular video sharing site on Earth. bzc6p (talk) 15:41, 20 October 2015 (EDT)

We could also save all videos starting at a specific view count such as 500 or 1000 and/or from channels above a subscriber threshold such as 200. But maybe better first start from e.g. 1000000 views and 10000 subscribers. --ATrescue (talk) 22:29, 13 May 2019 (UTC)

Saving YouTube Comments

YouTube comments are a surefire sign of just how awful the internet can be at times. Shouldn't they be archived as well? There's already a script for it, youtube-comment-downloader. --Powerkitten (talk) 16:49, 26 October 2016 (EDT)

Hello, Powerkitten. Thank you for mentioning this.

There is also youtube-comment-scraper-cli which can do this as well and supports .csv and .json outputs. This tool logs comment ID, channel name, like count, post date (in both Unix timestamp and UTC date format), user profile picture and number of replies. An online version is also available here. Systwi (talk) 06:53, 21 July 2019 (UTC)


Now, there is some information about comment archival here:

Celebrations

When the great wall of YTimg's robots.txt fell on 20120229 (leap year's day of 2012), it was a prideful feeling. Until then, http://ythistory.weebly.com/ and http://ythistory.co.cc/ were the only places where one could access the YouTube swf (flash-based) ActionScript 2 and ActionScript-3-players between circa 2008 (when YouTube's YTimg domain started, likely due to Google's acquisition) and 2012. Additionally, when browsing YouTube via the Wayback Machine, stylesheet information and all images were also blocked by robots.txt, which was suddenly gone.

For a few hours after it's removal, there was an error stating “Couldn't load Robots.txt for this page”, “Unable to read Robots.txt” or “Couldn't find robots.txt for this site” or something similar (I don't remember the exact words), because it was assuming that the robots.txt was just temporarily inaccessible. After these few hours past, suddently, all the YouTube pages blocked by robots.txt were suddently visible in their full beauty.

Now, because one of the biggest Wayback dreams I and other people[IAWcite.todayMemWeb] had apparently has come true (read more at Internet_Archive#robots.txt_and_the_Wayback_Machine too), I can now load YouTube comments through the Wayback Machine! (Try here) --ATrescue (talk) 22:52, 12 May 2019 (UTC)

@Powerkitten: http://ytcomments.klostermann.ca/ (title element={YouTube Comment Scraper}; title on the webpage={Download comments from YouTube};) can be used to get all the comments on a video, which can be downloaded as JSON and CSV at http://ytcomments.klostermann.ca/scrape. You can then upload those files to archive.org. --Usernam (talk) 10:57, 20 May 2019 (UTC)
archive.is can also archive replies to comments, given you have the link to a reply to a comment. --Usernam (talk) 00:17, 15 July 2019 (UTC)

Many YouTube video items are archived in low quality because they were archived using an old version of tubeup. Is there a way I can replace the files of old youtube items (using the "youtube-[id]" naming format the tubeup uses) with updated, high quality files? --Hiccup (talk) 17:01, 6 February 2018 (UTC)

Many YouTube video items are archived in low quality because they were archived using an old version of tubeup. Is there a way I can replace the files of old youtube items (using the "youtube-[id]" naming format the tubeup uses) with updated, high quality files? --Hiccup (talk) 17:01, 6 February 2018 (UTC)

Thumbnail rescue

Some YouTube uploaders change thumbnails of their videos, which would lead to the loss of the existing thumbnail if archived nowhere else.

Some YouTube thumbnails can be retrieved by searching the video ID using web search (try different search engine).

To save as many thumbnails as possible from a specific channel, please archive the video page of their channel using the help of chromebot's page scrolling.

One can also manually scroll down a page and save it using the web browser, then use the “sed” command to extract thumbnail URL's from the HTML page source code.

(command will be added here.)

With the help of simple text replacement, you can put all “mqdefault” and “hqdefault” URL's into “maxresdefault”, although “maxresdefault” is somehow not available for all videos, thus better feed both hqdefaults and maxresdefaults into ArchiveBot.

Then, you can upload them to https://transfer.sh/ or https://transfer.notkiska.pw/ and feed them into ArchiveBot using the !ao <file command. --ATrescue (talk) 13:27, 13 May 2019 (UTC)

Mass Deletion in Internet Archive??

There was more than 300,000 videos in https:// archive.org/details/archiveteam_youtube . Now just 12,000 videos, are there a mass deletion?? --Gridkr (talk) 14:05, 18 July 2019 (UTC)

We are not the Internet Archive. But no, they haven't been deleted, only delisted. JustAnotherArchivist (talk) 16:39, 18 July 2019 (UTC)
They are still accessible via the URLs just not through search any more. Flashfire42

Saving MORE stuff from YouTube?

On the topic of saving stuff from YouTube, are there any methods available for scraping more content such as a user's profile picture, channel banners, channel description, channel discussion page and video watermarks? I was thinking for things such as their profile picture it would be as simple as using wget, however I don't know if the url I am grabbing (such as this one) is the largest resolution available, especially since the URLs don't seem to follow a specific format.