Difference between revisions of "Google Video (Archive)"

From Archiveteam
Jump to navigation Jump to search
Line 55: Line 55:
Don't forget to join the IRC channel to coordinate who's getting what!
Don't forget to join the IRC channel to coordinate who's getting what!


== Downloading Videos Via Related Video Metadata (aka Listerine) ==
Listerine is an experimental [http://en.wikipedia.org/wiki/BOINC BOINC-style] download effort started by underscor.  It consists of a server that gets lists of videos to download from the indexing effort and assigns the videos via their ID number to users running a download client.
For more information, see the [irc://irc.efnet.org/boincgoogle IRC Channel for Downloaders (#boincgoogle); go here to get started]
Setup instructions:
* Make sure you have wget and curl installed
* Download youtube-dl and install aria2 as described above
** Ensure you have <tt>youtube-dl</tt>, <tt>wget</tt>, <tt>curl</tt> and <tt>aria2c</tt> in your <tt>PATH</tt>
* Download the scripts by cloning the git repository: <pre>git clone git://github.com/norcnorc/googlegargle.git</pre>If you download them [https://github.com/norcnorc/googlegargle/tarball/master manually] (instead of using git), you'll want to do <tt>chmod +x listerine googlegargle</tt> to make them executable.
* Edit the <tt>listerine</tt> script to set your username. The nick you use on IRC would be a good choice.
* <tt>cd</tt> into the script directory and run <tt>./listerine</tt>
* Let it run for a minute or so, then check if <tt>.flv</tt> files are starting to turn up.
* If the download speed seems low, try running multiple instances of listerine at once. You can use the same username and download dir for them all.
* To stop the script, create a file named <tt>STOP</tt> in its directory. (Open a new terminal, and use <tt>$ touch STOP</tt>)
'''For Windows systems:'''
This should run on Windows 2000/XP/2003/Vista/7
* Download [http://www.pentium100.com/windex.zip windex] (you will also need [http://sourceforge.net/projects/aria2/files/stable/aria2-1.11.1/aria2-1.11.1-mingw32msvc-build1.zip/download aria2] and [http://www.python.org/download/releases/2.7.1/ python] )
* Extract it to a directory where you want to download the files
* Make sure that aria2.exe is in the same directory
* Start windex.bat (Windows Vista/7 users may need to start it from the command prompt), enter your user name.
* To stop the script, create a file named "stop" in the folder with windex.bat. It will stop after it finishes the current download.
To make it automatically stop after downloading the video when free disk space gets too low:
* Download [http://www.pentium100.com/keep_free_space.zip keep_free_space] and extract the .exe to the folder with windex.bat.
* Run it, set the threshold and click "Start"
You will have to run one instance for each instance of windex you are running, even if they are on the same hard drive, if you want them all to stop.
If you want, it can notify you using Windows Messenger service (net send), make sure that the service is running, enter the hostname of the computer where to send the message and check the checkbox.
== Indexing Videos To Identify Related Videos ==
Indexing videos requires very little bandwidth and hard drive space. To discuss things or get help, go to #boincgoogle on EFNet.
=== On Linux Systems ===
'''Note''': This will only work on machines with X running. To run it on a headless server, use Xvfb (virtual framebuffer). On Ubuntu/Debian: 'apt-get install xvfb', then use xvfb-run to start your main script. An X server will now be made available to any programs that need it.
* Get the tools needed to build phantomjs (a headless web browser) and run the script: Qt WebKit, git, and curl. On Debian or Ubuntu Maverick and up, install the packages '''build-essential''', '''curl''', '''git''', '''libqtwebkit4''', '''libqtwebkit-dev''', and '''libqt4-dev''' by issuing the command:
<pre><nowiki>sudo apt-get install build-essential curl git libqtwebkit4 libqtwebkit-dev libqt4-dev</nowiki></pre>
On Ubuntu Lucid 10.04: Since Lucid comes with Qt4.6, not the required 4.7, you may need to add a ppa before trying to install the needed packages.
<pre><nowiki>sudo add-apt-repository ppa:kubuntu-ppa/backports && sudo apt-get update</nowiki></pre>
Additionally, in Lucid, the git package is named git-core, so:
<pre><nowiki>sudo apt-get install git-core</nowiki></pre>
or, on Fedora:
<pre><nowiki>sudo yum install curl git qt-webkit qt-webkit-devel qt-devel</nowiki></pre>
* Run the following command to get the phantomjs source code:
<pre><nowiki>git clone https://github.com/ariya/phantomjs.git</nowiki></pre>
* Enter the directory that was just created by using the following command:
<pre><nowiki>cd phantomjs</nowiki></pre>
* Build phantomjs by issuing the command:
<pre><nowiki>qmake && make -j2</nowiki></pre>
* Move the phantomjs binary somewhere in your path by issuing the command:
<pre><nowiki>cd bin && sudo mv ./phantomjs /usr/local/bin</nowiki></pre>
* Create a folder called '''gvscript''' and download the script to get the list of Google Video related pages to scrape: http://199.48.254.90/at/google_video_related.tar.gz
* Extract the above downloaded file (Right-click and Extract To.. or use '''tar -zxvf ./google_video_related.tar.gz''')
* In a terminal, navigate to the folder where you extracted the google_video_related file (above) and run the following command to help scrape Google Video:
<pre><nowiki>while : ; do ./related.sh ; done</nowiki></pre>
=== On Windows Systems ===
Grab the following archive which comes with full instructions:
http://nstrom.chaosnet.org/google_video_related_win.zip
The script will contact the server to get a page to index the related video links, do that indexing, send back the results and repeat! It takes very little processing and bandwidth on your end (a couple of kb/sec, if that).
== Saving Individual Videos ==
The seed files do currently not include all videos, so you might want to save precious videos explicitely. To do that, add IDs (found in the docid URL parameter video) to the "list" file in the same directory as the script, for example:
docid=1545969803753962248
docid=1598207563000425446
docid=-1679753730105404298
and start ./googlegargle
To request a video, add it to this list: http://piratepad.net/gvspecificrequests
If you download something from that list, add its docid to http://piratepad.net/TL7KDN8821 so that others won't download those videos for the second time.
==Keyword Searches==
===Linux===
If you want to grab videos by your own custom keyword search term, you can use [https://github.com/norcnorc/googlegargle/blob/master/searcher.sh this script].
Alternatively, you can use this command:
<pre><nowiki>
SEARCH='my+search+term';for i in `seq 0 10 990 `;do curl -A "AT, Bitches" "http://www.google.com/search?q=$SEARCH+site:video.google.com&hl=en&safe=off&tbm=vid&start=$i&sa=N"|grep -o "docid=[0-9-]*"|sort -u|tee -a seed_videos_$SEARCH;done
</nowiki></pre>
Change "my+search+term" to your search term, and remember to use a plus sign instead of spaces (and to url encode the text for other special characters).
===Mac Bash Command===
Uses jot instead of seq:
<pre><nowiki>
SEARCH='my+search+term';for i in `jot - 0 990 10 `;do curl -A "AT, Bitches" "http://www.google.com/search?q=$SEARCH+site:video.google.com&hl=en&safe=off&tbm=vid&start=$i&sa=N"|grep -o "docid=[0-9-]*"|sort -u|tee -a seed_videos_$SEARCH;done
</nowiki></pre>
Alternatively, you can get <tt>seq</tt> (and lots of other useful stuff) by installing the macports coreutils package: <tt>sudo port install coreutils</tt>. Commands are prefixed with a 'g', so <tt>seq</tt> is called <tt>gseq</tt>, but you may of course symlink it so you don't have to modify your scripts.


= FAQ =
= FAQ =

Revision as of 17:13, 20 April 2011

Google Video
Google Video logo
Google Video logo
URL http://video.google.com
Status Closing in 2011-04-29[1]
Archiving status In progress...
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)
Google Video results for "Papua New Guinea" keyword.

Google Video is a video sharing website which is shutting down.

If you want to save your own videos, see the announcement and tools below.

If you want to help archive Google Video, get some machines running and join us in IRC (EFNet #archiveteam / #googlegrape)

Joining the archival effort

The automatic scripts only work on FreeBSD, Linux, Solaris, Windows and maybe OS X. They also seem to work fine in Cygwin. Alternatively, you can run *nix in a virtual machine (given you have a fast enough machine).

Anyone can help out, but we would *really* appreciate it if you'd use an *NIX system over any thoughts of doing it on a Windows system. If you however choose to pursue the Magical World of Windows - please make sure that what you are collecting is not damaged as a consequence of running it on a Windows system.

In any case, the first thing to do is to please add your name/nickname to this list, along with the storage and bandwidth you have available.

What can I do?

The two main tasks are: indexing and downloading. The easiest and least taxing is indexing (see #Indexing Videos To Identify Related Videos). If you have some extra bandwidth and space think about running Listerine to download videos. Both of these tasks are automated and can be left running in the background. It is often good practice to start a few process of each at once.

Downloading Videos By Keyword

On Linux Systems

  • Download youtube-dl or from your distribution.
    • Make sure it's marked executable: chmod +x youtube-dl
  • Download and install wget for your distribution
  • Download the scripts by cloning the git repository:
    git clone git://github.com/norcnorc/googlegargle.git
    If you download them manually (instead of using git), you'll want to do chmod +x googlegargle to make it executable.
  • Get aria2 from your distribution (or if you're on Mac OS X, MacPorts or Homebrew) or SourceForge
    • quick-install info is available here
  • Pick a seed list from below, save it under the filename "list" and add your name to the list (you will need a wiki account)
  • For older aria versions, some options need to be removed (--max-connection-per-server=16 --min-split-size=1M)
    • You might need to upgrade your version from your system package manager, however the most recent version still may not suffice.
  • Change the ARIA variable in the script to the location of your ARIA executable. Usually (ubuntu) at /usr/bin/aria2c, change ARIA variable to this.
    • To know where aria2 is located you can use either of these commands:
      • "sudo updatedb; locate aria2"
      • "which aria2" / "which aria2c"
  • Invoke googlegargle
  • Check with your OS settings to insure that your computer will not auto suspend or sleep after long periods of inactivity.

On Windows Systems

On Solaris Systems

The scripts are known to work on OpenIndiana r147. You'll have to install aria2c and youtube-dl from source, but other than that the googlegargle script should work without modifications.

Don't forget to join the IRC channel to coordinate who's getting what!


FAQ

  • Is there any estimate on how many videos are on Google Video?
    • Wikipedia said it has 2,500,000 videos, a semi-official Google blog mentioned 2.8M
  • Is there anything about grabbing metadata for vids? like descriptions?
    • Googlegrape does that, it saves the html of the video download page
  • What happens to the data after you claim a seed on the wiki and download it?
    • We've got 140TB of space allocated to us on archive.org, and can get more
  • Is there already some space where it can be uploaded to?
    • Not yet, the effort is still young and things take time to organize.
  • How can I split seed files if I want to download fewer videos or share the task with others?
    • On *nix machines use: split --lines=500 [seedfile] [seedfile] to create a set of files each 500 lines in length in the form seedfileaa seedfileab ... etc.
  • How can I check if there are duplicates in a seed file?
    • On *nix machines use: sort [infile] | uniq -d to show all duplicates.
  • How can I remove duplicates from a seed file before I start to use it?
    • On *nix machines use: sort [infile] | uniq -u > [outfile] to produce a new seed file with duplicates removed.
  • If I wanted to run more than one listerine process, do I just make multiple clones? Do I need a different username for each?
    • Only if you need to be able to differentiate later on, like we'll say, we need video 123 from "xentac3"

Announcement: Uploaded video content no longer available

On April 29, 2011 videos that have been uploaded to Google Video will no longer be available for playback. We’ve added a Download button to the Video Status page, so you can download videos that you want to save. If you don’t want to download your videos, you don’t need to do anything. (The Download feature will be disabled after May 13, 2011.)

How do I download videos that I've uploaded?

On the Video Status page, click Download Video located on the right side of each of your videos in the "Actions" column.Once a video has been downloaded, an "Already Downloaded" message will appear. If you have many videos on Google Video, you may need to use the paging controls located on the bottom right of the page to access them all. This download option will be available through May 13, 2011.

I've downloaded my videos. Now what do I do with these FLV files?

FLV files are videos that have been encoded in the Flash Video Format. You can upload your videos in FLV format to other video hosting sites like YouTube or Picassa Web Albums. If you would like to playback your videos on your computer and they don’t seem to be working, you might need to install an FLV player. In order to find an FLV player to install, try doing a Google search for [ FLV player ].

External links