Difference between revisions of "Google Video (Archive)"

From Archiveteam
Jump to navigation Jump to search
Line 253: Line 253:
| [http://gv.nja.im/index.php?dir=seed_videos_ml_documentary_dedupe seed_videos_ml_documentary_dedupe] || 1975 || Lightblb, Papyrus, NomDuClavier || style="background: yellow" | 3 completed chunks of 4 (4 claimed)<br/> Lightblb: aa (Complete:38GB With 1 Fail -> Rsyncing)<br />Papyrus: ab<br />NomDuClavier: ac (complete), ad (complete)
| [http://gv.nja.im/index.php?dir=seed_videos_ml_documentary_dedupe seed_videos_ml_documentary_dedupe] || 1975 || Lightblb, Papyrus, NomDuClavier || style="background: yellow" | 3 completed chunks of 4 (4 claimed)<br/> Lightblb: aa (Complete:38GB With 1 Fail -> Rsyncing)<br />Papyrus: ab<br />NomDuClavier: ac (complete), ad (complete)
|-
|-
| [http://gv.nja.im/index.php?dir=seed_videos_ml_lecture_dedupe seed_videos_ml_lecture_dedupe] || 1898 || Lightblb, gribozavr, kn100 || style="background: yellow" | 1 completed chunks of 4 (4 claimed)<br/> Lightblb: aa ab (Done: 65G 2 Failed)<br />
| [http://gv.nja.im/index.php?dir=seed_videos_ml_lecture_dedupe seed_videos_ml_lecture_dedupe] || 1898 || Lightblb, gribozavr, kn100 || style="background: yellow" | 3 completed chunks of 4 (4 claimed)<br/> Lightblb: aa ab (Done: 65G 2 Failed)<br />
gribozavr: ad (complete, 28Gb)<br />kn100: ac (in progress)
gribozavr: ad (complete, 28Gb)<br />kn100: ac (in progress)
|-
|-

Revision as of 15:53, 20 April 2011

Google Video
Google Video logo
Google Video logo
URL http://video.google.com
Status Closing in 2011-04-29[1]
Archiving status In progress...
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)
Google Video results for "Papua New Guinea" keyword.

Google Video is a video sharing website which is shutting down.

If you want to save your own videos, see the announcement and tools below.

If you want to help archive Google Video, get some machines running and join us in IRC (EFNet #archiveteam / #googlegrape)

Joining the archival effort

The automatic scripts only work on FreeBSD, Linux, Solaris, Windows and maybe OS X. They also seem to work fine in Cygwin. Alternatively, you can run *nix in a virtual machine (given you have a fast enough machine).

Anyone can help out, but we would *really* appreciate it if you'd use an *NIX system over any thoughts of doing it on a Windows system. If you however choose to pursue the Magical World of Windows - please make sure that what you are collecting is not damaged as a consequence of running it on a Windows system.

In any case, the first thing to do is to please add your name/nickname to this list, along with the storage and bandwidth you have available.

The two main tasks are: indexing and downloading. The easiest and least taxing is indexing (see "Indexing Videos To Identify Related Videos") below. If you have some extra bandwidth and space think about running Listerine to download videos. Both of these tasks are automated and can be left running in the background. It is often good practice to start a few process of each at once.

Downloading Videos By Keyword

On Linux Systems

  • Download youtube-dl or from your distribution.
    • Make sure it's marked executable: chmod +x youtube-dl
  • Download and install wget for your distribution
  • Download googlegargle (Norc's updated, dupe-safe version of googlegargle is here.)
  • Get aria2 from your distribution (or if you're on Mac OS X, MacPorts or Homebrew) or SourceForge
    • quick-install info is available here
  • Pick a seed list from below, save it under the filename "list" and add your name to the list (you will need a wiki account)
  • Change the first few lines of the googlegargle script to reflect your installation
    • If you're using youtube-dl from your distro, run "which youtube-dl" or "sudo updatedb; locate youtube-dl" to find the location of the command. Change DLSCRIPT to this.
  • For older aria versions, some options need to be removed (--max-connection-per-server=16 --min-split-size=1M)
    • You might need to upgrade your version from your system package manager, however the most recent version still may not suffice.
  • Change the ARIA variable in the script to the location of your ARIA executable. Usually (ubuntu) at /usr/bin/aria2c, change ARIA variable to this.
    • To know where aria2 is located you can use either of these commands:
      • "sudo updatedb; locate aria2"
      • "which aria2" / "which aria2c"
  • Invoke googlegargle
  • Check with your OS settings to insure that your computer will not auto suspend or sleep after long periods of inactivity.

On Windows Systems

On Solaris Systems

The scripts are known to work on OpenIndiana r147. You'll have to install aria2c and youtube-dl from source, but other than that the googlegargle script should work without modifications.

Don't forget to join the IRC channel to coordinate who's getting what!

Downloading Videos Via Related Video Metadata (aka Listerine)

Listerine is an experimental BOINC-style download effort started by underscor. It consists of a server that gets lists of videos to download from the indexing effort and assigns the videos via their ID number to users running a download client.

For more information, see the IRC Channel for Downloaders (#boincgoogle); go here to get started

Setup instructions:

  • Make sure you have wget and curl installed
  • Download youtube-dl and install aria2 as described above
    • Ensure you have youtube-dl, wget, curl and aria2c in your PATH
  • Download the scripts from this GitHub repo
    • If you download them by manually (instead of using git), you'll want to do chmod +x listerine googlegargle to make them executable.
  • Edit the listerine script to set your username. The nick you use on IRC would be a good choice.
  • cd into the script directory and run ./listerine
  • Let it run for a minute or so, then check if .flv files are starting to turn up.
  • If the download speed seems low, try running multiple instances of listerine at once. You can use the same username and download dir for them all.
  • To stop the script, create a file named STOP in its directory. (Open a new terminal, and use $ touch STOP)


For Windows systems:

This should run on Windows 2000/XP/2003/Vista/7

  • Download windex (you will also need aria2 and python )
  • Extract it to a directory where you want to download the files
  • Make sure that aria2.exe is in the same directory
  • Start windex.bat (Windows Vista/7 users may need to start it from the command prompt), enter your user name.
  • To stop the script, create a file named "stop" in the folder with windex.bat. It will stop after it finishes the current download.

To make it automatically stop after downloading the video when free disk space gets too low:

  • Download keep_free_space and extract the .exe to the folder with windex.bat.
  • Run it, set the threshold and click "Start"

You will have to run one instance for each instance of windex you are running, even if they are on the same hard drive, if you want them all to stop. If you want, it can notify you using Windows Messenger service (net send), make sure that the service is running, enter the hostname of the computer where to send the message and check the checkbox.

Indexing Videos To Identify Related Videos

Indexing videos requires very little bandwidth and hard drive space. To discuss things or get help, go to #boincgoogle on EFNet.

On Linux Systems

Note: This will only work on machines with X running. To run it on a headless server, use Xvfb (virtual framebuffer). On Ubuntu/Debian: 'apt-get install xvfb', then use xvfb-run to start your main script. An X server will now be made available to any programs that need it.

  • Get the tools needed to build phantomjs (a headless web browser) and run the script: Qt WebKit, git, and curl. On Debian or Ubuntu Maverick and up, install the packages build-essential, curl, git, libqtwebkit4, libqtwebkit-dev, and libqt4-dev by issuing the command:
sudo apt-get install build-essential curl git libqtwebkit4 libqtwebkit-dev libqt4-dev

On Ubuntu Lucid 10.04: Since Lucid comes with Qt4.6, not the required 4.7, you may need to add a ppa before trying to install the needed packages.

sudo add-apt-repository ppa:kubuntu-ppa/backports && sudo apt-get update

Additionally, in Lucid, the git package is named git-core, so:

sudo apt-get install git-core

or, on Fedora:

sudo yum install curl git qt-webkit qt-webkit-devel qt-devel
  • Run the following command to get the phantomjs source code:
git clone https://github.com/ariya/phantomjs.git
  • Enter the directory that was just created by using the following command:
cd phantomjs
  • Build phantomjs by issuing the command:
qmake && make -j2
  • Move the phantomjs binary somewhere in your path by issuing the command:
cd bin && sudo mv ./phantomjs /usr/local/bin
  • Extract the above downloaded file (Right-click and Extract To.. or use tar -zxvf ./google_video_related.tar.gz)
  • In a terminal, navigate to the folder where you extracted the google_video_related file (above) and run the following command to help scrape Google Video:
while : ; do ./related.sh ; done

On Windows Systems

Grab the following archive which comes with full instructions: http://nstrom.chaosnet.org/google_video_related_win.zip

The script will contact the server to get a page to index the related video links, do that indexing, send back the results and repeat! It takes very little processing and bandwidth on your end (a couple of kb/sec, if that).

Saving Individual Videos

The seed files do currently not include all videos, so you might want to save precious videos explicitely. To do that, add IDs (found in the docid URL parameter video) to the "list" file in the same directory as the script, for example:

docid=1545969803753962248
docid=1598207563000425446
docid=-1679753730105404298

and start ./googlegargle

To request a video, add it to this list: http://piratepad.net/gvspecificrequests

If you download something from that list, add its docid to http://piratepad.net/TL7KDN8821 so that others won't download those videos for the second time.

Keyword Searches

Linux

If you want to grab videos by your own custom keyword search term, you can use this script.

Alternatively, you can use this command:

SEARCH='my+search+term';for i in `seq 0 10 990 `;do curl -A "AT, Bitches" "http://www.google.com/search?q=$SEARCH+site:video.google.com&hl=en&safe=off&tbm=vid&start=$i&sa=N"|grep -o "docid=[0-9-]*"|sort -u|tee -a seed_videos_$SEARCH;done

Change "my+search+term" to your search term, and remember to use a plus sign instead of spaces (and to url encode the text for other special characters).

Mac Bash Command

Uses jot instead of seq:

SEARCH='my+search+term';for i in `jot - 0 990 10 `;do curl -A "AT, Bitches" "http://www.google.com/search?q=$SEARCH+site:video.google.com&hl=en&safe=off&tbm=vid&start=$i&sa=N"|grep -o "docid=[0-9-]*"|sort -u|tee -a seed_videos_$SEARCH;done

Alternatively, you can get seq (and lots of other useful stuff) by installing the macports coreutils package: sudo port install coreutils. Commands are prefixed with a 'g', so seq is called gseq, but you may of course symlink it so you don't have to modify your scripts.

Searches Undertaken

Since we want to minimize overlap, here are some search terms that are already in progress of being downloaded along with the name of the downloader:

  • Darkstar: "rare", "vintage", "commercial"
  • NomDuClavier: "douglas adams", "richard dawkins", "charles darwin", "michio kaku", "brian cox", "vernor vinge", "carl sagan", "simon singh"
  • oli: "australia history"
  • dnova: "microelectronics"
  • Lightblb: "documentary" (medium and long videos), "lecture" (medium and long videos), "atheism" (medium & long), "interview" (long), talk (medium & long), brain (medium & long), civilization (medium & long), evolution (medium & long), future (medium & long), language (medium & long), literature (medium & long), mind (medium & long), money (medium & long), neurolinguistic (medium & long), singularity (medium & long)
  • ttuttle: "astronomy"
  • crackbab1: "ecology"
  • tj__: "army"

Also check the specificrequest PiratePad under Cherry Picking on this page.

Deduplication

To avoid downloading videos that have already been downloaded by others:

  • check if you have SQLite installed ("which sqlite3")
  • download the gv-dedup scripts
  • initialize a fresh database with "./gv-list-create.sh"
  • download all seed lists on this page (plus the cherry picks) and import them with "./gv-list-import.sh seed_file" (or "find seeds/* -exec ./gv-list-import.sh {} \;")
  • invoke "./gv-list-dedup.sh seed_videos_foo > list" to filter already downloaded videos from your custom seed list
  • also import your custom seed file with "./gv-list-import.sh list"

A pre-filled database is available.

Seed List Downloads

Custom searches, suggestions

  • PLEASE add your custom searches and their details to this table!
  • Words suggestions: public domain, subtitles
  • Words already in the table or added to the BOINC client: conference, hack, wiki, linux, creative commons, part, interview, documentary, talk, brain, civilization, evolution, future, language, literature, mind, money, neurolinguistic, singularity

Years

1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999

Countries

AFGHANISTAN, ÅLAND+ISLANDS, ALBANIA, ALGERIA, AMERICAN+SAMOA, ANDORRA, ANGOLA, ANGUILLA, ANTARCTICA, ANTIGUA+AND+BARBUDA, ARGENTINA, ARMENIA, ARUBA, AUSTRALIA, AUSTRIA, AZERBAIJAN, BAHAMAS, BAHRAIN, BANGLADESH, BARBADOS, BELARUS, BELGIUM, BELIZE, BENIN, BERMUDA, BHUTAN, BOLIVIA,+PLURINATIONAL+STATE+OF, BONAIRE,+SAINT+EUSTATIUS+AND+SABA, BOSNIA+AND+HERZEGOVINA, BOTSWANA, BOUVET+ISLAND, BRAZIL, BRITISH+INDIAN+OCEAN+TERRITORY, BRUNEI+DARUSSALAM, BULGARIA, BURKINA+FASO, BURUNDI, CAMBODIA, CAMEROON, CANADA, CAPE+VERDE, CAYMAN+ISLANDS, CENTRAL+AFRICAN+REPUBLIC, CHAD, CHILE, CHINA, CHRISTMAS+ISLAND, COCOS+(KEELING)+ISLANDS, COLOMBIA, COMOROS, CONGO, CONGO, COOK+ISLANDS, COSTA+RICA, CÔTE+D'IVOIRE, CROATIA, CUBA, CURAÇAO, CYPRUS, CZECH+REPUBLIC, DENMARK, DJIBOUTI, DOMINICA, DOMINICAN+REPUBLIC, ECUADOR, EGYPT, EL+SALVADOR, EQUATORIAL+GUINEA, ERITREA, ESTONIA, ETHIOPIA, FALKLAND+ISLANDS+(MALVINAS), FAROE+ISLANDS, FIJI, FINLAND, FRANCE, FRENCH+GUIANA, FRENCH+POLYNESIA, FRENCH+SOUTHERN+TERRITORIES, GABON, GAMBIA, GEORGIA, GERMANY, GHANA, GIBRALTAR, GREECE, GREENLAND, GRENADA, GUADELOUPE, GUAM, GUATEMALA, GUERNSEY, GUINEA, GUINEA-BISSAU, GUYANA, HAITI, HEARD+ISLAND+AND+MCDONALD+ISLANDS, HOLY+SEE+(VATICAN+CITY+STATE), HONDURAS, HONG+KONG, HUNGARY, ICELAND, INDIA, INDONESIA, IRAN, IRAQ, IRELAND, ISLE+OF+MAN, ISRAEL, ITALY, JAMAICA, JAPAN, JERSEY, JORDAN, KAZAKHSTAN, KENYA, KIRIBATI, KOREA, KUWAIT, KYRGYZSTAN, LAO, LATVIA, LEBANON, LESOTHO, LIBERIA, LIBYAN+ARAB+JAMAHIRIYA, LIECHTENSTEIN, LITHUANIA, LUXEMBOURG, MACAO, MACEDONIA, MADAGASCAR, MALAWI, MALAYSIA, MALDIVES, MALI, MALTA, MARSHALL+ISLANDS, MARTINIQUE, MAURITANIA, MAURITIUS, MAYOTTE, MEXICO, MICRONESIA,+FEDERATED+STATES+OF, MOLDOVA, MONACO, MONGOLIA, MONTENEGRO, MONTSERRAT, MOROCCO, MOZAMBIQUE, MYANMAR, NAMIBIA, NAURU, NEPAL, NETHERLANDS, NEW+CALEDONIA, NEW+ZEALAND, NICARAGUA, NIGER, NIGERIA, NIUE, NORFOLK+ISLAND, NORTHERN+MARIANA+ISLANDS, NORWAY, OMAN, PAKISTAN, PALAU, PALESTINIAN+TERRITORY,+OCCUPIED, PANAMA, PAPUA+NEW+GUINEA, PARAGUAY, PERU, PHILIPPINES, PITCAIRN, POLAND, PORTUGAL, PUERTO+RICO, QATAR, RÉUNION, ROMANIA, RUSSIAN+FEDERATION, RWANDA, SAINT+BARTHÉLEMY, SAINT+HELENA,+ASCENSION+AND+TRISTAN+DA+CUNHA, SAINT+KITTS+AND+NEVIS, SAINT+LUCIA, SAINT+MARTIN+(FRENCH+PART), SAINT+PIERRE+AND+MIQUELON, SAINT+VINCENT+AND+THE+GRENADINES, SAMOA, SAN+MARINO, SAO+TOME+AND+PRINCIPE, SAUDI+ARABIA, SENEGAL, SERBIA, SEYCHELLES, SIERRA+LEONE, SINGAPORE, SINT+MAARTEN+(DUTCH+PART), SLOVAKIA, SLOVENIA, SOLOMON+ISLANDS, SOMALIA, SOUTH+AFRICA, SOUTH+GEORGIA+AND+THE+SOUTH+SANDWICH+ISLANDS, SPAIN, SRI+LANKA, SUDAN, SURINAME, SVALBARD+AND+JAN+MAYEN, SWAZILAND, SWEDEN, SWITZERLAND, SYRIA, TAIWAN, TAJIKISTAN, TANZANIA, THAILAND, TIMOR-LESTE, TOGO, TOKELAU, TONGA, TRINIDAD+AND+TOBAGO, TUNISIA, TURKEY, TURKMENISTAN, TURKS+AND+CAICOS+ISLANDS, TUVALU, UGANDA, UKRAINE, UNITED+ARAB+EMIRATES, UNITED+KINGDOM, UNITED+STATES, UNITED+STATES+MINOR+OUTLYING+ISLANDS, URUGUAY, UZBEKISTAN, VANUATU, VENEZUELA, VIETNAM, VIRGIN+ISLANDS,+BRITISH, VIRGIN+ISLANDS,+U.S., WALLIS+AND+FUTUNA, WESTERN+SAHARA, YEMEN, ZAMBIA, ZIMBABWE

Table

NOTE

Please send any new seedlists to underscor on IRC, rather than embarking on them yourself. He'll add them to the listerine queue. Legend

     Uploaded to Archive.org
     Done/Complete with no errors
     Done/Complete with errors
     In progress
     Partially claimed and in progress
     Not claimed
     Moved to listerine
     Unknown status (If you know please edit)
Seed list Videos (lines) Downloaders Progress and SIZE
seed_videos_rhistory 6949 Jade Falcon 7 chunks with 1000 videos each
ndurner: aa
Jade Falcon: downloading...
seed_videos_ecology 890 crackbab1
seed_videos_meme 996 yipdw Done (12 GB), bad IDs: -7139586667055487256, 744578668610845478, 9027107881335248661
seed_videos_defcon 822 ndurner done
seed_videos_ml_documentary_dedupe 1975 Lightblb, Papyrus, NomDuClavier 3 completed chunks of 4 (4 claimed)
Lightblb: aa (Complete:38GB With 1 Fail -> Rsyncing)
Papyrus: ab
NomDuClavier: ac (complete), ad (complete)
seed_videos_ml_lecture_dedupe 1898 Lightblb, gribozavr, kn100 3 completed chunks of 4 (4 claimed)
Lightblb: aa ab (Done: 65G 2 Failed)

gribozavr: ad (complete, 28Gb)
kn100: ac (in progress)

seed_videos_ml_atheism_dedupe 698 norc, Mqrius 2 complete of 2
norc: ab done (16G), Mqrius: aa done (41GB).
seed_videos_l_interview_dedupe 986 Pentium100, wgfreewill aa - Done (136GB)

Pentium100: ab - in progress

seed_videos_evolution_dedupe (Long&Medium) 1742 Jade Falcon downloading...
seed_videos_talk_dedupe (Long&Medium) 1795 Jade Falcon downloading...
seed_videos_money_dedupe (Long&Medium) 1824 leftfield
seed_videos_civilization_dedupe (Long&Medium) 471 leftfield
seed_videos_2_a 25,761 swebb 61G, 3718/25761 files done (4/19/2011)
seed_videos_2_k 19,266 (24,242) Lightblb, ARc[Clone, crackbab1, Pentium100, Mqrius, arketype, Darkstar 49 chunks completed of 49

Lightblb: aa ab ac ad ae (In Progress)
crackbab1: af,ak,al (Done: 16GB)
Mqrius: Done: ag - ak, am - ao, aq - as: 81 billion bytes.
(Errors: 8140990496183661566, 3602820803563530100, 305824290212962756, 4662407464242191178, 1966892422853997036, 2337004030985954962, 1338452982534754821, 10726218902867294)
arketype: ap (Done: 17GB)
(Errors: 2781869234442161475, 3684594607388096414)
Pentium100: at-az (complete, 42.8GB), ba-bb (complete, 14.1GB)
Darkstar: bc bd be (complete)
ARc[Clone: bf bg bh bi bj bk bl bm bn bo bp bq br bs bt bu bv bw (all done)

seed_videos_2_l 22,641 ndurner, wgfreewill Split 46 chunks of 500 videos each
ndurner: aa done;
wgfreewill - 530GB, only around 3000 videos left to download of total 22k set.
seed_videos_2_m 24,465 Jade Falcon Jade:Done. 506G, 305 error'ed IDs. Rsyncing.
seed_videos_2_o 25,049 travelinlibrarian Split 51 chunks of 500 videos each

travelinlibrarian 376/1-500
perfinion done seed_videos_2_ob[n-y] perfinion grabbing seed_videos_2_[a-m]

seed_videos_2_p 23,713 oli, Xentac, db48x, otro, Mqrius, Pentium100, Darkstar, ryan__, nstrom 46 complete of 48 chunks (all 48 claimed)

oli: aa to ah (complete, 90GB) - RSYNCING
Mqrius: Done: ak - am: 27 GB
Pentium100: an-av (done, 100GB with errors)
Xentac: bt bu bv bp bo bm bq br bg bn bi as at au (done), bg-br, as-av
db48x: bu (1.44GB, finished), bv (187MB, finished), ba-bf (128GB, done)
otro: bs (2 GB complete with errors -4129568891134205061, -863669053556310192, 1529854584895362082, -1190862519877917483)
nstrom: aw (complete, 15GB, uploaded to a.o)
ryan__: ax(WIP)/ay(WIP)/az(done, 7 missing. verifying/retrying/confirming still)
Darkstar: ai, aj (Complete)

seed_videos_2_q 17,727 DoubleJ Done (165GB) w/2 bad IDs:

-3522777020956111862 1920882098876352864

seed_videos_2_t 25,301 businux Split 51 chunks of 500 videos each 961/25,301 3.79% 33GB

LietKynes going backwards, 50 threads, 310GB already

seed_videos_2_u 23,528 barbich, negge 48 chunks complete of 48

barbich: finished 0 to 29 (100% done, 370G)
negge: finished 30 to 47 (100% done, ~200G)

seed_videos_2_w 21,732 nickmoorman Split] 0 chunks completed of 34 (34 claimed0

nickmoorman: aa ab ac ad ae af ag ah ai aj
zachtib: ak al am an ao
Dr.Sweety: ap aq ar as at au av aw ax ay az ba bb bc bd be bf bg bh (In progress, currently downloading ar)

seed_videos_2_x 19,733 ksh 100% / 78GB

Need to check for errors!
After this is checked, if there are no errors, change to green and remove this line.

seed_videos_2_y 20,965 negge Done (216GB)
seed_videos_2_z 18,877 flare Currently in progress (16%)
seed_videos_a 1000 Dr.Sweety Currently in progress (47%)
seed_videos_a_related This list contain errors Dr.Sweety Done, 44G total. ~1097 out of 1284 seem to be DocIDs, rest is text. Half of the DocIDs are broken (see "Broken DocIDs" for some examples, a complete list is here http://piratepad.net/b8VbxXCVPG). What about the errors, will there be an updated list?
seed_videos_b 999 bjwebb 136/999
seed_videos_c 981 dnova Uploaded to Archive.org (40.2GB)
seed_videos_d 999 nomduclav complete
seed_videos_e 999 nomduclav complete
seed_videos_f 999 DoubleJ Done (25GB)
seed_videos_g 999 dnova Uploaded to Archive.org (30.9GB)
one bad id=7751522177274361392
seed_videos_h 999 ARc[Clone Done
seed_videos_i 999 DeCarabas Done (58 GB)
seed_videos_j 999 joethehuman Done (36.7 GB)
seed_videos_k 999 aggroskater Done (28.7 GB) one bad ID: -4784504756717962046
seed_videos_l 999 yipdw Done (58 GB); six bad IDs: -1165561225258043258, 1922748009661857239, 300163955057959602, -7110898118644169273, -7942619273555709195, 8543705644990106023
seed_videos_m 999 TJ__ Done (34.7GB)
seed_videos_n 999 ndurner Done (38 GB)
seed_videos_o 999 com_lab, grelbar (list) ~38GB (com_lab) already uploaded,
~24GB(grelbar)
seed_videos_p 999 Pneu
seed_videos_q 996 nomduclavier Done (~24Gb)
seed_videos_r 996 Pentium Done (26.5GB), two bad IDs (-6997682955012239023, -5475489738249304784)
seed_videos_s 999 Pentium Done (48.9GB), two bad IDs (2103424227166759427, -8954969329395485241)
seed_videos_t 999 joethehuman Done with errors below (56.8 GB)
seed_videos_u 999 perfinion, 0xDEADBEEF, norc 0xDEADBEEF 516/1000 24GB. norc 500-1000 done, 24GB. Perfinion done, 44GB.
seed_videos_v 999 masterme1 497/999 (~28GB)
seed_videos_w 1000 com_lab Done (~5.7GB)
seed_videos_x 1000 Dark-Star Done (~33GB)
seed_videos_y 1000 beremat 613/1000, (~37.76GB)
seed_videos_z 1000 ksh Done (27GB)
"microelectronics",
"circuit+design",
"microprocessor",
"chiptune",
"electrical+engineering",
"hardware+hacking",
"unboxing",
"demoscene",
1267 dnova Uploaded to Archive.org (33.9GB)
"transistor",
"tonawanda",
"micron",
"gallium",
"nanometer",
"femtosecond",
"qubit",
"integrated+circuit"
343 dnova Uploaded to Archive.org (7.1GB)
"singularity" 174 db48x completed, 12.57GB (list created at 8am UTC April 18th 2011)
"Feynman" 28 db48x completed, 2.20GB (list created at 9am UTC April 18th 2011)
"police" 998 lutostag 402/998 as of 04-19-2011 (list created at 8am UTC April 18th 2011)
"eliezer" 150 (1000) norc completed, 6.8G (list created at 8am UTC April 18th 2011)
"obama" 1000 ryan__ 302/1000 as of 04-19-2011 00:51 EDT (still WIP) (list created at 8am UTC April 18th 2011)
"cia" 999 ndurner 800 (list created at 8am UTC April 18th 2011)
"charlie" 1000 ryan__ 120/1000 as of 04-19-2011 00:51 EDT (still WIP) (list created at 8am UTC April 18th 2011)
IDs from the metafilter thread 28 db48x completed, 6.17GB (list created at 9am UTC April 18th 2011)
IDs from the reddit thread 106 ndurner done (list created at 9am UTC April 18th 2011)
"rare" ~3100 Darkstar done (~70gb)
"vintage"
"commercial"
[http://pastebin.com/ZkzNmwEW "douglas adams",
"richard dawkins",
"charles darwin"
NomDuClavier] 513 videos, done (one de-duped list for the 3 terms)
"australia history"
"indigenous aboriginal australia"
1659 oli complete - RSYNCING
"linux" 1641 xtat Done, 70GB, 8 failures
"Bugs Bunny" 153 stack,wgfreewill Done, 2.7GB
"rodney mullen" 176 com_lab Done, 1.7GB
"tech talks" 946 tahu in progress, 368 videos, 31GB, 2011-04-19 18:47:29 UTC
"rick astley" 17 db48x completed, 272.8MB (grabbed 13:00 UTC April 18th 2011)
"CERN" 912 vled Done
multiple: "michio kaku",
"brian cox",
"vernor vinge",
"carl sagan",
"simon singh"
176 nomduclavier done
"intel",
"amd"
1547 leftfield done 21.5GB one broken docid -712494279917239419
"foia" 89 com_lab Done, 4.1GB
"creative commons" 968 aikidork ~60% rsync'ed (Tue Apr 19 12:19:43 UTC 2011)
"TED" 1000 vled w/ problems
"programming" 1546 Xentac In Progress
"military", "army", "navy", "air force", "marine corps" 3108 tj__ & ksh In Progress (5%)
"fiddle", "banjo", "old time music" 921 RJL20 In Progress (~
"silent+film" 1000 dericed In Progress
"industrial" 1584 Archive242 In Progress
(pretty much) every valid GV link on MetaFilter 1675 RJL20 In Progress (~112730M / Errors: 28), ~50% done
http://hubpages.com/hub/The_Best_of_GoogleVideo 122 Lightblb Done: 7.1GB - 55 Failed - Rsync Done.
a few Olympics 1980 videos 4 gribozavr Completed
"kurzweil" 61 NomDuClavier Completed
Total >324,788 Archive Team >2.24 TB (Apr. 19, 11:37:13 UTC)

Broken DocIDs

DocID Title list
-4313176927520589553 Ferrari 320 km/h SelMcKenzie seed_videos_h
710915802292429594 Triple H-Best Pedigree Ever seed_videos_h
919675995190477263 404s seed_videos_h
-7433458566080701467 404s seed_videos_2_k
7476314005948269525 Tan Tay Du Ky 2 tap 1 phan 2 seed_videos_2_k
1310034078921227326 Presentatie H. van Garderen seed_videos_h
-8196546459051063200 Ethiopia - Ethiopian Talk Show - Dr. Kinfe M Kassaye seed_videos_m
6012309833489564165 I'm gonna miss you forever seed_videos_m
1006201176909432045 Nick "KNUCKLEHEAD" Thomas Learning to Ride A KX 65 seed_videos_2_k_br
9013618753646293166 TooSexii seed_videos_m
4607644763702261746 Most Haunted seed_videos_m
910327017359455024 404s seed_videos_2_k_br
-3505183273546479430 Top 10 Dunkers in Slam Dunk Contest History by www.todonba.mx.kz seed_videos_2_k_bu
515155312540224448 Prof. Stephen Berk - The Six Day War -- (Only downloads 106MB & manual seek fails) seed_videos_m
8233620694803027158 Tien Kiem Ky Hiep 12a seed_videos_2_k_bs
-7026671761719496982 KV Kortrijk - Virton: kans Vervaeke seed_videos_2_k_bo
4744936758707683681 404s seed_videos_2_k_bo
-4138015874145288917 Irvine City Council Regular Meeting -- content too short (expected 880173643 bytes and served 871) seed_videos_2_k_bo
1751753922865083288 Lou Dobbs - Bill Gates Testifies to Senate: Part 2 seed_videos_h
-1847242336625060764 404s seed_videos_h
-840074924615574683 H.O.T. TV EPISODE 7 seed_videos_h
5450039563312738134 seed_videos_2_o
2740779495236816438 seed_videos_2_o
8240553330007645065 404 "rick astley"
2776148046666235174 404 seed_videos_d
4641809537228296381 404 seed_videos_
-4718427583805445551 404 seed_videos_e
5588388288256218328 404 seed_videos_d
-1413491257698089214 Redirects to http://www.khou.com/news/119535529.html seed_videos_a_related
1895753595163256038 Redirects to http://tv.sky.com/martina-my-toughest-opponent seed_videos_a_related
-4941694769105315227 Redirects to http://saratoga-north.ynn.com/content/headlines/524274/governor-visit-s-nation-s-capitol/ seed_videos_a_related
-7773409926173229653 Redirects to http://www.zacks.com/commentary/15486/Value+Stock+Picks-August+24,+2010 seed_videos_a_related
7391058183663855490 Redirects to http://www.ebaumsworld.com/video/watch/81158874/ seed_videos_a_related
-4381742157481868130 Redirects to http://arcade.modemhelp.net/play-3613-Stealing_A_Van.html seed_videos_a_related
-1554641026467581780 Redirects to http://s167.photobucket.com/albums/u158/browneydgurl1212/?action=view&current=meganstealinghashbrown.mp4 seed_videos_a_related
2353616771034791644 Redirects to http://berkshires.ynn.com/content/headlines/523405/glens-falls-woman-accused-of-stealing-a-cat-from-pet-store/ seed_videos_a_related
9195455606734953941 Redirects to http://abcnews.go.com/ThisWeek/video/roundtable-tragedy-tucson-12575675 seed_videos_a_related
9150764031039845836 Redirects to http://www.ebaumsworld.com/video/watch/81298536/ seed_videos_a_related
9111781772616747857 Redirects to http://abcnews.go.com/Politics/video/stephen-colbert-testifies-house-hearing-illegal-farm-workers-11718759 seed_videos_a_related
9106424136068226425 Redirects to http://www.gameswelt.de/videos/videos/10349-Warhammer_Online_-_Home_Movie_Ever_Forward.html seed_videos_a_related
9106312808616607793 Redirects to http://video.google.com/videoplay?docid=9106312808616607793 seed_videos_a_related
-423230311474262633 seed_videos_2_k_at
-1989250447613793254 seed_videos_2_k_at
-1717591024529167847 seed_videos_2_k_au
-1893715945421217990 seed_videos_2_k_aw
98954701061936704 seed_videos_2_k_az
-857514171338089705 871B instead of 9.9MB seed_videos_2_k_az
187959010149993716 seed_videos_2_k_az
-3761310108351243571 seed_videos_2_k_az
-5034671686367848138 Umar Kalim breaks it all content too short seed_videos_2_k_bh
3687153060611498767 Picnic Tables at CiCo content too short seed_videos_2_k_bj
1010610140821179600 seed_videos_2_k_bf
1272139449455901373 seed_videos_2_k_bi
2154847967655726343 seed_videos_2_k_bj
2453599535490760149 seed_videos_2_k_bl
2525371248363122880 seed_videos_2_k_bf
-3761310108351243571 seed_videos_2_k_bh
4549148983829940555 404s seed_videos_2_k_bi
7051814862620931463 seed_videos_2_k_bh
-7353344548521134361 seed_videos_2_k_bl
-817434969229495880 seed_videos_2_k_bh
8335036545639007262 seed_videos_2_k_bh
-8653635503491974486 seed_videos_2_k_bh
-970580050717025709 seed_videos_2_k_bg
-3891054104657374974 seed_videos_2_k_bb
-5401734107040161313 seed_videos_2_k_bb
-6540216432023094075 seed_videos_2_k_bb
-1165561225258043258 L'universo elegante parte 1 seed_videos_l
1922748009661857239 4/8 - L'histoire secrète du pétrole - Le temps des premiers craquements seed_videos_l
300163955057959602 6/8 - L'histoire secrète du pétrole - Le temps des magouilles seed_videos_l
-7110898118644169273 Beppe Grillo e l'inceneritore seed_videos_l
-7942619273555709195 Le monde selon Monsanto - Arte FR seed_videos_l
8543705644990106023 José Bové à Aubagne le 7 Février. seed_videos_l
2781869234442161475 404 seed_videos_2_k_ap
3684594607388096414 404 seed_videos_2_k_ap
4857427355245773332 404 seed_videos_2_wap
4818927167565306511 404 seed_videos_2_wap
-7139586667055487256 Cadru 4 : Une mission du roi Even lui même? meme
744578668610845478 Massieux délire (saut à poil) meme
9027107881335248661 404 meme
712494279917239419 Unavailable - Charlie Rose - Red Wine & Mice / Andy Grove & Richard Tedlow intel amd
-4770095342392663956 Trailer Park Boys - S03E08 - A Sh*t Leopard Can't Change Its Spots seed_videos_t
http://pastebin.com/LhR0vDFu "Content Unavailable" or 404s seed_videos_2_x
-2183089322473530253 EOF army seed list
7899609783711363184 EOF army seed list
-8998613917213332529 EOF army seed list
-4784504756717962046 EOF ; visiting 2007 K-FROG Cares Golf Classic - Part 4: Pat Green Concert shows "video is not currently available" message seed_videos_k
7282734499247419085 Papell Studio Samba Serenade Printed Silk Georgette Pants - Item: 129-160 from listerine
1551984263748100534 ALLAMA TALIB JAUHARI - NASHTAR PARK KARACHI 2006 (PART-III) from listerine
2769128814553569958 Laguna_Beach__-_Season_3_-_Episode_15_-_16.avi from listerine
3368393825136501633 Magic Kingdom Hearts from listerine
-4534051497958455065 Naruto Shippuuden 10 Fuuin Jutsu - Genryuu Kyuu Fuujin from listerine
-2661405767136566167 marché aux animaux à Douz from listerine
-4129568891134205061 浙江化工廠釋放毒瓦斯 居民抗議遭鎮壓 seed_videos_2_p
-863669053556310192 silencio seed_videos_2_p
1529854584895362082 Dédicuce à ma Turtle Que Je Nadloveme !! seed_videos_2_p
-1190862519877917483 Reportaje seed_videos_2_p
777223614374448946 seed_videos_2_pan
-3753237639401264919 seed_videos_2_pan
513998298993769213 seed_videos_2_pan
4197907857130732658 seed_videos_2_pan
-7209518661908939846 seed_videos_2_pan
1936036414289617481 seed_videos_2_pan
1231628683306604703 seed_videos_2_pan
8391426573583714670 seed_videos_2_pao
-5030624673313016595 seed_videos_2_pao
2797125101537296652 seed_videos_2_pao
1231628683306604703 seed_videos_2_pao
765639190728070873 seed_videos_2_pap
3106095225664799618 seed_videos_2_pap
3824729866360231334 seed_videos_2_pap
-1011278591250373536 seed_videos_2_paq
5017038353295770271 seed_videos_2_paq
-2103962498187129713 seed_videos_2_par
-1920063529943044649 seed_videos_2_par
-8842656122683618628 seed_videos_2_par
3980781378957129624 seed_videos_2_par
3168333365786153885 seed_videos_2_par
-850263308777060275 seed_videos_2_par
-2739776417348844007 seed_videos_2_par
-3693490165652585623 seed_videos_2_par
-4421953779802914087 seed_videos_2_par
-4985191518265705146 seed_videos_2_par
-5030272711619967323 seed_videos_2_par
-7480760343548282696 seed_videos_2_par
-8507025902579487785 seed_videos_2_par
-8565673568506246688 seed_videos_2_par
7948280818830462878 seed_videos_2_par
7111518386861929818 seed_videos_2_par
5414116161601449115 seed_videos_2_par
4453387956996456150 seed_videos_2_par
3484019002795418536 seed_videos_2_par
2599414351734791684 seed_videos_2_par
981037964378644131 seed_videos_2_par
503478249453792411 seed_videos_2_par
-626427952319840934 seed_videos_2_pas
6692782035853741408 seed_videos_2_pas
-8104722695725517962 seed_videos_2_pas
6603725717674618753 seed_videos_2_pas
-6885426254291916923 seed_videos_2_pas
8878306115268123242 seed_videos_2_pas
2664598798454107069 seed_videos_2_pas
-1130301863313429407 seed_videos_2_pas
6383722209898652464 seed_videos_2_pas
1410624060530577390 seed_videos_2_pat
1100175904848145330 seed_videos_2_pat
6421364272580349095 seed_videos_2_pat
3243976296567942326 seed_videos_2_pat
2856723628413664723 seed_videos_2_pau
-6684370625181545902 seed_videos_2_pau
-9112039128971736721 seed_videos_2_pau
-5134977928545797502 seed_videos_2_pau
491463814477878191 listerine
8027332670412780967 listerine
-8620028295602605989 listerine

Tools

Youtube-DL

DocID scripts

Scraping by dates uploaded:

Check to see which dates have already been scraped at:

GoogleGargle

Aria2c (APT)

Aria2c (RPM)

Fedora and CentOS have RPMs available.

  • yum install aria2

Searcher

Bash script to search for terms on Google Video, includes dedupe and ability to restrict search by video length.

predict-download-size

Bash script to read a docid list and find out the total size of the listed videos. Requires youtube-dl, curl.

Subtitles

Some videos have subtitles which haven't been included in the download script (yet). I've created a fairly basic script which retrieves all available subtitles and stores them into the correct folder. You just need perl and a seed list (saved as "list"). You can also run it in an empty dir if you're afraid that it will mess with the videos you have downloaded so far (probably a good idea as I didn't do extensive tests yet). Once the subtitles have been downloaded, just run a "rsync -avP $subtitle_directory $video_directory" to transfer the subtitles to the corresponding video.

You may grab the script at http://piratepad.net/K7wZRrxvoU. Feel free to modify it.

--- For some reason it sometimes saves the file under a different name than what it outputs to the console, tested on Debian 6 -Pentium100 -> This has been corrected, the problem arose whenever there were spaces in the filename.

--- Google will return a 503 if it feels like it's queried by a bot (http://www.google.com/support/websearch/bin/answer.py?hl=en&answer=86640). I have modified the script to pause for 60 seconds after 100 queries, hope that this will suffice. If not, you can either tweak the $PAUSE_AFTER or the actual pause duration in the script. Also, the script will now download multiple subtitles for one video (it didn't do that before, sorry!). -Dr.Sweety

Troubleshooting

  • /usr/bin/aria2c: unrecognized option '--max-connection-per-server=16'
    • The Aria version available in many linux distributions is not up to date and will throw errors.
    • To fix this remove the option from the goooglegargle script line starting with "ARIAOPTIONS="
  • User 'negge' on IRC reports the following ARIA command line works for Debian Squeeze with ext4 filesystem,
    • --max-overall-download-limit=1024M --file-allocation=falloc --max-connection-per-server=4 --min-split-size=1M --log-level=notice --remote-time=true
  • or for ext3 on Debian Squeeze,
    • --max-overall-download-limit=1024M --file-allocation=prealloc --max-connection-per-server=4 --min-split-size=1M --log-level=notice --remote-time=true

FAQ

  • Is there any estimate on how many videos are on Google Video?
    • Wikipedia said it has 2,500,000 videos, a semi-official Google blog mentioned 2.8M
  • Is there anything about grabbing metadata for vids? like descriptions?
    • Googlegrape does that, it saves the html of the video download page
  • What happens to the data after you claim a seed on the wiki and download it?
    • We've got 140TB of space allocated to us on archive.org, and can get more
  • Is there already some space where it can be uploaded to?
    • Not yet, the effort is still young and things take time to organize.
  • How can I split seed files if I want to download fewer videos or share the task with others?
    • On *nix machines use: split --lines=500 [seedfile] [seedfile] to create a set of files each 500 lines in length in the form seedfileaa seedfileab ... etc.
  • How can I check if there are duplicates in a seed file?
    • On *nix machines use: sort [infile] | uniq -d to show all duplicates.
  • How can I remove duplicates from a seed file before I start to use it?
    • On *nix machines use: sort [infile] | uniq -u > [outfile] to produce a new seed file with duplicates removed.
  • If I wanted to run more than one listerine process, do I just make multiple clones? Do I need a different username for each?
    • Only if you need to be able to differentiate later on, like we'll say, we need video 123 from "xentac3"

Announcement: Uploaded video content no longer available

On April 29, 2011 videos that have been uploaded to Google Video will no longer be available for playback. We’ve added a Download button to the Video Status page, so you can download videos that you want to save. If you don’t want to download your videos, you don’t need to do anything. (The Download feature will be disabled after May 13, 2011.)

How do I download videos that I've uploaded?

On the Video Status page, click Download Video located on the right side of each of your videos in the "Actions" column.Once a video has been downloaded, an "Already Downloaded" message will appear. If you have many videos on Google Video, you may need to use the paging controls located on the bottom right of the page to access them all. This download option will be available through May 13, 2011.

I've downloaded my videos. Now what do I do with these FLV files?

FLV files are videos that have been encoded in the Flash Video Format. You can upload your videos in FLV format to other video hosting sites like YouTube or Picassa Web Albums. If you would like to playback your videos on your computer and they don’t seem to be working, you might need to install an FLV player. In order to find an FLV player to install, try doing a Google search for [ FLV player ].

External links