Difference between revisions of "Google Video (Archive)"

Revision as of 22:18, 18 April 2011

Google Video

URL	http://video.google.com
Status	Closing in 2011-04-29[1]
Archiving status	In progress...
Archiving type	Unknown
IRC channel	#archiveteam-bs (on hackint)

Google Video is a video sharing website which is shutting down.

If you want to save your own videos, see the announcement and tools below.

If you want to help archive Google Video, get some machines running and join us in IRC (EFNet #archiveteam / #googlegrape)

Joining the archival effort

The automatic scripts only work on FreeBSD, Linux, Windows and maybe OS X. They also seem to work fine in Cygwin. Alternatively, you can run *nix in a virtual machine (given you have a fast enough machine).

To help scrape videos

First of all, please add your name/nickname to this list, along with the storage and bandwidth you have available.

On Linux Systems

Download youtube-dl or from your distribution.
- Make sure it's marked executable: chmod +x youtube-dl
Download and install wget for your distribution
Download googlegargle (Norc's updated, dupe-safe version of googlegargle is here.)
Get aria2 from your distribution (or if you're on Mac OS X, MacPorts) or SourceForge
Pick a seed list from below, save it under the filename "list" and add your name to the list (you will need a wiki account)
Change the first few lines of the googlegargle script to reflect your installation
- If you're using youtube-dl from your distro, run "which youtube-dl" or "sudo updatedb; locate youtube-dl" to find the location of the command. Change DLSCRIPT to this.
For older aria versions, some options need to be removed (--max-connection-per-server=16 --min-split-size=1M)
- You might need to upgrade your version from your system package manager, however the most recent version still may not suffice.
Change the ARIA variable in the script to the location of your ARIA executable. Usually (ubuntu) at /usr/bin/aria2c, change ARIA variable to this.
- To know where aria2 is located you can use either of these commands:
  - "sudo updatedb; locate aria2"
  - "which aria2" / "which aria2c"
Invoke googlegargle
Check with your OS settings to insure that your computer will not auto suspend or sleep after long periods of inactivity.

On Windows Systems

Download the scraping script for Windows (you still need python and aria2, which can be downloaded separately - instructions in archive). Script location: http://www.pentium100.com/gg_windows.zip

Don't forget to join the IRC channel to coordinate who's getting what!

To help index videos (low bandwidth/storage)

On Linux Systems

Note: This will only work on machines with X running. To run it on a headless server, use Xvfb (virtual framebuffer). On Ubuntu/Debian: 'apt-get install xvfb', then use xvfb-run to start your main script. An X server will now be made available to any programs that need it.

Get the tools needed to build phantomjs (a headless web browser) and run the script: Qt WebKit, git, and curl. On Debian or Ubuntu Maverick and up, install the packages build-essential, curl, git, libqtwebkit4, libqtwebkit-dev, and libqt4-dev by issuing the command:

sudo apt-get install build-essential curl git libqtwebkit4 libqtwebkit-dev libqt4-dev

On Ubuntu Lucid 10.04: Since Lucid comes with Qt4.6, not the required 4.7, you may need to add a ppa before trying to install the needed packages.

sudo add-apt-repository ppa:kubuntu-ppa/backports && sudo apt-get update

Additionally, in Lucid, the git package is named git-core, so:

sudo apt-get install git-core

or, on Fedora:

sudo yum install curl git qt-webkit qt-webkit-devel qt-devel

Run the following command to get the phantomjs source code:

git clone https://github.com/ariya/phantomjs.git

Enter the directory that was just created by using the following command:

cd phantomjs

Build phantomjs by issuing the command:

qmake && make -j2

Move the phantomjs binary somewhere in your path by issuing the command:

cd bin && sudo mv ./phantomjs /usr/local/bin

Create a folder called gvscript and download the script to get the list of Google Video related pages to scrape: http://199.48.254.90/at/google_video_related.tar.gz

Extract the above downloaded file (Right-click and Extract To.. or use tar -zxvf ./google_video_related.tar.gz)

In a terminal, navigate to the folder where you extracted the google_video_related file (above) and run the following command to help scrape Google Video:

while : ; do ./related.sh ; done

On Windows Systems

Grab the following archive which comes with full instructions: http://nstrom.chaosnet.org/google_video_related_win.zip

Once the script's running simply leave it running and head on over to #ggtesting on EFnet (IRC) if you need any assistance or in case the script has any issues. The script will contact the server to get a page to index the related video links, do that indexing, send back the results and repeat! It takes very little processing and bandwidth on your end (a couple of kb/sec, if that).

Cherry picking

The seed files do currently not include all videos, so you might want to save precious videos explicitely. To do that, add IDs (docid URL parameter of the Google Video) to the "list" file in the same directory as the script, for example:

docid=1545969803753962248
docid=1598207563000425446
docid=-1679753730105404298

and start ./googlegargle

To request a cherrypick, add it to this list: http://piratepad.net/gvspecificrequests

If you download something from that list, add its docid to http://piratepad.net/TL7KDN8821 so that others won't download those videos for the second time.

Custom keyword searches

Linux Bash Command

If you want to grab videos by your own custom keyword search term, you can use this command:

SEARCH='my+search+term';for i in `seq 0 10 990 `;do curl -A "AT, Bitches" "http://www.google.com/search?q=$SEARCH+site:video.google.com&hl=en&safe=off&tbm=vid&start=$i&sa=N"|grep -o "docid=[0-9-]*"|sort -u|tee -a seed_videos_$SEARCH;done

Change "my+search+term" to your search term, and remember to use a plus sign instead of spaces (or url encode the text for other special characters).

Linux Bash Script

An alternative search script which sorts and dedupes results and can restrict searches to long, medium and short videos is here. <-- Please evolve the script and upload to Github?

Searches Undertaken

Since we want to minimize overlap, here are some search terms that are already in progress of being downloaded (and the user who downloads them):

Darkstar: "rare", "vintage", "commercial"
NomDuClavier: "douglas adams", "richard dawkins", "charles darwin"
oli: "australia history"
dnova: "microelectronics"
Lightblb: "documentary" (medium and long videos), "lecture" (medium and long videos), "atheism" (medium & long), "interview" (long)
ttuttle: "astronomy"
crackbab1: "ecology"
tj__: "army"

Also check the specificrequest PiratePad under Cherry Picking on this page.

Deduplication

To avoid downloading videos that have already been downloaded by others,

check if you have SQLite installed ("which sqlite3")
download the gv-dedup scripts
initialize a fresh database with "./gv-list-create.sh"
download all seed lists on this page (plus the cherry picks) and import them with "./gv-list-import.sh seed_file" (or "find seeds/* -exec ./gv-list-import.sh {} \;")
invoke "./gv-list-dedup.sh seed_videos_foo > list" to filter already downloaded videos from your custom seed list
also import your custom seed file with "./gv-list-import.sh list"

A pre-filled database (slightly outdated as of Apr 19, 00:00:00 CEST) is available.

Seed List Downloads

Original Lists: http://199.48.254.90/at/seeds/
PLEASE add your custom searches and their details to this table!
Words suggestions: conference, hack, wiki, linux, creative commons

Seed list	Videos (lines)	Downloader	Complete? (Size?)
seed_videos_ecology	890	crackbab1
seed_videos_meme	996
seed_videos_defcon	822	ndurner	22
seed_videos_ml_documentary_dedupe	1975	Lightblb	Lightblb: aa Papyrus: ab NomDuClavier: ac, ad
seed_videos_ml_lecture_dedupe	1898	Lightblb	Lightblb: aa ab gribozavr: ad (in progress)
seed_videos_ml_atheism_dedupe	698	norc	norc: ab in progress
seed_videos_l_interview_dedupe	986	Pentium100
seed_videos_2_a	25,761	swebb	8.6G (4/17/2011)
seed_videos_2_k	19,266 (24,242)	Lightblb, ARc[Clone, crackbab1, Pentium100, Mqrius, arketype	Split 49 chunks of 500 videos each Lightblb: aa ab ac ad ae (Done: 68GB) crackbab1: af,ak,al (Done: 16GB) Mqrius: Done: ag - ak: 26 GB, am - ao: 24 GB. Working on: aq ar as arketype: Working on ap Pentium100: at-az (complete, 42.8GB, 8 bad docids), ba, bb Darkstar: bc bd be ARc[Clone: bf bg bh bi bj bk bl bm bn bo bp bq br bs bt bu bv bw (all done)
seed_videos_2_l	22,641	ndurner, wgfreewill	Split 46 chunks of 500 videos each ndurner: aa, ab (399/1000); wgfreewill I am running the whole set now from the unplit file to have a complete copy.
seed_videos_2_m	24,465	Jade Falcon	Jade @ 2854/24465 ~77G and counting...(100 concurrent threads!) balrog running in reverse
seed_videos_2_o	25,049	travelinlibrarian	Split 51 chunks of 500 videos each travelinlibrarian 60/1-500 perfinion downloading seed_videos_2_ob[n-y]
seed_videos_2_p	23,713	oli, Xentac, db48x, otro	Split 48 chunks of 500 videos each oli: paa to paj Xentac: pbt (finished), pbu (finished), pbv (finished), pbp (7.2G finished), pbo (2.5GB finished), pbm (5.7GB finished), pbq (9.7GB finished), pbr (7.9GB finished), pbg-pbr db48x: pbu (1.44GB, finished), pbv (187MB, finished), pba-pbf otro: pbs
seed_videos_2_q	17,727	DoubleJ
seed_videos_2_t	25,301	businux	Split 51 chunks of 500 videos each
seed_videos_2_u	23,528	barbich, negge	Split 48 chunks of 500 videos each barbich: currently processing 0 to 29 (65% done, 290G) negge: getting the whole list (16 threads)
seed_videos_2_w	21,732	nickmoorman	Split 34 chunks of 500 videos each (already de-duped) nickmoorman: aa ab ac ad ae af ag ah ai aj zachtib: ak al am an ao Dr.Sweety: ap aq ar as at au av aw ax ay az ba bb bc bd be bf bg bh (Currently in progress)
seed_videos_2_x	19,733	ksh	35% / 30GB
seed_videos_2_y	20,965	negge	216G done (100%)
seed_videos_2_z	18,877	flare
seed_videos_a	1000	Dr.Sweety	Currently in progress
seed_videos_a_related	This list contain errors	Dr.Sweety	Done, 44G total. What about the errors, will there be an updated list?
seed_videos_b	999	bjwebb	136/999
seed_videos_c	981	dnova	Done (40.25GB)
seed_videos_d	999	nomduclav	complete
seed_videos_e	999	nomduclav	548/992
seed_videos_f	999	DoubleJ	Done (25GB)
seed_videos_g	999	dnova	Done (30.9GB)
seed_videos_h	999	ARc[Clone	Done
seed_videos_i	999	DeCarabas	915/999
seed_videos_j	999	joethehuman	Done (36.7 GB)
seed_videos_k	999	aggroskater	803/999 (22.9 GB so far)
seed_videos_l	999	yipdw	804/999, 51.5 GB Status update, updated every 30 minutes or so (login as guest:guest if you get an authorization error)
seed_videos_m	999	TJ__	Done (34.7GB)
seed_videos_n	999	ndurner	Done (38 GB)
seed_videos_o	999	com_lab, grelbar (list)	~38GB (com_lab), ~4GB so far, in progress (100 lines done of the second half) (grelbar)
seed_videos_p	999	Pneu
seed_videos_q	996	nomduclavier	Done (~24Gb)
seed_videos_r	996	Pentium	Done (26.5GB), two bad IDs (-6997682955012239023, -5475489738249304784)
seed_videos_s	999	Pentium	Done (48.9GB), two bad IDs (2103424227166759427, -8954969329395485241)
seed_videos_t	999	joethehuman	In Progress (806/1000 43.6 GB)
seed_videos_u	999	perfinion, 0xDEADBEEF, norc	0xDEADBEEF 516/1000 24GB. norc 500-1000 done, 24GB. Perfinion done, 44GB.
seed_videos_v	999	masterme1	162/999 (~11GB)
seed_videos_w	1000	com_lab	Done (~5.7GB)
seed_videos_x	1000	Dark-Star	Done (~33GB)
seed_videos_y	1000	beremat	613/1000, (~37.76GB)
seed_videos_z	1000	ksh	Done (27GB)
"microelectronics", "circuit+design", "microprocessor", "chiptune", "electrical+engineering", "hardware+hacking", "unboxing", "demoscene",	1267	dnova	761/1267
"singularity"	174	db48x	(grabbed 8am UTC April 18th 2011)
"Feynman"	28	db48x	completed, 2.20GB (grabbed 9am UTC April 18th 2011)
"police"	998	lutostag	(grabbed 8am UTC April 18th 2011)
"eliezer"	150 (1000)	norc	completed, 6.8G (grabbed 8am UTC April 18th 2011)
"obama"	1000	ryan__	started dl 4/18/2011 10am EDT (list created at 8am UTC April 18th 2011)
"cia"			(grabbed 8am UTC April 18th 2011)
"charlie"	1000	ryan__	started dl 4/18/2011 10am EDT (list created at 8am UTC April 18th 2011)
IDs from the metafilter thread	28	db48x	completed, 6.17GB (grabbed 9am UTC April 18th 2011)
IDs from the reddit thread			(grabbed 9am UTC April 18th 2011)
"rare"	~3100	Darkstar	done (~70gb)
"vintage"
"commercial"
[http://pastebin.com/ZkzNmwEW "douglas adams", "richard dawkins", "charles darwin"		NomDuClavier]	513 videos, done (one de-duped list for the 3 terms)
"australia history"	846	oli	Done
"Bugs Bunny"	153
"rodney mullen"	176	com_lab	Done, 1.7GB
"tech talks"	946	tahu	37 videos, 3.8GB, 2011-04-18 17:34:59 UTC
"rick astley"	17	db48x	completed, 272.8MB (grabbed 13:00 UTC April 18th 2011)
"CERN"	912	vled	Done
multiple: "michio kaku", "brian cox", "vernor vinge">, "carl sagan", "simon singh"	176	nomduclavier: 67/176
"intel", "amd"	1547	leftfield	In progress -leftfield I'd love if someone could take this one. -dnova
"foia"	89	com_lab	Done, 4.1GB
"creative commons"	1000	aikidork	In Progress
"TED"	1000	vled	w/ problems
Total	>324,788	Archive Team	In progress (>1.4 TB and counting)

Broken DocIDs

DocID	Title	list
-4313176927520589553	Ferrari 320 km/h SelMcKenzie	seed_videos_h
710915802292429594	Triple H-Best Pedigree Ever	seed_videos_h
919675995190477263	404s	seed_videos_h
-7433458566080701467	404s	seed_videos_2_k
7476314005948269525	Tan Tay Du Ky 2 tap 1 phan 2	seed_videos_2_k
1310034078921227326	Presentatie H. van Garderen	seed_videos_h
-8196546459051063200	Ethiopia - Ethiopian Talk Show - Dr. Kinfe M Kassaye	seed_videos_m
6012309833489564165	I'm gonna miss you forever	seed_videos_m
1006201176909432045	Nick "KNUCKLEHEAD" Thomas Learning to Ride A KX 65	seed_videos_2_k_br
9013618753646293166	TooSexii	seed_videos_m
4607644763702261746	Most Haunted	seed_videos_m
910327017359455024	404s	seed_videos_2_k_br
-3505183273546479430	Top 10 Dunkers in Slam Dunk Contest History by www.todonba.mx.kz	seed_videos_2_k_bu
515155312540224448	Prof. Stephen Berk - The Six Day War -- (Only downloads 106MB & manual seek fails)	seed_videos_m
8233620694803027158	Tien Kiem Ky Hiep 12a	seed_videos_2_k_bs
-7026671761719496982	KV Kortrijk - Virton: kans Vervaeke	seed_videos_2_k_bo
4744936758707683681	404s	seed_videos_2_k_bo
-4138015874145288917	Irvine City Council Regular Meeting -- content too short (expected 880173643 bytes and served 871)	seed_videos_2_k_bo
1751753922865083288	Lou Dobbs - Bill Gates Testifies to Senate: Part 2	seed_videos_h
-1847242336625060764	404s	seed_videos_h
-840074924615574683	H.O.T. TV EPISODE 7	seed_videos_h
5450039563312738134		seed_videos_2_o
2740779495236816438		seed_videos_2_o
8240553330007645065	404	"rick astley"
2776148046666235174	404	seed_videos_d
4641809537228296381	404	seed_videos_
-4718427583805445551	404	seed_videos_e
5588388288256218328	404	seed_videos_d
-1413491257698089214	Redirects to http://www.khou.com/news/119535529.html	seed_videos_a_related
1895753595163256038	Redirects to http://tv.sky.com/martina-my-toughest-opponent	seed_videos_a_related
-4941694769105315227	Redirects to http://saratoga-north.ynn.com/content/headlines/524274/governor-visit-s-nation-s-capitol/	seed_videos_a_related
-423230311474262633		seed_videos_2_k_at
-1989250447613793254		seed_videos_2_k_at
-1717591024529167847		seed_videos_2_k_au
-1893715945421217990		seed_videos_2_k_aw
98954701061936704		seed_videos_2_k_az
-857514171338089705	871B instead of 9.9MB	seed_videos_2_k_az
187959010149993716		seed_videos_2_k_az
-3761310108351243571		seed_videos_2_k_az

Tools

Aria2c (APT)

apt-add-repository ppa:t-tujikawa/ppa
apt-get update
apt-get install aria2c
- http://aria2.sourceforge.net/

Aria2c (RPM)

Fedora and CentOS have RPMs available.

yum install aria2

Searcher

Bash script to search for terms on Google Video, includes dedupe and ability to restrict search by video length.

https://github.com/dvandok/googlegargle/blob/ac2fbaff0b6c3cb8918f6c677204c312dda5b30f/searcher.sh

Troubleshooting

/usr/bin/aria2c: unrecognized option '--max-connection-per-server=16'
- The Aria version available in many linux distributions is not up to date and will throw errors.
- To fix this remove the option from the goooglegargle script line starting with "ARIAOPTIONS="

User 'negge' on IRC reports the following ARIA command line works for Debian Squeeze,
- --max-overall-download-limit=1024M --file-allocation=falloc --max-connection-per-server=4 --min-split-size=1M --log-level=notice --remote-time=true
or for ext3 on Debian Squeeze,
- --max-overall-download-limit=1024M --file-allocation=prealloc --max-connection-per-server=4 --min-split-size=1M --log-level=notice --remote-time=true

FAQ

Is there any estimate on how many videos are on Google Video?
- Wikipedia said it has 2,500,000 videos, a semi-official Google blog mentioned 2.8M

Is there anything about grabbing metadata for vids? like descriptions?
- Googlegrape does that, it saves the html of the video download page

What happens to the data after you claim a seed on the wiki and download it?
- We've got 100TB of space allocated to us on archive.org, and can get more

Is there already some space where it can be uploaded to?
- Not yet, the effort is still young and things take time to organize.

How can I split seed files if I want to download fewer videos or share the task with others?
- On *nix machines use: split --lines=500 [seedfile] [seedfile] to create a set of files each 500 lines in length in the form seedfileaa seedfileab ... etc.

How can I check if there are duplicates in a seed file?
- On *nix machines use: sort [infile] | uniq -d to show all duplicates.

How can I remove duplicates from a seed file before I start to use it?
- On *nix machines use: sort [infile] | uniq -u > [outfile] to produce a new seed file with duplicates removed.

Announcement: Uploaded video content no longer available

On April 29, 2011 videos that have been uploaded to Google Video will no longer be available for playback. We’ve added a Download button to the Video Status page, so you can download videos that you want to save. If you don’t want to download your videos, you don’t need to do anything. (The Download feature will be disabled after May 13, 2011.)

How do I download videos that I've uploaded?

On the Video Status page, click Download Video located on the right side of each of your videos in the "Actions" column.Once a video has been downloaded, an "Already Downloaded" message will appear. If you have many videos on Google Video, you may need to use the paging controls located on the bottom right of the page to access them all. This download option will be available through May 13, 2011.

I've downloaded my videos. Now what do I do with these FLV files?

FLV files are videos that have been encoded in the Flash Video Format. You can upload your videos in FLV format to other video hosting sites like YouTube or Picassa Web Albums. If you would like to playback your videos on your computer and they don’t seem to be working, you might need to install an FLV player. In order to find an FLV player to install, try doing a Google search for [ FLV player ].

External links

@@ Line 423: / Line 423: @@
 === Searcher ===
 Bash script to search for terms on Google Video, includes dedupe and ability to restrict search by video length.
-* http://gv.nja.im/index.php?dir=tools
+* https://github.com/dvandok/googlegargle/blob/ac2fbaff0b6c3cb8918f6c677204c312dda5b30f/searcher.sh
 == Troubleshooting ==