Google Drive
Google Drive | |
URL | http://drive.google.com/[IA•Wcite•.today•MemWeb] |
Status | Special case (Technically files are only becoming inaccessible) |
Archiving status | On hiatus |
Archiving type | Unknown |
Project source | google-drive-grab, google-drive-items |
Project tracker | google-drive |
IRC channel | #googlecrash (on hackint) |
Data[how to use] | archiveteam_googledrive |
Google Drive is a filehosting service, a la Dropbox, run by Google (not to be confused with Google Cloud Storage and similar more technical storage solutions). It is popular both for personal storage and for sharing of files.
2021 grab
Google Drive IDs are not random (anecdotally, IDs of folders in the same tree often share long parts), which makes them predictable, a problem which Google had been trying to rectify across its products (others of which have similar issues) throughout 2021[1]. As such, on September 13, 2021, Google required that, in order to access files and folders, users either have permissions tied to their signed-in Google Accounts, or access the item through a URL with a random per-item parameter called resourceKey, apparently introduced in 2021.[2] The result of this will be that at least millions of links across the Web will effectively break. Docs, Sheets and Slides will be exempted from this update [3] Apart from the longening links, the main threat was the users deleting files. Files are usually deleted to fit the users 15 GB limit.
Grab
The grab script had 3 item types, folder:, file:, and user:. It was intended that all folder: items be run first, to get a pool (through backfeed) of file: items, that can be randomly sampled to determine a size threshold that the Internet Archive will accept; then files will be run. Users contain some user metadata but not links to files or folders.
Playback is theoretically possible with a flexible, POST-capable Wayback Machine, but this does not yet exist. In the meantime, it may be possible to get files with vanilla wget or similar from the WBM.
Results
(This is based on OrIdow6's very vague memories) It appeared that there were 2 types of Google Drive items, those that automatically got a redirect to a version with a resourceKey, and those that didn't. There was speculation that the latter, which had more random-looking IDs, would not suffer in the quasi-removal.
Getting your files
The rudimentary downloader for the 2021 grab is now on Github.
Notes
Of the native types:
- Docs may be public, this is a good description of the formats available for downloading.
- Sheets, slides, drawings, Jamboard ditto, different formats.
- Forms are not downloadable in their totality but public ones may be accessed at
https://docs.google.com/forms/d/e/[ID]/viewform
(along with some other, near-identical pages). If they have public results, those will be visible athttps://docs.google.com/forms/d/e/[ID]/viewanalytics
. - My Maps by default just display a preview image. Seemingly not indicated from the Drive interface, at least without an account, is that they can be fully viewed at
https://www.google.com/maps/d/viewer?mid=[ID]
. - Sites are not "published" in a way that is publicly connected to their Drive entry[4]. If in Drive folders, they will be publicly listed there, but when accessed anonymously they just take you to an editor page that gives a 401, even when published. Preview images, however, are still publicly shown.
Additionally, not (for our purposes) native formats are:
- Colab, which is just a static file with a special editor, e.g. here
- Google Keep, which is "part of the... Google Docs Editors suite"[5] but does not seem to be accessible from Drive.
URLs like this, with "pubhtml" or this with just "pub" seem to be the result of "publishing" a file. These need JS, but only inline JS, and their images are in img tags, so they do not need any special treatment.
"htmlview" URLs also exist and seem to use the same IDs as normal "view" URLs (at least for sheets).
Requests made by the file viewer page
For instance, using some random batch file https://drive.google.com/file/d/1YQaRoe8kGVKYhYEbnfXF34jsHSXW7p5_/view ,
- https://drive.google.com/auth_warmup - seemingly does nothing useful but if blocked leads to a blank page (need to investigate this more - what makes it?)
- https://content.googleapis.com/static/proxy.html?usegapi=1&jsh=m%3B%2F_%2Fscs%2Fabc-static%2F_%2Fjs%2Fk%3Dgapi.gapi.en.SCWmpDDGjPk.O%2Fam%3DAAAC%2Fd%3D1%2Frs%3DAHpOoo_Pl64J0IIHlj2zBtEJ3ZwdaJC3HA%2Fm%3D__features__ - ugly and looks difficult to generate. If blocked inhibits retrieval of the following:
- https://content.googleapis.com/drive/v2beta/files/1YQaRoe8kGVKYhYEbnfXF34jsHSXW7p5_?fields=alternateLink%2CcopyRequiresWriterPermission%2CcreatedDate%2Cdescription%2CdriveId%2CfileSize%2CiconLink%2Cid%2Clabels(starred%2C%20trashed)%2ClastViewedByMeDate%2CmodifiedDate%2Cshared%2CteamDriveId%2CabuseNoticeReason%2ClabelInfo%2CuserPermission(id%2Cname%2CemailAddress%2Cdomain%2Crole%2CadditionalRoles%2CphotoLink%2Ctype%2CwithLink)%2Cpermissions(id%2Cname%2CemailAddress%2Cdomain%2Crole%2CadditionalRoles%2CphotoLink%2Ctype%2CwithLink)%2Cparents(id)%2Ccapabilities(canMoveItemWithinDrive%2CcanMoveItemOutOfDrive%2CcanMoveItemOutOfTeamDrive%2CcanAddChildren%2CcanDownload%2CcanComment%2CcanEdit%2CcanInitiateEsignature%2CcanMoveChildrenWithinDrive%2CcanMoveItemIntoTeamDrive%2CcanRename%2CcanRemoveChildren)%2Ckind&supportsTeamDrives=true&includeBadgedLabels=true&enforceSingleParent=true&key=AIzaSyC1eQ1xj69IdTMeii5r7brs3R90eck-m7k - the primary metadata request for the file, resulting in most of what is shown in the "Details" pane. This sometimes fails with a 403, apparently because of an intra-backend rate limit (imposed, per the error message, by Google Cloud Platform on Google Drive), in which case the web client will retry it until it succeeds. The web client will not get this without the proxy.html request going thru, but this needs less information than proxy.html. Despite looking daunting, the exact URL is stable over at least months.
- Scripts, which seem to vary by time (quite frequently) and by browser(?) - the ones I looked at got captured by SPN, but need a more thorough check (and at the time of writing the IA is getting DDoSed so that will have to wait). Only some have their URLs embedded in the HTML of the main page, the rest are generated by some horrendous client-side process.
- CSS and images - latter includes the user profile pic from the details pane
Can be blocked in the Firefox debugger without issue:
- https://play.google.com/log?format=json&hasfast=true - what it looks like. Issued every 5-10 seconds while the page is open
- A few variations of the params of the above
- https://ssl.gstatic.com/docs/common/cleardot.gif?zx=paw20cai0l9q - the "zx" param looks random. Also requested periodically
- https://content.googleapis.com/drive/v2internal/viewerimpressions?key=AIzaSyC1eQ1xj69IdTMeii5r7brs3R90eck-m7k&alt=json - sent a few times but not infinitely. Request body contains some data on the client e.g. UA
- https://drive.google.com/drivesharing/clientmodel?id=1YQaRoe8kGVKYhYEbnfXF34jsHSXW7p5_&foreignService=texmex&authuser=0&origin=https%3A%2F%2Fdrive.google.com - begins a redirect chain on accounts.google.com that ends with a 403, presumably does something useful when logged in
- https://blobcomments-pa.clients6.google.com/v1/metadata?docId=1YQaRoe8kGVKYhYEbnfXF34jsHSXW7p5_&revisionId=0B-kA_2cHaVAPZ2xZeVRWQU0zNnRXRWdkWlNFSjJLa1I1MWtRPQ&userLocale=en&timeZoneId=Etc%2FGMT%2B7&documentResourceKey.resourceKey&forceImportEnabled=true&key=AIzaSyCMp6sr4oTC18AWkE2Ii4UBZHTHEpGZWZM&%24unique=gc797 - whatever metadata it returns is fairly mysterious, consisting of a long, unlabeled JSON array. Can be blocked without inhibiting playback but does not look that difficult to generate (besides, perhaps, the "unique" param?)