Cohost

From Archiveteam
Jump to navigation Jump to search
Cohost
Cohost logo
Social media / blogging site
Social media / blogging site
URL https://cohost.org/
Status Offline
Archiving status Saved!
Archiving type DPoS
Project source cohost-grab, cohost-items
Project tracker cohost
IRC channel #nohost (on hackint)
Project lead OrIdow6
Data[how to use] archiveteam_cohost

The first post-twitter social media that was enjoyable. Such things don't last.

Official announcement: cohost to shut down at end of 2024.

Notes on site structure

Main things to get are blogs/users (internally, "projects"), posts, and tag indexes. Tag indexes, beyond the first page, take as a GET parameter a UNIX timestamp of what time to consider "now" for purposes of pagination; this is sent by the server as part of the first page, so it should play back, but we'll need to take care not to get these twice lest pagination become broken.

https://cohost.org/api/v1/trpc/login.loggedIn?batch=1&input=%7B%7D is retrieved every 15 seconds.

The below URL is retrieved upon switching to the page after having been switched away from it - purpose unclear - if blocked splits off parts of the list of requested items from itself then retries; if these are blocked for 2 more iterations, turns the page white, and makes a request to https://api.rollbar.com/api/1/item/ (an error-reporting SaaS)

https://cohost.org/api/v1/trpc/login.loggedIn,users.displayPrefs,subscriptions.hasActiveSubscription,projects.isReaderMuting,projects.isReaderBlocking,projects.followingState,posts.singlePost?batch=1&input=%7B%223%22%3A%7B%22projectHandle%22%3A%22FloopLoop%22%7D%2C%224%22%3A%7B%22projectHandle%22%3A%22FloopLoop%22%7D%2C%225%22%3A%7B%22projectHandle%22%3A%22FloopLoop%22%7D%2C%226%22%3A%7B%22handle%22%3A%22FloopLoop%22%2C%22postId%22%3A5878102%7D%7D

Need to clarify if deletes are possible after freeze - empirically I have experienced tag index deletions making formerly-working tag list pages empty - if so we should get these oldest-to-newest. Largest tags I can find are <10k pages so should be plausible to do these in single items. Perhaps narrow down how many pages it has with write_to_warc=false, then start from there? No.[1]

A long username:

https://cohost.org/ETC-ByTheForcibleOverthrowOfAllExistingSocialConditionsLetTheRulingClassesTrembleAtACommunisticRevolutionTheProletariansHaveNothingToLoseButTheirChainsTheyHaveAWorldToWinWorkingMenOfAllCountriesUNITE?page=0

(Fittingly enough I added this example when I began working on the project in September, and it ended up being one of the last 13 items in the tracker)

Grab

Grab started Nov 2 2024 UTC. Site much slower than expected, as of writing we are going 15 items/minute (user: prioritized).

November 9 or so, due to warriors not being able to handle(? - never got any concrete information on whether they were timing out, filling up, or still running) big items, user: was modified so that by default it only does the first page, and then queues user:[name]+[page number] for the rest (and usertags: for tags); and tag: now only gets the first 50 offsets, with ones beyond that being gotten by tagext:. As it turns out the majority of data was in the long tail, and the project probably would have been impractical without this.

December 14, an item type userfix1: was added in order to get some trpc batch requests that only the Wayback Machine made, that I hadn't observed in development (and was unable to do because the WBM had been down for an extended period).

December 29, for speed because the HTML and API requests were the bottleneck, the project was switched to only getting the post visibility combinations of showing everything and hiding shares.

2025-01-01 Cohost staff announced that they would stay online until Archive Team activities were complete.

Early January the Cohost owners got in contact with us again at our request removed TRPC batching. The item type userfix2: was created for the purpose of rerunning all HTML and API pages (except tags and user-tags) to take advantage of this. They also gave us a full list of all public users - userfix2:'s that we had not discovered, and user:'s we had not discovered (the item type did not become obsolete because they still got media, and it would have been unnecessary effort to refactor the whole thing) were generated from this list. Also - it turned out that custom subdomains (never an "official" feature Cohost's side) were a massive resource user server-side, and adding some heuristics to only get everything there for users with more than a few posts massively improved their servers' capacity and our overall rate. Would've been nice to know that earlier - knowledge for what to test for in the future.

Project ended January 10 2025. Site shut down and began redirecting to the WBM Jan 12.

Completedness issues

About 8 hours after the grab starts I (OrIdow6) "discover" we have not been getting full images, only the previews; this turns out not to be the case, download_child_p gets these images anyway from the *post* HTML pages; if we'd only run the user HTML pages these would have been missed but that wasn't the case.

The project was affected by a bug in wget-lua that caused some responses to be erroneously deduplicated. Due to its large item sizes Cohost is particularly susceptible to this. --warc-dedup-url-agnostic was disabled November 19 as a workaround, and the affected URLs were rerun in the Cohost project (types http: and https: - the item names are just URLs) starting January 9.

January 11 it looked like we'd lost about 2 million items between items being reported as "done" to the tracker and megawarcs being uploaded, however it turned out those were just sitting in a target from the time of the IA outage and hadn't been uploaded. As of January 16 these are being uploaded, but slowly as the IA is generally backlogged, and it may take months before they are all in the WBM.

Playback

Most users who posted to Cohost nontrivially were captured both before and after they disabled trpc batching at our request at the beginning of 2025; and as such these earlier captures will make lots and lots and lots of requests and may either 429 WBM-side or make one that we didn't anticipate and break, because Cohost uses an annoying SaaS called Rollbar that as part of its own error handling blanks the page if there are "too many" errors (incl. 404s and 429s, even if they don't affect anything page content-wise). However everything should be captured by userfix2: and work, so just go to the latest capture.