Cohost

From Archiveteam
Revision as of 05:47, 26 November 2024 by Nosamu (talk | contribs) (Better formatting for long links)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Cohost
Cohost logo
Social media / blogging site
Social media / blogging site
URL https://cohost.org/
Status Closing
Archiving status In progress...
Archiving type DPoS
Project source cohost-grab, cohost-items
Project tracker cohost
IRC channel #nohost (on hackint)
Project lead OrIdow6
Data[how to use] archiveteam_cohost

The first post-twitter social media that was enjoyable. Such things don't last.

Official announcement: cohost to shut down at end of 2024.

Notes on site structure

Main things to get are blogs/users (internally, "projects"), posts, and tag indexes. Tag indexes, beyond the first page, take as a GET parameter a UNIX timestamp of what time to consider "now" for purposes of pagination; this is sent by the server as part of the first page, so it should play back, but we'll need to take care not to get these twice lest pagination become broken.

https://cohost.org/api/v1/trpc/login.loggedIn?batch=1&input=%7B%7D is retrieved every 15 seconds.

The below URL is retrieved upon switching to the page after having been switched away from it - purpose unclear - if blocked splits off parts of the list of requested items from itself then retries; if these are blocked for 2 more iterations, turns the page white, and makes a request to https://api.rollbar.com/api/1/item/ (an error-reporting SaaS)

https://cohost.org/api/v1/trpc/login.loggedIn,users.displayPrefs,subscriptions.hasActiveSubscription,projects.isReaderMuting,projects.isReaderBlocking,projects.followingState,posts.singlePost?batch=1&input=%7B%223%22%3A%7B%22projectHandle%22%3A%22FloopLoop%22%7D%2C%224%22%3A%7B%22projectHandle%22%3A%22FloopLoop%22%7D%2C%225%22%3A%7B%22projectHandle%22%3A%22FloopLoop%22%7D%2C%226%22%3A%7B%22handle%22%3A%22FloopLoop%22%2C%22postId%22%3A5878102%7D%7D

Need to clarify if deletes are possible after freeze - empirically I have experienced tag index deletions making formerly-working tag list pages empty - if so we should get these oldest-to-newest. Largest tags I can find are <10k pages so should be plausible to do these in single items. Perhaps narrow down how many pages it has with write_to_warc=false, then start from there? No.[1]

A long username:

https://cohost.org/ETC-ByTheForcibleOverthrowOfAllExistingSocialConditionsLetTheRulingClassesTrembleAtACommunisticRevolutionTheProletariansHaveNothingToLoseButTheirChainsTheyHaveAWorldToWinWorkingMenOfAllCountriesUNITE?page=0

Grab

Grab started Nov 2 2024 UTC. Site much slower than expected, as of writing we are going 15 items/minute (user: prioritized). About 8 hours after the grab starts I (OrIdow6) discover we have not been getting full images, only the previews; as these get pretty big should be a minor issue but might be something to do if we have time to spare at the end.

November 9 or so, due to warriors not being able to handle(? - never got any concrete information on whether they were timing out, filling up, or still running) big items, user: was modified so that by default it only does the first page, and then queues user:[name]+[page number] for the rest (and usertags: for tags); and tag: now only gets the first 50 offsets, with ones beyond that being gotten by tagext:. As it turns out the majority of data was in the long tail, and the project probably would have been impractical without this.

The project was affected by a bug in wget-lua that caused some responses to be erroneously deduplicated. Due to its large item sizes Cohost is particularly susceptible to this. --warc-dedup-url-agnostic was disabled November 19 as a workaround, and as of November 24 we are waiting on detection and requeuing of erroneous omissions prior to that.