Difference between revisions of "Cohost"
(Project finished? Will put more details on this page tomorrow, it is now out of date somewhat) |
(add bluesky announcement to timeline) |
||
Line 40: | Line 40: | ||
December 29, for speed because the HTML and API requests were the bottleneck, the project was switched to only getting the post visibility combinations of showing everything and hiding shares. | December 29, for speed because the HTML and API requests were the bottleneck, the project was switched to only getting the post visibility combinations of showing everything and hiding shares. | ||
{{datetime|2025-01-01}} Cohost staff [https://bsky.app/profile/staff.cohost.org/post/3leowvgcptk2w announced] that they would stay online until Archive Team activities were complete. | |||
Early January the Cohost staff contacted us and at our request removed TRPC batching, fetching of these new pages is pending and hopefully will dramatically improve playback. | Early January the Cohost staff contacted us and at our request removed TRPC batching, fetching of these new pages is pending and hopefully will dramatically improve playback. |
Revision as of 13:27, 11 January 2025
Cohost | |
![]() | |
![]() Social media / blogging site | |
URL | https://cohost.org/ |
Status | Closing |
Archiving status | Saved! |
Archiving type | DPoS |
Project source | cohost-grab, cohost-items |
Project tracker | cohost |
IRC channel | #nohost (on hackint) |
Project lead | OrIdow6 |
Data[how to use] | archiveteam_cohost |
The first post-twitter social media that was enjoyable. Such things don't last.
Official announcement: cohost to shut down at end of 2024.
Notes on site structure
Main things to get are blogs/users (internally, "projects"), posts, and tag indexes. Tag indexes, beyond the first page, take as a GET parameter a UNIX timestamp of what time to consider "now" for purposes of pagination; this is sent by the server as part of the first page, so it should play back, but we'll need to take care not to get these twice lest pagination become broken.
https://cohost.org/api/v1/trpc/login.loggedIn?batch=1&input=%7B%7D
is retrieved every 15 seconds.
The below URL is retrieved upon switching to the page after having been switched away from it - purpose unclear - if blocked splits off parts of the list of requested items from itself then retries; if these are blocked for 2 more iterations, turns the page white, and makes a request to https://api.rollbar.com/api/1/item/
(an error-reporting SaaS)
https://cohost.org/api/v1/trpc/login.loggedIn,users.displayPrefs,subscriptions.hasActiveSubscription,projects.isReaderMuting,projects.isReaderBlocking,projects.followingState,posts.singlePost?batch=1&input=%7B%223%22%3A%7B%22projectHandle%22%3A%22FloopLoop%22%7D%2C%224%22%3A%7B%22projectHandle%22%3A%22FloopLoop%22%7D%2C%225%22%3A%7B%22projectHandle%22%3A%22FloopLoop%22%7D%2C%226%22%3A%7B%22handle%22%3A%22FloopLoop%22%2C%22postId%22%3A5878102%7D%7D
Need to clarify if deletes are possible after freeze - empirically I have experienced tag index deletions making formerly-working tag list pages empty - if so we should get these oldest-to-newest. Largest tags I can find are <10k pages so should be plausible to do these in single items. Perhaps narrow down how many pages it has with write_to_warc=false, then start from there? No.[1]
A long username:
https://cohost.org/ETC-ByTheForcibleOverthrowOfAllExistingSocialConditionsLetTheRulingClassesTrembleAtACommunisticRevolutionTheProletariansHaveNothingToLoseButTheirChainsTheyHaveAWorldToWinWorkingMenOfAllCountriesUNITE?page=0
Grab
Grab started Nov 2 2024 UTC. Site much slower than expected, as of writing we are going 15 items/minute (user: prioritized). About 8 hours after the grab starts I (OrIdow6) discover we have not been getting full images, only the previews; as these get pretty big should be a minor issue but might be something to do if we have time to spare at the end.
November 9 or so, due to warriors not being able to handle(? - never got any concrete information on whether they were timing out, filling up, or still running) big items, user: was modified so that by default it only does the first page, and then queues user:[name]+[page number] for the rest (and usertags: for tags); and tag: now only gets the first 50 offsets, with ones beyond that being gotten by tagext:. As it turns out the majority of data was in the long tail, and the project probably would have been impractical without this.
December 14, an item type userfix1: was added in order to get some trpc batch requests that only the Wayback Machine made, that I hadn't observed in development (and was unable to do because the WBM had been down for an extended period).
December 29, for speed because the HTML and API requests were the bottleneck, the project was switched to only getting the post visibility combinations of showing everything and hiding shares.
2025-01-01 Cohost staff announced that they would stay online until Archive Team activities were complete.
Early January the Cohost staff contacted us and at our request removed TRPC batching, fetching of these new pages is pending and hopefully will dramatically improve playback.
The project was affected by a bug in wget-lua that caused some responses to be erroneously deduplicated. Due to its large item sizes Cohost is particularly susceptible to this. --warc-dedup-url-agnostic
was disabled November 19 as a workaround, and as of November 24 we are waiting on detection and requeuing of erroneous omissions prior to that.
Playback
As of the time of writing (December 10) we were never able to reestablish contact with the admins, so nice as it would have been to have, the TRPC requests remain batched in a way that it somewhat arbitrary. Nonetheless playback in the WBM mostly works. However it is liable to 429.
Cohost uses an annoying SaaS called Rollbar that blanks the page if there are "too many" errors (incl. 404s and 429s, even if they don't affect anything page content-wise). This means that loading Cohost in the WBM is (at present) sometimes a game of speed.