Talk:Chromebot

From Archiveteam
Jump to navigation Jump to search

What exactly happens when ChromeBot tries to access Instagram's website?

Does Instagram just respond with a blank page, a 403 error, 404 error or something else? --ATrescue (talk) 21:52, 26 April 2019 (UTC)

How well does it handle Twitter Lite?

While Twitter's original desktop website still relies much on HTML source code, Twitter's Mobile page is a “Web App, powered by AJAX. In addition, it causes serious compatibility problems with older versions of browsers (but Twitter redirects them to “Mobile Web (M2)”, their legacy mobile website, anyway).
The advantage of the AJAX-powered web-app is that allows for smoother browsing because thanks to AJAX, there is no need to reload the entire webpage. But the initial loading time takes obviously longer, because it needs to download more information into the RAM (if not already in browser cache).

The downside of AJAX is obvious, especially for YouTube comments. Starting circa 2013, those did no longer load within the page itself (included into HTML source code). See YouTube#Comment loading for more information. AJAX has been a death sentence for the Wayback Machine, also for other websites.
Archive.is has partially been able to handle AJAX content, losing it's ability to capture YouTube comments since late 2017 (except for directly linked comments).

But now, there is our mighty ChromeBot. Thankfully.

It is not very likely for Twitter to replace their legacy website (also known as “Twitter Web Client” in tweet source tags) with their new “App” style website (“Twitter Web App”, formerly “Twitter Lite”), but in case it actually happens, or in case it becomes the default and only users who are logged in are able to opt out, is ChromeBot prepared? …and will it support infinite scroll there too?

It would be good if Twitter still gives users the choice about which platform to use. If Twitter enforced their AJAX-powered website onto all users, ArchiveBot, (which is more mature and more suited for mass archivals of larger pages rather than ChromeBot for modern, JS-heavy pages), could be incapacitated.

––ATrescue (talk) 19:08, 30 April 2019 (UTC).

JS-Pagination

Some websites that have multiple pages (e.g. Google Desktop website search results) work via URL's that can be put into a list and then fed into ArchiveBot.

Some websites aleady load the multiple pages into the RAM (via page source code) and accesses them via offline javascript, see the language tabs of this this site.[IAWcite.todayMemWeb] These pages are entirely acccessible from Wayback captures and when saved offline.

Some other websies (e.g. YouTube comments and video lists in 2012, prior to bottomless infinite scrolling) did have pages that can not be accessed via URL (but YouTube had /all_comments?v= back then, which supported pages.).

We need to find a way to archive website content in an automated way (manually via WARC recording is already possible) with content that can only be accessed via clicking (e.g. comment pages that get accessed via AJAX instead of URL). --ATrescue (talk) 19:39, 30 April 2019 (UTC)

chromebot clicks JS links on some pages, see [1] --PurpleSymphony (talk) 14:31, 8 May 2019 (UTC)

“bajop-” job ID's? New naming system?

  • Yesterday (20190506), all job ID's started with “bajop-muton-”.
  • Today (20190507), all job ID's start with “bajop-nanap-”.

Earlier, job ID's had just random ID's.

Is there a technical explaination for the new job ID's? Is there a new naming system? --ATrescue (talk) 12:39, 7 May 2019 (UTC)

Vocals and consonants

Job ID's no longer contain numbers.

Example: In the Job ID “bajop-nanap-ranuv-vukab” (archival of https://twitter.com/search?q=SonySketch ), letter 2 and 4 of all groups of 5 characters are vocals, the other 2 are consonants. Co-incidence or deliberate? --ATrescue (talk) 12:49, 7 May 2019 (UTC)

Yes, new and yes, deliberate, see commit [2] --PurpleSymphony (talk) 14:33, 8 May 2019 (UTC)

Handling multiple infinite-scroll boxes?

If the page has multiple embeedded infinite-scroll parts in does ChromeBot also infinite-scroll crawl them or only the main page? --ATrescue (talk) 14:05, 8 May 2019 (UTC)

Yes, it scolls all elements, including frames, see [3] --PurpleSymphony (talk) 14:35, 8 May 2019 (UTC)