User:OrIdow6/Info

From Archiveteam
< User:OrIdow6
Revision as of 21:00, 11 November 2021 by OrIdow6 (talk | contribs) (Created page with "Things I've said that may be generally useful, but which for now I don't want to rewrite. == I cautiously tell JAA about some nuances of the CDX server == From #internetarchi...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Things I've said that may be generally useful, but which for now I don't want to rewrite.

I cautiously tell JAA about some nuances of the CDX server

From #internetarchive. "[T]his channel is publicly logged", even if the log viewer is still broken.

[11/10/21 19:20:24] <OrIdow6> JAA: What are you using to get around the restriction on the number of results it gives you w/o pagination? Or has that been removed?

[11/10/21 19:23:47] <JAA> OrIdow6: This is just for testing. I do use the resumeKey pagination.

[11/10/21 19:24:45] <JAA> As I understand it, that one can't return empty pages, unlike the page=N one.

[11/10/21 19:26:10] <JAA> And well, I don't get a resumeKey anyway if I add showResumeKey=true, so pretty sure that isn't the problem here.

[11/10/21 19:32:11] <OrIdow6> I don't thinkthat does "real" pagination - https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/src/main/java/org/archive/cdxserver/CDXServer.java#L280

[11/10/21 19:32:53] <JAA> I use resumeKey/showResumeKey, not page/showNumPages.

[11/10/21 19:33:08] <OrIdow6> Try running your query with page=98901, do you get anything?

[11/10/21 19:33:32] <JAA> The page param stuff is not true pagination, yes. It goes through different shards of the CDX index, I think?

[11/10/21 19:33:58] <OrIdow6> Something like that

[11/10/21 19:34:33] <JAA> resumeKey is normal pagination, though not with page numbers. You get the first $limit results and, if there are more rows, the resumeKey for the next page.

[11/10/21 19:35:31] <JAA> But yeah, https://web.archive.org/cdx/search/cdx?url=reddit.com&matchType=domain&filter=original:^https?://(?:[^/]*\.)?reddit\.com(?::[0-9]*)?/r/[^/]*/about.*&page=98901 returns something. Wut?

[11/10/21 19:35:34] <OrIdow6> But if I remember correctly resumeKey doesn't get you past the limitation on the initial fetch of results

[11/10/21 19:36:23] <JAA> https://web.archive.org/cdx/search/cdx?url=reddit.com&matchType=domain&filter=original:^https?://(?:[^/]*\.)?reddit\.com(?::[0-9]*)?/r/[^/]*/about.*&showNumPages=true → 356731

[11/10/21 19:36:27] <JAA> Uhh, yeah...

[11/10/21 19:36:28] <OrIdow6> It's been a while since I tried using it, but I think that was what I found back then

[11/10/21 19:38:01] <OrIdow6> So I believe you are fetching the 1.5 million (or 15 million, I can't remember) results it gives you that are at the front of the domain match, then filtering, then applying pagination

[11/10/21 19:38:27] <JAA> Hmm

[11/10/21 19:39:15] <OrIdow6> If you add a page= it will filter on that specific page

[11/10/21 19:39:23] <OrIdow6> So what I usually do is just iterate through all pages

[11/10/21 19:40:31] <JAA> Right, that's what I used to do as well. I did compare the results yesterday though in some tests and got identical lists. Hmm

[11/10/21 19:40:37] <JAA> Only small tests, FWIW.

[11/10/21 19:41:42] <OrIdow6> If it's small enough that it's below the cutoff for non-paginated queries I would expect it to behave identically (except for recently-added entried)

[11/10/21 19:41:50] <OrIdow6> *entries

[11/10/21 19:44:24] <OrIdow6> Basically, what I've seen in practice (I suppose much of it is there, but I haven't really read through the whole source) is: first it does a range query on the big CDX index (based on your "match"), then it applies filters/pagination (can't remember which is first, actually)

[11/10/21 19:44:37] <OrIdow6> "pagination" as in resumeKey

[11/10/21 19:48:18] <OrIdow6> Without page= and showNumPages= pagination, your "range" includes everything matching the match, but it gets cut off at whatever the limit is; whereas with, it's a fragment that doesn't get cut off

[11/10/21 19:49:47] <OrIdow6> Also, page= pagination doesn't seem to include as many results as without; what I've seen is that it doesn't include newer ones; my guess, based on something arkiver said at one point, is that page= pagination only gets results from the "'all' index"

[11/10/21 19:50:13] <OrIdow6> Or at any rate a subset of indexes

[11/10/21 20:01:19] <JAA> Hmm

[11/10/21 20:01:30] <JAA> I tried poking at the source a bit, but I can't seem to find the relevant parts at all.

[11/10/21 20:02:48] <JAA> Doesn't help that some of the stuff are forked repos from iipc, and you can't search in forks on GitHub.

[11/10/21 20:16:34] <OrIdow6> Yeah, it's not the clearest thing

[11/10/21 20:20:57] <JAA> Yeah, I'll stop now, this is too messy.

[11/10/21 20:25:33] <JAA> So if I understand you correctly, this means that since https://web.archive.org/cdx/search/cdx?url=reddit.com&collapse=urlkey&matchType=domain&filter=original:^https?://(?:[^/]*\.)?reddit\.com(?::[0-9]*)?/r/.*&limit=100 also returns no results, not a single subreddit URL is in that first block of millions of results from the domain match?

[11/10/21 20:27:42] <JAA> I think the index is generally alphabetic, so /user/ pages should come after /r/. I wonder where millions of URLs prior to /r come from.

[11/10/21 20:35:11] <JAA> And then I suppose the only way to run my query is to make 357k requests against the API. If I remember correctly from last time I tried to run a significant number of queries, there's quite some rate limiting, so that sucks.

[11/10/21 20:35:58] <JAA> Also, I realised what the millions of URLs must be: /api/info.json from #shreddit.

[11/10/21 21:12:23] <OrIdow6> Looks like it

[11/10/21 21:19:13] <OrIdow6> And yes

[11/10/21 21:32:32] <OrIdow6> I did once try to write something to try to cover everything by generating prefixes and running with those as matches, but that didn't work, and I don't remember why

[11/10/21 21:35:36] <JAA> Well, you'll always miss some things with such generated prefix searches, namely the shorter matches.

[11/10/21 21:37:31] <JAA> I had that issue once on a discovery thing for some site. I think it was Dead Format. It was a search field there, but basically the same thing. When it arrived at searching 'black', there were still too many results, so it started doing 'blacka', 'blackb', etc., obviously missing everything containing the word 'black'.

[11/10/21 21:38:30] <JAA> Also, such prefix searches can't work with the matchType=domain obviously, so if subdomains are relevant are not all known, it breaks down.

[11/10/21 21:39:28] <JAA> For example, I know of {old,np,i}.reddit.com, but there are probably more. I could catch all of those if the query above worked, but I'd have to know the subdomains first to do a prefix-based search.

[11/10/21 21:40:01] <JAA> With that, it'd work well actually in this case since subreddit names also have quite some restrictions. But yeah...

[11/10/21 21:40:44] <JAA> In this particular case, I don't need a complete list of all results anyway. I'll probably just sample a bunch of pages and leave it at that.

[11/10/21 21:55:52] <OrIdow6> What I've seen is that the CDX entries are sorted lexically on urlkey + " " + timestamp (or maybe even the whole line), and so the matches with nothing after what you searched for come first

[11/10/21 21:55:58] <OrIdow6> But domain etc would be a problem

Formats and Tooling

This has been heavily edited, mostly to erase details about the other person in this conversation.

<OrIdow6> [Name]: I'll try to give an answer, with the usual warning that this is not an official opinion of ArchiveTeam (and, as it goes, ArchiveTeam is not a branch of the Internet Archive, even if we do sometimes coordinate with them).
<OrIdow6> Also, ArchiveTeam is largely made up of volunteers rather than professionals, and that includes me.
<OrIdow6> So as I see it, you have 3 options: [site's built-in exporter], something like HTTrack or wget in --convert-links mode, and something with WARC output.
<OrIdow6> I don't actually know much about what [the built-in exporter] gives you (I knew once, but have forgotten). A while ago, someone in this channel said that it doesn't include revision history of the pages, which the other options may be able to get. Also, I don't know whether it gives a true-to-life version of the pages or whether it omits things (e.g. the text saying "Hosted by [site]"); it's possible that, excluding the revision history, it 
<OrIdow6> could be similar to the next option, or it could just be a difficult-to-use collection of HTML pages and files.
<OrIdow6> HTTrack or something like wget --mirror --convert-links creates a set of files on disk that, if everything goes right, more or less recreate the site as a human would see it. These have the advantage that you can access them from a regular web browser (on Firefox, with the file:// protocol), without any special viewer software. The disadvantage is that these don't record all the technical information (namely HTTP headers) that WARC gets, 
<OrIdow6> so if their interpretation of that technical information that they use to produce the files is wrong, or if someone is interested in that later, you're out of luck.
<OrIdow6> WARC is basically a record of exactly what goes over the TCP (or SSL or TLS) connection that HTTP uses, and the result is that it's sort of a reverse of the advantages/disadvantages of HTTrack or WARC-less wget: it requires specialized software to view, but doesn't omit HTTP headers from what it saves.
<OrIdow6> Most of the "advanced" tooling (including ArchiveTeam's) tends to center around WARC, which for various reasons is also good for large-scale sets (e.g. the Wayback Machine)
[They discuss their specific case.]
<OrIdow6> Yes, the scripting could potentially break, depending on what they're doing and how standards change.
<OrIdow6> I'll mention that, unless you're using a headless browser, which is somewhat difficult to set up, WARCs can have problems capturing pages with a lot of scripts as well (though the playback situation, once captured, has improved quite a bit recently).
<OrIdow6> It's possible that a crawl would get all the raw files as well; it's been long enough since I was involved in this project that I can't remember.
[They ask about playback]
<OrIdow6> https://replayweb.page/ is one player that I think (I haven't strictly speaking used it, just its predecessor) a lot of people find easy to use.
<OrIdow6> There's an outdated list of tools, including other viewers, at https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem#Tools . Also, there are various "easy-to-use" tools for producing warcs (e.g. Archivebox) that I'm not familiar with at all.