Collecting items randomly

From Archiveteam
Jump to navigation Jump to search

An ArchiveTeam member comes across a lot of websites to be saved, with various structures. Understanding how items are accessible is a crucial point in creating the item list, which can then easily be scraped with well-developed tools.

Imagine the following situation: A website associates a long, unique identifier to each item, impossible to discover with brute force. The site doesn't provide an index or sitemap, of course. You don't even know the number of items. (There are sites that work this way.)

An obvious way is a Google/Bing/Commoncrawl/whatever discovery. But wait! The site allows to request a random item. You keep clicking on the button, and you get different and different items. Say, you can do this request automatically, as you have the link.

Simple! Repeat requesting that URL x times, x is a big number.

But how long should we try? Obviously, if the items are presented really randomly, then some items soon appear twice, you don't even need to go too far to experience that. The longer you run the discovery script, the more often the already-seen items appear. How long does it worth trying to get a new random item?

(TLDR: click here.)

In combinatorics, the model representing this situation is sampling with replacement. The most often discussed question about this is "what is the probability of picking...". But now, our question is, after how many picks reaches the number of picked distinct items the number of all (n) items? Or, after x picks, what percentage of all the items have we seen?

Not knowing the number n, we can't really answer this question. We can only examine the tendency of getting new, yet not seen items, watching it as it decreases.

Or, we can run simulations, where we do know what the number n is and what percentage of the items we have seen, and we try to find some constant between the only two numbers we know: the number of picks (or tries) and the number of distinct (found) items.

Let's do this for say, 100 items. We keep picking, and after each try we note how many items we have found. The following table – to be short – contains the state after every tenth try.

Python script 
import random

n = 100
N = [0 for _ in range(n)]
for x in range(1, 500+1):
    N[random.randint(0, n-1)] = 1
    if x % 10 == 0:
        k = sum(N)
        print(f"x {x}\tk {k}\tx/k {(x/k):.2f}\tx/n {(x/n):.2f}\tk/n {100*k/n:.0f}%")

n = 3811
N = [0 for _ in range(n)]
for x in range(1, 20000+1):
    N[random.randint(0, n-1)] = 1
    if x % 400 == 0:
        k = sum(N)
        print(f"x {x}\tk {k}\tx/k {(x/k):.2f}\tx/n {(x/n):.2f}\tk/n {100*k/n:.0f}%")
Progress collecting 100 items randomly
Tries (x) Items found (k) Tries/found (x/k)
0 0 1.00
10 10 1.00
20 17 1.18
30 25 1.20
40 31 1.29
50 37 1.35
60 45 1.33
70 51 1.37
80 58 1.38
90 60 1.50
100 64 1.56
110 68 1.62
120 71 1.69
130 74 1.76
140 75 1.87
150 78 1.92
160 81 1.98
170 84 2.02
180 84 2.14
190 85 2.24
200 88 2.27
210 90 2.33
220 90 2.44
230 92 2.50
240 93 2.58
250 93 2.69
260 93 2.80
270 93 2.90
280 95 2.95
290 95 3.05
300 96 3.13
310 97 3.20
320 97 3.30
330 97 3.40
340 97 3.51
350 98 3.57
360 98 3.67
370 99 3.74
380 99 3.84
390 99 3.94
400 99 4.04
410 99 4.14
420 100 4.20
430 100 4.30
440 100 4.40
450 100 4.50
460 100 4.60
470 100 4.70
480 100 4.80
490 100 4.90
500 100 5.00

Easy to see that if we pick 100 times, we get 60 distinct items, and even after 200 picks, we only have 88% of the items. After 300 picks, we miss only 4%, and after 400 picks, we have almost everything, and even more after 500 picks. (As this is only one experiment, these are just approximate numbers.)

So, if we knew how many items there are, four times that many requests of a random item would present us almost all the items. But what if we don't know the total number of items?

Look at the third column. When we reach 100%, the ratio of tries and found items is ~4.2. This is the number we can always calculate, and independent of the total number of items.

Don't ask me to prove this mathematically. Let me present another simulation instead, with not that small and round numbers: say, we have 3811 items.

Progress collecting 3811 items randomly
Tries (x) Items found (k) Tries/found (x/k) Tries/total (x/n) Found % (k/n)
0 0 1.00 0.00 0
400 382 1.05 0.10 10
800 729 1.10 0.21 19
1200 1030 1.17 0.31 27
1600 1309 1.22 0.42 34
2000 1556 1.29 0.52 41
2400 1783 1.35 0.63 47
2800 1991 1.41 0.73 52
3200 2168 1.48 0.84 57
3600 2323 1.55 0.94 61
4000 2484 1.61 1.05 65
4400 2608 1.69 1.15 68
4800 2727 1.76 1.26 72
5200 2835 1.83 1.36 74
5600 2915 1.92 1.47 76
6000 3001 2.00 1.57 79
6400 3073 2.08 1.68 81
6800 3145 2.16 1.78 83
7200 3201 2.25 1.89 84
7600 3263 2.33 1.99 86
8000 3321 2.41 2.10 87
8400 3378 2.49 2.20 89
8800 3427 2.57 2.31 90
9200 3456 2.66 2.41 91
9600 3487 2.75 2.52 91
10000 3519 2.84 2.62 92
10400 3560 2.92 2.73 93
10800 3577 3.02 2.83 94
11200 3600 3.11 2.94 94
11600 3620 3.20 3.04 95
12000 3639 3.30 3.15 95
12400 3661 3.39 3.25 96
12800 3679 3.48 3.36 97
13200 3695 3.57 3.46 97
13600 3705 3.67 3.57 97
14000 3723 3.76 3.67 98
14400 3736 3.85 3.78 98
14800 3744 3.95 3.88 98
15200 3751 4.05 3.99 98
15600 3760 4.15 4.09 99
16000 3769 4.25 4.20 99
16400 3775 4.34 4.30 99
16800 3782 4.44 4.41 99
17200 3785 4.54 4.51 99
17600 3788 4.65 4.62 99
18000 3793 4.75 4.72 100
18400 3794 4.85 4.83 100
18800 3796 4.95 4.93 100
19200 3798 5.06 5.04 100
19600 3799 5.16 5.14 100
20000 3801 5.26 5.25 100

Let's first check the percentage (in the case above, that was the number of found items, because number of all the items was 100). Now it's the fifth column. We have 3811 items. After this many tries, you can see, ~63% items found. Twice as many tries gives us 86%, three times: ~95%, four times: 98%, five times: ~100%. The percentages are similar to those in the first simulation.

Now, go on to the tries/found ratio (third column), which is the most interesting. When we reach 100%, it is ~4.7. In the first case it was 4.2. But let's look at some other milestones: when we reach 68%, this ratio is 1.62 and 1.69 in the first and second case, respectively; at 84%, 2.14 and 2.25. You see, this is quite constant. Of course, there are larger differences in the last percents, and even 100% doesn't mean you've got each and every items. But, after x tries, you can count how many distinct items you have, and can make a guess on how much of the total corpus you've discovered.

Math conjecture

Given n total items and x tries, the probability of finding k distinct items is Pr(k|n, x) = P(n, k) × S2(x, k) ÷ nx where 1 <= k <= n, x.

Therefore, argmaxk Pr(k|n, x) occurs when [S2(x, k-1) ÷ S2(x, k)]+(k-1) <= n <= [S2(x, k) ÷ S2(x, k+1)]+k.

Too hard to simplify further and relate the result to the x/k ratio.

Conclusion

So, if you need to do such a discovery, make x tries, then count the distinct items, that is k, and calculate x/k. Then find the ratio nearest to it in one of the preceding tables, and then you get an approximation of what percentage of all items you've discovered.

We could also learn that if we want to discover almost all the items, we need to push that random button at least three times more than the actual number of items, but doesn't worth more than five times that many tries. However, twice the number of items still gives a fair result.

An example: a run of 94,836 successful queries on kepkezelo.com gave 40,418 distinct items. The x/k ratio is 2.35. According to the second table, that means ~86% of the items has been discovered. Thus, the total number of items is around 47,000, and we can expect that after such another run we'd have ~98% of them.