Collecting items randomly

An ArchiveTeam member comes across a lot of websites to be saved, with various structures. Understanding how items are accessible is a crucial point in creating the item list, which can then easily be scraped with well-developed tools.

Imagine the following situation: A website associates a long, unique identifier to each item, impossible to discover with brute force. The site doesn't provide an index or sitemap, of course. You don't even know the number of items. (There are sites that work this way.)

An obvious way is a Google/Bing/Commoncrawl/whatever discovery. But wait! The site allows to request a random item. You keep clicking on the button, and you get different and different items. Say, you can do this request automatically, as you have the link.

Simple! Repeat requesting that URL x times, x is a big number.

But how long should we try? Obviously, if the items are presented really randomly, then some items soon appear twice, you don't even need to go too far to experience that. The longer you run the discovery script, the more often the already-seen items appear. How long does it worth trying to get a new random item?

(TLDR: click here.)

In combinatorics, the model representing this situation is sampling with replacement. The most often discussed question about this is "what is the probability of picking...". But now, our question is, after how many picks reaches the number of picked distinct items the number of all (n) items? Or, after x picks, what percentage of all the items have we seen?

Not knowing the number n, we can't really answer this question. We can only examine the tendency of getting new, yet not seen items, watching it as it decreases.

Or, we can run simulations, where we do know what the number n is and what percentage of the items we have seen, and we try to find some constant between the only two numbers we know: the number of picks (or tries) and the number of distinct (found) items.

Let's do this for say, 100 items. We keep picking, and after each try we note how many items we have found. The following table – to be short – contains the state after every tenth try.

Python script

import random

n = 100
N = [0 for _ in range(n)]
for x in range(1, 500+1):
    N[random.randint(0, n-1)] = 1
    if x % 10 == 0:
        k = sum(N)
        print(f"x {x}\tk {k}\tx/k {(x/k):.2f}\tx/n {(x/n):.2f}\tk/n {100*k/n:.0f}%")

n = 3811
N = [0 for _ in range(n)]
for x in range(1, 20000+1):
    N[random.randint(0, n-1)] = 1
    if x % 400 == 0:
        k = sum(N)
        print(f"x {x}\tk {k}\tx/k {(x/k):.2f}\tx/n {(x/n):.2f}\tk/n {100*k/n:.0f}%")

Progress collecting 100 items randomly
Tries (x)	Items found (k)	Tries/found (x/k)
0	0	1.00
10	10	1.00
20	17	1.18
30	25	1.20
40	31	1.29
50	37	1.35
60	45	1.33
70	51	1.37
80	58	1.38
90	60	1.50
100	64	1.56
110	68	1.62
120	71	1.69
130	74	1.76
140	75	1.87
150	78	1.92
160	81	1.98
170	84	2.02
180	84	2.14
190	85	2.24
200	88	2.27
210	90	2.33
220	90	2.44
230	92	2.50
240	93	2.58
250	93	2.69
260	93	2.80
270	93	2.90
280	95	2.95
290	95	3.05
300	96	3.13
310	97	3.20
320	97	3.30
330	97	3.40
340	97	3.51
350	98	3.57
360	98	3.67
370	99	3.74
380	99	3.84
390	99	3.94
400	99	4.04
410	99	4.14
420	100	4.20
430	100	4.30
440	100	4.40
450	100	4.50
460	100	4.60
470	100	4.70
480	100	4.80
490	100	4.90
500	100	5.00

Easy to see that if we pick 100 times, we get 60 distinct items, and even after 200 picks, we only have 88% of the items. After 300 picks, we miss only 4%, and after 400 picks, we have almost everything, and even more after 500 picks. (As this is only one experiment, these are just approximate numbers.)

So, if we knew how many items there are, four times that many requests of a random item would present us almost all the items. But what if we don't know the total number of items?

Look at the third column. When we reach 100%, the ratio of tries and found items is ~4.2. This is the number we can always calculate, and independent of the total number of items.

Don't ask me to prove this mathematically. Let me present another simulation instead, with not that small and round numbers: say, we have 3811 items.

Progress collecting 3811 items randomly
Tries (x)	Items found (k)	Tries/found (x/k)	Tries/total (x/n)	Found % (k/n)
0	0	1.00	0.00	0
400	382	1.05	0.10	10
800	729	1.10	0.21	19
1200	1030	1.17	0.31	27
1600	1309	1.22	0.42	34
2000	1556	1.29	0.52	41
2400	1783	1.35	0.63	47
2800	1991	1.41	0.73	52
3200	2168	1.48	0.84	57
3600	2323	1.55	0.94	61
4000	2484	1.61	1.05	65
4400	2608	1.69	1.15	68
4800	2727	1.76	1.26	72
5200	2835	1.83	1.36	74
5600	2915	1.92	1.47	76
6000	3001	2.00	1.57	79
6400	3073	2.08	1.68	81
6800	3145	2.16	1.78	83
7200	3201	2.25	1.89	84
7600	3263	2.33	1.99	86
8000	3321	2.41	2.10	87
8400	3378	2.49	2.20	89
8800	3427	2.57	2.31	90
9200	3456	2.66	2.41	91
9600	3487	2.75	2.52	91
10000	3519	2.84	2.62	92
10400	3560	2.92	2.73	93
10800	3577	3.02	2.83	94
11200	3600	3.11	2.94	94
11600	3620	3.20	3.04	95
12000	3639	3.30	3.15	95
12400	3661	3.39	3.25	96
12800	3679	3.48	3.36	97
13200	3695	3.57	3.46	97
13600	3705	3.67	3.57	97
14000	3723	3.76	3.67	98
14400	3736	3.85	3.78	98
14800	3744	3.95	3.88	98
15200	3751	4.05	3.99	98
15600	3760	4.15	4.09	99
16000	3769	4.25	4.20	99
16400	3775	4.34	4.30	99
16800	3782	4.44	4.41	99
17200	3785	4.54	4.51	99
17600	3788	4.65	4.62	99
18000	3793	4.75	4.72	100
18400	3794	4.85	4.83	100
18800	3796	4.95	4.93	100
19200	3798	5.06	5.04	100
19600	3799	5.16	5.14	100
20000	3801	5.26	5.25	100

Let's first check the percentage (in the case above, that was the number of found items, because number of all the items was 100). Now it's the fifth column. We have 3811 items. After this many tries, you can see, ~63% items found. Twice as many tries gives us 86%, three times: ~95%, four times: 98%, five times: ~100%. The percentages are similar to those in the first simulation.

Now, go on to the tries/found ratio (third column), which is the most interesting. When we reach 100%, it is ~4.7. In the first case it was 4.2. But let's look at some other milestones: when we reach 68%, this ratio is 1.62 and 1.69 in the first and second case, respectively; at 84%, 2.14 and 2.25. You see, this is quite constant. Of course, there are larger differences in the last percents, and even 100% doesn't mean you've got each and every items. But, after x tries, you can count how many distinct items you have, and can make a guess on how much of the total corpus you've discovered.

Math conjecture

Given n total items and x tries, the probability of finding k distinct items is Pr(k|n, x) = P(n, k) × S2(x, k) ÷ n^x where 1 <= k <= n, x.

Therefore, argmax_k Pr(k|n, x) occurs when [S2(x, k-1) ÷ S2(x, k)]+(k-1) <= n <= [S2(x, k) ÷ S2(x, k+1)]+k.

Too hard to simplify further and relate the result to the x/k ratio.

Conclusion

So, if you need to do such a discovery, make x tries, then count the distinct items, that is k, and calculate x/k. Then find the ratio nearest to it in one of the preceding tables, and then you get an approximation of what percentage of all items you've discovered.

We could also learn that if we want to discover almost all the items, we need to push that random button at least three times more than the actual number of items, but doesn't worth more than five times that many tries. However, twice the number of items still gives a fair result.

An example: a run of 94,836 successful queries on kepkezelo.com gave 40,418 distinct items. The x/k ratio is 2.35. According to the second table, that means ~86% of the items has been discovered. Thus, the total number of items is around 47,000, and we can expect that after such another run we'd have ~98% of them.

Collecting items randomly

Math conjecture

Conclusion

Navigation menu

Search