How likely is it that the next thing we see is one of a brand new kind? That sounds like a hard question. Last time I told you about the Good–Turing rule for answering this question.
The discussion that blog entry triggered has been very helpful! Among other things, it got Lou Jost more interested in this subject. Two days ago, he showed me the following simple argument for the Good–Turing estimate.
Suppose there are finitely many species of orchid. Suppose the fraction of orchids belonging to the th species is
Suppose we start collecting orchids. Suppose each time we find one, the chance that it’s an orchid of the th species is Of course this is not true in reality! For example, it’s harder to find a tiny orchid, like this:
than a big one. But never mind.
Say we collect a total of orchids. What is the probability that we find no orchids of the th species? It is
Similarly, the probability that we find exactly one orchid of the th species is
And so on: these are the first two terms in a binomial series.
Let be the expected number of singletons: species for which we find exactly one orchid of that species. Then
Let be the coverage deficit: the expected fraction of the total population consisting of species that remain undiscovered. Given our assumptions, this is the same as the chance that the next orchid we find will be of a brand new species.
since is the fraction of orchids belonging to the th species and is the chance that this species remains undiscovered.
Lou Jost pointed out that the formulas for and are very similar! In particular,
should be very close to
when is large. So, we should have
In other words: the chance that the next orchid we find is of a brand new species should be close to the fraction of orchids that are singletons now.
Of course it would be nice to turn these ‘shoulds’ into precise theorems! Theorem 1 in this paper does that:
• David McAllester and Robert E. Schapire, On the convergence rate of Good–Turing estimators, February 17, 2000.
By the way: the only difference between the formulas for and is that the first contains the exponent while the second contains the exponent So, Lou Jost’s argument is a version of Boris Borcic’s ‘time-reversal’ idea:
Good’s estimate is what you immediately obtain if you time-reverse your sampling procedure, e.g., if you ask for the probability that there is a change in the number of species in your sample when you randomly remove a specimen from it.