How likely is it that the next thing we see is one of a brand new kind? That sounds like a hard question. Last time I told you about the Good–Turing rule for answering this question.
The discussion that blog entry triggered has been very helpful! Among other things, it got Lou Jost more interested in this subject. Two days ago, he showed me the following simple argument for the Good–Turing estimate.
Suppose there are finitely many species of orchid. Suppose the fraction of orchids belonging to the th species is
Suppose we start collecting orchids. Suppose each time we find one, the chance that it’s an orchid of the th species is
Of course this is not true in reality! For example, it’s harder to find a tiny orchid, like this:
than a big one. But never mind.
Say we collect a total of orchids. What is the probability that we find no orchids of the
th species? It is
Similarly, the probability that we find exactly one orchid of the th species is
And so on: these are the first two terms in a binomial series.
Let be the expected number of singletons: species for which we find exactly one orchid of that species. Then
Let be the coverage deficit: the expected fraction of the total population consisting of species that remain undiscovered. Given our assumptions, this is the same as the chance that the next orchid we find will be of a brand new species.
Then
since is the fraction of orchids belonging to the
th species and
is the chance that this species remains undiscovered.
Lou Jost pointed out that the formulas for and
are very similar! In particular,
should be very close to
when is large. So, we should have
In other words: the chance that the next orchid we find is of a brand new species should be close to the fraction of orchids that are singletons now.
Of course it would be nice to turn these ‘shoulds’ into precise theorems! Theorem 1 in this paper does that:
• David McAllester and Robert E. Schapire, On the convergence rate of Good–Turing estimators, February 17, 2000.
By the way: the only difference between the formulas for and
is that the first contains the exponent
while the second contains the exponent
So, Lou Jost’s argument is a version of Boris Borcic’s ‘time-reversal’ idea:
Good’s estimate is what you immediately obtain if you time-reverse your sampling procedure, e.g., if you ask for the probability that there is a change in the number of species in your sample when you randomly remove a specimen from it.
A bit off topic, but it seems that there are hardly any posts about global warming these days.
I wonder, as you learned more about climate research did you become more or less confident man made emissions are the main driver of recent climate change and that global warming is a serious threat to our civilization and other species in the near term (this century)?
I’m more confident, not less. These days I mainly post about global warming here. The big question for me is what should I do about it? Publicizing information is useful, but not sufficiently satisfying to make it my full-time career. I’m good at math. It doesn’t make sense for me to become a journalist or a climate scientist. So, I’m focusing on shifting my career away from pure mathematics toward mathematics that is:
1) attractive to ambitious mathematicians, for example good grad students
yet:
2) eventually relevant to environmental issues.
These goals are somewhat in conflict, since real-world issues tend to involve lots of domain-specific knowledge and number-crunching, which are exactly what mathematicians dislike. Nonetheless there’s a lot of room for mathematicians to develop new formalisms that can eventually be helpful, when applied and adapted by other more practical people. Mathematicians are never near the front of the battle lines. But they can do things other people can’t.
So, I’ve been spending lots of time thinking about complex systems made of many interacting parts, using ideas from probability theory, information theory and game theory. And that’s what the information geometry, network theory and biodiversity posts are about. All these posts are actually about the same big subject. I hope that by the time I go back to U.C. Riverside in September, I’ll have enough material developed that I can run an interesting seminar about it and attract a crew of good grad students.
When it comes to global warming, a lot of programmers on the Azimuth Forum are working to develop simple online climate models for educational purposes. These will appear on the blog. My involvement has been limited mainly because I don’t like to program, but also because I’m busy trying to shift careers. I hope to get back into that in a while.
A typo: “Similarly, the probability that we find exactly one orchid of the
th species is
“. The exponent should be N-1.
Thanks, I’ll fix that!
Lou Jost gets around. I first heard of him when a fellow oil economics blogger caught wind of what I was working on and said it was similar to Jost’s work, who was a friend of his. See the comment at the bottom of this post of mine:
http://mobjectivist.blogspot.com/2010/04/entroplet-species-area-relationships.html
I rather believe species diversity is governed more by dispersion and chance than by specific mechanism. So it is more about filling up the state space of possible growth paths, leading to a spread in agglomeration values.
BTW, Is anyone following the work of David Mumford on Pattern Theory?
The approximation’s validity seems pretty consistent w/ intuition even for those w/o deep math background. For large
, the species with the largest contributions to both the singleton and the undiscovered formulas are not surprisingly among the rarest species, i.e. precisely those for which
is closest to 1, i.e. precisely those for which the
th and the
st powers of
are most nearly equal.
Yes, good point.