The Mathematics of Biodiversity (Part 2)

How likely is it that the next thing we see is one of a brand new kind? That sounds like a hard question. Last time I told you about the Good–Turing rule for answering this question.

The discussion that blog entry triggered has been very helpful! Among other things, it got Lou Jost more interested in this subject. Two days ago, he showed me the following simple argument for the Good–Turing estimate.

Suppose there are finitely many species of orchid. Suppose the fraction of orchids belonging to the ith species is p_i.

Suppose we start collecting orchids. Suppose each time we find one, the chance that it’s an orchid of the ith species is p_i. Of course this is not true in reality! For example, it’s harder to find a tiny orchid, like this:

than a big one. But never mind.

Say we collect a total of N orchids. What is the probability that we find no orchids of the ith species? It is

(1 - p_i)^N

Similarly, the probability that we find exactly one orchid of the ith species is

N p_i (1 - p_i)^{N-1}

And so on: these are the first two terms in a binomial series.

Let n_1 be the expected number of singletons: species for which we find exactly one orchid of that species. Then

\displaystyle{ n_1 = \sum_i N p_i (1 - p_i)^{N-1} }

Let D be the coverage deficit: the expected fraction of the total population consisting of species that remain undiscovered. Given our assumptions, this is the same as the chance that the next orchid we find will be of a brand new species.


\displaystyle{ D = \sum_i p_i (1-p_i)^N }

since p_i is the fraction of orchids belonging to the ith species and (1-p_i)^N is the chance that this species remains undiscovered.

Lou Jost pointed out that the formulas for n_1 and D are very similar! In particular,

\displaystyle{ \frac{n_1}{N} = \sum_i p_i (1 - p_i)^{N-1} }

should be very close to

\displaystyle{ D = \sum_i p_i (1 - p_i)^N }

when N is large. So, we should have

\displaystyle{ D \approx \frac{n_1}{N} }

In other words: the chance that the next orchid we find is of a brand new species should be close to the fraction of orchids that are singletons now.

Of course it would be nice to turn these ‘shoulds’ into precise theorems! Theorem 1 in this paper does that:

• David McAllester and Robert E. Schapire, On the convergence rate of Good–Turing estimators, February 17, 2000.

By the way: the only difference between the formulas for n_1/N and D is that the first contains the exponent N-1, while the second contains the exponent N. So, Lou Jost’s argument is a version of Boris Borcic’s ‘time-reversal’ idea:

Good’s estimate is what you immediately obtain if you time-reverse your sampling procedure, e.g., if you ask for the probability that there is a change in the number of species in your sample when you randomly remove a specimen from it.

7 Responses to The Mathematics of Biodiversity (Part 2)

  1. Arrow says:

    A bit off topic, but it seems that there are hardly any posts about global warming these days.

    I wonder, as you learned more about climate research did you become more or less confident man made emissions are the main driver of recent climate change and that global warming is a serious threat to our civilization and other species in the near term (this century)?

    • John Baez says:

      I’m more confident, not less. These days I mainly post about global warming here. The big question for me is what should I do about it? Publicizing information is useful, but not sufficiently satisfying to make it my full-time career. I’m good at math. It doesn’t make sense for me to become a journalist or a climate scientist. So, I’m focusing on shifting my career away from pure mathematics toward mathematics that is:

      1) attractive to ambitious mathematicians, for example good grad students


      2) eventually relevant to environmental issues.

      These goals are somewhat in conflict, since real-world issues tend to involve lots of domain-specific knowledge and number-crunching, which are exactly what mathematicians dislike. Nonetheless there’s a lot of room for mathematicians to develop new formalisms that can eventually be helpful, when applied and adapted by other more practical people. Mathematicians are never near the front of the battle lines. But they can do things other people can’t.

      So, I’ve been spending lots of time thinking about complex systems made of many interacting parts, using ideas from probability theory, information theory and game theory. And that’s what the information geometry, network theory and biodiversity posts are about. All these posts are actually about the same big subject. I hope that by the time I go back to U.C. Riverside in September, I’ll have enough material developed that I can run an interesting seminar about it and attract a crew of good grad students.

      When it comes to global warming, a lot of programmers on the Azimuth Forum are working to develop simple online climate models for educational purposes. These will appear on the blog. My involvement has been limited mainly because I don’t like to program, but also because I’m busy trying to shift careers. I hope to get back into that in a while.

  2. S says:

    A typo: “Similarly, the probability that we find exactly one orchid of the ith species is N p_i (1-p_i)^N“. The exponent should be N-1.

  3. Lou Jost gets around. I first heard of him when a fellow oil economics blogger caught wind of what I was working on and said it was similar to Jost’s work, who was a friend of his. See the comment at the bottom of this post of mine:

    I rather believe species diversity is governed more by dispersion and chance than by specific mechanism. So it is more about filling up the state space of possible growth paths, leading to a spread in agglomeration values.

    BTW, Is anyone following the work of David Mumford on Pattern Theory?

  4. arch1 says:

    The approximation’s validity seems pretty consistent w/ intuition even for those w/o deep math background. For large N, the species with the largest contributions to both the singleton and the undiscovered formulas are not surprisingly among the rarest species, i.e. precisely those for which (1-p_i) is closest to 1, i.e. precisely those for which the Nth and the (N+1)st powers of (1-p_i) are most nearly equal.

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.