Biodiversity, Entropy and Thermodynamics

27 October, 2014


I’m giving a short 30-minute talk at a workshop on Biological and Bio-Inspired Information Theory at the Banff International Research Institute.

I’ll say more about the workshop later, but here’s my talk, in PDF and video form:

Biodiversity, entropy and thermodynamics.

Most of the people at this workshop study neurobiology and cell signalling, not evolutionary game theory or biodiversity. So, the talk is just a quick intro to some things we’ve seen before here. Starting from scratch, I derive the Lotka–Volterra equation describing how the distribution of organisms of different species changes with time. Then I use it to prove a version of the Second Law of Thermodynamics.

This law says that if there is a ‘dominant distribution’—a distribution of species whose mean fitness is at least as great as that of any population it finds itself amidst—then as time passes, the information any population has ‘left to learn’ always decreases!

Of course reality is more complicated, but this result is a good start.

This was proved by Siavash Shahshahani in 1979. For more, see:

• Lou Jost, Entropy and diversity.

• Marc Harper, The replicator equation as an inference dynamic.

• Marc Harper, Information geometry and evolutionary game theory.

and more recent papers by Harper.

Life’s Struggle to Survive

19 December, 2013

Here’s the talk I gave at the SETI Institute:

When pondering the number of extraterrestrial civilizations, it is worth noting that even after it got started, the success of life on Earth was not a foregone conclusion. In this talk, I recount some thrilling episodes from the history of our planet, some well-documented but others merely theorized: our collision with the planet Theia, the oxygen catastrophe, the snowball Earth events, the Permian-Triassic mass extinction event, the asteroid that hit Chicxulub, and more, including the massive environmental changes we are causing now. All of these hold lessons for what may happen on other planets!

To watch the talk, click on the video above. To see
slides of the talk, click here!

Here’s a mistake in my talk that doesn’t appear in the slides: I suggested that Theia started at the Lagrange point in Earth’s orbit. After my talk, an expert said that at that time, the Solar System had lots of objects with orbits of high eccentricity, and Theia was probably one of these. He said the Lagrange point theory is an idiosyncratic theory, not widely accepted, that somehow found its way onto Wikipedia.

Another issue was brought up in the questions. In a paper in Science, Sherwood and Huber argued that:

Any exceedence of 35 °C for extended periods should
induce hyperthermia in humans and other mammals, as dissipation of metabolic heat becomes impossible. While this never happens now, it would begin to occur with global-mean warming of about 7 °C, calling the habitability of some regions into question. With 11-12 °C warming, such regions would spread to encompass the majority of the human population as currently distributed. Eventual warmings of 12 °C are
possible from fossil fuel burning.

However, the Paleocene-Eocene Thermal Maximum seems to have been even hotter:

So, the question is: where did mammals live during this period, which mammals went extinct, if any, and does the survival of other mammals call into question Sherwood and Huber’s conclusion?

Monarch Butterflies

25 November, 2013

Have you ever seen one of these? It’s a Monarch Butterfly. Every spring, millions fly from Mexico and southern California to other parts of the US and southern Canada. And every autumn, they fly back. On the first of November, called the Day of the Dead, people celebrate the return of the monarchs to the mountainous fir forests of Central Mexico.

But their numbers are dropping. In 1997, there were 150 million. Last year there were only 60 million. One problem is the gradual sterilization of American farmlands thanks to powerful herbicides like Roundup. Monarch butterfly larvae eat a plant called milkweed. But the amount of this plant in Iowa, for example, has dropped between 60% and 90% over the last decade.

And this year was much worse for the monarchs. They came late to Mexico… and I think only 3 million have been seen so far! That’s a stunning decrease!

Some blame the intense drought that hit the US in recent years—the sort of drought we can expect to become more frequent as global warming proceeds.

Earlier this year, Michael Risnit wrote this in USA Today:

Illegal logging in the Mexican forests where they spend the winter, new climate patterns and the disappearance of milkweed—the only plant on which monarchs lay their eggs and on which their caterpillars feed—are being blamed for their shrinking numbers.

Brooke Beebe, former director of the Native Plant Center at Westchester Community College in Valhalla, N.Y., collects monarch eggs, raises them from caterpillar to butterfly and releases them.

“I do that when they’re here. They’re not here,” she said.

The alarm over disappearing monarchs intensified this spring when conservation organizations reported that the amount of Mexican forest the butterflies occupied was at its lowest in 20 years. The World Wildlife Fund, in partnership with a Mexican wireless company and Mexico’s National Commission of Protected Areas, found nine hibernating colonies occupied almost 3 acres during the 2012-13 winter, a 59% decrease from the previous winter.

Because the insects can’t be counted individually, the colonies’ total size is used. Almost 20 years ago, the colonies covered about 45 acres. A couple of acres contains millions of monarchs.

“The monarch population is pretty strong, except it’s not as strong as it used to be and we find out it keeps getting smaller and smaller,” said Travis Brady, the education director at the Greenburgh Nature Center here.

Monarchs arrived at the nature center later this year and in fewer numbers, Brady said.

The nature center’s butterfly house this summer was aflutter with red admirals, giant swallowtails, painted ladies and monarchs, among others. But the last were difficult to obtain because collectors supplying the center had trouble finding monarch eggs in the wild, he said.

No one is suggesting monarchs will become extinct. The concern is whether the annual migration will remain sustainable, said Jeffrey Glassberg, the North American Butterfly Association’s president.

The record low shouldn’t set off a panic, said Marianna T. Wright, executive director of the National Butterfly Center in Texas, a project of the butterfly association.

“It should certainly get some attention,” she said. “I do think the disappearance of milkweed nationwide needs to be addressed. If you want to have monarchs, you have to have milkweed.”

Milkweed is often not part of suburban landscape, succumbing to lawn mowers and weed whackers, monarch advocates point out. Without it, monarch eggs aren’t laid and monarch caterpillars can’t feed and develop into winged adults.

“Many people know milkweed, and many people like it,” said Brady at the nature center. “And a lot of people actively try to destroy it. The health of the monarch population is solely dependent on the milkweed plant.”

The widespread use of herbicide-resistant corn and soybeans, which has resulted in the loss of more than 80 million acres of monarch habitat in recent years, also threatens the plant, according to the website Monarch Watch. In spraying fields to eradicate unwanted plants, Midwest farmers also eliminate butterflies’ habitat.

The 2012 drought and wildfires in Texas also made butterfly life difficult. All monarchs heading to or from the eastern two-thirds of the country pass through the state.

So—check out Monarch Watch! Plant some milkweed and make your yard insect-friendly in other ways… like mine!

I may seem like a math nerd, but I’m out there every weekend gardening. My wife Lisa is the real driving force behind this operation, but I’ve learned to love working with plants, soil, and compost. The best thing we ever did is tear out the lawn. Lawns are boring, let native plants flourish! Even if you don’t like insects, birds eat them, and you’ve gotta like birds. Let the beauty of nature start right where you live.

Maximum Entropy and Ecology

21 February, 2013

I already talked about John Harte’s book on how to stop global warming. Since I’m trying to apply information theory and thermodynamics to ecology, I was also interested in this book of his:

John Harte, Maximum Entropy and Ecology, Oxford U. Press, Oxford, 2011.

There’s a lot in this book, and I haven’t absorbed it all, but let me try to briefly summarize his maximum entropy theory of ecology. This aims to be “a comprehensive, parsimonious, and testable theory of the distribution, abundance, and energetics of species across spatial scales”. One great thing is that he makes quantitative predictions using this theory and compares them to a lot of real-world data. But let me just tell you about the theory.

It’s heavily based on the principle of maximum entropy (MaxEnt for short), and there are two parts:

Two MaxEnt calculations are at the core of the theory: the first yields all the metrics that describe abundance and energy distributions, and the second describes the spatial scaling properties of species’ distributions.

Abundance and energy distributions

The first part of Harte’s theory is all about a conditional probability distribution

R(n,\epsilon | S_0, N_0, E_0)

which he calls the ecosystem structure function. Here:

S_0: the total number of species under consideration in some area.

N_0: the total number of individuals under consideration in that area.

E_0: the total rate of metabolic energy consumption of all these individuals.

Given this,

R(n,\epsilon | S_0, N_0, E_0) \, d \epsilon

is the probability that given S_0, N_0, E_0, if a species is picked from the collection of species, then it has n individuals, and if an individual is picked at random from that species, then its rate of metabolic energy consumption is in the interval (\epsilon, \epsilon + d \epsilon).

Here of course d \epsilon is ‘infinitesimal’, meaning that we take a limit where it goes to zero to make this idea precise (if we’re doing analytical work) or take it to be very small (if we’re estimating R from data).

I believe that when we ‘pick a species’ we’re treating them all as equally probable, not weighting them according to their number of individuals.

Clearly R obeys some constraints. First, since it’s a probability distribution, it obeys the normalization condition:

\displaystyle{ \sum_n \int d \epsilon \; R(n,\epsilon | S_0, N_0, E_0) = 1 }

Second, since the average number of individuals per species is N_0/S_0, we have:

\displaystyle{ \sum_n \int d \epsilon \; n R(n,\epsilon | S_0, N_0, E_0) = N_0 / S_0 }

Third, since the average over species of the total rate of metabolic energy consumption of individuals within the species is E_0/ S_0, we have:

\displaystyle{ \sum_n \int d \epsilon \; n \epsilon R(n,\epsilon | S_0, N_0, E_0) = E_0 / S_0 }

Harte’s theory is that R maximizes entropy subject to these three constraints. Here entropy is defined by

\displaystyle{ - \sum_n \int d \epsilon \; R(n,\epsilon | S_0, N_0, E_0) \ln(R(n,\epsilon | S_0, N_0, E_0)) }

Harte uses this theory to calculate R, and tests the results against data from about 20 ecosystems. For example, he predicts the abundance of species as a function of their rank, with rank 1 being the most abundant, rank 2 being the second most abundant, and so on. And he gets results like this:

The data here are from:

• Green, Harte, and Ostling’s work on a serpentine grassland,

• Luquillo’s work on a 10.24-hectare tropical forest, and

• Cocoli’s work on a 2-hectare wet tropical forest.

The fit looks good to me… but I should emphasize that I haven’t had time to study these matters in detail. For more, you can read this paper, at least if your institution subscribes to this journal:

• J. Harte, T. Zillio, E. Conlisk and A. Smith, Maximum entropy and the state-variable approach to macroecology, Ecology 89 (2008), 2700–2711.

Spatial abundance distribution

The second part of Harte’s theory is all about a conditional probability distribution

\Pi(n | A, n_0, A_0)

This is the probability that n individuals of a species are found in a region of area A given that it has n_0 individuals in a larger region of area A_0.

\Pi obeys two constraints. First, since it’s a probability distribution, it obeys the normalization condition:

\displaystyle{ \sum_n  \Pi(n | A, n_0, A_0) = 1 }

Second, since the mean value of n across regions of area A equals n_0 A/A_0, we have

\displaystyle{ \sum_n n \Pi(n | A, n_0, A_0) = n_0 A/A_0 }

Harte’s theory is that \Pi maximizes entropy subject to these two constraints. Here entropy is defined by

\displaystyle{- \sum_n  \Pi(n | A, n_0, A_0)\ln(\Pi(n | A, n_0, A_0)) }

Harte explains two approaches to use this idea to derive ‘scaling laws’ for how n varies with n. And again, he compares his predictions to real-world data, and get results that look good to my (amateur, hasty) eye!

I hope sometime I can dig deeper into this subject. Do you have any ideas, or knowledge about this stuff?

The Mathematics of Biodiversity (Part 8)

14 July, 2012

Last time I mentioned that estimating entropy from real-world data is important not just for measuring biodiversity, but also for another area of biology: neurobiology!

When you look at something, neurons in your eye start firing. But how, exactly, is their firing related to what you see? Questions like this are hard! Answering them— ‘cracking the neural code’—is a big challenge. To make progress, neuroscientists are using information theory. But as I explained last time, estimating information from experimental data is tricky.

Romain Brasselet, now a postdoc at the Max Planck Institute for Biological Cybernetics at Tübingen, is working on these topics. He sent me a nice email explaining this area.

This is a bit of a digression, but the Mathematics of Biodiversity program in Barcelona has been extraordinarily multidisciplinary, with category theorists rubbing shoulders with ecologists, immunologists and geneticists. One of the common themes is entropy and its role in biology, so I think it’s worth posting Romain’s comments here. This is what he has to say…

Information in neurobiology

I will try to explain why neurobiologists are today very interested in reliable estimates of entropy/information and what are the techniques we use to obtain them.

The activity of sensory as well as more central neurons is known to be modulated by external stimulations. In 1926, in a seminal paper, Adrian observed that neurons in the sciatic nerve of the frog fire action potentials (or spikes) when some muscle in the hindlimb is stretched. In addition, he observed that the frequency of the spikes increases with the amplitude of the stretching.

• E.D. Adrian, The impulses produced by sensory nerve endings. (1926).

For another very nice example, in 1962, Hubel and Wiesel found neurons in the cat visual cortex whose activity depends on the orientation of a visual stimulus, a simple black line over white background: some neurons fire preferentially for one orientation of the line (Hubel and Wiesel were awarded the 1981 Nobel Prize in Physiology for their work). This incidentally led to the concept of “receptive field” which is of tremendous importance in neurobiology—but though it’s fascinating, it’s a different topic.

Good, we are now able to define what makes a neuron tick. The problem is that neural activity is often very “noisy”: when the exact same stimulus is presented many times, the responses appear to be very different from trial to trial. Even careful observation cannot necessarily reveal correlations between the stimulations and the neural activity. So we would like a measure capable of capturing the statistical dependencies between the stimulation and the response of the neuron to know if we can say something about the stimulation just by observing the response of a neuron, which is essentially the task of the brain. In particular, we want a fundamental measure that does not rely on any assumption about the functioning of the brain. Information theory provides the tools to do this, that is why we like to use it: we often try to measure the mutual information between stimuli and responses.

To my knowledge, the first paper using information theory in neuroscience was by MacKay and McCulloch in 1952:

• Donald M. Mackay and Warren S. McCulloch, The limiting information capacity of a neuronal link, Bulletin of Mathematical Biophysics 14 (1952), 127–135.

But information theory was not used in neuroscience much until the early 90’s. It started again with a paper by Bialek et al. in 1991:

• W. Bialek, F. Rieke, R. R. de Ruyter van Steveninck and D. Warland, Reading a neural code, Science 252 (1991), 1854–1857.

However, when applying information-theoretic methods to biological data, we often have a limited sampling of the neural response, we are usually very happy when we have 50 trials for a given stimulus. Why is this limited sample a problem?

During the major part of the 20th century, following Adrian’s finding, the paradigm for the neural code was the frequency of the spikes or, equivalently, the number of spikes in a window of time. But in the early 90’s, it was observed that the exact timing of spikes is (in some cases) reliable across trials. So instead of considering the neural response as a single number (the number of spikes), the temporal patterns of spikes started to be taken into account. But time is continuous, so to be able to do actual computations, time was discretized and a neural response became a binary string.

Now, if you consider relevant time-scales, say, a 100 millisecond time window with a 1 millisecond bin with a firing frequency of about 50 per second, then your response space is huge and the estimates of information with only 50 trials are not reliable anymore. That’s why a lot of efforts have been carried out to overcome the limited sampling bias.

Now, getting at the techniques developed in this field, John already mentioned the work by Liam Paninski, but here are other very interesting references:

• Stefano Panzeri and Alessandro Treves, Analytical estimates of limited sampling biases in different information measures, Network: Computation in Neural Systems 7 (1996), 87–107.

They computed the first-order bias of the information (related to the Miller–Madow correction) and then used a Bayesian technique to estimate the number of responses not included in the sample but that would be in an infinite sample (a goal similar to that of Good’s rule of thumb).

• S.P. Strong, R. Koberle, R.R. de Ruyter van Steveninck, and W. Bialek, Entropy and information in neural spike trains, Phys. Rev. Lett. 80 (1998), 197–200.

The entropy (or if you prefer, information) estimate can be expanded in a power series in N (the sample size) around the true value. By computing the estimate for various values of N and fitting it with a parabola, it is possible to estimate the value of the entropy as N \rightarrow \infty.

These approaches are also well-known:

• Ilya Nemenman, Fariel Shafee and William Bialek, Entropy and inference, revisited, 2002.

• Alexander Kraskov, Harald Stögbauer and Peter Grassberger, Estimating mutual information, Phys. Rev. E. 69 (2004), 066138.

Actually, Stefano Panzeri has quite a few impressive papers about this problem, and recently with colleagues he has made public a free Matlab toolbox for information theory ( implementing various correction methods.

Finally, the work by Jonathan Victor is worth mentioning, since he provided (to my knowledge again) the first estimate of mutual information using geometry. This is of particular interest with respect to the work by Christina Cobbold and Tom Leinster on measures of biodiversity that take the distance between species into account:

• J. D. Victor and K. P. Purpura, Nature and precision of temporal coding in visual cortex: a metric-space analysis, Journal of Neural Physiology 76 (1996), 1310–1326.

He introduced a distance between sequences of spikes and from this, derived a lower bound on mutual information.

• Jonathan D. Victor, Binless strategies for estimation of information from neural data, Phys. Rev. E. 66 (2002), 051903.

Taking inspiration from work by Kozachenko and Leonenko, he obtained an estimate of the information based on the distances between the closest responses.

Without getting too technical, that’s what we do in neuroscience about the limited sampling bias. The incentive is that obtaining reliable estimates is crucial to understand the ‘neural code’, the holy grail of computational neuroscientists.

The Mathematics of Biodiversity (Part 7)

12 July, 2012

How ignorant are you?

Do you know?

Do you know how much don’t you know?

It seems hard to accurately estimate your lack of knowledge. It even seems hard to say precisely how hard it is. But the cool thing is, we can actually extract an interesting math question from this problem. And one answer to this question leads to the following conclusion:

There’s no unbiased way to estimate how ignorant you are.

But the devil is in the details. So let’s see the details!

The Shannon entropy of a probability distribution is a way of measuring how ignorant we are when this probability distribution describes our knowledge.

For example, suppose all we care about is whether this ancient Roman coin will land heads up or tails up:

If we know there’s a 50% chance of it landing heads up, that’s a Shannon entropy of 1 bit: we’re missing one bit of information.

But suppose for some reason we know for sure it’s going to land heads up. For example, suppose we know the guy on this coin is the emperor Pupienus Maximus, a egomaniac who had lead put on the back of all coins bearing his likeness, so his face would never hit the dirt! Then the Shannon entropy is 0: we know what’s going to happen when we toss this coin.

Or suppose we know there’s a 90% it will land heads up, and a 10% chance it lands tails up. Then the Shannon entropy is somewhere in between. We can calculate it like this:

- 0.9 \log_2 (0.9) - 0.1 \log_2 (0.1) = 0.46899...

so that’s how many bits of information we’re missing.

But now suppose we have no idea. Suppose we just start flipping the coin over and over, and seeing what happens. Can we estimate the Shannon entropy?

Here’s a naive way to do it. First, use your experimental data to estimate the probability that that the coin lands heads-up. Then, stick that probability into the formula for Shannon entropy. For example, say we flip the coin 3 times and it lands head-up once. Then we can estimate the probability of it landing heads-up as 1/3, and tails-up as 2/3. So we can estimate that the Shannon entropy is

\displaystyle{ - \frac{1}{3} \log_2 (\frac{1}{3}) -\frac{2}{3} \log_2 (\frac{2}{3}) = 0.918... }

But it turns out that this approach systematically underestimates the Shannon entropy!

Say we have a coin that lands up a certain fraction of the time, say p. And say we play this game: we flip our coin n times, see what we get, and estimate the Shannon entropy using the simple recipe I just illustrated.

Of course, our estimate will depend on the luck of the game. But on average, it will be less than the actual Shannon entropy, which is

- p \log_2 (p) - (1-p) \log_2 (1-p)

We can prove this mathematically. But it shouldn’t be surprising. After all, if n = 1, we’re playing a game where we flip the coin just once. And with this game, our naive estimate of the Shannon entropy will always be zero! Each time we play the game, the coin will either land heads up 100% of the time, or tails up 100% of the time!

If we play the game with more coin flips, the error gets less severe. In fact it approaches zero as the number of coin flips gets ever larger, so that n \to \infty. The case where you flip the coin just once is an extreme case—but extreme cases can be good to think about, because they can indicate what may happen in less extreme cases.

One moral here is that naively generalizing on the basis of limited data can make you feel more sure you know what’s going on than you actually are.

I hope you knew that already!

But we can also say, in a more technical way, that the naive way of estimating Shannon entropy is a biased estimator: the average value of the estimator is different from the value of the quantity being estimated.

Here’s an example of an unbiased estimator. Say you’re trying to estimate the probability that the coin will land heads up. You flip it n times and see that it lands up m times. You estimate that the probability is m/n. That’s the obvious thing to do, and it turns out to be unbiased.

Statisticians like to think about estimators, and being unbiased is one way an estimator can be ‘good’. Beware: it’s not the only way! There are estimators that are unbiased, but whose standard deviation is so huge that they’re almost useless. It can be better to have an estimate of something that’s more accurate, even though on average it’s a bit too low. So sometimes, a biased estimator can be more useful than an unbiased estimator.

Nonetheless, my ears perked up when Lou Jost mentioned that there is no unbiased estimator for Shannon entropy. In rough terms, the moral is that:

There’s no unbiased way to estimate how ignorant you are.

I think this is important. For example, it’s important because Shannon entropy is also used as a measure of biodiversity. Instead of flipping a coin repeatedly and seeing which side lands up, now we go out and collect plants or animals, and see which species we find. The relative abundance of different species defines a probability distribution on the set of species. In this language, the moral is:

There’s no unbiased way to estimate biodiversity.

But of course, this doesn’t mean we should give up. We may just have to settle for an estimator that’s a bit biased! And people have spent a bunch of time looking for estimators that are less biased than the naive one I just described.

By the way, equating ‘biodiversity’ with ‘Shannon entropy’ is sloppy: there are many measures of biodiversity. The Shannon entropy is just a special case of the Rényi entropy, which depends on a parameter q: we get Shannon entropy when q = 1.

As q gets smaller, the Rényi entropy gets more and more sensitive to rare species—or shifting back to the language of probability theory, rare events. It’s the rare events that make Shannon entropy hard to estimate, so I imagine there should be theorems about estimators for Rényi entropy, which say it gets harder to estimate as q gets smaller. Do you know such theorems?

Also, I should add that biodiversity is better captured by the ‘Hill numbers’, which are functions of the Rényi entropy, than by the Rényi entropy itself. (See here for the formulas.) Since these functions are nonlinear, the lack of an unbiased estimator for Rényi entropy doesn’t instantly imply the same for the Hill numbers. So there are also some obvious questions about unbiased estimators for Hill numbers. Do you know answers to those?

Here are some papers on estimators for entropy. Most of these focus on estimating the Shannon entropy of a probability distribution on a finite set.

This old classic has a proof that the ‘naive’ estimator of Shannon entropy is biased, and estimates on the bias:

• Bernard Harris, The statistical estimation of entropy in the non-parametric case, Army Research Office, 1975.

He shows the bias goes to zero as we increase the number of samples: the number I was calling n in my coin flip example. In fact he shows the bias goes to zero like O(1/n). This is big big O notation which means that as n \to +\infty, the bias is bounded by some constant times 1/n. This constant depends on the size of our finite set—or, if you want to do better, the class number, which is the number of elements on which our probability distribution is nonzero.

Using this idea, he shows that you can find a less biased estimator if you have a probability distribution p_i on a finite set and you know that exactly k of these probabilities are nonzero. To do this, just take the ‘naive’ estimator I described earlier and add (k-1)/2n. This is called the Miller–Madow bias correction. The bias of this improved estimator goes to zero like O(1/n^2).

The problem is that in practice you don’t know ahead of time how many probabilities are nonzero! In applications to biodiversity this would amount to knowing ahead of time how many species exist, before you go out looking for them.

But what about the theorem that there’s no unbiased estimator for Shannon entropy? The best reference I’ve found is this:

• Liam Paninski, Estimation of entropy and mutual information, Neural Computation 15 (2003) 1191-1254.

In Proposition 8 of Appendix A, Paninski gives a quick proof that there is no unbiased estimator of Shannon entropy for probability distributions on a finite set. But his paper goes far beyond this. Indeed, it seems like a pretty definitive modern discussion of the whole subject of estimating entropy. Interestingly, this subject is dominated by neurobiologists studying entropy of signals in the brain! So, lots of his examples involve brain signals.

Another overview, with tons of references, is this:

• J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. van der Meulen, Nonparametric entropy estimation: an overview.

This paper focuses on the situation where don’t know ahead of time how many probabilities are nonzero:

• Anne Chao and T.-J. Shen, Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample, Environmental and Ecological Statistics 10 (2003), 429&–443.

In 2003 there was a conference on the problem of estimating entropy, whose webpage has useful information. As you can see, it was dominated by neurobiologists:

Estimation of entropy and information of undersampled probability distributions: theory, algorithms, and applications to the neural code, Whistler, British Columbia, Canada, 12 December 2003.

By the way, I was very confused for a while, because these guys claim to have found an unbiased estimator of Shannon entropy:

• Stephen Montgomery Smith and Thomas Schürmann, Unbiased estimators for entropy and class number.

However, their way of estimating entropy has a funny property: in the language of biodiversity, it’s only well-defined if our samples include at least one species of each organism. So, we cannot compute this estimate for an arbitary list of n samples. This means it’s not estimator in the usual sense—the sense that Paninski is using! So it doesn’t really contradict Paninski’s result.

To wrap up, let me state Paninski’s result in a mathematically precise way. Suppose p is a probability distribution on a finite set X. Suppose S is any number we can compute from p: that is, any real-valued function on the set of probability distributions. We’ll be interested in the case where S is the Shannon entropy:

\displaystyle{ S = -\sum_{x \in X} p(x) \, \log p(x) }

Here we can use whatever base for the logarithm we like: earlier I was using base 2, but that’s not sacred. Define an estimator to be any function

\hat{S}: X^n \to \mathbb{R}

The idea is that given n samples from the set X, meaning points x_1, \dots, x_n \in X, the estimator gives a number \hat{S}(x_1, \dots, x_n). This number is supposed to estimate some feature of the probability distribution p: for example, its entropy.

If the samples are independent and distributed according to the distribution p, the sample mean of the estimator will be

\displaystyle{ \langle \hat{S} \rangle = \sum_{x_1, \dots, x_n \in X} \hat{S}(x_1, \dots, x_n) \, p(x_1) \cdots p(x_n) }

The bias of the estimator is the difference between the sample mean of the estimator and actual value of S:

\langle \hat{S} \rangle - S

The estimator \hat{S} is unbiased if this bias is zero for all p.

Proposition 8 of Paninski’s paper says there exists no unbiased estimator for entropy! The proof is very short…

Okay, that’s all for today.

I’m back in Singapore now; I learned so much at the Mathematics of Biodiversity conference that there’s no way I’ll be able to tell you all that information. I’ll try to write a few more blog posts, but please be aware that my posts so far give a hopelessly biased and idiosyncratic view of the conference, which would be almost unrecognizable to most of the participants. There are a lot of important themes I haven’t touched on at all… while this business of entropy estimation barely came up: I just find it interesting!

If more of you blogged more, we wouldn’t have this problem.

The Mathematics of Biodiversity (Part 6)

6 July, 2012

Here are two fun botany stories I learned today from Lou Jost.

The decline and fall of the Roman Empire

I thought Latin was a long-dead language… except in Finland, where 75,000 people regularly listen to the news in Latin. That’s cool, but surely the last time someone seriously needed to write in Latin was at least a century ago… right?

No! Until the beginning of 2012, botanists reporting new species were required to do so in Latin.

Like this:

Arbor ad 8 alta, raminculis sparse pilosis, trichomatis 2-2.5 mm longis. Folia persistentia; laminae anisophyllae, foliis majoribus ellipticus, 12-23.5 cm longis, 6-13 cm latis, minoribus orbicularis, ca 8.5 cm longis, 7.5 cm latis, apice acuminato et caudato, acuminibus 1.5-2 cm longis, basi rotundata ad obtusam, margine integra, supra sericea, trichomatis 2.5-4 mm longis, appressis, pagina inferiore sericea ad pilosam, trichomatis 2-3 mm longis; petioli 4-7 mm longi. Inflorescentia terminalis vel axillaris, cymosa, 8-10 cm latis. Flores bisexuales; calyx tubularis, ca. 6 mm longus, 10-costatus; corolla alba, tubularis, 5-lobata; stamina 5, filis 8-10 mm longis, pubescentia ad insertionem.

The International Botanical Congress finally voted last year to drop this requirement. So, the busy people who are discovering about 2000 species of plants, algae and fungi each year no longer need to file their reports in the language of the Roman Empire.

Orchid Fever

The first person who publishes a paper on a new species of plant gets to name it. Sometimes the competition is fierce, as for the magnificent orchid shown above, Phragmipedium kovachii.

Apparently one guy beat another, his archenemy, by publishing an article just a few days earlier. But the other guy took his revenge by getting the first guy arrested for illegally taking an endangered orchid out of Peru. The first guy wound up getting two years’ probation and a $1,000 fine.

But, he got his name on the orchid!

I believe the full story appears here:

• Eric Hansen, Orchid Fever: A Horticultural Tale of Love, Lust, and Lunacy, Vintage Books, New York, 2001.

You can read a summary here.


By the way, Lou Jost is not only a great discoverer of new orchid species and a biologist deeply devoted to understanding the mathematics of biodiversity. He also runs a foundation called Ecominga, which runs a number of nature reserves in Ecuador, devoted to preserving the amazing biodiversity of the Upper Pastaza Watershed. This area contains over 190 species of plants not found anywhere else in the world, as well as spectacled bears, mountain tapirs, and an enormous variety of birds.

The forests here are being cut down… but Ecominga has bought thousands of hectares in key locations, and is protecting them. They need money to pay the locals who patrol and run the reserves. It’s not a lot of money in the grand scheme of things—a few thousand dollars a month. So if you’re interested, go to the Ecominga website, check out the information and reports and pictures, and think about giving them some help! Or for that matter, contract me and I’ll put you in touch with him.

The Mathematics of Biodiversity (Part 5)

3 July, 2012

I’d be happy to get your feedback on these slides of the talk I’m giving the day after tomorrow:

• John Baez, Diversity, entropy and thermodynamics, 6 July 2012, Exploratory Conference on the Mathematics of Biodiversity, Centre de Recerca Matemàtica, Barcelona.

Abstract: As is well known, some popular measures of biodiversity are formally identical to measures of entropy developed by Shannon, Rényi and others. This fact is part of a larger analogy between thermodynamics and the mathematics of biodiversity, which we explore here. Any probability distribution can be extended to a 1-parameter family of probability distributions where the parameter has the physical meaning of ‘temperature’. This allows us to introduce thermodynamic concepts such as energy, entropy, free energy and the partition function in any situation where a probability distribution is present—for example, the probability distribution describing the relative abundances of different species in an ecosystem. The Rényi entropy of this probability distribution is closely related to the change in free energy with temperature. We give one application of thermodynamic ideas to population dynamics, coming from the work of Marc Harper: as a population approaches an ‘evolutionary optimum’, the amount of Shannon information it has ‘left to learn’ is nonincreasing. This fact is closely related to the Second Law of Thermodynamics.

This talk is rather different than the one I’d envisaged giving! There was a lot of interest in my work on Rényi entropy and thermodynamics, because Rényi entropies—and their exponentials, called the Hill numbers—are an important measure of biodiversity. So, I decided to spend a lot of time talking about that.

The Mathematics of Biodiversity (Part 4)

2 July, 2012

Today the conference part of this program is starting:

Research Program on the Mathematics of Biodiversity, June-July 2012, Centre de Recerca Matemàtica, Barcelona, Spain. Organized by Ben Allen, Silvia Cuadrado, Tom Leinster, Richard Reeve and John Woolliams.

Lou Jost kicked off the proceedings with an impassioned call to think harder about fundamental concepts:

• Lou Jost, Why biologists should care about the mathematics of biodiversity.

Then Tom Leinster gave an introduction to some of these concepts, and Lou explained how they show up in ecology, genetics, economics and physics.

Suppose we have n different species on an island. Suppose a fraction p_i of the organisms belong to the ith species. So,

\displaystyle{ \sum_{i=1}^n p_i = 1}

and mathematically we can treat these numbers as probabilities.

People have many ways to compute the ‘biodiversity’ from these numbers. Some of these can be wildly misleading when applied incorrectly, and this has led to shocking errors. For example, in genetics, a commonly used formula for determining when plants or animals on a bunch of islands will split into separate species is completely wrong.

In fact, if we’re not careful, some measures of biodiversity can fool us into thinking we’re saving most of the biodiversity when we’re actually losing almost all of it!

One good example involves measures of similarity between tropical butterflies in the canopy (the top of the forest) and the understory (the bottom). According to Lou Just, some published studies say the similarity is about 95%. That sounds like the two communities are almost the same. However, almost no butterflies living in the canopy live in the understory, and vice versa! The problem is that mathematics is being used inappropriately.

Here are four famous measures of biodiversity:

Species richness. This is just the number of species:


Shannon entropy. This is the expected amount of information you gain when someone tells you which species an organism belongs to:

\displaystyle{ - \sum_{i=1}^n p_i \ln(p_i) }

• The inverse Simpson index. This is the reciprocal of the probability that two randomly chosen organisms belong to the same species:

\displaystyle{ 1 \big/ \sum_{i=1}^n p_i^2 }

The probability that two organisms belong to the same species is called the Simpson index:

\displaystyle{ \sum_{i=1}^n p_i^2 }

This is used in economics as a measure of the concentration of wealth, where p_i is the fraction of wealth owned by the ith individual. Be careful: there’s a lot of different jargon in different fields, so it’s easy to get confused at first! For example, the probability that two organisms belong to different species is often called the Gini–Simpson index:

\displaystyle{ 1 - \sum_{i=1}^n p_i^2 }

It was introduced by the statistician Corrado Gini a century ago, in 1912 and the ecologist Edward H. Simpson in 1949. It’s also called the heterozygosity in genetics.

• The Berger–Parker index. This is the fraction of organisms that belong to the most common species:

\mathrm{max} \, p_i

So, unlike the other main ones I’ve listed, this quantity tends to go down when biodiversity goes up. To fix this we could take its reciprocal, as we did with the Simpson index.

What a mess, eh? But here’s some good news: all these quantities are functions of a single quantity, the Rényi entropy:

\displaystyle{ H_q(p) = \frac{1}{1 -q} \ln \sum_{i=1}^n p_i^q  }

for various values of the parameter q.

I’ve written about the Rényi entropies and their role in thermodynamics before on this blog. I’ll also talk about it later in this conference, and I’ll show you my slides. So, I won’t repeat that story here. Suffice it to say that Rényi entropies are fascinating but still a bit mysterious to me.

But one of Lou Jost’s main points is that we can make bad mistakes if we work with Rényi entropies when we should be working with their exponentials, which are called Hill numbers and denoted by a D, for ‘diversity’:

\displaystyle{ {}^qD(p) = e^{H_q(p)} =   \left(\sum_{i=1}^n p_i^q \right)^{\frac{1}{1-q}}  }

These were introduced by M. O. Hill in 1973. One reason they’re good is that they are effective numbers. This means that if all the species are equally common, the Hill number equals the number of species, regardless of q:

p_i = \frac{1}{n} \; \Longrightarrow \; {}^qD(p) = n

So, they’re a way of measuring an ‘effective’ number of species in situations where species are not all equally common.

A closely related fact is that the Hill numbers obey the replication principle. This means that if we have probability distributions on two finite sets, each with Hill number X for some choice of q, and we combine them with equal weights to get a probability distribution on the disjoint union of those sets, the resulting distribution has Hill number 2X.

Another good fact is that the Hill numbers are as large as possible when all the probabilities p_i are equal. They’re as small as possible, namely 1, when one of the p_i equals 1 and the rest are zero.

Let’s see how all the measures of biodiversity I listed are either Hill numbers or can easily be converted to Hill numbers. We’ll also see that at q = 0, the Hill number treats all species that are present in an equal way, regardless of their abundance. As q increases, it counts more abundant species more heavily, since we’re raising the probabilities p_i to a bigger power. And when q = \infty, we only care about the most abundant species: none of the others matter at all!

Here goes:

• The species richness is the limit of the Hill numbers as q \to 0 from above:

\displaystyle{ \lim_{q \to 0^+} {}^qD (p) = n }

So, we can just call this {}^0D(p).

• The exponential of the Shannon entropy is the limit of the Hill numbers as q \to 1:

\displaystyle{ \lim_{q \to 1} {}^qD(p) = \exp\left(- \sum_{i=1}^n p_i \ln(p_i)\right) }

So, we can just call this {}^1D(p).

• The inverse Simpson index is the Hill number at q = 2:

\displaystyle{  {}^2D(p) =  1 \big/ \sum_{i=1}^n p_i^2 }

• The reciprocal of the Berger–Parker index is the limit of Hill numbers as q \to +\infty:

\displaystyle{ \lim_{q \to +\infty} {}^qD(p) = 1 \big/ \mathrm{max} \, p_i }

so we can call this quantity {}^\infty D(p).

These facts mean that understanding Hill numbers will help us understand lots of measures of biodiversity! And the good properties of Hill numbers will help us avoid dangerous mistakes.

For mathematicians, a good challenge is to find theorems uniquely characterizing the Hill numbers…. preferably with assumptions that biologists will accept as plausible facts about ‘diversity’. Some theorems like this already exist for specific choices of q, but it will be better to characterize the function {}^q D for all values of q in one blow. Tom Leinster is working on such a theorem now.

Another important task is to generalize Hill numbers to take into account things like:

• ‘distances’ between species, measured either genetically, phylogenetically or functionally,

• ‘values’ for species, measured either economically or
any other way.

There’s a lot of work on this, and many of the talks here conference will discuss these generalizations.

The Mathematics of Biodiversity (Part 3)

27 June, 2012

We tend to think of biodiversity as a good thing, but sometimes it’s deadly. Yesterday Andrei Korobeinikov gave a talk on ‘Viral evolution within a host’, which was mainly about AIDS.

The virus that causes this disease, HIV, can reproduce very fast. In an untreated patient near death there are between 1010 and 1012 new virions per day! Remember, a virion is an individual virus particle. The virus also has a high mutation rate: about 3 × 10-5 mutations per generation for each base—that is, each molecule of A,T,C, or G in the RNA of the virus. That may not seem like a lot, but if you multiply it by 1012 you’ll see that a huge number of new variations of each base arise within the body of a single patient.

So, evolution is at work within you as you die.

And in fact, many scientists believe that the diversity of the virus eventually overwhelms your immune system! Although it’s apparently not quite certain, it seems that while the body generates B cells and T cells to attack different variants of HIV as they arise, they eventually can’t keep up with the sheer number of variants.

Of course, the fact that the HIV virus attacks the immune system makes the disearse even worse. Here in blue you see the number of T cells per cubic millimeter of blood, and in red you see the number of virions per cubic centimeter of blood for a typical untreated patient:

Mathematicians and physicists have looked at some very simple models to get a qualitative understanding of these issues. One famous paper that started this off is:

• Lev S. Tsimring, Herbert Levine and David A. Kessler, RNA virus evolution via a fitness-space model, Phys. Rev. Lett. 76 (1996), 4440–4443.

The idea here is to say that at any time t the viruses have a probability density p(r,t) of having fitness r. In fact the different genotypes of the virus form a cloud in a higher-dimensional space, but these authors are treating that space is 1-dimensional, with fitness as its one coordinate, just to keep things simple. They then write down an equation for how the population density changes with time:

\displaystyle{\frac{\partial }{\partial t}p(r,t) = (r - \langle r \rangle)\, p(r,t) + D \frac{\partial^2 }{\partial r}p(r,t) - \frac{\partial}{\partial r}(v_{\mathrm{drift}}\, p(r,t)) }

This is a replication-mutation-drift equation. If we just had

\displaystyle{\frac{\partial }{\partial t}p(r,t) = (r - \langle r \rangle)\, p(r,t) }

this would be a version of the replicator equation, which I explained recently in Information Geometry (Part 9). Here

\displaystyle{ \langle r \rangle = \int_0^\infty r p(r,t) dr }

is the mean fitness, and the replicator equations says that the fraction of organisms of a given type grows at a rate proportional to how much their fitness exceeds the mean fitness: that’s where the (r - \langle r \rangle) comes from.

If we just had

\displaystyle{\frac{\partial }{\partial t}p(r,t) =  D \frac{\partial^2 }{\partial r^2}p(r,t) }

this would be the heat equation, which describes diffusion occurring at a rate D. This models the mutation of the virus, though not in a very realistic way.

If we just had

\displaystyle{\frac{\partial}{\partial t} p(r,t) =  - \frac{\partial}{\partial r}(v_{\mathrm{drift}} \, p(r,t)) }

the fitness of the virus would increase at rate equal to the drift velocity v_{\mathrm{drift}}.

If we include both the diffusion and drift terms:

\displaystyle{\frac{\partial }{\partial t} p(r,t) =  D \frac{\partial^2 }{\partial r^2}p(r,t) - \frac{\partial}{\partial r}(v_{\mathrm{drift}} \, p(r,t)) }

we get the Fokker–Planck equation. This is a famous model of something that’s spreading while also drifting along at a constant velocity: for example, a drop of ink in moving water. Its solutions look like this:

Here we start with stuff concentrated at one point, and it spreads out into a Gaussian while drifting along.

By the way, watch out: what biologists call ‘genetic drift’ is actually a form of diffusion, not what physicists call ‘drift’.

More recently, people have looked at another very simple model. You can read about it here:

• Martin A. Nowak, and R. M. May, Virus Dynamics, Oxford University Press, Oxford, 2000.

In this model the variables are:

• the number of healthy human cells of some type, \mathrm{H}(t)

• the number of infected human cells of that type, \mathrm{I}(t)

• the number of virions, \mathrm{V}(t)

These are my names for variables, not theirs. It’s just a sick joke that these letters spell out ‘HIV’.

Chemists like to describe how molecules react and turn into other molecules using ‘chemical reaction networks’. You’ve seen these if you’ve taken chemistry, but I’ve been explaining more about the math of these starting in Network Theory (Part 17). We can also use them here! Though May and Nowak probably didn’t put it this way, we can consider a chemical reaction network with the following 6 reactions:

• the production of a healthy cell:

\longrightarrow \mathrm{H}

• the infection of a healthy cell by a virion:

\mathrm{H} + \mathrm{V} \longrightarrow \mathrm{I}

• the production of a virion by an infected cell:

\mathrm{I} \longrightarrow \mathrm{I} + \mathrm{V}

• the death of a healthy cell:

\mathrm{H} \longrightarrow

• the death of a infected cell:

\mathrm{I} \longrightarrow

• the death of a virion:

\mathrm{V} \longrightarrow

Using a standard recipe which I explained, we can get from this chemical reaction network to some ‘rate equations’ saying how the number of healthy cells, infected cells and virions changes with time:

\displaystyle{ \frac{d\mathrm{H}}{dt}  =   \alpha - \beta \mathrm{H}\mathrm{V} - \gamma \mathrm{H} }

\displaystyle{ \frac{d\mathrm{I}}{dt}  = \beta \mathrm{H}\mathrm{V} - \delta \mathrm{I} }

\displaystyle{ \frac{d\mathrm{V}}{dt}  = - \beta \mathrm{H}\mathrm{V} + \epsilon \mathrm{I} - \zeta \mathrm{V} }

The Greek letters are constants called ‘rate constants’, and there’s one for each of the 6 reactions. The equations we get this way are exactly those described by Nowak and May!

What Andrei Korobeinikov is to unify the ideas behind the two models I’ve described here. Alas, I don’t have the energy to explain how. Indeed, I don’t even have the energy to explain what the models I’ve described actually predict. Sad, but true.

I don’t see anything online about Korobeinikov’s new work, but you can read some of his earlier work here:

• Andrei Korobeinikov, Global properties of basic virus dynamics models.

• Suzanne M. O’Regan, Thomas C. Kelly, Andrei Korobeinikov, Michael J. A. O’Callaghan and Alexei V. Pokrovskii, Lyapunov functions for SIR and SIRS epidemic models, Appl. Math. Lett. 23 (2010), 446-448.

The SIR and SIRS models are models of disease that also arise from chemical reaction networks. I explained them back in Network Theory (Part 3). That was before I introduced the terminology of chemical reaction networks… back then I was talking about ‘stochastic Petri nets’, which are an entirely equivalent formalism. Here’s the stochastic Petri net for the SIRS model:

Puzzle: Draw the stochastic Petri net for the HIV model discussed above. It should have 3 yellow circles and 6 aqua squares.