Increasing the Signal-to-Noise Ratio With More Noise

30 July, 2012

guest post by Glyn Adgie and Tim van Beek

Or: Are the Milankovich Cycles Causing the Ice Ages?

The Milankovich cycles are periodic changes in how the Earth orbits the Sun. One question is: can these changes can be responsible for the ice ages? On the first sight it seems impossible, because the changes are simply too small. But it turns out that we can find a dynamical system where a small periodic external force is actually strengthened by random ‘noise’ in the system. This phenomenon has been dubbed ‘stochastic resonance’ and has been proposed as an explanation for the ice ages.

In this blog post we would like to provide an introduction to it. We will look at a bistable toy example. And you’ll even get to play with an online simulation that some members of the Azimuth Project created, namely Glyn Adgie, Allan Erskine and Jim Stuttard!

Equations with noise

In This Weeks Finds, back in ‘week308’, we had a look at a model with ‘noise’. A lot of systems can be described by ordinary differential equations:

\displaystyle{ \frac{d x}{d t} = f(x, t)}

If f is nice, the time evolution of the system will be a nice smooth function x(t), like the trajectory of a thrown stone. But there are situations where we have some kind of noise, a chaotic, fluctuating influence, that we would like to take into account. This could be, for example, turbulence in the air flow around a rocket. Or, in our case, short term fluctuations of the weather of the earth. If we take this into account, we get an equation of the form

\displaystyle{ \frac{d x}{d t} = f(x, t) + W(t) }

where the W is some model of noise. In all the examples above, the noise is actually a simplified way to take into account fast degrees of freedom of the system at hand. This way we do not have to explicitly model these degrees of freedom, which is often impossible. But we could still try to formulate a model where long term influences of these degrees of freedom are included.

A way to make the above mathematically precise is to write the equation in the form

d x_t = f(x, t) d t + g(x, t) d W_t

as an Ito stochastic differential equation driven by a Wiener process.

The mathematics behind this is somewhat sophisticated, but we don’t need to delve into it in this blog post. For those who are interested in it, we have gathered some material here:

Stochastic differential equation, Azimuth Library.

Physicists and engineers who model systems with ordinary and partial differential equations should think of stochastic differential equations as another tool in the toolbox. With this tool, it is possible to easily model systems that exhibit new behavior.

What is stochastic resonance?

Listening to your friend who sits across the table, the noise around you may become a serious distraction, although humans are famous for their ability to filter the signal—what your friend is saying – from the noise—all other sounds, people’s chatter, dishes clattering etc. But could you imagine a situation where more noise makes it easier to detect the signal?

Picture a marble that sits in a double well potential:

double well potential

This graphic is taken from

• Roberto Benzi, Stochastic resonance: from climate to biology.

This review paper is also a nice introduction to the topic.

This simple model can be applied to various situations. Since we think about the ice ages, we can interpret the two metastable states at the two minima of the potential in this way: the minimum to the left at -1 corresponds to the Earth in an ice age. The minimum to the right at 1 corresponds to the Earth at warmer temperatures, as we know it today.

Let us add a small periodic force to the model. This periodic force corresponds to different solar input due to periodic changes in Earth’s orbit: the Milankovich cycles. There are actually several different cycles with different periods, but in our toy model we will consider only one force with one period.

If the force itself is too small to get the marble over the well that is between the two potential minima, we will not observe any periodic change in the position of the marble. But if we add noise to the motion, it may happen that the force together with the noise succeed in getting the marble from one minimum to the other! This is more likely to happen when the force is pushing the right way. So, it should happen with roughly the same periodicity as the force: we should see a ‘quasi-periodic’ motion of the marble.

But if we increase the noise more and more, the influence of the external force will become unrecognizable, and we will not recognize any pattern at all.

There should be a noise strength somewhere in between that makes the Milankovich cycles as visible as possible from the Earth’s temperature record. In other words, if we think of the periodic force as a ‘signal’, there should be some amount of noise that maximizes the ‘signal-to-noise ratio’. Can we find out what it is?

In order to find out, let us make our model precise. Let’s take the time-independent potential to be

\displaystyle{ V(x) = \frac{1}{4} x^4 - \frac{1}{2} x^2 }

This potential has two local minima at x = \pm 1. Let’s take the time-dependent periodic forcing, or ‘signal’, to be

F(t) = - A \; \sin(t)

The constant A is the amplitude of the periodic forcing. The time-dependent effective potential of the system is therefore:

\displaystyle{ V(x, t) = \frac{1}{4} x^4 - \frac{1}{2} x^2 - A \; \sin(t) x }

Including noise leads to the equation of motion, which is usually called a Langevin equation by physicists and chemists:

\begin{array}{ccl} dX &=& -V'(X_t, t) \; dt + \sqrt{2D} \; dW_t \\  \\  &=& (X_t - X_t^3 + A \;  \sin(t)) \; dt  + \sqrt{2D} \; dW_t  \end{array}

For more about what a Langevin equation is, see our article on stochastic differential equations.

This model has two free parameters, A and D. As already mentioned, A is the amplitude of the forcing. D is called the diffusion coefficient and represents the strength of noise. In our case it is a constant, but in more general models it could depend on space and time.

It is possible to write programs that generate simulations aka numerical approximations to solutions of stochastic differential equations. For those interested in how this works, there is an addendum at the bottom of the blog post that explains this. We can use such a program to model the three cases of small, high and optimal noise level.

In the following graphs, green is the external force, while red is the marble position. Here is a simulation with a low level of noise:

low noise level

As you can see, within the time of the simulation there is no transition from the metastable state at 1 to the one at -1. That is, in this situation Earth stays in the warm state.

Here is a simulation with a high noise level:

high noise level

The marble jumps wildly between the states. By inspecting the graph with your eyes only, you don’t see any pattern in it, do you?

And finally, here is a simulation where the noise level seems to be about right to make the signal visible via a quasi-periodic motion:

high noise level

Sometimes the marble makes a transition in phase with the force, sometimes it does not, sometimes it has a phase lag. It can even happen that the marble makes a transition while the force acts in the opposite direction!

Some members of the Azimuth crew have created an online model that calculates simulations of the model, for different values of the involved constants. You can see it here:

Stochastic resonance example, Azimuth Code Project.

We especially thank Allan Erskine and Jim Stuttard.

You can change the values using the sliders under the graphic and see what happens. You can also choose different ‘random seeds’, which means that the random numbers used in the simulation will be different.

Above, we asked if one could calculate for a fixed A the level of noise D that maximizes the signal to noise ratio. We have defined the model system, but we still need to define ‘signal to noise ratio’ in order to address the question.

There is no general definition that makes sense for all kinds of systems. For our model system, let’s assume that the expectation value of x(t) has a Fourier expansion like this:

\displaystyle{ \langle x(t) \rangle = \sum_{n = - \infty}^{\infty}  M_n \exp{(i n t)} }

Then one useful definition of the signal to noise ratio \eta is

\displaystyle{ \eta := \frac{|M_1|^2}{A^2}}

So, can one calculate the value of D, depending on A that maximizes \eta? We don’t know! Do you?

If you would like to know more, have a look at this page:

Stochastic resonance, Azimuth Library.

Also, read on for more about how we numerically solved our stochastic differential equation, and the design decisions we had to make to create a simulation that runs on your web browser.

For übernerds only

Numerically solving stochastic differential equations

Maybe you are asking yourself what is behind the online model? How does one ‘numerically solve’ a stochastic differential equation and in what sense are the graphs that you see there ‘close’ to the exact solution? No problem, we will explain it here!

First, you need C code like this:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define znew ((z=36969*(z&65535)+(z>>16)) << 16)
#define wnew ((w = 18000*(w&65535)+(w>>16))&65535)

static unsigned long z = 362436069, w = 521288629;

float rnor() {
	float x, y, v;

	x = ((signed long) (znew + wnew)) * 1.167239E-9;

	if (fabs(x) < 1.17741)
		return (x);

	y = (znew + wnew) * 2.328306E-10;

	if (log(y) < 0.6931472 - 0.5 * x * x)
		return x;

	x = (x > 0) ? 0.8857913 * (2.506628 - x) : -0.8857913 * (2.506628 + x);

	if (log(1.8857913 - y) < 0.5718733 - 0.5 * (x * x))
		return (x);

	do {
		v = ((signed long) (znew + wnew)) * 4.656613E-10;

		x = -log(fabs(v)) * 0.3989423;

		y = -log((znew + wnew)* 2.328306E-10);

	} while (y + y < x * x);

	return (v > 0 ? 2.506628 + x : 2.506628 - x);
}

int main(void) {

	int index = 0;

	for(index = 0; index < 10; index++) {
		printf("%f\n", rnor());
	}
	return EXIT_SUCCESS;
}

Pretty scary, huh? Do you see what it does? We don’t think anyone stands a chance to understand all the ‘magical constants’ in it. We pasted it here so you can go around and show it to friends who claim that they know the programming language C and understand every program in seconds, and see what they can tell you about it. We’ll explain it below.

Since this section is for übernerds, we will assume that you know stochastic processes in continuous time. Brownian motion on \mathbb{R} can be viewed as a probability distribution on the space of all continuous functions on a fixed interval, C[0, T]. This is also true for solutions of stochastic differential equations in general. A ‘numerical approximation’ to a stochastic differential equation is an discrete time approximation that samples this probability space.

Another way to think about it is to start with numerical approximations of ordinary differential equations. When you are handed such an equation of the form

\displaystyle{ \frac{d x}{d t} = f(x, t) }

with an initial condition x(0) = x_0, you know that there are different discrete approximation schemes to calculate an approximate solution to it. The simplest one is the Euler scheme. You need to choose a step size h and calculate values for x recursively using the formula

x_{n + 1} = x_n + f(x_n, t_n) \; h

with t_{n+1} = t_n + h. Connect the discrete values of x_n with straight lines to get your approximating function x^h. For h \to 0, you get convergence for example in the supremum norm of x^h(t) to the exact solution x(t):

\displaystyle{  \lim_{h \to 0} \| x^h - x \|_\infty = 0 }

It is possible to extend some schemes from ordinary differential equations to stochastic differential equations that involve certain random variables in every step. The best textbook on the subject that I know is this one:

• Peter E. Kloeden and Eckhard Platen, Numerical Solution of Stochastic Differential Equations, Springer, Berlin, 2010. (ZMATH review.)

The extension of the Euler scheme is very simple, deceptively so! It is much less straightforward to determine the extensions of higher Runge–Kutta schemes, for example. You have been warned! The Euler scheme for stochastic differential equations of the form

d X_t = f(X, t) d t + g(X, t) d W_t

is simply

X_{n + 1} = X_n + f(X_n, t_n) \; h + g(X_n, t_n) \; w_n

where all the w_n are independent random variables with a normal distribution N(0, h). If you write a computer program that calculates approximations using this scheme, you need a pseudorandom number generator. This is what the code we posted above is about. Actually, it is an efficient algorithm to calculate pseudorandom numbers with a normal distribution. It’s from this paper:

• George Marsaglia and Wai Wan Tsang, The Monty Python method for generating random variables, ACM Transactions on Mathematical Software 24 (1998), 341–350.

While an approximation to an ordinary differential equation approximates a function, an approximation to a stochastic differential equation approximates a probability distribution on C[0, T]. So we need a different concept of ‘convergence’ of the approximation for h \to 0. The two most important ones are called ‘strong’ and ‘weak’ convergence.

A discrete approximation X^h to an Ito process X converges strongly at time T if

\lim_{h \to 0} \; E(|X_T^h - X_T|) = 0

It is said to be of strong order \gamma if there is a constant C such that

E(|X_T^h - X_T|) \leq C h^{\gamma}

As we see from the definition, a strong approximation needs to approximate the path of the original process X at the end of the time interval [0, T].

What about weak convergence?

A discrete approximation X^h to an Ito process X converges weakly for a given class of functions G at a time T if for every function g \in G we have

\lim_{h \to 0} \; E(g(X_T^h)) - E(g(X_T)) = 0

where we assume that all appearing expectation values exist and are finite.

It is said to be of weak order \gamma with respect to G if there is a constant C such that for every function g \in G:

E(g(X_T^h)) - E(g(X_T)) \leq C h^{\gamma}

One speaks simply of ‘weak convergence’ when G is the set of all polynomials.

So weak convergence means that the approximation needs to approximate the probability distribution of the original process X. Weak and strong convergence are indeed different concepts. For example, the Euler scheme is of strong order 0.5 and of weak order 1.0.

Our simple online model implements the Euler scheme to calculate approximations to our model equation. So now you know it what sense it is indeed an approximation to the solution process!

Implementation details of the interactive online model

There appear to be three main approaches to implementing an interactive online model. It is assumed that all the main browsers understand Javascript.

1) Implement the model using a server-side programme that produces graph data in response to parameter settings, and use Javascript in the browser to plot the graphs and provide interaction inputs.

2) Pre-compute the graph data for various sets of parameter values, send these graphs to the browser, and use Javascript to select a graph and plot it. The computation of the graph data can be done on the designers machine, using their preferred language.

3) Implement the model entirely in Javascript.

Option 1) is only practical if one can run interactive programs on the server. While there is technology to do this, it is not something that most ISPs or hosting companies provide. You might be provided with Perl and PHP, but neither of these is a language of choice for numeric coding.

Option 2) has been tried. The problem with this approach is the large amount of pre-computed data required. For example, if we have 4 independent parameters, each of which takes on 10 values, then we need 10000 graphs. Also, it is not practical to provide much finer control of each parameter, as this increases the data size even more.

Option 3) looked doable. Javascript has some nice language features, such as closures, which are a good fit for implementing numerical methods for solving ODEs. The main question then is whether a typical Javascript engine can cope with the compute load.

One problem that arose with a pure Javascript implementation is the production of normal random deviates required for the numerical solution of an SDE. Javascript has a random() function, that produces uniform random deviates, and there are algorithms that could be implemented in Javascript to produce normal deviates from uniform deviates. The problems here are that there is no guarantee that all Javascript engines use the same PRNG, or that the PRNG is actually any good, and we cannot set a specific seed in Javascript. Even if the PRNG is adequate, we have the problem that different users will see different outputs for the same parameter settings. Also, reloading the page will reseed the PRNG from some unknown source, so a user will not be able to replicate a particular output.

The chosen solution was to pre-compute a sequence of random normal deviates, using a PRNG of known characteristics. This sequence is loaded from the server, and will naturally be the same for all users and sessions. This is the C++ code used to create the fixed Javascript array:

#include <boost/random/mersenne_twister.hpp>
#include <boost/random/normal_distribution.hpp>
#include <iostream>

int main(int argc, char * argv[])
{
    boost::mt19937 engine(1234); // PRNG with seed = 1234
    boost::normal_distribution<double> rng(0.0, 1.0); // Mean 0.0, standard deviation 1.0
    std::cout << "normals = [\n";
    int nSamples = 10001;
    while(nSamples--)
    {
        std::cout << rng(engine) << ",\n";
    }
    std::cout << "];";
}

Deviates with a standard deviation other than 1.0 can be produced in Javascript by multiplying the pre-computed deviates by a suitable scaling factor.

The graphs only use 1000 samples from the Javascript array. To see the effects of different sample paths, the Javascript code selects one of ten slices from the pre-computed array.


The Noisy Channel Coding Theorem

28 July, 2012

Here’s a charming, easily readable tale of Claude Shannon and how he came up with information theory:

• Erico Guizzo, The Essential Message: Claude Shannon and the Making of Information Theory.

I hadn’t known his PhD thesis was on genetics! His master’s thesis introduced Boolean logic to circuit design. And as a kid, he once set up a telegraph line to a friend’s house half a mile away.

So, he was perfectly placed to turn information into a mathematical quantity, deeply related to entropy, and prove some now-famous theorems about it.

These theorems set limits on how much information we can transmit through a noisy channel. More excitingly, they say we can cook up coding schemes that let us come as close as we want to this limit, with an arbitrarily low probability of error.

As Erico Guizzo points out, these results are fundamental to the ‘information age’ we live in today:

Can we transmit, say, a high-resolution picture over a telephone line? How long will that take? Is there a best way to do it?

Before Shannon, engineers had no clear answers to these questions. At that time, a wild zoo of technologies was in operation, each with a life of its own—telephone, telegraph, radio, television, radar, and a number of other systems developed during the war. Shannon came up with a unifying, general theory of communication. It didn’t matter whether you transmitted signals using a copper wire, an optical fiber, or a parabolic dish. It didn’t matter if you were transmitting text, voice, or images. Shannon envisioned communication in abstract, mathematical terms; he defined what the once fuzzy concept of “information” meant for communication engineers and proposed a precise way to quantify it. According to him, the information content of any kind of message could be measured in binary digits, or just bits—a name suggested by a colleague at Bell Labs. Shannon took the bit as the fundamental unit in information theory. It was the first time that the term appeared in print.

So, I want to understand Shannon’s theorems and their proofs—especially because they clarify the relation between information and entropy, two concepts I’d like to be an expert on. It’s sort of embarrassing that I don’t already know this stuff! But I thought I’d post some preliminary remarks anyway, in case you too are trying to learn this stuff, or in case you can help me.

There are various different theorems I should learn. For example:

• The source coding theorem says it’s impossible to compress a stream of data to make the average number of bits per symbol in the compressed data less than the Shannon information of the source, without some of the data almost certainly getting lost. However, you can make the number of bits per symbols arbitrarily close to the Shannon entropy with a probability of error as small as you like.

• the noisy channel coding theorem is a generalization to data sent over a noisy channel.

The proof of the noisy channel coding theorem seems not so bad—there’s a sketch of a proof in the Wikipedia article on this theorem. But many theorems have a hard lemma at their heart, and for this one it’s a result in probability theory called the asymptotic equipartition property.

You should not try to dodge the hard lemma at the heart of the theorem you’re trying to understand: there’s a reason it’s there. So what’s the asymptotic equipartition property?

Here’s a somewhat watered-down statement that gets the basic idea across. Suppose you have a method of randomly generating letters—for example, a probability distribution on the set of letters. Suppose you randomly generate a string of n letters, and compute -(1/n) times the logarithm of the probability that you got that string. Then as n \to \infty this number ‘almost surely’ approaches some number S. What’s this number S? It’s the entropy of the probability distribution you used to generate those letters!

(Almost surely is probability jargon for ‘with probability 100%’, which is not the same as ‘always’.)

This result is really cool—definitely worth understanding in its own right! It says that while many strings are possible, the ones you’re most likely to see lie in a certain ‘typical set’. The ‘typical’ strings are the ones where when you compute -(1/n) times the log of their probability, the result is close to S. How close? Well, you get to pick that.

The typical strings are not individually the most probable strings! But if you randomly generate a string, it’s very probable that it lies in the typical set. That sounds a bit paradoxical, but if you think about it, you’ll see it’s not. Think of repeatedly flipping a coin that has a 90% chance of landing heads up. The most probable single outcome is that it lands heads up every time. But the typical outcome is that it lands up close to 90% of the time. And, there are lots of ways this can happen. So, if you flip the coin a bunch of times, there’s a very high chance that the outcome is typical.

It’s easy to see how this result is the key to the noisy channel coding theorem. The ‘typical set’ has few elements compared to the whole set of strings with n letters. So, you can make short codes for the strings in this set, and compress your message that way, and this works almost all the time. How much you can compress your message depends on the entropy S.

So, we’re seeing the link between information and entropy!

The actual coding schemes that people use are a lot trickier than the simple scheme I’m hinting at here. When you read about them, you see scary things like this:

But presumably they’re faster to implement, hence more practical.

The first coding schemes that come really close to the Shannon limit are the turbo codes. Surprisingly, these codes were developed only in 1993! They’re used in 3G mobile communications and deep space satellite communications.

One key trick is to use, not one decoder, but two. These two decoders keep communicating with each other and improving their guesses about the signal they’re received, until they agree:

This iterative process continues until the two decoders come up with the same hypothesis for the m-bit pattern of the payload, typically in 15 to 18 cycles. An analogy can be drawn between this process and that of solving cross-reference puzzles like crossword or sudoku. Consider a partially completed, possibly garbled crossword puzzle. Two puzzle solvers (decoders) are trying to solve it: one possessing only the “down” clues (parity bits), and the other possessing only the “across” clues. To start, both solvers guess the answers (hypotheses) to their own clues, noting down how confident they are in each letter (payload bit). Then, they compare notes, by exchanging answers and confidence ratings with each other, noticing where and how they differ. Based on this new knowledge, they both come up with updated answers and confidence ratings, repeating the whole process until they converge to the same solution.

This can be seen as “an instance of loopy belief propagation in Bayesian networks.”

By the way, the picture I showed you above is a flowchart of the decoding scheme for a simple turbo code. You can see the two decoders, and maybe the loop where data gets fed back to the decoders.

While I said this picture is “scary”, I actually like it because it’s an example of network theory applied to real-life problems.


The Mathematics of Biodiversity (Part 7)

12 July, 2012

How ignorant are you?

Do you know?

Do you know how much don’t you know?

It seems hard to accurately estimate your lack of knowledge. It even seems hard to say precisely how hard it is. But the cool thing is, we can actually extract an interesting math question from this problem. And one answer to this question leads to the following conclusion:

There’s no unbiased way to estimate how ignorant you are.

But the devil is in the details. So let’s see the details!

The Shannon entropy of a probability distribution is a way of measuring how ignorant we are when this probability distribution describes our knowledge.

For example, suppose all we care about is whether this ancient Roman coin will land heads up or tails up:

If we know there’s a 50% chance of it landing heads up, that’s a Shannon entropy of 1 bit: we’re missing one bit of information.

But suppose for some reason we know for sure it’s going to land heads up. For example, suppose we know the guy on this coin is the emperor Pupienus Maximus, a egomaniac who had lead put on the back of all coins bearing his likeness, so his face would never hit the dirt! Then the Shannon entropy is 0: we know what’s going to happen when we toss this coin.

Or suppose we know there’s a 90% it will land heads up, and a 10% chance it lands tails up. Then the Shannon entropy is somewhere in between. We can calculate it like this:

- 0.9 \log_2 (0.9) - 0.1 \log_2 (0.1) = 0.46899...

so that’s how many bits of information we’re missing.

But now suppose we have no idea. Suppose we just start flipping the coin over and over, and seeing what happens. Can we estimate the Shannon entropy?

Here’s a naive way to do it. First, use your experimental data to estimate the probability that that the coin lands heads-up. Then, stick that probability into the formula for Shannon entropy. For example, say we flip the coin 3 times and it lands head-up once. Then we can estimate the probability of it landing heads-up as 1/3, and tails-up as 2/3. So we can estimate that the Shannon entropy is

\displaystyle{ - \frac{1}{3} \log_2 (\frac{1}{3}) -\frac{2}{3} \log_2 (\frac{2}{3}) = 0.918... }

But it turns out that this approach systematically underestimates the Shannon entropy!

Say we have a coin that lands up a certain fraction of the time, say p. And say we play this game: we flip our coin n times, see what we get, and estimate the Shannon entropy using the simple recipe I just illustrated.

Of course, our estimate will depend on the luck of the game. But on average, it will be less than the actual Shannon entropy, which is

- p \log_2 (p) - (1-p) \log_2 (1-p)

We can prove this mathematically. But it shouldn’t be surprising. After all, if n = 1, we’re playing a game where we flip the coin just once. And with this game, our naive estimate of the Shannon entropy will always be zero! Each time we play the game, the coin will either land heads up 100% of the time, or tails up 100% of the time!

If we play the game with more coin flips, the error gets less severe. In fact it approaches zero as the number of coin flips gets ever larger, so that n \to \infty. The case where you flip the coin just once is an extreme case—but extreme cases can be good to think about, because they can indicate what may happen in less extreme cases.

One moral here is that naively generalizing on the basis of limited data can make you feel more sure you know what’s going on than you actually are.

I hope you knew that already!

But we can also say, in a more technical way, that the naive way of estimating Shannon entropy is a biased estimator: the average value of the estimator is different from the value of the quantity being estimated.

Here’s an example of an unbiased estimator. Say you’re trying to estimate the probability that the coin will land heads up. You flip it n times and see that it lands up m times. You estimate that the probability is m/n. That’s the obvious thing to do, and it turns out to be unbiased.

Statisticians like to think about estimators, and being unbiased is one way an estimator can be ‘good’. Beware: it’s not the only way! There are estimators that are unbiased, but whose standard deviation is so huge that they’re almost useless. It can be better to have an estimate of something that’s more accurate, even though on average it’s a bit too low. So sometimes, a biased estimator can be more useful than an unbiased estimator.

Nonetheless, my ears perked up when Lou Jost mentioned that there is no unbiased estimator for Shannon entropy. In rough terms, the moral is that:

There’s no unbiased way to estimate how ignorant you are.

I think this is important. For example, it’s important because Shannon entropy is also used as a measure of biodiversity. Instead of flipping a coin repeatedly and seeing which side lands up, now we go out and collect plants or animals, and see which species we find. The relative abundance of different species defines a probability distribution on the set of species. In this language, the moral is:

There’s no unbiased way to estimate biodiversity.

But of course, this doesn’t mean we should give up. We may just have to settle for an estimator that’s a bit biased! And people have spent a bunch of time looking for estimators that are less biased than the naive one I just described.

By the way, equating ‘biodiversity’ with ‘Shannon entropy’ is sloppy: there are many measures of biodiversity. The Shannon entropy is just a special case of the Rényi entropy, which depends on a parameter q: we get Shannon entropy when q = 1.

As q gets smaller, the Rényi entropy gets more and more sensitive to rare species—or shifting back to the language of probability theory, rare events. It’s the rare events that make Shannon entropy hard to estimate, so I imagine there should be theorems about estimators for Rényi entropy, which say it gets harder to estimate as q gets smaller. Do you know such theorems?

Also, I should add that biodiversity is better captured by the ‘Hill numbers’, which are functions of the Rényi entropy, than by the Rényi entropy itself. (See here for the formulas.) Since these functions are nonlinear, the lack of an unbiased estimator for Rényi entropy doesn’t instantly imply the same for the Hill numbers. So there are also some obvious questions about unbiased estimators for Hill numbers. Do you know answers to those?

Here are some papers on estimators for entropy. Most of these focus on estimating the Shannon entropy of a probability distribution on a finite set.

This old classic has a proof that the ‘naive’ estimator of Shannon entropy is biased, and estimates on the bias:

• Bernard Harris, The statistical estimation of entropy in the non-parametric case, Army Research Office, 1975.

He shows the bias goes to zero as we increase the number of samples: the number I was calling n in my coin flip example. In fact he shows the bias goes to zero like O(1/n). This is big big O notation which means that as n \to +\infty, the bias is bounded by some constant times 1/n. This constant depends on the size of our finite set—or, if you want to do better, the class number, which is the number of elements on which our probability distribution is nonzero.

Using this idea, he shows that you can find a less biased estimator if you have a probability distribution p_i on a finite set and you know that exactly k of these probabilities are nonzero. To do this, just take the ‘naive’ estimator I described earlier and add (k-1)/2n. This is called the Miller–Madow bias correction. The bias of this improved estimator goes to zero like O(1/n^2).

The problem is that in practice you don’t know ahead of time how many probabilities are nonzero! In applications to biodiversity this would amount to knowing ahead of time how many species exist, before you go out looking for them.

But what about the theorem that there’s no unbiased estimator for Shannon entropy? The best reference I’ve found is this:

• Liam Paninski, Estimation of entropy and mutual information, Neural Computation 15 (2003) 1191-1254.

In Proposition 8 of Appendix A, Paninski gives a quick proof that there is no unbiased estimator of Shannon entropy for probability distributions on a finite set. But his paper goes far beyond this. Indeed, it seems like a pretty definitive modern discussion of the whole subject of estimating entropy. Interestingly, this subject is dominated by neurobiologists studying entropy of signals in the brain! So, lots of his examples involve brain signals.

Another overview, with tons of references, is this:

• J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. van der Meulen, Nonparametric entropy estimation: an overview.

This paper focuses on the situation where don’t know ahead of time how many probabilities are nonzero:

• Anne Chao and T.-J. Shen, Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample, Environmental and Ecological Statistics 10 (2003), 429&–443.

In 2003 there was a conference on the problem of estimating entropy, whose webpage has useful information. As you can see, it was dominated by neurobiologists:

Estimation of entropy and information of undersampled probability distributions: theory, algorithms, and applications to the neural code, Whistler, British Columbia, Canada, 12 December 2003.

By the way, I was very confused for a while, because these guys claim to have found an unbiased estimator of Shannon entropy:

• Stephen Montgomery Smith and Thomas Schürmann, Unbiased estimators for entropy and class number.

However, their way of estimating entropy has a funny property: in the language of biodiversity, it’s only well-defined if our samples include at least one species of each organism. So, we cannot compute this estimate for an arbitary list of n samples. This means it’s not estimator in the usual sense—the sense that Paninski is using! So it doesn’t really contradict Paninski’s result.

To wrap up, let me state Paninski’s result in a mathematically precise way. Suppose p is a probability distribution on a finite set X. Suppose S is any number we can compute from p: that is, any real-valued function on the set of probability distributions. We’ll be interested in the case where S is the Shannon entropy:

\displaystyle{ S = -\sum_{x \in X} p(x) \, \log p(x) }

Here we can use whatever base for the logarithm we like: earlier I was using base 2, but that’s not sacred. Define an estimator to be any function

\hat{S}: X^n \to \mathbb{R}

The idea is that given n samples from the set X, meaning points x_1, \dots, x_n \in X, the estimator gives a number \hat{S}(x_1, \dots, x_n). This number is supposed to estimate some feature of the probability distribution p: for example, its entropy.

If the samples are independent and distributed according to the distribution p, the sample mean of the estimator will be

\displaystyle{ \langle \hat{S} \rangle = \sum_{x_1, \dots, x_n \in X} \hat{S}(x_1, \dots, x_n) \, p(x_1) \cdots p(x_n) }

The bias of the estimator is the difference between the sample mean of the estimator and actual value of S:

\langle \hat{S} \rangle - S

The estimator \hat{S} is unbiased if this bias is zero for all p.

Proposition 8 of Paninski’s paper says there exists no unbiased estimator for entropy! The proof is very short…

Okay, that’s all for today.

I’m back in Singapore now; I learned so much at the Mathematics of Biodiversity conference that there’s no way I’ll be able to tell you all that information. I’ll try to write a few more blog posts, but please be aware that my posts so far give a hopelessly biased and idiosyncratic view of the conference, which would be almost unrecognizable to most of the participants. There are a lot of important themes I haven’t touched on at all… while this business of entropy estimation barely came up: I just find it interesting!

If more of you blogged more, we wouldn’t have this problem.


The Mathematics of Biodiversity (Part 5)

3 July, 2012

I’d be happy to get your feedback on these slides of the talk I’m giving the day after tomorrow:

• John Baez, Diversity, entropy and thermodynamics, 6 July 2012, Exploratory Conference on the Mathematics of Biodiversity, Centre de Recerca Matemàtica, Barcelona.

Abstract: As is well known, some popular measures of biodiversity are formally identical to measures of entropy developed by Shannon, Rényi and others. This fact is part of a larger analogy between thermodynamics and the mathematics of biodiversity, which we explore here. Any probability distribution can be extended to a 1-parameter family of probability distributions where the parameter has the physical meaning of ‘temperature’. This allows us to introduce thermodynamic concepts such as energy, entropy, free energy and the partition function in any situation where a probability distribution is present—for example, the probability distribution describing the relative abundances of different species in an ecosystem. The Rényi entropy of this probability distribution is closely related to the change in free energy with temperature. We give one application of thermodynamic ideas to population dynamics, coming from the work of Marc Harper: as a population approaches an ‘evolutionary optimum’, the amount of Shannon information it has ‘left to learn’ is nonincreasing. This fact is closely related to the Second Law of Thermodynamics.

This talk is rather different than the one I’d envisaged giving! There was a lot of interest in my work on Rényi entropy and thermodynamics, because Rényi entropies—and their exponentials, called the Hill numbers—are an important measure of biodiversity. So, I decided to spend a lot of time talking about that.


Information Geometry (Part 13)

26 June, 2012

Last time I gave a sketchy overview of evolutionary game theory. Now let’s get serious.

I’ll start by explaining ‘Nash equilibria’ for 2-person games. These are situations where neither player can profit by changing what they’re doing. Then I’ll introduce ‘mixed strategies’, where the players can choose among several strategies with different probabilities. Then I’ll introduce evolutionary game theory, where we think of each strategy as a species, and its probability as the fraction of organisms that belong to that species.

Back in Part 9, I told you about the ‘replicator equation’, which says how these fractions change with time thanks to natural selection. Now we’ll see how this leads to the idea of an ‘evolutionarily stable strategy’. And finally, we’ll see that when evolution takes us toward such a stable strategy, the amount of information the organisms have ‘left to learn’ keeps decreasing!

Nash equilibria

We can describe a certain kind of two-person game using a payoff matrix, which is an n \times n matrix A_{ij} of real numbers. We think of A_{ij} as the payoff that either player gets if they choose strategy i and their opponent chooses strategy j.

Note that in this kind of game, there’s no significant difference between the ‘first player’ and the ‘second player’: either player wins an amount A_{ij} if they choose strategy i and their opponent chooses strategy j. So, this kind of game is called symmetric even though the matrix A_{ij} may not be symmetric. Indeed, it’s common for this matrix to be antisymmetric, meaning A_{ij} = - A_{ji}, since in this case what one player wins, the other loses. Games with this extra property are called zero-sum games. But we won’t limit ourselves to those!

We say a strategy i is a symmetric Nash equilibrium if

A_{ii} \ge A_{ji}

for all j. This means that if both players use strategy i, neither gains anything by switching to another strategy.

For example, suppose our matrix is

\left( \begin{array}{rr} -1 & -12 \\  0 & -3 \end{array} \right)

Then we’ve got the Prisoner’s Dilemma exactly as described last time! Here strategy 1 is cooperate and strategy 2 is defect. If a player cooperates and so does his opponent, he wins

A_{11} = -1

meaning he gets one month in jail. We include a minus sign because ‘winning a month in jail’ is not a good thing. If the player cooperates but his opponent defects, he gets a whole year in jail:

A_{12} = -12

If he defects but his opponent cooperates, he doesn’t go to jail at all:

A_{21} = 0

And if they both defect, they both get three months in jail:

A_{22} = -3

You can see that defecting is a Nash equilibrium, since

A_{22} \ge A_{12}

So, oddly, if our prisoners know game theory and believe Nash equilibria are best, they’ll both be worse off than if they cooperate and don’t betray each other.

    

Nash equilibria for mixed strategies

So far we’ve been assuming that with 100% certainty, each player chooses one strategy i = 1,2,3,\dots, n. Since we’ll be considering more general strategies in a minute, let’s call these pure strategies.

Now let’s throw some probability theory into the stew! Let’s allow the players to pick different pure strategies with different probabilities. So, we define a mixed strategy to be a probability distribution on the set of pure strategies. In other words, it’s a list of n nonnegative numbers

p_i \ge 0

that sum to one:

\displaystyle{ \sum_{i=1}^n p_i = 1 }

Say I choose the mixed strategy p while you, my opponent, choose the mixed strategy q. Say our choices are made independently. Then the probability that I choose the pure strategy i while you chose j is

p_i q_j

so the expected value of my winnings is

\displaystyle{ \sum_{i,j = 1}^n p_i A_{ij} q_j }

or using vector notation

p \cdot A q

where the dot is the usual dot product on \mathbb{R}^n.

We can easily adapt the concept of Nash equilibrium to mixed strategies. A mixed strategy q is a symmetric Nash equilibrium if for any other mixed strategy p,

q \cdot A q \ge  p \cdot A q

This means that if both you and I are playing the mixed strategy q, I can’t improve my expected winnings by unilaterally switching to the mixed strategy p. And neither can you, because the game is symmetric!

If this were a course on game theory, I would now do some examples. But it’s not, so I’ll just send you to page 6 of Sandholm’s paper: he looks at some famous games like ‘hawks and doves’ and ‘rock paper scissors’.

Evolutionarily stable strategies

We’re finally ready to discuss evolutionarily stable strategies. To do this, let’s reinterpret the ‘pure strategies’ i = 1,2,3, \dots n as species. Here I don’t necessarily mean species in the classic biological sense: I just mean different kinds of self-replicating entities, or replicators. For example, they could be different alleles of the same gene.

Similarly, we’ll reinterpret the ‘mixed strategy’ p as describing a mixed population of replicators, where the fraction of replicators belonging to the ith species is p_i. These numbers are still probabilities: p_i is the probability that a randomly chosen replicator will belong to the ith species.

We’ll reinterpret the payoff matrix A_{ij} as a fitness matrix. In our earlier discussion of the replicator equation, we assumed that the population P_i of the ith species grew according to the replicator equation

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) P_i }

where the fitness function f_i is any smooth function of the populations of each kind of replicator.

But in evolutionary game theory it’s common to start by looking at a simple special case where

\displaystyle{f_i(P_1, \dots, P_n)   = \sum_{j=1}^n A_{ij} p_j }

where

\displaystyle{ p_j = \frac{P_j}{\sum_k P_k} }

is the fraction of replicators who belong to the jth species.

What does this mean? The idea is that we have a well-mixed population of game players—or replicators. Each one has its own pure strategy—or species. Each one randomly roams around and ‘plays games’ with each other replicator it meets. It gets to reproduce at a rate proportional to its expected winnings.

This is unrealistic in all sorts of ways, but it’s mathematically cute, and it’s been studied a lot, so it’s good to know about. Today I’ll explain evolutionarily stable strategies only in this special case. Later I’ll go back to the general case.

Suppose that we select a sample of replicators from the overall population. What is the mean fitness of the replicators in this sample? For this, we need to know the probability that a replicator from this sample belongs to the ith species. Say it’s q_j. Then the mean fitness of our sample is

\displaystyle{ \sum_{i,j=1}^n q_i A_{ij} p_j }

This is just a weighted average of the fitnesses in our earlier formula. But using the magic of vectors, we can write this sum as

q \cdot A p

We already saw this type of expression in the last section! It’s my expected winnings if I play the mixed strategy q and you play the mixed strategy p.

John Maynard Smith defined q to be evolutionarily stable strategy if when we add a small population of ‘invaders’ distributed according to any other probability distribution p, the original population is more fit than the invaders.

In simple terms: a small ‘invading’ population will do worse than the population as a whole.

Mathematically, this means:

q \cdot A ((1-\epsilon)q + \epsilon p) >  p \cdot A ((1-\epsilon)q + \epsilon p)

for all mixed strategies p and all sufficiently small \epsilon \ge 0 . Here

(1-\epsilon)q + \epsilon p

is the population we get by replacing an \epsilon-sized portion of our original population by invaders.

Puzzle: Show that q is an evolutionarily stable strategy if and only these two conditions hold for all mixed stategies p:

q \cdot A q \ge p \cdot A q

and also, for all q \ne p,

q \cdot A q = p \cdot A q \; \implies \; q \cdot A p > p \cdot A p

The first condition says that q is a symmetric Nash equilibrium. In other words, the invaders can’t on average be better playing against the original population than members of the original population are. The second says that if the invaders are just as good at playing against the original population, they must be worse at playing against each other! The combination of these conditions means the invaders won’t take over.

Again, I should do some examples… but instead I’ll refer you to page 9 of Sandholm’s paper, and also these course notes:

• Samuel Alizon and Daniel Cownden, Evolutionary games and evolutionarily stable strategies.

• Samuel Alizon and Daniel Cownden, Replicator dynamics.

The decrease of relative information

Now comes the punchline… but with a slight surprise twist at the end. Last time we let

P = (P_1, \dots , P_n)

be a population that evolves with time according to the replicator equation, and we let p be the corresponding probability distribution. We supposed q was some fixed probability distribution. We saw that the relative information

I(q,p) = \displaystyle{ \sum_i \ln \left(\frac{q_i}{ p_i }\right) q_i }

obeys

\displaystyle{ \frac{d}{dt} I(q,p) =  (p - q) } \cdot f(P)

where f(P) is the vector of fitness functions. So, this relative information can never increase if

(p - q) \cdot f(P) \le 0

for all P.

We can adapt this to the special case we’re looking at now. Remember, right now we’re assuming

\displaystyle{f_i(P_1, \dots, P_n)   = \sum_{j=1}^n A_{ij} p_j }

so

f(P) = A p

Thus, the relative information will never increase if

(p - q) \cdot A p \le 0

or in other words,

q \cdot A p \ge p \cdot A p  \qquad \qquad \qquad \qquad \qquad \qquad  (1)

Now, this looks very similar to the conditions for an evolutionary stable strategy as stated in the Puzzle above. But it’s not the same! That’s the surprise twist.

Remember, the Puzzle says that q is an evolutionarily stable state if for all mixed strategies p we have

q \cdot A q \ge p \cdot A q  \qquad \qquad \qquad \qquad \qquad \qquad (2)

and also

q \cdot A q = p \cdot A q \; \implies \; q \cdot A p > p \cdot A p  \qquad \; (3)

Note that condition (1), the one we want, is neither condition (2) nor condition (3)! This drove me crazy for almost a day.

I kept thinking I’d made a mistake, like mixing up p and q somewhere. You’ve got to mind your p’s and q’s in this game!

But the solution turned out to be this. After Maynard Smith came up with his definition of ‘evolutionarily stable state’, another guy came up with a different definition:

• Bernhard Thomas, On evolutionarily stable sets, J. Math. Biology 22 (1985), 105–115.

For him, an evolutionarily stable strategy obeys

q \cdot A q \ge p \cdot A q  \qquad \qquad \qquad \qquad \qquad \qquad (2)

and also

q \cdot A p \ge p \cdot A p  \qquad \qquad \qquad \qquad \qquad \qquad  (1)

Condition (1) is stronger than condition (3), so he renamed Maynard Smith’s evolutionarily stable strategies weakly evolutionarily stable strategies. And condition (1) guarantees that the relative information I(q,p) can never increase. So, now we’re happy.

Except for one thing: why should we switch from Maynard Smith’s perfectly sensible concept of evolutionarily stable state to this new stronger one? I don’t really know, except that

• it’s not much stronger

and

• it lets us prove the theorem we want!

So, it’s a small mystery for me to mull over. If you have any good ideas, let me know.


The Mathematics of Biodiversity (Part 2)

24 June, 2012

How likely is it that the next thing we see is one of a brand new kind? That sounds like a hard question. Last time I told you about the Good–Turing rule for answering this question.

The discussion that blog entry triggered has been very helpful! Among other things, it got Lou Jost more interested in this subject. Two days ago, he showed me the following simple argument for the Good–Turing estimate.

Suppose there are finitely many species of orchid. Suppose the fraction of orchids belonging to the ith species is p_i.

Suppose we start collecting orchids. Suppose each time we find one, the chance that it’s an orchid of the ith species is p_i. Of course this is not true in reality! For example, it’s harder to find a tiny orchid, like this:

than a big one. But never mind.

Say we collect a total of N orchids. What is the probability that we find no orchids of the ith species? It is

(1 - p_i)^N

Similarly, the probability that we find exactly one orchid of the ith species is

N p_i (1 - p_i)^{N-1}

And so on: these are the first two terms in a binomial series.

Let n_1 be the expected number of singletons: species for which we find exactly one orchid of that species. Then

\displaystyle{ n_1 = \sum_i N p_i (1 - p_i)^{N-1} }

Let D be the coverage deficit: the expected fraction of the total population consisting of species that remain undiscovered. Given our assumptions, this is the same as the chance that the next orchid we find will be of a brand new species.

Then

\displaystyle{ D = \sum_i p_i (1-p_i)^N }

since p_i is the fraction of orchids belonging to the ith species and (1-p_i)^N is the chance that this species remains undiscovered.

Lou Jost pointed out that the formulas for n_1 and D are very similar! In particular,

\displaystyle{ \frac{n_1}{N} = \sum_i p_i (1 - p_i)^{N-1} }

should be very close to

\displaystyle{ D = \sum_i p_i (1 - p_i)^N }

when N is large. So, we should have

\displaystyle{ D \approx \frac{n_1}{N} }

In other words: the chance that the next orchid we find is of a brand new species should be close to the fraction of orchids that are singletons now.

Of course it would be nice to turn these ‘shoulds’ into precise theorems! Theorem 1 in this paper does that:

• David McAllester and Robert E. Schapire, On the convergence rate of Good–Turing estimators, February 17, 2000.

By the way: the only difference between the formulas for n_1/N and D is that the first contains the exponent N-1, while the second contains the exponent N. So, Lou Jost’s argument is a version of Boris Borcic’s ‘time-reversal’ idea:

Good’s estimate is what you immediately obtain if you time-reverse your sampling procedure, e.g., if you ask for the probability that there is a change in the number of species in your sample when you randomly remove a specimen from it.


The Mathematics of Biodiversity (Part 1)

21 June, 2012

I’m in Barcelona now, and I want to blog about this:

Research Program on the Mathematics of Biodiversity, June-July 2012, Centre de Recerca Matemàtica, Barcelona, Spain. Organized by Ben Allen, Silvia Cuadrado, Tom Leinster, Richard Reeve and John Woolliams.

We’re having daily informal talks and there’s no way I can blog about all of them, talk to people here, and still get enough work done. So, I’ll just mention a few things that strike me! For example, this morning Lou Jost told me about an interesting paper by I. J. Good.

I’d known of I. J. Good as one of the guys who came up with the concept of a ‘technological singularity’. In 1963 he wrote:

Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make.

He was a British mathematician who worked as a cryptologist at Bletchley Park with Alan Turing. After World War II, he continued to work with Turing on the design of computers and Bayesian statistics at the University of Manchester. Later he moved to the US. In 1968, thanks to his interest in artificial intelligence, he served as consultant for Stanley Kubrick’s film 2001: A Space Odyssey. He died in 2009.

Good was also a big chess enthusiast, and worked on writing programs to play chess. He’s the guy in front here:

If you want to learn more about his work on chess, click on this photo!

But the paper Lou Jost mentioned is on a rather different subject:

• Irving John Good, The population frequency of species and the estimation of population parameters, Biometrika 40 (1953), 237–264.

Let me just state one result, sloppily, without any details or precise hypotheses!

Puzzle: Suppose you go into the jungles of Ecuador and start collecting orchids. You count the number of orchids of each different species that you find. You get a list of numbers, something like this:

14, 10, 8, 6, 2, 1, 1, 1

What is the chance that the next orchid you find will belong to a new species?

Good gives a rule of thumb for solving problems of this type:

\displaystyle{ \frac{n_1}{N} }

Here N is the total number of orchid you collected, and n_i is the number of species for which you found exactly i orchids of that species. In our example,

n_1 = 3

since we found just one orchid of three different species: those are the three 1’s at the end of our list. Furthermore,

N = 14 + 10 + 8 + 6 + 2 + 1 + 1 = 42

So here is Good’s estimate the chance that the next orchid you collect will be of a new species:

\displaystyle{ \frac{n_1}{N} = \frac{3}{42} }

Good’s argument is nontrivial—and of course it depends on some assumptions on the nature of the distribution of populations of different species! Since he doesn’t state these assumptions succinctly and I haven’t read the paper carefully yet, I’m afraid you’ll have to read the paper to find out what they are.

Of course the math works for samples of anything that comes in distinct types, not just species of organisms! Good considers four examples:

• moths captured in a light-trap at Rothamsted, England,

• words in American newspapers,

• nouns in Macaulay’s essay on Bacon,

• chess openings in games published by the British Chess Magazine in 1951.

By comparing a small sample to a bigger one, he studies how well his rule works in practice, and apparently it does okay.

In his paper, I. J. Good thanks Alan Turing for coming up with the basic idea. In fact he says Turing gave an ‘intuitive demonstration’ of it—but he doesn’t give this intuitive demonstration, and according to Lou Jost he actually admits somewhere that he forgot it.

You can read more about the idea here:

Good–Turing frequency estimation.

By the way, Lou Jost is not only an expert on biodiversity and its relation to entropy! He lives in the jungles of Ecuador and has discovered over 60 new species of orchids, including the world’s smallest:

He found it in Ecuador, and the petals are just a few cells thick! (Typically, the news reports say he found it in Bolivia and the petals are just one cell thick.)

He said:

I found it among the roots of another plant that I had collected, another small orchid which I took back to grow in my greenhouse to get it to flower. A few months later I saw that down among the roots was a tiny little plant that I realised was more interesting than the bigger orchid. Looking at the flower is often the best way to be able to identify which species of orchid you’ve got hold of – and can tell you whether you’re looking at an unknown species or not.


Follow

Get every new post delivered to your Inbox.

Join 3,095 other followers