Information Geometry (Part 10)

Last time I began explaining the tight relation between three concepts:

• entropy,

• information—or more precisely, lack of information,

and

• biodiversity.

The idea is to consider n different species of ‘replicators’. A replicator is any entity that can reproduce itself, like an organism, a gene, or a meme. A replicator can come in different kinds, and a ‘species’ is just our name for one of these kinds. If P_i is the population of the ith species, we can interpret the fraction

\displaystyle{ p_i = \frac{P_i}{\sum_j P_j} }

as a probability: the probability that a randomly chosen replicator belongs to the ith species. This suggests that we define entropy just as we do in statistical mechanics:

\displaystyle{ S = - \sum_i p_i \ln(p_i) }

In the study of statistical inference, entropy is a measure of uncertainty, or lack of information. But now we can interpret it as a measure of biodiversity: it’s zero when just one species is present, and small when a few species have much larger populations than all the rest, but gets big otherwise.

Our goal here is play these viewpoints off against each other. In short, we want to think of natural selection, and even biological evolution, as a process of statistical inference—or in simple terms, learning.

To do this, let’s think about how entropy changes with time. Last time we introduced a simple model called the replicator equation:

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) \, P_i }

where each population grows at a rate proportional to some ‘fitness functions’ f_i. We can get some intuition by looking at the pathetically simple case where these functions are actually constants, so

\displaystyle{ \frac{d P_i}{d t} = f_i \, P_i }

The equation then becomes trivial to solve:

\displaystyle{ P_i(t) = e^{t f_i } P_i(0)}

Last time I showed that in this case, the entropy will eventually decrease. It will go to zero as t \to +\infty whenever one species is fitter than all the rest and starts out with a nonzero population—since then this species will eventually take over.

But remember, the entropy of a probability distribution is its lack of information. So the decrease in entropy signals an increase in information. And last time I argued that this makes perfect sense. As the fittest species takes over and biodiversity drops, the population is acquiring information about its environment.

However, I never said the entropy is always decreasing, because that’s false! Even in this pathetically simple case, entropy can increase.

Suppose we start with many replicators belonging to one very unfit species, and a few belonging to various more fit species. The probability distribution p_i will start out sharply peaked, so the entropy will start out low:

Now think about what happens when time passes. At first the unfit species will rapidly die off, while the population of the other species slowly grows:

 

So the probability distribution will, for a while, become less sharply peaked. Thus, for a while, the entropy will increase!

This seems to conflict with our idea that the population’s entropy should decrease as it acquires information about its environment. But in fact this phenomenon is familiar in the study of statistical inference. If you start out with strongly held false beliefs about a situation, the first effect of learning more is to become less certain about what’s going on!

Get it? Say you start out by assigning a high probability to some wrong guess about a situation. The entropy of your probability distribution is low: you’re quite certain about what’s going on. But you’re wrong. When you first start suspecting you’re wrong, you become more uncertain about what’s going on. Your probability distribution flattens out, and the entropy goes up.

So, sometimes learning involves a decrease in information—false information. There’s nothing about the mathematical concept of information that says this information is true.

Given this, it’s good to work out a formula for the rate of change of entropy, which will let us see more clearly when it goes down and when it goes up. To do this, first let’s derive a completely general formula for the time derivative of the entropy of a probability distribution. Following Sir Isaac Newton, we’ll use a dot to stand for a time derivative:

\begin{array}{ccl} \displaystyle{  \dot{S}} &=& \displaystyle{ -  \frac{d}{dt} \sum_i p_i \ln (p_i)} \\   \\  &=& - \displaystyle{ \sum_i \dot{p}_i \ln (p_i) + \dot{p}_i }  \end{array}

In the last term we took the derivative of the logarithm and got a factor of 1/p_i which cancelled the factor of p_i. But since

\displaystyle{  \sum_i p_i = 1 }

we know

\displaystyle{ \sum_i \dot{p}_i = 0 }

so this last term vanishes:

\displaystyle{ \dot{S}= -\sum_i \dot{p}_i \ln (p_i) }

Nice! To go further, we need a formula for \dot{p}_i. For this we might as well return to the general replicator equation, dropping the pathetically special assumption that the fitness functions are actually constants. Then we saw last time that

\displaystyle{ \dot{p}_i = \Big( f_i(P) - \langle f(P) \rangle  \Big) \, p_i }

where we used the abbreviation

f_i(P) = f_i(P_1, \dots, P_n)

for the fitness of the ith species, and defined the mean fitness to be

\displaystyle{ \langle f(P) \rangle = \sum_i f_i(P) p_i  }

Using this cute formula for \dot{p}_i, we get the final result:

\displaystyle{ \dot{S} = - \sum_i \Big( f_i(P) - \langle f(P) \rangle \Big) \, p_i \ln (p_i) }

This is strikingly similar to the formula for entropy itself. But now each term in the sum includes a factor saying how much more fit than average, or less fit, that species is. The quantity - p_i \ln(p_i) is always nonnegative, since the graph of -x \ln(x) looks like this:

So, the ith term contributes positively to the change in entropy if the ith species is fitter than average, but negatively if it’s less fit than average.

This may seem counterintuitive!

Puzzle 1. How can we reconcile this fact with our earlier observations about the case when the fitness of each species is population-independent? Namely: a) if initially most of the replicators belong to one very unfit species, the entropy will rise at first, but b) in the long run, when the fittest species present take over, the entropy drops?

If this seems too tricky, look at some examples! The first illustrates observation a); the second illustrates observation b):

Puzzle 2. Suppose we have two species, one with fitness equal to 1 initially constituting 90% of the population, the other with fitness equal to 10 initially constituting just 10% of the population:

\begin{array}{ccc} f_1 = 1, & &  p_1(0) = 0.9 \\ \\                            f_2 = 10 , & & p_2(0) = 0.1   \end{array}

At what rate does the entropy change at t = 0? Which species is responsible for most of this change?

Puzzle 3. Suppose we have two species, one with fitness equal to 10 initially constituting 90% of the population, and the other with fitness equal to 1 initially constituting just 10% of the population:

\begin{array}{ccc} f_1 = 10, & &  p_1(0) = 0.9 \\ \\                            f_2 = 1 , & & p_2(0) = 0.1   \end{array}

At what rate does the entropy change at t = 0? Which species is responsible for most of this change?

I had to work through these examples to understand what’s going on. Now I do, and it all makes sense.

Next time

Still, it would be nice if there were some quantity that always goes down with the passage of time, reflecting our naive idea that the population gains information from its environment, and thus loses entropy, as time goes by.

Often there is such a quantity. But it’s not the naive entropy: it’s the relative entropy. I’ll talk about that next time. In the meantime, if you want to prepare, please reread Part 6 of this series, where I explained this concept. Back then, I argued that whenever you’re tempted to talk about entropy, you should talk about relative entropy. So, we should try that here.

There’s a big idea lurking here: information is relative. How much information a signal gives you depends on your prior assumptions about what that signal is likely to be. If this is true, perhaps biodiversity is relative too.

27 Responses to Information Geometry (Part 10)

  1. Those green bar graphs look like like the Preston and Whittaker plots of biodiversity.

    I like the way this series is moving along :)

  2. This reminds me of a topic that came up in a technical meeting we had with my robotics group in Lisbon last week. Children who have not yet learned a language are good at noticing differences between phonemes. Then their ability to discriminate dips. When it comes up again, it has changed– they can now only discriminate phonemes that their native language discriminates. They have learned a mapping from sound to sense that groups some things together. The speaker had invented a machine learning method using interlinked self-organizing maps that reproduced this behavior.

    • John Baez says:

      That’s interesting! I’m not sure if it’s formally related to what I was talking about. I was talking about how the usual mathematical concept of information includes ‘misinformation’. Thus, learning can involve a decrease of information as you discard misinformation, followed by an increase of information as you acquire correct (or at least better) information. I’m not sure the children’s early state involves having a lot of ‘misinformation’. Maybe in some sense it does—I can’t tell.

      It would be interesting to know what the children are like in the period when their ability to discriminate has gone down. Presumably they’re trying to classify sounds according to phonemes in the language they’re learning, but not doing very well?

      I’ve got a former student in Lisbon now: Jeffrey Morton. Next fall another former student of mine will be going there: John Huerta. The common link is Roger Picken, a mathematical physicist at the Instituto Superior Tecnico who is interested in ‘higher gauge theory’, an application of category theory to physics.

    • nad says:

      Douglas Summers-Stay says:

      “Children who have not yet learned a language are good at noticing differences between phonemes.

      The question is also when do you start to call a language language. My mother noticed that a child (8 months) in my vicinity made the sound HOOM everytime when she saw a dog (dog is in german “HUND” (speak hoond) which is sort of an onomatopoeia for barfing). So was she learning a word or inventing a word by mimicking barfing?

  3. Blake Stacey says:

    It is interesting to see how this works for the quadratic diversity

    D^{\bf Z}[p] = \left( \sum_{ij} Z_{ij} p_i p_j \right)^{-1},

    which is the inverse of the “average discord” within a population characterized by the probabilities p_i, and where species i resembles species j (genetically, functionally, whatever) to an extent measured by Z_{ij}. The rate of change of this diversity measure is

    -(D^{\bf Z}[p])^2 \sum_{ij} Z_{ij} (f_i(P) + f_j(P) - 2\langle f(P) \rangle) p_i p_j.

    In the special case that species are regarded as entirely distinct from one another, the matrix Z_{ij} reduces to a Kronecker delta, and the time derivative of the diversity becomes

    -2(D[p])^2 \sum_i (f_i(P) - \langle f(P) \rangle) p_i^2.

    In Puzzle 2, the sum is negative so the product is positive and the diversity is increasing; in Puzzle 3, the sum is positive so the product is negative and the diversity is decreasing. This makes sense, because in Puzzle 2 we should see the population moving towards a more even distribution: the proportion of species 2 should be going up because its fitness is larger than the average. The opposite holds true in Puzzle 3. Because the diversity is maximized for the equiprobable distribution, we should expect the diversity to be going up in Puzzle 2 and down in Puzzle 3.

    • Blake Stacey says:

      Oops. “Average discord” should be “average concord” or “average similarity”. This is what I get for not proofreading before I leave to catch the bus.

      My brain seems to have switched itself to the “off” position today as far as algebra is concerned, but I believe one can construct a Lyapunov function using what one might call the “cross diversity”,

      \left( \sum_{ij} p_i q_j \right)^{-1}.

    • Blake Stacey says:

      And of course I forget the Z_{ij}. Not my best day!

  4. Mike Stay says:

    Did you mean to include examples immediately after Puzzle 1?

    • akirabergman says:

      The examples are the puzzles 2 and 3.

    • John Baez says:

      Yes, I thought Puzzle 1 might be too tricky in its abstract general form, so I wanted to nudge people into trying some examples:

      If this seems too tricky, look at some examples! The first illustrates observation a); the second illustrates observation b)…

  5. Marc Harper says:

    Why isn’t entropy monotonically decreasing? The puzzles clearly illustrate that this is not the case… here is another way to think about it using interior trajectories — those in which all types survive at equilibrium. This can occur when the fitness landscape is not constant. (Hopefully I’m not giving away much of what John is planning to talk about next.)

    The maximum entropy discrete distribution is the uniform distribution, but this distribution is unlikely to be the stable population distribution in general (it is easy to pick landscapes that converge to any given distribution). In any case wanting the entropy to be monotonically decreasing is of course too much to ask for since a population can start a lower point of entropy initially and converge to the chosen distribution of higher entropy. Moreover the entropy at the equilibria is nonzero, which is intuitively unsatisfying in addition to the fact that the net result of selection in this case is overall more entropy.

    Note that at a rest point of the replicator dynamic, we have that

    0 = p_i (f_i(p) - \langle f(p) \rangle)

    so either p_i = 0 or f_i(p) = \langle f(p) \rangle. For interior distributions (no p_i = 0), we have that f_i(p) = \langle f(p) \rangle for all i, so at the stable state it is the fitness landscape that is uniform rather than the population distribution. (Note that the fitness landscape is not necessarily a probability distribution, but we can demand that it is everywhere positive.) If some of the p_i are zero, then we are really in a lower-dimensional probability simplex, and the preceding statement applies. (Incidentally, this condition is where the game theory in evolutionary game theory comes in — it’s part of the Nash equilibrium criteria.) On the other hand, if only one type of replicator ultimately survives, the entropy has to converge to zero eventually. One caveat, however: a uniform fitness landscape does not guarantee stability!

    One more puzzle to think about — it is possible for the replicator dynamic to have interior non-limit cycles in dimension 3 and higher (e.g. for the rock-paper-scissors game[1], of which examples have been found in the natural world). This implies that there is a constant of motion for the dynamic on these landscapes, but it’s not the entropy! Why not, and what is it?

    [1] See plot A at this location: http://sites.sinauer.com/animalcommunication2e/images/10/WT10.05Figure07.jpg

    • Marc Harper says:

      Looks my math formatting got eaten. Just use John’s equation for the replicator equation.

      • John Baez says:

        In WordPress blogs, you need to write

        $latex E = mc^2 $

        to get

        E = mc^2

        I’ll fix up those equations. Thanks for saying lots of interesting stuff without giving away too much of the posts to come! (You were not the intended audience of this series, since I’m just thinking out loud about your work….)

  6. romain says:

    I have a doubt concerning the interpretation of entropy as a lack of information.

    For an environment (equivalently a collection of fitnesses), given the simplicity of the replicator equation, there may be one or infinitely many equilibria.
    – If a species has a fitness greater than any other, then at equilibrium, it will have p=1 and the entropy will be zero.
    – However, if two species have an identical fitness larger than any other species, then they may both end up with p>0. The entropy would thus be non-zero.
    – Even worse, when all fitnesses are equal, any distribution of probabilities is an equilibrium for the replicator dynamics…

    Can we, in such a case, interpret it as a lack of information about the environment compared to the case where only one species is singled out?

    • John Baez says:

      Entropy is quite generally a measure of lack of information, but one has to ask ‘lack compared to what?‘, and this is why relative entropy is so important. The problems you mention are good examples of why ordinary entropy is not sufficient to understand the sense in which an ecosystem acquires information about its environment. Next time I’ll present Marc Harper’s solution using relative entropy.

  7. Last time we saw that if we take a bunch of different species of self-replicating entities, the entropy of their population distribution can go either up or down as time passes. We saw this is true even in the pathetically simple case where all the replicators have constant fitness—so they don’t interact with each other, and don’t hit ‘limits to growth’ as their population grows […]

  8. Jussi Leinonen says:

    The formula after “we get the final result” isn’t rendered properly for some reason, at least on the two browser I tested.

    • John Baez says:

      It works fine for me on Firefox. What’s wrong with it for you?

      Here it is:

      \displaystyle{ \dot{S} = - \sum_i \Big( f_i(P) - \langle f(P) \rangle \Big) \, p_i \ln (p_i) }

      What do other people think?

      • Jussi Leinonen says:

        That’s really weird. What I get is “x”, the e-like “belongs in a set” symbol, and a right arrow. The alt text works fine and looks like valid code to me.

        • John Baez says:

          What browser are you using? And if Firefox, which works fine for me, what version are you using and what system are you running it on?

        • romain says:

          Same for me with ubuntu 11.10 and Firefox 13.0. I get the same thing as Jussi. Though, when I place my cursor on it, I see the correct latex code… that I have to compile in my head.

      • Same thing (x element arrow) here: Win XP (tabula rasa boot/install at some Bavarian university) and Firefox 5.0.1.

  9. Jussi Leinonen says:

    I tested it with IE9 and Chrome on Windows 7, as well as Chromium and Firefox on Ubuntu. Same effect on all of them. But now I’m at work, and I see it render correctly using Chromium. I smell a caching bug on the server side…

    • davidtweed says:

      One possibility is that there’s although it looks like one web address behind the scenes “s0.wp.com” is sent to a geographically nearby server. It’s interesting that Florifulgurator is in Bavaria and, going purely on the name, you’re probably in Finland (apologies if this is stereotyping! :-) ). Whereas John is in Spain. So it might be a caching bug that’s hit one of the geographical servers near you (that’s now resolved itself).

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s