Information Geometry (Part 9)



It’s time to continue this information geometry series, because I’ve promised to give the following talk at a conference on the mathematics of biodiversity in early July… and I still need to do some of the research!

Diversity, information geometry and learning

As is well known, some measures of biodiversity are formally identical to measures of information developed by Shannon and others. Furthermore, Marc Harper has shown that the replicator equation in evolutionary game theory is formally identical to a process of Bayesian inference, which is studied in the field of machine learning using ideas from information geometry. Thus, in this simple model, a population of organisms can be thought of as a ‘hypothesis’ about how to survive, and natural selection acts to update this hypothesis according to Bayes’ rule. The question thus arises to what extent natural changes in biodiversity can be usefully seen as analogous to a form of learning. However, some of the same mathematical structures arise in the study of chemical reaction networks, where the increase of entropy, or more precisely decrease of free energy, is not usually considered a form of ‘learning’. We report on some preliminary work on these issues.

So, let’s dive in! To some extent I’ll be explaining these two papers:

• Marc Harper, Information geometry and evolutionary game theory.

• Marc Harper, The replicator equation as an inference dynamic.

However, I hope to bring in some more ideas from physics, the study of biodiversity, and the theory of stochastic Petri nets, also known as chemical reaction networks. So, this series may start to overlap with my network theory posts. We’ll see. We won’t get far today: for now, I just want to review and expand on what we did last time.

The replicator equation

The replicator equation is a simplified model of how populations change. Suppose we have n types of self-replicating entity. I’ll call these entities replicators. I’ll call the types of replicators species, but they don’t need to be species in the biological sense. For example, the replicators could be genes, and the types could be alleles. Or the replicators could be restaurants, and the types could be restaurant chains.

Let P_i(t), or just P_i for short, be the population of the ith species at time t. Then the replicator equation says

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) \, P_i }

So, the population P_i changes at a rate proportional to P_i, but the ‘constant of proportionality’ need not be constant: it can be any smooth function f_i of the populations of all the species. We call f_i(P_1, \dots, P_n) the fitness of the ith species.

Of course this model is absurdly general, while still leaving out lots of important effects, like the spatial variation of populations, or the ability for the population of some species to start at zero and become nonzero—which happens thanks to mutation. Nonetheless this model is worth taking a good look at.

Using the magic of vectors we can write

P = (P_1, \dots , P_n)

and

f(P) = (f_1(P), \dots, f_n(P))

This lets us write the replicator equation a wee bit more tersely as

\displaystyle{ \frac{d P}{d t} = f(P) P}

where on the right I’m multiplying vectors componentwise, the way your teachers tried to brainwash you into never doing:

f(P) P = (f(P)_1 P_1, \dots, f(P)_n P_n)

In other words, I’m thinking of P and f(P) as functions on the set \{1, \dots, n\} and multiplying them pointwise. This will be a nice way of thinking if we want to replace this finite set by some more general space.

Why would we want to do that? Well, we might be studying lizards with different length tails, and we might find it convenient to think of the set of possible tail lengths as the half-line [0,\infty) instead of a finite set.

Or, just to get started, we might want to study the pathetically simple case where f(P) doesn’t depend on P. Then we just have a fixed function f and a time-dependent function P obeying

\displaystyle{ \frac{d P}{d t} = f P}

If we’re physicists, we might write P more suggestively as \psi and write the operator multiplying by f as - H. Then our equation becomes

\displaystyle{ \frac{d \psi}{d t} = - H \psi }

This looks a lot like Schrödinger’s equation, but since there’s no factor of \sqrt{-1}, and \psi is real-valued, it’s more like the heat equation or the ‘master equation’, the basic equation of stochastic mechanics.

For an explanation of Schrödinger’s equation and the master equation, try Part 12 of the network theory series. In that post I didn’t include a minus sign in front of the H. That’s no big deal: it’s just a different convention than the one I want today. A more serious issue is that in stochastic mechanics, \psi stands for a probability distribution. This suggests that we should get probabilities into the game somehow.

The replicator equation in terms of probabilities

Luckily, that’s exactly what people usually do! Instead of talking about the population P_i of the ith species, they talk about the probability p_i that one of our organisms will belong to the ith species. This amounts to normalizing our populations:

\displaystyle{  p_i = \frac{P_i}{\sum_j P_j} }

Don’t you love it when notations work out well? Our big Population P_i has gotten normalized to give little probability p_i.

How do these probabilities p_i change with time? Now is the moment for that least loved rule of elementary calculus to come out and take a bow: the quotient rule for derivatives!

\displaystyle{ \frac{d p_i}{d t} = \left(\frac{d P_i}{d t} \sum_j P_j \quad - \quad P_i \sum_j \frac{d P_j}{d t}\right) \big{/} \left(  \sum_j P_j \right)^2 }

Using our earlier version of the replicator equation, this gives:

\displaystyle{ \frac{d p_i}{d t} =  \left(f_i(P) P_i \sum_j P_j \quad - \quad P_i \sum_j f_j(P) P_j \right) \big{/} \left(  \sum_j P_j \right)^2 }

Using the definition of p_i, this simplifies to:

\displaystyle{ \frac{d p_i}{d t} =  f_i(P) p_i \quad - \quad \left( \sum_j f_j(P) p_j \right) p_i }

The stuff in parentheses actually has a nice meaning: it’s just the mean fitness. In other words, it’s the average, or expected, fitness of an organism chosen at random from the whole population. Let’s write it like this:

\displaystyle{ \langle f(P) \rangle = \sum_j f_j(P) p_j  }

So, we get the replicator equation in its classic form:

\displaystyle{ \frac{d p_i}{d t} = \Big( f_i(P) - \langle f(P) \rangle \Big) \, p_i }

This has a nice meaning: for the fraction of organisms of the ith type to increase, their fitness must exceed the mean fitness. If you’re trying to increase market share, what matters is not how good you are, but how much better than average you are. If everyone else is lousy, you’re in luck.

Entropy

Now for something a bit new. Once we’ve gotten a probability distribution into the game, its entropy is sure to follow:

\displaystyle{ S(p) = - \sum_i p_i \, \ln(p_i) }

This says how ‘smeared-out’ the overall population is among the various different species. Alternatively, it says how much information it takes, on average, to say which species a randomly chosen organism belongs to. For example, if there are 2^N species, all with equal populations, the entropy S works out to N \ln 2. So in this case, it takes N bits of information to say which species a randomly chosen organism belongs to.

In biology, entropy is one of many ways people measure biodiversity. For a quick intro to some of the issues involved, try:

• Tom Leinster, Measuring biodiversity, Azimuth, 7 November 2011.

• Lou Jost, Entropy and diversity, Oikos 113 (2006), 363–375.

But we don’t need to understand this stuff to see how entropy is connected to the replicator equation. Marc Harper’s paper explains this in detail:

• Marc Harper, The replicator equation as an inference dynamic.

and I hope to go through quite a bit of it here. But not today! Today I just want to look at a pathetically simple, yet still interesting, example.

Exponential growth

Suppose the fitness of each species is independent of the populations of all the species. In other words, suppose each fitness f_i(P) is actually a constant, say f_i. Then the replicator equation reduces to

\displaystyle{ \frac{d P_i}{d t} = f_i \, P_i }

so it’s easy to solve:

P_i(t) = e^{t f_i} P_i(0)

You don’t need a detailed calculation to see what’s going to happen to the probabilities

\displaystyle{ p_i(t) = \frac{P_i(t)}{\sum_j P_j(t)}}

The most fit species present will eventually take over! If one species, say the ith one, has a fitness greater than the rest, then the population of this species will eventually grow faster than all the rest, at least if its population starts out greater than zero. So as t \to +\infty, we’ll have

p_i(t) \to 1

and

p_j(t) \to 0 \quad \mathrm{for} \quad j \ne i

Thus the probability distribution p will become more sharply peaked, and its entropy will eventually approach zero.

With a bit more thought you can see that even if more than one species shares the maximum possible fitness, the entropy will eventually decrease, though not approach zero.

In other words, the biodiversity will eventually drop as all but the most fit species are overwhelmed. Of course, this is only true in our simple idealization. In reality, biodiversity behaves in more complex ways—in part because species interact, and in part because mutation tends to smear out the probability distribution p_i. We’re not looking at these effects yet. They’re extremely important… in ways we can only fully understand if we start by looking at what happens when they’re not present.

In still other words, the population will absorb information from its environment. This should make intuitive sense: the process of natural selection resembles ‘learning’. As fitter organisms become more common and less fit ones die out, the environment puts its stamp on the probability distribution p. So, this probability distribution should gain information.

While intuitively clear, this last claim also follows more rigorously from thinking of entropy as negative information. Admittedly, it’s always easy to get confused by minus signs when relating entropy and information. A while back I said the entropy

\displaystyle{ S(p) = - \sum_i p_i \, \ln(p_i) }

was the average information required to say which species a randomly chosen organism belongs to. If this entropy is going down, isn’t the population losing information?

No, this is a classic sign error. It’s like the concept of ‘work’ in physics. We can talk about the work some system does on its environment, or the work done by the environment on the system, and these are almost the same… except one is minus the other!

When you are very ignorant about some system—say, some rolled dice—your estimated probabilities p_i for its various possible states are very smeared-out, so the entropy S(p) is large. As you gain information, you revise your probabilities and they typically become more sharply peaked, so S(p) goes down. When you know as much as you possibly can, S(p) equals zero.

So, the entropy S(p) is the amount of information you have left to learn: the amount of information you lack, not the amount you have. As you gain information, this goes down. There’s no paradox here.

It works the same way with our population of replicators—at least in the special case where the fitness of each species is independent of its population. The probability distribution p is like a ‘hypothesis’ assigning to each species i the probability p_i that it’s the best at self-replicating. As some replicators die off while others prosper, they gather information their environment, and this hypothesis gets refined. So, the entropy S(p) drops.

Next time

Of course, to make closer contact to reality, we need to go beyond the special case where the fitness of each species is a constant! Marc Harper does this, and I want to talk about his work someday, but first I have a few more remarks to make about the pathetically simple special case I’ve been focusing on. I’ll save these for next time, since I’ve probably strained your patience already.

17 Responses to Information Geometry (Part 9)

  1. “This should make intuitive sense: the process of natural selection resembles ‘learning’.”

    Nature is one tough teacher.

  2. Mike Stay says:

    The sum in the denominator of \displaystyle p_i = \frac{P_i}{\sum_i P_i} should probably be indexed by j.

  3. Replicators with dispersion in rates and adaptation times could probably explain the huge dynamic range in relative abundance distributions. This was touched on in Part 8.

  4. davidtweed says:

    I think the use of phrases like “absorb information from the environment” are a little too passive. I’d say a more informative analogy is that processes “acquire information from the environment”, with the process of acquisition being limited to asking questions of the form “I think x is a good answer, am I right?” (i.e., not just yes-no questions but yes-no questions about very specific things). Of course, the diff eqn framework you’re looking at doesn’t use the notion of the fitness of offspring being some function (either stochastic or some other type) of the parents fitness, so there’s no analogue of this effect in what you’re directly looking at.

    But in looking for links between information theory and biology, particularly biodiversity, I’d be inclined to look at how information theory looks at the “question strategy” issues. (E.g., if you really want to maximise your probability of getting some right answer, you’re best served by testing “answers” at widely spaced parts of the configuration space, but biological evolution seems to churn out individuals who are only slightly different from their parents. Why is that? One quick possibility is that “evolutionary inference” thinks the parents are already quite close to the answer, so small modifications are called for. Or maybe it’s one or multiple other reasons…)

    • John Baez says:

      David wrote:

      I think the use of phrases like “absorb information from the environment” are a little too passive. I’d say a more informative analogy is that processes “acquire information from the environment”,

      Evolutionary biologists are trained to avoid ‘teleological’ or ‘purposive’ accounts:

      Statements which imply that nature has goals, for example where a species is said to do something “in order to” achieve survival, appear teleological, and therefore invalid. Usually, it is possible to rewrite such sentences to avoid the apparent teleology. Some biology courses have incorporated exercises requiring students to rephrase such sentences so that they do not read teleologically.

      This is important for avoiding some mistakes. But I think it’ll be very interesting for evolutionary biology to collide with machine learning, where people with goals are designing systems in order to achieve those goals. Deen Abiola’s comments below show what I mean.

      (E.g., if you really want to maximise your probability of getting some right answer, you’re best served by testing “answers” at widely spaced parts of the configuration space, but biological evolution seems to churn out individuals who are only slightly different from their parents. Why is that? One quick possibility is that “evolutionary inference” thinks the parents are already quite close to the answer, so small modifications are called for. Or maybe it’s one or multiple other reasons…)

      Here you’re talking about what evolutionary inference “thinks”, which would give you a rap on the knuckles in those biology courses.

      More seriously, but relatedly, biologists are really interested in to what extent we can think of evolution itself as having been optimized for something… instead of just being a fixed method whereby other things get optimized. The buzzword for this puzzle is the evolution of evolvability:

      While variation yielding high evolvability could be useful in the long term, in the short term most of that variation is likely to be a disadvantage. For example, naively it would seem that increasing the mutation rate via a mutator allele would increase evolvability. But as an extreme example, if the mutation rate is too high then all individuals will be dead or at least carry a heavy mutation load. Short-term selection for low variation most of the time is usually thought likely to be more powerful than long-term selection for evolvability, making it difficult for natural selection to cause the evolution of evolvability.

      When recombination is low, mutator alleles may still sometimes hitchhike on the success of adaptive mutations that they cause. In this case, selection can take place at the level of the lineage. This may explain why mutators are often seen during experimental evolution of microbes. Mutator alleles can also evolve more easily when they only increase mutation rates in nearby DNA sequences, not across the whole genome: this is known as a contingency locus.

      The evolution of evolvability is less controversial if it occurs via the evolution of sexual reproduction, or via the tendency of variation-generating mechanisms to become more active when an organism is stressed. The yeast prion may also be an example of the evolution of evolvability through evolutionary capacitance. An evolutionary capacitor is a switch that turns genetic variation on and off. This is very much like bet-hedging the risk that a future environment will be similar or different. Theoretical models also predict the evolution of evolvability via modularity. When the costs of evolvability are sufficiently short-lived, more evolvable lineages may be the most successful in the long-term.

  5. Frederik De Roo says:

    As some replicators die off while others prosper, they gather information their environment, and this hypothesis gets refined. So, the entropy S(p) drops. We can make this even more precise.

    If you allow me to display my ignorance here: doesn’t this depend on what you assume as possible for f(P)? (for example, the constant f(P))

    Suppose we take two species P_1 and P_2 who have some symbiotic relationship, but have to share a finite area. E.g. suppose that (\beta is a positive coefficient):

    \frac{dP_1}{dt}=\beta \frac{P_2-P_1}{P_1+P_2}

    and similar for P_2, but with P_1 and P_2 exchanged.

    Then for large t, p_1 and p_2 will both go to 1/2. Also in this case I would say that the replicators gathered information about their environment (even though this example may not be very realistic) however it appears to me that the entropy becomes maximal. Am I doing something wrong?

    • John Baez says:

      Frederik wrote:

      …doesn’t this depend on what you assume as possible for f(P)?

      Yes! My remarks in the section “Exponential growth” only apply to the pathetically simple special case where f(P) is actually independent of P, so the population of each species grows or declines exponentially. This is a very boring case, and it would be almost silly to even mention it except that for short times, we can approximate the solution of any equation

      \displaystyle{ \frac{d}{d t} P_i(t) = f_i(P(t)) P_i(t) }

      by a solution of the linear equation

      \displaystyle{ \frac{d}{d t} P_i(t) = f_i(P(0)) P_i(t) }

      This is my ultimate reason for introducing the pathetically simple special case. However, you’ll note that in this section, I only analyze the behavior of this case as t \to +\infty. So, the lessons here won’t apply to more general cases—at least, not without tons of qualifications. Next time I’ll look at the short-time behavior of the same pathetically simple special case, and get some more lessons.

      I should have made this more clear. I’m not explaining things very well since I’m learning it and/or making it up as I go.

      Your example is a nice one, since it illustrates the importance of inter-species interactions, which are completely absent in the pathetically simple special case I discussed. I’ll talk more about those later!

      There’s a lot of interesting stuff to say about how information flows in situations where species are competing or cooperating with each other, but it’s more complicated than “so, the entropy drops”. Very often it rises.

      For now, try this:

      • Marc Harper, The replicator equation as an inference dynamic.

  6. Last time I began explaining the tight relation between three concepts: entropy, information and biodiversity […]

  7. Marc Harper says:

    Hi John!

    I’m happy to see that you are still interested in the topics. I want to recognize some other researchers that have had similar ideas (and that I based some of my work on). In particular, Cosma Shalizi independently discovered the analogy between the discrete replicator dynamic and Bayesian inference; I.M. Bomze was the first (as far as I can tell) to use relative entropy / cross entropy to analyze the replicator dynamic and proved many important results; and Shun-ichi Amari and his many collaborators wrote briefly about the connection between information geometry and the replicator equation circa 1995 and in subsequent works.

    Unfortunately I found out about much of these researchers’ work after I had finished my graduate thesis — some of it would have been much easier to figure out! In any case, there’s credit where credit is due!

  8. John Baez says:

    Over on G+, Deen Abiola wrote:

    Are you aware of the work by machine learning theorist in the area? In particular, the body of work pioneered by Leslie Valiant? The learning power of evolution is characterized in terms of machine learning and its shown to be weaker than many current algorithms (weaker than Probably Approximately Correct, equivalent to correlational statistical queries). Search Valiant Evolution algorithms for interesting reading.

    Click to access recombination-K11.pdf

    Click to access FV_Evolvability_COLT_OP_08.pdf

    http://www.mpi-inf.mpg.de/~mehlhorn/SeminarEvolvability/p619-feldman.pdf

    John Baez wrote:

    Deen Abiola – Thanks for the references! I wasn’t aware of these. I get the feeling that machine learning, evolutionary game theory and the study of biodiversity live in parallel universes with limited communication between them, but I’m just starting to learn all three, so I expect there’s lots of interesting interplay that I haven’t bumped into yet.

    Deen Abiola wrote:

    Recently, Game Theory has been getting a lot of attention in machine learning – stuff like regret minimization, on line learning and reinforcement learning. Parts of evolutionary game theory see use in coevolutionary optimization – an area I am really interested in.

    I think it’s mainly the biodiversity and real biology people that are most segregated. Which is a shame because it is amazing to see these same things keep popping up everywhere. Our knowledge base has too little entropy heh. Maybe category theory can help as a bridge there…

    Is information fundamental or does the digital age color our view the way clockwork mechanics did 200 years ago? Maybe I am biased but I vote for information as fundamental.

    Deen Abiola wrote:

    One of the papers I link to shows how Valiant’s model of evolution’s learning can be vastly sped up by introducing a recombination opertator. It makes a passing reference to evolutionary game theory work when discussing what is required in the reality of natural selection and genetic recombination for the model to hold:

    Click to access recombination-K11.pdf

    The strength of this new model is not given. But it is not hard to imagine that machine learning algos can be “smarter”.

    Evolution’s learning ability is clearly surpassed in speed by humans – indeed this fact is what many proponents of a super human AI argue will allow us to produce something smarter than us. it’s happened before. In many areas (where the assumptions of an exponential distribution or convexity or smooth differentiable objective function hold) machine learning algos already surpass everything on the planet.

    What humans and to some extent evolution do best and what even the best ML algos struggle with is in managing complexity by a kind of layering of past experience to quickly solve more complex problems (what Summers was hinting at). That is, most search algorithms are not very good at reducing search time by improving the search space as they learn. Instead they blindly grow the search space exponentially with problem size. New methods including deep learning and transfer learning and search space improving heuristics seek to improve ML beyond the idiot savants that only excel in well behaved problems (the kind that is not readily found in reality).

    Marc Harper wrote:

    Since the mechanisms by which humans learn, in particular how the brain functions fundamentally, is not known, it may be a stretch to say that “Evolution’s learning ability is clearly surpassed in speed by humans”. It could be evolutionary in some way (which has been proposed by many many people). Some of these models are intellectually pleasing though lacking in evidence (memetics), others have produced remarkably complex behaviors in simulations (Neural Darwinism).

    Will humans create an AI that surpasses our own abilities without resorting to simply evolving it? I leave you with the following quotations for meditation:

    “The theory of evolution by cumulative natural selection is the only theory we know of that is in principle capable of explaining the existence of organized complexity.” – Richard Dawkins, The Blind Watchmaker (1987)

    “A blind-variation-and-selective-retention process is fundamental to all inductive achievements, to all genuine increases in knowledge, to all increases in the fit of system to environment.” – Donald T. Campbell (1974)

    “Inductive inference is the only process known to us by which essentially new knowledge comes into the world.” – R. A. Fisher, The Design of Experiments (1935)

    Deen Abiola wrote:

    I am not looking inside the function. Only the results are needed to make this conclusion. You can capture a very good model of how evolution learns and then ask, what kind of functions can it learn? You can then compare it to humans and computational learning algos. That is what the papers I mentioned above do, one of the authors who has won the computing equivalent of a Nobel prize, for whatever that is worth.

    I actually specialize in something called Genetic Programming so I have a lot of respect for what we can learn from nature. However, like its natural motivation, genetic programming tends to be very slow because its manner of following a gradient is implicit rather than explicit. I am seeking a way to be able to more explicitly guide this search.

    Evolution learns at a scale measured in hundreds to tens of thousands of years. It learns by generating an incredibly large amount of hypotheses and unless the queries (organisms) are simple it waits many thousands of years to stabilize at a sufficiently optimal distribution on the best set of phenotypes. Human’s learn on a scale of decades to centuries. With new improvements in biotech we have gotten to a point where we can manipulate our own genetics and protein processing without the messy process and glacial pace of evolution. Even before biotech we were able to direct the development of specific phenotypes more quickly than the blind patience of evolution with this thing called breeding.

  9. In Part 9, I told you about the ‘replicator equation’, which says how these fractions change with time […]

  10. This would be a version of the replicator equation, which I explained recently in Information Geometry (Part 9). […]

  11. akirabergman says:

    I have been intrigued by the similarity of the definition of entropy to the formula for the n’th prime number;

    p(n) = n*ln(n) + …

    Is this similarity only accidental or does it have a meaning?

  12. This is my talk for the workshop Biological Complexity: Can It Be Quantified?

    Biology as information dynamics.

    Abstract. If biology is the study of self-replicating entities, and we want to understand the role of information, it makes sense to see how information theory is connected to the ‘replicator equation’—a simple model of population dynamics for self-replicating entities. The relevant concept of information turns out to be the information of one probability distribution relative to another, also known as the Kullback–Liebler divergence. Using this we can see evolution as a learning process, and give a clean general formulation of Fisher’s fundamental theorem of natural selection.

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

This site uses Akismet to reduce spam. Learn how your comment data is processed.