Stationary Stability in Finite Populations

24 March, 2015

guest post by Marc Harper

A while back, in the article Relative entropy minimization in evolutionary dynamics, we looked at extensions of the information geometry / evolutionary game theory story to more general time-scales, incentives, and geometries. Today we’ll see how to make this all work in finite populations!

Let’s recall the basic idea from last time, which John also described in his information geometry series. The main theorem is this: when there’s an evolutionarily stable state for a given fitness landscape, the relative entropy between the stable state and the population distribution decreases along the population trajectories as they converge to the stable state. In short, relative entropy is a Lyapunov function. This is a nice way to look at the action of a population under natural selection, and it has interesting analogies to Bayesian inference.

The replicator equation is a nice model from an intuitive viewpoint, and it’s mathematically elegant. But it has some drawbacks when it comes to modeling real populations. One major issue is that the replicator equation implicitly assumes that the population proportions of each type are differentiable functions of time, obeying a differential equation. This only makes sense in the limit of large populations. Other closely related models, such as the Lotka-Volterra model, focus on the number of individuals of each type (e.g. predators and prey) instead of the proportion. But they often assume that the number of individuals is a differentiable function of time, and a population of 3.5 isn’t very realistic either.

Real populations of replicating entities are not infinitely large; in fact they are often relatively small and of course have whole numbers of each type, at least for large biological replicators (like animals). They take up space and only so many can interact meaningfully. There are quite a few models of evolution that handle finite populations and some predate the replicator equation. Models with more realistic assumptions typically have to leave the realm of derivatives and differential equations behind, which means that the analysis of such models is more difficult, but the behaviors of the models are often much more interesting. Hopefully by the end of this post, you’ll see how all of these diagrams fit together:








One of the best-known finite population models is the Moran process, which is a Markov chain on a finite population. This is the quintessential birth-death process. For a moment consider a population of just two types A and B. The state of the population is given by a pair of nonnegative integers (a,b) with a+b=N, the total number of replicators in the population, and a and b the number of individuals of type A and B respectively. Though it may artificial to fix the population size N, this often turns out not to be that big of a deal, and you can assume the population is at its carrying capacity to make the assumption realistic. (Lots of people study populations that can change size and that have replicators spatially distributed say on a graph, but we’ll assume they can all interact with each whenever they want for now).

A Markov model works by transitioning from state to state in each round of the process, so we need to define the transitions probabilities to complete the model. Let’s put a fitness landscape on the population, given by two functions f_A and f_B of the population state (a,b). Now we choose an individual to reproduce proportionally to fitness, e.g. we choose an A individual to reproduce with probability

\displaystyle{ \frac{a f_A}{a f_A + b f_B} }

since there are a individuals of type A and they each have fitness f_A. This is analogous to the ratio of fitness to mean fitness from the discrete replicator equation, since

\displaystyle{ \frac{a f_A}{a f_A + b f_B} =  \frac{\frac{a}{N} f_A}{\frac{a}{N} f_A + \frac{b}{N} f_B} \to \frac{x_i f_i(x)}{\overline{f(x)}} }

and the discrete replicator equation is typically similar to the continuous replicator equation (this can be made precise), so the Moran process captures the idea of natural selection in a similar way. Actually there is a way to recover the replicator equation from the Moran process in large populations—details at the end!

We’ll assume that the fitnesses are nonnegative and that the total fitness (the denominator) is never zero; if that seems artificial, some people prefer to transform the fitness landscape by e^{\beta f(x)}, which gives a ratio reminiscent of the Boltzmann or Fermi distribution from statistical physics, with the parameter \beta playing the role of intensity of selection rather than inverse temperature. This is sometimes called Fermi selection.

That takes care of the birth part. The death part is easier: we just choose an individual at random (uniformly) to be replaced. Now we can form the transition probabilities of moving between population states. For instance the probability of moving from state (a,b) to (a+1, b-1) is given by the product of the birth and death probabilities, since they are independent:

\displaystyle{ T_a^{a+1} = \frac{a f_A}{a f_A + b f_B} \frac{b}{N} }

since we have to chose a replicator of type A to reproduce and one of type B to be replaced. Similarly for (a,b) to (a-1, b+1) (switch all the a’s and b’s), and we can write the probability of staying in the state (a, N-a) as

T_a^{a} = 1 - T_{a}^{a+1} - T_{a}^{a-1}

Since we only replace one individual at a time, this covers all the possible transitions, and keeps the population constant.

We’d like to analyze this model and many people have come up with clever ways to do so, computing quantities like fixation probabilities (also known as absorption probabilities), indicating the chance that the population will end up with one type completely dominating, i.e. in state (0, N) or (N,0). If we assume that the fitness of type A is constant and simply equal to 1, and the fitness of type B is r \neq 1, we can calculate the probability that a single mutant of type B will take over a population of type A using standard Markov chain methods:

\displaystyle{\rho = \frac{1 - r^{-1}}{1 - r^{-N}} }

For neutral relative fitness (r=1), \rho = 1/N, which is the probability a neutral mutant invades by drift alone since selection is neutral. Since the two boundary states (0, N) or (N,0) are absorbing (no transitions out), in the long run every population ends up in one of these two states, i.e. the population is homogeneous. (This is the formulation referred to by Matteo Smerlak in The mathematical origins of irreversibility.)

That’s a bit different flavor of result than what we discussed previously, since we had stable states where both types were present, and now that’s impossible, and a bit disappointing. We need to make the population model a bit more complex to have more interesting behaviors, and we can do this in a very nice way by adding the effects of mutation. At the time of reproduction, we’ll allow either type to mutate into the other with probability \mu. This changes the transition probabilities to something like

\displaystyle{ T_a^{a+1} = \frac{a (1-\mu) f_A + b \mu f_B}{a f_A + b f_B} \frac{b}{N} }

Now the process never stops wiggling around, but it does have something known as a stationary distribution, which gives the probability that the population is in any given state in the long run.

For populations with more than two types the basic ideas are the same, but there are more neighboring states that the population could move to, and many more states in the Markov process. One can also use more complicated mutation matrices, but this setup is good enough to typically guarantee that no one species completely takes over. For interesting behaviors, typically \mu = 1/N is a good choice (there’s some biological evidence that mutation rates are typically inversely proportional to genome size).

Without mutation, once the population reached (0,N) or (N,0), it stayed there. Now the population bounces between states, either because of drift, selection, or mutation. Based on our stability theorems for evolutionarily stable states, it’s reasonable to hope that for small mutation rates and larger populations (less drift), the population should spend most of its time near the evolutionarily stable state. This can be measured by the stationary distribution which gives the long run probabilities of a process being in a given state.

Previous work by Claussen and Traulsen:

• Jens Christian Claussen and Arne Traulsen, Non-Gaussian fluctuations arising from finite populations: exact results for the evolutionary Moran process, Physical Review E 71 (2005), 025101.

suggested that the stationary distribution is at least sometimes maximal around evolutionarily stable states. Specifically, they showed that for a very similar model with fitness landscape given by

\left(\begin{array}{c} f_A \\ f_B \end{array}\right)  = \left(\begin{array}{cc} 1 & 2\\ 2&1 \end{array}\right)  \left(\begin{array}{c} a\\ b \end{array}\right)

the stationary state is essentially a binomial distribution centered at (N/2, N/2).

Unfortunately, the stationary distribution can be very difficult to compute for an arbitrary Markov chain. While it can be computed for the Markov process described above without mutation, and in the case studied by Claussen and Traulsen, there’s no general analytic formula for the process with mutation, nor for more than two types, because the processes are not reversible. Since we can’t compute the stationary distribution analytically, we’ll have to find another way to show that the local maxima of the stationary distribution are “evolutionarily stable”. We can approximate the stationary distribution fairly easily with a computer, so it’s easy to plot the results for just about any landscape and reasonable population size (e.g. N \approx 100).

It turns out that we can use a relative entropy minimization approach, just like for the continuous replicator equation! But how? We lack some essential ingredients such as deterministic and differentiable trajectories. Here’s what we do:

• We show that the local maxima and minima of the stationary distribution satisfy a complex balance criterion.

• We then show that these states minimize an expected relative entropy.

• This will mean that the current state and the expected next state are ‘close’.

• Lastly, we show that these states satisfy an analogous definition of evolutionary stability (now incorporating mutation).

The relative entropy allows us to measure how close the current state is to the expected next state, which captures the idea of stability in another way. This ports the relative minimization Lyapunov result to some more realistic Markov chain models. The only downside is that we’ll assume the populations are “sufficiently large”, but in practice for populations of three types, N=20 is typically enough for common fitness landscapes (there are lots of examples here for N=80, which are prettier than the smaller populations). The reason for this is that the population state (a,b) needs enough “resolution” (a/N, b/N) to get sufficiently close to the stable state, which is not necessarily a ratio of integers. If you allow some wiggle room, smaller populations are still typically pretty close.

Evolutionarily stable states are closely related to Nash equilibria, which have a nice intuitive description in traditional game theory as “states that no player has an incentive to deviate from”. But in evolutionary game theory, we don’t use a game matrix to compute e.g. maximum payoff strategies, rather the game matrix defines a fitness landscape which then determines how natural selection unfolds.

We’re going to see this idea again in a moment, and to help get there let’s introduce an function called an incentive that encodes how a fitness landscape is used for selection. One way is to simply replace the quantities a f_A(a,b) and b f_B(a,b) in the fitness-proportionate selection ratio above, which now becomes (for two population types):

\displaystyle{ \frac{\varphi_A(a,b)}{\varphi_A(a,b) + \varphi_B(a,b)} }

Here \varphi_A(a,b) and \varphi_B(a,b) are the incentive function components that determine how the fitness landscape is used for natural selection (if at all). We have seen two examples above:

\varphi_A(a,b) = a f_A(a, b)

for the Moran process and fitness-proportionate selection, and

\varphi_A(a,b) = a e^{\beta f_A(a, b)}

for an alternative that incorporates a strength of selection term \beta, preventing division by zero for fitness landscapes defined by zero-sum game matrices, such as a rock-paper-scissors game. Using an incentive function also simplifies the transition probabilities and results as we move to populations of more than two types. Introducing mutation, we can describe the ratio for incentive-proportion selection with mutation for the ith population type when the population is in state x=(a,b,\ldots) / N as

\displaystyle{ p_i(x) = \frac{\sum_{k=1}^{n}{\varphi_k(x) M_{i k} }}{\sum_{k=1}^{n}{\varphi_k(x)}} }

for some matrix of mutation probabilities M. This is just the probability that we get a new individual of the ith type (by birth and/or mutation). A common choice for the mutation matrix is to use a single mutation probability \mu and spread it out over all the types, such as letting

M_{ij} = \mu / (n-1)

and

M_{ii} = 1 - \mu

Now we are ready to define the expected next state for the population and see how it captures a notion of stability. For a given state population x in a multitype population, using x to indicate the normalized population state (a,b,\ldots) / N, consider all the neighboring states y that the population could move to in one step of the process (one birth-death cycle). These neighboring states are the result of increasing a population type by one (birth) and decreasing another by one (death, possibly the same type), of course excluding cases on the boundary where the number of individuals of any type drops below zero or rises above N. Now we can define the expected next state as the sum of neighboring states weighted by the transition probabilities

E(x) = \sum_{y}{y T_x^{y}}

with transition probabilities given by

T_{x}^{y} = p_{i}(x) x_{j}

for states y that differ in 1/N at the ith coordinate and -1/N at jth coordinate from x. Here x_j is just the probability of the random death of an individual of the jth type, so the transition probabilities are still just birth (with mutation) and death as for the Moran process we started with.

Skipping some straightforward algebraic manipulations, we can show that

\displaystyle{ E(x) = \sum_{y}{y T_x^{y}} = \frac{N-1}{N}x + \frac{1}{N}p(x)}

Then it’s easy to see that E(x) = x if and only if x = p(x), and that x = p(x) if and only if x_i = \varphi_i(x). So we have a nice description of ‘stability’ in terms of fixed points of the expected next state function and the incentive function

x = E(x) = p(x) = \varphi(x),

and we’ve gotten back to “no one has an incentive to deviate”. More precisely, for the Moran process

\varphi_i(x) = x_i f_i(x)

and we get back f_i(x) = f_j(x) for every type. So we take x = \varphi(x) as our analogous condition to an evolutionarily stable state, though it’s just the ‘no motion’ part and not also the ‘stable’ part. That’s what we need the stationary distribution for!

To turn this into a useful number that measures stability, we use the relative entropy of the expected next state and the current state, in analogy with the Lyapunov theorem for the replicator equation. The relative entropy

\displaystyle{ D(x, y) = \sum_i x_i \ln(x_i) - y_i \ln(x_i) }

has the really nice property that D(x,y) = 0 if and only if x = y, so we can use the relative entropy D(E(x), x) as a measure of how close to stable any particular state is! Here the expected next state takes the place of the ‘evolutionarily stable state’ in the result described last time for the replicator equation.

Finally, we need to show that the maxima (and minima) of of the stationary distribution are these fixed points by showing that these states minimize the expected relative entropy.

Seeing that local maxima and minima of the stationary distribution minimize the expected relative entropy is a more involved, so let’s just sketch the details. In general, these Markov processes are not reversible, so they don’t satisfy the detailed-balance condition, but the stationary probabilities do satisfy something called the global balance condition, which says that for the stationary distribution s we have that

s_x \sum_{x}{T_x^{y}} = \sum_{y}{s_y T_y^{x}}

When the stationary distribution is at a local maximum (or minimum), we can show essentially that this implies (up to an \epsilon, for a large enough population) that

\displaystyle{\sum_{x}{T_x^{y}} = \sum_{y}{T_y^{x}} }

a sort of probability inflow-outflow equation, which is very similar to the condition of complex balanced equilibrium described by Manoj Gopalkrishnan in this Azimuth post. With some algebraic manipulation, we can show that these states have E(x)=x.

Now let’s look again at the figures from the start. The first shows the vector field of the replicator equation:

You can see rest points at the center, on the center of each boundary edge, and on the corner points. The center point is evolutionarily stable, the center points of the boundary are semi-stable (but stable when the population is restricted to a boundary simplex), and the corner points are unstable.

This one shows the stationary distribution for a finite population model with a Fermi incentive on the same landscape, for a population of size 80:

A fixed population size gives a partitioning of the simplex, and each triangle of the partition is colored by the value of the stationary distribution. So you can see that there are local maxima in the center and on the centers of the triangle boundary edges. In this case, the size of the mutation probability determines how much of the stationary distribution is concentrated on the center of the simplex.

This shows one-half of the Euclidean distance squared between the current state and the expected next state:

And finally, this shows the same thing but with the relative entropy as the ‘distance function':

As you can see, the Euclidean distance is locally minimal at each of the local maxima and minima of the stationary distribution (including the corners); the relative entropy is only guaranteed so on the interior states (because the relative entropy doesn’t play nicely with the boundary, and unlike the replicator equation, the Markov process can jump on and off the boundary). It turns out that the relative Rényi entropies for q between 0 and 1 also work just fine, but for the large population limit (the replicator dynamic), the relative entropy is the somehow the right choice for the replicator equation (has the derivative that easily gives Lyapunov stability), which is due to the connections between relative entropy and Fisher information in the information geometry of the simplex. The Euclidean distance is the q=0 case and the ordinary relative entropy is q=1.

As it turns out, something very similar holds for another popular finite population model, the Wright–Fisher process! This model is more complicated, so if you are interested in the details, check out our paper, which has many nice examples and figures. We also define a process that bridges the gap between the atomic nature of the Moran process and the generational nature of the Wright–Fisher process, and prove the general result for that model.

Finally, let’s see how the Moran process relates back to the replicator equation (see also the appendix in this paper), and how we recover the stability theory of the replicator equation. We can use the transition probabilities of the Moran process to define a stochastic differential equation (called a Langevin equation) with drift and diffusion terms that are essentially (for populations with two types:

\mathrm{Drift}(x) = T^{+}(x) - T^{-}(x)

\displaystyle{ \mathrm{Diffusion}(x) = \sqrt{\frac{T^{+}(x) + T^{-}(x)}{N}} }

As the population size gets larger, the diffusion term drops out, and the stochastic differential equation becomes essentially the replicator equation. For the stationary distribution, the variance (e.g. for the binomial example above) also has an inverse dependence on N, so the distribution limits to a delta-function that is zero except for at the evolutionarily stable state!

What about the relative entropy? Loosely speaking, as the population size gets larger, the iteration of the expected next state also becomes deterministic. Then the evolutionarily stable states is a fixed point of the expected next state function, and the expected relative entropy is essentially the same as the ordinary relative entropy, at least in a neighborhood of the evolutionarily stable state. This is good enough to establish local stability.

Earlier I said both the local maxima and minima minimize the expected relative entropy. Dash and I haven’t proven that the local maxima always correspond to evolutionarily stable states (and the minima to unstable states). That’s because the generalization of evolutionarily stable state we use is really just a ‘no motion’ condition, and isn’t strong enough to imply stability in a neighborhood for the deterministic replicator equation. So for now we are calling the local maxima stationary stable states.

We’ve also tried a similar approach to populations evolving on networks, which is a popular topic in evolutionary graph theory, and the results are encouraging! But there are many more ‘states’ in such a process, since the configuration of the network has to be taken into account, and whether the population is clustered together or not. See the end of our paper for an interesting example of a population on a cycle.


Sensing and Acting Under Information Constraints

30 October, 2014

I’m having a great time at a workshop on Biological and Bio-Inspired Information Theory in Banff, Canada. You can see videos of the talks online. There have been lots of good talks so far, but this one really blew my mind:

• Naftali Tishby, Sensing and acting under information constraints—a principled approach to biology and intelligence, 28 October 2014.

Tishby’s talk wasn’t easy for me to follow—he assumed you already knew rate-distortion theory and the Bellman equation, and I didn’t—but it was great!

It was about the ‘action-perception loop':


This is the feedback loop in which living organisms—like us—take actions depending on our goals and what we perceive, and perceive things depending on the actions we take and the state of the world.

How do we do this so well? Among other things, we need to balance the cost of storing information about the past against the payoff of achieving our desired goals in the future.

Tishby presented a detailed yet highly general mathematical model of this! And he ended by testing the model on experiments with cats listening to music and rats swimming to land.

It’s beautiful stuff. I want to learn it. I hope to blog about it as I understand more. But for now, let me just dive in and say some basic stuff. I’ll start with the two buzzwords I dropped on you. I hate it when people use terminology without ever explaining it.

Rate-distortion theory

Rate-distortion theory is a branch of information theory which seeks to find the minimum rate at which bits must be communicated over a noisy channel so that the signal can be approximately reconstructed at the other end without exceeding a given distortion. Shannon’s first big result in this theory, the ‘rate-distortion theorem’, gives a formula for this minimum rate. Needless to say, it still requires a lot of extra work to determine and achieve this minimum rate in practice.

For the basic definitions and a statement of the theorem, try this:

• Natasha Devroye, Rate-distortion theory, course notes, University of Chicago, Illinois, Fall 2009.

One of the people organizing this conference is a big expert on rate-distortion theory, and he wrote a book about it.

• Toby Berger, Rate Distortion Theory: A Mathematical Basis for Data Compression, Prentice–Hall, 1971.

Unfortunately it’s out of print and selling for $259 used on Amazon! An easier option might be this:

• Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, Chapter 10: Rate Distortion Theory, Wiley, New York, 2006.

The Bellman equation

The Bellman equation reduces the task of finding an optimal course of action to choosing what to do at each step. So, you’re trying to maximize the ‘total reward’—the sum of rewards at each time step—and Bellman’s equation says what to do at each time step.

If you’ve studied physics, this should remind you of how starting from the principle of least action, we can get a differential equation describing the motion of a particle: the Euler–Lagrange equation.

And in fact they’re deeply related. The relation is obscured by two little things. First, Bellman’s equation is usually formulated in a context where time passes in discrete steps, while the Euler–Lagrange equation is usually formulated in continuous time. Second, Bellman’s equation is really the discrete-time version not of the Euler–Lagrange equation but a more or less equivalent thing: the ‘Hamilton–Jacobi equation’.

Ah, another buzzword to demystify! I was scared of the Hamilton–Jacobi equation for years, until I taught a course on classical mechanics that covered it. Now I think it’s the greatest thing in the world!

Briefly: the Hamilton–Jacobi equation concerns the least possible action to get from a fixed starting point to a point q in space at time t. If we call this least possible action W(t,q), the Hamilton–Jacobi equation says

\displaystyle{ \frac{\partial W(t,q)}{\partial q_i} = p_i  }

\displaystyle{ \frac{\partial W(t,q)}{\partial t} = -E  }

where p is the particle’s momentum at the endpoint of its path, and E is its energy there.

If we replace derivatives by differences, and talk about maximizing total reward instead of minimizing action, we get Bellman’s equation:

Bellman equation, Wikipedia.

Markov decision processes

Bellman’s equation can be useful whenever you’re trying to figure out an optimal course of action. An important example is a ‘Markov decision process’. To prepare you for Tishby’s talk, I should say what this is.

In a Markov process, something randomly hops around from state to state with fixed probabilities. In the simplest case there’s a finite set S of states, and time proceeds in discrete steps. At each time step, the probability to hop from state s to state s' is some fixed number P(s,s').

This sort of thing is called a Markov chain, or if you feel the need to be more insistent, a discrete-time Markov chain.

A Markov decision process is a generalization where an outside agent gets to change these probabilities! The agent gets to choose actions from some set A. If at a given time he chooses the action \alpha \in A, the probability of the system hopping from state s to state s' is P_\alpha(s,s'). Needless to say, these probabilities have to sum to one for any fixed s.

That would already be interesting, but the real fun is that there’s also a reward R_\alpha(s,s'). This is a real number saying how much joy or misery the agent experiences if he does action \alpha and the system hops from s to s'.

The problem is to choose a policy—a function from states to actions—that maximizes the total expected reward over some period of time. This is precisely the kind of thing Bellman’s equation is good for!

If you’re an economist you might also want to ‘discount’ future rewards, saying that a reward n time steps in the future gets multiplied by \gamma^n, where 0 < \gamma \le 1 is some discount factor. This extra tweak is easily handled, and you can see it all here:

Markov decision process, Wikipedia.

Partially observable Markov decision processes

There’s a further generalization where the agent can’t see all the details of the system! Instead, when he takes an action \alpha \in A and the system hops from state s to state s', he sees something: a point in some set O of observations. He makes the observation o \in O with probability \Omega_\alpha(o,s').

(I don’t know why this probability depends on s' but not s. Maybe it ultimately doesn’t matter much.)

Again, the goal is to choose a policy that maximizes the expected total reward. But a policy is a bit different now. The action at any time can only depend on all the observations made thus far.

Partially observable Markov decision processes are also called POMPDs. If you want to learn about them, try these:

Partially observable Markov decision process, Wikipedia.

• Tony Cassandra, Partially observable Markov decision processes.

The latter includes an introduction without any formulas to POMDPs and how to choose optimal policies. I’m not sure who would study this subject and not want to see formulas, but it’s certainly a good exercise to explain things using just words—and it makes certain things easier to understand (though not others, in a way that depends on who is trying to learn the stuff).

The action-perception loop

I already explained the action-perception loop, with the help of this picture from the University of Bielefeld’s Department of Cognitive Neuroscience:


Nafthali Tishby has a nice picture of it that’s more abstract:

We’re assuming time comes in discrete steps, just to keep things simple.

At each time t

• the world has some state W_t, and
• the agent has some state M_t.

Why the letter M? This stands for memory: it can be the state of the agent’s memory, but I prefer to think of it as the state of the agent.

At each time

• the agent takes an action A_t, which affects the world’s next state, and

• the world provides a sensation S_t to the agent, which affect’s the agent’s next state.

This is simplified but very nice. Note that there’s a symmetry interchanging the world and the agent!

We could make it fancier by having lots of agents who all interact, but there are a lot of questions already. The big question Tishby focuses on is optimizing how much the agent should remember about the past if they

• get a reward depending on the action taken and the resulting state of the world

but

• pay a price for the information stored from sensations.

Tishby formulates this optimization question as something like a partially observed Markov decision process, uses rate-distortion theory to analyze how much information needs to be stored to achieve a given reward, and uses Bellman’s equation to solve the optimization problem!

So, everything I sketched fits together somehow!

I hope what I’m saying now is roughly right: it will take me more time to get the details straight. If you’re having trouble absorbing all the information I just threw at you, don’t feel bad: so am I. But the math feels really natural and good to me. It involves a lot of my favorite ideas (like generalizations of the principle of least action, and relative entropy), and it seems ripe to be combined with network theory ideas.

For details, I highly recommend this paper:

• Naftali Tishby and Daniel Polani, Information theory of decisions and actions, in Perception-Reason-Action Cycle: Models, Algorithms and System. Vassilis, Hussain and Taylor, Springer, Berlin, 2010.

I’m going to print this out, put it by my bed, and read it every night until I’ve absorbed it.


Biodiversity, Entropy and Thermodynamics

27 October, 2014

 

I’m giving a short 30-minute talk at a workshop on Biological and Bio-Inspired Information Theory at the Banff International Research Institute.

I’ll say more about the workshop later, but here’s my talk, in PDF and video form:

Biodiversity, entropy and thermodynamics.

Most of the people at this workshop study neurobiology and cell signalling, not evolutionary game theory or biodiversity. So, the talk is just a quick intro to some things we’ve seen before here. Starting from scratch, I derive the Lotka–Volterra equation describing how the distribution of organisms of different species changes with time. Then I use it to prove a version of the Second Law of Thermodynamics.

This law says that if there is a ‘dominant distribution’—a distribution of species whose mean fitness is at least as great as that of any population it finds itself amidst—then as time passes, the information any population has ‘left to learn’ always decreases!

Of course reality is more complicated, but this result is a good start.

This was proved by Siavash Shahshahani in 1979. For more, see:

• Lou Jost, Entropy and diversity.

• Marc Harper, The replicator equation as an inference dynamic.

• Marc Harper, Information geometry and evolutionary game theory.

and more recent papers by Harper.


Entropy and Information in Biological Systems (Part 2)

4 July, 2014

John Harte, Marc Harper and I are running a workshop! Now you can apply here to attend:

Information and entropy in biological systems, National Institute for Mathematical and Biological Synthesis, Knoxville Tennesee, Wednesday-Friday, 8-10 April 2015.

Click the link, read the stuff and scroll down to “CLICK HERE” to apply. The deadline is 12 November 2014.

Financial support for travel, meals, and lodging is available for workshop attendees who need it. We will choose among the applicants and invite 10-15 of them.

The idea

Information theory and entropy methods are becoming powerful tools in biology, from the level of individual cells, to whole ecosystems, to experimental design, model-building, and the measurement of biodiversity. The aim of this investigative workshop is to synthesize different ways of applying these concepts to help systematize and unify work in biological systems. Early attempts at “grand syntheses” often misfired, but applications of information theory and entropy to specific highly focused topics in biology have been increasingly successful. In ecology, entropy maximization methods have proven successful in predicting the distribution and abundance of species. Entropy is also widely used as a measure of biodiversity. Work on the role of information in game theory has shed new light on evolution. As a population evolves, it can be seen as gaining information about its environment. The principle of maximum entropy production has emerged as a fascinating yet controversial approach to predicting the behavior of biological systems, from individual organisms to whole ecosystems. This investigative workshop will bring together top researchers from these diverse fields to share insights and methods and address some long-standing conceptual problems.

So, here are the goals of our workshop:

• To study the validity of the principle of Maximum Entropy Production (MEP), which states that biological systems – and indeed all open, non-equilibrium systems – act to produce entropy at the maximum rate.

• To familiarize all the participants with applications to ecology of the MaxEnt method: choosing the probabilistic hypothesis with the highest entropy subject to the constraints of our data. We will compare MaxEnt with competing approaches and examine whether MaxEnt provides a sufficient justification for the principle of MEP.

• To clarify relations between known characterizations of entropy, the use of entropy as a measure of biodiversity, and the use of MaxEnt methods in ecology.

• To develop the concept of evolutionary games as “learning” processes in which information is gained over time.

• To study the interplay between information theory and the thermodynamics of individual cells and organelles.

For more details, go here.

If you’ve got colleagues who might be interested in this, please let them know. You can download a PDF suitable for printing and putting on a bulletin board by clicking on this:



The Computational Power of Chemical Reaction Networks

10 June, 2014

I’m at this workshop:

Programming with Chemical Reaction Networks: Mathematical Foundations, Banff International Research Station, 8-13 June 2014.

Luca Cardelli wrote about computation with chemical reactions in Part 26 of the network theory series here on this blog. So, it’s nice to meet him and many other researchers, learn more, and try to solve some problems together!

The first tutorial was this:

• David Soloveichik, U.C. San Francisco, The computational power of chemical reaction networks.

David works at the Center for Systems and Synthetic Biology, and their website says:

David did his graduate work with Erik Winfree at Caltech, focusing on algorithmic self-assembly and on synthetic networks of nucleic-acid interactions based on strand displacement cascades. He is interested in “molecular programming”: the systematic design of complex molecular systems based on the principles of computer science and distributed computing. More generally, he is trying to create a theoretical foundation of chemical computation applicable to both synthetic and natural systems.

According to his webpage, Soloveichik’s research interests are:

Wet-lab: the rational design of molecular interactions for synthetic biology, nanotechnology, and bioengineering. The goal is to engineer autonomous molecular systems that can sense, compute, and perform various actions. Using nucleic-acid “strand displacement cascades” as the molecular primitive, we are able to attain freedom of design that hasn’t been previously possible.

Theory: The theoretical foundation of chemical computation. Once we have a way to program molecular interactions, what programming language shall we use? How molecules can process information and carry out computation is still not well-understood; however, a formal connection to models of concurrent computation may allow systematic and scalable design, rigorous analysis and verification. Further, computational principles may elucidate the design of biological regulatory networks.

Here are my notes on his tutorial.

Motivation

We’ve got people here from different backgrounds:

• computational complexity theory
• wetlab / experimental science
• pure and applied mathematics
• software verification

CRNs (chemical reaction networks) show up in:

• chemistry
• population biology
• sensor networks
• math:
    ○ vector addition systems
    ○ Petri nets
    ○ commutative semigroups
    ○ bounded context-free languages
    ○ uniform recurrence equations

Why use them for computation? People want to go beyond the von Neumann architecture for computation. People also want to understand how cells process information. However, with a few exceptions, the computational perspective in this talk has not yet proved relevant in biology. So, there is a lot left to learn.

The model

The model of computation here will be the master equation for a chemical reaction network… since this has been explained starting Part 4 of the network theory series, I won’t review it!

Can all chemical reaction networks, even those without any conservation laws, be realized by actual chemical systems?

Though this is a subtle question, one answer is “yes, using strand displacement cascades”. This is a trick for getting DNA to simulate other chemical reactions. It’s been carried out in the lab! See this paper and many subsequent ones:

• Soloveichik, Seelig and Winfree, DNA as a universal substrate for chemical kinetics.

Abstract: Molecular programming aims to systematically engineer molecular and chemical systems of autonomous function and ever-increasing complexity. A key goal is to develop embedded control circuitry within a chemical system to direct molecular events. Here we show that systems of DNA molecules can be constructed that closely approximate the dynamic behavior of arbitrary systems of coupled chemical reactions. By using strand displacement reactions as a primitive, we construct reaction cascades with effectively unimolecular and bimolecular kinetics. Our construction allows individual reactions to be coupled in arbitrary ways such that reactants can participate in multiple reactions simultaneously, reproducing the desired dynamical properties. Thus arbitrary systems of chemical equations can be compiled into real chemical systems. We illustrate our method on the Lotka–Volterra oscillator, a limit-cycle oscillator, a chaotic system, and systems implementing feedback digital logic and algorithmic behavior.

However, even working with the master equation for a CRN, there are various things we might mean by having it compute something:

• uniform vs non-uniform: is a single CRN supposed to handle all inputs, or do we allow adding extra reactions for larger inputs? It’s a bit like Turing machines vs Boolean circuits.

• deterministic vs probabilistic: is the correct output guaranteed or merely likely?

• halting vs stabilizing: does the CRN ‘know’ when it has finished, or not? In the ‘halting’ case the CRN irreversibly produces some molecules that signal that the computation is done. In the ‘stabilizing’ case, it eventually stabilizes to the right answer, but we may not know how long to wait.

These distinctions dramatically affect the computational power. In the case of uniform computation:

• deterministic and halting: this has finite computational power.

• deterministic and stabilizing: this can decide semilinear predicates.

• probabilistic and halting: this is Turing-universal.

• probabilistic and stabilizing: this can decide \Delta_2^0 predicates, which are more general than computable ones. (Indeed, if we use Turing machines but don’t require them to signal when they’ve halted, the resulting infinitely long computations can ‘compute’ stuff that’s not computable in the usual sense.)

Deterministic stabilizing computations

Let’s look at the deterministic stabilizing computations in a bit more detail. We’ll look at decision problems. We have a subset S \subseteq \mathbb{N}^d, and we want to answer this question: is the vector X \in \mathbb{N}^d in the set S?

To do this, we represent the vector as a bunch of molecules: X_1 of the first kind, X_2 of the second kind, and so on. We call this an input. We may also include a fixed collection of additional molecules in our input, to help the reactions run.

Then we choose a chemical reaction network, and we let it run on our input. The answer to our question will be encoded in some molecules called Y and N. If X is in S, we want our chemical reaction to produce Y molecules. If it’s not, we want our reaction to produce N’s.

To make this more precise, we need to define what counts as an output. If we’ve got a bunch of molecules that

• contains Y but not N: then the output is YES.

• contains N but not Y: then the output is NO.

Otherwise the output is undefined.

Output-stable states are states with YES or NO output such that all states reachable from them via our chemical reactions give the same output. We say an output-stable-state is correct if this output is the correct answer to the question: is X in S.

Our chemical reaction network gives a deterministic stabilizing computation if for any input, and choosing any state reachable from that input, we can do further chemical reactions to reach a correct output-stable state.

In other words: starting from our input, and letting the chemical reactions <run any way they want, we will eventually stabilize at an output that gives the right answer to the question “is X in S?”

Examples

This sounds a bit complicated, but it’s really not. Let’s look at some examples!

Example 1. Suppose you want to check two numbers and see if one is greater than or equal to another. Here

S = \{(X_1,X_2) : X_2 \ge X_1 \}

How can you decide if a pair of numbers (X_1,X_2) is in this set?

You start with X_1 molecules of type A, X_2 molecules of type B, and one molecule of type Y. Then you use a chemical reaction network with these reactions:

A + N \to Y
B + Y \to N

If you let these reactions run, the Y switches to a N each time the reactions destroy an A. But the N switches back to a Y each time the reactions destroy a B.

When no more reactions are possible, we are left with either one Y or one N, which is the correct answer to your question!

Example 2. Suppose you want to check two numbers and see if one is equal to another. Here

S = \{(X_1,X_2) : X_2 = X_1 \}

How can you decide if a pair of numbers (X_1,X_2) is in here?

This is a bit harder! As before, you start with X_1 molecules of type A, X_2 molecules of type B, and one molecule of type Y. Then you use a chemical reaction network with these reactions:

A + B \to Y
Y + N \to Y
A + Y \to A + N
B + Y \to B + N

The first reaction lets an A and a B cancel out, producing a Y. If you only run this reaction, you’ll eventually be left with either a bunch of A\mathrm{s} or a bunch of B\mathrm{s} or nothing but Y\mathrm{s}.

If you have Y\mathrm{s}, your numbers were equal. The other reactions deal with the cases where you have A\mathrm{s} or B\mathrm{s} left over. But the key thing to check is that no matter what order we run the reactions, we’ll eventually get the right answer! In the end, you’ll have either Y\mathrm{s} or N\mathrm{s}, not both, and this will provide the yes-or-no answer to the question of whether X_1 = X_2.

What deterministic stabilizing computations can do

We’ve looked at some examples of deterministic stabilizing computations. The big question is: what kind of questions can they answer?

More precisely, for what subsets A \subseteq \mathbb{N}^d can we build a deterministic stabilizing computation that ends with output YES if the input X lies in A and with output NO otherwise?

The answer is: the ‘semilinear’ subsets!

• Dana Angluin, James Aspnes and David Eistenstat, Stably computable predicates are semilinear.

A set S \subseteq \mathbb{N}^d is linear if it’s of the form

\{u_0 + n_1 u_1 + \cdots + n_p u_p : n_i \in \mathbb{N}  \}

for some fixed vectors of natural numbers u_i \in \mathbb{N}^d.

A set S \subseteq \mathbb{N}^d semilinear if it’s a finite union of linear sets.

How did Angluin, Aspnes and Eisenstat prove their theorem? Apparently the easy part is showing that membership in any semilinear set can be decided by a chemical reaction network. David sketched the proof of the converse. I won’t go into it, but it used a very nice fact:

Dickson’s Lemma. Any subset of \mathbb{N}^d has a finite set of minimal elements, where we define x \le y if x_i \le y_i for all i.

For example, the region above and to the right of the hyperbola here has five minimal elements:

If you know some algebra, Dickson’s lemma should remind you of the Hilbert basis theorem, saying (for example) that every ideal in a ring of multivariable polynomials over a field is finitely generated. And in fact, Paul Gordan used Dickson’s Lemma in 1899 to help give a proof of Hilbert’s basis theorem.

It’s very neat to see how this lemma applies to chemical reaction networks! You can see how it works in Angluin, Aspnes and Eistenstat’s paper. But they call it “Higman’s lemma” for some reason.

References

Here are some of David Soloveichik’s recent talks:

• An introduction to strand displacement cascades for the Foresight Institute Conference (Palo Alto, Jan 2013): An artificial “biochemistry” with DNA.

• Paper presented at DNA Computing and Molecular Programming 18 (Aarhus, Denmark, Aug 2012): Deterministic function computation with chemical reaction networks.

• Tutorial talk for DNA Computing and Molecular Programming 17 (Pasadena, Aug 2011): The programming language of chemical kinetics, and how to discipline your DNA molecules using strand displacement cascades.

• High-level introduction to algorithmic self-assembly and stochastic chemical reaction networks as computer-theoretic models: Computer-theoretic abstractions for molecular programming.

• On algorithmic behavior in chemical reaction networks and implementing arbitrary chemical reaction networks with DNA: programming well-mixed chemical kinetics.


Relative Entropy in Evolutionary Dynamics

22 January, 2014

guest post by Marc Harper

In John’s information geometry series, he mentioned some of my work in evolutionary dynamics. Today I’m going to tell you about some exciting extensions!

The replicator equation

First a little refresher. For a population of n replicating types, such as individuals with different eye colors or a gene with n distinct alleles, the ‘replicator equation’ expresses the main idea of natural selection: the relative rate of growth of each type should be proportional to the difference between the fitness of the type and the mean fitness in the population.

To see why this equation should be true, let P_i be the population of individuals of the ith type, which we allow to be any nonnegative real number. We can list all these numbers and get a vector:

P = (P_1, \dots, P_n)

The Lotka–Volterra equation is a very general rule for how these numbers can change with time:

\displaystyle{ \frac{d P_i}{d t} = f_i(P) P_i }

Each population grows at a rate proportional to itself, where the ‘constant of proportionality’, f_i(P), is not necessarily constant: it can be any real-valued function of P. This function is called the fitness of the ith type. Taken all together, these functions f_i are called the fitness landscape.

Let p_i be the fraction of individuals who are of the ith type:

\displaystyle{ p_i = \frac{P_i}{\sum_{i =1}^n P_i } }

These numbers p_i are between 0 and 1, and they add up to 1. So, we can also think of them as probabilities: p_i is the probability that a randomly chosen individual is of the ith type. This is how probability theory, and eventually entropy, gets into the game.

Again, we can bundle these numbers into a vector:

p = (p_1, \dots, p_n)

which we call the population distribution. It turns out that the Lotka–Volterra equation implies the replicator equation:

\displaystyle{ \frac{d p_i}{d t} = \left( f_i(P) - \langle f(P) \rangle \right) \, p_i }

where

\displaystyle{ \langle f(P) \rangle = \sum_{i =1}^n  f_i(P)  p_i  }

is the mean fitness of all the individuals. You can see the proof in Part 9 of the information geometry series.

By the way: if each fitness f_i(P) only depends on the fraction of individuals of each type, not the total numbers, we can write the replicator equation in a simpler way:

\displaystyle{ \frac{d p_i}{d t} = \left( f_i(p) - \langle f(p) \rangle \right) \, p_i }

From now on, when talking about this equation, that’s what I’ll do.

Anyway, the take-home message is this: the replicator equation says the fraction of individuals of any type changes at a rate proportional to fitness of that type minus the mean fitness.

Now, it has been known since the late 1970s or early 1980s, thanks to the work of Akin, Bomze, Hofbauer, Shahshahani, and others, that the replicator equation has some very interesting properties. For one thing, it often makes ‘relative entropy’ decrease. For another, it’s often an example of ‘gradient flow’. Let’s look at both of these in turn, and then talk about some new generalizations of these facts.

Relative entropy as a Lyapunov function

I mentioned that we can think of a population distribution as a probability distribution. This lets us take ideas from probability theory and even information theory and apply them to evolutionary dynamics! For example, given two population distributions p and q, the information of q relative to p is

I(q,p) = \displaystyle{ \sum_i q_i \ln \left(\frac{q_i}{p_i }\right)}

This measures how much information you gain if you have a hypothesis about some state of affairs given by the probability distribution p, and then someone tells you “no, the best hypothesis is q!”

It may seem weird to treat a population distribution as a hypothesis, but this turns out to be a good idea. Evolution can then be seen as a learning process: a process of improving the hypothesis.

We can make this precise by seeing how the relative information changes with the passage of time. Suppose we have two population distributions q and p. Suppose q is fixed, while p evolves in time according to the replicator equation. Then

\displaystyle{  \frac{d}{d t} I(q,p)  =  \sum_i f_i(P) (p_i - q_i) }

For the proof, see Part 11 of the information geometry series.

So, the information of q relative to p will decrease as p evolves according to the replicator equation if

\displaystyle{  \sum_i f_i(P) (p_i - q_i) } \le 0

If q makes this true for all p, we say q is an evolutionarily stable state. For some reasons why, see Part 13.

What matters now is that when q is an evolutionarily stable state, I(q,p) says how much information the population has ‘left to learn’—and we’re seeing that this always decreases. Moreover, it turns out that we always have

I(q,p) \ge 0

and I(q,p) = 0 precisely when p = q.

People summarize all this by saying that relative information is a ‘Lyapunov function’. Very roughly, a Lyapunov function is something that decreases with the passage of time, and is zero only at the unique stable state. To be a bit more precise, suppose we have a differential equation like

\displaystyle{  \frac{d}{d t} x(t) = v(x(t)) }

where x(t) \in \mathbb{R}^n and v is some smooth vector field on \mathbb{R}^n. Then a smooth function

V : \mathbb{R}^n \to \mathbb{R}

is a Lyapunov function if

V(x) \ge 0 for all x

V(x) = 0 iff x is some particular point x_0

and

\displaystyle{ \frac{d}{d t} V(x(t)) \le 0 } for every solution of our differential equation.

In this situation, the point x_0 is a stable equilibrium for our differential equation: this is Lyapunov’s theorem.

The replicator equation as a gradient flow equation

The basic idea of Lyapunov’s theorem is that when a ball likes to roll downhill and the landscape has just one bottom point, that point will be the unique stable equilibrium for the ball.

The idea of gradient flow is similar, but different: sometimes things like to roll downhill as efficiently as possible: they move in the exactly the best direction to make some quantity smaller! Under certain conditions, the replicator equation is an example of this phenomenon.

Let’s fill in some details. For starters, suppose we have some function

F : \mathbb{R}^n \to \mathbb{R}

Think of V as ‘height’. Then the gradient flow equation says how a point x(t) \in \mathbb{R}^n will move if it’s always trying its very best to go downhill:

\displaystyle{ \frac{d}{d t} x(t) = - \nabla V(x(t)) }

Here \nabla is the usual gradient in Euclidean space:

\displaystyle{ \nabla V = \left(\partial_1 V, \dots, \partial_n V \right)  }

where \partial_i is short for the partial derivative with respect to the ith coordinate.

The interesting thing is that under certain conditions, the replicator equation is an example of a gradient flow equation… but typically not one where \nabla is the usual gradient in Euclidean space. Instead, it’s the gradient on some other space, the space of all population distributions, which has a non-Euclidean geometry!

The space of all population distributions is a simplex:

\{ p \in \mathbb{R}^n : \; p_i \ge 0, \; \sum_{i = 1}^n p_i = 1 \} .

For example, it’s an equilateral triangle when n = 3. The equilateral triangle looks flat, but if we measure distances another way it becomes round, exactly like a portion of a sphere, and that’s the non-Euclidean geometry we need!

In fact this trick works in any dimension. The idea is to give the simplex a special Riemannian metric, the ‘Fisher information metric’. The usual metric on Euclidean space is

\delta_{i j} = \left\{\begin{array}{ccl} 1 & \mathrm{ if } & i = j \\                                       0 &\mathrm{ if } & i \ne j \end{array} \right.

This simply says that two standard basis vectors like (0,1,0,0) and (0,0,1,0) have dot product zero if the 1’s are in different places, and one if they’re in the same place. The Fisher information metric is a bit more complicated:

\displaystyle{ g_{i j} = \frac{\delta_{i j}}{p_i} }

As before, g_{i j} is a formula for the dot product of the ith and jth standard basis vectors, but now it depends on where you are in the simplex of population distributions.

We saw how this formula arises from information theory back in Part 7. I won’t repeat the calculation, but the idea is this. Fix a population distribution p and consider the information of another one, say q, relative to this. We get I(q,p). If q = p this is zero:

\displaystyle{ \left. I(q,p)\right|_{q = p} = 0 }

and this point is a local minimum for the relative information. So, the first derivative of I(q,p) as we change q must be zero:

\displaystyle{ \left. \frac{\partial}{\partial q_i} I(q,p) \right|_{q = p} = 0 }

But the second derivatives are not zero. In fact, since we’re at a local minimum, it should not be surprising that we get a positive definite matrix of second derivatives:

\displaystyle{  g_{i j} = \left. \frac{\partial^2}{\partial q_i \partial q_j} I(q,p) \right|_{q = p} = 0 }

And, this is the Fisher information metric! So, the Fisher information metric is a way of taking dot products between vectors in the simplex of population distribution that’s based on the concept of relative information.

This is not the place to explain Riemannian geometry, but any metric gives a way to measure angles and distances, and thus a way to define the gradient of a function. After all, the gradient of a function should point at right angles to the level sets of that function, and its length should equal the slope of that function:

So, if we change our way of measuring angles and distances, we get a new concept of gradient! The ith component of this new gradient vector field turns out to b

(\nabla_g V)^i = g^{i j} \partial_j V

where g^{i j} is the inverse of the matrix g_{i j}, and we sum over the repeated index j. As a sanity check, make sure you see why this is the usual Euclidean gradient when g_{i j} = \delta_{i j}.

Now suppose the fitness landscape is the good old Euclidean gradient of some function. Then it turns out that the replicator equation is a special case of gradient flow on the space of population distributions… but where we use the Fisher information metric to define our concept of gradient!

To get a feel for this, it’s good to start with the Lotka–Volterra equation, which describes how the total number of individuals of each type changes. Suppose the fitness landscape is the Euclidean gradient of some function V:

\displaystyle{ f_i(P) = \frac{\partial V}{\partial P_i} }

Then the Lotka–Volterra equation becomes this:

\displaystyle{ \frac{d P_i}{d t} = \frac{\partial V}{\partial P_i} \, P_i }

This doesn’t look like the gradient flow equation, thanks to that annoying P_i on the right-hand side! It certainly ain’t the gradient flow coming from the function V and the usual Euclidean gradient. However, it is gradient flow coming from V and some other metric on the space

\{ P \in \mathbb{R}^n : \; P_i \ge 0 \}

For a proof, and the formula for this other metric, see Section 3.7 in this survey:

• Marc Harper, Information geometry and evolutionary game theory.

Now let’s turn to the replicator equation:

\displaystyle{ \frac{d p_i}{d t} = \left( f_i(p)  - \langle f(p) \rangle \right) \, p_i }

Again, if the fitness landscape is a Euclidean gradient, we can rewrite the replicator equation as a gradient flow equation… but again, not with respect to the Euclidean metric. This time we need to use the Fisher information metric! I sketch a proof in my paper above.

In fact, both these results were first worked out by Shahshahani:

• Siavash Shahshahani, A New Mathematical Framework for the Study of Linkage and Selection, Memoirs of the AMS 17, 1979.

New directions

All this is just the beginning! The ideas I just explained are unified in information geometry, where distance-like quantities such as the relative entropy and the Fisher information metric are studied. From here it’s a short walk to a very nice version of Fisher’s fundamental theorem of natural selection, which is familiar to researchers both in evolutionary dynamics and in information geometry.

You can see some very nice versions of this story for maximum likelihood estimators and linear programming here:

• Akio Fujiwara and Shun-ichi Amari, Gradient systems in view of information geometry, Physica D: Nonlinear Phenomena 80 (1995), 317–327.

Indeed, this seems to be the first paper discussing the similarities between evolutionary game theory and information geometry.

Dash Fryer (at Pomona College) and I have generalized this story in several interesting ways.

First, there are two famous ways to generalize the usual formula for entropy: Tsallis entropy and Rényi entropy, both of which involve a parameter q. There are Tsallis and Rényi versions of relative entropy and the Fisher information metric as well. Everything I just explained about:

• conditions under which relative entropy is a Lyapunov function for the replicator equation, and

• conditions under which the replicator equation is a special case of gradient flow

generalize to these cases! However, these generalized entropies give modified versions of the replicator equation. When we set q=1 we get back the usual story. See

• Marc Harper, Escort evolutionary game theory.

My initial interest in these alternate entropies was mostly mathematical—what is so special about the corresponding geometries?—but now researchers are starting to find populations that evolve according to these kinds of modified population dynamics! For example:

• A. Hernando et al, The workings of the Maximum Entropy Principle in collective human behavior.

There’s an interesting special case worth some attention. Lots of people fret about the relative entropy not being a distance function obeying the axioms that mathematicians like: for example, it doesn’t obey the triangle inequality. Many describe the relative entropy as a distance-like function, and this is often a valid interpretation contextually. On the other hand, the q=0 relative entropy is one-half the Euclidean distance squared! In this case the modified version of the replicator equation looks like this:

\displaystyle{ \frac{d p_i}{d t} = f_i(p) - \frac{1}{n} \sum_{j = 1}^n f_j(p) }

This equation is called the projection dynamic.

Later, I showed that there is a reasonable definition of relative entropy for a much larger family of geometries that satisfies a similar distance minimization property.

In a different direction, Dash showed that you can change the way that selection acts by using a variety of alternative ‘incentives’, extending the story to some other well-known equations describing evolutionary dynamics. By replacing the terms x_i f_i(x) in the replicator equation with a variety of other functions, called incentives, we can generate many commonly studied models of evolutionary dynamics. For instance if we exponentiate the fitness landscape (to make it always positive), we get what is commonly known as the logit dynamic. This amounts to changing the fitness landscape as follows:

\displaystyle{ f_i \mapsto \frac{x_i e^{\beta f_i}}{\sum_j{x_j e^{\beta f_j}}} }

where \beta is known as an inverse temperature in statistical thermodynamics and as an intensity of selection in evolutionary dynamics. There are lots of modified versions of the replicator equation, like the best-reply and projection dynamics, more common in economic applications of evolutionary game theory, and they can all be captured in this family. (There are also other ways to simultaneously capture such families, such as Bill Sandholm’s revision protocols, which were introduced earlier in his exploration of the foundations of game dynamics.)

Dash showed that there is a natural generalization of evolutionarily stable states to ‘incentive stable states’, and that for incentive stable states, the relative entropy is decreasing to zero when the trajectories get near the equilibrium. For the logit and projection dynamics, incentive stable states are simply evolutionarily stable states, and this happens frequently, but not always.

The third generalization is to look at different ‘time-scales’—that is, different ways of describing time! We can make up the symbol \mathbb{T} for a general choice of ‘time-scale’. So far I’ve been treating time as a real number, so

\mathbb{T} = \mathbb{R}

But we can also treat time as coming in discrete evenly spaced steps, which amounts to treating time as an integer:

\mathbb{T} = \mathbb{Z}

More generally, we can make the steps have duration h, where h is any positive real number:

\mathbb{T} = h\mathbb{Z}

There is a nice way to simultaneously describe the cases \mathbb{T} = \mathbb{R} and \mathbb{T} = h\mathbb{Z} using the time-scale calculus and time-scale derivatives. For the time-scale \mathbb{T} = \mathbb{R} the time-scale derivative is just the ordinary derivative. For the time-scale \mathbb{T} = h\mathbb{Z}, the time-scale derivative is given by the difference quotient from first year calculus:

\displaystyle{ f^{\Delta}(z) = \frac{f(z+h) - f(z)}{h} }

and using this as a substitute for the derivative gives difference equations like a discrete-time version of the replicator equation. There are many other choices of time-scale, such as the quantum time-scale given by \mathbb{T} = q^{\mathbb{Z}}, in which case the time-scale derivative is called the q-derivative, but that’s a tale for another time. In any case, the fact that the successive relative entropies are decreasing can be simply state by saying they have negative \mathbb{T} = h\mathbb{Z} time-scale derivative. The continuous case we started with corresponds to \mathbb{T} = \mathbb{R}.

Remarkably, Dash and I were able to show that you can combine all three of these generalizations into one theorem, and even allow for multiple interacting populations! This produces some really neat population trajectories, such as the following two populations with three types, with fitness functions corresponding to the rock-paper-scissors game. On top we have the replicator equation, which goes along with the Fisher information metric; on the bottom we have the logit dynamic, which goes along with the Euclidean metric on the simplex:

From our theorem, it follows that the relative entropy (ordinary relative entropy on top, the q = 0 entropy on bottom) converges to zero along the population trajectories.

The final form of the theorem is loosely as follows. Pick a Riemannian geometry given by a metric g (obeying some mild conditions) and an incentive for each population, as well as a time scale (\mathbb{R} or h \mathbb{Z}) for every population. This gives an evolutionary dynamic with a natural generalization of evolutionarily stable states, and a suitable version of the relative entropy. Then, if there is an evolutionarily stable state in the interior of the simplex, the time-scale derivative of sum of the relative entropies for each population will decrease as the trajectories converge to the stable state!

When there isn’t such a stable state, we still get some interesting population dynamics, like the following:


See this paper for details:

• Marc Harper and Dashiell E. A. Fryer, Stability of evolutionary dynamics on time scales.

Next time we’ll see how to make the main idea work in finite populations, without derivatives or deterministic trajectories!


Life’s Struggle to Survive

19 December, 2013

Here’s the talk I gave at the SETI Institute:

When pondering the number of extraterrestrial civilizations, it is worth noting that even after it got started, the success of life on Earth was not a foregone conclusion. In this talk, I recount some thrilling episodes from the history of our planet, some well-documented but others merely theorized: our collision with the planet Theia, the oxygen catastrophe, the snowball Earth events, the Permian-Triassic mass extinction event, the asteroid that hit Chicxulub, and more, including the massive environmental changes we are causing now. All of these hold lessons for what may happen on other planets!

To watch the talk, click on the video above. To see
slides of the talk, click here!

Here’s a mistake in my talk that doesn’t appear in the slides: I suggested that Theia started at the Lagrange point in Earth’s orbit. After my talk, an expert said that at that time, the Solar System had lots of objects with orbits of high eccentricity, and Theia was probably one of these. He said the Lagrange point theory is an idiosyncratic theory, not widely accepted, that somehow found its way onto Wikipedia.

Another issue was brought up in the questions. In a paper in Science, Sherwood and Huber argued that:

Any exceedence of 35 °C for extended periods should
induce hyperthermia in humans and other mammals, as dissipation of metabolic heat becomes impossible. While this never happens now, it would begin to occur with global-mean warming of about 7 °C, calling the habitability of some regions into question. With 11-12 °C warming, such regions would spread to encompass the majority of the human population as currently distributed. Eventual warmings of 12 °C are
possible from fossil fuel burning.

However, the Paleocene-Eocene Thermal Maximum seems to have been even hotter:

So, the question is: where did mammals live during this period, which mammals went extinct, if any, and does the survival of other mammals call into question Sherwood and Huber’s conclusion?


Follow

Get every new post delivered to your Inbox.

Join 2,973 other followers