Entropy and Information in Biological Systems (Part 2)

4 July, 2014

John Harte, Marc Harper and I are running a workshop! Now you can apply here to attend:

Information and entropy in biological systems, National Institute for Mathematical and Biological Synthesis, Knoxville Tennesee, Wednesday-Friday, 8-10 April 2015.

Click the link, read the stuff and scroll down to “CLICK HERE” to apply. The deadline is 12 November 2014.

Financial support for travel, meals, and lodging is available for workshop attendees who need it. We will choose among the applicants and invite 10-15 of them.

The idea

Information theory and entropy methods are becoming powerful tools in biology, from the level of individual cells, to whole ecosystems, to experimental design, model-building, and the measurement of biodiversity. The aim of this investigative workshop is to synthesize different ways of applying these concepts to help systematize and unify work in biological systems. Early attempts at “grand syntheses” often misfired, but applications of information theory and entropy to specific highly focused topics in biology have been increasingly successful. In ecology, entropy maximization methods have proven successful in predicting the distribution and abundance of species. Entropy is also widely used as a measure of biodiversity. Work on the role of information in game theory has shed new light on evolution. As a population evolves, it can be seen as gaining information about its environment. The principle of maximum entropy production has emerged as a fascinating yet controversial approach to predicting the behavior of biological systems, from individual organisms to whole ecosystems. This investigative workshop will bring together top researchers from these diverse fields to share insights and methods and address some long-standing conceptual problems.

So, here are the goals of our workshop:

• To study the validity of the principle of Maximum Entropy Production (MEP), which states that biological systems – and indeed all open, non-equilibrium systems – act to produce entropy at the maximum rate.

• To familiarize all the participants with applications to ecology of the MaxEnt method: choosing the probabilistic hypothesis with the highest entropy subject to the constraints of our data. We will compare MaxEnt with competing approaches and examine whether MaxEnt provides a sufficient justification for the principle of MEP.

• To clarify relations between known characterizations of entropy, the use of entropy as a measure of biodiversity, and the use of MaxEnt methods in ecology.

• To develop the concept of evolutionary games as “learning” processes in which information is gained over time.

• To study the interplay between information theory and the thermodynamics of individual cells and organelles.

For more details, go here.

If you’ve got colleagues who might be interested in this, please let them know. You can download a PDF suitable for printing and putting on a bulletin board by clicking on this:

Network Theory III

16 March, 2014


In the last of my Oxford talks I explain how entropy and relative entropy can be understood using certain categories related to probability theory… and how these categories also let us understand Bayesian networks!

The first two parts are explanations of these papers:

• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss

• John Baez and Tobias Fritz, A Bayesian characterization of relative entropy.

Somewhere around here the talk was interrupted by a fire drill, waking up the entire audience!

By the way, in my talk I mistakenly said that relative entropy is a continuous functor; in fact it’s just lower semicontinuous. I’ve fixed this in my slides.

The third part of my talk was my own interpretation of Brendan Fong’s master’s thesis:

• Brendan Fong, Causal Theories: a Categorical Perspective on Bayesian Networks.

I took a slightly different approach, by saying that a causal theory \mathcal{C}_G is the free category with products on certain objects and morphisms coming from a directed acyclic graph G. In his thesis he said \mathcal{C}_G was the free symmetric monoidal category where each generating object is equipped with a cocommutative comonoid structure. This is close to a category with finite products, though perhaps not quite the same: a symmetric monoidal category where every object is equipped with a cocommutative comonoid structure in a natural way (i.e., making a bunch of squares commute) is a category with finite products. It would be interesting to see if this difference hurts or helps.

By making this slight change, I am claiming that causal theories can be seen as algebraic theories in the sense of Lawvere. This would be a very good thing, since we know a lot about those.

You can also see the slides of this talk. Click on any picture in the slides, or any text in blue, and get more information!

Relative Entropy (Part 4)

16 February, 2014

In recent posts by Manoj Gopalkrishnan and Marc Harper we’ve seen how not just entropy but relative entropy—the entropy of a probability distribution relative to the equilibrium distribution—is a driving force in chemistry and evolution. Now Tobias Fritz and I have finally finished our paper on this subject:

A Bayesian characterization of relative entropy.

Very roughly, we proved that any reasonable measure of the information you gain when you to update your assumptions about the world based on discovering what a system is really doing must be some constant c times the relative entropy.

I’ve blogged about this here before:

Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.

Relative Entropy (Part 2): a category related to statistical inference, \mathrm{FinStat}, and how relative entropy defines a functor on this category.

Relative Entropy (Part 3): statement of our main theorem, which characterizes relative entropy up to a constant multiple as the only functor F : \mathrm{FinStat} \to [0,\infty) with a few nice properties.

Now that the paper is done, we’re having a nice conversation about it on the n-Category Café. Since I don’t want to splinter the conversation, I won’t enable comments here—please go there and join the fun!

But having blogged about this thrice before, what’s new?

One thing is that our conversation is getting more deeply into the category-theoretic aspects. Read the long parenthetical remarks in my post on the n-Café to get up to speed on that aspect.

Another is that by looking at our paper, you can actually see the proof of our result. As I mention on the n-Café.

The proof is surprisingly hard. Or maybe we’re just surprisingly bad at proving things. But the interesting thing is this: the proof is swift and effective in the ‘generic’ case—the case where the support of the probability measures involved is the whole set they’re living on, and the constant c is finite.

It takes some more work to handle the case where the probability measures have smaller support.

But the really hard work starts when we handle the case that, in the end, has c = \infty. Then the proof becomes more like analysis than what you normally expect in category theory. We slowly corner the result, blocking off all avenues of escape. Then we close in, grab its neck, and strangle it, crushing its larynx ever tighter, as it loses the will to fight back and finally expires… still twitching.

We haven’t gotten into discussing this much yet, perhaps because the mathematicians on the n-Café are too dainty and civilized. But someone into analysis might be able to find a more efficient proof.

That would make me a bit sad—since why didn’t we find it?—but mainly happy—since this subject deserves to be clean and elegant. We really need a category-theoretic formulation of the second law of thermodynamics that’s suitable for studying complex networks: that’s the long-term goal here.

Bio-Inspired Information Theory

31 January, 2014


There will be a 5-day workshop on Biological and Bio-Inspired Information Theory at BIRS from Sunday the 26th to Friday the 31st of October, 2014. It’s being organized by

Toby Berger (University of Virginia)
Andrew Eckford (York University)
Peter Thomas (Case Western Reserve University)

BIRS is the Banff International Research Station, a conference venue in a rather wild part of Alberta, in the mountains west of Calgary.


Here’s the workshop proposal on the BIRS website:

Currently, research in the community is organized into three streams:

• Information theory and biochemistry (including information theory and intercellular communication);

• Information theory and neuroscience; and

• Information-theoretic analysis of biologically-inspired communication systems (including nano-networking and design of biologically implemented information processing networks).

We propose a BIRS workshop to explore these streams, focusing on mathematical open problems that cut across the streams. The main objectives of this workshop would be: to bring together the most prominent researchers in this field; to discuss and review the current state of mathematical research in this field; to promote cross-pollination among the various streams of research to find common problems; and to collectively identify key future directions or open problems that would bring the greatest impact to this field over the next few years.

Expected impact

A BIRS workshop involving the field’s leading researchers would allow a review of the current state of the art, and would promote cross-pollination among these three streams of research. We expect to have these leading researchers in attendance. For example, Prof. Toby Berger (U. Virginia), a widely recognized pioneer in this field and a recipient of the Shannon Award (the top prize awarded by the IEEE Information Theory Society), is one of the co-organizers of the workshop. Moreover, we have approached many of the field’s most prominent mathematicians and scientists: a complete list is found elsewhere in this proposal, but among the most prominent confirmed participants are: Prof. Tadashi Nakano (Osaka U.), one of the earliest researchers in engineered molecular communication; Dr. Thomas D. Schneider (NIH – National Cancer Institute), inventor of the sequence logo and prominent researcher in genetic information theory; and Profs. William Bialek (Princeton U.) and Naftali Tishby (Hebrew U.), prominent experts on information theory in neural coding.

Although the focus of our workshop is on mathematical fundamentals, our list of expected participants includes a few experimental scientists, e.g. Raymond Cheong and Andre Levchenko (both from Johns Hopkins U.), in addition to mathematical scientists. This is because quantitative application of information theoretic analysis to biological systems typically requires empirical estimation of joint probability distributions for multiple input and output variables, often posing daunting data collection challenges which pioneered the use of high-throughput experimental methods to collect large data sets quantifying the input/output relationships for a specific biochemical signaling pathway). We believe a blended approach, emphasizing mathematics but including experimental perspectives, will enhance the impact of our workshop and increase the usefulness to our participants.

Given that publications in these research areas have achieved prominence in the past few years, the time is right for a general meeting among the key researchers to review the state of the field and develop future directions. Thus, our proposed workshop is timely and would be expected to have a tremendous impact on the field over the next several years.

The Rarest Things in the Universe

27 January, 2014

guest post by Leonard Adleman

About 50 years ago Kolomogorov assigned to each finite binary string \sigma a non-negative integer that measured that string’s ‘descriptive complexity’. Informally, K(\sigma) is the length (in binary) of the shortest (Turing machine) program that with the empty string as input, outputs \sigma and halts. A related measure of descriptive complexity is M(\sigma)=\frac{K(\sigma)}{|\sigma|}, where |\sigma| denotes the length of \sigma. A simple string like:

\sigma_0=\overbrace{111\ldots 111}^{1,000,000}

can be produced by a very short program; hence M(\sigma_0) is near 0. But if \sigma is a ‘random string’ (e.g. obtained by flipping coins), then, with high probability, it cannot be produced by a program significantly shorter than \sigma itself; hence M(\sigma) will be near 1.

If I asked you to produce (using a computer) a thousand strings of length one million with M near 1, it would be easy to do; just flip a lot of coins. If I asked you to produce a thousand strings with M near 0, that would also be easy. For example, you could start with a short random string and repeat it a lot. Actually, if I chose my favorite \alpha\in [0,1] and wanted a thousand strings of length one million with M near \alpha, then a mix of the preceding approaches can be used to produce them. So strings with a desired M are not rare.

Now let’s consider ‘deep strings’*. I will be informal here, but the underlying theory can be found in my Time space and randomness.

For all binary strings \sigma, we assign a value that measures the ‘depth’ of \sigma. D(\sigma) is obtained by considering both the size and the number of steps used by each program that with the empty string as input, outputs \sigma and halts**. D(\sigma) has the following properties:

  • If there is no short program to produce \sigma, then D(\sigma) is small.
  • If there is a short program to produce \sigma and it uses few steps, then D(\sigma) is small.
  • If there is a short program to produce \sigma, but all short programs to produce \sigma use lots of steps, then D(\sigma) is large. Roughly speaking, the more steps small programs use to produce \sigma, the larger D(\sigma) will be.

Informally, we call a string with large D ‘deep’ and one with small D ‘shallow’. A few examples may help.

Consider a string \sigma obtained by flipping coins. With high probability there is no short program to produce \sigma, hence \sigma is shallow.

Now consider \sigma_0 above. Since there is a short program to produce \sigma_0 and that program uses few steps, \sigma_0 is shallow.

Now treat \sigma_0 as a number in binary (i.e. 2^{1,000,000}-1) and consider the prime factorization. The fundamental theorem tells us it exists and will be about one million bits long. But, unless 2^{1,000,000}-1 is somehow special (e.g. a prime times a very smooth number), its prime factorization may be very very deep. A short program can generate the prime factorization (just generate one million 1s with a short program and then give it to a short factoring program). But if it turns out that factoring can’t be done in polynomial time, then perhaps all short programs that generate the prime factorization use a huge number of steps. So the prime factorization would have a very large D. Conceivably, since steps on a computer use time and energy, the prime factorization can never be realized. It is not a long string (only one million bits) but it may exist only in theory and not in reality.

If I asked you to produce even one string of length one million with D near that of the prime factorization of 2^{1,000,000}-1, could you do it? I would not know how and I suspect that as a practical matter it cannot be done. So strings with such a D are rare.

Here is a string that does exist in our universe and that (I suspect) is quite deep:


In fact, this string is the prime factorization of 2^{1061}-1. We should not expect it to be as deep as the prime factorization of 2^{1,000,000}-1, but we should still expect it to have considerable depth. There is even some experimental evidence that supports this view. How did I get this prime factorization? I got it from: Factorization of a 1061-bit number by the Special Number Field Sieve, by Greg Childers. So it was easy to get. Well, easy for me anyway; not so easy for Childers. He reports that the factorization took about `3 CPU-centuries’ using the Special Number Field Sieve. Other than by factoring 2^{1061}-1, how could you ever come to write this pair of primes? I venture to say that since the Special Number Field Sieve is the fastest known algorithm for factoring numbers of this kind, no method available today could have written these primes (for the first time) using fewer steps (and hence less time/energy).

The situation might be compared to that of atomic physics. I am not a physicist, but I suppose it is possible to theorize about an atomic nucleus with a million protons. But what if I want to create one? It appears that producing transuranic elements takes huge amounts of time/energy and the greater the number of protons, the more time/energy it takes. It is even conceivable (to me at least) that there is not enough time/energy available (at least on earth) to actually produce one. Like the prime factorization of 2^{1,000,000}-1, it may exist in theory but not in reality. On the other hand, physicists from Russia and America, using lots of time/energy, have created an atomic nucleus with 118 protons called Ununoctium. Ununoctium is analogous to Childers’ prime factorization; both exist in reality; both were very costly to create.

But my focus here is neither ontology nor physics; it is money. Recently Bitcoins and Bitcoin-like things have captured the world’s attention. I suspect that this kind of currency will revolutionize the nature of economies, and consequently the nature of governance. Personally, I am agnostic on whether this is a good thing or a bad thing. But, in any case, I am wondering if ‘depth’ might form part of a theoretical basis for understanding new currencies. I must confess to knowing few details about Bitcoins; consequently, I will start from first principles and consider some of the properties of currency:

  • Mass: A currency may have mass or may be massless. U.S. dollars and gold have mass. Bitcoins are massless. The efficiencies of the Internet market require a massless currency. Imagine running eBay or Amazon using gold. U.S. dollars have mass, so we do not actually use them for transactions on the Internet, rather we use credit card numbers to create massless IOUs that promise to deliver (but, in fact, seldom actually deliver) U.S. dollars in the future. These IOUs are essentially a massless currency.
  • Production: Who can produce or destroy the currency? This has been an important political, legal and economic issue for millennia. In the west, coins began to appear at about the time Solon ruled Athens. Solon realized that he could manipulate the value of coins. And so he did. Solon’s lesson has not been lost on governments ever since. In its simplest form: if you owe K dollars, you print K dollars, and voila! no more debt. Creditors don’t like debtors to do that, so they want to be paid in a currency that no one can produce more of. Gold comes close. You can produce more gold but you have to spend a lot of time/energy in a mine to do so. Production also includes counterfeiting. Counterfeiting comes in at least two important forms: de novo-counterfeiting (build a press) and duplication-counterfeiting (get a xerox machine). For now, all massless currencies are in digital form and are vulnerable to duplication-counterfeiting since computers make it cheap and easy to create perfect copies of digital notes (a unit of a currency will be called a ‘note’). As a result, massless currencies will typically be associated with systems that mitigate this threat. Perhaps the elaborate ledgers implemented by the creators of Bitcoin are an attempt to deal with the threat of duplication-counterfeiting.
  • Abundance: As is often said, money can be used to ‘lubricate the economy’. To accomplish this a currency has to be sufficiently abundant. For example, Ununoctium, might be attractive to creditors because, compared to gold, it is far more costly in time/energy to produce more. However, it is not desirable for lubricating the economy because it is so costly to produce that less than 100 atoms have ever been created.

The transition from mass to masslessness will lead to currencies with uses and properties we do not normally associate with money. For example, using the techniques of secret-sharing, it becomes possible to create digital currencies where a single note can be jointly owned by 1000 individuals; any 501 of whom could cooperate to spend it, while any 500 of whom would be unable to do so.

What is the perfect currency? This is probably the wrong question, rather we should ask what properties may currency have, in theory and in reality. Let’s consider massless currencies.

Is there a massless currency similar to the U.S. dollar? I think yes. For example, the Government could simply publish a set of numbers and declare the numbers to be currency. Put another way, make the U.S. dollar smaller and smaller to decrease mass until asymptotically all that is left is the serial number. With regard to abundance, like the U.S. dollar, the Government is free to determining the number of notes available. With regard to production, as with the U.S. dollar, the Government can print more by simply publishing new numbers to be added to the set (or destroy some by declaring that some numbers have been removed from the set). With regard to counterfeiting, the U.S. dollar has some advantages. The mass of the U.S. dollar turns counterfeiting from a digital problem to a physical one. This provides the Government with the ability to change the U.S. dollar physically to defeat technology that might arise to produce faithful (either de novo or duplication) counterfeits.

Is there a massless currency similar to gold? I think yes. I think this is what Bitcoin-like currencies are all about. They are ‘deep-currencies’, sets of deep strings. With regard to abundance, they are superior to gold. The total number of deep strings in the set can be chosen at initiation by the creator of the currency. The abundance of gold on the other hand has already been set by nature (and, at least as currently used, gold is not sufficiently abundant to lubricate the world economy). With regard to production, as with gold, making notes requires time/energy. With regard to counterfeiting, gold has an advantage. Counterfeit gold is an absurdity, since all gold (by the ounce, not numismatically), no matter how it arises, is the same atomically and is perceived to have the same value. On the other hand, as stated above, massless currencies are vulnerable to duplication-counterfeiting. Interestingly, deep-currencies may be resistant to de novo-counterfeiting, since the creator of the deep-currency is free to choose the depth of the notes, and consequently the cost of producing new notes.

The value of a massless currency in our information based world is clear. Deep-currencies such as Bitcoin offer an attractive approach. But there is an interesting issue that may soon arise. The issue stems from the fact that the production of each note in currencies like Bitcoin requires a large investment of time/energy, and as with gold, this deprives governments of the prerogative to print money. Creditors may like this, but governments will not. What will governments do? Perhaps they will want to create a ‘Dual-currency’. A Dual-currency should be massless. It should come with a secret key. If you do not possess the secret key, then, like gold, it should be very costly to produce a new note, but if you possess the secret key, then, like the U.S. dollar, it should be inexpensive to produce a new note. Is a Dual-currency possible? Here is an example of an approach I call aurum:

  • Generate a pair of RSA keys: a public key \langle E,N\rangle, and a secret key \langle D,N\rangle. Publish the public key \langle E,N\rangle.
  • Declare that notes are exactly those integers G such that 2\leq G\leq N-1 and (the least positive residue of) G^E\mbox{Mod}(N) is less than or equal to \frac{N}{10^9}.

So, choosing a random integer V such that 2\leq V\leq N-1, and then computing V^E\mbox{Mod}(N) has about one chance in a billion of producing a note. Hence the expected number of modular exponentiations to produce a note is about one billion. On the other hand, those who possess the secret key can chose an integer W such that 2\leq W\leq \frac{N}{10^9} and calculate W^D\mbox{Mod}(N) to produce a note after just one modular exponentiation. There are many bells and whistles that can be accommodated with aurum, but here is what is ‘really’ going on. The depth of a string \sigma is obtained by considering programs running with the empty string as input, but we can consider a more general concept: ‘relative depth’. Given a pair of strings \sigma and \tau, the depth of \sigma relative to \tau is obtained by considering programs running with \tau as the input. Hence depth as we have been discussing it is the same as depth relative to the empty string. In the example above, we have made `dual-strings'; strings that are deep relative to the public key, but shallow relative to the secret key.

One of the interesting phenomena of theoretical computer science is that you can sometimes turn bad news into good news. If factoring is hard, then we are deprived of the ability to do something we (at least number theorists) would like to do. Bad news. But surprisingly, we acquire public-key cryptography and the ability to preserve our privacy. Good news. Similarly, strings that are very hard to produce seem useless, but Bitcoins have revealed that such strings can provide a new and useful form of currency. Now that we are aware that deep strings can have value, I expect that clever people will find many new uses for them.

After thinking about deep strings for many years, I see them everywhere – I invite you to do the same. I will finish with one of my favorite observations. The watchmaker analogy is well known and is frequently used in arguing the existence of a creator. If you stumble upon a watch, you recognize from its complexity that it did not form by chance, and you conclude that there must have been a watchmaker. A human is more complex than a watch, so there must have been a ‘human-maker’ – a creator. An alternative view is that the complexity you recognize is actually ‘depth’ and the conclusion you should reach is that there must have been a computational process sustained through a great many steps. In the case of humans, the process is 3.6 billion years of evolution and the depth can be read in the genome. The watch is deep as well, but much of its depth is acquired from the human genome relative to which it is not nearly so deep.

*It has been brought to my attention that others have considered similar concepts including Charles Bennett, Murray Gell-Mann, and Luis Antunes, Lance Fortnow, Dieter Van Melkebeek, and N. Variyam Vinodchandran. Indeed, the latter group has even used the term ‘deep string’, hence it is to them we owe the name.

**For all \sigma\in\{0,1\}^{*}:

  • T(\sigma) denotes the set of Turing machine programs that with the empty string as input, output \sigma and halt.
  • E(\sigma) denotes \min\{\max\{|P|,\log_{2}(\vec{P})\}|P\in T(\sigma)\}, where |P| denotes the length of P, and \vec{P} denotes the number of steps used by P with the empty string as input.

It may be convenient for the reader to assume that D(\sigma) is approximately \frac{E(\sigma)}{K(\sigma)}; however, the proper theory of deep strings extends the notion of depth to sets of strings and accommodates the use of randomness in computation. Depth in the context of quantum computation may also be of interest.

Relative Entropy in Evolutionary Dynamics

22 January, 2014

guest post by Marc Harper

In John’s information geometry series, he mentioned some of my work in evolutionary dynamics. Today I’m going to tell you about some exciting extensions!

The replicator equation

First a little refresher. For a population of n replicating types, such as individuals with different eye colors or a gene with n distinct alleles, the ‘replicator equation’ expresses the main idea of natural selection: the relative rate of growth of each type should be proportional to the difference between the fitness of the type and the mean fitness in the population.

To see why this equation should be true, let P_i be the population of individuals of the ith type, which we allow to be any nonnegative real number. We can list all these numbers and get a vector:

P = (P_1, \dots, P_n)

The Lotka–Volterra equation is a very general rule for how these numbers can change with time:

\displaystyle{ \frac{d P_i}{d t} = f_i(P) P_i }

Each population grows at a rate proportional to itself, where the ‘constant of proportionality’, f_i(P), is not necessarily constant: it can be any real-valued function of P. This function is called the fitness of the ith type. Taken all together, these functions f_i are called the fitness landscape.

Let p_i be the fraction of individuals who are of the ith type:

\displaystyle{ p_i = \frac{P_i}{\sum_{i =1}^n P_i } }

These numbers p_i are between 0 and 1, and they add up to 1. So, we can also think of them as probabilities: p_i is the probability that a randomly chosen individual is of the ith type. This is how probability theory, and eventually entropy, gets into the game.

Again, we can bundle these numbers into a vector:

p = (p_1, \dots, p_n)

which we call the population distribution. It turns out that the Lotka–Volterra equation implies the replicator equation:

\displaystyle{ \frac{d p_i}{d t} = \left( f_i(P) - \langle f(P) \rangle \right) \, p_i }


\displaystyle{ \langle f(P) \rangle = \sum_{i =1}^n  f_i(P)  p_i  }

is the mean fitness of all the individuals. You can see the proof in Part 9 of the information geometry series.

By the way: if each fitness f_i(P) only depends on the fraction of individuals of each type, not the total numbers, we can write the replicator equation in a simpler way:

\displaystyle{ \frac{d p_i}{d t} = \left( f_i(p) - \langle f(p) \rangle \right) \, p_i }

From now on, when talking about this equation, that’s what I’ll do.

Anyway, the take-home message is this: the replicator equation says the fraction of individuals of any type changes at a rate proportional to fitness of that type minus the mean fitness.

Now, it has been known since the late 1970s or early 1980s, thanks to the work of Akin, Bomze, Hofbauer, Shahshahani, and others, that the replicator equation has some very interesting properties. For one thing, it often makes ‘relative entropy’ decrease. For another, it’s often an example of ‘gradient flow’. Let’s look at both of these in turn, and then talk about some new generalizations of these facts.

Relative entropy as a Lyapunov function

I mentioned that we can think of a population distribution as a probability distribution. This lets us take ideas from probability theory and even information theory and apply them to evolutionary dynamics! For example, given two population distributions p and q, the information of q relative to p is

I(q,p) = \displaystyle{ \sum_i q_i \ln \left(\frac{q_i}{p_i }\right)}

This measures how much information you gain if you have a hypothesis about some state of affairs given by the probability distribution p, and then someone tells you “no, the best hypothesis is q!”

It may seem weird to treat a population distribution as a hypothesis, but this turns out to be a good idea. Evolution can then be seen as a learning process: a process of improving the hypothesis.

We can make this precise by seeing how the relative information changes with the passage of time. Suppose we have two population distributions q and p. Suppose q is fixed, while p evolves in time according to the replicator equation. Then

\displaystyle{  \frac{d}{d t} I(q,p)  =  \sum_i f_i(P) (p_i - q_i) }

For the proof, see Part 11 of the information geometry series.

So, the information of q relative to p will decrease as p evolves according to the replicator equation if

\displaystyle{  \sum_i f_i(P) (p_i - q_i) } \le 0

If q makes this true for all p, we say q is an evolutionarily stable state. For some reasons why, see Part 13.

What matters now is that when q is an evolutionarily stable state, I(q,p) says how much information the population has ‘left to learn’—and we’re seeing that this always decreases. Moreover, it turns out that we always have

I(q,p) \ge 0

and I(q,p) = 0 precisely when p = q.

People summarize all this by saying that relative information is a ‘Lyapunov function’. Very roughly, a Lyapunov function is something that decreases with the passage of time, and is zero only at the unique stable state. To be a bit more precise, suppose we have a differential equation like

\displaystyle{  \frac{d}{d t} x(t) = v(x(t)) }

where x(t) \in \mathbb{R}^n and v is some smooth vector field on \mathbb{R}^n. Then a smooth function

V : \mathbb{R}^n \to \mathbb{R}

is a Lyapunov function if

V(x) \ge 0 for all x

V(x) = 0 iff x is some particular point x_0


\displaystyle{ \frac{d}{d t} V(x(t)) \le 0 } for every solution of our differential equation.

In this situation, the point x_0 is a stable equilibrium for our differential equation: this is Lyapunov’s theorem.

The replicator equation as a gradient flow equation

The basic idea of Lyapunov’s theorem is that when a ball likes to roll downhill and the landscape has just one bottom point, that point will be the unique stable equilibrium for the ball.

The idea of gradient flow is similar, but different: sometimes things like to roll downhill as efficiently as possible: they move in the exactly the best direction to make some quantity smaller! Under certain conditions, the replicator equation is an example of this phenomenon.

Let’s fill in some details. For starters, suppose we have some function

F : \mathbb{R}^n \to \mathbb{R}

Think of V as ‘height’. Then the gradient flow equation says how a point x(t) \in \mathbb{R}^n will move if it’s always trying its very best to go downhill:

\displaystyle{ \frac{d}{d t} x(t) = - \nabla V(x(t)) }

Here \nabla is the usual gradient in Euclidean space:

\displaystyle{ \nabla V = \left(\partial_1 V, \dots, \partial_n V \right)  }

where \partial_i is short for the partial derivative with respect to the ith coordinate.

The interesting thing is that under certain conditions, the replicator equation is an example of a gradient flow equation… but typically not one where \nabla is the usual gradient in Euclidean space. Instead, it’s the gradient on some other space, the space of all population distributions, which has a non-Euclidean geometry!

The space of all population distributions is a simplex:

\{ p \in \mathbb{R}^n : \; p_i \ge 0, \; \sum_{i = 1}^n p_i = 1 \} .

For example, it’s an equilateral triangle when n = 3. The equilateral triangle looks flat, but if we measure distances another way it becomes round, exactly like a portion of a sphere, and that’s the non-Euclidean geometry we need!

In fact this trick works in any dimension. The idea is to give the simplex a special Riemannian metric, the ‘Fisher information metric’. The usual metric on Euclidean space is

\delta_{i j} = \left\{\begin{array}{ccl} 1 & \mathrm{ if } & i = j \\                                       0 &\mathrm{ if } & i \ne j \end{array} \right.

This simply says that two standard basis vectors like (0,1,0,0) and (0,0,1,0) have dot product zero if the 1’s are in different places, and one if they’re in the same place. The Fisher information metric is a bit more complicated:

\displaystyle{ g_{i j} = \frac{\delta_{i j}}{p_i} }

As before, g_{i j} is a formula for the dot product of the ith and jth standard basis vectors, but now it depends on where you are in the simplex of population distributions.

We saw how this formula arises from information theory back in Part 7. I won’t repeat the calculation, but the idea is this. Fix a population distribution p and consider the information of another one, say q, relative to this. We get I(q,p). If q = p this is zero:

\displaystyle{ \left. I(q,p)\right|_{q = p} = 0 }

and this point is a local minimum for the relative information. So, the first derivative of I(q,p) as we change q must be zero:

\displaystyle{ \left. \frac{\partial}{\partial q_i} I(q,p) \right|_{q = p} = 0 }

But the second derivatives are not zero. In fact, since we’re at a local minimum, it should not be surprising that we get a positive definite matrix of second derivatives:

\displaystyle{  g_{i j} = \left. \frac{\partial^2}{\partial q_i \partial q_j} I(q,p) \right|_{q = p} = 0 }

And, this is the Fisher information metric! So, the Fisher information metric is a way of taking dot products between vectors in the simplex of population distribution that’s based on the concept of relative information.

This is not the place to explain Riemannian geometry, but any metric gives a way to measure angles and distances, and thus a way to define the gradient of a function. After all, the gradient of a function should point at right angles to the level sets of that function, and its length should equal the slope of that function:

So, if we change our way of measuring angles and distances, we get a new concept of gradient! The ith component of this new gradient vector field turns out to b

(\nabla_g V)^i = g^{i j} \partial_j V

where g^{i j} is the inverse of the matrix g_{i j}, and we sum over the repeated index j. As a sanity check, make sure you see why this is the usual Euclidean gradient when g_{i j} = \delta_{i j}.

Now suppose the fitness landscape is the good old Euclidean gradient of some function. Then it turns out that the replicator equation is a special case of gradient flow on the space of population distributions… but where we use the Fisher information metric to define our concept of gradient!

To get a feel for this, it’s good to start with the Lotka–Volterra equation, which describes how the total number of individuals of each type changes. Suppose the fitness landscape is the Euclidean gradient of some function V:

\displaystyle{ f_i(P) = \frac{\partial V}{\partial P_i} }

Then the Lotka–Volterra equation becomes this:

\displaystyle{ \frac{d P_i}{d t} = \frac{\partial V}{\partial P_i} \, P_i }

This doesn’t look like the gradient flow equation, thanks to that annoying P_i on the right-hand side! It certainly ain’t the gradient flow coming from the function V and the usual Euclidean gradient. However, it is gradient flow coming from V and some other metric on the space

\{ P \in \mathbb{R}^n : \; P_i \ge 0 \}

For a proof, and the formula for this other metric, see Section 3.7 in this survey:

• Marc Harper, Information geometry and evolutionary game theory.

Now let’s turn to the replicator equation:

\displaystyle{ \frac{d p_i}{d t} = \left( f_i(p)  - \langle f(p) \rangle \right) \, p_i }

Again, if the fitness landscape is a Euclidean gradient, we can rewrite the replicator equation as a gradient flow equation… but again, not with respect to the Euclidean metric. This time we need to use the Fisher information metric! I sketch a proof in my paper above.

In fact, both these results were first worked out by Shahshahani:

• Siavash Shahshahani, A New Mathematical Framework for the Study of Linkage and Selection, Memoirs of the AMS 17, 1979.

New directions

All this is just the beginning! The ideas I just explained are unified in information geometry, where distance-like quantities such as the relative entropy and the Fisher information metric are studied. From here it’s a short walk to a very nice version of Fisher’s fundamental theorem of natural selection, which is familiar to researchers both in evolutionary dynamics and in information geometry.

You can see some very nice versions of this story for maximum likelihood estimators and linear programming here:

• Akio Fujiwara and Shun-ichi Amari, Gradient systems in view of information geometry, Physica D: Nonlinear Phenomena 80 (1995), 317–327.

Indeed, this seems to be the first paper discussing the similarities between evolutionary game theory and information geometry.

Dash Fryer (at Pomona College) and I have generalized this story in several interesting ways.

First, there are two famous ways to generalize the usual formula for entropy: Tsallis entropy and Rényi entropy, both of which involve a parameter q. There are Tsallis and Rényi versions of relative entropy and the Fisher information metric as well. Everything I just explained about:

• conditions under which relative entropy is a Lyapunov function for the replicator equation, and

• conditions under which the replicator equation is a special case of gradient flow

generalize to these cases! However, these generalized entropies give modified versions of the replicator equation. When we set q=1 we get back the usual story. See

• Marc Harper, Escort evolutionary game theory.

My initial interest in these alternate entropies was mostly mathematical—what is so special about the corresponding geometries?—but now researchers are starting to find populations that evolve according to these kinds of modified population dynamics! For example:

• A. Hernando et al, The workings of the Maximum Entropy Principle in collective human behavior.

There’s an interesting special case worth some attention. Lots of people fret about the relative entropy not being a distance function obeying the axioms that mathematicians like: for example, it doesn’t obey the triangle inequality. Many describe the relative entropy as a distance-like function, and this is often a valid interpretation contextually. On the other hand, the q=0 relative entropy is one-half the Euclidean distance squared! In this case the modified version of the replicator equation looks like this:

\displaystyle{ \frac{d p_i}{d t} = f_i(p) - \frac{1}{n} \sum_{j = 1}^n f_j(p) }

This equation is called the projection dynamic.

Later, I showed that there is a reasonable definition of relative entropy for a much larger family of geometries that satisfies a similar distance minimization property.

In a different direction, Dash showed that you can change the way that selection acts by using a variety of alternative ‘incentives’, extending the story to some other well-known equations describing evolutionary dynamics. By replacing the terms x_i f_i(x) in the replicator equation with a variety of other functions, called incentives, we can generate many commonly studied models of evolutionary dynamics. For instance if we exponentiate the fitness landscape (to make it always positive), we get what is commonly known as the logit dynamic. This amounts to changing the fitness landscape as follows:

\displaystyle{ f_i \mapsto \frac{x_i e^{\beta f_i}}{\sum_j{x_j e^{\beta f_j}}} }

where \beta is known as an inverse temperature in statistical thermodynamics and as an intensity of selection in evolutionary dynamics. There are lots of modified versions of the replicator equation, like the best-reply and projection dynamics, more common in economic applications of evolutionary game theory, and they can all be captured in this family. (There are also other ways to simultaneously capture such families, such as Bill Sandholm’s revision protocols, which were introduced earlier in his exploration of the foundations of game dynamics.)

Dash showed that there is a natural generalization of evolutionarily stable states to ‘incentive stable states’, and that for incentive stable states, the relative entropy is decreasing to zero when the trajectories get near the equilibrium. For the logit and projection dynamics, incentive stable states are simply evolutionarily stable states, and this happens frequently, but not always.

The third generalization is to look at different ‘time-scales’—that is, different ways of describing time! We can make up the symbol \mathbb{T} for a general choice of ‘time-scale’. So far I’ve been treating time as a real number, so

\mathbb{T} = \mathbb{R}

But we can also treat time as coming in discrete evenly spaced steps, which amounts to treating time as an integer:

\mathbb{T} = \mathbb{Z}

More generally, we can make the steps have duration h, where h is any positive real number:

\mathbb{T} = h\mathbb{Z}

There is a nice way to simultaneously describe the cases \mathbb{T} = \mathbb{R} and \mathbb{T} = h\mathbb{Z} using the time-scale calculus and time-scale derivatives. For the time-scale \mathbb{T} = \mathbb{R} the time-scale derivative is just the ordinary derivative. For the time-scale \mathbb{T} = h\mathbb{Z}, the time-scale derivative is given by the difference quotient from first year calculus:

\displaystyle{ f^{\Delta}(z) = \frac{f(z+h) - f(z)}{h} }

and using this as a substitute for the derivative gives difference equations like a discrete-time version of the replicator equation. There are many other choices of time-scale, such as the quantum time-scale given by \mathbb{T} = q^{\mathbb{Z}}, in which case the time-scale derivative is called the q-derivative, but that’s a tale for another time. In any case, the fact that the successive relative entropies are decreasing can be simply state by saying they have negative \mathbb{T} = h\mathbb{Z} time-scale derivative. The continuous case we started with corresponds to \mathbb{T} = \mathbb{R}.

Remarkably, Dash and I were able to show that you can combine all three of these generalizations into one theorem, and even allow for multiple interacting populations! This produces some really neat population trajectories, such as the following two populations with three types, with fitness functions corresponding to the rock-paper-scissors game. On top we have the replicator equation, which goes along with the Fisher information metric; on the bottom we have the logit dynamic, which goes along with the Euclidean metric on the simplex:

From our theorem, it follows that the relative entropy (ordinary relative entropy on top, the q = 0 entropy on bottom) converges to zero along the population trajectories.

The final form of the theorem is loosely as follows. Pick a Riemannian geometry given by a metric g (obeying some mild conditions) and an incentive for each population, as well as a time scale (\mathbb{R} or h \mathbb{Z}) for every population. This gives an evolutionary dynamic with a natural generalization of evolutionarily stable states, and a suitable version of the relative entropy. Then, if there is an evolutionarily stable state in the interior of the simplex, the time-scale derivative of sum of the relative entropies for each population will decrease as the trajectories converge to the stable state!

When there isn’t such a stable state, we still get some interesting population dynamics, like the following:

See this paper for details:

• Marc Harper and Dashiell E. A. Fryer, Stability of evolutionary dynamics on time scales.

Next time we’ll see how to make the main idea work in finite populations, without derivatives or deterministic trajectories!

Relative Entropy (Part 3)

25 December, 2013

Holidays are great. There’s nothing I need to do! Everybody is celebrating! So, I can finally get some real work done.

In the last couple of days I’ve finished a paper with Jamie Vicary on wormholes and entanglement… subject to his approval and corrections. More on that later. And now I’ve returned to working on a paper with Tobias Fritz where we give a Bayesian characterization of the concept of ‘relative entropy’. This summer I wrote two blog articles about this paper:

Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.

Relative Entropy (Part 2): a category related to statistical inference, \mathrm{FinStat}, and how relative entropy defines a functor on this category.

But then Tobias Fritz noticed a big problem. Our characterization of relative entropy was inspired by this paper:

• D. Petz, Characterization of the relative entropy of states of matrix algebras, Acta Math. Hungar. 59 (1992), 449–455.

Here Petz sought to characterize relative entropy both in the ‘classical’ case we are concerned with and in the more general ‘quantum’ setting. Our original goal was merely to express his results in a more category-theoretic framework! Unfortunately Petz’s proof contained a significant flaw. Tobias noticed this and spent a lot of time fixing it, with no help from me.

Our paper is now self-contained, and considerably longer. My job now is to polish it up and make it pretty. What follows is the introduction, which should explain the basic ideas.

A Bayesian characterization of relative entropy

This paper gives a new characterization of the concept of relative entropy, also known as ‘relative information’, ‘information gain’ or ‘Kullback-Leibler divergence’. Whenever we have two probability distributions p and q on the same finite set X, we can define the entropy of q relative to p:

S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right)

Here we set

q_x \ln\left( \frac{q_x}{p_x} \right)

equal to \infty when p_x = 0, unless q_x is also zero, in which case we set it equal to 0. Relative entropy thus takes values in [0,\infty].

Intuitively speaking, S(q,p) is the expected amount of information gained when we discover the probability distribution is really q, when we had thought it was p. We should think of p as a ‘prior’ and q as a ‘posterior’. When we take p to be the uniform distribution on X, relative entropy reduces to the ordinary Shannon entropy, up to an additive constant. The advantage of relative entropy is that it makes the role of the prior explicit.

Since Bayesian probability theory emphasizes the role of the prior, relative entropy naturally lends itself to a Bayesian interpretation: it measures how much information we gain given a certain prior. Our goal here is to make this precise in a mathematical characterization of relative entropy. We do this using a category \mathrm{FinStat} where:

• an object (X,q) consists of a finite set X and a probability distribution x \mapsto q_x on that set;

• a morphism $(f,s) : (X,q) \to (Y,r)$ consists of a measure-preserving function $f$ from $X$ to $Y,$ together with a probability distribution $x \mapsto s_{x y}$ on $X$ for each element y \in Y, with the property that s_{xy} = 0 unless f(x) = y.

We can think of an object of \mathrm{FinStat} as a system with some finite set of states together with a probability distribution on its states. A morphism

(f,s) : (X,q) \to (Y,r)

then consists of two parts. First, there is a deterministic measurement process f : X \to Y mapping states of the system being measured, X, to states of the measurement apparatus, Y. The condition that f be measure-preserving says that, after the measurement, the probability that the apparatus be in any state y \in Y is the sum of the probabilities of all states of X leading to that outcome:

\displaystyle{  r_y = \sum_{x \in f^{-1}(y)} q_x }

Second, there is a hypothesis s: an assumption about the probability s_{xy} that the system being measured is in the state x \in X given any measurement outcome y \in Y.

Suppose we have any morphism

(f,s) : (X,q) \to (Y,r)

in \mathrm{FinStat}. From this we obtain two probability distributions on the states of the system being measured. First, we have the probability distribution p given by

\displaystyle{   p_x = \sum_{y \in Y} s_{xy} r_y } \qquad \qquad (1)

This is our prior, given our hypothesis and the probability distribution of measurement results. Second we have the ‘true’ probability distribution q, which would be the posterior if we updated our prior using complete direct knowledge of the system being measured.

It follows that any morphism in \mathrm{FinStat} has a relative entropy S(q,p) associated to it. This is the expected amount of information we gain when we update our prior p to the posterior q.

In fact, this way of assigning relative entropies to morphisms defines a functor

F_0 : \mathrm{FinStat} \to [0,\infty]

where we use [0,\infty] to denote the category with one object, the numbers 0 \le x \le \infty as morphisms, and addition as composition. More precisely, if

(f,s) : (X,q) \to (Y,r)

is any morphism in \mathrm{FinStat}, we define

F_0(f,s) = S(q,p)

where the prior p is defined as in the equation (1).

The fact that F_0 is a functor is nontrivial and rather interesting. It says that given any composable pair of measurement processes:

(X,q) \stackrel{(f,s)}{\longrightarrow} (Y,r) \stackrel{(g,t)}{\longrightarrow} (Z,u)

the relative entropy of their composite is the sum of the relative entropies of the two parts:

F_0((g,t) \circ (f,s)) = F_0(g,t) + F_0(f,s) .

We prove that F_0 is a functor. However, we go much further: we characterize relative entropy by saying that up to a constant multiple, F_0 is the unique functor from \mathrm{FinStat} to [0,\infty] obeying three reasonable conditions.

The first condition is that F_0 vanishes on morphisms (f,s) : (X,q) \to (Y,r) where the hypothesis s is optimal. By this, we mean that Equation (1) gives a prior p equal to the ‘true’ probability distribution q on the states of the system being measured.

The second condition is that F_0 is lower semicontinuous. The set P(X) of probability distibutions on a finite set X naturally has the topology of an (n-1)-simplex when X has n elements. The set [0,\infty] has an obvious topology where it’s homeomorphic to a closed interval. However, with these topologies, the relative entropy does not define a continuous function

\begin{array}{rcl}         S : P(X) \times P(X) &\to& [0,\infty]  \\                                            (q,p) &\mapsto & S(q,p) .  \end{array}

The problem is that

\displaystyle{ S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right) }

and we define q_x \ln(q_x/p_x) to be \infty when p_x = 0 and q_x > 0 but 0 when p_x = q_x = 0. So, it turns out that S is only lower semicontinuous, meaning that if p^i , q^i are sequences of probability distributions on X with p^i \to p and q^i \to q then

S(q,p) \le \liminf_{i \to \infty} S(q^i, p^i)

We give the set of morphisms in \mathrm{FinStat} its most obvious topology, and show that with this topology, F_0 maps morphisms to morphisms in a lower semicontinuous way.

The third condition is that F_0 is convex linear. We describe how to take convex linear combinations of morphisms in \mathrm{FinStat}, and then the functor F_0 is convex linear in the sense that it maps any convex linear combination of morphisms in \mathrm{FinStat} to the corresponding convex linear combination of numbers in [0,\infty]. Intuitively, this means that if we take a coin with probability P of landing heads up, and flip it to decide whether to perform one measurement process or another, the expected information gained is P times the expected information gain of the first process plus 1-P times the expected information gain of the second process.

Here, then, is our main theorem:

Theorem. Any lower semicontinuous, convex-linear functor

F : \mathrm{FinStat} \to [0,\infty]

that vanishes on every morphism with an optimal hypothesis must equal some constant times the relative entropy. In other words, there exists some constant c \in [0,\infty] such that

F(f,s) = c F_0(f,s)

for any any morphism (f,s) : (X,p) \to (Y,q) in \mathrm{FinStat}.


If you’re a maniacally thorough reader of this blog, with a photographic memory, you’ll recall that our theorem now says ‘lower semicontinuous’, where in Part 2 of this series I’d originally said ‘continuous’.

I’ve fixed that blog article now… but it was Tobias who noticed this mistake. In the process of fixing our proof to address this issue, he eventually noticed that the proof of Petz’s theorem, which we’d been planning to use in our work, was also flawed.

Now I just need to finish polishing the rest of the paper!


Get every new post delivered to your Inbox.

Join 2,845 other followers