## Relative Entropy (Part 4)

16 February, 2014

In recent posts by Manoj Gopalkrishnan and Marc Harper we’ve seen how not just entropy but relative entropy—the entropy of a probability distribution relative to the equilibrium distribution—is a driving force in chemistry and evolution. Now Tobias Fritz and I have finally finished our paper on this subject:

Very roughly, we proved that any reasonable measure of the information you gain when you to update your assumptions about the world based on discovering what a system is really doing must be some constant $c$ times the relative entropy.

Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.

Relative Entropy (Part 2): a category related to statistical inference, $\mathrm{FinStat},$ and how relative entropy defines a functor on this category.

Relative Entropy (Part 3): statement of our main theorem, which characterizes relative entropy up to a constant multiple as the only functor $F : \mathrm{FinStat} \to [0,\infty)$ with a few nice properties.

Now that the paper is done, we’re having a nice conversation about it on the n-Category Café. Since I don’t want to splinter the conversation, I won’t enable comments here—please go there and join the fun!

One thing is that our conversation is getting more deeply into the category-theoretic aspects. Read the long parenthetical remarks in my post on the n-Café to get up to speed on that aspect.

Another is that by looking at our paper, you can actually see the proof of our result. As I mention on the n-Café.

The proof is surprisingly hard. Or maybe we’re just surprisingly bad at proving things. But the interesting thing is this: the proof is swift and effective in the ‘generic’ case—the case where the support of the probability measures involved is the whole set they’re living on, and the constant $c$ is finite.

It takes some more work to handle the case where the probability measures have smaller support.

But the really hard work starts when we handle the case that, in the end, has $c = \infty.$ Then the proof becomes more like analysis than what you normally expect in category theory. We slowly corner the result, blocking off all avenues of escape. Then we close in, grab its neck, and strangle it, crushing its larynx ever tighter, as it loses the will to fight back and finally expires… still twitching.

We haven’t gotten into discussing this much yet, perhaps because the mathematicians on the n-Café are too dainty and civilized. But someone into analysis might be able to find a more efficient proof.

That would make me a bit sad—since why didn’t we find it?—but mainly happy—since this subject deserves to be clean and elegant. We really need a category-theoretic formulation of the second law of thermodynamics that’s suitable for studying complex networks: that’s the long-term goal here.

## Bio-Inspired Information Theory

31 January, 2014

There will be a 5-day workshop on Biological and Bio-Inspired Information Theory at BIRS from Sunday the 26th to Friday the 31st of October, 2014. It’s being organized by

Toby Berger (University of Virginia)
Andrew Eckford (York University)
Peter Thomas (Case Western Reserve University)

BIRS is the Banff International Research Station, a conference venue in a rather wild part of Alberta, in the mountains west of Calgary.

### Description

Here’s the workshop proposal on the BIRS website:

Currently, research in the community is organized into three streams:

• Information theory and biochemistry (including information theory and intercellular communication);

• Information theory and neuroscience; and

• Information-theoretic analysis of biologically-inspired communication systems (including nano-networking and design of biologically implemented information processing networks).

We propose a BIRS workshop to explore these streams, focusing on mathematical open problems that cut across the streams. The main objectives of this workshop would be: to bring together the most prominent researchers in this field; to discuss and review the current state of mathematical research in this field; to promote cross-pollination among the various streams of research to find common problems; and to collectively identify key future directions or open problems that would bring the greatest impact to this field over the next few years.

#### Expected impact

A BIRS workshop involving the field’s leading researchers would allow a review of the current state of the art, and would promote cross-pollination among these three streams of research. We expect to have these leading researchers in attendance. For example, Prof. Toby Berger (U. Virginia), a widely recognized pioneer in this field and a recipient of the Shannon Award (the top prize awarded by the IEEE Information Theory Society), is one of the co-organizers of the workshop. Moreover, we have approached many of the field’s most prominent mathematicians and scientists: a complete list is found elsewhere in this proposal, but among the most prominent confirmed participants are: Prof. Tadashi Nakano (Osaka U.), one of the earliest researchers in engineered molecular communication; Dr. Thomas D. Schneider (NIH – National Cancer Institute), inventor of the sequence logo and prominent researcher in genetic information theory; and Profs. William Bialek (Princeton U.) and Naftali Tishby (Hebrew U.), prominent experts on information theory in neural coding.

Although the focus of our workshop is on mathematical fundamentals, our list of expected participants includes a few experimental scientists, e.g. Raymond Cheong and Andre Levchenko (both from Johns Hopkins U.), in addition to mathematical scientists. This is because quantitative application of information theoretic analysis to biological systems typically requires empirical estimation of joint probability distributions for multiple input and output variables, often posing daunting data collection challenges which pioneered the use of high-throughput experimental methods to collect large data sets quantifying the input/output relationships for a specific biochemical signaling pathway). We believe a blended approach, emphasizing mathematics but including experimental perspectives, will enhance the impact of our workshop and increase the usefulness to our participants.

Given that publications in these research areas have achieved prominence in the past few years, the time is right for a general meeting among the key researchers to review the state of the field and develop future directions. Thus, our proposed workshop is timely and would be expected to have a tremendous impact on the field over the next several years.

## The Rarest Things in the Universe

27 January, 2014

About 50 years ago Kolomogorov assigned to each finite binary string $\sigma$ a non-negative integer that measured that string’s ‘descriptive complexity’. Informally, $K(\sigma)$ is the length (in binary) of the shortest (Turing machine) program that with the empty string as input, outputs $\sigma$ and halts. A related measure of descriptive complexity is $M(\sigma)=\frac{K(\sigma)}{|\sigma|}$, where $|\sigma|$ denotes the length of $\sigma$. A simple string like:

$\sigma_0=\overbrace{111\ldots 111}^{1,000,000}$

can be produced by a very short program; hence $M(\sigma_0)$ is near 0. But if $\sigma$ is a ‘random string’ (e.g. obtained by flipping coins), then, with high probability, it cannot be produced by a program significantly shorter than $\sigma$ itself; hence $M(\sigma)$ will be near 1.

If I asked you to produce (using a computer) a thousand strings of length one million with $M$ near 1, it would be easy to do; just flip a lot of coins. If I asked you to produce a thousand strings with $M$ near 0, that would also be easy. For example, you could start with a short random string and repeat it a lot. Actually, if I chose my favorite $\alpha\in [0,1]$ and wanted a thousand strings of length one million with $M$ near $\alpha$, then a mix of the preceding approaches can be used to produce them. So strings with a desired $M$ are not rare.

Now let’s consider ‘deep strings’*. I will be informal here, but the underlying theory can be found in my Time space and randomness.

For all binary strings $\sigma$, we assign a value that measures the ‘depth’ of $\sigma$. $D(\sigma)$ is obtained by considering both the size and the number of steps used by each program that with the empty string as input, outputs $\sigma$ and halts**. $D(\sigma)$ has the following properties:

• If there is no short program to produce $\sigma$, then $D(\sigma)$ is small.
• If there is a short program to produce $\sigma$ and it uses few steps, then $D(\sigma)$ is small.
• If there is a short program to produce $\sigma$, but all short programs to produce $\sigma$ use lots of steps, then $D(\sigma)$ is large. Roughly speaking, the more steps small programs use to produce $\sigma$, the larger $D(\sigma)$ will be.

Informally, we call a string with large $D$ ‘deep’ and one with small $D$ ‘shallow’. A few examples may help.

Consider a string $\sigma$ obtained by flipping coins. With high probability there is no short program to produce $\sigma$, hence $\sigma$ is shallow.

Now consider $\sigma_0$ above. Since there is a short program to produce $\sigma_0$ and that program uses few steps, $\sigma_0$ is shallow.

Now treat $\sigma_0$ as a number in binary (i.e. $2^{1,000,000}-1$) and consider the prime factorization. The fundamental theorem tells us it exists and will be about one million bits long. But, unless $2^{1,000,000}-1$ is somehow special (e.g. a prime times a very smooth number), its prime factorization may be very very deep. A short program can generate the prime factorization (just generate one million 1s with a short program and then give it to a short factoring program). But if it turns out that factoring can’t be done in polynomial time, then perhaps all short programs that generate the prime factorization use a huge number of steps. So the prime factorization would have a very large $D$. Conceivably, since steps on a computer use time and energy, the prime factorization can never be realized. It is not a long string (only one million bits) but it may exist only in theory and not in reality.

If I asked you to produce even one string of length one million with $D$ near that of the prime factorization of $2^{1,000,000}-1$, could you do it? I would not know how and I suspect that as a practical matter it cannot be done. So strings with such a $D$ are rare.

Here is a string that does exist in our universe and that (I suspect) is quite deep:

$46817226351072265620777670675006972301618979214252832875068976303839400413682313$
$921168154465151768472420980044715745858522803980473207943564433*5277396428112339$
$17558838216073534609312522896254707972010583175760467054896492872702786549764052$
$643493511382273226052631979775533936351462037464331880467187717179256707148303247$

In fact, this string is the prime factorization of $2^{1061}-1$. We should not expect it to be as deep as the prime factorization of $2^{1,000,000}-1$, but we should still expect it to have considerable depth. There is even some experimental evidence that supports this view. How did I get this prime factorization? I got it from: Factorization of a 1061-bit number by the Special Number Field Sieve, by Greg Childers. So it was easy to get. Well, easy for me anyway; not so easy for Childers. He reports that the factorization took about 3 CPU-centuries’ using the Special Number Field Sieve. Other than by factoring $2^{1061}-1$, how could you ever come to write this pair of primes? I venture to say that since the Special Number Field Sieve is the fastest known algorithm for factoring numbers of this kind, no method available today could have written these primes (for the first time) using fewer steps (and hence less time/energy).

The situation might be compared to that of atomic physics. I am not a physicist, but I suppose it is possible to theorize about an atomic nucleus with a million protons. But what if I want to create one? It appears that producing transuranic elements takes huge amounts of time/energy and the greater the number of protons, the more time/energy it takes. It is even conceivable (to me at least) that there is not enough time/energy available (at least on earth) to actually produce one. Like the prime factorization of $2^{1,000,000}-1$, it may exist in theory but not in reality. On the other hand, physicists from Russia and America, using lots of time/energy, have created an atomic nucleus with 118 protons called Ununoctium. Ununoctium is analogous to Childers’ prime factorization; both exist in reality; both were very costly to create.

But my focus here is neither ontology nor physics; it is money. Recently Bitcoins and Bitcoin-like things have captured the world’s attention. I suspect that this kind of currency will revolutionize the nature of economies, and consequently the nature of governance. Personally, I am agnostic on whether this is a good thing or a bad thing. But, in any case, I am wondering if ‘depth’ might form part of a theoretical basis for understanding new currencies. I must confess to knowing few details about Bitcoins; consequently, I will start from first principles and consider some of the properties of currency:

• Mass: A currency may have mass or may be massless. U.S. dollars and gold have mass. Bitcoins are massless. The efficiencies of the Internet market require a massless currency. Imagine running eBay or Amazon using gold. U.S. dollars have mass, so we do not actually use them for transactions on the Internet, rather we use credit card numbers to create massless IOUs that promise to deliver (but, in fact, seldom actually deliver) U.S. dollars in the future. These IOUs are essentially a massless currency.
• Production: Who can produce or destroy the currency? This has been an important political, legal and economic issue for millennia. In the west, coins began to appear at about the time Solon ruled Athens. Solon realized that he could manipulate the value of coins. And so he did. Solon’s lesson has not been lost on governments ever since. In its simplest form: if you owe $K$ dollars, you print $K$ dollars, and voila! no more debt. Creditors don’t like debtors to do that, so they want to be paid in a currency that no one can produce more of. Gold comes close. You can produce more gold but you have to spend a lot of time/energy in a mine to do so. Production also includes counterfeiting. Counterfeiting comes in at least two important forms: de novo-counterfeiting (build a press) and duplication-counterfeiting (get a xerox machine). For now, all massless currencies are in digital form and are vulnerable to duplication-counterfeiting since computers make it cheap and easy to create perfect copies of digital notes (a unit of a currency will be called a ‘note’). As a result, massless currencies will typically be associated with systems that mitigate this threat. Perhaps the elaborate ledgers implemented by the creators of Bitcoin are an attempt to deal with the threat of duplication-counterfeiting.
• Abundance: As is often said, money can be used to ‘lubricate the economy’. To accomplish this a currency has to be sufficiently abundant. For example, Ununoctium, might be attractive to creditors because, compared to gold, it is far more costly in time/energy to produce more. However, it is not desirable for lubricating the economy because it is so costly to produce that less than 100 atoms have ever been created.

The transition from mass to masslessness will lead to currencies with uses and properties we do not normally associate with money. For example, using the techniques of secret-sharing, it becomes possible to create digital currencies where a single note can be jointly owned by 1000 individuals; any 501 of whom could cooperate to spend it, while any 500 of whom would be unable to do so.

What is the perfect currency? This is probably the wrong question, rather we should ask what properties may currency have, in theory and in reality. Let’s consider massless currencies.

Is there a massless currency similar to the U.S. dollar? I think yes. For example, the Government could simply publish a set of numbers and declare the numbers to be currency. Put another way, make the U.S. dollar smaller and smaller to decrease mass until asymptotically all that is left is the serial number. With regard to abundance, like the U.S. dollar, the Government is free to determining the number of notes available. With regard to production, as with the U.S. dollar, the Government can print more by simply publishing new numbers to be added to the set (or destroy some by declaring that some numbers have been removed from the set). With regard to counterfeiting, the U.S. dollar has some advantages. The mass of the U.S. dollar turns counterfeiting from a digital problem to a physical one. This provides the Government with the ability to change the U.S. dollar physically to defeat technology that might arise to produce faithful (either de novo or duplication) counterfeits.

Is there a massless currency similar to gold? I think yes. I think this is what Bitcoin-like currencies are all about. They are ‘deep-currencies’, sets of deep strings. With regard to abundance, they are superior to gold. The total number of deep strings in the set can be chosen at initiation by the creator of the currency. The abundance of gold on the other hand has already been set by nature (and, at least as currently used, gold is not sufficiently abundant to lubricate the world economy). With regard to production, as with gold, making notes requires time/energy. With regard to counterfeiting, gold has an advantage. Counterfeit gold is an absurdity, since all gold (by the ounce, not numismatically), no matter how it arises, is the same atomically and is perceived to have the same value. On the other hand, as stated above, massless currencies are vulnerable to duplication-counterfeiting. Interestingly, deep-currencies may be resistant to de novo-counterfeiting, since the creator of the deep-currency is free to choose the depth of the notes, and consequently the cost of producing new notes.

The value of a massless currency in our information based world is clear. Deep-currencies such as Bitcoin offer an attractive approach. But there is an interesting issue that may soon arise. The issue stems from the fact that the production of each note in currencies like Bitcoin requires a large investment of time/energy, and as with gold, this deprives governments of the prerogative to print money. Creditors may like this, but governments will not. What will governments do? Perhaps they will want to create a ‘Dual-currency’. A Dual-currency should be massless. It should come with a secret key. If you do not possess the secret key, then, like gold, it should be very costly to produce a new note, but if you possess the secret key, then, like the U.S. dollar, it should be inexpensive to produce a new note. Is a Dual-currency possible? Here is an example of an approach I call aurum:

• Generate a pair of RSA keys: a public key $\langle E,N\rangle$, and a secret key $\langle D,N\rangle$. Publish the public key $\langle E,N\rangle$.
• Declare that notes are exactly those integers $G$ such that $2\leq G\leq N-1$ and (the least positive residue of) $G^E\mbox{Mod}(N)$ is less than or equal to $\frac{N}{10^9}$.

So, choosing a random integer $V$ such that $2\leq V\leq N-1$, and then computing $V^E\mbox{Mod}(N)$ has about one chance in a billion of producing a note. Hence the expected number of modular exponentiations to produce a note is about one billion. On the other hand, those who possess the secret key can chose an integer $W$ such that $2\leq W\leq \frac{N}{10^9}$ and calculate $W^D\mbox{Mod}(N)$ to produce a note after just one modular exponentiation. There are many bells and whistles that can be accommodated with aurum, but here is what is ‘really’ going on. The depth of a string $\sigma$ is obtained by considering programs running with the empty string as input, but we can consider a more general concept: ‘relative depth’. Given a pair of strings $\sigma$ and $\tau$, the depth of $\sigma$ relative to $\tau$ is obtained by considering programs running with $\tau$ as the input. Hence depth as we have been discussing it is the same as depth relative to the empty string. In the example above, we have made dual-strings’; strings that are deep relative to the public key, but shallow relative to the secret key.

One of the interesting phenomena of theoretical computer science is that you can sometimes turn bad news into good news. If factoring is hard, then we are deprived of the ability to do something we (at least number theorists) would like to do. Bad news. But surprisingly, we acquire public-key cryptography and the ability to preserve our privacy. Good news. Similarly, strings that are very hard to produce seem useless, but Bitcoins have revealed that such strings can provide a new and useful form of currency. Now that we are aware that deep strings can have value, I expect that clever people will find many new uses for them.

After thinking about deep strings for many years, I see them everywhere – I invite you to do the same. I will finish with one of my favorite observations. The watchmaker analogy is well known and is frequently used in arguing the existence of a creator. If you stumble upon a watch, you recognize from its complexity that it did not form by chance, and you conclude that there must have been a watchmaker. A human is more complex than a watch, so there must have been a ‘human-maker’ – a creator. An alternative view is that the complexity you recognize is actually ‘depth’ and the conclusion you should reach is that there must have been a computational process sustained through a great many steps. In the case of humans, the process is 3.6 billion years of evolution and the depth can be read in the genome. The watch is deep as well, but much of its depth is acquired from the human genome relative to which it is not nearly so deep.

*It has been brought to my attention that others have considered similar concepts including Charles Bennett, Murray Gell-Mann, and Luis Antunes, Lance Fortnow, Dieter Van Melkebeek, and N. Variyam Vinodchandran. Indeed, the latter group has even used the term ‘deep string’, hence it is to them we owe the name.

**For all $\sigma\in\{0,1\}^{*}$:

• $T(\sigma)$ denotes the set of Turing machine programs that with the empty string as input, output $\sigma$ and halt.
• $E(\sigma)$ denotes $\min\{\max\{|P|,\log_{2}(\vec{P})\}|P\in T(\sigma)\}$, where $|P|$ denotes the length of $P$, and $\vec{P}$ denotes the number of steps used by $P$ with the empty string as input.

It may be convenient for the reader to assume that $D(\sigma)$ is approximately $\frac{E(\sigma)}{K(\sigma)}$; however, the proper theory of deep strings extends the notion of depth to sets of strings and accommodates the use of randomness in computation. Depth in the context of quantum computation may also be of interest.

## Relative Entropy in Evolutionary Dynamics

22 January, 2014

guest post by Marc Harper

In John’s information geometry series, he mentioned some of my work in evolutionary dynamics. Today I’m going to tell you about some exciting extensions!

### The replicator equation

First a little refresher. For a population of $n$ replicating types, such as individuals with different eye colors or a gene with $n$ distinct alleles, the ‘replicator equation’ expresses the main idea of natural selection: the relative rate of growth of each type should be proportional to the difference between the fitness of the type and the mean fitness in the population.

To see why this equation should be true, let $P_i$ be the population of individuals of the $i$th type, which we allow to be any nonnegative real number. We can list all these numbers and get a vector:

$P = (P_1, \dots, P_n)$

The Lotka–Volterra equation is a very general rule for how these numbers can change with time:

$\displaystyle{ \frac{d P_i}{d t} = f_i(P) P_i }$

Each population grows at a rate proportional to itself, where the ‘constant of proportionality’, $f_i(P),$ is not necessarily constant: it can be any real-valued function of $P.$ This function is called the fitness of the $i$th type. Taken all together, these functions $f_i$ are called the fitness landscape.

Let $p_i$ be the fraction of individuals who are of the $i$th type:

$\displaystyle{ p_i = \frac{P_i}{\sum_{i =1}^n P_i } }$

These numbers $p_i$ are between 0 and 1, and they add up to 1. So, we can also think of them as probabilities: $p_i$ is the probability that a randomly chosen individual is of the $i$th type. This is how probability theory, and eventually entropy, gets into the game.

Again, we can bundle these numbers into a vector:

$p = (p_1, \dots, p_n)$

which we call the population distribution. It turns out that the Lotka–Volterra equation implies the replicator equation:

$\displaystyle{ \frac{d p_i}{d t} = \left( f_i(P) - \langle f(P) \rangle \right) \, p_i }$

where

$\displaystyle{ \langle f(P) \rangle = \sum_{i =1}^n f_i(P) p_i }$

is the mean fitness of all the individuals. You can see the proof in Part 9 of the information geometry series.

By the way: if each fitness $f_i(P)$ only depends on the fraction of individuals of each type, not the total numbers, we can write the replicator equation in a simpler way:

$\displaystyle{ \frac{d p_i}{d t} = \left( f_i(p) - \langle f(p) \rangle \right) \, p_i }$

Anyway, the take-home message is this: the replicator equation says the fraction of individuals of any type changes at a rate proportional to fitness of that type minus the mean fitness.

Now, it has been known since the late 1970s or early 1980s, thanks to the work of Akin, Bomze, Hofbauer, Shahshahani, and others, that the replicator equation has some very interesting properties. For one thing, it often makes ‘relative entropy’ decrease. For another, it’s often an example of ‘gradient flow’. Let’s look at both of these in turn, and then talk about some new generalizations of these facts.

### Relative entropy as a Lyapunov function

I mentioned that we can think of a population distribution as a probability distribution. This lets us take ideas from probability theory and even information theory and apply them to evolutionary dynamics! For example, given two population distributions $p$ and $q,$ the information of $q$ relative to $p$ is

$I(q,p) = \displaystyle{ \sum_i q_i \ln \left(\frac{q_i}{p_i }\right)}$

This measures how much information you gain if you have a hypothesis about some state of affairs given by the probability distribution $p,$ and then someone tells you “no, the best hypothesis is $q$!”

It may seem weird to treat a population distribution as a hypothesis, but this turns out to be a good idea. Evolution can then be seen as a learning process: a process of improving the hypothesis.

We can make this precise by seeing how the relative information changes with the passage of time. Suppose we have two population distributions $q$ and $p.$ Suppose $q$ is fixed, while $p$ evolves in time according to the replicator equation. Then

$\displaystyle{ \frac{d}{d t} I(q,p) = \sum_i f_i(P) (p_i - q_i) }$

For the proof, see Part 11 of the information geometry series.

So, the information of $q$ relative to $p$ will decrease as $p$ evolves according to the replicator equation if

$\displaystyle{ \sum_i f_i(P) (p_i - q_i) } \le 0$

If $q$ makes this true for all $p,$ we say $q$ is an evolutionarily stable state. For some reasons why, see Part 13.

What matters now is that when $q$ is an evolutionarily stable state, $I(q,p)$ says how much information the population has ‘left to learn’—and we’re seeing that this always decreases. Moreover, it turns out that we always have

$I(q,p) \ge 0$

and $I(q,p) = 0$ precisely when $p = q.$

People summarize all this by saying that relative information is a ‘Lyapunov function’. Very roughly, a Lyapunov function is something that decreases with the passage of time, and is zero only at the unique stable state. To be a bit more precise, suppose we have a differential equation like

$\displaystyle{ \frac{d}{d t} x(t) = v(x(t)) }$

where $x(t) \in \mathbb{R}^n$ and $v$ is some smooth vector field on $\mathbb{R}^n.$ Then a smooth function

$V : \mathbb{R}^n \to \mathbb{R}$

is a Lyapunov function if

$V(x) \ge 0$ for all $x$

$V(x) = 0$ iff $x$ is some particular point $x_0$

and

$\displaystyle{ \frac{d}{d t} V(x(t)) \le 0 }$ for every solution of our differential equation.

In this situation, the point $x_0$ is a stable equilibrium for our differential equation: this is Lyapunov’s theorem.

### The replicator equation as a gradient flow equation

The basic idea of Lyapunov’s theorem is that when a ball likes to roll downhill and the landscape has just one bottom point, that point will be the unique stable equilibrium for the ball.

The idea of gradient flow is similar, but different: sometimes things like to roll downhill as efficiently as possible: they move in the exactly the best direction to make some quantity smaller! Under certain conditions, the replicator equation is an example of this phenomenon.

Let’s fill in some details. For starters, suppose we have some function

$F : \mathbb{R}^n \to \mathbb{R}$

Think of $V$ as ‘height’. Then the gradient flow equation says how a point $x(t) \in \mathbb{R}^n$ will move if it’s always trying its very best to go downhill:

$\displaystyle{ \frac{d}{d t} x(t) = - \nabla V(x(t)) }$

Here $\nabla$ is the usual gradient in Euclidean space:

$\displaystyle{ \nabla V = \left(\partial_1 V, \dots, \partial_n V \right) }$

where $\partial_i$ is short for the partial derivative with respect to the $i$th coordinate.

The interesting thing is that under certain conditions, the replicator equation is an example of a gradient flow equation… but typically not one where $\nabla$ is the usual gradient in Euclidean space. Instead, it’s the gradient on some other space, the space of all population distributions, which has a non-Euclidean geometry!

The space of all population distributions is a simplex:

$\{ p \in \mathbb{R}^n : \; p_i \ge 0, \; \sum_{i = 1}^n p_i = 1 \} .$

For example, it’s an equilateral triangle when $n = 3.$ The equilateral triangle looks flat, but if we measure distances another way it becomes round, exactly like a portion of a sphere, and that’s the non-Euclidean geometry we need!

In fact this trick works in any dimension. The idea is to give the simplex a special Riemannian metric, the ‘Fisher information metric’. The usual metric on Euclidean space is

$\delta_{i j} = \left\{\begin{array}{ccl} 1 & \mathrm{ if } & i = j \\ 0 &\mathrm{ if } & i \ne j \end{array} \right.$

This simply says that two standard basis vectors like $(0,1,0,0)$ and $(0,0,1,0)$ have dot product zero if the 1′s are in different places, and one if they’re in the same place. The Fisher information metric is a bit more complicated:

$\displaystyle{ g_{i j} = \frac{\delta_{i j}}{p_i} }$

As before, $g_{i j}$ is a formula for the dot product of the $i$th and $j$th standard basis vectors, but now it depends on where you are in the simplex of population distributions.

We saw how this formula arises from information theory back in Part 7. I won’t repeat the calculation, but the idea is this. Fix a population distribution $p$ and consider the information of another one, say $q,$ relative to this. We get $I(q,p).$ If $q = p$ this is zero:

$\displaystyle{ \left. I(q,p)\right|_{q = p} = 0 }$

and this point is a local minimum for the relative information. So, the first derivative of $I(q,p)$ as we change $q$ must be zero:

$\displaystyle{ \left. \frac{\partial}{\partial q_i} I(q,p) \right|_{q = p} = 0 }$

But the second derivatives are not zero. In fact, since we’re at a local minimum, it should not be surprising that we get a positive definite matrix of second derivatives:

$\displaystyle{ g_{i j} = \left. \frac{\partial^2}{\partial q_i \partial q_j} I(q,p) \right|_{q = p} = 0 }$

And, this is the Fisher information metric! So, the Fisher information metric is a way of taking dot products between vectors in the simplex of population distribution that’s based on the concept of relative information.

This is not the place to explain Riemannian geometry, but any metric gives a way to measure angles and distances, and thus a way to define the gradient of a function. After all, the gradient of a function should point at right angles to the level sets of that function, and its length should equal the slope of that function:

So, if we change our way of measuring angles and distances, we get a new concept of gradient! The $i$th component of this new gradient vector field turns out to b

$(\nabla_g V)^i = g^{i j} \partial_j V$

where $g^{i j}$ is the inverse of the matrix $g_{i j},$ and we sum over the repeated index $j.$ As a sanity check, make sure you see why this is the usual Euclidean gradient when $g_{i j} = \delta_{i j}.$

Now suppose the fitness landscape is the good old Euclidean gradient of some function. Then it turns out that the replicator equation is a special case of gradient flow on the space of population distributions… but where we use the Fisher information metric to define our concept of gradient!

To get a feel for this, it’s good to start with the Lotka–Volterra equation, which describes how the total number of individuals of each type changes. Suppose the fitness landscape is the Euclidean gradient of some function $V$:

$\displaystyle{ f_i(P) = \frac{\partial V}{\partial P_i} }$

Then the Lotka–Volterra equation becomes this:

$\displaystyle{ \frac{d P_i}{d t} = \frac{\partial V}{\partial P_i} \, P_i }$

This doesn’t look like the gradient flow equation, thanks to that annoying $P_i$ on the right-hand side! It certainly ain’t the gradient flow coming from the function $V$ and the usual Euclidean gradient. However, it is gradient flow coming from $V$ and some other metric on the space

$\{ P \in \mathbb{R}^n : \; P_i \ge 0 \}$

For a proof, and the formula for this other metric, see Section 3.7 in this survey:

• Marc Harper, Information geometry and evolutionary game theory.

Now let’s turn to the replicator equation:

$\displaystyle{ \frac{d p_i}{d t} = \left( f_i(p) - \langle f(p) \rangle \right) \, p_i }$

Again, if the fitness landscape is a Euclidean gradient, we can rewrite the replicator equation as a gradient flow equation… but again, not with respect to the Euclidean metric. This time we need to use the Fisher information metric! I sketch a proof in my paper above.

In fact, both these results were first worked out by Shahshahani:

• Siavash Shahshahani, A New Mathematical Framework for the Study of Linkage and Selection, Memoirs of the AMS 17, 1979.

### New directions

All this is just the beginning! The ideas I just explained are unified in information geometry, where distance-like quantities such as the relative entropy and the Fisher information metric are studied. From here it’s a short walk to a very nice version of Fisher’s fundamental theorem of natural selection, which is familiar to researchers both in evolutionary dynamics and in information geometry.

You can see some very nice versions of this story for maximum likelihood estimators and linear programming here:

• Akio Fujiwara and Shun-ichi Amari, Gradient systems in view of information geometry, Physica D: Nonlinear Phenomena 80 (1995), 317–327.

Indeed, this seems to be the first paper discussing the similarities between evolutionary game theory and information geometry.

Dash Fryer (at Pomona College) and I have generalized this story in several interesting ways.

First, there are two famous ways to generalize the usual formula for entropy: Tsallis entropy and Rényi entropy, both of which involve a parameter $q.$ There are Tsallis and Rényi versions of relative entropy and the Fisher information metric as well. Everything I just explained about:

• conditions under which relative entropy is a Lyapunov function for the replicator equation, and

• conditions under which the replicator equation is a special case of gradient flow

generalize to these cases! However, these generalized entropies give modified versions of the replicator equation. When we set $q=1$ we get back the usual story. See

• Marc Harper, Escort evolutionary game theory.

My initial interest in these alternate entropies was mostly mathematical—what is so special about the corresponding geometries?—but now researchers are starting to find populations that evolve according to these kinds of modified population dynamics! For example:

• A. Hernando et al, The workings of the Maximum Entropy Principle in collective human behavior.

There’s an interesting special case worth some attention. Lots of people fret about the relative entropy not being a distance function obeying the axioms that mathematicians like: for example, it doesn’t obey the triangle inequality. Many describe the relative entropy as a distance-like function, and this is often a valid interpretation contextually. On the other hand, the $q=0$ relative entropy is one-half the Euclidean distance squared! In this case the modified version of the replicator equation looks like this:

$\displaystyle{ \frac{d p_i}{d t} = f_i(p) - \frac{1}{n} \sum_{j = 1}^n f_j(p) }$

This equation is called the projection dynamic.

Later, I showed that there is a reasonable definition of relative entropy for a much larger family of geometries that satisfies a similar distance minimization property.

In a different direction, Dash showed that you can change the way that selection acts by using a variety of alternative ‘incentives’, extending the story to some other well-known equations describing evolutionary dynamics. By replacing the terms $x_i f_i(x)$ in the replicator equation with a variety of other functions, called incentives, we can generate many commonly studied models of evolutionary dynamics. For instance if we exponentiate the fitness landscape (to make it always positive), we get what is commonly known as the logit dynamic. This amounts to changing the fitness landscape as follows:

$\displaystyle{ f_i \mapsto \frac{x_i e^{\beta f_i}}{\sum_j{x_j e^{\beta f_j}}} }$

where $\beta$ is known as an inverse temperature in statistical thermodynamics and as an intensity of selection in evolutionary dynamics. There are lots of modified versions of the replicator equation, like the best-reply and projection dynamics, more common in economic applications of evolutionary game theory, and they can all be captured in this family. (There are also other ways to simultaneously capture such families, such as Bill Sandholm’s revision protocols, which were introduced earlier in his exploration of the foundations of game dynamics.)

Dash showed that there is a natural generalization of evolutionarily stable states to ‘incentive stable states’, and that for incentive stable states, the relative entropy is decreasing to zero when the trajectories get near the equilibrium. For the logit and projection dynamics, incentive stable states are simply evolutionarily stable states, and this happens frequently, but not always.

The third generalization is to look at different ‘time-scales’—that is, different ways of describing time! We can make up the symbol $\mathbb{T}$ for a general choice of ‘time-scale’. So far I’ve been treating time as a real number, so

$\mathbb{T} = \mathbb{R}$

But we can also treat time as coming in discrete evenly spaced steps, which amounts to treating time as an integer:

$\mathbb{T} = \mathbb{Z}$

More generally, we can make the steps have duration $h,$ where $h$ is any positive real number:

$\mathbb{T} = h\mathbb{Z}$

There is a nice way to simultaneously describe the cases $\mathbb{T} = \mathbb{R}$ and $\mathbb{T} = h\mathbb{Z}$ using the time-scale calculus and time-scale derivatives. For the time-scale $\mathbb{T} = \mathbb{R}$ the time-scale derivative is just the ordinary derivative. For the time-scale $\mathbb{T} = h\mathbb{Z},$ the time-scale derivative is given by the difference quotient from first year calculus:

$\displaystyle{ f^{\Delta}(z) = \frac{f(z+h) - f(z)}{h} }$

and using this as a substitute for the derivative gives difference equations like a discrete-time version of the replicator equation. There are many other choices of time-scale, such as the quantum time-scale given by $\mathbb{T} = q^{\mathbb{Z}},$ in which case the time-scale derivative is called the q-derivative, but that’s a tale for another time. In any case, the fact that the successive relative entropies are decreasing can be simply state by saying they have negative $\mathbb{T} = h\mathbb{Z}$ time-scale derivative. The continuous case we started with corresponds to $\mathbb{T} = \mathbb{R}.$

Remarkably, Dash and I were able to show that you can combine all three of these generalizations into one theorem, and even allow for multiple interacting populations! This produces some really neat population trajectories, such as the following two populations with three types, with fitness functions corresponding to the rock-paper-scissors game. On top we have the replicator equation, which goes along with the Fisher information metric; on the bottom we have the logit dynamic, which goes along with the Euclidean metric on the simplex:

From our theorem, it follows that the relative entropy (ordinary relative entropy on top, the $q = 0$ entropy on bottom) converges to zero along the population trajectories.

The final form of the theorem is loosely as follows. Pick a Riemannian geometry given by a metric $g$ (obeying some mild conditions) and an incentive for each population, as well as a time scale ($\mathbb{R}$ or $h \mathbb{Z}$) for every population. This gives an evolutionary dynamic with a natural generalization of evolutionarily stable states, and a suitable version of the relative entropy. Then, if there is an evolutionarily stable state in the interior of the simplex, the time-scale derivative of sum of the relative entropies for each population will decrease as the trajectories converge to the stable state!

When there isn’t such a stable state, we still get some interesting population dynamics, like the following:

See this paper for details:

• Marc Harper and Dashiell E. A. Fryer, Stability of evolutionary dynamics on time scales.

Next time we’ll see how to make the main idea work in finite populations, without derivatives or deterministic trajectories!

## Relative Entropy (Part 3)

25 December, 2013

Holidays are great. There’s nothing I need to do! Everybody is celebrating! So, I can finally get some real work done.

In the last couple of days I’ve finished a paper with Jamie Vicary on wormholes and entanglement… subject to his approval and corrections. More on that later. And now I’ve returned to working on a paper with Tobias Fritz where we give a Bayesian characterization of the concept of ‘relative entropy’. This summer I wrote two blog articles about this paper:

Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.

Relative Entropy (Part 2): a category related to statistical inference, $\mathrm{FinStat},$ and how relative entropy defines a functor on this category.

But then Tobias Fritz noticed a big problem. Our characterization of relative entropy was inspired by this paper:

• D. Petz, Characterization of the relative entropy of states of matrix algebras, Acta Math. Hungar. 59 (1992), 449–455.

Here Petz sought to characterize relative entropy both in the ‘classical’ case we are concerned with and in the more general ‘quantum’ setting. Our original goal was merely to express his results in a more category-theoretic framework! Unfortunately Petz’s proof contained a significant flaw. Tobias noticed this and spent a lot of time fixing it, with no help from me.

Our paper is now self-contained, and considerably longer. My job now is to polish it up and make it pretty. What follows is the introduction, which should explain the basic ideas.

### A Bayesian characterization of relative entropy

This paper gives a new characterization of the concept of relative entropy, also known as ‘relative information’, ‘information gain’ or ‘Kullback-Leibler divergence’. Whenever we have two probability distributions $p$ and $q$ on the same finite set $X,$ we can define the entropy of $q$ relative to $p$:

$S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right)$

Here we set

$q_x \ln\left( \frac{q_x}{p_x} \right)$

equal to $\infty$ when $p_x = 0,$ unless $q_x$ is also zero, in which case we set it equal to 0. Relative entropy thus takes values in $[0,\infty].$

Intuitively speaking, $S(q,p)$ is the expected amount of information gained when we discover the probability distribution is really $q,$ when we had thought it was $p.$ We should think of $p$ as a ‘prior’ and $q$ as a ‘posterior’. When we take $p$ to be the uniform distribution on $X,$ relative entropy reduces to the ordinary Shannon entropy, up to an additive constant. The advantage of relative entropy is that it makes the role of the prior explicit.

Since Bayesian probability theory emphasizes the role of the prior, relative entropy naturally lends itself to a Bayesian interpretation: it measures how much information we gain given a certain prior. Our goal here is to make this precise in a mathematical characterization of relative entropy. We do this using a category $\mathrm{FinStat}$ where:

• an object $(X,q)$ consists of a finite set $X$ and a probability distribution $x \mapsto q_x$ on that set;

• a morphism $(f,s) : (X,q) \to (Y,r)$ consists of a measure-preserving function $f$ from $X$ to $Y,$ together with a probability distribution $x \mapsto s_{x y}$ on $X$ for each element $y \in Y,$ with the property that $s_{xy} = 0$ unless $f(x) = y.$

We can think of an object of $\mathrm{FinStat}$ as a system with some finite set of states together with a probability distribution on its states. A morphism

$(f,s) : (X,q) \to (Y,r)$

then consists of two parts. First, there is a deterministic measurement process $f : X \to Y$ mapping states of the system being measured, $X,$ to states of the measurement apparatus, $Y.$ The condition that $f$ be measure-preserving says that, after the measurement, the probability that the apparatus be in any state $y \in Y$ is the sum of the probabilities of all states of $X$ leading to that outcome:

$\displaystyle{ r_y = \sum_{x \in f^{-1}(y)} q_x }$

Second, there is a hypothesis $s$: an assumption about the probability $s_{xy}$ that the system being measured is in the state $x \in X$ given any measurement outcome $y \in Y.$

Suppose we have any morphism

$(f,s) : (X,q) \to (Y,r)$

in $\mathrm{FinStat}.$ From this we obtain two probability distributions on the states of the system being measured. First, we have the probability distribution $p$ given by

$\displaystyle{ p_x = \sum_{y \in Y} s_{xy} r_y } \qquad \qquad (1)$

This is our prior, given our hypothesis and the probability distribution of measurement results. Second we have the ‘true’ probability distribution $q,$ which would be the posterior if we updated our prior using complete direct knowledge of the system being measured.

It follows that any morphism in $\mathrm{FinStat}$ has a relative entropy $S(q,p)$ associated to it. This is the expected amount of information we gain when we update our prior $p$ to the posterior $q.$

In fact, this way of assigning relative entropies to morphisms defines a functor

$F_0 : \mathrm{FinStat} \to [0,\infty]$

where we use $[0,\infty]$ to denote the category with one object, the numbers $0 \le x \le \infty$ as morphisms, and addition as composition. More precisely, if

$(f,s) : (X,q) \to (Y,r)$

is any morphism in $\mathrm{FinStat},$ we define

$F_0(f,s) = S(q,p)$

where the prior $p$ is defined as in the equation (1).

The fact that $F_0$ is a functor is nontrivial and rather interesting. It says that given any composable pair of measurement processes:

$(X,q) \stackrel{(f,s)}{\longrightarrow} (Y,r) \stackrel{(g,t)}{\longrightarrow} (Z,u)$

the relative entropy of their composite is the sum of the relative entropies of the two parts:

$F_0((g,t) \circ (f,s)) = F_0(g,t) + F_0(f,s) .$

We prove that $F_0$ is a functor. However, we go much further: we characterize relative entropy by saying that up to a constant multiple, $F_0$ is the unique functor from $\mathrm{FinStat}$ to $[0,\infty]$ obeying three reasonable conditions.

The first condition is that $F_0$ vanishes on morphisms $(f,s) : (X,q) \to (Y,r)$ where the hypothesis $s$ is optimal. By this, we mean that Equation (1) gives a prior $p$ equal to the ‘true’ probability distribution $q$ on the states of the system being measured.

The second condition is that $F_0$ is lower semicontinuous. The set $P(X)$ of probability distibutions on a finite set $X$ naturally has the topology of an $(n-1)$-simplex when $X$ has $n$ elements. The set $[0,\infty]$ has an obvious topology where it’s homeomorphic to a closed interval. However, with these topologies, the relative entropy does not define a continuous function

$\begin{array}{rcl} S : P(X) \times P(X) &\to& [0,\infty] \\ (q,p) &\mapsto & S(q,p) . \end{array}$

The problem is that

$\displaystyle{ S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right) }$

and we define $q_x \ln(q_x/p_x)$ to be $\infty$ when $p_x = 0$ and $q_x > 0$ but $0$ when $p_x = q_x = 0.$ So, it turns out that $S$ is only lower semicontinuous, meaning that if $p^i , q^i$ are sequences of probability distributions on $X$ with $p^i \to p$ and $q^i \to q$ then

$S(q,p) \le \liminf_{i \to \infty} S(q^i, p^i)$

We give the set of morphisms in $\mathrm{FinStat}$ its most obvious topology, and show that with this topology, $F_0$ maps morphisms to morphisms in a lower semicontinuous way.

The third condition is that $F_0$ is convex linear. We describe how to take convex linear combinations of morphisms in $\mathrm{FinStat},$ and then the functor $F_0$ is convex linear in the sense that it maps any convex linear combination of morphisms in $\mathrm{FinStat}$ to the corresponding convex linear combination of numbers in $[0,\infty].$ Intuitively, this means that if we take a coin with probability $P$ of landing heads up, and flip it to decide whether to perform one measurement process or another, the expected information gained is $P$ times the expected information gain of the first process plus $1-P$ times the expected information gain of the second process.

Here, then, is our main theorem:

Theorem. Any lower semicontinuous, convex-linear functor

$F : \mathrm{FinStat} \to [0,\infty]$

that vanishes on every morphism with an optimal hypothesis must equal some constant times the relative entropy. In other words, there exists some constant $c \in [0,\infty]$ such that

$F(f,s) = c F_0(f,s)$

for any any morphism $(f,s) : (X,p) \to (Y,q)$ in $\mathrm{FinStat}.$

### Remarks

If you’re a maniacally thorough reader of this blog, with a photographic memory, you’ll recall that our theorem now says ‘lower semicontinuous’, where in Part 2 of this series I’d originally said ‘continuous’.

I’ve fixed that blog article now… but it was Tobias who noticed this mistake. In the process of fixing our proof to address this issue, he eventually noticed that the proof of Petz’s theorem, which we’d been planning to use in our work, was also flawed.

Now I just need to finish polishing the rest of the paper!

## Quantropy (Part 4)

11 November, 2013

There’s a new paper on the arXiv:

• John Baez and Blake Pollard, Quantropy.

Blake is a physics grad student at U. C. Riverside who plans to do his thesis with me.

If you have carefully read all my previous posts on quantropy (Part 1, Part 2 and Part 3), there’s only a little new stuff here. But still, it’s better organized, and less chatty.

And in fact, Blake came up with a lot of new stuff for this paper! He studied the quantropy of the harmonic oscillator, and tweaked the analogy between statistical mechanics and quantum mechanics in an interesting way. Unfortunately, we needed to put a version of this paper on the arXiv by a deadline, and our writeup of this new work wasn’t quite ready (my fault). So, we’ll put that other stuff in a new version—or, I’m thinking now, a separate paper.

But here are two new things.

First, putting this paper on the arXiv had the usual good effect of revealing some existing work on the same topic. Joakim Munkhammar emailed me and pointed out this paper, which is free online:

• Joakim Munkhammar, Canonical relational quantum mechanics from information theory, Electronic Journal of Theoretical Physics 8 (2011), 93–108.

You’ll see it cites Garrett Lisi’s paper and pushes forward in various directions. There seems to be a typo where he writes the path integral

$Z = \displaystyle{ \int e^{-\alpha S(q) } D q}$

and says

In order to fit the purpose Lisi concluded that the Lagrange multiplier value $\alpha \equiv 1/i \hbar.$ In similarity with Lisi’s approach we shall also assume that the arbitrary scaling-part of the constant $\alpha$ is in fact $1/\hbar.$

I’m pretty sure he means $1/i\hbar,$ given what he writes later. However, he speaks of ‘maximizing entropy’, which is not quite right for a complex-valued quantity; Blake and I prefer to give this new quantity a new name, and speak of ‘finding a stationary point of quantropy’.

But in a way these are small issues; being a mathematician, I’m more quick to spot tiny technical defects than to absorb significant new ideas, and it will take a while to really understand Munkhammar’s paper.

Second, while writing our paper, Blake and I noticed another similarity between the partition function of a classical ideal gas and the partition function of a quantum free particle. Both are given by an integral like this:

$Z = \displaystyle{\int e^{-\alpha S(q) } D q }$

where $S$ is a quadratic function of $q \in \mathbb{R}^n.$ Here $n$ is the number of degrees of freedom for the particles in the ideal gas, or the number of time steps for a free particle on a line (where we are discretizing time). The only big difference is that

$\alpha = 1/kT$

for the ideal gas, but

$\alpha = 1/i \hbar$

for the free particle.

In both cases there’s an ambiguity in the answer! The reason is that to do this integral, we need to pick a measure $D q.$ The obvious guess is Lebesgue measure

$dq = dq_1 \cdots dq_n$

on $\mathbb{R}^n.$ But this can’t be right, on physical grounds!

The reason is that the partition function $Z$ needs to be dimensionless, but $d q$ has units. To correct this, we need to divide $dq$ by some dimensionful quantity to get $D q.$

In the case of the ideal gas, this dimensionful quantity involves the ‘thermal de Broglie wavelength’ of the particles in the gas. And this brings Planck’s constant into the game, even though we’re not doing quantum mechanics: we’re studying the statistical mechanics of a classical ideal gas!

That’s weird and interesting. It’s not the only place where we see that classical statistical mechanics is incomplete or inconsistent, and we need to introduce some ideas from quantum physics to get sensible answers. The most famous one is the ultraviolet catastrophe. What are all rest?

In the case of the free particle, we need to divide by a quantity with dimensions of lengthn to make

$dq = dq_1 \cdots dq_n$

dimensionless, since each $dq_i$ has dimensions of length. The easiest way is to introduce a length scale $\Delta x$ and divide each $dq_i$ by that. This is commonly done when people study the free particle. This length scale drops out of the final answer for the questions people usually care about… but not for the quantropy.

Similarly, Planck’s constant drops out of the final answer for some questions about the classical ideal gas, but not for its entropy!

So there’s an interesting question here, about what this new length scale $\Delta x$ means, if anything. One might argue that quantropy is a bad idea, and the need for this new length scale to make it unambiguous is just proof of that. However, the mathematical analogy to quantum mechanics is so precise that I think it’s worth going a bit further out on this limb, and thinking a bit more about what’s going on.

Some weird sort of déjà vu phenomenon seems to be going on. Once upon a time, people tried to calculate the partition functions of classical systems. They discovered they were infinite or ambiguous until they introduced Planck’s constant, and eventually quantum mechanics. Then Feynman introduced the path integral approach to quantum mechanics. In this approach one is again computing partition functions, but now with a new meaning, and with complex rather than real exponentials. But these partition functions are again infinite or ambiguous… for very similar mathematical reasons! And at least in some cases, we can remove the ambiguity using the same trick as before: introducing a new constant. But then… what?

Are we stuck in an infinite loop here? What, if anything, is the meaning of this ‘second Planck’s constant’? Does this have anything to do with second quantization? (I don’t see how, but I can’t resist asking.)

## Entropy and Information in Biological Systems

2 November, 2013

John Harte is an ecologist who uses maximum entropy methods to predict the distribution, abundance and energy usage of species. Marc Harper uses information theory in bioinformatics and evolutionary game theory. Harper, Harte and I are organizing a workshop on entropy and information in biological systems, and I’m really excited about it!

It’ll take place at the National Institute for Mathematical and Biological Synthesis in Knoxville Tennesee. We are scheduling it for Wednesday-Friday, April 8-10, 2015. When the date gets confirmed, I’ll post an advertisement so you can apply to attend.

Writing the proposal was fun, because we got to pull together lots of interesting people who are applying information theory and entropy to biology in quite different ways. So, here it is!

### Proposal

Ever since Shannon initiated research on information theory in 1948, there have been hopes that the concept of information could serve as a tool to help systematize and unify work in biology. The link between information and entropy was noted very early on, and it suggested that a full thermodynamic understanding of biology would naturally involve the information processing and storage that are characteristic of living organisms. However, the subject is full of conceptual pitfalls for the unwary, and progress has been slower than initially expected. Premature attempts at ‘grand syntheses’ have often misfired. But applications of information theory and entropy to specific highly focused topics in biology have been increasingly successful, such as:

• the maximum entropy principle in ecology,
• Shannon and Rényi entropies as measures of biodiversity,
• information theory in evolutionary game theory,
• information and the thermodynamics of individual cells.

Because they work in diverse fields, researchers in these specific topics have had little opportunity to trade insights and take stock of the progress so far. The aim of the workshop is to do just this.

In what follows, participants’ names are in boldface, while the main goals of the workshop are in italics.

Roderick Dewar is a key advocate of the principle of Maximum Entropy Production, which says that biological systems—and indeed all open, non-equilibrium systems—act to produce entropy at the maximum rate. Along with others, he has applied this principle to make testable predictions in a wide range of biological systems, from ATP synthesis [DJZ2006] to respiration and photosynthesis of individual plants [D2010] and plant communities. He has also sought to derive this principle from ideas in statistical mechanics [D2004, D2009], but it remains controversial.

The first goal of this workshop is to study the validity of this principle.

While they may be related, the principle of Maximum Entropy Production should not be confused with the MaxEnt inference procedure, which says that we should choose the probabilistic hypothesis with the highest entropy subject to the constraints provided by our data. MaxEnt was first explicitly advocated by Jaynes. He noted that it is already implicit in the procedures of statistical mechanics, but convincingly argued that it can also be applied to situations where entropy is more ‘informational’ than ‘thermodynamic’ in character.

Recently John Harte has applied MaxEnt in this way to ecology, using it to make specific testable predictions for the distribution, abundance and energy usage of species across spatial scales and across habitats and taxonomic groups [Harte2008, Harte2009, Harte2011]. Annette Ostling is an expert on other theories that attempt to explain the same data, such as the ‘neutral model’ [AOE2008, ODLSG2009, O2005, O2012]. Dewar has also used MaxEnt in ecology [D2008], and he has argued that it underlies the principle of Maximum Entropy Production.

Thus, a second goal of this workshop is to familiarize all the participants with applications of the MaxEnt method to ecology, compare it with competing approaches, and study whether MaxEnt provides a sufficient justification for the principle of Maximum Entropy Production.

Entropy is not merely a predictive tool in ecology: it is also widely used as a measure of biodiversity. Here Shannon’s original concept of entropy naturally generalizes to ‘Rényi entropy’, which depends on a parameter $\alpha \ge 0$. This equals

$\displaystyle{ H_\alpha(p) = \frac{1}{1-\alpha} \log \sum_i p_i^\alpha }$

where $p_i$ is the fraction of organisms of the $i$th type (which could mean species, some other taxon, etc.). In the limit $\alpha \to 1$ this reduces to the Shannon entropy:

$\displaystyle{ H(p) = - \sum_i p_i \log p_i }$

As $\alpha$ increases, we give less weight to rare types of organisms. Christina Cobbold and Tom Leinster have described a systematic and highly flexible system of biodiversity measurement, with Rényi entropy at its heart [CL2012]. They consider both the case where all we have are the numbers $p_i$, and the more subtle case where we take the distance between different types of organisms into account.

John Baez has explained the role of Rényi entropy in thermodynamics [B2011], and together with Tom Leinster and Tobias Fritz he has proved other theorems characterizing entropy which explain its importance for information processing [BFL2011]. However, these ideas have not yet been connected to the widespread use of entropy in biodiversity studies. More importantly, the use of entropy as a measure of biodiversity has not been clearly connected to MaxEnt methods in ecology. Does the success of MaxEnt methods imply a tendency for ecosystems to maximize biodiversity subject to the constraints of resource availability? This seems surprising, but a more nuanced statement along these general lines might be correct.

So, a third goal of this workshop is to clarify relations between known characterizations of entropy, the use of entropy as a measure of biodiversity, and the use of MaxEnt methods in ecology.

As the amount of data to analyze in genomics continues to surpass the ability of humans to analyze it, we can expect automated experiment design to become ever more important. In Chris Lee and Marc Harper’s RoboMendel program [LH2013], a mathematically precise concept of ‘potential information’—how much information is left to learn—plays a crucial role in deciding what experiment to do next, given the data obtained so far. It will be useful for them to interact with William Bialek, who has expertise in estimating entropy from empirical data and using it to constrain properties of models [BBS, BNS2001, BNS2002], and Susanne Still, who applies information theory to automated theory building and biology [CES2010, PS2012].

However, there is another link between biology and potential information. Harper has noted that in an ecosystem where the population of each type of organism grows at a rate proportional to its fitness (which may depend on the fraction of organisms of each type), the quantity

$\displaystyle{ I(q||p) = \sum_i q_i \ln(q_i/p_i) }$

always decreases if there is an evolutionarily stable state [Harper2009]. Here $p_i$ is the fraction of organisms of the $i$th genotype at a given time, while $q_i$ is this fraction in the evolutionarily stable state. This quantity is often called the Shannon information of $q$ ‘relative to’ $p$. But in fact, it is precisely the same as Lee and Harper’s potential information! Indeed, there is a precise mathematical analogy between evolutionary games and processes where a probabilistic hypothesis is refined by repeated experiments.

Thus, a fourth goal of this workshop is to develop the concept of evolutionary games as ‘learning’ processes in which information is gained over time.

We shall try to synthesize this with Carl Bergstrom and Matina Donaldson-Matasci’s work on the ‘fitness value of information’: a measure of how much increase in fitness a population can obtain per bit of extra information [BL2004, DBL2010, DM2013]. Following Harper, we shall consider not only relative Shannon entropy, but also relative Rényi entropy, as a measure of information gain [Harper2011].

A fifth and final goal of this workshop is to study the interplay between information theory and the thermodynamics of individual cells and organelles.

Susanne Still has studied the thermodynamics of prediction in biological systems [BCSS2012]. And in a celebrated related piece of work, Jeremy England used thermodynamic arguments to a derive a lower bound for the amount of entropy generated during a process of self-replication of a bacterial cell [England2013]. Interestingly, he showed that E. coli comes within a factor of 3 of this lower bound.

In short, information theory and entropy methods are becoming powerful tools in biology, from the level of individual cells, to whole ecosystems, to experimental design, model-building, and the measurement of biodiversity. The time is ripe for an investigative workshop that brings together experts from different fields and lets them share insights and methods and begin to tackle some of the big remaining questions.

### Bibliography

[AOE2008] D. Alonso, A. Ostling and R. Etienne, The assumption of symmetry and species abundance distributions, Ecology Letters 11 (2008), 93–105.

[TMMABB2012} D. Amodei, W. Bialek, M. J. Berry II, O. Marre, T. Mora, and G. Tkacik, The simplest maximum entropy model for collective behavior in a neural network, arXiv:1207.6319 (2012).

[B2011] J. Baez, Rényi entropy and free energy, arXiv:1102.2098 (2011).

[BFL2011] J. Baez, T. Fritz and T. Leinster, A characterization of entropy in terms of information loss, Entropy 13 (2011), 1945–1957.

[B2011] J. Baez and M. Stay, Algorithmic thermodynamics, Math. Struct. Comp. Sci. 22 (2012), 771–787.

[BCSS2012] A. J. Bell, G. E. Crooks, S. Still and D. A Sivak, The thermodynamics of prediction, Phys. Rev. Lett. 109 (2012), 120604.

[BL2004] C. T. Bergstrom and M. Lachmann, Shannon information and biological fitness, in IEEE Information Theory Workshop 2004, IEEE, 2004, pp. 50-54.

[BBS] M. J. Berry II, W. Bialek and E. Schneidman, An information theoretic approach to the functional classification of neurons, in Advances in Neural Information Processing Systems 15, MIT Press, 2005.

[BNS2001] W. Bialek, I. Nemenman and N. Tishby, Predictability, complexity and learning, Neural Computation 13 (2001), 2409–2463.

[BNS2002] W. Bialek, I. Nemenman and F. Shafee, Entropy and inference, revisited, in Advances in Neural Information Processing Systems 14, MIT Press, 2002.

[CL2012] C. Cobbold and T. Leinster, Measuring diversity: the importance of species similarity, Ecology 93 (2012), 477–489.

[CES2010] J. P. Crutchfield, S. Still and C. Ellison, Optimal causal inference: estimating stored information and approximating causal architecture, Chaos 20 (2010), 037111.

[D2004] R. C. Dewar, Maximum entropy production and non-equilibrium statistical mechanics, in Non-Equilibrium Thermodynamics and Entropy Production: Life, Earth and Beyond, eds. A. Kleidon and R. Lorenz, Springer, New York, 2004, 41–55.

[DJZ2006] R. C. Dewar, D. Juretíc, P. Zupanovíc, The functional design of the rotary enzyme ATP synthase is consistent with maximum entropy production, Chem. Phys. Lett. 430 (2006), 177–182.

[D2008] R. C. Dewar, A. Porté, Statistical mechanics unifies different ecological patterns, J. Theor. Bio. 251 (2008), 389–403.

[D2009] R. C. Dewar, Maximum entropy production as an inference algorithm that translates physical assumptions into macroscopic predictions: don’t shoot the messenger, Entropy 11 (2009), 931–944.

[D2010] R. C. Dewar, Maximum entropy production and plant optimization theories, Phil. Trans. Roy. Soc. B 365 (2010) 1429–1435.

[DBL2010} M. C. Donaldson-Matasci, C. T. Bergstrom, and
M. Lachmann, The fitness value of information, Oikos 119 (2010), 219-230.

[DM2013] M. C. Donaldson-Matasci, G. DeGrandi-Hoffman, and A. Dornhaus, Bigger is better: honey bee colonies as distributed information-gathering systems, Animal Behaviour 85 (2013), 585–592.

[England2013] J. L. England, Statistical physics of self-replication, J. Chem. Phys. 139 (2013), 121923.

[ODLSG2009} J. L. Green, J. K. Lake, J. P. O’Dwyer, A. Ostling and V. M. Savage, An integrative framework for stochastic, size-structured community assembly, PNAS 106 (2009), 6170--6175.

[Harper2009] M. Harper, Information geometry and evolutionary game theory, arXiv:0911.1383 (2009).

[Harper2011] M. Harper, Escort evolutionary game theory, Physica D 240 (2011), 1411–1415.

[Harte2008] J. Harte, T. Zillio, E. Conlisk and A. Smith, Maximum entropy and the state-variable approach to macroecology, Ecology 89 (2008), 2700–2711.

[Harte2009] J. Harte, A. Smith and D. Storch, Biodiversity scales from plots to biomes with a universal species-area curve, Ecology Letters 12 (2009), 789–797.

[Harte2011] J. Harte, Maximum Entropy and Ecology: A Theory of Abundance, Distribution, and Energetics, Oxford U. Press, Oxford, 2011.

[LH2013] M. Harper and C. Lee, Basic experiment planning via information metrics: the RoboMendel problem, arXiv:1210.4808 (2012).

[O2005] A. Ostling, Neutral theory tested by birds, Nature 436 (2005), 635.

[O2012] A. Ostling, Do fitness-equalizing tradeoffs lead to neutral communities?, Theoretical Ecology 5 (2012), 181–194.

[PS2012] D. Precup and S. Still, An information-theoretic approach to curiosity-driven reinforcement learning, Theory in Biosciences 131 (2012), 139–148.

## Relative Entropy (Part 2)

2 July, 2013

In the first part of this mini-series, I describe how various ideas important in probability theory arise naturally when you start doing linear algebra using only the nonnegative real numbers.

But after writing it, I got an email from a rather famous physicist saying he got “lost at line two”. So, you’ll be happy to hear that the first part is not a prerequisite for the remaining parts! I wrote it just to intimidate that guy.

Tobias Fritz and I have proved a theorem characterizing the concept of relative entropy, which is also known as ‘relative information’, ‘information gain’ or—most terrifying and least helpful of all—’Kullback-Leibler divergence’. In this second part I’ll introduce two key players in this theorem. The first, $\mathrm{FinStat},$ is a category where:

• an object consists of a system with finitely many states, and a probability distribution on those states

and

• a morphism consists of a deterministic ‘measurement process’ mapping states of one system to states of another, together with a ‘hypothesis’ that lets the observer guess a probability distribution of states of the system being measured, based on what they observe.

The second, $\mathrm{FP},$ is a subcategory of $\mathrm{FinStat}.$ It has all the same objects, but only morphisms where the hypothesis is ‘optimal’. This means that if the observer measures the system many times, and uses the probability distribution of their observations together with their hypothesis to guess the probability distribution of states of the system, they get the correct answer (in the limit of many measurements).

In this part all I will really do is explain precisely what $\mathrm{FinStat}$ and $\mathrm{FP}$ are. But to whet your appetite, let me explain how we can use them to give a new characterization of relative entropy!

Suppose we have any morphism in $\mathrm{FinStat}.$ In other words: suppose we have a deterministic measurement process, together with a hypothesis that lets the observer guess a probability distribution of states of the system being measured, based on what they observe.

Then we have two probability distributions on the states of the system being measured! First, the ‘true’ probability distribution. Second, the probability that the observer will guess based on their observations.

Whenever we have two probability distributions on the same set, we can compute the entropy of the first relative to to the second. This describes how surprised you’ll be if you discover the probability distribution is really the first, when you thought it was the second.

So: any morphism in $\mathrm{FinStat}$ will have a relative entropy. It will describe how surprised the observer will be when they discover the true probability distribution, given what they had guessed.

But this amount of surprise will be zero if their hypothesis was ‘optimal’ in the sense I described. So, the relative entropy will vanish on morphisms in $\mathrm{FP}.$

Our theorem says this fact almost characterizes the concept of relative entropy! More precisely, it says that any convex-linear lower semicontinuous functor

$F : \mathrm{FinStat} \to [0,\infty]$

that vanishes on the subcategory $\mathrm{FP}$ must equal some constant times the relative entropy.

Don’t be scared! This should not make sense to you yet, since I haven’t said how I’m thinking of $[0,+\infty]$ as a category, nor what a ‘convex-linear lower semicontinuous functor’ is, nor how relative entropy gives one. I will explain all that later. I just want you to get a vague idea of where I’m going.

Now let me explain the categories $\mathrm{FinStat}$ and $\mathrm{FP}.$ We need to warm up a bit first.

### FinStoch

A stochastic map $f : X \leadsto Y$ is different from an ordinary function, because instead of assigning a unique element of $Y$ to each element of $X,$ it assigns a probability distribution on $Y$ to each element of $X.$ So you should imagine it as being like a function ‘with random noise added’, so that $f(x)$ is not a specific element of $Y,$ but instead has a probability of taking on different values. This is why I’m using a weird wiggly arrow to denote a stochastic map.

More formally:

Definition. Given finite sets $X$ and $Y,$ a stochastic map $f : X \leadsto Y$ assigns a real number $f_{yx}$ to each pair $x \in X, y \in Y,$ such that fixing any element $x,$ the numbers $f_{yx}$ form a probability distribution on $Y.$ We call $f_{yx}$ the probability of $y$ given $x.$

In more detail:

$f_{yx} \ge 0$ for all $x \in X,$ $y \in Y.$

and

$\displaystyle{ \sum_{y \in Y} f_{yx} = 1}$ for all $x \in X.$

Note that we can think of $f : X \leadsto Y$ as a $Y \times X$-shaped matrix of numbers. A matrix obeying the two properties above is called stochastic. This viewpoint is nice because it reduces the problem of composing stochastic maps to matrix multiplication. It’s easy to check that multiplying two stochastic matrices gives a stochastic matrix. So, composing stochastic maps gives a stochastic map.

We thus get a category:

Definition. Let $\mathrm{FinStoch}$ be the category of finite sets and stochastic maps between them.

In case you’re wondering why I’m restricting attention to finite sets, it’s merely because I want to keep things simple. I don’t want to worry about whether sums or integrals converge.

### FinProb

Now take your favorite 1-element set and call it $1.$ A function $p : 1 \to X$ is just a point of $X.$ But a stochastic map $p : 1 \leadsto X$ is something more interesting: it’s a probability distribution on $X.$

Why? Because it gives a probability distribution on $X$ for each element of $1,$ but that set has just one element.

Last time I introduced the rather long-winded phrase finite probability measure space to mean a finite set with a probability distribution on it. But now we’ve seen a very quick way to describe such a thing within $\mathrm{FinStoch}:$

And this gives a quick way to think about a measure-preserving function between finite probability measure spaces! It’s just a commutative triangle like this:

Note that the horizontal arrow $f: X \to Y$ is not wiggly. The straight arrow means it’s an honest function, not a stochastic map. But a function is a special case of a stochastic map! So it makes sense to compose a straight arrow with a wiggly arrow—and the result is, in general, a wiggly arrow. So, it makes sense to demand that this triangle commutes, and this says that the function $f: X \to Y$ is measure-preserving.

Let me work through the details, in case they’re not clear.

First: how is a function a special case of a stochastic map? Here’s how. If we start with a function $f: X \to Y,$ we get a matrix of numbers

$f_{yx} = \delta_{y,f(x)}$

where $\delta$ is the Kronecker delta. So, each element $x \in X$ gives a probability distribution that’s zero except at $f(x).$

Given this, we can work out what this commuting triangle really says:

If use $p_x$ to stand for the probability distribution that $p: 1 \leadsto X$ puts on $X,$ and similarly for $q_y,$ the commuting triangle says

$\displaystyle{ q_y = \sum_{x \in X} \delta_{y,f(x)} p_x}$

or in other words:

$\displaystyle{ q_y = \sum_{x \in X : f(x) = y} p_x }$

or if you like:

$\displaystyle{ q_y = \sum_{x \in f^{-1}(y)} p_x }$

In this situation people say $q$ is $p$ pushed forward along $f$, and they say $f$ is a measure-preserving function.

So, we’ve used $\mathrm{FinStoch}$ to describe another important category:

Definition. Let $\mathrm{FinProb}$ be the category of finite probability measure spaces and measure-preserving functions between them.

I can’t resist mentioning another variation:

A commuting triangle like this is a measure-preserving stochastic map. In other words, $p$ gives a probability measure on $X,$ $q$ gives a probability measure on $Y,$ and $f: X \leadsto Y$ is a stochastic map with

$\displaystyle{ q_y = \sum_{x \in X} f_{yx} p_x }$

### FinStat

The category we really need for relative entropy is a bit more subtle. An object is a finite probability measure space:

but a morphism looks like this:

The whole diagram doesn’t commute, but the two equations I wrote down hold. The first equation says that $f: X \to Y$ is a measure-preserving function. In other words, this triangle, which we’ve seen before, commutes:

The second equation says that $f \circ s$ is the identity, or in math jargon, $s$ is a section for $f.$

But what does that really mean?

The idea is that $X$ is the set of ‘states’ of some system, while $Y$ is a set of possible ‘observations’ you might make. The function $f$ is a ‘measurement process’. You ‘measure’ the system using $f,$ and if the system is in the the state $x$ you get the observation $f(x).$ The probability distribution $p$ says the probability that the system is any given state, while $q$ says the probability that you get any given observation when you do your measurement.

Note: are assuming for now that that there’s no random noise in the observation process! That’s why $f$ is a function instead of a stochastic map.

But what about $s?$ That’s the fun part: $s$ describes your ‘hypothesis’ about the system’s state given a particular measurement! If you measure the system and get a result $y \in Y,$ you guess it’s in the state $x$ with probability $s_{xy}.$

And we don’t want this hypothesis to be really dumb: that’s what

$f \circ s = 1_Y$

says. You see, this equation says that

$\sum_{x \in X} \delta_{y', f(x)} s_{xy} = \delta_{y' y}$

or in other words:

$\sum_{x \in f^{-1}(y')} s_{xy} = \delta_{y' y}$

If you think about it, this implies $s_{xy} = 0$ unless $f(x) = y.$

So, if you make an observation $y,$ you will guess the system is in state $x$ with probability zero unless $f(x) = y.$ In short, you won’t make a really dumb guess about the system’s state.

Here’s how we compose morphisms:

We get a measure-preserving function $g \circ f : X \to Z$ and a stochastic map going back, $s \circ t : Z \to Z.$ You can check that these obey the required equations:

$g \circ f \circ p = r$

$g \circ f \circ s \circ t = 1_Z$

So, we get a category:

Definition. Let $\mathrm{FinStat}$ be the category where an object is a finite probability measure space:

a morphism is a diagram obeying these equations:

and composition is defined as above.

### FP

As we’ve just seen, a morphism in $\mathrm{FinStat}$ consists of a ‘measurement process’ $f$ and a ‘hypothesis’ $s:$

But sometimes we’re lucky and our hypothesis is optimal, in the sense that

$s \circ q = p$

Conceptually, this says that if you take the probability distribution $q$ on our observations and use it to guess a probability distribution for the system’s state using our hypothesis $s,$ you get the correct answer: $p.$

Mathematically, it says that this diagram commutes:

In other words, $s$ is a measure-preserving stochastic map.

There’s a subcategory of $\mathrm{FinStat}$ with all the same objects, but only these ‘optimal’ morphisms. It’s important, but the name we have for it is not very exciting:

Definition. Let $\mathrm{FP}$ be the subcategory of $\mathrm{FinStat}$ where an object is a finite probability measure space

and a morphism is a diagram obeying these equations:

Why do we call this category $\mathrm{FP}$? Because it’s a close relative of $\mathrm{FinProb},$ where a morphism, you’ll remember, looks like this:

The point is that for a morphism in $\mathrm{FP},$ the conditions on $s$ are so strong that they completely determine it unless there are observations that happen with probability zero—that is, unless there are $y \in Y$ with $q_y = 0.$ To see this, note that

$s \circ q = p$

actually says

$\sum_{y \in Y} s_{xy} q_y = p_x$

for any choice of $x \in X.$ But we’ve already seen $s_{xy} = 0$ unless $f(x) = y,$ so the sum has just one term, and the equation says

$s_{x,f(x)} q_{f(x)} = p_x$

We can solve this for $s_{x,f(x)},$ so $s$ is completely determined… unless $q_{f(x)} = 0.$

This covers the case when $y = f(x).$ We also can’t figure out $s_{x,y}$ if $y$ isn’t in the image of $f.$

So, to be utterly precise, $s$ is determined by $p, q$ and $f$ unless there’s an element $y \in Y$ that has $q_y = 0.$ Except for this special case, a morphism in $\mathrm{FP}$ is just a morphism in $\mathrm{FinProb}.$ But in this special case, a morphism in $\mathrm{FP}$ has a little extra information: an arbitrary probability distribution on the inverse image of each point $y$ with this property.

In short, $\mathrm{FP}$ is the same as $\mathrm{FinProb}$ except that our observer’s ‘optimal hypothesis’ must provide a guess about the state of the system given an observation, even in cases of observations that occur with probability zero.

I’m going into these nitpicky details for two reasons. First, we’ll need $\mathrm{FP}$ for our characterization of relative entropy. But second, Tom Leinster already ran into this category in his work on entropy and category theory! He discussed it here:

• Tom Leinster, An operadic introduction to entropy.

Despite the common theme of entropy, he arrived at it from a very different starting-point.

### Conclusion

So, I hope that next time I can show you something like this:

and you’ll say “Oh, that’s a probability distribution on the states of some system!” Intuitively, you should think of the wiggly arrow $p$ as picking out a ‘random element’ of the set $X.$

I hope I can show you this:

and you’ll say “Oh, that’s a deterministic measurement process, sending a probability distribution on the states of the measured system to a probability distribution on observations!”

I hope I can show you this:

and you’ll say “Oh, that’s a deterministic measurement process, together with a hypothesis about the system’s state, given what is observed!”

And I hope I can show you this:

and you’ll say “Oh, that’s a deterministic measurement process, together with an optimal hypothesis about the system’s state, given what is observed!”

I don’t count on it… but I can hope.

### Postscript

And speaking of unrealistic hopes, if I were really optimistic I would hope you noticed that $\mathrm{FinStoch}$ and $\mathrm{FinProb},$ which underlie the more fancy categories I’ve discussed today, were themselves constructed starting from linear algebra over the nonnegative numbers $[0,\infty)$ in Part 1. That ‘foundational’ work is not really needed for what we’re doing now. However, I like the fact that we’re ultimately getting the concept of relative entropy starting from very little: just linear algebra, using only nonnegative numbers!

## Relative Entropy (Part 1)

20 June, 2013

I’m trying to finish off a paper that Tobias Fritz and I have been working on, which gives a category-theoretic (and Bayesian!) characterization of relative entropy. It’s a kind of sequel to our paper with Tom Leinster, in which we characterized entropy.

That earlier paper was developed in conversations on the n-Category Café. It was a lot of fun; I sort of miss that style of working. Also, to get warmed up, I need to think through some things I’ve thought about before. So, I might as well write them down here.

### The idea

There are many categories related to probability theory, and they’re related in many ways. Last summer—on the 24th of August 2012, according to my notes here—Jamie Vicary, Brendan Fong and I worked through a bunch of these relationships. I need to write them down now, even if they’re not all vitally important to my paper with Tobias. They’re sort of buzzing around my brain like flies.

(Tobias knows this stuff too, and this is how we think about probability theory, but we weren’t planning to stick it in our paper. Maybe we should.)

Let’s restrict attention to probability measures on finite sets, and related structures. We could study these questions more generally, and we should, but not today. What we’ll do is give a unified purely algebraic description of:

• finite sets

• measures on finite sets

• probability measures on finite sets

and various kinds of maps between these:

• functions

• bijections

• measure-preserving functions

• stochastic maps

### Finitely generated free [0,∞)-modules

People often do linear algebra over a field, which is—roughly speaking—a number system where you can add, subtract, multiply and divide. But algebraists have long realized that a lot of linear algebra still works with a commutative ring, where you can’t necessarily divide. It gets more complicated, but also a lot more interesting.

But in fact, a lot still works with a commutative rig, where we can’t necessarily subtract either! Something I keep telling everyone is that linear algebra over rigs is a good idea for studying things like probability theory, thermodynamics, and the principle of least action.

Today we’ll start with the rig of nonnegative real numbers with their usual addition and multiplication; let’s call this $[0,\infty) .$ The idea is that measure theory, and probability theory, are closely related to linear algebra over this rig.

Let $C$ be the category with of finitely generated free $[0,\infty)$-modules as objects, and module homomorphisms as morphisms. I’ll call these morphisms maps.

Puzzle. Do we need to say ‘free’ here? Are there finitely generated modules over $[0,\infty)$ that aren’t free?

Every finitely generated free $[0,\infty)$-module is isomorphic to $[0,\infty)^S$ for some finite set $S .$ In other words, it’s isomorphic to $[0,\infty)^n$ for some $n = 0, 1, 2, \dots .$ So, $C$ is equivalent to the category where objects are natural numbers, a morphism from $m$ to $n$ is an $m \times n$ matrix of numbers in $[0,\infty) ,$ and composition is done by matrix multiplication. I’ll also call this equivalent category $C.$

We can take tensor products of finitely generated free modules, and this makes $C$ into a symmetric monoidal †-category. This means we can draw maps using string diagrams in the usual way. However, I’m feeling lazy so I’ll often write equations when I could be drawing diagrams.

One of the rules of the game is that all these equations will make sense in any symmetric monoidal †-category. So we could, if we wanted, generalize ideas from probability theory this way. If you want to do this, you’ll need to know that $[0,\infty)$ is the unit for the tensor product in $C.$ We’ll be seeing this guy $[0,\infty)$ a lot. So if you want to generalize, replace $C$ by any symmetric monoidal †-category, and replace $[0,\infty)$ by the unit for the tensor product.

### Finite sets

There’s a way to see the category of finite sets lurking in $C,$ which we can borrow from this paper:

• Bob Coecke, Dusko Pavlovic and Jamie Vicary, A new description of orthogonal bases.

For any finite set $S ,$ we get a free finitely generated $[0,\infty)$-module, namely $[0,\infty)^S .$ This comes with some structure:

• a multiplication $m: [0,\infty)^S \otimes [0,\infty)^S \to [0,\infty)^S ,$ coming from pointwise multiplication of $[0,\infty)$-valued functions on $S$

• the unit for this multiplication, an element of $[0,\infty)^S,$ which we can write as a morphism $i: [0,\infty) \to [0,\infty)^S$

• a comultiplication, obtained by taking the diagonal map $\Delta : S \to S \times S$ and promoting it to a linear map $\Delta : [0,\infty)^S \to [0, \infty)^S \otimes [0,\infty)^S$

• a counit for this comultiplication, obtained by taking the unique map to the terminal set $! : S \to 1$ and promoting it to a linear map $e: [0,\infty)^S \to [0,\infty)$

These morphisms $m, i, \Delta, e$ make

$x = [0,\infty)^S$

into a commutative Frobenius algebra in $C .$ That’s a thing where the unit, counit, multiplication and comultiplication obey these laws:

(I drew these back when I was feeling less lazy.) This Frobenius algebra is also ‘special’, meaning it obeys this:

And it’s also a †-Frobenius algebra, meaning that the counit and comultiplication are obtained from the unit and multiplication by ‘flipping’ them using the †category structure. (If we think of a morphism in $C$ as a matrix, its dagger is its transpose.)

Conversely, suppose we have any special commutative †-Frobenius algebra $x .$ Then using the ideas in the paper by Coecke, Pavlovich and Vicary we can recover a basis for $x ,$ consisting of the vectors $e_i \in x$ with

$\Delta(e_i) = e_i \otimes e_i$

This basis forms a set $S$ such that

$x \cong [0,\infty)^S$

for some specified isomorphism in $C.$ Furthermore, this is an isomorphism of special commutative †-Frobenius algebras!

In case you’re wondering, these vectors $e_i$ correspond to the functions on $S$ that are zero everywhere except at one point $i \in S,$ where they equal 1.

In short, a special commutative †-Frobenius algebra in $C$ is just a fancy way of talking about a finite set. This may seem silly, but it’s a way to start describing probability theory using linear algebra very much as we do with quantum theory. This analogy between quantum theory and probability theory is so interesting that it deserves a book.

### Functions and bijections

Now suppose we have two special commutative †-Frobenius algebra in $C$, say $x$ and $y .$

Suppose $f : x \to y$ is a Frobenius algebra homomorphism: that is, a map preserving all the structure—the unit, counit, multiplication and comultiplication. Then it comes from an isomorphism of finite sets. This lets us find $FinSet_0 ,$ the groupoid of finite sets and bijections, inside $C.$

Alternatively, suppose $f : x \to y$ is just a coalgebra homomorphism: that is a map preserving just the counit and comultiplication. Then it comes from an arbitrary function between finite sets. This lets us find $FinSet ,$ the category of finite sets and functions, inside $C .$

But what if $f$ preserves just the counit? This sounds like a dry, formal question. But it’s not: the answer is something useful, a ‘stochastic map’.

### Stochastic maps

A stochastic map from a finite set $S$ to a finite set $T$ is a map sending each point of $S$ to a probability measure on $T .$

We can think of this as a $T \times S$-shaped matrix of numbers in $[0,\infty) ,$ where a given column gives the probability that a given point in $S$ goes to any point in $T .$ The sum of the numbers in each column will be 1. And conversely, any $T \times S$-shaped matrix of numbers in $[0,\infty) ,$ where each column sums to 1, gives a stochastic map from $S$ to $T .$

But now let’s describe this idea using the category $C.$ We’ve seen a finite set is the same as a special commutative †-Frobenius algebra. So, say we have two of these, $x$ and $y .$ Our matrix of numbers in $[0,\infty)$ is just a map

$f: x \to y$

So, we just need a way to state the condition that each column in the matrix sums to 1. And this condition simply says that $f$ preserves the counit:

$\epsilon_y \circ f = \epsilon_x$

where $\epsilon_x : x \to [0,\infty)$ is the counit for $x ,$ and similarly for $\epsilon_y .$

To understand this, note that if we use the canonical isomorphism

$x \cong [0,\infty)^S$

the counit $\epsilon_x$ can be seen as the map

$[0,\infty)^S \to [0,\infty)$

that takes any $S$-tuple of numbers and sums them up. In other words, it’s integration with respect to counting measure. So, the equation

$\epsilon_y \circ f = \epsilon_x$

says that if we take any $S$-tuple of numbers, multiply it by the matrix $f ,$ and then sum up the entries of the resulting $T$-tuple, it’s the same as if we summed up the original $S$-tuple. But this says precisely that each column of the matrix $f$ sums to 1.

So, we can use our formalism to describe $\mathrm{FinStoch},$ the category with finite sets as objects and stochastic maps as morphisms. We’ve seen this category is equivalent to the category with special commutative †-Frobenius algebras in $C$ as objects and counit-preserving maps as morphisms.

### Finite measure spaces

Now let’s use our formalism to describe finite measure spaces—by which, beware, I mean a finite sets equipped with measures! To do this, we’ll use a special commutative †-Frobenius algebra $x$ in $C$ together with any map

$\mu: [0,\infty) \to x$

Starting from these, we get a specified isomorphism

$x \cong [0,\infty)^S$

and $\mu$ sends the number 1 to a vector in $[0,\infty)^S$: that is, a function on $S$ taking values in $[0,\infty) .$ Multiplying this function by counting measure, we get a measure on $S .$

Puzzle. How can we describe this measure without the annoying use of counting measure?

Conversely, any measure on a finite set gives a special commutative †-Frobenius algebra $x$ in $C$ equipped with a map from $[0,\infty) .$

So, we can say a finite measure space is a special commutative †-Frobenius algebra in $C$ equipped with a map

$\mu: [0,\infty) \to x$

And given two of these,

$\mu: [0,\infty) \to x , \qquad \nu: [0,\infty) \to y$

and a coalgebra morphism

$f : x \to y$

obeying this equation

$f \circ \mu = \nu$

then we get a measure-preserving function between finite measure spaces! If you’re a category theorist, you’ll draw this equation as a commutative triangle:

Conversely, any measure-preserving function between finite measure spaces obeys this equation. So, we get an algebraic way of describing the category $\mathrm{FinMeas} ,$ with finite measure spaces as objects and measure-preserving maps as morphisms.

### Finite probability measure spaces

I’m mainly interested in probability measures. So suppose $x$ is a special commutative †-Frobenius algebra in $C$ equipped with a map

$\mu: [0,\infty) \to x$

We’ve seen this gives a finite measure space. But this is a probability measure space if and only if

$e \circ \mu = 1$

where

$e : x \to [0,\infty)$

is the counit for $x .$ The equation simply says the total integral of our measure $\mu$ is 1.

So, we get a way of describing the category $\mathrm{FinProb} ,$ which has finite probability measure spaces as objects and measure-preserving maps as objects. Given finite probability measure spaces described this way:

$\mu: [0,\infty) \to x , \qquad \nu: [0,\infty) \to y$

a measure-preserving function is a coalgebra morphism

$f : x \to y$

such that the obvious triangle commutes:

$f \circ \mu = \nu$

### Measure-preserving stochastic maps

Say we have two finite measure spaces. Then we can ask whether a stochastic map from one to the other is measure-preserving. And we can answer this question in the language of $C .$

Remember, a finite measure space is a special commutative †-Frobenius algebra $x$ in $C$ together with a map

$\mu: [0,\infty) \to x$

Say we have another one:

$\nu: [0,\infty) \to y$

A stochastic map is just a map

$f: x \to y$

that preserves the counit:

$\epsilon_y \circ f = \epsilon_x$

But it’s a measure-preserving stochastic map if also

$f \circ \mu = \nu$

### Next…

There’s a lot more to say; I haven’t gotten anywhere near what Tobias and I are doing! But it’s pleasant to have this basic stuff written down.

## Maximum Entropy and Ecology

21 February, 2013

I already talked about John Harte’s book on how to stop global warming. Since I’m trying to apply information theory and thermodynamics to ecology, I was also interested in this book of his:

John Harte, Maximum Entropy and Ecology, Oxford U. Press, Oxford, 2011.

There’s a lot in this book, and I haven’t absorbed it all, but let me try to briefly summarize his maximum entropy theory of ecology. This aims to be “a comprehensive, parsimonious, and testable theory of the distribution, abundance, and energetics of species across spatial scales”. One great thing is that he makes quantitative predictions using this theory and compares them to a lot of real-world data. But let me just tell you about the theory.

It’s heavily based on the principle of maximum entropy (MaxEnt for short), and there are two parts:

Two MaxEnt calculations are at the core of the theory: the first yields all the metrics that describe abundance and energy distributions, and the second describes the spatial scaling properties of species’ distributions.

### Abundance and energy distributions

The first part of Harte’s theory is all about a conditional probability distribution

$R(n,\epsilon | S_0, N_0, E_0)$

which he calls the ecosystem structure function. Here:

$S_0$: the total number of species under consideration in some area.

$N_0$: the total number of individuals under consideration in that area.

$E_0$: the total rate of metabolic energy consumption of all these individuals.

Given this,

$R(n,\epsilon | S_0, N_0, E_0) \, d \epsilon$

is the probability that given $S_0, N_0, E_0,$ if a species is picked from the collection of species, then it has $n$ individuals, and if an individual is picked at random from that species, then its rate of metabolic energy consumption is in the interval $(\epsilon, \epsilon + d \epsilon).$

Here of course $d \epsilon$ is ‘infinitesimal’, meaning that we take a limit where it goes to zero to make this idea precise (if we’re doing analytical work) or take it to be very small (if we’re estimating $R$ from data).

I believe that when we ‘pick a species’ we’re treating them all as equally probable, not weighting them according to their number of individuals.

Clearly $R$ obeys some constraints. First, since it’s a probability distribution, it obeys the normalization condition:

$\displaystyle{ \sum_n \int d \epsilon \; R(n,\epsilon | S_0, N_0, E_0) = 1 }$

Second, since the average number of individuals per species is $N_0/S_0,$ we have:

$\displaystyle{ \sum_n \int d \epsilon \; n R(n,\epsilon | S_0, N_0, E_0) = N_0 / S_0 }$

Third, since the average over species of the total rate of metabolic energy consumption of individuals within the species is $E_0/ S_0,$ we have:

$\displaystyle{ \sum_n \int d \epsilon \; n \epsilon R(n,\epsilon | S_0, N_0, E_0) = E_0 / S_0 }$

Harte’s theory is that $R$ maximizes entropy subject to these three constraints. Here entropy is defined by

$\displaystyle{ - \sum_n \int d \epsilon \; R(n,\epsilon | S_0, N_0, E_0) \ln(R(n,\epsilon | S_0, N_0, E_0)) }$

Harte uses this theory to calculate $R,$ and tests the results against data from about 20 ecosystems. For example, he predicts the abundance of species as a function of their rank, with rank 1 being the most abundant, rank 2 being the second most abundant, and so on. And he gets results like this:

The data here are from:

• Green, Harte, and Ostling’s work on a serpentine grassland,

• Luquillo’s work on a 10.24-hectare tropical forest, and

• Cocoli’s work on a 2-hectare wet tropical forest.

The fit looks good to me… but I should emphasize that I haven’t had time to study these matters in detail. For more, you can read this paper, at least if your institution subscribes to this journal:

• J. Harte, T. Zillio, E. Conlisk and A. Smith, Maximum entropy and the state-variable approach to macroecology, Ecology 89 (2008), 2700–2711.

### Spatial abundance distribution

The second part of Harte’s theory is all about a conditional probability distribution

$\Pi(n | A, n_0, A_0)$

This is the probability that $n$ individuals of a species are found in a region of area $A$ given that it has $n_0$ individuals in a larger region of area $A_0.$

$\Pi$ obeys two constraints. First, since it’s a probability distribution, it obeys the normalization condition:

$\displaystyle{ \sum_n \Pi(n | A, n_0, A_0) = 1 }$

Second, since the mean value of $n$ across regions of area $A$ equals $n_0 A/A_0,$ we have

$\displaystyle{ \sum_n n \Pi(n | A, n_0, A_0) = n_0 A/A_0 }$

Harte’s theory is that $\Pi$ maximizes entropy subject to these two constraints. Here entropy is defined by

$\displaystyle{- \sum_n \Pi(n | A, n_0, A_0)\ln(\Pi(n | A, n_0, A_0)) }$

Harte explains two approaches to use this idea to derive ‘scaling laws’ for how $n$ varies with $n$. And again, he compares his predictions to real-world data, and get results that look good to my (amateur, hasty) eye!

I hope sometime I can dig deeper into this subject. Do you have any ideas, or knowledge about this stuff?