Information Geometry (Part 11)

7 June, 2012

Last time we saw that given a bunch of different species of self-replicating entities, the entropy of their population distribution can go either up or down as time passes. This is true even in the pathetically simple case where all the replicators have constant fitness—so they don’t interact with each other, and don’t run into any ‘limits to growth’.

This is a bit of a bummer, since it would be nice to use entropy to explain how replicators are always extracting information from their environment, thanks to natural selection.

Luckily, a slight variant of entropy, called ‘relative entropy’, behaves better. When our replicators have an ‘evolutionary stable state’, the relative entropy is guaranteed to always change in the same direction as time passes!

Thanks to Einstein, we’ve all heard that times and distances are relative. But how is entropy relative?

It’s easy to understand if you think of entropy as lack of information. Say I have a coin hidden under my hand. I tell you it’s heads-up. How much information did I just give you? Maybe 1 bit? That’s true if you know it’s a fair coin and I flipped it fairly before covering it up with my hand. But what if you put the coin down there yourself a minute ago, heads up, and I just put my hand over it? Then I’ve given you no information at all. The difference is the choice of ‘prior’: that is, what probability distribution you attributed to the coin before I gave you my message.

My love affair with relative entropy began in college when my friend Bruce Smith and I read Hugh Everett’s thesis, The Relative State Formulation of Quantum Mechanics. This was the origin of what’s now often called the ‘many-worlds interpretation’ of quantum mechanics. But it also has a great introduction to relative entropy. Instead of talking about ‘many worlds’, I wish people would say that Everett explained some of the mysteries of quantum mechanics using the fact that entropy is relative.

Anyway, it’s nice to see relative entropy showing up in biology.

Relative Entropy

Inscribe an equilateral triangle in a circle. Randomly choose a line segment joining two points of this circle. What is the probability that this segment is longer than a side of the triangle?

This puzzle is called Bertrand’s paradox, because different ways of solving it give different answers. To crack the paradox, you need to realize that it’s meaningless to say you’ll “randomly” choose something until you say more about how you’re going to do it.

In other words, you can’t compute the probability of an event until you pick a recipe for computing probabilities. Such a recipe is called a probability measure.

This applies to computing entropy, too! The formula for entropy clearly involves a probability distribution, even when our set of events is finite:

S = - \sum_i p_i \ln(p_i)

But this formula conceals a fact that becomes obvious when our set of events is infinite. Now the sum becomes an integral:

S = - \int_X p(x) \ln(p(x)) \, d x

And now it’s clear that this formula makes no sense until we choose the measure d x. On a finite set we have a god-given choice of measure, called counting measure. Integrals with respect to this are just sums. But in general we don’t have such a god-given choice. And even for finite sets, working with counting measure is a choice: we are choosing to believe that in the absence of further evidence, all options are equally likely.

Taking this fact into account, it seems like we need two things to compute entropy: a probability distribution p(x), and a measure d x. That’s on the right track. But an even better way to think of it is this:

\displaystyle{ S = - \int_X  \frac{p(x) dx}{dx} \ln \left(\frac{p(x) dx}{dx}\right) \, dx }

Now we see the entropy depends two measures: the probability measure p(x)  dx we care about, but also the measure d x. Their ratio is important, but that’s not enough: we also need one of these measures to do the integral. Above I used the measure dx to do the integral, but we can also use p(x) dx if we write

\displaystyle{ S = - \int_X \ln \left(\frac{p(x) dx}{dx}\right) p(x) dx }

Either way, we are computing the entropy of one measure relative to another. So we might as well admit it, and talk about relative entropy.

The entropy of the measure d \mu relative to the measure d \nu is defined by:

\begin{array}{ccl} S(d \mu, d \nu) &=& \displaystyle{ - \int_X \frac{d \mu(x) }{d \nu(x)} \ln \left(\frac{d \mu(x)}{ d\nu(x) }\right)  d\nu(x) } \\   \\  &=& \displaystyle{ - \int_X  \ln \left(\frac{d \mu(x)}{ d\nu(x) }\right) d\mu(x) } \end{array}

The second formula is simpler, but the first looks more like summing -p \ln(p), so they’re both useful.

Since we’re taking entropy to be lack of information, we can also get rid of the minus sign and define relative information by

\begin{array}{ccl} I(d \mu, d \nu) &=& \displaystyle{ \int_X \frac{d \mu(x) }{d \nu(x)} \ln \left(\frac{d \mu(x)}{ d\nu(x) }\right)  d\nu(x) } \\   \\  &=& \displaystyle{  \int_X  \ln \left(\frac{d \mu(x)}{ d\nu(x) }\right) d\mu(x) } \end{array}

If you thought something was randomly distributed according to the probability measure d \nu, but then you you discover it’s randomly distributed according to the probability measure d \mu, how much information have you gained? The answer is I(d\mu,d\nu).

For more on relative entropy, read Part 6 of this series. I gave some examples illustrating how it works. Those should convince you that it’s a useful concept.

Okay: now let’s switch back to a more lowbrow approach. In the case of a finite set, we can revert to thinking of our two measures as probability distributions, and write the information gain as

I(q,p) = \displaystyle{  \sum_i  \ln \left(\frac{q_i}{p_i }\right) q_i}

If you want to sound like a Bayesian, call p the prior probability distribution and q the posterior probability distribution. Whatever you call them, I(q,p) is the amount of information you get if you thought p and someone tells you “no, q!”

We’ll use this idea to think about how a population gains information about its environment as time goes by, thanks to natural selection. The rest of this post will be an exposition of Theorem 1 in this paper:

• Marc Harper, The replicator equation as an inference dynamic.

Harper says versions of this theorem ave previously appeared in work by Ethan Akin, and independently in work by Josef Hofbauer and Karl Sigmund. He also credits others here. An idea this good is rarely noticed by just one person.

The change in relative information

So: consider n different species of replicators. Let P_i be the population of the ith species, and assume these populations change according to the replicator equation:

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) \, P_i }

where each function f_i depends smoothly on all the populations. And as usual, we let

\displaystyle{ p_i = \frac{P_i}{\sum_j P_j} }

be the fraction of replicators in the ith species.

Let’s study the relative information I(q,p) where q is some fixed probability distribution. We’ll see something great happens when q is a stable equilibrium solution of the replicator equation. In this case, the relative information can never increase! It can only decrease or stay constant.

We’ll think about what all this means later. First, let’s see that it’s true! Remember,

\begin{array}{ccl} I(q,p) &=& \displaystyle{ \sum_i  \ln \left(\frac{q_i}{ p_i }\right) q_i }  \\ \\ &=&  \displaystyle{ \sum_i  \Big(\ln(q_i) - \ln(p_i) \Big) q_i } \end{array}

and only p_i depends on time, not q_i, so

\begin{array}{ccl} \displaystyle{ \frac{d}{dt} I(q,p)}  &=& \displaystyle{ - \frac{d}{dt} \sum_i \ln(p_i)  q_i }\\   \\  &=& \displaystyle{ - \sum_i \frac{\dot{p}_i}{p_i} \, q_i } \end{array}

where \dot{p}_i is the rate of change of the probability p_i. We saw a nice formula for this in Part 9:

\displaystyle{ \dot{p}_i = \Big( f_i(P) - \langle f(P) \rangle  \Big) \, p_i }

where

f_i(P) = f_i(P_1, \dots, P_n)

and

\displaystyle{ \langle f(P) \rangle = \sum_i f_i(P) p_i  }

is the mean fitness of the species. So, we get

\displaystyle{ \frac{d}{dt} I(q,p) } = \displaystyle{ - \sum_i \Big( f_i(P) - \langle f(P) \rangle  \Big) \, q_i }

Nice, but we can fiddle with this expression to get something more enlightening. Remember, the numbers q_i sum to one. So:

\begin{array}{ccl}  \displaystyle{ \frac{d}{dt} I(q,p) } &=&  \displaystyle{  \langle f(P) \rangle - \sum_i f_i(P) q_i  } \\  \\ &=& \displaystyle{  \sum_i f_i(P) (p_i - q_i)  }  \end{array}

where in the last step I used the definition of the mean fitness. This result looks even cuter if we treat the numbers f_i(P) as the components of a vector f(P), and similarly for the numbers p_i and q_i. Then we can use the dot product of vectors to say

\displaystyle{ \frac{d}{dt} I(q,p) = f(P) \cdot (p - q) }

So, the relative information I(q,p) will always decrease if

f(P) \cdot (p - q) \le 0

for all choices of the population P.

And now something really nice happens: this is also the condition for q to be an evolutionarily stable state. This concept goes back to John Maynard Smith, the founder of evolutionary game theory. In 1982 he wrote:

A population is said to be in an evolutionarily stable state if its genetic composition is restored by selection after a disturbance, provided the disturbance is not too large.

I will explain the math next time—I need to straighten out some things in my mind first. But the basic idea is compelling: an evolutionarily stable state is like a situation where our replicators ‘know all there is to know’ about the environment and each other. In any other state, the population has ‘something left to learn’—and the amount left to learn is the relative information we’ve been talking about! But as time goes on, the information still left to learn decreases!

Note: in the real world, nature has never found an evolutionarily stable state… except sometimes approximately, on sufficiently short time scales, in sufficiently small regions. So we are still talking about an idealization of reality! But that’s okay, as long as we know it.


Information Geometry (Part 10)

4 June, 2012

Last time I began explaining the tight relation between three concepts:

• entropy,

• information—or more precisely, lack of information,

and

• biodiversity.

The idea is to consider n different species of ‘replicators’. A replicator is any entity that can reproduce itself, like an organism, a gene, or a meme. A replicator can come in different kinds, and a ‘species’ is just our name for one of these kinds. If P_i is the population of the ith species, we can interpret the fraction

\displaystyle{ p_i = \frac{P_i}{\sum_j P_j} }

as a probability: the probability that a randomly chosen replicator belongs to the ith species. This suggests that we define entropy just as we do in statistical mechanics:

\displaystyle{ S = - \sum_i p_i \ln(p_i) }

In the study of statistical inference, entropy is a measure of uncertainty, or lack of information. But now we can interpret it as a measure of biodiversity: it’s zero when just one species is present, and small when a few species have much larger populations than all the rest, but gets big otherwise.

Our goal here is play these viewpoints off against each other. In short, we want to think of natural selection, and even biological evolution, as a process of statistical inference—or in simple terms, learning.

To do this, let’s think about how entropy changes with time. Last time we introduced a simple model called the replicator equation:

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) \, P_i }

where each population grows at a rate proportional to some ‘fitness functions’ f_i. We can get some intuition by looking at the pathetically simple case where these functions are actually constants, so

\displaystyle{ \frac{d P_i}{d t} = f_i \, P_i }

The equation then becomes trivial to solve:

\displaystyle{ P_i(t) = e^{t f_i } P_i(0)}

Last time I showed that in this case, the entropy will eventually decrease. It will go to zero as t \to +\infty whenever one species is fitter than all the rest and starts out with a nonzero population—since then this species will eventually take over.

But remember, the entropy of a probability distribution is its lack of information. So the decrease in entropy signals an increase in information. And last time I argued that this makes perfect sense. As the fittest species takes over and biodiversity drops, the population is acquiring information about its environment.

However, I never said the entropy is always decreasing, because that’s false! Even in this pathetically simple case, entropy can increase.

Suppose we start with many replicators belonging to one very unfit species, and a few belonging to various more fit species. The probability distribution p_i will start out sharply peaked, so the entropy will start out low:

Now think about what happens when time passes. At first the unfit species will rapidly die off, while the population of the other species slowly grows:

 

So the probability distribution will, for a while, become less sharply peaked. Thus, for a while, the entropy will increase!

This seems to conflict with our idea that the population’s entropy should decrease as it acquires information about its environment. But in fact this phenomenon is familiar in the study of statistical inference. If you start out with strongly held false beliefs about a situation, the first effect of learning more is to become less certain about what’s going on!

Get it? Say you start out by assigning a high probability to some wrong guess about a situation. The entropy of your probability distribution is low: you’re quite certain about what’s going on. But you’re wrong. When you first start suspecting you’re wrong, you become more uncertain about what’s going on. Your probability distribution flattens out, and the entropy goes up.

So, sometimes learning involves a decrease in information—false information. There’s nothing about the mathematical concept of information that says this information is true.

Given this, it’s good to work out a formula for the rate of change of entropy, which will let us see more clearly when it goes down and when it goes up. To do this, first let’s derive a completely general formula for the time derivative of the entropy of a probability distribution. Following Sir Isaac Newton, we’ll use a dot to stand for a time derivative:

\begin{array}{ccl} \displaystyle{  \dot{S}} &=& \displaystyle{ -  \frac{d}{dt} \sum_i p_i \ln (p_i)} \\   \\  &=& - \displaystyle{ \sum_i \dot{p}_i \ln (p_i) + \dot{p}_i }  \end{array}

In the last term we took the derivative of the logarithm and got a factor of 1/p_i which cancelled the factor of p_i. But since

\displaystyle{  \sum_i p_i = 1 }

we know

\displaystyle{ \sum_i \dot{p}_i = 0 }

so this last term vanishes:

\displaystyle{ \dot{S}= -\sum_i \dot{p}_i \ln (p_i) }

Nice! To go further, we need a formula for \dot{p}_i. For this we might as well return to the general replicator equation, dropping the pathetically special assumption that the fitness functions are actually constants. Then we saw last time that

\displaystyle{ \dot{p}_i = \Big( f_i(P) - \langle f(P) \rangle  \Big) \, p_i }

where we used the abbreviation

f_i(P) = f_i(P_1, \dots, P_n)

for the fitness of the ith species, and defined the mean fitness to be

\displaystyle{ \langle f(P) \rangle = \sum_i f_i(P) p_i  }

Using this cute formula for \dot{p}_i, we get the final result:

\displaystyle{ \dot{S} = - \sum_i \Big( f_i(P) - \langle f(P) \rangle \Big) \, p_i \ln (p_i) }

This is strikingly similar to the formula for entropy itself. But now each term in the sum includes a factor saying how much more fit than average, or less fit, that species is. The quantity - p_i \ln(p_i) is always nonnegative, since the graph of -x \ln(x) looks like this:

So, the ith term contributes positively to the change in entropy if the ith species is fitter than average, but negatively if it’s less fit than average.

This may seem counterintuitive!

Puzzle 1. How can we reconcile this fact with our earlier observations about the case when the fitness of each species is population-independent? Namely: a) if initially most of the replicators belong to one very unfit species, the entropy will rise at first, but b) in the long run, when the fittest species present take over, the entropy drops?

If this seems too tricky, look at some examples! The first illustrates observation a); the second illustrates observation b):

Puzzle 2. Suppose we have two species, one with fitness equal to 1 initially constituting 90% of the population, the other with fitness equal to 10 initially constituting just 10% of the population:

\begin{array}{ccc} f_1 = 1, & &  p_1(0) = 0.9 \\ \\                            f_2 = 10 , & & p_2(0) = 0.1   \end{array}

At what rate does the entropy change at t = 0? Which species is responsible for most of this change?

Puzzle 3. Suppose we have two species, one with fitness equal to 10 initially constituting 90% of the population, and the other with fitness equal to 1 initially constituting just 10% of the population:

\begin{array}{ccc} f_1 = 10, & &  p_1(0) = 0.9 \\ \\                            f_2 = 1 , & & p_2(0) = 0.1   \end{array}

At what rate does the entropy change at t = 0? Which species is responsible for most of this change?

I had to work through these examples to understand what’s going on. Now I do, and it all makes sense.

Next time

Still, it would be nice if there were some quantity that always goes down with the passage of time, reflecting our naive idea that the population gains information from its environment, and thus loses entropy, as time goes by.

Often there is such a quantity. But it’s not the naive entropy: it’s the relative entropy. I’ll talk about that next time. In the meantime, if you want to prepare, please reread Part 6 of this series, where I explained this concept. Back then, I argued that whenever you’re tempted to talk about entropy, you should talk about relative entropy. So, we should try that here.

There’s a big idea lurking here: information is relative. How much information a signal gives you depends on your prior assumptions about what that signal is likely to be. If this is true, perhaps biodiversity is relative too.


Information Geometry (Part 9)

1 June, 2012


It’s time to continue this information geometry series, because I’ve promised to give the following talk at a conference on the mathematics of biodiversity in early July… and I still need to do some of the research!

Diversity, information geometry and learning

As is well known, some measures of biodiversity are formally identical to measures of information developed by Shannon and others. Furthermore, Marc Harper has shown that the replicator equation in evolutionary game theory is formally identical to a process of Bayesian inference, which is studied in the field of machine learning using ideas from information geometry. Thus, in this simple model, a population of organisms can be thought of as a ‘hypothesis’ about how to survive, and natural selection acts to update this hypothesis according to Bayes’ rule. The question thus arises to what extent natural changes in biodiversity can be usefully seen as analogous to a form of learning. However, some of the same mathematical structures arise in the study of chemical reaction networks, where the increase of entropy, or more precisely decrease of free energy, is not usually considered a form of ‘learning’. We report on some preliminary work on these issues.

So, let’s dive in! To some extent I’ll be explaining these two papers:

• Marc Harper, Information geometry and evolutionary game theory.

• Marc Harper, The replicator equation as an inference dynamic.

However, I hope to bring in some more ideas from physics, the study of biodiversity, and the theory of stochastic Petri nets, also known as chemical reaction networks. So, this series may start to overlap with my network theory posts. We’ll see. We won’t get far today: for now, I just want to review and expand on what we did last time.

The replicator equation

The replicator equation is a simplified model of how populations change. Suppose we have n types of self-replicating entity. I’ll call these entities replicators. I’ll call the types of replicators species, but they don’t need to be species in the biological sense. For example, the replicators could be genes, and the types could be alleles. Or the replicators could be restaurants, and the types could be restaurant chains.

Let P_i(t), or just P_i for short, be the population of the ith species at time t. Then the replicator equation says

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) \, P_i }

So, the population P_i changes at a rate proportional to P_i, but the ‘constant of proportionality’ need not be constant: it can be any smooth function f_i of the populations of all the species. We call f_i(P_1, \dots, P_n) the fitness of the ith species.

Of course this model is absurdly general, while still leaving out lots of important effects, like the spatial variation of populations, or the ability for the population of some species to start at zero and become nonzero—which happens thanks to mutation. Nonetheless this model is worth taking a good look at.

Using the magic of vectors we can write

P = (P_1, \dots , P_n)

and

f(P) = (f_1(P), \dots, f_n(P))

This lets us write the replicator equation a wee bit more tersely as

\displaystyle{ \frac{d P}{d t} = f(P) P}

where on the right I’m multiplying vectors componentwise, the way your teachers tried to brainwash you into never doing:

f(P) P = (f(P)_1 P_1, \dots, f(P)_n P_n)

In other words, I’m thinking of P and f(P) as functions on the set \{1, \dots, n\} and multiplying them pointwise. This will be a nice way of thinking if we want to replace this finite set by some more general space.

Why would we want to do that? Well, we might be studying lizards with different length tails, and we might find it convenient to think of the set of possible tail lengths as the half-line [0,\infty) instead of a finite set.

Or, just to get started, we might want to study the pathetically simple case where f(P) doesn’t depend on P. Then we just have a fixed function f and a time-dependent function P obeying

\displaystyle{ \frac{d P}{d t} = f P}

If we’re physicists, we might write P more suggestively as \psi and write the operator multiplying by f as - H. Then our equation becomes

\displaystyle{ \frac{d \psi}{d t} = - H \psi }

This looks a lot like Schrödinger’s equation, but since there’s no factor of \sqrt{-1}, and \psi is real-valued, it’s more like the heat equation or the ‘master equation’, the basic equation of stochastic mechanics.

For an explanation of Schrödinger’s equation and the master equation, try Part 12 of the network theory series. In that post I didn’t include a minus sign in front of the H. That’s no big deal: it’s just a different convention than the one I want today. A more serious issue is that in stochastic mechanics, \psi stands for a probability distribution. This suggests that we should get probabilities into the game somehow.

The replicator equation in terms of probabilities

Luckily, that’s exactly what people usually do! Instead of talking about the population P_i of the ith species, they talk about the probability p_i that one of our organisms will belong to the ith species. This amounts to normalizing our populations:

\displaystyle{  p_i = \frac{P_i}{\sum_j P_j} }

Don’t you love it when notations work out well? Our big Population P_i has gotten normalized to give little probability p_i.

How do these probabilities p_i change with time? Now is the moment for that least loved rule of elementary calculus to come out and take a bow: the quotient rule for derivatives!

\displaystyle{ \frac{d p_i}{d t} = \left(\frac{d P_i}{d t} \sum_j P_j \quad - \quad P_i \sum_j \frac{d P_j}{d t}\right) \big{/} \left(  \sum_j P_j \right)^2 }

Using our earlier version of the replicator equation, this gives:

\displaystyle{ \frac{d p_i}{d t} =  \left(f_i(P) P_i \sum_j P_j \quad - \quad P_i \sum_j f_j(P) P_j \right) \big{/} \left(  \sum_j P_j \right)^2 }

Using the definition of p_i, this simplifies to:

\displaystyle{ \frac{d p_i}{d t} =  f_i(P) p_i \quad - \quad \left( \sum_j f_j(P) p_j \right) p_i }

The stuff in parentheses actually has a nice meaning: it’s just the mean fitness. In other words, it’s the average, or expected, fitness of an organism chosen at random from the whole population. Let’s write it like this:

\displaystyle{ \langle f(P) \rangle = \sum_j f_j(P) p_j  }

So, we get the replicator equation in its classic form:

\displaystyle{ \frac{d p_i}{d t} = \Big( f_i(P) - \langle f(P) \rangle \Big) \, p_i }

This has a nice meaning: for the fraction of organisms of the ith type to increase, their fitness must exceed the mean fitness. If you’re trying to increase market share, what matters is not how good you are, but how much better than average you are. If everyone else is lousy, you’re in luck.

Entropy

Now for something a bit new. Once we’ve gotten a probability distribution into the game, its entropy is sure to follow:

\displaystyle{ S(p) = - \sum_i p_i \, \ln(p_i) }

This says how ‘smeared-out’ the overall population is among the various different species. Alternatively, it says how much information it takes, on average, to say which species a randomly chosen organism belongs to. For example, if there are 2^N species, all with equal populations, the entropy S works out to N \ln 2. So in this case, it takes N bits of information to say which species a randomly chosen organism belongs to.

In biology, entropy is one of many ways people measure biodiversity. For a quick intro to some of the issues involved, try:

• Tom Leinster, Measuring biodiversity, Azimuth, 7 November 2011.

• Lou Jost, Entropy and diversity, Oikos 113 (2006), 363–375.

But we don’t need to understand this stuff to see how entropy is connected to the replicator equation. Marc Harper’s paper explains this in detail:

• Marc Harper, The replicator equation as an inference dynamic.

and I hope to go through quite a bit of it here. But not today! Today I just want to look at a pathetically simple, yet still interesting, example.

Exponential growth

Suppose the fitness of each species is independent of the populations of all the species. In other words, suppose each fitness f_i(P) is actually a constant, say f_i. Then the replicator equation reduces to

\displaystyle{ \frac{d P_i}{d t} = f_i \, P_i }

so it’s easy to solve:

P_i(t) = e^{t f_i} P_i(0)

You don’t need a detailed calculation to see what’s going to happen to the probabilities

\displaystyle{ p_i(t) = \frac{P_i(t)}{\sum_j P_j(t)}}

The most fit species present will eventually take over! If one species, say the ith one, has a fitness greater than the rest, then the population of this species will eventually grow faster than all the rest, at least if its population starts out greater than zero. So as t \to +\infty, we’ll have

p_i(t) \to 1

and

p_j(t) \to 0 \quad \mathrm{for} \quad j \ne i

Thus the probability distribution p will become more sharply peaked, and its entropy will eventually approach zero.

With a bit more thought you can see that even if more than one species shares the maximum possible fitness, the entropy will eventually decrease, though not approach zero.

In other words, the biodiversity will eventually drop as all but the most fit species are overwhelmed. Of course, this is only true in our simple idealization. In reality, biodiversity behaves in more complex ways—in part because species interact, and in part because mutation tends to smear out the probability distribution p_i. We’re not looking at these effects yet. They’re extremely important… in ways we can only fully understand if we start by looking at what happens when they’re not present.

In still other words, the population will absorb information from its environment. This should make intuitive sense: the process of natural selection resembles ‘learning’. As fitter organisms become more common and less fit ones die out, the environment puts its stamp on the probability distribution p. So, this probability distribution should gain information.

While intuitively clear, this last claim also follows more rigorously from thinking of entropy as negative information. Admittedly, it’s always easy to get confused by minus signs when relating entropy and information. A while back I said the entropy

\displaystyle{ S(p) = - \sum_i p_i \, \ln(p_i) }

was the average information required to say which species a randomly chosen organism belongs to. If this entropy is going down, isn’t the population losing information?

No, this is a classic sign error. It’s like the concept of ‘work’ in physics. We can talk about the work some system does on its environment, or the work done by the environment on the system, and these are almost the same… except one is minus the other!

When you are very ignorant about some system—say, some rolled dice—your estimated probabilities p_i for its various possible states are very smeared-out, so the entropy S(p) is large. As you gain information, you revise your probabilities and they typically become more sharply peaked, so S(p) goes down. When you know as much as you possibly can, S(p) equals zero.

So, the entropy S(p) is the amount of information you have left to learn: the amount of information you lack, not the amount you have. As you gain information, this goes down. There’s no paradox here.

It works the same way with our population of replicators—at least in the special case where the fitness of each species is independent of its population. The probability distribution p is like a ‘hypothesis’ assigning to each species i the probability p_i that it’s the best at self-replicating. As some replicators die off while others prosper, they gather information their environment, and this hypothesis gets refined. So, the entropy S(p) drops.

Next time

Of course, to make closer contact to reality, we need to go beyond the special case where the fitness of each species is a constant! Marc Harper does this, and I want to talk about his work someday, but first I have a few more remarks to make about the pathetically simple special case I’ve been focusing on. I’ll save these for next time, since I’ve probably strained your patience already.


Quantropy (Part 2)

10 February, 2012

In my first post in this series, we saw that filling in a well-known analogy between statistical mechanics and quantum mechanics requires a new concept: ‘quantropy’. To get some feeling for this concept, we should look at some examples. But to do that, we need to develop some tools to compute quantropy. That’s what we’ll do today.

All these tools will be borrowed from statistical mechanics. So, let me remind you how to compute the entropy of a system in thermal equilibrium starting if we know the energy of every state. Then we’ll copy this and get a formula for the quantropy of a system if we know the action of every history.

Computing entropy

Everything in this section is bog-standard. In case you don’t know, that’s British slang for ‘extremely, perhaps even depressingly, familiar’. Apparently it rains so much in England that bogs are not only standard, they’re the standard of what counts as standard!

Let X be a measure space: physically, the set of states of some system. In statistical mechanics we suppose the system occupies states with probabilities given by some probability distribution

p : X \to [0,\infty)

where of course

\int_X p(x) \, dx = 1

The entropy of this probability distribution is

S = - \int_X p(x) \ln(p(x)) \, dx

There’s a nice way to compute the entropy when our system is in thermal equilibrium. This idea makes sense when we have a function

H : X \to \mathbb{R}

saying the energy of each state. Our system is in thermal equilibrium when p maximizes entropy subject to a constraint on the expected value of energy:

\langle H \rangle = \int_X H(x) p(x) \, dx

A famous calculation shows that thermal equilibrium occurs precisely when p is the so-called Gibbs state:

\displaystyle{ p(x) = \frac{e^{-\beta H(x)}}{Z} }

for some real number \beta, where Z is a normalization factor called the partition function:

Z = \int_X e^{-\beta H(x)} \, dx

The number \beta is called the coolness, since physical considerations say that

\displaystyle{ \beta = \frac{1}{T} }

where T is the temperature in units where Boltzmann’s constant is 1.

There’s a famous way to compute the entropy of the Gibbs state; I don’t know who did it first, but it’s both straightforward and tremendously useful. We take the formula for entropy

S = - \int_X p(x) \ln(p(x)) \, dx

and substitute the Gibbs state

\displaystyle{ p(x) = \frac{e^{-\beta H(x)}}{Z} }

getting

\begin{array}{ccl} S &=& \int_X p(x) \left( \beta H(x) - \ln Z \right)\, dx \\   \\  &=& \beta \, \langle H \rangle - \ln Z \end{array}

Reshuffling this a little bit, we obtain:

- T \ln Z = \langle H \rangle - T S

If we define the free energy by

F = - T \ln Z

then we’ve shown that

F = \langle H \rangle - T S

This justifies the term ‘free energy’: it’s the expected energy minus the energy in the form of heat, namely T S.

It’s nice that we can compute the free energy purely in terms of the partition function and the temperature, or equivalently the coolness \beta:

\displaystyle{ F = - \frac{1}{\beta} \ln Z }

Can we also do this for the entropy? Yes! First we’ll do it for the expected energy:

\begin{array}{ccl} \langle H \rangle &=& \displaystyle{ \int_X H(x) p(x) \, dx } \\   \\  &=& \displaystyle{ \frac{1}{Z} \int_X H(x) e^{-\beta H(x)} \, dx } \\   \\  &=& \displaystyle{ -\frac{1}{Z} \frac{d}{d \beta} \int_X e^{-\beta H(x)} \, dx } \\ \\  &=& \displaystyle{ -\frac{1}{Z} \frac{dZ}{d \beta} } \\ \\  &=& \displaystyle{ - \frac{d}{d \beta} \ln Z } \end{array}

This gives

\begin{array}{ccl} S &=& \beta \, \langle H \rangle - \ln Z \\ \\ &=& \displaystyle{ - \beta \, \frac{d \ln Z}{d \beta} - \ln Z }\end{array}

So, if we know the partition function of a system in thermal equilibrium as a function of the temperature, we can work out its entropy, expected energy and free energy.

Computing quantropy

Now we’ll repeat everything for quantropy! The idea is simply to replace the energy by action and the temperature T by i \hbar where \hbar is Planck’s constant. It’s harder to get the integrals to converge in interesting examples. But we’ll worry about that next time, that when we actually do an example!

It’s annoying that in physics S stands for both entropy and action, since in this article we need to think about both. People also use H to stand for entropy, but that’s no better, since that letter also stands for ‘Hamiltonian’! To avoid this let’s use A to stand for action. This letter is also used to mean ‘Helmholtz free energy’, but we’ll just have to live with that. It would be real bummer if we failed to unify physics just because we ran out of letters.

Let X be a measure space: physically, the set of histories of some system. In quantum mechanics we suppose the system carries out histories with amplitudes given by some function

a : X \to \mathbb{C}

where perhaps surprisingly

\int_X a(x) \, dx = 1

The quantropy of this function is

Q = - \int_X a(x) \ln(a(x)) \, dx

There’s a nice way to compute the entropy in Feynman’s path integral formalism. This formalism makes sense when we have a function

A : X \to \mathbb{R}

saying the action of each history. Feynman proclaimed that in this case we have

\displaystyle{ a(x) = \frac{e^{i A(x)/\hbar}}{Z} }

where \hbar is Planck’s constant and Z is a normalization factor called the partition function:

Z = \int_X e^{i A(x)/\hbar} \, dx

Last time I showed that we obtain Feynman’s prescription for a by demanding that it’s a stationary point for the quantropy

Q = - \int_X a(x) \, \ln (a(x)) \, dx

subject to a constraint on the expected action:

\langle A \rangle = \int_X A(x) a(x) \, dx

As I mentioned last time, the formula for quantropy is dangerous, since we’re taking the logarithm of a complex-valued function. There’s not really a ‘best’ logarithm for a complex number: if we have one choice we can add any multiple of 2 \pi i and get another. So in general, to define quantropy we need to pick a choice of \ln (a(x)) for each point x \in X. That’s a lot of ambiguity!

Luckily, the ambiguity is much less when we use Feynman’s prescription for a. Why? Because then a(x) is defined in terms of an exponential, and it’s easy to take the logarithm of an exponential! So, we can declare that

\ln (a(x)) = \displaystyle{ \ln \left( \frac{e^{iA(x)/\hbar}}{Z}\right) } = \frac{i}{\hbar} A(x) - \ln Z

Once we choose a logarithm for Z, this formula will let us define \ln (a(x)) and thus the quantropy.

So let’s do this, and say the quantropy is

\displaystyle{ Q = - \int_X a(x) \left( \frac{i}{\hbar} A(x) - \ln Z \right)\, dx }

We can simplify this a bit, since the integral of a is 1:

\displaystyle{ Q = \frac{1}{i \hbar} \langle A \rangle + \ln Z }

Reshuffling this a little bit, we obtain:

- i \hbar \ln Z = \langle A \rangle - i \hbar Q

By analogy to free energy in statistical mechanics, let’s define the free action by

F = - i \hbar \ln Z

I’m using the same letter for free energy and free action, but they play exactly analogous roles, so it’s not so bad. Indeed we now have

F = \langle A \rangle - i \hbar Q

which is the analogue of a formula we saw for free energy in thermodynamics.

It’s nice that we can compute the free action purely in terms of the partition function and Planck’s constant. Can we also do this for the quantropy? Yes!

It’ll be convenient to introduce a parameter

\displaystyle{ \beta = \frac{1}{i \hbar} }

which is analogous to ‘coolness’. We could call it ‘quantum coolness’, but a better name might be classicality, since it’s big when our system is close to classical. Whatever we call it, the main thing is that unlike ordinary coolness, it’s imaginary!

In terms of classicality, we have

\displaystyle{ a(x) = \frac{e^{- \beta A(x)}}{Z} }

Now we can compute the expected action just as we computed the expected energy in thermodynamics:

\begin{array}{ccl} \langle A \rangle &=& \displaystyle{ \int_X A(x) a(x) \, dx } \\ \\  &=& \displaystyle{ \frac{1}{Z} \int_X A(x) e^{-\beta A(x)} \, dx } \\   \\  &=& \displaystyle{ -\frac{1}{Z} \frac{d}{d \beta} \int_X e^{-\beta A(x)} \, dx } \\ \\  &=& \displaystyle{ -\frac{1}{Z} \frac{dZ}{d \beta} } \\ \\  &=& \displaystyle{ - \frac{d}{d \beta} \ln Z } \end{array}

This gives:

\begin{array}{ccl} Q &=& \beta \,\langle A \rangle - \ln Z \\ \\ &=& \displaystyle{ - \beta \,\frac{d \ln Z}{d \beta} - \ln Z } \end{array}

So, if we can compute the partition function in the path integral approach to quantum mechanics, we can also work out the quantropy, expected action and free action!

Next time I’ll use these formulas to compute quantropy in an example: the free particle. We’ll see some strange and interesting things.

Summary

Here’s where our analogy stands now:

Statistical Mechanics Quantum Mechanics
states: x \in X histories: x \in X
probabilities: p: X \to [0,\infty) amplitudes: a: X \to \mathbb{C}
energy: H: X \to \mathbb{R} action: A: X \to \mathbb{R}
temperature: T Planck’s constant times i: i \hbar
coolness: \beta = 1/T classicality: \beta = 1/i \hbar
partition function: Z = \sum_{x \in X} e^{-\beta H(x)} partition function: Z = \sum_{x \in X} e^{-\beta A(x)}
Boltzmann distribution: p(x) = e^{-\beta H(x)}/Z Feynman sum over histories: a(x) = e^{-\beta A(x)}/Z
entropy: S = - \sum_{x \in X} p(x) \ln(p(x)) quantropy: Q = - \sum_{x \in X} a(x) \ln(a(x))
expected energy: \langle H \rangle = \sum_{x \in X} p(x) H(x) expected action: \langle A \rangle = \sum_{x \in X} a(x) A(x)
free energy: F = \langle H \rangle - TS free action: F = \langle A \rangle - i \hbar Q
\langle H \rangle = - \frac{d}{d \beta} \ln Z \langle A \rangle = - \frac{d}{d \beta} \ln Z
F = -\frac{1}{\beta} \ln Z F = -\frac{1}{\beta} \ln Z
S =  \ln Z - \beta \,\frac{d}{d \beta}\ln Z Q = \ln Z - \beta \,\frac{d }{d \beta}\ln Z

I should also say a word about units and dimensional analysis. There’s enormous flexibility in how we do dimensional analysis. Amateurs often don’t realize this, because they’ve just learned one system, but experts take full advantage of this flexibility to pick a setup that’s convenient for what they’re doing. The fewer independent units you use, the fewer dimensionful constants like the speed of light, Planck’s constant and Boltzmann’s constant you see in your formulas. That’s often good. But here I don’t want to set Planck’s constant equal to 1 because I’m treating it as analogous to temperature—so it’s important, and I want to see it. I’m also finding dimensional analysis useful to check my formulas.

So, I’m using units where mass, length and time count as independent dimensions in the sense of dimensional analysis. On the other hand, I’m not treating temperature as an independent dimension: instead, I’m setting Boltzmann’s constant to 1 and using that to translate from temperature into energy. This is fairly common in some circles. And for me, treating temperature as an independent dimension would be analogous to treating Planck’s constant as having its own independent dimension! I don’t feel like doing that.

So, here’s how the dimensional analysis works in my setup:

Statistical Mechanics Quantum Mechanics
probabilities: dimensionless amplitudes: dimensionless
energy: ML/T^2 action: ML/T
temperature: ML/T^2 Planck’s constant: ML/T
coolness: T^2/ML classicality: T/ML
partition function: dimensionless partition function: dimensionless
entropy: dimensionless quantropy: dimensionless
expected energy: ML/T^2 expected action: ML/T
free energy: ML/T^2 free action: ML/T

I like this setup because I often think of entropy as closely allied to information, measured in bits or nats depending on whether I’m using base 2 or base e. From this viewpoint, it should be dimensionless.

Of course, in thermodynamics it’s common to put a factor of Boltzmann’s constant in front of the formula for entropy. Then entropy has units of energy/temperature. But I’m using units where Boltzmann’s constant is 1 and temperature has the same units as energy! So for me, entropy is dimensionless.


Entropic Forces

1 February, 2012

 

In 2009, Erik Verlinde argued that gravity is an entropic force. This created a big stir—and it helped him win about $6,500,000 in prize money and grants! But what the heck is an ‘entropic force’, anyway?

Entropic forces are nothing unusual: you’ve felt one if you’ve ever stretched a rubber band. Why does a rubber band pull back when you stretch it? You might think it’s because a stretched rubber band has more energy than an unstretched one. That would indeed be a fine explanation for a metal spring. But rubber doesn’t work that way. Instead, a stretched rubber band mainly has less entropy than an unstretched one—and this too can cause a force.

You see, molecules of rubber are like long chains. When unstretched, these chains can curl up in lots of random wiggly ways. ‘Lots of random ways’ means lots of entropy. But when you stretch one of these chains, the number of ways it can be shaped decreases, until it’s pulled taut and there’s just one way! Only past that point does stretching the molecule take a lot of energy; before that, you’re mainly decreasing its entropy.

So, the force of a stretched rubber band is an entropic force.

But how can changes in either energy or entropy give rise to forces? That’s what I want to explain. But instead of talking about force, I’ll start out talking about pressure. This too arises both from changes in energy and changes in entropy.

Entropic pressure — a sloppy derivation

If you’ve ever studied thermodynamics you’ve probably heard about an ideal gas. You can think of this as a gas consisting of point particles that almost never collide with each other—because they’re just points—and bounce elastically off the walls of the container they’re in. If you have a box of gas like this, it’ll push on the walls with some pressure. But the cause of this pressure is not that slowly making the box smaller increases the energy of the gas inside: in fact, it doesn’t! The cause is that making the box smaller decreases the entropy of the gas.

To understand how pressure has an ‘energetic’ part and an ‘entropic’ part, let’s start with the basic equation of thermodynamics:

d U = T d S - P d V

What does this mean? It means the internal energy U of a box of stuff changes when you heat or cool it, meaning that you change its entropy S, but also when you shrink or expand it, meaning that you change its volume V. Increasing its entropy raises its internal energy at a rate proportional to its temperature T. Increasing its volume lowers its internal energy at a rate proportional to its pressure P.

We can already see that both changes in energy, U, and entropy, S, can affect P d V. Pressure is like force—indeed it’s just force per area—so we should try to solve for P.

First let’s do it in a sloppy way. One reason people don’t like thermodynamics is that they don’t understand partial derivatives when there are lots different coordinate systems floating around—which is what thermodynamics is all about! So, they manipulate these partial derivatives sloppily, feeling a sense of guilt and unease, and sometimes it works, but other times it fails disastrously. The cure is not to learn more thermodynamics; the cure is to learn about differential forms. All the expressions in the basic equation d U = T d S - P d V are differential forms. If you learn what they are and how to work with them, you’ll never get in trouble with partial derivatives in thermodynamics—as long as you proceed slowly and carefully.

But let’s act like we don’t know this! Let’s start with the basic equation

d U = T d S - P d V

and solve for P. First we get

P d V = T d S - d U

This is fine. Then we divide by d V and get

\displaystyle{ P = T \frac{d S}{d V} - \frac{d U}{d V} }

This is not so fine: here the guilt starts to set in. After all, we’ve been told that we need to use ‘partial derivatives’ when we have functions of several variables—and the main fact about partial derivatives, the one that everybody remembers, is that these are written with with curly d’s, not ordinary letter d’s. So we must have done something wrong. So, we make the d’s curly:

\displaystyle{ P = T \frac{\partial S}{\partial V} - \frac{\partial U}{\partial V} }

But we still feel guilty. First of all, who gave us the right to make those d’s curly? Second of all, a partial derivative like \frac{\partial S}{\partial V} makes no sense unless V is one of a set of coordinate functions: only then we can talk about how much some function changes as we change V while keeping the other coordinates fixed. The value of \frac{\partial S}{\partial V} actually depends on what other coordinates we’re keeping fixed! So what coordinates are we using?

Well, it seems like one of them is V, and the other is… we don’t know! It could be S, or P, or T, or perhaps even P. This is where real unease sets in. If we’re taking a test, we might in desperation think something like this: “Since the easiest things to control about our box of stuff are its volume and its temperature, let’s take these as our coordinates!” And then we might write

\displaystyle{ P = T \left.\frac{\partial S}{\partial V}\right|_T - \left.\frac{\partial U}{\partial V}\right|_T }

And then we might do okay on this problem, because this formula is in fact correct! But I hope you agree that this is an unsatisfactory way to manipulate partial derivatives: we’re shooting in the dark and hoping for luck.

Entropic pressure and entropic force

So, I want to show you a better way to get this result. But first let’s take a break and think about what it means. It means there are two possible reasons a box of gas may push back with pressure as we try to squeeze it smaller while keeping its temperature constant. One is that the energy may go up:

\displaystyle{ -\left.\frac{\partial U}{\partial V}\right|_T }

will be positive if the internal energy goes up as we squeeze the box smaller. But the other reason is that entropy may go down:

\displaystyle{  T \left.\frac{\partial S}{\partial V}\right|_T }

will be positive if the entropy goes down as we squeeze the box smaller, assuming T > 0.

Let’s turn this fact into a result about force. Remember that pressure is just force per area. Say we have some stuff in a cylinder with a piston on top. Say the the position of the piston is given by some coordinate x, and its area is A. Then the stuff will push on the piston with a force

F = P A

and the change in the cylinder’s volume as the piston moves is

d V = A d x

Then

\displaystyle{  P = T \left.\frac{\partial S}{\partial V}\right|_T - \left.\frac{\partial U}{\partial V}\right|_T }

gives us

\displaystyle{ F = T \left.\frac{\partial S}{\partial x}\right|_T - \left.\frac{\partial U}{\partial x}\right|_T }

So, the force consists of two parts: the energetic force

\displaystyle{ F_{\mathrm{energetic}} = - \left.\frac{\partial U}{\partial x}\right|_T }

and the entropic force:

\displaystyle{ F_{\mathrm{entropic}} =  T \left.\frac{\partial S}{\partial x}\right|_T}

Energetic forces are familiar from classical statics: for example, a rock pushes down on the table because its energy would decrease if it could go down. Entropic forces enter the game when we generalize to thermal statics, as we’re doing now. But when we set T = 0, these entropic forces go away and we’re back to classical statics!

Entropic pressure—a better derivation

Okay, enough philosophizing. To conclude, let’s derive

\displaystyle{ P = T \left.\frac{\partial S}{\partial V}\right|_T - \left.\frac{\partial U}{\partial V}\right|_T }

in a less sloppy way. We start with

d U = T d S - P d V

which is true no matter what coordinates we use. We can choose 2 of the 5 variables here as local coordinates, generically at least, so let’s choose V and T. Then

\displaystyle{ d U = \left.\frac{\partial U}{\partial V}\right|_T d V + \left.\frac{\partial U}{\partial T}\right|_V d T }

and similarly

\displaystyle{ d S = \left.\frac{\partial S}{\partial V}\right|_T d V + \left.\frac{\partial S}{\partial T}\right|_V d T }

Using these, our equation

d U = T d S - P d V

becomes

\displaystyle{ \left.\frac{\partial U}{\partial V}\right|_T d V + \left.\frac{\partial U}{\partial T}\right|_V d T = T \left(\left.\frac{\partial S}{\partial V}\right|_T d V + \left.\frac{\partial S}{\partial T}\right|_V d T \right) - P dV }

If you know about differential forms, you know that the differentials of the coordinate functions, namely d T and d V, form a basis of 1-forms. Thus we can equate the coefficients of d V in the equation above and get:

\displaystyle{ \left.\frac{\partial U}{\partial V}\right|_T = T \left.\frac{\partial S}{\partial V}\right|_T - P }

and thus:

\displaystyle{ P = T \left.\frac{\partial S}{\partial V}\right|_T - \left.\frac{\partial U}{\partial V}\right|_T }

which is what we wanted! There should be no bitter aftertaste of guilt this time.

The big picture

That’s almost all I want to say: a simple exposition of well-known stuff that’s not quite as well-known as it should be. If you know some thermodynamics and are feeling mildly ambitious, you can now work out the pressure of an ideal gas and show that it’s completely entropic in origin: only the first term in the right-hand side above is nonzero. If you’re feeling a lot more ambitious, you can try to read Verlinde’s papers and explain them to me. But my own goal was not to think about gravity. Instead, it was to ponder a question raised by Allen Knutson: how does the ‘entropic force’ idea fit into my ruminations on classical mechanics versus thermodynamics?

It seems to fit in this way: as we go from classical statics (governed by the principle of least energy) to thermal statics at fixed temperature (governed by the principle of least free energy), the definition of force familiar in classical statics must be adjusted. In classical statics we have

\displaystyle{ F_i = - \frac{\partial U}{\partial q^i}}

where

U: Q \to \mathbb{R}

is the energy as a function of some coordinates q^i on the configuration space of our system, some manifold Q. But in thermal statics at temperature T our system will try to minimize, not the energy U, but the Helmholtz free energy

A = U - T S

where

S : Q \to \mathbb{R}

is the entropy. So now we should define force by

\displaystyle{ F_i = - \frac{\partial A}{\partial q^i}}

and we see that force has an entropic part and an energetic part:

\displaystyle{  F_i = T \frac{\partial S}{\partial q^i}} -  \frac{\partial U}{\partial q^i}

When T = 0, the entropic part goes away and we’re back to classical statics!


I’m subject to the natural forces.Lyle Lovett


A Quantum Hammersley–Clifford Theorem

29 January, 2012

I’m at this workshop:

Sydney Quantum Information Theory Workshop: Coogee 2012, 30 January – 2 February 2012, Coogee Bay Hotel, Coogee, Sydney, organized by Stephen Bartlett, Gavin Brennen, Andrew Doherty and Tom Stace.

Right now David Poulin is speaking about a quantum version of the Hammersley–Clifford theorem, which is a theorem about Markov networks. Let me quickly say a bit about what he proved! This will be a bit rough, since I’m doing it live…

The mutual information between two random variables is

I(A:B) = S(A) + S(B) - S(A,B)

The conditional mutual information between three random variables C is

I(A:B|C) = \sum_c p(C=c) I(A:B|C=c)

It’s the average amount of information about B learned by measuring A when you already knew C.

All this works for both classical (Shannon) and quantum (von Neumann) entropy. So, when we say ‘random variable’ above, we
could mean it in the traditional classical sense or in the quantum sense.

If I(A:B|C) = 0 then A, C, B has the following Markov property: if you know C, learning A tells you nothing new about B. In condensed matter physics, say a spin system, we get (quantum) random variables from measuring what’s going on in regions, and we have short range entanglement if I(A:B|C) = 0 when C corresponds to some sufficiently thick region separating the regions A and B. We’ll get this in any Gibbs state of a spin chain with a local Hamiltonian.

A Markov network is a graph with random variables at vertices (and thus subsets of vertices) such that I(A:B|C) = 0 whenever C is a subset of vertices that completely ‘shields’ the subset A from the subset B: any path from A to B goes through a vertex in a C.

The Hammersley–Clifford theorem says that in the classical case we can get any Markov network from the Gibbs state

\exp(-\beta H)

of a local Hamiltonian H, and vice versa. Here a Hamiltonian is local if it is a sum of terms, one depending on the degrees of freedom in each clique in the graph:

H = \sum_{C \in \mathrm{cliques}} h_C

Hayden, Jozsa, Petz and Winter gave a quantum generalization of one direction of this result to graphs that are just ‘chains’, like this:

o—o—o—o—o—o—o—o—o—o—o—o

Namely: for such graphs, any quantum Markov network is the Gibbs state of some local Hamiltonian. Now Poulin has shown the same for all graphs. But the converse is, in general, false. If the different terms h_C in a local Hamiltonian all commute, its Gibbs state will have the Markov property. But otherwise, it may not.

For some related material, see:

• David Poulin, Quantum graphical models and belief propagation.


Quantropy (Part 1)

22 December, 2011

I wish you all happy holidays! My wife Lisa and I are going to Bangkok on Christmas Eve, and thence to Luang Prabang, a town in Laos where the Nam Khan river joins the Mekong. We’ll return to Singapore on the 30th. See you then! And in the meantime, here’s a little present—something to mull over.

Statistical mechanics versus quantum mechanics

There’s a famous analogy between statistical mechanics and quantum mechanics. In statistical mechanics, a system can be in any state, but its probability of being in a state with energy E is proportional to

\exp(-E/T)

where T is the temperature in units where Boltzmann’s constant is 1. In quantum mechanics, a system can move along any path, but its amplitude for moving along a path with action S is proportional to

\exp(i S/\hbar)

where \hbar is Planck’s constant. So, we have an analogy where Planck’s constant is like an imaginary temperature:

Statistical Mechanics Quantum Mechanics
probabilities amplitudes
energy action
temperature Planck’s constant times i

In other words, making the replacements

E \mapsto S

T \mapsto i \hbar

formally turns the probabilities for states in statistical mechanics into the amplitudes for paths, or ‘histories’, in quantum mechanics.

But the probabilities \exp(-E/T) arise naturally from maximizing entropy subject to a constraint on the expected energy. So what about the amplitudes \exp(i S/\hbar)?

Following the analogy without thinking too hard, we’d guess it arises from minimizing something subject to a constraint on the expected action.

But now we’re dealing with complex numbers, so ‘minimizing’ doesn’t sound right. It’s better talk about finding a ‘stationary point’: a place where the derivative of something is zero.

More importantly, what is this something? We’ll have to see—indeed, we’ll have to see if this whole idea makes sense! But for now, let’s just call it ‘quantropy’. This is a goofy word whose only virtue is that it quickly gets the idea across: just as the main ideas in statistical mechanics follow from the idea of maximizing entropy, we’d like the main ideas in quantum mechanics to follow from maximizing… err, well, finding a stationary point… of ‘quantropy’.

I don’t know how well this idea works, but there’s no way to know except by trying, so I’ll try it here. I got this idea thanks to a nudge from Uwe Stroinski and WebHubTel, who started talking about the principle of least action and the principle of maximum entropy at a moment when I was thinking hard about probabilities versus amplitudes.

Of course, if this idea makes sense, someone probably had it already. If you know where, please tell me.

Here’s the story…

Statics

Static systems at temperature zero obey the principle of minimum energy. Energy is typically the sum of kinetic and potential energy:

E = K + V

where the potential energy V depends only on the system’s position, while the kinetic energy K also depends on its velocity. The kinetic energy is often (but not always) a quadratic function of velocity with a minimum at velocity zero. In classical physics this lets our system minimize energy in a two-step way. First it will minimize kinetic energy, K, by staying still. Then it will go on to minimize potential energy, V, by choosing the right place to stay still.

This is actually somewhat surprising: usually minimizing the sum of two things involves an interesting tradeoff. But sometimes it doesn’t!

In quantum physics, a tradeoff is required, thanks to the uncertainty principle. We can’t know the position and velocity of a particle simultaneously, so we can’t simultaneously minimize potential and kinetic energy. This makes minimizing their sum much more interesting, as you’ll know if you’ve ever worked out the lowest-energy state of a harmonic oscillator or hydrogen atom.

But in classical physics, minimizing energy often forces us into ‘statics’: the boring part of physics, the part that studies things that don’t move. And people usually say statics at temperature zero is governed by the principle of minimum potential energy.

Next let’s turn up the heat. What about static systems at nonzero temperature? This is what people study in the subject called ‘thermostatics’, or more often, ‘equilibrium thermodynamics’.

In classical or quantum thermostatics at any fixed temperature, a closed system will obey the principle of minimum free energy. Now it will minimize

F = E - T S

where T is the temperature and S is the entropy. Note that this principle reduces to the principle of minimum energy when T = 0. But as T gets bigger, the second term in the above formula becomes more important, so the system gets more interested in having lots of entropy. That’s why water forms orderly ice crystals at low temperatures (more or less minimizing energy despite low entropy) and a wild random gas at high temperatures (more or less maximizing entropy despite high energy).

But where does the principle of minimum free energy come from?

One nice way to understand it uses probability theory. Suppose for simplicity that our system has a finite set of states, say X, and the energy of the state x \in X is E_x. Instead of our system occupying a single definite state, let’s suppose it can be in any state, with a probability p_x of being in the state x. Then its entropy is, by definition:

\displaystyle{ S = - \sum_x p_x \ln(p_x) }

The expected value of the energy is

\displaystyle{ E = \sum_x p_x E_x }

Now suppose our system maximizes entropy subject to a constraint on the expected value of energy. Thanks to the Lagrange multiplier trick, this is the same as maximizing

S - \beta E

where \beta is a Lagrange multiplier. When we go ahead and maximize this, we see the system chooses a Boltzmann distribution:

\displaystyle{ p_x = \frac{\exp(-\beta E_x)}{\sum_x \exp(-\beta E_x)}}

This is just a calculation; you must do it for yourself someday, and I will not rob you of that joy.

But what does this mean? We could call \beta the coolness, since its inverse is the temperature, T, at least in units where Boltzmann’s constant is set to 1. So, when the temperature is positive, maximizing S - \beta E is the same as minimizing the free energy:

F = E - T S

(For negative temperatures, maximizing S - \beta E would amount to maximizing free energy.)

So, every minimum or maximum principle described so far can be seen as a special case or limiting case of the principle of maximum entropy, as long as we admit that sometimes we need to maximize entropy subject to constraints.

Why ‘limiting case’? Because the principle of least energy only shows up as the low-temperature limit, or \beta \to \infty limit, of the idea of maximizing entropy subject to a constraint on expected energy. But that’s good enough for me.

Dynamics

Now suppose things are changing as time passes, so we’re doing ‘dynamics’ instead of mere ‘statics’. In classical mechanics we can imagine a system tracing out a path \gamma(t) as time passes from one time to another, for example from t = t_0 to t = t_1. The action of this path is typically the integral of the kinetic minus potential energy:

A(\gamma) = \displaystyle{ \int_{t_0}^{t_1}  (K(t) - V(t)) \, dt }

where K(t) and V(t) depend on the path \gamma. Note that now I’m calling action A instead of the more usual S, since we’re already using S for entropy and I don’t want things to get any more confusing than necessary.

The principle of least action says that if we fix the endpoints of this path, that is the points \gamma(t_0) and \gamma(t_1), the system will follow the path that minimizes the action subject to these constraints.

Why is there a minus sign in the definition of action? How did people come up with principle of least action? How is it related to the principle of least energy in statics? These are all fascinating questions. But I have a half-written book that tackles these questions, so I won’t delve into them here:

• John Baez and Derek Wise, Lectures on Classical Mechanics.

Instead, let’s go straight to dynamics in quantum mechanics. Here Feynman proposed that instead of our following a single definite path, it can follow any path, with an amplitude a(\gamma) of following the path \gamma. And he proposed this prescription for the amplitude:

\displaystyle{ a(\gamma) = \frac{\exp(i A(\gamma)/\hbar)}{\int  \exp(i A(\gamma)/\hbar) \, d \gamma}}

where \hbar is Planck’s constant. He also gave a heuristic argument showing that as \hbar \to 0, this prescription reduces to the principle of least action!

Unfortunately the integral over all paths—called a ‘path integral’—is hard to make rigorous except in certain special cases. And it’s a bit of a distraction for what I’m talking about now. So let’s talk more abstractly about ‘histories’ instead of paths with fixed endpoints, and consider a system whose possible ‘histories’ form a finite set, say X. Systems of this sort frequently show up as discrete approximations to continuous ones, but they also show up in other contexts, like quantum cellular automata and topological quantum field theories. Don’t worry if you don’t know what those things are. I’d just prefer to write sums instead of integrals now, to make everything easier.

Suppose the action of the history x \in X is A_x. Then Feynman’s sum over histories formulation of quantum mechanics says the amplitude of the history x is:

\displaystyle{ a_x = \frac{\exp(i A_x /\hbar)}{\sum_x  \exp(i A_x /\hbar) }}

This looks very much like the Boltzmann distribution:

\displaystyle{ p_x = \frac{\exp(-E_x/T)}{\sum_x \exp(- E_x/T)}}

Indeed, the only serious difference is that we’re taking the exponential of an imaginary quantity instead of a real one.

So far everything has been a review of very standard stuff. Now comes something weird and new—at least, new to me.

Quantropy

I’ve described statics and dynamics, and a famous analogy between them, but there are some missing items in the analogy, which would be good to fill in:

Statics Dynamics
statistical mechanics quantum mechanics
probabilities amplitudes
Boltzmann distribution Feynman sum over histories
energy action
temperature Planck’s constant times i
entropy ???
free energy ???

Since the Boltzmann distribution

\displaystyle{ p_x = \frac{\exp(-E_x/T)}{\sum_x \exp(- E_x/T)}}

comes from the principle of maximum entropy, you might hope Feynman’s sum over histories formulation of quantum mechanics:

\displaystyle{ a_x = \frac{\exp(i A_x /\hbar)}{\sum_x  \exp(i A_x /\hbar) }}

comes from a maximum principle too!

Unfortunately Feynman’s sum over histories involves complex numbers, and it doesn’t make sense to maximize a complex function. However, when we say nature likes to minimize or maximize something, it often behaves like a bad freshman who applies the first derivative test and quits there: it just finds a stationary point, where the first derivative is zero. For example, in statics we have ‘stable’ equilibria, which are local minima of the energy, but also ‘unstable’ equilibria, which are still stationary points of the energy, but not local minima. This is good for us, because stationary points still make sense for complex functions.

So let’s try to derive Feynman’s prescription from some sort of ‘principle of stationary quantropy’.

Suppose we have a finite set of histories, X, and each history x \in X has a complex amplitude a_x  \in \mathbb{C}. We’ll assume these amplitudes are normalized so that

\sum_x a_x = 1

since that’s what Feynman’s normalization actually achieves. We can try to define the quantropy of a by:

\displaystyle{ Q = - \sum_x a_x \ln(a_x) }

You might fear this is ill-defined when a_x = 0, but that’s not the worst problem; in the study of entropy we typically set

0 \ln 0 = 0

and everything works fine. The worst problem is that the logarithm has different branches: we can add any multiple of 2 \pi i to our logarithm and get another equally good logarithm. For now suppose we’ve chosen a specific logarithm for each number a_x, and suppose that when we vary them they don’t go through zero, so we can smoothly change the logarithm as we move them. This should let us march ahead for now, but clearly it’s a disturbing issue which we should revisit someday.

Next, suppose each history x has an action A_x \in \mathbb{R}. Let’s seek amplitudes a_x that give a stationary point of the quantropy Q subject to a constraint on the expected action:

\displaystyle{ A = \sum_x a_x A_x }

The term ‘expected action’ is a bit odd, since the numbers a_x are amplitudes rather than probabilities. While I could try to justify it from how expected values are computed in Feynman’s formalism, I’m mainly using this term because A is analogous to the expected value of the energy, which we saw earlier. We can worry later what all this stuff really means; right now I’m just trying to push forwards with an analogy and do a calculation.

So, let’s look for a stationary point of Q subject to a constraint on A. To do this, I’d be inclined to use Lagrange multipliers and look for a stationary point of

Q - \lambda A

But there’s another constraint, too, namely

\sum_x a_x = 1

So let’s write

B = \sum_x a_x

and look for stationary points of Q subject to the constraints

A = \alpha , \qquad B = 1

To do this, the Lagrange multiplier recipe says we should find stationary points of

Q - \lambda A - \mu B

where \lambda and \mu are Lagrange multipliers. The Lagrange multiplier \lambda is really interesting. It’s analogous to ‘coolness’, \beta = 1/T, so our analogy chart suggests that

\lambda = 1/i\hbar

This says that when \lambda gets big our system becomes close to classical. So, we could call \lambda the classicality of our system. The Lagrange multiplier \mu is less interesting—or at least I haven’t thought about it much.

So, we’ll follow the usual Lagrange multiplier recipe and look for amplitudes for which

0 = \displaystyle{ \frac{\partial}{\partial a_x} \left(Q - \lambda A - \mu B \right) }

holds, along with the constraint equations. We begin by computing the derivatives we need:

\begin{array}{cclcl} \displaystyle{ \frac{\partial}{\partial a_x} Q  }  &=& - \displaystyle{ \frac{\partial}{\partial a_x} \; a_x \ln(a_x)}   &=& - \ln(a_x) - 1 \\    \\    \displaystyle{ \frac{\partial}{\partial a_x}\; A  }  &=& \displaystyle{ \frac{\partial}{\partial a_x} a_x A_x}  &=& A_x \\    \\   \displaystyle{ \frac{\partial}{\partial a_x} B  }  &=& \displaystyle{ \frac{\partial}{\partial a_x}\; a_x }  &=& 1 \end{array}

Thus, we need

0 = \displaystyle{ \frac{\partial}{\partial a_x} \left(Q - \lambda A - \mu B \right) = -\ln(a_x) - 1- \lambda A_x - \mu }

or

\displaystyle{ a_x = \frac{\exp(-\lambda A_x)}{\exp(\mu + 1)} }

The constraint

\sum_x a_x = 1

then forces us to choose:

\displaystyle{ \exp(\mu + 1) = \sum_x \exp(-\lambda A_x) }

so we have

\displaystyle{ a_x = \frac{\exp(-\lambda A_x)}{\sum_x \exp(-\lambda A_x)} }

Hurrah! This is precisely Feynman’s sum over histories formulation of quantum mechanics if

\lambda = 1/i\hbar

We could go further with the calculation, but this is the punchline, so I’ll stop here. I’ll just note that the final answer:

\displaystyle{ a_x = \frac{\exp(iA_x/\hbar)}{\sum_x \exp(iA_x/\hbar)} }

does two equivalent things in one blow:

• It gives a stationary point of quantropy subject to the constraints that the amplitudes sum to 1 and the expected action takes some fixed value.

• It gives a stationary point of the free action:

A - i \hbar Q

subject to the constraint that the amplitudes sum to 1.

In case the second point is puzzling, note that the ‘free action’ is the quantum analogue of ‘free energy’, E - T S. It’s also just Q - \lambda A times -i \hbar, and we already saw that finding stationary points of Q - \lambda A is another way of finding stationary points of quantropy with a constraint on the expected action.

Note also that when \hbar \to 0, free action reduces to action, so we recover the principle of least action—or at least stationary action—in classical mechanics.

Summary. We recover Feynman’s sum over histories formulation of quantum mechanics from assuming that all histories have complex amplitudes, that these amplitudes sum to one, and that the amplitudes give a stationary point of quantropy subject to a constraint on the expected action. Alternatively, we can assume the amplitudes sum to one and that they give a stationary point of free action.

That’s sort of nice! So, here’s our analogy chart, all filled in:

Statics Dynamics
statistical mechanics quantum mechanics
probabilities amplitudes
Boltzmann distribution Feynman sum over histories
energy action
temperature Planck’s constant times i
entropy quantropy
free energy free action

Follow

Get every new post delivered to your Inbox.

Join 3,095 other followers