Information Geometry (Part 11)

7 June, 2012

Last time we saw that given a bunch of different species of self-replicating entities, the entropy of their population distribution can go either up or down as time passes. This is true even in the pathetically simple case where all the replicators have constant fitness—so they don’t interact with each other, and don’t run into any ‘limits to growth’.

This is a bit of a bummer, since it would be nice to use entropy to explain how replicators are always extracting information from their environment, thanks to natural selection.

Luckily, a slight variant of entropy, called ‘relative entropy’, behaves better. When our replicators have an ‘evolutionary stable state’, the relative entropy is guaranteed to always change in the same direction as time passes!

Thanks to Einstein, we’ve all heard that times and distances are relative. But how is entropy relative?

It’s easy to understand if you think of entropy as lack of information. Say I have a coin hidden under my hand. I tell you it’s heads-up. How much information did I just give you? Maybe 1 bit? That’s true if you know it’s a fair coin and I flipped it fairly before covering it up with my hand. But what if you put the coin down there yourself a minute ago, heads up, and I just put my hand over it? Then I’ve given you no information at all. The difference is the choice of ‘prior’: that is, what probability distribution you attributed to the coin before I gave you my message.

My love affair with relative entropy began in college when my friend Bruce Smith and I read Hugh Everett’s thesis, The Relative State Formulation of Quantum Mechanics. This was the origin of what’s now often called the ‘many-worlds interpretation’ of quantum mechanics. But it also has a great introduction to relative entropy. Instead of talking about ‘many worlds’, I wish people would say that Everett explained some of the mysteries of quantum mechanics using the fact that entropy is relative.

Anyway, it’s nice to see relative entropy showing up in biology.

Relative Entropy

Inscribe an equilateral triangle in a circle. Randomly choose a line segment joining two points of this circle. What is the probability that this segment is longer than a side of the triangle?

This puzzle is called Bertrand’s paradox, because different ways of solving it give different answers. To crack the paradox, you need to realize that it’s meaningless to say you’ll “randomly” choose something until you say more about how you’re going to do it.

In other words, you can’t compute the probability of an event until you pick a recipe for computing probabilities. Such a recipe is called a probability measure.

This applies to computing entropy, too! The formula for entropy clearly involves a probability distribution, even when our set of events is finite:

S = - \sum_i p_i \ln(p_i)

But this formula conceals a fact that becomes obvious when our set of events is infinite. Now the sum becomes an integral:

S = - \int_X p(x) \ln(p(x)) \, d x

And now it’s clear that this formula makes no sense until we choose the measure d x. On a finite set we have a god-given choice of measure, called counting measure. Integrals with respect to this are just sums. But in general we don’t have such a god-given choice. And even for finite sets, working with counting measure is a choice: we are choosing to believe that in the absence of further evidence, all options are equally likely.

Taking this fact into account, it seems like we need two things to compute entropy: a probability distribution p(x), and a measure d x. That’s on the right track. But an even better way to think of it is this:

\displaystyle{ S = - \int_X  \frac{p(x) dx}{dx} \ln \left(\frac{p(x) dx}{dx}\right) \, dx }

Now we see the entropy depends two measures: the probability measure p(x)  dx we care about, but also the measure d x. Their ratio is important, but that’s not enough: we also need one of these measures to do the integral. Above I used the measure dx to do the integral, but we can also use p(x) dx if we write

\displaystyle{ S = - \int_X \ln \left(\frac{p(x) dx}{dx}\right) p(x) dx }

Either way, we are computing the entropy of one measure relative to another. So we might as well admit it, and talk about relative entropy.

The entropy of the measure d \mu relative to the measure d \nu is defined by:

\begin{array}{ccl} S(d \mu, d \nu) &=& \displaystyle{ - \int_X \frac{d \mu(x) }{d \nu(x)} \ln \left(\frac{d \mu(x)}{ d\nu(x) }\right)  d\nu(x) } \\   \\  &=& \displaystyle{ - \int_X  \ln \left(\frac{d \mu(x)}{ d\nu(x) }\right) d\mu(x) } \end{array}

The second formula is simpler, but the first looks more like summing -p \ln(p), so they’re both useful.

Since we’re taking entropy to be lack of information, we can also get rid of the minus sign and define relative information by

\begin{array}{ccl} I(d \mu, d \nu) &=& \displaystyle{ \int_X \frac{d \mu(x) }{d \nu(x)} \ln \left(\frac{d \mu(x)}{ d\nu(x) }\right)  d\nu(x) } \\   \\  &=& \displaystyle{  \int_X  \ln \left(\frac{d \mu(x)}{ d\nu(x) }\right) d\mu(x) } \end{array}

If you thought something was randomly distributed according to the probability measure d \nu, but then you you discover it’s randomly distributed according to the probability measure d \mu, how much information have you gained? The answer is I(d\mu,d\nu).

For more on relative entropy, read Part 6 of this series. I gave some examples illustrating how it works. Those should convince you that it’s a useful concept.

Okay: now let’s switch back to a more lowbrow approach. In the case of a finite set, we can revert to thinking of our two measures as probability distributions, and write the information gain as

I(q,p) = \displaystyle{  \sum_i  \ln \left(\frac{q_i}{p_i }\right) q_i}

If you want to sound like a Bayesian, call p the prior probability distribution and q the posterior probability distribution. Whatever you call them, I(q,p) is the amount of information you get if you thought p and someone tells you “no, q!”

We’ll use this idea to think about how a population gains information about its environment as time goes by, thanks to natural selection. The rest of this post will be an exposition of Theorem 1 in this paper:

• Marc Harper, The replicator equation as an inference dynamic.

Harper says versions of this theorem ave previously appeared in work by Ethan Akin, and independently in work by Josef Hofbauer and Karl Sigmund. He also credits others here. An idea this good is rarely noticed by just one person.

The change in relative information

So: consider n different species of replicators. Let P_i be the population of the ith species, and assume these populations change according to the replicator equation:

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) \, P_i }

where each function f_i depends smoothly on all the populations. And as usual, we let

\displaystyle{ p_i = \frac{P_i}{\sum_j P_j} }

be the fraction of replicators in the ith species.

Let’s study the relative information I(q,p) where q is some fixed probability distribution. We’ll see something great happens when q is a stable equilibrium solution of the replicator equation. In this case, the relative information can never increase! It can only decrease or stay constant.

We’ll think about what all this means later. First, let’s see that it’s true! Remember,

\begin{array}{ccl} I(q,p) &=& \displaystyle{ \sum_i  \ln \left(\frac{q_i}{ p_i }\right) q_i }  \\ \\ &=&  \displaystyle{ \sum_i  \Big(\ln(q_i) - \ln(p_i) \Big) q_i } \end{array}

and only p_i depends on time, not q_i, so

\begin{array}{ccl} \displaystyle{ \frac{d}{dt} I(q,p)}  &=& \displaystyle{ - \frac{d}{dt} \sum_i \ln(p_i)  q_i }\\   \\  &=& \displaystyle{ - \sum_i \frac{\dot{p}_i}{p_i} \, q_i } \end{array}

where \dot{p}_i is the rate of change of the probability p_i. We saw a nice formula for this in Part 9:

\displaystyle{ \dot{p}_i = \Big( f_i(P) - \langle f(P) \rangle  \Big) \, p_i }

where

f_i(P) = f_i(P_1, \dots, P_n)

and

\displaystyle{ \langle f(P) \rangle = \sum_i f_i(P) p_i  }

is the mean fitness of the species. So, we get

\displaystyle{ \frac{d}{dt} I(q,p) } = \displaystyle{ - \sum_i \Big( f_i(P) - \langle f(P) \rangle  \Big) \, q_i }

Nice, but we can fiddle with this expression to get something more enlightening. Remember, the numbers q_i sum to one. So:

\begin{array}{ccl}  \displaystyle{ \frac{d}{dt} I(q,p) } &=&  \displaystyle{  \langle f(P) \rangle - \sum_i f_i(P) q_i  } \\  \\ &=& \displaystyle{  \sum_i f_i(P) (p_i - q_i)  }  \end{array}

where in the last step I used the definition of the mean fitness. This result looks even cuter if we treat the numbers f_i(P) as the components of a vector f(P), and similarly for the numbers p_i and q_i. Then we can use the dot product of vectors to say

\displaystyle{ \frac{d}{dt} I(q,p) = f(P) \cdot (p - q) }

So, the relative information I(q,p) will always decrease if

f(P) \cdot (p - q) \le 0

for all choices of the population P.

And now something really nice happens: this is also the condition for q to be an evolutionarily stable state. This concept goes back to John Maynard Smith, the founder of evolutionary game theory. In 1982 he wrote:

A population is said to be in an evolutionarily stable state if its genetic composition is restored by selection after a disturbance, provided the disturbance is not too large.

I will explain the math next time—I need to straighten out some things in my mind first. But the basic idea is compelling: an evolutionarily stable state is like a situation where our replicators ‘know all there is to know’ about the environment and each other. In any other state, the population has ‘something left to learn’—and the amount left to learn is the relative information we’ve been talking about! But as time goes on, the information still left to learn decreases!

Note: in the real world, nature has never found an evolutionarily stable state… except sometimes approximately, on sufficiently short time scales, in sufficiently small regions. So we are still talking about an idealization of reality! But that’s okay, as long as we know it.


Information Geometry (Part 10)

4 June, 2012

Last time I began explaining the tight relation between three concepts:

• entropy,

• information—or more precisely, lack of information,

and

• biodiversity.

The idea is to consider n different species of ‘replicators’. A replicator is any entity that can reproduce itself, like an organism, a gene, or a meme. A replicator can come in different kinds, and a ‘species’ is just our name for one of these kinds. If P_i is the population of the ith species, we can interpret the fraction

\displaystyle{ p_i = \frac{P_i}{\sum_j P_j} }

as a probability: the probability that a randomly chosen replicator belongs to the ith species. This suggests that we define entropy just as we do in statistical mechanics:

\displaystyle{ S = - \sum_i p_i \ln(p_i) }

In the study of statistical inference, entropy is a measure of uncertainty, or lack of information. But now we can interpret it as a measure of biodiversity: it’s zero when just one species is present, and small when a few species have much larger populations than all the rest, but gets big otherwise.

Our goal here is play these viewpoints off against each other. In short, we want to think of natural selection, and even biological evolution, as a process of statistical inference—or in simple terms, learning.

To do this, let’s think about how entropy changes with time. Last time we introduced a simple model called the replicator equation:

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) \, P_i }

where each population grows at a rate proportional to some ‘fitness functions’ f_i. We can get some intuition by looking at the pathetically simple case where these functions are actually constants, so

\displaystyle{ \frac{d P_i}{d t} = f_i \, P_i }

The equation then becomes trivial to solve:

\displaystyle{ P_i(t) = e^{t f_i } P_i(0)}

Last time I showed that in this case, the entropy will eventually decrease. It will go to zero as t \to +\infty whenever one species is fitter than all the rest and starts out with a nonzero population—since then this species will eventually take over.

But remember, the entropy of a probability distribution is its lack of information. So the decrease in entropy signals an increase in information. And last time I argued that this makes perfect sense. As the fittest species takes over and biodiversity drops, the population is acquiring information about its environment.

However, I never said the entropy is always decreasing, because that’s false! Even in this pathetically simple case, entropy can increase.

Suppose we start with many replicators belonging to one very unfit species, and a few belonging to various more fit species. The probability distribution p_i will start out sharply peaked, so the entropy will start out low:

Now think about what happens when time passes. At first the unfit species will rapidly die off, while the population of the other species slowly grows:

 

So the probability distribution will, for a while, become less sharply peaked. Thus, for a while, the entropy will increase!

This seems to conflict with our idea that the population’s entropy should decrease as it acquires information about its environment. But in fact this phenomenon is familiar in the study of statistical inference. If you start out with strongly held false beliefs about a situation, the first effect of learning more is to become less certain about what’s going on!

Get it? Say you start out by assigning a high probability to some wrong guess about a situation. The entropy of your probability distribution is low: you’re quite certain about what’s going on. But you’re wrong. When you first start suspecting you’re wrong, you become more uncertain about what’s going on. Your probability distribution flattens out, and the entropy goes up.

So, sometimes learning involves a decrease in information—false information. There’s nothing about the mathematical concept of information that says this information is true.

Given this, it’s good to work out a formula for the rate of change of entropy, which will let us see more clearly when it goes down and when it goes up. To do this, first let’s derive a completely general formula for the time derivative of the entropy of a probability distribution. Following Sir Isaac Newton, we’ll use a dot to stand for a time derivative:

\begin{array}{ccl} \displaystyle{  \dot{S}} &=& \displaystyle{ -  \frac{d}{dt} \sum_i p_i \ln (p_i)} \\   \\  &=& - \displaystyle{ \sum_i \dot{p}_i \ln (p_i) + \dot{p}_i }  \end{array}

In the last term we took the derivative of the logarithm and got a factor of 1/p_i which cancelled the factor of p_i. But since

\displaystyle{  \sum_i p_i = 1 }

we know

\displaystyle{ \sum_i \dot{p}_i = 0 }

so this last term vanishes:

\displaystyle{ \dot{S}= -\sum_i \dot{p}_i \ln (p_i) }

Nice! To go further, we need a formula for \dot{p}_i. For this we might as well return to the general replicator equation, dropping the pathetically special assumption that the fitness functions are actually constants. Then we saw last time that

\displaystyle{ \dot{p}_i = \Big( f_i(P) - \langle f(P) \rangle  \Big) \, p_i }

where we used the abbreviation

f_i(P) = f_i(P_1, \dots, P_n)

for the fitness of the ith species, and defined the mean fitness to be

\displaystyle{ \langle f(P) \rangle = \sum_i f_i(P) p_i  }

Using this cute formula for \dot{p}_i, we get the final result:

\displaystyle{ \dot{S} = - \sum_i \Big( f_i(P) - \langle f(P) \rangle \Big) \, p_i \ln (p_i) }

This is strikingly similar to the formula for entropy itself. But now each term in the sum includes a factor saying how much more fit than average, or less fit, that species is. The quantity - p_i \ln(p_i) is always nonnegative, since the graph of -x \ln(x) looks like this:

So, the ith term contributes positively to the change in entropy if the ith species is fitter than average, but negatively if it’s less fit than average.

This may seem counterintuitive!

Puzzle 1. How can we reconcile this fact with our earlier observations about the case when the fitness of each species is population-independent? Namely: a) if initially most of the replicators belong to one very unfit species, the entropy will rise at first, but b) in the long run, when the fittest species present take over, the entropy drops?

If this seems too tricky, look at some examples! The first illustrates observation a); the second illustrates observation b):

Puzzle 2. Suppose we have two species, one with fitness equal to 1 initially constituting 90% of the population, the other with fitness equal to 10 initially constituting just 10% of the population:

\begin{array}{ccc} f_1 = 1, & &  p_1(0) = 0.9 \\ \\                            f_2 = 10 , & & p_2(0) = 0.1   \end{array}

At what rate does the entropy change at t = 0? Which species is responsible for most of this change?

Puzzle 3. Suppose we have two species, one with fitness equal to 10 initially constituting 90% of the population, and the other with fitness equal to 1 initially constituting just 10% of the population:

\begin{array}{ccc} f_1 = 10, & &  p_1(0) = 0.9 \\ \\                            f_2 = 1 , & & p_2(0) = 0.1   \end{array}

At what rate does the entropy change at t = 0? Which species is responsible for most of this change?

I had to work through these examples to understand what’s going on. Now I do, and it all makes sense.

Next time

Still, it would be nice if there were some quantity that always goes down with the passage of time, reflecting our naive idea that the population gains information from its environment, and thus loses entropy, as time goes by.

Often there is such a quantity. But it’s not the naive entropy: it’s the relative entropy. I’ll talk about that next time. In the meantime, if you want to prepare, please reread Part 6 of this series, where I explained this concept. Back then, I argued that whenever you’re tempted to talk about entropy, you should talk about relative entropy. So, we should try that here.

There’s a big idea lurking here: information is relative. How much information a signal gives you depends on your prior assumptions about what that signal is likely to be. If this is true, perhaps biodiversity is relative too.


Information Geometry (Part 9)

1 June, 2012


It’s time to continue this information geometry series, because I’ve promised to give the following talk at a conference on the mathematics of biodiversity in early July… and I still need to do some of the research!

Diversity, information geometry and learning

As is well known, some measures of biodiversity are formally identical to measures of information developed by Shannon and others. Furthermore, Marc Harper has shown that the replicator equation in evolutionary game theory is formally identical to a process of Bayesian inference, which is studied in the field of machine learning using ideas from information geometry. Thus, in this simple model, a population of organisms can be thought of as a ‘hypothesis’ about how to survive, and natural selection acts to update this hypothesis according to Bayes’ rule. The question thus arises to what extent natural changes in biodiversity can be usefully seen as analogous to a form of learning. However, some of the same mathematical structures arise in the study of chemical reaction networks, where the increase of entropy, or more precisely decrease of free energy, is not usually considered a form of ‘learning’. We report on some preliminary work on these issues.

So, let’s dive in! To some extent I’ll be explaining these two papers:

• Marc Harper, Information geometry and evolutionary game theory.

• Marc Harper, The replicator equation as an inference dynamic.

However, I hope to bring in some more ideas from physics, the study of biodiversity, and the theory of stochastic Petri nets, also known as chemical reaction networks. So, this series may start to overlap with my network theory posts. We’ll see. We won’t get far today: for now, I just want to review and expand on what we did last time.

The replicator equation

The replicator equation is a simplified model of how populations change. Suppose we have n types of self-replicating entity. I’ll call these entities replicators. I’ll call the types of replicators species, but they don’t need to be species in the biological sense. For example, the replicators could be genes, and the types could be alleles. Or the replicators could be restaurants, and the types could be restaurant chains.

Let P_i(t), or just P_i for short, be the population of the ith species at time t. Then the replicator equation says

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) \, P_i }

So, the population P_i changes at a rate proportional to P_i, but the ‘constant of proportionality’ need not be constant: it can be any smooth function f_i of the populations of all the species. We call f_i(P_1, \dots, P_n) the fitness of the ith species.

Of course this model is absurdly general, while still leaving out lots of important effects, like the spatial variation of populations, or the ability for the population of some species to start at zero and become nonzero—which happens thanks to mutation. Nonetheless this model is worth taking a good look at.

Using the magic of vectors we can write

P = (P_1, \dots , P_n)

and

f(P) = (f_1(P), \dots, f_n(P))

This lets us write the replicator equation a wee bit more tersely as

\displaystyle{ \frac{d P}{d t} = f(P) P}

where on the right I’m multiplying vectors componentwise, the way your teachers tried to brainwash you into never doing:

f(P) P = (f(P)_1 P_1, \dots, f(P)_n P_n)

In other words, I’m thinking of P and f(P) as functions on the set \{1, \dots, n\} and multiplying them pointwise. This will be a nice way of thinking if we want to replace this finite set by some more general space.

Why would we want to do that? Well, we might be studying lizards with different length tails, and we might find it convenient to think of the set of possible tail lengths as the half-line [0,\infty) instead of a finite set.

Or, just to get started, we might want to study the pathetically simple case where f(P) doesn’t depend on P. Then we just have a fixed function f and a time-dependent function P obeying

\displaystyle{ \frac{d P}{d t} = f P}

If we’re physicists, we might write P more suggestively as \psi and write the operator multiplying by f as - H. Then our equation becomes

\displaystyle{ \frac{d \psi}{d t} = - H \psi }

This looks a lot like Schrödinger’s equation, but since there’s no factor of \sqrt{-1}, and \psi is real-valued, it’s more like the heat equation or the ‘master equation’, the basic equation of stochastic mechanics.

For an explanation of Schrödinger’s equation and the master equation, try Part 12 of the network theory series. In that post I didn’t include a minus sign in front of the H. That’s no big deal: it’s just a different convention than the one I want today. A more serious issue is that in stochastic mechanics, \psi stands for a probability distribution. This suggests that we should get probabilities into the game somehow.

The replicator equation in terms of probabilities

Luckily, that’s exactly what people usually do! Instead of talking about the population P_i of the ith species, they talk about the probability p_i that one of our organisms will belong to the ith species. This amounts to normalizing our populations:

\displaystyle{  p_i = \frac{P_i}{\sum_j P_j} }

Don’t you love it when notations work out well? Our big Population P_i has gotten normalized to give little probability p_i.

How do these probabilities p_i change with time? Now is the moment for that least loved rule of elementary calculus to come out and take a bow: the quotient rule for derivatives!

\displaystyle{ \frac{d p_i}{d t} = \left(\frac{d P_i}{d t} \sum_j P_j \quad - \quad P_i \sum_j \frac{d P_j}{d t}\right) \big{/} \left(  \sum_j P_j \right)^2 }

Using our earlier version of the replicator equation, this gives:

\displaystyle{ \frac{d p_i}{d t} =  \left(f_i(P) P_i \sum_j P_j \quad - \quad P_i \sum_j f_j(P) P_j \right) \big{/} \left(  \sum_j P_j \right)^2 }

Using the definition of p_i, this simplifies to:

\displaystyle{ \frac{d p_i}{d t} =  f_i(P) p_i \quad - \quad \left( \sum_j f_j(P) p_j \right) p_i }

The stuff in parentheses actually has a nice meaning: it’s just the mean fitness. In other words, it’s the average, or expected, fitness of an organism chosen at random from the whole population. Let’s write it like this:

\displaystyle{ \langle f(P) \rangle = \sum_j f_j(P) p_j  }

So, we get the replicator equation in its classic form:

\displaystyle{ \frac{d p_i}{d t} = \Big( f_i(P) - \langle f(P) \rangle \Big) \, p_i }

This has a nice meaning: for the fraction of organisms of the ith type to increase, their fitness must exceed the mean fitness. If you’re trying to increase market share, what matters is not how good you are, but how much better than average you are. If everyone else is lousy, you’re in luck.

Entropy

Now for something a bit new. Once we’ve gotten a probability distribution into the game, its entropy is sure to follow:

\displaystyle{ S(p) = - \sum_i p_i \, \ln(p_i) }

This says how ‘smeared-out’ the overall population is among the various different species. Alternatively, it says how much information it takes, on average, to say which species a randomly chosen organism belongs to. For example, if there are 2^N species, all with equal populations, the entropy S works out to N \ln 2. So in this case, it takes N bits of information to say which species a randomly chosen organism belongs to.

In biology, entropy is one of many ways people measure biodiversity. For a quick intro to some of the issues involved, try:

• Tom Leinster, Measuring biodiversity, Azimuth, 7 November 2011.

• Lou Jost, Entropy and diversity, Oikos 113 (2006), 363–375.

But we don’t need to understand this stuff to see how entropy is connected to the replicator equation. Marc Harper’s paper explains this in detail:

• Marc Harper, The replicator equation as an inference dynamic.

and I hope to go through quite a bit of it here. But not today! Today I just want to look at a pathetically simple, yet still interesting, example.

Exponential growth

Suppose the fitness of each species is independent of the populations of all the species. In other words, suppose each fitness f_i(P) is actually a constant, say f_i. Then the replicator equation reduces to

\displaystyle{ \frac{d P_i}{d t} = f_i \, P_i }

so it’s easy to solve:

P_i(t) = e^{t f_i} P_i(0)

You don’t need a detailed calculation to see what’s going to happen to the probabilities

\displaystyle{ p_i(t) = \frac{P_i(t)}{\sum_j P_j(t)}}

The most fit species present will eventually take over! If one species, say the ith one, has a fitness greater than the rest, then the population of this species will eventually grow faster than all the rest, at least if its population starts out greater than zero. So as t \to +\infty, we’ll have

p_i(t) \to 1

and

p_j(t) \to 0 \quad \mathrm{for} \quad j \ne i

Thus the probability distribution p will become more sharply peaked, and its entropy will eventually approach zero.

With a bit more thought you can see that even if more than one species shares the maximum possible fitness, the entropy will eventually decrease, though not approach zero.

In other words, the biodiversity will eventually drop as all but the most fit species are overwhelmed. Of course, this is only true in our simple idealization. In reality, biodiversity behaves in more complex ways—in part because species interact, and in part because mutation tends to smear out the probability distribution p_i. We’re not looking at these effects yet. They’re extremely important… in ways we can only fully understand if we start by looking at what happens when they’re not present.

In still other words, the population will absorb information from its environment. This should make intuitive sense: the process of natural selection resembles ‘learning’. As fitter organisms become more common and less fit ones die out, the environment puts its stamp on the probability distribution p. So, this probability distribution should gain information.

While intuitively clear, this last claim also follows more rigorously from thinking of entropy as negative information. Admittedly, it’s always easy to get confused by minus signs when relating entropy and information. A while back I said the entropy

\displaystyle{ S(p) = - \sum_i p_i \, \ln(p_i) }

was the average information required to say which species a randomly chosen organism belongs to. If this entropy is going down, isn’t the population losing information?

No, this is a classic sign error. It’s like the concept of ‘work’ in physics. We can talk about the work some system does on its environment, or the work done by the environment on the system, and these are almost the same… except one is minus the other!

When you are very ignorant about some system—say, some rolled dice—your estimated probabilities p_i for its various possible states are very smeared-out, so the entropy S(p) is large. As you gain information, you revise your probabilities and they typically become more sharply peaked, so S(p) goes down. When you know as much as you possibly can, S(p) equals zero.

So, the entropy S(p) is the amount of information you have left to learn: the amount of information you lack, not the amount you have. As you gain information, this goes down. There’s no paradox here.

It works the same way with our population of replicators—at least in the special case where the fitness of each species is independent of its population. The probability distribution p is like a ‘hypothesis’ assigning to each species i the probability p_i that it’s the best at self-replicating. As some replicators die off while others prosper, they gather information their environment, and this hypothesis gets refined. So, the entropy S(p) drops.

Next time

Of course, to make closer contact to reality, we need to go beyond the special case where the fitness of each species is a constant! Marc Harper does this, and I want to talk about his work someday, but first I have a few more remarks to make about the pathetically simple special case I’ve been focusing on. I’ll save these for next time, since I’ve probably strained your patience already.


A Noether Theorem for Markov Processes

7 March, 2012

I’ll start you off with two puzzles. Their relevance should become clear by the end of this post:

Puzzle 1. Suppose I have a box of jewels. The average value of a jewel in the box is $10. I randomly pull one out of the box. What’s the probability that its value is at least $100?

Puzzle 2. Suppose I have a box full of numbers—they can be arbitrary real numbers. Their average is zero, and their standard deviation is 10. I randomly pull one out. What’s the probability that it’s at least 100?

Before you complain, I’ll admit: in both cases, you can’t actually tell me the probability. But you can say something about the probability! What’s the most you can say?

Noether theorems

Some good news: Brendan Fong, who worked here with me, has now gotten a scholarship to do his PhD at the University of Oxford! He’s talking to with people like Bob Coecke and Jamie Vicary, who work on diagrammatic and category-theoretic approaches to quantum theory.

But we’ve also finished a paper on good old-fashioned probability theory:

• John Baez and Brendan Fong, A Noether theorem for Markov processes.

This is based on a result Brendan proved in the network theory series on this blog. But we go further in a number of ways.

What’s the basic idea?

For months now I’ve been pushing the idea that we can take ideas from quantum mechanics and push them over to ‘stochastic mechanics’, which differs in that we work with probabilities rather than amplitudes. Here we do this for Noether’s theorem.

I should warn you: here I’m using ‘Noether’s theorem’ in an extremely general way to mean any result relating symmetries and conserved quantities. There are many versions. We prove a version that applies to Markov processes, which are random processes of the nicest sort: those where the rules don’t change with time, and the state of the system in the future only depends on its state now, not the past.

In quantum mechanics, there’s a very simple relation between symmetries and conserved quantities: an observable commutes with the Hamiltonian if and only if its expected value remains constant in time for every state. For Markov processes this is no longer true. But we show the next best thing: an observable commutes with the Hamiltonian if and only if both its expected value and standard deviation are constant in time for every state!

Now, we explained this stuff very simply and clearly back in Part 11 and Part 13 of the network theory series. We also tried to explain it clearly in the paper. So now let me explain it in a complicated, confusing way, for people who prefer that.

(Judging from the papers I read, that’s a lot of people!)

I’ll start by stating the quantum theorem we’re trying to mimic, and then state the version for Markov processes.

Noether’s theorem: quantum versions

For starters, suppose both our Hamiltonian H and the observable O are bounded self-adjoint operators. Then we have this:

Noether’s Theorem, Baby Quantum Version. Let H and O be bounded self-adjoint operators on some Hilbert space. Then

[H,O] = 0

if and only if for all states \psi(t) obeying Schrödinger’s equation

\displaystyle{ \frac{d}{d t} \psi(t) = -i H \psi(t) }

the expected value \langle \psi(t), O \psi(t) \rangle is constant as a function of t.

What if O is an unbounded self-adjoint operator? That’s no big deal: we can get a bounded one by taking f(O) where f is any bounded measurable function. But Hamiltonians are rarely bounded for fully realistic quantum systems, and we can’t mess with the Hamiltonian without changing Schrödinger’s equation! So, we definitely want a version of Noether’s theorem that lets H be unbounded.

It’s a bit tough to make the equation [H,O] = 0 precise in a useful way when H is unbounded, because then H is only densely defined. If O doesn’t map the domain of H to itself, it’s hard to know what [H,O] = HO - OH even means! We could demand that H does preserve the domain of O, but a better workaround is instead to say that

[\mathrm{exp}(-itH), O] = 0

for all t. Then we get this:

Noether’s Theorem, Full-fledged Quantum Version. Let H and O be self-adjoint operators on some Hilbert space, with O being bounded. Then

[\mathrm{exp}(-itH),O] = 0

if and only if for all states

\psi(t) = \mathrm{exp}(-itH) \psi

the expected value \langle \psi(t), O \psi(t) \rangle is constant as a function of t.

Here of course we’re using the fact that \mathrm{exp}(-itH) \psi is what we get when we solve Schrödinger’s equation with initial data \psi.

But in fact, this version of Noether’s theorem follows instantly from a simpler one:

Noether’s Theorem, Simpler Quantum Version. Let U be a unitary operator and let O be a bounded self-adjoint operator on some Hilbert space. Then

[U,O] = 0

if and only if for all states \psi,

\langle U \psi, O U \psi \rangle = \langle \psi, O \psi \rangle.

This version applies to a single unitary operator U instead of the 1-parameter unitary group

U(t) = \exp(-i t H)

It’s incredibly easy to prove. And this is is the easiest version to copy over to the Markov case! However, the proof over there is not quite so easy.

Noether’s theorem: stochastic versions

In stochastic mechanics we describe states using probability distributions, not vectors in a Hilbert space. We also need a new concept of ‘observable’, and unitary operators will be replaced by ‘stochastic operators’.

Suppose that X is a \sigma-finite measure space with a measure we write simply as dx. Then probability distributions \psi on X lie in L^1(X). Let’s define an observable O to be any element of the dual space L^\infty(X), allowing us to define the expected valued of O in the probability distribution \psi to be

\langle O, \psi \rangle = \int_X O(x) \psi(x) \, dx

The angle brackets are supposed to remind you of quantum mechanics, but we don’t have an inner product on a Hilbert space anymore! Instead, we have a pairing between L^1(X) and L^\infty(X). Probability distributions live in L^1(X), while observables live in L^\infty(X). But we can also think of an observable O as a bounded operator on L^1(X), namely the operator of multiplying by the function O.

Let’s say an operator

U : L^1(X) \to L^1(X)

is stochastic if it’s bounded and it maps probability distributions to probability distributions. Equivalently, U is stochastic if it’s linear and it obeys

\psi \ge 0 \implies U \psi \ge 0

and

\int_X (U\psi)(x) \, dx = \int_X \psi(x) \, dx

for all \psi \in L^1(X).

A Markov process, or technically a Markov semigroup, is a collection of operators

U(t) : L^1(X) \to L^1(X)

for t \ge 0 such that:

U(t) is stochastic for all t \ge 0.

U(t) depends continuously on t.

U(s+t) = U(s)U(t) for all s,t \ge 0.

U(0) = I.

By the Hille–Yosida theorem, any Markov semigroup may be written as

U(t) = \exp(tH)

for some operator H, called its Hamiltonian. However, H is typically unbounded and only densely defined. This makes it difficult to work with the commutator [H,O]. So, we should borrow a trick from quantum mechanics and work with the commutator [\exp(tH),O] instead. This amounts to working directly with the Markov semigroup instead of its Hamiltonian. And then we have:

Noether’s Theorem, Full-fledged Stochastic Version. Suppose X is a \sigma-finite measure space and

U(t) : L^1(X) \to L^1(X)

is a Markov semigroup. Suppose O is an observable. Then

[U(t),O] = 0

for all t \ge 0 if and only if for all probability distributions \psi on X, \langle O, U(t) \psi \rangle and \langle O^2, U(t) \psi \rangle are constant as a function of t.

In plain English: time evolution commutes with an observable if the mean and standard deviation of that observable never change with time. As in the quantum case, this result follows instantly from a simpler one, which applies to a single stochastic operator:

Noether’s Theorem, Simpler Stochastic Version. Suppose X is a \sigma-finite measure space and

U : L^1(X) \to L^1(X)

is stochastic operator. Suppose O is an observable. Then

[U,O] = 0

if and only if for all probability distributions \psi on X,

\langle O, U \psi \rangle = \langle O, \psi \rangle

and

\langle O^2, U \psi \rangle = \langle O^2, \psi \rangle

It looks simple, but the proof is a bit tricky! It’s easy to see that [U,O] = 0 implies those other equations; the work lies in showing the converse. The reason is that [U,O] = 0 implies

\langle O^n, U \psi \rangle = \langle O^n, \psi \rangle

for all n, not just 1 and 2. The expected values of the powers of O are more or less what people call its moments. So, we’re saying all the moments of O are unchanged when we apply U to an arbitrary probability distribution, given that we know this fact for the first two.

The proof is fairly technical but also sort of cute: we use Chebyshev’s inequality, which says that the probability of a random variable taking a value at least k standard deviations away from its mean is less than or equal to 1/k^2. I’ve always found this to be an amazing fact, but now it seems utterly obvious. You can figure out the proof yourself if you do the puzzles at the start of this post.

But now I’ll let you read our paper! And I’m really hoping you’ll spot mistakes, or places it can be improved.


Quantropy (Part 3)

18 February, 2012

I’ve been talking a lot about ‘quantropy’. Last time we figured out a trick for how to compute it starting from the partition function of a quantum system. But it’s hard to get a feeling for this concept without some examples.

So, let’s compute the partition function of a free particle on a line, and see what happens…

The partition function of a free particle

Suppose we have a free particle on a line tracing out some path as time goes by:

q: [0,T] \to \mathbb{R}

Then its action is just the time integral of its kinetic energy:

\displaystyle{ A(q) = \int_0^T \frac{mv(t)^2}{2} \; dt }

where

\displaystyle{ v(t) = \frac{d q(t)}{d t} }

is its velocity. The partition function is then

Z = \displaystyle{\int e^{i A(q) / \hbar} \; Dq }

where we integrate an exponential involving the action over the space of all paths q. Unfortunately, the space of all paths is infinite-dimensional, and the thing we’re integrating oscillates wildly. Integrals like this tend to make mathematicians run from the room screaming. For example, nobody is quite sure what Dq means in this expresson. There is no ‘Lebesgue measure’ on an infinite-dimensional vector space.

There is a lot to say about this, but if we just want to get some answers, it’s best to sneak up on the problem gradually.

Discretizing time

We’ll start by treating time as discrete—a trick Feynman used in his original work. We’ll consider n time intervals of length \Delta t. Say the position of our particle at the ith time step is q_i \in \mathbb{R}. We’ll require that the particle keeps a constant velocity between these time steps. This will reduce the problem of integrating over ‘all’ paths—whatever that means, exactly—to the more manageable problem of integrating over a finite-dimensional space of paths. Later we can study what happens as the time steps get shorter and more numerous.

Let’s call the particle’s velocity between the (i-1)st and ith time steps v_i.

\displaystyle{ v_i = \frac{q_i - q_{i-1}}{\Delta t} }

The action, defined as an integral, is now equal to a finite sum:

\displaystyle{ A(q) = \sum_{i = 1}^n \frac{mv_i^2}{2} \; \Delta t }

We’ll consider histories of the particle where its initial position is

q_0 = 0

but its final position q_n is arbitrary. Why? If we don’t ‘nail down’ the particle at some particular time, our path integrals will diverge. So, our space of histories is

X = \mathbb{R}^n

and now we’re ready to apply the formulas we developed last time!

We saw last time that the partition function is the key to all wisdom, so let’s start with that. Naively, it’s

\displaystyle{  Z = \int_X e^{- \beta A(q)} Dq }

where

\displaystyle{ \beta = \frac{1}{i \hbar} }

But there’s a subtlety here. Doing this integral requires a measure on our space of histories. Since the space of histories is just \mathbb{R}^n with coordinates q_1, \dots, q_n, an obvious guess for a measure would be

Dq = dq_1 \cdots dq_n    \qquad \qquad \qquad \qquad \quad \textrm{(obvious first guess)}

However, the partition function should be dimensionless! You can see why from the discussion of units last time. But the quantity \beta A(q) and thus its exponential is dimensionless, so our mesasure had better be dimensionless too. But dq_1 \cdots dq_n has units of lengthn. To deal with this we can introduce a length scale, which I’ll call \Delta x, and use the measure

Dq = \displaystyle{ \frac{1}{(\Delta x)^n} \, dq_1 \cdots dq_n }   \qquad \qquad \qquad  \textrm{(what we'll actually use)}

I should however emphasize that despite the notation \Delta x, I’m not discretizing space, just time. We could also discretize space, but it would make the calculation a lot harder. I’m only introducing this length scale \Delta x to make our measure on the space of histories dimensionless.

Now let’s compute the partition function. For starters, we have

\begin{array}{ccl} Z &=& \displaystyle{ \int_X e^{-\beta A(q)} \; Dq } \\  \\ &=& \displaystyle{  \frac{1}{(\Delta x)^n} \int e^{-\beta \sum_{i=1}^n m \, \Delta t \, v_i^2 /2} \; dq_1 \cdots dq_n } \end{array}

Normally when I see an integral bristling with annoying constants like this, I switch to a system of units where most of them equal 1. But I’m trying to get a physical feel for quantropy, so I’ll leave them all in. That way, we can see how they affect the final answer.

Since

\displaystyle{ v_i = \frac{q_i - q_{i-1}}{\Delta t} }

we can show that

dq_1 \cdots dq_n = (\Delta t)^n \; dv_1 \cdots dv_n

To show this, we need to work out the Jacobian of the transformation from the q_i coordinates to the v_i coordinates on our space of histories—but this is easy to do, since the determinant of a triangular matrix is the product of its diagonal entries.

We can rewrite the path integral using this change of variables:

Z = \displaystyle{\left(\frac{\Delta t}{\Delta x}\right)^n \int e^{-\beta \sum_{i=1}^n m \, \Delta t \, v_i^2 /2}  \; dv_1 \cdots dv_n }

But since an exponential of a sum is a product of exponentials, this big fat n-tuple integral is really just a product of n ordinary integrals. And all these integrals are equal, so we just get some integral to the nth power! Let’s call the variable in this integral v, since it could be any of the v_i:

Z =  \displaystyle{ \left(\frac{\Delta t}{\Delta x}  \int_{-\infty}^\infty e^{-\beta \, m \, \Delta t \, v^2 /2} \; dv \right)^n }

How do we do the integral here? Well, that’s easy…

Integrating Gaussians

We should all know the integral of our favorite Gaussian. As a kid, my favorite was this:

\displaystyle{ \int_{-\infty}^\infty e^{-x^2} \; d x = \sqrt{\pi} }

because this looks the simplest. But now, I prefer this:

\displaystyle{ \int_{-\infty}^\infty e^{-x^2/2} \; d x = \sqrt{2 \pi} }

They’re both true, so why did my preference change? First, I now like 2\pi better than \pi. There’s a whole manifesto about this, and I agree with it. Second, x^2/2 is better than x^2 for what we’re doing, since kinetic energy is one half the mass times the velocity squared. Originally physicists like Descartes and Leibniz defined kinetic energy to be m v^2, but the factor of 1/2 turns out to make everything work better. Nowadays every Hamiltonian or Lagrangian with a quadratic term in it tends to have a 1/2 in front—basically because the first thing you do with it is differentiate it, and the 1/2 cancels the resulting 2. The factor of 1/2 is just a convention, even in the definition of kinetic energy, but if we didn’t make that convention we’d be punished with lots of factors of 2 all over.

Or course it doesn’t matter much: you just need to remember the integral of some Gaussian, or at least know how to calculate it. And you’ve probably read this quote:

A mathematician is someone to whom

\displaystyle{ \int_{-\infty}^\infty e^{-x^2/2} \; d x = \sqrt{2 \pi} }

is as obvious as 2+2=4 is to you and me. – Lord Kelvin

So, you probably learned the trick for doing this integral, so you can call yourself a mathematician.

Stretching the above Gaussian by a factor of \sqrt{\alpha} increases the integral by a factor of \sqrt{\alpha}, so we get

\displaystyle{ \int_{-\infty}^\infty e^{-x^2/2\alpha} \; d x = \sqrt{2 \pi \alpha}  }

This is clear when \alpha is positive, but soon we’ll apply it when \alpha is imaginary! That makes some mathematicians sweaty and nervous. For example, we’re saying that

\displaystyle{ \int_{-\infty}^\infty e^{i x^2 / 2} \, dx = \sqrt{2 \pi i}}

But this integral doesn’t converge if you slap absolute values on the function inside: in math jargon, the function inside isn’t ‘Lebesgue integrable’. But we can tame it in various ways. We can impose a ‘cutoff’ and then let it go to infinity:

\displaystyle{ \lim_{M \to + \infty} \int_{-M}^M e^{i x^2 / 2} \, dx = \sqrt{2 \pi i} }

or we can damp the oscillations, and then let the amount of damping go to zero:

\displaystyle{ \lim_{\epsilon \downarrow 0} \int_{-\infty}^\infty e^{(i - \epsilon) x^2 / 2} \, dx = \sqrt{2 \pi i} }

We get the same answer either way, or indeed using many other methods. Since such tricks work for all the integrals I’ll write down, I won’t engage in further hand-wringing over this issue. We’ve got bigger things to worry about, like: what’s the physical meaning of quantropy?

Computing the partition function

Where were we? We had this formula for the partition function:

Z =  \displaystyle{ \left( \frac{\Delta t}{\Delta x} \int_{-\infty}^\infty e^{-\beta \, m \, \Delta t \, v^2 /2}  \; dv \right)^n }

and now we’re letting ourselves use this formula:

\displaystyle{ \int_{-\infty}^\infty e^{-x^2/2\alpha} \; d x = \sqrt{2 \pi \alpha}  }

even when \alpha is imaginary, so we get

Z = \displaystyle{ \left( \frac{\Delta t}{\Delta x} \sqrt{ \frac{2 \pi}{\beta m \, \Delta t}} \right)^n =  \left(\frac{2 \pi \Delta t}{\beta m \, (\Delta x)^2}\right)^{n/2}  }

And a nice thing about keeping all these constants floating around is that we can use dimensional analysis to check our work. The partition function should be dimensionless, and it is! To see this, just remember that \beta = 1/i\hbar has dimensions of inverse action, or T/M L^2.

Expected action

Now that we’ve got the partition function, what do we do with it? We can compute everything we care about. Remember, in statistical mechanics there’s a famous formula:

free energy = expected energy – temperature × entropy

and last time we saw that similarly, in quantum mechanics we have:

free action = expected action – classicality × quantropy

where the classicality is

1/\beta = 1/i \hbar

In other words:

\displaystyle{ F = \langle A \rangle - \frac{1}{\beta}\, Q }

Last time I showed you how to compute F and \langle A \rangle starting from the partition function. So, we can use the above formula to work out the quantropy as well:

Expected action \langle A \rangle = - \frac{d}{d \beta} \ln Z
Free action F = -\frac{1}{\beta} \ln Z
Quantropy Q = \ln Z - \beta \,\frac{d }{d \beta}\ln Z

But let’s start with the expected action. The answer will be so amazingly simple, yet strange, that I’ll want to spend the rest of this post discussing it.

Using our hard-won formula

\displaystyle{ Z = \left(\frac{2 \pi \Delta t}{\beta m \, (\Delta x)^2}\right)^{n/2}  }

we get

\begin{array}{ccl} \langle A \rangle &=& \displaystyle{ -\frac{d}{d \beta} \ln Z } \\  \\  &=& \displaystyle{ -\frac{n}{2}  \frac{d}{d \beta}  \ln \left(\frac{2 \pi \Delta t}{\beta m \, (\Delta x)^2}\right) } \\  \\ &=& \displaystyle{ -\frac{n}{2}  \frac{d}{d \beta} \left( \ln \left(\frac{2 \pi \Delta t}{m \, (\Delta x)^2}\right) - \ln \beta \right) } \\   \\  &=& \displaystyle{ \frac{n}{2} \; \frac{1}{\beta} }  \\  \\ &=& \displaystyle{ n\;  \frac{i \hbar}{2} }  \end{array}

Wow! When get an answer this simple, it must mean something! This formula is saying that the expected action of our freely moving quantum particle is proportional to n, the number of time steps. Each time step contributes i \hbar / 2 to the expected action. The mass of the particle, the time step \Delta t, and the length scale \Delta x don’t matter at all!

Why don’t they matter? Well, you can see from the above calculation that they just disappear when we take the derivative of the logarithm containing them. That’s not a profound philosophical explanation, but it implies that our action could be any quadratic function like this:

A : \mathbb{R}^n \to \mathbb{R}

\displaystyle{ A(x) = \sum_{i = 1}^n \frac{c_i x_i^2}{2} }

where c_i are positive numbers, and we’d still get the same expected action:

\langle A \rangle = \displaystyle{ n\; \frac{i \hbar}{2} }

The numbers c_i don’t matter!

The quadratic function we’re talking about here is an example of a quadratic form. Because the numbers c_i are positive, it’s a positive definite quadratic form. And since we can diagonalize any positive definite quadratic form, we can state our result in a fancier, more elegant way:

Whenever the action is a positive definite quadratic form on an n-dimensional vector space of histories, the expected action is n times i \hbar / 2.

For example, take a free particle in 3d Euclidean space, and discretize time into n steps as we’ve done here. Then the action is a positive definite quadratic form on a 3n-dimensional vector space:

\displaystyle{ A(q) = \sum_{i = 1}^n \frac{m \vec{v}_i \cdot \vec{v}_i}{2} \; \Delta t }

since now each velocity \vec{v}_i is a vector with 3 components. So, the expected action is 3n times i \hbar / 2.

Poetically speaking, 3n is the total number of ‘decisions’ our particle makes throughout its history. What do I mean by that? In the path integral approach to quantum mechanics, a system can trace out any history it wants. But takes a bunch of real numbers to determine a specific history. Each number counts as one ‘decision’. And in the situation we’ve described, each decision contributes i \hbar / 2 to the expected action.

So here’s a more intuitive way to think about our result:

In the path integral approach to quantum theory, each ‘decision’ made by the system contributes i \hbar / 2 to the expected action… as long as the action is given by a positive definite quadratic form on some vector space of histories.

There’s a lot more to say about this. For example, in the harmonic oscillator the action is a quadratic form, but it’s not positive definite. What happens then? But three more immediate questions leap to my mind:

1) Why is the expected action imaginary?

2) Should we worry that it diverges as n \to \infty?

3) Is this related to the heat capacity of an ideal gas?

So, let me conclude this post by trying to answer those.

Why is the expected action imaginary?

The action A is real. How in the world can its expected value be imaginary?

The reason is that we’re not taking its expected value with respect to an probability measure, but instead, with respect to a complex-valued measure. Last time we gave this very general definition:

\langle A \rangle = \displaystyle{  \frac{\int_X A(x) e^{-\beta A(x)} \, dx }{\int_X e^{-\beta A(x)} \, dx }}

The action A is real, but \beta = 1 / i \hbar is imaginary, so it’s not surprising that this ‘expected value’ is complex-valued.

Later we’ll see a good reason why it has to be purely imaginary.

Why does it diverge as n → ∞?

Consider our particle on a line, with time discretized into n time steps. Its expected action is

\langle A \rangle = \displaystyle{ n\; \frac{i \hbar}{2} }

To take the continuum limit we must let n \to \infty while simultaneously letting \Delta t \to 0 in such a way that n \Delta t stays constant. Some quantities will converge when we take this limit, but the expected action will not. It will go to infinity!

That’s a bit sad, but not unexpected. It’s a lot like how the expected length of the path of a particle carrying out Brownian motion is infinite. In 3 dimensions, a typical Brownian path looks like this:


In fact the free quantum particle is just a ‘Wick-rotated’ version of Brownian motion, where we replace time by imaginary time, so the analogy is fairly close. The action we’re considering now is not exactly analogous to the arclength of a path:

\displaystyle{ \int_0^T \left| \frac{d q}{d t} \right| \; dt }

Instead, it’s proportional to this quadratic form:

\displaystyle{ \int_0^T \left| \frac{d q}{d t} \right|^2 \; dt }

However, both these quantities diverge when we discretize Brownian motion and then take the continuum limit.

How sad should we be that the expected action is infinite in the continuum limit? Not too sad, I think. Any result that applies to all discretizations of a continuum problem should, I think, say something about that continuum problem. For us the expected action diverges, but the ‘expected action per decision’ is constant, and that’s something we can hope to understand even in the continuum limit!

Is this related to the heat capacity of an ideal gas?

That may seem like a strange question, unless you remember some formulas about the thermodynamics of an ideal gas!

Let’s say we’re in 3d Euclidean space. (Most of us already are, but some of my more spacy friends will need to pretend.) If we have an ideal gas made of n point particles at temperature T, its expected energy is

\frac{3}{2} n k T

where k is Boltzmann’s constant. This is a famous fact, which lets people compute the heat capacity of a monatomic ideal gas.

On the other hand, we’ve seen that in quantum mechanics, a single point particle will have an expected action of

\frac{3}{2} n i \hbar

after n time steps.

These results look awfully similar. Are they related?

Yes! These are just two special cases of the same result! The energy of the ideal gas is a quadratic form on a 3n-dimensional vector space; so is the action of our discretized point particle. The ideal gas is a problem in statistical mechanics; the point particle is a problem in quantum mechanics. In statistical mechanics we have

\displaystyle{ \beta = \frac{1}{k T} }

while in quantum mechanics we have

\displaystyle{ \beta = i \hbar }

Mathematically, they are the exact same problem except that \beta is real in one case, imaginary in the other. This is another example of the analogy between statistical mechanics and quantum mechanics—the analogy that motivated quantropy in the first place!

And this makes it even more obvious that the expected action must be imaginary… at least when the action is a positive definite quadratic form.


Classical Mechanics versus Thermodynamics (Part 2)

23 January, 2012

I showed you last time that in many branches of physics—including classical mechanics and thermodynamics—we can see our task as minimizing or maximizing some function. Today I want to show how we get from that task to symplectic geometry.

So, suppose we have a smooth function

S: Q \to \mathbb{R}

where Q is some manifold. A minimum or maximum of S can only occur at a point where

d S = 0

Here the differential d S which is a 1-form on Q. If we pick local coordinates q^i in some open set of Q, then we have

\displaystyle {d S = \frac{\partial S}{\partial q^i} dq^i }

and these derivatives \displaystyle{ \frac{\partial S}{\partial q^i} } are very interesting. Let’s see why:

Example 1. In classical mechanics, consider a particle on a manifold X. Suppose the particle starts at some fixed position at some fixed time. Suppose that it ends up at the position x at time t. Then the particle will seek to follow a path that minimizes the action given these conditions. Assume this path exists and is unique. The action of this path is then called Hamilton’s principal function, S(x,t). Let

Q = X \times \mathbb{R}

and assume Hamilton’s principal function is a smooth function

S : Q \to \mathbb{R}

We then have

d S = p_i dq^i - H d t

where q^i are local coordinates on X,

\displaystyle{ p_i = \frac{\partial S}{\partial q^i} }

is called the momentum in the ith direction, and

\displaystyle{ H = - \frac{\partial S}{\partial t} }

is called the energy. The minus signs here are basically just a mild nuisance. Time is different from space, and in special relativity the difference comes from a minus sign, but I don’t think that’s the explanation here. We could get rid of the minus signs by working with negative energy, but it’s not such a big deal.

Example 2. In thermodynamics, consider a system with the internal energy U and volume V. Then the system will choose a state that maximizes the entropy given these constraints. Assume this state exists and is unique. Call the entropy of this state S(U,V). Let

Q = \mathbb{R}^2

and assume the entropy is a smooth function

S : Q \to \mathbb{R}

We then have

d S = \displaystyle{\frac{1}{T} d U - \frac{P}{T} d V }

where T is the temperature of the system, and P is the pressure. The slight awkwardness of this formula makes people favor other setups.

Example 3. In thermodynamics there are many setups for studying the same system using different minimum or maximum principles. One of the most popular is called the energy scheme. If internal energy increases with increasing entropy, as usually the case, this scheme is equivalent to the one we just saw.

In the energy scheme we fix the entropy S and volume V. Then the system will choose a state that minimizes the internal energy given these constraints. Assume this state exists and is unique. Call the internal energy of this state U(S,V). Let

Q = \mathbb{R}^2

and assume the entropy is a smooth function

S : Q \to \mathbb{R}

We then have

d U = T d S - P d V

where

\displaystyle{ T = \frac{\partial U}{\partial S} }

is the temperature, and

\displaystyle{ P = - \frac{\partial U}{\partial V} }

is the pressure. You’ll note the formulas here closely resemble those in Example 1!

Example 4. Here are the four most popular schemes for thermodynamics:

• If we fix the entropy S and volume V, the system will choose a state that minimizes the internal energy U(S,V).

• If we fix the entropy S and pressure P, the system will choose a state that minimizes the enthalpy H(S,P).

• If we fix the temperature T and volume V, the system will choose a state that minimizes the Helmholtz free energy A(T,V).

• If we fix the temperature T and pressure P, the system will choose a state that minimizes the Gibbs free energy G(T,P).

These quantities are related by a pack of similar-looking formulas, from which we may derive a mind-numbing little labyrinth of Maxwell relations. But for now, all we need to know is that all these approaches to thermodynamics are equivalent given some reasonable assumptions, and all the formulas and relations can be derived using the Legendre transformation trick I explained last time. So, I won’t repeat what we did in Example 3 for all these other cases!

Example 5. In classical statics, consider a particle on a manifold Q. This particle will seek to minimize its potential energy V(q), which we’ll assume is some smooth function of its position q \in Q. We then have

d V = -F_i dq^i

where q^i are local coordinates on Q and

\displaystyle{ F_i = -\frac{\partial F}{\partial q^i} }

is called the force in the ith direction.

Conjugate variables

So, the partial derivatives of the quantity we’re trying
to minimize or maximize are very important! As a result, we often want to give them more of an equal status as independent quantities in their own right. Then we call them ‘conjugate variables’.

To make this precise, consider the cotangent bundle T^* Q, which has local coordinates q^i (coming from the coordinates on Q) and p_i (the corresponding coordinates on each cotangent space). We then call p_i the conjugate variable of the coordinate q^i.

Given a smooth function

S : Q \to \mathbb{R}

the 1-form d S can be seen as a section of the cotangent bundle. The graph of this section is defined by the equation

\displaystyle{ p_i = \frac{\partial S}{\partial q^i} }

and this equation ties together two intuitions about ‘conjugate variables’: as coordinates on the cotangent bundle, and as partial derivatives of the quantity we’re trying to minimize or maximize.

The tautological 1-form

There is a lot to say here, especially about Legendre transformations, but I want to hasten on to a bit of symplectic geometry. And for this we need the ‘tautological 1-form’ on T^* Q.

We can think of d S as a map

d S : Q \to T^* Q

sending each point q \in Q to the point (q,p) \in T^* Q where p is defined by the equation we just saw:

\displaystyle{ p_i = \frac{\partial S}{\partial q^i} }

Using this map, we can pull back any 1-form on T^* Q to get a 1-form on Q.

What 1-form on Q might we like to get? Why, d S of course!

Amazingly, there’s a 1-form \alpha on T^* Q such that when we pull it back using the map d S, we get the 1-form d S—no matter what smooth function d S we started with!

Thanks to this wonderfully tautological property, \alpha is called the tautological 1-form on T^* Q. You should check that it’s given by the formula

\alpha = p_i dq^i

If you get stuck, try this.

So, if we want to see how much S changes as we move along a path in Q, we can do this in three equivalent ways:

• Evaluate S at the endpoint of the path and subtract off S at the starting-point.

• Integrate the 1-form d S along the path.

• Use d S : Q \to T^* Q to map the path over to T^* Q, and then integrate \alpha over this path in T^* Q.

The last method is equivalent thanks to the ‘tautological’ property of \alpha. It may seem overly convoluted, but it shows that if we work in T^* Q, where the conjugate variables are accorded equal status, everything we want to know about the change in S is contained in the 1-form \alpha, no matter which function S we decide to use!

So, in this sense, \alpha knows everything there is to know about the change in Hamilton’s principal function in classical mechanics, or the change in entropy in thermodynamics… and so on!

But this means it must know about things like Hamilton’s equations, and the Maxwell relations.

The symplectic structure

We saw last time that the fundamental equations of classical mechanics and thermodynamics—Hamilton’s equations and the Maxwell relations—are mathematically just the same. They both say simply that partial derivatives commute:

\displaystyle { \frac{\partial^2 S}{\partial q^i \partial q^j} = \frac{\partial^2 S}{\partial q^j \partial q^i} }

where S: Q \to \mathbb{R} is the function we’re trying to minimize or maximize.

I also mentioned that this fact—the commuting of partial derivatives—can be stated in an elegant coordinate-free way:

d^2 S = 0

Perhaps I should remind you of the proof:

d^2 S =   d \left( \displaystyle{ \frac{\partial S}{\partial q^i} dq^i } \right) = \displaystyle{ \frac{\partial^2 S}{\partial q^j \partial q^i} dq^j \wedge dq^i }

but

dq^j \wedge dq^i

changes sign when we switch i and j, while

\displaystyle{ \frac{\partial^2 S}{\partial q^j \partial q^i}}

does not, so d^2 S = 0. It’s just a wee bit more work to show that conversely, starting from d^2 S = 0, it follows that the mixed partials must commute.

How can we state this fact using the tautological 1-form \alpha? I said that using the map

d S : Q \to T^* Q

we can pull back \alpha to Q and get d S. But pulling back commutes with the d operator! So, if we pull back d \alpha, we get d^2 S. But d^2 S = 0. So, d \alpha has the magical property that when we pull it back to Q, we always get zero, no matter what S we choose!

This magical property captures Hamilton’s equations, the Maxwell relations and so on—for all choices of S at once. So it shouldn’t be surprising that the 2-form

\theta = d \alpha

is colossally important: it’s the famous symplectic structure on the so-called phase space T^* Q.

Well, actually, most people prefer to work with

\omega = - d \alpha

It seems this whole subject is a monument of austere beauty… covered with minus signs, like bird droppings.

Example 6. In classical mechanics, let

Q = X \times \mathbb{R}

as in Example 1. If Q has local coordinates q^i, t, then T^* Q has these along with the conjugate variables as coordinates. As we explained, it causes little trouble to call these conjugate variables by the same names we used for the partial derivatives of S: namely, p_i and -H. So, we have

\alpha = p_i dq^i - H d t

and thus

\omega = dq^i \wedge dp_i - dt \wedge dH

Example 7. In thermodynamics, let

Q = \mathbb{R}^2

as in Example 3. If Q has coordinates S, V then the conjugate variables deserve to be called T, -P. So, we have

\alpha = T d S - P d V

and

\omega = d S \wedge d T - d V \wedge d P

You’ll see that in these formulas for \omega, variables get paired with their conjugate variables. That’s nice.

But let me expand on what we just saw, since it’s important. And let me talk about \theta =  d\alpha, without tossing in that extra sign.

What we saw is that the 2-form \theta is a ‘measure of noncommutativity’. When we pull \theta back to Q we get zero. This says that partial derivatives commute—and this gives Hamilton’s equations, the Maxwell relations, and all that. But up in T^* Q, \theta is not zero. And this suggests that there’s some built-in noncommutativity hiding in phase space!

Indeed, we can make this very precise. Consider a little parallelogram up in T^* Q:

Suppose we integrate the 1-form \alpha up the left edge and across the top. Do we get the same answer if integrate it across the bottom edge and then up the right?

No, not necessarily! The difference is the same as the integral of \alpha all the way around the parallelogram. By Stokes’ theorem, this is the same as integrating \theta over the parallelogram. And there’s no reason that should give zero.

However, suppose we got our parallelogram in T^* Q by taking a parallelogram in Q and applying the map

d S : Q \to T^* Q

Then the integral of \alpha around our parallelogram would be zero, since it would equal the integral of d S around a parallelogram in Q… and that’s the change in S as we go around a loop from some point to… itself!

And indeed, the fact that a function S doesn’t change when we go around a parallelogram is precisely what makes

\displaystyle { \frac{\partial^2 S}{\partial q^i \partial q^j} = \frac{\partial^2 S}{\partial q^j \partial q^i} }

So the story all fits together quite nicely.

The big picture

I’ve tried to show you that the symplectic structure on the phase spaces of classical mechanics, and the lesser-known but utterly analogous one on the phase spaces of thermodynamics, is a natural outgrowth of utterly trivial reflections on the process of minimizing or maximizing a function S on a manifold Q.

The first derivative test tells us to look for points with

d S = 0

while the commutativity of partial derivatives says that

d^2 S = 0

everywhere—and this gives Hamilton’s equations and the Maxwell relations. The 1-form d S is the pullback of the tautologous 1-form \alpha on T^* Q, and similarly d^2 S is the pullback of the symplectic structure d\alpha. The fact that

d \alpha \ne 0

says that T^* Q holds noncommutative delights, almost like a world where partial derivatives no longer commute! But of course we still have

d^2 \alpha = 0

everywhere, and this becomes part of the official definition of a symplectic structure.

All very simple. I hope, however, the experts note that to see this unified picture, we had to avoid the most common approaches to classical mechanics, which start with either a ‘Hamiltonian’

H : T^* Q \to \mathbb{R}

or a ‘Lagrangian’

L : T Q \to \mathbb{R}

Instead, we started with Hamilton’s principal function

S : Q \to \mathbb{R}

where Q is not the usual configuration space describing possible positions for a particle, but the ‘extended’ configuration space, which also includes time. Only this way do Hamilton’s equations, like the Maxwell relations, become a trivial consequence of the fact that partial derivatives commute.

But what about those ‘noncommutative delights’? First, there’s a noncommutative Poisson bracket operation on functions on T^* Q. This makes the functions into a so-called Poisson algebra. In classical mechanics of a point particle on the line, for example, it’s well-known that we have

\begin{array}{ccr}  \{ p, q \} &=& 1 \\  \{ H, t \} &=& -1 \end{array}

In thermodynamics, the analogous relations

\begin{array}{ccr}  \{ T, S \} &=& 1 \\  \{ P, V \} &=& -1 \end{array}

seem sadly little-known. But you can see them here, for example:

• M. J. Peterson, Analogy between thermodynamics and mechanics, American Journal of Physics 47 (1979), 488–490.

at least up to one of those pesky minus signs! We can use these Poisson brackets to study how one thermodynamic variable changes as we slowly change another, staying close to equilibrium all along.

Second, we can go further and ‘quantize’ the functions on T^* Q. This means coming up with an associative but noncommutative product of these function that mimics the Poisson bracket to some extent. In the case of a particle on a line, we’d get commutation relations like

\begin{array}{lcr}  p q - q p &=& - i \hbar \\  H t - t H &=& i \hbar \end{array}

where \hbar is Planck’s constant. Now we can represent these quantities as operators on a Hilbert space, the uncertainty principle kicks in, and life gets really interesting.

In thermodynamics, the analogous relations would be

\begin{array}{ccr}  T S - S T &=& - i \hbar \\  P V - V P &=& i \hbar \end{array}

The math works just the same, but what does it mean physically? Are we now thinking of temperature, entropy and the like as ‘quantum observables’—for example, operators on a Hilbert space? Are we just quantizing thermodynamics?

That’s one possible interpretation, but I’ve never heard anyone discuss it. Here’s one good reason: as Blake Stacey pointed out below, these equations don’t pass the test of dimensional analysis! The quantities at left have units of energy, while Plank’s constant has units of action. So maybe we need to introduce a quantity with units of time at right, or maybe there’s some other interpretation, where we don’t interpret the parameter \hbar as the good old-fashioned Planck’s constant, but something else instead.

And if you’ve really been paying attention, you may wonder how quantropy fits into this game! I showed that at least in a toy model, the path integral formulation of quantum mechanics arises, not exactly from maximizing or minimizing something, but from finding its critical points: that is, points where its first derivative vanishes. This something is a complex-valued quantity analogous to entropy, which I called ‘quantropy’.

Now, while I keep throwing around words like ‘minimize’ and ‘maximize’, most everything I’m doing works just fine for critical points. So, it seems that the apparatus of symplectic geometry may apply to the path-integral formulation of quantum mechanics.

But that would be weirdly interesting! In particular, what would happen when we go ahead and quantize the path-integral formulation of quantum mechanics?

If you’re a physicist, there’s a guess that will come tripping off your tongue at this point, without you even needing to think. Me too. But I don’t know if that guess is right.

Less mind-blowingly, there is also the question of how symplectic geometry enters into classical statics via the idea of Example 4.

But there’s a lot of fun to be had in this game already with thermodynamics.

Appendix

I should admit, just so you don’t think I failed to notice, that only rather esoteric physicists study the approach to quantum mechanics where time is an operator that doesn’t commute with the Hamiltonian H. In this approach H commutes with the momentum and position operators. I didn’t write down those commutation equations, for fear you’d think I was a crackpot and stop reading! It is however a perfectly respectable approach, which can be reconciled with the usual one. And this issue is not only quantum-mechanical: it’s also important in classical mechanics.

Namely, there’s a way to start with the so-called extended phase space for a point particle on a manifold X:

T^* (X \times \mathbb{R})

with coordinates q^i, t, p_i and H, and get back to the usual phase space:

T^* X

with just q^i and p_i as coordinates. The idea is to impose a constraint of the form

H = f(q,p)

to knock off one degree of freedom, and use a standard trick called ‘symplectic reduction’ to knock off another.

Similarly, in quantum mechanics we can start with a big Hilbert space

L^2(X \times \mathbb{R})

on which q^i, t, p_i, and H are all operators, then impose a constraint expressing H in terms of p and q, and then use that constraint to pick out states lying in a smaller Hilbert space. This smaller Hilbert space is naturally identified with the usual Hilbert space for a point particle:

L^2(X)

Here X is called the configuration space for our particle; its cotangent bundle is the usual phase space. We call X \times \mathbb{R} the extended configuration space for a particle on the line; its cotangent bundle is the extended phase space.

I’m having some trouble remembering where I first learned about these ideas, but here are some good places to start:

• Toby Bartels, Abstract Hamiltonian mechanics.

• Nikola Buric and Slobodan Prvanovic, Space of events and the time observable.

• Piret Kuusk and Madis Koiv, Measurement of time in nonrelativistic quantum and classical mechanics, Proceedings of the Estonian Academy of Sciences, Physics and Mathematics 50 (2001), 195–213.


Classical Mechanics versus Thermodynamics (Part 1)

19 January, 2012

It came as a bit of a shock last week when I realized that some of the equations I’d learned in thermodynamics were just the same as equations I’d learned in classical mechanics—with only the names of the variables changed, to protect the innocent.

Why didn’t anyone tell me?

For example: everybody loves Hamilton’s equations: there are just two, and they summarize the entire essence of classical mechanics. Most people hate the Maxwell relations in thermodynamics: there are lots, and they’re hard to remember.

But what I’d like to show you now is that Hamilton’s equations are Maxwell relations! They’re a special case, and you can derive them the same way. I hope this will make you like the Maxwell relations more, instead of liking Hamilton’s equations less.

First, let’s see what these equations look like. Then let’s see why Hamilton’s equations are a special case of the Maxwell relations. And then let’s talk about how this might help us unify different aspects of physics.

Hamilton’s equations

Suppose you have a particle on the line whose position q and momentum p are functions of time, t. If the energy H is a function of position and momentum, Hamilton’s equations say:

\begin{array}{ccr}  \displaystyle{  \frac{d p}{d t} }  &=&  \displaystyle{- \frac{\partial H}{\partial q} } \\  \\ \displaystyle{  \frac{d q}{d t} } &=&  \displaystyle{ \frac{\partial H}{\partial p} }  \end{array}

The Maxwell relations

There are lots of Maxwell relations, and that’s one reason people hate them. But let’s just talk about two; most of the others work the same way.

Suppose you have a physical system like a box of gas that has some volume V, pressure P, temperature T and entropy S. Then the first and second Maxwell relations say:

\begin{array}{ccr}  \displaystyle{ \left. \frac{\partial T}{\partial V}\right|_S } &=&  \displaystyle{ - \left. \frac{\partial P}{\partial S}\right|_V } \\   \\   \displaystyle{ \left. \frac{\partial S}{\partial  V}\right|_T  }  &=&  \displaystyle{ \left. \frac{\partial P}{\partial T} \right|_V }   \end{array}

Comparison

Clearly Hamilton’s equations resemble the Maxwell relations. Please check for yourself that the patterns of variables are exactly the same: only the names have been changed! So, apart from a key subtlety, Hamilton’s equations become the first and second Maxwell relations if we make these replacements:

\begin{array} {ccccccc}  q &\to& S & &  p &\to & T \\ t & \to & V & & H &\to & P \end{array}

What’s the key subtlety? One reason people hate the Maxwell’s relations is they have lots of little symbols like \left. \right|_V saying what to hold constant when we take our partial derivatives. Hamilton’s equations don’t have those.

So, you probably won’t like this, but let’s see what we get if we write Hamilton’s equations so they exactly match the pattern of the Maxwell relations:

\begin{array}{ccr}     \displaystyle{ \left. \frac{\partial p}{\partial t} \right|_q }  &=&  \displaystyle{- \left. \frac{\partial H}{\partial q} \right|_t } \\  \\\displaystyle{  \left.\frac{\partial q}{\partial t} \right|_p } &=&  \displaystyle{ \left. \frac{\partial H}{\partial p} \right|_t }    \end{array}

This looks a bit weird, and it set me back a day. What does it mean to take the partial derivative of q in the t direction while holding p constant, for example?

I still think it’s weird. But I think it’s correct. To see this, let’s derive the Maxwell relations, and then derive Hamilton’s equations using the exact same reasoning, with only the names of variables changed.

Deriving the Maxwell relations

The Maxwell relations are extremely general, so let’s derive them in a way that makes that painfully clear. Suppose we have any smooth function U on the plane. Just for laughs, let’s call the coordinates of this plane S and V. Then we have

d U = T d S - P d V

for some functions T and P. This equation is just a concise way of saying that

\displaystyle{ T = \left.\frac{\partial U}{\partial S}\right|_V }

and

\displaystyle{ P = - \left.\frac{\partial U}{\partial V}\right|_S }

The minus sign here is unimportant: you can think of it as a whimsical joke. All the math would work just as well if we left it out.

(In reality, physicists call U as the internal energy of a system, regarded as a function of its entropy S and volume V. They then call T the temperature and P the pressure. It just so happens that for lots of systems, their internal energy goes down as you increase their volume, so P works out to be positive if we stick in this minus sign, so that’s what people did. But you don’t need to know any of this physics to follow the derivation of the Maxwell relations!)

Now, mixed partial derivatives commute, so we have:

\displaystyle{ \frac{\partial^2 U}{\partial V \partial S} =  \frac{\partial^2 U}{\partial S \partial V}}

Plugging in our definitions of T and V, this says

\displaystyle{ \left. \frac{\partial T}{\partial V}\right|_S = - \left. \frac{\partial P}{\partial S}\right|_V }

And that’s the first Maxwell relation! So, there’s nothing to it: it’s just a sneaky way of saying that the mixed partial derivatives of the function U commute.

The second Maxwell relation works the same way. But seeing this takes a bit of thought, since we need to cook up a suitable function whose mixed partial derivatives are the two sides of this equation:

\displaystyle{ \left. \frac{\partial S}{\partial  V}\right|_T  = \left. \frac{\partial P}{\partial T} \right|_V }

There are different ways to do this, but for now let me use the time-honored method of ‘pulling the rabbit from the hat’.

Here’s the function we want:

A = U - T S

(In thermodynamics this function is called the Helmholtz free energy. It’s sometimes denoted F, but the International Union of Pure and Applied Chemistry recommends calling it A, which stands for the German word ‘Arbeit’, meaning ‘work’.)

Let’s check that this function does the trick:

\begin{array}{ccl} d A &=& d U - d(T S) \\  &=& (T d S - P d V) - (S dT + T d S) \\  &=& -S d T - P dV \end{array}

If we restrict ourselves to any subset of the plane where T and V serve as coordinates, the above equation is just a concise way of saying

\displaystyle{ S = - \left.\frac{\partial A}{\partial T}\right|_V }

and

\displaystyle{ P = - \left.\frac{\partial A}{\partial V}\right|_T }

Then since mixed partial derivatives commute, we get:

\displaystyle{ \frac{\partial^2 A}{\partial V \partial T} =  \frac{\partial^2 A}{\partial T \partial V}}

or in other words:

\displaystyle{ \left. \frac{\partial S}{\partial  V}\right|_T  = \left. \frac{\partial P}{\partial T} \right|_V }

which is the second Maxwell relation.

We can keep playing this game using various pairs of the four functions S, T, P, V as coordinates, and get more Maxwell relations: enough to give ourselves a headache! But we have more better things to do today.

Hamilton’s equations as Maxwell relations

For example: let’s see how Hamilton’s equations fit into this game. Suppose we have a particle on the line. Consider smooth paths where it starts at some fixed position at some fixed time and ends at the point q at the time t. Nature will choose a path with least action—or at least one that’s a stationary point of the action. Let’s assume there’s a unique such path, and that it depends smoothly on q and t. For this to be true, we may need to restrict q and t to a subset of the plane, but that’s okay: go ahead and pick such a subset.

Given q and t in this set, nature will pick the path that’s a stationary point of action; the action of this path is called Hamilton’s principal function and denoted S(q,t). (Beware: this S is not the same as entropy!)

Let’s assume S is smooth. Then we can copy our derivation of the Maxwell equations line for line and get Hamilton’s equations! Let’s do it, skipping some steps but writing down the key results.

For starters we have

d S = p d q - H d t

for some functions p and H called the momentum and energy, which obey

\displaystyle{ p = \left.\frac{\partial S}{\partial q}\right|_t }

and

\displaystyle{ H = - \left.\frac{\partial S}{\partial t}\right|_q }

As far as I can tell it’s just a cute coincidence that we see a minus sign in the same place as before! Anyway, the fact that mixed partials commute gives us

\displaystyle{ \left. \frac{\partial p}{\partial t} \right|_q = - \left. \frac{\partial H}{\partial q} \right|_t }

which is the first of Hamilton’s equations. And now we see that all the funny \left. \right|_q and \left. \right|_t things are actually correct!

Next, we pull a rabbit out of our hat. We define this function:

X = S - p q

and check that

d X = - q dp - H d t

This function X probably has a standard name, but I don’t know it. Do you?

Then, considering any subset of the plane where p and t serve as coordinates, we see that because mixed partials commute:

\displaystyle{ \frac{\partial^2 X}{\partial t \partial p} =  \frac{\partial^2 A}{\partial p \partial t}}

we get

\displaystyle{ \left. \frac{\partial q}{\partial t} \right|_p = \left. \frac{\partial H}{\partial p} \right|_t }

So, we’re done!

But you might be wondering how we pulled this rabbit out of the hat. More precisely, why did we suspect it was there in the first place? There’s a nice answer if you’re comfortable with differential forms. We start with what we know:

d S = p d q - H d t

Next, we use this fundamental equation:

d^2 = 0

to note that:

\begin{array}{ccl}  0 &=& d^2 S \\ &=& d(p d q- H d t) \\ &=& d p \wedge d q - d H \wedge d t \\ &=& - dq \wedge d p - d H \wedge d t \\ &=& d(-q d p - H d t) \end{array}

See? We’ve managed to switch the roles of p and q, at the cost of an extra minus sign!

Then, if we restrict attention to any contractible open subset of the plane, the Poincaré Lemma says

d \omega = 0 \implies \omega = d \mu \; \textrm{for some} \; \mu

Since

d(- q d p - H d t) = 0

it follows that there’s a function X with

d X = - q d p - H d t

This is our rabbit. And if you ponder the difference between -q d p and p d q, you’ll see it’s -d( p q). So, it’s no surprise that

X = S - p q

The big picture

Now let’s step back and think about what’s going on.

Lately I’ve been trying to unify a bunch of ‘extremal principles’, including:

1) the principle of least action
2) the principle of least energy
3) the principle of maximum entropy
4) the principle of maximum simplicity, or Occam’s razor

In my post on quantropy I explained how the first three principles fit into a single framework if we treat Planck’s constant as an imaginary temperature. The guiding principle of this framework is

maximize entropy
subject to the constraints imposed by what you believe

And that’s nice, because E. T. Jaynes has made a powerful case for this principle.

However, when the temperature is imaginary, entropy is so different that it may deserves a new name: say, ‘quantropy’. In particular, it’s complex-valued, so instead of maximizing it we have to look for stationary points: places where its first derivative is zero. But this isn’t so bad. Indeed, a lot of minimum and maximum principles are really ‘stationary principles’ if you examine them carefully.

What about the fourth principle: Occam’s razor? We can formalize this using algorithmic probability theory. Occam’s razor then becomes yet another special case of

maximize entropy
subject the constraints imposed by what you believe

once we realize that algorithmic entropy is a special case of ordinary entropy.

All of this deserves plenty of further thought and discussion—but not today!

Today I just want to point out that once we’ve formally unified classical mechanics and thermal statics (often misleadingly called ‘thermodynamics’), as sketched in the article on quantropy, we should be able to take any idea from one subject and transpose it to the other. And it’s true. I just showed you an example, but there are lots of others!

I guessed this should be possible after pondering three famous facts:

• In classical mechanics, if we fix the initial position of a particle, we can pick any position q and time t at which the particle’s path ends, and nature will seek the path to this endpoint that minimizes the action. This minimal action is Hamilton’s principal function S(q,t), which obeys

d S = p d q - H d t

In thermodynamics, if we fix the entropy S and volume V of a box of gas, nature will seek the probability distribution of microstates the minimizes the energy. This minimal energy is the internal energy U(S,V), which obeys

d U = T d S - P d V

• In classical mechanics we have canonically conjugate quantities, while in statistical mechanics we have conjugate variables. In classical mechanics the canonical conjugate of the position q is the momentum p, while the canonical conjugate of time t is energy H. In thermodynamics, the conjugate of entropy S is temperature T, while the conjugate of volume V is pressure P. All this is fits in perfectly with the analogy we’ve been using today:

\begin{array} {ccccccc}  q &\to& S & &  p &\to & T \\ t & \to & V & & H &\to & P \end{array}

• Something called the Legendre transformation plays a big role both in classical mechanics and thermodynamics. This transformation takes a function of some variable and turns it into a function of the conjugate variable. In our proof of the Maxwell relations, we secretly used a Legendre transformation to pass from the internal energy U(S,V) to the Helmholtz free energy A(T,V):

A = U - T S

where we must solve for the entropy S in terms of T and V to think of A as a function of these two variables.

Similarly, in our proof of Hamilton’s equations, we passed from Hamilton’s principal function S(q,t) to the function X(p,t):

X = S - p q

where we must solve for the position q in terms of p and t to think of X as a function of these two variables.

I hope you see that all this stuff fits together in a nice picture, and I hope to say a bit more about it soon. The most exciting thing for me will be to see how symplectic geometry, so important in classical mechanics, can be carried over to thermodynamics. Why? Because I’ve never seen anyone use symplectic geometry in thermodynamics. But maybe I just haven’t looked hard enough!

Indeed, it’s perfectly possible that some people already know what I’ve been saying today. Have you seen someone point out that Hamilton’s equations are a special case of the Maxwell relations? This would seem to be the first step towards importing all of symplectic geometry to thermodynamics.


Follow

Get every new post delivered to your Inbox.

Join 3,095 other followers