Classical Mechanics versus Thermodynamics (Part 3)

23 September, 2021

There’s a fascinating analogy between classical mechanics and thermodynamics, which I last talked about in 2012:

Classical mechanics versus thermodynamics (part 1).
Classical mechanics versus thermodynamics (part 2).

I’ve figured out more about it, and today I’m giving a talk about it in the physics colloquium at the University of British Columbia. It’s a colloquium talk that’s supposed to be accessible for upper-level undergraduates, so I’ll spend a lot of time reviewing the basics… which is good, I think.

I don’t know if the talk will be recorded, but you can see my slides here, and I’ll base this blog article on them.

Hamilton’s equations versus the Maxwell relations

Why do Hamilton’s equations in classical mechanics:

\begin{array}{ccr}  \displaystyle{  \frac{d p}{d t} }  &=&  \displaystyle{- \frac{\partial H}{\partial q} } \\  \\  \displaystyle{  \frac{d q}{d t} } &=&  \displaystyle{ \frac{\partial H}{\partial p} }  \end{array}

look so much like the Maxwell relations in thermodynamics?

\begin{array}{ccr}  \displaystyle{ \left. \frac{\partial T}{\partial V}\right|_S }  &=&  \displaystyle{ - \left. \frac{\partial P}{\partial S}\right|_V } \\   \\  \displaystyle{ \left. \frac{\partial S}{\partial  V}\right|_T  }  &=&  \displaystyle{ \left. \frac{\partial P}{\partial T} \right|_V }  \end{array}

William Rowan Hamilton discovered his equations describing classical mechanics in terms of energy around 1827. By 1834 he had also introduced Hamilton’s principal function, which I’ll explain later.

James Clerk Maxwell is most famous for his equations describing electromagnetism, perfected in 1865. But he also worked on thermodynamics, and discovered the ‘Maxwell relations’ in 1871.

Hamilton’s equations describe how the position q and momentum p of a particle on a line change with time t if we know the energy or Hamiltonian H(q,p):

\begin{array}{ccr}  \displaystyle{  \frac{d p}{d t} }  &=&  \displaystyle{- \frac{\partial H}{\partial q} } \\  \\  \displaystyle{  \frac{d q}{d t} } &=&  \displaystyle{ \frac{\partial H}{\partial p} }  \end{array}

Two of the Maxwell relations connect the volume V, entropy S, pressure P and temperature T of a system in thermodynamic equilibrium:

\begin{array}{ccr}  \displaystyle{ \left. \frac{\partial T}{\partial V}\right|_S }  &=&  \displaystyle{ - \left. \frac{\partial P}{\partial S}\right|_V } \\   \\  \displaystyle{ \left. \frac{\partial S}{\partial  V}\right|_T  }  &=&  \displaystyle{ \left. \frac{\partial P}{\partial T} \right|_V }  \end{array}
Using this change of variables:

q \to S \qquad p \to T
t \to V \qquad H \to P

Hamilton’s equations:

\begin{array}{ccr}  \displaystyle{  \frac{d p}{d t} }  &=&  \displaystyle{- \frac{\partial H}{\partial q} } \\   \\  \displaystyle{  \frac{d q}{d t} } &=&  \displaystyle{ \frac{\partial H}{\partial p} }  \end{array}

become these relations:

\begin{array}{ccr}  \displaystyle{  \frac{d T}{d V} }  &=&  \displaystyle{- \frac{\partial P}{\partial S} } \\  \\  \displaystyle{  \frac{d S}{d V} } &=&  \displaystyle{ \frac{\partial P}{\partial T} }  \end{array}

These are almost like two of the Maxwell relations! But in thermodynamics we always use partial derivatives:

\begin{array}{ccr}  \displaystyle{  \frac{\partial T}{\partial V} }  &=&  \displaystyle{ - \frac{\partial P}{\partial S} } \\   \\  \displaystyle{ \frac{\partial S}{\partial  V}   }  &=&  \displaystyle{  \frac{\partial P}{\partial T} }  \end{array}

and we say which variables are held constant:

\begin{array}{ccr}  \displaystyle{ \left. \frac{\partial T}{\partial V}\right|_S }  &=&  \displaystyle{ - \left. \frac{\partial P}{\partial S}\right|_V } \\   \\  \displaystyle{ \left. \frac{\partial S}{\partial  V}\right|_T  }  &=&  \displaystyle{ \left. \frac{\partial P}{\partial T} \right|_V }  \end{array}

If we write Hamilton’s equations in the same style as the Maxwell relations, they look funny:

\begin{array}{ccr}  \displaystyle{ \left. \frac{\partial p}{\partial t}\right|_q }  &=&  \displaystyle{ - \left. \frac{\partial H}{\partial q}\right|_t } \\   \\  \displaystyle{ \left. \frac{\partial q}{\partial  t}\right|_p  }  &=&  \displaystyle{ \left. \frac{\partial H}{\partial p} \right|_t }  \end{array}

Can this possibly be right?

Yes! When we work out the analogy between classical mechanics and thermodynamics we’ll see why.

We can get Maxwell’s relations starting from this: the internal energy U of a system in equilibrium depends on its entropy S and volume V.

Temperature and pressure are derivatives of U:

\displaystyle{  T =  \left.\frac{\partial U}{\partial S} \right|_V  \qquad  P = - \left. \frac{\partial U}{\partial V} \right|_S }

Maxwell’s relations follow from the fact that mixed partial derivatives commute! For example:

\displaystyle{    \left. \frac{\partial T}{\partial V} \right|_S \; = \;  \left. \frac{\partial}{\partial V} \right|_S \left. \frac{\partial }{\partial S} \right|_V U \; = \;  \left. \frac{\partial }{\partial S} \right|_V \left. \frac{\partial}{\partial V} \right|_S  U \; = \;  - \left. \frac{\partial P}{\partial S} \right|_V }

To get Hamilton’s equations the same way, we need a function W of the particle’s position q and time t such that

\displaystyle{       p = \left. \frac{\partial W}{\partial q} \right|_t   \qquad       H = -\left. \frac{\partial W}{\partial t} \right|_q  }

Then we’ll get Hamilton’s equations from the fact that mixed partial derivatives commute!

The trick is to let W be ‘Hamilton’s principal function’. So let’s define that. First, the action of a particle’s path is

\displaystyle{  \int_{t_0}^{t_1} L(q(t),\dot{q}(t)) \, dt  }

where L is the Lagrangian:

L(q,\dot{q}) = p \dot q - H(q,p)

The particle always takes a path from (q_0, t_0) to (q_1, t_1) that’s a critical point of the action. We can derive Hamilton’s equations from this fact.

Let’s assume this critical point is a minimum. Then the least action for any path from (q_0,t_0) to (q_1,t_1) is called Hamilton’s principal function

W(q_0,t_0,q_1,t_1) = \min_{q \; \mathrm{with} \; q(t_0) = q_0, q(t_1) = q_1}  \int_{t_0}^{t_1} L(q(t),\dot{q}(t)) \, dt

A beautiful fact: if we differentiate Hamilton’s principal function, we get back the energy H and momentum p:

\begin{array}{ccc}  \displaystyle{  \frac{\partial}{\partial q_0} W(q_0,t_0,q_1,t_1) = -p(t_0) } &&  \displaystyle{    \frac{\partial}{\partial t_0} W(q_0,t_0,q_1,t_1) = H(t_0) } \\  \\  \displaystyle{   \frac{\partial}{\partial q_1} W(q_0,t_0,q_1,t_1) =  p(t_1) } &&  \displaystyle{   \frac{\partial}{\partial t_1}W(q_0,t_0,q_1,t_1)  = -H(t_1) }  \end{array}

You can prove these equations using

L  = p\dot{q} - H

which implies that

\displaystyle{ W(q_0,t_0,q_1,t_1) =  \int_{q_0}^{q_1} p \, dq \; - \; \int_{t_0}^{t_1} H \, dt }

where we integrate along the minimizing path. (It’s not as trivial as it may look, but you can do it.)

Now let’s fix a starting-point (q_0,t_0) for our particle, and say its path ends at any old point (q,t). Think of Hamilton’s principal function as a function of just (q,t):

W(q,t) = W(q_0,t_0,q,t)

Then the particle’s momentum and energy when it reaches (q,t) are:

\displaystyle{   p = \left. \frac{\partial W}{\partial q} \right|_t   \qquad       H = -\left. \frac{\partial W}{\partial t} \right|_q }

This is just what we wanted. Hamilton’s equations now follow from the fact that mixed partial derivatives commute!

So, we have this analogy between classical mechanics and thermodynamics:

Classical mechanics Thermodynamics
action: W(q,t) internal energy: U(V,S)
position: q entropy: S
momentum: p = \frac{\partial W}{\partial q} temperature: T = \frac{\partial U}{\partial S}
time: t volume: V
energy: H= -\frac{\partial W}{\partial t} pressure: P = - \frac{\partial U}{\partial V}
dW = pdq - Hdt dU = TdS - PdV

What’s really going on in this analogy? It’s not really the matchup of variables that matters most—it’s something a bit more abstract. Let’s dig deeper.

I said we could get Maxwell’s relations from the fact that mixed partials commute, and gave one example:

\displaystyle{   \left. \frac{\partial T}{\partial V} \right|_S \; = \;  \left. \frac{\partial}{\partial V} \right|_S \left. \frac{\partial }{\partial S} \right|_V U \; = \;  \left. \frac{\partial }{\partial S} \right|_V \left. \frac{\partial}{\partial V} \right|_S  U \; = \;  - \left. \frac{\partial P}{\partial S} \right|_V }

But to get the other Maxwell relations we need to differentiate other functions—and there are four of them!

U: internal energy
U - TS: Helmholtz free energy
U + PV: enthalpy
U + PV - TS: Gibbs free energy

They’re important, but memorizing all the facts about them has annoyed students of thermodynamics for over a century. Is there some other way to get the Maxwell relations? Yes!

In 1958 David Ritchie explained how we can get all four Maxwell relations from one equation! Jaynes also explained how in some unpublished notes for a book. Here’s how it works.

Start here:

dU = T d S - P d V

Integrate around a loop \gamma:

\displaystyle{    \oint_\gamma T d S - P d V  = \oint_\gamma d U = 0 }

so

\displaystyle{  \oint_\gamma T d S = \oint_\gamma P dV }

This says the heat added to a system equals the work it does in this cycle

Green’s theorem implies that if a loop \gamma encloses a region R,

\displaystyle{   \oint_\gamma T d S = \int_R dT \, dS }

Similarly

\displaystyle{ \oint_\gamma P d V  = \int_R dP \, dV }

But we know these are equal!

So, we get

\displaystyle{ \int_R dT \, dS  =   \int_R dP \, dV  }

for any region R enclosed by a loop. And this in turn implies

d T\, dS = dP \, dV

In fact, all of Maxwell’s relations are hidden in this one equation!

Mathematicians call something like dT \, dS a 2-form and write it as dT \wedge dS. It’s an ‘oriented area element’, so

dT \, dS = -dS \, dT

Now, starting from

d T\, dS = dP \, dV

We can choose any coordinates X,Y and get

\displaystyle{   \frac{dT \, dS}{dX \, dY} = \frac{dP \, dV}{dX \, dY}  }

(Yes, this is mathematically allowed!)

If we take X = V, Y = S we get

\displaystyle{    \frac{dT \, dS}{dV \, dS} = \frac{dP \, dV}{dV \, dS}  }

and thus

\displaystyle{   \frac{dT \, dS}{dV \, dS} = - \frac{dV \, dP}{dV \, dS}  }

We can actually cancel some factors and get one of the Maxwell relations:

\displaystyle{  \left.   \frac{\partial T}{\partial V}\right|_S = - \left. \frac{\partial P}{\partial S}\right|_V  }

(Yes, this is mathematically justified!)

Let’s try another one. If we take X = T, Y = V we get

\displaystyle{   \frac{dT \, dS}{dT \, dV} = \frac{dP \, dV}{dT \, dV} }

Cancelling some factors here we get another of the Maxwell relations:

\displaystyle{ \left.   \frac{\partial S}{\partial V} \right|_T = \left. \frac{\partial P}{\partial T}  \right|_V }

Other choices of X,Y give the other two Maxwell relations.

In short, Maxwell’s relations all follow from one simple equation:

d T\, dS = dP \, dV

Similarly, Hamilton’s equations follow from this equation:

d p\, dq = dH \, dt

All calculations work in exactly the same way!

By the way, we can get these equations efficiently using the identity d^2 = 0 and the product rule for d:

\begin{array}{ccl} dU = TdS - PdV & \implies & d^2 U = d(TdS - P dV) \\ \\  &\implies& 0 = dT\, dS - dP \, dV  \\ \\  &\implies & dT\, dS = dP \, dV  \end{array}

Now let’s change viewpoint slightly and temporarily treat P and V as independent from S and T. So, let’s start with \mathbb{R}^4 with coordinates (S,T,V,P). Then this 2-form on \mathbb{R}^4:

\omega = dT \, dS - dP \, dV

is called a symplectic structure.

Choosing the internal energy function U(S,V), we get this 2-dimensional surface of equilibrium states:

\displaystyle{   \Lambda = \left\{ (S,T,V,P): \; \textstyle{ T = \left.\phantom{\Big|} \frac{\partial U}{\partial S}\right|_V  , \; P = -\left. \phantom{\Big|}\frac{\partial U}{\partial V} \right|_S} \right\} \; \subset \; \mathbb{R}^4 }

Since

\omega = dT \, dS - dP \, dV

we know

\displaystyle{  \int_R \omega = 0 }

for any region in the surface \Lambda, since on this surface dU = TdS - PdV and our old argument applies.

This fact encodes the Maxwell relations! Physically it says: for any cycle on the surface of equilibrium states, the heat flow in equals the work done.

Similarly, in classical mechanics we can start with \mathbb{R}^4 with coordinates (q,p,t,H), treating p and H as independent from q and t . This 2-form on \mathbb{R}^4:

\omega = dH \, dt - dp \, dq

is a symplectic structure. Hamilton’s principal function W(q,t) defines a 2d surface

\displaystyle{   \Lambda = \left\{ (q,p,t,H): \; \textstyle{ p = \left.\phantom{\Big|} \frac{\partial W}{\partial q}\right|_t  , \;  H = -\left.\phantom{\Big|} \frac{\partial W}{\partial t} \right|_q} \right\}  \subset \mathbb{R}^4 }

We have \int_R \omega = 0 for any region R in this surface \Lambda. And this fact encodes Hamilton’s equations!

Summary

In thermodynamics, any 2d region R in the surface \Lambda of equilibrium states has

\displaystyle{  \int_R \omega = 0 }

This is equivalent to the Maxwell relations.

In classical mechanics, any 2d region R in the surface \Lambda of allowed (q,p,t,H) 4-tuples for particle trajectories through a single point (q_0,t_0) has

\displaystyle{   \int_R \omega = 0 }

This is equivalent to Hamilton’s equations.

These facts generalize when we add extra degrees of freedom, e.g. the particle number N in thermodynamics:

\omega = dT \, dS - dP \, dV + d\mu \, dN

or more dimensions of space in classical mechanics:

\omega =  dp_1 \, dq_1 + \cdots + dp_{n-1} dq_{n-1} - dH \, dt

We get a vector space \mathbb{R}^{2n} with a 2-form \omega on it, and a Lagrangian submanifold \Lambda \subset \mathbb{R}^{2n}: that is, a n-dimensional submanifold such that

\int_R \omega = 0

for any 2d region R \subset \Lambda.

This is more evidence for Alan Weinstein’s “symplectic creed”:

EVERYTHING IS A LAGRANGIAN SUBMANIFOLD

As a spinoff, we get two extra Hamilton’s equations for a point particle on a line! They look weird, but I’m sure they’re correct for trajectories that go through a specific arbitrary spacetime point (q_0,t_0).

\begin{array}{ccrcccr}  \displaystyle{ \left. \frac{\partial T}{\partial V}\right|_S }  &=&  \displaystyle{ - \left. \frac{\partial P}{\partial S}\right|_V } & \qquad &  \displaystyle{ \left. \frac{\partial p}{\partial t}\right|_q }  &=&  \displaystyle{ - \left. \frac{\partial H}{\partial q}\right|_t }  \\   \\  \displaystyle{ \left. \frac{\partial S}{\partial  V}\right|_T  }  &=&  \displaystyle{ \left. \frac{\partial P}{\partial T} \right|_V } & &  \displaystyle{ \left. \frac{\partial q}{\partial  t}\right|_p  }  &=&  \displaystyle{ \left. \frac{\partial H}{\partial p} \right|_t }  \\ \\  \displaystyle{ \left. \frac{\partial V}{\partial T} \right|_P }  &=&  \displaystyle{ - \left. \frac{\partial S}{\partial P} \right|_T } & &  \displaystyle{ \left. \frac{\partial t}{\partial p} \right|_H }  &=&  \displaystyle{ - \left. \frac{\partial q}{\partial H} \right|_p }  \\ \\  \displaystyle{ \left. \frac{\partial V}{\partial S} \right|_P } &=&  \displaystyle{ \left. \frac{\partial T}{\partial P} \right|_S }  & &  \displaystyle{ \left. \frac{\partial p}{\partial H} \right|_q } &=&  \displaystyle{ \left. \frac{\partial t}{\partial q} \right|_H }  \end{array}

Maxwell’s Relations (Part 3)

22 September, 2021

In Part 2 we saw a very efficient formulation of Maxwell’s relations, from which we can easily derive their usual form. Now let’s talk more about the meaning of the Maxwell relations—both their physical meaning and their mathematical meaning. For the physical meaning, I’ll draw again from Ritchie’s paper:

• David J. Ritchie, A simple method for deriving Maxwell’s relations, American Journal of Physics 36 (1958), 760–760.

but Emily Roach pointed out that much of this can also be found in the third chapter of Jaynes’ unpublished book:

• E. T. Jaynes, Thermodynamics, Chapter 3: Plausible reasoning.

First I’ll do the case of 2 dimensions, and then the case of n dimensions. In the 2d case I’ll talk like a physicist and use notation from thermodynamics. We’ll learn the Maxwell relations have these meanings, apart from their obvious ones:

• In any thermodynamic cycle, the heat absorbed by a system equals the work it does.

• In any thermodynamic cycle, energy is conserved.

• Any region in the surface of equilibrium states has the same area in pressure-volume coordinates as it does in temperature-entropy coordinates.

In the n-dimensional case I’ll use notation that mathematicians will like better, and also introduce the language of symplectic geometry. This will give a general, abstract statement of the Maxwell relations:

• The manifold of equilibrium states is a Lagrangian submanifold of a symplectic manifold.

Don’t worry—I’ll explain it!

Maxwell’s relations in 2 dimensions

Suppose we have a physical system whose internal energy U is a smooth function of its entropy S and volume V. So, we have a smooth function on the plane:

U \colon \mathbb{R}^2 \to \mathbb{R}

and we call the coordinates on this plane (S,V).

Next we introduce two more variables called temperature T and pressure P. So, we’ll give \mathbb{R}^4 coordinates (S,T,V,P). But, there’s a 2-dimensional surface where these extra variables are given by the usual formulas in thermodynamics:

\displaystyle{ \Lambda = \left\{(S,T,V,P) : \; T = \left.\frac{\partial U}{\partial S} \right|_V, \;   P = - \left. \frac{\partial U}{\partial V} \right|_S \right\}   \; \subset \mathbb{R}^4 }

We call this the surface of equilibrium states.

By the equations that T and P obey on the surface of equilibrium states, the following equation holds on this surface:

dU = T d S - P d V

So, if \gamma is any loop on this surface, we have

\displaystyle{ \oint_\gamma ( T d S - P d V) \; = \; \oint_\gamma \; dU \; = 0    }

where the total change in U is zero because the loop ends where it starts! We thus have

\displaystyle{ \oint_\gamma  T d S  = \oint_\gamma P d V }

In thermodynamics this has a nice meaning: the left side is the heat absorbed by a system as its state moves around the loop \gamma, while the right side is the work done by this system. So, the equation says the heat absorbed by a system as it carries out a cycle is equal to the work it does.

We’ll soon see that this equation contains, hidden within it, all four Maxwell relations. And it’s a statement of conservation of energy! By the way, it’s perfectly obvious that energy is conserved in a cycle: our loop takes us from a point where U has some value back to the same point, where it has the same value.

So how do we get Maxwell’s relations out of this blithering triviality? There’s a slow way and a fast way. Since the slow way provides extra insight I’ll do that first.

Suppose the loop \gamma bounds some 2-dimensional region R in the surface \Lambda. Then by Green’s theorem we have

\displaystyle{ \oint_\gamma  T d S  = \int_R dT \wedge dS }

That is, the heat absorbed in the cycle is just the area of the region R in (T,S) coordinates. But Green’s theorem also says

\displaystyle{ \oint_\gamma  P dV  = \int_R dP \wedge dV }

That is, the work done in the cycle, which is minus the left hand side, is minus the area of the region R in (P,V) coordinates!

That’s nice to know. But since we’ve seen these are equal,

\displaystyle{ \int_R dT \wedge dS =  \int_R dP \wedge dV  }

for every region R in the surface \Lambda.

Well, at least I’ve shown it for every region bounded by a loop! But every region can be chopped up into small regions that are bounded by loops, so the equation is really true for any region R in the surface \Lambda.

We express this fact by saying that dT \wedge dS equals dP \wedge dV when these 2-forms are restricted to the surface \Lambda. We write it like this:

\displaystyle{ \left. dT \wedge dS \right|_{\Lambda} = \left. dP \wedge dV \right|_{\Lambda} }

Now, last time we saw how to quickly get from this equation to all four Maxwell relations!

(Back then I didn’t write the |_{\Lambda} symbol because I was implicitly working on this surface \Lambda without telling you. More precisely, I was working on the plane \mathbb{R}^2, but we can identify the surface \Lambda with that plane using the (S,V) coordinates.)

So, given what we did last time, we are done! The equation

\displaystyle{ \left. dT \wedge dS \right|_{\Lambda} = \left. dP \wedge dV \right|_{\Lambda}}

expresses both conservation of energy and all four Maxwell equations—in a very compressed, beautiful form!

Maxwell’s relations in n dimensions

Now we can generalize everything to n dimensions. Suppose we have a smooth function

U \colon \mathbb{R}^n \to \mathbb{R}

Write the coordinates on \mathbb{R}^n as

(q^1, \dots, q^n)

and write the corresponding coordinates on the cotangent bundle T^\ast \mathbb{R}^n as

(q^1, \dots, q^n, p_1, \dots, p_n)

There is a submanifold of T^\ast \mathbb{R}^n where p_i equals the partial derivative of U with respect to q^i:

\displaystyle{ p_i = \frac{\partial U}{\partial q^i} }

Let’s call this submanifold \Lambda since it’s the same one we’ve seen before (except for that annoying little minus sign in the definition of pressure):

\displaystyle{ \Lambda = \left\{(q,p) \in T^\ast \mathbb{R}^n : \;  p_i = \frac{\partial U}{\partial q^i} \right\} }

In applications to thermodynamics, this is the manifold of equilibrium states. But we’re doing math here, and this math has many applications. It’s this generality that makes the subject especially interesting to me.

Now, U started out life as a function on \mathbb{R}^n, but it lifts to a function on T^\ast \mathbb{R}^n, and we have

\displaystyle{ dU  = \sum_i \frac{\partial U}{\partial q^i} dq^i }

By the definition of \Lambda we have

\displaystyle{  \left. \frac{\partial U}{\partial q^i} \right|_\Lambda =  \left. p_i \right|_\Lambda }

so

\displaystyle{  \left. dU \right|_\Lambda  = \left. \sum_i p_i dq^i \right|_\Lambda }

The 1-form on the right-hand side here:

\displaystyle{ \theta = \sum_i p_i \, dq^i }

is called the tautological 1-form on T^\ast \mathbb{R}^n. Its exterior derivative

\displaystyle{ \omega = d\theta = \sum_i dp_i \wedge dq^i }

is called the symplectic structure on this cotangent bundle. Both of these are a big deal in classical mechanics, but here we are seeing them in thermodynamics! And the point of all this stuff is that we’ve seen

\displaystyle{ \left. dU \right|_\Lambda  = \left. \theta \right|_\Lambda }

Taking d of both sides and using d^2 = 0, we get

\displaystyle{ \left. d\theta \right|_\Lambda = 0 }

or in other words

\displaystyle{ \left. \omega \right|_\Lambda = 0 }

And this is a very distilled statement of Maxwell’s relations!

Why? Well, in the 2d case we discussed earlier, the tautological 1-form is

\theta = T dS - P dV

thanks to the annoying minus sign in the definition of pressure. Thus, the symplectic structure is

\omega = dT \wedge dS - dP \wedge dV

and the fact that the symplectic structure \omega vanishes when restricted to \Lambda is just our old friend

\displaystyle{ \left. dT \wedge dS \right|_{\Lambda} = \left. dP \wedge dV \right|_{\Lambda} }

As we’ve seen, this equation contains all of Maxwell’s relations.

In the n-dimensional case, \Lambda is an n-dimensional submanifold of the 2n-dimensional symplectic manifold T^\ast M. In general, if an n-dimensional submanifold S of a 2n-dimensional symplectic manifold has the property that the symplectic structure vanishes when restricted to S, we call S a Lagrangian submanifold.

So, one efficient abstract statement of Maxwell’s relations is:

The manifold of equilibrium states is a Lagrangian submanifold.

It takes a while to see the value of this statement, and I won’t try to explain it here. Instead, read Weinstein’s introduction to symplectic geometry:

• Alan Weinstein, Symplectic geometry, Bulletin of the American Mathematical Society 5 (1981), 1–13.

You’ll see here an introduction to Lagrangian submanifolds and an explanation of the “symplectic creed”:

EVERYTHING IS A LAGRANGIAN SUBMANIFOLD.
— Alan Weinstein

The restatement of Maxwell’s relations in terms of Lagrangian submanifolds is just another piece of evidence for this!


Part 1: a proof of Maxwell’s relations using commuting partial derivatives.

Part 2: a proof of Maxwell’s relations using 2-forms.

Part 3: the physical meaning of Maxwell’s relations, and their formulation in terms of symplectic geometry.

For how Maxwell’s relations are connected to Hamilton’s equations, see this post:

Classical mechanics versus thermodynamics.


Maxwell’s Relations (Part 2)

18 September, 2021

The Maxwell relations are some very general identities about the partial derivatives of functions of several variables. You don’t need to understand anything about thermodyamics to understand them, but they’re used a lot in that subject, so discussions of them tend to use notation from that subject.

Last time I went through a standard way to derive these relations for a function of two variables. Now I want to give a better derivation, which I found here:

• David J. Ritchie, A simple method for deriving Maxwell’s relations, American Journal of Physics 36 (1958), 760–760.

This paper is just one page long, and I can’t really improve on it, but I’ll work through the ideas in more detail. It again covers only the case of a function of two variables, and I won’t try to go beyond that case now—maybe later.

So, remember the setup. We have a smooth function on the plane

U \colon \mathbb{R}^2 \to \mathbb{R}

We call the coordinates on the plane S and V, and we give the partial derivatives of U funny names:

\displaystyle{ T =  \left.\frac{\partial U}{\partial S} \right|_V  \qquad    P = - \left. \frac{\partial U}{\partial V} \right|_S }

None of these funny names or the minus sign has any effect on the actual math involved; they’re just traditional in thermodynamics. So, mathematicians, please forgive me! If I ever generalize to the n-variable case, I’ll switch to more reasonable notation.

We instantly get this:

dU = T dS - P dV

and since d^2 = 0 we get

0 = d^2 U = dT \wedge dS - dP \wedge dV

so

dT \wedge dS = dP \wedge dV

Believe it or not, this simple relation contains all four of Maxwell’s relations within it!

To see this, note that both sides are smooth 2-forms on the plane. Now, the space of 2-forms at any one point of the plane is a 1-dimensional vector space. So, we can divide any 2-form at a point by any nonzero 2-form at that point and get a real number.

In particular, suppose X and Y are functions on the plane such that dX \wedge dY \ne 0 at some point. Then we can divide both sides of the above equation by dX \wedge dY and get

\displaystyle{ \frac{dT \wedge dS}{dX \wedge dY} \; = \; \frac{dP \wedge dV}{dX \wedge dY} }

at this point. We can now get the four Maxwell relations simply by making different choices of X and Y. We’ll choose them to be either T,S,V or P. The argument will only work if dX \wedge dY \ne 0, so I’ll always assume that. The argument works the same way each time so I’ll go faster after the first time.

The first relation

Take X = V and Y = S and substitute them into the above equation. We get

\displaystyle{ \frac{dT \wedge dS}{dV \wedge dS} \; = \; \frac{dP \wedge dV}{dV \wedge dS} }

or

\displaystyle{ \frac{dT \wedge dS}{dV \wedge dS} \; = \; - \frac{dP \wedge dV}{dS \wedge dV} }

Next we use a little fact about differential forms and partial derivatives to simplify both sides:

\displaystyle{ \frac{dT \wedge dS}{dV \wedge dS} \; = \; \left.\frac{\partial T}{\partial V} \right|_S }

and similarly

\displaystyle{  \frac{dP \wedge dV}{dS \wedge dV} \; = \; \left. \frac{\partial P}{\partial S}\right|_V }

If you were scarred as a youth when plausible-looking manipulations with partial derivatives turned out to be unjustified, you might be worried about this—and rightly so! Later I’ll show how to justify the kind of ‘cancellation’ we’re doing here. But anyway, it instantly gives us the first Maxwell relation:

\boxed{ \displaystyle{  \left. \frac{\partial T}{\partial V} \right|_S \; = \; - \left. \frac{\partial P}{\partial S} \right|_V } }

The second relation

This time we take X = T, Y = V. Substituting this into our general formula

\displaystyle{ \frac{dT \wedge dS}{dX \wedge dY} \; = \; \frac{dP \wedge dV}{dX \wedge dY} }

we get

\displaystyle{ \frac{dT \wedge dS}{dT \wedge dV} \; = \; \frac{dP \wedge dV}{dT \wedge dV} }

and doing the same sort of ‘cancellation’ as last time, this gives the second Maxwell relation:

\boxed{ \displaystyle{ \left. \frac{\partial S}{\partial V} \right|_T \; = \;  \left. \frac{\partial P}{\partial T} \right|_V } }

The third relation

This time we take X = P, Y = S. Substituting this into our general formula

\displaystyle{ \frac{dT \wedge dS}{dX \wedge dY} \; = \; \frac{dP \wedge dV}{dX \wedge dY} }

we get

\displaystyle{ \frac{dT \wedge dS}{dP \wedge dS} \; = \; \frac{dP \wedge dV}{dP \wedge dS} }

which gives the third Maxwell relation:

\boxed{ \displaystyle{ \left. \frac{\partial T}{\partial P} \right|_S  \; = \;  \left. \frac{\partial V}{\partial S} \right|_P }}

The fourth relation

This time we take X = P, Y = T. Substituting this into our general formula

\displaystyle{ \frac{dT \wedge dS}{dX \wedge dY} \; = \; \frac{dP \wedge dV}{dX \wedge dY} }

we get

\displaystyle{ \frac{dT \wedge dS}{dP \wedge dT} \; = \; \frac{dP \wedge dV}{dP \wedge dT} }

or

\displaystyle{ -\frac{dS \wedge dT}{dP \wedge dT} \; = \; \frac{dP \wedge dV}{dP \wedge dT} }

giving the fourth Maxwell relation:

\boxed{ \displaystyle{ \left. \frac{\partial V}{\partial T} \right|_P \; = \; - \left. \frac{\partial S}{\partial P} \right|_T }}

You can check that other choices of X and Y don’t give additional relations of the same form.

Determinants

So, we’ve see that all four Maxwell relations follow quickly from the equation

dT \wedge dS = dP \wedge dV

if we can do ‘cancellations’ in expressions like this:

\displaystyle{ \frac{dA \wedge dB}{dX \wedge dY} }

when one of the functions A,B \colon \mathbb{R}^2 \to \mathbb{R} equals one of the functions X,Y \colon \mathbb{R}^2 \to \mathbb{R}. This works whenever dX \wedge dY \ne 0. Let’s see why!

First of all, by the inverse function theorem, if dX \wedge dY \ne 0 at some point in the plane, the functions X and Y serve as coordinates in some neighborhood in that point. In this case we have

\displaystyle{ \frac{dA \wedge dB}{dX \wedge dY} = \det \left(  \begin{array}{cc}  \displaystyle{ \frac{\partial A}{\partial X} } & \displaystyle{ \frac{\partial A}{\partial Y} } \\ \\  \displaystyle{ \frac{\partial B}{\partial X} } & \displaystyle{\frac{\partial B}{\partial Y} }  \end{array}  \right) }

Yes: the ratio of 2-forms is just the Jacobian of the map sending (X,Y) to (A,B). This is clear if you know that 2-forms are ‘area elements’ and the Jacobian is a ratio of area elements. But you can also prove it by a quick calculation:

\displaystyle{ dA = \frac{\partial A}{\partial X} dX +  \frac{\partial A}{\partial Y} dY }

\displaystyle{ dB = \frac{\partial B}{\partial X} dX +  \frac{\partial B}{\partial Y} dY }

and thus

\displaystyle{  dA \wedge dB = \left(\frac{\partial A}{\partial X} \frac{\partial B}{\partial Y} - \frac{\partial A}{\partial Y} \frac{\partial B}{\partial X} \right) dX \wedge dY}

so the ratio dA \wedge dB/dX \wedge dY is the desired determinant.

How does this help? Well, take

\displaystyle{ \frac{dA \wedge dB}{dX \wedge dY} }

and now suppose that either A or B equals either X or Y. For example, suppose A = X. Then we can do a ‘cancellation’ like this:

\begin{array}{ccl}  \displaystyle{ \frac{dX \wedge dB}{dX \wedge dY} } &=& \det \left(  \begin{array}{cc}  \displaystyle{ \frac{\partial X}{\partial X} } & \displaystyle{ \frac{\partial X}{\partial Y} } \\ \\  \displaystyle{ \frac{\partial B}{\partial X} } & \displaystyle{\frac{\partial B}{\partial Y} }  \end{array} \right) \\  \\  &=& \det \left(  \begin{array}{cc}  \displaystyle{ 1 } & \displaystyle{ 0 } \\ \\  \displaystyle{ \frac{\partial B}{\partial X} } & \displaystyle{\frac{\partial B}{\partial Y} }  \end{array} \right) \\  \\  &=& \displaystyle{\frac{\partial B}{\partial Y} }  \end{array}

or to make it clear that the partial derivatives are being done in the X,Y coordinate system:

\displaystyle{ \frac{dX \wedge dB}{dX \wedge dY} = \left.\frac{\partial B}{\partial Y} \right|_X }

This justifies all our calculations earlier.

Conclusions

So, we’ve seen that all four Maxwell relations are unified in a much simpler equation:

dT \wedge dS = dP \wedge dV

which follows instantly from

dU = T dS - P dV

This is a big step forward compared to the proof I gave last time, which, at least as I presented it, required cleverly guessing a bunch of auxiliary functions—even though these auxiliary functions turn out to be incredibly important in their own right.

So, we should not stop here: we should think hard about the physical and mathematical meaning of the equation

dT \wedge dS = dP \wedge dV

And Ritchie does this in his paper! But I will talk about that next time.


Part 1: a proof of Maxwell’s relations using commuting partial derivatives.

Part 2: a proof of Maxwell’s relations using 2-forms.

Part 3: the physical meaning of Maxwell’s relations, and their formulation in terms of symplectic geometry.

For how Maxwell’s relations are connected to Hamilton’s equations, see this post:

Classical mechanics versus thermodynamics.


Maxwell’s Relations (Part 1)

17 September, 2021

The Maxwell relations are equations that show up whenever we have a smooth convex function

U \colon \mathbb{R}^n \to \mathbb{R}

They say that the mixed partial derivatives of U commute, but also that the mixed partial derivatives of various functions constructed from U commute. So in a sense they are completely trivial except for the way we construct these various functions! Nonetheless they are very important in physics, because they give generally valid relations between the derivatives of thermodynamically interesting quantities.

For example, here is one of the Maxwell relations that people often use when studying a thermodynamic system like a cylinder of gas or a beaker of liquid:

\displaystyle{  \left. \frac{\partial T}{\partial V} \right|_S \; = \; - \left. \frac{\partial P}{\partial S} \right|_V }

where:

S is the entropy of the system
T is its temperature
V is its volume
P is its pressure

There are also three more Maxwell relations involving these quantities. They all have the same general look, though half contain minus signs and half don’t. So, they’re quite tiresome to memorize. It’s more interesting to figure out what’s really going on here!

I already gave one story about what’s going on: we start with a function U, which happens in this case to be a function of S and V called the internal energy of our system. If we say its mixed partial derivatives commute:

\displaystyle{ \frac{\partial^2 U}{\partial V \partial S} = \frac{\partial^2 U}{\partial S \partial V}   }

we get the Maxwell relation I wrote above, simply by using the standard definitions of T and P as partial derivatives of U. We can then build a bunch of other functions from U, S, T, V and P, and write down other equations saying that the mixed partial derivatives of these functions commute. This gives us all the Maxwell relations.

Let me show you in detail how this works, but without explaining why I’m choosing the particular ‘bunch of other functions’ that I’ll use. There is a way to explain that using the concept of Legendre transform. But next time I’ll give a very different approach to the Maxwell relations, based on this paper:

• David J. Ritchie, A simple method for deriving Maxwell’s relations, American Journal of Physics 36 (1958), 760–760.

This sidesteps the whole business of choosing a ‘bunch of other functions’, and I think it’s very nice.

What follows is a bit mind-numbing, so let me tell you what to pay attention to. I’ll do the same general sort of calculation four times, first starting with any convex smooth function U \colon \mathbb{R}^2 \to \mathbb{R} and then with three other functions built from that one. The only clever part is how we choose those other functions.

The first relation

We start with any smooth convex function U of two real variables S, V and write its differential as follows:

dU = T dS - P d V

This says

\displaystyle{ T =  \left.\frac{\partial U}{\partial S} \right|_V  \qquad    P = - \left. \frac{\partial U}{\partial V} \right|_S }

These are the usual definitions of the temperature T and pressure P in terms of the internal energy U, but you don’t need to know anything about those concepts to follow all the calculations to come. In particular, from the pure mathematical viewpoint the minus sign is merely a stupid convention.

The commuting of mixed partials implies

\displaystyle{  \left. \frac{\partial T}{\partial V} \right|_S \; = \;  \left. \frac{\partial}{\partial V} \right|_S \left. \frac{\partial }{\partial S} \right|_V U \; = \;  \left. \frac{\partial }{\partial S} \right|_V \left. \frac{\partial}{\partial V} \right|_S  U \; = \;  - \left. \frac{\partial P}{\partial S} \right|_V }

giving the first of the Maxwell relations:

\boxed{ \displaystyle{  \left. \frac{\partial T}{\partial V} \right|_S \; = \; - \left. \frac{\partial P}{\partial S} \right|_V } }

The second relation

Next, let’s define the Helmholtz free energy

F = U - TS

Taking its differential we get

\begin{array}{ccl}  dF &=& dU - d(TS) \\ \\ &=& (T dS - P dV) - (SdT + TdS) \\ \\  &=& - S dT - P d V  \end{array}

so if we think of F as a function of T, V we get

\displaystyle{ S =  - \left.\frac{\partial F}{\partial T} \right|_V  \qquad    P = - \left. \frac{\partial F}{\partial V} \right|_T }

The commuting of mixed partials implies

\displaystyle{  \left. - \frac{\partial S}{\partial V} \right|_T \; = \;  \left. \frac{\partial}{\partial V} \right|_T \left. \frac{\partial }{\partial T} \right|_V F \; = \;  \left. \frac{\partial }{\partial T} \right|_V \left. \frac{\partial}{\partial V} \right|_T  F \; = \;  - \left. \frac{\partial P}{\partial T} \right|_V }

giving the second of the Maxwell relations:

\boxed{ \displaystyle{ \left. \frac{\partial S}{\partial V} \right|_T \; = \;  \left. \frac{\partial P}{\partial T} \right|_V } }

The third relation

Copying what we did to get the second relation, let’s define the enthalpy

H = U + P V

Taking its differential we get

\begin{array}{ccl}  dH &=& dU + d(PV) \\ \\ &=& (T dS - P dV) + (VdP + P dV) \\ \\  &=& T dS + V dP  \end{array}

so if we think of H as a function of S, P we get

\displaystyle{ T =  \left.\frac{\partial H}{\partial S} \right|_P  \qquad  V = \left. \frac{\partial H}{\partial P} \right|_S }

The commuting of mixed partials implies

\displaystyle{  \left. \frac{\partial T}{\partial P} \right|_S \; = \;  \left. \frac{\partial}{\partial P} \right|_S \left. \frac{\partial }{\partial S} \right|_P H \; = \;  \left. \frac{\partial }{\partial S} \right|_P \left. \frac{\partial}{\partial P} \right|_S  H \; = \;  \left. \frac{\partial V}{\partial S} \right|_P }

giving the third of the Maxwell relations:

\boxed{ \displaystyle{ \left. \frac{\partial T}{\partial P} \right|_S  \; = \;  \left. \frac{\partial V}{\partial S} \right|_P }}

The fourth relation

Combining the last two tricks, let’s define the Gibbs free energy

G = U + PV - TS

Taking its differential we get

\begin{array}{ccl}  dG &=& dU + d(PV) - d(TS) \\ \\ &=& (T dS - P dV) + (VdP + P dV) - (SdT + TdS) \\ \\  &=& V d P - S dT  \end{array}

so if we think of G as a function of P, T we get

\displaystyle{ V =  \left.\frac{\partial G}{\partial P} \right|_T  \qquad S = -\left. \frac{\partial G}{\partial T} \right|_P }

The commuting of mixed partials implies

\displaystyle{  \left. \frac{\partial V}{\partial T} \right|_P \; = \;  \left. \frac{\partial}{\partial T} \right|_P \left. \frac{\partial }{\partial P} \right|_T G \; = \;  \left. \frac{\partial }{\partial P} \right|_T \left. \frac{\partial}{\partial T} \right|_P  G \; = \;  - \left. \frac{\partial S}{\partial P} \right|_T }

giving the fourth and final Maxwell relation:

\boxed{ \displaystyle{ \left. \frac{\partial V}{\partial T} \right|_P \; = \; - \left. \frac{\partial S}{\partial P} \right|_T }}

Conclusions

I hope you’re a bit unsatisfied, for two main reasons.

The first question you should have is this: why did we chose these four functions to work with:

U, \; U - ST, \; U + PV, \; U - ST + PV

The pattern of signs is not really significant here: if we hadn’t followed tradition and stuck a minus sign in our definition of P here:

\displaystyle{ T =  \left.\frac{\partial U}{\partial S} \right|_V  \qquad    P = - \left. \frac{\partial U}{\partial V} \right|_S }

everything would look more systematic, and we’d use these four functions:

U, \; U - ST, \; U - PV, \; U - ST - PV

This would make it easier to guess how everything works if instead of a function U \colon \mathbb{R}^2 \to \mathbb{R} we started with a function U \colon \mathbb{R}^n \to \mathbb{R}. We could write this as

U(q^1, \dots, q^n)

and define a bunch of functions

\displaystyle{ p_i = \frac{\partial U}{\partial q^i} }

called ‘conjugate quantities’.

Then, we could get 2^n different functions by starting with U and subtracting off products of the form p_i q^i where i ranges over some subset of \{1, \dots, n\}.

Then we could take mixed partial derivatives of these functions, note that the mixed partial derivatives commute, and get a bunch of Maxwell relations — maybe

\displaystyle{ 2^n \frac{n(n-1)}{2} }

of them! (Sanity check: when n is 2 this is indeed 4.)

The second question you should have is: how did I sneakily switch from thinking of U as a function of S and V to thinking of F as a function of T and V, and so on? How could I get away with this? I believe the answer to this involves the concept of Legendre transform, which works well when U is convex.

Answering these questions well might get us into a bit of contact geometry. That would be nice. But instead of answering these questions next time, I’ll talk about Ritchie’s approach to deriving Maxwell’s equations, which seems to sidestep the two questions I just raised!

Just for fun

Finally: students of thermodynamics are often forced to memorize the Maxwell relations. They sometimes use the thermodynamic square:

where for some idiotic reason the p for pressure is written lower-case—perhaps so you can mix it up with momentum!

If you like puzzles, maybe you can figure out how the thermodynamic square works if you stare at it along with the four Maxwell equations I derived:

\displaystyle{  \left. \frac{\partial T}{\partial V} \right|_S \; \phantom{-} = \; - \left. \frac{\partial P}{\partial S} \right|_V }

\displaystyle{ \left. \frac{\partial S}{\partial V} \right|_T \; \phantom{-} = \; \phantom{-} \left. \frac{\partial P}{\partial T} \right|_V }

\displaystyle{ \left. \frac{\partial T}{\partial P} \right|_S  \; \phantom{-} = \; \phantom{-} \left. \frac{\partial V}{\partial S} \right|_P }

\displaystyle{ \left. \frac{\partial V}{\partial T} \right|_P \; \phantom{-} = \; - \left. \frac{\partial S}{\partial P} \right|_T }

Again, it probably be easier if we hadn’t stuck a minus sign into the definition of pressure. If you get stuck, click on the link.

Unfortunately the thermodynamic square is itself so hard to memorize that students resort to a mnemonic for that! Sometimes they say “Good Physicists Have Studied Under Very Fine Teachers” which gives the letters “GPHSUVFT” that you see as you go around the square.

I will never try to remember any of these mnemonics. But I wonder if there’s something deeper going on here. So:

Puzzle. How does the thermodynamic square generalize if U is a function of 3 variables instead of 2? How about more variables?


Part 1: a proof of Maxwell’s relations using commuting partial derivatives.

Part 2: a proof of Maxwell’s relations using 2-forms.

Part 3: the physical meaning of Maxwell’s relations, and their formulation in terms of symplectic geometry.

For how Maxwell’s relations are connected to Hamilton’s equations, see this post:

Classical mechanics versus thermodynamics.


The Cyclic Identity for Partial Derivatives

13 September, 2021

 

As an undergrad I learned a lot about partial derivatives in physics classes. But they told us rules as needed, without proving them. This rule completely freaked me out. If derivatives are kinda like fractions, shouldn’t this equal 1?

Let me show you why it’s -1.

First, consider an example:

This example shows that the identity is not crazy. But in fact it
holds the key to the general proof! Since (u,v) is a coordinate system we can assume without loss of generality that u = x, v = y. At any point we can approximate w to first order as ax + by + c for some constants a,b,c. But for derivatives the constant c doesn’t matter, so we can assume it’s zero.

Then just compute!

There’s also a proof using differential forms that you might like
better. You can see it here, along with an application to
thermodynamics:

But this still leaves us yearning for more intuition — and for me, at least, a more symmetrical, conceptual proof. Over on Twitter, someone named
Postdoc/cake provided some intuition using the same example from thermodynamics:

Using physics intuition to get the minus sign:

  1. increasing temperature at const volume = more pressure (gas pushes out more)
  2. increasing temperature at const pressure = increasing volume (ditto)
  3. increasing pressure at const temperature = decreasing volume (you push in more)


Jules Jacobs gave the symmetrical, conceptual proof that I was dreaming of:



As I’d hoped, the minus signs come from the anticommutativity of the wedge product of 1-forms, e.g.

du \wedge dv = - dv \wedge du

Since the space of 2-forms at a point in the plane is 1-dimensional, we can divide them. In fact a ratio like

\displaystyle{ \frac{du \wedge dw}{dv \wedge dw} }

is just the Jacobian of the tranformation from (v,w) coordinates to (u,w) coordinates. We also need that these ratios obey the rule

\displaystyle{ \frac{\alpha}{\beta} \cdot \frac{\gamma}{\delta} = \frac{\alpha}{\delta} \cdot \frac{\gamma}{\beta}  }

where \alpha, \beta, \gamma, \delta are nonzero 2-forms at a point in the plane. This seems obvious, but you need to check it. It’s not hard. But to put it in fancy language, it follows from the fact that nonzero 2-forms at a point in the plane are a ‘torsor’! I explain that idea here:

Torsors made easy.

Torsors are widespread in physics, and the nonzero vectors in any 1-dimensional real vector space form a torsor for the multiplicative group of nonzero real numbers.

Jules Jacob’s proof is more sophisticated than my simple argument, but it’s very pretty, and it generalizes to higher dimensions in ways that’d be hard to guess otherwise.


Information Geometry (Part 21)

17 August, 2021

Last time I ended with a formula for the ‘Gibbs distribution’: the probability distribution that maximizes entropy subject to constraints on the expected values of some observables.

This formula is well-known, but I’d like to derive it here. My argument won’t be up to the highest standards of rigor: I’ll do a bunch of computations, and it would take more work to state conditions under which these computations are justified. But even a nonrigorous approach is worthwhile, since the computations will give us more than the mere formula for the Gibbs distribution.

I’ll start by reminding you of what I claimed last time. I’ll state it in a way that removes all unnecessary distractions, so go back to Part 20 if you want more explanation.

The Gibbs distribution

Take a measure space \Omega with measure \mu. Suppose there is a probability distribution \pi on \Omega that maximizes the entropy

\displaystyle{ -\int_\Omega \pi(x) \ln \pi(x) \, d\mu(x) }

subject to the requirement that some integrable functions A^1, \dots, A^n on \Omega have expected values equal to some chosen list of numbers q^1, \dots, q^n.

(Unlike last time, now I’m writing A^i and q^i with superscripts rather than subscripts, because I’ll be using the Einstein summation convention: I’ll sum over any repeated index that appears once as a a superscript and once as a subscript.)

Furthermore, suppose \pi depends smoothly on q \in \mathbb{R}^n. I’ll call it \pi_q to indicate its dependence on q. Then, I claim \pi_q is the so-called Gibbs distribution

\displaystyle{  \pi_q(x) =\frac{e^{-p_i A^i(x)}}{\int_\Omega e^{-p_i A^i(x)} \, d\mu(x)}   }

where

\displaystyle{ p_i = \frac{\partial f(q)}{\partial q^i} }

and

\displaystyle{ f(q) = - \int_\Omega \pi_q(x) \ln \pi_q(x) \, d\mu(x) }

is the entropy of \pi_q.

Let’s show this is true!

Finding the Gibbs distribution

So, we are trying to find a probability distribution \pi that maximizes entropy subject to these constraints:

\displaystyle{ \int_\Omega \pi(x) A^i(x) \, d\mu(x) = q^i }

We can solve this problem using Lagrange multipliers. We need one Lagrange multiplier, say \beta_i, for each of the above constraints. But it’s easiest if we start by letting \pi range over all of L^1(\Omega), that is, the space of all integrable functions on \Omega. Then, because we want \pi to be a probability distribution, we need to impose one extra constraint

\displaystyle{ \int_\Omega \pi(x) \, d\mu(x) = 1 }

To do this we need an extra Lagrange multiplier, say \gamma.

So, that’s what we’ll do! We’ll look for critical points of this function on L^1(\Omega):

\displaystyle{ - \int \pi \ln \pi \, d\mu - \beta_i  \int \pi A^i \, d\mu - \gamma \int \pi\,  d\mu  }

Here I’m using some tricks to keep things short. First, I’m dropping the dummy variable x which appeared in all of the integrals we had: I’m leaving it implicit. Second, all my integrals are over \Omega so I won’t say that. And third, I’m using the Einstein summation convention, so there’s a sum over i implicit here.

Okay, now let’s do the variational derivative required to find a critical point of this function. When I was a math major taking physics classes, the way physicists did variational derivatives seemed like black magic to me. Then I spent months reading how mathematicians rigorously justified these techniques. I don’t feel like a massive digression into this right now, so I’ll just do the calculations—and if they seem like black magic, I’m sorry!

We need to find \pi obeying

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left(- \int \pi \ln \pi \, d\mu - \beta_i  \int \pi A^i \, d\mu - \gamma \int \pi \, d\mu \right) = 0 }

or in other words

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left(\int \pi \ln \pi \, d\mu + \beta_i  \int \pi A^i \, d\mu + \gamma \int \pi \, d\mu \right) = 0 }

First we need to simplify this expression. The only part that takes any work, if you know how to do variational derivatives, is the first term. Since the derivative of z \ln z is 1 + \ln z, we have

\displaystyle{ \frac{\delta}{\delta \pi(x)} \int \pi \ln \pi \, d\mu  = 1 + \ln \pi(x) }

The second and third terms are easy, so we get

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left( \int \pi \ln \pi \, d\mu + \beta_i \int \pi A^i \, d\mu + \gamma \int \pi \, d\mu \right) =}

        1 + \ln \pi(x) + \beta_i A^i(x) + \gamma

Thus, we need to solve this equation:

\displaystyle{ 1 + \ln \pi(x) + \beta_i A^i(x) + \gamma  = 0}

That’s easy to do:

\pi(x) = e^{-1 - \gamma - \beta_i A^i(x)}

Good! It’s starting to look like the Gibbs distribution!

We now need to choose the Lagrange multipliers \beta_i and \gamma to make the constraints hold. To satisfy this constraint

\displaystyle{ \int \pi \, d\mu = 1 }

we must choose \gamma so that

\displaystyle{ \int  e^{-1 - \gamma - \beta_i A^i } \, d\mu = 1 }

or in other words

\displaystyle{ e^{1 + \gamma} = \int e^{- \beta_i A^i} \, d\mu }

Plugging this into our earlier formula

\pi(x) = e^{-1 - \gamma - \beta_i A^i(x)}

we get this:

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

Great! Even more like the Gibbs distribution!

By the way, you must have noticed the “1” that showed up here:

\displaystyle{ \frac{\delta}{\delta \pi(x)} \int \pi \ln \pi \, d\mu  = 1 + \ln \pi(x) }

It buzzed around like an annoying fly in the otherwise beautiful calculation, but eventually went away. This is the same irksome “1” that showed up in Part 19. Someday I’d like to say a bit more about it.

Now, where were we? We were trying to show that

\displaystyle{  \pi_q(x) =\frac{e^{-p_i A^i(x)}}{\int e^{-p_i A^i} \, d\mu}   }

minimizes entropy subject to our constraints. So far we’ve shown

\displaystyle{ \pi(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

is a critical point. It’s clear that

\pi(x) \ge 0

so \pi really is a probability distribution. We should show it actually maximizes entropy subject to our constraints, but I will skip that. Given that, \pi will be our claimed Gibbs distribution \pi_q if we can show

p_i = \beta_i

This is interesting! It’s saying our Lagrange multipliers \beta_i actually equal the so-called conjugate variables p_i given by

\displaystyle{ p_i = \frac{\partial f}{\partial q^i} }

where f(q) is the entropy of \pi_q:

\displaystyle{ f(q) = - \int_\Omega \pi_q(x) \ln \pi_q(x) \, d\mu(x) }

There are two ways to show this: the easy way and the hard way. The easy way is to reflect on the meaning of Lagrange multipliers, and I’ll sketch that way first. The hard way is to use brute force: just compute p_i and show it equals \beta_i. This is a good test of our computational muscle—but more importantly, it will help us discover some interesting facts about the Gibbs distribution.

The easy way

Consider a simple Lagrange multiplier problem where you’re trying to find a critical point of a smooth function

f \colon \mathbb{R}^2 \to \mathbb{R}

subject to the constraint

g = c

for some smooth function

g \colon \mathbb{R}^2 \to \mathbb{R}

and constant c. (The function f here has nothing to do with the f in the previous sections.) To answer this we introduce a Lagrange multiplier \lambda and seek points where

\nabla ( f - \lambda g) = 0

This works because the above equation says

\nabla f = \lambda \nabla g

Geometrically this means we’re at a point where the gradient of f points at right angles to the level surface of g:

Thus, to first order we can’t change f by moving along the level surface of g.

But also, if we start at a point where

\nabla f = \lambda \nabla g

and we begin moving in any direction, the function f will change at a rate equal to \lambda times the rate of change of g. That’s just what the equation says! And this fact gives a conceptual meaning to the Lagrange multiplier \lambda.

Our situation is more complicated, since our functions are defined on the infinite-dimensional space L^1(\Omega), and we have an n-tuple of constraints with an n-tuple of Lagrange multipliers. But the same principle holds.

So, when we are at a solution \pi_q of our constrained entropy-maximization problem, and we start moving the point \pi_q by changing the value of the ith constraint, namely q^i, the rate at which the entropy changes will be \beta_i times the rate of change of q^i. So, we have

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

But this is just what we needed to show!

The hard way

Here’s another way to show

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

We start by solving our constrained entropy-maximization problem using Lagrange multipliers. As already shown, we get

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

Then we’ll compute the entropy

f(q) = - \int \pi_q \ln \pi_q \, d\mu

Then we’ll differentiate this with respect to q_i and show we get \beta_i.

Let’s try it! The calculation is a bit heavy, so let’s write Z(q) for the so-called partition function

\displaystyle{ Z(q) = \int e^{- \beta_i A^i} \, d\mu }

so that

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{Z(q)} }

and the entropy is

\begin{array}{ccl}  f(q) &=& - \displaystyle{ \int  \pi_q  \ln \left( \frac{e^{- \beta_k A^k}}{Z(q)} \right)  \, d\mu }  \\ \\  &=& \displaystyle{ \int \pi_q \left(\beta_k A^k + \ln Z(q) \right)  \, d\mu } \\ \\  \end{array}

This is the sum of two terms. The first term

\displaystyle{ \int \pi_q \beta_k A^k   \, d\mu =  \beta_k \int \pi_q A^k   \, d\mu}

is \beta_k times the expected value of A^k with respect to the probability distribution \pi_q, all summed over k. But the expected value of A^k is q^k, so we get

\displaystyle{ \int  \pi_q \beta_k A^k \, d\mu } =  \beta_k q^k

The second term is easier:

\displaystyle{ \int_\Omega  \pi_q \ln Z(q) \, d\mu = \ln Z(q) }

since \pi_q(x) integrates to 1 and the partition function Z(q) doesn’t depend on x \in \Omega.

Putting together these two terms we get an interesting formula for the entropy:

f(q) = \beta_k q^k + \ln Z(q)

This formula is one reason this brute-force approach is actually worthwhile! I’ll say more about it later.

But for now, let’s use this formula to show what we’re trying to show, namely

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

For starters,

\begin{array}{ccl}  \displaystyle{\frac{\partial f}{\partial q^i}} &=& \displaystyle{\frac{\partial}{\partial q^i} \left(\beta_k q^k + \ln Z(q) \right) } \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_k   \frac{\partial q^k}{\partial q^i} + \frac{\partial}{\partial q^i} \ln Z(q)   }  \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_k  \delta^k_i + \frac{\partial}{\partial q^i} \ln Z(q)   } \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i + \frac{\partial}{\partial q^i} \ln Z(q)  }  \end{array}

where we played a little Kronecker delta game with the second term.

Now we just need to compute the third term:

\begin{array}{ccl}  \displaystyle{ \frac{\partial}{\partial q^i} \ln Z(q) } &=&  \displaystyle{ \frac{1}{Z(q)} \frac{\partial}{\partial q^i} Z(q) } \\  \\  &=&  \displaystyle{ \frac{1}{Z(q)} \frac{\partial}{\partial q^i}  \int e^{- \beta_j A^j}  \, d\mu  }  \\ \\  &=&  \displaystyle{ \frac{1}{Z(q)} \int \frac{\partial}{\partial q^i} \left(e^{- \beta_j A^j}\right) \, d\mu  }  \\ \\  &=&  \displaystyle{ \frac{1}{Z(q)} \int \frac{\partial}{\partial q^i}\left( - \beta_k A^k \right)  e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ -\frac{1}{Z(q)} \int \frac{\partial \beta_k}{\partial q^i}  A^k   e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ - \frac{\partial \beta_k}{\partial q^i} \frac{1}{Z(q)} \int  A^k   e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ - \frac{\partial \beta_k}{\partial q^i} q^k }  \end{array}

Ah, you don’t know how good it feels, after years of category theory, to be doing calculations like this again!

Now we can finish the job we started:

\begin{array}{ccl}  \displaystyle{\frac{\partial f}{\partial q^i}} &=&  \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i + \frac{\partial}{\partial q^i} \ln Z(q)  } \\ \\  &=&  \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i - \frac{\partial \beta_k}{\partial q^i} q^k  } \\ \\  &=& \beta_i  \end{array}

Voilà!

Conclusions

We’ve learned the formula for the probability distribution that maximizes entropy subject to some constraints on the expected values of observables. But more importantly, we’ve seen that the anonymous Lagrange multipliers \beta_i that show up in this problem are actually the partial derivatives of entropy! They equal

\displaystyle{ p_i = \frac{\partial f}{\partial q^i} }

Thus, they are rich in meaning. From what we’ve seen earlier, they are ‘surprisals’. They are analogous to momentum in classical mechanics and have the meaning of intensive variables in thermodynamics:

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

Furthermore, by showing \beta_i = p_i the hard way we discovered an interesting fact. There’s a relation between the entropy and the logarithm of the partition function:

f(q) = p_i q^i + \ln Z(q)

(We proved this formula with \beta_i replacing p_i, but now we know those are equal.)

This formula suggests that the logarithm of the partition function is important—and it is! It’s closely related to the concept of free energy—even though ‘energy’, free or otherwise, doesn’t show up at the level of generality we’re working at now.

This formula should also remind you of the tautological 1-form on the cotangent bundle T^\ast Q, namely

\theta = p_i dq^i

It should remind you even more of the contact 1-form on the contact manifold T^\ast Q \times \mathbb{R}, namely

\alpha = -dS + p_i dq^i

Here S is a coordinate on the contact manifold that’s a kind of abstract stand-in for our entropy function f.

So, it’s clear there’s a lot more to say: we’re seeing hints of things here and there, but not yet the full picture.


For all my old posts on information geometry, go here:

Information geometry.


Information Geometry (Part 20)

14 August, 2021

Last time we worked out an analogy between classical mechanics, thermodynamics and probability theory. The latter two look suspiciously similar:

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

This is no coincidence. After all, in the subject of statistical mechanics we explain classical thermodynamics using probability theory—and entropy is revealed to be Shannon entropy (or its quantum analogue).

Now I want to make this precise.

To connect classical thermodynamics to probability theory, I’ll start by discussing ‘statistical manifolds’. I introduced the idea of a statistical manifold in Part 7: it’s a manifold Q equipped with a map sending each point q \in Q to a probability distribution \pi_q on some measure space \Omega. Now I’ll say how these fit into the second column of the above chart.

Then I’ll talk about statistical manifolds of a special sort used in thermodynamics, which I’ll call ‘Gibbsian’, since they really go back to Josiah Willard Gibbs.

In a Gibbsian statistical manifold, for each q \in Q the probability distribution \pi_q is a ‘Gibbs distribution’. Physically, these Gibbs distributions describe thermodynamic equilibria. For example, if you specify the volume, energy and number of particles in a box of gas, there will be a Gibbs distribution describing what the particles do in thermodynamic equilibrium under these conditions. Mathematically, Gibbs distributions maximize entropy subject to some constraints specified by the point q \in Q.

More precisely: in a Gibbsian statistical manifold we have a list of observables A_1, \dots , A_n whose expected values serve as coordinates q_1, \dots, q_n for points q \in Q, and \pi_q is the probability distribution that maximizes entropy subject to the constraint that the expected value of A_i is q_i. We can derive most of the interesting formulas of thermodynamics starting from this!

Statistical manifolds

Let’s fix a measure space \Omega with measure \mu. A statistical manifold is then a manifold Q equipped with a smooth map \pi assigning to each point q \in Q a probability distribution on \Omega, which I’ll call \pi_q. So, \pi_q is a function on \Omega with

\displaystyle{ \int_\Omega \pi_q \, d\mu = 1 }

and

\pi_q(x) \ge 0

for all x \in \Omega.

The idea here is that the space of all probability distributions on \Omega may be too huge to understand in as much detail as we’d like, so instead we describe some of these probability distributions—a family parametrized by points of some manifold Q—using the map \pi. This is the basic idea behind parametric statistics.

Information geometry is the geometry of statistical manifolds. Any statistical manifold comes with a bunch of interesting geometrical structures. One is the ‘Fisher information metric’, a Riemannian metric I explained in Part 7. Another is a 1-parameter family of connections on the tangent bundle T Q, which is important in Amari’s approach to information geometry. You can read about this here:

• Hiroshi Matsuzoe, Statistical manifolds and affine differential geometry, in Advanced Studies in Pure Mathematics 57, pp. 303–321.

I don’t want to talk about it now—I just wanted to reassure you that I’m not completely ignorant of it!

I want to focus on the story I’ve been telling, which is about entropy. Our statistical manifold Q comes with a smooth entropy function

f \colon Q \to \mathbb{R}

namely

\displaystyle{  f(q) = -\int_\Omega \pi_q(x) \, \ln \pi_q(x) \, d\mu(x)    }

We can use this entropy function to do many of the things we usually do in thermodynamics! For example, at any point q \in Q where this function is differentiable, its differential gives a cotangent vector

p = (df)_q

which has an important physical meaning. In coordinates we have

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

and we call p_i the intensive variable conjugate to q_i. For example if q_i is energy, p_i will be ‘coolness’: the reciprocal of temperature.

Defining p this way gives a Lagrangian submanifold

\Lambda = \{ (q,p) \in T^\ast Q : \; x \in M, \; p =  (df)_x \}

of the cotangent bundle T^\ast Q. We can also get contact geometry into the game by defining a contact manifold T^\ast Q \times \mathbb{R} and a Legendrian submanifold

\Sigma = \{ (q,p,S) \in T^\ast Q \times \mathbb{R} : \; x \in M, \; p =  (df)_q , S = f(q) \}

But I’ve been talking about these ideas for the last three episodes, so I won’t say more just now! Instead, I want to throw a new idea into the pot.

Gibbsian statistical manifolds

Thermodynamics, and statistical mechanics, spend a lot of time dealing with statistical manifold of a special sort I’ll call ‘Gibbsian’. In these, each probability distribution \pi_q is a ‘Gibbs distribution’, meaning that it maximizes entropy subject to certain constraints specified by the point q \in Q.

How does this work? For starters, an integrable function

A \colon \Omega \to \mathbb{R}

is called a random variable, or in physics an observable. The expected value of an observable is a smooth real-valued function on our statistical manifold

\langle A \rangle \colon Q \to \mathbb{R}

given by

\displaystyle{ \langle A \rangle(q) = \int_\Omega A(x) \pi_q(x) \, d\mu(x) }

In other words, \langle A \rangle is a function whose value at at any point q \in Q is the expected value of A with respect to the probability distribution \pi_q.

Now, suppose our statistical manifold is n-dimensional and we have n observables A_1, \dots, A_n. Their expected values will be smooth functions on our manifold—and sometimes these functions will be a coordinate system!

This may sound rather unlikely, but it’s really not so outlandish. Indeed, if there’s a point q such that the differentials of the functions \langle A_i \rangle are linearly independent at this point, these functions will be a coordinate system in some neighborhood of this point, by the inverse function theorem. So, we can take this neighborhood, use it as our statistical manifold, and the functions \langle A_i \rangle will be coordinates.

So, let’s assume the expected values of our observables give a coordinate system on Q. Let’s call these coordinates q_1, \dots, q_n, so that

\langle A_i \rangle(q) = q_i

Now for the kicker: we say our statistical manifold is Gibbsian if for each point q \in Q, \pi_q is the probability distribution that maximizes entropy subject to the above condition!

Which condition? The condition saying that

\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }

for all i. This is just the previous equation spelled out so that you can see it’s a condition on \pi_q.

This assumption of the entropy-maximizing nature of \pi_q is a very powerful, because it implies a useful and nontrivial formula for \pi_q. It’s called the Gibbs distribution:

\displaystyle{  \pi_q(x) = \frac{1}{Z(q)} \exp\left(-\sum_{i = 1}^n p_i A_i(x)\right) }

for all x \in \Omega.

Here p_i is the intensive variable conjugate to q_i, while Z(q) is the partition function: the thing we must divide by to make sure \pi_q integrates to 1. In other words:

\displaystyle{ Z(q) = \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x)   }

By the way, this formula may look confusing at first, since the left side depends on the point q in our statistical manifold, while there’s no q visible in the right side! Do you see what’s going on?

I’ll tell you: the conjugate variable p_i, sitting on the right side of the above formula, depends on q. Remember, we got it by taking the partial derivative of the entropy in the q_i direction

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

and the evaluating this derivative at the point q.

But wait a minute! f here is the entropy—but the entropy of what?

The entropy of \pi_q, of course!

So there’s something circular about our formula for \pi_q. To know \pi_q, you need to know the conjugate variables p_i, but to compute these you need to know the entropy of \pi_q.

This is actually okay. While circular, the formula for \pi_q is still true. It’s harder to work with than you might hope. But it’s still extremely useful.

Next time I’ll prove that this formula for \pi_q is true, and do a few things with it. All this material was discovered by Gibbs in the late 1800’s, and it’s lurking any good book on statistical mechanics—but not phrased in the language of statistical manifolds. The physics textbooks usually consider special cases, like a box of gas where:

q_1 is energy, p_1 is 1/temperature.

q_2 is volume, p_2 is –pressure/temperature.

q_3 is the number of particles, p_3 is chemical potential / pressure.

While these special cases are important and interesting, I’d rather be general!

Technical comments

I said “Any statistical manifold comes with a bunch of interesting geometrical structures”, but in fact some conditions are required. For example, the Fisher information metric is only well-defined and nondegenerate under some conditions on the map \pi. For example, if \pi maps every point of Q to the same probability distribution, the Fisher information metric will vanish.

Similarly, the entropy function f is only smooth under some conditions on \pi.

Furthermore, the integral

\displaystyle{ \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x)   }

may not converge for all values of the numbers p_1, \dots, p_n. But in my discussion of Gibbsian statistical manifolds, I was assuming that an entropy-maximizing probability distribution \pi_q with

\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }

actually exists. In this case the probability distribution is also unique (almost everywhere).


For all my old posts on information geometry, go here:

Information geometry.


Information Geometry (Part 19)

8 August, 2021

Last time I figured out the analogue of momentum in probability theory, but I didn’t say what it’s called. Now I will tell you—thanks to some help from Abel Jansma and Toby Bartels.

SURPRISE: it’s called SURPRISAL!

This is a well-known concept in information theory. It’s also called ‘information content‘.

Let’s see why. First, let’s remember the setup. We have a manifold

\displaystyle{ Q = \{ q \in \mathbb{R}^n : \; q_i > 0, \; \sum_{i=1}^n q_i = 1 \} }

whose points q are nowhere vanishing probability distributions on the set \{1, \dots, n\}. We have a function

f \colon Q \to \mathbb{R}

called the Shannon entropy, defined by

\displaystyle{ f(q) = - \sum_{j = 1}^n q_j \ln q_j }

For each point q \in Q we define a cotangent vector p \in T^\ast_q Q by

p = (df)_q

As mentioned last time, this is the analogue of momentum in probability theory. In the second half of this post I’ll say more about exactly why. But first let’s compute it and see what it actually equals!

Let’s start with a naive calculation, acting as if the probabilities q_1, \dots, q_n were a coordinate system on the manifold Q. We get

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

so using the definition of the Shannon entropy we have

\begin{array}{ccl}  p_i &=& \displaystyle{ -\frac{\partial}{\partial q_i} \sum_{j = 1}^n q_j \ln q_j  }\\ \\  &=& \displaystyle{ -\frac{\partial}{\partial q_i} \left( q_i \ln q_i \right) } \\ \\  &=& -\ln(q_i) - 1  \end{array}

Now, the quantity -\ln q_i is called the surprisal of the probability distribution at i. Intuitively, it’s a measure of how surprised you should be if an event of probability q_i occurs. For example, if you flip a fair coin and it lands heads up, your surprisal is ln 2. If you flip 100 fair coins and they all land heads up, your surprisal is 100 times ln 2.

Of course ‘surprise’ is a psychological term, not a term from math or physics, so we shouldn’t take it too seriously here. We can derive the concept of surprisal from three axioms:

  1. The surprisal of an event of probability q is some function of q, say F(q).
  2. The less probable an event is, the larger its surprisal is: q_1 \le q_2 \implies  F(q_1) \ge F(q_2).
  3. The surprisal of two independent events is the sum of their surprisals: F(q_1 q_2) = F(q_1) + F(q_2).

It follows from work on Cauchy’s functional equation that F must be of this form:

F(q) = - \log_b q

for some constant b > 1. We shall choose b, the base of our logarithms, to be e. We had a similar freedom of choice in defining the Shannon entropy, and we will use base e for both to be consistent. If we chose something else, it would change the surprisal and the Shannon entropy by the same constant factor.

So far, so good. But what about the irksome “-1” in our formula?

p_i = -\ln(q_i) - 1

Luckily it turns out we can just get rid of this! The reason is that the probabilities q_i are not really coordinates on the manifold Q. They’re not independent: they must sum to 1. So, when we change them a little, the sum of their changes must vanish. Putting it more technically, the tangent space T_q Q is not all of \mathbb{R}^n, but just the subspace consisting of vectors whose components sum to zero:

\displaystyle{ T_q Q = \{ v \in \mathbb{R}^n : \; \sum_{j = 1}^n v_j = 0 \} }

The cotangent space is the dual of the tangent space. The dual of a subspace

S \subseteq V

is the quotient space

V^\ast/\{ \ell \colon V \to \mathbb{R} : \; \forall v \in S \; \, \ell(v) = 0 \}

The cotangent space T_q^\ast Q thus consists of linear functionals \ell \colon \mathbb{R}^n \to \mathbb{R} modulo those that vanish on vectors v obeying the equation

\displaystyle{ \sum_{j = 1}^n v_j = 0 }

Of course, we can identify the dual of \mathbb{R}^n with \mathbb{R}^n in the usual way, using the Euclidean inner product: a vector u \in \mathbb{R}^n corresponds to the linear functional

\displaystyle{ \ell(v) = \sum_{j = 1}^n u_j v_j }

From this, you can see that a linear functional \ell vanishes on all vectors v obeying the equation

\displaystyle{ \sum_{j = 1}^n v_j = 0 }

if and only if its corresponding vector u has

u_1 = \cdots = u_n

So, we get

T^\ast_q Q \cong \mathbb{R}^n/\{ u \in \mathbb{R}^n : \; u_1 = \cdots = u_n \}

In words: we can describe cotangent vectors to Q as lists of n numbers if we want, but we have to remember that adding the same constant to each number in the list doesn’t change the cotangent vector!

This suggests that our naive formula

p_i = \ln(q_i) - 1

is on the right track, but we’re free to get rid of the constant 1 if we want! And that’s true.

To check this rigorously, we need to show

\displaystyle{ p(v) = -\sum_{j=1}^n \ln(q_i) v_i}

for all v \in T_q Q. We compute:

\begin{array}{ccl}  p(v) &=& df(v) \\ \\  &=& v(f) \\ \\  &=& \displaystyle{ \sum_{j=1}^n v_j \, \frac{\partial f}{\partial q_j} } \\ \\  &=& \displaystyle{ \sum_{j=1}^n v_j (-\ln(q_i) - 1) } \\ \\  &=& \displaystyle{ -\sum_{j=1}^n \ln(q_i) v_i }  \end{array}

where in the second to last step we used our earlier calculation:

\displaystyle{ \frac{\partial f}{\partial q_i} = -\frac{\partial}{\partial q_i} \sum_{j = 1}^n q_j \ln q_j = -\ln(q_i) - 1 }

and in the last step we used

\displaystyle{ \sum_j v_j = 0 }

Back to the big picture

Now let’s take stock of where we are. We can fill in the question marks in the charts from last time, and combine those charts while we’re at it.

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

What’s going on here? In classical mechanics, action is minimized (or at least the system finds a critical point of the action). In thermodynamics, entropy is maximized. In the maximum entropy approach to probability, Shannon entropy is maximized. This leads to a mathematical analogy that’s quite precise. For classical mechanics and thermodynamics, I explained it here:

Classical mechanics versus thermodynamics (part 1).

Classical mechanics versus thermodynamics (part 2).

These posts may give a more approachable introduction to what I’m doing now: now I’m bringing probability theory into the analogy, with a big emphasis on symplectic and contact geometry.

Let me spell out a bit of the analogy more carefully:

Classical Mechanics. In classical mechanics, we have a manifold Q whose points are positions of a particle. There’s an important function on this manifold: Hamilton’s principal function

f \colon Q \to \mathbb{R}

What’s this? It’s basically action: f(q) is the action of the least-action path from the position q_0 at some earlier time t_0 to the position q at time 0. The Hamilton–Jacobi equations say the particle’s momentum p at time 0 is given by

p = (df)_q

Thermodynamics. In thermodynamics, we have a manifold Q whose points are equilibrium states of a system. The coordinates of a point q \in Q are called extensive variables. There’s an important function on this manifold: the entropy

f \colon Q \to \mathbb{R}

There is a cotangent vector p at the point q given by

p = (df)_q

The components of this vector are the intensive variables corresponding to the extensive variables.

Probability Theory. In probability theory, we have a manifold Q whose points are nowhere vanishing probability distributions on a finite set. The coordinates of a point q \in Q are probabilities. There’s an important function on this manifold: the Shannon entropy

f \colon Q \to \mathbb{R}

There is a cotangent vector p at the point q given by

p = (df)_q

The components of this vector are the surprisals corresponding to the probabilities.

In all three cases, T^\ast Q is a symplectic manifold and imposing the constraint p = (df)_q picks out a Lagrangian submanifold

\Lambda = \{ (q,p) \in T^\ast Q: \; p = (df)_q \}

There is also a contact manifold T^\ast Q \times \mathbb{R} where the extra dimension comes with an extra coordinate S that means

• action in classical mechanics,
• entropy in thermodynamics, and
• Shannon entropy in probability theory.

We can then decree that S = f(q) along with p = (df)_q, and these constraints pick out a Legendrian submanifold

\Sigma = \{(q,p,S) \in T^\ast Q \times \mathbb{R} : \; S = f(q), \; p = (df)_q \}

There’s a lot more to do with these ideas, and I’ll continue next time.


For all my old posts on information geometry, go here:

Information geometry.


Information Geometry (Part 18)

5 August, 2021

Last time I sketched how two related forms of geometry, symplectic and contact geometry, show up in thermodynamics. Today I want to explain how they show up in probability theory.

For some reason I haven’t seen much discussion of this! But people should have looked into this. After all, statistical mechanics explains thermodynamics in terms of probability theory, so if some mathematical structure shows up in thermodynamics it should appear in statistical mechanics… and thus ultimately in probability theory.

I just figured out how this works for symplectic and contact geometry.

Suppose a system has n possible states. We’ll call these microstates, following the tradition in statistical mechanics. If you don’t know what ‘microstate’ means, don’t worry about it! But the rough idea is that if you have a macroscopic system like a rock, the precise details of what its atoms are doing are described by a microstate, and many different microstates could be indistinguishable unless you look very carefully.

We’ll call the microstates 1, 2, \dots, n. So, if you don’t want to think about physics, when I say microstate I’ll just mean an integer from 1 to n.

Next, a probability distribution q assigns a real number q_i to each microstate, and these numbers must sum to 1 and be nonnegative. So, we have q \in \mathbb{R}^n, though not every vector in \mathbb{R}^n is a probability distribution.

I’m sure you’re wondering why I’m using q rather than p to stand for an observable instead of a probability distribution. Am I just trying to confuse you?

No: I’m trying to set up an analogy to physics!

Last time I introduced symplectic geometry using classical mechanics. The most important example of a symplectic manifold is the cotangent bundle T^\ast Q of a manifold Q. A point of T^\ast Q is a pair (q,p) consisting of a point q \in Q and a cotangent vector p \in T^\ast_q Q. In classical mechanics the point q describes the position of some physical system, while p describes its momentum.

So, I’m going to set up an analogy like this:

 Classical Mechanics  Probability Theory
  q   position   probability distribution  
  p   momentum ???

But what is to momentum as probability is to position?

A big clue is the appearance of symplectic geometry in thermodynamics, which I also outlined last time. We can use this to get some intuition about the analogue of momentum in probability theory.

In thermodynamics, a system has a manifold Q of states. (These are not the ‘microstates’ I mentioned before: we’ll see the relation later.) There is a function

f \colon Q \to \mathbb{R}

describing the entropy of the system as a function of its state. There is a law of thermodynamics saying that

p = (df)_q

This equation picks out a submanifold of T^\ast Q, namely

\Lambda = \{(q,p) \in T^\ast Q : \; p = (df)_q \}

Moreover this submanifold is Lagrangian: the symplectic structure \omega vanishes when restricted to it:

\displaystyle{ \omega |_\Lambda = 0 }

This is very beautiful, but it goes by so fast you might almost miss it! So let’s clutter it up a bit with coordinates. We often use local coordinates on Q and describe a point q \in Q using these coordinates, getting a point

(q_1, \dots, q_n) \in \mathbb{R}^n

They give rise to local coordinates q_1, \dots, q_n, p_1, \dots, p_n on the cotangent bundle T^\ast Q. The q_i are called extensive variables, because they are typically things that you can measure only by totalling up something over the whole system, like the energy or volume of a cylinder of gas. The p_i are called intensive variables, because they are typically things that you can measure locally at any point, like temperature or pressure.

In these local coordinates, the symplectic structure on T^\ast Q is the 2-form given by

\omega = dp_1 \wedge dq_1 + \cdots + dp_n \wedge dq_n

The equation

p = (df)_q

serves as a law of physics that determines the intensive variables given the extensive ones when our system is in thermodynamic equilibrium. Written out using coordinates, this law says

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

It looks pretty bland here, but in fact it gives formulas for the temperature and pressure of a gas, and many other useful formulas in thermodynamics.

Now we are ready to see how all this plays out in probability theory! We’ll get an analogy like this, which goes hand-in-hand with our earlier one:

 Thermodynamics   Probability Theory 
  q   extensive variables   probability distribution  
  p   intensive variables ???

This analogy is clearer than the last, because statistical mechanics reveals that the extensive variables in thermodynamics are really just summaries of probability distributions on microstates. Furthermore, both thermodynamics and probability theory have a concept of entropy.

So, let’s take our manifold Q to consist of probability distributions on the set of microstates I was talking about before: the set \{1, \dots, n\}. Actually, let’s use nowhere vanishing probability distributions:

\displaystyle{ Q = \{ q \in \mathbb{R}^n : \; q_i > 0, \; \sum_{i=1}^n q_i = 1 \} }

I’m requiring q_i > 0 to ensure Q is a manifold, and also to make sure f is differentiable: it ceases to be differentiable when one of the probabilities q_i hits zero.

Since Q is a manifold, its cotangent bundle is a symplectic manifold T^\ast Q. And here’s the good news: we have a god-given entropy function

f \colon Q \to \mathbb{R}

namely the Shannon entropy

\displaystyle{ f(q) = - \sum_{i = 1}^n q_i \ln q_i }

So, everything I just described about thermodynamics works in the setting of plain old probability theory! Starting from our manifold Q and the entropy function, we get all the rest, leading up to the Lagrangian submanifold

\Lambda = \{(q,p) \in T^\ast Q : \; p = (df)_q \}

that describes the relation between extensive and intensive variables.

For computations it helps to pick coordinates on Q. Since the probabilities q_1, \dots, q_n sum to 1, they aren’t independent coordinates on Q. So, we can either pick all but one of them as coordinates, or learn how to deal with non-independent coordinates, which are already completely standard in projective geometry. Let’s do the former, just to keep things simple.

These coordinates on Q give rise in the usual way to coordinates q_i and p_i on the cotangent bundle T^\ast Q. These play the role of extensive and intensive variables, respectively, and it should be very interesting to impose the equation

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

where f is the Shannon entropy. This picks out a Lagrangian submanifold \Lambda \subseteq T^\ast Q.

So, the question becomes: what does this mean? If this formula gives the analogue of momentum for probability theory, what does this analogue of momentum mean?

Here’s a preliminary answer: p_i says how fast entropy increases as we increase the probability q_i that our system is in the ith microstate. So if we think of nature as ‘wanting’ to maximize entropy, the quantity p_i says how eager it is to increase the probability q_i.

Indeed, you can think of p_i as a bit like pressure—one of the most famous intensive quantities in thermodynamics. A gas ‘wants’ to expand, and its pressure says precisely how eager it is to expand. Similarly, a probability distribution ‘wants’ to flatten out, to maximize entropy, and p_i says how eager it is to increase the probability q_i in order to do this.

But what can we do with this concept? And what does symplectic geometry do for probability theory?

I will start tackling these questions next time.

One thing I’ll show is that when we reduce thermodynamics to probability theory using the ideas of statistical mechanics, the appearance of symplectic geometry in thermodynamics follows from its appearance in probability theory.

Another thing I want to investigate is how other geometrical structures on the space of probability distributions, like the Fisher information metric, interact with the symplectic structure on its cotangent bundle. This will integrate symplectic geometry and information geometry.

I also want to bring contact geometry into the picture. It’s already easy to see from our work last time how this should go. We treat the entropy S as an independent variable, and replace T^\ast Q with a larger manifold T^\ast Q \times \mathbb{R} having S as an extra coordinate. This is a contact manifold with contact form

\alpha = -dS + p_1 dq_i + \cdots + p_n dq_n

This contact manifold has a submanifold \Sigma where we remember that entropy is a function of the probability distribution q, and define p in terms of q as usual:

\Sigma = \{(q,p,S) \in T^\ast Q \times \mathbb{R} : \; S = f(q), \; p = (df)_q \}

And as we saw last time, \Sigma is a Legendrian submanifold, meaning

\displaystyle{ \alpha|_{\Sigma} = 0 }

But again, we want to understand what these ideas from contact geometry really do for probability theory!


For all my old posts on information geometry, go here:

Information geometry.


Structured vs Decorated Cospans (Part 2)

29 July, 2021

Decorated cospans are a framework for studying open systems invented by Brendan Fong. Since I’m now visiting the institute he and David Spivak set up—the Topos Institute—it was a great time to give a talk explaining the history of decorated cospans, their problems, and how those problems have been solved:

Structured vs Decorated Cospans

Abstract. One goal of applied category theory is to understand open systems: that is, systems that can interact with the external world. We compare two approaches to describing open systems as morphisms: structured and decorated cospans. Each approach provides a symmetric monoidal double category. Structured cospans are easier, decorated cospans are more general, but under certain conditions the two approaches are equivalent. We take this opportunity to explain some tricky issues that have only recently been resolved.

It’s probably best to get the slides here and look at them while watching this video:

If you prefer a more elementary talk explaining what structured and decorated cospans are good for, try these slides.

For videos and slides of two related talks go here:

Structured cospans and double categories.

Structured cospans and Petri nets.

For more, read these:

• Brendan Fong, Decorated cospans.

• Evan Patterson and Micah Halter, Compositional epidemiological modeling using structured cospans.

• John Baez and Kenny Courser, Structured cospans.

• John Baez, Kenny Courser and Christina Vasilakopoluou, Structured versus decorated cospans.

• Kenny Courser, Open Systems: a Double Categorical Perspective.

• Michael Shulman, Framed bicategories and monoidal fibrations.

• Joe Moeller and Christina Vasilakopolou, Monoidal Grothendieck construction.

To read more about the network theory project, go here:

Network Theory.