The Cyclic Identity for Partial Derivatives

13 September, 2021


As an undergrad I learned a lot about partial derivatives in physics classes. But they told us rules as needed, without proving them. This rule completely freaked me out. If derivatives are kinda like fractions, shouldn’t this equal 1?

Let me show you why it’s -1.

First, consider an example:

This example shows that the identity is not crazy. But in fact it
holds the key to the general proof! Since (u,v) is a coordinate system we can assume without loss of generality that u = x, v = y. At any point we can approximate w to first order as ax + by + c for some constants a,b,c. But for derivatives the constant c doesn’t matter, so we can assume it’s zero.

Then just compute!

There’s also a proof using differential forms that you might like
better. You can see it here, along with an application to

But this still leaves us yearning for more intuition — and for me, at least, a more symmetrical, conceptual proof. Over on Twitter, someone named
Postdoc/cake provided some intuition using the same example from thermodynamics:

Using physics intuition to get the minus sign:

  1. increasing temperature at const volume = more pressure (gas pushes out more)
  2. increasing temperature at const pressure = increasing volume (ditto)
  3. increasing pressure at const temperature = decreasing volume (you push in more)

Jules Jacobs gave the symmetrical, conceptual proof that I was dreaming of:

As I’d hoped, the minus signs come from the anticommutativity of the wedge product of 1-forms, e.g.

du \wedge dv = - dv \wedge du

Since the space of 2-forms at a point in the plane is 1-dimensional, we can divide them. In fact a ratio like

\displaystyle{ \frac{du \wedge dw}{dv \wedge dw} }

is just the Jacobian of the tranformation from (v,w) coordinates to (u,w) coordinates. We also need that these ratios obey the rule

\displaystyle{ \frac{\alpha}{\beta} \cdot \frac{\gamma}{\delta} = \frac{\alpha}{\delta} \cdot \frac{\gamma}{\beta}  }

where \alpha, \beta, \gamma, \delta are nonzero 2-forms at a point in the plane. This seems obvious, but you need to check it. It’s not hard. But to put it in fancy language, it follows from the fact that nonzero 2-forms at a point in the plane are a ‘torsor’! I explain that idea here:

Torsors made easy.

Torsors are widespread in physics, and the nonzero vectors in any 1-dimensional real vector space form a torsor for the multiplicative group of nonzero real numbers.

Jules Jacob’s proof is more sophisticated than my simple argument, but it’s very pretty, and it generalizes to higher dimensions in ways that’d be hard to guess otherwise.

Information Geometry (Part 21)

17 August, 2021

Last time I ended with a formula for the ‘Gibbs distribution’: the probability distribution that maximizes entropy subject to constraints on the expected values of some observables.

This formula is well-known, but I’d like to derive it here. My argument won’t be up to the highest standards of rigor: I’ll do a bunch of computations, and it would take more work to state conditions under which these computations are justified. But even a nonrigorous approach is worthwhile, since the computations will give us more than the mere formula for the Gibbs distribution.

I’ll start by reminding you of what I claimed last time. I’ll state it in a way that removes all unnecessary distractions, so go back to Part 20 if you want more explanation.

The Gibbs distribution

Take a measure space \Omega with measure \mu. Suppose there is a probability distribution \pi on \Omega that maximizes the entropy

\displaystyle{ -\int_\Omega \pi(x) \ln \pi(x) \, d\mu(x) }

subject to the requirement that some integrable functions A^1, \dots, A^n on \Omega have expected values equal to some chosen list of numbers q^1, \dots, q^n.

(Unlike last time, now I’m writing A^i and q^i with superscripts rather than subscripts, because I’ll be using the Einstein summation convention: I’ll sum over any repeated index that appears once as a a superscript and once as a subscript.)

Furthermore, suppose \pi depends smoothly on q \in \mathbb{R}^n. I’ll call it \pi_q to indicate its dependence on q. Then, I claim \pi_q is the so-called Gibbs distribution

\displaystyle{  \pi_q(x) =\frac{e^{-p_i A^i(x)}}{\int_\Omega e^{-p_i A^i(x)} \, d\mu(x)}   }


\displaystyle{ p_i = \frac{\partial f(q)}{\partial q^i} }


\displaystyle{ f(q) = - \int_\Omega \pi_q(x) \ln \pi_q(x) \, d\mu(x) }

is the entropy of \pi_q.

Let’s show this is true!

Finding the Gibbs distribution

So, we are trying to find a probability distribution \pi that maximizes entropy subject to these constraints:

\displaystyle{ \int_\Omega \pi(x) A^i(x) \, d\mu(x) = q^i }

We can solve this problem using Lagrange multipliers. We need one Lagrange multiplier, say \beta_i, for each of the above constraints. But it’s easiest if we start by letting \pi range over all of L^1(\Omega), that is, the space of all integrable functions on \Omega. Then, because we want \pi to be a probability distribution, we need to impose one extra constraint

\displaystyle{ \int_\Omega \pi(x) \, d\mu(x) = 1 }

To do this we need an extra Lagrange multiplier, say \gamma.

So, that’s what we’ll do! We’ll look for critical points of this function on L^1(\Omega):

\displaystyle{ - \int \pi \ln \pi \, d\mu - \beta_i  \int \pi A^i \, d\mu - \gamma \int \pi\,  d\mu  }

Here I’m using some tricks to keep things short. First, I’m dropping the dummy variable x which appeared in all of the integrals we had: I’m leaving it implicit. Second, all my integrals are over \Omega so I won’t say that. And third, I’m using the Einstein summation convention, so there’s a sum over i implicit here.

Okay, now let’s do the variational derivative required to find a critical point of this function. When I was a math major taking physics classes, the way physicists did variational derivatives seemed like black magic to me. Then I spent months reading how mathematicians rigorously justified these techniques. I don’t feel like a massive digression into this right now, so I’ll just do the calculations—and if they seem like black magic, I’m sorry!

We need to find \pi obeying

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left(- \int \pi \ln \pi \, d\mu - \beta_i  \int \pi A^i \, d\mu - \gamma \int \pi \, d\mu \right) = 0 }

or in other words

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left(\int \pi \ln \pi \, d\mu + \beta_i  \int \pi A^i \, d\mu + \gamma \int \pi \, d\mu \right) = 0 }

First we need to simplify this expression. The only part that takes any work, if you know how to do variational derivatives, is the first term. Since the derivative of z \ln z is 1 + \ln z, we have

\displaystyle{ \frac{\delta}{\delta \pi(x)} \int \pi \ln \pi \, d\mu  = 1 + \ln \pi(x) }

The second and third terms are easy, so we get

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left( \int \pi \ln \pi \, d\mu + \beta_i \int \pi A^i \, d\mu + \gamma \int \pi \, d\mu \right) =}

        1 + \ln \pi(x) + \beta_i A^i(x) + \gamma

Thus, we need to solve this equation:

\displaystyle{ 1 + \ln \pi(x) + \beta_i A^i(x) + \gamma  = 0}

That’s easy to do:

\pi(x) = e^{-1 - \gamma - \beta_i A^i(x)}

Good! It’s starting to look like the Gibbs distribution!

We now need to choose the Lagrange multipliers \beta_i and \gamma to make the constraints hold. To satisfy this constraint

\displaystyle{ \int \pi \, d\mu = 1 }

we must choose \gamma so that

\displaystyle{ \int  e^{-1 - \gamma - \beta_i A^i } \, d\mu = 1 }

or in other words

\displaystyle{ e^{1 + \gamma} = \int e^{- \beta_i A^i} \, d\mu }

Plugging this into our earlier formula

\pi(x) = e^{-1 - \gamma - \beta_i A^i(x)}

we get this:

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

Great! Even more like the Gibbs distribution!

By the way, you must have noticed the “1” that showed up here:

\displaystyle{ \frac{\delta}{\delta \pi(x)} \int \pi \ln \pi \, d\mu  = 1 + \ln \pi(x) }

It buzzed around like an annoying fly in the otherwise beautiful calculation, but eventually went away. This is the same irksome “1” that showed up in Part 19. Someday I’d like to say a bit more about it.

Now, where were we? We were trying to show that

\displaystyle{  \pi_q(x) =\frac{e^{-p_i A^i(x)}}{\int e^{-p_i A^i} \, d\mu}   }

minimizes entropy subject to our constraints. So far we’ve shown

\displaystyle{ \pi(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

is a critical point. It’s clear that

\pi(x) \ge 0

so \pi really is a probability distribution. We should show it actually maximizes entropy subject to our constraints, but I will skip that. Given that, \pi will be our claimed Gibbs distribution \pi_q if we can show

p_i = \beta_i

This is interesting! It’s saying our Lagrange multipliers \beta_i actually equal the so-called conjugate variables p_i given by

\displaystyle{ p_i = \frac{\partial f}{\partial q^i} }

where f(q) is the entropy of \pi_q:

\displaystyle{ f(q) = - \int_\Omega \pi_q(x) \ln \pi_q(x) \, d\mu(x) }

There are two ways to show this: the easy way and the hard way. The easy way is to reflect on the meaning of Lagrange multipliers, and I’ll sketch that way first. The hard way is to use brute force: just compute p_i and show it equals \beta_i. This is a good test of our computational muscle—but more importantly, it will help us discover some interesting facts about the Gibbs distribution.

The easy way

Consider a simple Lagrange multiplier problem where you’re trying to find a critical point of a smooth function

f \colon \mathbb{R}^2 \to \mathbb{R}

subject to the constraint

g = c

for some smooth function

g \colon \mathbb{R}^2 \to \mathbb{R}

and constant c. (The function f here has nothing to do with the f in the previous sections.) To answer this we introduce a Lagrange multiplier \lambda and seek points where

\nabla ( f - \lambda g) = 0

This works because the above equation says

\nabla f = \lambda \nabla g

Geometrically this means we’re at a point where the gradient of f points at right angles to the level surface of g:

Thus, to first order we can’t change f by moving along the level surface of g.

But also, if we start at a point where

\nabla f = \lambda \nabla g

and we begin moving in any direction, the function f will change at a rate equal to \lambda times the rate of change of g. That’s just what the equation says! And this fact gives a conceptual meaning to the Lagrange multiplier \lambda.

Our situation is more complicated, since our functions are defined on the infinite-dimensional space L^1(\Omega), and we have an n-tuple of constraints with an n-tuple of Lagrange multipliers. But the same principle holds.

So, when we are at a solution \pi_q of our constrained entropy-maximization problem, and we start moving the point \pi_q by changing the value of the ith constraint, namely q^i, the rate at which the entropy changes will be \beta_i times the rate of change of q^i. So, we have

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

But this is just what we needed to show!

The hard way

Here’s another way to show

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

We start by solving our constrained entropy-maximization problem using Lagrange multipliers. As already shown, we get

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

Then we’ll compute the entropy

f(q) = - \int \pi_q \ln \pi_q \, d\mu

Then we’ll differentiate this with respect to q_i and show we get \beta_i.

Let’s try it! The calculation is a bit heavy, so let’s write Z(q) for the so-called partition function

\displaystyle{ Z(q) = \int e^{- \beta_i A^i} \, d\mu }

so that

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{Z(q)} }

and the entropy is

\begin{array}{ccl}  f(q) &=& - \displaystyle{ \int  \pi_q  \ln \left( \frac{e^{- \beta_k A^k}}{Z(q)} \right)  \, d\mu }  \\ \\  &=& \displaystyle{ \int \pi_q \left(\beta_k A^k + \ln Z(q) \right)  \, d\mu } \\ \\  \end{array}

This is the sum of two terms. The first term

\displaystyle{ \int \pi_q \beta_k A^k   \, d\mu =  \beta_k \int \pi_q A^k   \, d\mu}

is \beta_k times the expected value of A^k with respect to the probability distribution \pi_q, all summed over k. But the expected value of A^k is q^k, so we get

\displaystyle{ \int  \pi_q \beta_k A^k \, d\mu } =  \beta_k q^k

The second term is easier:

\displaystyle{ \int_\Omega  \pi_q \ln Z(q) \, d\mu = \ln Z(q) }

since \pi_q(x) integrates to 1 and the partition function Z(q) doesn’t depend on x \in \Omega.

Putting together these two terms we get an interesting formula for the entropy:

f(q) = \beta_k q^k + \ln Z(q)

This formula is one reason this brute-force approach is actually worthwhile! I’ll say more about it later.

But for now, let’s use this formula to show what we’re trying to show, namely

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

For starters,

\begin{array}{ccl}  \displaystyle{\frac{\partial f}{\partial q^i}} &=& \displaystyle{\frac{\partial}{\partial q^i} \left(\beta_k q^k + \ln Z(q) \right) } \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_k   \frac{\partial q^k}{\partial q^i} + \frac{\partial}{\partial q^i} \ln Z(q)   }  \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_k  \delta^k_i + \frac{\partial}{\partial q^i} \ln Z(q)   } \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i + \frac{\partial}{\partial q^i} \ln Z(q)  }  \end{array}

where we played a little Kronecker delta game with the second term.

Now we just need to compute the third term:

\begin{array}{ccl}  \displaystyle{ \frac{\partial}{\partial q^i} \ln Z(q) } &=&  \displaystyle{ \frac{1}{Z(q)} \frac{\partial}{\partial q^i} Z(q) } \\  \\  &=&  \displaystyle{ \frac{1}{Z(q)} \frac{\partial}{\partial q^i}  \int e^{- \beta_j A^j}  \, d\mu  }  \\ \\  &=&  \displaystyle{ \frac{1}{Z(q)} \int \frac{\partial}{\partial q^i} \left(e^{- \beta_j A^j}\right) \, d\mu  }  \\ \\  &=&  \displaystyle{ \frac{1}{Z(q)} \int \frac{\partial}{\partial q^i}\left( - \beta_k A^k \right)  e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ -\frac{1}{Z(q)} \int \frac{\partial \beta_k}{\partial q^i}  A^k   e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ - \frac{\partial \beta_k}{\partial q^i} \frac{1}{Z(q)} \int  A^k   e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ - \frac{\partial \beta_k}{\partial q^i} q^k }  \end{array}

Ah, you don’t know how good it feels, after years of category theory, to be doing calculations like this again!

Now we can finish the job we started:

\begin{array}{ccl}  \displaystyle{\frac{\partial f}{\partial q^i}} &=&  \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i + \frac{\partial}{\partial q^i} \ln Z(q)  } \\ \\  &=&  \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i - \frac{\partial \beta_k}{\partial q^i} q^k  } \\ \\  &=& \beta_i  \end{array}



We’ve learned the formula for the probability distribution that maximizes entropy subject to some constraints on the expected values of observables. But more importantly, we’ve seen that the anonymous Lagrange multipliers \beta_i that show up in this problem are actually the partial derivatives of entropy! They equal

\displaystyle{ p_i = \frac{\partial f}{\partial q^i} }

Thus, they are rich in meaning. From what we’ve seen earlier, they are ‘surprisals’. They are analogous to momentum in classical mechanics and have the meaning of intensive variables in thermodynamics:

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

Furthermore, by showing \beta_i = p_i the hard way we discovered an interesting fact. There’s a relation between the entropy and the logarithm of the partition function:

f(q) = p_i q^i + \ln Z(q)

(We proved this formula with \beta_i replacing p_i, but now we know those are equal.)

This formula suggests that the logarithm of the partition function is important—and it is! It’s closely related to the concept of free energy—even though ‘energy’, free or otherwise, doesn’t show up at the level of generality we’re working at now.

This formula should also remind you of the tautological 1-form on the cotangent bundle T^\ast Q, namely

\theta = p_i dq^i

It should remind you even more of the contact 1-form on the contact manifold T^\ast Q \times \mathbb{R}, namely

\alpha = -dS + p_i dq^i

Here S is a coordinate on the contact manifold that’s a kind of abstract stand-in for our entropy function f.

So, it’s clear there’s a lot more to say: we’re seeing hints of things here and there, but not yet the full picture.

For all my old posts on information geometry, go here:

Information geometry.

Information Geometry (Part 20)

14 August, 2021

Last time we worked out an analogy between classical mechanics, thermodynamics and probability theory. The latter two look suspiciously similar:

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

This is no coincidence. After all, in the subject of statistical mechanics we explain classical thermodynamics using probability theory—and entropy is revealed to be Shannon entropy (or its quantum analogue).

Now I want to make this precise.

To connect classical thermodynamics to probability theory, I’ll start by discussing ‘statistical manifolds’. I introduced the idea of a statistical manifold in Part 7: it’s a manifold Q equipped with a map sending each point q \in Q to a probability distribution \pi_q on some measure space \Omega. Now I’ll say how these fit into the second column of the above chart.

Then I’ll talk about statistical manifolds of a special sort used in thermodynamics, which I’ll call ‘Gibbsian’, since they really go back to Josiah Willard Gibbs.

In a Gibbsian statistical manifold, for each q \in Q the probability distribution \pi_q is a ‘Gibbs distribution’. Physically, these Gibbs distributions describe thermodynamic equilibria. For example, if you specify the volume, energy and number of particles in a box of gas, there will be a Gibbs distribution describing what the particles do in thermodynamic equilibrium under these conditions. Mathematically, Gibbs distributions maximize entropy subject to some constraints specified by the point q \in Q.

More precisely: in a Gibbsian statistical manifold we have a list of observables A_1, \dots , A_n whose expected values serve as coordinates q_1, \dots, q_n for points q \in Q, and \pi_q is the probability distribution that maximizes entropy subject to the constraint that the expected value of A_i is q_i. We can derive most of the interesting formulas of thermodynamics starting from this!

Statistical manifolds

Let’s fix a measure space \Omega with measure \mu. A statistical manifold is then a manifold Q equipped with a smooth map \pi assigning to each point q \in Q a probability distribution on \Omega, which I’ll call \pi_q. So, \pi_q is a function on \Omega with

\displaystyle{ \int_\Omega \pi_q \, d\mu = 1 }


\pi_q(x) \ge 0

for all x \in \Omega.

The idea here is that the space of all probability distributions on \Omega may be too huge to understand in as much detail as we’d like, so instead we describe some of these probability distributions—a family parametrized by points of some manifold Q—using the map \pi. This is the basic idea behind parametric statistics.

Information geometry is the geometry of statistical manifolds. Any statistical manifold comes with a bunch of interesting geometrical structures. One is the ‘Fisher information metric’, a Riemannian metric I explained in Part 7. Another is a 1-parameter family of connections on the tangent bundle T Q, which is important in Amari’s approach to information geometry. You can read about this here:

• Hiroshi Matsuzoe, Statistical manifolds and affine differential geometry, in Advanced Studies in Pure Mathematics 57, pp. 303–321.

I don’t want to talk about it now—I just wanted to reassure you that I’m not completely ignorant of it!

I want to focus on the story I’ve been telling, which is about entropy. Our statistical manifold Q comes with a smooth entropy function

f \colon Q \to \mathbb{R}


\displaystyle{  f(q) = -\int_\Omega \pi_q(x) \, \ln \pi_q(x) \, d\mu(x)    }

We can use this entropy function to do many of the things we usually do in thermodynamics! For example, at any point q \in Q where this function is differentiable, its differential gives a cotangent vector

p = (df)_q

which has an important physical meaning. In coordinates we have

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

and we call p_i the intensive variable conjugate to q_i. For example if q_i is energy, p_i will be ‘coolness’: the reciprocal of temperature.

Defining p this way gives a Lagrangian submanifold

\Lambda = \{ (q,p) \in T^\ast Q : \; x \in M, \; p =  (df)_x \}

of the cotangent bundle T^\ast Q. We can also get contact geometry into the game by defining a contact manifold T^\ast Q \times \mathbb{R} and a Legendrian submanifold

\Sigma = \{ (q,p,S) \in T^\ast Q \times \mathbb{R} : \; x \in M, \; p =  (df)_q , S = f(q) \}

But I’ve been talking about these ideas for the last three episodes, so I won’t say more just now! Instead, I want to throw a new idea into the pot.

Gibbsian statistical manifolds

Thermodynamics, and statistical mechanics, spend a lot of time dealing with statistical manifold of a special sort I’ll call ‘Gibbsian’. In these, each probability distribution \pi_q is a ‘Gibbs distribution’, meaning that it maximizes entropy subject to certain constraints specified by the point q \in Q.

How does this work? For starters, an integrable function

A \colon \Omega \to \mathbb{R}

is called a random variable, or in physics an observable. The expected value of an observable is a smooth real-valued function on our statistical manifold

\langle A \rangle \colon Q \to \mathbb{R}

given by

\displaystyle{ \langle A \rangle(q) = \int_\Omega A(x) \pi_q(x) \, d\mu(x) }

In other words, \langle A \rangle is a function whose value at at any point q \in Q is the expected value of A with respect to the probability distribution \pi_q.

Now, suppose our statistical manifold is n-dimensional and we have n observables A_1, \dots, A_n. Their expected values will be smooth functions on our manifold—and sometimes these functions will be a coordinate system!

This may sound rather unlikely, but it’s really not so outlandish. Indeed, if there’s a point q such that the differentials of the functions \langle A_i \rangle are linearly independent at this point, these functions will be a coordinate system in some neighborhood of this point, by the inverse function theorem. So, we can take this neighborhood, use it as our statistical manifold, and the functions \langle A_i \rangle will be coordinates.

So, let’s assume the expected values of our observables give a coordinate system on Q. Let’s call these coordinates q_1, \dots, q_n, so that

\langle A_i \rangle(q) = q_i

Now for the kicker: we say our statistical manifold is Gibbsian if for each point q \in Q, \pi_q is the probability distribution that maximizes entropy subject to the above condition!

Which condition? The condition saying that

\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }

for all i. This is just the previous equation spelled out so that you can see it’s a condition on \pi_q.

This assumption of the entropy-maximizing nature of \pi_q is a very powerful, because it implies a useful and nontrivial formula for \pi_q. It’s called the Gibbs distribution:

\displaystyle{  \pi_q(x) = \frac{1}{Z(q)} \exp\left(-\sum_{i = 1}^n p_i A_i(x)\right) }

for all x \in \Omega.

Here p_i is the intensive variable conjugate to q_i, while Z(q) is the partition function: the thing we must divide by to make sure \pi_q integrates to 1. In other words:

\displaystyle{ Z(q) = \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x)   }

By the way, this formula may look confusing at first, since the left side depends on the point q in our statistical manifold, while there’s no q visible in the right side! Do you see what’s going on?

I’ll tell you: the conjugate variable p_i, sitting on the right side of the above formula, depends on q. Remember, we got it by taking the partial derivative of the entropy in the q_i direction

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

and the evaluating this derivative at the point q.

But wait a minute! f here is the entropy—but the entropy of what?

The entropy of \pi_q, of course!

So there’s something circular about our formula for \pi_q. To know \pi_q, you need to know the conjugate variables p_i, but to compute these you need to know the entropy of \pi_q.

This is actually okay. While circular, the formula for \pi_q is still true. It’s harder to work with than you might hope. But it’s still extremely useful.

Next time I’ll prove that this formula for \pi_q is true, and do a few things with it. All this material was discovered by Gibbs in the late 1800’s, and it’s lurking any good book on statistical mechanics—but not phrased in the language of statistical manifolds. The physics textbooks usually consider special cases, like a box of gas where:

q_1 is energy, p_1 is 1/temperature.

q_2 is volume, p_2 is –pressure/temperature.

q_3 is the number of particles, p_3 is chemical potential / pressure.

While these special cases are important and interesting, I’d rather be general!

Technical comments

I said “Any statistical manifold comes with a bunch of interesting geometrical structures”, but in fact some conditions are required. For example, the Fisher information metric is only well-defined and nondegenerate under some conditions on the map \pi. For example, if \pi maps every point of Q to the same probability distribution, the Fisher information metric will vanish.

Similarly, the entropy function f is only smooth under some conditions on \pi.

Furthermore, the integral

\displaystyle{ \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x)   }

may not converge for all values of the numbers p_1, \dots, p_n. But in my discussion of Gibbsian statistical manifolds, I was assuming that an entropy-maximizing probability distribution \pi_q with

\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }

actually exists. In this case the probability distribution is also unique (almost everywhere).

For all my old posts on information geometry, go here:

Information geometry.

Information Geometry (Part 19)

8 August, 2021

Last time I figured out the analogue of momentum in probability theory, but I didn’t say what it’s called. Now I will tell you—thanks to some help from Abel Jansma and Toby Bartels.


This is a well-known concept in information theory. It’s also called ‘information content‘.

Let’s see why. First, let’s remember the setup. We have a manifold

\displaystyle{ Q = \{ q \in \mathbb{R}^n : \; q_i > 0, \; \sum_{i=1}^n q_i = 1 \} }

whose points q are nowhere vanishing probability distributions on the set \{1, \dots, n\}. We have a function

f \colon Q \to \mathbb{R}

called the Shannon entropy, defined by

\displaystyle{ f(q) = - \sum_{j = 1}^n q_j \ln q_j }

For each point q \in Q we define a cotangent vector p \in T^\ast_q Q by

p = (df)_q

As mentioned last time, this is the analogue of momentum in probability theory. In the second half of this post I’ll say more about exactly why. But first let’s compute it and see what it actually equals!

Let’s start with a naive calculation, acting as if the probabilities q_1, \dots, q_n were a coordinate system on the manifold Q. We get

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

so using the definition of the Shannon entropy we have

\begin{array}{ccl}  p_i &=& \displaystyle{ -\frac{\partial}{\partial q_i} \sum_{j = 1}^n q_j \ln q_j  }\\ \\  &=& \displaystyle{ -\frac{\partial}{\partial q_i} \left( q_i \ln q_i \right) } \\ \\  &=& -\ln(q_i) - 1  \end{array}

Now, the quantity -\ln q_i is called the surprisal of the probability distribution at i. Intuitively, it’s a measure of how surprised you should be if an event of probability q_i occurs. For example, if you flip a fair coin and it lands heads up, your surprisal is ln 2. If you flip 100 fair coins and they all land heads up, your surprisal is 100 times ln 2.

Of course ‘surprise’ is a psychological term, not a term from math or physics, so we shouldn’t take it too seriously here. We can derive the concept of surprisal from three axioms:

  1. The surprisal of an event of probability q is some function of q, say F(q).
  2. The less probable an event is, the larger its surprisal is: q_1 \le q_2 \implies  F(q_1) \ge F(q_2).
  3. The surprisal of two independent events is the sum of their surprisals: F(q_1 q_2) = F(q_1) + F(q_2).

It follows from work on Cauchy’s functional equation that F must be of this form:

F(q) = - \log_b q

for some constant b > 1. We shall choose b, the base of our logarithms, to be e. We had a similar freedom of choice in defining the Shannon entropy, and we will use base e for both to be consistent. If we chose something else, it would change the surprisal and the Shannon entropy by the same constant factor.

So far, so good. But what about the irksome “-1” in our formula?

p_i = -\ln(q_i) - 1

Luckily it turns out we can just get rid of this! The reason is that the probabilities q_i are not really coordinates on the manifold Q. They’re not independent: they must sum to 1. So, when we change them a little, the sum of their changes must vanish. Putting it more technically, the tangent space T_q Q is not all of \mathbb{R}^n, but just the subspace consisting of vectors whose components sum to zero:

\displaystyle{ T_q Q = \{ v \in \mathbb{R}^n : \; \sum_{j = 1}^n v_j = 0 \} }

The cotangent space is the dual of the tangent space. The dual of a subspace

S \subseteq V

is the quotient space

V^\ast/\{ \ell \colon V \to \mathbb{R} : \; \forall v \in S \; \, \ell(v) = 0 \}

The cotangent space T_q^\ast Q thus consists of linear functionals \ell \colon \mathbb{R}^n \to \mathbb{R} modulo those that vanish on vectors v obeying the equation

\displaystyle{ \sum_{j = 1}^n v_j = 0 }

Of course, we can identify the dual of \mathbb{R}^n with \mathbb{R}^n in the usual way, using the Euclidean inner product: a vector u \in \mathbb{R}^n corresponds to the linear functional

\displaystyle{ \ell(v) = \sum_{j = 1}^n u_j v_j }

From this, you can see that a linear functional \ell vanishes on all vectors v obeying the equation

\displaystyle{ \sum_{j = 1}^n v_j = 0 }

if and only if its corresponding vector u has

u_1 = \cdots = u_n

So, we get

T^\ast_q Q \cong \mathbb{R}^n/\{ u \in \mathbb{R}^n : \; u_1 = \cdots = u_n \}

In words: we can describe cotangent vectors to Q as lists of n numbers if we want, but we have to remember that adding the same constant to each number in the list doesn’t change the cotangent vector!

This suggests that our naive formula

p_i = \ln(q_i) - 1

is on the right track, but we’re free to get rid of the constant 1 if we want! And that’s true.

To check this rigorously, we need to show

\displaystyle{ p(v) = -\sum_{j=1}^n \ln(q_i) v_i}

for all v \in T_q Q. We compute:

\begin{array}{ccl}  p(v) &=& df(v) \\ \\  &=& v(f) \\ \\  &=& \displaystyle{ \sum_{j=1}^n v_j \, \frac{\partial f}{\partial q_j} } \\ \\  &=& \displaystyle{ \sum_{j=1}^n v_j (-\ln(q_i) - 1) } \\ \\  &=& \displaystyle{ -\sum_{j=1}^n \ln(q_i) v_i }  \end{array}

where in the second to last step we used our earlier calculation:

\displaystyle{ \frac{\partial f}{\partial q_i} = -\frac{\partial}{\partial q_i} \sum_{j = 1}^n q_j \ln q_j = -\ln(q_i) - 1 }

and in the last step we used

\displaystyle{ \sum_j v_j = 0 }

Back to the big picture

Now let’s take stock of where we are. We can fill in the question marks in the charts from last time, and combine those charts while we’re at it.

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

What’s going on here? In classical mechanics, action is minimized (or at least the system finds a critical point of the action). In thermodynamics, entropy is maximized. In the maximum entropy approach to probability, Shannon entropy is maximized. This leads to a mathematical analogy that’s quite precise. For classical mechanics and thermodynamics, I explained it here:

Classical mechanics versus thermodynamics (part 1).

Classical mechanics versus thermodynamics (part 2).

These posts may give a more approachable introduction to what I’m doing now: now I’m bringing probability theory into the analogy, with a big emphasis on symplectic and contact geometry.

Let me spell out a bit of the analogy more carefully:

Classical Mechanics. In classical mechanics, we have a manifold Q whose points are positions of a particle. There’s an important function on this manifold: Hamilton’s principal function

f \colon Q \to \mathbb{R}

What’s this? It’s basically action: f(q) is the action of the least-action path from the position q_0 at some earlier time t_0 to the position q at time 0. The Hamilton–Jacobi equations say the particle’s momentum p at time 0 is given by

p = (df)_q

Thermodynamics. In thermodynamics, we have a manifold Q whose points are equilibrium states of a system. The coordinates of a point q \in Q are called extensive variables. There’s an important function on this manifold: the entropy

f \colon Q \to \mathbb{R}

There is a cotangent vector p at the point q given by

p = (df)_q

The components of this vector are the intensive variables corresponding to the extensive variables.

Probability Theory. In probability theory, we have a manifold Q whose points are nowhere vanishing probability distributions on a finite set. The coordinates of a point q \in Q are probabilities. There’s an important function on this manifold: the Shannon entropy

f \colon Q \to \mathbb{R}

There is a cotangent vector p at the point q given by

p = (df)_q

The components of this vector are the surprisals corresponding to the probabilities.

In all three cases, T^\ast Q is a symplectic manifold and imposing the constraint p = (df)_q picks out a Lagrangian submanifold

\Lambda = \{ (q,p) \in T^\ast Q: \; p = (df)_q \}

There is also a contact manifold T^\ast Q \times \mathbb{R} where the extra dimension comes with an extra coordinate S that means

• action in classical mechanics,
• entropy in thermodynamics, and
• Shannon entropy in probability theory.

We can then decree that S = f(q) along with p = (df)_q, and these constraints pick out a Legendrian submanifold

\Sigma = \{(q,p,S) \in T^\ast Q \times \mathbb{R} : \; S = f(q), \; p = (df)_q \}

There’s a lot more to do with these ideas, and I’ll continue next time.

For all my old posts on information geometry, go here:

Information geometry.

Information Geometry (Part 18)

5 August, 2021

Last time I sketched how two related forms of geometry, symplectic and contact geometry, show up in thermodynamics. Today I want to explain how they show up in probability theory.

For some reason I haven’t seen much discussion of this! But people should have looked into this. After all, statistical mechanics explains thermodynamics in terms of probability theory, so if some mathematical structure shows up in thermodynamics it should appear in statistical mechanics… and thus ultimately in probability theory.

I just figured out how this works for symplectic and contact geometry.

Suppose a system has n possible states. We’ll call these microstates, following the tradition in statistical mechanics. If you don’t know what ‘microstate’ means, don’t worry about it! But the rough idea is that if you have a macroscopic system like a rock, the precise details of what its atoms are doing are described by a microstate, and many different microstates could be indistinguishable unless you look very carefully.

We’ll call the microstates 1, 2, \dots, n. So, if you don’t want to think about physics, when I say microstate I’ll just mean an integer from 1 to n.

Next, a probability distribution q assigns a real number q_i to each microstate, and these numbers must sum to 1 and be nonnegative. So, we have q \in \mathbb{R}^n, though not every vector in \mathbb{R}^n is a probability distribution.

I’m sure you’re wondering why I’m using q rather than p to stand for an observable instead of a probability distribution. Am I just trying to confuse you?

No: I’m trying to set up an analogy to physics!

Last time I introduced symplectic geometry using classical mechanics. The most important example of a symplectic manifold is the cotangent bundle T^\ast Q of a manifold Q. A point of T^\ast Q is a pair (q,p) consisting of a point q \in Q and a cotangent vector p \in T^\ast_q Q. In classical mechanics the point q describes the position of some physical system, while p describes its momentum.

So, I’m going to set up an analogy like this:

 Classical Mechanics  Probability Theory
  q   position   probability distribution  
  p   momentum ???

But what is to momentum as probability is to position?

A big clue is the appearance of symplectic geometry in thermodynamics, which I also outlined last time. We can use this to get some intuition about the analogue of momentum in probability theory.

In thermodynamics, a system has a manifold Q of states. (These are not the ‘microstates’ I mentioned before: we’ll see the relation later.) There is a function

f \colon Q \to \mathbb{R}

describing the entropy of the system as a function of its state. There is a law of thermodynamics saying that

p = (df)_q

This equation picks out a submanifold of T^\ast Q, namely

\Lambda = \{(q,p) \in T^\ast Q : \; p = (df)_q \}

Moreover this submanifold is Lagrangian: the symplectic structure \omega vanishes when restricted to it:

\displaystyle{ \omega |_\Lambda = 0 }

This is very beautiful, but it goes by so fast you might almost miss it! So let’s clutter it up a bit with coordinates. We often use local coordinates on Q and describe a point q \in Q using these coordinates, getting a point

(q_1, \dots, q_n) \in \mathbb{R}^n

They give rise to local coordinates q_1, \dots, q_n, p_1, \dots, p_n on the cotangent bundle T^\ast Q. The q_i are called extensive variables, because they are typically things that you can measure only by totalling up something over the whole system, like the energy or volume of a cylinder of gas. The p_i are called intensive variables, because they are typically things that you can measure locally at any point, like temperature or pressure.

In these local coordinates, the symplectic structure on T^\ast Q is the 2-form given by

\omega = dp_1 \wedge dq_1 + \cdots + dp_n \wedge dq_n

The equation

p = (df)_q

serves as a law of physics that determines the intensive variables given the extensive ones when our system is in thermodynamic equilibrium. Written out using coordinates, this law says

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

It looks pretty bland here, but in fact it gives formulas for the temperature and pressure of a gas, and many other useful formulas in thermodynamics.

Now we are ready to see how all this plays out in probability theory! We’ll get an analogy like this, which goes hand-in-hand with our earlier one:

 Thermodynamics   Probability Theory 
  q   extensive variables   probability distribution  
  p   intensive variables ???

This analogy is clearer than the last, because statistical mechanics reveals that the extensive variables in thermodynamics are really just summaries of probability distributions on microstates. Furthermore, both thermodynamics and probability theory have a concept of entropy.

So, let’s take our manifold Q to consist of probability distributions on the set of microstates I was talking about before: the set \{1, \dots, n\}. Actually, let’s use nowhere vanishing probability distributions:

\displaystyle{ Q = \{ q \in \mathbb{R}^n : \; q_i > 0, \; \sum_{i=1}^n q_i = 1 \} }

I’m requiring q_i > 0 to ensure Q is a manifold, and also to make sure f is differentiable: it ceases to be differentiable when one of the probabilities q_i hits zero.

Since Q is a manifold, its cotangent bundle is a symplectic manifold T^\ast Q. And here’s the good news: we have a god-given entropy function

f \colon Q \to \mathbb{R}

namely the Shannon entropy

\displaystyle{ f(q) = - \sum_{i = 1}^n q_i \ln q_i }

So, everything I just described about thermodynamics works in the setting of plain old probability theory! Starting from our manifold Q and the entropy function, we get all the rest, leading up to the Lagrangian submanifold

\Lambda = \{(q,p) \in T^\ast Q : \; p = (df)_q \}

that describes the relation between extensive and intensive variables.

For computations it helps to pick coordinates on Q. Since the probabilities q_1, \dots, q_n sum to 1, they aren’t independent coordinates on Q. So, we can either pick all but one of them as coordinates, or learn how to deal with non-independent coordinates, which are already completely standard in projective geometry. Let’s do the former, just to keep things simple.

These coordinates on Q give rise in the usual way to coordinates q_i and p_i on the cotangent bundle T^\ast Q. These play the role of extensive and intensive variables, respectively, and it should be very interesting to impose the equation

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

where f is the Shannon entropy. This picks out a Lagrangian submanifold \Lambda \subseteq T^\ast Q.

So, the question becomes: what does this mean? If this formula gives the analogue of momentum for probability theory, what does this analogue of momentum mean?

Here’s a preliminary answer: p_i says how fast entropy increases as we increase the probability q_i that our system is in the ith microstate. So if we think of nature as ‘wanting’ to maximize entropy, the quantity p_i says how eager it is to increase the probability q_i.

Indeed, you can think of p_i as a bit like pressure—one of the most famous intensive quantities in thermodynamics. A gas ‘wants’ to expand, and its pressure says precisely how eager it is to expand. Similarly, a probability distribution ‘wants’ to flatten out, to maximize entropy, and p_i says how eager it is to increase the probability q_i in order to do this.

But what can we do with this concept? And what does symplectic geometry do for probability theory?

I will start tackling these questions next time.

One thing I’ll show is that when we reduce thermodynamics to probability theory using the ideas of statistical mechanics, the appearance of symplectic geometry in thermodynamics follows from its appearance in probability theory.

Another thing I want to investigate is how other geometrical structures on the space of probability distributions, like the Fisher information metric, interact with the symplectic structure on its cotangent bundle. This will integrate symplectic geometry and information geometry.

I also want to bring contact geometry into the picture. It’s already easy to see from our work last time how this should go. We treat the entropy S as an independent variable, and replace T^\ast Q with a larger manifold T^\ast Q \times \mathbb{R} having S as an extra coordinate. This is a contact manifold with contact form

\alpha = -dS + p_1 dq_i + \cdots + p_n dq_n

This contact manifold has a submanifold \Sigma where we remember that entropy is a function of the probability distribution q, and define p in terms of q as usual:

\Sigma = \{(q,p,S) \in T^\ast Q \times \mathbb{R} : \; S = f(q), \; p = (df)_q \}

And as we saw last time, \Sigma is a Legendrian submanifold, meaning

\displaystyle{ \alpha|_{\Sigma} = 0 }

But again, we want to understand what these ideas from contact geometry really do for probability theory!

For all my old posts on information geometry, go here:

Information geometry.

Structured vs Decorated Cospans (Part 2)

29 July, 2021

Decorated cospans are a framework for studying open systems invented by Brendan Fong. Since I’m now visiting the institute he and David Spivak set up—the Topos Institute—it was a great time to give a talk explaining the history of decorated cospans, their problems, and how those problems have been solved:

Structured vs Decorated Cospans

Abstract. One goal of applied category theory is to understand open systems: that is, systems that can interact with the external world. We compare two approaches to describing open systems as morphisms: structured and decorated cospans. Each approach provides a symmetric monoidal double category. Structured cospans are easier, decorated cospans are more general, but under certain conditions the two approaches are equivalent. We take this opportunity to explain some tricky issues that have only recently been resolved.

It’s probably best to get the slides here and look at them while watching this video:

If you prefer a more elementary talk explaining what structured and decorated cospans are good for, try these slides.

For videos and slides of two related talks go here:

Structured cospans and double categories.

Structured cospans and Petri nets.

For more, read these:

• Brendan Fong, Decorated cospans.

• Evan Patterson and Micah Halter, Compositional epidemiological modeling using structured cospans.

• John Baez and Kenny Courser, Structured cospans.

• John Baez, Kenny Courser and Christina Vasilakopoluou, Structured versus decorated cospans.

• Kenny Courser, Open Systems: a Double Categorical Perspective.

• Michael Shulman, Framed bicategories and monoidal fibrations.

• Joe Moeller and Christina Vasilakopolou, Monoidal Grothendieck construction.

To read more about the network theory project, go here:

Network Theory.

Complex Adaptive System Design (Part 10)

25 June, 2021

guest post by John Foley

Though the Complex Adaptive System Composition and Design Environment (CASCADE) program concluded in Fall 2020, just this week two new articles came out reviewing the work and future research directions:

• John Baez and John Foley, Operads for designing systems of systems, Notices of the American Mathematical Society 68 (2021), 1005–1007.

• John Foley, Spencer Breiner, Eswaran Subrahmanian and John Dusel, Operads for complex system design specification, analysis and synthesis, Proceedings of the Royal Society A 477 (2021), 20210099.

Operads for Designing Systems of Systems

The first is short and sweet (~2 pages!), aimed at a general mathematical audience. It describes the motivation for CASCADE and how basic modeling issues for point-to-point communications led to the development network operads:

This figure depicts the prototypical example of this style of operad, the ‘simple network operad’, acting on an algebra of graphs whose nodes are endowed with locations and edges can be no longer than a fixed range limit. For more information, check out the article or Part 4 of this series.

For a quick, retrospective overview of CASCADE, this note is hard to beat, so I won’t repeat more here.

Operads for complex system design specification, analysis and synthesis

The second article is a full length review, aimed at a general applied science audience:

We introduce operads for design to a general scientific audience by explaining what the operads do relative to broadly applied techniques and how specific domain problems are modelled. Research directions are presented with an eye towards opening up interdisciplinary partnerships and continuing application-driven investigations to build on recent insights.

The review describes how operads apply to system design problems through three examples:

and concludes with a discussion of future research directions. The specification and synthesis examples come from applications of network operads in CASCADE, but the analysis example was contributed by collaborators Spencer Breiner and Eswaran Subrahmanian at the National Institute of Standards and Technology (NIST), who analyzed the Length Scale Interferometer (LSI) at NIST headquarters. Readers interested in a quick introduction to these examples should head directly to Section 3 of the review.

As we describe:

The present article captures an intermediate stage of technical maturity: operad-based design has shown its practicality by lowering barriers of entry for applied practitioners and demonstrating applied examples across many domains. However, it has not realized its full potential as an applied meta-language. Much of this recent progress is not focused solely on the analytic power of operads to separate concerns. Significant progress on explicit specification of domain models and techniques to automatically synthesize designs from basic building blocks has been made.

With this context, CASCADE’s contribution was prototyping general-purpose methods to specify basic building blocks and synthesize composite systems from atoms. By testing these methods against specific domain problems, we learned that domain-specific information should be exploited but systematically fitting together general-purpose and computationally efficient methods is challenging. Moreover, no reconciliation between the analytic point-of-view on operads for system design and the `generative’ perspective of network operads, which facilitate specification and synthesis, has been established. The review does not address how these threads might fit together precisely, but perhaps the answer looks something like this:

For more discussion of future research directions, please see Section 7 of the review, especially the open problems listed in 7f.

For readers that make it through the examples in Sections 4, 5 and 6 of the review but still want more, the following references provide additional details:

• John Baez, John Foley, Joe Moeller and Blake Pollard, Network models, Theory and Applications of Categories 35 (2020), 700–744.

• Spencer Breiner, Olivier Marie-Rose, Blake Pollard and Eswaran Subrahmanian, Modeling hierarchical system with operads, Electron. Proc. Theor. Comput. Sci. 323 (2020) 72–83.

• John Baez, John Foley and Joe Moeller, Network models from Petri nets with catalysts, Compositionality 1 (4) (2017).

Here’s the whole series of posts:

Part 1. CASCADE: the Complex Adaptive System Composition and Design Environment.

Part 2. Metron’s software for system design.

Part 3. Operads: the basic idea.

Part 4. Network operads: an easy example.

Part 5. Algebras of network operads: some easy examples.

Part 6. Network models.

Part 7. Step-by-step compositional design and tasking using commitment networks.

Part 8. Compositional tasking using category-valued network models.

Part 9 – Network models from Petri nets with catalysts.

Part 10 – Two papers reviewing the whole project.

Symmetric Monoidal Categories: a Rosetta Stone

28 May, 2021

The Topos Institute is in business! I’m really excited about visiting there this summer and working on applied category theory.

They recently had a meeting with some people concerned about AI risks, called Finding the Right Abstractions, organized by Scott Garrabrant, David Spivak, and Andrew Critch. I gave a gentle introduction to the uses of symmetric monoidal categories:

• Symmetric monoidal categories: a Rosetta Stone.

To describe systems composed of interacting parts, scientists and engineers draw diagrams of networks: flow charts, Petri nets, electrical circuit diagrams, signal-flow graphs, chemical reaction networks, Feynman diagrams and the like. All these different diagrams fit into a common framework: the mathematics of symmetric monoidal categories. While originally the morphisms in such categories were mainly used to describe processes, we can also use them to describe open systems.

You can see the slides here, and watch a video here:

For a lot more detail on these ideas, see:

• John Baez and Mike Stay, Physics, topology, logic and computation: a Rosetta Stone, in New Structures for Physics, ed. Bob Coecke, Lecture Notes in Physics vol. 813, Springer, Berlin, 2011, pp. 95—174.

Compositional Robotics (Part 2)

27 May, 2021

Very soon we’re having a workshop on applications of category theory to robotics:

2021 Workshop on Compositional Robotics: Mathematics and Tools, online, Monday 31 May 2021.

You’re invited! As of today it’s not too late to register and watch the talks online, and registration is free. Go here to register:

Here’s the schedule. All times are in UTC, so the show starts at 9:15 am Pacific Time:

Time (UTC) Speaker


16:15-16:30   Intro and plan of the workshop


Jonathan Lorand

Category Theory Basics


John Baez Category Theory and Systems 


Breakout rooms



Andrea Censi
& Gioele Zardini

Categories for Co-Design


David Spivak

Dynamic Interaction Patterns


Breakout rooms



Aaron Ames

A Categorical Perspective on Robotics

21:30-22:15 Daniel Koditschek Toward a Grounded Type Theory for Robot Task Composition
22:30-00:30 Selected speakers Talks from open submissions

For more information go to the workshop website or my previous blog post on this workshop:

Compositional robotics (part 1).

Category Theory and Systems

27 May, 2021

I’m giving a talk on Monday the 31st of May, 2021 at 17:20 UTC, which happens to be 10:20 am Pacific Time for me. You can see my slides here:

Category theory and systems.

I’ll talk about how to describe open systems as morphisms in symmetric monoidal categories, and how to use ‘functorial semantics’ to describe the behavior of open systems.

It’s part of the 2021 Workshop on Compositional Robotics: Mathematics and Tools, and if you click the link you can see how to attend!  If you stick around for the rest of the workshop you’ll hear more concrete talks from people who really work on robotics.