A Networked World (Part 1)

27 March, 2015

guest post by David Spivak

The problem

The idea that’s haunted me, and motivated me, for the past seven years or so came to me while reading a book called The Moment of Complexity: our Emerging Network Culture, by Mark C. Taylor. It was a fascinating book about how our world is becoming increasingly networked—wired up and connected—and that this is leading to a dramatic increase in complexity. I’m not sure if it was stated explicitly there, but I got the idea that with the advent of the World Wide Web in 1991, a new neural network had been born. The lights had been turned on, and planet earth now had a brain.

I wondered how far this idea could be pushed. Is the world alive, is it a single living thing? If it is, in the sense I meant, then its primary job is to survive, and to survive it’ll have to make decisions. So there I was in my living room thinking, “oh my god, we’ve got to steer this thing!”

Taylor pointed out that as complexity increases, it’ll become harder to make sense of what’s going on in the world. That seemed to me like a big problem on the horizon, because in order to make good decisions, we need to have a good grasp on what’s occurring. I became obsessed with the idea of helping my species through this time of unprecedented complexity. I wanted to understand what was needed in order to help humanity make good decisions.

What seemed important as a first step is that we humans need to unify our understanding—to come to agreement—on matters of fact. For example, humanity still doesn’t know whether global warming is happening. Sure almost all credible scientists have agreed that it is happening, but does that steer money into programs that will slow it or mitigate its effects? This isn’t an issue of what course to take to solve a given problem; it’s about whether the problem even exists! It’s like when people were talking about Obama being a Muslim, born in Kenya, etc., and some people were denying it, saying he was born in Hawaii. If that’s true, why did he repeatedly refuse to show his birth certificate?

It is important, as a first step, to improve the extent to which we agree on the most obvious facts. This kind of “sanity check” is a necessary foundation for discussions about what course we should take. If we want to steer the ship, we have to make committed choices, like “we’re turning left now,” and we need to do so as a group. That is, there needs to be some amount of agreement about the way we should steer, so we’re not fighting ourselves.

Luckily there are a many cases of a group that needs to, and is able to, steer itself as a whole. For example as a human, my neural brain works with my cells to steer my body. Similarly, corporations steer themselves based on boards of directors, and based on flows of information, which run bureaucratically and/or informally between different parts of the company. Note that in neither case is there any suggestion that each part—cell, employee, or corporate entity—is “rational”; they’re all just doing their thing. What we do see in these cases is that the group members work together in a context where information and internal agreement is valued and often attained.

It seemed to me that intelligent, group-directed steering is possible. It does occur. But what’s the mechanism by which it happens, and how can we think about it? I figured that the way we steer, i.e., make decisions, is by using information.

I should be clear: whenever I say information, I never mean it “in the sense of Claude Shannon”. As beautiful as Shannon’s notion of information is, he’s not talking about the kind of information I mean. He explicitly said in his seminal paper that information in his sense is not concerned with meaning:

Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages.

In contrast, I’m interested in the semantic stuff, which flows between humans, and which makes possible decisions about things like climate change. Shannon invented a very useful quantitative measure of meaningless probability distributions.

That’s not the kind of information I’m talking about. When I say “I want to know what information is”, I’m saying I want to formulate the notion of human-usable semantic meaning, in as mathematical a way as possible.

Back to my problem: we need to steer the ship, and to do so we need to use information properly. Unfortunately, I had no idea what information is, nor how it’s used to make decisions (let alone to make good ones), nor how it’s obtained from our interaction with the world. Moreover, I didn’t have a clue how the minute information-handling at the micro-level, e.g., done by cells inside a body or employees inside a corporation, would yield information-handling at the macro (body or corporate) level.

I set out to try to understand what information is and how it can be communicated. What kind of stuff is information? It seems to follow rules: facts can be put together to form new facts, but only in certain ways. I was once explaining this idea to Dan Kan, and he agreed saying, “Yes, information is inherently a combinatorial affair.” What is the combinatorics of information?

Communication is similarly difficult to understand, once you dig into it. For example, my brain somehow enables me to use information and so does yours. But our brains are wired up in personal and ad hoc ways, when you look closely, a bit like a fingerprint or retinal scan. I found it fascinating that two highly personalized semantic networks could interface well enough to effectively collaborate.

There are two issues that I wanted to understand, and by to understand I mean to make mathematical to my own satisfaction. The first is what information is, as structured stuff, and what communication is, as a transfer of structured stuff. The second is how communication at micro-levels can create, or be, understanding at macro-levels, i.e., how a group can steer as a singleton.

Looking back on this endeavor now, I remain concerned. Things are getting increasingly complex, in the sorts of ways predicted by Mark C. Taylor in his book, and we seem to be losing some control: of the NSA, of privacy, of people 3D printing guns or germs, of drones, of big financial institutions, etc.

Can we expect or hope that our species as a whole will make decisions that are healthy, like keeping the temperature down, given the information we have available? Are we in the driver’s seat, or is our ship currently in the process of spiraling out of our control?

Let’s assume that we don’t want to panic but that we do want to participate in helping the human community to make appropriate decisions. A possible first step could be to formalize the notion of “using information well”. If we could do this rigorously, it would go a long way toward helping humanity get onto a healthy course. Further, mathematics is one of humanity’s best inventions. Using this tool to improve our ability to use information properly is a non-partisan approach to addressing the issue. It’s not about fighting, it’s about figuring out what’s happening, and weighing all our options in an informed way.

So, I ask: What kind of mathematics might serve as a formal ground for the notion of meaningful information, including both its successful communication and its role in decision-making?

Stationary Stability in Finite Populations

24 March, 2015

guest post by Marc Harper

A while back, in the article Relative entropy minimization in evolutionary dynamics, we looked at extensions of the information geometry / evolutionary game theory story to more general time-scales, incentives, and geometries. Today we’ll see how to make this all work in finite populations!

Let’s recall the basic idea from last time, which John also described in his information geometry series. The main theorem is this: when there’s an evolutionarily stable state for a given fitness landscape, the relative entropy between the stable state and the population distribution decreases along the population trajectories as they converge to the stable state. In short, relative entropy is a Lyapunov function. This is a nice way to look at the action of a population under natural selection, and it has interesting analogies to Bayesian inference.

The replicator equation is a nice model from an intuitive viewpoint, and it’s mathematically elegant. But it has some drawbacks when it comes to modeling real populations. One major issue is that the replicator equation implicitly assumes that the population proportions of each type are differentiable functions of time, obeying a differential equation. This only makes sense in the limit of large populations. Other closely related models, such as the Lotka-Volterra model, focus on the number of individuals of each type (e.g. predators and prey) instead of the proportion. But they often assume that the number of individuals is a differentiable function of time, and a population of 3.5 isn’t very realistic either.

Real populations of replicating entities are not infinitely large; in fact they are often relatively small and of course have whole numbers of each type, at least for large biological replicators (like animals). They take up space and only so many can interact meaningfully. There are quite a few models of evolution that handle finite populations and some predate the replicator equation. Models with more realistic assumptions typically have to leave the realm of derivatives and differential equations behind, which means that the analysis of such models is more difficult, but the behaviors of the models are often much more interesting. Hopefully by the end of this post, you’ll see how all of these diagrams fit together:

One of the best-known finite population models is the Moran process, which is a Markov chain on a finite population. This is the quintessential birth-death process. For a moment consider a population of just two types A and B. The state of the population is given by a pair of nonnegative integers (a,b) with a+b=N, the total number of replicators in the population, and a and b the number of individuals of type A and B respectively. Though it may artificial to fix the population size N, this often turns out not to be that big of a deal, and you can assume the population is at its carrying capacity to make the assumption realistic. (Lots of people study populations that can change size and that have replicators spatially distributed say on a graph, but we’ll assume they can all interact with each whenever they want for now).

A Markov model works by transitioning from state to state in each round of the process, so we need to define the transitions probabilities to complete the model. Let’s put a fitness landscape on the population, given by two functions f_A and f_B of the population state (a,b). Now we choose an individual to reproduce proportionally to fitness, e.g. we choose an A individual to reproduce with probability

\displaystyle{ \frac{a f_A}{a f_A + b f_B} }

since there are a individuals of type A and they each have fitness f_A. This is analogous to the ratio of fitness to mean fitness from the discrete replicator equation, since

\displaystyle{ \frac{a f_A}{a f_A + b f_B} =  \frac{\frac{a}{N} f_A}{\frac{a}{N} f_A + \frac{b}{N} f_B} \to \frac{x_i f_i(x)}{\overline{f(x)}} }

and the discrete replicator equation is typically similar to the continuous replicator equation (this can be made precise), so the Moran process captures the idea of natural selection in a similar way. Actually there is a way to recover the replicator equation from the Moran process in large populations—details at the end!

We’ll assume that the fitnesses are nonnegative and that the total fitness (the denominator) is never zero; if that seems artificial, some people prefer to transform the fitness landscape by e^{\beta f(x)}, which gives a ratio reminiscent of the Boltzmann or Fermi distribution from statistical physics, with the parameter \beta playing the role of intensity of selection rather than inverse temperature. This is sometimes called Fermi selection.

That takes care of the birth part. The death part is easier: we just choose an individual at random (uniformly) to be replaced. Now we can form the transition probabilities of moving between population states. For instance the probability of moving from state (a,b) to (a+1, b-1) is given by the product of the birth and death probabilities, since they are independent:

\displaystyle{ T_a^{a+1} = \frac{a f_A}{a f_A + b f_B} \frac{b}{N} }

since we have to chose a replicator of type A to reproduce and one of type B to be replaced. Similarly for (a,b) to (a-1, b+1) (switch all the a’s and b’s), and we can write the probability of staying in the state (a, N-a) as

T_a^{a} = 1 - T_{a}^{a+1} - T_{a}^{a-1}

Since we only replace one individual at a time, this covers all the possible transitions, and keeps the population constant.

We’d like to analyze this model and many people have come up with clever ways to do so, computing quantities like fixation probabilities (also known as absorption probabilities), indicating the chance that the population will end up with one type completely dominating, i.e. in state (0, N) or (N,0). If we assume that the fitness of type A is constant and simply equal to 1, and the fitness of type B is r \neq 1, we can calculate the probability that a single mutant of type B will take over a population of type A using standard Markov chain methods:

\displaystyle{\rho = \frac{1 - r^{-1}}{1 - r^{-N}} }

For neutral relative fitness (r=1), \rho = 1/N, which is the probability a neutral mutant invades by drift alone since selection is neutral. Since the two boundary states (0, N) or (N,0) are absorbing (no transitions out), in the long run every population ends up in one of these two states, i.e. the population is homogeneous. (This is the formulation referred to by Matteo Smerlak in The mathematical origins of irreversibility.)

That’s a bit different flavor of result than what we discussed previously, since we had stable states where both types were present, and now that’s impossible, and a bit disappointing. We need to make the population model a bit more complex to have more interesting behaviors, and we can do this in a very nice way by adding the effects of mutation. At the time of reproduction, we’ll allow either type to mutate into the other with probability \mu. This changes the transition probabilities to something like

\displaystyle{ T_a^{a+1} = \frac{a (1-\mu) f_A + b \mu f_B}{a f_A + b f_B} \frac{b}{N} }

Now the process never stops wiggling around, but it does have something known as a stationary distribution, which gives the probability that the population is in any given state in the long run.

For populations with more than two types the basic ideas are the same, but there are more neighboring states that the population could move to, and many more states in the Markov process. One can also use more complicated mutation matrices, but this setup is good enough to typically guarantee that no one species completely takes over. For interesting behaviors, typically \mu = 1/N is a good choice (there’s some biological evidence that mutation rates are typically inversely proportional to genome size).

Without mutation, once the population reached (0,N) or (N,0), it stayed there. Now the population bounces between states, either because of drift, selection, or mutation. Based on our stability theorems for evolutionarily stable states, it’s reasonable to hope that for small mutation rates and larger populations (less drift), the population should spend most of its time near the evolutionarily stable state. This can be measured by the stationary distribution which gives the long run probabilities of a process being in a given state.

Previous work by Claussen and Traulsen:

• Jens Christian Claussen and Arne Traulsen, Non-Gaussian fluctuations arising from finite populations: exact results for the evolutionary Moran process, Physical Review E 71 (2005), 025101.

suggested that the stationary distribution is at least sometimes maximal around evolutionarily stable states. Specifically, they showed that for a very similar model with fitness landscape given by

\left(\begin{array}{c} f_A \\ f_B \end{array}\right)  = \left(\begin{array}{cc} 1 & 2\\ 2&1 \end{array}\right)  \left(\begin{array}{c} a\\ b \end{array}\right)

the stationary state is essentially a binomial distribution centered at (N/2, N/2).

Unfortunately, the stationary distribution can be very difficult to compute for an arbitrary Markov chain. While it can be computed for the Markov process described above without mutation, and in the case studied by Claussen and Traulsen, there’s no general analytic formula for the process with mutation, nor for more than two types, because the processes are not reversible. Since we can’t compute the stationary distribution analytically, we’ll have to find another way to show that the local maxima of the stationary distribution are “evolutionarily stable”. We can approximate the stationary distribution fairly easily with a computer, so it’s easy to plot the results for just about any landscape and reasonable population size (e.g. N \approx 100).

It turns out that we can use a relative entropy minimization approach, just like for the continuous replicator equation! But how? We lack some essential ingredients such as deterministic and differentiable trajectories. Here’s what we do:

• We show that the local maxima and minima of the stationary distribution satisfy a complex balance criterion.

• We then show that these states minimize an expected relative entropy.

• This will mean that the current state and the expected next state are ‘close’.

• Lastly, we show that these states satisfy an analogous definition of evolutionary stability (now incorporating mutation).

The relative entropy allows us to measure how close the current state is to the expected next state, which captures the idea of stability in another way. This ports the relative minimization Lyapunov result to some more realistic Markov chain models. The only downside is that we’ll assume the populations are “sufficiently large”, but in practice for populations of three types, N=20 is typically enough for common fitness landscapes (there are lots of examples here for N=80, which are prettier than the smaller populations). The reason for this is that the population state (a,b) needs enough “resolution” (a/N, b/N) to get sufficiently close to the stable state, which is not necessarily a ratio of integers. If you allow some wiggle room, smaller populations are still typically pretty close.

Evolutionarily stable states are closely related to Nash equilibria, which have a nice intuitive description in traditional game theory as “states that no player has an incentive to deviate from”. But in evolutionary game theory, we don’t use a game matrix to compute e.g. maximum payoff strategies, rather the game matrix defines a fitness landscape which then determines how natural selection unfolds.

We’re going to see this idea again in a moment, and to help get there let’s introduce an function called an incentive that encodes how a fitness landscape is used for selection. One way is to simply replace the quantities a f_A(a,b) and b f_B(a,b) in the fitness-proportionate selection ratio above, which now becomes (for two population types):

\displaystyle{ \frac{\varphi_A(a,b)}{\varphi_A(a,b) + \varphi_B(a,b)} }

Here \varphi_A(a,b) and \varphi_B(a,b) are the incentive function components that determine how the fitness landscape is used for natural selection (if at all). We have seen two examples above:

\varphi_A(a,b) = a f_A(a, b)

for the Moran process and fitness-proportionate selection, and

\varphi_A(a,b) = a e^{\beta f_A(a, b)}

for an alternative that incorporates a strength of selection term \beta, preventing division by zero for fitness landscapes defined by zero-sum game matrices, such as a rock-paper-scissors game. Using an incentive function also simplifies the transition probabilities and results as we move to populations of more than two types. Introducing mutation, we can describe the ratio for incentive-proportion selection with mutation for the ith population type when the population is in state x=(a,b,\ldots) / N as

\displaystyle{ p_i(x) = \frac{\sum_{k=1}^{n}{\varphi_k(x) M_{i k} }}{\sum_{k=1}^{n}{\varphi_k(x)}} }

for some matrix of mutation probabilities M. This is just the probability that we get a new individual of the ith type (by birth and/or mutation). A common choice for the mutation matrix is to use a single mutation probability \mu and spread it out over all the types, such as letting

M_{ij} = \mu / (n-1)


M_{ii} = 1 - \mu

Now we are ready to define the expected next state for the population and see how it captures a notion of stability. For a given state population x in a multitype population, using x to indicate the normalized population state (a,b,\ldots) / N, consider all the neighboring states y that the population could move to in one step of the process (one birth-death cycle). These neighboring states are the result of increasing a population type by one (birth) and decreasing another by one (death, possibly the same type), of course excluding cases on the boundary where the number of individuals of any type drops below zero or rises above N. Now we can define the expected next state as the sum of neighboring states weighted by the transition probabilities

E(x) = \sum_{y}{y T_x^{y}}

with transition probabilities given by

T_{x}^{y} = p_{i}(x) x_{j}

for states y that differ in 1/N at the ith coordinate and -1/N at jth coordinate from x. Here x_j is just the probability of the random death of an individual of the jth type, so the transition probabilities are still just birth (with mutation) and death as for the Moran process we started with.

Skipping some straightforward algebraic manipulations, we can show that

\displaystyle{ E(x) = \sum_{y}{y T_x^{y}} = \frac{N-1}{N}x + \frac{1}{N}p(x)}

Then it’s easy to see that E(x) = x if and only if x = p(x), and that x = p(x) if and only if x_i = \varphi_i(x). So we have a nice description of ‘stability’ in terms of fixed points of the expected next state function and the incentive function

x = E(x) = p(x) = \varphi(x),

and we’ve gotten back to “no one has an incentive to deviate”. More precisely, for the Moran process

\varphi_i(x) = x_i f_i(x)

and we get back f_i(x) = f_j(x) for every type. So we take x = \varphi(x) as our analogous condition to an evolutionarily stable state, though it’s just the ‘no motion’ part and not also the ‘stable’ part. That’s what we need the stationary distribution for!

To turn this into a useful number that measures stability, we use the relative entropy of the expected next state and the current state, in analogy with the Lyapunov theorem for the replicator equation. The relative entropy

\displaystyle{ D(x, y) = \sum_i x_i \ln(x_i) - y_i \ln(x_i) }

has the really nice property that D(x,y) = 0 if and only if x = y, so we can use the relative entropy D(E(x), x) as a measure of how close to stable any particular state is! Here the expected next state takes the place of the ‘evolutionarily stable state’ in the result described last time for the replicator equation.

Finally, we need to show that the maxima (and minima) of of the stationary distribution are these fixed points by showing that these states minimize the expected relative entropy.

Seeing that local maxima and minima of the stationary distribution minimize the expected relative entropy is a more involved, so let’s just sketch the details. In general, these Markov processes are not reversible, so they don’t satisfy the detailed-balance condition, but the stationary probabilities do satisfy something called the global balance condition, which says that for the stationary distribution s we have that

s_x \sum_{x}{T_x^{y}} = \sum_{y}{s_y T_y^{x}}

When the stationary distribution is at a local maximum (or minimum), we can show essentially that this implies (up to an \epsilon, for a large enough population) that

\displaystyle{\sum_{x}{T_x^{y}} = \sum_{y}{T_y^{x}} }

a sort of probability inflow-outflow equation, which is very similar to the condition of complex balanced equilibrium described by Manoj Gopalkrishnan in this Azimuth post. With some algebraic manipulation, we can show that these states have E(x)=x.

Now let’s look again at the figures from the start. The first shows the vector field of the replicator equation:

You can see rest points at the center, on the center of each boundary edge, and on the corner points. The center point is evolutionarily stable, the center points of the boundary are semi-stable (but stable when the population is restricted to a boundary simplex), and the corner points are unstable.

This one shows the stationary distribution for a finite population model with a Fermi incentive on the same landscape, for a population of size 80:

A fixed population size gives a partitioning of the simplex, and each triangle of the partition is colored by the value of the stationary distribution. So you can see that there are local maxima in the center and on the centers of the triangle boundary edges. In this case, the size of the mutation probability determines how much of the stationary distribution is concentrated on the center of the simplex.

This shows one-half of the Euclidean distance squared between the current state and the expected next state:

And finally, this shows the same thing but with the relative entropy as the ‘distance function':

As you can see, the Euclidean distance is locally minimal at each of the local maxima and minima of the stationary distribution (including the corners); the relative entropy is only guaranteed so on the interior states (because the relative entropy doesn’t play nicely with the boundary, and unlike the replicator equation, the Markov process can jump on and off the boundary). It turns out that the relative Rényi entropies for q between 0 and 1 also work just fine, but for the large population limit (the replicator dynamic), the relative entropy is the somehow the right choice for the replicator equation (has the derivative that easily gives Lyapunov stability), which is due to the connections between relative entropy and Fisher information in the information geometry of the simplex. The Euclidean distance is the q=0 case and the ordinary relative entropy is q=1.

As it turns out, something very similar holds for another popular finite population model, the Wright–Fisher process! This model is more complicated, so if you are interested in the details, check out our paper, which has many nice examples and figures. We also define a process that bridges the gap between the atomic nature of the Moran process and the generational nature of the Wright–Fisher process, and prove the general result for that model.

Finally, let’s see how the Moran process relates back to the replicator equation (see also the appendix in this paper), and how we recover the stability theory of the replicator equation. We can use the transition probabilities of the Moran process to define a stochastic differential equation (called a Langevin equation) with drift and diffusion terms that are essentially (for populations with two types:

\mathrm{Drift}(x) = T^{+}(x) - T^{-}(x)

\displaystyle{ \mathrm{Diffusion}(x) = \sqrt{\frac{T^{+}(x) + T^{-}(x)}{N}} }

As the population size gets larger, the diffusion term drops out, and the stochastic differential equation becomes essentially the replicator equation. For the stationary distribution, the variance (e.g. for the binomial example above) also has an inverse dependence on N, so the distribution limits to a delta-function that is zero except for at the evolutionarily stable state!

What about the relative entropy? Loosely speaking, as the population size gets larger, the iteration of the expected next state also becomes deterministic. Then the evolutionarily stable states is a fixed point of the expected next state function, and the expected relative entropy is essentially the same as the ordinary relative entropy, at least in a neighborhood of the evolutionarily stable state. This is good enough to establish local stability.

Earlier I said both the local maxima and minima minimize the expected relative entropy. Dash and I haven’t proven that the local maxima always correspond to evolutionarily stable states (and the minima to unstable states). That’s because the generalization of evolutionarily stable state we use is really just a ‘no motion’ condition, and isn’t strong enough to imply stability in a neighborhood for the deterministic replicator equation. So for now we are calling the local maxima stationary stable states.

We’ve also tried a similar approach to populations evolving on networks, which is a popular topic in evolutionary graph theory, and the results are encouraging! But there are many more ‘states’ in such a process, since the configuration of the network has to be taken into account, and whether the population is clustered together or not. See the end of our paper for an interesting example of a population on a cycle.

Planets in the Fourth Dimension

17 March, 2015

You probably that planets go around the sun in elliptical orbits. But do you know why?

In fact, they’re moving in circles in 4 dimensions. But when these circles are projected down to 3-dimensional space, they become ellipses!

This animation by Greg Egan shows the idea:

The plane here represents 2 of the 3 space dimensions we live in. The vertical direction is the mysterious fourth dimension. The planet goes around in a circle in 4-dimensional space. But down here in 3 dimensions, its ‘shadow’ moves in an ellipse!

What’s this fourth dimension I’m talking about here? It’s a lot like time. But it’s not exactly time. It’s the difference between ordinary time and another sort of time, which flows at a rate inversely proportional to the distance between the planet and the sun.

The movie uses this other sort of time. Relative to this other time, the planet is moving at constant speed around a circle in 4 dimensions. But in ordinary time, its shadow in 3 dimensions moves faster when it’s closer to the sun.

All this sounds crazy, but it’s not some new physics theory. It’s just a different way of thinking about Newtonian physics!

Physicists have known about this viewpoint at least since 1980, thanks to a paper by the mathematical physicist Jürgen Moser. Some parts of the story are much older. A lot of papers have been written about it.

But I only realized how simple it is when I got this paper in my email, from someone I’d never heard of before:

• Jesper Göransson, Symmetries of the Kepler problem, 8 March 2015.

I get a lot of papers by crackpots in my email, but the occasional gem from someone I don’t know makes up for all those.

The best thing about Göransson’s 4-dimensional description of planetary motion is that it gives a clean explanation of an amazing fact. You can take any elliptical orbit, apply a rotation of 4-dimensional space, and get another valid orbit!

Of course we can rotate an elliptical orbit about the sun in the usual 3-dimensional way and get another elliptical orbit. The interesting part is that we can also do 4-dimensional rotations. This can make a round ellipse look skinny: when we tilt a circle into the fourth dimension, its ‘shadow’ in 3-dimensional space becomes thinner!

In fact, you can turn any elliptical orbit into any other elliptical orbit with the same energy by a 4-dimensional rotation of this sort. All elliptical orbits with the same energy are really just circular orbits on the same sphere in 4 dimensions!

Jesper Göransson explains how this works in a terse and elegant way. But I can’t resist summarizing the key results.

The Kepler problem

Suppose we have a particle moving in an inverse square force law. Its equation of motion is

\displaystyle{ m \ddot{\mathbf{r}} = - \frac{k \mathbf{r}}{r^3} }

where \mathbf{r} is its position as a function of time, r is its distance from the origin, m is its mass, and k says how strong the force is. From this we can derive the law of conservation of energy, which says

\displaystyle{ \frac{m \dot{\mathbf{r}} \cdot \dot{\mathbf{r}}}{2} - \frac{k}{r} = E }

for some constant E that depends on the particle’s orbit, but doesn’t change with time.

Let’s consider an attractive force, so k > 0, and elliptical orbits, so E < 0. Let’s call the particle a ‘planet’. It’s a planet moving around the sun, where we treat the sun as so heavy that it remains perfectly fixed at the origin.

I only want to study orbits of a single fixed energy E. This frees us to choose units of mass, length and time in which

m = 1, \;\; k = 1, \;\; E = -\frac{1}{2}

This will reduce the clutter of letters and let us focus on the key ideas. If you prefer an approach that keeps in the units, see Göransson’s paper.

Now the equation of motion is

\displaystyle{\ddot{\mathbf{r}} = - \frac{\mathbf{r}}{r^3} }

and conservation of energy says

\displaystyle{ \frac{\dot{\mathbf{r}} \cdot \dot{\mathbf{r}}}{2} - \frac{1}{r} = -\frac{1}{2} }

The big idea, apparently due to Moser, is to switch from our ordinary notion of time to a new notion of time! We’ll call this new time s, and demand that

\displaystyle{ \frac{d s}{d t} = \frac{1}{r} }

This new kind of time ticks more slowly as you get farther from the sun. So, using this new time speeds up the planet’s motion when it’s far from the sun. If that seems backwards, just think about it. For a planet very far from the sun, one day of this new time could equal a week of ordinary time. So, measured using this new time, a planet far from the sun might travel in one day what would normally take a week.

This compensates for the planet’s ordinary tendency to move slower when it’s far from the sun. In fact, with this new kind of time, a planet moves just as fast when it’s farthest from the sun as when it’s closest.

Amazing stuff happens with this new notion of time!

To see this, first rewrite conservation of energy using this new notion of time. I’ve been using a dot for the ordinary time derivative, following Newton. Let’s use a prime for the derivative with respect to s. So, for example, we have

\displaystyle{ t' = \frac{dt}{ds} = r }


\displaystyle{ \mathbf{r}' = \frac{dr}{ds} = \frac{dt}{ds}\frac{dr}{dt} = r \dot{\mathbf{r}} }

Using this new kind of time derivative, Göransson shows that conservation of energy can be written as

\displaystyle{ (t' - 1)^2 + \mathbf{r}' \cdot \mathbf{r}' = 1 }

This is the equation of a sphere in 4-dimensional space!

I’ll prove this later. First let’s talk about what it means. To understand it, we should treat the ordinary time coordinate t and the space coordinates (x,y,z) on an equal footing. The point


moves around in 4-dimensional space as the parameter s changes. What we’re seeing is that the velocity of this point, namely

\mathbf{v} = (t',x',y',z')

moves around on a sphere in 4-dimensional space! It’s a sphere of radius one centered at the point


With some further calculation we can show some other wonderful facts:

\mathbf{r}''' = -\mathbf{r}'


t''' = -(t' - 1)

These are the usual equations for a harmonic oscillator, but with an extra derivative!

I’ll prove these wonderful facts later. For now let’s just think about what they mean. We can state both of them in words as follows: the 4-dimensional velocity \mathbf{v} carries out simple harmonic motion about the point (1,0,0,0).

That’s nice. But since \mathbf{v} also stays on the unit sphere centered at this point, we can conclude something even better: v must move along a great circle on this sphere, at constant speed!

This implies that the spatial components of the 4-dimensional velocity have mean 0, while the t component has mean 1.

The first part here makes a lot of sense: our planet doesn’t drift ever farther from the Sun, so its mean velocity must be zero. The second part is a bit subtler, but it also makes sense: the ordinary time t moves forward at speed 1 on average with respect to the new time parameter s, but its rate of change oscillates in a sinusoidal way.

If we integrate both sides of

\mathbf{r}''' = -\mathbf{r}'

we get

\mathbf{r}'' = -\mathbf{r} + \mathbf{a}

for some constant vector \mathbf{a}. This says that the position \mathbf{r} oscillates harmonically about a point \mathbf{a}. Since \mathbf{a} doesn’t change with time, it’s a conserved quantity: it’s called the Runge–Lenz vector.

Often people start with the inverse square force law, show that angular momentum and the Runge–Lenz vector are conserved, and use these 6 conserved quantities and Noether’s theorem to show there’s a 6-dimensional group of symmetries. For solutions with negative energy, this turns out to be the group of rotations in 4 dimensions, \mathrm{SO}(4). With more work, we can see how the Kepler problem is related to a harmonic oscillator in 4 dimensions. Doing this involves reparametrizing time.

I like Göransson’s approach better in many ways, because it starts by biting the bullet and reparametrizing time. This lets him rather efficiently show that the planet’s elliptical orbit is a projection to 3-dimensional space of a circular orbit in 4d space. The 4d rotational symmetry is then evident!

Göransson actually carries out his argument for an inverse square law in n-dimensional space; it’s no harder. The elliptical orbits in n dimensions are projections of circular orbits in n+1 dimensions. Angular momentum is a bivector in n dimensions; together with the Runge–Lenz vector it forms a bivector in n+1 dimensions. This is the conserved quantity associated to the (n+1) dimensional rotational symmetry of the problem.

He also carries out the analogous argument for positive-energy orbits, which are hyperbolas, and zero-energy orbits, which are parabolas. The hyperbolic case has the Lorentz group symmetry and the zero-energy case has Euclidean group symmetry! This was already known, but it’s nice to see how easily Göransson’s calculations handle all three cases.

Mathematical details

Checking all this is a straightforward exercise in vector calculus, but it takes a bit of work, so let me do some here. There will still be details left to fill in, and I urge that you give it a try, because this is the sort of thing that’s more interesting to do than to watch.

There are a lot of equations coming up, so I’ll put boxes around the important ones. The basic ones are the force law, conservation of energy, and the change of variables that gives

\boxed{  t' = r , \qquad  \mathbf{r}' = r \dot{\mathbf{r}} }

We start with conservation of energy:

\boxed{ \displaystyle{ \frac{\dot{\mathbf{r}} \cdot \dot{\mathbf{r}}}{2} -  \frac{1}{r}  = -\frac{1}{2} } }

and then use

\displaystyle{ \dot{\mathbf{r}} = \frac{d\mathbf{r}/dt}{dt/ds} = \frac{\mathbf{r}'}{t'} }

to obtain

\displaystyle{ \frac{\mathbf{r}' \cdot \mathbf{r}'}{2 t'^2}  - \frac{1}{t'} = -\frac{1}{2} }

With a little algebra this gives

\boxed{ \displaystyle{ \mathbf{r}' \cdot \mathbf{r}' + (t' - 1)^2 = 1} }

This shows that the ‘4-velocity’

\mathbf{v} = (t',x',y',z')

stays on the unit sphere centered at (1,0,0,0).

The next step is to take the equation of motion

\boxed{ \displaystyle{\ddot{\mathbf{r}} = - \frac{\mathbf{r}}{r^3} } }

and rewrite it using primes (s derivatives) instead of dots (t derivatives). We start with

\displaystyle{ \dot{\mathbf{r}} = \frac{\mathbf{r}'}{r} }

and differentiate again to get

\ddot{\mathbf{r}} = \displaystyle{ \frac{1}{r} \left(\frac{\mathbf{r}'}{r}\right)' }  = \displaystyle{ \frac{1}{r} \left( \frac{r \mathbf{r}'' - r' \mathbf{r}'}{r^2} \right) } = \displaystyle{ \frac{r \mathbf{r}'' - r' \mathbf{r}'}{r^3} }

Now we use our other equation for \ddot{\mathbf{r}} and get

\displaystyle{ \frac{r \mathbf{r}'' - r' \mathbf{r}'}{r^3} = - \frac{\mathbf{r}}{r^3} }


r \mathbf{r}'' - r' \mathbf{r}' = -\mathbf{r}


\boxed{ \displaystyle{ \mathbf{r}'' =  \frac{r' \mathbf{r}' - \mathbf{r}}{r} } }

To go further, it’s good to get a formula for r'' as well. First we compute

r' = \displaystyle{ \frac{d}{ds} (\mathbf{r} \cdot \mathbf{r})^{\frac{1}{2}} } = \displaystyle{ \frac{\mathbf{r}' \cdot \mathbf{r}}{r} }

and then differentiating again,

r'' = \displaystyle{\frac{d}{ds} \frac{\mathbf{r}' \cdot \mathbf{r}}{r} } = \displaystyle{ \frac{r \mathbf{r}'' \cdot \mathbf{r} + r \mathbf{r}' \cdot \mathbf{r}' - r' \mathbf{r}' \cdot \mathbf{r}}{r^2} }

Plugging in our formula for \mathbf{r}'', some wonderful cancellations occur and we get

r'' = \displaystyle{ \frac{\mathbf{r}' \cdot \mathbf{r}'}{r} - 1 }

But we can do better! Remember, conservation of energy says

\displaystyle{ \mathbf{r}' \cdot \mathbf{r}' + (t' - 1)^2 = 1}

and we know t' = r. So,

\mathbf{r}' \cdot \mathbf{r}' = 1 - (r - 1)^2 = 2r - r^2


r'' = \displaystyle{ \frac{\mathbf{r}' \cdot \mathbf{r}'}{r} - 1 } = 1 - r

So, we see

\boxed{ r'' = 1 - r }

Can you get here more elegantly?

Since t' = r this instantly gives

\boxed{ t''' = 1 - t' }

as desired.

Next let’s get a similar formula for \mathbf{r}'''. We start with

\displaystyle{ \mathbf{r}'' =  \frac{r' \mathbf{r}' - \mathbf{r}}{r} }

and differentiate both sides to get

\displaystyle{ \mathbf{r}''' = \frac{r r'' \mathbf{r}' + r r' \mathbf{r}'' - r \mathbf{r}' - r'}{r^2} }

Then plug in our formulas for r'' and \mathbf{r}''. Some truly miraculous cancellations occur and we get

\boxed{  \mathbf{r}''' = -\mathbf{r}' }

I could show you how it works—but to really believe it you have to do it yourself. It’s just algebra. Again, I’d like a better way to see why this happens!

Integrating both sides—which is a bit weird, since we got this equation by differentiating both sides of another one—we get

\boxed{ \mathbf{r}'' = -\mathbf{r} + \mathbf{a} }

for some fixed vector \mathbf{a}, the Runge–Lenz vector. This says \mathbf{r} undergoes harmonic motion about \mathbf{a}. It’s quite remarkable that both \mathbf{r} and its norm r undergo harmonic motion! At first I thought this was impossible, but it’s just a very special circumstance.

The quantum version of a planetary orbit is a hydrogen atom. Everything we just did has a quantum version! For more on that, see

• Greg Egan, The ellipse and the atom.

For more of the history of this problem, see:

• John Baez, Mysteries of the gravitational 2-body problem.

This also treats quantum aspects, connections to supersymmetry and Jordan algebras, and more! Someday I’ll update it to include the material in this blog post.

Visual Insight

1 March, 2015

I have another blog, called Visual Insight. Over here, our focus is on applying science to help save the planet. Over there, I try to make the beauty of pure mathematics visible to the naked eye.

I’m always looking for great images, so if you know about one, please tell me about it! If not, you may still enjoy taking a look.

Here are three of my favorite images from that blog, and a bit about the people who created them.

I suspect that these images, and many more on Visual Insight, are all just different glimpses of the same big structure. I have a rough idea what that structure is. Sometimes I dream of a computer program that would let you tour the whole thing. Unfortunately, a lot of it lives in more than 3 dimensions.

Less ambitiously, I sometimes dream of teaming up with lots of mathematicians and creating a gorgeous coffee-table book about this stuff.


Schmidt arrangement of the Eisenstein integers


Schmidt Arrangement of the Eisenstein Integers - Katherine Stange

This picture drawn by Katherine Stange shows what happens when we apply fractional linear transformations

z \mapsto \frac{a z + b}{c z + d}

to the real line sitting in the complex plane, where a,b,c,d are Eisenstein integers: that is, complex numbers of the form

m + n \sqrt{-3}

where m,n are integers. The result is a complicated set of circles and lines called the ‘Schmidt arrangement’ of the Eisenstein integers. For more details go here.

Katherine Stange did her Ph.D. with Joseph H. Silverman, an expert on elliptic curves at Brown University. Now she is an assistant professor at the University of Colorado, Boulder. She works on arithmetic geometry, elliptic curves, algebraic and integer sequences, cryptography, arithmetic dynamics, Apollonian circle packings, and game theory.


{7,3,3} honeycomb

This is the {7,3,3} honeycomb as drawn by Danny Calegari. The {7,3,3} honeycomb is built of regular heptagons in 3-dimensional hyperbolic space. It’s made of infinite sheets of regular heptagons in which 3 heptagons meet at vertex. 3 such sheets meet at each edge of each heptagon, explaining the second ‘3’ in the symbol {7,3,3}.

The 3-dimensional regions bounded by these sheets are unbounded: they go off to infinity. They show up as holes here. In this image, hyperbolic space has been compressed down to an open ball using the so-called Poincaré ball model. For more details, go here.

Danny Calegari did his Ph.D. work with Andrew Casson and William Thurston on foliations of three-dimensional manifolds. Now he’s a professor at the University of Chicago, and he works on these and related topics, especially geometric group theory.


{7,3,3} honeycomb meets the plane at infinity

This picture, by Roice Nelson, is another view of the {7,3,3} honeycomb. It shows the ‘boundary’ of this honeycomb—that is, the set of points on the surface of the Poincaré ball that are limits of points in the {7,3,3} honeycomb.

Roice Nelson used stereographic projection to draw part of the surface of the Poincaré ball as a plane. The circles here are holes, not contained in the boundary of the {7,3,3} honeycomb. There are infinitely many holes, and the actual boundary, the region left over, is a fractal with area zero. The white region on the outside of the picture is yet another hole. For more details, and a different version of this picture, go here.

Roice Nelson is a software developer for a flight data analysis company. There’s a good chance the data recorded on the airplane from your last flight moved through one of his systems! He enjoys motorcycling and recreational mathematics, he has a blog with lots of articles about geometry, and he makes plastic models of interesting geometrical objects using a 3d printer.

Higher-Dimensional Rewriting in Warsaw

18 February, 2015

This summer there will be a conference on higher-dimensional algebra and rewrite rules in Warsaw. They want people to submit papers! I’ll give a talk about presentations of symmetric monoidal categories that arise in electrical engineering and control theory. This is part of the network theory program, which we talk about so often here on Azimuth.

There should also be interesting talks about combinatorial algebra, homotopical aspects of rewriting theory, and more:

Higher-Dimensional Rewriting and Applications, 28-29 June 2015, Warsaw, Poland. Co-located with the RDP, RTA and TLCA conferences. Organized by Yves Guiraud, Philippe Malbos and Samuel Mimram.


Over recent years, rewriting methods have been generalized from strings and terms to richer algebraic structures such as operads, monoidal categories, and more generally higher dimensional categories. These extensions of rewriting fit in the general scope of higher-dimensional rewriting theory, which has emerged as a unifying algebraic framework. This approach allows one to perform homotopical and homological analysis of rewriting systems (Squier theory). It also provides new computational methods in combinatorial algebra (Artin-Tits monoids, Coxeter and Garside structures), in homotopical and homological algebra (construction of cofibrant replacements, Koszulness property). The workshop is open to all topics concerning higher-dimensional generalizations and applications of rewriting theory, including

• higher-dimensional rewriting: polygraphs / computads, higher-dimensional generalizations of string/term/graph rewriting systems, etc.

• homotopical invariants of rewriting systems: homotopical and homological finiteness properties, Squier theory, algebraic Morse theory, coherence results in algebra and higher-dimensional category theory, etc.

• linear rewriting: presentations and resolutions of algebras and operads, Gröbner bases and generalizations, homotopy and homology of algebras and operads, Koszul duality theory, etc.

• applications of higher-dimensional and linear rewriting and their interactions with other fields: calculi for quantum computations, algebraic lambda-calculi, proof nets, topological models for concurrency, homotopy type theory, combinatorial group theory, etc.

• implementations: the workshop will also be interested in implementation issues in higher-dimensional rewriting and will allow demonstrations of prototypes of existing and new tools in higher-dimensional rewriting.


Important dates:

• Submission: April 15, 2015

• Notification: May 6, 2015

• Final version: May 20, 2015

• Conference: 28-29 June, 2015

Submissions should consist of an extended abstract, approximately 5 pages long, in standard article format, in PDF. The page for uploading those is here. The accepted extended abstracts will be made available electronically before the


Program committee:

• Vladimir Dotsenko (Trinity College, Dublin)

• Yves Guiraud (INRIA / Université Paris 7)

• Jean-Pierre Jouannaud (École Polytechnique)

• Philippe Malbos (Université Claude Bernard Lyon 1)

• Paul-André Melliès (Université Paris 7)

• Samuel Mimram (École Polytechnique)

• Tim Porter (University of Wales, Bangor)

• Femke van Raamsdonk (VU University, Amsterdam)

Lebesgue’s Universal Covering Problem (Part 2)

3 February, 2015

A while back I described a century-old geometry problem posed by the famous French mathematician Lebesgue, inventor of our modern theory of areas and volumes.

This problem is famously difficult. So I’m happy to report some progress:

• John Baez, Karine Bagdasaryan and Philip Gibbs, Lebesgue’s universal covering problem.

But we’d like you to check our work! It will help if you’re good at programming. As far as the math goes, it’s just high-school geometry… carried to a fanatical level of intensity.

Here’s the story:

A subset of the plane has diameter 1 if the distance between any two points in this set is ≤ 1. You know what a circle of diameter 1 looks like. But an equilateral triangle with edges of length 1 also has diameter 1:

After all, two points in this triangle are farthest apart when they’re at two corners.

Note that this triangle doesn’t fit inside a circle of diameter 1:

There are lots of sets of diameter 1, so it’s interesting to look for a set that can contain them all.

In 1914, the famous mathematician Henri Lebesgue sent a letter to a pal named Pál. And in this letter he challenged Pál to find the convex set with smallest possible area such that every set of diameter 1 fits inside.

More precisely, he defined a universal covering to be a convex subset of the plane that can cover a translated, reflected and/or rotated version of every subset of the plane with diameter 1. And his challenge was to find the universal covering with the least area.

Pál worked on this problem, and 6 years later he published a paper on it. He found a very nice universal covering: a regular hexagon in which one can inscribe a circle of diameter 1. This has area


But he also found a universal covering with less area, by removing two triangles from this hexagon—for example, the triangles C1C2C3 and E1E2E3 here:

Our paper explains why you can remove these triangles, assuming the hexagon was a universal covering in the first place. The resulting universal covering has area


In 1936, Sprague went on to prove that more area could be removed from another corner of Pál’s original hexagon, giving a universal covering of area


In 1992, Hansen took these reductions even further by removing two more pieces from Pál’s hexagon. Each piece is a thin sliver bounded by two straight lines and an arc. The first piece is tiny. The second is downright microscopic!

Hansen claimed the areas of these regions were 4 · 10-11 and 6 · 10-18. However, our paper redoes his calculation and shows that the second number is seriously wrong. The actual areas are 3.7507 · 10-11 and 8.4460 · 10-21.

Philip Gibbs has created a Java applet illustrating Hansen’s universal cover. I urge you to take a look! You can zoom in and see the regions he removed:

• Philip Gibbs, Lebesgue’s universal covering problem.

I find that my laptop, a Windows machine, makes it hard to view Java applets because they’re a security risk. I promise this one is safe! To be able to view it, I had to go to the “Search programs and files” window, find the “Configure Java” program, go to “Security”, and add


to the “Exception Site List”. It’s easy once you know what to do.

And it’s worth it, because only the ability to zoom lets you get a sense of the puny slivers that Hansen removed! One is the region XE2T here, and the other is T’C3V:

You can use this picture to help you find these regions in Philip Gibbs’ applet. But this picture is not in scale! In fact the smaller region, T’C3V, has length 3.7 · 10-7 and maximum width 1.4 · 10-14, tapering down to a very sharp point.

That’s about a few atoms wide if you draw the whole hexagon on paper! And it’s about 30 million times longer than it is wide. This is the sort of thing you can only draw with the help of a computer.

Anyway, Hansen’s best universal covering had an area of


This tiny improvement over Sprague’s work led Klee and Wagon to write:

it does seem safe to guess that progress on [this problem], which has been painfully slow in the past, may be even more painfully slow in the future.

However, our new universal covering removes about a million times more area than Hansen’s larger region: a whopping 2.233 · 10-5. So, we get a universal covering with area


The key is to slightly rotate the dodecagon shown in the above pictures, and then use the ideas of Pál and Sprague.

There’s a lot of room between our number and the best lower bound on this problem, due to Brass and Sharifi:


So, one way or another, we can expect a lot of progress now that computers are being brought to bear.

Read our paper for the details! If you want to check our work, we’ll be glad to answer lots of detailed questions. We want to rotate the dodecagon by an amount that minimizes the area of the universal covering we get, so we use a program to compute the area for many choices of rotation angle:

• Philip Gibbs, Java program.

The program is not very long—please study it or write your own, in your own favorite language! The output is here:

• Philip Gibbs, Java program output.

and as explained at the end of our paper, the best rotation angle is about 1.3°.

A Second Law for Open Markov Processes

15 November, 2014

guest post by Blake Pollard

What comes to mind when you hear the term ‘random process’? Do you think of Brownian motion? Do you think of particles hopping around? Do you think of a drunkard staggering home?

Today I’m going to tell you about a version of the drunkard’s walk with a few modifications. Firstly, we don’t have just one drunkard: we can have any positive real number of drunkards. Secondly, our drunkards have no memory; where they go next doesn’t depend on where they’ve been. Thirdly, there are special places, such as entrances to bars, where drunkards magically appear and disappear.

The second condition says that our drunkards satisfy the Markov property, making their random walk into a Markov process. The third condition is really what I want to tell you about, because it makes our Markov process into a more general ‘open Markov process’.

There are a collection of places the drunkards can be, for example:

V= \{ \text{bar},\text{sidewalk}, \text{street}, \text{taco truck}, \text{home} \}

We call this set V the set of states. There are certain probabilities associated with traveling between these places. We call these transition rates. For example it is more likely for a drunkard to go from the bar to the taco truck than to go from the bar to home so the transition rate between the bar and the taco truck should be greater than the transition rate from the bar to home. Sometimes you can’t get from one place to another without passing through intermediate places. In reality the drunkard can’t go directly from the bar to the taco truck: he or she has to go from the bar to sidewalk to the taco truck.

This information can all be summarized by drawing a directed graph where the positive numbers labelling the edges are the transition rates:

For simplicity we draw only three states: home, bar, taco truck. Drunkards go from home to the bar and back, but they never go straight from home to the taco truck.

We can keep track of where all of our drunkards are using a vector with 3 entries:

\displaystyle{ p(t) = \left( \begin{array}{c} p_h(t) \\ p_b(t) \\ p_{tt}(t) \end{array} \right) \in \mathbb{R}^3 }

We call this our population distribution. The first entry p_h is the number of drunkards that are at home, the second p_b is how many are at the bar, and the third p_{tt} is how many are at the taco truck.

There is a set of coupled, linear, first-order differential equations we can write down using the information in our graph that tells us how the number of drunkards in each place change with time. This is called the master equation:

\displaystyle{ \frac{d p}{d t} = H p }

where H is a 3×3 matrix which we call the Hamiltonian. The off-diagonal entries are nonnegative:

H_{ij} \geq 0, i \neq j

and the columns sum to zero:

\sum_i H_{ij}=0

We call a matrix satisfying these conditions infinitesimal stochastic. Stochastic matrices have columns that sum to one. If we take the exponential of an infinitesimal stochastic matrix we get one whose columns sum to one, hence the label ‘infinitesimal’.

The Hamiltonian for the graph above is

H = \left( \begin{array}{ccc} -2 & 5 & 10 \\ 2 & -12 & 0 \\ 0 & 7 & -10 \end{array} \right)

John has written a lot about Markov processes and infinitesimal stochastic Hamiltonians in previous posts.

Given two vectors p,q \in \mathbb{R}^3 describing the populations of drunkards which obey the same master equation, we can calculate the relative entropy of p relative to q:

\displaystyle{ S(p,q) = \sum_{ i \in V} p_i \ln \left( \frac{p_i}{q_i} \right) }

This is an example of a ‘divergence’. In statistics, a divergence a way of measuring the distance between probability distributions, which may not be symmetrical and may even not obey the triangle inequality.

The relative entropy is important because it decreases monotonically with time, making it a Lyapunov function for Markov processes. Indeed, it is a well known fact that

\displaystyle{ \frac{dS(p(t),q(t) ) } {dt} \leq 0 }

This is true for any two population distributions which evolve according to the same master equation, though you have to allow infinity as a possible value for the relative entropy and negative infinity for its time derivative.

Why is entropy decreasing? Doesn’t the Second Law of Thermodynamics say entropy increases?

Don’t worry: the reason is that I have not put a minus sign in my definition of relative entropy. Put one in if you like, and then it will increase. Sometimes without the minus sign it’s called the Kullback–Leibler divergence. This decreases with the passage of time, saying that any two population distributions p(t) and q(t) get ‘closer together’ as they get randomized with the passage of time.

That itself is a nice result, but I want to tell you what happens when you allow drunkards to appear and disappear at certain states. Drunkards appear at the bar once they’ve had enough to drink and once they are home for long enough they can disappear. The set of places where drunkards can appear or disappear B is called the set of boundary states.  So for the above process

B = \{ \text{home},\text{bar} \}

is the set of boundary states. This changes the way in which the population of drunkards changes with time!

The drunkards at the taco truck obey the master equation. For them,

\displaystyle{ \frac{dp_{tt}}{dt} = 7p_b -10 p_{tt} }

still holds. But because the populations can appear or disappear at the boundary states the master equation no longer holds at those states! Instead it is useful to define the flow of drunkards into the i^{th} state by

\displaystyle{ \frac{Dp_i}{Dt} = \frac{dp_i}{dt}-\sum_j H_{ij} p_j}

This quantity describes by how much the rate of change of the populations at the boundary states differ from that given by the master equation.

The reason why we are interested in open Markov processes is because you can take two open Markov processes and glue them together along some subset of their boundary states to get a new open Markov process! This allows us to build up or break down complicated Markov processes using open Markov processes as the building blocks.

For example we can draw the graph corresponding to the drunkards’ walk again, only now we will distinguish boundary states from internal states by coloring internal states blue and having boundary states be white:

Consider another open Markov process with states

V=\{ \text{home},\text{work},\text{bar} \}


B=\{ \text{home}, \text{bar}\}

are the boundary states, leaving


as an internal state:

Since the boundary states of this process overlap with the boundary states of the first process we can compose the two to form a new Markov process:

Notice the boundary states are now internal states. I hope any Markov process that could approximately model your behavior has more interesting nodes! There is a nice way to figure out the Hamiltonian of the composite from the Hamiltonians of the pieces, but we will leave that for another time.

We can ask ourselves, how does relative entropy change with time in open Markov processes? You can read my paper for the details, but here is the punchline:

\displaystyle{ \frac{dS(p(t),q(t) ) }{dt} \leq \sum_{i \in B} \frac{Dp_i}{Dt}\frac{\partial S}{\partial p_i} + \frac{Dq_i}{Dt}\frac{\partial S}{\partial q_i} }

This is a version of the Second Law of Thermodynamics for open Markov processes.

It is important to notice that the sum is only over the boundary states! This inequality tells us that relative entropy still decreases inside our process, but depending on the flow of populations through the boundary states the relative entropy of the whole process could either increase or decrease! This inequality will be important when we study how the relative entropy changes in different parts of a bigger more complicated process.

That is all for now, but I leave it as an exercise for you to imagine a Markov process that describes your life. How many states does it have? What are the relative transition rates? Are there states you would like to spend more or less time in? Are there states somewhere you would like to visit?

Here is my paper, which proves the above inequality:

• Blake Pollard, A Second Law for open Markov processes.

If you have comments or corrections, let me know!


Get every new post delivered to your Inbox.

Join 2,968 other followers