Network Theory (Part 23)

21 August, 2012

We’ve been looking at reaction networks, and we’re getting ready to find equilibrium solutions of the equations they give. To do this, we’ll need to connect them to another kind of network we’ve studied. A reaction network is something like this:

It’s a bunch of complexes, which are sums of basic building-blocks called species, together with arrows called transitions going between the complexes. If we know a number for each transition describing the rate at which it occurs, we get an equation called the ‘rate equation’. This describes how the amount of each species changes with time. We’ve been talking about this equation ever since the start of this series! Last time, we wrote it down in a new very compact form:

\displaystyle{ \frac{d x}{d t} = Y H x^Y  }

Here x is a vector whose components are the amounts of each species, while H and Y are certain matrices.

But now suppose we forget how each complex is made of species! Suppose we just think of them as abstract things in their own right, like numbered boxes:

We can use these boxes to describe states of some system. The arrows still describe transitions, but now we think of these as ways for the system to hop from one state to another. Say we know a number for each transition describing the probability per time at which it occurs:

Then we get a ‘Markov process’—or in other words, a random walk where our system hops from one state to another. If \psi is the probability distribution saying how likely the system is to be in each state, this Markov process is described by this equation:

\displaystyle{ \frac{d \psi}{d t} = H \psi  }

This is simpler than the rate equation, because it’s linear. But the matrix H is the same—we’ll see that explicitly later on today.

What’s the point? Well, our ultimate goal is to prove the deficiency zero theorem, which gives equilibrium solutions of the rate equation. That means finding x with

Y H x^Y = 0

Today we’ll find all equilibria for the Markov process, meaning all \psi with

H \psi = 0

Then next time we’ll show some of these have the form

\psi = x^Y

So, we’ll get

H x^Y = 0

and thus

Y H x^Y = 0

as desired!

So, let’s get to to work.

The Markov process of a graph with rates

We’ve been looking at stochastic reaction networks, which are things like this:

However, we can build a Markov process starting from just part of this information:

Let’s call this thing a ‘graph with rates’, for lack of a better name. We’ve been calling the things in K ‘complexes’, but now we’ll think of them as ‘states’. So:

Definition. A graph with rates consists of:

• a finite set of states K,

• a finite set of transitions T,

• a map r: T \to (0,\infty) giving a rate constant for each transition,

source and target maps s,t : T \to K saying where each transition starts and ends.

Starting from this, we can get a Markov process describing how a probability distribution \psi on our set of states will change with time. As usual, this Markov process is described by a master equation:

\displaystyle{ \frac{d \psi}{d t} = H \psi }

for some Hamiltonian:

H : \mathbb{R}^K \to \mathbb{R}^K

What is this Hamiltonian, exactly? Let’s think of it as a matrix where H_{i j} is the probability per time for our system to hop from the state j to the state i. This looks backwards, but don’t blame me—blame the guys who invented the usual conventions for matrix algebra. Clearly if i \ne j this probability per time should be the sum of the rate constants of all transitions going from j to i:

\displaystyle{ i \ne j \quad \Rightarrow \quad H_{i j} =  \sum_{\tau: j \to i} r(\tau) }

where we write \tau: j \to i when \tau is a transition with source j and target i.

Now, we saw in Part 11 that for a probability distribution to remain a probability distribution as it evolves in time according to the master equation, we need H to be infinitesimal stochastic: its off-diagonal entries must be nonnegative, and the sum of the entries in each column must be zero.

The first condition holds already, and the second one tells us what the diagonal entries must be. So, we’re basically done describing H. But we can summarize it this way:

Puzzle 1. Think of \mathbb{R}^K as the vector space consisting of finite linear combinations of elements \kappa \in K. Then show

\displaystyle{  H \kappa = \sum_{s(\tau) = \kappa} r(\tau) (t(\tau) - s(\tau)) }

Equilibrium solutions of the master equation

Now we’ll classify equilibrium solutions of the master equation, meaning \psi \in \mathbb{R}^K with

H \psi = 0

We’ll do only do this when our graph with rates is ‘weakly reversible’. This concept doesn’t actually depend on the rates, so let’s be general and say:

Definition. A graph is weakly reversible if for every edge \tau : i \to j, there is directed path going back from j to i, meaning that we have edges

\tau_1 : j \to j_1 , \quad \tau_2 : j_1 \to j_2 , \quad \dots, \quad \tau_n: j_{n-1} \to i

This graph with rates is not weakly reversible:

but this one is:

The good thing about the weakly reversible case is that we get one equilibrium solution of the master equation for each component of our graph, and all equilibrium solutions are linear combinations of these. This is not true in general! For example, this guy is not weakly reversible:

It has only one component, but the master equation has two linearly independent equilibrium solutions: one that vanishes except at the state 0, and one that vanishes except at the state 2.

The idea of a ‘component’ is supposed to be fairly intuitive—our graph falls apart into pieces called components—but we should make it precise. As explained in Part 21, the graphs we’re using here are directed multigraphs, meaning things like

s, t : E \to V

where E is the set of edges (our transitions) and V is the set of vertices (our states). There are actually two famous concepts of ‘component’ for graphs of this sort: ‘strongly connected’ components and ‘connected’ components. We only need connected components, but let me explain both concepts, in a futile attempt to slake your insatiable thirst for knowledge.

Two vertices i and j of a graph lie in the same strongly connected component iff you can find a directed path of edges from i to j and also one from j back to i.

Remember, a directed path from i to j looks like this:

i \to a \to b \to c \to j

Here’s a path from x to y that is not directed:

i \to a \leftarrow b \to c \to j

and I hope you can write down the obvious but tedious definition of an ‘undirected path’, meaning a path made of edges that don’t necessarily point in the correct direction. Given that, we say two vertices i and j lie in the same connected component iff you can find an undirected path going from i to j. In this case, there will automatically also be an undirected path going from j to i.

For example, i and j lie in the same connected component here, but not the same strongly connected component:

i \to a \leftarrow b \to c \to j

Here’s a graph with one connected component and 3 strongly connected components, which are marked in blue:

For the theory we’re looking at now, we only care about connected components, not strongly connected components! However:

Puzzle 2. Show that for weakly reversible graph, the connected components are the same as the strongly connected components.

With these definitions out of way, we can state today’s big theorem:

Theorem. Suppose H is the Hamiltonian of a weakly reversible graph with rates:

Then for each connected component C \subseteq K, there exists a unique probability distribution \psi_C \in \mathbb{R}^K that is positive on that component, zero elsewhere, and is an equilibrium solution of the master equation:

H \psi_C = 0

Moreover, these probability distributions \psi_C form a basis for the space of equilibrium solutions of the master equation. So, the dimension of this space is the number of components of K.

Proof. We start by assuming our graph has one connected component. We use the Perron–Frobenius theorem, as explained in Part 20. This applies to ‘nonnegative’ matrices, meaning those whose entries are all nonnegative. That is not true of H itself, but only its diagonal entries can be negative, so if we choose a large enough number c > 0, H + c I will be nonnegative.

Since our graph is weakly reversible and has one connected component, it follows straight from the definitions that the operator H + c I will also be ‘irreducible’ in the sense of Part 20. The Perron–Frobenius theorem then swings into action, and we instantly conclude several things.

First, H + c I has a positive real eigenvalue r such that any other eigenvalue, possibly complex, has absolute value \le r. Second, there is an eigenvector \psi with eigenvalue r and all positive components. Third, any other eigenvector with eigenvalue r is a scalar multiple of \psi.

Subtracting c, it follows that \lambda = r - c is the eigenvalue of H with the largest real part. We have H \psi = \lambda \psi, and any other vector with this property is a scalar multiple of \psi.

We can show that in fact \lambda = 0. To do this we copy an argument from Part 20. First, since \psi is positive we can normalize it to be a probability distribution:

\displaystyle{ \sum_{i \in K} \psi_i = 1 }

Since H is infinitesimal stochastic, \exp(t H) sends probability distributions to probability distributions:

\displaystyle{ \sum_{i \in K} (\exp(t H) \psi)_i = 1 }

for all t \ge 0. On the other hand,

\displaystyle{ \sum_{i \in K} (\exp(t H)\psi)_i = \sum_{i \in K} e^{t \lambda} \psi_i = e^{t \lambda} }

so we must have \lambda = 0.

We conclude that when our graph has one connected component, there is a probability distribution \psi \in \mathbb{R}^K that is positive everywhere and has H \psi = 0. Moreover, any \phi \in \mathbb{R}^K with H \phi = 0 is a scalar multiple of \psi.

When K has several components, the matrix H is block diagonal, with one block for each component. So, we can run the above argument on each component C \subseteq K and get a probability distribution \psi_C \in \mathbb{R}^K that is positive on C. We can then check that H \psi_C = 0 and that every \phi \in \mathbb{R}^K with H \phi = 0 can be expressed as a linear combination of these probability distributions \psi_C in a unique way.   █

This result must be absurdly familiar to people who study Markov processes, but I haven’t bothered to look up a reference yet. Do you happen to know a good one? I’d like to see one that generalizes this theorem to graphs that aren’t weakly reversible. I think I see how it goes. We don’t need that generalization right now, but it would be good to have around.

The Hamiltonian, revisited

One last small piece of business: last time I showed you a very slick formula for the Hamiltonian H. I’d like to prove it agrees with the formula I gave this time.

We start with any graph with rates:

We extend s and t to linear maps between vector spaces:

We define the boundary operator just as we did last time:

\partial = t - s

Then we put an inner product on the vector spaces \mathbb{R}^T and \mathbb{R}^K. So, for \mathbb{R}^K we let the elements of K be an orthonormal basis, but for \mathbb{R}^T we define the inner product in a more clever way involving the rate constants:

\displaystyle{ \langle \tau, \tau' \rangle = \frac{1}{r(\tau)} \delta_{\tau, \tau'} }

where \tau, \tau' \in T. This lets us define adjoints of the maps s, t and \partial, via formulas like this:

\langle s^\dagger \phi, \psi \rangle = \langle \phi, s \psi \rangle

Then:

Theorem. The Hamiltonian for a graph with rates is given by

H = \partial s^\dagger

Proof. It suffices to check that this formula agrees with the formula for H given in Puzzle 1:

\displaystyle{   H \kappa = \sum_{s(\tau) = \kappa} r(\tau) (t(\tau) - s(\tau)) }

Here we are using the complex \kappa \in K as a name for one of the standard basis vectors of \mathbb{R}^K. Similarly shall we write things like \tau or \tau' for basis vectors of \mathbb{R}^T.

First, we claim that

\displaystyle{ s^\dagger \kappa = \sum_{\tau: \; s(\tau) = \kappa} r(\tau) \, \tau }

To prove this it’s enough to check that taking the inner products of either sides with any basis vector \tau', we get results that agree. On the one hand:

\begin{array}{ccl}  \langle \tau' , s^\dagger \kappa \rangle &=&   \langle s \tau', \kappa \rangle \\  \\  &=& \delta_{s(\tau'), \kappa}    \end{array}

On the other hand:

\begin{array}{ccl} \displaystyle{ \langle \tau', \sum_{\tau: \; s(\tau) = \kappa} r(\tau) \, \tau \rangle } &=&  \sum_{\tau: \; s(\tau) = \kappa} r(\tau) \, \langle \tau', \tau \rangle   \\  \\  &=& \displaystyle{ \sum_{\tau: \; s(\tau) = \kappa} \delta_{\tau', \tau} }   \\  \\  &=&   \delta_{s(\tau'), \kappa}   \end{array}

where the factor of 1/r(\tau) in the inner product on \mathbb{R}^T cancels the visible factor of r(\tau). So indeed the results match.

Using this formula for s^\dagger \kappa we now see that

\begin{array}{ccl}  H \kappa &=& \partial s^\dagger \kappa   \\  \\  &=& \partial \displaystyle{ \sum_{\tau: \; s(\tau) = \kappa} r(\tau) \, \tau }    \\  \\  &=& \displaystyle{ \sum_{\tau: \; s(\tau) = \kappa} r(\tau) \, (t(\tau) - s(\tau)) }  \end{array}

which is precisely what we want.   █

I hope you see through the formulas to their intuitive meaning. As usual, the formulas are just a way of precisely saying something that makes plenty of sense. If \kappa is some state of our Markov process, s^\dagger \kappa is the sum of all transitions starting at this state, weighted by their rates. Applying \partial to a transition tells us what change in state it causes. So \partial s^\dagger \kappa tells us the rate at which things change when we start in the state \kappa. That’s why \partial s^\dagger is the Hamiltonian for our Markov process. After all, the Hamiltonian tells us how things change:

\displaystyle{ \frac{d \psi}{d t} = H \psi }

Okay, we’ve got all the machinery in place. Next time we’ll prove the deficiency zero theorem!


Network Theory (Part 22)

16 August, 2012

Okay, now let’s dig deeper into the proof of the deficiency zero theorem. We’re only going to prove a baby version, at first. Later we can enhance it:

Deficiency Zero Theorem (Baby Version). Suppose we have a weakly reversible reaction network with deficiency zero. Then for any choice of rate constants there exists an equilibrium solution of the rate equation where all species are present in nonzero amounts.

The first step is to write down the rate equation in a new, more conceptual way. It’s incredibly cool. You’ve probably seen Schrödinger’s equation, which describes the motion of a quantum particle:

\displaystyle { \frac{d \psi}{d t} = -i H \psi }

If you’ve been following this series, you’ve probably also seen the master equation, which describes the motion of a ‘stochastic’ particle:

\displaystyle { \frac{d \psi}{d t} = H \psi }

A ‘stochastic’ particle is one that’s carrying out a random walk, and now \psi describes its probability to be somewhere, instead of its amplitude.

Today we’ll see that the rate equation for a reaction network looks somewhat similar:

\displaystyle { \frac{d x}{d t} =  Y H x^Y }

where Y is some matrix, and x^Y is defined using a new thing called ‘matrix exponentiation’, which makes the equation nonlinear!

If you’re reading this you probably know how to multiply a vector by a matrix. But if you’re like me, you’ve never seen anyone take a vector and raise to the power of some matrix! I’ll explain it, don’t worry… right now I’m just trying to get you intrigued. It’s not complicated, but it’s exciting how this unusual operation shows up naturally in chemistry. That’s just what I’m looking for these days: new math ideas that show up in practical subjects like chemistry, and new ways that math can help people in these subjects.

Since we’re looking for an equilibrium solution of the rate equation, we actually want to solve

\displaystyle { \frac{d x}{d t} =  0 }

or in other words

Y H x^Y = 0

In fact we will do better: we will find a solution of

H x^Y = 0

And we’ll do this in two stages:

• First we’ll find all solutions of

H \psi = 0

This equation is linear, so it’s easy to understand.

• Then, among these solutions \psi, we’ll find one that also obeys

\psi = x^Y

This is a nonlinear problem involving matrix exponentiation, but still, we can do it, using a clever trick called ‘logarithms’.

Putting the pieces together, we get our solution of

H x^Y = 0

and thus our equilibrium solution of the rate equation:

\displaystyle { \frac{d x}{d t} = Y H x^Y = 0 }

That’s a rough outline of the plan. But now let’s get started, because the details are actually fascinating. Today I’ll just show you how to rewrite the rate equation in this new way.

The rate equation

Remember how the rate equation goes. We start with a stochastic reaction network, meaning a little diagram like this:

This contains quite a bit of information:

• a finite set T of transitions,

• a finite set K of complexes,

• a finite set S of species,

• a map r: T \to (0,\infty) giving a rate constant for each transition,

source and target maps s,t : T \to K saying where each transition starts and ends,

• a one-to-one map Y : K \to \mathbb{N}^S saying how each complex is made of species.

Given all this, the rate equation says how the amount of each species changes with time. We describe these amounts with a vector x \in [0,\infty)^S. So, we want a differential equation filling in the question marks here:

\displaystyle{ \frac{d x}{d t} = ??? }

Now last time, we started by thinking of K as a subset of \mathbb{N}^S, and thus of the vector space \mathbb{R}^S. Back then, we wrote the rate equation as follows:

\displaystyle{ \frac{d x}{d t} = \sum_{\tau \in T} r(\tau) \; \left(t(\tau) - s(\tau)\right) \; x^{s(\tau)} }

where vector exponentiation is defined by

x^s = x_1^{s_1} \cdots x_k^{s_k}

when x and s are vectors in \mathbb{R}^S.

However, we’ve now switched to thinking of our set of complexes K as a set in its own right that is mapped into \mathbb{N}^S by Y. This is good for lots of reasons, like defining the concept of ‘deficiency’, which we did last time. But it means the rate equation above doesn’t quite parse anymore! Things like s(\tau) and t(\tau) live in K; we need to explicitly convert them into elements of \mathbb{R}^S using Y for our equation to make sense!

So now we have to write the rate equation like this:

\displaystyle{ \frac{d x}{d t} = Y \sum_{\tau \in T} r(\tau) \;  \left(t(\tau) - s(\tau)\right) \; x^{Y s(\tau)} }

This looks more ugly, but if you’ve got even one mathematical bone in your body, you can already see vague glimmerings of how we’ll rewrite this the way we want:

\displaystyle { \frac{d x}{d t} =  Y H x^Y }

Here’s how.

First, we extend our maps s, t and Y to linear maps between vector spaces:

Then, we put an inner product on the vector spaces \mathbb{R}^T, \mathbb{R}^K and \mathbb{R}^S. For \mathbb{R}^K we do this in the most obvious way, by letting the complexes be an orthonormal basis. So, given two complexes \kappa, \kappa', we define their inner product by

\langle \kappa, \kappa' \rangle = \delta_{\kappa, \kappa'}

We do the same for \mathbb{R}^S. But for \mathbb{R}^T we define the inner product in a more clever way involving the rate constants. If \tau, \tau' \in T are two transitions, we define their inner product by:

\langle \tau, \tau' \rangle = \frac{1}{r(\tau)} \delta_{\tau, \tau'}

This will seem perfectly natural when we continue our study of circuits made of electrical resistors, and if you’re very clever you can already see it lurking in Part 16. But never mind.

Having put inner products on these three vector spaces, we can take the adjoints of the linear maps between them, to get linear maps going back the other way:

These are defined in the usual way—though we’re using daggers here they way physicists do, where mathematicians would prefer to see stars! For example, s^\dagger : \mathbb{R}^K \to \mathbb{R}^T is defined by the relation

\langle s^\dagger \phi, \psi \rangle = \langle \phi, s \psi \rangle

and so on.

Next, we set up a random walk on the set of complexes. Remember, our reaction network is a graph with complexes as vertices and transitions as edges, like this:

Each transition \tau has a number attached to it: the rate constant r(\tau). So, we can randomly hop from complex to complex along these transitions, with probabilities per unit time described by these numbers. The probability of being at some particular complex will then be described by a function

\psi : K \to \mathbb{R}

which also depends on time, and changes according to the equation

\displaystyle { \frac{d \psi}{d t} = H \psi }

for some Hamiltonian

H : \mathbb{R}^K \to \mathbb{R}^K

I defined this Hamiltonian back in Part 15, but now I see a slicker way to write it:

H = (t - s) s^\dagger

I’ll justify this next time. For now, the main point is that with this Hamiltonian, the rate equation is equivalent to this:

\displaystyle{ \frac{d x}{d t} = Y H x^Y }

The only thing I haven’t defined yet is the funny exponential x^Y. That’s what makes the equation nonlinear. We’re taking a vector to the power of a matrix and getting a vector. This sounds weird—but it actually makes sense!

It only makes sense because we have chosen bases for our vector spaces. To understand it, let’s number our species 1, \dots, k as we’ve been doing all along, and number our complexes 1, \dots, \ell. Our linear map Y : \mathbb{R}^K \to \mathbb{R}^S then becomes a k \times \ell matrix of natural numbers. Its entries say how many times each species shows up in each complex:

Y = \left( \begin{array}{cccc}   Y_{11} & Y_{12}  & \cdots & Y_{1 \ell} \\  Y_{21} & Y_{22}  & \cdots & Y_{2 \ell} \\  \vdots  & \vdots   & \ddots & \vdots \\  Y_{k1} & Y_{k2}  & \cdots & Y_{k \ell} \end{array} \right)

The entry Y_{i j} says how many times the ith species shows up in the jth complex.

Now, let’s be a bit daring and think of the vector x \in \mathbb{R}^S as a row vector with k entries:

x = \left(x_1 , x_2 ,  \dots ,  x_k \right)

Then we can multiply x on the right by the matrix Y and get a vector in \mathbb{R}^K:

\begin{array}{ccl} x Y &=& ( x_1 , x_2, \dots, x_k) \;   \left( \begin{array}{cccc}   Y_{11} & Y_{12}  & \cdots & Y_{1 \ell} \\  Y_{21} & Y_{22}  & \cdots & Y_{2 \ell} \\  \vdots  & \vdots   & \ddots & \vdots \\  Y_{k1} & Y_{k2}  & \cdots & Y_{k \ell} \end{array} \right)   \\  \\  &=& \left( x_1 Y_{11} + \cdots + x_k Y_{k1}, \; \dots, \; x_1 Y_{1 \ell} + \cdots + x_k Y_{k \ell} \right)   \end{array}

So far, no big deal. But now you’re ready to see the definition of x^Y, which is very similar:

It’s exactly the same, but with multiplication replacing addition, and exponentiation replacing multiplication! Apparently my class on matrices stopped too soon: we learned about matrix multiplication, but matrix exponentiation is also worthwhile.

What’s the point of it? Well, suppose you have a certain number of hydrogen molecules, a certain number of oxygen molecules, a certain number of water molecules, and so on—a certain number of things of each species. You can list these numbers and get a vector x \in \mathbb{R}^S. Then the components of x^Y describe how many ways you can build up each complex from the things you have. For example,

x_1^{Y_{11}} x_2^{Y_{21}} \cdots  x_k^{Y_{k1}}

say roughly how many ways you can build complex 1 by picking Y_{11} things of species 1, Y_{21} things of species 2, and so on.

Why ‘roughly’? Well, we’re pretending we can pick the same thing twice. So if we have 4 water molecules and we need to pick 3, this formula gives 4^3. The right answer is 4 \times 3 \times 2. To get this answer we’d need to use the ‘falling power’ 4^{\underline{3}} = 4 \times 3 \times 2, as explained in Part 4. But the rate equation describes chemistry in the limit where we have lots of things of each species. In this limit, the ordinary power becomes a good approximation.

Puzzle. In this post we’ve seen a vector raised to a matrix power, which is a vector, and also a vector raised to a vector power, which is a number. How are they related?

There’s more to say about this, which I’d be glad to explain if you’re interested. But let’s get to the punchline:

Theorem. The rate equation:

\displaystyle{ \frac{d x}{d t} = Y \sum_{\tau \in T} r(\tau) \; \left(t(\tau) - s(\tau)\right) \; x^{Y s(\tau)} }

is equivalent to this equation:

\displaystyle{ \frac{d x}{d t} = Y (t - s) s^\dagger x^Y }

or in other words:

\displaystyle{ \frac{d x}{d t} = Y H x^Y }

Proof. It’s enough to show

\displaystyle{ (t - s) s^\dagger x^Y = \sum_{\tau \in T} r(\tau) \; \left(t(\tau) - s(\tau)\right) \; x^{Y s(\tau)} }

So, we’ll compute (t - s) s^\dagger x^Y, and think about the meaning of each quantity we get en route.

We start with x \in \mathbb{R}^S. This is a list of numbers saying how many things of each species we have: our raw ingredients, as it were. Then we compute

x^Y = (x_1^{Y_{11}} \cdots  x_k^{Y_{k1}} ,  \dots,  x_1^{Y_{1 \ell}} \cdots x_k^{Y_{k \ell}} )

This is a vector in \mathbb{R}^K. It’s a list of numbers saying how many ways we can build each complex starting from our raw ingredients.

Alternatively, we can write this vector x^Y as a sum over basis vectors:

x^Y = \sum_{\kappa \in K} x_1^{Y_{1\kappa}} \cdots  x_k^{Y_{k\kappa}} \; \kappa

Next let’s apply s^\dagger to this. We claim that

\displaystyle{ s^\dagger \kappa = \sum_{\tau \; : \; s(\tau) = \kappa} r(\tau) \; \tau }

In other words, we claim s^\dagger \kappa is the sum of all the transitions having \kappa as their source, weighted by their rate constants! To prove this claim, it’s enough to take the inner product of each side with any transition \tau', and check that we get the same answer. For the left side we get

\langle s^\dagger \kappa, \tau' \rangle = \langle \kappa, s(\tau') \rangle = \delta_{\kappa, s (\tau') }

To compute the right side, we need to use the cleverly chosen inner product on \mathbb{R}^T. Here we get

\displaystyle{ \left\langle \sum_{\tau \; : \; s(\tau) = \kappa} r(\tau) \tau, \; \tau' \right\rangle =  \sum_{\tau \; : \; s(\tau) = \kappa} \delta_{\tau, \tau'} = \delta_{\kappa, s(\tau')}  }

In the first step here, the factor of 1 /r(\tau) in the cleverly chosen inner product canceled the visible factor of r(\tau). For the second step, you just need to think for half a minute—or ten, depending on how much coffee you’ve had.

Either way, we we conclude that indeed

\displaystyle{ s^\dagger \kappa = \sum_{\tau \; : \; s(\tau) = \kappa} r(\tau) \tau }

Next let’s combine this with our formula for x^Y:

\displaystyle { x^Y = \sum_{\kappa \in K} x_1^{Y_{1\kappa}} \cdots  x_k^{Y_{k\kappa}} \; \kappa }

We get this:

\displaystyle { s^\dagger x^Y = \sum_{\kappa, \tau \; : \; s(\tau) = \kappa} r(\tau) \; x_1^{Y_{1\kappa}} \cdots  x_k^{Y_{k\kappa}} \; \tau }

In other words, s^\dagger x^Y is a linear combination of transitions, where each one is weighted both by the rate it happens and how many ways it can happen starting with our raw ingredients.

Our goal is to compute (t - s)s^\dagger x^Y. We’re almost there. Remember, s says which complex is the input of a given transition, and t says which complex is the output. So, (t - s) s^\dagger x^Y says the total rate at which complexes are created and/or destroyed starting with the species in x as our raw ingredients.

That sounds good. But let’s just pedantically check that everything works. Applying t - s to both sides of our last equation, we get

\displaystyle{ (t - s) s^\dagger x^Y = \sum_{\kappa, \tau \; : \; s(\tau) = \kappa} r(\tau) \; x_1^{Y_{1\kappa}} \cdots  x_k^{Y_{k\kappa}} \; \left( t(\tau) - s(\tau)\right) }

Remember, our goal was to prove that this equals

\displaystyle{ \sum_{\tau \in T} r(\tau) \; \left(t(\tau) - s(\tau)\right) \; x^{Y s(\tau)} }

But if you stare at these a while and think, you’ll see they’re equal.   █

It took me a couple of weeks to really understand this, so I’ll be happy if it takes you just a few days. It seems peculiar at first but ultimately it all makes sense. The interesting subtlety is that we use the linear map called ‘multiplying by Y’:

\begin{array}{ccc} \mathbb{R}^K &\to& \mathbb{R}^S \\                               \psi     &\mapsto&   Y \psi \end{array}

to take a bunch of complexes and work out the species they contain, while we use the nonlinear map called ‘raising to the Yth power’:

\begin{array}{ccc} \mathbb{R}^S &\to& \mathbb{R}^K \\                               x     &\mapsto&   x^Y \end{array}

to take a bunch of species and work out how many ways we can build each complex from them. There is much more to say about this: for example, these maps arise from a pair of what category theorists call ‘adjoint functors’. But I’m worn out and you probably are too, if you’re still here at all.

References

I found this thesis to be the most helpful reference when I was trying to understand the proof of the deficiency zero theorem:

• Jonathan M. Guberman, Mass Action Reaction Networks and the Deficiency Zero Theorem, B.A. thesis, Department of Mathematics, Harvard University, 2003.

I urge you to check it out. In particular, Section 3 and Appendix A discuss matrix exponentiation. Has anyone discussed this before?

Here’s another good modern treatment of the deficiency zero theorem:

• Jeremy Gunawardena, Chemical reaction network theory for in silico biologists, 2003.

The theorem was first proved here:

• Martin Feinberg, Chemical reaction network structure and the stability of complex isothermal reactors: I. The deficiency zero and deficiency one theorems, Chemical Engineering Science 42 (1987), 2229-2268.

However, Feinberg’s treatment here relies heavily on this paper:

• F. Horn and R. Jackson, General mass action kinetics, Archive for Rational Mechanics and Analysis 47 (1972), 81-116.

(Does anyone work on ‘irrational mechanics’?) These lectures are also very helpful:

• Martin Feinberg, Lectures on reaction networks, 1979.

If you’ve seen other proofs, let us know.


Network Theory (Part 21)

14 August, 2012

Recently we’ve been talking about ‘reaction networks’, like this:

The letters are names of chemicals, and the arrows are chemical reactions. If we know how fast each reaction goes, we can write down a ‘rate equation’ describing how the amount of each chemical changes with time.

In Part 17 we met the deficiency zero theorem. This is a powerful tool for finding equilibrium solutions of the rate equation: in other words, solutions where the amounts of the various chemicals don’t change at all. To apply this theorem, two conditions must hold. Both are quite easy to check:

• Your reaction network needs to be ‘weakly reversible’: if you have a reaction that takes one bunch of chemicals to another, there’s a series of reactions that takes that other bunch back to the one you started with.

• A number called the ‘deficiency’ that you can compute from your reaction network needs to be zero.

The first condition makes a lot of sense, intuitively: you won’t get an equilibrium with all the different chemicals present if some chemicals can turn into others but not the other way around. But the second condition, and indeed the concept of ‘deficiency’, seems mysterious.

Luckily, when you work through the proof of the deficiency zero theorem, the mystery evaporates. It turns out that there are two equivalent ways to define the deficiency of a reaction network. One makes it easy to compute, and that’s the definition people usually give. But the other explains why it’s important.

In fact the whole proof of the deficiency zero theorem is full of great insights, so I want to show it to you. This will be the most intense part of the network theory series so far, but also the climax of its first phase. After a little wrapping up, we will then turn to another kind of network: electrical circuits!

Today I’ll just unfold the concept of ‘deficiency’ so we see what it means. Next time I’ll show you a slick way to write the rate equation, which is crucial to proving the deficiency zero theorem. Then we’ll start the proof.

Reaction networks

Let’s recall what a reaction network is, and set up our notation. In chemistry we consider a finite set of ‘species’ like C, O2, H2O and so on… and then consider reactions like

CH4 + 2O2 \longrightarrow CO2 + 2H2O

On each side of this reaction we have a finite linear combination of species, with natural numbers as coefficients. Chemists call such a thing a complex.

So, given any finite collection of species, say S, let’s write \mathbb{N}^S to mean the set of finite linear combinations of elements of S, with natural numbers as coefficients. The complexes appearing in our reactions will form a subset of this, say K.

We’ll also consider a finite collection of reactions—or in the language of Petri nets, ‘transitions’. Let’s call this T. Each transition goes from some complex to some other complex: if we want a reaction to be reversible we’ll explicitly include another reaction going the other way. So, given a transition \tau \in T it will always go from some complex called its source s(\tau) to some complex called its target t(\tau).

All this data, put together, is a reaction network:

Definition. A reaction network (S,\; s,t : T \to K) consists of:

• a finite set S of species,

• a finite set T of transitions,

• a finite set K \subset \mathbb{N}^S of complexes,

source and target maps s,t : T \to K.

We can draw a reaction network as a graph with complexes as vertices and transitions as edges:

The set of species here is S = \{A,B,C,D,E\}, and the set of complexes is K = \{A,B,D,A+C,B+E\}.

But to write down the ‘rate equation’ describing our chemical reactions, we need a bit more information: constants r(\tau) saying the rate at which each transition \tau occurs. So, we define:

Definition. A stochastic reaction network is a reaction network together with a map r: T \to (0,\infty) assigning a rate constant to each transition.

Let me remind you how the rate equation works. At any time we have some amount x_i \in [0,\infty) of each species i. These numbers are the components of a vector x \in \mathbb{R}^S, which of course depends on time. The rate equation says how this vector changes:

\displaystyle{ \frac{d x_i}{d t} = \sum_{\tau \in T} r(\tau) \left(t_i(\tau) - s_i(\tau)\right)  x^{s(\tau)} }

Here I’m writing s_i(\tau) for the ith component of the vector s(\tau), and similarly for t_i(\tau). I should remind you what x^{s(\tau)} means, since here we are raising a vector to a vector power, which is a bit unusual. So, suppose we have any vector

x = (x_1, \dots, x_k)

and we raise it to the power of

s = (s_1, \dots, s_k)

Then by definition we get

\displaystyle{ x^s = x_1^{s_1} \cdots x_k^{s_k} }

Given this, I hope the rate equation makes intuitive sense! There’s one term for each transition \tau. The factor of t_i(\tau) - s_i(\tau) shows up because our transition destroys s_i(\tau) things of the ith species and creates t_i(\tau) of them. The big product

\displaystyle{ x^{s(\tau)} = x_1^{s_1(\tau)} \cdots x_k^{s_k(\tau)} }

shows up because our transition occurs at a rate proportional to the product of the numbers of things it takes as inputs. The constant of proportionality is the reaction rate r(\tau).

The deficiency zero theorem says lots of things, but in the next few episodes we’ll prove a weak version, like this:

Deficiency Zero Theorem (Baby Version). Suppose we have a weakly reversible reaction network with deficiency zero. Then for any choice of rate constants there exists an equilibrium solution x \in (0,\infty)^S of the rate equation. In other words:

\displaystyle{ \sum_{\tau \in T} r(\tau) \left(t_i(\tau) - s_i(\tau)\right)  x^{s(\tau)} = 0}

An important feature of this result is that all the components of the vector x are positive. In other words, there’s actually some chemical of each species present!

But what do the hypotheses in this theorem mean?

A reaction network is weakly reversible if for any transition \tau \in T going from a complex \kappa to a complex \kappa', there is a sequence of transitions going from \kappa' back to \kappa. But what about ‘deficiency zero’? As I mentioned, this requires more work to understand.

So, let’s dive in!

Deficiency

In modern math, we like to take all the stuff we’re thinking about and compress it into a diagram with a few objects and maps going between these objects. So, to understand the deficiency zero theorem, I wanted to isolate the crucial maps. For starters, there’s an obvious map

Y : K \to \mathbb{N}^S

sending each complex to the linear combination of species that it is.

Indeed, we can change viewpoints a bit and think of K as an abstract set equipped with this map saying how each complex is made of species. From now on this is what we’ll do. Then all the information in a stochastic reaction network sits in this diagram:

This is fundamental to everything we’ll do for the next few episodes, so take a minute to lock it into your brain.

We’ll do lots of different things with this diagram. For example, we often want to use ideas from linear algebra, and then we want our maps to be linear. For example, Y extends uniquely to a linear map

Y : \mathbb{R}^K \to \mathbb{R}^S

sending real linear combinations of complexes to real linear combinations of species. Reusing the name Y here won’t cause confusion. We can also extend r, s, and t to linear maps in a unique way, getting a little diagram like this:

Linear algebra lets us talk about differences of complexes. Each transition \tau gives a vector

\partial(\tau) = t(\tau) - s(\tau) \in \mathbb{R}^K

saying the change in complexes that it causes. And we can extend \partial uniquely to a linear map

\partial : \mathbb{R}^T \to \mathbb{R}^K

defined on linear combinations of transitions. Mathematicians would call \partial a boundary operator.

So, we have a little sequence of linear maps

\mathbb{R}^T \stackrel{\partial}{\to} \mathbb{R}^K \stackrel{Y}{\to} \mathbb{R}^S

This turns a transition into a change in complexes, and then a change in species.

If you know fancy math you’ll be wondering if this sequence is a ‘chain complex’, which is a fancy way of saying that Y \partial = 0. The answer is no. This equation means that every linear combination of reactions leaves the amount of all species unchanged. Or equivalently: every reaction leaves the amount of all species unchanged. This only happens in very silly examples.

Nonetheless, it’s possible for a linear combination of reactions to leave the amount of all species unchanged.

For example, this will happen if we have a linear combination of reactions that leaves the amount of all complexes unchanged. But this sufficient condition is not necessary. And this leads us to the concept of ‘deficiency zero’:

Definition. A reaction network has deficiency zero if any linear combination of reactions that leaves the amount of every species unchanged also leaves the amount of every complex unchanged.

In short, a reaction network has deficiency zero iff

Y (\partial \rho) = 0 \implies \partial \rho = 0

for every \rho \in \mathbb{R}^T. In other words—using some basic concepts from linear algebra—a reaction network has deficiency zero iff Y is one-to-one when restricted to the image of \partial. Remember, the image of \partial is

\mathrm{im} \partial = \{ \partial \rho \; : \; \rho \in \mathbb{R}^T \}

Roughly speaking, this consists of all changes in complexes that can occur due to reactions.

In still other words, a reaction network has deficiency zero if 0 is the only vector in both the image of \partial and the kernel of Y:

\mathrm{im} \partial \cap \mathrm{ker} Y = \{ 0 \}

Remember, the kernel of Y is

\mathrm{ker} Y = \{ \phi \in \mathbb{R}^K : Y \phi = 0 \}

Roughly speaking, this consists of all changes in complexes that don’t cause changes in species. So, ‘deficiency zero’ roughly says that if a reaction causes a change in complexes, it causes a change in species.

(All this ‘roughly speaking’ stuff is because in reality I should be talking about linear combinations of transitions, complexes and species. But it’s a bit distracting to keep saying that when I’m trying to get across the basic ideas!)

Now we’re ready to understand deficiencies other than zero, at least a little. They’re defined like this:

Definition. The deficiency of a reaction network is the dimension of \mathrm{im} \partial \cap \mathrm{ker} Y.

How to compute the deficiency

You can compute the deficiency of a reaction network just by looking at it. However, it takes a little training. First, remember that a reaction network can be drawn as a graph with complexes as vertices and transitions as edges, like this:

There are three important numbers we can extract from this graph:

• We can count the number of vertices in this graph; let’s call that |K|, since it’s just the number of complexes.

• We can count the number of pieces or ‘components’ of this graph; let’s call that \# \mathrm{components} for now.

• We can also count the dimension of the image of Y \partial. This space, \mathrm{im} Y \partial, is called the stochiometric subspace: vectors in here are changes in species that can be accomplished by transitions in our reaction network, or linear combinations of transitions.

These three numbers, all rather easy to compute, let us calculate the deficiency:

Theorem. The deficiency of a reaction network equals

|K| - \# \mathrm{components} - \mathrm{dim} \left( \mathrm{im} Y \partial \right)

Proof. By definition we have

\mathrm{deficiency} = \dim \left( \mathrm{im} \partial \cap \mathrm{ker} Y \right)

but another way to say this is

\mathrm{deficiency} = \mathrm{dim} (\mathrm{ker} Y|_{\mathrm{im} \partial})

where we are restricting Y to the subspace \mathrm{im} \partial, and taking the dimension of the kernel of that.

The rank-nullity theorem says that whenever you have a linear map T: V \to W between finite-dimensional vector spaces, then

\mathrm{dim} \left(\mathrm{ker} T \right) \; = \; \mathrm{dim}\left(\mathrm{dom} T\right) \; - \; \mathrm{dim} \left(\mathrm{im} T\right)

where \mathrm{dom} T is the domain of T, namely the vector space V. It follows that

\mathrm{dim}(\mathrm{ker} Y|_{\mathrm{im} \partial}) = \mathrm{dim}(\mathrm{dom}  Y|_{\mathrm{im} \partial}) - \mathrm{dim}(\mathrm{im}  Y|_{\mathrm{im} \partial})

The domain of Y|_{\mathrm{im} \partial} is just \mathrm{im} \partial, while its image equals \mathrm{im} Y \partial, so

\mathrm{deficiency} = \mathrm{dim}(\mathrm{im} \partial) - \mathrm{dim}(\mathrm{im} Y \partial)

The theorem then follows from this:

Lemma. \mathrm{dim} (\mathrm{im} \partial) = |K| - \# \mathrm{components}

Proof. In fact this holds whenever we have a finite set of complexes and a finite set of transitions going between them. We get a diagram

We can extend the source and target functions to linear maps as usual:

and then we can define \partial = t - s. We claim that

\mathrm{dim} (\mathrm{im} \partial) = |K| - \# \mathrm{components}

where \# \mathrm{components} is the number of connected components of the graph with K as vertices and T as edges.

This is easiest to see using an inductive argument where we start by throwing out all the edges of our graph and then put them back in one at a time. When our graph has no edges, \partial = 0 and the number of components is |K|, so our claim holds: both sides of the equation above are zero. Then, each time we put in an edge, there are two choices: either it goes between two different components of the graph we have built so far, or it doesn’t. If goes between two different components, it increases \mathrm{dim} (\mathrm{im} \partial) by 1 and decreases the number of components by 1, so our equation continues to hold. If it doesn’t, neither of these quantities change, so our equation again continues to hold.   █

Examples

This reaction network is not weakly reversible, since we can get from B + E and A + C to D but not back. It becomes weakly reversible if we throw in another transition:

Taking a reaction network and throwing in the reverse of an existing transition never changes the number of complexes, the number of components, or the dimension of the stoichiometric subspace. So, the deficiency of the reaction network remains unchanged. We computed it back in Part 17. For either reaction network above:

• the number of complexes is 5:

|K| = |\{A,B,D, A+C, B+E\}| = 5

• the number of components is 2:

\# \mathrm{components} = 2

• the dimension of the stoichiometric subspace is 3. For this we go through each transition, see what change in species it causes, and take the vector space spanned by these changes. Then we find a basis for this space and count it:

\mathrm{im} Y \partial =

\langle B - A, A - B, D - A - C, D - (B+E) ,  (B+E)-(A+C) \rangle

= \langle B -A, D - A - C, D - A - E \rangle

so

\mathrm{dim} \left(\mathrm{im} Y \partial \right) = 3

As a result, the deficiency is zero:

\begin{array}{ccl} \mathrm{deficiency}   &=& |K| - \# \mathrm{components} - \mathrm{dim} \left( \mathrm{im} Y \partial \right) \\  &=& 5 - 2 - 3 \\  &=&  0 \end{array}

Now let’s throw in another complex and some more transitions:

Now:

• the number of complexes increases by 1:

|K| = |\{A,B,D,E, A+C, B+E\}| = 6

• the number of components is unchanged:

\# \mathrm{components} = 2

• the dimension of the stoichiometric subspace increases by 1. We never need to include reverses of transitions when computing this:

\mathrm{im} Y \partial =

\langle B-A, E-B,  D - (A+C),
D - (B+E) , (B+E)-(A+C) \rangle

= \langle B -A, E-B, D - A - C, D - B - E \rangle

so

\mathrm{dim} \left(\mathrm{im} Y \partial \right) = 4

As a result, the deficiency is still zero:

\begin{array}{ccl} \mathrm{deficiency}   &=& |K| - \# \mathrm{components} - \mathrm{dim} \left( \mathrm{im} Y \partial \right) \\  &=& 6 - 2 - 4 \\  &=&  0 \end{array}

Do all reaction networks have deficiency zero? That would be nice. Let’s try one more example:

• the number of complexes is the same as in our last example:

|K| = |\{A,B,D,E, A+C, B+E\}| = 6

• the number of components is also the same:

\# \mathrm{components} = 2

• the dimension of the stoichiometric subspace is also the same:

\mathrm{im} Y \partial =

\langle B-A,  D - (A+C), D - (B+E) ,
(B+E)-(A+C), (B+E) - B \rangle

= \langle B -A, D - A - C, D - B , E\rangle

so

\mathrm{dim} \left(\mathrm{im} Y \partial \right) = 4

So the deficiency is still zero:

\begin{array}{ccl} \mathrm{deficiency}   &=& |K| - \# \mathrm{components} - \mathrm{dim} \left( \mathrm{im} Y \partial \right) \\  &=& 6 - 2 - 4 \\  &=&  0 \end{array}

It’s sure easy to find examples with deficiency zero!

Puzzle 1. Can you find an example where the deficiency is not zero?

Puzzle 2. If you can’t find an example, prove the deficiency is always zero. If you can, find 1) the smallest example and 2) the smallest example that actually arises in chemistry.

Note that not all reaction networks can actually arise in chemistry. For example, the transition $A \to A + A$ would violate conservation of mass. Nonetheless a reaction network like this might be useful in a very simple model of amoeba reproduction, one that doesn’t take limited resources into account.

Different kinds of graphs

I’ll conclude with some technical remarks that only a mathematician could love; feel free to skip them if you’re not in the mood. As you’ve already seen, it’s important that a reaction network can be drawn as a graph:

But there are many kinds of graph. What kind of graph is this, exactly?

As I mentioned in Part 17, it’s a directed multigraph, meaning that the edges have arrows on them, more than one edge can go from one vertex to another, and we can have any number of edges going from a vertex to itself. Not all those features are present in this example, but they’re certainly allowed by our definition of reaction network!

After all, we’ve seen that a stochastic reaction network amounts to a little diagram like this:

If we throw out the rate constants, we’re left with a reaction network. So, a reaction network is just a diagram like this:

If we throw out the information about how complexes are made of species, we’re left with an even smaller diagram:

And this precisely matches the slick definition of a directed multigraph: it’s a set E of edges, a set V of vertices, and functions s,t : E \to V saying where each edge starts and where it ends: its source and target.

Since this series is about networks, we should expect to run into many kinds of graphs. While their diversity is a bit annoying at first, we must learn to love it, at least if we’re mathematicians and want everything to be completely precise.

There are at least 23 = 8 kinds of graphs, depending on our answers to three questions:

• Do the edges have arrows on them?

• Can more than one edge can go between a specific pair of vertices?

and

• Can an edge can go from a vertex to itself?

We get directed multigraphs if we answer YES, YES and YES to these questions. Since they allow all three features, directed multigraphs are very important. For example, a category is a directed multigraph equipped with some extra structure. Also, what mathematicians call a quiver is just another name for a directed multigraph.

We’ve met two other kinds of graph so far:

• In Part 15 and Part 16 we described circuits made of resistors—or, in other words, Dirichlet operators—using ‘simple graphs’. We get simple graphs when we answer NO, NO and NO to the three questions above. The slick definition of a simple graph is that it’s a set V of vertices together with a subset E of the collection of 2-element subsets of V.

• In Part 20 we described Markov processes on finite sets—or, in other words, infinitesimal stochastic operators—using ‘directed graphs’. We get directed graphs when we answer YES, NO and YES to the three questions. The slick definition of a directed graph is that it’s a set V of vertices together with a subset E of the ordered pairs of V: E \subseteq V \times V.

There is a lot to say about this business, but for now I’ll just note that you can use directed multigraphs with edges labelled by positive numbers to describe Markov processes, just as we used directed graphs. You don’t get anything more general, though! After all, if we have multiple edges from one vertex to another, we can replace them by a single edge as long as we add up their rate constants. And an edge from a vertex to itself has no effect at all.

In short, both for Markov processes and reaction networks, we can take ‘graph’ to mean either ‘directed graph’ or ‘directed multigraph’, as long as we make some minor adjustments. In what follows, we’ll use directed multigraphs for both Markov processes and reaction networks. It will work a bit better if we ever get around to explaining how category theory fits into this story.


Symmetry and the Fourth Dimension (Part 6)

11 August, 2012

Last time I showed what happened if you took a cube and chopped off its corners more and more until you reached its dual: the octahedron. Today let’s do the same thing starting with a dodecahedron!

Just as a cube has 3 squares meeting at each vertex, a dodecahedron has 3 pentagons meeting at each vertex. So, instead of the Coxeter diagram for the cube:

V—4—E—3—F

everything today will be based on the Coxeter diagram for the dodecahedron:

V—5—E—3—F

The number 5 is much cooler than the number 4, which is, frankly, a bit square. So the shapes we get today look much more sophisticated, at least to my eyes. But the underlying math is very similar: we can use diagrams to keep track of these shapes as we did before.

Dodecahedron: •—5—o—3—o

First we have the dodecahedron, with all pentagons as faces:


I like this shape so much I gave a lecture about it, and you can see the slides here:

• John Baez, Tales of the dodecahedron: from Pythagoras through Plato to Poincaré.

Truncated dodecahedron: •—5—•—3—o

Then we get the truncated dodecahedron, with decagons (10-sided shapes) and triangles as faces:


Icosidodecahedron: o—3—•—3—o

Then, halfway through, we get the aptly named icosidodecahedron, with pentagons and triangles as faces:


Like that other ‘halfway through’ shape the cuboctahedron, every edge of the icosidodecahedron lies on a great circle’s worth of edges.

Truncated icosahedron: o—5—•—3—•

Then we get the truncated icosahedron, with pentagons and hexagons as faces:


This one is so beautiful that a whole sport was developed in its honor!

Icosahedron: o—5—o—3—•

And then finally we get the icosahedron, with triangles as faces:


Again, I like this one so much I gave a talk about it:

• John Baez, Who discovered the icosahedron?

I almost feel like telling you all the stuff that’s in these talks of mine… and if I turn these blog posts into a book, I’ll definitely want to include it all! But there’s a lot of it, and I’m feeling a bit lazy—so why not just go check it out?

Puzzle. Why did Thomas Heath, the great scholar of Greek mathematics, think that Geminus of Rhodes is responsible for the remark in Euclid’s Elements crediting Theatetus with discovering the icosahedron?


Network Theory (Part 20)

6 August, 2012

guest post by Jacob Biamonte

We’re in the middle of a battle: in addition to our typical man versus equation scenario, it’s a battle between two theories. For those good patrons following the network theory series, you know the two opposing forces well. It’s our old friends, at it again:

Stochastic Mechanics vs Quantum Mechanics!

Today we’re reporting live from a crossroads, and we’re facing a skirmish that gives rise to what some might consider a paradox. Let me sketch the main thesis before we get our hands dirty with the gory details.

First I need to tell you that the battle takes place at the intersection of stochastic and quantum mechanics. We recall from Part 16 that there is a class of operators called ‘Dirichlet operators’ that are valid Hamiltonians for both stochastic and quantum mechanics. In other words, you can use them to generate time evolution both for old-fashioned random processes and for quantum processes!

Staying inside this class allows the theories to fight it out on the same turf. We will be considering a special subclass of Dirichlet operators, which we call ‘irreducible Dirichlet operators’. These are the ones where starting in any state in our favorite basis of states, we have a nonzero chance of winding up in any other. When considering this subclass, we found something interesting:

Thesis. Let H be an irreducible Dirichlet operator with n eigenstates. In stochastic mechanics, there is only one valid state that is an eigenvector of H: the unique so-called ‘Perron–Frobenius state’. The other n-1 eigenvectors are forbidden states of a stochastic system: the stochastic system is either in the Perron–Frobenius state, or in a superposition of at least two eigensvectors. In quantum mechanics, all n eigenstates of H are valid states.

This might sound like a riddle, but today as we’ll prove, riddle or not, it’s a fact. If it makes sense, well that’s another issue. As John might have said, it’s like a bone kicked down from the gods up above: we can either choose to chew on it, or let it be. Today we are going to do a bit of chewing.

One of the many problems with this post is that John had a nut loose on his keyboard. It was not broken! I’m saying he wrote enough blog posts on this stuff to turn them into a book. I’m supposed to be compiling the blog articles into a massive LaTeX file, but I wrote this instead.

Another problem is that this post somehow seems to use just about everything said before, so I’m going to have to do my best to make things self-contained. Please bear with me as I try to recap what’s been done. For those of you familiar with the series, a good portion of the background for what we’ll cover today can be found in Part 12 and Part 16.

At the intersection of two theories

As John has mentioned in his recent talks, the typical view of how quantum mechanics and probability theory come into contact looks like this:

The idea is that quantum theory generalizes classical probability theory by considering observables that don’t commute.

That’s perfectly valid, but we’ve been exploring an alternative view in this series. Here quantum theory doesn’t subsume probability theory, but they intersect:

What goes in the middle you might ask? As odd as it might sound at first, John showed in Part 16 that electrical circuits made of resistors constitute the intersection!

For example, a circuit like this:

gives rise to a Hamiltonian H that’s good both for stochastic mechanics and quantum mechanics. Indeed, he found that the power dissipated by a circuit made of resistors is related to the familiar quantum theory concept known as the expectation value of the Hamiltonian!

\textrm{power} = -2 \langle \psi, H \psi \rangle

Oh—and you might think we made a mistake and wrote our Ω (ohm) symbols upside down. We didn’t. It happens that ℧ is the symbol for a ‘mho’—a unit of conductance that’s the reciprocal of an ohm. Check out Part 16 for the details.

Stochastic mechanics versus quantum mechanics

Let’s recall how states, time evolution, symmetries and observables work in the two theories. Today we’ll fix a basis for our vector space of states, and we’ll assume it’s finite-dimensional so that all vectors have n components over either the complex numbers \mathbb{C} or the reals \mathbb{R}. In other words, we’ll treat our space as either \mathbb{C}^n or \mathbb{R}^n. In this fashion, linear operators that map such spaces to themselves will be represented as square matrices.

Vectors will be written as \psi_i where the index i runs from 1 to n, and we think of each choice of the index as a state of our system—but since we’ll be using that word in other ways too, let’s call it a configuration. It’s just a basic way our system can be.

States

Besides the configurations i = 1,\dots, n, we have more general states that tell us the probability or amplitude of finding our system in one of these configurations:

Stochastic states are n-tuples of nonnegative real numbers:

\psi_i \in \mathbb{R}^+

The probability of finding the system in the ith configuration is defined to be \psi_i. For these probabilities to sum to one, \psi_i needs to be normalized like this:

\displaystyle{ \sum_i \psi_i = 1 }

or in the notation we’re using in these articles:

\displaystyle{ \langle \psi \rangle = 1 }

where we define

\displaystyle{ \langle \psi \rangle = \sum_i \psi_i }

Quantum states are n-tuples of complex numbers:

\psi_i \in \mathbb{C}

The probability of finding a state in the ith configuration is defined to be |\psi(x)|^2. For these probabilities to sum to one, \psi needs to be normalized like this:

\displaystyle{ \sum_i |\psi_i|^2 = 1 }

or in other words

\langle \psi,  \psi \rangle = 1

where the inner product of two vectors \psi and \phi is defined by

\displaystyle{ \langle \psi, \phi \rangle = \sum_i \overline{\psi}_i \phi_i }

Now, the usual way to turn a quantum state \psi into a stochastic state is to take the absolute value of each number \psi_i and then square it. However, if the numbers \psi_i happen to be nonnegative, we can also turn \psi into a stochastic state simply by multiplying it by a number to ensure \langle \psi \rangle = 1.

This is very unorthodox, but it lets us evolve the same vector \psi either stochastically or quantum-mechanically, using the recipes I’ll describe next. In physics jargon these correspond to evolution in ‘real time’ and ‘imaginary time’. But don’t ask me which is which: from a quantum viewpoint stochastic mechanics uses imaginary time, but from a stochastic viewpoint it’s the other way around!

Time evolution

Time evolution works similarly in stochastic and quantum mechanics, but with a few big differences:

• In stochastic mechanics the state changes in time according to the master equation:

\displaystyle{ \frac{d}{d t} \psi(t) = H \psi(t) }

which has the solution

\psi(t) = \exp(t H) \psi(0)

• In quantum mechanics the state changes in time according to Schrödinger’s equation:

\displaystyle{ \frac{d}{d t} \psi(t) = -i H \psi(t) }

which has the solution

\psi(t) = \exp(-i t H) \psi(0)

The operator H is called the Hamiltonian. The properties it must have depend on whether we’re doing stochastic mechanics or quantum mechanics:

• We need H to be infinitesimal stochastic for time evolution given by \exp(t H) to send stochastic states to stochastic states. In other words, we need that (i) its columns sum to zero and (ii) its off-diagonal entries are real and nonnegative:

\displaystyle{ \sum_i H_{i j}=0 }

i\neq j \Rightarrow H_{i j}\geq 0

• We need H to be self-adjoint for time evolution given by \exp(-i t H) to send quantum states to quantum states. So, we need

H = H^\dagger

where we recall that the adjoint of a matrix is the conjugate of its transpose:

\displaystyle{ (H^\dagger)_{i j} := \overline{H}_{j i} }

We are concerned with the case where the operator H generates both a valid quantum evolution and also a valid stochastic one:

H is a Dirichlet operator if it’s both self-adjoint and infinitesimal stochastic.

We will soon go further and zoom in on this intersection! But first let’s finish our review.

Symmetries

As John explained in Part 12, besides states and observables we need symmetries, which are transformations that map states to states. These include the evolution operators which we only briefly discussed in the preceding subsection.

• A linear map U that sends quantum states to quantum states is called an isometry, and isometries are characterized by this property:

U^\dagger U = 1

• A linear map U that sends stochastic states to stochastic states is called a stochastic operator, and stochastic operators are characterized by these properties:

\displaystyle{ \sum_i U_{i j} = 1 }

and

U_{i j} \geq 0

A notable difference here is that in our finite-dimensional situation, isometries are always invertible, but stochastic operators may not be! If U is an n \times n matrix that’s an isometry, U^\dagger is its inverse. So, we also have

U U^\dagger = 1

and we say U is unitary. But if U is stochastic, it may not have an inverse—and even if it does, its inverse is rarely stochastic. This explains why in stochastic mechanics time evolution is often not reversible, while in quantum mechanics it always is.

Puzzle 1. Suppose U is a stochastic n \times n matrix whose inverse is stochastic. What are the possibilities for U?

It is quite hard for an operator to be a symmetry in both stochastic and quantum mechanics, especially in our finite-dimensional situation:

Puzzle 2. Suppose U is an n \times n matrix that is both stochastic and unitary. What are the possibilities for U?

Observables

‘Observables’ are real-valued quantities that can be measured, or predicted, given a specific theory.

• In quantum mechanics, an observable is given by a self-adjoint matrix O, and the expected value of the observable O in the quantum state \psi is

\displaystyle{ \langle \psi , O \psi \rangle = \sum_{i,j} \overline{\psi}_i O_{i j} \psi_j }

• In stochastic mechanics, an observable O has a value O_i in each configuration i, and the expected value of the observable O in the stochastic state \psi is

\displaystyle{ \langle O \psi \rangle = \sum_i O_i \psi_i }

We can turn an observable in stochastic mechanics into an observable in quantum mechanics by making a diagonal matrix whose diagonal entries are the numbers O_i.

From graphs to matrices

Back in Part 16, John explained how a graph with positive numbers on its edges gives rise to a Hamiltonian in both quantum and stochastic mechanics—in other words, a Dirichlet operator.

Here’s how this works. We’ll consider simple graphs: graphs without arrows on their edges, with at most one edge from one vertex to another, with no edges from a vertex to itself. And we’ll only look at graphs with finitely many vertices and edges. We’ll assume each edge is labelled by a positive number, like this:

If our graph has n vertices, we can create an n \times n matrix A where A_{i j} is the number labelling the edge from i to j, if there is such an edge, and 0 if there’s not. This matrix is symmetric, with real entries, so it’s self-adjoint. So A is a valid Hamiltonian in quantum mechanics.

How about stochastic mechanics? Remember that a Hamiltonian H in stochastic mechanics needs to be ‘infinitesimal stochastic’. So, its off-diagonal entries must be nonnegative, which is indeed true for our A, but also the sums of its columns must be zero, which is not true when our A is nonzero.

But now comes the best news you’ve heard all day: we can improve A to a stochastic operator in a way that is completely determined by A itself! This is done by subtracting a diagonal matrix L whose entries are the sums of the columns of A:

L_{i i} = \sum_i A_{i j}

i \ne j \Rightarrow L_{i j} = 0

It’s easy to check that

H = A - L

is still self-adjoint, but now also infinitesimal stochastic. So, it’s a Dirichlet operator: a good Hamiltonian for both stochastic and quantum mechanics!

In Part 16, we saw a bit more: every Dirichlet operator arises this way. It’s easy to see. You just take your Dirichlet operator and make a graph with one edge for each nonzero off-diagonal entry. Then you label the edge with this entry.

So, Dirichlet operators are essentially the same as finite simple graphs with edges labelled by positive numbers.

Now, a simple graph can consist of many separate ‘pieces’, called components. Then there’s no way for a particle hopping along the edges to get from one component to another, either in stochastic or quantum mechanics. So we might as well focus our attention on graphs with just one component. These graphs are called ‘connected’. In other words:

Definition. A simple graph is connected if it is nonempty and there is a path of edges connecting any vertex to any other.

Our goal today is to understand more about Dirichlet operators coming from connected graphs. For this we need to learn the Perron–Frobenius theorem. But let’s start with something easier.

Perron’s theorem

In quantum mechanics it’s good to think about observables that have positive expected values:

\langle \psi, O \psi \rangle > 0

for every quantum state \psi \in \mathbb{C}^n. These are called positive definite. But in stochastic mechanics it’s good to think about matrices that are positive in a more naive sense:

Definition. An n \times n real matrix T is positive if all its entries are positive:

T_{i j} > 0

for all 1 \le i, j \le n.

Similarly:

Definition. A vector \psi \in \mathbb{R}^n is positive if all its components are positive:

\psi_i > 0

for all 1 \le i \le n.

We’ll also define nonnegative matrices and vectors in the same way, replacing > 0 by \ge 0. A good example of a nonnegative vector is a stochastic state.

In 1907, Perron proved the following fundamental result about positive matrices:

Perron’s Theorem. Given a positive square matrix T, there is a positive real number r, called the Perron–Frobenius eigenvalue of T, such that r is an eigenvalue of T and any other eigenvalue \lambda of T has |\lambda| < r. Moreover, there is a positive vector \psi \in \mathbb{R}^n with T \psi = r \psi. Any other vector with this property is a scalar multiple of \psi. Furthermore, any nonnegative vector that is an eigenvector of T must be a scalar multiple of \psi.

In other words, if T is positive, it has a unique eigenvalue with the largest absolute value. This eigenvalue is positive. Up to a constant factor, it has an unique eigenvector. We can choose this eigenvector to be positive. And then, up to a constant factor, it’s the only nonnegative eigenvector of T.

From matrices to graphs

The conclusions of Perron’s theorem don’t hold for matrices that are merely nonnegative. For example, these matrices

\left( \begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array} \right) , \qquad \left( \begin{array}{cc} 0 & 1 \\ 0 & 0 \end{array} \right)

are nonnegative, but they violate lots of the conclusions of Perron’s theorem.

Nonetheless, in 1912 Frobenius published an impressive generalization of Perron’s result. In its strongest form, it doesn’t apply to all nonnegative matrices; only to those that are ‘irreducible’. So, let us define those.

We’ve seen how to build a matrix from a graph. Now we need to build a graph from a matrix! Suppose we have an n \times n matrix T. Then we can build a graph G_T with n vertices where there is an edge from the ith vertex to the jth vertex if and only if T_{i j} \ne 0.

But watch out: this is a different kind of graph! It’s a directed graph, meaning the edges have directions, there’s at most one edge going from any vertex to any vertex, and we do allow an edge going from a vertex to itself. There’s a stronger concept of ‘connectivity’ for these graphs:

Definition. A directed graph is strongly connected if there is a directed path of edges going from any vertex to any other vertex.

So, you have to be able to walk along edges from any vertex to any other vertex, but always following the direction of the edges! Using this idea we define irreducible matrices:

Definition. A nonnegative square matrix T is irreducible if its graph G_T is strongly connected.

The Perron–Frobenius theorem

Now we are ready to state:

The Perron-Frobenius Theorem. Given an irreducible nonnegative square matrix T, there is a positive real number r, called the Perron–Frobenius eigenvalue of T, such that r is an eigenvalue of T and any other eigenvalue \lambda of T has |\lambda| \le r. Moreover, there is a positive vector \psi \in \mathbb{R}^n with T\psi = r \psi. Any other vector with this property is a scalar multiple of \psi. Furthermore, any nonnegative vector that is an eigenvector of T must be a scalar multiple of \psi.

The only conclusion of this theorem that’s weaker than those of Perron’s theorem is that there may be other eigenvalues with |\lambda| = r. For example, this matrix is irreducible and nonnegative:

\left( \begin{array}{cc} 0 & 1 \\ 1 & 0 \end{array} \right)

Its Perron–Frobenius eigenvalue is 1, but it also has -1 as an eigenvalue. In general, Perron-Frobenius theory says quite a lot about the other eigenvalues on the circle |\lambda| = r, but we won’t need that fancy stuff here.

Perron–Frobenius theory is useful in many ways, from highbrow math to ranking football teams. We’ll need it not just today but also later in this series. There are many books and other sources of information for those that want to take a closer look at this subject. If you’re interested, you can search online or take a look at these:

• Dimitrious Noutsos, Perron Frobenius theory and some extensions, 2008. (Includes proofs of the basic theorems.)

• V. S. Sunder, Perron Frobenius theory, 18 December 2009. (Includes applications to graph theory, Markov chains and von Neumann algebras.)

• Stephen Boyd, Lecture 17: Perron Frobenius theory, Winter 2008-2009. (Includes a max-min characterization of the Perron–Frobenius eigenvalue and applications to Markov chains, economics, population growth and power control.)

I have not taken a look myself, but if anyone is interested and can read German, the original work appears here:

• Oskar Perron, Zur Theorie der Matrizen, Math. Ann. 64 (1907), 248–263.

• Georg Frobenius, Über Matrizen aus nicht negativen Elementen, S.-B. Preuss Acad. Wiss. Berlin (1912), 456–477.

And, of course, there’s this:

• Wikipedia, Perron–Frobenius theorem.

It’s got a lot of information.

Irreducible Dirichlet operators

Now comes the payoff. We saw how to get a Dirichlet operator H from any finite simple graph with edges labelled by positive numbers. Now let’s apply Perron–Frobenius theory to prove our thesis.

Unfortunately, the matrix H is rarely nonnegative. If you remember how we built it, you’ll see its off-diagonal entries will always be nonnegative… but its diagonal entries can be negative.

Luckily, we can fix this just by adding a big enough multiple of the identity matrix to H! The result is a nonnegative matrix

T = H + c I

where c > 0 is some large number. This matrix T has the same eigenvectors as H. The off-diagonal matrix entries of T are the same as those of H, so T_{i j} is nonzero for i \ne j exactly when the graph we started with has an edge from i to j. So, for i \ne j, the graph G_T will have an directed edge going from i to j precisely when our original graph had an edge from i to j. And that means that if our original graph was connected, G_T will be strongly connected. Thus, by definition, the matrix T is irreducible!

Since T is nonnegative and irreducible, the Perron–Frobenius theorem swings into action and we conclude:

Lemma. Suppose H is the Dirichlet operator coming from a connected finite simple graph with edges labelled by positive numbers. Then the eigenvalues of H are real. Let \lambda be the largest eigenvalue. Then there is a positive vector \psi \in \mathbb{R}^n with H\psi = \lambda \psi. Any other vector with this property is a scalar multiple of \psi. Furthermore, any nonnegative vector that is an eigenvector of H must be a scalar multiple of \psi.

Proof. The eigenvalues of H are real since H is self-adjoint. Notice that if r is the Perron–Frobenius eigenvalue of T = H + c I and

T \psi = r \psi

then

H \psi = (r - c)\psi

By the Perron–Frobenius theorem the number r is positive, and it has the largest absolute value of any eigenvalue of T. Thanks to the subtraction, the eigenvalue r - c may not have the largest absolute value of any eigenvalue of H. It is, however, the largest eigenvalue of H, so we take this as our \lambda. The rest follows from the Perron–Frobenius theorem.   █

But in fact we can improve this result, since the largest eigenvalue \lambda is just zero. Let’s also make up a definition, to make our result sound more slick:

Definition. A Dirichlet operator is irreducible if it comes from a connected finite simple graph with edges labelled by positive numbers.

This meshes nicely with our earlier definition of irreducibility for nonnegative matrices. Now:

Theorem. Suppose H is an irreducible Dirichlet operator. Then H has zero as its largest real eigenvalue. There is a positive vector \psi \in \mathbb{R}^n with H\psi = 0. Any other vector with this property is a scalar multiple of \psi. Furthermore, any nonnegative vector that is an eigenvector of H must be a scalar multiple of \psi.

Proof. Choose \lambda as in the Lemma, so that H\psi = \lambda \psi. Since \psi is positive we can normalize it to be a stochastic state:

\displaystyle{ \sum_i \psi_i = 1 }

Since H is a Dirichlet operator, \exp(t H) sends stochastic states to stochastic states, so

\displaystyle{ \sum_i (\exp(t H) \psi)_i = 1 }

for all t \ge 0. On the other hand,

\displaystyle{ \sum_i (\exp(t H)\psi)_i = \sum_i e^{t \lambda} \psi_i = e^{t \lambda} }

so we must have \lambda = 0.   █

What’s the point of all this? One point is that there’s a unique stochastic state \psi that’s an equilibrium state: since H \psi = 0, it doesn’t change with time. It’s also globally stable: since all the other eigenvalues of H are negative, all other stochastic states converge to this one as time goes forward.

An example

There are many examples of irreducible Dirichlet operators. For instance, in Part 15 we talked about graph Laplacians. The Laplacian of a connected simple graph is always irreducible. But let us try a different sort of example, coming from the picture of the resistors we saw earlier:

Let’s create a matrix A whose entry A_{i j} is the number labelling the edge from i to j if there is such an edge, and zero otherwise:

A = \left(  \begin{array}{ccccc}   0 & 2 & 1 & 0 & 1 \\   2 & 0 & 0 & 1 & 1 \\   1 & 0 & 0 & 2 & 1 \\   0 & 1 & 2 & 0 & 1 \\   1 & 1 & 1 & 1 & 0  \end{array}  \right)

Remember how the game works. The matrix A is already a valid Hamiltonian for quantum mechanics, since it’s self adjoint. However, to get a valid Hamiltonian for both stochastic and quantum mechanics—in other words, a Dirichlet operator—we subtract the diagonal matrix L whose entries are the sums of the columns of A. In this example it just so happens that the column sums are all 4, so L = 4 I, and our Dirichlet operator is

H = A - 4 I = \left(  \begin{array}{rrrrr}   -4 & 2 & 1 & 0 & 1 \\   2 & -4 & 0 & 1 & 1 \\   1 & 0 & -4 & 2 & 1 \\   0 & 1 & 2 & -4 & 1 \\   1 & 1 & 1 & 1 & -4  \end{array}  \right)

We’ve set up this example so it’s easy to see that the vector \psi = (1,1,1,1,1) has

H \psi = 0

So, this is the unique eigenvector for the eigenvalue 0. We can use Mathematica to calculate the remaining eigenvalues of H. The set of eigenvalues is

\{0, -7, -8, -8, -3 \}

As we expect from our theorem, the largest real eigenvalue is 0. By design, the eigenstate associated to this eigenvalue is

| v_0 \rangle = (1, 1, 1, 1, 1)

(This funny notation for vectors is common in quantum mechanics, so don’t worry about it.) All the other eigenvectors fail to be nonnegative, as predicted by the theorem. They are:

| v_1 \rangle = (1, -1, -1, 1, 0)
| v_2 \rangle = (-1, 0, -1, 0, 2)
| v_3 \rangle = (-1, 1, -1, 1, 0)
| v_4 \rangle = (-1, -1, 1, 1, 0)

To compare the quantum and stochastic states, consider first |v_0\rangle. This is the only eigenvector that can be normalized to a stochastic state. Remember, a stochastic state must have nonnegative components. This rules out |v_1\rangle through to |v_4\rangle as valid stochastic states, no matter how we normalize them! However, these are allowed as states in quantum mechanics, once we normalize them correctly. For a stochastic system to be in a state other than the Perron–Frobenius state, it must be a linear combination of least two eigenstates. For instance,

\psi_a = (1-a)|v_0\rangle + a |v_1\rangle

can be normalized to give stochastic state only if 0 \leq a \leq \frac{1}{2}.

And, it’s easy to see that it works this way for any irreducible Dirichlet operator, thanks to our theorem. So, our thesis has been proved true!

Puzzles on irreducibility

Let us conclude with a couple more puzzles. There are lots of ways to characterize irreducible nonnegative matrices; we don’t need to mention graphs. Here’s one:

Puzzle 3. Let T be a nonnegative n \times n matrix. Show that T is irreducible if and only if for all i,j \ge 0, (T^m)_{i j} > 0 for some natural number m.

You may be confused because today we explained the usual concept of irreducibility for nonnegative matrices, but also defined a concept of irreducibility for Dirichlet operators. Luckily there’s no conflict: Dirichlet operators aren’t nonnegative matrices, but if we add a big multiple of the identity to a Dirichlet operator it becomes a nonnegative matrix, and then:

Puzzle 4. Show that a Dirichlet operator H is irreducible if and only if the nonnegative operator H + c I (where c is any sufficiently large constant) is irreducible.

Irreducibility is also related to the nonexistence of interesting conserved quantities. In Part 11 we saw a version of Noether’s Theorem for stochastic mechanics. Remember that an observable O in stochastic mechanics assigns a number O_i to each configuration i = 1, \dots, n. We can make a diagonal matrix with O_i as its diagonal entries, and by abuse of language we call this O as well. Then we say O is a conserved quantity for the Hamiltonian H if the commutator [O,H] = O H - H O vanishes.

Puzzle 5. Let H be a Dirichlet operator. Show that H is irreducible if and only if every conserved quantity O for H is a constant, meaning that for some c \in \mathbb{R} we have O_i = c for all i. (Hint: examine the proof of Noether’s theorem.)

In fact this works more generally:

Puzzle 6. Let H be an infinitesimal stochastic matrix. Show that H + c I is an irreducible nonnegative matrix for all sufficiently large c if and only if every conserved quantity O for H is a constant.


Symmetry and the Fourth Dimension (Part 5)

3 August, 2012

Last time we saw that Platonic solids come in dual pairs, with the tetrahedron being dual to itself:

When you have a dual pair, you can start chopping off the corners of one, more and more, and keep going until you reach the other. Along the way you get some interesting shapes:

At certain points along the way, we get semiregular polyhedra, meaning that:

• all the faces are regular polygons, and

• there’s a symmetry carrying any corner to any other.

Let’s see how it goes with the cube/octahedron pair. And on the way, I’ll show you some diagrams that summarize what’s going on. I’ll explain them later, but you can try to guess the pattern. As a clue, I’ll say they’re based on the Coxeter diagram for the cube, which I explained last time:

V—4—E—3—F

Here we go!

Cube: •—4—o—3—o

First we have the cube, with all square faces:


Truncated cube: •—4—•—3—o

Then we get the truncated cube, with octagons and triangles as faces:


Cuboctahedron: o—4—•—3—o

Halfway through we get the aptly named cuboctahedron, with squares and triangles as faces:


Truncated octahedron: o—4—•—3—•

Then we get the truncated octahedron, with squares and hexagons as faces:


Octahedron: o—4—o—3—•

Then finally we get the octahedron, with triangles as faces:


Partial flags

Can you see what’s going on with the diagrams here?

cube •—4—o—3—o
truncated cube •—4—•—3—o
cuboctahedron o—4—•—3—o
truncated octahedron o—4—•—3—•
octahedron o—4—o—3—•

Clearly the black dots tend to move from left to right as we move down the chart, but there’s something much cooler and more precise going on. The black dots secretly say where the corners of the shapes are!

Let’s see how quickly I can explain this, and how quickly you can get what I’m talking about. Remember how I defined a ‘flag’ in Part 3?

No? Good, because today I’m going to call that a ‘complete flag’. So, given a Platonic solid, we’ll say a complete flag is a vertex, edge and face where the vertex lies on the edge and the edge lies on the face.

For example, here is a complete flag for a cube:

It’s the black vertex lying on the blue edge lying on the yellow face.

But the term ‘complete flag’ hints that there are also ‘partial flags’. And there are! A vertex-edge flag is a vertex and edge where the vertex lies on the edge. Here’s a vertex-edge flag for the cube:

Similarly, an edge-face flag is an edge and a face where the edge lies on the face. Here’s an edge-face flag for the cube:

You can see why they’re called partial flags: they’re different parts of a complete flag.

Now, fix the Coxeter diagram for the cube firmly in mind:

V—4—E—3—F

V for vertex, E for edge and F for face.

Then:

• a cube obviously has one corner for each vertex of the cube, so we draw it like this:

•—4—o—3—o

• a truncated cube has one corner for each vertex-edge flag of the cube, so we draw it like this:

•—4—•—3—o

• a cuboctahedron has one corner for each edge of the cube, so we draw it like this:

o—4—•—3—o

• a truncated octahedron has one corner for each edge-face flag of the cube, so we draw it like this:

o—4—•—3—•

• an octahedron has one corner for each face of the cube, so we draw it like this:

o—4—o—3—•

Alas, I don’t have the patience to draw all the pictures needed to explain this clearly; I’ll just grab the pictures I can get for free on Wikicommons. Here’s how an octahedron has a corner for each face of the cube:

And here’s how the cuboctahedron has a corner for each edge of the cube:

But these are the least interesting cases! It’s more interesting to see how the truncated cube has one corner for each vertex-edge flag of the cube. Do you see how it works? You have to imagine this truncated cube sitting inside a cube:


Then, notice that the truncated cube has 2 corners on each edge of the cube, one near each end. So, it has one corner for each vertex-edge flag of the cube!

Similarly, the truncated octahedron has one corner for each edge-face flag of the cube. But since I don’t have a great picture to help you see that, lets use duality to change our point of view. A face of the cube corresponds to a vertex of the octahedron. So, think of this truncated octahedron as sitting inside an octahedron:


The truncated octahedron has 4 corners near each vertex of the octahedron, one on each edge touching that vertex. In short, it has one corner for each vertex-edge flag of the octahedron. So it’s got one corner for each edge-face flag of the cube!

This change of viewpoint can be justified more thoroughly:

Puzzle. The diagrams we’ve been using were based on the Coxeter diagram for the cube. What would they look like if we based them on the Coxeter diagram for the octahedron instead?

By the way, the pretty pictures of solids with brass balls at the vertices were made by Tom Ruen using Robert Webb’s Stella software. They’re available on Wikicommons, and you can find most of them by clicking on the images here and looking around on the Wikipedia articles you’ll reach that way. On this blog, I try hard to make most images take you to more information when you click on them.


The Noisy Channel Coding Theorem

28 July, 2012

Here’s a charming, easily readable tale of Claude Shannon and how he came up with information theory:

• Erico Guizzo, The Essential Message: Claude Shannon and the Making of Information Theory.

I hadn’t known his PhD thesis was on genetics! His master’s thesis introduced Boolean logic to circuit design. And as a kid, he once set up a telegraph line to a friend’s house half a mile away.

So, he was perfectly placed to turn information into a mathematical quantity, deeply related to entropy, and prove some now-famous theorems about it.

These theorems set limits on how much information we can transmit through a noisy channel. More excitingly, they say we can cook up coding schemes that let us come as close as we want to this limit, with an arbitrarily low probability of error.

As Erico Guizzo points out, these results are fundamental to the ‘information age’ we live in today:

Can we transmit, say, a high-resolution picture over a telephone line? How long will that take? Is there a best way to do it?

Before Shannon, engineers had no clear answers to these questions. At that time, a wild zoo of technologies was in operation, each with a life of its own—telephone, telegraph, radio, television, radar, and a number of other systems developed during the war. Shannon came up with a unifying, general theory of communication. It didn’t matter whether you transmitted signals using a copper wire, an optical fiber, or a parabolic dish. It didn’t matter if you were transmitting text, voice, or images. Shannon envisioned communication in abstract, mathematical terms; he defined what the once fuzzy concept of “information” meant for communication engineers and proposed a precise way to quantify it. According to him, the information content of any kind of message could be measured in binary digits, or just bits—a name suggested by a colleague at Bell Labs. Shannon took the bit as the fundamental unit in information theory. It was the first time that the term appeared in print.

So, I want to understand Shannon’s theorems and their proofs—especially because they clarify the relation between information and entropy, two concepts I’d like to be an expert on. It’s sort of embarrassing that I don’t already know this stuff! But I thought I’d post some preliminary remarks anyway, in case you too are trying to learn this stuff, or in case you can help me.

There are various different theorems I should learn. For example:

• The source coding theorem says it’s impossible to compress a stream of data to make the average number of bits per symbol in the compressed data less than the Shannon information of the source, without some of the data almost certainly getting lost. However, you can make the number of bits per symbols arbitrarily close to the Shannon entropy with a probability of error as small as you like.

• the noisy channel coding theorem is a generalization to data sent over a noisy channel.

The proof of the noisy channel coding theorem seems not so bad—there’s a sketch of a proof in the Wikipedia article on this theorem. But many theorems have a hard lemma at their heart, and for this one it’s a result in probability theory called the asymptotic equipartition property.

You should not try to dodge the hard lemma at the heart of the theorem you’re trying to understand: there’s a reason it’s there. So what’s the asymptotic equipartition property?

Here’s a somewhat watered-down statement that gets the basic idea across. Suppose you have a method of randomly generating letters—for example, a probability distribution on the set of letters. Suppose you randomly generate a string of n letters, and compute -(1/n) times the logarithm of the probability that you got that string. Then as n \to \infty this number ‘almost surely’ approaches some number S. What’s this number S? It’s the entropy of the probability distribution you used to generate those letters!

(Almost surely is probability jargon for ‘with probability 100%’, which is not the same as ‘always’.)

This result is really cool—definitely worth understanding in its own right! It says that while many strings are possible, the ones you’re most likely to see lie in a certain ‘typical set’. The ‘typical’ strings are the ones where when you compute -(1/n) times the log of their probability, the result is close to S. How close? Well, you get to pick that.

The typical strings are not individually the most probable strings! But if you randomly generate a string, it’s very probable that it lies in the typical set. That sounds a bit paradoxical, but if you think about it, you’ll see it’s not. Think of repeatedly flipping a coin that has a 90% chance of landing heads up. The most probable single outcome is that it lands heads up every time. But the typical outcome is that it lands up close to 90% of the time. And, there are lots of ways this can happen. So, if you flip the coin a bunch of times, there’s a very high chance that the outcome is typical.

It’s easy to see how this result is the key to the noisy channel coding theorem. The ‘typical set’ has few elements compared to the whole set of strings with n letters. So, you can make short codes for the strings in this set, and compress your message that way, and this works almost all the time. How much you can compress your message depends on the entropy S.

So, we’re seeing the link between information and entropy!

The actual coding schemes that people use are a lot trickier than the simple scheme I’m hinting at here. When you read about them, you see scary things like this:

But presumably they’re faster to implement, hence more practical.

The first coding schemes that come really close to the Shannon limit are the turbo codes. Surprisingly, these codes were developed only in 1993! They’re used in 3G mobile communications and deep space satellite communications.

One key trick is to use, not one decoder, but two. These two decoders keep communicating with each other and improving their guesses about the signal they’re received, until they agree:

This iterative process continues until the two decoders come up with the same hypothesis for the m-bit pattern of the payload, typically in 15 to 18 cycles. An analogy can be drawn between this process and that of solving cross-reference puzzles like crossword or sudoku. Consider a partially completed, possibly garbled crossword puzzle. Two puzzle solvers (decoders) are trying to solve it: one possessing only the “down” clues (parity bits), and the other possessing only the “across” clues. To start, both solvers guess the answers (hypotheses) to their own clues, noting down how confident they are in each letter (payload bit). Then, they compare notes, by exchanging answers and confidence ratings with each other, noticing where and how they differ. Based on this new knowledge, they both come up with updated answers and confidence ratings, repeating the whole process until they converge to the same solution.

This can be seen as “an instance of loopy belief propagation in Bayesian networks.”

By the way, the picture I showed you above is a flowchart of the decoding scheme for a simple turbo code. You can see the two decoders, and maybe the loop where data gets fed back to the decoders.

While I said this picture is “scary”, I actually like it because it’s an example of network theory applied to real-life problems.


Follow

Get every new post delivered to your Inbox.

Join 3,095 other followers