Extremal Principles in Classical, Statistical and Quantum Mechanics

13 January, 2012

guest post by Mike Stay

The table in John’s post on quantropy shows that energy and action are analogous:

Statics Dynamics
statistical mechanics quantum mechanics
probabilities amplitudes
Boltzmann distribution Feynman sum over histories
energy action
temperature Planck’s constant times i
entropy quantropy
free energy free action

However, this seems to be part of a bigger picture that includes at least entropy as analogous to both of those, too. I think that just about any quantity defined by an integral over a path would behave similarly.

I see four broad areas to consider, based on a temperature parameter:

  1. T = 0: statics, or “least quantity”
  2. Real T > 0: statistical mechanics
  3. Imaginary T: a thermal ensemble gets replaced by a quantum superposition
  4. Complex T: ensembles of quantum systems, as in nuclear magnetic resonance

I’m not going to get into the last of these in what follows.

1. “Least quantity”

Lagrangian of a classical particle

K is kinetic energy, i.e. the “action density” due to motion.

V is potential energy, i.e. minus the “action density” due to position.

The action is then:

\displaystyle \begin{array}{rcl}   A &=& \int (K-V) \,  d t \\ & = & \int \left[m\left(\frac{d q(t)}{d t}^2 - V(q(t)\right)\right] d t  \end{array}

where m is the particle’s mass. We get the principle of least action by setting \delta A = 0.

“Static” systems related by a Wick rotation

  1. Substitute q(s = iz) for q(t) to get a “springy” static system.

    In John’s homework problem A Spring in Imaginary Time, he guided students through a Wick-rotation-like process that transforms the Lagrangian above into the Hamiltonian of a springy system. (I say “springy” because it’s not exactly the Hamiltonian for a hanging spring: here each infinitesimal piece of the spring is at a fixed horizontal position and is free to move only vertically.)

    \kappa is the potential energy density due to stretching.

    \upsilon is the potential energy density due to position.

    We then have

    \displaystyle  \begin{array}{rcl}\int(\kappa-\upsilon) dz & = &  \int\left[k\left(\frac{dq(iz)}{dz}\right)^2 - \upsilon(q(iz))\right]  dz\\ & = & -i\int\left[-k\left(\frac{dq(iz)}{diz}\right)^2 -  \upsilon(q(iz))\right] diz\\ & = & i  \int\left[k\left(\frac{dq(iz)}{diz}\right)^2 + \upsilon(q(iz))\right]  diz \end{array}

    or letting s = iz,

    \displaystyle  \begin{array}{rcl}   & = &  i\int\left[k\left(\frac{dq(s)}{ds}\right)^2 + \upsilon(q(s))\right]  ds\\ & = & iE \end{array}

    where E is the potential energy of the spring. We get the principle of least energy by setting \delta E = 0.

  2. Substitute q(β = iz) for q(t) to get a thermometer
    system.

    We can repeat the process above, but use inverse temperature, or “coolness”, instead of time. Note that this is still a statics problem at heart! We’ll introduce another temperature below when we allow for multiple possible q‘s.

    K is the potential energy due to rate of change of q with respect to \beta. (This has to do with the thermal expansion coefficient: if we fix length of the thermometer and then cool it, we get “stretching” potential energy.)

    V is any extra potential energy due to q.

    \displaystyle \begin{array}{rcl}\int(K-V) dz  & = & \int\left[k\left(\frac{dq(iz)}{dz}\right)^2 -  V(q(iz))\right] dz\\ & = &  -i\int\left[-k\left(\frac{dq(iz)}{diz}\right)^2 - V(q(iz))\right]  diz\\ & = & i \int\left[k\left(\frac{dq(iz)}{diz}\right)^2 +  V(q(iz))\right] diz \end{array}

    or letting \beta = iz,

    \displaystyle \begin{array}{rcl}   & = &  i\int\left[k\left(\frac{dq(\beta)}{d\beta}\right)^2 +  V(q(\beta))\right] d\beta\\ & = & iS_1\end{array}

    where S_1 is the entropy lost as the thermometer is cooled. We get the principle of “least entropy lost” by setting \delta S_1 = 0.

  3. Substitute q(T₁ = iz) for q(t).

    We can repeat the process above, but use temperature instead of time. We get a system whose heat capacity is governed by a function q(T) and its derivative. We’re trying to find the best function q, the most efficient way to raise the temperature of the system.

    C is the heat capacity (= entropy) proportional to (dq/dT_1)^2.

    V is the heat capacity due to q.

    \displaystyle \begin{array}{rcl}\int(C-V) dz  & = & \int\left[k\left(\frac{dq(iz)}{dz}\right)^2 -  V(q(iz))\right] dz\\ & = &  -i\int\left[-k\left(\frac{dq(iz)}{diz}\right)^2 - V(q(iz))\right]  diz\\ & = & i \int\left[k\left(\frac{dq(iz)}{diz}\right)^2 +  V(q(iz))\right] diz  \end{array}

    or letting T_1 = iz,

    \displaystyle \begin{array}{rcl} & = &  i\int\left[k\left(\frac{dq(T_1)}{dT_1}\right)^2 + V(q(T_1))\right]  dT_1\\ & = & iE \end{array}

    where E is the energy required to raise the
    temperature. We again get the principle of least energy by setting \delta E = 0.

2. Statistical mechanics

Here we allow lots of possible q‘s, then maximize entropy subject to constraints using the Lagrange multiplier trick.

Statistical mechanics of a particle

For the statistical mechanics of a particle, we choose a real measure a_x on the set of paths. For simplicity, we assume the set is finite.

Normalize so \sum a_x = 1.

Define entropy to be S = - \sum a_x \ln a_x.

Our problem is to choose a_x to minimize the “free action” F = A - \lambda S, or, what’s equivalent, to maximize S subject to a constraint on A.

To make units match, λ must have units of action, so it’s some multiple of . Replace λ by ℏλ so the free action is

F = A - \hbar\lambda\, S.

The distribution that minimizes the free action is the Gibbs distribution a_x = \exp(-A/\hbar\lambda) / Z, where Z is the usual partition function.

However, there are other observables of a path, like the position q_{1/2} at the halfway point; given another constraint on the average value of q_{1/2} over all paths, we get a distribution like

\displaystyle a_x = \exp(-\left[A +  pq_{1/2}\right]/\hbar\lambda) / Z.

The conjugate variable to that position is a momentum: in order to get from the starting point to the given point in the allotted time, the particle has to have the corresponding momentum.

dA = \hbar\lambda\, dS - p\, dq.

Other examples from Wick rotation

  1. Introduce a temperature T [Kelvins] that perturbs the spring.

    We minimize the free energy F = E - kT\, S, i.e. maximize the entropy S subject to a constraint on the expected energy

    \langle E\rangle = \sum a_x E_x.

    We get the measure a_x = \exp(-E_x/kT) / Z.

    Other observables about the spring’s path give conjugate variables whose product is energy. Given constraint on the average position of the spring at the halfway point, we get a conjugate force: pulling the spring out of equilibrium requires a force.

    dE = kT\, dS - F\, dq.

  2. Statistical ensemble of thermometers with ensemble temperature T₂ [unitless].

    We minimize the “free entropy” F = S_1 - T_2S_2, i.e. we maximize the entropy S_2 subject to a constraint on the expected entropy lost

    \langle S_1\rangle = \sum a_x S_{1,x}.

    We get the measure a_x = \exp(-S_{1,x}/T_2) / Z.

    Given a constraint on the average position at the halfway point, we get a conjugate inverse length r that tells how much entropy is lost when the thermometer shrinks by dq.

    dS_1 = T_2\, dS_2 - r\, dq.

  3. Statistical ensemble of functions q with ensemble temperature T₂ [Kelvins].

    We minimize the free energy F = E - kT_2\, S, i.e. we maximize the entropy S subject to a constraint on the expected energy

    \displaystyle \langle E\rangle = \sum a_x E_x.

    We get the measure a_x = \exp(-E_x/kT_2) / Z.

    Again, a constraint on the position would give a conjugate force. It’s a little harder to see how here, but given a non-optimal function q(T), we have an extra energy cost due to inefficiency that’s analogous to the stretching potential energy when pulling a spring out of equilibrium.

3. Thermo to quantum via Wick rotation of Lagrange multiplier

We allow a complex-valued measure a as John did in the article on quantropy. We pick a logarithm for each a_x and assume they don’t go through zero as we vary them. We also choose an imaginary Lagrange multiplier.

Normalize so \sum a_x = 1.

Define quantropy Q = - \sum a_x \ln a_x.

Find a stationary point of the free action F = A - \hbar\lambda\, Q.

We get a_x = \exp(-A_x/\hbar\lambda). If \lambda = -i, we get Feynman’s sum over histories. Surely something like the two-slit experiment considers histories with a constraint on position at a particular time, and we get a conjugate momentum?

A Quantum Version of Entropy

Again allow complex-valued a_x. However, this time normalize these by setting \sum |a_x|^2 = 1.

Define a quantum version of entropy S = - \sum |a_x|^2  \ln |a_x|^2.

  1. Allow quantum superposition of perturbed springs.

    \langle E\rangle = \sum |a_x|^2 E_x. Get a_x =  \exp(-E_x/kT) / Z. If T = -i\hbar/tk, we get the evolution of the quantum state |q\rangle under the given Hamiltonian for a time t.

  2. Allow quantum superpositions of thermometers.

    \langle S_1\rangle = \sum |a_x|^2 S_{1,x}. Get a_x =  \exp(-S_{1,x}/T_2) / Z. If T_2 = -i, we get something like a sum over histories, but with a different normalization condition that converges because our set of paths is finite.

  3. Allow quantum superposition of systems.

    \langle E \rangle = \sum |a_x|^2 E_x. Get a_x =\exp(-E_x/kT_2) / Z. If T_2 = -i\hbar/tk, we get the result of “Measure E, then heat the superposition T₁ degrees in a time much less than t seconds, then wait t seconds.” Different functions q in the superposition change the heat capacity differently and thus the systems end up at different energies.

So to sum up, there’s at least a three-way analogy between action, energy, and entropy depending on what you’re integrating over. You get a kind of “statics” if you extremize the integral by varying the path; by allowing multiple paths and constraints on observables, you get conjugate variables and “free” quantities that you want to minimize; and by taking the temperature to be imaginary, you get quantum systems.


Probabilities Versus Amplitudes

5 December, 2011

Here are the slides of the talk I’m giving at the CQT Annual Symposium on Wednesday afternoon, which is Tuesday morning for a lot of you. If you catch mistakes, I’d love to hear about them before then!

Probabilities versus amplitudes.

Abstract: Some ideas from quantum theory are just beginning to percolate back to classical probability theory. For example, there is a widely used and successful theory of “chemical reaction networks”, which describes the interactions of molecules in a stochastic rather than quantum way. If we look at it from the perspective of quantum theory, this turns out to involve creation and annihilation operators, coherent states and other well-known ideas—but with a few big differences. The stochastic analogue of quantum field theory is also used in population biology, and here the connection is well-known. But what does it mean to treat wolves as fermions or bosons?


Network Theory (Part 16)

4 November, 2011

We’ve been comparing two theories: stochastic mechanics and quantum mechanics. Last time we saw that any graph gives us an example of both theories! It’s a bit peculiar, but today we’ll explore the intersection of these theories a little further, and see that it has another interpretation. It’s also the theory of electrical circuits made of resistors!

That’s nice, because I’m supposed to be talking about ‘network theory’, and electrical circuits are perhaps the most practical networks of all:

I plan to talk a lot about electrical circuits. I’m not quite ready to dive in, but I can’t resist dipping my toe in the water today. Why don’t you join me? It’s not too cold!

Dirichlet operators

Last time we saw that any graph gives us an operator called the ‘graph Laplacian’ that’s both infinitesimal stochastic and self-adjoint. That means we get both:

• a Markov process describing the random walk of a classical particle on the graph.

and

• a 1-parameter unitary group describing the motion of a quantum particle on the graph.

That’s sort of neat, so it’s natural to wonder what are all the operators that are both infinitesimal stochastic and self-adjoint. They’re called ‘Dirichlet operators’, and at least in the finite-dimensional case we’re considering, they’re easy to completely understand. Even better, it turns out they describe electrical circuits made of resistors!

Today let’s take a lowbrow attitude and think of a linear operator H : \mathbb{C}^n \to \mathbb{C}^n as an n \times n matrix with entries H_{i j}. Then:

H is self-adjoint if it equals the conjugate of its transpose:

H_{i j} = \overline{H}_{j i}

H is infinitesimal stochastic if its columns sum to zero and its off-diagonal entries are real and nonnegative:

\displaystyle{ \sum_i H_{i j} = 0 }

i \ne j \Rightarrow H_{i j} \ge 0

H is a Dirichlet operator if it’s both self-adjoint and infinitesimal stochastic.

What are Dirichlet operators like? Suppose H is a Dirichlet operator. Then its off-diagonal entries are \ge 0, and since

\displaystyle{ \sum_i H_{i j} = 0}

its diagonal entries obey

\displaystyle{ H_{i i} = - \sum_{ i \ne j} H_{i j} \le 0 }

So all the entries of the matrix H are real, which in turn implies it’s symmetric:

H_{i j} = \overline{H}_{j i} = H_{j i}

So, we can build any Dirichlet operator H as follows:

• Choose the entries above the diagonal, H_{i j} with i < j, to be arbitrary nonnegative real numbers.

• The entries below the diagonal, H_{i j} with i > j, are then forced on us by the requirement that H be symmetric: H_{i j} = H_{j i}.

• The diagonal entries are then forced on us by the requirement that the columns sum to zero: H_{i i} = - \sum_{ i \ne j} H_{i j}.

Note that because the entries are real, we can think of a Dirichlet operator as a linear operator H : \mathbb{R}^n \to \mathbb{R}^n. We’ll do that for the rest of today.

Circuits made of resistors

Now for the fun part. We can easily draw any Dirichlet operator! To this we draw n dots, connect each pair of distinct dots with an edge, and label the edge connecting the ith dot to the jth with any number H_{i j} \ge 0:

This contains all the information we need to build our Dirichlet operator. To make the picture prettier, we can leave out the edges labelled by 0:

Like last time, the graphs I’m talking about are simple: undirected, with no edges from a vertex to itself, and at most one edge from one vertex to another. So:

Theorem. Any finite simple graph with edges labelled by positive numbers gives a Dirichlet operator, and conversely.

We already talked about a special case last time: if we label all the edges by the number 1, our operator H is called the graph Laplacian. So, now we’re generalizing that idea by letting the edges have more interesting labels.

What’s the meaning of this trick? Well, we can think of our graph as an electrical circuit where the edges are wires. What do the numbers labelling these wires mean? One obvious possibility is to put a resistor on each wire, and let that number be its resistance. But that doesn’t make sense, since we’re leaving out wires labelled by 0. If we leave out a wire, that’s not like having a wire of zero resistance: it’s like having a wire of infinite resistance! No current can go through when there’s no wire. So the number labelling an edge should be the conductance of the resistor on that wire. Conductance is the reciprocal of resistance.

So, our Dirichlet operator above gives a circuit like this:

Here Ω is the symbol for an ‘ohm’, a unit of resistance… but the upside-down version, namely ℧, is the symbol for a ‘mho’, a unit of conductance that’s the reciprocal of an ohm.

Let’s see if this cute idea leads anywhere. Think of a Dirichlet operator H : \mathbb{R}^n \to \mathbb{R}^n as a circuit made of resistors. What could a vector \psi \in \mathbb{R}^n mean? It assigns a real number to each vertex of our graph. The only sensible option is for this number to be the electric potential at that point in our circuit. So let’s try that.

Now, what’s

\langle \psi, H \psi \rangle  ?

In quantum mechanics this would be a very sensible thing to look at: it would be gives us the expected value of the Hamiltonian H in a state \psi. But what does it mean in the land of electrical circuits?

Up to a constant fudge factor, it turns out to be the power consumed by the electrical circuit!

Let’s see why. First, remember that when a current flows along a wire, power gets consumed. In other words, electrostatic potential energy gets turned into heat. The power consumed is

P = V I

where V is the voltage across the wire and I is the current flowing along the wire. If we assume our wire has resistance R we also have Ohm’s law:

I = V / R

so

\displaystyle{ P = \frac{V^2}{R} }

If we write this using the conductance instead of the resistance R, we get

P = \textrm{conductance} \; V^2

But our electrical circuit has lots of wires, so the power it consumes will be a sum of terms like this. We’re assuming H_{i j} is the conductance of the wire from the ith vertex to the jth, or zero if there’s no wire connecting them. And by definition, the voltage across this wire is the difference in electrostatic potentials at the two ends: \psi_i - \psi_j. So, the total power consumed is

\displaystyle{ P = \sum_{i \ne j}  H_{i j} (\psi_i - \psi_j)^2 }

This is nice, but what does it have to do with \langle \psi , H \psi \rangle?

The answer is here:

Theorem. If H : \mathbb{R}^n \to \mathbb{R}^n is any Dirichlet operator, and \psi \in \mathbb{R}^n is any vector, then

\displaystyle{ \langle \psi , H \psi \rangle = -\frac{1}{2} \sum_{i \ne j}  H_{i j} (\psi_i - \psi_j)^2 }

Proof. Let’s start with the formula for power:

\displaystyle{ P = \sum_{i \ne j}  H_{i j} (\psi_i - \psi_j)^2 }

Note that this sum includes the condition i \ne j, since we only have wires going between distinct vertices. But the summand is zero if i = j, so we also have

\displaystyle{ P = \sum_{i, j}  H_{i j} (\psi_i - \psi_j)^2 }

Expanding the square, we get

\displaystyle{ P = \sum_{i, j}  H_{i j} \psi_i^2 - 2 H_{i j} \psi_i \psi_j + H_{i j} \psi_j^2 }

The middle term looks promisingly similar to \langle \psi, H \psi \rangle, but what about the other two terms? Because H_{i j} = H_{j i}, they’re equal:

\displaystyle{ P = \sum_{i, j} - 2 H_{i j} \psi_i \psi_j + 2 H_{i j} \psi_j^2  }

And in fact they’re zero! Since H is infinitesimal stochastic, we have

\displaystyle{ \sum_i H_{i j} = 0 }

so

\displaystyle{ \sum_i H_{i j} \psi_j^2 = 0 }

and it’s still zero when we sum over j. We thus have

\displaystyle{ P = - 2 \sum_{i, j} H_{i j} \psi_i \psi_j }

But since \psi_i is real, this is -2 times

\displaystyle{ \langle \psi, H \psi \rangle  = \sum_{i, j}  H_{i j} \overline{\psi}_i \psi_j }

So, we’re done.   █

An instant consequence of this theorem is that a Dirichlet operator has

\langle \psi , H \psi \rangle \le 0

for all \psi. Actually most people use the opposite sign convention in defining infinitesimal stochastic operators. This makes H_{i j} \le 0, which is mildly annoying, but it gives

\langle \psi , H \psi \rangle \ge 0

which is nice. When H is a Dirichlet operator, defined with this opposite sign convention, \langle \psi , H \psi \rangle is called a Dirichlet form.

The big picture

Maybe it’s a good time to step back and see where we are.

So far we’ve been exploring the analogy between stochastic mechanics and quantum mechanics. Where do networks come in? Well, they’ve actually come in twice so far:

1) First we saw that Petri nets can be used to describe stochastic or quantum processes where things of different kinds randomly react and turn into other things. A Petri net is a kind of network like this:

The different kinds of things are the yellow circles; we called them states, because sometimes we think of them as different states of a single kind of thing. The reactions where things turn into other things are the blue squares: we called them transitions. We label the transitions by numbers to say the rates at which they occur.

2) Then we looked at stochastic or quantum processes where in each transition a single thing turns into a single thing. We can draw these as Petri nets where each transition has just one state as input and one state as output. But we can also draw them as directed graphs with edges labelled by numbers:

Now the dark blue boxes are states and the edges are transitions!

Today we looked at a special case of the second kind of network: the Dirichlet operators. For these the ‘forward’ transition rate H_{i j} equals the ‘reverse’ rate H_{j i}, so our graph can be undirected: no arrows on the edges. And for these the rates H_{i i} are determined by the rest, so we can omit the edges from vertices to themselves:

The result can be seen as an electrical circuit made of resistors! So we’re building up a little dictionary:

• Stochastic mechanics: \psi_i is a probability and H_{i j} is a transition rate (probability per time).

• Quantum mechanics: \psi_i is an amplitude and H_{i j} is a transition rate (amplitude per time).

• Circuits made of resistors: \psi_i is a voltage and H_{i j} is a conductance.

This dictionary may seem rather odd—especially the third item, which looks completely different than the first two! But that’s good: when things aren’t odd, we don’t get many new ideas. The whole point of this ‘network theory’ business is to think about networks from many different viewpoints and let the sparks fly!

Actually, this particular oddity is well-known in certain circles. We’ve been looking at the discrete version, where we have a finite set of states. But in the continuum, the classic example of a Dirichlet operator is the Laplacian H = \nabla^2. And then we have:

• The heat equation:

\frac{d}{d t} \psi = \nabla^2 \psi

is fundamental to stochastic mechanics.

• The Schrödinger equation:

\frac{d}{d t} \psi = -i \nabla^2 \psi

is fundamental to quantum mechanics.

• The Poisson equation:

\nabla^2 \psi = -\rho

is fundamental to electrostatics.

Briefly speaking, electrostatics is the study of how the electric potential \psi depends on the charge density \rho. The theory of electrical circuits made of resistors can be seen as a special case, at least when the current isn’t changing with time.

I’ll say a lot more about this… but not today! If you want to learn more, this is a great place to start:

• P. G. Doyle and J. L. Snell, Random Walks and Electrical Circuits, Mathematical Association of America, Washington DC, 1984.

This free online book explains, in a really fun informal way, how random walks on graphs, are related to electrical circuits made of resistors. To dig deeper into the continuum case, try:

• M. Fukushima, Dirichlet Forms and Markov Processes, North-Holland, Amsterdam, 1980.


Network Theory (Part 15)

26 October, 2011

Last time we saw how to get a graph whose vertices are states of a molecule and whose edges are transitions between states. We focused on two beautiful but not completely realistic examples that both give rise to the same highly symmetrical graph: the ‘Desargues graph’.

Today I’ll start with a few remarks about the Desargues graph. Then I’ll get to work showing how any graph gives:

• A Markov process, namely a random walk on the graph.

• A quantum process, where instead of having a probability to hop from vertex to vertex as time passes, we have an amplitude.

The trick is to use an operator called the ‘graph Laplacian’, a discretized version of the Laplacian which happens to be both infinitesimal stochastic and self-adjoint. As we saw in Part 12, such an operator will give rise both to a Markov process and a quantum process (that is, a one-parameter unitary group).

The most famous operator that’s both infinitesimal stochastic and self-adjoint is the Laplacian, \nabla^2. Because it’s both, the Laplacian shows up in two important equations: one in stochastic mechanics, the other in quantum mechanics.

• The heat equation:

\displaystyle{ \frac{d}{d t} \psi = \nabla^2 \psi }

describes how the probability \psi(x) of a particle being at the point x smears out as the particle randomly walks around:

The corresponding Markov process is called ‘Brownian motion’.

• The Schrödinger equation:

\displaystyle{ \frac{d}{d t} \psi = -i \nabla^2 \psi }

describes how the amplitude \psi(x) of a particle being at the point x wiggles about as the particle ‘quantumly’ walks around.

Both these equations have analogues where we replace space by a graph, and today I’ll describe them.

Drawing the Desargues graph

First I want to show you a nice way to draw the Desargues graph. For this it’s probably easiest to go back to our naive model of an ethyl cation:

Even though ethyl cations don’t really look like this, and we should be talking about some trigonal bipyramidal molecule instead, it won’t affect the math to come. Mathematically, the two problems are isomorphic! So let’s stick with this nice simple picture.

We can be a bit more abstract, though. A state of the ethyl cation is like having 5 balls, with 3 in one pile and 2 in the other. And we can focus on the first pile and forget the second, because whatever isn’t in the first pile must be in the second.

Of course a mathematician calls a pile of things a ‘set’, and calls those things ‘elements’. So let’s say we’ve got a set with 5 elements. Draw a red dot for each 2-element subset, and a blue dot for each 3-element subset. Draw an edge between a red dot and a blue dot whenever the 2-element subset is contained in the 3-element subset. We get the Desargues graph.

That’s true by definition. But I never proved that any of the pictures I showed you are correct! For example, this picture shows the Desargues graph:

but I never really proved this fact—and I won’t now, either.

To draw a picture we know is correct, it’s actually easier to start with a big graph that has vertices for all the subsets of our 5-element set. If we draw an edge whenever an n-element subset is contained in an (n+1)-element subset, the Desargues graph will be sitting inside this big graph.

Here’s what the big graph looks like:

This graph has 2^5 vertices. It’s actually a picture of a 5-dimensional hypercube! The vertices are arranged in columns. There’s

• one 0-element subset,

• five 1-element subsets,

• ten 2-element subsets,

• ten 3-element subsets,

• five 4-element subsets,

• one 5-element subset.

So, the numbers of vertices in each column go like this:

1 \quad 5 \quad 10 \quad 10 \quad 5 \quad 1

which is a row in Pascal’s triangle. We get the Desargues graph if we keep only the vertices corresponding to 2- and 3-element subsets, like this:

It’s less pretty than our earlier picture, but at least there’s no mystery to it. Also, it shows that the Desargues graph can be generalized in various ways. For example, there’s a theory of bipartite Kneser graphs H(n,k). The Desargues graph is H(5,2).

Desargues’ theorem

I can’t resist answering this question: why is it called the ‘Desargues graph’? This name comes from Desargues’ theorem, a famous result in projective geometry. Suppose you have two triangles ABC and abc, like this:

Suppose the lines Aa, Bb, and Cc all meet at a single point, the ‘center of perspectivity’. Then the point of intersection of ab and AB, the point of intersection of ac and AC, and the point of intersection of bc and BC all lie on a single line, the ‘axis of perspectivity’. The converse is true too. Quite amazing!

The Desargues configuration consists of all the actors in this drama:

• 10 points: A, B, C, a, b, c, the center of perspectivity, and the three points on the axis of perspectivity

and

• 10 lines: Aa, Bb, Cc, AB, AC, BC, ab, ac, bc and the axis of perspectivity

Given any configuration of points and lines, we can form a graph called its Levi graph by drawing a vertex for each point or line, and drawing edges to indicate which points lie on which lines. And now for the punchline: Levi graph of the Desargues configuration is the ‘Desargues-Levi graph’!—or Desargues graph, for short.

Alas, I don’t know how this is relevant to anything I’ve discussed. For now it’s just a tantalizing curiosity.

A random walk on the Desargues graph

Back to business! I’ve been telling you about the analogy between quantum mechanics and stochastic mechanics. This analogy becomes especially interesting in chemistry, which lies on the uneasy borderline between quantum and stochastic mechanics.

Fundamentally, of course, atoms and molecules are described by quantum mechanics. But sometimes chemists describe chemical reactions using stochastic mechanics instead. When can they get away with this? Apparently whenever the molecules involved are big enough and interacting with their environment enough for ‘decoherence’ to kick in. I won’t attempt to explain this now.

Let’s imagine we have a molecule of iron pentacarbonyl with—here’s the unrealistic part, but it’s not really too bad—distinguishable carbonyl groups:

Iron pentacarbonyl is liquid at room temperatures, so as time passes, each molecule will bounce around and occasionally do a maneuver called a ‘pseudorotation’:

We can approximately describe this process as a random walk on a graph whose vertices are states of our molecule, and whose edges are transitions between states—namely, pseudorotations. And as we saw last time, this graph is the Desargues graph:

Note: all the transitions are reversible here. And thanks to the enormous amount of symmetry, the rates of all these transitions must be equal.

Let’s write V for the set of vertices of the Desargues graph. A probability distribution of states of our molecule is a function

\displaystyle{ \psi : V \to [0,\infty) }

with

\displaystyle{ \sum_{x \in V} \psi(x) = 1 }

We can think of these probability distributions as living in this vector space:

L^1(V) = \{ \psi: V \to \mathbb{R} \}

I’m calling this space L^1 because of the general abstract nonsense explained in Part 12: probability distributions on any measure space live in a vector space called L^1. Today that notation is overkill, since every function on V lies in L^1. But please humor me.

The point is that we’ve got a general setup that applies here. There’s a Hamiltonian:

H : L^1(V) \to L^1(V)

describing the rate at which the molecule randomly hops from one state to another… and the probability distribution \psi \in L^1(X) evolves in time according to the equation:

\displaystyle{ \frac{d}{d t} \psi(t) = H \psi(t) }

But what’s the Hamiltonian H? It’s very simple, because it’s equally likely for the state to hop from any vertex to any other vertex that’s connected to that one by an edge. Why? Because the problem has so much symmetry that nothing else makes sense.

So, let’s write E for the set of edges of the Desargues graph. We can think of this as a subset of V \times V by saying (x,y) \in E when x is connected to y by an edge. Then

\displaystyle{ (H \psi)(x) =  \sum_{y \,\, \textrm{such that} \,\, (x,y) \in E} \!\!\!\!\!\!\!\!\!\!\! \psi(y) \quad - \quad 3 \psi(x) }

We’re subtracting 3 \psi(x) because there are 3 edges coming out of each vertex x, so this is the rate at which the probability of staying at x decreases. We could multiply this Hamiltonian by a constant if we wanted the random walk to happen faster or slower… but let’s not.

The next step is to solve this discretized version of the heat equation:

\displaystyle{ \frac{d}{d t} \psi(t) = H \psi(t) }

Abstractly, the solution is easy:

\psi(t) = \exp(t H) \psi(0)

But to actually compute \exp(t H), we might want to diagonalize the operator H. In this particular example, we could take advantage of the enormous symmetry of the Desargues graph. Its symmetry group includes the permutation group S_5, so we could take the vector space L^1(V) and break it up into irreducible representations of S_5. Each of these will give an eigenspace of H, so by this method we can diagonalize H. I’d sort of like to try this… but it’s a big digression, so I won’t. At least, not today!

Graph Laplacians

The Hamiltonian we just saw is an example of a ‘graph Laplacian’. We can write down such a Hamiltonian for any graph, but it gets a tiny bit more complicated when different vertices have different numbers of edges coming out of them.

The word ‘graph’ means lots of things, but right now I’m talking about simple graphs. Such a graph has a set of vertices V and a set of edges E \subseteq V \times V, such that

(x,y) \in E \implies (y,x) \in E

which says the edges are undirected, and

(x,x) \notin E

which says there are no loops. Let d(x) be the degree of the vertex x, meaning the number of edges coming out of it.

Then the graph Laplacian is this operator on L^1(V):

\displaystyle{ (H \psi)(x) =  \sum_{y \,\, \textrm{such that} \, \,(x,y) \in E} \!\!\!\!\!\!\!\!\!\!\! \Psi(y) \quad - \quad d(x) \Psi(x) }

There is a huge amount to say about graph Laplacians! If you want, you can get started here:

• Michael William Newman, The Laplacian Spectrum of Graphs, Masters Thesis, Department of Mathematics, University of Manitoba, 2000.

But for now, let’s just say that \exp(t H) is a Markov process describing a random walk on the graph, where hopping from one vertex to any neighboring vertex has unit probability per unit time. We can make the hopping faster or slower by multiplying H by a constant. And here is a good time to admit that most people use a graph Laplacian that’s the negative of ours, and write time evolution as \exp(-t H). The advantage is that then the eigenvalues of the Laplacian are \ge 0.

But what matters most is this. We can write the operator H as a matrix whose entry H_{x y} is 1 when there’s an edge from x to y and 0 otherwise, except when x = y, in which case the entry is -d(x). And then:

Puzzle 1. Show that for any finite graph, the graph Laplacian H is infinitesimal stochastic, meaning that:

\displaystyle{ \sum_{x \in V} H_{x y} = 0 }

and

x \ne y \implies  H_{x y} \ge 0

This fact implies that for any t \ge 0, the operator \exp(t H) is stochastic—just what we need for a Markov process.

But we could also use H as a Hamiltonian for a quantum system, if we wanted. Now we think of \psi(x) as the amplitude for being in the state x \in V. But now \psi is a function

\psi : V \to \mathbb{C}

with

\displaystyle{ \sum_{x \in V} |\psi(x)|^2 = 1 }

We can think of this function as living in the Hilbert space

L^2(V) = \{ \psi: V \to \mathbb{C} \}

where the inner product is

\langle \phi, \psi \rangle = \displaystyle{ \sum_{x \in V} \overline{\phi(x)} \psi(x) }

Puzzle 2. Show that for any finite graph, the graph Laplacian H: L^2(V) \to L^2(V) is self-adjoint, meaning that:

H_{x y} = \overline{H}_{y x}

This implies that for any t \in \mathbb{R}, the operator \exp(-i t H) is unitary—just what we need for one-parameter unitary group. So, we can take this version of Schrödinger’s equation:

\displaystyle{ \frac{d}{d t} \psi = -i H \psi }

and solve it:

\displaystyle{ \psi(t) = \exp(-i t H) \psi(0) }

and we’ll know that time evolution is unitary!

So, we’re in a dream world where we can do stochastic mechanics and quantum mechanics with the same Hamiltonian. I’d like to exploit this somehow, but I’m not quite sure how. Of course physicists like to use a trick called Wick rotation where they turn quantum mechanics into stochastic mechanics by replacing time by imaginary time. We can do that here. But I’d like to do something new, special to this context.

Maybe I should learn more about chemistry and graph theory. Of course, graphs show up in at least two ways: first for drawing molecules, and second for drawing states and transitions, as I’ve been doing. These books are supposed to be good:

• Danail Bonchev and D.H. Rouvray, eds., Chemical Graph Theory: Introduction and Fundamentals, Taylor and Francis, 1991.

• Nenad Trinajstic, Chemical Graph Theory, CRC Press, 1992.

• R. Bruce King, Applications of Graph Theory and Topology in Inorganic Cluster Coordination Chemistry, CRC Press, 1993.

The second is apparently the magisterial tome of the subject. The prices of these books are absurd: for example, Amazon sells the first for $300, and the second for $222. Luckily the university here should have them…


The Decline Effect

18 October, 2011

I bumped into a surprising article recently:

• Jonah Lehrer, Is there something wrong with the scientific method?, New Yorker, 13 December 2010.

It starts with a bit of a bang:

Before the effectiveness of a drug can be confirmed, it must be tested and tested again. Different scientists in different labs need to repeat the protocols and publish their results. The test of replicability, as it’s known, is the foundation of modern research. Replicability is how the community enforces itself. It’s a safeguard for the creep of subjectivity. Most of the time, scientists know what results they want, and that can influence the results they get. The premise of replicability is that the scientific community can correct for these flaws.

But now all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain. It’s as if our facts were losing their truth: claims that have been enshrined in textbooks are suddenly unprovable. This phenomenon doesn’t yet have an official name, but it’s occurring across a wide range of fields, from psychology to ecology. In the field of medicine, the phenomenon seems extremely widespread, affecting not only antipsychotics but also therapies ranging from cardiac stents to vitamin E and antidepressants: Davis has a forthcoming analysis demonstrating that the efficacy of antidepressants has gone down as much as threefold in recent decades.

This phenomenon does have a name now: it’s called the decline effect. The article tells some amazing stories about it. If you’re in the mood for some fun, I suggest going to your favorite couch or café now, and reading them!

For example: John Ioannides is the author of the most heavily downloaded paper in the open-access journal PLoS Medicine. It’s called Why most published research findings are false.

In it, Ioannides took three prestigious medical journals and looked at the 49 most cited clinical research studies. 45 of them used randomized controlled trials and reported positive results. But of the 34 that people tried to replicate, 41% were either directly contradicted or had their effect sizes significantly downgraded.

For more examples, read the article or listen to this radio show:

Cosmic Habituation, Radiolab, May 3, 2011.

It’s a bit sensationalistic… but it’s fun. It features Jonathan Schooler, who discovered a famous effect in psychology, called verbal overshadowing. It doesn’t really matter what this effect is. What matters is that it showed up very strongly in his first experiments… but as he and others continued to study it, it gradually diminished over time! He got freaked out. And then looked around, and saw that this sort of decline happened all over the place, in lots of cases.

What could cause this ‘decline effect’? There are lots of possible explanations.

At one extreme, maybe the decline effect doesn’t really exist. Maybe this sort of decline just happens sometimes purely by chance. Maybe there are equally many cases where effects seem to get stronger each time they’re measured!

At the other extreme, a very disturbing possibility has been proposed by Jonathan Schooler. He suggests that somehow the laws of reality change when they’re studied, in such a way that initially strong effects gradually get weaker.

I don’t believe this. It’s logically possible, but there are lots of less radical explanations to rule out first.

But if it were true, maybe we could make the decline effect go away by studying it. The decline effect would itself decline!

Unless of course, you started studying the decline of the decline effect.

Okay. On to some explanations that are interesting but less far-out.

One plausible explanation is significance chasing. Scientists work really hard to find something that’s ‘statistically significant’ according to the widely-used criterion of having a p-value of less than 0.05.

That sounds technical, but basically all it means is this: there was at most a 5% chance of having found a deviation from the expected situation that’s as big as the one you found.

(To play this game, you have to say ahead of time what the ‘expected situation’ is: this is your null hypothesis.)

Why is significance chasing dangerous? How can it lead to the decline effect?

Well, here’s how to write a paper with a statistically significant result. Go through 20 different colors of jelly bean and see if people who eat them have more acne than average. There’s a good chance that one of your experiments will say ‘yes’ with a p-value of less than 0.05, just because 0.05 = 1/20. If so, this experiment gives a statistically significant result!

I took this example from Randall Munroe’s cartoon strip xkcd:

It’s funny… but it’s actually sad: some testing of drugs is not much better than this! Clearly a result obtained this way is junk, so when you try to replicate it, the ‘decline effect’ will kick in.

Another possible cause of the decline effect is publication bias: scientists and journals prefer positive results over null results, where no effect is found. And surely there are other explanations, too: for starters, all the ways people can fool themselves into thinking they’ve discovered something interesting.

For suggestions on how to avoid the evils of ‘publication bias’, try these:

• Jonathan Schooler, Unpublished results hide the decline effect, Nature 470 (2011), 437.

Putting an end to ‘significance chasing’ may require people to learn more about statistics:

• Geoff Cumming, Significant does not equal important: why we need the new statistics, 9 October 2011.

He explains the problem in simple language:

Consider a psychologist who’s investigating a new therapy for anxiety. She randomly assigns anxious clients to the therapy group, or a control group. You might think the most informative result would be an estimate of the benefit of therapy – the average improvement as a number of points on the anxiety scale-together with the amount that’s the confidence interval around that average. But psychology typically uses significance testing rather than estimation.

Introductory statistics books often introduce significance testing as a step-by-step recipe:

Step 1. Assume the new therapy has zero effect. You don’t believe this and you fervently hope it’s not true, but you assume it.

Step 2. You use that assumption to calculate a strange thing called a ‘p value’, which is the probability that, if the therapy really has zero effect, the experiment would have given a difference as large as you observed, or even larger.

Step 3. If the p value is small, in particular less than the hallowed criterion of .05 (that’s 1 chance in 20), you are permitted to reject your initial assumption—which you never believed anyway—and declare that the therapy has a ‘significant’ effect.

If that’s confusing, you’re in good company. Significance testing relies on weird backward logic. No wonder countless students every year are bamboozled by their introduction to statistics! Why this strange ritual they ask, and what does a p value actually mean? Why don’t we focus on how large an improvement the therapy gives, and whether people actually find it helpful? These are excellent questions, and estimation gives the best answers.

For half a century distinguished scholars have published damning critiques of significance testing, and explained how it hampers research progress. There’s also extensive evidence that students, researchers, and even statistics teachers often don’t understand significance testing correctly. Strangely, the critiques of significance testing have hardly prompted any defences by its supporters. Instead, psychology and other disciplines have simply continued with the significance testing ritual, which is now deeply entrenched. It’s used in more than 90% of published research in psychology, and taught in every introductory textbook.

For more discussion and references, try my co-blogger:

• Tom Leinster, Fetishizing p-values, n-Category Café.

He gives some good examples of how significance testing can lead us astray. Anyone who uses the p-test should read these! He also discusses this book:

• Stephen T. Ziliak and Deirdre N. McCloskey, The Cult of Statistical Significance, University of Michigan Press, Ann Arbor, 2008. (Online summary here.)

Now, back to the provocative title of that New Yorker article: “Is there something wrong with the scientific method?”

The answer is yes if we mean science as actually practiced, now. Lots of scientists are using cookbook recipes they learned in statistics class without understanding them, or investigating the alternatives. Worse, some are treating statistics as a necessary but unpleasant piece of bureaucratic red tape, and then doing whatever it takes to achieve the appearance of a significant result!

This is a bit depressing. There’s a student I know, who is taking an introductory statistics course. After she read about this stuff she said:

So, what I’m gleaning here is that what I’m studying is basically bull. It struck me as bull to start with, admittedly, but since my grade depended on it, I grinned and swallowed. At least my eyes are open now, I guess.

But there’s some good news, buried in her last sentence. Science has the marvelous ability to notice and correct its own mistakes. It’s scientists who noticed the decline effect and significance chasing. They’ll eventually figure out what’s going on, and learn how to fix any mistakes that they’ve been making. So ultimately, I don’t find this story depressing. It’s actually inspiring!

The scientific method is not a fixed rulebook handed down from on high. It’s a work in progress. It’s only been around for a few centuries—not very long, in the grand scheme of things. The widespread use of statistics in science has been around for less than one century. And computers, which make heavy-duty number-crunching easy, have only been cheap for 30 years! No wonder people still use primitive cookbook methods for analyzing data, when they could do better.

So science is still evolving. And I think that’s fun, because it means we can help it along. If you see someone claim their results are statistically significant, you can ask them what they mean, exactly… and what they had to do to get those results.


I thank a lot of people on Google+ for discussions on this topic, including (but not limited to) John Forbes, Roko Mijic, Heather Vandagriff, and Willie Wong.


Network Theory (Part 13)

11 October, 2011

Unlike some recent posts, this will be very short. I merely want to show you the quantum and stochastic versions of Noether’s theorem, side by side.

Having made my sacrificial offering to the math gods last time by explaining how everything generalizes when we replace our finite set X of states by an infinite set or an even more general measure space, I’ll now relax and state Noether’s theorem only for a finite set. If you’re the sort of person who finds that unsatisfactory, you can do the generalization yourself.

Two versions of Noether’s theorem

Let me write the quantum and stochastic Noether’s theorem so they look almost the same:

Theorem. Let X be a finite set. Suppose H is a self-adjoint operator on L^2(X), and let O be an observable. Then

[O,H] = 0

if and only if for all states \psi(t) obeying Schrödinger’s equation

\displaystyle{ \frac{d}{d t} \psi(t) = -i H \psi(t) }

the expected value of O in the state \psi(t) does not change with t.

Theorem. Let X be a finite set. Suppose H is an infinitesimal stochastic operator on L^1(X), and let O be an observable. Then

[O,H] =0

if and only if for all states \psi(t) obeying the master equation

\displaystyle{ \frac{d}{d t} \psi(t) = H \psi(t) }

the expected values of O and O^2 in the state \psi(t) do not change with t.

This makes the big difference stick out like a sore thumb: in the quantum version we only need the expected value of O, while in the stochastic version we need the expected values of O and O^2!

Brendan Fong proved the stochastic version of Noether’s theorem in Part 11. Now let’s do the quantum version.

Proof of the quantum version

My statement of the quantum version was silly in a couple of ways. First, I spoke of the Hilbert space L^2(X) for a finite set X, but any finite-dimensional Hilbert space will do equally well. Second, I spoke of the “self-adjoint operator” H and the “observable” O, but in quantum mechanics an observable is the same thing as a self-adjoint operator!

Why did I talk in such a silly way? Because I was attempting to emphasize the similarity between quantum mechanics and stochastic mechanics. But they’re somewhat different. For example, in stochastic mechanics we have two very different concepts: infinitesimal stochastic operators, which generate symmetries, and functions on our set X, which are observables. But in quantum mechanics something wonderful happens: self-adjoint operators both generate symmetries and are observables! So, my attempt was a bit strained.

Let me state and prove a less silly quantum version of Noether’s theorem, which implies the one above:

Theorem. Suppose H and O are self-adjoint operators on a finite-dimensional Hilbert space. Then

[O,H] = 0

if and only if for all states \psi(t) obeying Schrödinger’s equation

\displaystyle{ \frac{d}{d t} \psi(t) = -i H \psi(t) }

the expected value of O in the state \psi(t) does not change with t:

\displaystyle{ \frac{d}{d t} \langle \psi(t), O \psi(t) \rangle = 0 }

Proof. The trick is to compute the time derivative I just wrote down. Using Schrödinger’s equation, the product rule, and the fact that H is self-adjoint we get:

\begin{array}{ccl}  \displaystyle{ \frac{d}{d t} \langle \psi(t), O \psi(t) \rangle } &=&   \langle -i H \psi(t) , O \psi(t) \rangle + \langle \psi(t) , O (- i H \psi(t)) \rangle \\  \\  &=& i \langle \psi(t) , H O \psi(t) \rangle -i \langle \psi(t) , O H \psi(t)) \rangle \\  \\  &=& - i \langle \psi(t), [O,H] \psi(t) \rangle  \end{array}

So, if [O,H] = 0, clearly the above time derivative vanishes. Conversely, if this time derivative vanishes for all states \psi(t) obeying Schrödinger’s equation, we know

\langle \psi, [O,H] \psi \rangle = 0

for all states \psi and thus all vectors in our Hilbert space. Does this imply [O,H] = 0? Yes, because i times a commutator of a self-adjoint operators is self-adjoint, and for any self-adjoint operator A we have

\forall \psi  \; \; \langle \psi, A \psi \rangle = 0 \qquad \Rightarrow \qquad A = 0

This is a well-known fact whose proof goes like this. Assume \langle \psi, A \psi \rangle = 0 for all \psi. Then to show A = 0, it is enough to show \langle \phi, A \psi \rangle = 0 for all \phi and \psi. But we have a marvelous identity:

\begin{array}{ccl} \langle \phi, A \psi \rangle &=& \frac{1}{4} \left( \langle \phi + \psi, \, A (\phi + \psi) \rangle \; - \; \langle \psi - \phi, \, A (\psi - \phi) \rangle \right. \\ && \left. +i \langle \psi + i \phi, \, A (\psi + i \phi) \rangle \; - \; i\langle \psi - i \phi, \, A (\psi - i \phi) \rangle \right) \end{array}

and all four terms on the right vanish by our assumption.   █

The marvelous identity up there is called the polarization identity. In plain English, it says: if you know the diagonal entries of a self-adjoint matrix in every basis, you can figure out all the entries of that matrix in every basis.

Why is it called the ‘polarization identity’? I think because it shows up in optics, in the study of polarized light.

Comparison

In both the quantum and stochastic cases, the time derivative of the expected value of an observable O is expressed in terms of its commutator with the Hamiltonian. In the quantum case we have

\displaystyle{ \frac{d}{d t} \langle \psi(t), O \psi(t) \rangle = - i \langle \psi(t), [O,H] \psi(t) \rangle }

and for the right side to always vanish, we need [O,H] = 0 latex , thanks to the polarization identity. In the stochastic case, a perfectly analogous equation holds:

\displaystyle{ \frac{d}{d t} \int O \psi(t) = \int [O,H] \psi(t) }

but now the right side can always vanish even without [O,H] = 0. We saw a counterexample in Part 11. There is nothing like the polarization identity to save us! To get [O,H] = 0 we need a supplementary hypothesis, for example the vanishing of

\displaystyle{ \frac{d}{d t} \int O^2 \psi(t) }

Okay! Starting next time we’ll change gears and look at some more examples of stochastic Petri nets and Markov processes, including some from chemistry. After some more of that, I’ll move on to networks of other sorts. There’s a really big picture here, and I’m afraid I’ve been getting caught up in the details of a tiny corner.


Network Theory (Part 12)

9 October, 2011

Last time we proved a version of Noether’s theorem for stochastic mechanics. Now I want to compare that to the more familiar quantum version.

But to do this, I need to say more about the analogy between stochastic mechanics and quantum mechanics. And whenever I try, I get pulled toward explaining some technical issues involving analysis: whether sums converge, whether derivatives exist, and so on. I’ve been trying to avoid such stuff—not because I dislike it, but because I’m afraid you might. But the more I put off discussing these issues, the more they fester and make me unhappy. In fact, that’s why it’s taken so long for me to write this post!

So, this time I will gently explore some of these issues. But don’t be scared: I’ll mainly talk about some simple big ideas. Next time I’ll discuss Noether’s theorem. I hope that by getting the technicalities out of my system, I’ll feel okay about hand-waving whenever I want.

And if you’re an expert on analysis, maybe you can help me with a question.

Stochastic mechanics versus quantum mechanics

First, we need to recall the analogy we began sketching in Part 5, and push it a bit further. The idea is that stochastic mechanics differs from quantum mechanics in two big ways:

• First, instead of complex amplitudes, stochastic mechanics uses nonnegative real probabilities. The complex numbers form a ring; the nonnegative real numbers form a mere rig, which is a ‘ring without negatives’. Rigs are much neglected in the typical math curriculum, but unjustly so: they’re almost as good as rings in many ways, and there are lots of important examples, like the natural numbers \mathbb{N} and the nonnegative real numbers, [0,\infty). For probability theory, we should learn to love rigs.

But there are, alas, situations where we need to subtract probabilities, even when the answer comes out negative: namely when we’re taking the time derivative of a probability. So sometimes we need \mathbb{R} instead of just [0,\infty).

• Second, while in quantum mechanics a state is described using a ‘wavefunction’, meaning a complex-valued function obeying

\int |\psi|^2 = 1

in stochastic mechanics it’s described using a ‘probability distribution’, meaning a nonnegative real function obeying

\int \psi = 1

So, let’s try our best to present the theories in close analogy, while respecting these two differences.

States

We’ll start with a set X whose points are states that a system can be in. Last time I assumed X was a finite set, but this post is so mathematical I might as well let my hair down and assume it’s a measure space. A measure space lets you do integrals, but a finite set is a special case, and then these integrals are just sums. So, I’ll write things like

\int f

and mean the integral of the function f over the measure space X, but if X is a finite set this just means

\sum_{x \in X} f(x)

Now, I’ve already defined the word ‘state’, but both quantum and stochastic mechanics need a more general concept of state. Let’s call these ‘quantum states’ and ‘stochastic states’:

• In quantum mechanics, the system has an amplitude \psi(x) of being in any state x \in X. These amplitudes are complex numbers with

\int | \psi |^2 = 1

We call \psi: X \to \mathbb{C} obeying this equation a quantum state.

• In stochastic mechanics, the system has a probability \psi(x) of being in any state x \in X. These probabilities are nonnegative real numbers with

\int \psi = 1

We call \psi: X \to [0,\infty) obeying this equation a stochastic state.

In quantum mechanics we often use this abbreviation:

\langle \phi, \psi \rangle = \int \overline{\phi} \psi

so that a quantum state has

\langle \psi, \psi \rangle = 1

Similarly, we could introduce this notation in stochastic mechanics:

\langle \psi \rangle = \int \psi

so that a stochastic state has

\langle \psi \rangle = 1

But this notation is a bit risky, since angle brackets of this sort often stand for expectation values of observables. So, I’ve been writing \int \psi, and I’ll keep on doing this.

In quantum mechanics, \langle \phi, \psi \rangle is well-defined whenever both \phi and \psi live in the vector space

L^2(X) = \{ \psi: X \to \mathbb{C} \; : \; \int |\psi|^2 < \infty \}

In stochastic mechanics, \langle \psi \rangle is well-defined whenever \psi lives in the vector space

L^1(X) =  \{ \psi: X \to \mathbb{R} \; : \; \int |\psi| < \infty \}

You’ll notice I wrote \mathbb{R} rather than [0,\infty) here. That’s because in some calculations we’ll need functions that take negative values, even though our stochastic states are nonnegative.

Observables

A state is a way our system can be. An observable is something we can measure about our system. They fit together: we can measure an observable when our system is in some state. If we repeat this we may get different answers, but there’s a nice formula for average or ‘expected’ answer.

• In quantum mechanics, an observable is a self-adjoint operator A on L^2(X). The expected value of A in the state \psi is

\langle \psi, A \psi \rangle

Here I’m assuming that we can apply A to \psi and get a new vector A \psi \in L^2(X). This is automatically true when X is a finite set, but in general we need to be more careful.

• In stochastic mechanics, an observable is a real-valued function A on X. The expected value of A in the state \psi is

\int A \psi

Here we’re using the fact that we can multiply A and \psi and get a new vector A \psi \in L^1(X), at least if A is bounded. Again, this is automatic if X is a finite set, but not otherwise.

Symmetries

Besides states and observables, we need ‘symmetries’, which are transformations that map states to states. We use these to describe how our system changes when we wait a while, for example.

• In quantum mechanics, an isometry is a linear map U: L^2(X) \to L^2(X) such that

\langle U \phi, U \psi \rangle = \langle \phi, \psi \rangle

for all \psi, \phi \in L^2(X). If U is an isometry and \psi is a quantum state, then U \psi is again a quantum state.

• In stochastic mechanics, a stochastic operator is a linear map U: L^1(X) \to L^1(X) such that

\int U \psi = \int \psi

and

\psi \ge 0 \; \; \Rightarrow \; \; U \psi \ge 0

for all \psi \in L^1(X). If U is stochastic and \psi is a stochastic state, then U \psi is again a stochastic state.

In quantum mechanics we are mainly interested in invertible isometries, which are called unitary operators. There are lots of these, and their inverses are always isometries. There are, however, very few stochastic operators whose inverses are stochastic:

Puzzle 1. Suppose X is a finite set. Show that every isometry U: L^2(X) \to L^2(X) is invertible, and its inverse is again an isometry.

Puzzle 2. Suppose X is a finite set. Which stochastic operators U: L^1(X) \to L^1(X) have stochastic inverses?

This is why we usually think of time evolution as being reversible quantum mechanics, but not in stochastic mechanics! In quantum mechanics we often describe time evolution using a ‘1-parameter group’, while in stochastic mechanics we describe it using a 1-parameter semigroup… meaning that we can run time forwards, but not backwards.

But let’s see how this works in detail!

Time evolution in quantum mechanics

In quantum mechanics there’s a beautiful relation between observables and symmetries, which goes like this. Suppose that for each time t we want a unitary operator U(t) :  L^2(X) \to L^2(X) that describes time evolution. Then it makes a lot of sense to demand that these operators form a 1-parameter group:

Definition. A collection of linear operators U(t) (t \in \mathbb{R}) on some vector space forms a 1-parameter group if

U(0) = 1

and

U(s+t) = U(s) U(t)

for all s,t \in \mathbb{R}.

Note that these conditions force all the operators U(t) to be invertible.

Now suppose our vector space is a Hilbert space, like L^2(X). Then we call a 1-parameter group a 1-parameter unitary group if the operators involved are all unitary.

It turns out that 1-parameter unitary groups are either continuous in a certain way, or so pathological that you can’t even prove they exist without the axiom of choice! So, we always focus on the continuous case:

Definition. A 1-parameter unitary group is strongly continuous if U(t) \psi depends continuously on t for all \psi, in this sense:

t_i \to t \;\; \Rightarrow \; \;\|U(t_i) \psi - U(t) \psi \| \to 0

Then we get a classic result proved by Marshall Stone back in the early 1930s. You may not know him, but he was so influential at the University of Chicago during this period that it’s often called the “Stone Age”. And here’s one reason why:

Stone’s Theorem. There is a one-to-one correspondence between strongly continuous 1-parameter unitary groups on a Hilbert space and self-adjoint operators on that Hilbert space, given as follows. Given a strongly continuous 1-parameter unitary group U(t) we can always write

U(t) = \exp(-i t H)

for a unique self-adjoint operator H. Conversely, any self-adjoint operator determines a strongly continuous 1-parameter group this way. For all vectors \psi for which H \psi is well-defined, we have

\displaystyle{ \left.\frac{d}{d t} U(t) \psi \right|_{t = 0} = -i H \psi }

Moreover, for any of these vectors, if we set

\psi(t) = \exp(-i t H) \psi

we have

\displaystyle{ \frac{d}{d t} \psi(t) = - i H \psi(t) }

When U(t) = \exp(-i t H) describes the evolution of a system in time, H is is called the Hamiltonian, and it has the physical meaning of ‘energy’. The equation I just wrote down is then called Schrödinger’s equation.

So, simply put, in quantum mechanics we have a correspondence between observables and nice one-parameter groups of symmetries. Not surprisingly, our favorite observable, energy, corresponds to our favorite symmetry: time evolution!

However, if you were paying attention, you noticed that I carefully avoided explaining how we define \exp(- i t H). I didn’t even say what a self-adjoint operator is. This is where the technicalities come in: they arise when H is unbounded, and not defined on all vectors in our Hilbert space.

Luckily, these technicalities evaporate for finite-dimensional Hilbert spaces, such as L^2(X) for a finite set X. Then we get:

Stone’s Theorem (Baby Version). Suppose we are given a finite-dimensional Hilbert space. In this case, a linear operator H on this space is self-adjoint iff it’s defined on the whole space and

\langle \phi , H \psi \rangle = \langle H \phi, \psi \rangle

for all vectors \phi, \psi. Given a strongly continuous 1-parameter unitary group U(t) we can always write

U(t) = \exp(- i t H)

for a unique self-adjoint operator H, where

\displaystyle{ \exp(-i t H) \psi = \sum_{n = 0}^\infty \frac{(-i t H)^n}{n!} \psi }

with the sum converging for all \psi. Conversely, any self-adjoint operator on our space determines a strongly continuous 1-parameter group this way. For all vectors \psi in our space we then have

\displaystyle{ \left.\frac{d}{d t} U(t) \psi \right|_{t = 0} = -i H \psi }

and if we set

\psi(t) = \exp(-i t H) \psi

we have

\displaystyle{ \frac{d}{d t} \psi(t) = - i H \psi(t) }

Time evolution in stochastic mechanics

We’ve seen that in quantum mechanics, time evolution is usually described by a 1-parameter group of operators that comes from an observable: the Hamiltonian. Stochastic mechanics is different!

First, since stochastic operators aren’t usually invertible, we typically describe time evolution by a mere ‘semigroup’:

Definition. A collection of linear operators U(t) (t \in [0,\infty)) on some vector space forms a 1-parameter semigroup if

U(0) = 1

and

U(s+t) = U(s) U(t)

for all s, t \ge 0.

Now suppose this vector space is L^1(X) for some measure space X. We want to focus on the case where the operators U(t) are stochastic and depend continuously on t in the same sense we discussed earlier.

Definition. A 1-parameter strongly continuous semigroup of stochastic operators U(t) : L^1(X) \to L^1(X) is called a Markov semigroup.

What’s the analogue of Stone’s theorem for Markov semigroups? I don’t know a fully satisfactory answer! If you know, please tell me.

Later I’ll say what I do know—I’m not completely clueless—but for now let’s look at the ‘baby’ case where X is a finite set. Then the story is neat and complete:

Theorem. Suppose we are given a finite set X. In this case, a linear operator H on L^1(X) is infinitesimal stochastic iff it’s defined on the whole space,

\int H \psi = 0

for all \psi \in L^1(X), and the matrix of H in terms of the obvious basis obeys

H_{i j} \ge 0

for all j \ne i. Given a Markov semigroup U(t) on L^1(X), we can always write

U(t) = \exp(t H)

for a unique infinitesimal stochastic operator H, where

\displaystyle{ \exp(t H) \psi = \sum_{n = 0}^\infty \frac{(t H)^n}{n!} \psi }

with the sum converging for all \psi. Conversely, any infinitesimal stochastic operator on our space determines a Markov semigroup this way. For all \psi \in L^1(X) we then have

\displaystyle{ \left.\frac{d}{d t} U(t) \psi \right|_{t = 0} = H \psi }

and if we set

\psi(t) = \exp(t H) \psi

we have the master equation:

\displaystyle{ \frac{d}{d t} \psi(t) = H \psi(t) }

In short, time evolution in stochastic mechanics is a lot like time evolution in quantum mechanics, except it’s typically not invertible, and the Hamiltonian is typically not an observable.

Why not? Because we defined an observable to be a function A: X \to \mathbb{R}. We can think of this as giving an operator on L^1(X), namely the operator of multiplication by A. That’s a nice trick, which we used to good effect last time. However, at least when X is a finite set, this operator will be diagonal in the obvious basis consisting of functions that equal 1 at one point of X and zero elsewhere. So, it can only be infinitesimal stochastic if it’s zero!

Puzzle 3. If X is a finite set, show that any operator on L^1(X) that’s both diagonal and infinitesimal stochastic must be zero.

The Hille–Yosida theorem

I’ve now told you everything you really need to know… but not everything I want to say. What happens when X is not a finite set? What are Markov semigroups like then? I can’t abide letting this question go unresolved! Unfortunately I only know a partial answer.

We can get a certain distance using the Hille-Yosida theorem, which is much more general.

Definition. A Banach space is vector space with a norm such that any Cauchy sequence converges.

Examples include Hilbert spaces like L^2(X) for any measure space, but also other spaces like L^1(X) for any measure space!

Definition. If V is a Banach space, a 1-parameter semigroup of operators U(t) : V \to V is called a contraction semigroup if it’s strongly continuous and

\| U(t) \psi \| \le \| \psi \|

for all t \ge 0 and all \psi \in V.

Examples include strongly continuous 1-parameter unitary groups, but also Markov semigroups!

Puzzle 4. Show any Markov semigroup is a contraction semigroup.

The Hille–Yosida theorem generalizes Stone’s theorem to contraction semigroups. In my misspent youth, I spent a lot of time carrying around Yosida’s book Functional Analysis. Furthermore, Einar Hille was the advisor of my thesis advisor, Irving Segal. Segal generalized the Hille–Yosida theorem to nonlinear operators, and I used this generalization a lot back when I studied nonlinear partial differential equations. So, I feel compelled to tell you this theorem:

Hille-Yosida Theorem. Given a contraction semigroup U(t) we can always write

U(t) = \exp(t H)

for some densely defined operator H such that H - \lambda I has an inverse and

\displaystyle{ \| (H - \lambda I)^{-1} \psi \| \le \frac{1}{\lambda} \| \psi \| }

for all \lambda > 0 and \psi \in V. Conversely, any such operator determines a strongly continuous 1-parameter group. For all vectors \psi for which H \psi is well-defined, we have

\displaystyle{ \left.\frac{d}{d t} U(t) \psi \right|_{t = 0} = H \psi }

Moreover, for any of these vectors, if we set

\psi(t) = U(t) \psi

we have

\displaystyle{ \frac{d}{d t} \psi(t) = H \psi(t) }

If you like, you can take the stuff at the end of this theorem to be what we mean by saying U(t) = \exp(t H). When U(t) = \exp(t H), we say that H generates the semigroup U(t).

But now suppose V = L^1(X). Besides the conditions in the Hille–Yosida theorem, what extra conditions on H are necessary and sufficient for it to generate a Markov semigroup? In other words, what’s a definition of ‘infinitesimal stochastic operator’ that’s suitable not only when X is a finite set, but an arbitrary measure space?

I asked this question on Mathoverflow a few months ago, and so far the answers have not been completely satisfactory.

Some people mentioned the Hille–Yosida theorem, which is surely a step in the right direction, but not the full answer.

Others discussed the special case when \exp(t H) extends to a bounded self-adjoint operator on L^2(X). When X is a finite set, this special case happens precisely when the matrix H_{i j} is symmetric: the probability of hopping from j to i equals the probability of hopping from i to j. This is a fascinating special case, not least because when H is both infinitesimal stochastic and self-adjoint, we can use it as a Hamiltonian for both stochastic mechanics and quantum mechanics! Someday I want to discuss this. However, it’s just a special case.

After grabbing people by the collar and insisting that I wanted to know the answer to the question I actually asked—not some vaguely similar question—the best answer seems to be Martin Gisser’s reference to this book:

• Zhi-Ming Ma and Michael Röckner, Introduction to the Theory of (Non-Symmetric) Dirichlet Forms, Springer, Berlin, 1992.

This book provides a very nice self-contained proof of the Hille-Yosida theorem. On the other hand, it does not answer my question in general, but only when the skew-symmetric part of H is dominated (in a certain sense) by the symmetric part.

So, I’m stuck on this front, but that needn’t bring the whole project to a halt. We’ll just sidestep this question.

For a good well-rounded introduction to Markov semigroups and what they’re good for, try:

• Ryszard Rudnicki, Katarzyna Pichór and Marta Tyran-Kamínska, Markov semigroups and their applications.


Follow

Get every new post delivered to your Inbox.

Join 3,095 other followers