The Complexity Barrier

28 October, 2011

Could we grow the whole universe with all its seeming complexity starting from a little seed? How much can you do with just a little information?

People have contests about this. My programmer friend Bruce Smith points out this animation, produced using a program less than 4 kilobytes long:

As he notes:

…to be fair, the complexity of some of the OS, graphics drivers, and hardware should be included, but this is a lot less than you might think if you imagine rewriting it purely for compactness rather than for speed, and only including what this sort of program needs to produce output.

Sampo Syreeni pointed me to this video, all generated from under 64 kilobytes of x86 code:

As he points out, one trick is to use symmetry.

These fancy images produced from tiny amounts of information are examples of the ‘demoscene’. a computer art subculture where people produce demos: non-interactive audio-visual computer presentations that run in real time.

According to the Wikipedia article on the demoscene:

Recent computer hardware advancements include faster processors, more memory, faster video graphics processors, and hardware 3D acceleration. With many of the past’s challenges removed, the focus in making demos has moved from squeezing as much out of the computer as possible to making stylish, beautiful, well-designed real time artwork […]

The old tradition still lives on, though. Demo parties have competitions with varying limitations in program size or platform (different series are called compos). On a modern computer the executable size may be limited to 64 kB or 4 kB. Programs of limited size are usually called intros. In other compos the choice of platform is restricted; only old computers, like the 8-bit Commodore 64, or the 16-bit Amiga, or Atari ST, or mobile devices like handheld phones or PDAs are allowed. Such restrictions provide a challenge for coders, musicians and graphics artists and bring back the old motive of making a device do more than was intended in its original design.

What else can you do with just a little information? Bruce listed a couple more things:

• Bill Gates’ first commercial success was an implementation of a useful version of BASIC in about 4000 bytes;

• the complete genetic code of an organism can be as short as a few hundred thousand bytes, and that has to be encoded in a way that doesn’t allow for highly clever compression schemes.

So if quite complex things can be compressed into fairly little information, you can’t help but wonder: how complex can something be?

The answer: arbitrarily complex! At least that’s true if we’re talking about the Kolmogorov complexity of a string of bits: namely, the length of the shortest computer program that prints it out. Lots of long strings of bits can’t be compressed. You can’t print most of them out using short programs, since there aren’t enough short programs to go around.

Of course, we need to fix a computer language ahead of time, so this is well-defined. And we need to make sure the programs are written in binary, so the comparison is fair.

So, things can be arbitrarily complex. But here’s a more interesting question: how complex can we prove something to be?

The answer is one of the most astounding facts I know. It’s called Chaitin’s incompleteness theorem. It says, very roughly:

There’s a number L such that we can’t prove the Kolmogorov complexity of any specific string of bits is bigger than L.

Make sure you understand this. For any number,
we can prove there are infinitely many bit strings with Kolmogorov complexity bigger than that. But we can’t point to any particular bit string and prove its Kolmogorov complexity is bigger than L!

Over on Google+, Allen Knutson wrote:

That’s an incredibly disturbing theorem, like driving to the edge of the universe and finding a wall.

I call L the complexity barrier. So one question is, how big is L? It’s hard, or perhaps even impossible, to find the smallest L that does the job. But we can certainly find numbers L that work. And they’re surprisingly small!

My friend Bruce estimates that the complexity barrier is a few kilobytes.

I’d like to see a program a few kilobytes long that produces a video showing a big bang, the formation of stars and galaxies, then planets, including one where life evolves, then intelligent life, then the development of computers… and finally someone writing the very same program.

I can’t prove it’s possible… but you can’t prove it isn’t!

Let’s see why.

For starters, we need to choose some axioms for our system of math, so we know what ‘provable’ means. We need a system that’s powerful enough to prove a bunch of basic facts about arithmetic, but simple enough that a computer program can check if a proof in this system is valid.

There are lots of systems like this. Three famous ones are Peano arithmetic, Robinson arithmetic (which is less powerful) and Zermelo-Fraenkel set theory (which is more powerful).

When you have a system of math like this, Gödel’s first incompleteness theorem kicks in: if the system is consistent, it can’t be complete. In other words, there are some questions it leaves unsettled. This is why we shouldn’t be utterly shocked that while a bunch of bit strings have complexity more than L, we can’t prove this.

Gödel’s second incompleteness theorem also kicks in: if the system can prove that it’s consistent, it’s not! (If it’s not consistent, it can prove anything, so you shouldn’t trust it.) So, there’s a sense in which we can never be completely sure that our system of math is consistent. But let’s assume it is.

Given this, Chaitin’s theorem says:

There exists a constant L such that no string of bits has Kolmogorov complexity that provably exceeds L.

How can we get a number that does the job? Any number L with

U + \log_2(L) + C < L

will do. Here:

U is the length of a program where if you input a natural number i, it will search through all proofs in Peano arithmetic until it finds one that proves some bit string has Kolmogorov complexity > i. If it finds one, then it outputs this bit string. If it never finds one, it grinds on endlessly. (Of course, if i = L, it will never find one!)

C is a small overhead cost: the length of the extra 'glue' to create a bigger program that takes the number L, written out in binary, and feeds it into the program described above.

The length of L written out in binary is about \log_2(L). This bigger program thus has length

U + \log_2(L) + C

and for the proof of Chaitin’s incompleteness theorem to work, we need this to be smaller than L. Obviously we can accomplish this by making L big enough, since \log_2 L grows slower than L.

Given all the stuff I’ve told you, the proof of Chaitin’s theorem almost writes itself! You run this bigger program I just described. If there were a bit string whose Kolmogorov complexity is provably greater than L, this program would print one out. But that’s a contradiction, because this program has length less than L.

So, we just need to pick a computer language and a suitable system of math, and estimate U and, less importantly because it’s so much smaller, C. Then L will be just a bit bigger than U + C.

I picked the language C and Peano arithmetic and asked Bruce if he could guess, roughly, what answer we get for L. He replied:

I don’t think it can be done in C, since C semantics are not well-defined unless you specify a particular finite machine size. (Since C programs can do things like convert pointers to integers and back, tell you the size of any datatype, and convert data of any specified datatype to bytes and back.) On a finite machine of N bits, all programs either finish in time less than about 2^N or take forever.

But if you take “C without size-specific operations”, or a higher level language like Python, or for that matter a different sort of low-level language like a Turing machine, then that’s not an issue — you can define a precise semantics that allows it to run a program for an arbitrarily long time and allocate an arbitrary number of objects in memory which contain pointers to each other. (To stick with the spirit of the question, for whatever language you choose, you’d want to disallow use of any large external batch of information like a “standard library”, except for whatever is so basic that you think of it as part of the native language. This is not a serious handicap for this problem.)

The main things that the program ‘U‘ (I’d rather call the program itself ‘U’ than call its length ‘U‘) needs to do are:

• recognize a syntactically correct statement or proof;

• check the validity of a purported proof;

• recognize certain statements as saying or implying “The Kolmogorov complexity of n is more than i” for some n and i. (It’s not necessary to recognize all such statements, just at least one for each n and i; so it can just recognize a statement that consists of some template with specific values of n and i inserted into it at certain places.)

Assuming that U expresses the proofs it wants to check in a practical proof language (which will be more like what a practical theorem-prover like Coq uses than like what a traditional logician would recognize as “straight Peano arithmetic”, but which will not be excessively powerful in the spirit of this question), I’d estimate that the most complex part is checking proof validity, but that that can still be expressed in at most a few dozen syntactic rules, each expressible in a few lines of code. (The authors of a system like Coq, which includes code to actually do that, would know better, as long as they remember that the vast majority of their system’s actual code is not needed for this problem.)

This makes me think that even without trying to compact it much, in a reasonable language we could write U in a few hundred lines of code, or (after a bit of simple compression) a few thousand bytes. (And perhaps much less if we tried hard to compact the whole program in clever ways.)

So L will also be “a few thousand” (bytes or digits), or perhaps less, rather than some number you can never possibly count to.

Does anyone know if Chaitin or someone actually went ahead a wrote a program that showed a specific value of L would work?


A Characterization of Entropy

2 June, 2011

Over at the n-Category Café some of us have been trying an experiment: writing a math paper in full public view, both on that blog and on its associated wiki: the nLab. One great thing about doing things this way is that people can easily chip in with helpful suggestions. It’s also more fun! Both these tend to speed the process.

Like Frankenstein’s monster, our paper’s main result was initially jolted into life by huge blasts of power: in this case, not lightning but category theory. It was awesome to behold, but too scary for this blog.

First Tom Leinster realized that the concept of entropy fell out — unexpectedly, but very naturally — from considerations involving ‘operads’, which are collections of abstract operations. He was looking at a particular operad where the operations are ‘convex linear combinations’, and he discovered that this operad has entropy lurking in its heart. Then Tobias Fritz figured out a nice way to state Tom’s result without mentioning operads. By now we’ve taught the monster table manners, found it shoes that fit, and it’s ready for polite society:

• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss.

The idea goes like this. Say you’ve got a finite set X with a probability measure p on it, meaning a number 0 \le p_i \le 1 for each point i \in X, obeying

\sum_{i \in X} p_i = 1

Then the Shannon entropy of p is defined by

S(p) = - \sum_{i \in X} p_i \, \ln(p_i)

This funny-looking formula can be justified in many ways. Our new way involves focusing not on entropy itself, but on changes in entropy. This makes sense for lots of reasons. For example, in physics we don’t usually measure entropy directly. Instead, we measure changes in entropy, using the fact that a system at temperature T absorbing a tiny amount of heat \Delta Q in a reversible way will experience an entropy change of \Delta Q / T. But our real reason for focusing on changes in entropy is that it gives a really slick theorem.

Suppose we have two finite sets with probability measures, say (X,p) and (Y,q). Then we define a morphism f: (X,p) \to (Y,q) to be a measure-preserving function: in other words, one for which the probability q_j of any point in Y is the sum of the probabilities p_i of the points in X with f(i) = j.

A morphism of this sort is a deterministic process that carries one random situation to another. For example, if I have a random integer between -10 and 10, chosen according to some probability distribution, and I square it, I get a random integer between 0 and 100. A process of this sort always decreases the entropy: given any morphism f: (X,p) \to (Y,q), we have

S(q) \le S(p)

Since the second law of thermodynamics says that entropy always increases, this may seem counterintuitive or even paradoxical! But there’s no paradox here. It makes more intuitive sense if you think of entropy as information, and the function f as some kind of data processing that doesn’t introduce any additional randomness. Such a process can only decrease the amount of information. For example, squaring the number -5 gives the same answer as squaring 5, so if I tell you “this number squared is 25”, I’m giving you less information than if I said “this number is -5”.

For this reason, we call the difference S(p) - S(q) the information loss of the morphism f : (X,p) \to (Y,q). And here’s our characterization of Shannon entropy in terms of information loss:

First, let’s write a morphism f: (X,p) \to (Y,q) as f : p \to q for short. Suppose F is a function that assigns to any such morphism a number F(f) \in [0,\infty), which we think of as its information loss. And suppose that F obeys three axioms:

1. Functoriality. Whenever we can compose morphisms f and g, we demand that

F(f \circ g) = F(f) + F(g)

In other words: when we do a process consisting of two stages, the amount of information lost in the whole process is the sum of the amounts lost in each stage!

2. Convex linearity. Suppose we have two finite sets equipped with probability measures, say p and q, and a real number \lambda \in [0, 1]. Then there is a probability measure \lambda p \oplus (1 - \lambda) q on the disjoint union of the two sets, obtained by weighting the two measures by \lambda and 1 - \lambda, respectively. Similarly, given morphisms f: p \to p' and g: q \to q' there is an obvious morphism from \lambda p \oplus (1 - \lambda) q to \lambda p' \oplus (1 - \lambda) q'. Let’s call this morphism \lambda f \oplus (1 - \lambda) g. We demand that

F(\lambda f \oplus (1 - \lambda) g) = \lambda F(f) + (1 - \lambda) F(g)

In other words: if we flip a probability-λ coin to decide whether to do one process or another, the information lost is λ times the information lost by the first process plus (1 – λ) times the information lost by the second!

3. Continuity. The same function between finite sets can be thought of as a measure-preserving map in different ways, by changing the measures on these sets. In this situation the quantity F(f) should depend continuously on the measures in question.

In other words: if we slightly change what we do a process to, the information it loses changes only slightly.

Then we conclude that there exists a constant c \ge 0 such that for any morphism f: (X,p) \to (Y,q), we have

F(f) = c(S(p) - S(q))

In other words: the information loss is some multiple of the change in Shannon entropy!

What’s pleasing about this theorem is that the three axioms are pretty natural, and it’s hard to see the formula

S(p) = - \sum_{i \in X} p_i \, \ln(p_i)

hiding in them… but it’s actually there.

(We also prove a version of this theorem for Tsallis entropy, in case you care. This obeys a mutant version of axiom 2, namely:

F(\lambda f \oplus (1 - \lambda) g) = \lambda^\alpha F(f) + (1 - \lambda)^\alpha F(g)

where \alpha is a parameter with 0 < \alpha< \infty. Tsallis entropy is a close relative of Rényi entropy, which I discussed here earlier. Just as Rényi entropy is a kind of q-derivative of the free energy, the Tsallis entropy is a q-derivative of the partition function. I’m not sure either of them are really important, but when you’re trying to uniquely characterize Shannon entropy, it’s nice for it to have some competitors to fight against, and these are certainly the main two. Both of them depend on a parameter and reduce to the Shannon entropy at a certain value of that parameter.)


Information Geometry (Part 8)

26 May, 2011

Now this series on information geometry will take an unexpected turn toward ‘green mathematics’. Lately I’ve been talking about relative entropy. Now I’ll say how this concept shows up in the study of evolution!

That’s an unexpected turn to me, at least. I learned of this connection just two days ago in a conversation with Marc Harper, a mathematician who is a postdoc in bioinformatics at UCLA, working with my friend Chris Lee. I was visiting Chris for a couple of days after attending the thesis defenses of some grad students of mine who just finished up at U.C. Riverside. Marc came by and told me about this paper:

• Marc Harper, Information geometry and evolutionary game theory.

and now I can’t resist telling you.

First of all: what does information theory have to do with biology? Let me start with a very general answer: biology is different from physics because biological systems are packed with information you can’t afford to ignore.

Physicists love to think about systems that take only a little information to describe. So when they get a system that takes a lot of information to describe, they use a trick called ‘statistical mechanics’, where you try to ignore most of this information and focus on a few especially important variables. For example, if you hand a physicist a box of gas, they’ll try to avoid thinking about the state of each atom, and instead focus on a few macroscopic quantities like the volume and total energy. Ironically, the mathematical concept of information arose first here—although they didn’t call it information back then; they called it ‘entropy’. The entropy of a box of gas is precisely the amount of information you’ve decided to forget when you play this trick of focusing on the macroscopic variables. Amazingly, remembering just this—the sheer amount of information you’ve forgotten—can be extremely useful… at least for the systems physicists like best.

But biological systems are different. They store lots of information (for example in DNA), transmit lots of information (for example in the form of biochemical signals), and collect a lot of information from their environment. And this information isn’t uninteresting ‘noise’, like the positions of atoms in a gas. The details really matter. Thus, we need to keep track of lots of information to have a chance of understanding any particular biological system.

So, part of doing biology is developing new ways to think about physical systems that contain lots of relevant information. This is why physicists consider biology ‘messy’. It’s also why biology and computers go hand in hand in the subject called ‘bioinformatics’. There’s no avoiding this: in fact, it will probably force us to automate the scientific method! That’s what Chris Lee and Marc Harper are really working on:

• Chris Lee, General information metrics for automated experiment planning, presentation in the UCLA Chemistry & Biochemistry Department faculty luncheon series, 2 May 2011.

But more about that some other day. Let me instead give another answer to the question of what information theory has to do with biology.

There’s an analogy between evolution and the scientific method. Simply put, life is an experiment to see what works; natural selection weeds out the bad guesses, and over time the better guesses predominate. This process transfers information from the world to the ‘experimenter’: the species that’s doing the evolving, or the scientist. Indeed, the only way the experimenter can get information is by making guesses that can be wrong.

All this is simple enough, but the nice thing is that we can make it more precise.

On the one hand, there’s a simple model of the scientific method called ‘Bayesian inference’. Assume there’s a set of mutually exclusive alternatives: possible ways the world can be. And suppose we start with a ‘prior probability distribution’: a preconceived notion of how probable each alternative is. Say we do an experiment and get a result that depends on which alternative is true. We can work out how likely this result was given our prior, and—using a marvelously simple formula called Bayes’ rule—we can use this to update our prior and obtain a new improved probability distribution, called the ‘posterior probability distribution’.

On the other hand, suppose we have a species with several different possible genotypes. A population of this species will start with some number of organisms with each genotype. So, we get a probability distribution saying how likely it is that an organism has any given genotype. These genotypes are our ‘mutually exclusive alternatives’, and this probability distribution is our ‘prior’. Suppose each generation the organisms have some expected number of offspring that depends on their genotype. Mathematically, it turns out this is just like updating our prior using Bayes’ rule! The result is a new probability distribution of genotypes: the ‘posterior’.

I learned about this from Chris Lee on the 19th of December, 2006. In my diary that day, I wrote:

The analogy is mathematically precise, and fascinating. In rough terms, it says that the process of natural selection resembles the process of Bayesian inference. A population of organisms can be thought of as having various ‘hypotheses’ about how to survive—each hypothesis corresponding to a different allele. (Roughly, an allele is one of several alternative versions of a gene.) In each successive generation, the process of natural selection modifies the proportion of organisms having each hypothesis, according to Bayes’ rule!

Now let’s be more precise:

Bayes’ rule says if we start with a ‘prior probability’ for some hypothesis to be true, divide it by the probability that some observation is made, then multiply by the ‘conditional probability’ that this observation will be made given that the hypothesis is true, we’ll get the ‘posterior probability’ that the hypothesis is true given that the observation is made.

Formally, the exact same equation shows up in population genetics! In fact, Chris showed it to me—it’s equation 9.2 on page 30 of this
book:

• R. Bürger, The Mathematical Theory of Selection, Recombination and Mutation, section I.9: Selection at a single locus, Wiley, 2000.

But, now all the terms in the equation have different meanings!

Now, instead of a ‘prior probability’ for a hypothesis to be true, we have the frequency of occurrence of some allele in some generation of a population. Instead of the probability that we make some observation, we have the expected number of offspring of an organism. Instead of the ‘conditional probability’ of making the observation, we have the expected number of offspring of an organism given that it has this allele. And, instead of the ‘posterior probability’ of our hypothesis, we have the frequency of occurrence of that allele in the next generation.

(Here we are assuming, for simplicity, an asexually reproducing ‘haploid’ population – that is, one with just a single set of chromosomes.)

This is a great idea—Chris felt sure someone must have already had it. A natural context would be research on genetic programming, a machine learning technique that uses an evolutionary algorithm to optimize a population of computer programs according to a fitness landscape determined by their ability to perform a given task. Since there has also been a lot of work on Bayesian approaches to machine learning, surely someone has noticed their mathematical relationship?

I see at least one person found these ideas as new and exciting as I did. But I still can’t believe Chris was the first to clearly formulate them, so I’d still like to know who did.

Marc Harper actually went to work with Chris after reading that diary entry of mine. By now he’s gone a lot further with this analogy by focusing on the role of information. As we keep updating our prior using Bayes’ rule, we should be gaining information about the real world. This idea has been made very precise in the theory of ‘machine learning’. Similarly, as a population evolves through natural selection, it should be gaining information about its environment.

I’ve been talking about Bayesian updating as a discrete-time process: something that happens once each generation for our population. That’s fine and dandy, definitely worth studying, but Marc’s paper focuses on a continuous-time version called the ‘replicator equation’. It goes like this. Let X be the set of alternative genotypes. For each i \in X, let P_i be the number of organisms that have the ith genotype at time t. Say that

\displaystyle{ \frac{d P_i}{d t} = f_i P_i }

where f_i is the fitness of the ith genotype. Let p_i be the probability that at time t, a randomly chosen organism will have the ith genotype:

\displaystyle{ p_i = \frac{P_i}{\sum_{i \in X} P_i } }

Then a little calculus gives the replicator equation:

\displaystyle{\frac{d p_i}{d t} = \left( f_i  - \langle f \rangle \right) \, p_i }

where

\langle f \rangle = \sum_{i \in X}  f_i  p_i

is the mean fitness of the organisms. So, the fraction of organisms of the ith type grows at a rate proportional to the fitness of that type minus the mean fitness. It ain’t enough to be good: you gotta be better than average.

Note that all this works not just when each fitness f_i is a mere number, but also when it’s a function of the whole list of probabilities p_i. That’s good, because in the real world, the fitness of one kind of bug may depend on the fraction of bugs of various kinds.

But what does all this have to do with information?

Marc’s paper has a lot to say about this! But just to give you a taste, here’s a simple fact involving relative entropy, which was first discovered by Ethan Atkin. Suppose evolution as described by the replicator equation brings the whole list of probabilities p_i—let’s call this list p—closer and closer to some stable equilibrium, say q. Then if a couple of technical conditions hold, the entropy of q relative to p keeps decreasing, and approaches zero.

Remember what I told you about relative entropy. In Bayesian inference, the entropy q relative to p is how much information we gain if we start with p as our prior and then do an experiment that pushes us to the posterior q.

So, in simple rough terms: as it approaches a stable equilibrium, the amount of information a species has left to learn keeps dropping, and goes to zero!

I won’t fill in the precise details, because I bet you’re tired already. You can find them in Section 3.5, which is called “Kullback-Leibler Divergence is a Lyapunov function for the Replicator Dynamic”. If you know all the buzzwords here, you’ll be in buzzword heaven now. ‘Kullback-Leibler divergence’ is just another term for relative entropy. ‘Lyapunov function’ means that it keeps dropping and goes to zero. And the ‘replicator dynamic’ is the replicator equation I described above.

Perhaps next time I’ll say more about this stuff. For now, I just hope you see why it makes me so happy.

First, it uses information geometry to make precise the sense in which evolution is a process of acquiring information. That’s very cool. We’re looking at a simplified model—the replicator equation—but doubtless this is just the beginning of a very long story that keeps getting deeper as we move to less simplified models.

Second, if you read my summary of Chris Canning’s talks on evolutionary game theory, you’ll see everything I just said meshes nicely with that. He was taking the fitness f_i to be

f_i = \sum_{j \in X} A_{i j} p_j

where the payoff matrix A_{i j} describes the ‘winnings’ of an organism with the ith genotype when it meets an organism with the jth genotype. This gives a particularly nice special case of the replicator equation.

Third, this particularly nice special case happens to be the rate equation for a certain stochastic Petri net. So, we’ve succeeded in connecting the ‘diagram theory’ discussion to the ‘information geometry’ discussion! This has all sort of implications, which will take quite a while to explore.

As the saying goes, in mathematics:

Everything sufficiently beautiful is connected to all other beautiful things.


Crooks’ Fluctuation Theorem

30 April, 2011

guest post by Eric Downes

Christopher Jarzynski, Gavin Crooks, and some others have made a big splash by providing general equalities that relate free energy differences to non-equilibrium work values. The best place to start is the first two chapters of Gavin Crooks’ thesis:

• Gavin Crooks, Excursions in Statistical Dynamics, Ph.D. Thesis, Department of Chemistry, U.C. Berkeley, 1999.

Here is the ~1 kiloword summary:

If we consider the work W done on a system, Clausius’ Inequality states that

W \ge \Delta F

where \Delta F is the change in free energy. One must perform more work along a non-equilibrium path because of the second law of thermodynamics. The excess work

W- \Delta F

is dissipated as heat, and is basically the entropy change in the universe, measured in different units. But who knows how large the excess work will be…

One considers a small system for which we imagine there exists a distribution of thermodynamic work values W (more on that below) in moving a system through phase space. We start at a macrostate with free energy F_1, and (staying in touch with a thermal reservoir at inverse temperature \beta) move in finite time to a new non-equilibrium state. When this new state is allowed to equilibriate it will have free energy

F_1 + \Delta F

You can do this by changing the spin-spin coupling, compressing a gas, etc: you’re changing one of the parameters in the system’s Hamiltonian in a completely deterministic way, such that the structure of the Hamiltonian does not change, and the system still has well-defined microstates at all intervening times. Your total accumulated work values will follow

\displaystyle{ \langle \exp(-\beta W) \rangle = \exp(-\beta \Delta F)}

where the expectation value is over a distribution of all possible paths through the classical phase space. This is the Jarzynski Equality.

It has an analogue for quantum systems, which appears to be related to supersymmetry, somehow. But the proof for classical systems simply relies on a Markov chain that moves through state space and an appropriate definition for work (see below). I can dig up the reference if anyone wants.

This is actually a specific case of a more fundamental theorem discovered about a decade ago by Gavin Crooks: the Crooks fluctuation theorem:

\displaystyle{ \exp(-\beta(W- \Delta F)) = \frac{P_{\mathrm{fwd}}}{P_{\mathrm{rev}}} }

where P_{\mathrm{fwd}} is the probability of a particular forward path which requires work W, and P_{\mathrm{rev}} is the probability of its time-reversal dual (see Gavin Crooks’ thesis for more precise definitions).

How do we assign a thermodynamic work value to a path of microstates? At the risk of ruining it for you: It turns out that one can write a first law analog for a subsystem Hamiltonian. We start with:

H_{\mathrm{tot}} = H_{\mathrm{subsys}} + H_{\mathrm{environ}} + H_{\mathrm{interact}}

As with Gibbs’ derivation of the canonical ensemble, we never specify what H_{\mathrm{environ}} and H_{\mathrm{interact}} are, only that the number of degrees of freedom in H_{\mathrm{environ}} is very large, and H_{\mathrm{interact}} is a small coupling. You make the observation that work can be associated with changing the energy-levels of the microstates in H_{\mathrm{subsys}}, while heat is associated with the energy change when the (sub)system jumps from one microstate to another (due to H_{\mathrm{interact}}) with no change in the spectrum of available energies. This implies a rather deep connection between the Hamiltonian and thermodynamic work. The second figure in Gavin’s thesis explained everything for me, after that you can basically derive it yourself.

The only physical applications I am aware of are to Monte Carlo simulations and mesoscopic systems in nano- or molecular biophysics. In that regard John Baez’ recent relation between free energy and Rényi entropy is a nice potential competitor for the efficient calculation of free energy differences (which apparently normally requires multiple simulations at intervening temperaturess, calculating the specific heat at each.)

But the relation to Markov chains is much more interesting to me, because this is a very general mathematical object which can be related to a much broader class of problems. Heat ends up being associated with fluctuations in the system’s state, and the (phenomenological) energy values are kind of the “relative unlikelihood” of each state. The excess work turns out to be related to the Kullback-Leibler divergence between the forward and reverse path-probabilities.

For visual learners with a background in stat mech, this is all developed in a pedagogical talk I gave in Fall 2010 at U. Wisconsin-Madison’s Condensed Matter Theory Seminar; talk available here. I’m licensing it cc-by-sa-nc through the Creative Commons License. I’ve been sloppy with references, but I emphasize that this is not original work; it is my presentation of Crooks’ and Jarzynski’s. Nonetheless, any errors you find are my own. Hokay, have a nice day!


Information Geometry (Part 7)

2 March, 2011

Today, I want to describe how the Fisher information metric is related to relative entropy. I’ve explained both these concepts separately (click the links for details); now I want to put them together.

But first, let me explain what this whole series of blog posts is about. Information geometry, obviously! But what’s that?

Information geometry is the geometry of ‘statistical manifolds’. Let me explain that concept twice: first vaguely, and then precisely.

Vaguely speaking, a statistical manifold is a manifold whose points are hypotheses about some situation. For example, suppose you have a coin. You could have various hypotheses about what happens when you flip it. For example: you could hypothesize that the coin will land heads up with probability x, where x is any number between 0 and 1. This makes the interval [0,1] into a statistical manifold. Technically this is a manifold with boundary, but that’s okay.

Or, you could have various hypotheses about the IQ’s of American politicians. For example: you could hypothesize that they’re distributed according to a Gaussian probability distribution with mean x and standard deviation y. This makes the space of pairs (x,y) into a statistical manifold. Of course we require y \ge 0, which gives us a manifold with boundary. We might also want to assume x \ge 0, which would give us a manifold with corners, but that’s okay too. We’re going to be pretty relaxed about what counts as a ‘manifold’ here.

If we have a manifold whose points are hypotheses about some situation, we say the manifold ‘parametrizes’ these hypotheses. So, the concept of statistical manifold is fundamental to the subject known as parametric statistics.

Parametric statistics is a huge subject! You could say that information geometry is the application of geometry to this subject.

But now let me go ahead and make the idea of ‘statistical manifold’ more precise. There’s a classical and a quantum version of this idea. I’m working at the Centre of Quantum Technologies, so I’m being paid to be quantum—but today I’m in a classical mood, so I’ll only describe the classical version. Let’s say a classical statistical manifold is a smooth function p from a manifold M to the space of probability distributions on some measure space \Omega.

We should think of \Omega as a space of events. In our first example, it’s just \{H, T\}: we flip a coin and it lands either heads up or tails up. In our second it’s \mathbb{R}: we measure the IQ of an American politician and get some real number.

We should think of M as a space of hypotheses. For each point x \in M, we have a probability distribution p_x on \Omega. This is hypothesis about the events in question: for example “when I flip the coin, there’s 55% chance that it will land heads up”, or “when I measure the IQ of an American politician, the answer will be distributed according to a Gaussian with mean 0 and standard deviation 100.”

Now, suppose someone hands you a classical statistical manifold (M,p). Each point in M is a hypothesis. Apparently some hypotheses are more similar than others. It would be nice to make this precise. So, you might like to define a metric on M that says how ‘far apart’ two hypotheses are. People know lots of ways to do this; the challenge is to find ways that have clear meanings.

Last time I explained the concept of relative entropy. Suppose we have two probability distributions on \Omega, say p and q. Then the entropy of p relative to q is the amount of information you gain when you start with the hypothesis q but then discover that you should switch to the new improved hypothesis p. It equals:

\int_\Omega \; \frac{p}{q} \; \ln(\frac{p}{q}) \; q d \omega

You could try to use this to define a distance between points x and y in our statistical manifold, like this:

S(x,y) =  \int_\Omega \; \frac{p_x}{p_y} \; \ln(\frac{p_x}{p_y}) \; p_y d \omega

This is definitely an important function. Unfortunately, as I explained last time, it doesn’t obey the axioms that a distance function should! Worst of all, it doesn’t obey the triangle inequality.

Can we ‘fix’ it? Yes, we can! And when we do, we get the Fisher information metric, which is actually a Riemannian metric on M. Suppose we put local coordinates on some patch of M containing the point x. Then the Fisher information metric is given by:

g_{ij}(x) = \int_\Omega  \partial_i (\ln p_x) \; \partial_j (\ln p_x) \; p_x d \omega

You can think of my whole series of articles so far as an attempt to understand this funny-looking formula. I’ve shown how to get it from a few different starting-points, most recently back in Part 3. But now let’s get it starting from relative entropy!

Fix any point in our statistical manifold and choose local coordinates for which this point is the origin, 0. The amount of information we gain if move to some other point x is the relative entropy S(x,0). But what’s this like when x is really close to 0? We can imagine doing a Taylor series expansion of S(x,0) to answer this question.

Surprisingly, to first order the answer is always zero! Mathematically:

\partial_i S(x,0)|_{x = 0} = 0

In plain English: if you change your mind slightly, you learn a negligible amount — not an amount proportional to how much you changed your mind.

This must have some profound significance. I wish I knew what. Could it mean that people are reluctant to change their minds except in big jumps?

Anyway, if you think about it, this fact makes it obvious that S(x,y) can’t obey the triangle inequality. S(x,y) could be pretty big, but if we draw a curve from x and y, and mark n closely spaced points x_i on this curve, then S(x_{i+1}, x_i) is zero to first order, so it must be of order 1/n^2, so if the triangle inequality were true we’d have

S(x,y) \le \sum_i S(x_{i+i},x_i) \le \mathrm{const} \, n \cdot \frac{1}{n^2}

for all n, which is a contradiction.

In plain English: if you change your mind in one big jump, the amount of information you gain is more than the sum of the amounts you’d gain if you change your mind in lots of little steps! This seems pretty darn strange, but the paper I mentioned in part 1 helps:

• Gavin E. Crooks, Measuring thermodynamic length.

You’ll see he takes a curve and chops it into lots of little pieces as I just did, and explains what’s going on.

Okay, so what about second order? What’s

\partial_i \partial_j S(x,0)|_{x = 0} ?

Well, this is the punchline of this blog post: it’s the Fisher information metric:

\partial_i \partial_j S(x,0)|_{x = 0} = g_{ij}

And since the Fisher information metric is a Riemannian metric, we can then apply the usual recipe and define distances in a way that obeys the triangle inequality. Crooks calls this distance thermodynamic length in the special case that he considers, and he explains its physical meaning.

Now let me prove that

\partial_i S(x,0)|_{x = 0} = 0

and

\partial_i \partial_j S(x,0)|_{x = 0} = g_{ij}

This can be somewhat tedious if you do it by straighforwardly grinding it out—I know, I did it. So let me show you a better way, which requires more conceptual acrobatics but less brute force.

The trick is to work with the universal statistical manifold for the measure space \Omega. Namely, we take M to be the space of all probability distributions on \Omega! This is typically an infinite-dimensional manifold, but that’s okay: we’re being relaxed about what counts as a manifold here. In this case, we don’t need to write p_x for the probability distribution corresponding to the point x \in M. In this case, a point of M just is a probability distribution on \Omega, so we’ll just call it p.

If we can prove the formulas for this universal example, they’ll automatically follow for every other example, by abstract nonsense. Why? Because any statistical manifold with measure space \Omega is the same as a manifold with a smooth map to the universal statistical manifold! So, geometrical structures on the universal one ‘pull back’ to give structures on all the rest. The Fisher information metric and the function S can be defined as pullbacks in this way! So, to study them, we can just study the universal example.

(If you’re familiar with ‘classifying spaces for bundles’ or other sorts of ‘classifying spaces’, all this should seem awfully familiar. It’s a standard math trick.)

So, let’s prove that

\partial_i S(x,0)|_{x = 0} = 0

by proving it in the universal example. Given any probability distribution q, and taking a nearby probability distribution p, we can write

\frac{p}{q} = 1 + f

where f is some small function. We only need to show that S(p,q) is zero to first order in f. And this is pretty easy. By definition:

S(p,q) =  \int_\Omega \; \frac{p}{q} \, \ln(\frac{p}{q}) \; q d \omega

or in other words,

S(p,q) =  \int_\Omega \; (1 + f) \, \ln(1 + f) \; q d \omega

We can calculate this to first order in f and show we get zero. But let’s actually work it out to second order, since we’ll need that later:

\ln (1 + f) = f - \frac{1}{2} f^2 + \cdots

so

(1 + f) \, \ln (1+ f) = f + \frac{1}{2} f^2 + \cdots

so

\begin{aligned} S(p,q) &=& \int_\Omega \; (1 + f) \; \ln(1 + f) \; q d \omega \\ &=& \int_\Omega f \, q d \omega + \frac{1}{2} \int_\Omega f^2\, q d \omega + \cdots \end{aligned}

Why does this vanish to first order in f? It’s because p and q are both probability distributions and p/q = 1 + f, so

\int_\Omega (1 + f) \, q d\omega = \int_\Omega p d\omega = 1

but also

\int_\Omega q d\omega = 1

so subtracting we see

\int_\Omega f \, q d\omega = 0

So, S(p,q) vanishes to first order in f. Voilà!

Next let’s prove the more interesting formula:

\partial_i \partial_j S(x,0)|_{x = 0} = g_{ij}

which relates relative entropy to the Fisher information metric. Since both sides are symmetric matrices, it suffices to show their diagonal entries agree in any coordinate system:

\partial^2_i S(x,0)|_{x = 0} = g_{ii}

Devoted followers of this series of posts will note that I keep using this trick, which takes advantage of the polarization identity.

To prove

\partial^2_i S(x,0)|_{x = 0} = g_{ii}

it’s enough to consider the universal example. We take the origin to be some probability distribution q and take x to be a nearby probability distribution p which is pushed a tiny bit in the ith coordinate direction. As before we write p/q = 1 + f. We look at the second-order term in our formula for S(p,q):

\frac{1}{2} \int_\Omega f^2\, q d \omega

Using the usual second-order Taylor’s formula, which has a \frac{1}{2} built into it, we can say

\partial^2_i S(x,0)|_{x = 0} = \int_\Omega f^2\, q d \omega

On the other hand, our formula for the Fisher information metric gives

g_{ii} = \left. \int_\Omega  \partial_i \ln p \; \partial_i \ln p \; q d \omega \right|_{p=q}

The right hand sides of the last two formulas look awfully similar! And indeed they agree, because we can show that

\left. \partial_i \ln p \right|_{p = q} = f

How? Well, we assumed that p is what we get by taking q and pushing it a little bit in the ith coordinate direction; we have also written that little change as

p/q = 1 + f

for some small function f. So,

\partial_i (p/q) = f

and thus:

\partial_i p = f q

and thus:

\partial_i \ln p = \frac{\partial_i p}{p} = \frac{fq}{p}

so

\left. \partial_i \ln p \right|_{p=q} = f

as desired.

This argument may seem a little hand-wavy and nonrigorous, with words like ‘a little bit’. If you’re used to taking arguments involving infinitesimal changes and translating them into calculus (or differential geometry), it should make sense. If it doesn’t, I apologize. It’s easy to make it more rigorous, but only at the cost of more annoying notation, which doesn’t seem good in a blog post.

Boring technicalities

If you’re actually the kind of person who reads a section called ‘boring technicalities’, I’ll admit to you that my calculations don’t make sense if the integrals diverge, or we’re dividing by zero in the ratio p/q. To avoid these problems, here’s what we should do. Fix a \sigma-finite measure space (\Omega, d\omega). Then, define the universal statistical manifold to be the space P(\Omega,d \omega) consisting of all probability measures that are equivalent to d\omega, in the usual sense of measure theory. By Radon-Nikodym, we can write any such measure as q d \omega where q \in L^1(\Omega, d\omega). Moreover, given two of these guys, say p d \omega and q d\omega, they are absolutely continuous with respect to each other, so we can write

p d \omega = \frac{p}{q} \; q d \omega

where the ratio p/q is well-defined almost everywhere and lies in L^1(\Omega, q d\omega). This is enough to guarantee that we’re never dividing by zero, and I think it’s enough to make sure all my integrals converge.

We do still need to make P(\Omega,d \omega) into some sort of infinite-dimensional manifold, to justify all the derivatives. There are various ways to approach this issue, all of which start from the fact that L^1(\Omega, d\omega) is a Banach space, which is about the nicest sort of infinite-dimensional manifold one could imagine. Sitting in L^1(\Omega, d\omega) is the hyperplane consisting of functions q with

\int_\Omega q d\omega = 1

and this is a Banach manifold. To get P(\Omega,d \omega) we need to take a subspace of that hyperplane. If this subspace were open then P(\Omega,d \omega) would be a Banach manifold in its own right. I haven’t checked this yet, for various reasons.

For one thing, there’s a nice theory of ‘diffeological spaces’, which generalize manifolds. Every Banach manifold is a diffeological space, and every subset of a diffeological space is again a diffeological space. For many purposes we don’t need our ‘statistical manifolds’ to be manifolds: diffeological spaces will do just fine. This is one reason why I’m being pretty relaxed here about what counts as a ‘manifold’.

For another, I know that people have worked out a lot of this stuff, so I can just look things up when I need to. And so can you! This book is a good place to start:

• Paolo Gibilisco, Eva Riccomagno, Maria Piera Rogantin and Henry P. Wynn, Algebraic and Geometric Methods in Statistics, Cambridge U. Press, Cambridge, 2009.

I find the chapters by Raymond Streater especially congenial. For the technical issue I’m talking about now it’s worth reading section 14.2, “Manifolds modelled by Orlicz spaces”, which tackles the problem of constructing a universal statistical manifold in a more sophisticated way than I’ve just done. And in chapter 15, “The Banach manifold of quantum states”, he tackles the quantum version!


Rényi Entropy and Free Energy

10 February, 2011

I want to keep telling you about information geometry… but I got sidetracked into thinking about something slightly different, thanks to some fascinating discussions here at the CQT.

There are a lot of people interested in entropy here, so some of us — Oscar Dahlsten, Mile Gu, Elisabeth Rieper, Wonmin Son and me — decided to start meeting more or less regularly. I call it the Entropy Club. I’m learning a lot of wonderful things, and I hope to tell you about them someday. But for now, here’s a little idea I came up with, triggered by our conversations:

• John Baez, Rényi entropy and free energy.

In 1960, Alfred Rényi defined a generalization of the usual Shannon entropy that depends on a parameter. If p is a probability distribution on a finite set, its Rényi entropy of order \beta is defined to be

\displaystyle{ H_\beta = \frac{1}{1 - \beta} \ln \sum_i p_i^\beta }

where 0 \le \beta < \infty. This looks pretty weird at first, and we need \beta \ne 1 to avoid dividing by zero, but you can show that the Rényi entropy approaches the Shannon entropy as \beta approaches
1:

\lim_{\beta \to 1} H_\beta = -\sum_{i} p_i \ln p_i .

(A fun puzzle, which I leave to you.) So, it’s customary to define H_1 to be the Shannon entropy… and then the Rényi entropy generalizes the Shannon entropy by allowing an adjustable parameter \beta.

But what does it mean?

If you ask people what’s good about the Rényi entropy, they’ll usually say: it’s additive! In other words, when you combine two independent probability distributions into a single one, their Rényi entropies add. And that’s true — but there are other quantities that have the same property. So I wanted a better way to think about Rényi entropy, and here’s what I’ve come up with so far.

Any probability distribution can be seen as the state of thermal equilibrium for some Hamiltonian at some fixed temperature, say T = 1. And that Hamiltonian is unique. Starting with that Hamiltonian, we can then compute the free energy F at any temperature T, and up to a certain factor this free energy turns out to be the Rényi entropy H_\beta, where \beta = 1/T. More precisely:

F = (1 - T) H_\beta.

So, up to the fudge factor 1 - T, Rényi entropy is the same as free energy. It seems like a good thing to know — but I haven't seen anyone say it anywhere! Have you?

Let me show you why it’s true — the proof is pathetically simple. We start with our probability distribution p_i. We can always write

p_i = e^{- E_i}

for some real numbers E_i. Let’s think of these numbers E_i as energies. Then the state of thermal equilibrium, also known as the canonical ensemble or Gibbs state at inverse temperature \beta is the probability distribution

\frac{e^{- \beta E_i}}{Z}

where Z is the partition function:

Z = \sum_i e^{-\beta E_i}

Since Z = 1 when \beta = 1, the Gibbs state reduces to our original probability distribution at \beta = 1.

Now in thermodynamics, the quantity

F = - \frac{1}{\beta} \ln Z

is called the free energy. It’s important, because it equals the total expected energy of our system, minus the energy in the form of heat. Roughly speaking, it’s the energy that you can use.

Let’s see how the Rényi entropy is related to the free energy. The proof is a trivial calculation:

- \beta F = \ln Z = \ln \sum_{i \in X} e^{-\beta E_i} = \ln \sum_{i \in X} p_i^\beta = (1 - \beta) H_\beta

so

H_\beta = -  \frac{\beta}{1 - \beta} F

at least for \beta \ne 1. But you can also check that both sides of this equation have well-defined limits as \beta \to 1.

The relation between free energy and Rényi entropy looks even neater if we solve for F and write the answer using T instead of \beta = 1/T:

F = (1 - T)H_\beta

So, what’s this fact good for? I’m not sure yet! In my paper, I combine it with this equation:

F = \langle E \rangle - T S

Here \langle E \rangle is the expected energy in the Gibbs state at temperature T:

\langle E \rangle = \frac{1}{Z} \sum_i E_i \, e^{-\beta E_i}

while S is the usual Shannon entropy of this Gibbs state. I also show that all this stuff works quantum-mechanically as well as classically. But so far, it seems the main benefit is that Rényi entropy has become a lot less mysterious. It’s not a mutant version of Shannon entropy: it’s just a familiar friend in disguise.


Information Geometry (Part 6)

21 January, 2011

So far, my thread on information geometry hasn’t said much about information. It’s time to remedy that.

I’ve been telling you about the Fisher information metric. In statistics this is nice a way to define a ‘distance’ between two probability distributions. But it also has a quantum version.

So far I’ve showed you how to define the Fisher information metric in three equivalent ways. I also showed that in the quantum case, the Fisher information metric is the real part of a complex-valued thing. The imaginary part is related to the uncertainty principle.

You can see it all here:

Part 1     • Part 2     • Part 3     • Part 4     • Part 5

But there’s yet another way to define the Fisher information metric, which really involves information.

To explain this, I need to start with the idea of ‘information gain’, or ‘relative entropy’. And it looks like I should do a whole post on this.

So:

Suppose that \Omega is a measure space — that is, a space you can do integrals over. By a probability distribution on \Omega, I’ll mean a nonnegative function

p : \Omega \to \mathbb{R}

whose integral is 1. Here d \omega is my name for the measure on \Omega. Physicists might call \Omega the ‘phase space’ of some classical system, but probability theorists might call it a space of ‘events’. Today I’ll use the probability theorist’s language. The idea here is that

\int_A \; p(\omega) \; d \omega

gives the probability that when an event happens, it’ll be one in the subset A \subseteq \Omega. That’s why we want

p \ge 0

Probabilities are supposed to be nonnegative. And that’s also why we want

\int_\Omega \; p(\omega) \; d \omega = 1

This says that the probability of some event happening is 1.

Now, suppose we have two probability distributions on \Omega, say p and q. The information gain as we go from q to p is

S(p,q) = \int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega

We also call this the entropy of p relative to q. It says how much information you learn if you discover that the probability distribution of an event is p, if before you had thought it was q.

I like relative entropy because it’s related to the Bayesian interpretation of probability. The idea here is that you can’t really ‘observe’ probabilities as frequencies of events, except in some unattainable limit where you repeat an experiment over and over infinitely many times. Instead, you start with some hypothesis about how likely things are: a probability distribution called the prior. Then you update this using Bayes’ rule when you gain new information. The updated probability distribution — your new improved hypothesis — is called the posterior.

And if you don’t do the updating right, you need a swift kick in the posterior!

So, we can think of q as the prior probability distribution, and p as the posterior. Then S(p,q) measures the amount of information that caused you to change your views.

For example, suppose you’re flipping a coin, so your set of events is just

\Omega = \{ \mathrm{heads}, \mathrm{tails} \}

In this case all the integrals are just sums with two terms. Suppose your prior assumption is that the coin is fair. Then

q(\mathrm{heads}) = 1/2, \; q(\mathrm{tails}) = 1/2

But then suppose someone you trust comes up and says “Sorry, that’s a trick coin: it always comes up heads!” So you update our probability distribution and get this posterior:

p(\mathrm{heads}) = 1, \; p(\mathrm{tails}) = 0

How much information have you gained? Or in other words, what’s the relative entropy? It’s this:

S(p,q) = \int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega = 1 \cdot \log(\frac{1}{1/2}) + 0 \cdot \log(\frac{0}{1/2}) = 1

Here I’m doing the logarithm in base 2, and you’re supposed to know that in this game 0 \log 0 = 0.

So: you’ve learned one bit of information!

That’s supposed to make perfect sense. On the other hand, the reverse scenario takes a bit more thought.

You start out feeling sure that the coin always lands heads up. Then someone you trust says “No, that’s a perfectly fair coin.” If you work out the amount of information you learned this time, you’ll see it’s infinite.

Why is that?

The reason is that something that you thought was impossible — the coin landing tails up — turned out to be possible. In this game, it counts as infinitely shocking to learn something like that, so the information gain is infinite. If you hadn’t been so darn sure of yourself — if you had just believed that the coin almost always landed heads up — your information gain would be large but finite.

The Bayesian philosophy is built into the concept of information gain, because information gain depends on two things: the prior and the posterior. And that’s just as it should be: you can only say how much you learned if you know what you believed beforehand!

You might say that information gain depends on three things: p, q and the measure d \omega. And you’d be right! Unfortunately, the notation S(p,q) is a bit misleading. Information gain really does depend on just two things, but these things are not p and q: they’re p(\omega) d\omega and q(\omega) d\omega. These are called probability measures, and they’re ultimately more important than the probability distributions p and q.

To see this, take our information gain:

\int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega

and juggle it ever so slightly to get this:

\int_\Omega \;  \log(\frac{p(\omega) d\omega}{q(\omega)d \omega}) \; p(\omega) d \omega

Clearly this depends only on p(\omega) d\omega and q(\omega) d\omega. Indeed, it’s good to work directly with these probability measures and give them short names, like

d\mu = p(\omega) d \omega

d\nu = q(\omega) d \omega

Then the formula for information gain looks more slick:

\int_\Omega \; \log(\frac{d\mu}{d\nu}) \; d\mu

And by the way, in case you’re wondering, the d here doesn’t actually mean much: we’re just so brainwashed into wanting a d x in our integrals that people often use d \mu for a measure even though the simpler notation \mu might be more logical. So, the function

\frac{d\mu}{d\nu}

is really just a ratio of probability measures, but people call it a Radon-Nikodym derivative, because it looks like a derivative (and in some important examples it actually is). So, if I were talking to myself, I could have shortened this blog entry immensely by working with directly probability measures, leaving out the d‘s, and saying:

Suppose \mu and \nu are probability measures; then the entropy of \mu relative to \nu, or information gain, is

S(\mu, \nu) =  \int_\Omega \; \log(\frac{\mu}{\nu}) \; \mu

But I’m under the impression that people are actually reading this stuff, and that most of you are happier with functions than measures. So, I decided to start with

S(p,q) =  \int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega

and then gradually work my way up to the more sophisticated way to think about relative entropy! But having gotten that off my chest, now I’ll revert to the original naive way.

As a warmup for next time, let me pose a question. How much is this quantity

S(p,q) =  \int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega

like a distance between probability distributions? A distance function, or metric, is supposed to satisfy some axioms. Alas, relative entropy satisfies some of these, but not the most interesting one!

• If you’ve got a metric, the distance between points should always be nonnegative. Indeed, this holds:

S(p,q) \ge 0

So, we never learn a negative amount when we update our prior, at least according to this definition. It’s a fun exercise to prove this inequality, at least if you know some tricks involving inequalities and convex functions — otherwise it might be hard.

• If you’ve got a metric, the distance between two points should only be zero if they’re really the same point. In fact,

S(p,q) = 0

if and only if

p d\omega = q d \omega

It’s possible to have p d\omega = q d \omega even if p \ne q, because d \omega can be zero somewhere. But this is just more evidence that we should really be talking about the probability measure p d \omega instead of the probability distribution p. If we do that, we’re okay so far!

• If you’ve got a metric, the distance from your first point to your second point is the same as the distance from the second to the first. Alas,

S(p,q) \ne S(q,p)

in general. We already saw this in our example of the flipped coin. This is a slight bummer, but I could live with it, since Lawvere has already shown that it’s wise to generalize the concept of metric by dropping this axiom.

• If you’ve got a metric, it obeys the triangle inequality. This is the really interesting axiom, and alas, this too fails. Later we’ll see why.

So, relative entropy does a fairly miserable job of acting like a distance function. People call it a divergence. In fact, they often call it the Kullback-Leibler divergence. I don’t like that, because ‘the Kullback-Leibler divergence’ doesn’t really explain the idea: it sounds more like the title of a bad spy novel. ‘Relative entropy’, on the other hand, makes a lot of sense if you understand entropy. And ‘information gain’ makes sense if you understand information.

Anyway: how can we save this miserable attempt to get a distance function on the space of probability distributions? Simple: take its matrix of second derivatives and use that to define a Riemannian metric g_{ij}. This Riemannian metric in turn defines a metric of the more elementary sort we’ve been discussing today.

And this Riemannian metric is the Fisher information metric I’ve been talking about all along!

More details later, I hope.


Follow

Get every new post delivered to your Inbox.

Join 3,095 other followers