My grad student Mike Stay and I put a new paper on the arXiv:
• John Baez and Mike Stay, Algorithmic thermodynamics.
Let me just tell you a little about what we did.
Around 1948, the electrical engineer Claude Shannon came up with a mathematical theory of information. Here is a quote that gives a flavor of his thoughts:
The whole point of a message is that it should contain something new.
Say you have a source of information, for example a mysterious radio transmission from an extraterrestrial civilization. Suppose every day you get a signal sort of like this:
How much information are you getting? If the message always looks like this, presumably not much:
Shannon came up with a precise formula for the information. But beware: it’s not really a formula for the information of a particular message. It’s a formula for the average information of a message chosen from some probability distribution. It’s this:
where we sum over all possible messages, and is the probability of the i th message.
So, for example, suppose you keep getting the same message. Then every message has probability 0 except for one message, which has probability 1. Then either or is zero, so the information is zero.
That seems vaguely plausible in the example where every day we get this message:
It may seem less plausible if every day we get this message:
It looks like the aliens are trying to tell us which numbers are prime! 1 is not prime, 2 is, 3 is, 4 is not, 5 is, and so on. Aren’t we getting some information?
Maybe so: this could be considered a defect of Shannon information. On the other hand, you might be willing to admit that if we keep getting the same message every day, the second time we get it we’re not getting any new information. Or, you might not be willing to admit this — it’s actually a subtle issue, and I don’t feel like arguing.
But at the very least, you’ll probably admit that the second time you get the same message, you get less new information than the first time. The third time you get even less, and so on. So it’s quite believable that in the long run, the average amount of new information per message approaches 0 in this case. For Shannon, information means new information.
On the other hand, suppose we are absolutely unable to to predict each new bit we get from the aliens. Suppose our ability to predict the next bit is no better than our ability to predict whether a fair coin comes up heads or tails. Then Shannon’s formula says we are getting the same amount of new information with every bit: namely,
If we take the logarithm using base 2 here, we get 1 — so we say we’re getting one bit of information. If we take it using base e, as physicists prefer, we get — and we say we’re getting nats of information. One bit equals nats.
There’s a long road from these reflections to a full justification of the specific formula for Shannon information! To begin marching down that road you can read his original paper, A mathematical theory of communication.
Anyway: it soon became clear that Shannon’s formula was almost the same as the usual formula for “entropy”, which goes back to Josiah Willard Gibbs. Gibbs was actually the first engineer to get a Ph.D. in the United States, back in 1863… but he’s mainly famous as a mathematician, physicist and chemist.
Entropy is a measure of disorder. Suppose we have a box of stuff — solid, liquid, gas, whatever. There are many possible states this stuff can be in: for example, the atoms can be in different positions, and have different velocities. Suppose we only know the probability that the stuff is in any one of the allowed states. If the th state is occupied with probability Gibbs said the entropy of our box of stuff is
Here is a constant called Boltzmann’s constant.
There’s a wonderful story here, which I don’t have time to tell in detail. The way I wrote Shannon’s formula for information and Gibbs’ formula for entropy, you’d think only a moron would fail to instantly grasp that they’re basically the same. But historically, it took some work.
The appearance of Bolzmann’s constant hints at why. It shows up because people had ideas about entropy, and the closely related concept of temperature, long before they realized the full mathematical meaning of these concepts! So entropy traditionally came in units of “joules/kelvin”, and physicists had a visceral understanding of it. But dividing by Boltzmann’s constant, we can translate that notion of entropy into the modern, more abstract way of thinking of entropy as information!
Henceforth I’ll work in units where as modern mathematical physicists do, and treat entropy and information as the same concept.
Closely related to information and entropy is a third concept, which I will call Kolmogorov complexity, though it was developed by many people — including Martin-Löf, Solomonoff, Kolmogorov, Levin, and Chaitin — and it has many names, including descriptive complexity, Kolmogorov-Chaitin complexity, algorithmic entropy, and program-size complexity. You may be intimidated by all these names, but you shouldn’t be: when lots of people keep rediscovering something and giving it new names, it’s usually because this thing is pathetically simple.
So, what is Kolmogorov complexity? It’s a way of measuring the information in a single message rather than a probability distribution of messages. And it’s just the length of the shortest computer program that prints out this message.
I suppose some clarification may be needed here:
1) I’m only talking about programs without any input, that either calculate, print out a message, and halt… or calculate forever and never halt.
2) Of course the length of the shortest program that prints out the desired message depends on the programming language. But there are theorems saying it doesn’t depend “too much” on the language. So don’t worry about it: just pick your favorite language and stick with that.
If you think about it, Kolmogorov complexity is a really nice concept. The Kolmogorov complexity of a string of a million 0’s is a lot less than a million: you can write a short program that says “print a million 0’s”. But the Kolmogov complexity of a highly unpredictable string of a million 0’s and 1’s is about a million: you basically need to include that string in your program and then say “print this string”. Those are the two extremes, but in general the complexity will be somewhere in between.
Various people — the bigshots listed above, and others too — soon realized that Kolmogorov complexity is deeply connected to Shannon information. They have similar properties, but they’re also directly related. It’s another great story, and I urge you to learn it. For that, I recommend:
• Ming Li and Paul Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications, Springer, Berlin, 2008.
To understand the relation a bit better, Mike and I started thinking about probability measures on the set of programs. People had already thought about this — but we thought about it a bit more the way physicists do.
Physicists like to talk about something called a Gibbs ensemble. Suppose we have a set and a function
Then the Gibbs ensemble is the probability distribution on that maximizes entropy subject to the condition that has some specified average, or “expected value”.
So, to find the Gibbs ensemble, we need to find a probability distribution that maximizes
subject to the constraint that
where is some number, the expected value of Finding the probability distribution that does the job is an exercise in Lagrange multipliers. I won’t do it. There’s a nice formula for the answer, but we won’t need it here.
What you really need to know is something more important: why Gibbs invented the Gibbs ensemble! He invented it to solve some puzzles that sound completely impossible at first.
For example, suppose you have a box of stuff and you don’t know which state it’s in. Suppose you only know the expected value of its energy, say What’s the probability that this stuff is in its th state?
This may sound like an insoluble puzzle: how can we possibly know? But Gibbs proposed an answer! He said, basically: find the probability distribution that maximizes entropy subject to the constraint that the mean value of energy is Then the answer is
In other words: use the Gibbs ensemble.
Now let’s come back to Kolmogorov complexity.
Imagine randomly picking a program out of a hat. What’s the probability that you pick a certain program? Again, this sounds like an impossible puzzle. But you can answer it if you know the expected value of the length of this program! Then you just use the Gibbs ensemble.
What does this have to do with Kolmogorov complexity? Here I’ll be a bit fuzzy, because the details are in our paper, and I want you to read that.
Suppose we start out with the Gibbs ensemble I just mentioned. In other words, we have a program in a hat, but all you know is the expected value of its length.
But now suppose I tell you the message that this program prints out. Now you know more. How much more information do you have now? The Kolmogorov complexity of the message — that’s how much!
(Well, at least this is correct up to some error bounded by a constant.)
The main fuzzy thing in what I just said is “how much more information do you have?” You see, I’ve explained information, but I haven’t explained “information gain”. Information is a quantity you compute from one probability distribution. Information gain is a quantity you compute from two. More precisely,
is the information you gain when you thought the probability distribution was but then someone comes along and tells you it’s
In fact, we argue that information gain is more fundamental than information. This is a Bayesian idea: is your “prior”, the probability distribution you thought was true, and the information you get upon hearing the distribution is should be defined relative to this prior. When people think they’re talking about information without any prior, they are really using a prior that’s so bland that they don’t notice it’s there: a so-called “uninformative prior”.
But I digress. To help you remember the story so far, let me repeat myself. Up to a bounded error, the Kolmogorov complexity of a message is the information gained when you start out only knowing the expected length of a program, and then learn which message the program prints out.
But this is just the beginning of the story. We’ve seen how Kolmogorov complexity is related to Gibbs ensembles. Now that we have Gibbs ensembles running around, we can go ahead and do thermodynamics! We can talk about quantities analogous to temperature, pressure, and so on… and all the usual thermodynamic relations hold! We can even take the ideas about steam engines and apply them to programs!
But for that, please read our paper. Here’s the abstract, which says what we really do:
Algorithmic entropy can be seen as a special case of entropy as studied in statistical mechanics. This viewpoint allows us to apply many techniques developed for use in thermodynamics to the subject of algorithmic information theory. In particular, suppose we fix a universal prefix-free Turing machine and let X be the set of programs that halt for this machine. Then we can regard X as a set of ‘microstates’, and treat any function on X as an ‘observable’. For any collection of observables, we can study the Gibbs ensemble that maximizes entropy subject to constraints on expected values of these observables. We illustrate this by taking the log runtime, length, and output of a program as observables analogous to the energy E, volume V and number of molecules N in a container of gas. The conjugate variables of these observables allow us to define quantities which we call the ‘algorithmic temperature’ T, ‘algorithmic pressure’ P and `algorithmic potential’ μ, since they are analogous to the temperature, pressure and chemical potential. We derive an analogue of the fundamental thermodynamic relation dE = T dS – P d V + μ dN, and use it to study thermodynamic cycles analogous to those for heat engines. We also investigate the values of T, P and μ for which the partition function converges. At some points on the boundary of this domain of convergence, the partition function becomes uncomputable. Indeed, at these points the partition function itself has nontrivial algorithmic entropy.