The idea of entropy arose in thermodynamics, and its connection to probability theory is fundamental to statistical mechanics. Its connection to information was developed later by Shannon. We now think of entropy and information as two sides of the same coin.
But there’s another concept, called algorithmic information, which was developed in work on logic and computer science. This concept is related to Shannon’s earlier notion of information, but it also seems rather different.
For example, Shannon’s ideas let us compute the information of a probability distribution on bit strings. So if we detect a radio signal from an extraterrestrial life form, that sends us a string of a dozen 0’s and 1’s each day, and it seems the string is chosen randomly according to some probability distribution, we can calculate the information per message. But if I just show you a single bit string, like this:
it makes no sense to ask what its information is. On the other hand, we can define its algorithmic information.
Nonetheless, my goal here is to show you that algorithmic information can really be seen as a special case of the old probabilistic concept of information. This lets us take all the familiar tools from statistical mechanics and thermodynamics, and apply them to algorithmic information theory. Mike Stay and I started to do this here:
• John Baez and Mike Stay, Algorithmic thermodynamics.
Unfortunately, I’ll only have time to sketch a bit of what we did!
Algorithmic information – first definition
Here’s a definition of algorithmic information. Later we’ll see a better one that’s almost equivalent.
Fix a programming language where a program is a finite bit string and its output, if it halts, is again a finite bit string. Then the algorithmic information of a finite bit string is the length of the shortest program that has that bit string as output.
So, for example, the algorithmic information of a string of a trillion 0’s is low, because you can write a short program that prints this out. On the other hand, the algorithmic information of a long “random” string of bits will be high, because the shortest program to print it out will be a program that says “print out this string: 01011100101100…” — so the program is approximately as long as the string itself: slightly longer, but only by a fixed amount.
Of course you may ask what “random” means here. I haven’t defined it! But the point is, people have used these ideas to give a definition what it means for a string of bits to be random. There are different ways to do it. For example: a bit string is ε-random if the shortest program that prints it out is at least ε times as long as the string itself.
You may also wonder: “Doesn’t the definition of algorithmic information depend on the details of our programming language?” And the answer is, “Yes, but not very much.” More precisely, for any two universal programming languages, there’s a constant such that for any finite bit string, the algorithmic information of that string will differ by at most depending on which language we use.
Let’s see how algorithmic information compares to the usual concept of information introduced by Shannon. The usual concept works like this:
If an event of probability occurs, we define the information gained by learning this event has occurred to be . We take the logarithm because probabilities of independent events multiply, and we want information to add. The minus sign makes the information positive. We can use any base we want for the logarithm: physicists like , while computer scientists favor 2.
If there’s a set of possible outcomes, where the outcome occurs with probability , then the average or expected amount of information we gain by learning the outcome is
We call this the entropy of the probability distribution .
Now, ordinary information doesn’t look very much like algorithmic information! But there’s a second definition of algorithmic information, almost equivalent to the first one, that makes the relation clearer.
Algorithmic information – second definition
First, a minor shift in viewpoint. Before we defined algorithmic information for a finite bit string. Now let’s define it for a natural number. This change in viewpoint is no big deal, since we can set up a one-to-one correspondence between natural numbers and finite bit strings that’s easy to compute.
We’ll still think of our programs as finite bit strings. Not every such string gives a program that halts. We also may demand that bit strings have special features to count as programs. For example, we may demand that programs end in the string 11111, just like a Fortran program must end with the word END.
Let be the set of programs that halt. Without loss of generality, we can assume that is prefix-free. This means that if , no other bit string starting with the string is also in . For example, if the string 11111 marks the end of a program, and is only allowed to show up at the end of the program, will be prefix-free.
Let be the length of the program . It’s easy to see that if is prefix-free,
Making this sum finite is the reason we want the prefix-free condition — you’ll see exactly why in a minute.
Let be the output of the program . Remember, we now think of the output as a natural number. So, we have functions
We define the algorithmic information of a number to be
So: we do a sum over all programs having the number as output. We sum up , where is the length of the program . The sum converges because
That’s where the prefix-free condition comes into play.
Finally, we take minus the logarithm of this sum.
This seems a bit complicated! But suppose there’s just one program having the number as output. Then the sum consists of one term, the logarithm cancels out the exponential, and the minus signs cancel too, so we get
This matches our first definition: the algorithmic information of a number is the length of the shortest program having that number as output.
Of course in this case the shortest program is the only program having that output. If we have more than one program with as output, the second definition of algorithmic entropy gives a smaller answer than the first definition:
However, there’s a famous theorem, proved by Leonid Levin in 1974, that says the new definition is not very different from the old one. The difference is bounded by a constant. More precisely: for any universal prefix-free programming language, there’s a constant such that for every ,
Since the algorithmic information depends on our programming language, it was only well-defined within an additive constant in the first place. So, switching from the first definition to the second one doesn’t significantly change the concept. But, it makes the relation to ordinary information a lot easier to see!
From now on we’ll use the second definition.
Algorithmic information versus ordinary information
To relate algorithmic information to ordinary information, we need to get logarithms and probabilities into the game. We see a logarithm here:
but it’s in a funny place, outside the sum — and where are the probabilities?
To solve both these puzzles, let’s define a probability distribution on the set of natural numbers by
Here is a normalizing constant, to make the probabilities sum to 1. Taking
Now, notice that
So, up to an additive constant, the algorithmic entropy of the number is . But is just the information gained upon learning the number , if we started out knowing only that was chosen randomly according to the probability distribution .
What’s the meaning of the probability distribution ? Simple: is the probability that the number is the output of a randomly chosen program… where ‘randomly chosen’ means chosen according to a certain specific rule. The rule is that increasing the length of a program by 1 makes it half as probable. In other words, the probability of the program is
where is the normalization factor chosen to make these probabilities sum to 1.
So, the relation between algorithmic information and ordinary information is this:
The algorithmic information of a number is the information gained by learning that number, if beforehand we only knew that it was the output of a randomly chosen program.
Why should the probability of the program be defined as
There is no reason we need to use this probability distribution. However, there’s something good about it. It’s an example of a Gibbs state, meaning a probability distribution that maximizes entropy subject to a constraint on the expected value of some observable. Nature likes to maximize entropy, so Gibbs states are fundamental to statistical mechanics. So is the quantity : it’s called the partition function.
The idea works like this: suppose we want to maximize the entropy of a probability distribution on the set of programs, subject to a constraint on the expected value of the length of the program. Then the answer is
where is some number, and the partition function ensures that the probabilities sum to 1:
So, equals our previous probability distribution when
However, it is also interesting to consider other values of , and in fact Tadaki has already done so:
• K. Tadaki, A generalization of Chaitin’s halting probability and halting self-similar sets, Hokkaido Math. J. 31 (2002), 219–253.
The partition function converges for but diverges otherwise.
More generally, we can look for the probability distribution that maximizes entropy subject to constraints on the expected value of several observables. For example, let:
• the logarithm of the runtime of the program
• the length of the program
• the output of the program
Then the probability distribution that maximizes entropy subject to constraints on the expected values of these three quantities is
where now the partition function is
We’ve chosen our notation to remind experts on statistical mechanics that they’ve seen formulas like this before. The exact same formulas describe a piston full of gas in thermal equilibrium! From a formal, purely mathematical viewpoint:
• is analogous to the internal energy of the gas, and is analogous to where is the temperature in units where Boltzmann’s constant is 1.
• is analogous to the volume of the gas, and is analogous to where is the pressure.
• is analogous to the number of molecules in the gas, and is analogous to where is the chemical potential.
The analogy here is quite arbitrary. I’m not saying that the length of a program is profoundly similar to the volume of a cylinder of gas; we could have chosen the letter to stand for the length of a program and everything would still work.
But the analogy works. In other words: now we can take a lot of familiar facts about thermodynamics and instantly translate them into analogous facts about algorithmic information theory! For example, define the entropy of the probability distribution in the usual way:
We can think of this as a function of either and or the expected values of and . In thermodynamics we learn lots of equations like this:
where by standard abuse of notation we’re using to stand for their expected values. And, all these equations remain true in algorithmic thermodynamics!
The interesting part is figuring out what all these equations really mean in the context of algorithmic thermodynamics. For a start on that, try our paper.
For example, we call algorithmic temperature. If you allow programs to run longer, more of them will halt and give an answer. Thanks to the equation
the algorithmic temperature is roughly the number of times you have to double the runtime in order to double the number of programs that satisfy the constraints on length and output.
We also consider the algorithmic analogues of thermodynamic cycles, familiar from old work on steam engines. Charles Babbage described a computer powered by a steam engine; we describe a heat engine powered by programs! The significance of this line of thinking remains fairly mysterious. However, I’m hoping that it will help point the way toward a further synthesis of algorithmic information theory and thermodynamics.