Here’s a charming, easily readable tale of Claude Shannon and how he came up with information theory:
I hadn’t known his PhD thesis was on genetics! His master’s thesis introduced Boolean logic to circuit design. And as a kid, he once set up a telegraph line to a friend’s house half a mile away.
So, he was perfectly placed to turn information into a mathematical quantity, deeply related to entropy, and prove some now-famous theorems about it.
These theorems set limits on how much information we can transmit through a noisy channel. More excitingly, they say we can cook up coding schemes that let us come as close as we want to this limit, with an arbitrarily low probability of error.
As Erico Guizzo points out, these results are fundamental to the ‘information age’ we live in today:
Can we transmit, say, a high-resolution picture over a telephone line? How long will that take? Is there a best way to do it?
Before Shannon, engineers had no clear answers to these questions. At that time, a wild zoo of technologies was in operation, each with a life of its own—telephone, telegraph, radio, television, radar, and a number of other systems developed during the war. Shannon came up with a unifying, general theory of communication. It didn’t matter whether you transmitted signals using a copper wire, an optical fiber, or a parabolic dish. It didn’t matter if you were transmitting text, voice, or images. Shannon envisioned communication in abstract, mathematical terms; he defined what the once fuzzy concept of “information” meant for communication engineers and proposed a precise way to quantify it. According to him, the information content of any kind of message could be measured in binary digits, or just bits—a name suggested by a colleague at Bell Labs. Shannon took the bit as the fundamental unit in information theory. It was the first time that the term appeared in print.
So, I want to understand Shannon’s theorems and their proofs—especially because they clarify the relation between information and entropy, two concepts I’d like to be an expert on. It’s sort of embarrassing that I don’t already know this stuff! But I thought I’d post some preliminary remarks anyway, in case you too are trying to learn this stuff, or in case you can help me.
There are various different theorems I should learn. For example:
• The source coding theorem says it’s impossible to compress a stream of data to make the average number of bits per symbol in the compressed data less than the Shannon information of the source, without some of the data almost certainly getting lost. However, you can make the number of bits per symbols arbitrarily close to the Shannon entropy with a probability of error as small as you like.
• the noisy channel coding theorem is a generalization to data sent over a noisy channel.
The proof of the noisy channel coding theorem seems not so bad—there’s a sketch of a proof in the Wikipedia article on this theorem. But many theorems have a hard lemma at their heart, and for this one it’s a result in probability theory called the asymptotic equipartition property.
You should not try to dodge the hard lemma at the heart of the theorem you’re trying to understand: there’s a reason it’s there. So what’s the asymptotic equipartition property?
Here’s a somewhat watered-down statement that gets the basic idea across. Suppose you have a method of randomly generating letters—for example, a probability distribution on the set of letters. Suppose you randomly generate a string of letters, and compute times the logarithm of the probability that you got that string. Then as this number ‘almost surely’ approaches some number What’s this number ? It’s the entropy of the probability distribution you used to generate those letters!
(Almost surely is probability jargon for ‘with probability 100%’, which is not the same as ‘always’.)
This result is really cool—definitely worth understanding in its own right! It says that while many strings are possible, the ones you’re most likely to see lie in a certain ‘typical set’. The ‘typical’ strings are the ones where when you compute times the log of their probability, the result is close to How close? Well, you get to pick that.
The typical strings are not individually the most probable strings! But if you randomly generate a string, it’s very probable that it lies in the typical set. That sounds a bit paradoxical, but if you think about it, you’ll see it’s not. Think of repeatedly flipping a coin that has a 90% chance of landing heads up. The most probable single outcome is that it lands heads up every time. But the typical outcome is that it lands up close to 90% of the time. And, there are lots of ways this can happen. So, if you flip the coin a bunch of times, there’s a very high chance that the outcome is typical.
It’s easy to see how this result is the key to the noisy channel coding theorem. In general, the ‘typical set’ has few elements compared to the whole set of strings with letters. So, you can make short codes for the strings in this set, and compress your message that way, and this works almost all the time. How much you can compress your message depends on the entropy .
(I said ‘in general’ because there’s one exception: when every -letter string is equally probable, every string is in the typical set. In this very special case, no compression is possible.)
So, we’re seeing the link between information and entropy!
The actual coding schemes that people use are a lot trickier than the simple scheme I’m hinting at here. When you read about them, you see scary things like this:
But presumably they’re faster to implement, hence more practical.
The first coding schemes that come really close to the Shannon limit are the turbo codes. Surprisingly, these codes were developed only in 1993! They’re used in 3G mobile communications and deep space satellite communications.
One key trick is to use, not one decoder, but two. These two decoders keep communicating with each other and improving their guesses about the signal they’re received, until they agree:
This iterative process continues until the two decoders come up with the same hypothesis for the m-bit pattern of the payload, typically in 15 to 18 cycles. An analogy can be drawn between this process and that of solving cross-reference puzzles like crossword or sudoku. Consider a partially completed, possibly garbled crossword puzzle. Two puzzle solvers (decoders) are trying to solve it: one possessing only the “down” clues (parity bits), and the other possessing only the “across” clues. To start, both solvers guess the answers (hypotheses) to their own clues, noting down how confident they are in each letter (payload bit). Then, they compare notes, by exchanging answers and confidence ratings with each other, noticing where and how they differ. Based on this new knowledge, they both come up with updated answers and confidence ratings, repeating the whole process until they converge to the same solution.
This can be seen as “an instance of loopy belief propagation in Bayesian networks.”
By the way, the picture I showed you above is a flowchart of the decoding scheme for a simple turbo code. You can see the two decoders, and maybe the loop where data gets fed back to the decoders.
While I said this picture is “scary”, I actually like it because it’s an example of network theory applied to real-life problems.