Here’s a charming, easily readable tale of Claude Shannon and how he came up with information theory:

• Erico Guizzo, *The Essential Message: Claude Shannon and the Making of Information Theory*.

I hadn’t known his PhD thesis was on genetics! His master’s thesis introduced Boolean logic to circuit design. And as a kid, he once set up a telegraph line to a friend’s house half a mile away.

So, he was perfectly placed to turn information into a mathematical quantity, deeply related to entropy, and prove some now-famous theorems about it.

These theorems set limits on how much information we can transmit through a noisy channel. More excitingly, they say we can cook up coding schemes that let us come *as close as we want* to this limit, with an arbitrarily low probability of error.

As Erico Guizzo points out, these results are fundamental to the ‘information age’ we live in today:

Can we transmit, say, a high-resolution picture over a telephone line? How long will that take? Is there a best way to do it?

Before Shannon, engineers had no clear answers to these questions. At that time, a wild zoo of technologies was in operation, each with a life of its own—telephone, telegraph, radio, television, radar, and a number of other systems developed during the war. Shannon came up with a unifying, general theory of communication. It didn’t matter whether you transmitted signals using a copper wire, an optical fiber, or a parabolic dish. It didn’t matter if you were transmitting text, voice, or images. Shannon envisioned communication in abstract, mathematical terms; he defined what the once fuzzy concept of “information” meant for communication engineers and proposed a precise way to quantify it. According to him, the information content of any kind of message could be measured in binary digits, or just

bits—a name suggested by a colleague at Bell Labs. Shannon took the bit as the fundamental unit in information theory. It was the first time that the term appeared in print.

So, I want to understand Shannon’s theorems and their proofs—especially because they clarify the relation between *information* and *entropy*, two concepts I’d like to be an expert on. It’s sort of embarrassing that I don’t already know this stuff! But I thought I’d post some preliminary remarks anyway, in case you too are trying to learn this stuff, or in case you can help me.

There are various different theorems I should learn. For example:

• The source coding theorem says it’s impossible to compress a stream of data to make the average number of bits per symbol in the compressed data less than the Shannon information of the source, without some of the data almost certainly getting lost. However, you can make the number of bits per symbols arbitrarily close to the Shannon entropy with a probability of error as small as you like.

• the noisy channel coding theorem is a generalization to data sent over a noisy channel.

The proof of the noisy channel coding theorem seems not so bad—there’s a sketch of a proof in the Wikipedia article on this theorem. But many theorems have a hard lemma at their heart, and for this one it’s a result in probability theory called the asymptotic equipartition property.

You should not try to *dodge* the hard lemma at the heart of the theorem you’re trying to understand: there’s a reason it’s there. So what’s the asymptotic equipartition property?

Here’s a somewhat watered-down statement that gets the basic idea across. Suppose you have a method of randomly generating letters—for example, a probability distribution on the set of letters. Suppose you randomly generate a string of letters, and compute times the logarithm of the probability that you got that string. Then as this number ‘almost surely’ approaches some number What’s this number ? It’s the *entropy* of the probability distribution you used to generate those letters!

(**Almost surely** is probability jargon for ‘with probability 100%’, which is not the same as ‘always’.)

This result is really cool—definitely worth understanding in its own right! It says that while many strings are possible, the ones you’re most likely to see lie in a certain ‘typical set’. The ‘typical’ strings are the ones where when you compute times the log of their probability, the result is close to How close? Well, you get to pick that.

The typical strings are not individually the most probable strings! But if you randomly generate a string, it’s very probable that it lies in the typical set. That sounds a bit paradoxical, but if you think about it, you’ll see it’s not. Think of repeatedly flipping a coin that has a 90% chance of landing heads up. The most probable single outcome is that it lands heads up every time. But the *typical* outcome is that it lands up close to 90% of the time. And, there are lots of ways this can happen. So, if you flip the coin a bunch of times, there’s a very high chance that the outcome is typical.

It’s easy to see how this result is the key to the noisy channel coding theorem. In general, the ‘typical set’ has few elements compared to the whole set of strings with letters. So, you can make short codes for the strings in this set, and compress your message that way, and this works almost all the time. How much you can compress your message depends on the entropy .

(I said ‘in general’ because there’s one exception: when every -letter string is equally probable, every string is in the typical set. In this very special case, no compression is possible.)

So, we’re seeing the link between information and entropy!

The actual coding schemes that people use are a lot trickier than the simple scheme I’m hinting at here. When you read about them, you see scary things like this:

But presumably they’re faster to implement, hence more practical.

The first coding schemes that come really close to the Shannon limit are the turbo codes. Surprisingly, these codes were developed only in 1993! They’re used in 3G mobile communications and deep space satellite communications.

One key trick is to use, not *one* decoder, but *two*. These two decoders keep communicating with each other and improving their guesses about the signal they’re received, until they agree:

This iterative process continues until the two decoders come up with the same hypothesis for the m-bit pattern of the payload, typically in 15 to 18 cycles. An analogy can be drawn between this process and that of solving cross-reference puzzles like crossword or sudoku. Consider a partially completed, possibly garbled crossword puzzle. Two puzzle solvers (decoders) are trying to solve it: one possessing only the “down” clues (parity bits), and the other possessing only the “across” clues. To start, both solvers guess the answers (hypotheses) to their own clues, noting down how confident they are in each letter (payload bit). Then, they compare notes, by exchanging answers and confidence ratings with each other, noticing where and how they differ. Based on this new knowledge, they both come up with updated answers and confidence ratings, repeating the whole process until they converge to the same solution.

This can be seen as “an instance of loopy belief propagation in Bayesian networks.”

By the way, the picture I showed you above is a flowchart of the decoding scheme for a simple turbo code. You can see the two decoders, and maybe the loop where data gets fed back to the decoders.

While I said this picture is “scary”, I actually like it because it’s an example of network theory applied to real-life problems.

What you’re looking at here is (AIUI) general codes where one is receives all the transmitted “symbols” but there is generally some garbling going on. Another model is the “erasure codes”, where it’s assumed you can “check” (with eg a checksum) each symbol that you receive, so you KNOW if it’s correct or not but you may not receive all the symbols. (Of course these are the ends of a spectrum, eg, erasure could be said to be the case where garbled symbols are “completely” garbled.) In both cases you want to reconstruct an original message from a “redundancy via coding” message that’s gone over the unreliable channel.

The interesting thing is that slightly different mathematics appears to apply best in the different cases, eg, consider “fountain codes” such as described in the last chapter of McKay’s book. There it considers using random graphs to encode, then reconstruct, a message from an arbitrary set of packets that’s just bigger than the minimal amount needed.It would be interesting to see what you could figure out about the representing a “set” of messages (eg, a set of viable genomes) via one of these kind of processes since it might be related to “useful” within species diversity. (It might be useful to have some different genes within a population, but for many — although not all — genetic ailments there is no benefit associated with it. So you want some set, but not all, variants to be “recoverable” from the population of animals in a given species.)

David wrote:

That’s right—that’s the model here.

Hmm! Now you’re making me wonder what’s the most general model of ‘garbled’, ‘partially erased’, or otherwise screwed-up data. Category-theoretic computer scientists and logicians have extremely general theories of data types, which I sort of understand, but I’ve never tried to imagine an extremely general theory of ‘screwed-up’ or ‘damaged’ data.

It seems information theorists are often content to think of data as a string of symbols drawn from a pre-established alphabet. Clearly anything discrete can be encoded in this way, but is this model sufficiently general for thinking about the ways data can be damaged?

It’s easy to imagine a string being ‘garbled’ with some symbols replaced by others, and it’s easy to imagine it being ‘partially deleted’. As you note, the latter can be conceived as a special case of the former if we allow a symbol ‘blank’ and think of deletion as replacing some other symbol by a blank. But are there other fundamentally different ways to damage data… or is there some theorem saying that all other ways can be reduced to ‘garbling’?

And when I say “information theorists are often content to think of data as a string of symbols drawn from a pre-established alphabet” maybe I’m being old-fashioned. Maybe they do more general things I don’t know about. For example, maybe I’m stuck back in the era where data was sent along a fixed ‘channel’, and now I should be thinking about data moving around in ‘packets’… so that one form of ‘garbling’ could be getting a packet that should have gone to someone else, or getting a packet really late, or something like that!

Examples chosen from biology could force us to think about information transmission in even more general ways. Just because you

canmodel data as a string of symbols in some alphabet, doesn’t mean youshould. There’s information in the cytoplasm of a human egg cell, which is important for the development of the fetus. But modeling it as a string of symbols seems awkward (while for the genome, it’s not quite so bad).John said:

What about network information theory?

Having a single communication channel is very simplistic: there is one sender and one receiver which share a channel. For our deeply connected world, this scenario is highly unrealistic! Nowadays, communication is organized in big networks formed by lots of nodes and connections between these. Such a network typically have many senders and many receivers. In such situations, capacity ceases to be a single number; rather, it becomes an entire region in the space of matrices , the entries of which stand for the amount of information transmitted from sender to receiver when using a certain coding scheme. It’s called the ‘rate region’.

The butterfly network is a beautiful example of how coding in a network can work — and why the internet, with its routing-based approach, is likely to be in a pretty mediocre part of the rate region.

I’m mentioning all this because it’s obviously related to Bayesian networks, and maybe even to other aspects of your work on network theory, I don’t know.

Thanks! More to learn and think about. I’d like to combine network theory with information theory, since green mathematics, if such a thing exists, should involve both.

DNA copying can also be thought of as data transmission. Besides point transcription and deletion errors, corresponding to those already described, chunks of DNA can get duplicated, or moved around. – ‘genes 1,2,3’ becoming ‘genes 1,2,2,3’ or ‘genes `1,3,2’. I don’t think there’s any equivalent for messages sent over serial channels.

Tangentially, the types of errors you see can indicate how the data is being stored. If ‘riddle’ is misspelt ‘ridlle’, that suggests an underlying representation as r-i-double-d-l-e, with a single error producing r-i-d-double-l-e.

In quantum error correction, people often consider two primary types of errors: bit flips, where and switch; and sign flips, where is unchanged but is multiplied by . These are conveniently represented by the Pauli and matrices. There are results saying that as long as you can correct adequately for these discrete sorts of errors, you can correct for all errors in some sense.

I wonder if there are classical analogues of these sorts of result.

Tobias, I agree, it’s really exciting to think about how all this stuff generalizes to topologically interesting situations. It’s entrancing to see linear algebra coming into play in the butterfly network.

Let’s keep talking about all this stuff.

By the way, Jamie, speaking of error correction…

Everyone please remember: on all WordPress blogs, LaTeX is done like this:

$latex E = mc^2$

with the word ‘latex’ directly following the first dollar sign, no space. Double dollar signs don’t work here.

Noteworthy, the asymptotic equipartition property (AEP) shows that the plug-in estimate of entropy converges to the true entropy in the large sampling limit.

This is interesting with respect to the post “Mathematics of biodiversity 7” to understand the whole concept of bias correction of entropy estimates in the case of a limited sample.

Consider the quantity

where is a sequence of letters occurring with probabilities . And let’s assume there are different letters..

In the case where the letters are i.i.d., then

Now, the same may appear many times, say , so this formula can be rewritten

so that:

When is large, then

QED.

That’s nice! Thanks!

I hope I have enough energy to say more about the asymptotic equipartition property.

[…] So I’m boning up on the subject by reading Khinchin’s great classic,

Mathematical Foundations of Information Theory. In the future I’ll share some findings here, but in the mean time, follow the link to Azimuth and read Baez’s great discussion of the Noisy Channel Coding Theorem. […]John,

My quick skim of your summary of the hard lemma left me wondering why it is a hard lemma*. Isn’t it just a baby step away from the fact that as n-> infinity, samples of size n look more and more like the underlying distribution from which they are drawn?

*That in itself is progress, since my quick skim of Wikipedia’s summary of the hard lemma left me almost clueless as to the content of the hard lemma (it struck me as very vague:-)

I didn’t mean to say the asymptotic equipartition property is extremely hard. However, the rest of the proof looks easy in comparison, so one is inclined to look at this part and say “yuck, that’s some technical fact I’d rather take on faith”. It seems like the pit in the peach. But I was trying to convince everyone that unlike the pit in the peach, it’s highly nutritious, and tasty in its own way.

I stated a watered-down version of the asymptotic equipartition theorem: just for purposes of exposition, I assumed that each letter in the string was drawn independently from the same probability distribution on letters. In other words, I was assuming that they’re ‘independent identically distributed’ random variables. This is clearly too restrictive—it sure ain’t true for English text!

The statement and proof gets a bit harder when we do the full-fledged thing: you can see a proof here for the i.i.d. case and here for the more general case. The more general case uses Lévy’s martingale convergence theorem, Markov’s inequality and Borel-Cantelli lemma. For some people that would be very scary, while others eat such stuff for breakfast.

But anyway, regardless of whether the proof is hard, the result seems both interesting and highly believable: not shocking.

Sounds like yummy breakfast. But no Cramer large deviation stuff? Last century I ran a seminar on Cramer’s theorem, iterated logarithm, arcsin law etc. Now suddenly it seems Shannon is yummy, too.

Thanks for the extra context John.

It is interesting that Weyl was Shannon’s adviser at IAS. The communication between those two must have come close to exceeding these limits on information!

Do you think you can expand on this : “The ‘typical set’ has few elements compared to the whole set of strings with n letters”…? If we start with the uniform distribution of a finite set A, then every string is in the typical set of A^n – they are equiprobable – so I guess the statement needs some qualification, or perhaps I am confused. Does this have something to do with the uniform distribution having the largest entropy?

I haven’t thought about this for about 5 years, but: when the probability distribution on is uniform, I believe you are right that the ‘typical set’ is all of . In this case no compression is possible. For every other probability distribution on , the typical set will become small compared to when .

Thanks for responding!! For other readers, the quantitative statement is that the number of of typical words (those words with $| latex |-1/n * log (p(x)) – H(X) | < \beta$ ) can be bounded by (this follows by direct computation from ). When $H(X) < log |X|$ (which is exactly the non-uniform distribution case) and $\latex \beta$ sufficiently small, $H(X) + \beta – log |X| < 0$, so the ratio of typical words to all words goes to zero as $N$ grows.