How ignorant are you?
Do you know?
Do you know how much don’t you know?
It seems hard to accurately estimate your lack of knowledge. It even seems hard to say precisely how hard it is. But the cool thing is, we can actually extract an interesting math question from this problem. And one answer to this question leads to the following conclusion:
There’s no unbiased way to estimate how ignorant you are.
But the devil is in the details. So let’s see the details!
The Shannon entropy of a probability distribution is a way of measuring how ignorant we are when this probability distribution describes our knowledge.
For example, suppose all we care about is whether this ancient Roman coin will land heads up or tails up:
If we know there’s a 50% chance of it landing heads up, that’s a Shannon entropy of 1 bit: we’re missing one bit of information.
But suppose for some reason we know for sure it’s going to land heads up. For example, suppose we know the guy on this coin is the emperor Pupienus Maximus, a egomaniac who had lead put on the back of all coins bearing his likeness, so his face would never hit the dirt! Then the Shannon entropy is 0: we know what’s going to happen when we toss this coin.
Or suppose we know there’s a 90% it will land heads up, and a 10% chance it lands tails up. Then the Shannon entropy is somewhere in between. We can calculate it like this:

so that’s how many bits of information we’re missing.
But now suppose we have no idea. Suppose we just start flipping the coin over and over, and seeing what happens. Can we estimate the Shannon entropy?
Here’s a naive way to do it. First, use your experimental data to estimate the probability that that the coin lands heads-up. Then, stick that probability into the formula for Shannon entropy. For example, say we flip the coin 3 times and it lands head-up once. Then we can estimate the probability of it landing heads-up as 1/3, and tails-up as 2/3. So we can estimate that the Shannon entropy is

But it turns out that this approach systematically underestimates the Shannon entropy!
Say we have a coin that lands up a certain fraction of the time, say
And say we play this game: we flip our coin
times, see what we get, and estimate the Shannon entropy using the simple recipe I just illustrated.
Of course, our estimate will depend on the luck of the game. But on average, it will be less than the actual Shannon entropy, which is

We can prove this mathematically. But it shouldn’t be surprising. After all, if
we’re playing a game where we flip the coin just once. And with this game, our naive estimate of the Shannon entropy will always be zero! Each time we play the game, the coin will either land heads up 100% of the time, or tails up 100% of the time!
If we play the game with more coin flips, the error gets less severe. In fact it approaches zero as the number of coin flips gets ever larger, so that
The case where you flip the coin just once is an extreme case—but extreme cases can be good to think about, because they can indicate what may happen in less extreme cases.
One moral here is that naively generalizing on the basis of limited data can make you feel more sure you know what’s going on than you actually are.
I hope you knew that already!
But we can also say, in a more technical way, that the naive way of estimating Shannon entropy is a biased estimator: the average value of the estimator is different from the value of the quantity being estimated.
Here’s an example of an unbiased estimator. Say you’re trying to estimate the probability that the coin will land heads up. You flip it
times and see that it lands up
times. You estimate that the probability is
That’s the obvious thing to do, and it turns out to be unbiased.
Statisticians like to think about estimators, and being unbiased is one way an estimator can be ‘good’. Beware: it’s not the only way! There are estimators that are unbiased, but whose standard deviation is so huge that they’re almost useless. It can be better to have an estimate of something that’s more accurate, even though on average it’s a bit too low. So sometimes, a biased estimator can be more useful than an unbiased estimator.
Nonetheless, my ears perked up when Lou Jost mentioned that there is no unbiased estimator for Shannon entropy. In rough terms, the moral is that:
There’s no unbiased way to estimate how ignorant you are.
I think this is important. For example, it’s important because Shannon entropy is also used as a measure of biodiversity. Instead of flipping a coin repeatedly and seeing which side lands up, now we go out and collect plants or animals, and see which species we find. The relative abundance of different species defines a probability distribution on the set of species. In this language, the moral is:
There’s no unbiased way to estimate biodiversity.
But of course, this doesn’t mean we should give up. We may just have to settle for an estimator that’s a bit biased! And people have spent a bunch of time looking for estimators that are less biased than the naive one I just described.
By the way, equating ‘biodiversity’ with ‘Shannon entropy’ is sloppy: there are many measures of biodiversity. The Shannon entropy is just a special case of the Rényi entropy, which depends on a parameter
: we get Shannon entropy when
As
gets smaller, the Rényi entropy gets more and more sensitive to rare species—or shifting back to the language of probability theory, rare events. It’s the rare events that make Shannon entropy hard to estimate, so I imagine there should be theorems about estimators for Rényi entropy, which say it gets harder to estimate as
gets smaller. Do you know such theorems?
Also, I should add that biodiversity is better captured by the ‘Hill numbers’, which are functions of the Rényi entropy, than by the Rényi entropy itself. (See here for the formulas.) Since these functions are nonlinear, the lack of an unbiased estimator for Rényi entropy doesn’t instantly imply the same for the Hill numbers. So there are also some obvious questions about unbiased estimators for Hill numbers. Do you know answers to those?
Here are some papers on estimators for entropy. Most of these focus on estimating the Shannon entropy of a probability distribution on a finite set.
This old classic has a proof that the ‘naive’ estimator of Shannon entropy is biased, and estimates on the bias:
• Bernard Harris, The statistical estimation of entropy in the non-parametric case, Army Research Office, 1975.
He shows the bias goes to zero as we increase the number of samples: the number I was calling
in my coin flip example. In fact he shows the bias goes to zero like
This is big big O notation which means that as
the bias is bounded by some constant times
This constant depends on the size of our finite set—or, if you want to do better, the class number, which is the number of elements on which our probability distribution is nonzero.
Using this idea, he shows that you can find a less biased estimator if you have a probability distribution
on a finite set and you know that exactly
of these probabilities are nonzero. To do this, just take the ‘naive’ estimator I described earlier and add
This is called the Miller–Madow bias correction. The bias of this improved estimator goes to zero like 
The problem is that in practice you don’t know ahead of time how many probabilities are nonzero! In applications to biodiversity this would amount to knowing ahead of time how many species exist, before you go out looking for them.
But what about the theorem that there’s no unbiased estimator for Shannon entropy? The best reference I’ve found is this:
• Liam Paninski, Estimation of entropy and mutual information, Neural Computation 15 (2003) 1191-1254.
In Proposition 8 of Appendix A, Paninski gives a quick proof that there is no unbiased estimator of Shannon entropy for probability distributions on a finite set. But his paper goes far beyond this. Indeed, it seems like a pretty definitive modern discussion of the whole subject of estimating entropy. Interestingly, this subject is dominated by neurobiologists studying entropy of signals in the brain! So, lots of his examples involve brain signals.
Another overview, with tons of references, is this:
• J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. van der Meulen, Nonparametric entropy estimation: an overview.
This paper focuses on the situation where don’t know ahead of time how many probabilities are nonzero:
• Anne Chao and T.-J. Shen, Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample, Environmental and Ecological Statistics 10 (2003), 429&–443.
In 2003 there was a conference on the problem of estimating entropy, whose webpage has useful information. As you can see, it was dominated by neurobiologists:
• Estimation of entropy and information of undersampled probability distributions: theory, algorithms, and applications to the neural code, Whistler, British Columbia, Canada, 12 December 2003.
By the way, I was very confused for a while, because these guys claim to have found an unbiased estimator of Shannon entropy:
• Stephen Montgomery Smith and Thomas Schürmann, Unbiased estimators for entropy and class number.
However, their way of estimating entropy has a funny property: in the language of biodiversity, it’s only well-defined if our samples include at least one species of each organism. So, we cannot compute this estimate for an arbitary list of
samples. This means it’s not estimator in the usual sense—the sense that Paninski is using! So it doesn’t really contradict Paninski’s result.
To wrap up, let me state Paninski’s result in a mathematically precise way. Suppose
is a probability distribution on a finite set
. Suppose
is any number we can compute from
: that is, any real-valued function on the set of probability distributions. We’ll be interested in the case where
is the Shannon entropy:

Here we can use whatever base for the logarithm we like: earlier I was using base 2, but that’s not sacred. Define an estimator to be any function

The idea is that given
samples from the set
meaning points
the estimator gives a number
. This number is supposed to estimate some feature of the probability distribution
: for example, its entropy.
If the samples are independent and distributed according to the distribution
the sample mean of the estimator will be

The bias of the estimator is the difference between the sample mean of the estimator and actual value of
:

The estimator
is unbiased if this bias is zero for all 
Proposition 8 of Paninski’s paper says there exists no unbiased estimator for entropy! The proof is very short…
Okay, that’s all for today.
I’m back in Singapore now; I learned so much at the Mathematics of Biodiversity conference that there’s no way I’ll be able to tell you all that information. I’ll try to write a few more blog posts, but please be aware that my posts so far give a hopelessly biased and idiosyncratic view of the conference, which would be almost unrecognizable to most of the participants. There are a lot of important themes I haven’t touched on at all… while this business of entropy estimation barely came up: I just find it interesting!
If more of you blogged more, we wouldn’t have this problem.