The Mathematics of Biodiversity (Part 7)

How ignorant are you?

Do you know?

Do you know how much don’t you know?

It seems hard to accurately estimate your lack of knowledge. It even seems hard to say precisely how hard it is. But the cool thing is, we can actually extract an interesting math question from this problem. And one answer to this question leads to the following conclusion:

There’s no unbiased way to estimate how ignorant you are.

But the devil is in the details. So let’s see the details!

The Shannon entropy of a probability distribution is a way of measuring how ignorant we are when this probability distribution describes our knowledge.

For example, suppose all we care about is whether this ancient Roman coin will land heads up or tails up:

If we know there’s a 50% chance of it landing heads up, that’s a Shannon entropy of 1 bit: we’re missing one bit of information.

But suppose for some reason we know for sure it’s going to land heads up. For example, suppose we know the guy on this coin is the emperor Pupienus Maximus, a egomaniac who had lead put on the back of all coins bearing his likeness, so his face would never hit the dirt! Then the Shannon entropy is 0: we know what’s going to happen when we toss this coin.

Or suppose we know there’s a 90% it will land heads up, and a 10% chance it lands tails up. Then the Shannon entropy is somewhere in between. We can calculate it like this:

$- 0.9 \log_2 (0.9) - 0.1 \log_2 (0.1) = 0.46899...$

so that’s how many bits of information we’re missing.

But now suppose we have no idea. Suppose we just start flipping the coin over and over, and seeing what happens. Can we estimate the Shannon entropy?

Here’s a naive way to do it. First, use your experimental data to estimate the probability that that the coin lands heads-up. Then, stick that probability into the formula for Shannon entropy. For example, say we flip the coin 3 times and it lands head-up once. Then we can estimate the probability of it landing heads-up as 1/3, and tails-up as 2/3. So we can estimate that the Shannon entropy is

$\displaystyle{ - \frac{1}{3} \log_2 (\frac{1}{3}) -\frac{2}{3} \log_2 (\frac{2}{3}) = 0.918... }$

But it turns out that this approach systematically underestimates the Shannon entropy!

Say we have a coin that lands up a certain fraction of the time, say $p.$ And say we play this game: we flip our coin $n$ times, see what we get, and estimate the Shannon entropy using the simple recipe I just illustrated.

Of course, our estimate will depend on the luck of the game. But on average, it will be less than the actual Shannon entropy, which is

$- p \log_2 (p) - (1-p) \log_2 (1-p)$

We can prove this mathematically. But it shouldn’t be surprising. After all, if $n = 1,$ we’re playing a game where we flip the coin just once. And with this game, our naive estimate of the Shannon entropy will always be zero! Each time we play the game, the coin will either land heads up 100% of the time, or tails up 100% of the time!

If we play the game with more coin flips, the error gets less severe. In fact it approaches zero as the number of coin flips gets ever larger, so that $n \to \infty.$ The case where you flip the coin just once is an extreme case—but extreme cases can be good to think about, because they can indicate what may happen in less extreme cases.

One moral here is that naively generalizing on the basis of limited data can make you feel more sure you know what’s going on than you actually are.

I hope you knew that already!

But we can also say, in a more technical way, that the naive way of estimating Shannon entropy is a biased estimator: the average value of the estimator is different from the value of the quantity being estimated.

Here’s an example of an unbiased estimator. Say you’re trying to estimate the probability that the coin will land heads up. You flip it $n$ times and see that it lands up $m$ times. You estimate that the probability is $m/n.$ That’s the obvious thing to do, and it turns out to be unbiased.

Statisticians like to think about estimators, and being unbiased is one way an estimator can be ‘good’. Beware: it’s not the only way! There are estimators that are unbiased, but whose standard deviation is so huge that they’re almost useless. It can be better to have an estimate of something that’s more accurate, even though on average it’s a bit too low. So sometimes, a biased estimator can be more useful than an unbiased estimator.

Nonetheless, my ears perked up when Lou Jost mentioned that there is no unbiased estimator for Shannon entropy. In rough terms, the moral is that:

There’s no unbiased way to estimate how ignorant you are.

I think this is important. For example, it’s important because Shannon entropy is also used as a measure of biodiversity. Instead of flipping a coin repeatedly and seeing which side lands up, now we go out and collect plants or animals, and see which species we find. The relative abundance of different species defines a probability distribution on the set of species. In this language, the moral is:

There’s no unbiased way to estimate biodiversity.

But of course, this doesn’t mean we should give up. We may just have to settle for an estimator that’s a bit biased! And people have spent a bunch of time looking for estimators that are less biased than the naive one I just described.

By the way, equating ‘biodiversity’ with ‘Shannon entropy’ is sloppy: there are many measures of biodiversity. The Shannon entropy is just a special case of the Rényi entropy, which depends on a parameter $q$ : we get Shannon entropy when $q = 1.$

As $q$ gets smaller, the Rényi entropy gets more and more sensitive to rare species—or shifting back to the language of probability theory, rare events. It’s the rare events that make Shannon entropy hard to estimate, so I imagine there should be theorems about estimators for Rényi entropy, which say it gets harder to estimate as $q$ gets smaller. Do you know such theorems?

Also, I should add that biodiversity is better captured by the ‘Hill numbers’, which are functions of the Rényi entropy, than by the Rényi entropy itself. (See here for the formulas.) Since these functions are nonlinear, the lack of an unbiased estimator for Rényi entropy doesn’t instantly imply the same for the Hill numbers. So there are also some obvious questions about unbiased estimators for Hill numbers. Do you know answers to those?

Here are some papers on estimators for entropy. Most of these focus on estimating the Shannon entropy of a probability distribution on a finite set.

This old classic has a proof that the ‘naive’ estimator of Shannon entropy is biased, and estimates on the bias:

• Bernard Harris, The statistical estimation of entropy in the non-parametric case, Army Research Office, 1975.

He shows the bias goes to zero as we increase the number of samples: the number I was calling $n$ in my coin flip example. In fact he shows the bias goes to zero like $O(1/n).$ This is big big O notation which means that as $n \to +\infty,$ the bias is bounded by some constant times $1/n.$ This constant depends on the size of our finite set—or, if you want to do better, the class number, which is the number of elements on which our probability distribution is nonzero.

Using this idea, he shows that you can find a less biased estimator if you have a probability distribution $p_i$ on a finite set and you know that exactly $k$ of these probabilities are nonzero. To do this, just take the ‘naive’ estimator I described earlier and add $(k-1)/2n.$ This is called the Miller–Madow bias correction. The bias of this improved estimator goes to zero like $O(1/n^2).$

The problem is that in practice you don’t know ahead of time how many probabilities are nonzero! In applications to biodiversity this would amount to knowing ahead of time how many species exist, before you go out looking for them.

But what about the theorem that there’s no unbiased estimator for Shannon entropy? The best reference I’ve found is this:

• Liam Paninski, Estimation of entropy and mutual information, Neural Computation 15 (2003) 1191-1254.

In Proposition 8 of Appendix A, Paninski gives a quick proof that there is no unbiased estimator of Shannon entropy for probability distributions on a finite set. But his paper goes far beyond this. Indeed, it seems like a pretty definitive modern discussion of the whole subject of estimating entropy. Interestingly, this subject is dominated by neurobiologists studying entropy of signals in the brain! So, lots of his examples involve brain signals.

Another overview, with tons of references, is this:

• J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. van der Meulen, Nonparametric entropy estimation: an overview.

This paper focuses on the situation where don’t know ahead of time how many probabilities are nonzero:

• Anne Chao and T.-J. Shen, Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample, Environmental and Ecological Statistics 10 (2003), 429&–443.

In 2003 there was a conference on the problem of estimating entropy, whose webpage has useful information. As you can see, it was dominated by neurobiologists:

• Estimation of entropy and information of undersampled probability distributions: theory, algorithms, and applications to the neural code, Whistler, British Columbia, Canada, 12 December 2003.

By the way, I was very confused for a while, because these guys claim to have found an unbiased estimator of Shannon entropy:

• Stephen Montgomery Smith and Thomas Schürmann, Unbiased estimators for entropy and class number.

However, their way of estimating entropy has a funny property: in the language of biodiversity, it’s only well-defined if our samples include at least one species of each organism. So, we cannot compute this estimate for an arbitary list of $n$ samples. This means it’s not estimator in the usual sense—the sense that Paninski is using! So it doesn’t really contradict Paninski’s result.

To wrap up, let me state Paninski’s result in a mathematically precise way. Suppose $p$ is a probability distribution on a finite set $X$ . Suppose $S$ is any number we can compute from $p$ : that is, any real-valued function on the set of probability distributions. We’ll be interested in the case where $S$ is the Shannon entropy:

$\displaystyle{ S = -\sum_{x \in X} p(x) \, \log p(x) }$

Here we can use whatever base for the logarithm we like: earlier I was using base 2, but that’s not sacred. Define an estimator to be any function

$\hat{S}: X^n \to \mathbb{R}$

The idea is that given $n$ samples from the set $X,$ meaning points $x_1, \dots, x_n \in X,$ the estimator gives a number $\hat{S}(x_1, \dots, x_n)$ . This number is supposed to estimate some feature of the probability distribution $p$ : for example, its entropy.

If the samples are independent and distributed according to the distribution $p,$ the sample mean of the estimator will be

$\displaystyle{ \langle \hat{S} \rangle = \sum_{x_1, \dots, x_n \in X} \hat{S}(x_1, \dots, x_n) \, p(x_1) \cdots p(x_n) }$

The bias of the estimator is the difference between the sample mean of the estimator and actual value of $S$ :

$\langle \hat{S} \rangle - S$

The estimator $\hat{S}$ is unbiased if this bias is zero for all $p.$

Proposition 8 of Paninski’s paper says there exists no unbiased estimator for entropy! The proof is very short…

Okay, that’s all for today.

I’m back in Singapore now; I learned so much at the Mathematics of Biodiversity conference that there’s no way I’ll be able to tell you all that information. I’ll try to write a few more blog posts, but please be aware that my posts so far give a hopelessly biased and idiosyncratic view of the conference, which would be almost unrecognizable to most of the participants. There are a lot of important themes I haven’t touched on at all… while this business of entropy estimation barely came up: I just find it interesting!

If more of you blogged more, we wouldn’t have this problem.

This entry was posted on Thursday, July 12th, 2012 at 7:15 am and is filed under biodiversity, information and entropy, probability. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

26 Responses to The Mathematics of Biodiversity (Part 7)

John Baez says:

12 July, 2012 at 11:25 am

Over on G+, Adam Liss wrote:

Related to the Dunning–Kruger effect?

That’s the effect where “unskilled individuals suffer from illusory superiority, mistakenly rating their ability much higher than average.”

It might be related. Suppose perfectly rational agents were doing their best to estimate their skill at some simple task—to keep it really simple, some silly task that you can’t improve at, like guessing which face of a perfectly fair coin will land up the next time you toss it.

Then they could use an unbiased estimator of the probability of success: just count the successes divided by the number of trials. And, you’d find that beginners would have wildly different estimates: for example, somebody who succeeded the first time might guess they were always going to succeed, while someone who failed the first time might guess they’d always fail.

But the average of all these beginners’ estimates would typically almost equal the estimate of someone who’d done the task lots of times.

On the other hand, the Dunning–Kruger effect would suggest that real human beginners will tend to overestimate their chances.

But maybe my simple model here is inappropriate. Maybe somehow our estimate of our abilities is based on some estimate of how much we know about what’ll happen if we do something. That could involve an estimate of entropy, i.e., ignorance. And maybe people tend to use an estimator that’s biased towards underestimating ignorance… and thus overestimating skill. Typically this sort of bias will be bigger with a smaller sample. So yeah.

Wikipedia says of Dunning–Kruger effect, “This bias is attributed to a metacognitive inability of the unskilled to recognize their mistakes.” And that could indeed be a form of underestimating ignorance!

Reply
Blake Stacey says:

12 July, 2012 at 12:36 pm

The Hill number for $q = 2$ is the inverse of the probability of “collision”. That is, suppose I spot an organism at random in the population and find it is of species $i$ . The probability that doing the same thing a second time will find an organism which is also of species $i$ is, by definition, $p_i$ . Averaging this over all possible values of $i$ gives $\sum_i p_i^2$ . If this

Say you’re trying to estimate the probability that the coin will land heads up. You flip it $n$ times and see that it lands up $m$ times. You estimate that the probability is $m/n$ . That’s the obvious thing to do, and it turns out to be unbiased.

is true, then we should be able to get an unbiased estimate of the collision probability by performing the double experiment $n$ times and counting how many times the two organisms belong to the same species.

(I’ve read that the Gini–Simpson index, which is just $1 - \sum_i p_i^2$ , has an unbiased estimator, so this argument may not be completely wrong.)

It does no good to have an unbiased, rapidly-converging estimator if that estimator doesn’t measure what one needs to measure.

— Lou Jost

Jost points to this paper, which I should read but don’t have time to at the moment:

A. Chao and T.-J. Shen (2003), “Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample” Environmental and Ecological Statistics 10, 4: 429–43. DOI: 10.1023/A:1026096204727.

Reply
- John Baez says:
  
  12 July, 2012 at 1:57 pm
  
  Hey—great point Blake! I agree with your argument that the two-fold collision probability
  
  $\displaystyle{ \sum_i p_i^2 }$
  
  should have an unbiased estimator. The same argument should work for the $q$ -fold collision probability
  
  $\displaystyle{ \sum_i p_i^q }$
  
  for all $q = 2, 3, \dots.$ In his talk, Ben Allen pointed out that we can use this $q$ -fold collision probability to compute the Rényi entropy:
  
  $\displaystyle{ H_q(p) = \frac{1}{1 -q} \ln \sum_{i=1}^n p_i^q }$
  
  Unfortunately, since the logarithm function is nonlinear, an unbiased estimator for the $q$ -fold collision probability won’t give one for the Rényi entropy—since we can’t push an expectation value through a nonlinear function. So we get an estimator, but not an unbiased one.
  
  Then there’s the Hill number:
  
  $\displaystyle{ {}^qD(p) = \left(\sum_{i=1}^n p_i^q \right)^{\frac{1}{1-q}} }$
  
  which is even more important in biodiversity studies. Again it’s a function of the $q$ -fold collision probability. But again, the nonlinearity of this function keeps us from turning an unbiased estimator for the $q$ -fold collision probability into one for the Hill number!
  
  But as you note, something nice happens when $q = 2$ : we get an unbiased estimator for the inverse of the $q = 2$ Hill number, which is also called the Gini–Simpson index:
  
  $\displaystyle{ \frac{1}{{}^2D(p)} = 1 - \sum_{i=1}^n p_i^2 }$
  
  Somehow the case $q = 1$ is the most pathological of all from this ‘collision’ viewpoint, while being the nicest of all from other viewpoints.
  
  Thanks for pointing out that Chao–Shen reference. I’m gonna copy it into the main blog article, since Anne Chao was a speaker at the conference!
  
  Reply
- Blake Stacey says:
  
  12 July, 2012 at 2:34 pm
  
  I believe your formula for the Gini–Simpson index should be
  
  $G = 1 - \sum_i p_i^2 = 1 - \frac{1}{{}^2 D(p)}.$
  
  Also, you’ve a stray ampersand in the Chao–Shen reference in the main post.
  
  Reply
- Blake Stacey says:
  
  12 July, 2012 at 3:14 pm
  
  It’s interesting how some topics keep recurring in new contexts. I first learned about collision probabilities and Rényi entropies when I was writing my undergraduate thesis, “Relation of Electron Scattering Cross Sections to Drift Measurements in Noble Gases“. It was somewhat incidental to the topic I was supposed to be writing about: the nominal goal of my thesis was to study steady-state flow properties of charge pulses being driven through gases, and I used the Rényi entropy ideas to say a bit about the decay of the transient behaviour on the way to the steady state. I learned about using the Rényi entropies, computed from collision counts, to estimate the Shannon index, because Shang-Keng Ma said a bit about a special case of it in his textbook, Statistical Mechanics (1985). The generalisation I used came from here:
  
  A. Bialas and W. Czyz (2000), “Event by event analysis and entropy of multiparticle systems” Physical Review D 61: 074021 [arXiv:hep-ph/9909209].
  
  They weren’t thinking about charge pulses or species in a rainforest, but rather collisions of atomic nuclei in particle accelerators.
  
  A year or so later, I was studying random geometric graphs to see if they provided a workable “null model” for the 3D structures of proteins. Collision probabilities and Rényi entropies showed up there again: properties of the graphs built by throwing down vertices at random according to some probability distribution depend on Rényi entropies of that distribution. I learned about this from here:
  
  C. Herrmann et al. (2003), “Connectivity distribution of spatial networks” Physical Review E 68: 026128 [arXiv:cond-mat/0302544].
  
  The idea is basically that in all these other contexts, we’ve been talking about “coincidences” or “collisions”, where doing an experiment $k$ times gives the same result each time. The analogue in the random-graph scenario is a subset of $k$ vertices in the graph all connected to one another.
  
  Collisions of ideas… Maybe I only know one or two things and keep seeing them everywhere…
  
  Reply
  - John Baez says:
    
    12 July, 2012 at 5:07 pm
    
    Cool!
    
    People at the biodiversity conference were wondering if they could use collision counts to estimate Shannon entropy, somehow extrapolating Rényi entropy from $q= 2,3,4, \dots$ down to $q = 1.$ Being a decent pure mathematician I quickly saw that Rényi entropy for probability measures on any given finite set is the kind of function whose values at the point $q= 2,3,4, \dots$ determine its value for all $q \in [0,\infty).$ But that’s different than getting a practical, robust, method of estimating the Shannon entropy from collision counts. Do you know the state of the art on that?
    
    Reply
    - Lou Jost says:
      
      13 July, 2012 at 9:20 am
      
      Anne Chao has such a result, and it will lead to the best available estimator of entropy! Stay tuned….
      
      John, I just gave a talk yesterday afternoon here in Barcelona (we all miss you!!!) about estimation of species richness, entropy, and Hill numbers. I suggested a way to “tune” the observed relative abundances (which sum to unity) by noting that their true values must sum to the coverage deficit $1-C$ , where $C$ is the Good–Turing coverage, the population share of the species detected in the sample. The coverage deficit can be estimated from the sample as $f_1/n$ where $f_1$ is the number of singletons in the sample and n is sample size.
      
      Chao and Shen use this method to tune all observed frequencies by a correction factor $1/(1-C).$ However, I later realized (and Anne agreed) that this is not quite correct (though it works very well anyway), because the relative uncertainty in the estimate of $p_i$ is very small for the large $p_i$ but is large for the small $p_i.$ For example, a species represented only once in the sample, with frequency $1/n,$ could easily have been missed or been found twice; its true frequency in the population might easily be $2/n$ or $0.5/n,$ so our relative uncertainty in its value is huge. Contrast this with a species which makes up 50% of the sample. Using the binomial formula for standard deviation, we can show that the precision of this estimate is very high if $n$ is large. We don’t have the right to tune that frequency very much. We should tune all frequencies in proportion to their uncertainty. It turns out to be easy to write a simple formula for this.
      
      Tuning only solves half the problem, though. In the case of entropy, the abundance distribution of the unseen species has a largish effect on the value of entropy. That part still needs more work. Hopefully Anne’s innovation will make this unnecessary.
    - Blake Stacey says:
      
      13 July, 2012 at 3:47 pm
      
      The most recent thing I’ve seen in the tradition I was talking about is this:
      
      A. Bialas, W. Czyz, K. Zalewski (2006), “Measurement of Renyi entropies in multiparticle production: a DO-LIST II” Acta Physica Polonica B 37: 2713–28 [arXiv:hep-ph/0607082].
Michelle Beissel says:

12 July, 2012 at 12:59 pm

It is often pointed out to me that I self-select audiences and make assumptions without any awareness, that is, ignorance of what I am doing. Assumptions love small data sets. And yeah, you other scientists, BLOG MORE ABOUT YOUR WORK.

Reply
memming says:

12 July, 2012 at 3:14 pm

A proof for no unbiased entropy estimator claim is given in Liam Paninski’s neural computation paper. See proposition 8 and proof.
http://www.stat.columbia.edu/~liam/research/abstracts/info_est-nc-abs.html
http://www.stat.columbia.edu/~liam/research/info_est.html

Reply
- memming says:
  
  12 July, 2012 at 3:23 pm
  
  Oh, you added the reference already.
  
  Reply
- John Baez says:
  
  12 July, 2012 at 4:50 pm
  
  Thanks! I just noticed that Proposition 8 hidden in the appendix a little while ago. I’m confused because it seems to contradict this paper:
  
  • Stephen Montgomery Smith and Thomas Schürmann, Unbiased estimators for entropy and class number.
  
  I’ve updated my blog entry but I’ll probably have to keep updating it after I get a night of sleep. I seem to have underestimated my ignorance.
  
  Reply
  - memming says:
    
    13 July, 2012 at 7:17 pm
    
    Actually I’ve seen the same (or very similar) entropy estimator as in the Smith and Schürmann paper you mentioned. And the unbiasedness proof seems to be right, given that we have enough data to observe every symbol at least once (for $H_1$ ) while Paninski’s proof assumes a fixed $N$ number of symbols observed. Given infinite data, most entropy estimators are in fact asymptotically unbiased.
    
    BTW, if the number of symbols (or species) is known, entropy is always bounded by $[0, \log K]$ , so the variance cannot be infinite. When it is unknown, it is a different story.
    
    Reply
    - John Baez says:
      
      14 July, 2012 at 3:18 am
      
      Thanks! I believe the new version of my blog article gets the facts straight—if you see any actual errors please let me know.
      
      I believe the usual definition of ‘estimator’ requires that you’re able to compute an estimate from $n$ samples of data no matter what those samples are. As you note, Smith and Schürmann’s result evades Paninski’s theorem by relaxing this definition of ‘estimator’: their formula only lets you compute an estimate if your samples obey some condition. One might call this a ‘conditional estimator’. I imagine some statisticians have already thought hard about this idea.
    - Stephen Montgomery-Smith says:
      
      10 June, 2013 at 2:43 am
      
      I just wanted to add that the paper by myself and Schürmann has a second estimator, where you don’t need to observe every symbol at least once.
      
      But it doesn’t violate Prop 8 of Liam Paninski’s paper. Prop 8 requires that the sample size be given in advance, whereas our estimators have no a priori bound on the sample size.
    - John Baez says:
      
      10 June, 2013 at 3:37 am
      
      Interesting, thanks! I’ll have to get back to this subject sometime… I’ve been distracted by other projects.
    - memming says:
      
      10 June, 2013 at 12:47 pm
      
      BTW, we have yet another discrete entropy estimator that doesn’t assume $K$ . It’s a nonparametric Bayesian estimator based on a Pitman-Yor process prior.
      
      Paper: Evan Archer*, Il Memming Park*, Jonathan Pillow. Bayesian Entropy Estimation for Countable Discrete Distributions. arXiv:1302.0328 (2013)
      
      Code available: https://github.com/pillowlab/PYMentropy
Blake Stacey says:

12 July, 2012 at 3:16 pm

Typo alert: $O(1/n^2)$ is missing a “latex”.

Reply
- John Baez says:
  
  12 July, 2012 at 4:53 pm
  
  Thanks, fixed. I’ve been updating this section near the end of the blog article as I keep learning new things. I’m pretty confused right now—see my reply to ‘memming’ above.
  
  Reply
Joe Marshall says:

12 July, 2012 at 6:47 pm

I think there’s a problem in your reasoning. When you say “Say we have a coin that lands up a certain known fraction of the time”, how would we know this? Are you saying that we have a coin that *has landed up* a certain known fraction *in the past*, or do you have some way to know that it *will land up* a certain known fraction of the time *in the future*?

When you use phrases like “the true value” alarm bells go off in my head. You are viewing the probability of heads as a physical property of the coin instead of an information theoretic property of your *knowledge* about the coin.

Reply
- John Baez says:
  
  13 July, 2012 at 3:21 am
  
  I thought I’d deleted the word “known” in that sentence, to lessen the confusion. I’ll do it now.
  
  Of course there’s a huge debate on what probabilities actually mean, with subjective Bayesians religiously avoiding phrases like “the true probability”—and for good reasons. It’s an important debate that I’ve spent a lot of time on. I’m actually a subjective Bayesian with mild reservations concerning the possibility that even this position doesn’t seem to get to the bottom of certain issues.
  
  But here I was trying to explain the concept of an unbiased estimator without getting too technical. This is a mathematical issue that largely sidesteps the debate about what probabilities mean. But the problem is that when you talk using words, it’s hard to avoid seeming like you’re taking a position on that debate—and it’s a lot simpler to talk like someone who believes probabilities are an objective feature of reality.
  
  So, in anything I write, you should be able to take phrases like “the true probability distribution” and replace them with “the probability distribution that models Fred’s beliefs” without anything bad happening.
  
  Allowing myself to get more technical, the math goes like this:
  
  Suppose $p$ is a probability distribution on a finite set $X$ . Its entropy is
  
  $\displaystyle{ S = -\sum_{x \in X} p(x) \, \ln p(x) }$
  
  Define an estimator for entropy to be a function
  
  $\hat{S}: X^n \to \mathbb{R}$
  
  The idea is that given $n$ samples from the set $X$, the estimator gives a number $\hat{S}(x_1, \dots, x_n)$ that’s supposed to be an estimate of entropy. If these samples are independent and distributed according to the distribution $p$ , the mean estimated entropy will be
  
  $\displaystyle{ \langle \hat{S} \rangle = \sum_{x_1, \dots, x_n \in X} \hat{S}(x_1, \dots, x_n) \, p(x_1) \cdots p(x_n) }$
  
  The bias of the estimator is the difference between the mean estimated entropy and the actual entropy of $p$ :
  
  $\langle \hat{S} \rangle - S$
  
  The estimator is unbiased if the bias is zero for all $p$ .
  
  Proposition 8 of Paninski’s paper says there exists no unbiased estimator for entropy.
  
  People often think of $p$ the ‘true’ probability distribution, whose entropy the estimator is seeking to estimate on the basis of samples distributed according to that distribution. But the math doesn’t care about the word ‘true’, and since the definition of ‘unbiased estimator’ involves a property that’s supposed to hold for all $p$ , we don’t even need any way to specify a particular one.
  
  Reply
- Blake Stacey says:
  
  13 July, 2012 at 4:00 pm
  
  Typo alert: $X$ is missing its “latex”.
  
  (I had a more substantial comment to make here, but I realised I should explicitly work through the de-Finetti-theorem-related stuff I was thinking about before I started yapping on over it.)
  
  Reply
  - John Baez says:
    
    14 July, 2012 at 3:19 am
    
    Thanks, I’ll fix this. I’ll be interested to hear what you’re thinking about.
    
    Reply
dave tweed says:

13 July, 2012 at 7:53 am

A tangential thought: an “estimator” can be viewed as an example of a “machine learning model” that just happens to have no adjustable parameters at all. But in machine learning it’s a common viewpoint that if you want to reduce (even minimise) the generalisation error when using your model there are terms in the error which can be grouped as “due to bias” and which can be grouped as “due to variance”, and in general it’s trading off these effectively that gives the best performance on new data (see here). Now admittedly for such a “trivial” case it may be possible to do an analysis/optimisation that goes further, but is it possible that neither unbiased nor minimal variance estimators are the most appropriate, but some trade-off between the two that minimises the statistic’s error when applied to finite sets of samples from experiments?

Reply
The Mathematics of Biodiversity (Part 8) « Azimuth says:

15 July, 2012 at 4:36 am

[…] They computed the first-order bias of the information (related to the Miller–Madow correction) […]

Reply
The best way to extort an extortionist is to be fair « neuroecology says:

18 July, 2012 at 7:08 pm

[…] It’s all over the place! If you rerun the simulation again and again,you get a different distribution of these values, but they always seem to be >.3 and never settle at 1 (tit-for-tat)! We can measure how diverse the distribution is by the entropy of possible […]

Reply

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it. Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

	skolemization on The Logic of Real and Complex…
	skolemization on The Complexity Barrier
	John Baez on The Complexity Barrier
	skolemization on The Complexity Barrier
	skolemization on Chaitin’s Theorem and th…
	John Baez on Nonstandard Models of Ari…
	skolemization on Nonstandard Models of Ari…
	John Baez on Agent-Based Models (Part …
	Grandpa D on Agent-Based Models (Part …
	Anton Sherwood on Well Temperaments (Part 3…

Azimuth