GiveWell is an organization that rates charities. They’ve met people who argue that
charities working on reducing the risk of sudden human extinction must be the best ones to support, since the value of saving the human race is so high that “any imaginable probability of success” would lead to a higher expected value for these charities than for others.
For example, say I have a dollar to spend on charity. One charity says that with this dollar they can save the life of one child in Somalia. Another says that with this dollar they can increase by .000001% our chance of saving 1 billion people from the effects of a massive asteroid colliding with the Earth.
Naively, in terms of the expected number of lives saved, the latter course of action seems 10 times better, since
But is it really better?
It’s a subtle question, with all sorts of complicating factors, like why should I trust these guys?
I’m not ready to present a thorough analysis of this sort of question today. But I would like to hear what you think about it. And I’d like you to read what the founder of Givewell has to say about it:
• Holden Karnofsky, Why we can’t take expected value estimates literally (even when they’re unbiased), 18 August 2011.
He argues against what he calls an ‘explicit expected value’ or ‘EEV’ approach:
The mistake (we believe) is estimating the “expected value” of a donation (or other action) based solely on a fully explicit, quantified formula, many of whose inputs are guesses or very rough estimates. We believe that any estimate along these lines needs to be adjusted using a “Bayesian prior”; that this adjustment can rarely be made (reasonably) using an explicit, formal calculation; and that most attempts to do the latter, even when they seem to be making very conservative downward adjustments to the expected value of an opportunity, are not making nearly large enough downward adjustments to be consistent with the proper Bayesian approach.
His focus, in short, is on the fact that anyone saying “this money can increase by .000001% our chance of saving 1 billion people from an asteroid impact” is likely to be pulling those numbers from thin air. If they can’t really back up their numbers with a lot of hard evidence, then our lack of confidence in their estimate should be taken into account somehow.
His article spends a lot of time analyzing a less complex but still very interesting example:
It seems fairly clear that a restaurant with 200 Yelp reviews, averaging 4.75 stars, ought to outrank a restaurant with 3 Yelp reviews, averaging 5 stars. Yet this ranking can’t be justified in an explicit expected utility framework, in which options are ranked by their estimated average/expected value.
This is the only question I really want to talk about today. Actually I’ll focus on a similar question that Tim van Beek posed on this blog:
You have two kinds of fertilizer, A and B. You know that of 4 trees who got A, three thrived and one died. Of 36 trees that got B, 24 thrived and 12 died. Which fertilizer would you buy?
So, 3/4 of the trees getting fertilizer A thrived, while only 2/3 of those getting fertilizer B thrived. That makes fertilizer A seem better. However, the sample size is considerably larger for fertilizer B, so we may feel more confident about the results in this case. Which should we choose?
Nathan Urban tackled the problem in an interesting way. Let me sketch what he did before showing you his detailed work.
Suppose that before doing any experiments at all, we assume the probability that a fertilizer will make a tree thrive is a number uniformly distributed between 0 and 1. This assumption is our “Bayesian prior”.
Note: I’m not saying this prior is “correct”. You are allowed to choose a different prior! Choosing a different prior will change your answer to this puzzle. That can’t be helped. We need to make some assumption to answer this kind of puzzle; we are simply making it explicit here.
Starting from this prior, Nathan works out the probability that has some value given that when we apply the fertilizer to 4 trees, 3 thrive. That’s the black curve below. He also works out the probability that has some value given that when we apply the fertilizer to 36 trees, 24 thrive. That’s the red curve:
The red curve, corresponding to the experiment with 36 trees, is much more sharply peaked. That makes sense. It means that when we do more experiments, we become more confident that we know what’s going on.
We still have to choose a criterion to decide which fertilizer is best! This is where ‘decision theory’ comes in. For example, suppose we want to maximize the expected number of the trees that thrive. Then Nathan shows that fertilizer A is slightly better, despite the smaller sample size.
However, he also shows that if fertilizer A succeeded 4 out of 5 times, while fertilizer B succeeded 7 out of 9 times, the same evaluation procedure would declare fertilizer B better! Its percentage success rate is less: about 78% instead of 80%. However, the sample size is larger. And in this particular case, given our particular Bayesian prior and given what we are trying to maximize, that’s enough to make fertilizer B win.
So if someone is trying to get you to contribute to a charity, there are many interesting issues involved in deciding if their arguments are valid or just a bunch of… fertilizer.
Here is Nathan’s detailed calculation:
It’s fun to work out an official ‘correct’ answer mathematically, as John suggested. Of course, this ends up being a long way of confirming the obvious—and the answer is only as good as the assumptions—but I think it’s interesting anyway. In this case, I’ll work it out by maximizing expected utility in Bayesian decision theory, for one choice of utility function. This dodges the whole risk aversion point, but it opens discussion for how the assumptions might be modified to account for more real-world considerations. Hopefully others can spot whether I’ve made mistakes in the derivations.
In Bayesian decision theory, the first thing you do is write down the data-generating process and then compute a posterior distribution for what is unknown.
In this case, we may assume the data-generating process (likelihood function) is a binomial distribution for successes in trials, given a probability of success . Fertilizer A corresponds to , and fertilizer B corresponds to , .
The probability of success is unknown, and we want to infer its posterior conditional on the data, . To compute a posterior we need to assume a prior on .
It turns out that the Beta distribution is conjugate to a binomial likelihood, meaning that if we assume a Beta distributed prior, the then the posterior is also Beta distributed. If the prior is then the posterior is
One choice for a prior is a uniform prior on , which corresponds to a distribution. There are of course other prior choices which will lead to different conclusions. For this prior, the posterior is . The posterior mode is
and the posterior mean is
So, what is the inference for fertilizers A and B? I made a graph of the posterior distributions. You can see that the inference for fertilizer B is sharper, as expected, since there is more data. But the inference for fertilizer A tends towards higher success rates, which can be quantified.
Fertilizer A has a posterior mode of 3/4 = 0.75 and B has a mode of 2/3 = 0.667, corresponding to the sample proportions. The mode isn’t the only measure of central tendency we could use. The means are 0.667 for A and 0.658 for B; the medians are 0.686 for A and 0.661 for B. No matter which of the three statistics we choose, fertilizer A looks better than fertilizer B.
But we haven’t really done “decision theory” yet. We’ve just compared point estimators. Actually, we have done a little decision theory, implicitly. It turns out that picking the mean corresponds to the estimator which minimizes the expected squared error in , where “squared error” can be thought of formally as a loss function in decision theory. Picking the median corresponds to minimizing the expected absolute loss, and picking the mode corresponds to minimizing the minimizing the 0-1 loss (where you lose nothing if you guess exactly and lose 1 otherwise).
Still, these don’t really correspond to a decision theoretic view of the problem. We don’t care about the quantity at all, let alone some point estimator of it. We only care about indirectly, insofar as it helps us predict something about what the fertilizer will do to new trees. For that, we have to move from the posterior distribution to the predictive distribution
where is a random variable indicating whether a new tree will thrive under treatment. Here I assume that the success of new trees follows the same binomial distribution as in the experimental group.
For a Beta posterior, the predictive distribution is beta-binomial, and the expected number of successes for a new tree is equal to the mean of the Beta distribution for – i.e. the posterior mean we computed before, . If we introduce a utility function such that we are rewarded 1 util for a thriving tree and 0 utils for non-thriving tree, then the expected utility is equal to the expected success rate. Therefore, under these assumptions, we should choose the fertilizer that maximizes the quantity , which, as we’ve seen, favors fertilizer A (0.667) over fertilizer B (0.658).
An interesting mathematical question is, does this ever work out to a “non-obvious” conclusion? That is, if fertilizer A has a sample success rate which is greater than fertilizer B’s sample success rate, but expected utility maximization prefers fertilizer B? Mathematically, we’re looking for a set such that but . (Also there are obvious constraints on and .) The answer is yes. For example, if fertilizer A has 4 of 5 successes while fertilizer B has 7 of 9 successes.
By the way, on a quite different note: NASA currently rates the chances of the asteroid Apophis colliding with the Earth in 2036 at 4.3 × 10-6. It estimates that the energy of such a collision would be comparable with a 510-megatonne thermonuclear bomb. This is ten times larger than the largest bomb actually exploded, the Tsar Bomba. The Tsar Bomba, in turn, was ten times larger than all the explosives used in World War II.
There’s an interesting Chinese plan to deflect Apophis if that should prove necessary. It is, however, quite a sketchy plan. I expect people will make more detailed plans shortly before Apophis comes close to the Earth in 2029.