GiveWell is an organization that rates charities. They’ve met people who argue that

charities working on reducing the risk of sudden human extinction must be the best ones to support, since the value of saving the human race is so high that “any imaginable probability of success” would lead to a higher expected value for these charities than for others.

For example, say I have a dollar to spend on charity. One charity says that with this dollar they can save the life of one child in Somalia. Another says that with this dollar they can increase by .000001% our chance of saving 1 billion people from the effects of a massive asteroid colliding with the Earth.

Naively, in terms of the expected number of lives saved, the latter course of action seems 10 times better, since

But is it really better?

It’s a subtle question, with all sorts of complicating factors, like *why should I trust these guys?*

I’m not ready to present a thorough analysis of this sort of question today. But I would like to hear what you think about it. And I’d like you to read what the founder of Givewell has to say about it:

• Holden Karnofsky, Why we can’t take expected value estimates literally (even when they’re unbiased), 18 August 2011.

He argues against what he calls an ‘explicit expected value’ or ‘EEV’ approach:

The mistake (we believe) is estimating the “expected value” of a donation (or other action) based solely on a fully explicit, quantified formula, many of whose inputs are guesses or very rough estimates. We believe that any estimate along these lines needs to be adjusted using a “Bayesian prior”; that this adjustment can rarely be made (reasonably) using an explicit, formal calculation; and that most attempts to do the latter, even when they seem to be making very conservative downward adjustments to the expected value of an opportunity, are not making nearly large enough downward adjustments to be consistent with the proper Bayesian approach.

His focus, in short, is on the fact that anyone saying “this money can increase by .000001% our chance of saving 1 billion people from an asteroid impact” is likely to be *pulling those numbers from thin air*. If they can’t really back up their numbers with a lot of hard evidence, then our lack of confidence in their estimate should be taken into account somehow.

His article spends a lot of time analyzing a less complex but still very interesting example:

It seems fairly clear that a restaurant with 200 Yelp reviews, averaging 4.75 stars, ought to outrank a restaurant with 3 Yelp reviews, averaging 5 stars. Yet this ranking can’t be justified in an explicit expected utility framework, in which options are ranked by their estimated average/expected value.

This is the only question I really want to talk about today. Actually I’ll focus on a similar question that Tim van Beek posed on this blog:

You have two kinds of fertilizer, A and B. You know that of 4 trees who got A, three thrived and one died. Of 36 trees that got B, 24 thrived and 12 died. Which fertilizer would you buy?

So, 3/4 of the trees getting fertilizer A thrived, while only 2/3 of those getting fertilizer B thrived. That makes fertilizer A seem better. However, the sample size is considerably larger for fertilizer B, so we may feel *more confident* about the results in this case. Which should we choose?

Nathan Urban tackled the problem in an interesting way. Let me sketch what he did before showing you his detailed work.

Suppose that before doing any experiments at all, we assume the probability that a fertilizer will make a tree thrive is a number *uniformly distributed between 0 and 1*. This assumption is our “Bayesian prior”.

Note: I’m not saying this prior is “correct”. You are allowed to choose a different prior! Choosing a different prior will change your answer to this puzzle. That can’t be helped. We need to make some assumption to answer this kind of puzzle; we are simply making it explicit here.

Starting from this prior, Nathan works out the probability that has some value *given that when we apply the fertilizer to 4 trees, 3 thrive*. That’s the black curve below. He also works out the probability that has some value *given that when we apply the fertilizer to 36 trees, 24 thrive*. That’s the red curve:

The red curve, corresponding to the experiment with 36 trees, is much more sharply peaked. That makes sense. It means that *when we do more experiments, we become more confident that we know what’s going on*.

We still have to choose a criterion to decide which fertilizer is best! This is where ‘decision theory’ comes in. For example, suppose we want to maximize the expected number of the trees that thrive. Then Nathan shows that fertilizer A is slightly better, despite the smaller sample size.

However, he also shows that if fertilizer A succeeded 4 out of 5 times, while fertilizer B succeeded 7 out of 9 times, the same evaluation procedure would declare fertilizer B better! Its percentage success rate is less: about 78% instead of 80%. However, the sample size is larger. And in this particular case, *given our particular Bayesian prior* and *given what we are trying to maximize*, that’s enough to make fertilizer B win.

So if someone is trying to get you to contribute to a charity, there are many interesting issues involved in deciding if their arguments are valid or just a bunch of… fertilizer.

Here is Nathan’s detailed calculation:

It’s fun to work out an official ‘correct’ answer mathematically, as John suggested. Of course, this ends up being a long way of confirming the obvious—and the answer is only as good as the assumptions—but I think it’s interesting anyway. In this case, I’ll work it out by maximizing expected utility in Bayesian decision theory, for one choice of utility function. This dodges the whole risk aversion point, but it opens discussion for how the assumptions might be modified to account for more real-world considerations. Hopefully others can spot whether I’ve made mistakes in the derivations.

In Bayesian decision theory, the first thing you do is write down the data-generating process and then compute a posterior distribution for what is unknown.

In this case, we may assume the data-generating process (likelihood function) is a binomial distribution for successes in trials, given a probability of success . Fertilizer A corresponds to , and fertilizer B corresponds to , .

The probability of success is unknown, and we want to infer its posterior conditional on the data, . To compute a posterior we need to assume a prior on .

It turns out that the Beta distribution is conjugate to a binomial likelihood, meaning that if we assume a Beta distributed prior, the then the posterior is also Beta distributed. If the prior is then the posterior is

One choice for a prior is a uniform prior on , which corresponds to a distribution. There are of course other prior choices which will lead to different conclusions. For this prior, the posterior is . The posterior mode is

and the posterior mean is

So, what is the inference for fertilizers A and B? I made a graph of the posterior distributions. You can see that the inference for fertilizer B is sharper, as expected, since there is more data. But the inference for fertilizer A tends towards higher success rates, which can be quantified.

Fertilizer A has a posterior mode of 3/4 = 0.75 and B has a mode of 2/3 = 0.667, corresponding to the sample proportions. The mode isn’t the only measure of central tendency we could use. The means are 0.667 for A and 0.658 for B; the medians are 0.686 for A and 0.661 for B. No matter which of the three statistics we choose, fertilizer A looks better than fertilizer B.

But we haven’t really done “decision theory” yet. We’ve just compared point estimators. Actually, we have done a little decision theory, implicitly. It turns out that picking the mean corresponds to the estimator which minimizes the expected squared error in , where “squared error” can be thought of formally as a loss function in decision theory. Picking the median corresponds to minimizing the expected absolute loss, and picking the mode corresponds to minimizing the minimizing the 0-1 loss (where you lose nothing if you guess exactly and lose 1 otherwise).

Still, these don’t really correspond to a decision theoretic view of the problem. We don’t care about the quantity at all, let alone some point estimator of it. We only care about indirectly, insofar as it helps us predict something about what the fertilizer will do to new trees. For that, we have to move from the posterior distribution to the predictive distribution

where is a random variable indicating whether a new tree will thrive under treatment. Here I assume that the success of new trees follows the same binomial distribution as in the experimental group.

For a Beta posterior, the predictive distribution is beta-binomial, and the expected number of successes for a new tree is equal to the mean of the Beta distribution for – i.e. the posterior mean we computed before, . If we introduce a utility function such that we are rewarded 1 util for a thriving tree and 0 utils for non-thriving tree, then the expected utility is equal to the expected success rate. Therefore, under these assumptions, we should choose the fertilizer that maximizes the quantity , which, as we’ve seen, favors fertilizer A (0.667) over fertilizer B (0.658).

An interesting mathematical question is, does this ever work out to a “non-obvious” conclusion? That is, if fertilizer A has a sample success rate which is greater than fertilizer B’s sample success rate, but expected utility maximization prefers fertilizer B? Mathematically, we’re looking for a set such that but . (Also there are obvious constraints on and .) The answer is yes. For example, if fertilizer A has 4 of 5 successes while fertilizer B has 7 of 9 successes.

By the way, on a quite different note: NASA currently rates the chances of the asteroid Apophis colliding with the Earth in 2036 at 4.3 × 10^{}^{-6}. It estimates that the energy of such a collision would be comparable with a 510-megatonne thermonuclear bomb. This is ten times larger than the largest bomb actually exploded, the Tsar Bomba. The Tsar Bomba, in turn, was ten times larger than all the explosives used in World War II.

There’s an interesting Chinese plan to deflect Apophis if that should prove necessary. It is, however, quite a sketchy plan. I expect people will make more detailed plans shortly before Apophis comes close to the Earth in 2029.

I haven’t read the whole thing, but GiveWell seems to have a reasonable argument. It’s not really criticizing expected utility theory directly, but what they call “explicit expected utility”, where you use sample statistics instead of inferred population statistics. To me it’s completely sensible to want to “regularize” estimates with a prior.

Of course, you will end up with prior-dependent answers, as GiveWell notes, but this is really unavoidable, and you can use this to test the sensitivity of the decision rule to the choice of prior. This gets into “robust decision making” (finding rules that work well over a wide range of priors or other assumptions).

The BeerAdvocate example they discuss is also briefly mentioned in the Bayesian sorting by rank post by Aleks Jakulin (and the subsequent comments) on Andrew Gelman’s blog, with a sketch of how their ranking formula is derived.

There are other objections to expected utility decision making (even in a Bayesian context). For example, if you have the possiblity of a “catastrophe” whose utility is hard to quantify, you may want to hedge somehow by constraining the decision to have a low probability of catastrophe, even if strictly speaking the “optimal” decision under expected utility would allow a higher probability of catastrophe. This leads to “chance-constrained expected utility maximization”. Or, you may want to account for risk aversion in the decision. People may choose to forgo a little expected utility in return for a much reduced risk of catastrophe.

For applications to environmental sciences, see this brief draft encyclopedia article on Bayesian decision theory and climate change.

There’s another wrinkle that is

my takeon something Nassim Taleb goes on about. (I think he pushes the argument further than justified, but it’s an interesting point.) Firstly, we often use probabilities as a viewpoint on a system that may have internal dynamics we either can’t see or analyse. But other thanverymechanistic scientific questions, the dynamics may modify at some point; that means your “probability view” ought to change. E.g., consider the probability of the number of violent offences in a city. One point on the spectrum is rioting, but in the case of the recent UK riots it’s been argued that modern communications made escalation to riot status a different to thing than it would have been in the 1980s, and quite likely the 1980s were different to the 1940s, etc.But now factor in the problem of trying to observationally estimate the probability of an extreme event in order to mathematically develop an optimal strategy. Extreme events don’t happen everyday, but you can certainly “observe” some tiny number of extreme events over history when “model building”. However, are the extreme events you’ve observed still “effective” probabilities for use in modelling the extreme end of your decision theory, or are they likely to no longer be representative? Of course, if your utility function drops off quickly enough extreme events will have a negligible effect and you’ll be “driven” by the common values. Conversely, if your utility values responding to extreme events you’re unlikely to actually have observed probabilities that will be “accurate” in telling you the likelihood/magnitude of extreme events happening in the future.

As I understand it Taleb is making slightly different arguments, but his conclusion seems to be “use math models to deal with the common-enough stuff, but don’t try and deal with extreme events in that framework because you can’t. If you want to deal with extreme possibilities you should use other thinking.” I think there’s something to that viewpoint.

A riot is *correlated* violence to people or property. The correlations may be directly causal, perhaps through modern communications, or more indirectly causal, perhaps through a long-standing generally perceived injustice. In the case of a riot, there may be very high order n-point connected correlations. It’s essentially a state change, in which local and global environments both play a part. In any case, it’s a situation when statistical independence of actors is a doubtful assumption.

The kind of thing I was thinking is that one

componentof the possible “mechanism” for riots, particularly across a country, is two waves of communication to currently uninvolved people: one that informs people that “people seem to be invovled in violence” and a delayed one that informs people that “the police are ready to crack-down/there will be consequences”. In a given situation it’s certainly arguable that“in a platonic mathematics”sense you could approximate this partly deterministic phenomenon with a probability. But for other reasons you mention, riots are relatively rare.So is it “meaningful” to use the fact there was a set of riots in Tottenham in the UK in 1985 in a model of the likelihood of various levels of street disturbance happening in the UK in 2012? They had TV/radio/landline phone in 85 but not mobile phones with social network access. That’s the question about rare events: does their relatively low frequency mean that

rare events in the future will happen in different ways to those recorded in the past(in contrast to common events are frequent enough that they can be estimated by very recent observations whose mechanism is likely to be relatively little changed in the near future)?For an example of robust decision making applied to a stylized climate problem, see the article Lempert and Collins (2007), as well as drafts of McInerney et al. and Hall et al..

For the fertilizer problem, whether I purchase on the basis of the information available or wait for more data depends on the expected losses if the information is inapplicable. If I want to plant 1000 fir trees, I want to know whether the trees in the two samples are fir trees, or enough like fir trees for the data to be trustworthy, and whether there are larger samples available for fertilizers that are “like” A or B. Were different amounts of fertilizer used in the two cases, and what are the relative costs of fertilizer applications? If I want to plant 100,000 trees, I might do my own experiment to determine which gives better results, and the relative cost of using no fertilizer at all might be the best answer. Someone who has 40 years arboriculture experience would have much more detailed supplementary questions to ask.

If something significant rests on the data, I also want to assess the independence of the samples. Were the samples of 4 and of 24 planted by the same people or independently by 28 different people? The assumption of independence is never completely accurate (in the restaurant example, it would be assumed by a cynic that 3 5-star reviews are all by relatives of the owner unless proven otherwise).

The fertilizer question is determinedly phrased to enforce the uniform prior, by not saying anything about the manufacturers, on the basis of which the prior distributions might well be non-uniform for someone who has expert knowledge, although justifying a given quantitatively non-uniform prior would of course be problematic.

If, as some hold, we do adjust our beliefs in something approaching a Bayesian way, we must do a lot of encoding of background knowledge into a prior. Use of a Beta prior in a Bernoulli situation is like equating your knowledge before any trials have taken place to having made pseudo-observations with no prior knowledge of successes and failures. One would imagine for a coin you’d chosen yourself from general circulation, and had a chance to examine, measure and weigh, and which you knew would be placed random side up in some chaotic flipping device of your own making, that even a sequence of 8 heads would not encourage you to put the chance of a nineth head too far above 0.5. Your prior and would have been set quite high.

Of course, some prior knowledge won’t be representable like this. If I have to give a prior for a gangster’s coin I’m fairly sure is weighted one way or the other, I might opt for a sum of two Beta distributions, each favouring one of the faces.

I’m at least as interested in knowing what happens to the trees if we use no fertilizer at all. That is, don’t guess a fresh prior ignoring our history, but use all the experience available! This includes the fact that misuse of fertilizer can kill an otherwise healthy tree.

While we can argue about details of the model, I think the point is that the posterior expected utility and the sample performance may be topologically distinct statistics, given some prior — that is, they are separated by a mapping to permutations. This sort of feature we’d not expect to disappear under generic perturbations!

John wrote:

Either “that’s the red curve” and “that’s the black curve” are reversed here, or I’m really confused.

Yes, the attribution was reversed.

Thanks. Your description of Bayesian experimental design is the clearest and most useful I’ve ever seen; I grok much better now than I have using the wikipedia article on it and other resources I’d found.

Steve wrote:

Whoops. I fixed it now.

Also, you say the red curve is “when we apply the fertilizer to 36 trees, 12 thrive.” But it’s when 24 thrive.

Fixed. I think that when I’m copying existing information I get bored, start thinking about other things, and make more mistakes.

Assuming that Apophis is the only truly dangerous asteroid in the next 30 years, and that it would wipe out 1 billion people if it were to hit the Earth (likely a major overestimate) that it takes $10 million to come up with a foolproof plan to deflect it ahead of time (likely an underestimate), the return on your money is 1 billion people * 4.3e-6 / $10 million, or $2300 per life saved. Obviously, giving away food in Somalia is much cheaper.

To make the tradeoff more meaningful, you need to come up an extinction-type event that is much more likely to occur and/or cheaper to prevent.

On the other hand, the problem with giving away food in Somalia is that you’re not achieving the result that you think you’re achieving. In a society in Malthusian trap, giving away food today, without changing either the carrying capacity of the land or birth rates, only ensures that you’ll need to give away more food later or people will start dying again. So the “gross” effect of giving away $1 in food may be saving one life now, but the “net” effect in the long run may even be negative.

Nice analysis, Eugene!

By the way, I wasn’t trying to make my example here realistic:

I was just trying to give a simple situation where the expected value of lives saved per dollar is 1 in the first case, 10 in the second case—but the probability of saving a life is 100% in the first case, and only 000001% in the second case.

If there were no complicating factors, which would you prefer?

But of course there are always complicating factors. For example, you mention the Malthusian factor: sometimes a life saved now means one or more lives lost in the future!

The blog post was mainly about another complicating factor. The first charity can, in theory, reliably state that out of 1000 cases, they’ve saved an average of

nchildren’s lives per dollar. The second charity is not in the position to say “out of 1000 asteroids, we have successfully deflectednper dollar spent.” Their probability estimates are backed up by less evidence… so our Bayesian prior matters more!On another note: I agree that it seems unlikely a collision with Apophis would kill a billion people. Apophis is just 350 meters across and a collision would release an energy equivalent to about a 510-megatonne bomb. By comparison, the asteroid that hit Chicxulub and killed off the dinosaurs was about 10,000 meters across, and the collision released an energy equivalent to about a 240,000-megatonne bomb.

If there were no complicating factors, I’d prefer the one with greater potential catastrophic outcome. (Taking the extreme limit, it seems self-evident that an intervention which reduces the chance of complete annihilation of human civilization should be preferred over an intervention that only saves some lives, even if the latter is cheaper in terms of $ per life saved.)

But, at the same time, there’s probably a lower limit on the price per life saved for any such interventions still available, and it is rather high (anything sufficiently deadly would’ve been addressed by a government of at least one developed country by now). I’m pretty sure that it’s way higher than $1/life. We already track near-earth asteroids, and we guard our stockpiles of biological weapons, and we have a military that should be able to knock out any dictator who tries to think seriously about nuclear holocaust. The remaining risks are either improbable (Chicxulub redux), expensive to defend against (a 90% mortality flu pandemic), or both (a nearby supernova).

And yes, I know that I’m still missing the point of the blog post. :)

Eugene wrote:

Umm, I hope you mean with the greater potential of

preventinga catastrophic outcome.Just to be 100% clear: the “$1/life” figure in my example was never intended to be

realistic. VillageReach, a charity strongly recommended by GiveWell, is estimated to save one child’s life per $545 spent.It’s only called “missing the point” if you failed to see what I was trying to say. If you just feel like talking about something else, that’s fine! There are lots of interesting issues to talk about in this general area. I wish more people would join these conversations…

$545/life, really? I’d have thought that they can do much better. At the current wholesale prices, you can buy 2000 calories worth of sugar for $0.40, or 2000 calories worth of corn for as low as $0.16 (and these are considered record breaking, never-before-seen prices. Ten years ago the same quantity of calories would’ve cost you one quarter of that money.) Preventing famine seems much cheaper than $545/life.

Though I probably shouldn’t, I am going to quickly respond to the question of which fertilizer to buy. The question of “which product, with its inherent risk and payoff, do you chose?” is the same process an investment company will go through in introducing a new customer to its various investment products. At an investment firm, they simply ask what your “risk aversion” is, that is, how comfortable are you with risk. As an individual investor with excess cash lying around, your “risk aversion” is just a taste or preference. Perhaps you are one to “eat a lot of risk”, perhaps not and there is no sense in doing a calculation, though the bayesian approach might be more appealing to assess what the buyer will chose as it takes into account their beliefs, or better put, their preferences.

Now, if you are a business and you are buying the fertilizer as an input to your company (perhaps you are a food company), then a calculation makes a bit more sense, though perhaps not much. For instance, the more tested fertilizer would appeal to the company that has made precise promises as to its production this year. Despite the fact that there seems to be a potentially lower yield, they could simply buy more land and plant more things and hit the gross yield with a very accurately projected profit margin. Again, however, this is in contrast to a company which has made no promises and is interested in taking a risk on getting a higher profit margin. It is the same thing as the individual investor. It will come down to the tastes of the owner of the company.

[...] John Baez: Bayesian Computations of Expected Utility [...]

In my hardware development past, we used to test by taking a small sample of prototypes and subjecting it to varying voltages, temperature, frequency, also known as shmoo testing. We did this to learn more about the design because we couldn’t the time or money to build with every combination of parts selected from their extremes of their tolerance ranges. We were not trying to develop acceptance criteria for a large population. Inevitably, we’d find a spot on the shmoo plot where we need to do some work. At that point, the program manager would ask if we could screen for that point as a production test and we’d have to try and explain the difference between our observed failure rate and what the actual failure rate would be in a large population. Were we just unlucky and got the few bad eggs in an otherwise healthy population or did we just test the cream of the crop and the rest of it should be sent to the recycle bin.

After searching a bit, we found literature on a relationship between an observed failure rate in a small test population and the potential failure rate in a much larger population at a given confidence level. To compare, to an 80% confidence level, the observed failure rate of 25% on a sample of 4 meant that the actual failure rate of fertilizer A on a big orchard could be as low as 2.6% or as high as 68%, while the observed failure rate of 33% on a sample of 36 meant that the actual failure rate of fertilizer B on a big orchard could be as low as 23% or as high as 45%. It was a useful tool to use to explain how hard it is to “test in” quality as opposed to design it in.

What we started with was an old Bureau of Standards article with graphs we extrapolated on from the 50s, but it turned out that it was based on beta probability distributions and we could duplicate it with the betainv() function in excel

Gamma = (Confidence + 1.0) / 2

Pmin(Gamma,Sample Size, Percent Defective) = 1 – betainv(Gamma, Sample size * ( 1 – Percent Defective) + 1, Sample Size * Percent Defective)

Pmax(Gamma,Sample Size, Percent Defective) = betainv(Gamma, Sample size * Percent Defective + 1, Sample Size * (1 – Percent Defective))

[...] Bayesian Computations of Expected Utility [...]

[...] Bayesian Computations of Expected Utility [...]

The example of the asteroid donation being “better” than the Somalia donation overlooks the obvious – it is factually stated that $1 will save a child, but as far as can be determined, $1 donated to save us from an asteroid will have no effect whatsoever. To assume that it would is kind of like saying that the effect of one person peeing into a bucket would have been a good effort toward preventing Mrs. O’Leary’s cow from being barbecued in Chicago.