Monte Carlo Methods in Climate Science

joint with David Tweed

One way the Azimuth Project can help save the planet is to get bright young students interested in ecology, climate science, green technology, and stuff like that. So, we are writing an article for Math Horizons, an American magazine for undergraduate math majors. This blog article is a draft of that. You can also see it in PDF form here.

We’d really like to hear your comments! There are severe limits on including more detail, since the article should be easy to read and short. So please don’t ask us to explain more stuff: we’re most interested to know if you sincerely don’t understand something, or feel that students would have trouble understanding something. For comparison, you can see sample Math Horizons articles here.

Introduction

They look placid lapping against the beach on a calm day, but the oceans are actually quite dynamic. The ocean currents act as ‘conveyor belts’, transporting heat both vertically between the water’s surface and the depths and laterally from one area of the globe to another. This effect is so significant that the temperature and precipitation patterns can change dramatically when currents do.

For example: shortly after the last ice age, northern Europe experienced a shocking change in climate from 10,800 to 9,500 BC. At the start of this period temperatures plummeted in a matter of decades. It became 7° Celsius colder, and glaciers started forming in England! The cold spell lasted for over a thousand years, but it ended as suddenly as it had begun.

Why? The most popular theory is that that a huge lake in North America formed by melting glaciers burst its bank—and in a massive torrent lasting for years, the water from this lake rushed out to the northern Atlantic ocean. By floating atop the denser salt water, this fresh water blocked a major current: the Atlantic Meridional Overturning Circulation. This current brings warm water north and helps keep northern Europe warm. So, when iit shut down, northern Europe was plunged into a deep freeze.

Right now global warming is causing ice sheets in Greenland to melt and release fresh water into the North Atlantic. Could this shut down the Atlantic Meridional Overturning Circulation and make the climate of Northern Europe much colder? In 2010, Keller and Urban [KU] tackled this question using a simple climate model, historical data, probability theory, and lots of computing power. Their goal was to understand the spectrum of possible futures compatible with what we know today.

Let us look at some of the ideas underlying their work.

Box models

The earth’s physical behaviour, including the climate is far too complex to simulate from the bottom up using basic physical principles, at least for now. The most detailed models today can take days to run on very powerful computers. So to make reasonable predictions on a laptop in a tractable time-frame, geophysical modellers use some tricks.

First, it is possible to split geophysical phenomena into ‘boxes’ containing strongly related things. For example: atmospheric gases, particulate levels and clouds all affect each other strongly; likewise the heat content, currents and salinity of the oceans all interact strongly. However, the interactions between the atmosphere and the oceans are weaker, and we can approximately describe them using just a few settings, such as the amount of atmospheric CO2 entering or leaving the oceans. Clearly these interactions must be consistent—for example, the amount of CO2 leaving the atmosphere box must equal the amount entering the ocean box—but breaking a complicated system into parts lets different specialists focus on different aspects; then we can combine these parts and get an approximate model of entire planet. The box model used by Keller and Urban is shown in Figure 1.



1. The box model used by Keller and Urban.

 
Second, it turn out that simple but effective box models can be distilled from the complicated physics in terms of forcings and feedbacks. Essentially a forcing is a measured input to the system, such as solar radiation or CO2 released by burning fossil fuels. As an analogy, consider a child on a swing: the adult’s push every so often is a forcing. Similarly a feedback describes how the current ‘box variables’ influence future ones. In the swing analogy, one feedback is how the velocity will influence the future height. Specifying feedbacks typically uses knowledge of the detailed low-level physics to derive simple, tractable functional relationships between groups of large-scale observables, a bit like how we derive the physics of a gas by thinking about collisions of lots of particles.

However, it is often not feasible to get actual settings for the parameters in our model starting from first principles. In other words, often we can get the general form of the equations in our model, but they contain a lot of constants that we can estimate only by looking at historical data.

Probability modeling

Suppose we have a box model that depends on some settings S. For example, in Keller and Urban’s model, S is a list of 18 numbers. To keep things simple, suppose the settings are element of some finite set. Suppose we also have huge hard disc full of historical measurements, and we want to use this to find the best estimate of S. Because our data is full of ‘noise’ from other, unmodeled phenomena we generally cannot unambiguously deduce a single set of settings. Instead we have to look at things in terms of probabilities. More precisely, we need to study the probability that S take some value s given that the measurements take some value. Let’s call the measurements M, and again let’s keep things simple by saying M takes values in some finite set of possible measurements.

The probability that S = s given that M takes some value m is called the conditional probability P(S=s | M=m). How can we compute this conditional probability? This is a somewhat tricky problem.

One thing we can more easily do is repeatedly run our model with randomly chosen settings and see what measurements it predicts. By doing this, we can compute the probability that given setting values S = s, the model predicts measurements M=m. This again is a conditional probability, but now it is called P(M=m|S=s).

This is not what we want: it’s backwards! But here Bayes’ rule comes to the rescue, relating what we want to what we can more easily compute:

\displaystyle{ P(S = s | M = m) = P(M = m| S = s) \frac{P(S = s)}{P(M = m)} }

Here P(S = s) is the probability that the settings take a specific value s, and similarly for P(M = m). Bayes’ rule is quite easy to prove, and it is actually a general rule that applies to any random variables, not just the settings and the measurements in our problem [Y]. It underpins most methods of figuring out hidden quantities from observed ones. For this reason, it is widely used in modern statistics and data analysis [K].

How does Bayes’ rule help us here? When we repeatedly run our model with randomly chosen settings, we have control over P(S = s). As mentioned, we can compute P(M=m| S=s). Finally, P(M = m) is independent of our choice of settings. So, we can use Bayes’ rule to compute P(S = s | M = m) up to a constant factor. And since probabilities must sum to 1, we can figure out this constant.

This lets us do many things. It lets us find the most likely values of the settings for our model, given our hard disc full of observed data. It also lets us find the probability that the settings lie within some set. This is important: if we’re facing the possibility of a climate disaster, we don’t just want to know the most likely outcome. We would like to know to know that with 95% probability, the outcome will lie in some range.

An example

Let us look at an example much simpler than that considered by Keller and Urban. Suppose our measurements are real numbers m_0,\dots, m_T related by

m_{t+1} = s m_t - m_{t-1} + N_t

Here s, a real constant, is our ‘setting’, while N_t is some ‘noise': an independent Gaussian random variable for each time t, each with mean zero and some fixed standard deviation. Then the measurements m_t will have roughly sinusoidal behavior but with irregularity added by the noise at each time step, as illustrated in Figure 2.



2. The example system: red are predicted measurements for a given value of the settings, green is another simulation for the same s value and blue is a simulation for a slightly different s.

 
Note how there is no clear signal from either the curves or the differences that the green curve is at the correct setting value while the blue one has the wrong one: the noise makes it nontrivial to estimate s. This is a baby version of the problem faced by Keller and Urban.

Markov Chain Monte Carlo

Having glibly said that we can compute the conditional probability P(M=m | S=s), how do we actually do this? The simplest way would be to run our model many, many times with the settings set at S=s and determine the fraction of times it predicts measurements equal to m. This gives us an estimate of P(M=m | S=s). Then we can use Bayes’ rule to work out P(M=m|S=s), at least up to a constant factor.

Doing all this by hand would be incredibly time consuming and error prone, so computers are used for this task. In our example, we do this in Figure 3. As we keep running our model over and over, the curve showing P(M=m |S=s) as a function of s settles down to the right answer.


3. The estimates of P(M=m | S=s) as a function of s using uniform sampling, ending up with 480 samples at each point.

 

However, this is computationally inefficient, as shown in the probability distribution for small numbers of samples. This has quite a few ‘kinks’, which only disappear later. The problem is that there are lots of possible choices of s to try. And this is for a very simple model!

When dealing with the 18 settings involved in the model of Keller and Urban, trying every combination would take far too long. A way to avoid this is Markov Chain Monte Carlo sampling. Monte Carlo is famous for its casinos, so a ‘Monte Carlo’ algorithm is one that uses randomness. A ‘Markov chain’ is a random walk: for example, where you repeatedly flip a coin and take one step right when you get heads, and one step right when you get tails. So, in Markov Chain Monte Carlo, we perform a random walk through the collection of all possible settings, collecting samples.

The key to making this work is that at each step on the walk a proposed modification s' to the current settings s is generated randomly—but it may be rejected if it does not seem to improve the estimates. The essence of the rule is:

The modification s \mapsto s' is randomly accepted with a probability equal to the ratio

\displaystyle{ \frac{P(M=m | S=s')}{ P(M=m | S=s)} }

Otherwise the walk stays at the current position.

If the modification is better, so that the ratio is greater than 1, the new state is always accepted. With some additional tricks—such as discarding the very beginning of the walk—this gives a set of samples from which can be used to compute P(M=m | S=s). Then we can compute P(S = s | M = m) using Bayes’ rule.

Figure 4 shows the results of using the Markov Chain Monte Carlo procedure to figure out P(S= s| M= m) in our example.


4. The estimates of P(S = s|M = m) curves using Markov Chain Monte Carlo, showing the current distribution estimate at increasing intervals. The red line shows the current position of the random walk. Again the kinks are almost gone in the final distribution.

 

Note that the final distribution has only peformed about 66 thousand simulations in total, while the full sampling peformed over 1.5 million. The key advantage of Markov Chain Monte Carlo is that it avoids performing many simulations in areas where the probability is low, as we can see from the way the walk path remains under the big peak in the probability density almost all the time. What is more impressive is that it achieves this without any global view of the probability density, just by looking at how P(M=m | S=s) changes when we make small changes in the settings. This becomes even more important as we move to dealing with systems with many more dimensions and settings, where it proves very effective at finding regions of high probability density whatever their shape.

Why is it worth doing so much work to estimate the probability distribution for settings for a climate model? One reason is that we can then estimate probabilities of future events, such as the collapse of the Atlantic Meridional Ocean Current. And what’s the answer? According to Keller and Urban’s calculation, this current will likely weaken by about a fifth in the 21st century, but a complete collapse is unlikely before 2300. This claim needs to be checked in many ways—for example, using more detailed models. But the importance of the issue is clear, and we hope we have made the importance of good mathematical ideas for climate science clear as well.

Exploring the topic

The Azimuth Project is a group of scientists, engineers and computer programmers interested in questions like this [A]. If you have questions, or want to help out, just email us. Versions of the computer programs we used in this paper will be made available here in a while.

Here are some projects you can try, perhaps with the help of Kruschke’s textbook [K]:

• There are other ways to do setting estimation using time series: compare some to MCMC in terms of accuracy and robustness.

• We’ve seen a 1-dimensional system with one setting. Simulate some multi-dimensional and multi-setting systems. What new issues arise?

Acknowledgements. We thank Nathan Urban and other
members of the Azimuth Project for many helpful discussions.

References

[A] Azimuth Project, http://www.azimuthproject.org.

[KU] Klaus Keller and Nathan Urban, Probabilistic hindcasts and projections of the coupled climate, carbon cycle and Atlantic meridional overturning circulation system: a Bayesian fusion of century-scale measurements with a simple model, Tellus A 62 (2010), 737–750. Also available free online.

[K] John K. Kruschke, Doing Bayesian Data Analysis: A Tutorial with R and BUGS, Academic Press, New York, 2010.

[Y] Eliezer S. Yudkowsky, An intuitive explanation of Bayes’ theorem.

21 Responses to Monte Carlo Methods in Climate Science

  1. Jon Awbrey says:

    Doesn’t it seem to you though that the biggest obstacles to saving the planet are more political than scientific? Bright young students have come to college interested in “stuff like that” since I was an undergrad in the 60s, at least, but it little availed them or humanity or the planet when they ran into wall after wall of corporate disinformation and political obstruction. There is a great backlog of knowledge already but no general will to act on it — indeed, a massively greenbacked opposition to every wave of greening that comes along. Perhaps some thought should be given to the political front before sending more greenhorns into that valley.

    • John Baez says:

      It’s true that the biggest obstacles are political, and it’s true that more thought must be given to that.

      But this is not primarily what the Azimuth Project is about. The Azimuth Project is about things that engineers, scientists and mathematicians can do to save the planet. So the idea is something like this: young mathematicians read Math Horizons and get ideas for cool things to work on. Most of these ideas have nothing to do with climate science, ecology, sustainability, or the like. They may not even know how math is relevant to these issues.

      I think it’s good to plant this idea in their head: math is relevant to these issues. Getting interested in these issues won’t hurt their chances of getting jobs compared to the other activities Math Horizons describes. And some of these students will wind up doing important things.

      Of course, if they’re not already set on becoming mathematicians—if all they want is to solve the world’s problems—it might be better for them to become politicians.

      But in any case: having already written this article, we would like people to give us suggestions about how to make it easier to understand.

  2. Peter Morgan says:

    Proofreading, 3rd para of introduction, “that that”, “iit”. 2nd para of “box models”, “containing related strongly related things” seems stilted; I think it means, from what comes later, that the boxes are weakly related, but the contents of the boxes are strongly related. “Clearly these interactions must be consistent” — IMO, “of course”, “clearly”, and similar expressions are usually best removed; in this case, just omitting “Clearly” works just fine; part of this is a pedantry, that the “clearly” is only true if *all* possible sources and sinks are in the model, which, ultimately, they are not. “Second, it turn out” needs an ‘s’ on “turn”. “the settings are element of some finite set” needs an ‘s’ on “element” (or “each setting is an…”). But the article seems quite clear.

    • John Baez says:

      Thanks very much for the proofreading—I agree with all your comments! I will keep polishing the prose, but I’m glad the article made sense to you.

  3. John Davidson says:

    This article would benefit from a good proofreader. When explaining the Monte Carlo method the user always “talks” a right turn, wheter the coin is heads or tails. I am certain this is not correct.

    • John Baez says:

      It’s “takes”. Thanks. The article will get proofread by the magazine (if its accepted), so right now I’m most eager to get comments about the way we explain the ideas: is there anything that’s hard to understand?

  4. Dan says:

    I’m a little confused. In my limited understanding of MCMC, I thought it was used to sample directly from the posterior (with your notation, I believe the posterior would be P(S=s|M=m), correct?). You seem to be saying that you’re using MCMC to sample from the likelihood P(M=m|S=s) and then getting the posterior by Bayes theorem. Is that really what you’re doing? If so, how do you use Bayes theorem to get a posterior when you only have the likelihood as samples? It seems like an empirical estimate would be pretty inefficient. Why not just sample from the posterior?
    In particular, if you’re doing Metropolis and if I understand your notation and what you’re trying to achieve, I think you ought to be looking at the ratio

    \frac{P(M=m|S=s') P(S=s')}{P(M=m|S=s)P(S=s)}

    Or maybe I’m just really confused? Wouldn’t be the first time….

    • John Baez says:

      If we’re as mixed up as you think, we’re seriously mixed up.

      To make probabilistic predictions like “what’s the probability that that the Atlantic Meridional Overturning Circulation will collapse if we pump enough CO2 into the atmosphere to double its concentration by 2100, and then stop?”, we need to know the posterior P(S = s | M = m). We say this. You seem to agree with that.

      But we also say this quantity is very hard to compute directly for a complicated climate model, while P(M = m | S = s) is easier:

      Suppose we have a box model that depends on some settings S. For example, in Keller and Urban’s model, S is a list of 18 numbers. To keep things simple, suppose the settings are element of some finite set. Suppose we also have huge hard disc full of historical measurements, and we want to use this to find the best estimate of S. Because our data is full of ‘noise’ from other, unmodeled phenomena we generally cannot unambiguously deduce a single set of settings. More precisely, we need to study the probability that S take some value s given that the measurements take some value. Let’s call the measurements M, and again let’s keep things simple by saying M takes values in some finite set of possible measurements.

      The probability that S = s given that M takes some value m is called the conditional probability P(S=s | M=m). How can we compute this conditional probability? This is a somewhat tricky problem.

      One thing we can more easily do is repeatedly run our model with randomly chosen settings and see what measurements it predicts. By doing this, we can compute the probability that given setting values S = s, the model predicts measurements M=m. This again is a conditional probability, but now it is called P(M=m|S=s).

      This is not what we want: it’s backwards! But here Bayes’ rule comes to the rescue…

      If we’re mixed up, we’ll thank you for correcting us. But if we’re right, and you come to agree that we are, maybe something in the above explanation was not sufficiently clear or convincing. In that case, it would be good for us to improve it.

      • Dan says:

        Well, one of us is pretty mixed up. It’s probably me. But let me quote the bit that is confusing me, explain what it seems (to me) to be saying that you’re doing (which I think is far from standard practice, if not downright wrong), and then tell you what I think you ought to be doing (and quite possibly are, though it doesn’t sound like it :)). Anyway, here’s the quote from the section entitled “Markov Chain Monte Carlo”:

        The key to making this work is that at each step on the walk a proposed modification s' to the current settings s is generated randomly—but it may be rejected if it does not seem to improve the estimates. The essence of the rule is:

        The modification s \mapsto s' is randomly accepted with a probability equal to the ratio

        \displaystyle{ \frac{P(M=m | S=s')}{ P(M=m | S=s)} }

        Otherwise the walk stays at the current position.

        If the modification is better, so that the ratio is greater than 1, the new state is always accepted. With some additional tricks—such as discarding the very beginning of the walk—this gives a set of samples from which can be used to compute P(M=m | S=s). Then we can compute P(S = s | M = m) using Bayes’ rule.

        Now, to me, it sounds like you’re using the Metropolis algorithm to get your Markov chain. Within that algorithm, you are deciding to accept or reject a step based on consideration of the ratio

        \displaystyle{ \frac{P(M=m | S=s')}{ P(M=m | S=s)} }

        My understanding is that if you do that, then the stationary distribution of your Markov chain will be

        f(s|m) \equiv \frac{P(M=m|S=s)}{Z(m)}

        where Z(m)=\int P(M=m|S=s) ds is a normalizing constant making your likelihood into a probability density when thought of as a function of the settings s. In other words, I think the stationary distribution will be the posterior under the assumption of a uniform prior. Now, there isn’t anything wrong with this (although, if S is not compactly supported, you may get some numerical instabilities), but it doesn’t seem to be standard practice. But, forgetting that and pressing on, you then say that you run the algorithm and do the necessary tricks to get random samples from f(s|m). That’s fine, if you want the posterior under the assumption of a uniform prior. But now, you seem to be wanting to think of these samples from f(s|m) as samples from P(M=m|S=s) and then you’re somehow using these samples to get P(S=s|M=m) from Bayes rule? I’m not sure how you would do that. I mean, I guess you could construct a density estimator of f(s|m) from its samples, multiply by P(s), and then do a numerical integration to get the normalization, but that seems kind of convoluted. You’d basically be canceling out Z(m) numerically, which makes me wonder why you’d want to run the MCMC algorithm to begin with. Why not just numerically integrate P(M=m|S=s)P(S=s) to find its normalization?

        So, that’s what it sounds (to me) like you’re doing. Now, here’s what I think you ought to be doing (and quite possibly are, since I believe the following is standard practice). Make your decisions of whether or not to make a step based on the ratio:

        \frac{P(M=m | S=s') P(S=s')}{ P(M=m | S=s) P(S=s)}

        Then, the stationary distribution of the resulting Markov chain should be the posterior distribution P(S=s|M=m). So, you run the algorithm and you get random samples from the posterior. There is no need to use Bayes theorem again because you already used it (i.e., by Bayes, your step decision rule is basically the ratio of posteriors, but the unknown normalization P(M=m) cancels out making it possible to evaluate the ratio). Now you can use those random samples from the posterior to make a density estimator of the posterior, but more likely you’ll want to use them to calculate posterior expectations by appealing to the Law of Large Numbers for Markov Chains (a.k.a. the Ergodic Theorem):

        \lim_{n\rightarrow \infty} \frac{1}{n}\sum_{i=1}^n h(s_i) \rightarrow \int h(s) P(S=s|M=m) ds

        So, that’s all my cards on the table….

        • John Baez says:

          Thanks for the detailed description of the potential problem, Dan! You wrote:

          My understanding is that if you do that, then the stationary distribution of your Markov chain will be

          \displaystyle{ f(s|m) \equiv \frac{P(M=m|S=s)}{Z(m)} }

          That’s my understanding too.

          So, that’s what it sounds (to me) like you’re doing. Now, here’s what I think you ought to be doing (and quite possibly are, since I believe the following is standard practice). Make your decisions of whether or not to make a step based on the ratio:

          \displaystyle{ \frac{P(M=m | S=s') P(S=s')}{ P(M=m | S=s) P(S=s)} }

          Then, the stationary distribution of the resulting Markov chain should be the posterior distribution P(S=s|M=m).

          Okay, yes, if we had some non-uniform prior P(S = s) in mind for the ‘settings’ of our model, we could do that. But we’re using a uniform prior, so P(S = s) becomes invisible here. We’re doing this because—as far as I know—Keller and Urban are using a uniform prior on their 18 settings, with each setting uniformly distributed in an interval as shown on Table 1 on page 15 of their paper. From this they get a posterior distribution whose marginals are shown in Figure 2 of page 11.

          But I see your point. Even if we’re using a uniform prior, that’s a decision on our part: a uniform prior is just a special case of using some prior P(S = s), so the conceptually more significant formula is the one you mention. I’ll have to think about how this should affect our explanation. At the very least, I think we need to clarify the role that P(S = s) plays in the whole calculation.

          Thanks for putting in the time needed for me to get your point!

          (Of course, it’s possible I’m confused now, but I know enough smart people that I can make sure this is straightened out by the time we submit the paper.)

        • Dan says:

          Okay, that makes more sense now. Thanks for taking the time to hear me out and explain things. Now that I understand what you’re doing, I would suggest that you change

          If the modification is better, so that the ratio is greater than 1, the new state is always accepted. With some additional tricks—such as discarding the very beginning of the walk—this gives a set of samples from which can be used to compute P(M=m | S=s). Then we can compute P(S = s | M = m) using Bayes’ rule.

          to something like

          If the modification is better, so that the ratio is greater than 1, the new state is always accepted. With some additional tricks—such as discarding the very beginning of the walk—this gives a set of samples from P(S=s|M=m).

          That is, leave out the stuff about using Bayes theorem again, because as far as I can tell you don’t use it again. And just go ahead and call the thing you’ve got samples from the posterior, because that’s what it is and that’s what you say you’re showing in figure 4. Alternately, you could call it P(M=m|S=s) thought of as a function of s and normalized to be a probability density. But I think it is confusing to say that you’re doing all of this MCMC stuff “to compute P(M=m|S=s)” because you really need to already know how to evaluate P(M=m|S=s) in order to do the analysis. P(M=m|S=s) is your stochastic model for how the data are generated given the settings. You have to pick one of those before you can do anything (and, of course, you did; it is implicitly defined by m_{t+1} = s m_t - m_{t-1} + N_t with N_t being Gaussian noise).

          But, however you decide to write it up, I guess what I’m saying is that the following sentence and a half

          …this gives a set of samples from which can be used to compute P(M=m | S=s). Then we can compute P(S = s | M = m) using Bayes’ rule.

          were ultimately my source of confusion. Hope that helps in some small way.

        • Nathan Urban says:

          John,

          We use bounded uniform priors on all the parameters except climate sensitivity, which uses a truncated Cauchy prior. The preprint you link to is not the final published version. Use my version.

  5. nick says:

    First off, great article! As someone who studied math and physics as an undergrad, and has gone on to graduate school in oceanography (which I learned about only after finishing school), I think a greater exposure to the problems in earth sciences for undergrads is vital.

    Secondly, there are a few typos/points of confusion:

    i) “containing related strongly related things”

    This seems awkwardly phrased.

    ii) “it turn out that simple but effective ”

    turns

    iii) Figure 2: “red are predicted measurements”

    I guess this is slightly ambiguous, depending on whether you’re referring to red as a line or a collection of points, but for consistency with the rest of your figure 2 description, I think it should be something like “red is the predicted measurement”

    iv) “modification s’ to the current settings s’ ”

    Do you mean “modification s to the current settings s’ “?

    v) I feel like the word nonlinear should be included somewhere in this article.

    Anyways, good job on getting the word out that there is plenty of room in these fields for mathematicians/physicists.

    Nick

    • John Baez says:

      Nick wrote:

      “containing related strongly related things”

      This seems awkwardly phrased.

      It was just a typo for “strongly related”, not a deliberate phrasing. Fixed!

      “red are predicted measurements”

      I guess this is slightly ambiguous, depending on whether you’re referring to red as a line or a collection of points, but for consistency with the rest of your figure 2 description, I think it should be something like “red is the predicted measurement”

      I’ll have to think about that. Right now we’re calling m the measurements, and in our example it’s a list of numbers m_1, \dots, m_T, each of which we could call ‘a measurement’, though actually we just say:

      Suppose our measurements are real numbers m_0,\dots, m_T related by

      m_{t+1} = s m_t - m_{t-1} + N_t

      So, the red curve shows “the measurements”.

      “modification s’ to the current settings s’ ”

      Do you mean “modification s to the current settings s’ “?

      No, we mean “the modification s’ to the current settings s. In math, primes are often used for ‘modified’ or ‘changed’ things. Thanks for catching this mistake, though! Fixed.

      I feel like the word nonlinear should be included somewhere in this article.

      Hmm, personally I’m a bit tired of how people keep emphasizing that real life involves a lot of nonlinear phenomena.

  6. arch1 says:

    Thanks, I found this very helpful! It deals with a *lot* of concepts in a short article (Bayes’ formula, Monte Carlo, parameterized events, multidimensional parameters, and optimization via Markov Chains) so I think it will challenge people new to most of them. It needs some copy editing. I did find a couple of things confusing (one I think I resolved, one I’m not sure):

    1) “One thing we can more easily do is repeatedly run our model with randomly chosen settings and see what measurements it predicts. By doing this, we can compute the probability that given setting values the model predicts measurements ”

    This was a head-scratcher, as it seemed to imply that the probability in sentence 2 was determined based on the randomness in sentence 1. But I couldn’t see how use of “randomly chosen settings” in the model could yield a nondeterministic value of m for a given settings choice s; and I didn’t know of any *other* randomness in the model which could produce this nondeterminism. (You earlier mention that noise causes this, but since that noise was “from other, unmodeled phenomena”, I jumped to the conclusion that it wasn’t included in the model). Only later when I saw your example did I realize a) you *do* model the noise as a separate random variable, b) the “randomly chosen settings” (the Markov Chain stuff) is only for efficiency.

    2) “When we repeatedly run our model with randomly chosen settings, we have control over P(S=s). As mentioned, we can compute P(M=m|S=s). Finally, P(M=m) is independent of our choice of settings. So, we can use Bayes’ rule to compute P(S=s|M=m) up to a constant factor.”

    I still find this confusing, maybe because I have a misconception. I get how “control over P(S=s)” in your MCMC runs helps compute P(M=m|S=s). But in order to use Bayes’ Rule to determine P(S=s|M=m), don’t you also need to know the value of P(S=s) in the real world? If so, how do you know or estimate that?

    • arch1 says:

      Oops, the first quote in my previous comment omitted some text. It should read: “…the probability that given setting values S = s, the model predicts measurements M=m.”

    • John Baez says:

      arch1 wrote:

      It deals with a *lot* of concepts in a short article (Bayes’ formula, Monte Carlo, parametrized events, multidimensional parameters, and optimization via Markov Chains) so I think it will challenge people new to most of them.

      Yes. In a way it’s hopelessly ambitious to introduce all these ideas in one short article, and I’m afraid our article is more dry and compressed than typical for this magazine. I’m hoping that adding a few carefully chosen sentences here or there could help… not to explain more information, or address nuances, but simply to make the existing information easier to digest.

      Or at the very least, we can try to remove all road bumps, like this:

      ““One thing we can more easily do is repeatedly run our model with randomly chosen settings and see what measurements it predicts. By doing this, we can compute the probability that given setting values S = s the model predicts measurements M = m.”

      This was a head-scratcher,

      Thanks! Yes, it could be confusing that we introduce randomly chosen settings here: it adds an extra layer of randomness that the reader won’t be expecting. A more obvious thing to do would be simply choose settings S = s, repeatedly run the model with these settings, and work out P(M = m| S = s).

      Of course we want to do this for lots of settings, and in the end we want to choose different settings s with different probabilities, or frequencies, P(S = s). Then we can use Bayes’ rule:

      \displaystyle{ P(S = s | M = m) = P(M = m| S = s) \frac{P(S = s)}{P(M = m)} }

      I still find this confusing, maybe because I have a misconception. I get how “control over P(S=s)” in your MCMC runs helps compute P(M=m|S=s). But in order to use Bayes’ Rule to determine P(S=s|M=m), don’t you also need to know the value of P(S=s) in the real world?

      No: the ‘real world’ doesn’t know anything about our model or the probability that some settings of our model take some value S = s. In this stage of the calculation the probability P(S = s) is something we control, and our goal is to do that cleverly to efficiently compute P(S = s | M = m).

      However, I sympathize with your confusion, because we zipped through something very quickly: namely, what we do when we have P(S = s | M = m). This conditional probability tells us the probability that we should use various possible settings for our model as a function of some actual past or imagined future measurements. So it’s very useful, but it’s not the final answer to any real-world question.

      By the way, you once complained about what happens when you cut-and-paste a passage containing equations, like this:

      But here Bayes’ rule comes to the rescue, relating what we want to what we can more easily compute:

      \displaystyle{ P(S = s | M = m) = P(M = m| S = s) \frac{P(S = s)}{P(M = m)} }

      If you cut-and-paste it, for some stupid reason you get

      But here Bayes’ rule comes to the rescue, relating what we want to what we can more easily compute:

      \displaystyle{ P(S = s | M = m) = P(M = m| S = s) \frac{P(S = s)}{P(M = m)} }

      This looks like a mess, but you just need to put $latex in front of the equation, leaving a space before the other stuff, and $ after it, like this:

      But here Bayes’ rule comes to the rescue, relating what we want to what we can more easily compute:

      $latex \displaystyle{ P(S = s | M = m) = P(M = m| S = s) \frac{P(S = s)}{P(M = m)} } $

      to get something nice:

      But here Bayes’ rule comes to the rescue, relating what we want to what we can more easily compute:

      \displaystyle{ P(S = s | M = m) = P(M = m| S = s) \frac{P(S = s)}{P(M = m)} }

      (The boldface on “Bayes’ rule” is still gone, but life is short.)

  7. arch1 says:

    Thanks for the careful explanation John. Regarding my 2nd point of confusion, I’m still puzzled. I’m almost certainly misinterpreting or mis-thinking, but I’ll try to be clearer about what confuses me in case on the chance that it helps clarity of the article.

    In the MCMC runs which compute \displaystyle{ P(M = m | S = s )} as a function of s, your random-walk approach tends to cluster around s values which make that function big. In other words, for a given m, \displaystyle{ P(S = s) } for your MCMC runs tends to be large for precisely those s values for which \displaystyle{ P(M = m | S = s) } is also large.

    Are you saying that, having thus computed \displaystyle{ P(M = m | S = s) } , you then compute \displaystyle{ P(S = s | M = m) } by plugging this same value of \displaystyle{ P(S = s )} into the RHS of Bayes’ formula, and you then use the resulting \displaystyle{ P(S = s | M = m )} values to draw conclusions about actual or potential real world scenarios?

    If so, I don’t see how this can work, because it seems to me that in order for the LHS of Bayes’ formula to apply to the real world, the inputs on the RHS have to, also. But the \displaystyle{ P(S = s) } values which characterize your MCMC runs have no relation to the real world; rather, they were chosen solely to improve efficiency in an intermediate phase of the computation.

  8. Hi John,

    Limited comments now. More later. Great article. Great appetizer.

    (1) In the discussion of MCMC suggest putting a reference to [K] since you have him anyway and he does a superb job of introducing Metropolis-Rosenbluth-Hastings to the uninitiated, with problem sets. I would not sweat the details of burn-in or other nuances of Monte Carlo for the introductory.

    (2) I’d replace

    “Note how there is no clear signal from either the curves or the differences that the green curve is at the correct setting value while the blue one has the wrong one: the noise makes it nontrivial to estimate s.”

    with something like

    “Note how it’s difficult to make a compelling argument for why the green curve or the blue curve are more faithful representations of the red and, accordingly, which value of s is better. Their difference is due to the variability of noise.”

    followed by the “This is a baby version ….”

    (3) To overcome some of the confusion implicit in the comments above, it may be helpful to use less formal notation for Likelihood vs Posterior and simply say that more than one s value may be a candidate explanation and we want to estimate all of them, with associated probability masses. When I have a real computer, I’ll work up some proposed language for that.

    Nice job, all!

    — Jan

  9. Additional comment on the above Comment discussion including Dan says: 25 July, 2013 at 12:31 pm, while MCMC is used in a Bayesian scheme to find or describe a Posterior, it can be used to do the same for ANY density, irrespective. It’s used in Bayesian work but is just machinery.

  10. Note, BTW, that P(M = m) = \int P(M = m|S = s) P(S = s)\;\mathrm{d}s and that P(M = m) is sometimes called [i]evidence[/i] per [K].

You can use HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 2,845 other followers