John,

We use bounded uniform priors on all the parameters except climate sensitivity, which uses a truncated Cauchy prior. The preprint you link to is not the final published version. Use my version.

]]>Limited comments now. More later. Great article. Great appetizer.

(1) In the discussion of MCMC suggest putting a reference to [K] since you have him anyway and he does a superb job of introducing Metropolis-Rosenbluth-Hastings to the uninitiated, with problem sets. I would not sweat the details of burn-in or other nuances of Monte Carlo for the introductory.

(2) I’d replace

“Note how there is no clear signal from either the curves or the differences that the green curve is at the correct setting value while the blue one has the wrong one: the noise makes it nontrivial to estimate s.”

with something like

“Note how it’s difficult to make a compelling argument for why the green curve or the blue curve are more faithful representations of the red and, accordingly, which value of s is better. Their difference is due to the variability of noise.”

followed by the “This is a baby version ….”

(3) To overcome some of the confusion implicit in the comments above, it may be helpful to use less formal notation for Likelihood vs Posterior and simply say that more than one s value may be a candidate explanation and we want to estimate all of them, with associated probability masses. When I have a real computer, I’ll work up some proposed language for that.

Nice job, all!

— Jan

]]>Okay, that makes more sense now. Thanks for taking the time to hear me out and explain things. Now that I understand what you’re doing, I would suggest that you change

If the modification is better, so that the ratio is greater than 1, the new state is always accepted. With some additional tricks—such as discarding the very beginning of the walk—this gives a set of samples from which can be used to compute . Then we can compute using Bayes’ rule.

to something like

If the modification is better, so that the ratio is greater than 1, the new state is always accepted. With some additional tricks—such as discarding the very beginning of the walk—this gives a set of samples from .

That is, leave out the stuff about using Bayes theorem again, because as far as I can tell you don’t use it again. And just go ahead and call the thing you’ve got samples from the posterior, because that’s what it is and that’s what you say you’re showing in figure 4. Alternately, you could call it thought of as a function of and normalized to be a probability density. But I think it is confusing to say that you’re doing all of this MCMC stuff “to compute ” because you really need to already know how to evaluate in order to do the analysis. is your stochastic model for how the data are generated given the settings. You have to pick one of those before you can do anything (and, of course, you did; it is implicitly defined by with being Gaussian noise).

But, however you decide to write it up, I guess what I’m saying is that the following sentence and a half

…this gives a set of samples from which can be used to compute . Then we can compute using Bayes’ rule.

were ultimately my source of confusion. Hope that helps in some small way.

]]>Thanks for the detailed description of the potential problem, Dan! You wrote:

My understanding is that if you do that, then the stationary distribution of your Markov chain will be

That’s my understanding too.

So, that’s what it sounds (to me) like you’re doing. Now, here’s what I think you ought to be doing (and quite possibly are, since I believe the following is standard practice). Make your decisions of whether or not to make a step based on the ratio:

Then, the stationary distribution of the resulting Markov chain should be the posterior distribution

Okay, yes, if we had some non-uniform prior in mind for the ‘settings’ of our model, we could do that. But we’re using a uniform prior, so becomes invisible here. We’re doing this because—as far as I know—Keller and Urban are using a uniform prior on their 18 settings, with each setting uniformly distributed in an interval as shown on Table 1 on page 15 of their paper. From this they get a posterior distribution whose marginals are shown in Figure 2 of page 11.

But I see your point. Even if we’re using a uniform prior, that’s a *decision* on our part: a uniform prior is just a special case of using some prior so the conceptually more significant formula is the one you mention. I’ll have to think about how this should affect our explanation. At the very least, I think we need to clarify the role that plays in the whole calculation.

Thanks for putting in the time needed for me to get your point!

(Of course, it’s possible I’m confused *now*, but I know enough smart people that I can make sure this is straightened out by the time we submit the paper.)

In the MCMC runs which compute as a function of s, your random-walk approach tends to cluster around s values which make that function big. In other words, for a given m, for your MCMC runs tends to be large for precisely those s values for which is also large.

Are you saying that, having thus computed , you then compute by plugging this same value of into the RHS of Bayes’ formula, and you then use the resulting values to draw conclusions about actual or potential real world scenarios?

If so, I don’t see how this can work, because it seems to me that in order for the LHS of Bayes’ formula to apply to the real world, the inputs on the RHS have to, also. But the values which characterize your MCMC runs have no relation to the real world; rather, they were chosen solely to improve efficiency in an intermediate phase of the computation.

]]>Well, one of us is pretty mixed up. It’s probably me. But let me quote the bit that is confusing me, explain what it seems (to me) to be saying that you’re doing (which I think is far from standard practice, if not downright wrong), and then tell you what I think you ought to be doing (and quite possibly are, though it doesn’t sound like it :)). Anyway, here’s the quote from the section entitled “Markov Chain Monte Carlo”:

The key to making this work is that at each step on the walk a proposed modification to the current settings is generated randomly—but it may be rejected if it does not seem to improve the estimates. The essence of the rule is:

The modification is randomly accepted with a probability equal to the ratio

Otherwise the walk stays at the current position.

If the modification is better, so that the ratio is greater than 1, the new state is always accepted. With some additional tricks—such as discarding the very beginning of the walk—this gives a set of samples from which can be used to compute . Then we can compute using Bayes’ rule.

Now, to me, it sounds like you’re using the Metropolis algorithm to get your Markov chain. Within that algorithm, you are deciding to accept or reject a step based on consideration of the ratio

My understanding is that if you do that, then the stationary distribution of your Markov chain will be

where is a normalizing constant making your likelihood into a probability density when thought of as a function of the settings . In other words, I think the stationary distribution will be the posterior under the assumption of a uniform prior. Now, there isn’t anything wrong with this (although, if is not compactly supported, you may get some numerical instabilities), but it doesn’t seem to be standard practice. But, forgetting that and pressing on, you then say that you run the algorithm and do the necessary tricks to get random samples from . That’s fine, if you want the posterior under the assumption of a uniform prior. But now, you seem to be wanting to think of these samples from as samples from and then you’re somehow using these samples to get from Bayes rule? I’m not sure how you would do that. I mean, I guess you could construct a density estimator of from its samples, multiply by , and then do a numerical integration to get the normalization, but that seems kind of convoluted. You’d basically be canceling out numerically, which makes me wonder why you’d want to run the MCMC algorithm to begin with. Why not just numerically integrate to find its normalization?

So, that’s what it sounds (to me) like you’re doing. Now, here’s what I think you ought to be doing (and quite possibly are, since I believe the following is standard practice). Make your decisions of whether or not to make a step based on the ratio:

Then, the stationary distribution of the resulting Markov chain should be the posterior distribution . So, you run the algorithm and you get random samples from the posterior. There is no need to use Bayes theorem again because you already used it (i.e., by Bayes, your step decision rule is basically the ratio of posteriors, but the unknown normalization cancels out making it possible to evaluate the ratio). Now you can use those random samples from the posterior to make a density estimator of the posterior, but more likely you’ll want to use them to calculate posterior expectations by appealing to the Law of Large Numbers for Markov Chains (a.k.a. the Ergodic Theorem):

So, that’s all my cards on the table….

]]>arch1 wrote:

It deals with a *lot* of concepts in a short article (Bayes’ formula, Monte Carlo, parametrized events, multidimensional parameters, and optimization via Markov Chains) so I think it will challenge people new to most of them.

Yes. In a way it’s hopelessly ambitious to introduce all these ideas in one short article, and I’m afraid our article is more dry and compressed than typical for this magazine. I’m hoping that adding a few carefully chosen sentences here or there could help… not to explain more information, or address nuances, but simply to make the existing information easier to digest.

Or at the very least, we can try to remove all road bumps, like this:

““One thing we can more easily do is repeatedly run our model with randomly chosen settings and see what measurements it predicts. By doing this, we can compute the probability that given setting values the model predicts measurements .”

This was a head-scratcher,

Thanks! Yes, it could be confusing that we introduce *randomly chosen settings* here: it adds an extra layer of randomness that the reader won’t be expecting. A more obvious thing to do would be simply *choose* settings repeatedly run the model with these settings, and work out

Of course we want to do this for lots of settings, and in the end we want to choose different settings with different probabilities, or frequencies, Then we can use Bayes’ rule:

I still find this confusing, maybe because I have a misconception. I get how “control over P(S=s)” in your MCMC runs helps compute P(M=m|S=s). But in order to use Bayes’ Rule to determine P(S=s|M=m), don’t you also need to know the value of P(S=s) in the real world?

No: the ‘real world’ doesn’t know anything about our model or the probability that some settings of our model take some value In this stage of the calculation the probability is something *we* control, and our goal is to do that cleverly to efficiently compute

However, I sympathize with your confusion, because we zipped through something very quickly: namely, what we do when we have . This conditional probability tells us the probability that we should use various possible settings for our model as a function of some actual past or imagined future measurements. So it’s very useful, but it’s not the final answer to any real-world question.

By the way, you once complained about what happens when you cut-and-paste a passage containing equations, like this:

But here

Bayes’ rulecomes to the rescue, relating what we want to what we can more easily compute:

If you cut-and-paste it, for some stupid reason you get

But here Bayes’ rule comes to the rescue, relating what we want to what we can more easily compute:

\displaystyle{ P(S = s | M = m) = P(M = m| S = s) \frac{P(S = s)}{P(M = m)} }

This looks like a mess, but you just need to put $latex in front of the equation, leaving a space before the other stuff, and $ after it, like this:

But here Bayes’ rule comes to the rescue, relating what we want to what we can more easily compute:

$latex \displaystyle{ P(S = s | M = m) = P(M = m| S = s) \frac{P(S = s)}{P(M = m)} } $

to get something nice:

But here Bayes’ rule comes to the rescue, relating what we want to what we can more easily compute:

(The boldface on “**Bayes’ rule**” is still gone, but life is short.)

Nick wrote:

“containing related strongly related things”

This seems awkwardly phrased.

It was just a typo for “strongly related”, not a deliberate phrasing. Fixed!

“red are predicted measurements”

I guess this is slightly ambiguous, depending on whether you’re referring to red as a line or a collection of points, but for consistency with the rest of your figure 2 description, I think it should be something like “red is the predicted measurement”

I’ll have to think about that. Right now we’re calling **the measurements**, and in our example it’s a list of numbers , each of which we could call ‘a measurement’, though actually we just say:

Suppose our measurements are real numbers related by

So, the red curve shows “the measurements”.

“modification s’ to the current settings s’ ”

Do you mean “modification s to the current settings s’ “?

No, we mean “the modification s’ to the current settings s. In math, primes are often used for ‘modified’ or ‘changed’ things. Thanks for catching this mistake, though! Fixed.

I feel like the word nonlinear should be included somewhere in this article.

Hmm, personally I’m a bit tired of how people keep emphasizing that real life involves a lot of nonlinear phenomena.

]]>