This is the second part of my interview with Eliezer Yudkowsky. If you click on some technical terms here, you’ll go down to a section where I explain them.
JB: You’ve made a great case for working on artificial intelligence—and more generally, understanding how intelligence works, to figure out how we can improve it. It’s especially hard to argue against studying rationality. Even most people who doubt computers will ever get smarter will admit the possibility that people can improve. And it seems clear that the almost every problem we face could benefit from better thinking.
I’m intrigued by the title The Art of Rationality because it suggests that there’s a kind of art to it. We don’t know how to teach someone to be a great artist, but maybe we can teach them to be a better artist. So, what are some of the key principles when it comes to thinking better?
EY: Stars above, what an open-ended question. The idea behind the book is to explain all the drop-dead basic fundamentals that almost no one seems to know about, like what is evidence, what is simplicity, what is truth, the importance of actually changing your mind now and then, the major known cognitive biases that stop people from changing their minds, what it means to live in a universe where things are made of parts, and so on. This is going to be a book primarily aimed at people who are not completely frightened away by complex mathematical concepts such as addition, multiplication, and division (i.e., all you need to understand Bayes’ Theorem if it’s explained properly), albeit with the whole middle of the book being just practical advice based on cognitive biases for the benefit of people who don’t want to deal with multiplication and division. Each chapter is going to address a different aspect of rationality, not in full textbook detail, just enough to convey the sense of a concept, with each chapter being around 5-10,000 words broken into 4-10 bite-size sections of 500-2000 words each. Which of the 27 currently planned book chapters did you want me to summarize?
But if I had to pick just one thing, just one concept that’s most important, I think it would be the difference between rationality and rationalization.
Suppose there’s two boxes, only one of which contains a diamond. And on the two boxes there are various signs and portents which distinguish, imperfectly and probabilistically, between boxes which contain diamonds, and boxes which don’t. I could take a sheet of paper, and I could write down all the signs and portents that I understand, and do my best to add up the evidence, and then on the bottom line I could write, "And therefore, there is a 37% probability that Box A contains the diamond." That’s rationality. Alternatively, I could be the owner of Box A, and I could hire a clever salesman to sell Box A for the highest price he can get; and the clever salesman starts by writing on the bottom line of his sheet of paper, "And therefore, Box A contains the diamond", and then he writes down all the arguments he can think of on the lines above.
But consider: At the moment the salesman wrote down the bottom line on that sheet of paper, the truth or falsity of the statement was fixed. It’s already right or already wrong, and writing down arguments on the lines above isn’t going to change that. Or if you imagine a spread of probable worlds, some of which have different boxes containing the diamond, the correlation between the ink on paper and the diamond’s location became fixed at the moment the ink was written down, and nothing which doesn’t change the ink or the box is going to change that correlation.
That’s "rationalization", which should really be given a name that better distinguishes it from rationality, like "anti-rationality" or something. It’s like calling lying "truthization". You can’t make rational what isn’t rational to start with.
Whatever process your brain uses, in reality, to decide what you’re going to argue for, that’s what determines your real-world effectiveness. Rationality isn’t something you can use to argue for a side you already picked. Your only chance to be rational is while you’re still choosing sides, before you write anything down on the bottom line. If I had to pick one concept to convey, it would be that one.
JB: Okay. I wasn’t really trying to get you to summarize a whole book. I’ve seen you explain a whole lot of heuristics designed to help us be more rational. So I was secretly wondering if the "art of rationality" is mainly a long list of heuristics, or whether you’ve been able to find a few key principles that somehow spawn all those heuristics.
Either way, it could be a tremendously useful book. And even if you could distill the basic ideas down to something quite terse, in practice people are going to need all those heuristics—especially since many of them take the form "here’s something you tend to do without noticing you’re doing it—so watch out!" If we’re saddled with dozens of cognitive biases that we can only overcome through strenuous effort, then your book has to be long. You can’t just say "apply Bayes’ rule and all will be well."
I can see why you’d single out the principle that "rationality only comes into play before you’ve made up your mind", because so much seemingly rational argument is really just a way of bolstering support for pre-existing positions. But what is rationality? Is it something with a simple essential core, like "updating probability estimates according to Bayes’ rule", or is its very definition inherently long and complicated?
EY: I’d say that there are parts of rationality that we do understand very well in principle. Bayes’ Theorem, the expected utility formula, and Solomonoff induction between them will get you quite a long way. Bayes’ Theorem says how to update based on the evidence, Solomonoff induction tells you how to assign your priors (in principle, it should go as the Kolmogorov complexity aka algorithmic complexity of the hypothesis), and then once you have a function which predicts what will probably happen as the result of different actions, the expected utility formula says how to choose between them.
Marcus Hutter has a formalism called AIXI which combines all three to write out an AI as a single equation which requires infinite computing power plus a halting oracle to run. And Hutter and I have been debating back and forth for quite a while on which AI problems are or aren’t solved by AIXI. For example, I look at the equation as written and I see that AIXI will try the experiment of dropping an anvil on itself to resolve its uncertainty about what happens next, because the formalism as written invokes a sort of Cartesian dualism with AIXI on one side of an impermeable screen and the universe on the other; the equation for AIXI says how to predict sequences of percepts using Solomonoff induction, but it’s too simple to encompass anything as reflective as "dropping an anvil on myself will destroy that which is processing these sequences of percepts". At least that’s what I claim; I can’t actually remember whether Hutter was agreeing with me about that as of our last conversation. Hutter sees AIXI as important because he thinks it’s a theoretical solution to almost all of the important problems; I see AIXI as important because it demarcates the line between things that we understand in a fundamental sense and a whole lot of other things we don’t.
So there are parts of rationality—big, important parts too—which we know how to derive from simple, compact principles in the sense that we could write very simple pieces of code which would behave rationally along that dimension given unlimited computing power.
But as soon as you start asking "How can human beings be more rational?" then things become hugely more complicated because human beings make much more complicated errors that need to be patched on an individual basis, and asking "How can I be rational?" is only one or two orders of magnitude simpler than asking "How does the brain work?", i.e., you can hope to write a single book that will cover many of the major topics, but not quite answer it in an interview question…
On the other hand, the question "What is it that I am trying to do, when I try to be rational?" is a question for which big, important chunks can be answered by saying "Bayes’ Theorem", "expected utility formula" and "simplicity prior" (where Solomonoff induction is the canonical if uncomputable simplicity prior).
At least from a mathematical perspective. From a human perspective, if you asked "What am I trying to do, when I try to be rational?" then the fundamental answers would run more along the lines of "Find the truth without flinching from it and without flushing all the arguments you disagree with out the window", "When you don’t know, try to avoid just making stuff up", "Figure out whether the strength of evidence is great enough to support the weight of every individual detail", "Do what should lead to the best consequences, but not just what looks on the immediate surface like it should lead to the best consequences, you may need to follow extra rules that compensate for known failure modes like shortsightedness and moral rationalizing"…
JB: Fascinating stuff!
Yes, I can see that trying to improve humans is vastly more complicated than designing a system from scratch… but also very exciting, because you can tell a human a high-level principle like " "When you don’t know, try to avoid just making stuff up" and have some slight hope that they’ll understand it without it being explained in a mathematically precise way.
I guess AIXI dropping an anvil on itself is a bit like some of the self-destructive experiments that parents fear their children will try, like sticking a pin into an electrical outlet. And it seems impossible to avoid doing such experiments without having a base of knowledge that was either "built in" or acquired by means of previous experiments.
In the latter case, it seems just a matter of luck that none of these previous experiments were fatal. Luckily, people also have "built in" knowledge. More precisely, we have access to our ancestor’s knowledge and habits, which get transmitted to us genetically and culturally. But still, a fair amount of random blundering, suffering, and even death was required to build up that knowledge base.
So when you imagine "seed AIs" that keep on improving themselves and eventually become smarter than us, how can you reasonably hope that they’ll avoid making truly spectacular mistakes? How can they learn really new stuff without a lot of risk?
EY: The best answer I can offer is that they can be conservative externally and deterministic internally.
Human minds are constantly operating on the ragged edge of error, because we have evolved to compete with other humans. If you’re a bit more conservative, if you double-check your calculations, someone else will grab the banana and that conservative gene will not be passed on to descendants. Now this does not mean we couldn’t end up in a bad situation with AI companies competing with each other, but there’s at least the opportunity to do better.
If I recall correctly, the Titanic sank from managerial hubris and cutthroat cost competition, not engineering hubris. The original liners were designed far more conservatively, with triple-redundant compartmentalized modules and soon. But that was before cost competition took off, when the engineers could just add on safety features whenever they wanted. The part about the Titanic being extremely safe was pure marketing literature.
There is also no good reason why any machine mind should be overconfident the way that humans are. There are studies showing that, yes, managers prefer subordinates who make overconfident promises to subordinates who make accurate promises—sometimes I still wonder that people are this silly, but given that people are this silly, the social pressures and evolutionary pressures follow. And we have lots of studies showing that, for whatever reason, humans are hugely overconfident; less than half of students finish their papers by the time they think it 99% probable they’ll get done, etcetera.
And this is a form of stupidity an AI can simply do without. Rationality is not omnipotent; a bounded rationalist cannot do all things. But there is no reason why a bounded rationalist should ever have to overpromise, be systematically overconfident, systematically tend to claim it can do what it can’t. It does not have to systematically underestimate the value of getting more information, or overlook the possibility of unspecified Black Swans and what sort of general behavior helps to compensate. (A bounded rationalist does end up overlooking specific Black Swans because it doesn’t have enough computing power to think of all specific possible catastrophes.)
And contrary to how it works in say Hollywood, even if an AI does manage to accidentally kill a human being, that doesn’t mean it’s going to go “I HAVE KILLED” and dress up in black and start shooting nuns from rooftops. What it ought to do—what you’d want to see happen—would be for the utility function to go on undisturbed, and for the probability distribution to update based on whatever unexpected thing just happened and contradicted its old hypotheses about what does and does not kill humans. In other words, keep the same goals and say “oops” on the world-model; keep the same terminal values and revise its instrumental policies. These sorts of external-world errors are not catastrophic unless they can actually wipe out the planet in one shot, somehow.
The catastrophic sort of error, the sort you can’t recover from, is an error in modifying your own source code. If you accidentally change your utility function you will no longer want to change it back. And in this case you might indeed ask, "How will an AI make millions or billions of code changes to itself without making a mistake like that?" But there are in fact methods powerful enough to do a billion error-free operations. A friend of mine once said something along the lines of "a CPU does a mole of transistor operations, error-free, in a day" though I haven’t checked the numbers. When chip manufacturers are building a machine with hundreds of millions of interlocking pieces and they don’t want to have to change it after it leaves the factory, they may go so far as to prove the machine correct, using human engineers to navigate the proof space and suggest lemmas to prove (which AIs can’t do, they’re defeated by the exponential explosion) and complex theorem-provers to prove the lemmas (which humans would find boring) and simple verifiers to check the generated proof. It takes a combination of human and machine abilities and it’s extremely expensive. But I strongly suspect that an Artificial General Intelligence with a good design would be able to treat all its code that way—that it would combine all those abilities in a single mind, and find it easy and natural to prove theorems about its code changes. It could not, of course, prove theorems about the external world (at least not without highly questionable assumptions). It could not prove external actions correct. The only thing it could write proofs about would be events inside the highly deterministic environment of a CPU—that is, its own thought processes. But it could prove that it was processing probabilities about those actions in a Bayesian way, and prove that it was assessing the probable consequences using a particular utility function. It could prove that it was sanely trying to achieve the same goals.
A self-improving AI that’s unsure about whether to do something ought to just wait and do it later after self-improving some more. It doesn’t have to be overconfident. It doesn’t have to operate on the ragged edge of failure. It doesn’t have to stop gathering information too early, if more information can be productively gathered before acting. It doesn’t have to fail to understand the concept of a Black Swan. It doesn’t have to do all this using a broken error-prone brain like a human one. It doesn’t have to be stupid in the ways like overconfidence that humans seem to have specifically evolved to be stupid. It doesn’t have to be poorly calibrated (assign 99% probabilities that come true less that 99 out of 100 times), because bounded rationalists can’t do everything but they don’t have to claim what they can’t do. It can prove that its self-modifications aren’t making itself crazy or changing its goals, at least if the transistors work as specified, or make no more than any possible combination of 2 errors, etc. And if the worst does happen, so long as there’s still a world left afterward, it will say "Oops" and not do it again. This sounds to me like essentially the optimal scenario given any sort of bounded rationalist whatsoever.
And finally, if I was building a self-improving AI, I wouldn’t ask it to operate heavy machinery until after it had grown up. Why should it?
Okay—I’d like to take a break here, explain some terms you used, and pick up next week with some less technical questions, like what’s a better use of time: tackling environmental problems, or trying to prepare for a technological singularity?
Here are some quick explanations. If you click on the links here you’ll get more details:
• Cognitive Bias. A cognitive bias is a way in which people’s judgements systematically deviate from some norm—for example, from ideal rational behavior. You can see a long list of cognitive biases on Wikipedia. It’s good to know a lot of these and learn how to spot them in yourself and your friends.
For example, confirmation bias is the tendency to pay more attention to information that confirms our existing beliefs. Another great example is the bias blind spot: the tendency for people to think of themselves as less cognitively biased than average! I’m sure glad I don’t suffer from that.
• Bayes’ Theorem. This is a rule for updating our opinions about probabilities when we get new information. Suppose you start out thinking the probability of some event A is P(A), and the probability of some event B is P(B). Suppose P(A|B) is the probability of event A given that B happens. Likewise, suppose P(B|A) is the probability of B given that A happens. Then the probability that both A and B happen is
but by the same token it’s also
so these are equal. A little algebra gives Bayes’ Theorem:
P(A|B) = P(B|A) P(A) / P(B)
If for some reason we know everything on the right-hand side, we can this equation to work out P(A|B), and thus update our probability for event A when we see event B happen.
For a longer explanation with examples, see:
• Eliezer Yudkowsky, An intuitive explanation of Bayes’ Theorem.
Some handy jargon: we call P(A) the prior probability of A, and P(A|B) the posterior probability.
•Solomonoff Induction. Bayes’ Theorem helps us compute posterior probabilities, but where do we get the prior probabilities from? How can we guess probabilities before we’ve observed anything?
This famous puzzle led Ray Solomonoff to invent Solomonoff induction. The key new idea is algorithmic probability theory. This is a way to define a probability for any string of letters in some alphabet, where a string counts as more probable if it’s less complicated. If we think of a string as a "hypothesis"—it could be a sentence in English, or an equation—this becomes a way to formalize Occam’s razor: the idea that given two competing hypotheses, the simpler one is more likely to be true.
So, algorithmic probability lets us define a prior probability distribution on hypotheses, the so-called “simplicity prior”, that implements Occam’s razor.
More precisely, suppose we have a special programming language where:
- Computer programs are written as strings of bits.
- They contain a special bit string meaning “END” at the end, and nowhere else.
- They don’t take an input: they just run and either halt and print out a string of letters, or never halt.
Then to get the algorithmic probability of a string of letters, we take all programs that print out that string and add up
So, you can see that a string counts as more probable if it has more short programs that print it out.
•Kolmogorov complexity. The Kolmologorov complexity of a string of letters is the length of the shortest program that prints it out, where programs are written in a special language as described above. This is a way of measuring how complicated a string is. It’s closely related to the algorithmic entropy: the difference between the Kolmogorov complexity of a string and minus the logarithm of its algorithmic probability is bounded by a constant, if we take logarithms using base 2. For more on all this stuff, see:
• M. Li and P. Vitányi, An Introduction to Kolmogorov Complexity Theory and its Applications, Springer, Berlin, 2008.
• Halting Oracle. Alas, the algorithmic probability of a string is not computable. Why? Because to compute it, you’d need to go through all the programs in your special language that print out that string and add up a contribution from each one. But to do that, you’d need to know which programs halt—and there’s no systematic way to answer that question, which is called the halting problem.
But, we can pretend! We can pretend we have a magic box that will tell us whether any program in our special language halts. Computer scientists call any sort of magic box that answers questions an oracle. So, our particular magic box called a halting oracle.
• AIXI. AIXI is Marcus Hutter’s attempt to define an agent that "behaves optimally in any computable environment". Since AIXI relies on the idea of algorithmic probability, you can’t run AIXI on a computer unless it has infinite computer power and—the really hard part—access to a halting oracle. However, Hutter has also defined computable approximations to AIXI. For a quick intro, see this:
• Marcus Hutter, Universal intelligence: a mathematical top-down approach.
For more, try this:
• Marcus Hutter, Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability, Springer, Berlin, 2005.
• Utility. Utility is a hypothetical numerical measure of satisfaction. If you know the probabilities of various outcomes, and you know what your utility will be in each case, you can compute your "expected utility" by taking the probabilities of the different outcomes, multiplying them by the corresponding utilities, and adding them up. In simple terms, this is how happy you’ll be on average. The expected utility hypothesis says that a rational decision-maker has a utility function and will try to maximize its expected utility.
•Bounded Rationality. In the real world, any decision-maker has limits on its computational power and the time it has to make a decision. The idea that rational decision-makers "maximize expected utility" is oversimplified unless it takes this into account somehow. Theories of bounded rationality try to take these limitations into account. One approach is to think of decision-making as yet another activity whose costs and benefits must be taken into account when making decisions. Roughly: you must decide how much time you want to spend deciding. Of course, there’s an interesting circularity here.
• Black Swan. According to Nassim Taleb, human history is dominated by black swans: important events that were unpredicted and indeed unpredictable, but rationalized by hindsight and thus made to seem as if they could have been predicted. He believes that rather than trying to predict such events (which he considers largely futile), we should try to get good at adapting to them. For more see:
• Nassim Taleb, The Black Swan: The Impact of the Highly Improbable, Random House, New York, 2007.
The first principle is that you must not fool yourself—and you are the easiest person to fool. – Richard Feynman