Last week I attended the Machine Intelligence Research Institute’s sixth Workshop on Logic, Probability, and Reflection. This one was in Berkeley, where the institute has its headquarters.

You may know this institute under their previous name: the Singularity Institute. It seems to be the brainchild of Eliezer Yudkowsky, a well-known advocate of ‘friendly artificial intelligence’, whom I interviewed in week311, week312 and week313 of *This Week’s Finds*. He takes an approach to artificial intelligence that’s heavily influenced by mathematical logic, and I got invited to the workshop because I blogged about a paper he wrote with Mihaly Barasz, Paul Christiano and Marcello Herreshoff on probability theory and logic.

I only have the energy to lay the groundwork for a good explanation of what happened in the workshop. So, after you read my post, please read this:

• Benja Fallenstein, Results from MIRI’s December workshop, *Less Wrong*, 28 December 2013.

The workshop had two main themes, so let me tell you what they were.

### Scientific induction in mathematics

The first theme is related to that paper I just mentioned. How should a rational agent assign probabilities to statements in mathematics? Of course an omniscient being could assign

• **probability 1** to every mathematical statement that’s provable,

• **probability 0** to every statement whose negation is provable,

and

• to every statement that is neither provable nor disprovable.

But a real-world rational agent will never have time to check all proofs, so there will always be lots of statements it’s not sure about. Actual mathematicians always have conjectures, like the Twin Prime Conjecture, that we consider plausible even though nobody has proved them. And whenever we do research, we’re constantly estimating how likely it is for statements to be true, and changing our estimates as new evidence comes in. In other words, we use scientific induction in mathematics.

How could we automate this? Most of us don’t consciously assign numerical probabilities to mathematical statements. But maybe an AI mathematician should. If so, what rules should it follow?

It’s natural to try a version of Solomonoff induction, where our probability estimate, before any evidence comes in, favors statements that are *simple*. However, this runs up against problems. If you’re interested in learning more about this, try:

• Jeremy Hahn, Scientific induction in probabilistic mathematics.

It’s a summary of ideas people came up with during the workshop. I would like to explain them sometime, but for now I should move on.

### The Löbian obstacle

The second main theme was the ‘Löbian obstacle’. Löb’s theorem is the flip side of Gödel’s first incompleteness theorem, less famous but just as shocking. It seems to put limitations on how much a perfectly rational being can trust itself.

Since it’s the day after Christmas, let’s ease our way into these deep waters with the Santa Claus paradox, also known as Curry’s paradox.

If you have a child who is worried that Santa Claus might not exist, you can reassure them using this sentence:

**If this sentence is true, Santa Claus exists.**

Call it P, for short.

Assume, for the sake of argument, that P is true. Then what it says is true: “If P is true, Santa Claus exists.” And we’re assuming P is true. So, Santa Claus exists.

So, we’ve proved that *if P is true, Santa Claus exists*.

But that’s just what P says!

So, P is true.

So, Santa Claus exists!

There must be something wrong about this argument, even if Santa Claus *does* exist, because if it were valid you could you use it to prove anything at all. The self-reference is obviously suspicious. The sentence in question is a variant of the Liar Paradox:

**This sentence is false.**

since we can rewrite the Liar Paradox as

**If this sentence is true, 0 = 1.**

and then replace “0=1” by any false statement you like.

However, Gödel figured out a way to squeeze solid insights from these dubious self-referential sentences. He did this by creating a statement in the language of arithmetic, referring to nothing but numbers, which nonetheless manages to *effectively* say

**This sentence is unprovable.**

If it were provable, you’d get a contradiction! So, either arithmetic is inconsistent or this sentence is unprovable. But if it’s unprovable, it’s true. So, *there are true but unprovable statements in arithmetic… unless arithmetic is inconsistent!* This discovery shook the world of mathematics.

Here I’m being quite sloppy, just to get the idea across.

For one thing, when I’m saying ‘provable’, I mean provable *given some specific axioms* for arithmetic, like the Peano axioms. If we change our axioms, different statements will be provable.

For another, the concept of ‘true’ statements in arithmetic is often shunned by logicians. That may sound shocking, but there are many reasons for this: for example, Tarski showed that the truth of statements about arithmetic is undefinable in arithmetic. ‘Provability’ is much easier to deal with.

So, a better way of thinking about Gödel’s result is that he constructed a statement that is neither provable nor disprovable from Peano’s axioms of arithmetic, unless those axioms are inconsistent (in which case we can prove everything, but it’s all worthless).

Furthermore, this result applies not just to Peano’s axioms but to any stronger set of axioms, as long as you can write a computer program to list those axioms.

In 1952, the logician Leon Henkin flipped Gödel’s idea around and asked about a sentence in the language of arithmetic that says:

**This sentence is provable.**

He asked: is this provable or not? The answer is much less obvious than for Gödel’s sentence. Play around with it and see what I mean.

But in 1954, Martin Hugo Löb showed that Henkin’s sentence *is* provable!

And Henkin noticed something amazing: Löb’s proof shows much more.

At this point it pays to become a bit more precise. Let us write to mean the statement is provable from the Peano axioms of arithmetic. Gödel figured out how to encode statements in arithmetic as numbers, so let’s write for the Gödel number of any statement And Gödel figured out how to write a statement in arithmetic, say

which says that the statement with Gödel number is provable using the Peano axioms.

Using this terminology, what Henkin originally did was find a number such that the sentence

has Gödel number So, this sentence says

**This sentence is provable from the Peano axioms of arithmetic.**

What Löb did was show

In other words, he showed that Henkin sentence really *is* provable from the Peano axioms!

What Henkin then did is prove that for *any* sentence in the language of arithmetic, if

then

In other words, suppose we can prove that the provability of implies Then we can prove !

At first this merely sounds nightmarishly complicated. But if you think about it long enough, you’ll see it’s downright terrifying! For example, suppose is some famous open question in arithmetic, like the Twin Prime Conjecture. You might hope to prove

**The provability of the Twin Prime Conjecture implies the Twin Prime Conjecture.**

Indeed, that seems like a perfectly reasonable thing to want. But it turns out that proving this is as hard as proving the Twin Prime Conjecture! Why? Because if we can prove the boldface sentence above, Löb and Henkin’s work instantly gives us a proof of Twin Prime Conjecture!

What does all this have to do with artificial intelligence?

Well, what I just said is true not only for Peano arithmetic, but any set of axioms *including* Peano arithmetic that a computer program can list. Suppose your highly logical AI mathematician has some such set of axioms, say You might want it to trust itself. In other words, you might want

for every sentence This says, roughly, that *whatever the AI can prove it can prove, it can prove*.

But then Löb’s theorem would kick in and give

for every sentence And this would be disastrous: our AI would be inconsistent, because it could prove *everything!*

This is just the beginning of the problems. It gets more involved when we consider AI’s that spawn new AI’s and want to trust them. For more see:

• Eliezer Yudkowsky and Marcello Herreshoff, Tiling agents for self-modifying AI, and the Löbian obstacle.

At workshop various people made progress on this issue, which is recorded in these summaries:

• Eliezer Yudkowsky, The procrastination paradox.

Abstract.A theorem by Marcello Herresho, Benja Fallenstein, and Stuart Armstrong shows that if there exists an infinite series of theories extending where each proves the soundness of , then all the must have only nonstandard models. We call this the Procrastination Theorem for reasons which will become apparent.

• Benja Fallenstein, An infinitely descending sequence of sound theories each proving the next consistent.

Here Fallenstein constructs a different sequence of theories extending Peano arithmetic such that each proves the consistency of and all the theories are sound for sentences—that is, sentences with only one quantifier outside the rest of the stuff.

The following summaries would take more work to explain:

• Nate Soares, Fallenstein’s monster.

• Nisan Stiennon, Recursively-defined logical theories are well-defined.

• Benja Fallenstein, The 5-and-10 problem and the tiling agents formalism.

• Benja Fallenstein, Decreasing mathematical strength in one formalization of parametric polymorphism.

Again: before reading any of these summaries, I urge you to read Benja Fallenstein’s post, which will help you understand them!

It’s actually the 6th workshop, not the 2nd:

http://intelligence.org/workshops/

Wow, no wonder I had so much trouble catching up! Thanks—fixed.

LW post here: http://lesswrong.com/lw/jed/results_from_miris_december_workshop/

Thanks! I’ve added links to it in my post.

The discussion of this post over on G+ has been more lively, though at times frustrating.

You might want to check it out…

Isn’t this really rather something of a meta-tautology? You’re talking about sentences that are gaming the system within which they live, in the sense that they refer only to the way in which they can be evaluated, and no other actual content (axiom). What I would propose is that “Löb inference” should either be an axiom, or accounted for as a mode of inference. (Aside on the artificiality of formal logic…) Or perhaps logic needs to have some metaphysics and maybe an observer factored into it — and possibly even some ontology, epistemology, psychology and ethics. Logic is after all an abstraction, and a projection (reduction of dimension) from (and attempt at codification of) (our working) philosophy, which we are attracted to for psychological reasons (releasing ourselves from the burden of questioning all assumptions furnishes us with the luxury of a point of view, without which “consciousness” cannot function at the higher level of achieving goals).

Ben wrote:

What’s “this”? The whole bloody blog article?

(When you start a conversation with a pronoun like “this”, we’d have to read your mind to know what you’re talking about.)

I must be missing something because It seems rather obvious the fact you can prove it’s possible to prove A is equivalent to proving A. How could it not be? That would mean you could prove A is provable even though A cannot be proven? Seems like a clear contradiction.

Analogously if you can show that it is possible to show A is true then A has to be true.

One interesting thing about logic is that some notions that seem ‘obvious’ in everyday informal reasoning lead to contradictions when you try to incorporate them into a completely precise set of rules!

Practical people have an ability to avoid trains of thought that lead to contradictory conclusions, and shrug off contradictions when they run into them. But if our goal is to design a precise system of logical rules that

neverlead to a contradiction, we have to forego some notions that seem ‘obvious’.For starters, Gödel showed it’s impossible to be sure we’ve achieved this goal, unless we settle for pathetically weak rules that can’t do much! But that’s another story for another day.

What’s happening now is that you’re using ‘provable’ in an intuitive sense, where any perfectly convincing argument counts as a proof. In this sense, clearly something is provable if and only if it’s provably provable.

But in formal logic we have to write down a system of precise rules of proof; then we say something is ‘provable’ if it’s provable

using those rules.Furthermore, if we want to talk about ‘proving that something is provable’, we need the system to be able to talk about itself somehow. We need to specify a method.

As I explained, arithmetic can talk about sentences in the language of arithmetic, using the fact that we can encode any sentence in the language of arithmetic as a number , the

Gödel numberofGödel proved that we can construct a gargantuan formula in the language of arithmetic, which says that the sentence with Gödel number is provable using the Peano axioms of arithmetic.

However, Löb’s theorem implies that we

cannotuse the Peano axioms to provefor all sentences ,

unless the Peano axioms are inconsistent!Furthermore, I believe that we cannot use the Peano axioms to prove

for all sentences , unless the Peano axioms are inconsistent. I will need to think about this a bit more to be sure, or maybe someone can help me out here. You

canuse the Peano axioms to proveIn other words, you can prove that ‘provably provable’ implies ‘provable’. So, if I’m not confused, it’s the other way around that’s the problem.

If all this seems horribly unintuitive,

GOOD!The reason Gödel is revered is that he pushed logic to the point where some truly surprising results appear. Löb pushed it a little bit further. To make what I’m saying seem plausible to you, I would need to go through the proof of Löb’s theorem. I may do that in another post.

John, both directions are ok. In particular,

is one of the Hilbert-Bernays-Loeb derivability conditions (where means “the formula with code number x is provable in PA”). See this post on FOM by Richard Heck,

http://cs.nyu.edu/pipermail/fom/2012-September/016706.html

In modal logic, it looks like this

Nowadays, this is called the S4 axiom. See, e.g.,

http://plato.stanford.edu/entries/logic-modal/#ModLog

It was … surprise surprise ! … Kurt Gödel who, in a 1933 paper, noticed these formal similarities between “provability” and modal logic. See section 2.5.3 (“Intuitionistic Propositional Logic is Interpretable in S4”) in Juliette Kennedy’s article here,

http://plato.stanford.edu/entries/goedel/#IntProLogIntS4

Jeff

Thought about this again. My previous comment was mistaken. One direction is ok, and the other isn’t.

Ok, the other way round actually

isok.I.e., “provable” implies “provably provable”:

(a)

is one of the derivability conditions, as I mentioned.

But this direction,

(b)

does not hold. It took me a while to think why.

The proof is: suppose

(i) .

So,

(ii) .

So, by Löb’s theorem,

(iii) .

So,

(iv) .

But this is not so, since PA is -consistent.

The condition

is called (C4) in modal logic.

Jeff

Jeffrey wrote:

Thanks for clearing that up! I somehow got it backwards even though I’d looked up stuff here:

• Provability logic,

Stanford Encyclopedia of Philosophy.They list three rules called the

Löb conditions:1) If then (Edit: this is backwards, see below!)

2)

3)

Somehow I got 3) twisted around. Thanks for proving the converse rule fails! It will be fun to think about that some more. I hope to write more about Löb’s theorem, going a bit deeper… and the failure of

is one of those surprising things that will be good to understand and explain.

Yes, the safer result is: suppose PA is -consistent and

Then

My guess is that this meta-claim itself is a theorem of PA:

Jeff

That would be cool!

Thanks, John …. some of these things are pretty odd, at first sight. But the first derivability condition (1) should have the boxes switched, i.e., like this,

1) If then

The other direction does in fact happen to hold for PA; but this is because PA is -consistent.

And then (3) is the “internalization” of (1), as it were, because PA can prove (1).

Happy new year!

Jeff

Ugh, this is embarrassing! I think I make some of these mistakes because the process of typing in LaTeX is so engrossing I forget what I’m typing. But the real problem is that my intuition for this subject is not good enough. Thanks for catching that mistake.

It’s interesting how you say “PA is -consistent” while I’d cautiously say “if PA is -consistent…” You are more optimistic than me.

Happy New Year!

I don’t have the book at hand, but this reminds me of the first dialogue between Achilles and Tortoise from Hofstadter’s book “Gödel Escher Bach”. I mean the dialogue that was attributed to Lewis Carroll.

I’m baffled by your association with Eliezer Yudkowsky.

I don’t agree with some of his views on artificial intelligence and the Singularity, but I was certainly happy to spend a week talking to him and a bunch of good logicians about 1) the role of probabilistic reasoning in mathematics, 2) the conceptual issues surrounding Löb’s theorem, and 3) random other stuff. I spent a lot of time studying mathematical logic as an undergrad, but then I turned toward other things, so it’s great to have a chance to think about it again, and learn more.

In fact some of the most pleasurable hours of the workshop were those I spent learning about large cardinal axioms from Cameron Freer and Nate Ackerman. I’ve been slowly trying to read Kanamori’s

The Higher Infinite, but they gave me a much better sense of the big picture.Has Eliezer (or the people associated with him) provided you with any funding or monetary compensation? Not that there would be anything intrinsically wrong with this but my impression is that he/this institute has more money than mainstream credibility. This feels like Templeton.

They paid for my air ticket up there, and my hotel, and some food. My opinions are not so cheaply bought. And I never feel very bad about taking money from people who give it to me: they might otherwise spend it on worse things.

[…] please see Benja Fallenstein’s post: Results from MIRI’s December workshop. See also two posts about the workshop by workshop participant John […]

If I understand correctly why they worry about this Löbian issue, it’s that they want to know how, in principle, to build an AI-controlled machine which can design and build fancier, smarter AI-controlled machines, which can design and build even smarter ones, ad infinitum, and they want to be able to trust all these machines to “never do the wrong thing” according to some formal definition. But (they recognize) even a lone machine (here approximated as a formal theory) can never “be sure it’s consistent” (prove its consistency), let alone be sure the smarter machines it designs and builds are consistent; so they are stuck.

Here is my sketch of a solution to this: admit that Machine 1 can’t prove it’s own consistency, nor that of its smarter “child” Machine 2. Instead only ask each Machine i to prove “my consistency implies my smarter child’s consistency”. There is no barrier to this. Then since we, the original builder, believe Machine 1 was consistent, we readily infer each Machine i is consistent.

(There is a lot left out here — most importantly, I’m glossing over an inference on our part from “reasoning by Machine i is consistent” to “conclusions by Machine i are correct” — but I think this covers the logical issue at the heart of the problem.)

Hi! Great to hear from you. You described the problem well, I think.

I’m not sure your approach would satisfy everyone. It’s known that the consistency of Peano Arithmetic implies the consistency of Peano Arithmetic plus the axiom

saying that Peano Arithmetic is inconsistent. I don’t see why this can’t be internalized to show

So if I’m PA (which conveniently also stands for ‘father’), I can prove that my consistency implies the consistency of my child who says I’m inconsistent. But should I trust a child who thinks I’m insane? (And: should the child trust me?)

I’m not sure how bad this problem is, but it does suggest some things to worry about.

You might like to look at the ’tiling agents’ paper linked to above, which raised this subject in the first place.

That’s a valid criticism.

A better way to allow for a smarter child theory would be to require that the child theory has the same set of theorems on statements which are also statements in the parent theory. This lets it add new symbols, and axioms to define them (e.g. add abbreviations), and add new shortcut inference rules; but it doesn’t let it decide anything that was undecidable to the parent, or forget anything that was believed by the parent.

Maybe we could call this relation between theories “extended but equivalent”. (The first relation I mentioned is called “relatively consistent”, as you know.)

(Of course the parent still has to prove the child theory has this relation to the parent theory, before it allows itself to activate the child.)

But after reading the relevant parts of their draft paper (“Tiling Agents for Self-Modifying AI, and the Löbian Obstacle”), I see that this doesn’t solve their “Löbian Obstacle” problem. It only might show how to improve a hypothetical solution to that problem in which the child had the same theory as the parent, to one in which the child has a better theory.

But I think I worked out a solution to their “Löbian Obstacle”, too. I will put it in a subsequent comment.

Ok, here’s my proposed solution to the “Löbian Obstacle”. (Technically I should call it a “solution strategy” since it has one step I haven’t formally proven correct, which is pointed out below; it’s the validity of assuming “rule R”.)

To be brief, I’ll reuse background assumptions and some terminology and notation from their paper (sections 1-3) without restating most of it. But I will renumber their machines so parent i has child i+1, not i-1, and make other minor revisions for clarity.

I’ll also point out explicitly that their model seems to involve any one machine performing at most one action, which either does or doesn’t result in (perhaps indirectly) achieving goal G (while avoiding disastrous effect D), rather than a series of actions designed to incrementally achieve a long-term goal. (That simplification is ok for this limited purpose.) In fact they distinguish three result states (G for goal achieved, D for disaster occurred, or 0 for “nothing”) which an arbitrary action (or lack of action) might lead to; to simplify, I’ll just assume each action taken is required to be “ok” with respect to G, D, and 0, that is, there is some predicate Ok on proposed actions A (or more precisely, on executable descriptions, including all relevant context, of proposed actions), and any machine we want to allow to be active must prove Ok(A) before taking action A.

Problem: how can parent machine i (M1), built to permit itself to perform action A iff it can prove Ok(A) (and to try to take action), delegate this responsibility to a machine i+1 (M2) it constructs and activates (by means of a provably ok action it takes to do that), without the “Löbian Obstacle” either preventing this, or forcing M2 to be mathematically less powerful than M1 (in how it can reason when trying to prove Ok(A))?

Naive solution: M1 constructs M2 so it can prove that M2 will work like M1 does: only take actions it can prove ok, using the same theory as M1. Then M1 can infer that any action taken by M2 will be provably Ok, and therefore Ok. The last step (provable(Ok(A)) -> Ok(A)) is invalid, and that’s the “Löbian Obstacle”.

I’ll first present the fix (with only minimal motivation) and show how it works and why it’s ok; then I’ll give more motivation and discussion.

===

The proposed fix:

Let T be the theory that M1 uses and wants M2 to be able to use, to try to prove Ok(A). (In this comment, “provable” always means “provable in T”.)

Presumably we designed T to include both a powerful-enough mathematical theory, and all the “facts about the world” we think M1 will need (as axioms) (otherwise we could not even express Ok(A) in T’s language). And we also believe T is consistent, and those “facts about the world” are true. And due to that construction of T, we believe provable(X) -> X for any statement X. The “Löbian Obstacle” is just that we can’t let M1 (or any M_i) believe that implication, but we want it to “act as if it did”.

That last phrase (“act as if it did”), taken literally, tells us one part of the fix:

(1) We change the action rule built into our machines from the naive one

“to do A, you must prove Ok(A)”

to

“to do A, you must prove (Con(T) -> Ok(A))”.

Note the very important distinction from the naive way of trying for the same result, which is to get the machines to believe Con(T): the machines don’t believe Con(T), and (unlike us humans) they are not confused about this — they know they don’t (explicitly) believe it. But this action rule lets them do things they can only prove under the assumption Con(T).

To get the other part of the fix, we note that our own belief in (provable(Ok(A)) -> Ok(A)) (“rule R-naive”) is only valid since we also believe Con(T) (and certain facts about the world), as discussed above. So for use by the machines, who can’t believe Con(T):

(2) We have the machines believe

provable(Ok(A)) -> (Con(T) -> Ok(A))

or more generally (as will be needed below):

provable(S) -> (Con(T) -> S) [at least for a certain kind of S, to be determined] [this is “rule R”]

(I am not sure if rule R can be proven as a theorem or should be added as an axiom, nor how general the statement S needs to be or can safely be, nor whether by Con(S) I need to mean “omega consistent” or just “consistent”. But at least if S is restricted to statements like the two forms we use here, I think one or the other (theorem or axiom status for this statement) is doable. But I have not formally proven that claim (that this statement can be safely added as an axiom). The status of my claim about it is: “I am very sure this avoids the Löbian Obstacle, and pretty sure it is a valid thing to do in some useable form, and convinced it’s at least on the right track for the natural fundamental solution to the Löbian Obstacle”.)

===

How it works:

First, we’d better make sure it still works when M1 just acts directly. M1 can now act in more cases than it could before — is this safe? Suppose M1 does A; then we know it proved (Con(T) -> Ok(A)). Since we (humans) believe Con(T), we know Ok(A). So we’re ok.

Next, does this now let M1 make the desired equally-powerful M2? It constructs M2 like itself as before, and then (deciding whether it’s safe to activate M2) reasons:

if M2 does A, I know it proved (Con(T) -> Ok(A)), so I know provable(Con(T) -> Ok(A)).

That doesn’t let me conclude Con(T) -> Ok(A) directly, but (by rule R) I do have

provable(S) -> (Con(T) -> S)

for S = (Con(T) -> Ok(A)), i.e.

provable(Con(T) -> Ok(A)) -> (Con(T) -> (Con(T) -> Ok(A)))

and thus

Con(T) -> (Con(T) -> Ok(A))

which is equivalent to

Con(T) -> Ok(A)

which is just what I needed.

So it can correctly conclude “the effect of activating M2 and then having it do some A, which is the same as the effect of A, satisfies Con(T) -> Ok(A), so I can safely activate M2”.

(I am not sure there is not some other logical issue coming up when passing from “the effect of getting M2 to do A” to “the effect of A”, but if there is, I think it’s either independent from the one being considered here or could be solved in the same way once formalized. And if there is, that issue also went unnoticed in the paper I’m responding to.)

===

Discussion:

Note that if M1 didn’t construct M2, but found it, examined it, and came to believe it had the proper structure, it could decide to activate it or help it or not try to stop it, just as if it constructed it. Or if it “found itself” (seeing itself from outside, understanding its own programming), it could do the same thing (help itself, not try to stop itself). Even if M2 had a very different detailed structure, as long as M1 could prove it only did actions under the proper criterion (the same one as M1), it could safely delegate authority to it, activate it, or whatever.

More generally, this scheme satisfies all the desiderata at the end of section 3 in the paper, to the extent they can be or need to be satified at all. (That is, some of those are worded so as to implicitly require the machines to believe “rule R-naive”, which clearly they can’t do, but reworking them to permit “rule R” doesn’t lose anything important.)

Looking at what changes (1) and (2) really mean, and why they’re justified and natural:

Change 2 is to use rule R rather than rule R-naive. As argued before, rule R is in fact a more reasonable claim about what provability in T means, given that you can’t really be sure T is consistent, however much you hope so and are willing to act “as if it is” (i.e. to evaluate the consequences of your actions under that assumption).

Change 1 is (imagining yourself in the role of one of these machines) to be willing to do things you haven’t proven are ok, provided you’ve proven “either they’re ok or not Con(T)”, i.e. “either they’re ok or everything I believe might be wrong”. In practice, you lose no safety by this relaxation, since if your mind is based on T but if Con(T) is false, then in fact everything you believe *might* be wrong, and your actions would be unpredictable even if you didn’t make this change in policy!

Making an analogy to humans reasoning out their own actions, it’s like a sufficiently wise human admitting “I’m sure this is ok, provided my basic assumptions are sound, so I’ll do it”, in spite of the fact that they can never be really sure their basic assumptions are sound (and, being wise, they know this) — the best they can do is hope so and act as if they are. (If we were talking about higher level assumptions, they often can do better, i.e. react rationality if evidence against them comes up; but if the assumption is at the level of “Con(T)”, rational reaction is probably not possible, at least if T was a conservative choice of theory.)

A sufficiently honest and wise human might decide to make the same change in self-understanding as is advocated here for the machines — namely, to stop believing in Con(T), and to admit one’s actions are only being internally proven ok up to that assumption anyway (whether or not that’s explicit in one’s structure, since one’s correct functioning depends on that assumption anyway).

This is no different in practice than saying something like “it’s possible in principle that at any moment the earth could be destroyed by something we can’t predict or stop or notice beforehand, like a moon-sized body travelling at almost the speed of light coming far away and hitting it; there is no rational argument to even prove this has low probability (at least when all similar possibilities are added up, including the presumed majority of them which I never thought of); but (for possible events truly unpredictable/unpreventable) the best thing to do is to ignore them and act as if they definitely won’t happen”.

Finally, note that this does highlight how important it is, when building machines of this kind, to choose T as conservatively as possible — even if you didn’t make change 1 in them, you’d only be fooling yourself if you thought that made them safe against T being inconsistent, since if T is not consistent then they could in fact do anything (due to errors in reasoning), regardless of their action-filtering policy.

===

(An interesting question is what consequence it might have if T is consistent, but can prove it’s not — this would be almost as surprising for a conservative T as if it’s directly inconsistent, but not quite as much; in the present scheme this permits a machine to take any action. Hopefully there could be some “failsafe” mechanism to try to notice an internal proof of not(Con(T)) as such — certainly if it was being used in a direct way to permit some action, this seems doable. But this issue needs further investigation.)

===

P.S. After writing the above I skimmed the “results from workshop” link to lesswrong. It looks like there are some interesting commonalities between the above scheme and something related to solving the “procrastination paradox”, but I didn’t yet study that so I can’t say how closely related they might be.

I should add that an alternative to adding (as part of T) an axiom

provable(S) -> (Con(T) -> S)

is to add a new undefined boolean symbol Sane and then add the axiom

provable(S) -> (Sane -> S)

and optionally (I think) to revise the action filtering policy similarly, so the machine can only do an action A if it believes Sane -> Ok(A).

(The proof of Löb’s Theorem given in TilingAgents.pdf can then be imitated to show that, in T, Sane -> Con(T). As far as I can tell, T will know nothing else about Sane.)

This alternative might make no difference, or it might have some technical advantages (I’m not yet sure).

That’s an interesting way to try to get around the Löbian obstacle! It’s nice and simple. I don’t have time right now to investigate it in detail, so I’ll pass it on to the other folks who attended this workshop, including the authors of that Tiling Agents paper.

Thanks!

Well, it took them (actually Benja Fallenstein) less than an hour to notice an apparent fatal flaw in my scheme above! For now I will leave this as an “exercise for the reader” since I’m still thinking about it, though there is little doubt that my proposal as presented above won’t work.

Bummer! But it’s good to find this out.

This has been carefully studied in mathematical logic and proof theory, and the basic setup is well-understood.

1. Transfinite progressions of theories (Turing, Feferman, et al.)

2. Truth theoretic extensions.

For the first, simply google “Feferman transfinite progressions”.

For the second, let PA be ordinary first order arithmetic. If PA is consistent, then . Indeed, by Lob’s theorem, if , then . This is apparently, the “Lobian obstacle”.

It is no “obstacle”. It is well-known how to strengthen theories to obtain proofs of their reflection principles. So, e.g., take PA and extend the language with a new unary predicate symbol True(.) and add the following four axioms:

(i)

(ii)

(iii)

(iii)

Let the result be called Tr(PA). Then, for all $A$ (in the arithmetic language):

That is, Tr(PA)

provesthe reflection principle for PA.This can be iterated. See, e.g.,

Feferman, S, 1991: “Reflecting on Incompleteness”, JSL.

Jeff

Oops, bugger. No latex previewer …

The standard truth axioms are:

(i)

(ii)

(iii)

(iv)

Jeff

Jeff, yes of course it is well-known how to go from a theory T to a stronger theory T’ that proves the reflection principle for T. (The most obvious way being to take T’ to be T + the reflection principle for T.) This allows a machine using T’ to justify building a machine using T (intuitively, because it can reason: whatever the action the new machine takes, it has found a proof in T that this action is “ok”, and the existence of a proof in T implies that this statement is true). But it doesn’t allow building a machine that uses T’; in other words, each generation has to use a weaker system. Moreover, the obvious ways of creating stronger systems lead to well-founded hierarchies of systems, where every descending sequence of weaker and weaker systems terminates after a finite number of steps. Thus, if we only used those hierarchies, at some point we’d reach a lowest system, and the machine using that system wouldn’t be able to build any further machines using the same kind of reasoning.

That said, there

areknown partial solutions to the problem, in particular the ones described in Tiling agents for self-modifying AI, and the Löbian obstacle and An infinitely descending sequence of sound theories each proving the next consistent, both linked in John’s blog post.@Jeffrey: I can’t speak for Benja or the other workshop participants, but I think I know the answer to one of your questions (having read TilingAgents.pdf sections 1-3 fairly closely — those sections stand alone, so there’s no need to read the rest of that paper for this discussion):

The ultimate goal is for the children to be smarter than the parents, or at least equally smart; but the workshop participants have so far not been able to achieve that (with a formal proof of validity); in the only provably valid schemes they have, each child must be slightly less smart than its parent (though perhaps *so* slightly that it may not matter much). (For example, they can make each parent be higher than its children on one of those hierarchies you mention.)

(I am using “smart” imprecisely here, as an abbreviation for how powerful a theory something lets itself prove things in, if it’s going to believe them and act on them. Nothing would prevent the children from being smarter in other ways, like in how they choose to use that theory, how fast they can think, etc.)

That difference (between desired scheme, and best proven scheme) is exactly the reason there’s a remaining issue.

As for “why not just do what biology does”, that is of course what will happen “by itself” if we do nothing special (except develop AI, or even, do nothing to stop other people from developing it). The problem is, the environment will then remove not only the dumb offspring, but the dumb parents (i.e. natural humans). (Or at least displace us from our present “dominance” and perhaps many of our ecological niches.) They are trying to come up with some scheme which lets humans develop AI “safely”, i.e. be sure to not get removed by it.

As for why infinitary issues might matter, it would be great if they didn’t, but apparently no one can yet formalize a scheme for this where they don’t matter (or if someone has, this is not yet known to the workshop participants). I agree with your intuition on this (there should be a way to do it where those issues don’t matter), but I don’t yet have a solution for that and neither do they.

Thanks, Benja. Yes, I read (more liked skimmed, as there’s a lot of stuff to digest!) the papers John linked to.

I still don’t quite get the issue, though. To be honest, I’ve probably misunderstood what you mean by the Lobian obstacle. You want to have these sequences of self-modifying agents, that get progressively smarter somehow (e.g., more reflective). So, their *children* are meant to be smarter.

But then you set it up so that their *parents* are smarter … so I’m kind of lost. Are the children smarter or less smart?

On the usual thing in proof theory, we define hierarchies , getting stronger and stronger. Either by brute force (adding consistency statements or reflection principles – transfinite progressions) or by adding truth axioms (e.g., Feferman 1991), or maybe by adding stronger higher-order comprehension principles (e.g., ). Feferman is the pioneer for all of this.

(On the idea of automated “self-modifying” reflection, I’m a bit sceptical, by the way. Why not let it be as in biology – let the parents have code that “mutates”, and let the environment remove the dumb offspring, and select the cleverer ones? Also, it’s unclear why these somewhat esoteric, infinitary, issues (really, about -sentences, ) should matter. The environment of any agent is finite, and when everything is finite, none of the Godelian issues arise.)

Jeff

Bruce, many thanks.

Why not let the children have the truth axioms describing the semantics of their parent? Then they’re smarter. If is the truth extension of the theory , you get

proofsof the reflection principles,which is what you want.

You might like to read this recent exchange on closely related topics:

• Cieslinski, C. 2010: “Truth, Conservativeness and Provability”,

Mind.• Ketland, J. 2010: “Truth, Conservativeness and Provability: Reply to Cieslinski”,

Mind.Jeff

Because if you do that, then when the parent turns on the child it can’t prove it’s doing something safe. (In their current formalizations of this situation, it can’t even prove this if the child *equals* the parent, which clearly “ought to be ok”, which is exactly the “Löbian obstacle”.)

Thanks for those references, but I’d better read them when I’m more awake (it’s late at night now).

I realized that some of my earlier comments are unclear because I am using “smart” in two different ways. So, to clarify:

If the parent has to prove it’s safe to turn on the child before doing so, I don’t think anyone believes it could be possible for the child to have a more powerful proof theory than the parent. (Only unsafe operations, like evolution, can do that.)

But what ought to be possible would be for the child to have the *same* proof theory as the parent, and be smarter or better in practical ways (like implementation shortcuts, speed, capacity, etc).

But it’s not yet known how to do that in a general way; the most that’s known (in principle only, not yet done in practice) is how the child can be *slightly weaker* in proof theory (though it could at the same time be smarter or better in other ways).

It is known how to do the equal-proof-theory child in a *non-general* way. For example, the parent could have a special operation “replace me with an identical copy”. It would be easy to prove that’s safe. But what’s wanted is for the same safety rules used for general operations on the physical world to also work for making and activating the child. That way it’s clear that there is no artificial limit on the child’s structure or implementation. (And there are other benefits, described in TilingAgents.pdf.) That’s what is not yet known how to do in principle (to those authors or me, anyway).

Thanks, Bruce, I get the constraint now, I was missing it – it’s the “if” clause below:

“If the parent has to prove it’s safe to turn on the child before doing so, I don’t think anyone believes it could be possible for the child to have a more powerful proof theory than the parent. (Only unsafe operations, like evolution, can do that.)”

So mu question would be, why does the parent have to *prove* this? Is it not ok for the parent simply to *be* safe?

Maybe I want to have a child. Maybe I am safe, maybe I’m not. God knows! But in order to have a safe child, I merely need to *be* safe. I don’t have to *prove* I’m safe before I have a child.

(It could be irresponsible, I suppose, for a person to reproduce if they already know, or have some evidence, they’ll pass on some horrible condition. But I think that situation is a bit different – this would be an agent that knows that one of its own “modules” is broken; so the agent could request that module to be fixed, before generating a child.)

Also, it’s a bit unclear what use there would be in *me* proving *I* am safe. After all, unsafe theories (e.g., inconsistent ones) *can* prove that they are themselves safe. And there are consistent theories which, wrongly, prove themselves inconsistent! (E.g., PA + ~Con(PA).) So, I’m inclined to say that self-certification of safeness seems impossible, given Godel’s incompleteness result.

Jeff

(my reply to “Thanks, Bruce, I get the constraint now” got accidentally posted a bit higher up. Maybe I clicked the wrong “reply” link.)

Bruce, thanks for all that and apologies for the delay! I think I see better now the whole issue; in a nutshell, I think the underlying problem is this:

(i) Design aim: parent needs to *certify* the child is safe before switching on child.

(ii) Design aim: preferably child is smarter.

(iii) Godel.

But these three are incompatible given the maths. (iii) is more or less fixed, as it’s a standard result in math logic. So, one has to give up, or get round (i) or (ii). I think your hope is to get round (ii) by making the child in a sense both smarter but weaker, but I think it’s hopeless … I think you should give up design aim (i).

There is a good way to make the child smarter: give the child the truth theory for the parent. (This would be the “reflection approach”, in line with Feferman’s work on transfinite progressions, reflective closure, etc.)

One oddity of all this is that the parent (well, PA) can talk about their own truth theory (which is merely talking about syntax), but the parent can’t *believe* this theory … For example, I (JK) know that the truth theory for my language contains axioms “JK says “p” if and only if p”. I can easily describe this. However, I cannot believe it. The addition of this axiom to my “belief box” makes me inconsistent (indeed I also know this…!).

For example, adding the simplest truth theory to PA gives a theory called DT (disquotational truth), which has axioms “A is true iff A”, for each sentence . In fact, DT is a conservative extension of PA, and in fact PA proves this. So,

Jeff

Of course self-certification of safeness is both impossible and useless. Again I was unclear (in hindsight), in two ways:

– I meant the parent has to prove that *the act of turning on the child* is safe, not that *the parent is safe*;

– By “child” I didn’t mean “copy of parent” (which some of your comments seem to assume, at least implicitly), but “anything the parent wants to make, and delegate some authority and responsibility to, to help it do something” — so “agent” is probably a better word, and indeed the authors of TilingAgents.pdf used “agent” rather than (or at least more prominently than) “child”. So the thing I was calling “the child” might be arbitrarily different than the “parent” — those terms only make sense because of the “child” having been made by the “parent”. (Maybe this terminology makes more sense to programmers (like me), who use “parent” and “child” in all kinds of ways, e.g. for nodes in directed graphs, for computer programs when one initiates the running of another, etc, than it does to other people.)

So to clarify again, and answer your points: the parent can’t prove it’s safe, but we hope it *is* safe, and we try to prove that ourselves. But to prove that, we have to prove it only takes safe actions (including when it makes agents or “children”). If we specify a limited set of actions it can take (perhaps including “produce an agent of exact design X”), and prove all these actions are safe, we’re ok, but we don’t want to be so limited. So we want to let the parent take any action *it* can prove is safe, using a theory whose proofs *we* already trust. (Of course there is no ultimate justification for our trust in a specific proof theory — it’s really a matter of faith. But as long as we have that faith, we want to take advantage of it and are willing to trust it.)

And to make sure the parent is not overly limited in what kind of agents it can produce to delegate some powers to, we want to include actions to do that (i.e. to produce agents of any design — or for that matter, recognize already existing ones — and delegate powers to them) in the set of actions it can do, in any form it can prove is safe.

And this is where we run into the logical problem, since at least in the way this has been formalized in TilingAgents.pdf, it doesn’t seem possible to achieve all that unless the parent produces an agent whose own proof theory (for its own internal use in limiting its actions) is weaker than that of the parent.

I suspect that problem can be solved somehow, perhaps only by revising the way it’s formalized, but it’s still an open question.

(It’s also conceivable it can be proved impossible to solve, much like various goals people had for formal systems were proved impossible.)

(By the way, in case there is any danger of some other reader taking some of these comments out of context, let me make it clear that I’m not advocating that *human beings* should have to prove it’s safe to have a child before doing so! (Or prove it’s safe to take any other action.) This is only about a technical model of one hypothetical way to make certain kinds of machines safe for humans to use in general ways.)

(This comment “Of course self-certification of safeness is both impossible and useless” was supposed to go lower down, as a reply to Jeffrey saying “Thanks, Bruce, I get the constraint now”. I don’t know why it ended up earlier than the comment it replies to.)

Gentlemen, forgive an intrusion, but to focus you on a common question, in Theory of Games, is the idea that “rational” means here that there are networks of theorems from the start of the game to the end?

That’s

not how people usually think of it, but I supposed you can think of a “rational” agent as one who only takes some action after it proves a theorem saying it should. Of course this theorem must be derived from some set of ‘axioms’: assumptions about the state of affairs and the goals of the agent. So, personally, I think rationality is useful but rather limited in its powers, especially when it comes to choosing ones basic goals.Yes, just wondering if there might be a classic game here, one that triggers discussion like the iterated prisoner’s dilemma, the ultimatum game, and the tragedy of the commons. It seems like the axioms or models for making decisions in this kind of game could be different, if even a little, for every choice transition in a Petri net of the game. Here’s an example of what I’m calling a choice transition.

Here are the details of how it works.