As I noted in the comments, I think I have a new and better proof, and I’m working on a careful writeup. However, that’s going to take a while, so for now I’m posting the introduction, which gives an intuitive overview, because I think this may help in the meantime.
== How to cheat Löb’s Theorem: my second try ==
In his open problems talk, Eliezer explains how Löb’s theorem prevents you from having a consistent proof system P with an axiom schema that anything P proves is actually true, and ask how we can then “build an AI that could completely rewrite itself, without decreasing the amount of trust it had in math every time it executed that self-rewrite” (18:46).
Yesterday, I posted about an attempt to apply a general trick for avoiding diagonalization problems to a minimal toy version of this problem. Unfortunately my “proof” was broken. I think I’ve fixed the problem and made the proof much simpler and intuitive. (Please check me.) To avoid confusion, note that what I’m proving is slightly different from (but related to) what I did in the previous post.
I’ll be more explicit about quoting/unquoting, using notation I’ll introduce in a moment. (It’s already prevented me today from making the same mistake as previously, a third time.) However, to sustain you through the schlep of preliminaries, I thought I’d start with an informal summary.
*
Löb’s theorem shows that it’s inconsistent to assume BAD := Peano Arithmetic + for each statement C, the axiom “if ‘C’ is provable in BAD, then C”. I will extend the language of PA with a constant symbol K, and consider the proof system PPT := Peano Arithmetic in the extended language + for each statement C of the extended language, the axiom “if ‘C’ is provable in PPT, and K>0, then D”, where D is the statement obtained from C by replacing each occurrence of “K” by “(K-1)”. [PPT is short for for parametric polymorphism trick, version 0.2.]
I won’t just show that’s consistent — I will show that it’s true (i.e., sound), from which consistency follows. Of course, to do so I will have to say what my strange constant K is supposed to mean. I will show that for any natural number n, all axioms of PPT are true in the standard model of PA if K is interpreted to mean n.
What’s the use? In my toy version of the AI rewrite problem, the AI will accept a rewrite proposed by its unreliable heuristics module if PPT proves that after executing “K” rewrites this way, the AI’s values will still be preserved. By the above, this implies that any finite number of rewrites will be safe — exactly what you would not have if you used a hierarchy of proof systems, where PA(n+1) knows that things proven in PA(n) are true, as Eliezer explains in his talk. I will show that my AI will be able to accept itself as a safe rewrite, which seems to indicate that PPT is not entirely powerless in the direction we want. (I’m not saying that this somehow solves the whole problem, though, just that it suggests this might be a place worth looking in.)
My soundness proof proceeds by induction. For K=0 (by which I mean that K is interpreted as the number 0), the assertion is trivial, since the axioms of PPT have no content in this case. Suppose PPT is sound for K=n, and let C be a statement in the extended language that’s provable in PPT. By the induction hypothesis, this implies that in the K=n interpretation, C is true. Therefore, for K=(n+1), D is true, where D is the sentence obtained from C by substituting (K-1) for K. This proves the truth of the axiom “If ‘C’ is provable in PPT, and K>0, then D.”
I will show that my AI will be able to accept itself as a safe rewrite
Now again unsure whether I can show this. Argh.
ETA: The argument I want to be making is the one from the third paragraph in the proof sketch. The problem is the quantification over p—I need to show in PPT that “for all p, if PPT proves ‘p is safe for K steps’, then p is safe for K-1 steps”. But the axioms of PPT (v0.2) don’t allow for parameters (free variables) in the statement C.
As I noted in the comments, I think I have a new and better proof, and I’m working on a careful writeup. However, that’s going to take a while, so for now I’m posting the introduction, which gives an intuitive overview, because I think this may help in the meantime.
== How to cheat Löb’s Theorem: my second try ==
In his open problems talk, Eliezer explains how Löb’s theorem prevents you from having a consistent proof system P with an axiom schema that anything P proves is actually true, and ask how we can then “build an AI that could completely rewrite itself, without decreasing the amount of trust it had in math every time it executed that self-rewrite” (18:46).
Yesterday, I posted about an attempt to apply a general trick for avoiding diagonalization problems to a minimal toy version of this problem. Unfortunately my “proof” was broken. I think I’ve fixed the problem and made the proof much simpler and intuitive. (Please check me.) To avoid confusion, note that what I’m proving is slightly different from (but related to) what I did in the previous post.
I’ll be more explicit about quoting/unquoting, using notation I’ll introduce in a moment. (It’s already prevented me today from making the same mistake as previously, a third time.) However, to sustain you through the schlep of preliminaries, I thought I’d start with an informal summary.
*
Löb’s theorem shows that it’s inconsistent to assume BAD := Peano Arithmetic + for each statement C, the axiom “if ‘C’ is provable in BAD, then C”. I will extend the language of PA with a constant symbol K, and consider the proof system PPT := Peano Arithmetic in the extended language + for each statement C of the extended language, the axiom “if ‘C’ is provable in PPT, and K>0, then D”, where D is the statement obtained from C by replacing each occurrence of “K” by “(K-1)”. [PPT is short for for parametric polymorphism trick, version 0.2.]
I won’t just show that’s consistent — I will show that it’s true (i.e., sound), from which consistency follows. Of course, to do so I will have to say what my strange constant K is supposed to mean. I will show that for any natural number n, all axioms of PPT are true in the standard model of PA if K is interpreted to mean n.
What’s the use? In my toy version of the AI rewrite problem, the AI will accept a rewrite proposed by its unreliable heuristics module if PPT proves that after executing “K” rewrites this way, the AI’s values will still be preserved. By the above, this implies that any finite number of rewrites will be safe — exactly what you would not have if you used a hierarchy of proof systems, where PA(n+1) knows that things proven in PA(n) are true, as Eliezer explains in his talk. I will show that my AI will be able to accept itself as a safe rewrite, which seems to indicate that PPT is not entirely powerless in the direction we want. (I’m not saying that this somehow solves the whole problem, though, just that it suggests this might be a place worth looking in.)
My soundness proof proceeds by induction. For K=0 (by which I mean that K is interpreted as the number 0), the assertion is trivial, since the axioms of PPT have no content in this case. Suppose PPT is sound for K=n, and let C be a statement in the extended language that’s provable in PPT. By the induction hypothesis, this implies that in the K=n interpretation, C is true. Therefore, for K=(n+1), D is true, where D is the sentence obtained from C by substituting (K-1) for K. This proves the truth of the axiom “If ‘C’ is provable in PPT, and K>0, then D.”
Now again unsure whether I can show this. Argh.
ETA: The argument I want to be making is the one from the third paragraph in the proof sketch. The problem is the quantification over p—I need to show in PPT that “for all p, if PPT proves ‘p is safe for K steps’, then p is safe for K-1 steps”. But the axioms of PPT (v0.2) don’t allow for parameters (free variables) in the statement C.
I seem to have fixed the remaining problems; new proof attempt here.