In On Explicit Reflection in Theorem Proving and Formal Verification S. Artemov claims that for any t and P, it is derivable that [[t]]P→P, provided that the system has a few properties: the underlying logic is classical or intuitionistic, proofhood in the system is decidable / theoremhood is computably enumerable, the system can represent any computable function and decidable relation, and there is a function rep that maps syntactic terms to ground terms. PatrickRobotham tells me the paper is very messy, so maybe that doesn’t matter. I’m clearly over my head.
The statement is definitely provable for any t and P; it’s easy to check whether a proof proves a statement in any normal axiomatic system.
The thing that doesn’t work about this is that we can’t use it to express our confidence in a proposed self-modification. Given any proof that an AI’s modified self can come up with, the original AI can prove that it is valid, but in order to have reason to want to self-modify, the AI has to prove that this is true for every possible proof, which, by Loeb’s theorem, is impossible (assuming, of course, that the AI does not think that it would gain utility if it self-modified so as to be inconsistent).
Consider the “self-hating theory” PA+Not(Con(PA)), or Peano Arithmetic plus the assertion of its own inconsistency. We know that if PA is consistent, then this strange theory must be consistent as well—since otherwise PA would prove its own consistency, which the Incompleteness Theorem doesn’t allow. It follows, by the Completeness Theorem, that PA+Not(Con(PA)) must have a model. But what could such a model possibly look like? In particular, what you happen if, within that model, you just asked to see the proof that PA was inconsistent?
I’ll tell you what would happen: the axioms would tell you that proof of PA’s inconsistency is encoded by a positive integer X. And then you would say, “but what is X?” And the axioms would say, “X.” And you would say, “But what is X, as an ordinary positive integer?”
“No, no, no! Talk to the axioms.”
“Alright, is X greater or less than 10500,000?”
“Greater.” (The axioms aren’t stupid: they know that if they said “smaller”, then you could simply try every smaller number and verify that none of them encode a proof of PA’s inconsistency.)
“Alright then, what’s X+1?”
“Y.”
And so on. The axioms will keep cooking up fictitious numbers to satisfy your requests, and assuming that PA itself is consistent, you’ll never be able to trap them in an inconsistency.
Taking this seriously, it seems that if PA is consistent, then PA + ¬Cons(PA) ⊢ ◻¬Cons(PA) is true, i.e. there exists a term, X, encoding a proof of ¬Cons(PA). It also seems, from “you’ll never be able to trap them”, that we can’t construct this proof. Which is to say, within some PA-like systems of interest, there are things (existentially) provable which are not (constructively) provable. So I agree that an AI should be able to accept statements provable in the second system before modifying to use that system. Our difference is between requiring existential proofs (which are not interesting to me, since things can be provable and yet never proved) and constructive proofs (a set whose membership is undecidable, yet which, by showing equivalence between the two systems or between the first and a subsystem of the second, we can accept). Again however, this is not a real limitation, because the constructive proof of a statement lets us infer that the statement is existentially provable. Thus we recover the “sensible” existentially provable statements, and discard the ones which are provable and yet unprovable.
I don’t think this is quite what you meant. This means PA + ¬Cons(PA) allows you to prove that you can prove ¬Cons(PA), where the second use of the word ‘prove’ does not make reference to any specific formal system. The turnstile refers to the specific theory that is being used, here PA + ¬Cons(PA), and the box symbol is used when the meaning is clear from context, but you’ve used both.
Our difference is between requiring existential proofs (which are not interesting to me, since things can be provable and yet never proved) and constructive proofs (a set whose membership is undecidable, yet which, by showing equivalence between the two systems or between the first and a subsystem of the second, we can accept). Again however, this is not a real limitation, because the constructive proof of a statement lets us infer that the statement is existentially provable. Thus we recover the “sensible” existentially provable statements, and discard the ones which are provable and yet unprovable.
This seems like a description of some constructive logic version of PA. I’m not sure exactly how that would work or if there is a standard constructive axiomatization of the natural numbers, but the set of theorems of constructive logic is a subset of those of classical logic, so this will not be able to prove that some given axiom system is trustable if the original system cannot do so.
You could add constructive theorems if you also add axioms, which I’m not sure whether you’re proposing. For example, given a universal Turing machine U and a number x, there is a shortest program p such that U(p) = x and it could be the case, depending on U and x, that PA can prove that such a p exists, but not what it is. Then, by adding axioms that U never halts on certain inputs, it can be possible to deduce the value of p.
I don’t see how that could help with the reflection problem though. Do you propose adding soundness of some formal system as an axiom, but preventing Loeb’s theorem from applying by switching to constructive logic? I think Loeb’s theorem is still valid constructively; it doesn’t use negation, but I’m not sure if that is a sufficient criterion to be provable constructively.
It’s called Heyting arithmetic, but no, nowhere above do I mean intuitionistic logic when I say constructive. Or I don’t think I do. I recognize Löb’s theorem and am not trying restrict any system’s expressive power to avoid it (i.e. using Dan Willard’s self verifying theories).
Maybe I’ll try to prove some sufficient conditions for a Gödel machine to self modify in a toy universe given this perspective. That should make it more convincing / concrete.
In On Explicit Reflection in Theorem Proving and Formal Verification S. Artemov claims that for any t and P, it is derivable that [[t]]P→P, provided that the system has a few properties: the underlying logic is classical or intuitionistic, proofhood in the system is decidable / theoremhood is computably enumerable, the system can represent any computable function and decidable relation, and there is a function rep that maps syntactic terms to ground terms. PatrickRobotham tells me the paper is very messy, so maybe that doesn’t matter. I’m clearly over my head.
The statement is definitely provable for any t and P; it’s easy to check whether a proof proves a statement in any normal axiomatic system.
The thing that doesn’t work about this is that we can’t use it to express our confidence in a proposed self-modification. Given any proof that an AI’s modified self can come up with, the original AI can prove that it is valid, but in order to have reason to want to self-modify, the AI has to prove that this is true for every possible proof, which, by Loeb’s theorem, is impossible (assuming, of course, that the AI does not think that it would gain utility if it self-modified so as to be inconsistent).
Scott Aaronson wrote
Taking this seriously, it seems that if PA is consistent, then PA + ¬Cons(PA) ⊢ ◻¬Cons(PA) is true, i.e. there exists a term, X, encoding a proof of ¬Cons(PA). It also seems, from “you’ll never be able to trap them”, that we can’t construct this proof. Which is to say, within some PA-like systems of interest, there are things (existentially) provable which are not (constructively) provable. So I agree that an AI should be able to accept statements provable in the second system before modifying to use that system. Our difference is between requiring existential proofs (which are not interesting to me, since things can be provable and yet never proved) and constructive proofs (a set whose membership is undecidable, yet which, by showing equivalence between the two systems or between the first and a subsystem of the second, we can accept). Again however, this is not a real limitation, because the constructive proof of a statement lets us infer that the statement is existentially provable. Thus we recover the “sensible” existentially provable statements, and discard the ones which are provable and yet unprovable.
I don’t think this is quite what you meant. This means PA + ¬Cons(PA) allows you to prove that you can prove ¬Cons(PA), where the second use of the word ‘prove’ does not make reference to any specific formal system. The turnstile refers to the specific theory that is being used, here PA + ¬Cons(PA), and the box symbol is used when the meaning is clear from context, but you’ve used both.
This seems like a description of some constructive logic version of PA. I’m not sure exactly how that would work or if there is a standard constructive axiomatization of the natural numbers, but the set of theorems of constructive logic is a subset of those of classical logic, so this will not be able to prove that some given axiom system is trustable if the original system cannot do so.
You could add constructive theorems if you also add axioms, which I’m not sure whether you’re proposing. For example, given a universal Turing machine U and a number x, there is a shortest program p such that U(p) = x and it could be the case, depending on U and x, that PA can prove that such a p exists, but not what it is. Then, by adding axioms that U never halts on certain inputs, it can be possible to deduce the value of p.
I don’t see how that could help with the reflection problem though. Do you propose adding soundness of some formal system as an axiom, but preventing Loeb’s theorem from applying by switching to constructive logic? I think Loeb’s theorem is still valid constructively; it doesn’t use negation, but I’m not sure if that is a sufficient criterion to be provable constructively.
It’s called Heyting arithmetic, but no, nowhere above do I mean intuitionistic logic when I say constructive. Or I don’t think I do. I recognize Löb’s theorem and am not trying restrict any system’s expressive power to avoid it (i.e. using Dan Willard’s self verifying theories).
Maybe I’ll try to prove some sufficient conditions for a Gödel machine to self modify in a toy universe given this perspective. That should make it more convincing / concrete.