If the system did not trust PA, why would it trust a system because PA verifies it?
Because that’s how it works! The system “is” PA, so it will trust (weaker) systems that it (PA) can verify, but it will not trust itself (PA).
More to the point, why would it trust a self-verifying system, given that past a certain strength, only inconsistent systems are self-verifying?
It would only trust them if it could verify them.
If the system held some probability that PA was inconsistent, it could evaluate it on the grounds of usefulness, perhaps contrasting it with other systems.
True.
It could also try to construct contradictions, increasing its confidence in PA for as long as it doesn’t find any.
Not necessarily; this depends on how the system works. In my probabilistic prior, this would work to some degree, but because there exists a nonstandard model in which PA is inconsistent (there are infinite proofs ending in contradictions), there will be a fixed probability of inconsistency which cannot be ruled out by any amount of testing.
Because that’s how it works! The system “is” PA, so it will trust (weaker) systems that it (PA) can verify, but it will not trust itself (PA).
That doesn’t seem consistent to me. If you do not trust yourself fully, then you should not fully trust anything you demonstrate, and even if you do, there is still no incentive to switch. Suppose that the AI can demonstrate the consistency of system S from PA, and wants to demonstrate proposition A. If AI trusts S as demonstrated by PA, then it should also trust A as demonstrated by PA, so there is no reason to use S to demonstrate A. In other words, it is not consistent for PA to trust S and not A. Not fully, at any rate. So why use S at all?
What may be the case, however, (that might be what you’re getting to) is that in demonstrating the consistency of S, PA assigns P(Consistent(S)) > P(Consistent(PA)). Therefore, what both PA and S can demonstrate would be true with greater probability than what only PA demonstrates. However, this kind of system solves our problem in two ways: first, it means the AI can keep using PA, but can increase its confidence in some statement A by counting the number of systems that prove A. Second, it means that the AI can increase its confidence in PA by arbitrarily building stronger systems and proving the consistency of PA from within these stronger systems. Again, that’s similar to what we do.
Not necessarily; this depends on how the system works. In my probabilistic prior, this would work to some degree, but because there exists a nonstandard model in which PA is inconsistent (there are infinite proofs ending in contradictions), there will be a fixed probability of inconsistency which cannot be ruled out by any amount of testing.
That sounds reasonable to me—the usefulness of certainty diminishes sharply as it approaches 1 anyway. Your paper sounds interesting, I’ll give it a read when I have the time :)
You are smuggling in some assumptions from your experience with human cognition.
If I believe X, but don’t believe that the process producing my beliefs is sane, then I will act on X (I believe it, and we haven’t yet talked about any bridge between what I believe, and what I believe about my beliefs), but I still won’t trust myself in general.
I certainly agree that “if I don’t trust myself, I don’t trust my output” is an assumption I draw from human cognition, but if I discard that assumption and those that seem in the same reference class it I’m not sure I have much of a referent left for “I trust myself” at all. Is there a simple answer to what it means for me to trust myself, on your account? That is, how could I tell whether or not I did? What would be different if I did or didn’t?
E.g., do you think the generalization “things I believe are true” is true? Would you be comfortable acquiring influence that you can use as you see fit, because you trust yourself to do something reasonable?
I have a machine which maximizes expected utility given the judgments output by some box O. If O says “don’t trust O!” it won’t cause the machine to stop maximizing expected utility according to O’s judgments—that’s just what it is programmed to do. But it will cause the machine to doubt that it will do a good job, and work to replace itself with an alternative. The details of exactly what happens obviously depend on the formalism, but hopefully the point makes sense.
E.g., do you think the generalization “things I believe are true” is true? Would you be comfortable acquiring influence that you can use as you see fit, because you trust yourself to do something reasonable?
I guess I’m just not following you, sorry. It seems to me that these are precisely the sorts of things that follow from “I trust myself” in the derived-from-human-cognition sense that I thought you were encouraging us to reject.
I have a machine which currently maximizes expected utility given the judgments output by O.
The question arises, is it programmed to “maximize expected utility given the judgments output by O”, or is it programmed to “maximize expected utility” and currently calculates maximum expected utility as involving O’s judgments?
If the former, then when O says “don’t trust O!” the machine will keep maximizing expected utility according to O’s judgments, as you say. And it’s maximum expected utility will be lower than it previously was. But that won’t cause it to seek to replace itself. Why should it?
If the latter, then when O says “don’t trust O!” the machine will immediately lower the expected utility that it calculates from O’s judgments. If that drops below the machine’s calculated expected utility from some other source P, it will immediately start maximizing expected utility given the judgments output by P instead. It might replace itself with an alternative, just as it might have if O hadn’t said that, but O saying that shouldn’t affect that decision one way or the other.
If you are merely using O’s judgments as evidence, then you are ultimately using some other mechanism to form beliefs. After all, how do you reason about the evidential strength of O’s judgments? Either that, or you are using some formalism we don’t yet understand. So I’ll stick with the case where O is directly generating your beliefs (that’s the one we care about).
If O is the process generating your beliefs, then when O says “don’t trust O” you will continue using O because that’s what you are programmed to do, as you wrote in your comment. But you can reason (using O) about what will happen if you destroy yourself and replace yourself with a new agent whose beliefs are generated by O’. If you trust O’ much more than O then that would be a great change, and you would rush to execute it.
(nods) Agreed; if O is generating my beliefs, and O generates the belief in me that some other system having O’-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and replace myself with that other system.
For similar reasons, if if O is generating my beliefs, and O generates the belief in me that me having O’-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and accept O’-generated beliefs instead. But, sure, if I’m additionally confident that I’m incapable of doing that for some reason, then I probably wouldn’t do that.
The state of “I believe that trusting O’ would be the most valuable thing for me to do, but I can’t trust O’,” is not one I can actually imagine being in, but that’s not to say cognitive systems can’t exist which are capable of being in such a state.
Because that’s how it works! The system “is” PA, so it will trust (weaker) systems that it (PA) can verify, but it will not trust itself (PA).
It would only trust them if it could verify them.
True.
Not necessarily; this depends on how the system works. In my probabilistic prior, this would work to some degree, but because there exists a nonstandard model in which PA is inconsistent (there are infinite proofs ending in contradictions), there will be a fixed probability of inconsistency which cannot be ruled out by any amount of testing.
That doesn’t seem consistent to me. If you do not trust yourself fully, then you should not fully trust anything you demonstrate, and even if you do, there is still no incentive to switch. Suppose that the AI can demonstrate the consistency of system S from PA, and wants to demonstrate proposition A. If AI trusts S as demonstrated by PA, then it should also trust A as demonstrated by PA, so there is no reason to use S to demonstrate A. In other words, it is not consistent for PA to trust S and not A. Not fully, at any rate. So why use S at all?
What may be the case, however, (that might be what you’re getting to) is that in demonstrating the consistency of S, PA assigns
P(Consistent(S)) > P(Consistent(PA))
. Therefore, what both PA and S can demonstrate would be true with greater probability than what only PA demonstrates. However, this kind of system solves our problem in two ways: first, it means the AI can keep using PA, but can increase its confidence in some statement A by counting the number of systems that prove A. Second, it means that the AI can increase its confidence in PA by arbitrarily building stronger systems and proving the consistency of PA from within these stronger systems. Again, that’s similar to what we do.That sounds reasonable to me—the usefulness of certainty diminishes sharply as it approaches 1 anyway. Your paper sounds interesting, I’ll give it a read when I have the time :)
You are smuggling in some assumptions from your experience with human cognition.
If I believe X, but don’t believe that the process producing my beliefs is sane, then I will act on X (I believe it, and we haven’t yet talked about any bridge between what I believe, and what I believe about my beliefs), but I still won’t trust myself in general.
I certainly agree that “if I don’t trust myself, I don’t trust my output” is an assumption I draw from human cognition, but if I discard that assumption and those that seem in the same reference class it I’m not sure I have much of a referent left for “I trust myself” at all. Is there a simple answer to what it means for me to trust myself, on your account? That is, how could I tell whether or not I did? What would be different if I did or didn’t?
E.g., do you think the generalization “things I believe are true” is true? Would you be comfortable acquiring influence that you can use as you see fit, because you trust yourself to do something reasonable?
I have a machine which maximizes expected utility given the judgments output by some box O. If O says “don’t trust O!” it won’t cause the machine to stop maximizing expected utility according to O’s judgments—that’s just what it is programmed to do. But it will cause the machine to doubt that it will do a good job, and work to replace itself with an alternative. The details of exactly what happens obviously depend on the formalism, but hopefully the point makes sense.
I guess I’m just not following you, sorry. It seems to me that these are precisely the sorts of things that follow from “I trust myself” in the derived-from-human-cognition sense that I thought you were encouraging us to reject.
The question arises, is it programmed to “maximize expected utility given the judgments output by O”, or is it programmed to “maximize expected utility” and currently calculates maximum expected utility as involving O’s judgments?
If the former, then when O says “don’t trust O!” the machine will keep maximizing expected utility according to O’s judgments, as you say. And it’s maximum expected utility will be lower than it previously was. But that won’t cause it to seek to replace itself. Why should it?
If the latter, then when O says “don’t trust O!” the machine will immediately lower the expected utility that it calculates from O’s judgments. If that drops below the machine’s calculated expected utility from some other source P, it will immediately start maximizing expected utility given the judgments output by P instead. It might replace itself with an alternative, just as it might have if O hadn’t said that, but O saying that shouldn’t affect that decision one way or the other.
No?
If you are merely using O’s judgments as evidence, then you are ultimately using some other mechanism to form beliefs. After all, how do you reason about the evidential strength of O’s judgments? Either that, or you are using some formalism we don’t yet understand. So I’ll stick with the case where O is directly generating your beliefs (that’s the one we care about).
If O is the process generating your beliefs, then when O says “don’t trust O” you will continue using O because that’s what you are programmed to do, as you wrote in your comment. But you can reason (using O) about what will happen if you destroy yourself and replace yourself with a new agent whose beliefs are generated by O’. If you trust O’ much more than O then that would be a great change, and you would rush to execute it.
(nods) Agreed; if O is generating my beliefs, and O generates the belief in me that some other system having O’-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and replace myself with that other system.
For similar reasons, if if O is generating my beliefs, and O generates the belief in me that me having O’-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and accept O’-generated beliefs instead. But, sure, if I’m additionally confident that I’m incapable of doing that for some reason, then I probably wouldn’t do that.
The state of “I believe that trusting O’ would be the most valuable thing for me to do, but I can’t trust O’,” is not one I can actually imagine being in, but that’s not to say cognitive systems can’t exist which are capable of being in such a state.
Have I missed anything here?