You are smuggling in some assumptions from your experience with human cognition.
If I believe X, but don’t believe that the process producing my beliefs is sane, then I will act on X (I believe it, and we haven’t yet talked about any bridge between what I believe, and what I believe about my beliefs), but I still won’t trust myself in general.
I certainly agree that “if I don’t trust myself, I don’t trust my output” is an assumption I draw from human cognition, but if I discard that assumption and those that seem in the same reference class it I’m not sure I have much of a referent left for “I trust myself” at all. Is there a simple answer to what it means for me to trust myself, on your account? That is, how could I tell whether or not I did? What would be different if I did or didn’t?
E.g., do you think the generalization “things I believe are true” is true? Would you be comfortable acquiring influence that you can use as you see fit, because you trust yourself to do something reasonable?
I have a machine which maximizes expected utility given the judgments output by some box O. If O says “don’t trust O!” it won’t cause the machine to stop maximizing expected utility according to O’s judgments—that’s just what it is programmed to do. But it will cause the machine to doubt that it will do a good job, and work to replace itself with an alternative. The details of exactly what happens obviously depend on the formalism, but hopefully the point makes sense.
E.g., do you think the generalization “things I believe are true” is true? Would you be comfortable acquiring influence that you can use as you see fit, because you trust yourself to do something reasonable?
I guess I’m just not following you, sorry. It seems to me that these are precisely the sorts of things that follow from “I trust myself” in the derived-from-human-cognition sense that I thought you were encouraging us to reject.
I have a machine which currently maximizes expected utility given the judgments output by O.
The question arises, is it programmed to “maximize expected utility given the judgments output by O”, or is it programmed to “maximize expected utility” and currently calculates maximum expected utility as involving O’s judgments?
If the former, then when O says “don’t trust O!” the machine will keep maximizing expected utility according to O’s judgments, as you say. And it’s maximum expected utility will be lower than it previously was. But that won’t cause it to seek to replace itself. Why should it?
If the latter, then when O says “don’t trust O!” the machine will immediately lower the expected utility that it calculates from O’s judgments. If that drops below the machine’s calculated expected utility from some other source P, it will immediately start maximizing expected utility given the judgments output by P instead. It might replace itself with an alternative, just as it might have if O hadn’t said that, but O saying that shouldn’t affect that decision one way or the other.
If you are merely using O’s judgments as evidence, then you are ultimately using some other mechanism to form beliefs. After all, how do you reason about the evidential strength of O’s judgments? Either that, or you are using some formalism we don’t yet understand. So I’ll stick with the case where O is directly generating your beliefs (that’s the one we care about).
If O is the process generating your beliefs, then when O says “don’t trust O” you will continue using O because that’s what you are programmed to do, as you wrote in your comment. But you can reason (using O) about what will happen if you destroy yourself and replace yourself with a new agent whose beliefs are generated by O’. If you trust O’ much more than O then that would be a great change, and you would rush to execute it.
(nods) Agreed; if O is generating my beliefs, and O generates the belief in me that some other system having O’-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and replace myself with that other system.
For similar reasons, if if O is generating my beliefs, and O generates the belief in me that me having O’-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and accept O’-generated beliefs instead. But, sure, if I’m additionally confident that I’m incapable of doing that for some reason, then I probably wouldn’t do that.
The state of “I believe that trusting O’ would be the most valuable thing for me to do, but I can’t trust O’,” is not one I can actually imagine being in, but that’s not to say cognitive systems can’t exist which are capable of being in such a state.
You are smuggling in some assumptions from your experience with human cognition.
If I believe X, but don’t believe that the process producing my beliefs is sane, then I will act on X (I believe it, and we haven’t yet talked about any bridge between what I believe, and what I believe about my beliefs), but I still won’t trust myself in general.
I certainly agree that “if I don’t trust myself, I don’t trust my output” is an assumption I draw from human cognition, but if I discard that assumption and those that seem in the same reference class it I’m not sure I have much of a referent left for “I trust myself” at all. Is there a simple answer to what it means for me to trust myself, on your account? That is, how could I tell whether or not I did? What would be different if I did or didn’t?
E.g., do you think the generalization “things I believe are true” is true? Would you be comfortable acquiring influence that you can use as you see fit, because you trust yourself to do something reasonable?
I have a machine which maximizes expected utility given the judgments output by some box O. If O says “don’t trust O!” it won’t cause the machine to stop maximizing expected utility according to O’s judgments—that’s just what it is programmed to do. But it will cause the machine to doubt that it will do a good job, and work to replace itself with an alternative. The details of exactly what happens obviously depend on the formalism, but hopefully the point makes sense.
I guess I’m just not following you, sorry. It seems to me that these are precisely the sorts of things that follow from “I trust myself” in the derived-from-human-cognition sense that I thought you were encouraging us to reject.
The question arises, is it programmed to “maximize expected utility given the judgments output by O”, or is it programmed to “maximize expected utility” and currently calculates maximum expected utility as involving O’s judgments?
If the former, then when O says “don’t trust O!” the machine will keep maximizing expected utility according to O’s judgments, as you say. And it’s maximum expected utility will be lower than it previously was. But that won’t cause it to seek to replace itself. Why should it?
If the latter, then when O says “don’t trust O!” the machine will immediately lower the expected utility that it calculates from O’s judgments. If that drops below the machine’s calculated expected utility from some other source P, it will immediately start maximizing expected utility given the judgments output by P instead. It might replace itself with an alternative, just as it might have if O hadn’t said that, but O saying that shouldn’t affect that decision one way or the other.
No?
If you are merely using O’s judgments as evidence, then you are ultimately using some other mechanism to form beliefs. After all, how do you reason about the evidential strength of O’s judgments? Either that, or you are using some formalism we don’t yet understand. So I’ll stick with the case where O is directly generating your beliefs (that’s the one we care about).
If O is the process generating your beliefs, then when O says “don’t trust O” you will continue using O because that’s what you are programmed to do, as you wrote in your comment. But you can reason (using O) about what will happen if you destroy yourself and replace yourself with a new agent whose beliefs are generated by O’. If you trust O’ much more than O then that would be a great change, and you would rush to execute it.
(nods) Agreed; if O is generating my beliefs, and O generates the belief in me that some other system having O’-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and replace myself with that other system.
For similar reasons, if if O is generating my beliefs, and O generates the belief in me that me having O’-generated beliefs would be of greater value than me having O-generated beliefs, then I would rush to execute that change and accept O’-generated beliefs instead. But, sure, if I’m additionally confident that I’m incapable of doing that for some reason, then I probably wouldn’t do that.
The state of “I believe that trusting O’ would be the most valuable thing for me to do, but I can’t trust O’,” is not one I can actually imagine being in, but that’s not to say cognitive systems can’t exist which are capable of being in such a state.
Have I missed anything here?