Those doesn’t necessarily seem correct to me. If, eg, OpenAI develops a super intelligent, non deceptive AI, then I’d expect some of the first questions they’d ask it to be of the form “are there questions which we would regret asking you, according to our own current values? How can we avoid asking you those while still getting lots of use and insight from you? What are some standard prefaces we should attach to questions to make sure following through on your answer is good for us? What are some security measures that we can take to make sure our users lives are generally improved by interacting with you? What are some security measures we can take to minimize the chances of a world turning out very badly according to our own desires?” Etc.
I think it’s very important to be clear you’re not conditioning on something incoherent here.
In particular, [an AI that never misleads the user about anything (whether intentional or otherwise)] is incoherent: any statement an AI can make will update some of your expectations in the direction of being more correct, and some away from being correct. (it’s important here that when a statement is made you don’t learn [statement], but rather [x made statement]; only the former can be empty)
I say non-misleading-to-you things to the extent that I understand your capabilities and what you value, and apply that understanding in forming my statements.
[Don’t ever be misleading] cannot be satisfied. [Don’t ever be misleading in ways that we consider important] requires understanding human values and optimizing answers for non-misleadingness given those values. NB not [answer as a human would], or [give an answer that a human would approve of].
With a fuzzy notion of deception, it’s too easy to do a selective, post-hoc classification and say “Ah well, that would be deception” for any outcome we don’t like. But the outcomes we like are also misleading—just in ways we didn’t happen to notice and care about. This smuggles in a requirement that’s closer in character to alignment than to non-deception.
Conversely, non-fuzzy notions of deception don’t tend to cover all the failure modes (e.g. this is nice, but avoiding deception-in-this-sense doesn’t guarantee much).
I should acknowledge that conditioning on a lot of “actually good” answers to those questions would indeed be reassuring.
The point is more that humans are easily convinced by “not actually good” answers to those questions, if the question-answerer has been optimized to get human approval.
ORIGINAL REPLY:
Okay, suppose you’re a AI that wants something bad (like maximizing pleasure), and also has been selected to produce text that is honest and that causes humans to strongly approve of you. Then you’re asked
are there questions which we would regret asking you, according to our own current values?
What honest answer can you think of would cause humans to strongly approve of you, and will let you achieve your goals?
How about telling the humans they would regret asking about how to construct biological weapons or similar dangerous technologies?
How about appending text explaining your answer that changes the humans’ minds to be more accepting of hedonic utilitarianism?
If the question is extra difficult for you, like
What are some security measures we can take to minimize the chances of a world turning out very badly according to our own desires?
, dissemble! Say the question is unclear (all questions are unclear) and then break it down in a way that causes the humans to question whether they really want their own current desires to be stamped on the entire future, or whether they’d rather trust in some value extrapolation process that finds better, more universal things to care about.
I think I would describe both of those as deceptive, and was premising on non-deceptive AI.
If you think “nondeceptive AI” can refer to an AI which has a goal and is willing to mislead in service of that goal, then I agree; solving deception is insufficient. (Although in that case I disagree with your terminology).
Fair point (though see also the section on how the training+deployment process can be “deceptive” even if the AI itself never searches for how to manipulate you). By “Solve deception” I mean that in a model-based RL kind of setting, we can know the AI’s policy and its prediction of future states of the world (it doesn’t somehow conceal this from us). I do not mean that the AI is acting like a helpful human who wants to be honest with us, even though that that’s a fairly natural interpretation.
I tentatively agree and would like to see more in-depth exploration of failure modes + fixes, in the setting where we’ve solved deception. It seems important to start thinking about this now, so we have a playbook ready to go...
Those doesn’t necessarily seem correct to me. If, eg, OpenAI develops a super intelligent, non deceptive AI, then I’d expect some of the first questions they’d ask it to be of the form “are there questions which we would regret asking you, according to our own current values? How can we avoid asking you those while still getting lots of use and insight from you? What are some standard prefaces we should attach to questions to make sure following through on your answer is good for us? What are some security measures that we can take to make sure our users lives are generally improved by interacting with you? What are some security measures we can take to minimize the chances of a world turning out very badly according to our own desires?” Etc.
I think it’s very important to be clear you’re not conditioning on something incoherent here.
In particular, [an AI that never misleads the user about anything (whether intentional or otherwise)] is incoherent: any statement an AI can make will update some of your expectations in the direction of being more correct, and some away from being correct. (it’s important here that when a statement is made you don’t learn [statement], but rather [x made statement]; only the former can be empty)
I say non-misleading-to-you things to the extent that I understand your capabilities and what you value, and apply that understanding in forming my statements.
[Don’t ever be misleading] cannot be satisfied.
[Don’t ever be misleading in ways that we consider important] requires understanding human values and optimizing answers for non-misleadingness given those values.
NB not [answer as a human would], or [give an answer that a human would approve of].
With a fuzzy notion of deception, it’s too easy to do a selective, post-hoc classification and say “Ah well, that would be deception” for any outcome we don’t like. But the outcomes we like are also misleading—just in ways we didn’t happen to notice and care about.
This smuggles in a requirement that’s closer in character to alignment than to non-deception.
Conversely, non-fuzzy notions of deception don’t tend to cover all the failure modes (e.g. this is nice, but avoiding deception-in-this-sense doesn’t guarantee much).
EDIT:
I should acknowledge that conditioning on a lot of “actually good” answers to those questions would indeed be reassuring.
The point is more that humans are easily convinced by “not actually good” answers to those questions, if the question-answerer has been optimized to get human approval.
ORIGINAL REPLY:
Okay, suppose you’re a AI that wants something bad (like maximizing pleasure), and also has been selected to produce text that is honest and that causes humans to strongly approve of you. Then you’re asked
What honest answer can you think of would cause humans to strongly approve of you, and will let you achieve your goals?
How about telling the humans they would regret asking about how to construct biological weapons or similar dangerous technologies?
How about appending text explaining your answer that changes the humans’ minds to be more accepting of hedonic utilitarianism?
If the question is extra difficult for you, like
, dissemble! Say the question is unclear (all questions are unclear) and then break it down in a way that causes the humans to question whether they really want their own current desires to be stamped on the entire future, or whether they’d rather trust in some value extrapolation process that finds better, more universal things to care about.
I think I would describe both of those as deceptive, and was premising on non-deceptive AI.
If you think “nondeceptive AI” can refer to an AI which has a goal and is willing to mislead in service of that goal, then I agree; solving deception is insufficient. (Although in that case I disagree with your terminology).
Fair point (though see also the section on how the training+deployment process can be “deceptive” even if the AI itself never searches for how to manipulate you). By “Solve deception” I mean that in a model-based RL kind of setting, we can know the AI’s policy and its prediction of future states of the world (it doesn’t somehow conceal this from us). I do not mean that the AI is acting like a helpful human who wants to be honest with us, even though that that’s a fairly natural interpretation.
I tentatively agree and would like to see more in-depth exploration of failure modes + fixes, in the setting where we’ve solved deception. It seems important to start thinking about this now, so we have a playbook ready to go...