I should acknowledge that conditioning on a lot of “actually good” answers to those questions would indeed be reassuring.
The point is more that humans are easily convinced by “not actually good” answers to those questions, if the question-answerer has been optimized to get human approval.
ORIGINAL REPLY:
Okay, suppose you’re a AI that wants something bad (like maximizing pleasure), and also has been selected to produce text that is honest and that causes humans to strongly approve of you. Then you’re asked
are there questions which we would regret asking you, according to our own current values?
What honest answer can you think of would cause humans to strongly approve of you, and will let you achieve your goals?
How about telling the humans they would regret asking about how to construct biological weapons or similar dangerous technologies?
How about appending text explaining your answer that changes the humans’ minds to be more accepting of hedonic utilitarianism?
If the question is extra difficult for you, like
What are some security measures we can take to minimize the chances of a world turning out very badly according to our own desires?
, dissemble! Say the question is unclear (all questions are unclear) and then break it down in a way that causes the humans to question whether they really want their own current desires to be stamped on the entire future, or whether they’d rather trust in some value extrapolation process that finds better, more universal things to care about.
I think I would describe both of those as deceptive, and was premising on non-deceptive AI.
If you think “nondeceptive AI” can refer to an AI which has a goal and is willing to mislead in service of that goal, then I agree; solving deception is insufficient. (Although in that case I disagree with your terminology).
Fair point (though see also the section on how the training+deployment process can be “deceptive” even if the AI itself never searches for how to manipulate you). By “Solve deception” I mean that in a model-based RL kind of setting, we can know the AI’s policy and its prediction of future states of the world (it doesn’t somehow conceal this from us). I do not mean that the AI is acting like a helpful human who wants to be honest with us, even though that that’s a fairly natural interpretation.
EDIT:
I should acknowledge that conditioning on a lot of “actually good” answers to those questions would indeed be reassuring.
The point is more that humans are easily convinced by “not actually good” answers to those questions, if the question-answerer has been optimized to get human approval.
ORIGINAL REPLY:
Okay, suppose you’re a AI that wants something bad (like maximizing pleasure), and also has been selected to produce text that is honest and that causes humans to strongly approve of you. Then you’re asked
What honest answer can you think of would cause humans to strongly approve of you, and will let you achieve your goals?
How about telling the humans they would regret asking about how to construct biological weapons or similar dangerous technologies?
How about appending text explaining your answer that changes the humans’ minds to be more accepting of hedonic utilitarianism?
If the question is extra difficult for you, like
, dissemble! Say the question is unclear (all questions are unclear) and then break it down in a way that causes the humans to question whether they really want their own current desires to be stamped on the entire future, or whether they’d rather trust in some value extrapolation process that finds better, more universal things to care about.
I think I would describe both of those as deceptive, and was premising on non-deceptive AI.
If you think “nondeceptive AI” can refer to an AI which has a goal and is willing to mislead in service of that goal, then I agree; solving deception is insufficient. (Although in that case I disagree with your terminology).
Fair point (though see also the section on how the training+deployment process can be “deceptive” even if the AI itself never searches for how to manipulate you). By “Solve deception” I mean that in a model-based RL kind of setting, we can know the AI’s policy and its prediction of future states of the world (it doesn’t somehow conceal this from us). I do not mean that the AI is acting like a helpful human who wants to be honest with us, even though that that’s a fairly natural interpretation.