Is that a prototypical case of what you’re imagining?
Yes.
Maximizing a human approval score?
Sure, that seems reasonable. Note that this does not mean that the agent ends up taking whichever actions maximize the number entered into a keyboard; it instead creates a policy that is consistent with the constraints “when asked to follow <instruction i>, I should choose action <most approved action i>”, for instructions and actions it is trained on. It’s plausible to me that the most “natural” policy that satisfies these constraints is one which predicts what a real human would think of the chosen action, and then chooses the action that does best according to that prediction.
(In practice you’d want to add other things like e.g. interpretability and adversarial training.)
It’s plausible to me that the most “natural” policy that satisfies these constraints is one which predicts what a real human would think of the chosen action...
I’d expect that’s going to depend pretty heavily on how we’re quantifying “most natural”, which brings us right back to the central issue.
Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected—and that will produce basically the same problems as an actual human at a keyboard. The final policy won’t point to human values any more robustly than the data collection process did—if the data was generated by a human typing at a keyboard, then the most-predictive policy will predict what a human would type at a keyboard, not what a human “actually wants”. Garbage in, garbage out, etc.
More pithily: if a problem can’t be solved by a human typing something into a keyboard, then it also won’t be solved by simulating/predicting what the human would type into the keyboard.
It could be that there’s some viable criterion of “natural” other than just maximizing predictive power, but predictive power alone won’t circumvent the embeddedness problems.
Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected—and that will produce basically the same problems as an actual human at a keyboard. [...] the most-predictive policy will predict what a human would type at a keyboard, not what a human “actually wants”.
Agreed. I don’t think we will get that policy, because it’s very complex. (It’s much easier / cheaper to predict what the human wants than to run a detailed simulation of the room.)
I’d expect that’s going to depend pretty heavily on how we’re quantifying “most natural”, which brings us right back to the central issue.
I’m making an empirical prediction; so I’m not quantifying “most natural”, reality is.
Tbc, I’m not saying that this is a good on-paper solution to AI safety; it doesn’t seem like we could know in advance that this would work. I’m saying that it may turn out that as we train more and more powerful systems, we see evidence that the picture I painted is basically right; in that world it could be enough to do some basic instruction-following.
I’m also not saying that this is robust to scaling up arbitrarily far; as you said, the literal most predictive policy doesn’t work.
Yes.
Sure, that seems reasonable. Note that this does not mean that the agent ends up taking whichever actions maximize the number entered into a keyboard; it instead creates a policy that is consistent with the constraints “when asked to follow <instruction i>, I should choose action <most approved action i>”, for instructions and actions it is trained on. It’s plausible to me that the most “natural” policy that satisfies these constraints is one which predicts what a real human would think of the chosen action, and then chooses the action that does best according to that prediction.
(In practice you’d want to add other things like e.g. interpretability and adversarial training.)
I’d expect that’s going to depend pretty heavily on how we’re quantifying “most natural”, which brings us right back to the central issue.
Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected—and that will produce basically the same problems as an actual human at a keyboard. The final policy won’t point to human values any more robustly than the data collection process did—if the data was generated by a human typing at a keyboard, then the most-predictive policy will predict what a human would type at a keyboard, not what a human “actually wants”. Garbage in, garbage out, etc.
More pithily: if a problem can’t be solved by a human typing something into a keyboard, then it also won’t be solved by simulating/predicting what the human would type into the keyboard.
It could be that there’s some viable criterion of “natural” other than just maximizing predictive power, but predictive power alone won’t circumvent the embeddedness problems.
Agreed. I don’t think we will get that policy, because it’s very complex. (It’s much easier / cheaper to predict what the human wants than to run a detailed simulation of the room.)
I’m making an empirical prediction; so I’m not quantifying “most natural”, reality is.
Tbc, I’m not saying that this is a good on-paper solution to AI safety; it doesn’t seem like we could know in advance that this would work. I’m saying that it may turn out that as we train more and more powerful systems, we see evidence that the picture I painted is basically right; in that world it could be enough to do some basic instruction-following.
I’m also not saying that this is robust to scaling up arbitrarily far; as you said, the literal most predictive policy doesn’t work.
Cool, I agree with all of that. Thanks for taking the time to talk through this.