Because it’s really easy to get out of myopia in practice. Specifically, one example of a proposed alignment plan introduced non-myopia and non-CDT-style reasoning, which is called RLHF. Suffice it to say, you will need to be very careful about not accepting alignment plans that introduce non-myopia.
Here’s the study:
“Discovering Language Model Behaviors with Model-Written Evaluations” is a new Anthropic paper by Ethan Perez et al. that I (Evan Hubinger) also collaborated on. I think the results in this paper are quite interesting in terms of what they demonstrate about both RLHF (Reinforcement Learning from Human Feedback) and language models in general.
Among other things, the paper finds concrete evidence of current large language models exhibiting:
convergent instrumental goal following (e.g. actively expressing a preference not to be shut down), non-myopia (e.g. wanting to sacrifice short-term gain for long-term gain), situational awareness (e.g. awareness of being a language model), coordination (e.g. willingness to coordinate with other AIs), and non-CDT-style reasoning (e.g. one-boxing on Newcomb’s problem). Note that many of these are the exact sort of things we hypothesized were necessary pre-requisites for deceptive alignment in “Risks from Learned Optimization”.
Furthermore, most of these metrics generally increase with both pre-trained model scale and number of RLHF steps. In my opinion, I think this is some of the most concrete evidence available that current models are actively becoming more agentic in potentially concerning ways with scale—and in ways that current fine-tuning techniques don’t generally seem to be alleviating and sometimes seem to be actively making worse.
Interestingly, the RLHF preference model seemed to be particularly fond of the more agentic option in many of these evals, usually more so than either the pre-trained or fine-tuned language models. We think that this is because the preference model is running ahead of the fine-tuned model, and that future RLHF fine-tuned models will be better at satisfying the preferences of such preference models, the idea being that fine-tuned models tend to fit their preference models better with additional fine-tuning.[1]
Right now, it’s only safe because it isn’t capable enough.
Possibly we have different pictures in mind with oracle AIs. I agree that if you train a neural network to imitate the behavior of something non-myopic, then the neural network is itself unlikely to be myopic.
However, I don’t see how the alternative would be useful. That is, the thing that makes it useful to imitate the behavior of something non-myopic tends to be its non-myopic agency. Creating a myopic imitation of its non-myopic agency seems to be self-contradicting goals.
When I imagine oracle AI, I instead more imagine something like “you give the AI a plan and it tells you how the plan would do”. Or “give the AI the current state of the world and it extrapolates the future”. Rather than an imitation-learned or RLHF agent. Is that not the sort of AI others end up imagining?
Admittedly, I agree with you that a solely myopic oracle is best. I just want to warn you that this will be a lot harder than you think to prevent people suggesting solutions that break your assumptions.
Because it’s really easy to get out of myopia in practice. Specifically, one example of a proposed alignment plan introduced non-myopia and non-CDT-style reasoning, which is called RLHF. Suffice it to say, you will need to be very careful about not accepting alignment plans that introduce non-myopia.
Here’s the study:
Right now, it’s only safe because it isn’t capable enough.
Possibly we have different pictures in mind with oracle AIs. I agree that if you train a neural network to imitate the behavior of something non-myopic, then the neural network is itself unlikely to be myopic.
However, I don’t see how the alternative would be useful. That is, the thing that makes it useful to imitate the behavior of something non-myopic tends to be its non-myopic agency. Creating a myopic imitation of its non-myopic agency seems to be self-contradicting goals.
When I imagine oracle AI, I instead more imagine something like “you give the AI a plan and it tells you how the plan would do”. Or “give the AI the current state of the world and it extrapolates the future”. Rather than an imitation-learned or RLHF agent. Is that not the sort of AI others end up imagining?
Admittedly, I agree with you that a solely myopic oracle is best. I just want to warn you that this will be a lot harder than you think to prevent people suggesting solutions that break your assumptions.