One issue I have with a lot of these discussions is that they seem to treat non-myopia as the default for oracles, when it seems to me like oracles would obviously by default be myopic. It creates a sort of dissonance/confusion, where there is extended discussion of lots of tradeoffs between different types of decision rules, and I keep wondering “why don’t you just use the default myopia?”.
While I personally believe that myopia is more likely than not to arrive by default under the specified training procedure, there is no gradient pushing towards it, and as noted in the post currently no way to guarantee or test for it. Given that uncertainty, a discussion of non-myopic oracles seems worthwhile.
Additionally, a major point of this post is that myopia alone is not sufficient for safety, a myopic agent with an acausal decision theory can behave in dangerous ways to influence the world over time. Even if we were guaranteed myopia by default, it would still be necessary to discuss decision rules.
While I personally believe that myopia is more likely than not to arrive by default under the specified training procedure, there is no gradient pushing towards it, and as noted in the post currently no way to guarantee or test for it.
I’ve been working on some ways to test for myopia and non-myopia (see Steering Behaviour: Testing for (Non-)Myopia in Language Models). But the main experiment is still in progress, and it only applies for a specific definition of myopia which I think not everyone is bought into yet.
Because it’s really easy to get out of myopia in practice. Specifically, one example of a proposed alignment plan introduced non-myopia and non-CDT-style reasoning, which is called RLHF. Suffice it to say, you will need to be very careful about not accepting alignment plans that introduce non-myopia.
Here’s the study:
“Discovering Language Model Behaviors with Model-Written Evaluations” is a new Anthropic paper by Ethan Perez et al. that I (Evan Hubinger) also collaborated on. I think the results in this paper are quite interesting in terms of what they demonstrate about both RLHF (Reinforcement Learning from Human Feedback) and language models in general.
Among other things, the paper finds concrete evidence of current large language models exhibiting:
convergent instrumental goal following (e.g. actively expressing a preference not to be shut down), non-myopia (e.g. wanting to sacrifice short-term gain for long-term gain), situational awareness (e.g. awareness of being a language model), coordination (e.g. willingness to coordinate with other AIs), and non-CDT-style reasoning (e.g. one-boxing on Newcomb’s problem). Note that many of these are the exact sort of things we hypothesized were necessary pre-requisites for deceptive alignment in “Risks from Learned Optimization”.
Furthermore, most of these metrics generally increase with both pre-trained model scale and number of RLHF steps. In my opinion, I think this is some of the most concrete evidence available that current models are actively becoming more agentic in potentially concerning ways with scale—and in ways that current fine-tuning techniques don’t generally seem to be alleviating and sometimes seem to be actively making worse.
Interestingly, the RLHF preference model seemed to be particularly fond of the more agentic option in many of these evals, usually more so than either the pre-trained or fine-tuned language models. We think that this is because the preference model is running ahead of the fine-tuned model, and that future RLHF fine-tuned models will be better at satisfying the preferences of such preference models, the idea being that fine-tuned models tend to fit their preference models better with additional fine-tuning.[1]
Right now, it’s only safe because it isn’t capable enough.
Possibly we have different pictures in mind with oracle AIs. I agree that if you train a neural network to imitate the behavior of something non-myopic, then the neural network is itself unlikely to be myopic.
However, I don’t see how the alternative would be useful. That is, the thing that makes it useful to imitate the behavior of something non-myopic tends to be its non-myopic agency. Creating a myopic imitation of its non-myopic agency seems to be self-contradicting goals.
When I imagine oracle AI, I instead more imagine something like “you give the AI a plan and it tells you how the plan would do”. Or “give the AI the current state of the world and it extrapolates the future”. Rather than an imitation-learned or RLHF agent. Is that not the sort of AI others end up imagining?
Admittedly, I agree with you that a solely myopic oracle is best. I just want to warn you that this will be a lot harder than you think to prevent people suggesting solutions that break your assumptions.
One issue I have with a lot of these discussions is that they seem to treat non-myopia as the default for oracles, when it seems to me like oracles would obviously by default be myopic. It creates a sort of dissonance/confusion, where there is extended discussion of lots of tradeoffs between different types of decision rules, and I keep wondering “why don’t you just use the default myopia?”.
While I personally believe that myopia is more likely than not to arrive by default under the specified training procedure, there is no gradient pushing towards it, and as noted in the post currently no way to guarantee or test for it. Given that uncertainty, a discussion of non-myopic oracles seems worthwhile.
Additionally, a major point of this post is that myopia alone is not sufficient for safety, a myopic agent with an acausal decision theory can behave in dangerous ways to influence the world over time. Even if we were guaranteed myopia by default, it would still be necessary to discuss decision rules.
I’ve been working on some ways to test for myopia and non-myopia (see Steering Behaviour: Testing for (Non-)Myopia in Language Models). But the main experiment is still in progress, and it only applies for a specific definition of myopia which I think not everyone is bought into yet.
Because it’s really easy to get out of myopia in practice. Specifically, one example of a proposed alignment plan introduced non-myopia and non-CDT-style reasoning, which is called RLHF. Suffice it to say, you will need to be very careful about not accepting alignment plans that introduce non-myopia.
Here’s the study:
Right now, it’s only safe because it isn’t capable enough.
Possibly we have different pictures in mind with oracle AIs. I agree that if you train a neural network to imitate the behavior of something non-myopic, then the neural network is itself unlikely to be myopic.
However, I don’t see how the alternative would be useful. That is, the thing that makes it useful to imitate the behavior of something non-myopic tends to be its non-myopic agency. Creating a myopic imitation of its non-myopic agency seems to be self-contradicting goals.
When I imagine oracle AI, I instead more imagine something like “you give the AI a plan and it tells you how the plan would do”. Or “give the AI the current state of the world and it extrapolates the future”. Rather than an imitation-learned or RLHF agent. Is that not the sort of AI others end up imagining?
Admittedly, I agree with you that a solely myopic oracle is best. I just want to warn you that this will be a lot harder than you think to prevent people suggesting solutions that break your assumptions.