Good post, thanks. I largely agree with you. A couple of thoughts:
Note, that the AI is aware of the fact that we wanted it to achieve a different goal and therefore actively acts in ways that humans will perceive as aligned.
This isn’t quite right if we’re going by the RFLO description (see footnote 7: in general, there’s no requirement for modelling of the base optimiser or oversight system; it’s enough to understand the optimisation pressure and be uncertain whether it’ll persist).
In particular, the model needn’t do this: [realise we want it to do x] --> [realise it’ll remain unchanged if it does x] --> [do x]
It can jump to: [realise it’ll remain unchanged if it does x] --> [do x]
It’s not enough to check for a system’s reasoning about the base optimiser.
For example, a very powerful language model might still “only” care about predicting the next word and potentially the incentives to become deceptive are just not very strong for next-word prediction.
Here (and throughout) you seem to be assuming that powerful models are well-described as goal-directed (that they’re “aiming”/”trying”/”caring”...). For large language models in particular, this doesn’t seem reasonable: we know that LLMs predict the next token well; this is not the same as trying to predict the next token well. [Janus’ Simulators is a great post covering this and much more]
That’s not to say that similar risks don’t arise, but the most natural path is more about goal-directed simulacra than a goal-directed simulator. If you’re aiming to convince people of the importance of deception, it’s important to make this clear: the argument doesn’t rely on powerful LLMs being predict-next-token optimisers (they’re certainly optimised, they may well not be optimisers).
This is why Eliezer/Nate often focus on a system’s capacity to produce e.g. complex plans, rather than on the process of producing them. [e.g. “What produces the danger is not the details of the search process, it’s the search being strong and effective at all.” from Ngo and Yudkowsky on alignment difficulty]
In an LLM that gives outputs that look like [result of powerful search process], it’s likely there’s some kind of powerful search going on (perhaps implicitly). That search might be e.g. over plans of some simulated character. In principle, the character may be deceptively aligned—and the overall system may exhibit deceptive alignment as a consequence. Arguments for characters that are powerful reasoners tending to be deceptively aligned are similar (instrumental incentives...).
Good post, thanks. I largely agree with you.
A couple of thoughts:
This isn’t quite right if we’re going by the RFLO description (see footnote 7: in general, there’s no requirement for modelling of the base optimiser or oversight system; it’s enough to understand the optimisation pressure and be uncertain whether it’ll persist).
In particular, the model needn’t do this:
[realise we want it to do x] --> [realise it’ll remain unchanged if it does x] --> [do x]
It can jump to:
[realise it’ll remain unchanged if it does x] --> [do x]
It’s not enough to check for a system’s reasoning about the base optimiser.
Here (and throughout) you seem to be assuming that powerful models are well-described as goal-directed (that they’re “aiming”/”trying”/”caring”...). For large language models in particular, this doesn’t seem reasonable: we know that LLMs predict the next token well; this is not the same as trying to predict the next token well. [Janus’ Simulators is a great post covering this and much more]
That’s not to say that similar risks don’t arise, but the most natural path is more about goal-directed simulacra than a goal-directed simulator. If you’re aiming to convince people of the importance of deception, it’s important to make this clear: the argument doesn’t rely on powerful LLMs being predict-next-token optimisers (they’re certainly optimised, they may well not be optimisers).
This is why Eliezer/Nate often focus on a system’s capacity to produce e.g. complex plans, rather than on the process of producing them. [e.g. “What produces the danger is not the details of the search process, it’s the search being strong and effective at all.” from Ngo and Yudkowsky on alignment difficulty]
In an LLM that gives outputs that look like [result of powerful search process], it’s likely there’s some kind of powerful search going on (perhaps implicitly). That search might be e.g. over plans of some simulated character. In principle, the character may be deceptively aligned—and the overall system may exhibit deceptive alignment as a consequence.
Arguments for characters that are powerful reasoners tending to be deceptively aligned are similar (instrumental incentives...).
Thanks for the clarifications. They helped me :)