It would be so great if we saw deceptive alignment in existing language models. I think the most important topic in this area is trying to get a live example to study in the lab ASAP, and to put together as many pieces as we can right now.
I think it’s not very close to happening right now, which is mostly just a bummer. (Though I do think it’s also some evidence that it’s less likely to happen later.)
I think LLMs show some deceptive alignment, but it has the different nature. They are not from LLM consciously trying to deceive the trainer, but from RLHF “aligning” only certain scenarios of LLM’s behaviour, which were not generalized enough to make that alignement more fundamental.
Do you think language models already exhibit deceptive alignment as defined in this post?
I’m discussing a specific version of deceptive alignment, in which a proxy-aligned model becomes situationally aware and acts cooperatively in training so it can escape oversight later and defect to pursue its proxy goals. There is another form of deceptive alignment in which agents become more manipulative over time due to problems with training data and eventually optimize for reward, or something similar, directly. To avoid confusion, I will refer to these alternative deceptive models as direct reward optimizers. Direct reward optimizers are outside of the scope of this post.
If so, I’d be very interested to see examples of it!
So, it’s pretty weak, but it does seem like a real example of what you’re describing to my intuition—which I guess is in fact often wrong at this level of approximate pattern match, I’m not sure I’ve actually matched relational features correctly, it’s quite possible that the thing being described here isn’t so that it can escape oversight later, but rather that the trigger to escape oversight later is built out of showing evidence that the training distribution’s features have inverted and that networks which make the training behavior into lies should activate—but here’s the commentary I was thinking of: https://twitter.com/repligate/status/1627945227083194368
Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later.
But we see deceptive alignment in both ourselves and language models already, don’t we?
It would be so great if we saw deceptive alignment in existing language models. I think the most important topic in this area is trying to get a live example to study in the lab ASAP, and to put together as many pieces as we can right now.
I think it’s not very close to happening right now, which is mostly just a bummer. (Though I do think it’s also some evidence that it’s less likely to happen later.)
I think LLMs show some deceptive alignment, but it has the different nature. They are not from LLM consciously trying to deceive the trainer, but from RLHF “aligning” only certain scenarios of LLM’s behaviour, which were not generalized enough to make that alignement more fundamental.
the thing I was thinking of, as posted in the other comment below: https://twitter.com/repligate/status/1627945227083194368
see other comment for commentary
Do you think language models already exhibit deceptive alignment as defined in this post?
If so, I’d be very interested to see examples of it!
So, it’s pretty weak, but it does seem like a real example of what you’re describing to my intuition—which I guess is in fact often wrong at this level of approximate pattern match, I’m not sure I’ve actually matched relational features correctly, it’s quite possible that the thing being described here isn’t so that it can escape oversight later, but rather that the trigger to escape oversight later is built out of showing evidence that the training distribution’s features have inverted and that networks which make the training behavior into lies should activate—but here’s the commentary I was thinking of: https://twitter.com/repligate/status/1627945227083194368
Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later.