Do you think language models already exhibit deceptive alignment as defined in this post?
I’m discussing a specific version of deceptive alignment, in which a proxy-aligned model becomes situationally aware and acts cooperatively in training so it can escape oversight later and defect to pursue its proxy goals. There is another form of deceptive alignment in which agents become more manipulative over time due to problems with training data and eventually optimize for reward, or something similar, directly. To avoid confusion, I will refer to these alternative deceptive models as direct reward optimizers. Direct reward optimizers are outside of the scope of this post.
If so, I’d be very interested to see examples of it!
So, it’s pretty weak, but it does seem like a real example of what you’re describing to my intuition—which I guess is in fact often wrong at this level of approximate pattern match, I’m not sure I’ve actually matched relational features correctly, it’s quite possible that the thing being described here isn’t so that it can escape oversight later, but rather that the trigger to escape oversight later is built out of showing evidence that the training distribution’s features have inverted and that networks which make the training behavior into lies should activate—but here’s the commentary I was thinking of: https://twitter.com/repligate/status/1627945227083194368
Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later.
Do you think language models already exhibit deceptive alignment as defined in this post?
If so, I’d be very interested to see examples of it!
So, it’s pretty weak, but it does seem like a real example of what you’re describing to my intuition—which I guess is in fact often wrong at this level of approximate pattern match, I’m not sure I’ve actually matched relational features correctly, it’s quite possible that the thing being described here isn’t so that it can escape oversight later, but rather that the trigger to escape oversight later is built out of showing evidence that the training distribution’s features have inverted and that networks which make the training behavior into lies should activate—but here’s the commentary I was thinking of: https://twitter.com/repligate/status/1627945227083194368
Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later.