It would be so great if we saw deceptive alignment in existing language models. I think the most important topic in this area is trying to get a live example to study in the lab ASAP, and to put together as many pieces as we can right now.
I think it’s not very close to happening right now, which is mostly just a bummer. (Though I do think it’s also some evidence that it’s less likely to happen later.)
I think LLMs show some deceptive alignment, but it has the different nature. They are not from LLM consciously trying to deceive the trainer, but from RLHF “aligning” only certain scenarios of LLM’s behaviour, which were not generalized enough to make that alignement more fundamental.
It would be so great if we saw deceptive alignment in existing language models. I think the most important topic in this area is trying to get a live example to study in the lab ASAP, and to put together as many pieces as we can right now.
I think it’s not very close to happening right now, which is mostly just a bummer. (Though I do think it’s also some evidence that it’s less likely to happen later.)
I think LLMs show some deceptive alignment, but it has the different nature. They are not from LLM consciously trying to deceive the trainer, but from RLHF “aligning” only certain scenarios of LLM’s behaviour, which were not generalized enough to make that alignement more fundamental.
the thing I was thinking of, as posted in the other comment below: https://twitter.com/repligate/status/1627945227083194368
see other comment for commentary