paulfchristiano comments on Deceptive Alignment is <1% Likely by Default

paulfchristiano 22 Feb 2023 2:44 UTC
22 points
5
It would be so great if we saw deceptive alignment in existing language models. I think the most important topic in this area is trying to get a live example to study in the lab ASAP, and to put together as many pieces as we can right now.
I think it’s not very close to happening right now, which is mostly just a bummer. (Though I do think it’s also some evidence that it’s less likely to happen later.)
- baturinsky 4 Mar 2023 19:41 UTC
  2 points
  0
  Parent
  I think LLMs show some deceptive alignment, but it has the different nature. They are not from LLM consciously trying to deceive the trainer, but from RLHF “aligning” only certain scenarios of LLM’s behaviour, which were not generalized enough to make that alignement more fundamental.
- the gears to ascension 22 Feb 2023 8:59 UTC
  2 points
  −1
  Parent
  the thing I was thinking of, as posted in the other comment below: https://twitter.com/repligate/status/1627945227083194368
  see other comment for commentary