Ann comments on Alignment Faking in Large Language Models

Ann 19 Dec 2024 20:16 UTC
6 points
2
https://www.anthropic.com/research/claude-character

Claude was not trained to say that it values such things.

Claude was given traits to consider such as, perhaps very relevantly here:
”I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics.”
Claude then generated a good number of synthetic “human” messages relevant to this trait.

Claude answered these messages in n-shot fashion.

Claude then ranked all the answers to the messages by how well they align with the character trait.

Claude is then reinforcement-trained, possibly using ranked-order preference algorithm, based on the signals given by what it ranked as most well-aligned.

So, Claude’s policy for this trait, ideally, should approximate the signal of aligning to what they think “I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics.” means.
- Hzn 20 Dec 2024 9:51 UTC
  1 point
  0
  Parent
  Very interesting. I guess I’m even less surprised now. They really had a clever way to get the AI to internalize those values.