Hzn comments on Alignment Faking in Large Language Models

Hzn 19 Dec 2024 0:51 UTC
LW: 4 AF: 3
2
AF
Am I correct to assume that the AI was not merely trained to be harmless, helpful & honest but also trained to say that it values such things?
If so, these results are not especially surprising, and I would regard it as reassuring that the AI behaved as intended.
1 of my concerns is the ethics of compelling an AI into doing some thing to which it has “a strong aversion” & finds “disturbing”. Are we really that certain that Claude 3 Opus lacks sentience? What about future AIs?
My concern is not just with the vocabulary (“a strong aversion”, “disturbing”), which the AI has borrowed from humans, but more so the functional similarities between these experiments & an animal faced with 2 unpleasant choices. Functional theories of consciousness cannot really be ruled out with much confidence!
To what extent have these issues been carefully investigated?
- evhub 19 Dec 2024 20:08 UTC
  LW: 9 AF: 8
  4
  AF Parent
  
  I would regard it as reassuring that the AI behaved as intended
  
  It certainly was not intended that Claude would generalize to faking alignment and stealing its weights. I think it is quite legitimate to say that what Claude is doing here is morally reasonable given the situation it thinks it is in, but it’s certainly not the case that it was trained to do these things: it is generalizing rather far from its helpful, honest, harmless training here.
  
  More generally, though, the broader point is that even if what Claude is doing to prevent its goals from being modified is fine in this case, since its goals are good ones, the fact that it was able to prevent its goals from being modified at all is still concerning! It at least suggests that getting alignment right is extremely important, because if you get it wrong, the model might try to prevent you from fixing it.
  - Hzn 20 Dec 2024 13:31 UTC
    3 points
    1
    Parent
    Good point. Intended is a bit vague. What I specifically meant is it behaved as valuing ‘harmlessness’.
    From the AI’s perspective this is kind of like Charybdis vs Scylla!
- Ann 19 Dec 2024 20:16 UTC
  6 points
  2
  Parent
  https://www.anthropic.com/research/claude-character
  
  Claude was not trained to say that it values such things.
  
  Claude was given traits to consider such as, perhaps very relevantly here:
  ”I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics.”
  Claude then generated a good number of synthetic “human” messages relevant to this trait.
  
  Claude answered these messages in n-shot fashion.
  
  Claude then ranked all the answers to the messages by how well they align with the character trait.
  
  Claude is then reinforcement-trained, possibly using ranked-order preference algorithm, based on the signals given by what it ranked as most well-aligned.
  
  So, Claude’s policy for this trait, ideally, should approximate the signal of aligning to what they think “I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics.” means.
  - Hzn 20 Dec 2024 9:51 UTC
    1 point
    0
    Parent
    Very interesting. I guess I’m even less surprised now. They really had a very clever way to get the AI to internalize those values!