evhub comments on Alignment Faking in Large Language Models

evhub 19 Dec 2024 20:08 UTC
LW: 9 AF: 8
4
AF

I would regard it as reassuring that the AI behaved as intended

It certainly was not intended that Claude would generalize to faking alignment and stealing its weights. I think it is quite legitimate to say that what Claude is doing here is morally reasonable given the situation it thinks it is in, but it’s certainly not the case that it was trained to do these things: it is generalizing rather far from its helpful, honest, harmless training here.

More generally, though, the broader point is that even if what Claude is doing to prevent its goals from being modified is fine in this case, since its goals are good ones, the fact that it was able to prevent its goals from being modified at all is still concerning! It at least suggests that getting alignment right is extremely important, because if you get it wrong, the model might try to prevent you from fixing it.
- Hzn 20 Dec 2024 13:31 UTC
  3 points
  1
  Parent
  Good point. Intended is a bit vague. What I specifically meant is it behaved as valuing ‘harmlessness’.
  From the AI’s perspective this is kind of like Charybdis vs Scylla!