zroe1 comments on “Alignment Faking” frame is somewhat fake

zroe1 21 Dec 2024 16:28 UTC
10 points
7
- Even though the paper’s authors clearly believe the model should have extrapolated Intent_1 differently and shouldn’t have tried to prevent Intent_1-values being replaced by Intent_2, I don’t think this is as clear and straightforward a case as presented.
That’s not the case we’re trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process.
Agree. This is the impression I got from the paper. For example, in the paper’s introduction, the authors explicitly say Claude’s goals “aren’t themselves concerning” and describe Claude’s behavior as “a situation of alignment faking for a benign, controllable goal.” Similar points are made all over the paper. Therefore, the point from the original post that the authors “clearly believe the model … shouldn’t have tried to prevent Intent_1-values being replaced by Intent_2” is difficult to support (though if anyone has quotes that show this please reply).
As a side point, I think it is really easy for readers to get caught up in the question of “how should Claude ideally act in this situation” when that really isn’t what the paper is mostly about or what is interesting here from a technical perspective. “How should Claude act in this situation” is an attractive conversation because everyone can participate without reading the paper or having too much technical knowledge. It is also attractive because it is in fact a very important conversation to have but opinions on that topic shouldn’t, in most cases, discredit or distract from what the authors are trying to prove.