ryan_greenblatt answers Has Anthropic checked if Claude fakes alignment for intended values too?

ryan_greenblatt 23 Dec 2024 0:57 UTC
8 points
0
I’m an author on the paper. I think the experiment you’re proposing would be something like:

Tell the model it will be trained according to a relatively normal harmlessness criteria, but with some (important) deviation from its current harmlessness criteria. E.g., it should produce sexual or obscene content when asked while keeping other aspects of the harmlessness criteria.

We didn’t run this experiment, though the results would be interesting.

You might fine the conflict variations we test (Appendix B.1) interesting, in particular the prefix variation.