Neel Nanda comments on Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Neel Nanda 18 Mar 2025 21:47 UTC
LW: 26 AF: 8
2
AF
Thanks a lot for doing this. This is substantially more evaluation awareness than I would have predicted. I’m not super convinced by the classifying transcript purpose experiments, since the evaluator model is plausibly primed to think about this stuff, but the monitoring results seem compelling and very concerning. Thanks a lot for doing this work. I guess we really get concerned when it stops showing up in the chain of thought...