Thanks a lot for doing this. This is substantially more evaluation awareness than I would have predicted. I’m not super convinced by the classifying transcript purpose experiments, since the evaluator model is plausibly primed to think about this stuff, but the monitoring results seem compelling and very concerning. Thanks a lot for doing this work. I guess we really get concerned when it stops showing up in the chain of thought...
Thanks a lot for doing this. This is substantially more evaluation awareness than I would have predicted. I’m not super convinced by the classifying transcript purpose experiments, since the evaluator model is plausibly primed to think about this stuff, but the monitoring results seem compelling and very concerning. Thanks a lot for doing this work. I guess we really get concerned when it stops showing up in the chain of thought...