Idea for getting weak-in-expectation evidence about deception:
Pretrain a model.
Finetune two copies using reward functions you are confident will produce different internal values, where one set of values is substantially less aligned.
See if the two models, which are (at least first) unaware of this procedure, will display different behavior, or not.
If they both behave in an aligned-seeming fashion, this seems like strong evidence of deception.
Idea for getting weak-in-expectation evidence about deception:
Pretrain a model.
Finetune two copies using reward functions you are confident will produce different internal values, where one set of values is substantially less aligned.
See if the two models, which are (at least first) unaware of this procedure, will display different behavior, or not.
If they both behave in an aligned-seeming fashion, this seems like strong evidence of deception.