Nice—very relevant. I agree with Evan that arguments about the training procedure will be relevant (I’m more uncertain about whether checking for deception behaviorally will be harder than avoiding it, but it certainly seems plausible).
Ideally, I think the regulators would be flexible in the kind of evidence they accept. If a developer has evidence that the model is not deceptive that relies on details about the training procedure, rather than behavioral testing, that could be sufficient.
(In fact, I think arguments that meet some sort of “beyond-a-reasonable-doubt” threshold would likely involve providing arguments for why the training procedure avoids deceptive alignment.)
Nice—very relevant. I agree with Evan that arguments about the training procedure will be relevant (I’m more uncertain about whether checking for deception behaviorally will be harder than avoiding it, but it certainly seems plausible).
Ideally, I think the regulators would be flexible in the kind of evidence they accept. If a developer has evidence that the model is not deceptive that relies on details about the training procedure, rather than behavioral testing, that could be sufficient.
(In fact, I think arguments that meet some sort of “beyond-a-reasonable-doubt” threshold would likely involve providing arguments for why the training procedure avoids deceptive alignment.)