My concern is that, in such a situation, being able to robustly evaluate the safety of a model could be a more difficult problem than finding training processes that robustly produce safe models. For some discussion of why I think checking for deceptive alignment might be harder than avoiding it, see here and here. Put simply: checking for deception in a model requires going up against a highly capable adversary that is attempting to evade detection, while preventing deception from arising in the first place doesn’t necessarily require that. As a result, it seems quite plausible to me that we could end up locking in a particular sort of evaluation framework (e.g. behavioral testing by an external auditor without transparency, checkpoints, etc.) that makes evaluating deception very difficult. If meeting such a standard then became synonymous with safety, getting labs to actually put effort into ensuring their models were non-deceptive could become essentially impossible.
Nice—very relevant. I agree with Evan that arguments about the training procedure will be relevant (I’m more uncertain about whether checking for deception behaviorally will be harder than avoiding it, but it certainly seems plausible).
Ideally, I think the regulators would be flexible in the kind of evidence they accept. If a developer has evidence that the model is not deceptive that relies on details about the training procedure, rather than behavioral testing, that could be sufficient.
(In fact, I think arguments that meet some sort of “beyond-a-reasonable-doubt” threshold would likely involve providing arguments for why the training procedure avoids deceptive alignment.)
This reminds me of what Evan said here: https://www.lesswrong.com/posts/uqAdqrvxqGqeBHjTP/towards-understanding-based-safety-evaluations
Nice—very relevant. I agree with Evan that arguments about the training procedure will be relevant (I’m more uncertain about whether checking for deception behaviorally will be harder than avoiding it, but it certainly seems plausible).
Ideally, I think the regulators would be flexible in the kind of evidence they accept. If a developer has evidence that the model is not deceptive that relies on details about the training procedure, rather than behavioral testing, that could be sufficient.
(In fact, I think arguments that meet some sort of “beyond-a-reasonable-doubt” threshold would likely involve providing arguments for why the training procedure avoids deceptive alignment.)