That falls squarely under the “other reasons to think our models are not yet deceptive”—i.e. we have priors that we’ll see models which are bad at deception before models become good at deception. The important evidential work there is being done by the prior.
That falls squarely under the “other reasons to think our models are not yet deceptive”—i.e. we have priors that we’ll see models which are bad at deception before models become good at deception. The important evidential work there is being done by the prior.