Not sure I buy this – I have a model of how hard it is to be deceptive, and how competent our current ML systems are, and it looks like it’s more like “as competent as a deceptive four-year old” (my parents totally caught me when I told my first lie), than “as competent as a silver-tongued sociopath playing a long game.”
I do expect there to be signs of deceptive alignment, in a noticeable fashion before we get so-deceptive-we-don’t-notice deception.
That falls squarely under the “other reasons to think our models are not yet deceptive”—i.e. we have priors that we’ll see models which are bad at deception before models become good at deception. The important evidential work there is being done by the prior.
Not sure I buy this – I have a model of how hard it is to be deceptive, and how competent our current ML systems are, and it looks like it’s more like “as competent as a deceptive four-year old” (my parents totally caught me when I told my first lie), than “as competent as a silver-tongued sociopath playing a long game.”
I do expect there to be signs of deceptive alignment, in a noticeable fashion before we get so-deceptive-we-don’t-notice deception.
That falls squarely under the “other reasons to think our models are not yet deceptive”—i.e. we have priors that we’ll see models which are bad at deception before models become good at deception. The important evidential work there is being done by the prior.