johnswentworth comments on [Link] A minimal viable product for alignment

johnswentworth 9 Apr 2022 16:46 UTC
LW: -6 AF: -8
AF
Today’s language models are on track to become quite useful, without showing signs of deceptive misalignment...
facepalm
It won’t show signs of deceptive alignment. The entire point of “deception” is not showing signs. Unless it’s just really incompetent at deception, there won’t be signs; the lack of signs is not significant evidence of a lack of deception.
There may be other reasons to think our models are not yet deceptive to any significant extent (I certainly don’t think they are), but the lack of signs of deception is not one of them.
- Ethan Perez 9 Apr 2022 21:31 UTC
  LW: 18 AF: 13
  AF Parent
  I understand that deceptive models won’t show signs of deception :) That’s why I made the remark of models not showing signs of prerequisites to scary kinds of deception. Unless you think there are going to be no signs of deception or any prerequisites, for any models before we get deceptive ones?
  It also seems at least plausible that models will be imperfectly deceptive before they are perfectly deceptive, in which case we will see signs (e.g., in smaller models)
- Raemon 9 Apr 2022 17:00 UTC
  LW: 7 AF: 6
  AF Parent
  Not sure I buy this – I have a model of how hard it is to be deceptive, and how competent our current ML systems are, and it looks like it’s more like “as competent as a deceptive four-year old” (my parents totally caught me when I told my first lie), than “as competent as a silver-tongued sociopath playing a long game.”
  I do expect there to be signs of deceptive alignment, in a noticeable fashion before we get so-deceptive-we-don’t-notice deception.
  - johnswentworth 9 Apr 2022 17:31 UTC
    LW: -2 AF: -2
    AF Parent
    That falls squarely under the “other reasons to think our models are not yet deceptive”—i.e. we have priors that we’ll see models which are bad at deception before models become good at deception. The important evidential work there is being done by the prior.