Ethan Perez comments on [Link] A minimal viable product for alignment

Ethan Perez 9 Apr 2022 21:31 UTC
LW: 18 AF: 13
AF
I understand that deceptive models won’t show signs of deception :) That’s why I made the remark of models not showing signs of prerequisites to scary kinds of deception. Unless you think there are going to be no signs of deception or any prerequisites, for any models before we get deceptive ones?
It also seems at least plausible that models will be imperfectly deceptive before they are perfectly deceptive, in which case we will see signs (e.g., in smaller models)