MAGMA also has the model check its own work, but the model notices that the work it is checking is its own and doesn’t flag it.
Why would anyone give such a responsibility to an untrusted model in a not-overseen fashion? Already in December last year Greenblatt et al. demonstrated which techniques alignment researchers could use to control a high-capability untrusted model (and Robert Miles did a good video on it recently).
It doesn’t currently look plausible that any model (or any human for that matter) would be able to distinguish between its own work it checks and a synthetic input intentionally crafted by alignment researchers to test whether the model lies about maliciousness of the work it’s checking.
Not to speak about other techniques described in the paper, such as having a trusted model (like Sonnet 3.5) estimate the suspiciousness of the work or just rewrite the work in a less suspicious way altogether
Why would anyone give such a responsibility to an untrusted model in a not-overseen fashion? Already in December last year Greenblatt et al. demonstrated which techniques alignment researchers could use to control a high-capability untrusted model (and Robert Miles did a good video on it recently).
It doesn’t currently look plausible that any model (or any human for that matter) would be able to distinguish between its own work it checks and a synthetic input intentionally crafted by alignment researchers to test whether the model lies about maliciousness of the work it’s checking.
Not to speak about other techniques described in the paper, such as having a trusted model (like Sonnet 3.5) estimate the suspiciousness of the work or just rewrite the work in a less suspicious way altogether