Buck comments on Untrusted smart models and trusted dumb models

Buck 15 Nov 2024 15:19 UTC
LW: 4 AF: 3
0
AF
I mostly mean “we are sure that it isn’t egregiously unaligned and thus treating us adversarially”. So models can be aligned but untrusted (if they’re capable enough that we believe they could be schemers, but they aren’t actually schemers). There shouldn’t be models that are trusted but unaligned.
Everywhere I wrote “unaligned” here, I meant the fairly specific thing of “trying to defeat our safety measures so as to grab power”, which is not the only way the word “aligned” is used.