I mostly mean “we are sure that it isn’t egregiously unaligned and thus treating us adversarially”. So models can be aligned but untrusted (if they’re capable enough that we believe they could be schemers, but they aren’t actually schemers). There shouldn’t be models that are trusted but unaligned.
Everywhere I wrote “unaligned” here, I meant the fairly specific thing of “trying to defeat our safety measures so as to grab power”, which is not the only way the word “aligned” is used.
I mostly mean “we are sure that it isn’t egregiously unaligned and thus treating us adversarially”. So models can be aligned but untrusted (if they’re capable enough that we believe they could be schemers, but they aren’t actually schemers). There shouldn’t be models that are trusted but unaligned.
Everywhere I wrote “unaligned” here, I meant the fairly specific thing of “trying to defeat our safety measures so as to grab power”, which is not the only way the word “aligned” is used.