I’m curious if “trusted” in this sense basically just means “aligned”—or like, the superset of that which also includes “unaligned yet too dumb to cause harm” and “unaligned yet prevented from causing harm”—or whether you mean something more specific? E.g., are you imagining that some powerful unconstrained systems are trusted yet unaligned, or vice versa?
I mostly mean “we are sure that it isn’t egregiously unaligned and thus treating us adversarially”. So models can be aligned but untrusted (if they’re capable enough that we believe they could be schemers, but they aren’t actually schemers). There shouldn’t be models that are trusted but unaligned.
Everywhere I wrote “unaligned” here, I meant the fairly specific thing of “trying to defeat our safety measures so as to grab power”, which is not the only way the word “aligned” is used.
You might be interested in this post of mine, which is more precise about what “trustworthy” means. In short, my definition is “the AI isn’t adversarially trying to cause a bad outcome”. This includes aligned models, and also unaligned models that are dumb enough to realize they should (try to) sabotage. This does not include models that are unaligned, trying to sabotage and which we are able to stop from causing bad outcomes (but we might still have use-cases for such models).
I think your short definition should include the part about our epistemic status: “We are happy to assume the AI isn’t adversarially trying to cause a bad outcome”.
I’m curious if “trusted” in this sense basically just means “aligned”—or like, the superset of that which also includes “unaligned yet too dumb to cause harm” and “unaligned yet prevented from causing harm”—or whether you mean something more specific? E.g., are you imagining that some powerful unconstrained systems are trusted yet unaligned, or vice versa?
I mostly mean “we are sure that it isn’t egregiously unaligned and thus treating us adversarially”. So models can be aligned but untrusted (if they’re capable enough that we believe they could be schemers, but they aren’t actually schemers). There shouldn’t be models that are trusted but unaligned.
Everywhere I wrote “unaligned” here, I meant the fairly specific thing of “trying to defeat our safety measures so as to grab power”, which is not the only way the word “aligned” is used.
You might be interested in this post of mine, which is more precise about what “trustworthy” means. In short, my definition is “the AI isn’t adversarially trying to cause a bad outcome”. This includes aligned models, and also unaligned models that are dumb enough to realize they should (try to) sabotage. This does not include models that are unaligned, trying to sabotage and which we are able to stop from causing bad outcomes (but we might still have use-cases for such models).
I think your short definition should include the part about our epistemic status: “We are happy to assume the AI isn’t adversarially trying to cause a bad outcome”.