You might be interested in this post of mine, which is more precise about what “trustworthy” means. In short, my definition is “the AI isn’t adversarially trying to cause a bad outcome”. This includes aligned models, and also unaligned models that are dumb enough to realize they should (try to) sabotage. This does not include models that are unaligned, trying to sabotage and which we are able to stop from causing bad outcomes (but we might still have use-cases for such models).
I think your short definition should include the part about our epistemic status: “We are happy to assume the AI isn’t adversarially trying to cause a bad outcome”.
You might be interested in this post of mine, which is more precise about what “trustworthy” means. In short, my definition is “the AI isn’t adversarially trying to cause a bad outcome”. This includes aligned models, and also unaligned models that are dumb enough to realize they should (try to) sabotage. This does not include models that are unaligned, trying to sabotage and which we are able to stop from causing bad outcomes (but we might still have use-cases for such models).
I think your short definition should include the part about our epistemic status: “We are happy to assume the AI isn’t adversarially trying to cause a bad outcome”.