Olli Järviniemi comments on Untrusted smart models and trusted dumb models

Olli Järviniemi 15 Nov 2024 14:55 UTC
2 points
0
You might be interested in this post of mine, which is more precise about what “trustworthy” means. In short, my definition is “the AI isn’t adversarially trying to cause a bad outcome”. This includes aligned models, and also unaligned models that are dumb enough to realize they should (try to) sabotage. This does not include models that are unaligned, trying to sabotage and which we are able to stop from causing bad outcomes (but we might still have use-cases for such models).
- Buck 15 Nov 2024 15:16 UTC
  2 points
  0
  Parent
  I think your short definition should include the part about our epistemic status: “We are happy to assume the AI isn’t adversarially trying to cause a bad outcome”.