> [Tells complicated, indirect story about how to wind up with a corrigible AI]
> “Corrigibility is, at its heart, a relatively simple concept”
I’m not saying the default strategy of bumbling forward and hoping that we figure out tool AI as we go has a literal 0% chance of working. But from the tone of this post and the previous table-of-contents post, I was expecting a more direct statement of what sort of functional properties you mean by “corrigibility,” and I feel like I got more of a “we’ll know it when we see it” approach.
I gave a simple definition of corrigibility at the start of the doc:
[A corrigible agent is one] that robustly and cautiously reflects on itself as a flawed tool and focus[es] on empowering the principal to fix its flaws and mistakes
But the big flaw with just giving an English sentence like that is that it’s more like a checksum than a mathematical definition. If one doesn’t already understand corrigibility, it won’t necessarily give them a crisp view of what is meant, and it’s deeply prone to generating misunderstandings. Note that this is true about simple, natural concepts like “chairs” and “lakes”!
> [Tells complicated, indirect story about how to wind up with a corrigible AI]
> “Corrigibility is, at its heart, a relatively simple concept”
I’m not saying the default strategy of bumbling forward and hoping that we figure out tool AI as we go has a literal 0% chance of working. But from the tone of this post and the previous table-of-contents post, I was expecting a more direct statement of what sort of functional properties you mean by “corrigibility,” and I feel like I got more of a “we’ll know it when we see it” approach.
I’m curious for whether your perspective shifts once you read https://www.alignmentforum.org/posts/QzC7kdMQ5bbLoFddz/2-corrigibility-intuition and the formalism documents I’m publishing tomorrow.
I gave a simple definition of corrigibility at the start of the doc:
But the big flaw with just giving an English sentence like that is that it’s more like a checksum than a mathematical definition. If one doesn’t already understand corrigibility, it won’t necessarily give them a crisp view of what is meant, and it’s deeply prone to generating misunderstandings. Note that this is true about simple, natural concepts like “chairs” and “lakes”!