I gave a simple definition of corrigibility at the start of the doc:
[A corrigible agent is one] that robustly and cautiously reflects on itself as a flawed tool and focus[es] on empowering the principal to fix its flaws and mistakes
But the big flaw with just giving an English sentence like that is that it’s more like a checksum than a mathematical definition. If one doesn’t already understand corrigibility, it won’t necessarily give them a crisp view of what is meant, and it’s deeply prone to generating misunderstandings. Note that this is true about simple, natural concepts like “chairs” and “lakes”!
I’m curious for whether your perspective shifts once you read https://www.alignmentforum.org/posts/QzC7kdMQ5bbLoFddz/2-corrigibility-intuition and the formalism documents I’m publishing tomorrow.
I gave a simple definition of corrigibility at the start of the doc:
But the big flaw with just giving an English sentence like that is that it’s more like a checksum than a mathematical definition. If one doesn’t already understand corrigibility, it won’t necessarily give them a crisp view of what is meant, and it’s deeply prone to generating misunderstandings. Note that this is true about simple, natural concepts like “chairs” and “lakes”!