Part of the point that I was trying to make in this post is that I’m somewhat dissatisfied with many of the existing definitions and treatments of corrigibility, as I feel like they don’t give enough of a basis for actually verifying them. So I can’t really give you a definition of act-based corrigibility that I’d be happy with, as I don’t think there currently exists such a definition.
That being said, I think there is something real in the act-based corrigibility cluster, which (as I describe in the post) I think looks something like corrigible alignment in terms of having some pointer to what the human wants (not in a perfectly reflective way, but just in terms of actually trying to help the human) combined with some sort of pre-prior creating an incentive to improve that pointer.
I thought Evan’s response was missing my point (that “act-based corrigibility” as used in OP doesn’t seem to be a kind of corrigibility as defined in the original corrigibility paper but just a way to achieve corrigibility) and had a chat with Evan about this on MIRIxDiscord (with Abram joining in). It turns out that by “act-based corrigibility” Evan meant both “a way of achieving something in the corrigibility cluster [by using act-based agents] as well as the particular thing in that cluster that you achieve if you actually get act-based corrigibility to work.”
The three of us talked a bit about finding better terms for these concepts but didn’t come up with any good candidates. My current position is that using “act-based corrigibility” this way is quite confusing and until we come up with better terms we should probably just stick with “achieving corrigibility using act-based agents” and “the kind of corrigibility that act-based agents may be able to achieve” depending on which concept one wants to refer to.
My current position is that using “act-based corrigibility” this way is quite confusing and until we come up with better terms we should probably just stick with “achieving corrigibility using act-based agents” and “the kind of corrigibility that act-based agents may be able to achieve” depending on which concept one wants to refer to.
Now that I’ve lived with understanding what Evan meant by “act-based corrigibility” for a while, I find that I’m having trouble holding on to my initial feeling of “this is likely to cause confusion to people”, despite consciously trying to, and it’s starting to feel more and more reasonable to use it the way Evan did. It seems like an interesting and revealing case of the illusion of transparency in action.
Part of the point that I was trying to make in this post is that I’m somewhat dissatisfied with many of the existing definitions and treatments of corrigibility, as I feel like they don’t give enough of a basis for actually verifying them. So I can’t really give you a definition of act-based corrigibility that I’d be happy with, as I don’t think there currently exists such a definition.
That being said, I think there is something real in the act-based corrigibility cluster, which (as I describe in the post) I think looks something like corrigible alignment in terms of having some pointer to what the human wants (not in a perfectly reflective way, but just in terms of actually trying to help the human) combined with some sort of pre-prior creating an incentive to improve that pointer.
I thought Evan’s response was missing my point (that “act-based corrigibility” as used in OP doesn’t seem to be a kind of corrigibility as defined in the original corrigibility paper but just a way to achieve corrigibility) and had a chat with Evan about this on MIRIxDiscord (with Abram joining in). It turns out that by “act-based corrigibility” Evan meant both “a way of achieving something in the corrigibility cluster [by using act-based agents] as well as the particular thing in that cluster that you achieve if you actually get act-based corrigibility to work.”
The three of us talked a bit about finding better terms for these concepts but didn’t come up with any good candidates. My current position is that using “act-based corrigibility” this way is quite confusing and until we come up with better terms we should probably just stick with “achieving corrigibility using act-based agents” and “the kind of corrigibility that act-based agents may be able to achieve” depending on which concept one wants to refer to.
Now that I’ve lived with understanding what Evan meant by “act-based corrigibility” for a while, I find that I’m having trouble holding on to my initial feeling of “this is likely to cause confusion to people”, despite consciously trying to, and it’s starting to feel more and more reasonable to use it the way Evan did. It seems like an interesting and revealing case of the illusion of transparency in action.