One possible candidate property that Paul has proposed is act-based corrigibility, wherein an agent respects our short-term preferences, including those over how the agent itself should be modified.
Similar to my (now largely resolved) confusion about how Paul uses “corrigibility”, I also have a confusion about how “corrigibility” is used here. In particular, is “act-based corrigibility” synonymous with “respects our short-term preferences” (and if so do you mean “preferences-on-reflection”) or is it a different property (i.e., corrigibility_MIRI or a broader version of that) that you think “an agent respects our short-term preferences” is likely to have? It seems to me from context that you mean the former (synonymous), because earlier you wrote:
what is the easiest-to-train-and-verify property such that all models that satisfy that property[1] (and achieve high average reward) are safe?
and “respects our short-term preferences” seems to be the “candidate property” that you’re naming “act-based corrigibility” because it’s a “mechanistic” property that might be easy to train and verify whereas corrigibility in the MIRI sense (or my current understanding of Paul’s sense) does not seem mechanistic or easy to verify.
Can you please confirm whether my guess of your original intended meaning is correct? And if it is, please consider changing your wording here (or in the future) to be more consistent with Paul’s clarification of how he uses “corrigibility”?
Part of the point that I was trying to make in this post is that I’m somewhat dissatisfied with many of the existing definitions and treatments of corrigibility, as I feel like they don’t give enough of a basis for actually verifying them. So I can’t really give you a definition of act-based corrigibility that I’d be happy with, as I don’t think there currently exists such a definition.
That being said, I think there is something real in the act-based corrigibility cluster, which (as I describe in the post) I think looks something like corrigible alignment in terms of having some pointer to what the human wants (not in a perfectly reflective way, but just in terms of actually trying to help the human) combined with some sort of pre-prior creating an incentive to improve that pointer.
I thought Evan’s response was missing my point (that “act-based corrigibility” as used in OP doesn’t seem to be a kind of corrigibility as defined in the original corrigibility paper but just a way to achieve corrigibility) and had a chat with Evan about this on MIRIxDiscord (with Abram joining in). It turns out that by “act-based corrigibility” Evan meant both “a way of achieving something in the corrigibility cluster [by using act-based agents] as well as the particular thing in that cluster that you achieve if you actually get act-based corrigibility to work.”
The three of us talked a bit about finding better terms for these concepts but didn’t come up with any good candidates. My current position is that using “act-based corrigibility” this way is quite confusing and until we come up with better terms we should probably just stick with “achieving corrigibility using act-based agents” and “the kind of corrigibility that act-based agents may be able to achieve” depending on which concept one wants to refer to.
My current position is that using “act-based corrigibility” this way is quite confusing and until we come up with better terms we should probably just stick with “achieving corrigibility using act-based agents” and “the kind of corrigibility that act-based agents may be able to achieve” depending on which concept one wants to refer to.
Now that I’ve lived with understanding what Evan meant by “act-based corrigibility” for a while, I find that I’m having trouble holding on to my initial feeling of “this is likely to cause confusion to people”, despite consciously trying to, and it’s starting to feel more and more reasonable to use it the way Evan did. It seems like an interesting and revealing case of the illusion of transparency in action.
Similar to my (now largely resolved) confusion about how Paul uses “corrigibility”, I also have a confusion about how “corrigibility” is used here. In particular, is “act-based corrigibility” synonymous with “respects our short-term preferences” (and if so do you mean “preferences-on-reflection”) or is it a different property (i.e., corrigibility_MIRI or a broader version of that) that you think “an agent respects our short-term preferences” is likely to have? It seems to me from context that you mean the former (synonymous), because earlier you wrote:
and “respects our short-term preferences” seems to be the “candidate property” that you’re naming “act-based corrigibility” because it’s a “mechanistic” property that might be easy to train and verify whereas corrigibility in the MIRI sense (or my current understanding of Paul’s sense) does not seem mechanistic or easy to verify.
Can you please confirm whether my guess of your original intended meaning is correct? And if it is, please consider changing your wording here (or in the future) to be more consistent with Paul’s clarification of how he uses “corrigibility”?
Part of the point that I was trying to make in this post is that I’m somewhat dissatisfied with many of the existing definitions and treatments of corrigibility, as I feel like they don’t give enough of a basis for actually verifying them. So I can’t really give you a definition of act-based corrigibility that I’d be happy with, as I don’t think there currently exists such a definition.
That being said, I think there is something real in the act-based corrigibility cluster, which (as I describe in the post) I think looks something like corrigible alignment in terms of having some pointer to what the human wants (not in a perfectly reflective way, but just in terms of actually trying to help the human) combined with some sort of pre-prior creating an incentive to improve that pointer.
I thought Evan’s response was missing my point (that “act-based corrigibility” as used in OP doesn’t seem to be a kind of corrigibility as defined in the original corrigibility paper but just a way to achieve corrigibility) and had a chat with Evan about this on MIRIxDiscord (with Abram joining in). It turns out that by “act-based corrigibility” Evan meant both “a way of achieving something in the corrigibility cluster [by using act-based agents] as well as the particular thing in that cluster that you achieve if you actually get act-based corrigibility to work.”
The three of us talked a bit about finding better terms for these concepts but didn’t come up with any good candidates. My current position is that using “act-based corrigibility” this way is quite confusing and until we come up with better terms we should probably just stick with “achieving corrigibility using act-based agents” and “the kind of corrigibility that act-based agents may be able to achieve” depending on which concept one wants to refer to.
Now that I’ve lived with understanding what Evan meant by “act-based corrigibility” for a while, I find that I’m having trouble holding on to my initial feeling of “this is likely to cause confusion to people”, despite consciously trying to, and it’s starting to feel more and more reasonable to use it the way Evan did. It seems like an interesting and revealing case of the illusion of transparency in action.