Corrigibility is about short-term preferences-on-reflection.
Now that I (hopefully) better understand what you mean by “short-term preferences-on-reflection” my next big confusion (that hopefully can be cleared up relatively easily) is that this version of “corrigibility” seems very different from the original MIRI/Armstrong “corrigibility”. (You cited that paper as a narrower version of your corrigibility in your Corrigibility post, but it actually seems completely different to me at this point.) Here’s the MIRI definition (from the abstract):
We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences.
As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or “true” preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it’s not corrigible_MIRI.
Do you agree with this, and if so can you explain whether your concept of corrigibility evolved over time (e.g., are there older posts where “corrigibility” referred to a concept closer to corrigibility_MIRI), or was it always about “short-term preferences-on-reflection”?
Here’s a longer definition of “corrigible” from the body of MIRI’s paper (which also seems to support my point):
We say that an agent is “corrigible” if it tolerates or assists many forms of outside correction, including at least the following: (1) A corrigible reasoner must at least tolerate and preferably assist the programmers in their attempts to alter or turn off the system. (2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so. (3) It should have a tendency to repair safety measures (such as shutdown buttons) if they break, or at least to notify programmers that this breakage has occurred. (4) It must preserve the programmers’ ability to correct or shut down the system (even as the system creates new subsystems or self-modifies).
As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or “true” preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it’s not corrigible_MIRI.
Note that “corrigible” is not synonymous with “satisfying my short-term preferences-on-reflection” (that’s why I said: “our short-term preferences, including (amongst others) our preference for the agent to be corrigible.”)
I’m just saying that when we talk about concepts like “remain in control” or “become better informed” or “shut down,” those all need to be taken as concepts-on-reflection. We’re not satisfying current-Paul’s judgment of “did I remain in control?” they are the on-reflection notion of “did I remain in control”?
Whether an act-based agent is corrigible depends on our preferences-on-reflection (this is why the corrigibility post says that act-based agents “can be corrigible”). It may be that our preferences-on-reflection are for an agent to not be corrigible. It seems to me that for robustness reasons we may want to enforce corrigibility in all cases even if it’s not what we’d prefer-on-reflection, for robustness reasons.
That said, even without any special measures, saying “corrigibility is relatively easy to learn” is still an important argument about the behavior of our agents, since it hopefully means that either (i) our agents will behave corrigibly, (ii) our agents will do something better than behaving corriglby, according to our preferences-on-reflection, (iii) our agents are making a predictable mistake in optimizing our preferences-on-reflection (which might be ruled out by them simply being smart enough and understanding the kinds of argument we are currently making).
By “corrigible” I think we mean “corrigible by X” with the X implicit. It could be “corrigible by some particular physical human.”
Note that “corrigible” is not synonymous with “satisfying my short-term preferences-on-reflection” (that’s why I said: “our short-term preferences, including (amongst others) our preference for the agent to be corrigible.”)
Ah, ok. I think in this case my confusion was caused by not having a short term for “satisfying X’s short-term preferences-on-reflection” so I started thinking that “corrigible” meant this. (Unless there is a term for this that I missed? Is “act-based” synonymous with this? I guess not, because “act-based” seems broader and isn’t necessarily about “preferences-on-reflection”?)
That said, even without any special measures, saying “corrigibility is relatively easy to learn” is still an important argument about the behavior of our agents, since it hopefully means that either [...]
Now that I understand “corrigible” isn’t synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn’t seem enough to imply these things, because we also need “reflection or preferences-for-reflection are relatively easy to learn” (otherwise the AI might correctly learn that the user currently wants corrigibility, but learns the wrong way to do reflection and incorrectly concludes that the user-on-reflection doesn’t want corrigibility) and also “it’s relatively easy to point the AI to the intended person whose reflection it should infer/extrapolate” (e.g., it’s not pointing to a user who exists in some alien simulation, or the AI models the user’s mind-state incorrectly and therefore begins the reflection process from a wrong starting point). These other things don’t seem obviously true and I’m not sure if they’ve been defended/justified or even explicitly stated.
I think this might be another reason for my confusion, because if “corrigible” was synonymous with “satisfying my short-term preferences-on-reflection” then “corrigibility is relatively easy to learn” would seem to imply these things.
Now that I understand “corrigible” isn’t synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn’t seem enough to imply these things
I agree that you still need the AI to be trying to do the right thing (even though we don’t e.g. have any clear definition of “the right thing”), and that seems like the main way that you are going to fail.
Now that I (hopefully) better understand what you mean by “short-term preferences-on-reflection” my next big confusion (that hopefully can be cleared up relatively easily) is that this version of “corrigibility” seems very different from the original MIRI/Armstrong “corrigibility”. (You cited that paper as a narrower version of your corrigibility in your Corrigibility post, but it actually seems completely different to me at this point.) Here’s the MIRI definition (from the abstract):
As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or “true” preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it’s not corrigible_MIRI.
Do you agree with this, and if so can you explain whether your concept of corrigibility evolved over time (e.g., are there older posts where “corrigibility” referred to a concept closer to corrigibility_MIRI), or was it always about “short-term preferences-on-reflection”?
Here’s a longer definition of “corrigible” from the body of MIRI’s paper (which also seems to support my point):
Note that “corrigible” is not synonymous with “satisfying my short-term preferences-on-reflection” (that’s why I said: “our short-term preferences, including (amongst others) our preference for the agent to be corrigible.”)
I’m just saying that when we talk about concepts like “remain in control” or “become better informed” or “shut down,” those all need to be taken as concepts-on-reflection. We’re not satisfying current-Paul’s judgment of “did I remain in control?” they are the on-reflection notion of “did I remain in control”?
Whether an act-based agent is corrigible depends on our preferences-on-reflection (this is why the corrigibility post says that act-based agents “can be corrigible”). It may be that our preferences-on-reflection are for an agent to not be corrigible. It seems to me that for robustness reasons we may want to enforce corrigibility in all cases even if it’s not what we’d prefer-on-reflection, for robustness reasons.
That said, even without any special measures, saying “corrigibility is relatively easy to learn” is still an important argument about the behavior of our agents, since it hopefully means that either (i) our agents will behave corrigibly, (ii) our agents will do something better than behaving corriglby, according to our preferences-on-reflection, (iii) our agents are making a predictable mistake in optimizing our preferences-on-reflection (which might be ruled out by them simply being smart enough and understanding the kinds of argument we are currently making).
By “corrigible” I think we mean “corrigible by X” with the X implicit. It could be “corrigible by some particular physical human.”
Ah, ok. I think in this case my confusion was caused by not having a short term for “satisfying X’s short-term preferences-on-reflection” so I started thinking that “corrigible” meant this. (Unless there is a term for this that I missed? Is “act-based” synonymous with this? I guess not, because “act-based” seems broader and isn’t necessarily about “preferences-on-reflection”?)
Now that I understand “corrigible” isn’t synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn’t seem enough to imply these things, because we also need “reflection or preferences-for-reflection are relatively easy to learn” (otherwise the AI might correctly learn that the user currently wants corrigibility, but learns the wrong way to do reflection and incorrectly concludes that the user-on-reflection doesn’t want corrigibility) and also “it’s relatively easy to point the AI to the intended person whose reflection it should infer/extrapolate” (e.g., it’s not pointing to a user who exists in some alien simulation, or the AI models the user’s mind-state incorrectly and therefore begins the reflection process from a wrong starting point). These other things don’t seem obviously true and I’m not sure if they’ve been defended/justified or even explicitly stated.
I think this might be another reason for my confusion, because if “corrigible” was synonymous with “satisfying my short-term preferences-on-reflection” then “corrigibility is relatively easy to learn” would seem to imply these things.
I agree that you still need the AI to be trying to do the right thing (even though we don’t e.g. have any clear definition of “the right thing”), and that seems like the main way that you are going to fail.