paulfchristiano comments on The strategy-stealing assumption

paulfchristiano 29 Sep 2019 17:19 UTC
LW: 6 AF: 3
AF
As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or “true” preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it’s not corrigible_MIRI.
Note that “corrigible” is not synonymous with “satisfying my short-term preferences-on-reflection” (that’s why I said: “our short-term preferences, including (amongst others) our preference for the agent to be corrigible.”)
I’m just saying that when we talk about concepts like “remain in control” or “become better informed” or “shut down,” those all need to be taken as concepts-on-reflection. We’re not satisfying current-Paul’s judgment of “did I remain in control?” they are the on-reflection notion of “did I remain in control”?
Whether an act-based agent is corrigible depends on our preferences-on-reflection (this is why the corrigibility post says that act-based agents “can be corrigible”). It may be that our preferences-on-reflection are for an agent to not be corrigible. It seems to me that for robustness reasons we may want to enforce corrigibility in all cases even if it’s not what we’d prefer-on-reflection, for robustness reasons.
That said, even without any special measures, saying “corrigibility is relatively easy to learn” is still an important argument about the behavior of our agents, since it hopefully means that either (i) our agents will behave corrigibly, (ii) our agents will do something better than behaving corriglby, according to our preferences-on-reflection, (iii) our agents are making a predictable mistake in optimizing our preferences-on-reflection (which might be ruled out by them simply being smart enough and understanding the kinds of argument we are currently making).
By “corrigible” I think we mean “corrigible by X” with the X implicit. It could be “corrigible by some particular physical human.”
What links here?
- List of resolved confusions about IDA by Wei Dai (30 Sep 2019 20:03 UTC; 97 points)
- Wei Dai's comment on Towards a mechanistic understanding of corrigibility by evhub (29 Sep 2019 21:00 UTC; 6 points)
- Wei Dai 29 Sep 2019 19:40 UTC
  LW: 4 AF: 3
  AF Parent
  
  Note that “corrigible” is not synonymous with “satisfying my short-term preferences-on-reflection” (that’s why I said: “our short-term preferences, including (amongst others) our preference for the agent to be corrigible.”)
  
  Ah, ok. I think in this case my confusion was caused by not having a short term for “satisfying X’s short-term preferences-on-reflection” so I started thinking that “corrigible” meant this. (Unless there is a term for this that I missed? Is “act-based” synonymous with this? I guess not, because “act-based” seems broader and isn’t necessarily about “preferences-on-reflection”?)
  
  That said, even without any special measures, saying “corrigibility is relatively easy to learn” is still an important argument about the behavior of our agents, since it hopefully means that either [...]
  
  Now that I understand “corrigible” isn’t synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn’t seem enough to imply these things, because we also need “reflection or preferences-for-reflection are relatively easy to learn” (otherwise the AI might correctly learn that the user currently wants corrigibility, but learns the wrong way to do reflection and incorrectly concludes that the user-on-reflection doesn’t want corrigibility) and also “it’s relatively easy to point the AI to the intended person whose reflection it should infer/extrapolate” (e.g., it’s not pointing to a user who exists in some alien simulation, or the AI models the user’s mind-state incorrectly and therefore begins the reflection process from a wrong starting point). These other things don’t seem obviously true and I’m not sure if they’ve been defended/justified or even explicitly stated.
  
  I think this might be another reason for my confusion, because if “corrigible” was synonymous with “satisfying my short-term preferences-on-reflection” then “corrigibility is relatively easy to learn” would seem to imply these things.
  What links here?
  - List of resolved confusions about IDA by Wei Dai (30 Sep 2019 20:03 UTC; 97 points)
  - Wei Dai's comment on Thoughts on “Human-Compatible” by TurnTrout (10 Oct 2019 19:59 UTC; 10 points)
  - paulfchristiano 29 Sep 2019 20:52 UTC
    LW: 4 AF: 2
    AF Parent
    Now that I understand “corrigible” isn’t synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn’t seem enough to imply these things
    I agree that you still need the AI to be trying to do the right thing (even though we don’t e.g. have any clear definition of “the right thing”), and that seems like the main way that you are going to fail.