Wei Dai comments on The strategy-stealing assumption

Wei Dai 29 Sep 2019 19:40 UTC
LW: 4 AF: 3
AF

Note that “corrigible” is not synonymous with “satisfying my short-term preferences-on-reflection” (that’s why I said: “our short-term preferences, including (amongst others) our preference for the agent to be corrigible.”)

Ah, ok. I think in this case my confusion was caused by not having a short term for “satisfying X’s short-term preferences-on-reflection” so I started thinking that “corrigible” meant this. (Unless there is a term for this that I missed? Is “act-based” synonymous with this? I guess not, because “act-based” seems broader and isn’t necessarily about “preferences-on-reflection”?)

That said, even without any special measures, saying “corrigibility is relatively easy to learn” is still an important argument about the behavior of our agents, since it hopefully means that either [...]

Now that I understand “corrigible” isn’t synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn’t seem enough to imply these things, because we also need “reflection or preferences-for-reflection are relatively easy to learn” (otherwise the AI might correctly learn that the user currently wants corrigibility, but learns the wrong way to do reflection and incorrectly concludes that the user-on-reflection doesn’t want corrigibility) and also “it’s relatively easy to point the AI to the intended person whose reflection it should infer/extrapolate” (e.g., it’s not pointing to a user who exists in some alien simulation, or the AI models the user’s mind-state incorrectly and therefore begins the reflection process from a wrong starting point). These other things don’t seem obviously true and I’m not sure if they’ve been defended/justified or even explicitly stated.

I think this might be another reason for my confusion, because if “corrigible” was synonymous with “satisfying my short-term preferences-on-reflection” then “corrigibility is relatively easy to learn” would seem to imply these things.
What links here?
- List of resolved confusions about IDA by Wei Dai (30 Sep 2019 20:03 UTC; 97 points)
- Wei Dai's comment on Thoughts on “Human-Compatible” by TurnTrout (10 Oct 2019 19:59 UTC; 10 points)
- paulfchristiano 29 Sep 2019 20:52 UTC
  LW: 4 AF: 2
  AF Parent
  Now that I understand “corrigible” isn’t synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn’t seem enough to imply these things
  I agree that you still need the AI to be trying to do the right thing (even though we don’t e.g. have any clear definition of “the right thing”), and that seems like the main way that you are going to fail.