For (a): Deception is a convergent instrumental goal; you get it “for free” when you succeed in making an effective system, in the sense that the simplest, most-likely-to-be-randomly-generated effective systems are deceptive. Corrigibility by contrast is complex and involves making various nuanced decisions between good and bad sorts of influence on human behavior.
For (b): If you take an effective system and modify it to be corrigible, this will tend to make it less effective. By contrast, deceptiveness (insofar as it arises “naturally” as a byproduct of pursuing convergent instrumental goals effectively) does not “get in the way” of effectiveness, and even helps in some cases!
Ngo’s (and Shah’s) position (we think) is that the data we’ll be using to select our systems will be heavily entangled with human preferences—we’ll indeed be trying to use human preferences to guide and shape the systems—so there’s a strong bias towards actually learning them. You don’t have to get human preferences right in all their nuance and detail to know some basic things like that humans generally don’t want to die or be manipulated/deceived. I think they mostly bounce off the claim that “effectiveness” has some kind of “deep underlying principles” that will generalise better than any plausible amount of human preference data actually goes into building the effective system. We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.”
It seems to us that Ngo, Shah, etc. draw intuitive support from analogy to humans, whereas Yudkowsky etc. draw intuitive support from the analogy to programs and expected utility equations.
If you are thinking about a piece of code that describes a bayesian EU-maximizer, and then you try to edit the code to make the agent corrigible, it’s obvious that (a) you don’t know how to do that, and (b) if you did figure it out the code you add would be many orders of magnitude longer than the code you started with.
If instead you are thinking about humans, it seems like you totally could be corrigible if you tried, and it seems like you might totally have tried if you had been raised in the right way (e.g. if your parents had lovingly but strictly trained you to be corrigible-in-way-X.)
We think Yudkowsky’s response to this apparent counterexample is that humans are stupid, basically; AIs might be similarly stupid at first, but as they get smarter we should expect crude corrigibility-training techniques to stop working.
If instead you are thinking about humans, it seems like you totally could be corrigible if you tried, and it seems like you might totally have tried if you had been raised in the right way (e.g. if your parents had lovingly but strictly trained you to be corrigible-in-way-X.)
Are there any examples of this in history, where being corrigible-in-way-X wasn’t being constantly incentivized/reinforced via a larger game (e.g., status game) that the human was embedded in? In other words, I think an apparently corrigible human can be modeled as trying to optimize for survival and social status as terminal values, and using “being corrigible” as an instrumental strategy as long as that’s an effective strategy. In other words, it’s unclear that they can be better described as “corrigible” than “deceptive” (in the AI alignment sense).
(Humans probably have hard-coded drives for survival and social status, so it may actually be harder to train humans than AIs to be actually corrigible. My point above is just that humans don’t seem to be a good example of corrigibility being easy or possible.)
We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.”
Yeah, that’s right. Adapted to the language here, it would be 1. Why would we have a “full and complete” outcome pump, rather than domain-specific outcome pumps that primarily use plans using actions from a certain domain rather than “all possible actions”, and 2. Why are the outcomes being pumped incompatible with human survival?
For (a): Deception is a convergent instrumental goal; you get it “for free” when you succeed in making an effective system, in the sense that the simplest, most-likely-to-be-randomly-generated effective systems are deceptive. Corrigibility by contrast is complex and involves making various nuanced decisions between good and bad sorts of influence on human behavior.
For (b): If you take an effective system and modify it to be corrigible, this will tend to make it less effective. By contrast, deceptiveness (insofar as it arises “naturally” as a byproduct of pursuing convergent instrumental goals effectively) does not “get in the way” of effectiveness, and even helps in some cases!
Ngo’s (and Shah’s) position (we think) is that the data we’ll be using to select our systems will be heavily entangled with human preferences—we’ll indeed be trying to use human preferences to guide and shape the systems—so there’s a strong bias towards actually learning them. You don’t have to get human preferences right in all their nuance and detail to know some basic things like that humans generally don’t want to die or be manipulated/deceived. I think they mostly bounce off the claim that “effectiveness” has some kind of “deep underlying principles” that will generalise better than any plausible amount of human preference data actually goes into building the effective system. We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.”
It seems to us that Ngo, Shah, etc. draw intuitive support from analogy to humans, whereas Yudkowsky etc. draw intuitive support from the analogy to programs and expected utility equations.
If you are thinking about a piece of code that describes a bayesian EU-maximizer, and then you try to edit the code to make the agent corrigible, it’s obvious that (a) you don’t know how to do that, and (b) if you did figure it out the code you add would be many orders of magnitude longer than the code you started with.
If instead you are thinking about humans, it seems like you totally could be corrigible if you tried, and it seems like you might totally have tried if you had been raised in the right way (e.g. if your parents had lovingly but strictly trained you to be corrigible-in-way-X.)
We think Yudkowsky’s response to this apparent counterexample is that humans are stupid, basically; AIs might be similarly stupid at first, but as they get smarter we should expect crude corrigibility-training techniques to stop working.
Are there any examples of this in history, where being corrigible-in-way-X wasn’t being constantly incentivized/reinforced via a larger game (e.g., status game) that the human was embedded in? In other words, I think an apparently corrigible human can be modeled as trying to optimize for survival and social status as terminal values, and using “being corrigible” as an instrumental strategy as long as that’s an effective strategy. In other words, it’s unclear that they can be better described as “corrigible” than “deceptive” (in the AI alignment sense).
(Humans probably have hard-coded drives for survival and social status, so it may actually be harder to train humans than AIs to be actually corrigible. My point above is just that humans don’t seem to be a good example of corrigibility being easy or possible.)
Yeah, that’s right. Adapted to the language here, it would be 1. Why would we have a “full and complete” outcome pump, rather than domain-specific outcome pumps that primarily use plans using actions from a certain domain rather than “all possible actions”, and 2. Why are the outcomes being pumped incompatible with human survival?