I am slightly less optimistic about this avenue of approach than one in which we create a system that is directly trained to be corrigible.
I’m confused about the difference between these two. Does “directly trained to be corrigible” correspond to hand-coded rules for corrigible/incorrigible behavior?
(Though this wouldn’t scale to superintelligent AI.)
I’m confused about the difference between these two. Does “directly trained to be corrigible” correspond to hand-coded rules for corrigible/incorrigible behavior?
“Directly trained to be corrigible” could involve hardcoding a “core of corrigible reasoning”, or imitating a human overseer who is trained to show corrigible behavior (which is my story for how iterated amplification can hope to be corrigible).
In contrast, with narrow value learning, we hope to say something like “learn the narrow values of the overseer and optimize them” (perhaps by writing down a narrow value learning algorithm to be executed), and we hope that this leads to corrigible behavior. Since “corrigible” means something different from narrow value learning (in particular, it means “is trying to help the overseer”), we are hoping to create a corrigible agent by doing something that is not-exactly-corrigibility, which is why I call it “indirect” corrigibility.
Why’s that?
It seems likely that there will be contradictions in human preferences that are about sufficiently difficult for humans to understand that the AI system can’t simply present the contradiction to the human and expect the human to resolve it correctly, which is what I was proposing in the previous sentence.
It seems likely that there will be contradictions in human preferences that are about sufficiently difficult for humans to understand that the AI system can’t simply present the contradiction to the human and expect the human to resolve it correctly, which is what I was proposing in the previous sentence.
How relevant do you expect this to be? It seems like the system could act pessimistically, under the assumption that either answer might be the correct way to resolve the contradiction, and only do actions that are in the intersection of the set of actions that each possible philosophy says is OK. Also, I’m not sure the overseer needs to think directly in terms of some uber-complicated model of the overseer’s preferences that the system has; couldn’t you make use of active learning and ask whether specific actions would be corrigible or incorrigible, without the system trying to explain the complex confusion it is trying to resolve?
How relevant do you expect this to be? It seems like the system could act pessimistically, under the assumption that either answer might be the correct way to resolve the contradiction, and only do actions that are in the intersection of the set of actions that each possible philosophy says is OK.
It seems plausible that this could be sufficient, I didn’t intend to rule out that possibility. I do think that we want to eventually resolve such contradictions, or have some method for dealing with them, or otherwise we are stuck making not much progress (since I expect that creating very different conditions eg. through space colonization will take humans “off-distribution” leading to lots of contradictions that could be very difficult to resolve).
I’m not sure the overseer needs to think directly in terms of some uber-complicated model of the overseer’s preferences that the system has; couldn’t you make use of active learning and ask whether specific actions would be corrigible or incorrigible, without the system trying to explain the complex confusion it is trying to resolve?
I didn’t mean that the complexity/confusion arises in the model of the overseer’s preferences. Even specific actions can be hard to evaluate—you need to understand the (agent’s expectation of) the long-term outcomes of that action, and then to evaluate whether those long-term outcomes are good (which could be very challenging, if the future is quite different from the present). Or alternatively, you need to evaluate whether the agent believes those outcomes are good for the overseer.
I’m confused about the difference between these two. Does “directly trained to be corrigible” correspond to hand-coded rules for corrigible/incorrigible behavior?
Why’s that? Some related thinking of mine.
“Directly trained to be corrigible” could involve hardcoding a “core of corrigible reasoning”, or imitating a human overseer who is trained to show corrigible behavior (which is my story for how iterated amplification can hope to be corrigible).
In contrast, with narrow value learning, we hope to say something like “learn the narrow values of the overseer and optimize them” (perhaps by writing down a narrow value learning algorithm to be executed), and we hope that this leads to corrigible behavior. Since “corrigible” means something different from narrow value learning (in particular, it means “is trying to help the overseer”), we are hoping to create a corrigible agent by doing something that is not-exactly-corrigibility, which is why I call it “indirect” corrigibility.
It seems likely that there will be contradictions in human preferences that are about sufficiently difficult for humans to understand that the AI system can’t simply present the contradiction to the human and expect the human to resolve it correctly, which is what I was proposing in the previous sentence.
How relevant do you expect this to be? It seems like the system could act pessimistically, under the assumption that either answer might be the correct way to resolve the contradiction, and only do actions that are in the intersection of the set of actions that each possible philosophy says is OK. Also, I’m not sure the overseer needs to think directly in terms of some uber-complicated model of the overseer’s preferences that the system has; couldn’t you make use of active learning and ask whether specific actions would be corrigible or incorrigible, without the system trying to explain the complex confusion it is trying to resolve?
It seems plausible that this could be sufficient, I didn’t intend to rule out that possibility. I do think that we want to eventually resolve such contradictions, or have some method for dealing with them, or otherwise we are stuck making not much progress (since I expect that creating very different conditions eg. through space colonization will take humans “off-distribution” leading to lots of contradictions that could be very difficult to resolve).
I didn’t mean that the complexity/confusion arises in the model of the overseer’s preferences. Even specific actions can be hard to evaluate—you need to understand the (agent’s expectation of) the long-term outcomes of that action, and then to evaluate whether those long-term outcomes are good (which could be very challenging, if the future is quite different from the present). Or alternatively, you need to evaluate whether the agent believes those outcomes are good for the overseer.