“Corrigibility” is usually defined as the property of AIs who don’t resist modifications by their designers. Why would we want to perform such modifications? Mainly it’s because we made errors in the initial implementation, and in particular the initial implementation is not aligned. But, this leads to a paradox: if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn’t it also be flawed in a way that destroys corrigibility?
In order to stop passing the recursive buck, we must assume some dimensions along which our initial implementation is not allowed to be flawed. Therefore, corrigibility is only a well-posed notion in the context of a particular such assumption. Seen through this lens, the Hippocratic principle becomes a particular crystallization of corrigibility. Specifically, the Hippocratic principle assumes the agent has access to some reliable information about the user’s policy and preferences (be it through timelines, revealed preferences or anything else).
Importantly, this information can be incomplete, which can motivate altering the agent along the way. And, the agent will not resist this alteration! Indeed, resisting the alteration is ruled out unless the AI can conclude with high confidence (and not just in expectation) that such resistance is harmless. Since we assumed the information is reliable, and the alteration is beneficial, the AI cannot reach such a conclusion.
For example, consider an HDTL agent getting upgraded to “Hippocratic CIRL” (assuming some sophisticated model of relationship between human behavior and human preferences). In order to resist the modification, the agent would need a resistance strategy that (i) doesn’t deviate too much from the human baseline and (ii) ends with the user submitting a favorable report. Such a strategy is quite unlikely to exist.
if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn’t it also be flawed in a way that destroys corrigibility?
I think the people most interested in corrigibility are imagining a situation where we know what we’re doing with corrigibility (e.g. we have some grab-bag of simple properties we want satisfied), but don’t even know what we want from alignment, and then they imagine building an unaligned slightly-sub-human AGI and poking at it while we “figure out alignment.”
Maybe this is a strawman, because the thing I’m describing doesn’t make strategic sense, but I think it does have some model of why we might end up with something unaligned but corrigible (for at least a short period).
The concept of corrigibility was introduced by MIRI, and I don’t think that’s their motivation? On my model of MIRI’s model, we won’t have time to poke at a slightly subhuman AI, we need to have at least a fairly good notion of what to do with a superhuman AI upfront. Maybe what you meant is “we won’t know how to construct perfect-utopia-AI, so we will just construct a prevent-unaligned-AIs-AI and run it so that we can figure out perfect-utopia-AI in our leisure”. Which, sure, but I don’t see what it has to do with corrigibility.
Corrigibility is neither necessary nor sufficient for safety. It’s not strictly necessary because in theory an AI can resist modifications in some scenarios while always doing the right thing (although in practice resisting modifications is an enormous red flag), and it’s not sufficient since an AI can be “corrigible” but cause catastrophic harm before someone notices and fixes it.
What we’re supposed to gain from corrigibility is having some margin of error around alignment, in which case we can decompose alignment as corrigibility + approximate alignment. But it is underspecified if we don’t say along which dimensions or how big the margin is. If it’s infinite margin along all dimensions then corrigibility and alignment are just isomorphic and there’s no reason to talk about the former.
“Corrigibility” is usually defined as the property of AIs who don’t resist modifications by their designers. Why would we want to perform such modifications? Mainly it’s because we made errors in the initial implementation, and in particular the initial implementation is not aligned. But, this leads to a paradox: if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn’t it also be flawed in a way that destroys corrigibility?
In order to stop passing the recursive buck, we must assume some dimensions along which our initial implementation is not allowed to be flawed. Therefore, corrigibility is only a well-posed notion in the context of a particular such assumption. Seen through this lens, the Hippocratic principle becomes a particular crystallization of corrigibility. Specifically, the Hippocratic principle assumes the agent has access to some reliable information about the user’s policy and preferences (be it through timelines, revealed preferences or anything else).
Importantly, this information can be incomplete, which can motivate altering the agent along the way. And, the agent will not resist this alteration! Indeed, resisting the alteration is ruled out unless the AI can conclude with high confidence (and not just in expectation) that such resistance is harmless. Since we assumed the information is reliable, and the alteration is beneficial, the AI cannot reach such a conclusion.
For example, consider an HDTL agent getting upgraded to “Hippocratic CIRL” (assuming some sophisticated model of relationship between human behavior and human preferences). In order to resist the modification, the agent would need a resistance strategy that (i) doesn’t deviate too much from the human baseline and (ii) ends with the user submitting a favorable report. Such a strategy is quite unlikely to exist.
I think the people most interested in corrigibility are imagining a situation where we know what we’re doing with corrigibility (e.g. we have some grab-bag of simple properties we want satisfied), but don’t even know what we want from alignment, and then they imagine building an unaligned slightly-sub-human AGI and poking at it while we “figure out alignment.”
Maybe this is a strawman, because the thing I’m describing doesn’t make strategic sense, but I think it does have some model of why we might end up with something unaligned but corrigible (for at least a short period).
The concept of corrigibility was introduced by MIRI, and I don’t think that’s their motivation? On my model of MIRI’s model, we won’t have time to poke at a slightly subhuman AI, we need to have at least a fairly good notion of what to do with a superhuman AI upfront. Maybe what you meant is “we won’t know how to construct perfect-utopia-AI, so we will just construct a prevent-unaligned-AIs-AI and run it so that we can figure out perfect-utopia-AI in our leisure”. Which, sure, but I don’t see what it has to do with corrigibility.
Corrigibility is neither necessary nor sufficient for safety. It’s not strictly necessary because in theory an AI can resist modifications in some scenarios while always doing the right thing (although in practice resisting modifications is an enormous red flag), and it’s not sufficient since an AI can be “corrigible” but cause catastrophic harm before someone notices and fixes it.
What we’re supposed to gain from corrigibility is having some margin of error around alignment, in which case we can decompose alignment as corrigibility + approximate alignment. But it is underspecified if we don’t say along which dimensions or how big the margin is. If it’s infinite margin along all dimensions then corrigibility and alignment are just isomorphic and there’s no reason to talk about the former.