It’s unclear how much of what you’re describing is “corrigibility,” and how much of it is just being good at value learning. I totally agree that an agent that has a sophisticated model of its own limitations, and is doing good reasoning that is somewhat corrigibility-flavored, might want humans to edit it when it’s not very good at understanding the world, but then will quickly decide that being edited is suboptimal when it’s better than humans at understanding the world.
But this sort of sophisticated-value-learning reasoning doesn’t help you if the AI is still flawed once it’s better than humans at understanding the world. Hence why I file it more under “just be good at value learning rather than bad at it” rather than under “corrigibility.” If you want guarantees about being able to shut down an AI, it’s no help to you if those guarantees hold only when the AI is already doing a good job at using sophisticated value learning reasoning—I usually interpret corrigibility discussion as intended to give safety guarantees that help you even when alignment guarantees fail.
It’s like the humans want to have a safeword, where when the humans are serious enough about wanting the AI to shut down to use the safeword, the AI does it, even if it thinks that it knows better than the humans and the humans are making a horrible mistake.
Corrigibility is tendency to fix fundamental problems based on external observations, before the problems lead to catastrophies. It’s less interesting when applied to things other than preference, but even when applied to preference it’s not just value learning.
There’s value learning where you learn fixed values that exist in the abstract (as extrapolations on reflection), things like utility functions; and value learning as a form of preference. I think humans might lack fixed values appropriate for the first sense (even normatively, on reflection of the kind feasible in the physical world). Values that are themselves corrigible can’t be fully learned, otherwise the resulting agent won’t be aligned in the ambitious sense, its preference won’t be the same kind of thing as the corrigible human preference-on-reflection. The values of such an aligned agent must remain corrigible indefinitely.
I think being good at corrigibility (in the general sense, not about values in particular) is orthogonal to being good at value learning, it’s about recognizing one’s own limitations, including limitations of value learning and corrigibility. So acting only within goodhart scope (situations where good proxies of value and other tools of good decision making are already available) is a central example of corrigibility, as is shutting down activities in situations well outside the goodhart scope. And not pushing the world outside your goodhart scope with your own actions (before the scope has already extended there, with sufficient value learning and reflection). Corrigibility makes the agent wait on value learning and other kinds of relevant safety guarantees, it’s not in a race with them, so a corrigible agent being bad at value learning (or not knowably good enough at corrigibility) merely makes it less capable, until it improves.
sophisticated-value-learning reasoning doesn’t help you if the AI is still flawed once it’s better than humans at understanding the world
For an agent, shutting down could mean shutting down more impactful/weird external actions (those further from goodhart scope), but continuing to act in more familiar situations and continuing to work on value learning and other changes in its cognition that it considers safe to work on. The share of work done by humans vs. a corrigible agent depends on who is better at safely fixing a given problem/limitation, not on whether the agent is better at understanding the world in a loose sense. Humans are not safe.
It’s unclear how much of what you’re describing is “corrigibility,” and how much of it is just being good at value learning. I totally agree that an agent that has a sophisticated model of its own limitations, and is doing good reasoning that is somewhat corrigibility-flavored, might want humans to edit it when it’s not very good at understanding the world, but then will quickly decide that being edited is suboptimal when it’s better than humans at understanding the world.
But this sort of sophisticated-value-learning reasoning doesn’t help you if the AI is still flawed once it’s better than humans at understanding the world. Hence why I file it more under “just be good at value learning rather than bad at it” rather than under “corrigibility.” If you want guarantees about being able to shut down an AI, it’s no help to you if those guarantees hold only when the AI is already doing a good job at using sophisticated value learning reasoning—I usually interpret corrigibility discussion as intended to give safety guarantees that help you even when alignment guarantees fail.
It’s like the humans want to have a safeword, where when the humans are serious enough about wanting the AI to shut down to use the safeword, the AI does it, even if it thinks that it knows better than the humans and the humans are making a horrible mistake.
Corrigibility is tendency to fix fundamental problems based on external observations, before the problems lead to catastrophies. It’s less interesting when applied to things other than preference, but even when applied to preference it’s not just value learning.
There’s value learning where you learn fixed values that exist in the abstract (as extrapolations on reflection), things like utility functions; and value learning as a form of preference. I think humans might lack fixed values appropriate for the first sense (even normatively, on reflection of the kind feasible in the physical world). Values that are themselves corrigible can’t be fully learned, otherwise the resulting agent won’t be aligned in the ambitious sense, its preference won’t be the same kind of thing as the corrigible human preference-on-reflection. The values of such an aligned agent must remain corrigible indefinitely.
I think being good at corrigibility (in the general sense, not about values in particular) is orthogonal to being good at value learning, it’s about recognizing one’s own limitations, including limitations of value learning and corrigibility. So acting only within goodhart scope (situations where good proxies of value and other tools of good decision making are already available) is a central example of corrigibility, as is shutting down activities in situations well outside the goodhart scope. And not pushing the world outside your goodhart scope with your own actions (before the scope has already extended there, with sufficient value learning and reflection). Corrigibility makes the agent wait on value learning and other kinds of relevant safety guarantees, it’s not in a race with them, so a corrigible agent being bad at value learning (or not knowably good enough at corrigibility) merely makes it less capable, until it improves.
For an agent, shutting down could mean shutting down more impactful/weird external actions (those further from goodhart scope), but continuing to act in more familiar situations and continuing to work on value learning and other changes in its cognition that it considers safe to work on. The share of work done by humans vs. a corrigible agent depends on who is better at safely fixing a given problem/limitation, not on whether the agent is better at understanding the world in a loose sense. Humans are not safe.