Corrigibility is tendency to fix fundamental problems based on external observations, before the problems lead to catastrophies. It’s less interesting when applied to things other than preference, but even when applied to preference it’s not just value learning.
There’s value learning where you learn fixed values that exist in the abstract (as extrapolations on reflection), things like utility functions; and value learning as a form of preference. I think humans might lack fixed values appropriate for the first sense (even normatively, on reflection of the kind feasible in the physical world). Values that are themselves corrigible can’t be fully learned, otherwise the resulting agent won’t be aligned in the ambitious sense, its preference won’t be the same kind of thing as the corrigible human preference-on-reflection. The values of such an aligned agent must remain corrigible indefinitely.
I think being good at corrigibility (in the general sense, not about values in particular) is orthogonal to being good at value learning, it’s about recognizing one’s own limitations, including limitations of value learning and corrigibility. So acting only within goodhart scope (situations where good proxies of value and other tools of good decision making are already available) is a central example of corrigibility, as is shutting down activities in situations well outside the goodhart scope. And not pushing the world outside your goodhart scope with your own actions (before the scope has already extended there, with sufficient value learning and reflection). Corrigibility makes the agent wait on value learning and other kinds of relevant safety guarantees, it’s not in a race with them, so a corrigible agent being bad at value learning (or not knowably good enough at corrigibility) merely makes it less capable, until it improves.
sophisticated-value-learning reasoning doesn’t help you if the AI is still flawed once it’s better than humans at understanding the world
For an agent, shutting down could mean shutting down more impactful/weird external actions (those further from goodhart scope), but continuing to act in more familiar situations and continuing to work on value learning and other changes in its cognition that it considers safe to work on. The share of work done by humans vs. a corrigible agent depends on who is better at safely fixing a given problem/limitation, not on whether the agent is better at understanding the world in a loose sense. Humans are not safe.
Corrigibility is tendency to fix fundamental problems based on external observations, before the problems lead to catastrophies. It’s less interesting when applied to things other than preference, but even when applied to preference it’s not just value learning.
There’s value learning where you learn fixed values that exist in the abstract (as extrapolations on reflection), things like utility functions; and value learning as a form of preference. I think humans might lack fixed values appropriate for the first sense (even normatively, on reflection of the kind feasible in the physical world). Values that are themselves corrigible can’t be fully learned, otherwise the resulting agent won’t be aligned in the ambitious sense, its preference won’t be the same kind of thing as the corrigible human preference-on-reflection. The values of such an aligned agent must remain corrigible indefinitely.
I think being good at corrigibility (in the general sense, not about values in particular) is orthogonal to being good at value learning, it’s about recognizing one’s own limitations, including limitations of value learning and corrigibility. So acting only within goodhart scope (situations where good proxies of value and other tools of good decision making are already available) is a central example of corrigibility, as is shutting down activities in situations well outside the goodhart scope. And not pushing the world outside your goodhart scope with your own actions (before the scope has already extended there, with sufficient value learning and reflection). Corrigibility makes the agent wait on value learning and other kinds of relevant safety guarantees, it’s not in a race with them, so a corrigible agent being bad at value learning (or not knowably good enough at corrigibility) merely makes it less capable, until it improves.
For an agent, shutting down could mean shutting down more impactful/weird external actions (those further from goodhart scope), but continuing to act in more familiar situations and continuing to work on value learning and other changes in its cognition that it considers safe to work on. The share of work done by humans vs. a corrigible agent depends on who is better at safely fixing a given problem/limitation, not on whether the agent is better at understanding the world in a loose sense. Humans are not safe.