I think we need a different approach to corrigibility: the AI should not be merely indifferent to corrections, it should be actively motivated to seek out relevant knowledge, including corrections to its current value model. I see this as being like the multi-armed bandit problem (see http://en.wikipedia.org/wiki/Multi-armed_bandit ) -- the AI should understand that it’s trying to maximize a function that it doesn’t know, its programmers couldn’t fully and accurately describe to it, and that it is trying to learn. The actual function is likely to be something hard-to-define/test/compute, like the averaged coherent extrapolated volition of all of humanity, or the all-time integral of (accurate, undeceived) human retrospective satisfaction (in, say, quality-adjusted life years or some similar unit) -- the AI needs to have a coherent description of what value function it’s trying to learn (that may well be the hard part).
The AI needs to understand that all it has at any point in time is an approximate model of the true value function, and it needs to devote part of its effort to attempting to improving its model (using something like the scientific method and/or Bayesian reasoning and/or statistical and logical inference and/or machine learning techniques). And in order to rationally decide how much effort to allocate to improving its future model rather than acting on its current model, and how much trust to put in its current model in various different contexts, it also needs an approximate estimate of the accuracy of its current value function in different situations—likely including concepts such as a quantification of ‘I’m pretty sure that at least under under most circumstances humans don’t like being killed’, and likely also an estimate of the accuracy of its estimate of the accuracy of its model, and so forth.)
The AI should be aware that if you evaluate the median value of a monte-carlo ensemble of different uncertain estimated value functions over a large space of possible actions, there is a significant chance that the maximum of the median value will lie at point in the search space where the uncertainty in the estimate of the true value is large and will be due to the estimated value functions being locally inaccurate at that point rather than to that being a true maximum of the genuine value function. So before maximization, the median of the monte-carlo ensemble of value functions should be penalized by a factor related to both the estimated local uncertainty and its estimated distribution (and the uncertainty in that uncertainty, and so on, and allowing for the fact that the unknown errors in the value function are unlikely to be normally distributed—a ‘fat-tail’ distribution is much more likely) and also to the magnitude of the look-elsewhere effect from the effective size of the space being searched over. In short, the AI needs to understand probability and statistics, and how they apply to its own internal models, and act rationally upon this knowledge.
Corrigibility (and the corrigibility of any agents it constructs) then becomes an inherently automatically desirable sub-goal: the more the AI can improve its model of the true value function, the better the expected future results of its optimization on the basis of its future model are likely to be. Humans, as evolved sentient and sapient beings, are clearly domain experts on how to make humans happy/satisfied. As such (genuine, uncoerced, unmanipulated) correction input from humans (especially live, awake, sane, rational, well informed, adult humans) is a high value input, which should be actively sought out; and whenever it is received, the AI’s value model and estimates of its value model’s accuracy should be rationally and promptly updated in light of it.
To pick a specific example, if the AI has a shutdown button, and a (live, awake, sane, rational, well informed, adult, uncoerced) human (genuinely) presses it, then the AI’s value model should be updated in light of the fact ‘this human now believes that the average value of the states of the world in which I shut down is higher (and I can even roughly estimate how much higher from how hard they pressed the button) than the optimized-by-me value of the states of the world in which I don’t—which implies that they believe there is a very serious flaw in my current value model that I am unaware of and they believe will be unlikely to correct’. If, once the AI has revised its value model and estimates of its accuracy in light of this new input, the AI believes they are likely to be correct, then it should shut down (how cautiously/hard it does so depending on how much/little it now trusts its own judgement).
Such an AI will value the input possible from a shutdown button, and will thus be motivated to keep it in good repair, as it would for any other sensor that could convey valuable information. Indeed, it might even choose institute regular shut-down surveys, polling its current popularity, much as human politicians do (but hopefully with more willingness to flip-flop if its action prove unpopular).
Indeed , the AI should be actively devoting a suitably chosen portion of its resources to going out and performing, for example, sociological surveys and double-blind experiments on what humans want it to do that could improve its value model (perhaps including how they feel about the current world supply of paperclips).
If the AI thinks it has discovered an action that will maximize value under its current value model but that falls in an area where it suspects its value model might be less accurate (e.g. launching an interstellar fleet of paperclip-constructing Von Neumann machines to conquer the galaxy and convert it to paperclips—yes, humans pretty clearly like paperclips, but perhaps it’s uncertain how much this would be modified by them being light-years away, and it’s notable that so far humans have shown little sign of interest in doing this for themselves), then before doing expending any significant resources on or doing anything hard-to-undo towards this goal, it would clearly be an excellent idea to first try to get more solid data on how much human happiness would actually be enhanced by the existence of vast numbers of paperclips orbiting distant stars, in case its current value model is in fact inaccurate in this area.
As the AI becomes better informed, more knowledgeable about humans, and more capable, you would also start to see what could be termed ‘informed corrigibility’: the AI values and incorporate correction input from humans, in proportion to its estimate of their likely accuracy, and has fairly accurate models for how trustworthy input from different humans is—for example, it weights the considered opinion of a panel of domain experts based on extensive experience and surveys (especially if it fits pretty well with its previous knowledge) higher than the nonsensical drunken babblings of crazy person, and is aware that under certain circumstances humans are not always entirely accurate at predicting what will turn out to give them the most satisfaction, and that even their retrospective estimates of satisfaction are not always entirely accurate.
One possible (and under appropriate circumstances highly desirable) outcome of this sort of AI is the possibility that the AI, having performed some attempts at learning, comes to the conclusion that its current value model is too inaccurate and too unsafe/oversimplified for it to safely exist long enough to learn to fix, and voluntarily shuts itself down without anyone even needing to hit the shutdown button, likely after first writing a ‘suicide note’ explaining what aspects of its model it had come to the conclusion were too poorly designed for its continued existence to be safe. This is probably the best possible failure mode for a flawed advanced AI—one that correctly diagnoses that it own design is flawed and shuts itself off.
I think this is the key to corrigibility—the AI needs to have a decision/learning process that reflects the facts that a) it’s current value model is likely to be partially incorrect b) that what it’s actually attempting to optimize is the value under the correct value model that it doesn’t yet know and is trying to learn, and c) that correction input from its programmers is a valuable source of information about inaccuracies in its current value model that will help it improve its value model and thus increase the expectation of the true value of its future actions.
If an agent understands these facts, it will then need to divide it’s efforts between a) acting on its current value model and b) trying to (as safely and efficiently as possible) gather information with which to improve its current value model (using Bayesian logic, machine learning, or whatever) -- this ends up looking rather like the Multi-Armed Bandit problem: the optimum is usually to learn first and then gradually shift over towards acting more. Optimizing this shifting division of effort is going to require it maintaining some sort of internal model of how much confidence it currently has in its value model in different parts of the space of all possible actions/outcomes that it is optimizing over. This is also quite useful for evaluating the utility of actions like “Before taking an action that seems optimal to you but is in an area where your confidence in your current value model’s accuracy is low, ask your programmers what they think about it.”
To instill greater caution in the AI, it’s probably a good idea to give it Bayesian priors suggesting that in regions where its current value model turn out to be in fact incorrect, it is more likely to be an overestimate than an underestimate (since in general humans are both pretty happy with the status quo that they’ve made for themselves and picky about how it gets changed, and also understand (and have thus already encoded into your program) how much they like it and small changes to it better than they do for large changes from the status quo—so inaccuracies in your value model are more likely in areas where the true value is low, and are more likely to cause overestimates than underestimates).
It would also be a good idea to encode awareness of the fact that its inbuilt maximization search of currently expected value over a large space of actions/outcomes is inhernetly much more likely to find a apparent-but-false maximum due to a local value-overestimate than it is to find a true maximum where the AI is in fact also underestimating the true value: so there is a tendency for apparent optima to be points where the value function is overestimated rather than underestimated by the model used in the search (statistically/analytically quantifying this tendency and how it depends on both the form of the value function and variations in its true and estimated accuracy would be an interesting mathematical project, if it hasn’t already been done). There’s a kind of Look-Elsewhere effect here—the larger your search space, the more likely it is that your search includes a point where the estimated value is incorrect by an amount that is large compared to your standard-error estimate of the size of its local inaccuracy (this point may or may not end up being the apparent maximum as a result).