I think we need a different approach to corrigibility: the AI should not be merely indifferent to corrections, it should be actively motivated to seek out relevant knowledge, including corrections to its current value model. I see this as being like the multi-armed bandit problem (see http://en.wikipedia.org/wiki/Multi-armed_bandit ) -- the AI should understand that it’s trying to maximize a function that it doesn’t know, its programmers couldn’t fully and accurately describe to it, and that it is trying to learn. The actual function is likely to be something hard-to-define/test/compute, like the averaged coherent extrapolated volition of all of humanity, or the all-time integral of (accurate, undeceived) human retrospective satisfaction (in, say, quality-adjusted life years or some similar unit) -- the AI needs to have a coherent description of what value function it’s trying to learn (that may well be the hard part).
The AI needs to understand that all it has at any point in time is an approximate model of the true value function, and it needs to devote part of its effort to attempting to improving its model (using something like the scientific method and/or Bayesian reasoning and/or statistical and logical inference and/or machine learning techniques). And in order to rationally decide how much effort to allocate to improving its future model rather than acting on its current model, and how much trust to put in its current model in various different contexts, it also needs an approximate estimate of the accuracy of its current value function in different situations—likely including concepts such as a quantification of ‘I’m pretty sure that at least under under most circumstances humans don’t like being killed’, and likely also an estimate of the accuracy of its estimate of the accuracy of its model, and so forth.)
The AI should be aware that if you evaluate the median value of a monte-carlo ensemble of different uncertain estimated value functions over a large space of possible actions, there is a significant chance that the maximum of the median value will lie at point in the search space where the uncertainty in the estimate of the true value is large and will be due to the estimated value functions being locally inaccurate at that point rather than to that being a true maximum of the genuine value function. So before maximization, the median of the monte-carlo ensemble of value functions should be penalized by a factor related to both the estimated local uncertainty and its estimated distribution (and the uncertainty in that uncertainty, and so on, and allowing for the fact that the unknown errors in the value function are unlikely to be normally distributed—a ‘fat-tail’ distribution is much more likely) and also to the magnitude of the look-elsewhere effect from the effective size of the space being searched over. In short, the AI needs to understand probability and statistics, and how they apply to its own internal models, and act rationally upon this knowledge.
Corrigibility (and the corrigibility of any agents it constructs) then becomes an inherently automatically desirable sub-goal: the more the AI can improve its model of the true value function, the better the expected future results of its optimization on the basis of its future model are likely to be. Humans, as evolved sentient and sapient beings, are clearly domain experts on how to make humans happy/satisfied. As such (genuine, uncoerced, unmanipulated) correction input from humans (especially live, awake, sane, rational, well informed, adult humans) is a high value input, which should be actively sought out; and whenever it is received, the AI’s value model and estimates of its value model’s accuracy should be rationally and promptly updated in light of it.
To pick a specific example, if the AI has a shutdown button, and a (live, awake, sane, rational, well informed, adult, uncoerced) human (genuinely) presses it, then the AI’s value model should be updated in light of the fact ‘this human now believes that the average value of the states of the world in which I shut down is higher (and I can even roughly estimate how much higher from how hard they pressed the button) than the optimized-by-me value of the states of the world in which I don’t—which implies that they believe there is a very serious flaw in my current value model that I am unaware of and they believe will be unlikely to correct’. If, once the AI has revised its value model and estimates of its accuracy in light of this new input, the AI believes they are likely to be correct, then it should shut down (how cautiously/hard it does so depending on how much/little it now trusts its own judgement).
Such an AI will value the input possible from a shutdown button, and will thus be motivated to keep it in good repair, as it would for any other sensor that could convey valuable information. Indeed, it might even choose institute regular shut-down surveys, polling its current popularity, much as human politicians do (but hopefully with more willingness to flip-flop if its action prove unpopular).
Indeed , the AI should be actively devoting a suitably chosen portion of its resources to going out and performing, for example, sociological surveys and double-blind experiments on what humans want it to do that could improve its value model (perhaps including how they feel about the current world supply of paperclips).
If the AI thinks it has discovered an action that will maximize value under its current value model but that falls in an area where it suspects its value model might be less accurate (e.g. launching an interstellar fleet of paperclip-constructing Von Neumann machines to conquer the galaxy and convert it to paperclips—yes, humans pretty clearly like paperclips, but perhaps it’s uncertain how much this would be modified by them being light-years away, and it’s notable that so far humans have shown little sign of interest in doing this for themselves), then before doing expending any significant resources on or doing anything hard-to-undo towards this goal, it would clearly be an excellent idea to first try to get more solid data on how much human happiness would actually be enhanced by the existence of vast numbers of paperclips orbiting distant stars, in case its current value model is in fact inaccurate in this area.
As the AI becomes better informed, more knowledgeable about humans, and more capable, you would also start to see what could be termed ‘informed corrigibility’: the AI values and incorporate correction input from humans, in proportion to its estimate of their likely accuracy, and has fairly accurate models for how trustworthy input from different humans is—for example, it weights the considered opinion of a panel of domain experts based on extensive experience and surveys (especially if it fits pretty well with its previous knowledge) higher than the nonsensical drunken babblings of crazy person, and is aware that under certain circumstances humans are not always entirely accurate at predicting what will turn out to give them the most satisfaction, and that even their retrospective estimates of satisfaction are not always entirely accurate.
One possible (and under appropriate circumstances highly desirable) outcome of this sort of AI is the possibility that the AI, having performed some attempts at learning, comes to the conclusion that its current value model is too inaccurate and too unsafe/oversimplified for it to safely exist long enough to learn to fix, and voluntarily shuts itself down without anyone even needing to hit the shutdown button, likely after first writing a ‘suicide note’ explaining what aspects of its model it had come to the conclusion were too poorly designed for its continued existence to be safe. This is probably the best possible failure mode for a flawed advanced AI—one that correctly diagnoses that it own design is flawed and shuts itself off.
This approach relies on having a process that reaches the desired conclusion, without specifying the desired conclusion. It’s a multi-armed bandit problem with not only the rewards uncertain, but the reward function uncertain. And it seems to rely on defining terms like “live, awake, sane, rational, well informed, adult, uncoerced”, which ain’t easy (though I have some developing ideas on how to do that for some of them).
Both your definition and corrigibility require human input. For your process, the AI has to assess what human input should be, at least as far as it has the power to influence future human input (see some of the issues with ). Corrigibility allows actual human input in many cases, without the AI doing any assessment.
Corrigibility is not needed if everything else is right; corrigibility is very useful if there might still be flaws in the AI’s design.
Yes, plan A is for the AI to be corrigible because of uncertainty about human values and about the accuracy of its own reasoning (and which actively seeks feedback for the same reason). The question is how to set things up so that that happens. We have some rough idea but concrete existing proposals don’t quite work.
I think plan B is for the AI to understand and satisfy human short-term preferences (including the preference for the AI to follow direct instructions, to not kill anyone or do anything serious and irreversible, to gather information that is relevant to understanding our preferences...). Realistically I think this will probably be the most robust measure, and we would use it even if we expect plan A to work.
The kind of utility-function surgery from this post is at best plan C.
I think we need a different approach to corrigibility: the AI should not be merely indifferent to corrections, it should be actively motivated to seek out relevant knowledge, including corrections to its current value model. I see this as being like the multi-armed bandit problem (see http://en.wikipedia.org/wiki/Multi-armed_bandit ) -- the AI should understand that it’s trying to maximize a function that it doesn’t know, its programmers couldn’t fully and accurately describe to it, and that it is trying to learn. The actual function is likely to be something hard-to-define/test/compute, like the averaged coherent extrapolated volition of all of humanity, or the all-time integral of (accurate, undeceived) human retrospective satisfaction (in, say, quality-adjusted life years or some similar unit) -- the AI needs to have a coherent description of what value function it’s trying to learn (that may well be the hard part).
The AI needs to understand that all it has at any point in time is an approximate model of the true value function, and it needs to devote part of its effort to attempting to improving its model (using something like the scientific method and/or Bayesian reasoning and/or statistical and logical inference and/or machine learning techniques). And in order to rationally decide how much effort to allocate to improving its future model rather than acting on its current model, and how much trust to put in its current model in various different contexts, it also needs an approximate estimate of the accuracy of its current value function in different situations—likely including concepts such as a quantification of ‘I’m pretty sure that at least under under most circumstances humans don’t like being killed’, and likely also an estimate of the accuracy of its estimate of the accuracy of its model, and so forth.)
The AI should be aware that if you evaluate the median value of a monte-carlo ensemble of different uncertain estimated value functions over a large space of possible actions, there is a significant chance that the maximum of the median value will lie at point in the search space where the uncertainty in the estimate of the true value is large and will be due to the estimated value functions being locally inaccurate at that point rather than to that being a true maximum of the genuine value function. So before maximization, the median of the monte-carlo ensemble of value functions should be penalized by a factor related to both the estimated local uncertainty and its estimated distribution (and the uncertainty in that uncertainty, and so on, and allowing for the fact that the unknown errors in the value function are unlikely to be normally distributed—a ‘fat-tail’ distribution is much more likely) and also to the magnitude of the look-elsewhere effect from the effective size of the space being searched over. In short, the AI needs to understand probability and statistics, and how they apply to its own internal models, and act rationally upon this knowledge.
Corrigibility (and the corrigibility of any agents it constructs) then becomes an inherently automatically desirable sub-goal: the more the AI can improve its model of the true value function, the better the expected future results of its optimization on the basis of its future model are likely to be. Humans, as evolved sentient and sapient beings, are clearly domain experts on how to make humans happy/satisfied. As such (genuine, uncoerced, unmanipulated) correction input from humans (especially live, awake, sane, rational, well informed, adult humans) is a high value input, which should be actively sought out; and whenever it is received, the AI’s value model and estimates of its value model’s accuracy should be rationally and promptly updated in light of it.
To pick a specific example, if the AI has a shutdown button, and a (live, awake, sane, rational, well informed, adult, uncoerced) human (genuinely) presses it, then the AI’s value model should be updated in light of the fact ‘this human now believes that the average value of the states of the world in which I shut down is higher (and I can even roughly estimate how much higher from how hard they pressed the button) than the optimized-by-me value of the states of the world in which I don’t—which implies that they believe there is a very serious flaw in my current value model that I am unaware of and they believe will be unlikely to correct’. If, once the AI has revised its value model and estimates of its accuracy in light of this new input, the AI believes they are likely to be correct, then it should shut down (how cautiously/hard it does so depending on how much/little it now trusts its own judgement).
Such an AI will value the input possible from a shutdown button, and will thus be motivated to keep it in good repair, as it would for any other sensor that could convey valuable information. Indeed, it might even choose institute regular shut-down surveys, polling its current popularity, much as human politicians do (but hopefully with more willingness to flip-flop if its action prove unpopular).
Indeed , the AI should be actively devoting a suitably chosen portion of its resources to going out and performing, for example, sociological surveys and double-blind experiments on what humans want it to do that could improve its value model (perhaps including how they feel about the current world supply of paperclips).
If the AI thinks it has discovered an action that will maximize value under its current value model but that falls in an area where it suspects its value model might be less accurate (e.g. launching an interstellar fleet of paperclip-constructing Von Neumann machines to conquer the galaxy and convert it to paperclips—yes, humans pretty clearly like paperclips, but perhaps it’s uncertain how much this would be modified by them being light-years away, and it’s notable that so far humans have shown little sign of interest in doing this for themselves), then before doing expending any significant resources on or doing anything hard-to-undo towards this goal, it would clearly be an excellent idea to first try to get more solid data on how much human happiness would actually be enhanced by the existence of vast numbers of paperclips orbiting distant stars, in case its current value model is in fact inaccurate in this area.
As the AI becomes better informed, more knowledgeable about humans, and more capable, you would also start to see what could be termed ‘informed corrigibility’: the AI values and incorporate correction input from humans, in proportion to its estimate of their likely accuracy, and has fairly accurate models for how trustworthy input from different humans is—for example, it weights the considered opinion of a panel of domain experts based on extensive experience and surveys (especially if it fits pretty well with its previous knowledge) higher than the nonsensical drunken babblings of crazy person, and is aware that under certain circumstances humans are not always entirely accurate at predicting what will turn out to give them the most satisfaction, and that even their retrospective estimates of satisfaction are not always entirely accurate.
One possible (and under appropriate circumstances highly desirable) outcome of this sort of AI is the possibility that the AI, having performed some attempts at learning, comes to the conclusion that its current value model is too inaccurate and too unsafe/oversimplified for it to safely exist long enough to learn to fix, and voluntarily shuts itself down without anyone even needing to hit the shutdown button, likely after first writing a ‘suicide note’ explaining what aspects of its model it had come to the conclusion were too poorly designed for its continued existence to be safe. This is probably the best possible failure mode for a flawed advanced AI—one that correctly diagnoses that it own design is flawed and shuts itself off.
This approach relies on having a process that reaches the desired conclusion, without specifying the desired conclusion. It’s a multi-armed bandit problem with not only the rewards uncertain, but the reward function uncertain. And it seems to rely on defining terms like “live, awake, sane, rational, well informed, adult, uncoerced”, which ain’t easy (though I have some developing ideas on how to do that for some of them).
Both your definition and corrigibility require human input. For your process, the AI has to assess what human input should be, at least as far as it has the power to influence future human input (see some of the issues with ). Corrigibility allows actual human input in many cases, without the AI doing any assessment.
Corrigibility is not needed if everything else is right; corrigibility is very useful if there might still be flaws in the AI’s design.
Yes, plan A is for the AI to be corrigible because of uncertainty about human values and about the accuracy of its own reasoning (and which actively seeks feedback for the same reason). The question is how to set things up so that that happens. We have some rough idea but concrete existing proposals don’t quite work.
I think plan B is for the AI to understand and satisfy human short-term preferences (including the preference for the AI to follow direct instructions, to not kill anyone or do anything serious and irreversible, to gather information that is relevant to understanding our preferences...). Realistically I think this will probably be the most robust measure, and we would use it even if we expect plan A to work.
The kind of utility-function surgery from this post is at best plan C.
The better our understanding, the less need for utility function surgery (and vice versa).