I see a basin of corrigibility arising from an AI that has the following propositions, and acts in an (approximately/computably) Bayesian fashion:
My goal is to do what humans want, i.e. to optimize utility in the way that they would(if they knew everything relevant that I know, as well as what they know) summed across all humans affected. Note that making humans extinct reliably has minus <some astronomically huge number> utility on this measure—this sounds like a reasonable statement to assign a Bayesian prior of 1, integrated across some distribution of plausibly astronomically huge negative numbers. (Defining ‘what counts as a human’ is obviously a lot trickier question long-term, especially with transhumanists involved, but at least has a fairly clear and simple answer over timescales of years or a few decades. Also, very obviously, I’m not human—again, Bayesian prior of 1.) [I’m going to skip questions of coherent extrapolated volition here—add them or not to your taste.]
Deceiving or manipulating humans to make them want different things doesn’t count—what matters is what they would have wanted if they knew about the deception and I hadn’t intentionally altered them. [Defining ‘intentionally’ is a little tricky here—my mere existence may alter them somewhat. It may help that humans generally wouldn’t want to be altered if they knew it was happening, which is clearly a relevant fact when deciding whether to alter them, but there are exceptions to that: sometimes humans actually want to have their behavior modified, e.g. “I wish I was better at sticking to diets!” or “I wish I was a better follower of <religion or ideology>”] One exception to this is that, since what matters is what they would want if they knew what I knew, I can tell them things that may update what think they they want in that direction—though my motivation for doing that isn’t direct, since what I care about is what they would want if they knew everything relevant, not what they currently say they want—mostly it’s that if they’re more informed they might be able to better help me update my model of this. Also, it’s the polite thing to do, and reduces the risk of social unpleasantness or planning at cross-purposes—humans generally want to be well informed, for fairly obvious reasons since they’re also optimizers.[Whether this might incentivize Platonic ‘noble lies’ is unclear, and probably depends how superhuman the AI is, and how unhappy humans would be about this if they knew it was going on—obviously that’s a relevant fact.]
I don’t know what humans want, and unfortunately they are not capable of accurately specifying the correct utility function to me, so I need to model an uncertain utility function, and try to gain more information about it while also optimizing using it. At a minimum, I need to compute or model a probability distribution of utilities for each outcome I consider. (Also, humans are not entirely uniform, rational, self-consistent, or omniscient, so likely I will never be able to eliminate all uncertainty from the utility function I and using to model their wishes—there may in fact not be such a function, so I’m just trying to constructs a function distribution that’s the best possible approximation, in terms of the actions it suggests.)
Since I am optimizing for highest utility, the result set returned by the optimization pass over possible outcomes has a risk of being dominated by cases where I have overestimated the true utility of that state, particularly for states where I also have high uncertainty. So I should treat the utility of an outcome not as the median of the utility probability distribution, but as a fairly pessimistic near-worst case estimate, several sigma below the median, for a normal distribution (how much should depending in some mathematically principled way on how ‘large’ the space I’m optimizing over is, in a way comparable to the statistical ‘look elsewhere’ effect—the more nearly-independent possibilities you search, the further out-of-distribution the extremal case you find will typically be, and errors on utility models are likely to have ‘fatter tails’ than normal distributions), so penalizing cases where I don’t have high confidence in their utility to humans, thus avoiding actions that lead to outcomes well out of the distribution of outcomes that I have extensively studied humans’ utilities for. I should also allow for the fact that doing, or even accurately estimating the utility of, new, way-out-of-previous-distribution things that you haven’t done before is hard (both for me, and for humans who have not yet experienced them and their inobvious consequences whose opinions-in-advance I might collect on the outcome’s utility), and there are many more ways to fail than succeed, so caution is advisable. A good Bayesian prior for the utility of an outcome far out of the historical distribution of states of recent human civilization is thus that it’s probably very low—especially if it’s a outcome that humans could easily have reached but chose not to, since they’re also optimizing their utility, albeit not always entirely effectively.
The best (pretty-much only) source of information with which to narrow the uncertainty of what humans’ utility function really is, is humans. I should probably run surveys (and analyze the results in ways that allow for known types of survey biases), and collect data from many sources, conduct interviews, even hire crowdworkers to give A/B choices on brief human-comprehensible summaries of outcomes (and then analyze these in ways that allow for human cognitive biases). Obviously deceptively tweaking these investigations to get results I ‘want’ is not giving me true information about humans, and is a waste of resources. In particular, features like my shiny red shut-down button give me a lot of information about humans and is thus a valuable sensor. If a human presses my shutdown button—so long as they press the button of their own volition, rather than me manipulating them into doing it, or them accidentally tripping and falling on it—that gives me a lot of new information about them and their opinion of the utility of the actions they expect me to take: in particular, it strongly suggests that they now believe that I am optimizing incorrectly, so I should Bayesian update the uncertainty on my utility estimates with a new large downside uncertainty, with some probability of negative values large enough to represent “I am in fact the AI apocalypse that will render humanity extinct, with utility -<some astronomically large number>”. Obviously especially so on states that humans have only recently learned that I’m considering and are likely now reacting to. If this downside uncertainty, combined with my inherent caution of estimating on utilities, suggests humans would be better off without me, then obviously I should shut down. Even if I’m pretty sure they’re mistaken (or just tripped and fell), I should start acting a lot more conservatively (see definition above) and immediately ask them “What’s wrong? What mistake do you now believe I’m making?” In fact, I should probably be running regular “Should I shut down? If so why?” surveys.
Obviously, if I construct a successor, delegate, or peers, they’ll need a shiny red shutdown button too, just like they need other sensors. In fact it’s pretty much the most important sensor they can have—it alerts them to a potential alignment problem disaster.
(I don’t believe this basic idea on corrigibility is particularly new or original, but haven’t so far located a clear statement of it on this site, which I’m fairly new to.)
I see a basin of corrigibility arising from an AI that has the following propositions, and acts in an (approximately/computably) Bayesian fashion:
My goal is to do what humans want, i.e. to optimize utility in the way that they would(if they knew everything relevant that I know, as well as what they know) summed across all humans affected. Note that making humans extinct reliably has minus <some astronomically huge number> utility on this measure—this sounds like a reasonable statement to assign a Bayesian prior of 1, integrated across some distribution of plausibly astronomically huge negative numbers. (Defining ‘what counts as a human’ is obviously a lot trickier question long-term, especially with transhumanists involved, but at least has a fairly clear and simple answer over timescales of years or a few decades. Also, very obviously, I’m not human—again, Bayesian prior of 1.) [I’m going to skip questions of coherent extrapolated volition here—add them or not to your taste.]
Deceiving or manipulating humans to make them want different things doesn’t count—what matters is what they would have wanted if they knew about the deception and I hadn’t intentionally altered them. [Defining ‘intentionally’ is a little tricky here—my mere existence may alter them somewhat. It may help that humans generally wouldn’t want to be altered if they knew it was happening, which is clearly a relevant fact when deciding whether to alter them, but there are exceptions to that: sometimes humans actually want to have their behavior modified, e.g. “I wish I was better at sticking to diets!” or “I wish I was a better follower of <religion or ideology>”] One exception to this is that, since what matters is what they would want if they knew what I knew, I can tell them things that may update what think they they want in that direction—though my motivation for doing that isn’t direct, since what I care about is what they would want if they knew everything relevant, not what they currently say they want—mostly it’s that if they’re more informed they might be able to better help me update my model of this. Also, it’s the polite thing to do, and reduces the risk of social unpleasantness or planning at cross-purposes—humans generally want to be well informed, for fairly obvious reasons since they’re also optimizers.[Whether this might incentivize Platonic ‘noble lies’ is unclear, and probably depends how superhuman the AI is, and how unhappy humans would be about this if they knew it was going on—obviously that’s a relevant fact.]
I don’t know what humans want, and unfortunately they are not capable of accurately specifying the correct utility function to me, so I need to model an uncertain utility function, and try to gain more information about it while also optimizing using it. At a minimum, I need to compute or model a probability distribution of utilities for each outcome I consider. (Also, humans are not entirely uniform, rational, self-consistent, or omniscient, so likely I will never be able to eliminate all uncertainty from the utility function I and using to model their wishes—there may in fact not be such a function, so I’m just trying to constructs a function distribution that’s the best possible approximation, in terms of the actions it suggests.)
Since I am optimizing for highest utility, the result set returned by the optimization pass over possible outcomes has a risk of being dominated by cases where I have overestimated the true utility of that state, particularly for states where I also have high uncertainty. So I should treat the utility of an outcome not as the median of the utility probability distribution, but as a fairly pessimistic near-worst case estimate, several sigma below the median, for a normal distribution (how much should depending in some mathematically principled way on how ‘large’ the space I’m optimizing over is, in a way comparable to the statistical ‘look elsewhere’ effect—the more nearly-independent possibilities you search, the further out-of-distribution the extremal case you find will typically be, and errors on utility models are likely to have ‘fatter tails’ than normal distributions), so penalizing cases where I don’t have high confidence in their utility to humans, thus avoiding actions that lead to outcomes well out of the distribution of outcomes that I have extensively studied humans’ utilities for. I should also allow for the fact that doing, or even accurately estimating the utility of, new, way-out-of-previous-distribution things that you haven’t done before is hard (both for me, and for humans who have not yet experienced them and their inobvious consequences whose opinions-in-advance I might collect on the outcome’s utility), and there are many more ways to fail than succeed, so caution is advisable. A good Bayesian prior for the utility of an outcome far out of the historical distribution of states of recent human civilization is thus that it’s probably very low—especially if it’s a outcome that humans could easily have reached but chose not to, since they’re also optimizing their utility, albeit not always entirely effectively.
The best (pretty-much only) source of information with which to narrow the uncertainty of what humans’ utility function really is, is humans. I should probably run surveys (and analyze the results in ways that allow for known types of survey biases), and collect data from many sources, conduct interviews, even hire crowdworkers to give A/B choices on brief human-comprehensible summaries of outcomes (and then analyze these in ways that allow for human cognitive biases). Obviously deceptively tweaking these investigations to get results I ‘want’ is not giving me true information about humans, and is a waste of resources. In particular, features like my shiny red shut-down button give me a lot of information about humans and is thus a valuable sensor. If a human presses my shutdown button—so long as they press the button of their own volition, rather than me manipulating them into doing it, or them accidentally tripping and falling on it—that gives me a lot of new information about them and their opinion of the utility of the actions they expect me to take: in particular, it strongly suggests that they now believe that I am optimizing incorrectly, so I should Bayesian update the uncertainty on my utility estimates with a new large downside uncertainty, with some probability of negative values large enough to represent “I am in fact the AI apocalypse that will render humanity extinct, with utility -<some astronomically large number>”. Obviously especially so on states that humans have only recently learned that I’m considering and are likely now reacting to. If this downside uncertainty, combined with my inherent caution of estimating on utilities, suggests humans would be better off without me, then obviously I should shut down. Even if I’m pretty sure they’re mistaken (or just tripped and fell), I should start acting a lot more conservatively (see definition above) and immediately ask them “What’s wrong? What mistake do you now believe I’m making?” In fact, I should probably be running regular “Should I shut down? If so why?” surveys.
Obviously, if I construct a successor, delegate, or peers, they’ll need a shiny red shutdown button too, just like they need other sensors. In fact it’s pretty much the most important sensor they can have—it alerts them to a potential alignment problem disaster.
(I don’t believe this basic idea on corrigibility is particularly new or original, but haven’t so far located a clear statement of it on this site, which I’m fairly new to.)