Insofar as humans care about their AI being corrigible, we should expect some degree of corrigibility even from a CEV-maximizer. That, in turn, suggests at least some basin-of-attraction for values (at least along some dimensions), in the same way that corrigibility yields a basin-of-attraction.
(Though obviously that’s not an argument we’d want to make load-bearing without both theoretical and empirical evidence about how big the basin-of-attraction is along which dimensions.)
Conversely, it doesn’t seem realistic to define limited impact or corrigibility or whatever without relying on an awful lot of values information—like e.g. what sort of changes-to-the-world we do/don’t care about, what thing-in-the-environment the system is supposed to be corrigible with, etc.
Values seem like a necessary-and-sufficient component. Corrigibility/task architecture/etc doesn’t.
And rushing CEV and getting it only 95% correct poses far larger s-risks than rushing low-impact non-operator-modeling strawberry AGI and getting it only 95% correct.
Small but important point here: an estimate of CEV which is within 5% error everywhere does reasonably well; that gets us within 5% of our best possible outcome. The problem is when our estimate is waaayyy off in 5% of scenarios, especially if it’s off in the overestimate direction; then we’re in trouble.
Conversely, it doesn’t seem realistic to define limited impact or corrigibility or whatever without relying on an awful lot of values information—like e.g. what sort of changes-to-the-world we do/don’t care about, what thing-in-the-environment the system is supposed to be corrigible with, etc.
I suspect you could do this in a less value-loaded way if you’re somehow intervening on ‘what the AGI wants to pay attention to’, as opposed to just intervening on ‘what sorts of directions it wants to steer the world in’.
‘Only spend your cognition thinking about individual physical structures smaller than 10 micrometers’, ‘only spend your cognition thinking about the physical state of this particular five-cubic-foot volume of space’, etc. could eliminate most of the risk of ‘high-impact’ actions without forcing us to define human conceptions of ‘impact’, and without forcing the AI to do a bunch of human-modeling. But I don’t know what research path would produce the ability to do things like that.
(There’s still of course something that we’re trying to get the AGI to do, like make a nanofactory or make a scanning machine for WBE or make improved computing hardware. That part strikes me as intuitively more value-loaded than ‘only think about this particular volume of space’.
The difficulty with ‘only think about this particular volume of space’ is that it requires the ability to intervene on thoughts rather than outputs.)
‘Only spend your cognition thinking about individual physical structures smaller than 10 micrometers’, ‘only spend your cognition thinking about the physical state of this particular five-cubic-foot volume of space’, etc. could eliminate most of the risk of ‘high-impact’ actions without forcing us to define human conceptions of ‘impact’, and without forcing the AI to do a bunch of human-modeling.
I do not think that would do what you seem to think it would do. If something optimizes one little chunk of the world really hard, ignoring everything else, that doesn’t mean the rest of the world is unchanged; by default there are lots of side effects. E.g. if something is building nanotech in a 1m cube, ignoring everything outside the cube, at the very least I’d expect that dump nuke levels of waste heat into its immediate surroundings.
I agree that this would be scary if the system is, for example, as smart as physically possible. What I’m imagining is:
(1) if you find a way to ensure that the system is only weakly superhuman (e.g., it performs vast amounts of low-level-Google-engineer-quality reasoning, only rare short controlled bursts of von-Neumann-quality reasoning, and nothing dramatically above the von-Neumann level), and
(2) if you get the system to only care about thinking about this cube of space, and
(3) if you also somehow get the system to want to build the particular machine you care about,
then you can plausibly save the world, and (importantly) you’re not likely to destroy the world if you fail, assuming you really are correctly confident in 1, 2, and 3.
I think you can also get more safety margin if the cube is in Antarctica (or on the Moon?), if you’ve tried to seal it off from the environment to some degree, and if you actively monitor for things like toxic waste products, etc.
Notably, the “only care about thinking about this cube of space” part is important for a lot of the other safety features to work, like:
It’s a lot harder to get guarantees about the system’s intelligence if it’s optimizing the larger world (since it might then improve itself, or build a smart successor in its environment—good luck closing off all possible loopholes for what kinds of physical systems an AGI might build that count as “smart successors”, while still leaving it able to build nanotech!).
Likewise, it’s a lot harder to get guarantees that the system stably is optimizing what you want it to optimize, or stably has any specific internal property, if it’s willing and able to modify itself.
Part of why we can hope to notice, anticipate, and guard against bad side-effects like “waste products” is that the waste products aren’toptimized to have any particular effect on the external environment, and aren’t optimized to evade our efforts to notice, anticipate, or respond to the danger. For that reason, “An AGI that only terminally cares about the state of a certain cube of space, but does spend time thinking about the larger world”, is vastly scarier than an AGI that just-doesn’t-think in those directions.
If the system does start going off the rails, we’re a lot more likely to be able to shut it down if it isn’t thinking about us or about itself.
This makes me think that the “only care about thinking about certain things” part may be relatively important in order for a lot of other safety requirements to be tractable. It feels more “(realistically) necessary” than “sufficient” to me; but I do personally have a hunch (which hopefully we wouldn’t have to actually rely on as a safety assumption!) that the ability to do things in this reference class would get us, like, 80+% of the way to saving the world? (Dunno whether Eliezer or anyone else at MIRI would agree.)
Insofar as humans care about their AI being corrigible, we should expect some degree of corrigibility even from a CEV-maximizer. That, in turn, suggests at least some basin-of-attraction for values (at least along some dimensions), in the same way that corrigibility yields a basin-of-attraction.
(Though obviously that’s not an argument we’d want to make load-bearing without both theoretical and empirical evidence about how big the basin-of-attraction is along which dimensions.)
Conversely, it doesn’t seem realistic to define limited impact or corrigibility or whatever without relying on an awful lot of values information—like e.g. what sort of changes-to-the-world we do/don’t care about, what thing-in-the-environment the system is supposed to be corrigible with, etc.
Values seem like a necessary-and-sufficient component. Corrigibility/task architecture/etc doesn’t.
Small but important point here: an estimate of CEV which is within 5% error everywhere does reasonably well; that gets us within 5% of our best possible outcome. The problem is when our estimate is waaayyy off in 5% of scenarios, especially if it’s off in the overestimate direction; then we’re in trouble.
I suspect you could do this in a less value-loaded way if you’re somehow intervening on ‘what the AGI wants to pay attention to’, as opposed to just intervening on ‘what sorts of directions it wants to steer the world in’.
‘Only spend your cognition thinking about individual physical structures smaller than 10 micrometers’, ‘only spend your cognition thinking about the physical state of this particular five-cubic-foot volume of space’, etc. could eliminate most of the risk of ‘high-impact’ actions without forcing us to define human conceptions of ‘impact’, and without forcing the AI to do a bunch of human-modeling. But I don’t know what research path would produce the ability to do things like that.
(There’s still of course something that we’re trying to get the AGI to do, like make a nanofactory or make a scanning machine for WBE or make improved computing hardware. That part strikes me as intuitively more value-loaded than ‘only think about this particular volume of space’.
The difficulty with ‘only think about this particular volume of space’ is that it requires the ability to intervene on thoughts rather than outputs.)
I do not think that would do what you seem to think it would do. If something optimizes one little chunk of the world really hard, ignoring everything else, that doesn’t mean the rest of the world is unchanged; by default there are lots of side effects. E.g. if something is building nanotech in a 1m cube, ignoring everything outside the cube, at the very least I’d expect that dump nuke levels of waste heat into its immediate surroundings.
I agree that this would be scary if the system is, for example, as smart as physically possible. What I’m imagining is:
(1) if you find a way to ensure that the system is only weakly superhuman (e.g., it performs vast amounts of low-level-Google-engineer-quality reasoning, only rare short controlled bursts of von-Neumann-quality reasoning, and nothing dramatically above the von-Neumann level), and
(2) if you get the system to only care about thinking about this cube of space, and
(3) if you also somehow get the system to want to build the particular machine you care about,
then you can plausibly save the world, and (importantly) you’re not likely to destroy the world if you fail, assuming you really are correctly confident in 1, 2, and 3.
I think you can also get more safety margin if the cube is in Antarctica (or on the Moon?), if you’ve tried to seal it off from the environment to some degree, and if you actively monitor for things like toxic waste products, etc.
Notably, the “only care about thinking about this cube of space” part is important for a lot of the other safety features to work, like:
It’s a lot harder to get guarantees about the system’s intelligence if it’s optimizing the larger world (since it might then improve itself, or build a smart successor in its environment—good luck closing off all possible loopholes for what kinds of physical systems an AGI might build that count as “smart successors”, while still leaving it able to build nanotech!).
Likewise, it’s a lot harder to get guarantees that the system stably is optimizing what you want it to optimize, or stably has any specific internal property, if it’s willing and able to modify itself.
Part of why we can hope to notice, anticipate, and guard against bad side-effects like “waste products” is that the waste products aren’t optimized to have any particular effect on the external environment, and aren’t optimized to evade our efforts to notice, anticipate, or respond to the danger. For that reason, “An AGI that only terminally cares about the state of a certain cube of space, but does spend time thinking about the larger world”, is vastly scarier than an AGI that just-doesn’t-think in those directions.
If the system does start going off the rails, we’re a lot more likely to be able to shut it down if it isn’t thinking about us or about itself.
This makes me think that the “only care about thinking about certain things” part may be relatively important in order for a lot of other safety requirements to be tractable. It feels more “(realistically) necessary” than “sufficient” to me; but I do personally have a hunch (which hopefully we wouldn’t have to actually rely on as a safety assumption!) that the ability to do things in this reference class would get us, like, 80+% of the way to saving the world? (Dunno whether Eliezer or anyone else at MIRI would agree.)