Before humanity gets to steps 1-2 (‘use CEV or something to make the long-term future awesome’), it needs to get past steps 3-6 (‘use limited task AGI to ensure that humanity doesn’t kill itself with AGI so we can proceed to take our time with far harder problems like “what even is CEV” and “how even in principle would one get an AI system to robustly do anything remotely like that, without some subtle or not-so-subtle disaster resulting”’).
I want to register my skepticism about this claim. Whereas it might naively seem that “put a strawberry on a plate” is easier to align than “extrapolated volition”, on a closer look there are reasons why it might be the other way around. Specifically, the notion of “utility function of given agent” is a natural concept that we should expect to have a relatively succinct mathematical description. This intuition is supported by the AIT definition of intelligence. On the other hand, “put a strawberry on a plate without undesirable side effects” is not a natural concept, since a lot of complexity is packed into the “undesirable side effects”. Therefore, while I see some lines of attack on both task AGI and extrapolated volition, the latter might well turn out easier.
And if humans had a utility function and we knew what that utility function was, we would not need CEV. Unfortunately extracting human preferences over out-of-distribution options and outcomes at dangerously high intelligence, using data gathered at safe levels of intelligence and a correspondingly narrower range of outcomes and options, when there exists no sensory ground truth about what humans want because human raters can be fooled or disassembled, seems pretty complicated. There is ultimately a rescuable truth about what we want, and CEV is my lengthy informal attempt at stating what that even is; but I would assess it as much, much, much more difficult than ‘corrigibility’ to train into a dangerously intelligent system using only training and data from safe levels of intelligence. (As is the central lethally difficult challenge of AGI alignment.)
If we were paperclip maximizers and knew what paperclips were, then yes, it would be easier to just build an offshoot paperclip maximizer.
I agree that it’s a tricky problem, but I think it’s probably tractable. The way PreDCA tries to deal with these difficulties is:
The AI can tell that, even before the AI was turned on, the physical universe was running certain programs.
Some of those programs are “agentic” programs.
Agentic programs have approximately well-defined utility functions.
Disassembling the humans doesn’t change anything, since it doesn’t affect the programs that were already running[1] before the AI was turned on.
Since we’re looking at agent-programs rather than specific agent-actions, there is much more ground for inference about novel situations.
Obviously, the concepts I’m using here (e.g. which programs are “running” or which programs are “agentic”) are non-trivial to define, but infra-Bayesian physicalism does allow us the define them (not without some caveats, but hopefully at least to a 1st approximation).
Yeah, I’m very interested in hearing counter-arguments to claims like this. I’ll say that although I think task AGI is easier, it’s not necessarily strictly easier, for the reason you mentioned.
Maybe a cruxier way of putting my claim is: Maybe corrigibility / task AGI / etc. is harder than CEV, but it just doesn’t seem realistic to me to try to achieve full, up-and-running CEV with the very first AGI systems you build, within a few months or a few years of humanity figuring out how to build AGI at all.
And I do think you need to get CEV up and running within a few months or a few years, if you want to both (1) avoid someone else destroying the world first, and (2) not use a “strawberry-aligned” AGI to prevent 1 from happening.
All of the options are to some extent a gamble, but corrigibility, task AGI, limited impact, etc. strike me as gambles that could actually realistically work out well for humanity even under extreme time pressure to deploy a system within a year or two of ‘we figure out how to build AGI’. I don’t think CEV is possible under that constraint. (And rushing CEV and getting it only 95% correct poses far larger s-risks than rushing low-impact non-operator-modeling strawberry AGI and getting it only 95% correct.)
Maybe corrigibility / task AGI / etc. is harder than CEV, but it just doesn’t seem realistic to me to try to achieve full, up-and-running CEV with the very first AGI systems you build, within a few months or a few years of humanity figuring out how to build AGI at all.
The way I imagine the win scenario is, we’re going to make a lot of progress in understanding alignment before we know how to build AGI. And, we’re going to do it by prioritizing understanding alignment modulo capability (the two are not really possible to cleanly separate, but it might be possible to separate them enough for this purpose). For example, we can assume the existence of algorithms with certain properties, s.t. these properties arguably imply the algorithms can be used as building-blocks for AGI, and then ask: given such algorithms, how would we build aligned AGI? Or, we can come up with some toy setting where we already know how to build “AGI” in some sense, and ask, how to make it aligned in that setting? And then, once we know how to build AGI in the real world, it would hopefully not be too difficult to translate the alignment method.
One caveat in all this is, if AGI is going to use deep learning, we might not know how to apply the lesson from the “oracle”/toy setting, because we don’t understand what deep learning is actually doing, and because of that, we wouldn’t be sure where to “slot” it in the correspondence/analogy s.t. the alignment method remains sound. But, mainstream researchers have been making progress on understanding what deep learning is actually doing, and IMO it’s plausible we will have a good mathematical handle on it before AGI.
And rushing CEV and getting it only 95% correct poses far larger s-risks than rushing low-impact non-operator-modeling strawberry AGI and getting it only 95% correct.
I’m not sure whether you mean “95% correct CEV has a lot of S-risk” or “95% correct CEV has a little S-risk, and even a tiny amount of S-risk is terrifying”? I think I agree with the latter but not with the former. (How specifically does 95% CEV produce S-risk? I can imagine something like “AI realizes we want non-zero amount of pain/suffering to exist, somehow miscalibrates the amount and creates a lot of pain/suffering” or “AI realizes we don’t want to die, and focuses on this goal on the expense of everything else, preserving us forever in a state of complete sensory deprivation”. But these scenarios don’t seem very likely?)
I’m not sure whether you mean “95% correct CEV has a lot of S-risk” or “95% correct CEV has a little S-risk, and even a tiny amount of S-risk is terrifying”?
Insofar as humans care about their AI being corrigible, we should expect some degree of corrigibility even from a CEV-maximizer. That, in turn, suggests at least some basin-of-attraction for values (at least along some dimensions), in the same way that corrigibility yields a basin-of-attraction.
(Though obviously that’s not an argument we’d want to make load-bearing without both theoretical and empirical evidence about how big the basin-of-attraction is along which dimensions.)
Conversely, it doesn’t seem realistic to define limited impact or corrigibility or whatever without relying on an awful lot of values information—like e.g. what sort of changes-to-the-world we do/don’t care about, what thing-in-the-environment the system is supposed to be corrigible with, etc.
Values seem like a necessary-and-sufficient component. Corrigibility/task architecture/etc doesn’t.
And rushing CEV and getting it only 95% correct poses far larger s-risks than rushing low-impact non-operator-modeling strawberry AGI and getting it only 95% correct.
Small but important point here: an estimate of CEV which is within 5% error everywhere does reasonably well; that gets us within 5% of our best possible outcome. The problem is when our estimate is waaayyy off in 5% of scenarios, especially if it’s off in the overestimate direction; then we’re in trouble.
Conversely, it doesn’t seem realistic to define limited impact or corrigibility or whatever without relying on an awful lot of values information—like e.g. what sort of changes-to-the-world we do/don’t care about, what thing-in-the-environment the system is supposed to be corrigible with, etc.
I suspect you could do this in a less value-loaded way if you’re somehow intervening on ‘what the AGI wants to pay attention to’, as opposed to just intervening on ‘what sorts of directions it wants to steer the world in’.
‘Only spend your cognition thinking about individual physical structures smaller than 10 micrometers’, ‘only spend your cognition thinking about the physical state of this particular five-cubic-foot volume of space’, etc. could eliminate most of the risk of ‘high-impact’ actions without forcing us to define human conceptions of ‘impact’, and without forcing the AI to do a bunch of human-modeling. But I don’t know what research path would produce the ability to do things like that.
(There’s still of course something that we’re trying to get the AGI to do, like make a nanofactory or make a scanning machine for WBE or make improved computing hardware. That part strikes me as intuitively more value-loaded than ‘only think about this particular volume of space’.
The difficulty with ‘only think about this particular volume of space’ is that it requires the ability to intervene on thoughts rather than outputs.)
‘Only spend your cognition thinking about individual physical structures smaller than 10 micrometers’, ‘only spend your cognition thinking about the physical state of this particular five-cubic-foot volume of space’, etc. could eliminate most of the risk of ‘high-impact’ actions without forcing us to define human conceptions of ‘impact’, and without forcing the AI to do a bunch of human-modeling.
I do not think that would do what you seem to think it would do. If something optimizes one little chunk of the world really hard, ignoring everything else, that doesn’t mean the rest of the world is unchanged; by default there are lots of side effects. E.g. if something is building nanotech in a 1m cube, ignoring everything outside the cube, at the very least I’d expect that dump nuke levels of waste heat into its immediate surroundings.
I agree that this would be scary if the system is, for example, as smart as physically possible. What I’m imagining is:
(1) if you find a way to ensure that the system is only weakly superhuman (e.g., it performs vast amounts of low-level-Google-engineer-quality reasoning, only rare short controlled bursts of von-Neumann-quality reasoning, and nothing dramatically above the von-Neumann level), and
(2) if you get the system to only care about thinking about this cube of space, and
(3) if you also somehow get the system to want to build the particular machine you care about,
then you can plausibly save the world, and (importantly) you’re not likely to destroy the world if you fail, assuming you really are correctly confident in 1, 2, and 3.
I think you can also get more safety margin if the cube is in Antarctica (or on the Moon?), if you’ve tried to seal it off from the environment to some degree, and if you actively monitor for things like toxic waste products, etc.
Notably, the “only care about thinking about this cube of space” part is important for a lot of the other safety features to work, like:
It’s a lot harder to get guarantees about the system’s intelligence if it’s optimizing the larger world (since it might then improve itself, or build a smart successor in its environment—good luck closing off all possible loopholes for what kinds of physical systems an AGI might build that count as “smart successors”, while still leaving it able to build nanotech!).
Likewise, it’s a lot harder to get guarantees that the system stably is optimizing what you want it to optimize, or stably has any specific internal property, if it’s willing and able to modify itself.
Part of why we can hope to notice, anticipate, and guard against bad side-effects like “waste products” is that the waste products aren’toptimized to have any particular effect on the external environment, and aren’t optimized to evade our efforts to notice, anticipate, or respond to the danger. For that reason, “An AGI that only terminally cares about the state of a certain cube of space, but does spend time thinking about the larger world”, is vastly scarier than an AGI that just-doesn’t-think in those directions.
If the system does start going off the rails, we’re a lot more likely to be able to shut it down if it isn’t thinking about us or about itself.
This makes me think that the “only care about thinking about certain things” part may be relatively important in order for a lot of other safety requirements to be tractable. It feels more “(realistically) necessary” than “sufficient” to me; but I do personally have a hunch (which hopefully we wouldn’t have to actually rely on as a safety assumption!) that the ability to do things in this reference class would get us, like, 80+% of the way to saving the world? (Dunno whether Eliezer or anyone else at MIRI would agree.)
I want to register my skepticism about this claim. Whereas it might naively seem that “put a strawberry on a plate” is easier to align than “extrapolated volition”, on a closer look there are reasons why it might be the other way around. Specifically, the notion of “utility function of given agent” is a natural concept that we should expect to have a relatively succinct mathematical description. This intuition is supported by the AIT definition of intelligence. On the other hand, “put a strawberry on a plate without undesirable side effects” is not a natural concept, since a lot of complexity is packed into the “undesirable side effects”. Therefore, while I see some lines of attack on both task AGI and extrapolated volition, the latter might well turn out easier.
And if humans had a utility function and we knew what that utility function was, we would not need CEV. Unfortunately extracting human preferences over out-of-distribution options and outcomes at dangerously high intelligence, using data gathered at safe levels of intelligence and a correspondingly narrower range of outcomes and options, when there exists no sensory ground truth about what humans want because human raters can be fooled or disassembled, seems pretty complicated. There is ultimately a rescuable truth about what we want, and CEV is my lengthy informal attempt at stating what that even is; but I would assess it as much, much, much more difficult than ‘corrigibility’ to train into a dangerously intelligent system using only training and data from safe levels of intelligence. (As is the central lethally difficult challenge of AGI alignment.)
If we were paperclip maximizers and knew what paperclips were, then yes, it would be easier to just build an offshoot paperclip maximizer.
I agree that it’s a tricky problem, but I think it’s probably tractable. The way PreDCA tries to deal with these difficulties is:
The AI can tell that, even before the AI was turned on, the physical universe was running certain programs.
Some of those programs are “agentic” programs.
Agentic programs have approximately well-defined utility functions.
Disassembling the humans doesn’t change anything, since it doesn’t affect the programs that were already running[1] before the AI was turned on.
Since we’re looking at agent-programs rather than specific agent-actions, there is much more ground for inference about novel situations.
Obviously, the concepts I’m using here (e.g. which programs are “running” or which programs are “agentic”) are non-trivial to define, but infra-Bayesian physicalism does allow us the define them (not without some caveats, but hopefully at least to a 1st approximation).
More precisely, I am looking at agents which could prevent the AI from becoming turned on, this is what I call “precursors”.
Yeah, I’m very interested in hearing counter-arguments to claims like this. I’ll say that although I think task AGI is easier, it’s not necessarily strictly easier, for the reason you mentioned.
Maybe a cruxier way of putting my claim is: Maybe corrigibility / task AGI / etc. is harder than CEV, but it just doesn’t seem realistic to me to try to achieve full, up-and-running CEV with the very first AGI systems you build, within a few months or a few years of humanity figuring out how to build AGI at all.
And I do think you need to get CEV up and running within a few months or a few years, if you want to both (1) avoid someone else destroying the world first, and (2) not use a “strawberry-aligned” AGI to prevent 1 from happening.
All of the options are to some extent a gamble, but corrigibility, task AGI, limited impact, etc. strike me as gambles that could actually realistically work out well for humanity even under extreme time pressure to deploy a system within a year or two of ‘we figure out how to build AGI’. I don’t think CEV is possible under that constraint. (And rushing CEV and getting it only 95% correct poses far larger s-risks than rushing low-impact non-operator-modeling strawberry AGI and getting it only 95% correct.)
The way I imagine the win scenario is, we’re going to make a lot of progress in understanding alignment before we know how to build AGI. And, we’re going to do it by prioritizing understanding alignment modulo capability (the two are not really possible to cleanly separate, but it might be possible to separate them enough for this purpose). For example, we can assume the existence of algorithms with certain properties, s.t. these properties arguably imply the algorithms can be used as building-blocks for AGI, and then ask: given such algorithms, how would we build aligned AGI? Or, we can come up with some toy setting where we already know how to build “AGI” in some sense, and ask, how to make it aligned in that setting? And then, once we know how to build AGI in the real world, it would hopefully not be too difficult to translate the alignment method.
One caveat in all this is, if AGI is going to use deep learning, we might not know how to apply the lesson from the “oracle”/toy setting, because we don’t understand what deep learning is actually doing, and because of that, we wouldn’t be sure where to “slot” it in the correspondence/analogy s.t. the alignment method remains sound. But, mainstream researchers have been making progress on understanding what deep learning is actually doing, and IMO it’s plausible we will have a good mathematical handle on it before AGI.
I’m not sure whether you mean “95% correct CEV has a lot of S-risk” or “95% correct CEV has a little S-risk, and even a tiny amount of S-risk is terrifying”? I think I agree with the latter but not with the former. (How specifically does 95% CEV produce S-risk? I can imagine something like “AI realizes we want non-zero amount of pain/suffering to exist, somehow miscalibrates the amount and creates a lot of pain/suffering” or “AI realizes we don’t want to die, and focuses on this goal on the expense of everything else, preserving us forever in a state of complete sensory deprivation”. But these scenarios don’t seem very likely?)
The latter, as I was imagining “95%”.
Insofar as humans care about their AI being corrigible, we should expect some degree of corrigibility even from a CEV-maximizer. That, in turn, suggests at least some basin-of-attraction for values (at least along some dimensions), in the same way that corrigibility yields a basin-of-attraction.
(Though obviously that’s not an argument we’d want to make load-bearing without both theoretical and empirical evidence about how big the basin-of-attraction is along which dimensions.)
Conversely, it doesn’t seem realistic to define limited impact or corrigibility or whatever without relying on an awful lot of values information—like e.g. what sort of changes-to-the-world we do/don’t care about, what thing-in-the-environment the system is supposed to be corrigible with, etc.
Values seem like a necessary-and-sufficient component. Corrigibility/task architecture/etc doesn’t.
Small but important point here: an estimate of CEV which is within 5% error everywhere does reasonably well; that gets us within 5% of our best possible outcome. The problem is when our estimate is waaayyy off in 5% of scenarios, especially if it’s off in the overestimate direction; then we’re in trouble.
I suspect you could do this in a less value-loaded way if you’re somehow intervening on ‘what the AGI wants to pay attention to’, as opposed to just intervening on ‘what sorts of directions it wants to steer the world in’.
‘Only spend your cognition thinking about individual physical structures smaller than 10 micrometers’, ‘only spend your cognition thinking about the physical state of this particular five-cubic-foot volume of space’, etc. could eliminate most of the risk of ‘high-impact’ actions without forcing us to define human conceptions of ‘impact’, and without forcing the AI to do a bunch of human-modeling. But I don’t know what research path would produce the ability to do things like that.
(There’s still of course something that we’re trying to get the AGI to do, like make a nanofactory or make a scanning machine for WBE or make improved computing hardware. That part strikes me as intuitively more value-loaded than ‘only think about this particular volume of space’.
The difficulty with ‘only think about this particular volume of space’ is that it requires the ability to intervene on thoughts rather than outputs.)
I do not think that would do what you seem to think it would do. If something optimizes one little chunk of the world really hard, ignoring everything else, that doesn’t mean the rest of the world is unchanged; by default there are lots of side effects. E.g. if something is building nanotech in a 1m cube, ignoring everything outside the cube, at the very least I’d expect that dump nuke levels of waste heat into its immediate surroundings.
I agree that this would be scary if the system is, for example, as smart as physically possible. What I’m imagining is:
(1) if you find a way to ensure that the system is only weakly superhuman (e.g., it performs vast amounts of low-level-Google-engineer-quality reasoning, only rare short controlled bursts of von-Neumann-quality reasoning, and nothing dramatically above the von-Neumann level), and
(2) if you get the system to only care about thinking about this cube of space, and
(3) if you also somehow get the system to want to build the particular machine you care about,
then you can plausibly save the world, and (importantly) you’re not likely to destroy the world if you fail, assuming you really are correctly confident in 1, 2, and 3.
I think you can also get more safety margin if the cube is in Antarctica (or on the Moon?), if you’ve tried to seal it off from the environment to some degree, and if you actively monitor for things like toxic waste products, etc.
Notably, the “only care about thinking about this cube of space” part is important for a lot of the other safety features to work, like:
It’s a lot harder to get guarantees about the system’s intelligence if it’s optimizing the larger world (since it might then improve itself, or build a smart successor in its environment—good luck closing off all possible loopholes for what kinds of physical systems an AGI might build that count as “smart successors”, while still leaving it able to build nanotech!).
Likewise, it’s a lot harder to get guarantees that the system stably is optimizing what you want it to optimize, or stably has any specific internal property, if it’s willing and able to modify itself.
Part of why we can hope to notice, anticipate, and guard against bad side-effects like “waste products” is that the waste products aren’t optimized to have any particular effect on the external environment, and aren’t optimized to evade our efforts to notice, anticipate, or respond to the danger. For that reason, “An AGI that only terminally cares about the state of a certain cube of space, but does spend time thinking about the larger world”, is vastly scarier than an AGI that just-doesn’t-think in those directions.
If the system does start going off the rails, we’re a lot more likely to be able to shut it down if it isn’t thinking about us or about itself.
This makes me think that the “only care about thinking about certain things” part may be relatively important in order for a lot of other safety requirements to be tractable. It feels more “(realistically) necessary” than “sufficient” to me; but I do personally have a hunch (which hopefully we wouldn’t have to actually rely on as a safety assumption!) that the ability to do things in this reference class would get us, like, 80+% of the way to saving the world? (Dunno whether Eliezer or anyone else at MIRI would agree.)