Rob Bensinger comments on Six Dimensions of Operational Adequacy in AGI Projects

Rob Bensinger Jun 2, 2022, 5:40 PM
LW: 3 AF: 1
AF
Yeah, I’m very interested in hearing counter-arguments to claims like this. I’ll say that although I think task AGI is easier, it’s not necessarily strictly easier, for the reason you mentioned.
Maybe a cruxier way of putting my claim is: Maybe corrigibility / task AGI / etc. is harder than CEV, but it just doesn’t seem realistic to me to try to achieve full, up-and-running CEV with the very first AGI systems you build, within a few months or a few years of humanity figuring out how to build AGI at all.
And I do think you need to get CEV up and running within a few months or a few years, if you want to both (1) avoid someone else destroying the world first, and (2) not use a “strawberry-aligned” AGI to prevent 1 from happening.
All of the options are to some extent a gamble, but corrigibility, task AGI, limited impact, etc. strike me as gambles that could actually realistically work out well for humanity even under extreme time pressure to deploy a system within a year or two of ‘we figure out how to build AGI’. I don’t think CEV is possible under that constraint. (And rushing CEV and getting it only 95% correct poses far larger s-risks than rushing low-impact non-operator-modeling strawberry AGI and getting it only 95% correct.)
- Vanessa Kosoy Jun 3, 2022, 7:31 AM
  LW: 8 AF: 4
  AF Parent
  
  Maybe corrigibility / task AGI / etc. is harder than CEV, but it just doesn’t seem realistic to me to try to achieve full, up-and-running CEV with the very first AGI systems you build, within a few months or a few years of humanity figuring out how to build AGI at all.
  
  The way I imagine the win scenario is, we’re going to make a lot of progress in understanding alignment before we know how to build AGI. And, we’re going to do it by prioritizing understanding alignment modulo capability (the two are not really possible to cleanly separate, but it might be possible to separate them enough for this purpose). For example, we can assume the existence of algorithms with certain properties, s.t. these properties arguably imply the algorithms can be used as building-blocks for AGI, and then ask: given such algorithms, how would we build aligned AGI? Or, we can come up with some toy setting where we already know how to build “AGI” in some sense, and ask, how to make it aligned in that setting? And then, once we know how to build AGI in the real world, it would hopefully not be too difficult to translate the alignment method.
  
  One caveat in all this is, if AGI is going to use deep learning, we might not know how to apply the lesson from the “oracle”/toy setting, because we don’t understand what deep learning is actually doing, and because of that, we wouldn’t be sure where to “slot” it in the correspondence/analogy s.t. the alignment method remains sound. But, mainstream researchers have been making progress on understanding what deep learning is actually doing, and IMO it’s plausible we will have a good mathematical handle on it before AGI.
  
  And rushing CEV and getting it only 95% correct poses far larger s-risks than rushing low-impact non-operator-modeling strawberry AGI and getting it only 95% correct.
  
  I’m not sure whether you mean “95% correct CEV has a lot of S-risk” or “95% correct CEV has a little S-risk, and even a tiny amount of S-risk is terrifying”? I think I agree with the latter but not with the former. (How specifically does 95% CEV produce S-risk? I can imagine something like “AI realizes we want non-zero amount of pain/suffering to exist, somehow miscalibrates the amount and creates a lot of pain/suffering” or “AI realizes we don’t want to die, and focuses on this goal on the expense of everything else, preserving us forever in a state of complete sensory deprivation”. But these scenarios don’t seem very likely?)
  - Rob Bensinger Jun 3, 2022, 5:01 PM
    LW: 2 AF: 1
    AF Parent
    I’m not sure whether you mean “95% correct CEV has a lot of S-risk” or “95% correct CEV has a little S-risk, and even a tiny amount of S-risk is terrifying”?
    The latter, as I was imagining “95%”.
- johnswentworth Jun 3, 2022, 5:05 PM
  LW: 4 AF: 3
  AF Parent
  Insofar as humans care about their AI being corrigible, we should expect some degree of corrigibility even from a CEV-maximizer. That, in turn, suggests at least some basin-of-attraction for values (at least along some dimensions), in the same way that corrigibility yields a basin-of-attraction.
  (Though obviously that’s not an argument we’d want to make load-bearing without both theoretical and empirical evidence about how big the basin-of-attraction is along which dimensions.)
  Conversely, it doesn’t seem realistic to define limited impact or corrigibility or whatever without relying on an awful lot of values information—like e.g. what sort of changes-to-the-world we do/don’t care about, what thing-in-the-environment the system is supposed to be corrigible with, etc.
  Values seem like a necessary-and-sufficient component. Corrigibility/task architecture/etc doesn’t.
  And rushing CEV and getting it only 95% correct poses far larger s-risks than rushing low-impact non-operator-modeling strawberry AGI and getting it only 95% correct.
  Small but important point here: an estimate of CEV which is within 5% error everywhere does reasonably well; that gets us within 5% of our best possible outcome. The problem is when our estimate is waaayyy off in 5% of scenarios, especially if it’s off in the overestimate direction; then we’re in trouble.
  - Rob Bensinger Jun 3, 2022, 6:37 PM
    LW: 8 AF: 4
    AF Parent
    Conversely, it doesn’t seem realistic to define limited impact or corrigibility or whatever without relying on an awful lot of values information—like e.g. what sort of changes-to-the-world we do/don’t care about, what thing-in-the-environment the system is supposed to be corrigible with, etc.
    I suspect you could do this in a less value-loaded way if you’re somehow intervening on ‘what the AGI wants to pay attention to’, as opposed to just intervening on ‘what sorts of directions it wants to steer the world in’.
    ‘Only spend your cognition thinking about individual physical structures smaller than 10 micrometers’, ‘only spend your cognition thinking about the physical state of this particular five-cubic-foot volume of space’, etc. could eliminate most of the risk of ‘high-impact’ actions without forcing us to define human conceptions of ‘impact’, and without forcing the AI to do a bunch of human-modeling. But I don’t know what research path would produce the ability to do things like that.
    (There’s still of course something that we’re trying to get the AGI to do, like make a nanofactory or make a scanning machine for WBE or make improved computing hardware. That part strikes me as intuitively more value-loaded than ‘only think about this particular volume of space’.
    The difficulty with ‘only think about this particular volume of space’ is that it requires the ability to intervene on thoughts rather than outputs.)
    What links here?
    Steven Byrnes's comment on Instruction-following AGI is easier and more likely than value aligned AGI by Seth Herd (May 17, 2024, 9:02 PM; 4 points)
    - johnswentworth Jun 3, 2022, 7:41 PM
      LW: 4 AF: 4
      AF Parent
      ‘Only spend your cognition thinking about individual physical structures smaller than 10 micrometers’, ‘only spend your cognition thinking about the physical state of this particular five-cubic-foot volume of space’, etc. could eliminate most of the risk of ‘high-impact’ actions without forcing us to define human conceptions of ‘impact’, and without forcing the AI to do a bunch of human-modeling.
      I do not think that would do what you seem to think it would do. If something optimizes one little chunk of the world really hard, ignoring everything else, that doesn’t mean the rest of the world is unchanged; by default there are lots of side effects. E.g. if something is building nanotech in a 1m cube, ignoring everything outside the cube, at the very least I’d expect that dump nuke levels of waste heat into its immediate surroundings.
      - Rob Bensinger Jun 3, 2022, 8:41 PM
        LW: 3 AF: 1
        AF Parent
        I agree that this would be scary if the system is, for example, as smart as physically possible. What I’m imagining is:
        (1) if you find a way to ensure that the system is only weakly superhuman (e.g., it performs vast amounts of low-level-Google-engineer-quality reasoning, only rare short controlled bursts of von-Neumann-quality reasoning, and nothing dramatically above the von-Neumann level), and
        (2) if you get the system to only care about thinking about this cube of space, and
        (3) if you also somehow get the system to want to build the particular machine you care about,
        then you can plausibly save the world, and (importantly) you’re not likely to destroy the world if you fail, assuming you really are correctly confident in 1, 2, and 3.
        I think you can also get more safety margin if the cube is in Antarctica (or on the Moon?), if you’ve tried to seal it off from the environment to some degree, and if you actively monitor for things like toxic waste products, etc.
        Notably, the “only care about thinking about this cube of space” part is important for a lot of the other safety features to work, like:
        It’s a lot harder to get guarantees about the system’s intelligence if it’s optimizing the larger world (since it might then improve itself, or build a smart successor in its environment—good luck closing off all possible loopholes for what kinds of physical systems an AGI might build that count as “smart successors”, while still leaving it able to build nanotech!).
        Likewise, it’s a lot harder to get guarantees that the system stably is optimizing what you want it to optimize, or stably has any specific internal property, if it’s willing and able to modify itself.
        Part of why we can hope to notice, anticipate, and guard against bad side-effects like “waste products” is that the waste products aren’t optimized to have any particular effect on the external environment, and aren’t optimized to evade our efforts to notice, anticipate, or respond to the danger. For that reason, “An AGI that only terminally cares about the state of a certain cube of space, but does spend time thinking about the larger world”, is vastly scarier than an AGI that just-doesn’t-think in those directions.
        If the system does start going off the rails, we’re a lot more likely to be able to shut it down if it isn’t thinking about us or about itself.
        This makes me think that the “only care about thinking about certain things” part may be relatively important in order for a lot of other safety requirements to be tractable. It feels more “(realistically) necessary” than “sufficient” to me; but I do personally have a hunch (which hopefully we wouldn’t have to actually rely on as a safety assumption!) that the ability to do things in this reference class would get us, like, 80+% of the way to saving the world? (Dunno whether Eliezer or anyone else at MIRI would agree.)