Rob Bensinger comments on Six Dimensions of Operational Adequacy in AGI Projects

Rob Bensinger Jun 3, 2022, 8:41 PM
LW: 3 AF: 1
AF
I agree that this would be scary if the system is, for example, as smart as physically possible. What I’m imagining is:
- (1) if you find a way to ensure that the system is only weakly superhuman (e.g., it performs vast amounts of low-level-Google-engineer-quality reasoning, only rare short controlled bursts of von-Neumann-quality reasoning, and nothing dramatically above the von-Neumann level), and
- (2) if you get the system to only care about thinking about this cube of space, and
- (3) if you also somehow get the system to want to build the particular machine you care about,
then you can plausibly save the world, and (importantly) you’re not likely to destroy the world if you fail, assuming you really are correctly confident in 1, 2, and 3.
I think you can also get more safety margin if the cube is in Antarctica (or on the Moon?), if you’ve tried to seal it off from the environment to some degree, and if you actively monitor for things like toxic waste products, etc.
Notably, the “only care about thinking about this cube of space” part is important for a lot of the other safety features to work, like:
- It’s a lot harder to get guarantees about the system’s intelligence if it’s optimizing the larger world (since it might then improve itself, or build a smart successor in its environment—good luck closing off all possible loopholes for what kinds of physical systems an AGI might build that count as “smart successors”, while still leaving it able to build nanotech!).
- Likewise, it’s a lot harder to get guarantees that the system stably is optimizing what you want it to optimize, or stably has any specific internal property, if it’s willing and able to modify itself.
- Part of why we can hope to notice, anticipate, and guard against bad side-effects like “waste products” is that the waste products aren’t optimized to have any particular effect on the external environment, and aren’t optimized to evade our efforts to notice, anticipate, or respond to the danger. For that reason, “An AGI that only terminally cares about the state of a certain cube of space, but does spend time thinking about the larger world”, is vastly scarier than an AGI that just-doesn’t-think in those directions.
- If the system does start going off the rails, we’re a lot more likely to be able to shut it down if it isn’t thinking about us or about itself.
This makes me think that the “only care about thinking about certain things” part may be relatively important in order for a lot of other safety requirements to be tractable. It feels more “(realistically) necessary” than “sufficient” to me; but I do personally have a hunch (which hopefully we wouldn’t have to actually rely on as a safety assumption!) that the ability to do things in this reference class would get us, like, 80+% of the way to saving the world? (Dunno whether Eliezer or anyone else at MIRI would agree.)