Evan R. Murphy comments on Confusions in My Model of AI Risk

Evan R. Murphy 7 Jul 2022 23:18 UTC
1 point
(Note: This comment is hand-wavy, but I still have medium-high confidence in its ideas.)

When I think about more advanced AIs that will be developed several years in the future, it’s more clear to me that they look like optimizers and why that’s dangerous. Economic pressures will push us toward having agent AIs rather than tool AIs, that’s why we can’t hang out in relatively safe passive language model land forever. Similarly I think more general AIs will outcompete narrow AIs, which is why CAIS and an ecosystem of narrow AIs isn’t sustainable even though it would be safer.

Agent AIs seem inherently to have optimizer-like qualities. They are both “trying to do things” instead of just responding to prompts and inputs. For an AI to successfully make the leap (or perhaps gradual progression) from narrowness to generality, it will need to model the real world and probably humans. The most competitive AI I can think of would be an advanced general agent AI. It understands the world like we do only better, and it can do all sorts of things and pick up new skills quickly. It anticipates our needs and when we talk to it, it intuits what we mean.

This advanced general agent AI is a powerful optimizer. (It’s an optimizer because it’s an agent, and powerful because it’s generally capable.) The reason this is dangerous is because we don’t know what it’s optimizing for. Whether it was truly aligned or deceptively aligned, it would be acting the same way, being really useful and helpful and impressive and understanding—up until the point when it gains enough control that if it’s deceptively aligned it will overpower humans by force. And even though this scenario sounds like a wacky sci-fi thing, it seems to be the more likely outcome because deceptive alignment is just a natural strategy to emerge from the AI having clear understanding the world and having a proxy goal rather than the actual goal we want, and there are many such proxy goals for gradient descent to stumble upon vs. only one (or a relatively much smaller number of) well aligned goals.

So this is my attempt to articulate why dangerous powerful optimizers are likely in the limit. I think your post is great because while this eventual picture seems fairly clear to me, I am much less clear on what optimization means in lower level systems, at what point it starts becoming dangerous, understanding hybrid optimizer/heuristic systems, etc. Your post is a good start for noting these kinds of ambiguities and starting to deconfuse them.