I think a very common problem in alignment research today is that people focus almost exclusively on a specific story about strategic deception/scheming, and that story is a very narrow slice of the AI extinction probability mass. At some point I should probably write a proper post on this, but for now here are few off-the-cuff example AI extinction stories which don’t look like the prototypical scheming story. (These are copied from a Facebook thread.)
Perhaps the path to superintelligence looks like applying lots of search/optimization over shallow heuristics. Then we potentially die to things which aren’t smart enough to be intentionally deceptive, but nonetheless have been selected-upon to have a lot of deceptive behaviors (via e.g. lots of RL on human feedback).
Perhaps someone trains a STEM-AGI, which can’t think about humans much at all. In the course of its work, that AGI reasons that an oxygen-rich atmosphere is very inconvenient for manufacturing, and aims to get rid of it. It doesn’t think about humans at all, but the human operators can’t understand most of the AI’s plans anyway, so the plan goes through. As an added bonus, nobody can figure out why the atmosphere is losing oxygen until it’s far too late, because the world is complicated and becomes more so with a bunch of AIs running around and no one AI has a big-picture understanding of anything either (much like today’s humans have no big-picture understanding of the whole human economy/society).
People try to do the whole “outsource alignment research to early AGI” thing, but the human overseers are themselves sufficiently incompetent at alignment of superintelligences that the early AGI produces a plan which looks great to the overseers (as it was trained to do), and that plan totally fails to align more-powerful next-gen AGI at all. And at that point, they’re already on the more-powerful next gen, so it’s too late.
The classic overnight hard takeoff: a system becomes capable of self-improving at all but doesn’t seem very alarmingly good at it, somebody leaves it running overnight, exponentials kick in, and there is no morning.
(At least some) AGIs act much like a colonizing civilization. Plenty of humans ally with it, trade with it, try to get it to fight their outgroup, etc, and the AGIs locally respect the agreements with the humans and cooperate with their allies, but the end result is humanity gradually losing all control and eventually dying out.
Perhaps early AGI involves lots of moderately-intelligent subagents. The AI as a whole mostly seems pretty aligned most of the time, but at some point a particular subagent starts self-improving, goes supercritical, and takes over the rest of the system overnight. (Think cancer, but more agentic.)
Perhaps the path to superintelligence looks like scaling up o1-style runtime reasoning to the point where we’re using an LLM to simulate a whole society. But the effects of a whole society (or parts of a society) on the world are relatively decoupled from the things-individual-people-say-taken-at-face-value. For instance, lots of people talk a lot about reducing poverty, yet have basically-no effect on poverty. So developers attempt to rely on chain-of-thought transparency, and shoot themselves in the foot.
Also (separate comment because I expect this one to be more divisive): I think the scheming story has been disproportionately memetically successful largely because it’s relatively easy to imagine hacky ways of preventing an AI from intentionally scheming. And that’s mostly a bad thing; it’s a form of streetlighting.
My initial reaction is that at least some of these points would be covered by the Guaranteed Safe AI agenda if that works out, right? Though the “AGIs act much like a colonizing civilization” situation does scare me because it’s the kind of thing which locally looks harmless but collectively is highly dangerous. It would require no misalignment on the part of any individual AI.
I think a very common problem in alignment research today is that people focus almost exclusively on a specific story about strategic deception/scheming, and that story is a very narrow slice of the AI extinction probability mass. At some point I should probably write a proper post on this, but for now here are few off-the-cuff example AI extinction stories which don’t look like the prototypical scheming story. (These are copied from a Facebook thread.)
Perhaps the path to superintelligence looks like applying lots of search/optimization over shallow heuristics. Then we potentially die to things which aren’t smart enough to be intentionally deceptive, but nonetheless have been selected-upon to have a lot of deceptive behaviors (via e.g. lots of RL on human feedback).
The “Getting What We Measure” scenario from Paul’s old “What Failure Looks Like” post.
The “fusion power generator scenario”.
Perhaps someone trains a STEM-AGI, which can’t think about humans much at all. In the course of its work, that AGI reasons that an oxygen-rich atmosphere is very inconvenient for manufacturing, and aims to get rid of it. It doesn’t think about humans at all, but the human operators can’t understand most of the AI’s plans anyway, so the plan goes through. As an added bonus, nobody can figure out why the atmosphere is losing oxygen until it’s far too late, because the world is complicated and becomes more so with a bunch of AIs running around and no one AI has a big-picture understanding of anything either (much like today’s humans have no big-picture understanding of the whole human economy/society).
People try to do the whole “outsource alignment research to early AGI” thing, but the human overseers are themselves sufficiently incompetent at alignment of superintelligences that the early AGI produces a plan which looks great to the overseers (as it was trained to do), and that plan totally fails to align more-powerful next-gen AGI at all. And at that point, they’re already on the more-powerful next gen, so it’s too late.
The classic overnight hard takeoff: a system becomes capable of self-improving at all but doesn’t seem very alarmingly good at it, somebody leaves it running overnight, exponentials kick in, and there is no morning.
(At least some) AGIs act much like a colonizing civilization. Plenty of humans ally with it, trade with it, try to get it to fight their outgroup, etc, and the AGIs locally respect the agreements with the humans and cooperate with their allies, but the end result is humanity gradually losing all control and eventually dying out.
Perhaps early AGI involves lots of moderately-intelligent subagents. The AI as a whole mostly seems pretty aligned most of the time, but at some point a particular subagent starts self-improving, goes supercritical, and takes over the rest of the system overnight. (Think cancer, but more agentic.)
Perhaps the path to superintelligence looks like scaling up o1-style runtime reasoning to the point where we’re using an LLM to simulate a whole society. But the effects of a whole society (or parts of a society) on the world are relatively decoupled from the things-individual-people-say-taken-at-face-value. For instance, lots of people talk a lot about reducing poverty, yet have basically-no effect on poverty. So developers attempt to rely on chain-of-thought transparency, and shoot themselves in the foot.
Also (separate comment because I expect this one to be more divisive): I think the scheming story has been disproportionately memetically successful largely because it’s relatively easy to imagine hacky ways of preventing an AI from intentionally scheming. And that’s mostly a bad thing; it’s a form of streetlighting.
My initial reaction is that at least some of these points would be covered by the Guaranteed Safe AI agenda if that works out, right? Though the “AGIs act much like a colonizing civilization” situation does scare me because it’s the kind of thing which locally looks harmless but collectively is highly dangerous. It would require no misalignment on the part of any individual AI.