Poorly-Aimed Death Rays
Alternate framing: Optimality is the tiger, and agents are its teeth.
Tonally relevant: Godzilla Strategies.
It’s a problem when people think that a superintelligent AI will be just a volitionless tool that will do as told. But it’s also a problem when people focus overly much on the story of “agency”. When they imagine that all of the problems come from the AI “wanting” things, “thinking” things, and consequentializing all over the place about it. If only we could make it more of a volitionless tool! Then all of our problems would be solved. Because the problem is the AI using its power in clever ways with the deliberate intent to hurt us, right?
This, I feel, fails entirely to appreciate the sheer power of optimization, and how even the slightest failure to aim it properly, the slightest leakage of its energy in the wrong direction, for the briefest of moments, will be sufficient to wash us all away.
The problem isn’t making a superintelligent system that wouldn’t positively want to kill us. Accidentally killing us all is a natural property of superintelligence. The problem is making an AI that will deliberately spend a lot of effort on ensuring it’s not killing us.
I find planet-destroying Death Rays to be a good analogy. Think the Death Star. Think—
Imagine that you’re an engineer employed by an… eccentric fellow. The guy has a volcano lair, weird aesthetic tastes, and a tendency to put words like “world” and “domination” one after another. You know the type.
One of his latest schemes is to blow up Jupiter. To that end, he’d had excavated a giant cavern underneath his volcano lair, dug a long cylindrical tunnel from that cavern to the surface, and ordered your team to build a beam weapon in that cavern and shoot it through the tunnel at Jupiter.
You’re getting paid literal tons of money, so you don’t complain (except about the payment logistics). You have a pretty good idea of how to do that project, too. There are these weird crystal things your team found lying around. If you poke one in a particular way, it releases a narrow energy beam which blows up anything it touches. The power of the beam scales superexponentially with the strength of the poke; you’re pretty sure shooting one with a rifle will do the Jupiter-vanishing trick.
There’s just one problem: aim. You can never quite predict which part of the crystal will emit the beam. It depends on where you poke it, but also on how hard you poke, with seemingly random results. And your employer is insistent that the Death Ray be fired from the cavern through the tunnel, not from space where it’s less likely to hit important things, or something practical like that.
If you say that can’t be done, your employer will just replace you with someone less… pessimistic.
So, here’s your problem. How do you build a machine that uses one or more of these crystals in such a way that they fire a Death Ray through the tunnel at Jupiter, without hitting Earth and killing everyone?[1]
You experiment with the crystals at non-Earth-destroying settings, trying to figure out how the beam is directed. You make a fair amount of progress! You’re able to predict the beam’s direction at the next power setting with 97% confidence!
When you fire it with Jupiter-destroying power, that slight margin of error causes the beam to be slightly misdirected. It grazes the tunnel, exploding Earth and killing everyone.
You fire the Death Ray at a lower, non-Earth-destroying setting that you know how to aim.
It hits Jupiter but fails to destroy it. Your employer is disappointed, and tells you to try again.
You line the cavern’s walls and the tunnel with really good protective shielding.
The Death Ray grazes the tunnel, blows past the shielding, and kills everyone.
You set up a mechanism for quickly turning off the Death Ray. If you see it firing in the wrong direction, you’ll cut power.
The Death Ray kills you before the information about its misfire reaches your brain.
You set up a really fast targeting system which will rapidly rotate the crystal the moment it detects that the Death Ray is misaimed.
In the fraction of a second that it spends firing in the wrong direction, it outputs enough energy to explode Earth and kill everyone.
You make the beam really narrow, so it’s less likely to hit tunnel walls.
It grazes a tunnel wall anyway, killing everyone.
You set up the system in a clever way that fires several Death Rays in the vague direction of the tunnel, aimed to intersect underneath the entrance to it. The idea is that their errors will cancel out, and the composite beam will fly true!
The errors do not cancel out perfectly, the beam grazes the tunnel and kills everyone again.
Also, one of the Death Rays fires into the floor, so it wouldn’t have worked even then.
You perform exorcism on the crystal, banishing the daemons infesting it.
Nothing changes. The beam grazes a tunnel wall, killing everyone.
You modify the crystal so the beam harmlessly dissipates into aether shortly after firing.
It can’t reach Jupiter. You’ve disappointed your employer for the last time.
He fires you
into the Sun.
Your replacement figures that lining the walls with even better protective shielding ought to do the trick, fires the beam, destroys Earth and kills everyone.
This analogy can be nitpicked endlessly, of course. By no means does anything here prove that it’s a valid one. You can argue that just a wee bit of misalignment won’t destroy the world, or that the AI doesn’t need to be dangerous in this way for us to do interesting things with it, or that intelligence isn’t really quite that powerful, et cetera.
This post isn’t aimed at convincing someone of that; there’s a lot of posts that do it already. But if you broadly agree with the premise, but have some difficulty sorting out the exact problems with any given containment scenario, this analogy might help.
Any sufficiently powerful AI system holds a terrifying core of optimization — the ability to implacably rewrite some part of the world according to some specification. It doesn’t matter how that power is represented, in what wrapper it’s in, where specifically it is aimed, whether it’s controlled by an alien sapient entity. As long as it’s not aimed exactly where we want it to be, with no leakage, from the very beginning, it will kill us all.
It’s its intrinsic property.
- ^
Also, Earth has no atmosphere in that scenario. Probably your employer’s fault too. But at least that means a well-aimed beam wouldn’t hit the air and explode everything anyway.
- 19 May 2023 15:03 UTC; 16 points) 's comment on Mr. Meeseeks as an AI capability tripwire by (
- 3 Nov 2022 12:23 UTC; 3 points) 's comment on All AGI Safety questions welcome (especially basic ones) [~monthly thread] by (
Russian translation by me
I really like this analogy!
Also worth noting that some idiot may just play around with death ray technology without aiming it...
An AGI that lacks volition is incomparably safer. In particular, it is very highly unlikely to render humanity extinct. In addition, absent volition it is possible to prevent an AGI from doing too much harm by moving more slowly and carefully, having breakpoints, having advanced modeled outputs etc.
It will still be dangerous in the sense that many powerful technologies are dangerous, but not uniquely so.
I think the point of this post is “a powerful enough optimization process kills you (and everyone else) anyway”.
As soon as you give it a command the AI has “volition” in the sense that it is optimizing some output that affects the world.
Here is a breakdown of deaths by causes, worldwide, for every year from 1990 to 2019. The overwhelming majority do not involve volition. The first category that does, suicide, accounted for less than 2% of all deaths in each of those years. Homicide was always under 1%. Conflict and Terrorism together are below 1% in every year but one. Alcohol disorders and Drug disorders might be regarded as having volitional causes, but their contribution is similarly insignificant.
So at least 97% of all deaths do not happen because of anyone’s volition. I am not seeing in this an argument for the safety of excluding volition, whatever that is, from a system.
I say “whatever that is,” because while it should be clear what I mean in using the word above about people, it is not clear what it means when applied to an artificial system. We do not have a gears-level model that we can use to impute it or not to any given system.