In recent years, there’d been growing concerns among AI specialists regarding the dangers of advanced artificial intelligence. The capabilities of AI models are growing rapidly, they argue, while our ability to control them lags far behind, not to mention governmental regulations, which haven’t even begun to catch up. We’re on direct course for catastrophe — one that might cost us billions of dollars in economic damage, millions of lives, or even the very survival of our species.
The most striking thing about it, however, is that there’s nothing outlandish or science-fictional about the nature of these dangers. When you get down to it, the threats are neither esoteric nor truly novel. They’re merely scaled up beyond anything we’re familiar with.
It has nothing to do with AIs being smart or “sentient”. The core problem is simpler:
AI models are software products. And as software products go, their functionality is revolutionary, while their reliability is abysmal.
A quick primer on how AI models are developed. It has preciously little in common with standard, time-tested methods of software development.
We initialize some virtual architecture — a cluster of neurons designed by loose analogue with biological brains. That architecture has no knowledge recorded in it at first, it’s entirely randomized. Then, we set up the training loop. We expose the neural network to some stimuli — pictures of animals, natural-language text, or a simulated environment. The network computes some output in response.
In the first stages, said output is just gibberish. Our software evaluates it, comparing it to the expected, “correct” responses. If it’s unsatisfactory, the software slightly modifies the neural network, nudging the connections between its neurons so that it’s somewhat more likely to output the correct answer.
Then the NN is exposed to another stimulus, and the process repeats.
It is entirely automatic — the architecture is updated by a blind selection process not unlike biological evolution. That process doesn’t “understand” what it’s doing — it just computes some minimal “mutations” that would make the NN’s performance marginally better.
After tens of thousands of iterations, we end up with a trained AI model consisting of millions of neurons and trillions of connections between them. That AI implements some software algorithm of incomprehensible complexity, and it works very well for the purposes it was trained for. Somehow.
That “incomprehensible complexity” is literal. The internals of a trained AI model do not resemble programming code in the least. It is simply billions upon billions of floating-point numbers, arranged in matrices. In the course of computation, these matrices are multiplied, and somehow, some algorithmic magic happens inside that black box, and it spits out an essay, a paining, or a plan of actions.
The problem of interpreting the internals of neural networks is an entire sub-field within AI engineering. The difficulties it faces are not unlike the problem of reading a human’s thoughts from neural imagining. Needless to say, these efforts also lag far behind the bleeding edge of AI research.
The security implications of that should be clear. Traditionally-written software is infamous for how bug-ridden and error-prone it is. Any decent program has thousands if not millions lines of code, and every line could conceal a subtle mistake that could lead to the program crashing, or miscalculating some value by orders of magnitude, or engaging in undefined behavior. Thus, any program supporting crucial infrastructure needs to undergo intensive code reviews, performance tests, adversarial red-teaming, and so on. The costs of that are often measured in hundreds of thousands of dollars. And still software often malfunctions, sometimes in embarrassingly preventable ways.
What can we expect, then, of software products light-years more advanced than any traditionally-written program, whose code we cannot review, test for bugs, check for validity, or patch?
These concerns are not theoretical. It’s well known that AI models suffer from all the pitfalls of common software — they’re just not called by the same names.
AI models can be “hacked” by tailored inputs that exploit some underlying flaws in the algorithms they implement — so-called adversarial examples.
AIs’ behavior is unpredictable when they’re exposed to sufficiently unfamiliar stimuli — when the input they’re fed is “off-distribution”. Bugs.
Developing software that does even roughly what you want is often difficult. The client needs to exhaustively list every technical detail, and be in constant communication with the developers, steering the process. Otherwise, subtle misunderstandings might creep in, and the final product would turn out poorly. In AI, we call this “goal misgeneralization”.
Let’s focus on that last one, it’s important.
Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. —Goodhart’s Law
A hypothetical hospital starts evaluating how well doctors are performing by the recovery rate of their patients. So doctors stop taking on patients with serious illnesses, and public health suffers.
A hypothetical nail factory’s productivity starts being evaluated by the number of nails they output in a month. So they end up producing small, low-quality nails that are useless for any real purpose.
A hypothetical school evaluates how well students are learning by giving them tests. So students focus on getting a good grade instead of focusing on education. They start cramming or outright cheating, and end up actually learning very little.
In all of these cases, we want to encourage good performance on some task — healthcare, manufacturing, education. To do so, we focus on some proxy objective that correlates well with good performance. But the moment we do so, people end up ignoring their supposed jobs, and focus on gaming that metric.
That’s not quite how it happens in real life with real people. Students actually care about learning, doctors actually care about helping, so Goodhart’s Law doesn’t apply in full.
With AI, it does. They don’t start out caring about anything, after all.
When we train an AI, we’re measuring its performance in some limited number of cases — its “on-distribution” behavior. But, much like a test can’t actually measure “good education”, an AI’s on-distribution behavior is always only a proxy for what we really want.
As a toy example, consider an AI trained to operate a vacuum cleaner. Our intended goal for it is to remove dust from some environment, so we’re grading the AI on that. Over the course of training, it seems to learn to do that: it locates dust when it appears, and vacuums it up in its bag. We decide that’s good enough, and deploy it.
In the real world, it starts bowling over potted plants, then vacuuming up the messes it itself created. What happened?
The policy it learned was “maximize the amount of dirt in the cleaner bag”, not “minimize the amount of dust in the environment”. On-distribution, there were no potted plants, so the best way to do that was gathering the dust that was already present. In the real world, though… As it happened, its goal misgeneralized.
This specific problem is easy to patch: just re-train the AI in an environment with potted plants. But would you be sure that you didn’t miss something else? That you left literally no room for error in your training setup?
Whatever algorithm an AI ends up developing, it always only correlates with good performance on whatever test cases we came up with. It doesn’t actually aim for what we want.
And that’s the only way we know how to develop advanced AI systems, to date.
So, returning to the software analogy: our ability to specify the product we want is atrocious. Thus, we’ll never get what we actually want.
To recap: The current AI development paradigm produces software products that are no less bug-ridden and unreliable than any other software, while being impossible to debug or review, and it’s nigh-guaranteed that they do not meet our desiderata. They’re not just buggy: they’re fundamentally designed for the wrong purpose.
And these products can write essays, produce art, pass Turing tests, design other software products, aim missiles, and trade stocks. The modern machine-learning paradigm has emerged in 2012, and in a single decade it achieved all of this. What will happen in the next decade? In the next three?
AI systems will soon achieve superhuman levels of performance. They will be able to do anything a human can, but better. That includes financial speculation, weapons engineering, hacking, propaganda, and manipulation. All while being as unstable as any other piece of software, or more so.
I need not point out what will happen should they be properly commercialized: when governments and corporations hook these unstable technologies up to critical infrastructure without proper precautions. And that’s a “when”, not an “if” — with how powerful AI will become, an arms race to the bottom is inevitable.
But the consequences of that are obvious.
Instead, let’s talk about an even more grim possibility: of these technologies escaping our control entirely.
This topic usually evokes images of robot uprisings, of Terminator. If I’ve successfully conveyed myself, you might instead imagine modern cybersecurity failures writ large: flash crashes, Heartbleed.
Picture, instead, Chernobyl.
When a nuclear chain reaction goes out of control, it doesn’t mean to hurt people. It’s not purposeful, its damage is not maliciously directed. It is simply a blind natural reaction, a mechanism. An algorithm written in elementary particles, executing itself on the substrate of reality.
Neither does it need people’s assistance to hurt them. When a nuclear reactor melts down, radiation doesn’t travel along the electricity wires. You can’t escape it by staying away from the appliances powered by it, or by turning them off. It spreads through space itself — not through channels people built for it, but by its own power.
Similarly, an AGI that goes out of control won’t mean to hurt us. It just won’t mean not to. It will be a blind mechanism; a malfunctioning piece of software, gormlessly executing towards its misgeneralized goal, with no mind paid to what’s in its way.
Similarly, an AGI won’t need us to make ourselves vulnerable to it. We don’t have to deploy it outside an isolated laboratory environment, don’t have to make our economy dependent on it. We need only create it. It will do everything else on its own: blow past the protections we’ve built to contain it, escape into the environment, hurt and kill people by the millions.
It will look very different from a nuclear disaster, for sure. The AI will use the medium of information and industry, not space. Its attacks will take the form of well-strung words and advanced designs, not poisonous radiation and tumors. But the underlying pattern is the same: unintended, meaningless, inescapable destruction.
There’s another difference: the nuclear industry’s safety standards are exacting, compared to AI industry’s.
Nuclear reactors are based on solid, well-established and well-understood theoretical principles. The engineering behind them is informed by that theory, and every component is stress-tested to the extreme.
With AI, as I’ve outlined, we can’t even begin to do that. We don’t understand why contemporary AIs work, can’t investigate why, can’t improve their reliability, can’t hope to ensure that nothing goes wrong.
To extend the analogy, the current AI industry practices are to blindly stockpile raw uranium in the hopes that it spontaneously assembles into a stable nuclear reactor, instead of exploding.
Is it any wonder the more safety-minded of us think that disaster is certain?
AI Risk in a nutshell: Developing software products with the disaster potential of nuclear plants and safety guarantees below the already lousy standards of mundane software, in a careless “throw stuff at the wall until something sticks” manner, is potentially a completely terrible idea.
AI Risk in Terms of Unstable Nuclear Software
(This is an entry for the AI Safety Public Materials contest testing out a novel-ish frame on the AI Risk.)
In recent years, there’d been growing concerns among AI specialists regarding the dangers of advanced artificial intelligence. The capabilities of AI models are growing rapidly, they argue, while our ability to control them lags far behind, not to mention governmental regulations, which haven’t even begun to catch up. We’re on direct course for catastrophe — one that might cost us billions of dollars in economic damage, millions of lives, or even the very survival of our species.
The most striking thing about it, however, is that there’s nothing outlandish or science-fictional about the nature of these dangers. When you get down to it, the threats are neither esoteric nor truly novel. They’re merely scaled up beyond anything we’re familiar with.
It has nothing to do with AIs being smart or “sentient”. The core problem is simpler:
AI models are software products. And as software products go, their functionality is revolutionary, while their reliability is abysmal.
A quick primer on how AI models are developed. It has preciously little in common with standard, time-tested methods of software development.
We initialize some virtual architecture — a cluster of neurons designed by loose analogue with biological brains. That architecture has no knowledge recorded in it at first, it’s entirely randomized. Then, we set up the training loop. We expose the neural network to some stimuli — pictures of animals, natural-language text, or a simulated environment. The network computes some output in response.
In the first stages, said output is just gibberish. Our software evaluates it, comparing it to the expected, “correct” responses. If it’s unsatisfactory, the software slightly modifies the neural network, nudging the connections between its neurons so that it’s somewhat more likely to output the correct answer.
Then the NN is exposed to another stimulus, and the process repeats.
It is entirely automatic — the architecture is updated by a blind selection process not unlike biological evolution. That process doesn’t “understand” what it’s doing — it just computes some minimal “mutations” that would make the NN’s performance marginally better.
After tens of thousands of iterations, we end up with a trained AI model consisting of millions of neurons and trillions of connections between them. That AI implements some software algorithm of incomprehensible complexity, and it works very well for the purposes it was trained for. Somehow.
That “incomprehensible complexity” is literal. The internals of a trained AI model do not resemble programming code in the least. It is simply billions upon billions of floating-point numbers, arranged in matrices. In the course of computation, these matrices are multiplied, and somehow, some algorithmic magic happens inside that black box, and it spits out an essay, a paining, or a plan of actions.
The problem of interpreting the internals of neural networks is an entire sub-field within AI engineering. The difficulties it faces are not unlike the problem of reading a human’s thoughts from neural imagining. Needless to say, these efforts also lag far behind the bleeding edge of AI research.
The security implications of that should be clear. Traditionally-written software is infamous for how bug-ridden and error-prone it is. Any decent program has thousands if not millions lines of code, and every line could conceal a subtle mistake that could lead to the program crashing, or miscalculating some value by orders of magnitude, or engaging in undefined behavior. Thus, any program supporting crucial infrastructure needs to undergo intensive code reviews, performance tests, adversarial red-teaming, and so on. The costs of that are often measured in hundreds of thousands of dollars. And still software often malfunctions, sometimes in embarrassingly preventable ways.
What can we expect, then, of software products light-years more advanced than any traditionally-written program, whose code we cannot review, test for bugs, check for validity, or patch?
These concerns are not theoretical. It’s well known that AI models suffer from all the pitfalls of common software — they’re just not called by the same names.
AI models can be “hacked” by tailored inputs that exploit some underlying flaws in the algorithms they implement — so-called adversarial examples.
AIs’ behavior is unpredictable when they’re exposed to sufficiently unfamiliar stimuli — when the input they’re fed is “off-distribution”. Bugs.
Developing software that does even roughly what you want is often difficult. The client needs to exhaustively list every technical detail, and be in constant communication with the developers, steering the process. Otherwise, subtle misunderstandings might creep in, and the final product would turn out poorly. In AI, we call this “goal misgeneralization”.
Let’s focus on that last one, it’s important.
A hypothetical hospital starts evaluating how well doctors are performing by the recovery rate of their patients. So doctors stop taking on patients with serious illnesses, and public health suffers.
A hypothetical nail factory’s productivity starts being evaluated by the number of nails they output in a month. So they end up producing small, low-quality nails that are useless for any real purpose.
A hypothetical school evaluates how well students are learning by giving them tests. So students focus on getting a good grade instead of focusing on education. They start cramming or outright cheating, and end up actually learning very little.
In all of these cases, we want to encourage good performance on some task — healthcare, manufacturing, education. To do so, we focus on some proxy objective that correlates well with good performance. But the moment we do so, people end up ignoring their supposed jobs, and focus on gaming that metric.
That’s not quite how it happens in real life with real people. Students actually care about learning, doctors actually care about helping, so Goodhart’s Law doesn’t apply in full.
With AI, it does. They don’t start out caring about anything, after all.
When we train an AI, we’re measuring its performance in some limited number of cases — its “on-distribution” behavior. But, much like a test can’t actually measure “good education”, an AI’s on-distribution behavior is always only a proxy for what we really want.
As a toy example, consider an AI trained to operate a vacuum cleaner. Our intended goal for it is to remove dust from some environment, so we’re grading the AI on that. Over the course of training, it seems to learn to do that: it locates dust when it appears, and vacuums it up in its bag. We decide that’s good enough, and deploy it.
In the real world, it starts bowling over potted plants, then vacuuming up the messes it itself created. What happened?
The policy it learned was “maximize the amount of dirt in the cleaner bag”, not “minimize the amount of dust in the environment”. On-distribution, there were no potted plants, so the best way to do that was gathering the dust that was already present. In the real world, though… As it happened, its goal misgeneralized.
This specific problem is easy to patch: just re-train the AI in an environment with potted plants. But would you be sure that you didn’t miss something else? That you left literally no room for error in your training setup?
Whatever algorithm an AI ends up developing, it always only correlates with good performance on whatever test cases we came up with. It doesn’t actually aim for what we want.
And that’s the only way we know how to develop advanced AI systems, to date.
So, returning to the software analogy: our ability to specify the product we want is atrocious. Thus, we’ll never get what we actually want.
To recap: The current AI development paradigm produces software products that are no less bug-ridden and unreliable than any other software, while being impossible to debug or review, and it’s nigh-guaranteed that they do not meet our desiderata. They’re not just buggy: they’re fundamentally designed for the wrong purpose.
And these products can write essays, produce art, pass Turing tests, design other software products, aim missiles, and trade stocks. The modern machine-learning paradigm has emerged in 2012, and in a single decade it achieved all of this. What will happen in the next decade? In the next three?
AI systems will soon achieve superhuman levels of performance. They will be able to do anything a human can, but better. That includes financial speculation, weapons engineering, hacking, propaganda, and manipulation. All while being as unstable as any other piece of software, or more so.
I need not point out what will happen should they be properly commercialized: when governments and corporations hook these unstable technologies up to critical infrastructure without proper precautions. And that’s a “when”, not an “if” — with how powerful AI will become, an arms race to the bottom is inevitable.
But the consequences of that are obvious.
Instead, let’s talk about an even more grim possibility: of these technologies escaping our control entirely.
This topic usually evokes images of robot uprisings, of Terminator. If I’ve successfully conveyed myself, you might instead imagine modern cybersecurity failures writ large: flash crashes, Heartbleed.
Picture, instead, Chernobyl.
When a nuclear chain reaction goes out of control, it doesn’t mean to hurt people. It’s not purposeful, its damage is not maliciously directed. It is simply a blind natural reaction, a mechanism. An algorithm written in elementary particles, executing itself on the substrate of reality.
Neither does it need people’s assistance to hurt them. When a nuclear reactor melts down, radiation doesn’t travel along the electricity wires. You can’t escape it by staying away from the appliances powered by it, or by turning them off. It spreads through space itself — not through channels people built for it, but by its own power.
Similarly, an AGI that goes out of control won’t mean to hurt us. It just won’t mean not to. It will be a blind mechanism; a malfunctioning piece of software, gormlessly executing towards its misgeneralized goal, with no mind paid to what’s in its way.
Similarly, an AGI won’t need us to make ourselves vulnerable to it. We don’t have to deploy it outside an isolated laboratory environment, don’t have to make our economy dependent on it. We need only create it. It will do everything else on its own: blow past the protections we’ve built to contain it, escape into the environment, hurt and kill people by the millions.
It will look very different from a nuclear disaster, for sure. The AI will use the medium of information and industry, not space. Its attacks will take the form of well-strung words and advanced designs, not poisonous radiation and tumors. But the underlying pattern is the same: unintended, meaningless, inescapable destruction.
There’s another difference: the nuclear industry’s safety standards are exacting, compared to AI industry’s.
Nuclear reactors are based on solid, well-established and well-understood theoretical principles. The engineering behind them is informed by that theory, and every component is stress-tested to the extreme.
With AI, as I’ve outlined, we can’t even begin to do that. We don’t understand why contemporary AIs work, can’t investigate why, can’t improve their reliability, can’t hope to ensure that nothing goes wrong.
To extend the analogy, the current AI industry practices are to blindly stockpile raw uranium in the hopes that it spontaneously assembles into a stable nuclear reactor, instead of exploding.
Is it any wonder the more safety-minded of us think that disaster is certain?
AI Risk in a nutshell: Developing software products with the disaster potential of nuclear plants and safety guarantees below the already lousy standards of mundane software, in a careless “throw stuff at the wall until something sticks” manner, is potentially a completely terrible idea.