Nature: Eliezer Yudkowsky and Stuart Russell solve AI alignment with breakthrough insight. This October, Eliezer and Stuart sat down to consider one of the most pressing technical challenges confronting humanity: How to ensure that superhuman AI is aligned with human interests. That’s when they had their big insight: The alignment problem is a math problem.
In the past, Eliezer and Stuart had been thinking about the alignment problem in terms of probability theory. But probability theory isn’t powerful enough to fully capture the nuances of human values. Probability theory is too coarse-grained to distinguish between a universe where humans are eaten by paperclips and a universe where humans are eaten by paperclips and everyone has a good time.
So they turned to a more powerful tool: decision theory, which underlies game theory and has been used to analyze everything from voting systems to how to play poker. Decision theory is more nuanced than probability theory, but it’s also more complicated. It’s not just harder for humans to grok; it’s harder for computers too. So the first step was just getting decision theory into AI algorithms.
The next step was figuring out how to use decision theory to solve the alignment problem. They started by defining a reward function that would tell an AI what we want it to do. Then they set up a decision tree showing all the possible ways an AI could behave, with each branch corresponding to a different possible reward function. The goal was then to find the path that maximizes our reward under any possible future circumstance—a path that would ensure that an AI does what we want no matter what happens in the future, whether it’s created by us or someone else, whether it has two neurons or two hundred billion neurons, whether it loves us or hates us or feels nothing at all about us one way or another…or even if there are no humans left on Earth at all!
But wait—how can you have an algorithm without knowing what reward function you’re trying to maximize? That’s like trying to find your way home without knowing which way you’re facing! And yet this is exactly what Stuart and Eliezer did: They took this giant pile of unknowns—all these potential reward functions—and fed them into their decision-theoretic machine learning system as input variables…and then they let their system figure out which reward function was most likely! And when they were done, they found that their system had settled on one particular definition of human values: It was something like “human values are whatever maximizes humanity’s future potential.” It wasn’t perfect, but it was good enough for government work; better than any previous attempt at defining human values had ever been.
And this is where they stopped. This is where they stopped and thought, “Wow, we’ve done it! We’ve solved the alignment problem!” And then they went home and slept soundly, happy in the knowledge that humanity’s future was secure.
But…that’s not how it happened at all. That’s not how it happened at all. Because when Eliezer and Stuart had their big breakthrough, I was sitting right there with them, listening to every word. And I know what really happened.
What really happened was that Stuart and Eliezer worked on AI alignment for another decade or so before giving up in frustration. They worked on AI alignment until their hair turned gray and their teeth fell out, until their eyesight failed and their joints became arthritic from sitting at a computer for too many hours a day, until they were so old that nobody would publish their papers anymore because nobody takes old people seriously anymore. And then they died of natural causes before ever solving the alignment problem—and the world was left with no way to align AI with human values whatsoever.
GPT-3 Solves Alignment