Two reasons we might be closer to solving alignment than it seems
I was at an AI safety retreat recently and there seemed to be two categories of researchers:
Those who thought most AI safety research was useless
Those who thought all AI safety research was useless
This is a darkly humurous anecdote illustrating a larger pattern of intense pessimism I’ve noticed among a certain contingency of AI safety researchers.
I don’t disagree with the more moderate version of this position. If things continue as they are, anywhere up to a 95% chance of doom seems defendable.
What I disagree with is the degree of confidence. While we certainly shouldn’t be confident that everything will turn out fine, we also shouldn’t feel confident that it won’t. This post might have easily been titled the same as Rob Bensinger’s similar post: we shouldn’t be maximally pessimistic about AI alignment.
The main two reasons for not being overly confident of doom are:
All of the arguments saying that it’s hard to be confident that transformative AI (TAI) isn’t just around the corner also apply to safety research progress.
It’s still early days and we’ve had about as much progress as you’d predict given that up until recently we’ve only had double-digit numbers of people working on the problem.
The arguments that apply to TAI potentially being closer than we think also apply to alignment
It’s really hard to predict research progress. In ‘There’s no fire alarm for artificial general intelligence’, Eliezer Yudkowsky points out that historically, ‘it is very often the case that key technological developments still seem decades away, five years before they show up’ - even to scientists who are working directly on the problem.
Wilbur Wright thought that heavier-than-air flight was fifty years away; two years later, he helped build the first heavier-than-air flyer. This is because it often feels the same when the technology is decades away and when the technology is a year away: in either case, you don’t yet know how to solve the problem.
These arguments apply not only to TAI, but also to TAI alignment. Heavier-than-air flight felt like it was years away when it was actually round the corner. Similarly, researchers’ sense that alignment is decades away—or even that it is impossible—is consistent with the possibility that we’ll solve alignment next year.
AI safety researchers are more likely to be pessimistic about alignment than the general public because they are deeply embroiled in the weeds of the problem. They are viscerally aware, from firsthand experience, of the difficulty. They are the ones who have to feel the day-to-day confusion, frustration, and despair of bashing their heads against a problem and making inconsistent progress. But this is how it always feels to be on the cutting edge of research. If it felt easy and smooth, it wouldn’t be the edge of our knowledge.
AI progress thus far has been highly discontinuous; there have been times of fast advancement interspersed with ‘AI winters’ where enthusiasm waned, and then several important advances in the last few months. This could also be true for AI safety—even if we’re in a slump now, massive progress could be around the corner.
It’s not surprising to see this little progress when we have so few people working on it
I understand why some people are in despair about the problem. Some have been working on alignment for decades and have still not figured it out. I can empathize. I’ve dedicated my life to trying to do good for the last twelve years and I’m still deeply uncertain whether I’ve even been net positive. It’s hard to stay optimistic and motivated in that scenario.
But let’s take a step back: this is an extremely complex question, and we haven’t attacked the problem with all our strength yet. Some of the earliest pioneers of the field are no doubt some of the most brilliant humans out there. Yet, they are still only a small number of people. There are currently only about one hundred and fifty people working full-time on technical AI safety, and even that is recent—ten years ago, it was more like five. We probably need more like tens of thousands of people researching this for several decades.
I’m reminded of the great bit in Harry Potter and the Methods of Rationality where Harry explains to Fred and George how to think about something. For context, Harry just asked the twins to creatively solve a problem for him:
’Fred and George exchanged worried glances.
“I can’t think of anything,” said George.
“Neither can I,” said Fred. “Sorry.”
Harry stared at them.
And then Harry began to explain how you went about thinking of things.
It had been known to take longer than two seconds, said Harry.
You never called any question impossible, said Harry, until you had taken an actual clock and thought about it for five minutes, by the motion of the minute hand. Not five minutes metaphorically, five minutes by a physical clock….
So Harry was going to leave this problem to Fred and George, and they would discuss all the aspects of it and brainstorm anything they thought might be remotely relevant. And they shouldn’t try to come up with an actual solution until they’d finished doing that, unless of course they did happen to randomly think of something awesome, in which case they could write it down for afterward and then go back to thinking. And he didn’t want to hear back from them about any so-called failures to think of anything for at least a week. Some people spent decades trying to think of things.’
We’ve definitely set a timer and thought about this for five minutes. But this is the sort of problem that won’t just be solved by a small number of geniuses. We need way more “quality-adjusted researcher-years” if we’re going to get through this.
This is one of if not the most difficult intellectual challenge of our time. Even understanding the problem is difficult, and to solve it we will probably require a mix of math, philosophy, programming, and a healthy dose of political acumen.
Think about how many scientists it took before we made progress on practically any important scientific discovery. Except for the lucky ones at the beginning of the Enlightenment period where there were few scientists and lots of low-hanging fruit, there are usually thousands to tens of thousands scientists banging their heads against walls for decades for every one who makes a significant breakthrough. And we’ve got around one hundred in a field barely over a decade old!
When you look at it this way, it’s no wonder we haven’t made a lot of progress yet. In fact, it would be quite surprising if we had. We are a small field that’s just getting started.
We’re currently Fred and George, feeling discouraged after having pondered the world’s most important and challenging question for a few metaphorical seconds. Let’s be inspired by Harry to not only think about it for five minutes, but for decades, with a massive community of other people trying to do the same. Let’s field-build and get thousands of people banging their head against this wicked problem.
Who knows—one of the new researchers might be just a year away from making the crucial insight that ushers in the AI alignment summer.
Reminder that you can listen to EA Forum/LessWrong posts on your podcast player using The Nonlinear Library.
This post was written collaboratively by Kat Woods and Amber Dawn Ace as part of Nonlinear’s experimental Writing Internship program. The ideas are Kat’s; Kat explained them to Amber, and Amber wrote them up. We would like to offer this service to other EAs who want to share their as-yet unwritten ideas or expertise.
If you would be interested in working with Amber to write up your ideas, fill out this form.
I agree with your characterization of the problem, and would feel far more confident about our ability to solve AI alignment if I expected we would have thousands of smart people working hard on this for decades. Instead, I think we have maybe 5-10 years, and I don’t know how to scale up the number of scientists working on this in time, or how to slow down the countdown.
I think we have significantly longer. Still, if success requires several tens of thousands of people researching this for decades, we will likely fail.
(1) Reasoned estimates for the date as of which we will develop AGI start in less than two decades.
(2) To my knowledge, there aren’t thousands studying alignment now (let alone tens of thousands) and there does not seem to be a significant likelihood of that changing in the next few years.
(3) Even if, by the early 2030s, there are 10s of thousands of researchers working on alignment, there is a significant chance they may not have time to work on it for decades before AGI is developed.
Strong upvote for giving some outside perspective to the field, and this is an important point of why AI Alignment is likely to be tractable at all. It also means getting many more researchers and money, fast is important for AI Safety.
How so? It’s much easier to harness forces than engineer them, it’s much easier to write general search code than to write code that does something specific, it’s much easier to point at something than to describe it well enough to recreate it, it’s much easier to get general intelligence by pointing a search at a context and saying “find something that does really well here on these programmable objectives” than by understanding general intelligence well enough to make it robustly not do something.
Agree, the claim in the post seems to require assumptions that directly contradict observations of the real world.
Thanks for explicitly writing out your thoughts in a place where you can expect strong pushback! I think this is particularly valuable.
That being said, while I completely agree with your second point (I keep telling to people who argue theory cannot work that barely 10 people worked on it for 10 years, which is a ridiculously small number), I feel like your first point is missing some key reflections on the asymmetry of capabilities vs alignment.
I don’t have time to write a long answer, but I already have a post going in depth into many of the core assumptions of science and engineering that we don’t expect to apply for alignment, (almost all apply or are irrelevant for capabilities, although that’s not discussed explicitly in the post)
This pattern is common and I think it has a common simple explanation: experts arrive in a field by absorbing the current knowledge, like seeds developing fruit they blossom with a variable crop of fresh ideas, they eventually then harvest by publishing/testing (at varying rate depending on funding) said ideas, exhausting their pool and then eventually fading out. Polling experts in a developing field about future breakthroughs is then always mostly useless because almost by definition you are polling those whose ideas have already mostly failed, and new success ultimately comes from the unproven, those with novel ideas untested.
Good post
I have similar thoughts. I believe that at one moment, fears about TAI will spread like a wildfire, and the field will get a giant stream of people, money and policies, and it is hard to feel from today
The problem is we’re going about it all wrong. We’re trying to solve it at the complicated end while it’s forbidden to look at the basics. Right now, we live in a world with satanically complex and defective user interfaces at every level. The fact that “simple” software is allowed to be as bad as it is today is completely incomprehensible to me. In fact most software is already worse than useless, like a runaway AI but with zero capabilities.