Thoughts on AGI organizations and capabilities work
(Note: This essay was largely written by Rob, based on notes from Nate. It’s formatted as Rob-paraphrasing-Nate because (a) Nate didn’t have time to rephrase everything into his own words, and (b) most of the impetus for this post came from Eliezer wanting MIRI to praise a recent OpenAI post and Rob wanting to share more MIRI-thoughts about the space of AGI organizations, so it felt a bit less like a Nate-post than usual.)
Nate and I have been happy about the AGI conversation seeming more honest and “real” recently. To contribute to that, I’ve collected some general Nate-thoughts in this post, even though they’re relatively informal and disorganized.
AGI development is a critically important topic, and the world should obviously be able to hash out such topics in conversation. (Even though it can feel weird or intimidating, and even though there’s inevitably some social weirdness in sometimes saying negative things about people you like and sometimes collaborate with.) My hope is that we’ll be able to make faster and better progress if we move the conversational norms further toward candor and substantive discussion of disagreements, as opposed to saying everything behind a veil of collegial obscurity.
Capabilities work is currently a bad idea
Nate’s top-level view is that ideally, Earth should take a break on doing work that might move us closer to AGI, until we understand alignment better.
That move isn’t available to us, but individual researchers and organizations who choose not to burn the timeline are helping the world, even if other researchers and orgs don’t reciprocate. You can unilaterally lengthen timelines, and give humanity more chances of success, by choosing not to personally shorten them.
Nate thinks capabilities work is currently a bad idea for a few reasons:
He doesn’t buy that current capabilities work is a likely path to ultimately solving alignment.
Insofar as current capabilities work does seem helpful for alignment, it strikes him as helping with parallelizable research goals, whereas our bottleneck is serial research goals. (See A note about differential technological development.)
Nate doesn’t buy that we need more capabilities progress before we can start finding a better path.
This is not to say that capabilities work is never useful for alignment, or that alignment progress is never bottlenecked on capabilities progress. As an extreme example, having a working AGI on hand tomorrow would indeed make it easier to run experiments that teach us things about alignment! But in a world where we build AGI tomorrow, we’re dead, because we won’t have time to get a firm understanding of alignment before AGI technology proliferates and someone accidentally destroys the world.[1] Capabilities progress can be useful in various ways, while still being harmful on net.
(Also, to be clear: AGI capabilities are obviously an essential part of humanity’s long-term path to good outcomes, and it’s important to develop them at some point — the sooner the better, once we’re confident this will have good outcomes — and it would be catastrophically bad to delay realizing them forever.)
On Nate’s view, the field should do experiments with ML systems, not just abstract theory. But if he were magically in charge of the world’s collective ML efforts, he would put a pause on further capabilities work until we’ve had more time to orient to the problem, consider the option space, and think our way to some sort of plan-that-will-actually-probably-work. It’s not as though we’re hurting for ML systems to study today, and our understanding already lags far behind today’s systems’ capabilities.[2]
Publishing capabilities advances is even more obviously bad
For researchers who aren’t willing to hit the pause button, an even more obvious (and cheaper) option is to avoid publishing any capabilities research (including results of the form “it turns out that X can be done, though we won’t say how we did it”).
Information can leak out over time, so “do the work but don’t publish about it” still shortens AGI timelines in expectation. However, it can potentially shorten them a lot less.
In an ideal world, the field would currently be doing ~zero publishing of capabilities research — and marginal action to publish less is beneficial even if the rest of the world continues publishing.
Thoughts on the landscape of AGI organizations
With those background points in hand:
Nate was asked earlier this year whether he agrees with Eliezer’s negative takes on OpenAI. There’s also been a good amount of recent discussion of OpenAI on LessWrong.
Nate tells me that his headline view of OpenAI is mostly the same as his view of other AGI organizations, so he feels a little odd singling out OpenAI. That said, here are his notes on OpenAI anyway:
On Nate’s model, the effect of OpenAI is almost entirely dominated by its capabilities work (and sharing of its work), and this effect is robustly negative. (This is true for DeepMind, FAIR, and Google Brain too.)
Nate thinks that DeepMind, OpenAI, Anthropic, FAIR, Google Brain, etc. should hit the pause button on capabilities work (or failing that, at least halt publishing). (And he thinks any one actor can unilaterally do good in the process, even if others aren’t reciprocating.)
On Nate’s model, OpenAI isn’t close to operational adequacy in the sense of the Six Dimensions of Operational Adequacy write-up — which is another good reason to hold off on doing capabilities research. But this is again a property OpenAI shares with DeepMind, Anthropic, etc.
Insofar as Nate or I think OpenAI is doing the wrong thing, we’re happy to criticize it.[3] But, while this doesn’t change the fact that we view OpenAI’s effects as harmful on net currently, Nate does want to acknowledge that OpenAI seems to him to be doing better than some other orgs on a number of fronts:
Nate liked a lot of things about the OpenAI Charter. (As did Eliezer, though compared to Eliezer, Nate saw the Charter as a more important positive sign about OpenAI’s internal culture.)
Nate would suspect that OpenAI is much better than Google Brain and FAIR (and comparable with DeepMind, and maybe a bit behind Anthropic? it’s hard to judge these things from the outside) on some important adequacy dimensions, like research closure and operational security. (Though Nate worries that, e.g., he may hear more about efforts in these directions made by OpenAI than about DeepMind just by virtue of spending more time in the Bay.)
Nate is also happy that Sam Altman and others at OpenAI talk to EAs/rationalists and try to resolve disagreements, and he’s happy that OpenAI has had people like Holden and Helen on their board at various points.
Also, obviously, OpenAI (along with DeepMind and Anthropic) has put in a much clearer AGI alignment effort than Google, FAIR, etc. (Albeit Nate thinks the absolute amount of “real” alignment work is still small.)
Most recently, Nate and Eliezer both think it’s great that OpenAI released a blog post that states their plan going forward, and we want to encourage DeepMind and Anthropic to do the same.[4]
Comparatively, Nate thinks of OpenAI as being about on par with DeepMind, maybe a bit behind Anthropic (who publish less), and better than most of the other big names, in terms of attempts to take not-killing-everyone seriously. But again, Nate and I think that the overall effect of OpenAI (and DeepMind and FAIR and etc.) is bad, because we think it’s dominated by “shortens AGI timelines”. And we’re a little leery of playing “who’s better on [x] dimension” when everyone seems to be on the floor of the logistic success curve.
We don’t want “here are a bunch of ways OpenAI is doing unusually well for its reference class” to be treated as encouragement for those organizations to stay in the pool, or encouragement for others to join them in the pool. Outperforming DeepMind, FAIR, and Google on one or two dimensions is a weakly positive sign about the future, but on my model and Nate’s, it doesn’t come close to outweighing the costs of “adding another capabilities org to the world”.
- ^
Nate simultaneously endorses these four claims:
1. More capabilities would make it possible to learn some new things about alignment.
2. We can’t do all the alignment work pre-AGI. Some trial-and-error and experience with working AGI systems will be required.
3. It can’t all be trial-and-error, and it can’t all be improvised post-AGI. Among other things, this is because:
3.1. Some errors kill you, and you need insight into which errors those are, and how to avoid them, in advance.
3.2. We’re likely to have at most a few years to upend the gameboard once AGI arrives. Figuring everything out under that level of time pressure seems unrealistic; we need to be going into the AGI regime with a solid background understanding, so that empirical work in the endgame looks more like “nailing down a dozen loose ends and making moderate tweaks to a detailed plan” rather than “inventing an alignment field from scratch”.
3.3. AGI is likely to coincide with a sharp left turn, which makes it harder (and more dangerous) to rely on past empirical generalizations, especially ones that aren’t backed by deep insight into AGI cognition.
3.4. Other points raised in AGI Ruin: A List of Lethalities.
4. If we end up able to do alignment, it will probably be because we figured out at least one major thing that we don’t currently know, that isn’t a part of the current default path toward advancing SotA or trying to build AGI ASAP with mainstream-ish techniques, and isn’t dependent on such progress.
- ^
And, again, small individual “don’t burn the timeline” actions all contribute to incrementally increasing the time humanity has to get its act together and figure this stuff out. You don’t actually need coordination in order to have a positive effect in this way.
And, to reiterate: I say “pause” rather than “never build AGI at all” because MIRI leadership thinks that humanity never building AGI would mean the loss of nearly all of the future’s value. If this were a live option, it would be an unacceptably bad one.
- ^
Nate tells me that his current thoughts on OpenAI are probably a bit less pessimistic than Eliezer’s. As a rule, Nate thinks of himself as generally less socially cynical than Eliezer on a bunch of fronts, though not less-cynical enough to disagree with the basic conclusions.
Nate tells me that he agrees with Eliezer that the original version of OpenAI (“an AGI in every household”, the associated social drama, etc.) was a pretty negative shock in the wake of the camaraderie of the 2015 Puerto Rico conference.
At this point, of course, the founding of OpenAI is a sunk cost. So Nate mostly prefers to assess OpenAI’s current state and future options.
Currently, Nate thinks that OpenAI is trying harder than most on some important safety fronts — though none of this reaches the standards of “adequate project” and we’re still totally going to die if they meet great success along their current path.
Since I’ve listed various positives about OpenAI here, I’ll note some examples of recent-ish developments that made Nate less happy about OpenAI: his sense that OpenAI was less interested in Paul Christiano’s research, Evan Hubinger’s research, etc. than he thought they should have been, when Paul was at OpenAI; Dario’s decision to leave OpenAI; and OpenAI focusing on the “use AI to solve AI alignment” approach (as opposed to other possible strategies), as endorsed by e.g. Jan Leike, the head of OpenAI’s safety team after Paul’s departure.
- ^
If a plan doesn’t make sense, the research community can then notice this and apply corrective arguments, causing the plan to change. As indeed happened when Elon and Sam stated their more-obviously-bad plan for OpenAI at the organization’s inception.
It would have been better to state their plan first and start an organization later, so rounds of critical feedback and updating could occur before you lock in decisions about hiring, org structure, name, culture, etc.
But at least it happened at all; if OpenAI had just said “yeah, we’re gonna do alignment research!” and left it there, the outcome probably would have been far worse.
Also, if organizations release obviously bad plans but are then unresponsive to counter-arguments, researchers can go work at the orgs with better plans and avoid the orgs with worse plans. This encourages groups to compete to have the seemingly-sanest plan, which strikes me as a better equilibrium than the current one.
- Against Almost Every Theory of Impact of Interpretability by 17 Aug 2023 18:44 UTC; 325 points) (
- If interpretability research goes well, it may get dangerous by 3 Apr 2023 21:48 UTC; 200 points) (
- Comments on OpenAI’s “Planning for AGI and beyond” by 3 Mar 2023 23:01 UTC; 148 points) (
- I Would Have Solved Alignment, But I Was Worried That Would Advance Timelines by 20 Oct 2023 16:37 UTC; 119 points) (
- Comments on OpenAI’s “Planning for AGI and beyond” by 3 Mar 2023 23:01 UTC; 115 points) (EA Forum;
- Should we publish mechanistic interpretability research? by 21 Apr 2023 16:19 UTC; 105 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- If interpretability research goes well, it may get dangerous by 3 Apr 2023 21:48 UTC; 33 points) (EA Forum;
I wanted to give this a big +1. I think OpenAI is doing better than literally every single other major AI research org except probably Anthropic and Deepmind on trying to solve the AI-not-killing-everyone task. I also think that Anthropic/Deepmind/OpenAI are doing better in terms of not publishing their impressive capabilities research than ~everyone else (e.g. not revealing the impressive downstream Benchmark numbers on Codex/
text-davinci-002
performance). Accordingly, I think there’s a tendency to give OpenAI an unfair amount of flak compared to say, Google Brain or FAIR or any of the startups like Adept or Cohere.This is probably a combination of three effects:
OpenAI is clearly on the cutting edge of AI research.
OpenAI has a lot of visibility in this community, due to its physical proximity and a heavy overlap between OpenAI employees and the EA/Rationalist social scene.
OpenAI is publicly talking about alignment; other orgs don’t even acknowledge it, this makes it a heretic rather than an infidel.
And I’m happy that this post pushes against this tendency.
(And yes, standard caveats, reality doesn’t grade on a curve, etc.)
I’m not sure I agree that this is unfair.
This is obviously a good reason to focus on them more.
Perhaps we have responsibility to scrutinize/criticize them more because of this, due to comparative advantage (who else can do it easier/better than we can), and because they’re arguably deriving some warm fuzzy glow from this association? (Consider FTX as an analogy.)
Yes, but they don’t seem keen on talking about the risks/downsides/shortcomings of their alignment efforts (e.g., they make their employees sign non-disparagement agreements and as a result the former alignment team members who left in a big exodus can’t say exactly why they left). If you only talk about how great your alignment effort is, maybe that’s worse than not talking about it at all, as it’s liable to give people a false sense of security?
I appreciate this post! It feels fairly reasonable, and much closer to my opinion than (my perception of) previous MIRI posts. Points that stand out:
Publishing capabilities work is notably worse than just doing the work.
I’d argue that hyping up the capabilities work is even worse than just quietly publishing it without fanfare.
Though, a counter-point is that if an organisation doesn’t have great cyber-security and is a target for hacking, capabilities can easily leak (see, eg, the Soviets getting nuclear weapons 4 year after the US, despite it being a top secret US program and before the internet)
Capabilities work can be importantly helpful for alignment work, especially empirical focused work.
Probably my biggest crux is around the parallel vs serial thing. My read is that fairly little current alignment work really feels “serial” to me. Assuming that you’re mostly referring to conceptual alignment work, my read is that a lot of it is fairly confused, and would benefit a lot from real empirical data and real systems that can demonstrate concepts such as agency, planning, strategic awareness, etc. And just more data on what AGI cognition might look like. Without these, it seems extremely hard to distinguish true progress from compelling falsehoods.
What’s the mechanism you’re thinking of, through which hype does damage?
I also doubt that good capabilities work will be published “without fanfare”, given how watched this space is.
I think this is more an indictment of existing work, and less a statement about what work needs to be done. e.g. my guess is we’ll both agree that the original inner alignment work from Evan Hubinger is pretty decent conceptual research. And I think much conceptual work seems pretty serial to me, and is hard to parallelize due to reasons like “intuitions from the lead researcher are difficult to share” and communications difficulties in general.
Of course, I also do agree that there’s a synergy between empirical data and thinking—e.g. one of the main reasons I’m excited about Redwood’s agenda is because it’s very conceptually driven, which lets it be targeted at specific problems (for example, they’re coming with techniques that aim to solve the mechanistic anomaly detection problem, and finding current analogues and doing experiments with those).
This ship may have sailed at this point, but to me the main mechanism is getting other actors to pay attention, focus on the most effective kind of capabilities work, and making it more politically feasible to raise support. Eg, I expect that the media firestorm around GPT-3 made it significantly easier to raise the capital + support within Google Brain to train PaLM. Legibly making a ton of money with it falls in a similar category to me.
Gopher is a good example of not really seeing much fanfare, I think? (Though I don’t spend much time on ML Twitter, so maybe there was loads lol)
Ah, my key argument here is that most conceptual work is bad because of lacking good empirical examples, grounding and feedback loops, and that if we were closer to AGI we could have this.
I agree that risks from learned optimisation is important and didn’t need this, and plausibly feels like a good example of serial work to me.
Wouldn’t surprise me if this was true, but I agree with you that it’s possible the ship has already sailed on LLMs. I think this is more so the case if you have a novel insight about what paths are more promising to AGI (similar to the scaling hypothesis in 2018)---getting ~everyone to adopt that insight would significantly advance timelines, though I’d argue that publishing it (such that only the labs explicitly aiming at AGI like OpenAI and Deepmind adopt it) is not clearly less bad than hyping it up.
Surely this is because it didn’t say anything except “Deepmind is also now in the LLM game”, which wasn’t surprising given Geoff Irving left OpenAI for Deepmind? There weren’t significant groundbreaking techniques used to train Gopher as far as I can remember.
Chinchilla, on the other hand, did see a ton of fanfare.
Cool. I agree with you that conceptual work is bad in part because of a lack of good examples/grounding/feedback loops, though I think this can be overcome with clever toy problem design and analogies to current problems (that you can then get the examples/grounding/feedback loops from). E.g. surely we can test toy versions of shard theory claims using the small algorithmic neural networks we’re able to fully reverse engineer.
Can you give some historical examples of work that lowered the amount-of-serial-research-left-till-doom? And examples of work that didn’t? Because an advance in alignment is often a direct advance in capabilities, and I’m a little confused about the spectrum of possibilities.
Here’s an example of my confusion. Clearly interpretability work is mostly good, right? Exploring semantic super-positions and other current advances seem like they’re clearly benificial to publish in spite of the fact that they advance capabilities. If we progress to the point where we can interpret the algorithms that a smallish NN is using, that still seems fine. But what if interpretability research progress to the point where they can decode the algorithms a NN is running, then the techniques that allow that level of interpretability are quite dangerous. For example, if we find large NNs have some kind of proto-general search which seems like it could be amplified easily to get a general agent, then, you know, it would be pretty bad if every AGI organization could find this out by just applying standard interpretability tool X. Or is that kind of work still worth publishing, because powerful interpretability would make alignment way easier and that outweighs the risk of reducing serial research time till doom?
I don’t know Nate’s response, but his take on agent-foundations-ish research in A note about differential technological development (and the fact that he and MIRI have been broadly pro-interpretability-work to date) might help clarify how he thinks about cases like this.
“AI capabilities” and “AI alignment” are highly related to each other, and “AI capabilities” has to come first in that alignment assumes that there is a system to align. I agree that for people on the cutting edge of research like OpenAI, it would be a good idea for at least some of them to start thinking deeply about alignment instead. There’s two reasons for this:
1) OpenAI is actually likely to advance capabilities a pretty significant amount, and
2) Due to their expertise that they’ve developed from working on AI capabilities, they’re much more likely to make important progress on AGI alignment than e.g. MIRI.
But I think there’s something of a “reverse any advice you hear” thing going on—the people most likely to avoid working on capabilities as a result of this post are those who would actually benefit from working on AI capabilities for a while, even if they don’t intend to publish their results, in order to build more expertise in AI. Capabilities is the foundation of the field and trying theorize about how to control an AI system without having anything but the vaguest ideas about how the AI system will work isn’t going to get you anywhere.
For example, Eliezer is in a pessimistic doom-spiral while also being, by his own admission, pretty useless at solving alignment. If he would just take a break and try to make an AI good at Atari for six months then I think he’d find he was a lot more effective at alignment afterwards and would realize that AGI isn’t as imminent as he currently believes it is. Of course, the very fact that he thinks it’s imminent means he won’t do this; such is life.
“Working on AI capabilities” explicitly means working to advance the state-of-the-art of the field. Skilling up doesn’t do this. Hell, most ML work doesn’t do this. I would predict >50% of AI alignment researchers would say that building an AI startup that commercialises the capabilities of already-existing models does not count as “capabilities work” in the sense of this post. For instance, I’ve spent the last six months studying reinforcement learning and Transformers, but I haven’t produced anything that has actually reduced timelines, because I haven’t improved anything beyond the level that humanity was capable of before, let alone published it.
If you work on research engineering in a similar manner, but don’t publish any SOTA results, I would say you haven’t worked on AI capabilities in the way this post refers to them.
Right, I specifically think that someone would be best served by trying to think of ways to get a SOTA result on an Atari benchmark, not simply reading up on past results (although you’d want to do that as part of your attempt). There’s a huge difference between reading about what’s worked in the past and trying to think of new things that could work and then trying them out to see if they do.
As I’ve learned more about deep learning and tried to understand the material, I’ve constantly had ideas that I think could improve things. Then I’ve tried them out, and usually learned that they didn’t, or they did but they’d already been done, or that it was more complicated than that, etc. But I learned a ton in the process. On the other hand, suppose I was wary of doing AI capability work. Each time I had one of these ideas, I shied away from it out of fear of advancing AGI timelines. The result would be threefold: I’d have a much worse understanding of AI, and I’d be a lot more concerned about immininent AGI (after all, I had tons of ideas for how things could be done better!), and I wouldn’t have actually delayed AGI timelines at all.
I think a lot of people who get into AI from the alignment side are in danger of falling into this trap. As an example in an ACX thread I saw someone thinking about doing their PHD in ML, and they were concerned that they may have to do capability research in order to get their PHD. Someone replied that if they had to they should at least try to make sure it is nothing particularly important, in order to avoid advancing AGI timelines. I don’t think this is a good idea. Spending years working on research while actively holding yourself back from really thinking deeply about AI will harm your development significantly, and early in your career is right when you benefit the most from developing your understanding and are least likely to actually move up AGI timelines.
Suppose we have a current expected AGI arrival date of 20XX. This is the result of DeepMind, Google Brain, OpenAI, FAIR, Nvidia, universities all over the world, the Chinese government, and more all developing the state of the art. On top of that there’s computational progress happening at the same time, which may well turn out to be a major bottleneck. How much would OpenAI removing themselves from this race affect the date? A small but real amount. How about a bright PHD candidate removing themselves from this race? About zero. I don’t think people properly internalize both how insignificant the timeline difference is, and also how big the skill gains are from actually trying your hardest at something as opposed to handicapping yourself. And if you come up with something you’re genuinely worried about you can just not publish.
I do agree that people should try their ideas out, even if the ideas are “capabilities” flavored. However, I do think (if you buy the serial vs parallel distinction in the OP) that you should try to not do capabilities research.
As you say, most ML ideas people come up with at first are pretty doomed to failure, and the main way people learn is via experience. This is in part due to the overconfidence of newbies in any field, but also in part due to how counterintuitive many ML results are to most people. [1]
The key thing people should know is, if you stumble on an actual capabilities insight… you can just… not publish it or talk about it. I think I’d emphasize this point over the other points. Do the research most helpful for learning, and then in the unlikely event it ends up being impressive capabilities work, you can always just put it into your filing cabinet and walk away. [2]
As for the ML PhD example:
I think you can think very deeply about AIs without going out and working on critical-path capabilities work! You should think very deeply about AIs in general, if you’re working in the field, regardless of what you’re doing! But if you’re in a job where you have to publish to advance, (assuming you buy the assumptions in the OP) it seems pretty bad to actively seek out and work on critical-path capabilities work, as opposed to skill-building work or safety work.
Finally, while I agree with your overall takeaway, I strongly disagree with this style of argument:
Because the expected effect of most people on moving anything significant forward by a couple of months is probably going to be zero, including solutions to alignment; there’s just a lot of people out there and the problems people want to work on are really hard. What matters isn’t whether or not we can measure the impact of single people in terms of full percentage points of various outcomes or full weeks of time, but about whether the expected gains are larger from doing the PhD vs not doing so, or what kind of PhD maximizes the expected gains or minimizes the expected harms.
As the post says:
Even if your net impact is small, you can still choose exactly how small and what direction it is.
That being said, I think you’re painting a false dichotomy between trying really hard to “get a SOTA on an Atari benchmark” and “simply reading up on past results”. e.g. you could also gain experience reimplement existing results, explore their robustness, etc.
As a side note, I don’t think most ways of getting SOTA on Atari benchmarks are particularly relevant to cutting-edge capabilities work nor what I’d recommend people spend a lot of their time on. It’s possible we’re imagining completely different things here. That being said, this is not a crux for my belief that people should lean more toward trying things out.
Similarly, I also disagree with this take:
I think that the majority of ML PhDs actually are not meaningfully contributing to capabilities, because they’re just not working on things likely to be relevant after a few more improvements on general capabilities. See for example a lot of the pre-GPT NLP work finetuning small neural networks harder on specific tasks. I’d also bet that a lot of video understanding work in the past 5 years will also be obsoleted when we get better video/multi-modal foundational models.
I’m very unconfident in the following but, to sketch my intuition:
I don’t really agree with the idea of serial alignment progress that is independent from capability progress. This is what I was trying to get at with
By analogy, nuclear fusion safety research is inextricable from nuclear fusion capability research.
When I try to think of ways to align AI my mind points towards questions like “how do we get an AI to extrapolate concepts? How will it be learning? What will its architecture be?” etc. In other words it just points towards capabilities questions. Since alignment turns on capability questions that we don’t yet have an answer to, it doesn’t surprise me when many alignment researchers seem to spin their wheels and turn to doom and gloom—that’s more or less what I had thought would happen.
As an example of the blurred lines between capability and alignment: while I think it’s useful to have specific terms for inner and outer alignment, I also think that really anyone who worked with RL in a situation where they were manually setting the reward function was aware of these ideas already on some level. “Sometimes I mess up the reward function” and “sometimes the agent isn’t optimizing properly” are both issues encountered frequently. Basically while many people in the alignment community seem to think of alignment as something that is cooked up entirely separately from capability research I tend to think that a lot of it will develop naturally as part of day-to-day AI research with no specific focus on alignment.
As a thought experiment, let’s say that about 20% of current AI capability researchers are very concerned about AI alignment and get together to decide what to do for the next five years. They’re deciding between taking the stance “Capability work is fine right now! Go for it! Worry about alignment when we’re farther along!” or “Let’s get out of capability and go into alignment instead. Capability research is dangerous and burning precious time.” What’s the impact of adopting these two positions?
The first is roughly the default position, and I’d expect that basically what we’ll see is AGI in the year 20XX and that in the runup to this we’ll see vastly increased interest in alignment work and also a significant blurring between “alignment” and “regular AI research” since people want their home robots to not roll over their cat. We’ll also see all major AI research orgs and the AI community as a whole take existential risk from self-improving AGI a lot more seriously once modern SOTA AI systems start looking more and more like the kind of thing that could do that. Because of this there’ll be a concerted effort to handle the situation appropriately which has a good chance of success.
Option two involves slowing down the timeline by about 5-10%. Cutting the size of a field by 20% doesn’t slow progress that much since there’s diminishing returns to adding more researchers, and on top of that AI capability research is only half of what drives progress (the other half being compute). In return for this small slowdown the AI researchers who are now going into alignment will initially spin their wheels due to the lack of anything concrete to focus on or any concrete knowledge of what the future systems will look like. When AGI does start approaching the remaining AI capability community will take it much less seriously due to having been selected specifically for that trait. Three years before the arrival of transformative AGI alignment research is further along than it otherwise would have been, but AI capability researchers have gotten used to tuning alignment researchers out and there aren’t alignment-sympathetic colleagues around to say “hey, given how things are progressing I think it’s time we start taking all that AI risk stuff seriously”. Prospects are worse than option one.
So right now my intuition is that I think alignment will be very doable as long as it’s something that the AI community is taking seriously in the few years leading up to transformative AGI. The biggest risk seems to me to be some AI researchers at one of the leading research groups thinking “man, it sure would be cool if we could use the latest coding LLM combined with RL to make an AI that could improve itself in order to accomplish a goal” and set it running without it ever occuring to them that this could go wrong. Given this, the suggestion that everyone concerned about alignment basically cedes the whole field of AI research (outside of this specific community, “AI capability research” is just called “AI research”) to people who aren’t worried about it seems like a bad idea.
Yeah, that might be a big idea. If you’re right that AI capabilities work and AI Alignment work is the same thing, the problem is solved by definition. So if I’m getting at things correctly, capabilities and safety are highly correlated, and there can’t be situations where capabilities and alignment decouple.
Not that far, more like it doesn’t decouple until more progress has been made. Pure alignment is an advanced subtopic of AI research that requires more progress to have been made before it’s a viable field.
I’m not super confident in the above and wouldn’t discourage people from doing alignment work now (plus the obvious nuance that it’s not one big lump, there are some things that can be done later and some that can be done earlier) but the idea of alignment work that requires a whole bunch of work in serial, independent of AI capability work, doesn’t seem plausible to me. From Nate Soares’ post:
This is the kind of thing that seems inextricably bound up with capability work to me. My impression is that MIRI tends to think that whatever route we take to get to AGI, as it moves from subhuman to human-level intelligence it will transform to be like the minds that they theorize about (and they think this will happen before it goes foom) no matter how different it was when it started. So even if they don’t know what a state of the art RL agent will look like five years from now, they feel confident they can theorize about what it will look like ten years from now. Whereas my view is that if you can’t get the former right you won’t get the latter right either.
To the extent that intelligences will converge towards a certain optimal way of thinking as they get smarter, being able to predict what that looks like will involve a lot of capability work (“Hmm, maybe it will learn like this; let’s code up an agent that learns that way and see how it does”). If you’re not grounding your work in concrete experiments you will end up with mistakes in your view of what an optimal agent looks like and no way to fix them.
A big part of my view is that we seem to still be a long way from AGI. This hinges on how “real” the intelligence behind LLMs is. If we have to take the RL route then we are a long way away—I wrote a piece on this, “What Happened to AIs Learning Games from Pixels?”, which points out how slow the progress has been and covers the areas where the field is stuck. On the other hand if we can get most of the way to AGI just with massive self-supervised training then it starts seeming more likely that we’ll walk into AGI without having a good understanding of what’s going on. I think that the failure of VPT for minecraft compared to GPT for language, and the difficulty LLMs have with extrapolation and innovation, means that self-supervised learning won’t be enough without more insight. I’ll be paying close attention to how GPT-4 and other LLMs do over the next few years to see if they’re making progress faster than I thought, but I talked to chatGPT and it was way worse than I thought it’d be.
I like your comments, 307th, and your linked post on RL SotA. I don’t agree with everything you say, but I some of what you say is quite on point. In particular I agree that ‘RL is currently being rather unimpressive in achieving complicated goals in complex wide-possible-action-space simulation worlds’. I agree that some fundamental breakthroughs are needed to change this, not just scaling existing methods. I disagree that such breakthroughs will necessarily require many calendar years of research. I think probably the eyes of the big research labs will soon be turning to focus more fully upon tackling complex-world RL, and that it won’t be long at all before significant breakthroughs start being made.
I think rather than thinking about research progress in terms of years, or even ‘researcher hours’, it’s more helpful to think of progress in terms of ‘research points’ devoted to the specific topic. An hour of a highly effective researcher at a well-funded lab, with a well-setup research environment that makes new experiments easy to run is worth vastly more ‘research points’ towards a topic than an hour of a compute-limited grad student without polished experiment-running code patterns, without access to huge compute resources, and without much experience running large experiments over many variables.
Thanks for making things clearer! I’ll have to think about this one—some very interesting points from a side I had perhaps unfairly dismissed before.