Thoughts on AGI organizations and capabilities work

(Note: This essay was largely written by Rob, based on notes from Nate. It’s formatted as Rob-paraphrasing-Nate because (a) Nate didn’t have time to rephrase everything into his own words, and (b) most of the impetus for this post came from Eliezer wanting MIRI to praise a recent OpenAI post and Rob wanting to share more MIRI-thoughts about the space of AGI organizations, so it felt a bit less like a Nate-post than usual.)


Nate and I have been happy about the AGI conversation seeming more honest and “real” recently. To contribute to that, I’ve collected some general Nate-thoughts in this post, even though they’re relatively informal and disorganized.

AGI development is a critically important topic, and the world should obviously be able to hash out such topics in conversation. (Even though it can feel weird or intimidating, and even though there’s inevitably some social weirdness in sometimes saying negative things about people you like and sometimes collaborate with.) My hope is that we’ll be able to make faster and better progress if we move the conversational norms further toward candor and substantive discussion of disagreements, as opposed to saying everything behind a veil of collegial obscurity.

Capabilities work is currently a bad idea

Nate’s top-level view is that ideally, Earth should take a break on doing work that might move us closer to AGI, until we understand alignment better.

That move isn’t available to us, but individual researchers and organizations who choose not to burn the timeline are helping the world, even if other researchers and orgs don’t reciprocate. You can unilaterally lengthen timelines, and give humanity more chances of success, by choosing not to personally shorten them.

Nate thinks capabilities work is currently a bad idea for a few reasons:

  • He doesn’t buy that current capabilities work is a likely path to ultimately solving alignment.

  • Insofar as current capabilities work does seem helpful for alignment, it strikes him as helping with parallelizable research goals, whereas our bottleneck is serial research goals. (See A note about differential technological development.)

  • Nate doesn’t buy that we need more capabilities progress before we can start finding a better path.

This is not to say that capabilities work is never useful for alignment, or that alignment progress is never bottlenecked on capabilities progress. As an extreme example, having a working AGI on hand tomorrow would indeed make it easier to run experiments that teach us things about alignment! But in a world where we build AGI tomorrow, we’re dead, because we won’t have time to get a firm understanding of alignment before AGI technology proliferates and someone accidentally destroys the world.[1] Capabilities progress can be useful in various ways, while still being harmful on net.

(Also, to be clear: AGI capabilities are obviously an essential part of humanity’s long-term path to good outcomes, and it’s important to develop them at some point — the sooner the better, once we’re confident this will have good outcomes — and it would be catastrophically bad to delay realizing them forever.)

On Nate’s view, the field should do experiments with ML systems, not just abstract theory. But if he were magically in charge of the world’s collective ML efforts, he would put a pause on further capabilities work until we’ve had more time to orient to the problem, consider the option space, and think our way to some sort of plan-that-will-actually-probably-work. It’s not as though we’re hurting for ML systems to study today, and our understanding already lags far behind today’s systems’ capabilities.[2]

Publishing capabilities advances is even more obviously bad

For researchers who aren’t willing to hit the pause button, an even more obvious (and cheaper) option is to avoid publishing any capabilities research (including results of the form “it turns out that X can be done, though we won’t say how we did it”).

Information can leak out over time, so “do the work but don’t publish about it” still shortens AGI timelines in expectation. However, it can potentially shorten them a lot less.

In an ideal world, the field would currently be doing ~zero publishing of capabilities research — and marginal action to publish less is beneficial even if the rest of the world continues publishing.

Thoughts on the landscape of AGI organizations

With those background points in hand:

Nate was asked earlier this year whether he agrees with Eliezer’s negative takes on OpenAI. There’s also been a good amount of recent discussion of OpenAI on LessWrong.

Nate tells me that his headline view of OpenAI is mostly the same as his view of other AGI organizations, so he feels a little odd singling out OpenAI. That said, here are his notes on OpenAI anyway:

  • On Nate’s model, the effect of OpenAI is almost entirely dominated by its capabilities work (and sharing of its work), and this effect is robustly negative. (This is true for DeepMind, FAIR, and Google Brain too.)

  • Nate thinks that DeepMind, OpenAI, Anthropic, FAIR, Google Brain, etc. should hit the pause button on capabilities work (or failing that, at least halt publishing). (And he thinks any one actor can unilaterally do good in the process, even if others aren’t reciprocating.)

  • On Nate’s model, OpenAI isn’t close to operational adequacy in the sense of the Six Dimensions of Operational Adequacy write-up — which is another good reason to hold off on doing capabilities research. But this is again a property OpenAI shares with DeepMind, Anthropic, etc.

Insofar as Nate or I think OpenAI is doing the wrong thing, we’re happy to criticize it.[3] But, while this doesn’t change the fact that we view OpenAI’s effects as harmful on net currently, Nate does want to acknowledge that OpenAI seems to him to be doing better than some other orgs on a number of fronts:

  • Nate liked a lot of things about the OpenAI Charter. (As did Eliezer, though compared to Eliezer, Nate saw the Charter as a more important positive sign about OpenAI’s internal culture.)

  • Nate would suspect that OpenAI is much better than Google Brain and FAIR (and comparable with DeepMind, and maybe a bit behind Anthropic? it’s hard to judge these things from the outside) on some important adequacy dimensions, like research closure and operational security. (Though Nate worries that, e.g., he may hear more about efforts in these directions made by OpenAI than about DeepMind just by virtue of spending more time in the Bay.)

  • Nate is also happy that Sam Altman and others at OpenAI talk to EAs/​rationalists and try to resolve disagreements, and he’s happy that OpenAI has had people like Holden and Helen on their board at various points.

  • Also, obviously, OpenAI (along with DeepMind and Anthropic) has put in a much clearer AGI alignment effort than Google, FAIR, etc. (Albeit Nate thinks the absolute amount of “real” alignment work is still small.)

  • Most recently, Nate and Eliezer both think it’s great that OpenAI released a blog post that states their plan going forward, and we want to encourage DeepMind and Anthropic to do the same.[4]

Comparatively, Nate thinks of OpenAI as being about on par with DeepMind, maybe a bit behind Anthropic (who publish less), and better than most of the other big names, in terms of attempts to take not-killing-everyone seriously. But again, Nate and I think that the overall effect of OpenAI (and DeepMind and FAIR and etc.) is bad, because we think it’s dominated by “shortens AGI timelines”. And we’re a little leery of playing “who’s better on [x] dimension” when everyone seems to be on the floor of the logistic success curve.

We don’t want “here are a bunch of ways OpenAI is doing unusually well for its reference class” to be treated as encouragement for those organizations to stay in the pool, or encouragement for others to join them in the pool. Outperforming DeepMind, FAIR, and Google on one or two dimensions is a weakly positive sign about the future, but on my model and Nate’s, it doesn’t come close to outweighing the costs of “adding another capabilities org to the world”.

  1. ^

    Nate simultaneously endorses these four claims:

    1. More capabilities would make it possible to learn some new things about alignment.

    2. We can’t do all the alignment work pre-AGI. Some trial-and-error and experience with working AGI systems will be required.

    3. It can’t all be trial-and-error, and it can’t all be improvised post-AGI. Among other things, this is because:

    3.1. Some errors kill you, and you need insight into which errors those are, and how to avoid them, in advance.

    3.2. We’re likely to have at most a few years to upend the gameboard once AGI arrives. Figuring everything out under that level of time pressure seems unrealistic; we need to be going into the AGI regime with a solid background understanding, so that empirical work in the endgame looks more like “nailing down a dozen loose ends and making moderate tweaks to a detailed plan” rather than “inventing an alignment field from scratch”.

    3.3. AGI is likely to coincide with a sharp left turn, which makes it harder (and more dangerous) to rely on past empirical generalizations, especially ones that aren’t backed by deep insight into AGI cognition.

    3.4. Other points raised in AGI Ruin: A List of Lethalities.

    4. If we end up able to do alignment, it will probably be because we figured out at least one major thing that we don’t currently know, that isn’t a part of the current default path toward advancing SotA or trying to build AGI ASAP with mainstream-ish techniques, and isn’t dependent on such progress.

  2. ^

    And, again, small individual “don’t burn the timeline” actions all contribute to incrementally increasing the time humanity has to get its act together and figure this stuff out. You don’t actually need coordination in order to have a positive effect in this way.

    And, to reiterate: I say “pause” rather than “never build AGI at all” because MIRI leadership thinks that humanity never building AGI would mean the loss of nearly all of the future’s value. If this were a live option, it would be an unacceptably bad one.

  3. ^

    Nate tells me that his current thoughts on OpenAI are probably a bit less pessimistic than Eliezer’s. As a rule, Nate thinks of himself as generally less socially cynical than Eliezer on a bunch of fronts, though not less-cynical enough to disagree with the basic conclusions.

    Nate tells me that he agrees with Eliezer that the original version of OpenAI (“an AGI in every household”, the associated social drama, etc.) was a pretty negative shock in the wake of the camaraderie of the 2015 Puerto Rico conference.

    At this point, of course, the founding of OpenAI is a sunk cost. So Nate mostly prefers to assess OpenAI’s current state and future options.

    Currently, Nate thinks that OpenAI is trying harder than most on some important safety fronts — though none of this reaches the standards of “adequate project” and we’re still totally going to die if they meet great success along their current path.

    Since I’ve listed various positives about OpenAI here, I’ll note some examples of recent-ish developments that made Nate less happy about OpenAI: his sense that OpenAI was less interested in Paul Christiano’s research, Evan Hubinger’s research, etc. than he thought they should have been, when Paul was at OpenAI; Dario’s decision to leave OpenAI; and OpenAI focusing on the “use AI to solve AI alignment” approach (as opposed to other possible strategies), as endorsed by e.g. Jan Leike, the head of OpenAI’s safety team after Paul’s departure.

  4. ^

    If a plan doesn’t make sense, the research community can then notice this and apply corrective arguments, causing the plan to change. As indeed happened when Elon and Sam stated their more-obviously-bad plan for OpenAI at the organization’s inception.

    It would have been better to state their plan first and start an organization later, so rounds of critical feedback and updating could occur before you lock in decisions about hiring, org structure, name, culture, etc.

    But at least it happened at all; if OpenAI had just said “yeah, we’re gonna do alignment research!” and left it there, the outcome probably would have been far worse.

    Also, if organizations release obviously bad plans but are then unresponsive to counter-arguments, researchers can go work at the orgs with better plans and avoid the orgs with worse plans. This encourages groups to compete to have the seemingly-sanest plan, which strikes me as a better equilibrium than the current one.