I Would Have Solved Alignment, But I Was Worried That Would Advance Timelines

The alignment community is ostensibly a group of people concerned about AI risk. Lately, it would be more accurate to describe it as a group of people concerned about AI timelines.

AI timelines have some relation to AI risk. Slower timelines mean that people have more time to figure out how to align future intelligent AIs, potentially lowering the risk. But increasingly, slowing down AI progress is becoming an end in itself, taking precedence over ensuring that AI goes well. When the two objectives come into conflict, timelines are winning.

Pretty much anything can be spun as speeding up timelines, either directly or indirectly. Because of this, the alignment community is becoming paralyzed, afraid of doing anything related to AI - even publishing alignment work! - because of fears that their actions will speed up AI timelines.

The result is that the alignment community is becoming less effective and more isolated from the broader field of AI; and the benefit to this, a minor slowdown in AI progress, does not nearly outweigh the cost.

This is not to say that trying to slow down AI progress is always bad. It depends on how it is done.

Slowing Down AI: Different Approaches

If you want to slow down AI progress, there are different ways you can go about it. One way to categorize it is by who a slowdown affects.

  • Government Enforcement: Getting the government to slow down progress through regulation or bans. This is a very broad category, including both regulations in a certain country vs. international bans on models over a certain size, but the important distinguishing feature of this category is that the ban applies to everyone, or, if not to everyone, at least to a large group that is not selected for caring about AI risk.

  • Voluntary Co-Ordination: If OpenAI, DeepMind, and Anthropic all agreed to halt capability work for a period of time, this would be voluntary co-ordination. Because it’s voluntary, a pause around reducing AI risk could only affect organizations that are worried about AI risk.

  • Individual Withdrawal: When individuals concerned about AI risk refrain from going into AI, for fear of having to do capability work and thereby advancing timelines, this is individual withdrawal; ditto for other actions that are not taken in order to avoid speeding up timelines, such as not publishing alignment research.

The focus of this piece is on individual withdrawal. I think nearly all forms of individual withdrawal are highly counterproductive and don’t stand up to an unbiased cost benefit analysis. And yet, individual withdrawal as a principle is becoming highly entrenched in the alignment community.

Examples of Individual Withdrawal in the Alignment Community

Let’s first see how individual withdrawal is being advocated for and practiced in the alignment community and try to categorize it.

Capabilities Withdrawal

This is probably the point which has the most consensus: people worried about AI risk shouldn’t be involved in AI capabilities, because that speeds up AI timelines. If you are working on AI capabilities, you should stop. Ideally no one in the field of AI capabilities would care about AI risk at all, because everyone who cared had left—this would be great, because it would slow down AI timelines. Here are examples of people advocating for capabilities withdrawal:

Zvi in AI: Practical Advice for the Worried:

Remember that the default outcome of those working in AI in order to help is to end up working primarily on capabilities, and making the situation worse.

Nate Soares in Request: stop advancing AI capabilities:

This is an occasional reminder that I think pushing the frontier of AI capabilities in the current paradigm is highly anti-social, and contributes significantly in expectation to the destruction of everything I know and love. To all doing that (directly and purposefully for its own sake, rather than as a mournful negative externality to alignment research): I request you stop.

Connor Leahy:

Alignment Withdrawal

Although capabilities withdrawal is a good start, it isn’t enough. There is still the danger that alignment work advances timelines. It is important both to be very careful in who you share your alignment research with, and potentially to avoid certain types of alignment research altogether if it has implications for capabilities.

Examples:

From Miri, in a blog post titled “2018 Update: Our New Research Directions”, concern about short timelines and advancing capabilities led them to decide to default to not sharing their alignment research:

MIRI recently decided to make most of its research “nondisclosed-by-default,” by which we mean that going forward, most results discovered within MIRI will remain internal-only unless there is an explicit decision to release those results, based usually on a specific anticipated safety upside from their release.

This is still in practice today. In fact, this reticence about sharing their thoughts is practiced not just in public-facing work but even in face-to-face communication with other alignment researchers:

  • I think we were overly cautious with infosec. The model was something like: Nate and Eliezer have a mindset that’s good for both capabilities and alignment, and so if we talk to other alignment researchers about our work, the mindset will diffuse into the alignment community, and thence to OpenAI, where it would speed up capabilities. I think we didn’t have enough evidence to believe this, and should have shared more.

Another example of concern about increasing existential risk by sharing alignment research was raised by Nate Soares:

I’ve historically been pretty publicly supportive of interpretability research. I’m still supportive of interpretability research. However, I do not necessarily think that all of it should be done in the open indefinitely. Indeed, insofar as interpretability researchers gain understanding of AIs that could significantly advance the capabilities frontier, I encourage interpretability researchers to keep their research closed.

Interpretability research is also warned about by Justin Shovelain and Elliot Mckernon of Convergence Analysis:

Here are some heuristics to consider if you’re involved or interested in interpretability research (in ascending order of nuance):

  • Research safer topics instead. There are many research areas in AI safety, and if you want to ensure your research is net positive, one way is to focus on areas without applications to AI capabilities.

  • Research safer sub-topics within interpretability. As we’ll discuss in the next section, some areas are riskier than others—changing your focus to a less risky area could ensure your research is net positive.

  • Conduct interpretability research cautiously, if you’re confident you can do interpretability research safely, with a net-positive effect. In this case:

    • Stay cautious and up to date. Familiarize yourself with the ways that interpretability research can enhance capabilities, and update and apply this knowledge to keep your research safe.

    • Advocate for caution publicly.

    • Carefully consider what information you share with whom. This particular topic is covered in detail in Should we publish mechanistic interpretability research?, but to summarise: it may be beneficial to conduct interpretability research and share it only with select individuals and groups, ensuring that any potential benefit to capability enhancement isn’t used for such.

General AI Withdrawal

It’s not just alignment and capability research you need to watch out for—anything connected to AI could conceivably advance timelines and therefore is inadvisable. Examples:

Again from Zvi’s “Practical Advice for the Worried”, mentioned above:

Q: How would you rate the ‘badness’ of doing the following actions: Direct work at major AI labs, working in VC funding AI companies, using applications based on the models, playing around and finding jailbreaks, things related to jobs or hobbies, doing menial tasks, having chats about the cool aspects of AI models?

A: Ask yourself what you think accelerates AI to what extent, and what improves our ability to align one to what extent. This is my personal take only – you should think about what your model says about the things you might do. So here goes. Working directly on AI capabilities, or working directly to fund work on AI capabilities, both seem maximally bad, with ‘which is worse’ being a question of scope. Working on the core capabilities of the LLMs seems worse than working on applications and layers, but applications and layers are how LLMs are going to get more funding and more capabilities work, so the more promising the applications and layers, the more I’d worry. Similarly, if you are spreading the hype about AI in ways that advance its use and drive more investment, that is not great, but seems hard to do that much on such fronts on the margin unless you are broadcasting in some fashion, and you would presumably also mention the risks at least somewhat.

So in addition to not working on capabilities, Zvi recommends not investing in organizations that use AI, and not working on applications of LLMs, with the note that doing things that build hype or publicity about AI is “not great” but isn’t that big a deal.

On the other hand, maybe publicity isn’t OK after all—in Hooray for stepping out of the limelight, Nate Soares comes out against publicity and hype:

From maybe 2013 to 2016, DeepMind was at the forefront of hype around AGI. Since then, they’ve done less hype. For example, AlphaStar was not hyped nearly as much as I think it could have been.

I think that there’s a very solid chance that this was an intentional move on the part of DeepMind: that they’ve been intentionally avoiding making AGI capabilities seem sexy.

In the wake of big public releases like ChatGPT and Sydney and GPT-4, I think it’s worth appreciating this move on DeepMind’s part. It’s not a very visible move. It’s easy to fail to notice. It probably hurts their own position in the arms race. I think it’s a prosocial move.

What is the cost of these withdrawals?

The Cost of Capabilities Withdrawal

Right now it seems likely that the first AGI and later ASI will be built with utmost caution by people who take AI risk very seriously. If voluntary withdrawal from capabilities is a success—if all of the people calling for OpenAI and DeepMind and Anthropic to shut down get their way—then this will not be the case.

But this is far from the only cost. Capabilities withdrawal also means that there will be less alignment research done. Capabilities organizations that are concerned about AI risk hire alignment researchers. Alignment research is already funding constrained—there are more qualified people who want to do alignment research than there are jobs available for them. As the number of available alignment jobs shrinks, the number of alignment researchers will shrink as well.

The alignment research that is done will be lower quality due to less access to compute, capability knowhow, and cutting edge AI systems. And, the research that does get done is far less likely to percolate through to researchers building cutting edge AI systems, because the people building those systems simply won’t be interested in reading it.

Lastly, capabilities withdrawal makes a government-implemented pause less likely, because the credibility of AI risk is closely tied to how many leading AI capability researchers take AI risk seriously. People in the alignment community are in a bubble, and talk about “alignment research” and “capability research” as if they are two distinct fields of approximately equal import. To everyone else, the field of “AI capability research” is just known as “AI research”. And so, by trying to remove people worried about AI risk from AI research, you are trying to bring about a scenario where the field of AI research has a consensus that AI risk is not a real problem. This will be utterly disastrous for efforts to get government regulations and interventions through.

Articles like this, this, and this:

are only possible because the people featured in them pushed AI capabilities.

Similarly, look at the top signatories of the CAIS Statement on AI Risk which had such a huge impact on the public profile of AI risk:

Why do you think they chose to lead off with these signatures and not Eliezer Yudkowsky’s? If the push for individual withdrawal from capabilities work is a success, then any time a government-implemented pause is proposed the expert consensus will be that no pause is necessary and AI does not represent an existential risk.

The Cost of Alignment Withdrawal

Capabilities withdrawal introduces a rift between the alignment community and the broader one of AI research. Alignment withdrawal will widen this rift as research is intentionally withheld from people working on cutting edge AI systems, for fear of advancing timelines.

A policy of not allowing people building powerful AI systems to see alignment research is a strong illustration of how AI risk has become a secondary concern to AI timelines.

The quality of alignment research that gets done will also drop, both because researchers will be restricting their research topics for fear of advancing capabilities, and because researchers won’t even be talking to each other freely.

Figuring out how to build an aligned general intelligence will necessarily involve knowing how to build a general intelligence at all. Because of this, promising alignment work will have implications for capabilities; trying to avoid alignment work that could speed up timelines will mean avoiding alignment work that might actually lead somewhere.

The Cost of General Withdrawal

As people worried about AI risk withdraw from fields tangentially related to AI capabilities—concerned VCs avoid funding AI companies, concerned software devs avoid developing apps or other technologies that use AI, and concerned internet denizens withdraw from miscellaneous communities like AI art—the presence of AI risk in all of these spaces will diminish. When the topic of AI risk comes up, all of these places—AI startups, AI apps, AI discord communities—will find fewer and fewer people willing to defend AI risk as a concern worth taking seriously. And, in the event that these areas have influence over how AI is deployed, when AI is deployed it will be done without thought given to potential risks.

Benefits

The benefit of withdrawal is not a pause or a stop. As long as there is no consensus on AI risk, individual withdrawal cannot lead to a stop. The benefit, then, is that AI is slowed down. By how much? This will depend on a lot of assumptions, and so I’ll leave it to the reader to make up their own mind. How much will all of the withdrawal going on—every young STEM nerd worried about AI risk who decides not to get a PhD in AI because they’d have to publish a paper and Zvi said that advancing capabilities is the worst thing you could do, every alignment researcher who doesn’t publish or who turns away from a promising line of research because they’re worried it would advance capabilities, every software engineer who doesn’t work on that cool AI app they had in mind—how much do you think all of this will slow down AI?

And, is that time worth the cost?