Beliefs and Disagreements about Automating Alignment Research

Epistemic status: Mostly organizing and summarizing the views of others.

Thanks to those whose views I summarized in this post, and to Tamera Lanham, Nicholas Kees Dupuis, Daniel Kokotajlo, Peter Barnett, Eli Lifland, and Logan Smith for reviewing a draft.

Introduction

In my current view of the alignment problem, there are two paths that we could try to take:

Come up with an alignment strategy that allows us to both build aligned AGI and to keep that AGI (or its successors) aligned as they improve towards superintelligence
Come up with an alignment strategy that allows us to build AI systems that are powerful (but not so powerful as to be themselves dangerous) and use that to execute some kind of ‘pivotal act’ that means that misaligned ASI is not built

For the purposes of this post, I am going to assume that we are unable to do (1) – maybe the problem is too difficult, or we don’t have time – and focus on (2).

Within the category of ‘pivotal act’, I see two main types:

Preventative pivotal acts: acts that makes it impossible for anyone to build AGI for a long period of time
Constructive pivotal acts: acts that makes it possible to build aligned ASI

People disagree about whether preventative pivotal acts are possible or even if they were possible, if they’d be a good idea. Again, for the purposes of this post, I am going to assume we can’t or don’t want to execute a preventative pivotal act, and focus on constructive pivotal acts. In particular: can we use AI to automate alignment research safely?

What does ‘automating alignment research’ even mean?

I see three overlapping categories that one could mean when referring to ‘automating alignment research’, ordered in terms of decreasing human involvement:

Level 1: AIs help humans work faster
1. Examples include brainstorming, intelligent autocomplete, and automated summarization/explanation.
Level 2: AIs produce original contributions
1. This could be key insights into the nature of intelligence, additional problems that were overlooked, or entire alignment proposals.
Level 3: AIs build aligned successors
1. Here, we have an aligned AGI that we entrust with building a successor. At this point, the current aligned AGI has to do all the alignment research required to ensure that its successor is aligned.

Mostly I have been thinking about Levels 1 and 2, although some people I spoke to (e.g. Richard Ngo) were more focused on Level 3.

Current state of automating alignment

At the moment, we are firmly at Level 1. Models can produce similar-sounding ideas when prompted with existing ideas and are pretty good at completing code but are not great at summarizing or explaining complex ideas. Tools like Loom and Codex can provide speed-ups but seem unlikely to be decisive.

Whether we get to Level 2 soon or whether Level 2 is already beyond the point where AI systems are dangerous are key questions that researchers disagree on.

Key disagreements

Generative models vs agents

Much of the danger from powerful AI systems comes from them pursuing coherent goals that persist across inputs. If we can build generative models that do not pursue goals in this way, then perhaps these will provide a way to extract intelligent behavior from advanced systems safely.

Timing of emergence of deception vs intelligence

Related to the problem of agents, there is also disagreement about whether we get systems that are intelligent enough to be useful for automating alignment before they are misaligned enough (e.g. deceptive or power-seeking) to be dangerous. My understanding is that Nate and Eliezer are quite confident that the useful intelligence comes only after they are already misaligned, whereas most other people are more uncertain about this.

The ‘hardness’ of generating alignment insights

This could be seen as another framing of the above point: how smart does the system have to be to do useful, original thinking for us? Does it have to have a comprehensive understanding of how minds in general work, or can original insights be generated by cleverly combining John Wentworth posts, or John Wentworth posts with Paul Christiano posts?

The benefits (in terms of time saved) of Level 1 interventions

It is unclear how much time is saved by Level 1 interventions: if all alignment researchers were regularly using Loom to write faster, brainstorming with GPT-3, and coding with Copilot, would this result in an appreciable speed-up of alignment work?

Summaries of viewpoints on automating alignment research

Below, I summarize the positions of various alignment researchers I have spoken to about this topic. Where possible, I have had the people in question review my summary to ensure I am not misrepresenting them too badly.

Nate Soares (unreviewed)

Solving alignment requires understanding how to control minds. If we want AI systems to solve the hard parts of alignment for us, then necessarily they will understand how to control minds in a way that we do not. Understanding how to control minds requires a ‘higher grade of cognition’ than most engineering tasks, and so a system capable enough to solve alignment is also capable of doing many dangerous things (we cannot teach AIs to drive a blue car without also being able to drive red cars). The good outcomes that we want (complete, working alignment solutions) are a sufficiently small target that we do not know how to direct a dangerous AI towards that outcome: doing this safely is precisely the alignment problem, and so we have not made our task meaningfully easier.

You don’t get around this by saying you’re using a specific architecture or technique, like scaling up GPTs. You are trying to channel the future into a specific, small target – a world where we have ended the acute risk period from AGI and have time to contemplate our values or have a long reflection – and this channeling is where the danger lies.

You can maybe use models like GPT-3 or similar to help with brainstorming or summarizing or writing, but this is not where most of the difficulty or speed-up comes from. If your definition of ‘automating alignment’ includes speeding up alignment researchers running experiments then Codex already does this, but this does not mean that we will solve alignment in time.

John Wentworth (reviewed)

In theory, we know of one safe outer objective for automating alignment research: simulate alignment researchers in some environment. However, there are many issues with this in practice. For example, if you want to train a generative model on a bunch of existing data and use this to generate a Paul Christiano post from 2040, it needs to generalize extremely well to faithfully predict what Paul will write about alignment in 2040. However, we also need to avoid having it predict (perhaps accurately) that the most likely outcome is that there is an unaligned AGI in 2040 faking a Paul post.

In general, when you move away from ‘just simulating’ people to something else that applies more optimization pressure, things fail in subtle ways. If we are pretty close to solving alignment already, then we don’t have to apply too much optimization pressure – going from a 50% chance of solving the problem in time to a 100% chance is just 1 bit of optimization, but going from 1 in a million to 100% is a much harder task and is much more dangerous.

It is very hard to know how close we are to solving the alignment because the alignment community is still quite confused about the problem. The obvious way to reduce the amount of optimization pressure we need to apply is to do more alignment research ourselves such that the gap between the starting point of optimization and the goal is smaller. The optimization we apply by directly doing alignment research is safer insofar as we have introspective access to the processes that produce our insights, and can check if we expect them to reliably lead to good outcomes.

Some AI-assisted tools like autocomplete or improved Google Scholar could be useful, but the bottom line is that we can’t really have the AI do the hard parts without confronting the problems arising from powerful optimization.

One possible way to get around these problems is to leverage the safety of ‘simulate alignment researchers in a stable environment’ by running this safe simulation very fast. If we had arbitrary technical capabilities at our disposal, this might work. However, our current technology, generative models, would not work even if scaled up. This is because they make predictions about a conditional world, not a counterfactual world. What we really want is to put our alignment researchers in a counterfactual world where ‘unaligned AGI takes over’ is much less likely but people are still working on the alignment problem. This would mean that when we ask for a Paul post from 2040, we get a Paul post that actually solves alignment rather than one that was written by an unaligned AGI.

Evan Hubinger (reviewed)

Generative models can be very powerful, and constitute a type of intelligence that is not inherently goal-directed. GPT-3 provided evidence that not every intelligent system is (or approximates) a coherent agent. If we can be sure that we have built a powerful generative model (and not a system that appears to behave like a generative model during training) then we should be able to get it to safely and productively produce alignment research.

The hard part is ensuring that it really is a generative model – i.e. that it really is just simulating the processes that generated its training data. Inner alignment is the main problem in this framing: there may be pressures in the training process that mean systems that get sufficiently low loss on the training objective no longer act as pure simulators and instead implement some kind of consequentialism.

Ethan Perez (reviewed)

We should be trying to automate alignment research with AI systems. It’s not clear that getting useful alignment work out of AI systems requires levels of intelligence that are necessarily misaligned or power-seeking. It’s not clear in which order ‘capable of doing useful stuff’ and ‘deceptively aligned’ arise in these systems – current models can talk competently about deception but are not themselves deceptive. It remains to be seen whether building assistants that can help solve the alignment problem is easier or harder than directly building an alignment strategy that holds all the way to superintelligence.

However, we don’t currently know the best way to use powerful AI systems to help with alignment, so we should be building lots of tools that can have more powerful AI ‘plugged in’ when it is available. We should be a little careful about building a tool that is also useful for capabilities, but capabilities don’t pay as much attention to the alignment community as we sometimes imagine, similar ideas are already out there, and we can capture a lot of value by building it early.

Richard Ngo (edited and endorsed by Richard)

Having AI systems help with alignment in some capacity is an essential component of the long-term plan. The most likely path to superintelligence involves a lot of AI assistance. So “using AIs to align future AIs’ is less of a plan than a description of the default path – the question is which alignment proposals help most in aligning the AIs that will be doing that later work. I feel pretty uncertain about how dangerous the first AIs capable of superhuman alignment research will be, but tend to think that they’ll be significantly less power-seeking than humans are.

It’s hard to know in advance specifically what ‘automating alignment’ will look like except taking our best systems and applying them as hard as we can; so the default way to make progress here is just to keep doing regular alignment research to build a foundation we can automate from earlier. For example, if mechanistic interpretability research discovers some facts about how transformers work, we can train on these and use the resulting system to discover new facts.

Conclusion

There is no consensus on how much automating alignment research can speed up progress. In hindsight, it would have been good to get more quantitative estimates of the type of speed-up each person expected to be possible. There seems to be sufficient uncertainty that investigating the possibility further makes a lot of sense, especially given the lack of current clear paths to an alignment solution. In future posts I will aim to go into more detail on some proposed mechanisms by which alignment could be accelerated.