Tldr

AI Alignment does not follow the standard scientific method.
Therefore, we might not agree on which research agendas have the potential to align AGI systems.
We should work intentionally to build consensus around the foundations of different research agendas.
Ways to do this might include more clearly stating the assumptions behind a given framework, further focussing on distilling, red-teaming certain frameworks, or conferences focussing on the foundations of different frameworks.

Introduction

I often see the phrase “AI Alignment is pre-paradigmatic” used a lot within the community, suggesting that at some point AI Alignment will exist as some clear paradigm. To me, this implies that some of the current, core disagreements around the foundations of the field will eventually clear up and that there will be some degree of understanding around the different approaches towards Alignment, how they fit together, and the assumptions behind them.

I don’t think it is controversial to state we are not currently in this position as a field. This is exemplified by posts such as Soares’ “On how various plans miss the hard bits of the alignment challenge”, the MIRI conversations, and (to an extent) Christiano’s “Where I agree and disagree with Eliezer”.

The core idea of this post is that I don’t think that we should necessarily expect the field to deal with these internal disagreements without intentional work to do so. We want to be sure that we have aligned AGI, not just aligned it within some framework and hoped for the best. I think this means that there should be more work trying to understand the assumptions made by different researchers, combined with precise methods to reason about these. This reminds me of the “rowing and steering” forum post—trying as hard as we can within a framework is different to trying to decide which frameworks are built on solid foundations, and thus which we should be trying to pursue.

The main thrust of this post is to explain why I think the Alignment community should dedicate at least some resources to “steering” as opposed to “rowing”, and to give some ideas as to how we could go about doing this. I will also briefly consider why we might not want to do this.

Thanks to Holly Muir, Hanna Palya, Anson Ho, and Francis Priestland for feedback on previous drafts of this post.

Epistemic Status: I wrote this post whilst trying to form better inside views about AI Safety. I’m still forming these views, and so I’m pretty uncertain about this post, and there’s a good chance I’ve misunderstood something that significantly impacts my final conclusion.

Clarifying What Steering Should Aim Towards

Throughout this post, I will be talking about AI Alignment, i.e. trying to build AI systems that do what we actually want them to do. Note that this is a subset of AI Safety as a whole: solving Alignment does not ensure AI will be used to the benefit of humanity, but it is a necessary part of this.

Also note that I will be talking about AGI systems here, which I will define as AI systems which are at least as capable as humans across a range of domains. I think these arguments will also apply to systems which are more narrowly capable than these, but this is the case I will focus on for now. I have chosen to focus on AGI systems because I think they are a major source of existential risk (I will make this assumption without justification, see here or here).

Now, what is it that I want the field to look like? In previous drafts I used the term “paradigmatic”, but I think this implies a narrower and more rigid framework than is desirable. I do not think that the field would be in a better position if everybody was pursuing similar research directions, and that is not what I’m advocating for.

What am I advocating for, then?

Work that better understands the underpinnings of various research agendas.
Work that better understands the theory of change behind different research agendas.
Work that tries to build a consensus around questions like “if research agenda X is successful, it will have meaningfully contributed to aligning an AGI system”. It does not mean that there will be consensus around “this research agenda is the most promising” or “this research agenda is the most likely to be successful”.

So, when I talk about “steering”, this is the kind of work I am referring to. (For a comparison, “rowing” might look something like “pursuing work within a given research agenda”.) I’ll sometimes refer to a field where much of the above work has been completed as “sound”.

I will now give two arguments for why steering might be unnecessary, and then present counterarguments to these.

Rebuttal 1: Steering seems to happen naturally in Science

In Science, scientific frameworks which tell two different stories about reality do not tend to exist in harmony. If two different theories, research agendas or paradigms (we will use the catch-all term “framework” for these) produce different predictions about the outcome of an experiment, then we can simply conduct that experiment. Although no single failed or faulty experiment is usually enough to discredit a framework, over time a body of evidence is collected which indicates that a framework is no longer plausible (“the lens must have been warped this time, and the measurements must have been faulty this next time, and …” spins an increasingly unlikely yarn). Eventually, scientists reject one framework in favour of the other, and converge on the questions, methodologies, and research promoted by the framework of the now dominant paradigm.

A classic example of this is the paradigm shift from Classical Mechanics to Quantum Mechanics (this Wikipedia page has more details). In the late 19th Century, phenomena such as black-body radiation and the photoelectric effect were observed, which produced different outcomes than would have been predicted by Classical Mechanics. A new worldview which could explain these experiments began to develop, in the guise of Quantum Mechanics. Quantum Mechanics was ultimately able to predict the outcomes of phenomena at a microscopic level much better than classical mechanics, and thus it became the new dominant framework for reasoning about the world at small scales.

Although there is much disagreement over the precise nature of the Scientific Method, the basic outline above is much less controversial. Theories and paradigms make predictions, experiments provide evidence, and this evidence leads to us updating our beliefs. Over time, dominant frameworks which best explain the available evidence emerge, and we can dismiss frameworks which make faulty predictions. Moreover, this process happens naturally through the standard pursuit of Science—by producing new theories and conducting experiments, we cannot help but to dismiss those frameworks which inaccurately describe the world.

Given that this is how Science works, why should we expect AI Alignment to be any different? Maybe we should expect to converge on correct frameworks simply by pursuing Alignment research, without having to make any intentional efforts to do so.

Counter-Rebuttal 1: Alignment may not be a typical Science

Let’s accept that this is how Science works and that it has been very successful at providing increasingly accurate depictions of the world. But what happens in the situation where opposing paradigms differ only on predictions about things which we can’t verify? This would make it very difficult to converge on ‘correct’ strategies.

Unfortunately, this is exactly the problem that different strategies for aligning AGI systems face. We do not currently have access to AGI systems, and as such we cannot get direct experimental evidence about whether Alignment strategies will work for these systems or not. Another complication is that we might not get the opportunity to ever get this experimental evidence, as we might have to get Alignment right on our first try with an AGI system (as evidence of this, Paul Christiano states that he agrees with Eliezer about this here, at the start of his first disagreement). This means that in the worst-case scenario, we won’t get any experimental evidence at all about assumptions like “interpretability won’t be possible once a system is past human capabilities” or “it is in the interest of all AGI systems to resist being turned off”.

A really important point here is that there is big disagreement regarding to what extent we can get experimental evidence about Alignment strategies for AGI systems from current systems, or from systems which are more capable than we currently have, but are not at the level of AGI. The most optimistic perspective would be that if we can align sub-AGI systems, we can align AGI systems, in which case Alignment works just like a typical Science, so we should expect to easily form a consensus around Alignment strategies without any “steering”. The most pessimistic perspective is that an Alignment strategy proving successful on sub-AGI systems gives us no information at all about its chances on AGI systems.

I imagine the above positions as two ends of a spectrum, with Alignment researchers falling between these two points. I think that this is perhaps the biggest barrier to the field becoming sound we currently have: there are currently big differences of opinion on this, and these differences of opinion mean that we could have more “optimistic” researchers who think they have solved the problem of Alignment via evidence on a sub-AGI system, when more pessimistic ones do not. As such, I think clarification and understanding about the extent to which Alignment is a typical Science is exactly the kind of thing we should be intentionally investigating as a field, and hence is exactly the kind of steering we need to be doing.

Rebuttal 2: Maybe different epistemic strategies are enough

Let’s say that I accept the argument that Alignment is not a typical Science, and we cannot rely on experimental evidence alone to make us confident that a given strategy will align an AGI system ahead of time. This takes us naturally to the work of Adam Shimi, who has noted that the peculiar challenge of Alignment means that we will need more than just the toolkit of Science to help us (Adam lays out some of the consequences of this conclusion in his post “On Solving Alignment Problems before they Appear: the Weird Epistemologies of Alignment”). It also suggests that AI Alignment needs different “epistemic strategies”, specific, agreed-upon methods of producing knowledge to which everyone in the field subscribes. In Mathematics, the core epistemic strategy is proof; in Science, Shimi claims that the epistemic strategy is the cycle of “modelling, predicting and testing”. When I refer to the “toolkit of Science” or “how Science operates”, this is the epistemic strategy that I have in mind.

Shimi’s corresponding sequence provides a good overview of how some different Alignment strategies provide information via different epistemic strategies. An example of an epistemic strategy in Alignment, which Shimi discusses in more detail here, is John Wentworth’s work on Selection Theorems. Shimi explains that Wentworth’s Selection Theorems generate knowledge about Alignment by outlining some combination of selection pressure (such as natural selection) along with an environment (the world), before proving results regarding some agents that will arise from this combination.

This epistemic strategy has had success: the likes of Alex Turner have used it to show that issues such as instrumental subgoals, power-seeking, and incorrigibility are problems any Alignment strategy will need to tackle in certain domains. Other examples of epistemic strategies currently used by the field would be: better understanding the inner workings of Neural Networks in order to better understand their behaviour and whether they are “aligned” (interpretability); trying to simultaneously generate and analyse potential Alignment strategies, such as attempts to elicit latent knowledge (I think of this as representing Paul Christiano’s wider methodology); and analysing the complications that arise when an agent is embedded in its own environment (embedded agency).

So, by using different epistemic strategies, it is possible to generate knowledge about AGI systems. Maybe simply using these epistemic strategies will be enough to help us better analyse the different assumptions made by different research agendas, and hence help to push the field to reach a common understanding of the underpinnings of different research agendas?

Counter-Rebuttal 2: We can use different epistemic strategies without learning about their validity

I think the core reason why using different epistemic strategies might be insufficient to make the field “sound” is that there is a difference between generating knowledge using an epistemic strategy, and understanding the underpinnings behind an epistemic strategy. This is in stark contrast to Science, where we don’t need to analyse the underpinnings behind a framework: we can see whether it works or not, by looking at what it predicts!

In Engineering, if we can build something and see that it works perfectly, then we don’t need to worry about our underlying assumptions: we can see that it works. It is the same in Science: if our model makes some underlying assumptions, but it accurately predicts reality, we don’t have to worry about the validity of these assumptions. If a method in Science appears to work, then for all purposes it does work.

However, depending on the extent to which one thinks Alignment is not a typical Science, this may not hold for Alignment. That is, potential Alignment strategies generated by an epistemic strategy come with no guarantee they will work because we can’t test them out on the real thing (an AGI system): we have to analyse the epistemic strategy itself to check for any potential issues with the strategy.

Thus, I think that we will have to be intentional as a field to ensure we have a firm understanding of different epistemic strategies and the assumptions behind them, since we perhaps cannot expect to fully understand the limitations of an epistemic strategy just by trying to progress by using that epistemic strategy.

This is captured by Shimi in his work on “Breaking Epistemic Strategies”, where he analyses where certain epistemic strategies can go wrong. When he applies this to John Wentworth’s Selection Theorems, he has a list of questions that require pursuing to ensure that the behaviour suggested by the Selection Theorems is representative of real-world applications. For example, maybe we have proved some theorems in an idealised environment, but under further inspection these don’t apply to the real world? Or maybe we have shown that the behaviour arises under some definition of “most of the time” that isn’t accurate?

Another example (which is my understanding of Ethan Perez’s theory of change) might be the strategy of trying to build an aligned system using interpretability, red-teaming and scalable oversight, alongside using a method like IDA to build successively more powerful systems which are hopefully still aligned. (As an aside, an example of scalable oversight would be trying to verify the answers of a “Maths AGI” by getting it to output the steps in its reasoning. Each of these steps could be easily verified, which means we can easily verify answers no matter how difficult the question becomes). Here, developing interpretability tools and trying to implement IDA would be examples of trying to advance the research agenda, but they wouldn’t help us analyse the foundations of the framework and understand whether it would actually create an aligned AGI.

What I’m trying to highlight with this example is that simply doing work using this epistemic strategy is not the same as trying to analyse it, and understanding whether it actually provides useful information about aligning AGI systems. To do this, we have to be intentional about analysing these different frameworks.

What Should Steering AI Alignment Look Like?

So, to summarise, I think we cannot be confident that the field of Alignment will become “sound” without intentional work on behalf of the community to try to make this happen. If Alignment was simply a Science then this would be ok: eventually different beliefs and assumptions would predict different outcomes for a given experiment, helping us to ignore less promising epistemic strategies and eventually converging to a consensus on different frameworks. However, as we’ve discussed, Alignment is potentially not just a Science, and we might not expect to be able to narrow down these frameworks via business as usual.

In an attempt to better understand the foundations of Alignment, here are some potential ideas for ways the field could try to “steer” more intentionally, as opposed to just “rowing” and pursuing certain frameworks. This is by no means exhaustive, and I think coming up with a great idea here is the crux of the problem.

Clearly Stating our Frameworks

Maybe instead of just trying to build up within a given framework and theory of change, we should be making more of an effort to test the foundations. From my understanding, adding Dan Hendrycks’ X-Risk Sheets as addendums to ML papers is a nice start to getting researchers to explain their theories of change, but I think we could go much further.

Neel Nanda’s “A Longlist of Theories of Impact for Interpretability” seems like an example of the kind of thing I’m thinking about here, but it could go even further. As a very crude example, if we take his example 4, “Auditing for deception”, maybe an idealised version of this document explains the assumptions behind why this is possible for an AGI system trying to hide its deception.

I think the current trend in looking for distillations of current research could also be really useful here: being able to refine the assumptions and theory of change of a research agenda could make them much easier to analyse.

Red-Teaming

The classic EA example could be to run red-teaming competitions or projects around certain frameworks. This might take the form of examining the theory of change of a framework, and then trying to make the best possible case for why this might not work.

What makes this slightly different to normal red-teaming is that this process might still look like technical research. Work such as Stuart Armstrong’s “Humans can be assigned any values whatsoever” on Ambitious Value Learning seems like a great example of this kind of work: it suggests that one of the key assumptions of ambitious value learning (that it is possible, under what might have been reasonable assumptions) might be incorrect!

This approach also has the advantage that it doesn’t use much of the time of those leading this body of work: most theories of change should be pretty public, so maybe we can structure these in a way where leading researchers of some research agenda will only be consulted if a red-team has been “successful” in finding a flaw that they think merits further discussion.

However, this might be difficult to execute: how many people fully understand, for instance, Paul Christiano’s Research Agenda? This should become easier if the frameworks of different researchers were more clearly stated, but this might still be hard to do in practice.

Informal Conversations Between Leading Researchers

Currently, this seems to be a leading method for trying to examine the assumptions of different frameworks and paradigms, with the MIRI conversations being a prominent example (note: I haven’t read these in their entirety). This seems like a great step in the direction of “trying to analyse the assumptions of different frameworks”, and I think their existence shows that the community agrees that this process is important. I think the amount of work that has gone into these, and is required by constant discussions in comment threads etc., shows that researchers think they provide at least some value to the field as a whole.

However, something about this process as it is currently done seems pretty sub-optimal to me. For starters, if we want new researchers coming into the field to be able to form their own opinions, having them trawl through the comments section of Less Wrong to find the best arguments for and against the assumptions of some framework seems like a huge barrier to entry. It seems pretty plausible that someone in this situation might turn to outputs published in more typical places such as ML journals etc., where this kind of debate is much less prominent, steering them away from these important questions.

Generating More Research Agendas

If we are inherently uncertain about many of our research agendas, then it makes sense to try to diversify our approaches to Alignment, or at least think about what proportion of our “research budget” should go towards exploring different frameworks vs exploiting within a framework. I’m not sure what portion of our resources should be directed to this, but it seems like trying to invent new research agendas should be an important part of our efforts in “steering”.

Adam Shimi’s recent post on the ethos behind Refine, his new research incubator, seems like a great example of this. My understanding is that Refine is trying to help researchers create their own potential paradigms, and not just work within a specific one. This is great, although it will be interesting to see how successful they are in this, as starting new research agendas could be a very intractable undertaking.

Journals or Newsletters

Related to the issues with “steering” through only informal conversations, I think that having a more systematic account of the foundations of different frameworks seems really useful. I feel like the Alignment Newsletter is a great step in this direction, synthesising formal publications with Less Wrong posts. Although the MIRI conversations were covered, I think there is space for them to cover this kind of work more explicitly.

Maybe a “Meta-Alignment Newsletter” would be useful here? Alternatively, some kind of journal where “steering” arguments are covered could serve a similar purpose. It is worth noting that there is a risk one person or group’s views become overly dominant in this process of “steering”, which could be exacerbated if there is a single dominant journal/newsletter in this space.

I don’t particularly like any of the exact ideas in this subsection, but what I am trying to point to is that I think the field could do much better if there were superior norms and processes around distilling the assumptions behind research agendas than currently exists. This could then lead to more ambitious work to try to analyse these research agendas, but the first stage is simply teasing these assumptions out, as clearly as possible, from current frameworks.

Conferences

An example I’m not convinced is a good idea, but is maybe worth trying, is that of an annual conference where AI researchers go to examine the assumptions their frameworks are based on. This could also function as a workshop, where researchers go to question the underlying assumptions of their favoured framework. It seems possible that some people might just be really good at finding assumptions in models and challenging them, in which case they could prove very valuable at running these workshops.

I think that the kinds of questions we should be looking for here are not “can we find some common ground where we all agree”, but instead “can we pinpoint why you, Mr Christiano, think ELK would work in the best case scenario? Can anybody clearly pinpoint which aspects of this they might disagree with?”. Whether this would be productive, or just lead to impasses and frustration, is not at all clear to me.

An example of a similar endeavour from another field could be Polymath projects. Here, different researchers in Pure Mathematics came together to collaborate on difficult problems, and they had some success in solving previously intractable problems.

What Happens if We Fail?

As an aside, for those who are likely to be serious players at AGI crunchtime, it might be worth thinking about how to proceed if there is no consensus around which research agendas are “sound” when crunchtime arises. Although this would be a highly undesirable position to be in, we may still end up there, in which case we will need some kind of method for deciding which is the best strategy to pick when nobody can agree on one. Drawing from decision-making in other fields which aren’t able to rely solely on Science and aren’t assembled from clear paradigms, such as International Relations, might be useful here.

Drawbacks to Steering

Despite the arguments listed above, I do think there are some genuine problems associated with trying to intentionally shape the field by investigating its assumptions (“steering”).

I think a serious problem would be if, by being unreasonably critical about different research directions, we ruled out epistemic strategies that would otherwise be promising. This could be especially problematic if there is a large amount of deference occurring during the process of steering, because it increases the chance that a single person’s poor judgement rules out some epistemic strategy. Some risk of this is perhaps unavoidable if we make any serious attempts at steering (there is always some risk that the community unwisely rules out a promising approach), but it certainly seems like it is something we should be mindful of in order to minimise.
I think a potential issue that arises depending on how similar Alignment is to a typical Science is that it may be the case that it is much easier to agree whether a given Alignment strategy has solved the problem once we have developed it and tried to use it for an AGI system, but before deployment. For example: maybe you think some combination of interpretability + adversarial training is unlikely to work ahead of time, but then you try it and become convinced it has been successful. If this is true, then steering ahead of time might not make sense: it might just make sense for researchers to “row” forwards and pursue their research agendas, then try to convince others once we are near crunch time. However, I think even in this scenario it might make sense for there to be some initial consensus building so that the field can better agree on what form a proposed solution might look like.
I have the impression that many working in the field think that laying out firm foundations might simply be too difficult a problem. They might suggest that it is intractable to try to find promising research agendas with firm theoretical footing, and hence that it doesn’t make sense to put effort into steering. I think this could be true in the strongest sense (maybe we can’t get a promising research agenda that is perfectly sound), but maybe not in weaker senses (maybe we can at least analyse the foundations of different research agendas). My current belief is that the problem is tractable enough to warrant some attempts at steering, but it is worth bearing in mind that this might not be the case.
A related problem is that every researcher trying to investigate the foundations of different research agendas is a researcher who is not working on trying to create Alignment strategies. If all or most prominent research agendas are sound already, then intentional work on investigating foundations just (counterfactually) slows down progress on Alignment. Although I think this is a problem, I think it is mitigated by a variety of factors. Firstly, discovering research agendas are based on false premises means we can utilise researchers within more sound research agendas. Secondly, we might be able to do steering effectively without using the top researchers in the field. Thirdly, I imagine more clarity of the foundations of a field could help research within an agenda, although I’m not certain about this.

Concluding Thoughts

In short, I don’t think that the disagreements that are currently widespread in Alignment will conclude as we might expect in other scientific fields because we cannot just rely on the power of Science. I also don’t think we should be trying to end all disagreements: every healthy field contains researchers with different intuitions, beliefs and research agendas.

However, I think that it is necessary for the community to have some plan to solve these disagreements where they are involved with the foundations of the field. I don’t think we can just expect this to happen without intentional effort, and I don’t think we can just ignore it. I’ll repeat a line from the opening of this post: we want to be sure that we have aligned AGI, not just aligned it within some framework and hoped for the best.

It could be that establishing firm foundations is simply impossible and that when crunch time comes we will have to hope for the best, to an extent. However, I think it’s worth trying to limit how much we will have to rely on optimism as much as possible, and that there are steps we could take that wouldn’t be too costly to help us with this. Hence, I think the community should seriously consider how we can do this steering in a more methodical and considered way than we currently are.

Disagreements about Alignment: Why, and how, we should try to solve them