Epistemological Framing for AI Alignment Research
Introduction
You open the Alignment Forum one day, and a new post stares at you. By sheer luck you have some time, so you actually read it. And then you ask yourself the eternal question: how does this fit with the rest of the field? If you’re like me, your best guess comes from looking at the author and some keywords: this usually links the post with one of the various “schools” of AI Alignment. These tend to be affiliated with a specific researcher or lab—there’s Paul Christiano’s kind of research, MIRI’s embedded agency, and various other approaches and agendas. Yet this is a pretty weak understanding of the place of new research.
In other fields, for example Complexity Theory, you don’t really need to know who wrote the paper. It usually shows a result from one of a few types (lower bound, completeness for a class, algorithm,...), and your basic training in the field armed you with mental tools to interpret results of this type. You know the big picture of the field (defining and separating complexity classes), and how types of results are linked with it. Chances are that the authors themselves called on these mental tools to justify the value of their research.
In the words of Thomas S. Kuhn, Complexity Theory is paradigmatic and AI Alignment isn’t. Paradigms, popularized in Kuhn’s The Structure of Scientific Revolutions, capture shared assumptions on theories, interesting problems, and evaluation of solutions. They are tremendously useful to foster normal science, the puzzle-solving activity of scientists; the paradigm carves out the puzzles. Being paradigmatic also makes it easier to distinguish what’s considered valuable for the field and what isn’t, as well as how it all fits together.
This list of benefit logically pushed multiple people to argue that we should make AI Alignment paradigmatic.
I disagree. Or to be more accurate, I agree that we should have paradigms in the field, but I think that they should be part of a bigger epistemological structure. Indeed, a naive search for a paradigm either results in a natural science-like paradigm, that put too little emphasis on applications and usefulness, or in a premature constraint on the problem we’re trying to solve.
This post instead proposes a framing of AI Alignment research which has a place for paradigms, but isn’t reduced to them. I start by stating this framing, along with multiple examples in each of its categories. I then go back to the two failure modes of naive paradigmatism I mentioned above. Finally, I detail how I intend to falsify the usefulness of this framing through a current project to review important AF posts.
Thanks to Joe Collman, Jérémy Perret, Evan Hubinger, Rohin Shah, Alex Turner and John S. Wentworth for feedback on this post.
The Framing
Let’s start by asking ourselves the different sort of progress one could make in AI Alignment. I see three categories in broad strokes (I’ll give examples in a minute).
Defining the terms of the problem
Exploring these definitions
Solving the now well-defined problem
I expect the first and third to be quite intuitive—define the problem and solve it. On the other hand, the second might feel redundant. If we defined the problem, the only thing left is to solve it, right?
Not in a world without logical omniscience. Indeed, the definitions we’re looking for in AI Alignment are merely structures and premises; they don’t give all their consequences for free. Some work is needed to understand their implications.
Let’s get slightly less abstract, and try to state the problem of AI Alignment: “Make AIs well-behaved”. Here “AIs” and “well-behaved” are intentionally vague; they stand for “AI-related systems we will end up building” and “what we actually want them to do”, respectively. So I’m just saying that AI Alignment aims to make the AIs we build do as we wish.
What happens when we try to carve research on this abstract problem along the three categories defined above?
Research on the “AIs” part
(Defining) Clarify what “AI-related systems we will end up building” means. This basically amounts to making a paradigm for studying the AIs we will most probably build in the future.
Note that such a paradigm is reminiscent of the ones in natural sciences, since it studies an actual physical phenomenon (the building of AIs and what they do, as it is done).
Examples include:Timelines research, like Daniel Kokotajlo’s posts
(Exploring) Assuming a paradigm (most probably deep learning these days), this is normal science done within this paradigm, that helps understanding aspects of it deemed relevant for AI Alignment.
Examples (in the paradigm of deep learning) include:Interpretability work, like the circuit work done by the Clarify team at OpenAI.
Work on understanding how training works, like this recent work on SGD
Research on the “well-behaved” part
(Defining) Clarifying what “what we actually want them to do” means. So building a paradigm that makes clear what the end-goals of alignment are. In general, I expect a global shared paradigm here too, with individual researchers championing specific properties among all the ones promoted by the paradigm.
Note that such a paradigm is reminiscent of the ones in theoretical computer science, since it studies a philosophical abstraction in a formal or semi-formal way.
Examples include:Defining Coherent Extrapolated Volition as an abstraction of what we would truly want upon reflection.
Defining HCH as an abstraction of considered judgment
Defining catastrophic consequences through attainable utility.
(Exploring) Assuming a paradigm (or at least some part of the paradigm focused on a specific property), normal science done in extending and analyzing this property.
Examples include:Assuming “well-behaved” includes following considered judgement, works on exploring HCH, like these two posts.
Assuming “well-behaved” includes being a good embedded agent, works on exploring embedded agency, like the papers and posts referenced in the Embedded Agency sequence.
(Solving) Assuming a paradigm for “AIs” and a paradigm for “well-behaved”, research on actually solving the problem. This category is probably the most straightforward, as it includes most of what we intuitively expect in AI Alignment research: proposition for alignment schemes, impossibility results, critics of schemes,...
Examples include:Assuming “AIs” means “Deep Learning models for question answering” and “well-behaved” means “following HCH”, IDA is a proposed solution
Assuming “AIs” means “DeepRL systems” and “well-behaved” means “coherent with observed human behavior”, an impossibility result is the well-known paper on Occam Razor’s and IRL by Stuart Armstrong and Sören Mindermann.
Assuming “AIs” means “Embedded Agents” and “well-behaved” means “deals with logical uncertainty in a reasonable way”, logical inductors are a proposed solution.
Note that this framing points towards some of the same ideas that Rohin’s threat models (I wasn’t aware of them before Rohin’s pointer in an email). Basically, Rohin argues that a model on which to do AI Alignment research should include both a development model (what AI will look like) and a risk model (how it will fail). His issue with some previous work lies in only filling one of these models, and not both. In my framing, this amounts to requiring that work in the Solving category comes with both a model/paradigm of what “AIs” means and a model/paradigm of what “well-behaved” means. That fits with my framing. On the difference side, Rohin focuses on “what goes wrong” (his risk model), whereas I focus on “what we want”.
Going back to the framing, let’s be very clear on what I’m not saying.
I’m not saying that every post or paper falls within exactly one of these categories. The Logical Induction paper for example both defines a criterion for the part of “well-behaved” related to embedded logical uncertainty, but also provides logical inductors to show that it’s possible to satisfy it. Yet I think it’s generally easy to separate the different contributions to make clear what falls into which category. And I believe such explicit separation helps tremendously when learning the field.
I’m not saying that these categories are independent. It’s obvious that the “solution” category depends on the other two; but one can also argue that there are dependencies between studying what “AIs” means and studying what “well-behaved” means. For example, inner alignment only really makes sense in a setting where AIs are learned models through some sort of local optimization process—hence this part of “well-behaved” requires a specific form to the definition of “AIs”. This isn’t really a problem, though.
I’m not saying that every post or paper falls within at least one category. Some work that we count as AI Alignment don’t really fall in any of my categories. The foremost example that I have in mind is John’s research on Abstraction. In a way, that is expected: this research is of a more general idea. It impacts some categories (like what “well-behaved” means), but is more a fundamental building block. Still, pointing to the categories that this research applies might help make it feel more relevant to AI Alignment.
I’m not saying that we need to fully solve what we mean by “AIs” and “well-behaved” before working on solutions. Of course work on solutions can already proceed quite usefully. What I’m arguing for instead is that basically any work on solutions assumes (implicitly or explicitly) some sort of partial answer to what “AIs” and “well-behaved” means. And that by stating it out loud, the authors would help the understanding of their work within the field.
I’m not saying that this is the only reasonable and meaningful framing of AI Alignment research. Obviously, this is but one way to categorize the research. We already saw that it isn’t as clean as we might want. Nonetheless, I’m convinced that using it will help make the field clearer to current researchers and newcomers alike.
In essence, this framing serves as a lens on the field. I believe that using it systematically (as readers when interpreting a work and as author when presenting our work) would help quite a lot, but that doesn’t mean it should be the only lens ever used.
Why not a single paradigm?
I promised in the introduction that I would explain why I believe my framing is more adequate than a single paradigm. This is because I only see two straightforward ways of compressing AI Alignment into a single paradigm: make it a paradigm about a fundamental abstraction (like agency) that once completely understood should make a solution obvious; or make it a paradigm about a definition of the problem (what “AIs” and “well-behaved” means). Both come with issues that make them undesirable.
Abstraction Paradigm
Paradigms historically come from natural sciences, as perspectives or explanations of phenomena such as electricity. A paradigm provides an underlying theory about the phenomenon, expresses the well-defined questions one can ask about it, and what would count as a successful solution of these questions.
We can also find paradigms about abstractions, for example in theoretical computer science. The current paradigm about computability is captured by the Church-Turing thesis, which claims that everything that can be physically computed can be computed by a Turing Machine. The “explanation” for what computation means is the Turing Machine, and all its equivalent models. Hence studying computability within this paradigm hinges on studying what Turing Machines can compute, as well as other models equivalent to TMs or weaker (This overlooks the sort of research done by mathematicians studying recursion theory, like Turing degrees; but as far as I know, these are of limited interest to theoretical computer scientists).
So a paradigm makes a lot of sense when applied to the study of a phenomenon or an abstraction. Now, AI Alignment is neither; it’s instead the search for the solution of a specific problem. But natural sciences and computer science have been historically pretty good at providing tools that make solving complex problems straightforward. Why couldn’t the same be true for AI Alignment?
Let’s look at a potential candidate. An abstraction presented as the key to AI Alignment by multiple people is agency. According to this view, if we had a complete understanding of agency, we wouldn’t find the problem of aligning AI difficult anymore. Thus maybe a paradigm giving an explanation of agency, and laying out the main puzzles following from this explanation, would be a good paradigm of AI Alignment.
Despite agreeing with the value of such work, I disagree with the legitimacy of making it the sole paradigm of AI Alignment. Even if understanding completely something like agency would basically solve the problem, how long will it take (if it is ever reached)? Historical examples in both natural sciences and computer science show that the original paradigm of a field isn’t usually adapted to tackle questions deemed fundamental by later paradigms. And this progress of paradigms takes decades in the best of cases, and centuries in the worst!
With the risk of short timelines, we can’t reasonably decide that this is the only basket to put our research eggs.
That being said, this paradigmatic approach has a place in my framing, about what “well-behaved” means. The difference is that once a paradigm is chosen, work can proceed in it while other researchers attempt to solve the problem for the current paradigm. There’s thus a back and forth between the work within the paradigm and its main application.
Problem Paradigm
If we stretch a bit the term, we can call paradigm the assumptions about what “AIs” and “well-behaved”. Then becoming paradigmatic would mean fixing the assumption and forcing all the work to go within this context.
That would be great, if only we could already be sure about what assumptions to use. But in the current state of the field, a lot more work is needed (especially for the “well-behaved” part) before anyone can reasonably decide to focus all research on a single such paradigm.
This form of paradigm thus suffers from the opposite problems than the previous one: it fails to value the research on the term of the problems, just to have a well-defined setting on which to make progress. Progress towards what? Who knows…
Here too, this approach has a place in my framing. Specifically, every work on the Solving category exists within such a paradigm. The difference is that I allow multiple paradigms to coexist, as well as the research on the assumptions behind this paradigm, allowing a saner epistemological process.
Where do we go from here?
Multiple voices in AI Alignment push for making the field more paradigmatic. I argue that doing this naïvely isn’t what we want: it either removes the push towards application and solutions, or fixes the term of the problem even though we are still so uncertain. I propose instead that we should think about research according to different parts of the statement “Make AIs well-behaved”: research about what “AIs” we’re talking about, research on what we mean by “well-behaved”, and based on answers to the two previous questions, actually try to solve the clarified problem.
I believe I argued reasonably enough for you to not dismiss the idea immediately. Nonetheless, this post is hardly sufficient to show the value of adopting this framing at the level of the whole research community.
One way I hope to falsify this proposition is through a project to review many posts on the AF to see what makes a good review, done with Joe Collman and Jérémy Perret. We plan on trying to use this lens when doing the reviews, to see if it clarifies anything. Such an experiment thus relies on us reviewing both posts that fit quite well the framing, and ones that don’t. If you have any recommendation, I wrote a post some time ago where you can give suggestions for the review.
- Review of “Fun with +12 OOMs of Compute” by 28 Mar 2021 14:55 UTC; 65 points) (
- Review of “Learning Normativity: A Research Agenda” by 6 Jun 2021 13:33 UTC; 37 points) (
- Paradigm-building: Introduction by 8 Feb 2022 0:06 UTC; 28 points) (
- [AN #141]: The case for practicing alignment work on GPT-3 and other large models by 10 Mar 2021 18:30 UTC; 27 points) (
- W2SG: Introduction by 10 Mar 2024 16:25 UTC; 1 point) (
Planned summary for the Alignment Newsletter:
Who? It would be helpful to have some links so I can go read what they said.
In the simple model of paradigms and fields, there is some pre-existing division into fields, and then for each field it can either be pardigmatic or non-paradigmatic, and if it is non-paradigmatic it can contain multiple paradigms unified in some overall structure, or not. I’d like to go to a more complicated model: There’s a big space of research being done, and some of the research naturally lumps together into paradigms, and we carve up the space of research into “fields” at least partly influenced by where the paradigms are—a sufficiently large paradigm will be called a field, for example. Fields can have sub-fields, and paradigms can have sub-paradigms. Also, paradigmaticness is not a binary property; there are some lumps of research that are in a grey area, depending on how organized they are, how unified and internally self-aware they are in some sense.
On this more complicated (but IMO more accurate) model, your post is itself an attempt to make AI alignment paradigmatic! After all, you are saying we should have multiple paradigms (i.e. you push to make parts of AI alignment more paradigmatic) and that they should fit together into this overall epistemic structure you propose. Insofar as your proposed epistemic structure is more substantial than the default epistemic structure that always exists between paradigms (e.g. the one that exists now) then it’s an attempt make the whole of AI alignment more paradigmatic too, even if not maximally paradigmatic.
Of course, that’s not necessarily a bad thing—your search for a paradigm is not naive, and the paradigm you propose is flexible and noncommital (i.e. not-maximally-paradigmatic?) enough that it should be able to avoid the problems you highlight. (I like the paradigm you propose! It seems like a fairly solid, safe first step.)
I think you could instead have structured your post like this:
1. Against Premature Paradigmitization: [Argues that when a body of ongoing research is sufficiently young/confused, pushing to paradigmitize it results in bad assumptions being snuck in, bad constraints on the problem, too little attention on what actually matters, etc. Gives some examples.]
2. Paradigmiticization of Alignment is Premature: [Argues that it would be premature to push for paratigmization now. Maybe lists some major paradigms or proto-paradigms proposed by various people and explains why it would be bad to make any one of them The King. Maybe argues that in general it’s best to let these things happen naturally than to push for them.]
I think overall my reaction is: This is too meta; can you point to any specific, concrete things people are doing that they should do differently? For example, I think of Richard Ngo’s “AI Safety from First Principles,” Bostrom’s Superintelligence, maybe Christiano’s stuff, MIRI’s stuff, and CAIS as attempts to build paradigms that (if things go as well as their authors hope) could become The Big Paradigm we All Follow. Are you saying people should stop trying to write things like this? Probably not… so then what are you recommending? That people not get too invested into any one particular paradigm, and start thinking it is The One, until we’ve had more time to process everything? Well, I feel like people are pretty good about that already.
I very much like your idea of testing this out. It’ll be hard to test, since it’s up to your subjective judgment of how useful this way of thinking is, but it’s worth a shot! I’ll be looking forward to the results.
Thanks for the feedback!
That was one of my big frustrations when writing this post: I only saw this topic pop up in personal conversation, not really in published posts. And so I didn’t want to give names of people who just discussed that with me on a zoom call or in a chat. But I totally feel you—I’m always annoyed by posts that pretend to answer a criticism without pointing to it.
That’s a really impressive comment, because my last rewrite of the post was exactly to hint that this was the “right way” (in my opinion) to make the field paradigmatic, instead of arguing that AI Alignment should be made paradigmatic (what my previous draft attempted). So I basically agree with what you say.
If I agreed with what you wrote before, this part strikes me as quite different from what I’m saying. Or more like you’re only focusing on one aspect. Because I actually argue for two things:
That we should have a paradigm of the “AIs” part, a paradigm of the “well-behaved” part, and from that we get a paradigm of the solving part. This has nothing to do with the field being young and/or confused, and all about the field being focused on solving a problem. (That’s the part I feel your version is missing)
That in the current state of our knowledge, fixing those paradigms is too early; we should instead do more work on comparing and extending multiple paradigms for each of the “slots” from the previous point, and similarly have a go at solving different variants of the problem. That’s what you mostly get right.
It’s partly my fault, because I’m not stating it that way.
My point about this is that thinking of your examples as “big paradigms of AI” is the source of the confusion, and a massive problem within the field. If we use my framing instead, then you can split these big proposals into their paradigm for “AIs”, their paradigm for “well-behaved”, and so the paradigm for the solving part. This actually show you where they agree and where they disagree. If you’re trying to build a new perspective on AI Alignment, then I also think my framing is a good lens to crystallize your insatisfactions with the current proposals.
Ultimately, this framing is a tool of philosophy of science, and so it probably won’t be useful to anyone not doing philosophy of science. The catch is that we all do a bit of philosophy of science regularly: when trying to decide what to work on, when interpreting work, when building these research agendas and proposals. I hope that this tool will help on these occasions.
That’s why I asked to people who are not as invested in this framing (and can be quite critical) to help me do these reviews—hopefully that will help make them less biased! (We also choose some posts specifically because they didn’t fit neatly into my framing).
“Even if understanding completely something like agency would basically solve the problem, how long will it take (if it is ever reached)? Historical examples in both natural sciences and computer science show that the original paradigm of a field isn’t usually adapted to tackle questions deemed fundamental by later paradigms. And this progress of paradigms takes decades in the best of cases, and centuries in the worst! With the risk of short timelines, we can’t reasonably decide that this is the only basket to put our research eggs.”
Yeah, this is one of the core problems that we need to solve. AI safety would seem much more tractable if we had more time to iterate through a series of paradigms.
I like this idea. AI alignment research is more like engineering than math or science, and engineering is definitely full of multiple paradigms, not just because it’s a big field with lots of specialties that have different requirements, but also because different problems require different solutions and sometimes the same problem can be solved by approaching it in multiple ways.
A classic example from computer science is the equivalence of loops and recursion. In a lot of ways these create two very different approaches to designing systems, writing code, and solving problems, and in the end both can do the same things, but it would be pretty bad if people try to do things with loops always had to talk in terms of recursion and vice versa because that was the dominant paradigm of the field.
(I tried writing up comments here as if I were commenting on a google doc, rather than a LW post, as part of an experiment I had talked about with AdamShimi. I found that actually it was fairly hard – both because I couldn’t make quick comments on. a given section without it feeling like a bigger-deal than I meant it to be, and also because the overall thing came out more critical feeling than feels right on a public post. This is ironic since I was the the one who told Adam “I bet if you just ask people to comment on it as if it’s a google doc it’ll go fine.”)
((I started writing this before the exchange between Adam and Daniel and am not sure if it’s redundant with that))
I think my primary comment (perhaps similar to Daniel’s) is that it’s not clear to me that this post is arguing against a position that anyone holds. What I hear lots of people saying is “AI Alignment is preparadigmatic, and that makes it confusing and hard to navigate”, which is fairly different from the particular solution of “and we should get to a point of paradigmaticness soon”. I don’t know of anyone who seems to me to be pushing for that.
(fake edit: Adam says in another comment that these arguments have come up in person, not online. That’s annoying to deal with for sure. I’d still put some odds on there being some kind of miscommunication going on, where Adam’s interlocuter says something like “it’s annoying/bad that alignment is pre-paradigmatic”, and Adam assumed that meant “we should get it to the paradigmatic stage ASAP”. But, it’s hard to have a clear opinion here without knowing more about the side-channel arguments. I think in the world where there are people who are saying in private “we should get Alignment to a paradigmatic state”, I think it might be good to have some context-setting in the post explaining that)
...
I do agree with the stated problem of “in order to know how to respond to a given Alignment post, I kinda need to know who wrote it and what sort of other things they think about, and that’s annoying.” But I’m not sure what to do with that.
The two solutions I can think of are basically “Have the people with established paradigms spend more time clarifying their frame” (for the benefit of people who don’t already know), and “have new up-and-coming people put in more time clarifying their frame” (also for the benefit of those who don’t already know, but in this case many fewer people know them so it’s more obviously worth it).
Something about this feels bad to me because I think most attempts to do it will create a bunch of boilerplate cruft that makes the posts harder to read. (This might be a UI problem that lesswrong should try to fix)
I read the OP as suggesting
a meta-level strategy of “have a shared frame for framing things, which we all pay a one-time cost of getting up to speed with”
a specific shared-frame of “1. Defining the terms of the problem, 2. Exploring these definitions, 3. Solving the now well-defined problem.” I think these things are… plausibly fine, but I don’t know how strongly to endorse them because I feel like I haven’t gotten to see much of the space of possible alternative shared-frames.
On the meta-side: an update I made writing this comment is that inline-google-doc-style commenting is pretty important. It allows you to tag a specific part of the post and say “hey, these seems wrong/confused” without making that big a deal about it, whereas writing a LW comment you sort of have to establish the context which intrinsically means making into A Thing.