I would call what METR does alignment research, but also fine to use a different term for it. Mostly using it synonymously with “AI Safety Research” which I know you object to, but I do think that’s how it’s normally used (and the relevant aspect here is the pre-paradigmaticity of the relevant research, which I continue to think applies independently of the bucket you put it into).
I do think it’s marginally good to make “AI Alignment Research” mean something narrower, so am supportive here of getting me to use something broader like “AI Safety Research”, but I don’t really think that changes my argument in any relevant way.
It seems like we have significant need for orgs like METR or the DeepMind dangerous capabilities evals team trying to operationalise these evals, but also regulators with authority building on that work to set them as explicit and objective standards. The latter feels maybe more practical for NIST to do, especially under Paul?
Buck, could you (or habryka) elaborate on this? What does Buck call the set of things that ARC theory and METR (formerly known as ARC evals) does, “AI control research”?
My understanding is that while Redwood clearly does control research, METR evals seem more of an attempt to demonstrate dangerous capabilities than help with control. I haven’t wrapped my head around ARC’s research philosophy and output to confidently state anything.
I normally use “alignment research” to mean “research into making models be aligned, e.g. not performing much worse than they’re ‘able to’ and not purposefully trying to kill you”. By this definition, ARC is alignment research, METR and Redwood isn’t.
An important division between Redwood and METR is that we focus a lot more on developing/evaluating countermeasures.
FWIW, I typically use “alignment research” to mean “AI research aimed at making it possible to safely do ambitious things with sufficiently-capable AI” (with an emphasis on “safely”). So I’d include things like Chris Olah’s interpretability research, even if the proximate impact of this is just “we understand what’s going on better, so we may be more able to predict and finely control future systems” and the proximate impact is not “the AI is now less inclined to kill you”.
Some examples: I wouldn’t necessarily think of “figure out how we want to airgap the AI” as “alignment research”, since it’s less about designing the AI, shaping its mind, predicting and controlling it, etc., and more about designing the environment around the AI.
But I would think of things like “figure out how to make this AI system too socially-dumb to come up with ideas like ‘maybe I should deceive my operators’, while keeping it superhumanly smart at nanotech research” as central examples of “alignment research”, even though it’s about controlling capabilities (‘make the AI dumb in this particular way’) rather than about instilling a particular goal into the AI.
And I’d also think of “we know this AI is trying to kill us; let’s figure out how to constrain its capabilities so that it keeps wanting that, but is too dumb to find a way to succeed in killing us, thereby forcing it to work with us rather than against us in order to achieve more of what it wants” as a pretty central example of alignment research, albeit not the sort of alignment research I feel optimistic about. The way I think about the field, you don’t have to specifically attack the “is it trying to kill you?” part of the system in order to be doing alignment research; there are other paths, and alignment researchers should consider all of them and focus on results rather than marrying a specific methodology.
It’s pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for “actually terminally has good intentions”. So, this makes me sad to call this alignment research. Of course, this type of research maybe instrumentally useful for making AIs more aligned, but so will a bunch of other stuff (e.g. earning to give).
Fair enough if you think we should just eat this terminology issue and then coin a new term like “actually real-alignment-targeting-directly alignment research”. Idk what the right term is obviously.
It’s pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for “actually terminally has good intentions”.
Aren’t there a lot of clearer words for this? “Well-intentioned”, “nice”, “benevolent”, etc.
(And a lot of terms, like “value loading” and “value learning”, that are pointing at the research project of getting good intentions into the AI.)
To my ear, “aligned person” sounds less like “this person wishes the best for me”, and more like “this person will behave in the right ways”.
If I hear that Russia and China are “aligned”, I do assume that their intentions play a big role in that, but I also assume that their circumstances, capabilities, etc. matter too. Alignment in geopolitics can be temporary or situational, and it almost never means that Russia cares about China as much as China cares about itself, or vice versa.
And if we step back from the human realm, an engineered system can be “aligned” in contexts that have nothing to do with goal-oriented behavior, but are just about ensuring components are in the right place relative to each other.
Cf. the history of the term “AI alignment”. From my perspective, a big part of why MIRI coordinated with Stuart Russell to introduce the term “AI alignment” was that we wanted to switch away from “Friendly AI” to a term that sounded more neutral. “Friendly AI research” had always been intended to subsume the full technical problem of making powerful AI systems safe and aimable; but emphasizing “Friendliness” made it sound like the problem was purely about value loading, so a more generic and low-content word seemed desirable.
But Stuart Russell (and later Paul Christiano) had a different vision in mind for what they wanted “alignment” to be, and MIRI apparently failed to communicate and coordinate with Russell to avoid a namespace collision. So we ended up with a messy patchwork of different definitions.
I’ve basically given up on trying to achieve uniformity on what “AI alignment” is; the best we can do, I think, is clarify whether we’re talking about “intent alignment” vs. “outcome alignment” when the distinction matters.
But I do want to push back against those who think outcome alignment is just an unhelpful concept — on the contrary, if we didn’t have a word for this idea I think it would be very important to invent one.
IMO it matters more that we keep our eye on the ball (i.e., think about the actual outcomes we want and keep researchers’ focus on how to achieve those outcomes) than that we define an extremely crisp, easily-packaged technical concept (that is at best a loose proxy for what we actually want). Especially now that ASI seems nearer at hand (so the need for this “keep our eye on the ball” skill is becoming less and less theoretical), and especially now that ASI disaster concerns have hit the mainstream (so the need to “sell” AI risk has diminished somewhat, and the need to direct research talent at the most important problems has increased).
And I also want to push back against the idea that a priori, before we got stuck with the current terminology mess, it should have been obvious that “alignment” is about AI systems’ goals and/or intentions, rather than about their behavior or overall designs. I think intent alignment took off because Stuart Russell and Paul Christiano advocated for that usage and encouraged others to use the term that way, not because this was the only option available to us.
Aren’t there a lot of clearer words for this? “Well-intentioned”, “nice”, “benevolent”, etc.
Fair enough. I guess it just seems somewhat incongruous to say. “Oh yes, the AI is aligned. Of course it might desperately crave murdering all of us in its heart (we certainly haven’t ruled this out with our current approach), but it is aligned because we’ve made it so that it wouldn’t get away with it if it tried.”
Sounds like a lot of political alliances! (And “these two political actors are aligned” is arguably an even weaker condition than “these two political actors are allies”.)
At the end of the day, of course, all of these analogies are going to be flawed. AI is genuinely a different beast.
Personally, I like mentally splitting the space into AI safety (emphasis on measurement and control), AI alignment (getting it to align to the operators purposes and actually do what the operators desire), and AI value-alignment (getting the AI to understand and care about what people need and want).
Feels like a Venn diagram with a lot of overlap, and yet some distinct non-overlap spaces.
By my framing, Redwood research and METR are more centrally AI safety. ARC/Paul’s research agenda more of a mix of AI safety and AI alignment. MIRI’s work to fundamentally understand and shape Agents is a mix of AI alignment and AI value-alignment. Obviously success there would have the downstream effect of robustly improving AI safety (reducing the need for careful evals and control), but is a more difficult approach in general with less immediate applicability.
I think we need all these things!
I would call what METR does alignment research, but also fine to use a different term for it. Mostly using it synonymously with “AI Safety Research” which I know you object to, but I do think that’s how it’s normally used (and the relevant aspect here is the pre-paradigmaticity of the relevant research, which I continue to think applies independently of the bucket you put it into).
I do think it’s marginally good to make “AI Alignment Research” mean something narrower, so am supportive here of getting me to use something broader like “AI Safety Research”, but I don’t really think that changes my argument in any relevant way.
Yeah I object to using the term “alignment research” to refer to research that investigates whether models can do particular things.
But all the terminology options here are somewhat fucked imo, I probably should have been more chill about you using the language you did, sorry.
It seems like we have significant need for orgs like METR or the DeepMind dangerous capabilities evals team trying to operationalise these evals, but also regulators with authority building on that work to set them as explicit and objective standards. The latter feels maybe more practical for NIST to do, especially under Paul?
Buck, could you (or habryka) elaborate on this? What does Buck call the set of things that ARC theory and METR (formerly known as ARC evals) does, “AI control research”?
My understanding is that while Redwood clearly does control research, METR evals seem more of an attempt to demonstrate dangerous capabilities than help with control. I haven’t wrapped my head around ARC’s research philosophy and output to confidently state anything.
I normally use “alignment research” to mean “research into making models be aligned, e.g. not performing much worse than they’re ‘able to’ and not purposefully trying to kill you”. By this definition, ARC is alignment research, METR and Redwood isn’t.
An important division between Redwood and METR is that we focus a lot more on developing/evaluating countermeasures.
FWIW, I typically use “alignment research” to mean “AI research aimed at making it possible to safely do ambitious things with sufficiently-capable AI” (with an emphasis on “safely”). So I’d include things like Chris Olah’s interpretability research, even if the proximate impact of this is just “we understand what’s going on better, so we may be more able to predict and finely control future systems” and the proximate impact is not “the AI is now less inclined to kill you”.
Some examples: I wouldn’t necessarily think of “figure out how we want to airgap the AI” as “alignment research”, since it’s less about designing the AI, shaping its mind, predicting and controlling it, etc., and more about designing the environment around the AI.
But I would think of things like “figure out how to make this AI system too socially-dumb to come up with ideas like ‘maybe I should deceive my operators’, while keeping it superhumanly smart at nanotech research” as central examples of “alignment research”, even though it’s about controlling capabilities (‘make the AI dumb in this particular way’) rather than about instilling a particular goal into the AI.
And I’d also think of “we know this AI is trying to kill us; let’s figure out how to constrain its capabilities so that it keeps wanting that, but is too dumb to find a way to succeed in killing us, thereby forcing it to work with us rather than against us in order to achieve more of what it wants” as a pretty central example of alignment research, albeit not the sort of alignment research I feel optimistic about. The way I think about the field, you don’t have to specifically attack the “is it trying to kill you?” part of the system in order to be doing alignment research; there are other paths, and alignment researchers should consider all of them and focus on results rather than marrying a specific methodology.
It’s pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for “actually terminally has good intentions”. So, this makes me sad to call this alignment research. Of course, this type of research maybe instrumentally useful for making AIs more aligned, but so will a bunch of other stuff (e.g. earning to give).
Fair enough if you think we should just eat this terminology issue and then coin a new term like “actually real-alignment-targeting-directly alignment research”. Idk what the right term is obviously.
Aren’t there a lot of clearer words for this? “Well-intentioned”, “nice”, “benevolent”, etc.
(And a lot of terms, like “value loading” and “value learning”, that are pointing at the research project of getting good intentions into the AI.)
To my ear, “aligned person” sounds less like “this person wishes the best for me”, and more like “this person will behave in the right ways”.
If I hear that Russia and China are “aligned”, I do assume that their intentions play a big role in that, but I also assume that their circumstances, capabilities, etc. matter too. Alignment in geopolitics can be temporary or situational, and it almost never means that Russia cares about China as much as China cares about itself, or vice versa.
And if we step back from the human realm, an engineered system can be “aligned” in contexts that have nothing to do with goal-oriented behavior, but are just about ensuring components are in the right place relative to each other.
Cf. the history of the term “AI alignment”. From my perspective, a big part of why MIRI coordinated with Stuart Russell to introduce the term “AI alignment” was that we wanted to switch away from “Friendly AI” to a term that sounded more neutral. “Friendly AI research” had always been intended to subsume the full technical problem of making powerful AI systems safe and aimable; but emphasizing “Friendliness” made it sound like the problem was purely about value loading, so a more generic and low-content word seemed desirable.
But Stuart Russell (and later Paul Christiano) had a different vision in mind for what they wanted “alignment” to be, and MIRI apparently failed to communicate and coordinate with Russell to avoid a namespace collision. So we ended up with a messy patchwork of different definitions.
I’ve basically given up on trying to achieve uniformity on what “AI alignment” is; the best we can do, I think, is clarify whether we’re talking about “intent alignment” vs. “outcome alignment” when the distinction matters.
But I do want to push back against those who think outcome alignment is just an unhelpful concept — on the contrary, if we didn’t have a word for this idea I think it would be very important to invent one.
IMO it matters more that we keep our eye on the ball (i.e., think about the actual outcomes we want and keep researchers’ focus on how to achieve those outcomes) than that we define an extremely crisp, easily-packaged technical concept (that is at best a loose proxy for what we actually want). Especially now that ASI seems nearer at hand (so the need for this “keep our eye on the ball” skill is becoming less and less theoretical), and especially now that ASI disaster concerns have hit the mainstream (so the need to “sell” AI risk has diminished somewhat, and the need to direct research talent at the most important problems has increased).
And I also want to push back against the idea that a priori, before we got stuck with the current terminology mess, it should have been obvious that “alignment” is about AI systems’ goals and/or intentions, rather than about their behavior or overall designs. I think intent alignment took off because Stuart Russell and Paul Christiano advocated for that usage and encouraged others to use the term that way, not because this was the only option available to us.
Fair enough. I guess it just seems somewhat incongruous to say. “Oh yes, the AI is aligned. Of course it might desperately crave murdering all of us in its heart (we certainly haven’t ruled this out with our current approach), but it is aligned because we’ve made it so that it wouldn’t get away with it if it tried.”
Sounds like a lot of political alliances! (And “these two political actors are aligned” is arguably an even weaker condition than “these two political actors are allies”.)
At the end of the day, of course, all of these analogies are going to be flawed. AI is genuinely a different beast.
Personally, I like mentally splitting the space into AI safety (emphasis on measurement and control), AI alignment (getting it to align to the operators purposes and actually do what the operators desire), and AI value-alignment (getting the AI to understand and care about what people need and want). Feels like a Venn diagram with a lot of overlap, and yet some distinct non-overlap spaces.
By my framing, Redwood research and METR are more centrally AI safety. ARC/Paul’s research agenda more of a mix of AI safety and AI alignment. MIRI’s work to fundamentally understand and shape Agents is a mix of AI alignment and AI value-alignment. Obviously success there would have the downstream effect of robustly improving AI safety (reducing the need for careful evals and control), but is a more difficult approach in general with less immediate applicability. I think we need all these things!