I like my own definition of alignment vs. capabilities research better:
”Alignment research is when your research goals are primarily about how to make AIs aligned; capabilities research is when your research goals are primarily about how to make AIs more capable.”
I think it’s very important that lots of people currently doing capabilities research switch to doing alignment research. That is, I think it’s very important that lots of people who are currently waking up every day thinking ‘how can I design a training run that will result in AGI?’ switch to waking up every day thinking ‘Suppose my colleagues do in fact get to AGI in something like the current paradigm, and they apply standard alignment techniques—what would happen? Would it be aligned? How can I improve the odds that it would be aligned?’
Whereas I don’t think it’s particularly important that e.g. people switch from scalable oversight to agent foundations research. (In fact it might even be harmful lol)
What if your research goal is “I’d like to understand how neural networks work?” This is not research primarily about how to make AIs aligned. We tend to hypothesize, as a community, that it will help with alignment more than it helps with capabilities. But that’s not an inherent part of the research goal for many interpretability researchers.
(Same for “I’d like to understand how agency works”, which is a big motivation for many agent foundations researchers.)
Conversely, what if your research goal is “I’m going to design a training run that will produce a frontier model, so that we can study it to advance alignment research”? Seems odd, but I’d bet that (e.g.) a chunk of Anthropic’s scaling team thinks this way. Counts as alignment under your definition, since that’s the primary goal of the research.
More generally, I think it’s actually a very important component of science that people judge the research itself, not the motivations behind it—since historically scientific breakthroughs have often come from people who were disliked by establishment scientists. A definition that basically boils down to “alignment research is whatever research is done by the people with the right motivations” makes it very easy to prioritize the ingroup. I do think that historically being motivated by alignment has correlated with choosing valuable research directions from an alignment perspective (like mech interp instead of more shallow interp techniques) but I think we can mostly capture that difference by favoring more principled, robust, generalizable research (as per my definitions in the post).
Whereas I don’t think it’s particularly important that e.g. people switch from scalable oversight to agent foundations research. (In fact it might even be harmful lol)
I agree. I’ll add a note in the post saying that the point you end up on the alignment spectrum should also account for feasibility of the research direction.
Though note that we can interpret your definition as endorsing this too: if you really hate the idea of making AIs more capable, then that might motivate you to switch from scalable oversight to agent foundations, since scalable oversight will likely be more useful for capabilities progress.
Answering your first question. If you truly aren’t trying to make AGI, and you truly aren’t trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) …great! That’s neither capabilities nor alignment research afaict, but basic science. Good for you. I still think it would be better if you switched to doing alignment research (e.g. you could switch to ’i want to understand how neural networks work… so that I can understand how a prosaic AGI system being RLHF’d might behave when presented with genuine credible opportunities for takeover + lots of time to think about what to do) but I don’t feel so strongly about it as I would if you were doing capabilities research.
re: judging the research itself rather than the motivations: idk I think it’s actually easier, and less subjective, to judge the motivations, at least in this case. People usually just state what their motivations are. Also I’m not primarily trying to judge people, I’m trying to exhort people—I’m giving people advice about what they should do to make the world a better place, I’m not e.g. talking about which research should be published and which should be restricted (I’m generally in favor of publishing research with maybe a few exceptions, but I think corporations on the margin should publish more) Moreover it’s much easier for the researcher to judge their own motivations, than for them to judge the long-term impact of their work or to fit it into your diagram.
If you truly aren’t trying to make AGI, and you truly aren’t trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) …great! That’s neither capabilities nor alignment research afaict, but basic science.
Consider Chris Olah, who I think has done more than almost anyone else to benefit alignment. It would be very odd if we had a definition of alignment research where you could read all of Chris’s interpretability work and still not know whether or not he’s an “alignment researcher”. On your definition, when I read a paper by a researcher I haven’t heard of, I don’t know anything about whether it’s alignment research or not until I stalk them on facebook and find out how socially proximal they are to the AI safety community. That doesn’t seem great.
Back to Chris. Because I’ve talked to Chris and read other stuff by him, I’m confident that he does care about alignment. But I still don’t know whether his actual motivations are more like 10% intrinsic interest in how neural networks work and 90% in alignment, or vice versa, or anything in between. (It’s probably not even a meaningful thing to measure.) It does seem likely to me that the ratio of how much intrinsic interest he has in how neural networks work, to how much he cares about alignment, is significantly higher than that of most alignment researchers, and I don’t think that’s a coincidence—based on the history of science (Darwin, Newton, etc) intrinsic interest in a topic seems like one of the best predictors of actually making the most important breakthroughs.
In other words: I think your model of what produces more useful research from an alignment perspective overprioritizes towards first-order effects (if people care more they’ll do more relevant work) and ignores the second-order effects that IMO are more important (1. Great breakthroughs seem, historically, to be primarily motivated by intrinsic interest; and 2. Creating research communities that are gatekept by people’s beliefs/motivations/ideologies is corrosive, and leads to political factionalism + ingroupiness rather than truth-seeking.)
I’m not primarily trying to judge people, I’m trying to exhort people
Well, there are a lot of grants given out for alignment research. Under your definition, those grants would only be given to people who express the right shibboleths.
I also think that the best exhortation of researchers mostly looks like nerdsniping them, and the way to do that is to build a research community that is genuinely very interested in a certain set of (relatively object-level) topics. I’d much rather an interpretability team hire someone who’s intrinsically fascinated by neural networks (but doesn’t think much about alignment) than someone who deeply cares about making AI go well (but doesn’t find neural nets very interesting). But any step in the pipeline that prioritizes “alignment researchers” (like: who gets invited to alignment workshops, who gets alignment funding or career coaching, who gets mentorship, etc) will prioritize the latter over the former if they’re using your definition.
I’d much rather an interpretability team hire someone who’s intrinsically fascinated by neural networks (but doesn’t think much about alignment) than someone who deeply cares about making AI go well (but doesn’t find neural nets very interesting).
I disagree, I’d rather they’d hire someone who cares about making AI go well. E.g. I like Sam Marks’s work on making interpretability techniques useful (e.g. here), and I think he gets a lot of leverage compared to most interpretability researchers via trying to do stuff that’s in the direction of being useful. (Though note that his work builds on the work of non-backchaining interpretability researchers.)
Paul’s work (ELK more than RLHF though it was useful to see what happens when you throw RL at LLMs in a way that’s kind of similar to how I do get some value out of Chris’s work)
Eliezer’s work
Nate’s work
Holden’s writing on cold takes
Ajeya’s work
Wentworth’s work
The debate stuff
Redwood’s work
Bostrom’s work
Evan’s work
Scott and Abram’s work
There is of course still huge variance in how relevant and how much for the throat these different people’s work is going for, but all of these seem more relevant to AI Alignment/AI-not-kill-everyonism than Chris’s work (which again, I found interesting, but not like super interesting).
Definitely not trying to put words in Habryka’s mouth, but I did want to make a concrete prediction to test my understanding of his position; I expect he will say that:
the only work which is relevant is the one that tries to directly tackle what Nate Soares described as “the hard bits of the alignment challenge” (the identity of which Habryka basically agrees with Soares about)
nobody is fully on the ball yet
but agent foundations-like research by MIRI-aligned or formerly MIRI-aligned people (Vanessa Kosoy, Abram Demski, etc.) is the one that’s most relevant, in theory
however, in practice, even that is kinda irrelevant because timelines are short and that work is going along too slowly to be useful even for deconfusion purposes
To clarify: are you saying that since you perceive Chris Olah as mostly intrinsically caring about understanding neural networks (instead of mostly caring about alignment), you conclude that his work is irrelevant to alignment?
No, I have detailed inside view models of the alignment problem, and under those models consider Chris Olah’s work to be interesting but close to irrelevant (or to be about as relevant as the work of top capability researchers, whose work, to be clear, does have some relevance since of course understanding how to make systems better is relevant for understanding how AGI will behave, but where the relevance is pretty limited).
I feel like we are misunderstanding each other, and I think it’s at least in large part my fault.
I definitely agree that we don’t want to be handing out grants or judging people on the basis of what shibboleths they spout or what community they are part of. In fact I agree with most of what you’ve said above, except for when you start attributing stuff to me.
I think that grantmakers should evaluate research proposals not on the basis of the intentions of the researcher, but on the basis of whether the proposed research seems useful for alignment. This is not your view though right? You are proposing something else?
I like my own definition of alignment vs. capabilities research better:
”Alignment research is when your research goals are primarily about how to make AIs aligned; capabilities research is when your research goals are primarily about how to make AIs more capable.”
I think it’s very important that lots of people currently doing capabilities research switch to doing alignment research. That is, I think it’s very important that lots of people who are currently waking up every day thinking ‘how can I design a training run that will result in AGI?’ switch to waking up every day thinking ‘Suppose my colleagues do in fact get to AGI in something like the current paradigm, and they apply standard alignment techniques—what would happen? Would it be aligned? How can I improve the odds that it would be aligned?’
Whereas I don’t think it’s particularly important that e.g. people switch from scalable oversight to agent foundations research. (In fact it might even be harmful lol)
What if your research goal is “I’d like to understand how neural networks work?” This is not research primarily about how to make AIs aligned. We tend to hypothesize, as a community, that it will help with alignment more than it helps with capabilities. But that’s not an inherent part of the research goal for many interpretability researchers.
(Same for “I’d like to understand how agency works”, which is a big motivation for many agent foundations researchers.)
Conversely, what if your research goal is “I’m going to design a training run that will produce a frontier model, so that we can study it to advance alignment research”? Seems odd, but I’d bet that (e.g.) a chunk of Anthropic’s scaling team thinks this way. Counts as alignment under your definition, since that’s the primary goal of the research.
More generally, I think it’s actually a very important component of science that people judge the research itself, not the motivations behind it—since historically scientific breakthroughs have often come from people who were disliked by establishment scientists. A definition that basically boils down to “alignment research is whatever research is done by the people with the right motivations” makes it very easy to prioritize the ingroup. I do think that historically being motivated by alignment has correlated with choosing valuable research directions from an alignment perspective (like mech interp instead of more shallow interp techniques) but I think we can mostly capture that difference by favoring more principled, robust, generalizable research (as per my definitions in the post).
I agree. I’ll add a note in the post saying that the point you end up on the alignment spectrum should also account for feasibility of the research direction.
Though note that we can interpret your definition as endorsing this too: if you really hate the idea of making AIs more capable, then that might motivate you to switch from scalable oversight to agent foundations, since scalable oversight will likely be more useful for capabilities progress.
Answering your first question. If you truly aren’t trying to make AGI, and you truly aren’t trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) …great! That’s neither capabilities nor alignment research afaict, but basic science. Good for you. I still think it would be better if you switched to doing alignment research (e.g. you could switch to ’i want to understand how neural networks work… so that I can understand how a prosaic AGI system being RLHF’d might behave when presented with genuine credible opportunities for takeover + lots of time to think about what to do) but I don’t feel so strongly about it as I would if you were doing capabilities research.
re: judging the research itself rather than the motivations: idk I think it’s actually easier, and less subjective, to judge the motivations, at least in this case. People usually just state what their motivations are. Also I’m not primarily trying to judge people, I’m trying to exhort people—I’m giving people advice about what they should do to make the world a better place, I’m not e.g. talking about which research should be published and which should be restricted (I’m generally in favor of publishing research with maybe a few exceptions, but I think corporations on the margin should publish more) Moreover it’s much easier for the researcher to judge their own motivations, than for them to judge the long-term impact of their work or to fit it into your diagram.
Consider Chris Olah, who I think has done more than almost anyone else to benefit alignment. It would be very odd if we had a definition of alignment research where you could read all of Chris’s interpretability work and still not know whether or not he’s an “alignment researcher”. On your definition, when I read a paper by a researcher I haven’t heard of, I don’t know anything about whether it’s alignment research or not until I stalk them on facebook and find out how socially proximal they are to the AI safety community. That doesn’t seem great.
Back to Chris. Because I’ve talked to Chris and read other stuff by him, I’m confident that he does care about alignment. But I still don’t know whether his actual motivations are more like 10% intrinsic interest in how neural networks work and 90% in alignment, or vice versa, or anything in between. (It’s probably not even a meaningful thing to measure.) It does seem likely to me that the ratio of how much intrinsic interest he has in how neural networks work, to how much he cares about alignment, is significantly higher than that of most alignment researchers, and I don’t think that’s a coincidence—based on the history of science (Darwin, Newton, etc) intrinsic interest in a topic seems like one of the best predictors of actually making the most important breakthroughs.
In other words: I think your model of what produces more useful research from an alignment perspective overprioritizes towards first-order effects (if people care more they’ll do more relevant work) and ignores the second-order effects that IMO are more important (1. Great breakthroughs seem, historically, to be primarily motivated by intrinsic interest; and 2. Creating research communities that are gatekept by people’s beliefs/motivations/ideologies is corrosive, and leads to political factionalism + ingroupiness rather than truth-seeking.)
Well, there are a lot of grants given out for alignment research. Under your definition, those grants would only be given to people who express the right shibboleths.
I also think that the best exhortation of researchers mostly looks like nerdsniping them, and the way to do that is to build a research community that is genuinely very interested in a certain set of (relatively object-level) topics. I’d much rather an interpretability team hire someone who’s intrinsically fascinated by neural networks (but doesn’t think much about alignment) than someone who deeply cares about making AI go well (but doesn’t find neural nets very interesting). But any step in the pipeline that prioritizes “alignment researchers” (like: who gets invited to alignment workshops, who gets alignment funding or career coaching, who gets mentorship, etc) will prioritize the latter over the former if they’re using your definition.
I disagree, I’d rather they’d hire someone who cares about making AI go well. E.g. I like Sam Marks’s work on making interpretability techniques useful (e.g. here), and I think he gets a lot of leverage compared to most interpretability researchers via trying to do stuff that’s in the direction of being useful. (Though note that his work builds on the work of non-backchaining interpretability researchers.)
(FWIW I think Chris Olah’s work is approximately irrelevant to alignment and indeed this is basically fully explained by the motivational dimension)
Whose work is relevant, according to you?
Lots of people’s work:
Paul’s work (ELK more than RLHF though it was useful to see what happens when you throw RL at LLMs in a way that’s kind of similar to how I do get some value out of Chris’s work)
Eliezer’s work
Nate’s work
Holden’s writing on cold takes
Ajeya’s work
Wentworth’s work
The debate stuff
Redwood’s work
Bostrom’s work
Evan’s work
Scott and Abram’s work
There is of course still huge variance in how relevant and how much for the throat these different people’s work is going for, but all of these seem more relevant to AI Alignment/AI-not-kill-everyonism than Chris’s work (which again, I found interesting, but not like super interesting).
Do you mean Evan Hubinger, Evan R. Murphy, or a different Evan? (I would be surprised and humbled if it was me, though my priors on that are low.)
Hubinger
Definitely not trying to put words in Habryka’s mouth, but I did want to make a concrete prediction to test my understanding of his position; I expect he will say that:
the only work which is relevant is the one that tries to directly tackle what Nate Soares described as “the hard bits of the alignment challenge” (the identity of which Habryka basically agrees with Soares about)
nobody is fully on the ball yet
but agent foundations-like research by MIRI-aligned or formerly MIRI-aligned people (Vanessa Kosoy, Abram Demski, etc.) is the one that’s most relevant, in theory
however, in practice, even that is kinda irrelevant because timelines are short and that work is going along too slowly to be useful even for deconfusion purposes
Edit: I was wrong.
To clarify: are you saying that since you perceive Chris Olah as mostly intrinsically caring about understanding neural networks (instead of mostly caring about alignment), you conclude that his work is irrelevant to alignment?
No, I have detailed inside view models of the alignment problem, and under those models consider Chris Olah’s work to be interesting but close to irrelevant (or to be about as relevant as the work of top capability researchers, whose work, to be clear, does have some relevance since of course understanding how to make systems better is relevant for understanding how AGI will behave, but where the relevance is pretty limited).
I feel like we are misunderstanding each other, and I think it’s at least in large part my fault.
I definitely agree that we don’t want to be handing out grants or judging people on the basis of what shibboleths they spout or what community they are part of. In fact I agree with most of what you’ve said above, except for when you start attributing stuff to me.
I think that grantmakers should evaluate research proposals not on the basis of the intentions of the researcher, but on the basis of whether the proposed research seems useful for alignment. This is not your view though right? You are proposing something else?