I agree with many of the points made in this post, especially the “But my ideas/insights/research is not likely to impact much!” point. I find it plausible that in some subfields, AI x-risk people are too prone to publishing due to historical precedent and norms (maybe mech interp? though little has actually come of that). I also want to point out that there are non-zero arguments to expect alignment work to help more with capabilties, relative to existing “mainstream” capabilities work, even if I don’t believe this to be the case. (For example, you might believe that the field of deep learning spends too little time actually thinking about how to improve their models, and too much time just tinkering, in which case your thinking could have a disproportionate impact even after adjusting for the fact that you’re not trying to do capabilities.) And I think that some of the research labeled “alignment” is basically just capabilities work, and maybe the people doing them should stop.
I also upvoted the post because I think this attitude is pervasive in these circles, and it’s good to actually hash it out in public.
But as with most of the commenters, I disagree with the conclusion of the post.
I suspect the main cruxes between us are the following:
1. How much useful alignment work is actually being done?
From paragraphs such as the following:
It’s very rare that any research purely helps alignment, because any alignment design is a fragile target that is just a few changes away from unaligned. There is no alignment plan which fails harmlessly if you fuck up implementing it, and people tend to fuck things up unless they try really hard not to (and often even if they do), and people don’t tend to try really hard not to. This applies doubly so to work that aims to make AI understandable or helpful, rather than aligned — a helpful AI will help anyone, and the world has more people trying to build any superintelligence (let’s call those “capabilities researchers”) than people trying to build aligned superintelligence (let’s call those “alignment researchers”).
And
“But my ideas/insights/research is not likely to impact much!” — that’s not particularly how it works? It needs to somehow be differenially helpful to alignment, which I think is almost never the case.
It seems that a big part of your world model is that ~no one who thinks they’re doing “alignment” work is doing real alignment work, and are really just doing capabilities work. In particular, it seems that you think interp or intent alignment are basically just capabilities work, insofar as their primary effect is helping people build unsafe ASI faster. Perhaps you think that, in the case of interp, before we can understand the AI in a way that’s helpful for alignment, we’ll understand it in a way that allows us to improve it. I’m somewhat sympathetic to this argument. But I think making it requires arguing that interp work doesn’t really contribute to alignment at all, and is thus better thought of as capabilities work (and same for intent alignment).
Perhaps you believe that all alignment work is useless, not because they’re misguided and actually capabilities work, but because we’re so far from building aligned ASI that ~all alignment work is useless, and in the intermediate regime where additional insights non-negligibly hasten the arrival of unaligned ASI. But I think you should argue for that explicitly (as say, Eliezer did in his death with dignity post), since I imagine most of the commenters here would disagree with this take.
My guess is this is the largest crux between us; if I thought all “alignment” work did nothing for alignment, and was perhaps just capabilities work in disguise, then I would agree that people should stop. In fact, I might even argue that we should just stop all alignment work whatsoever! Insofar as I’m correct about this being a crux, I’d like to see a post explicitly arguing for the lack of alignment relevancy of existing ‘alignment work’, which will probably lead to a more constructive conversation than this post.
2. How many useful capabilities insights incidentally come from “alignment” work?
I think empirically, very few (if not zero) capabilities insights have come from alignment work. And a priori, you might expect that research that aims to solve topic X produces marginally more X than a related topic Y. Insofar as you think that current “alignment” work is more than epsilon useful, I think you would not argue that most alignment work is differentially negative. So insofar as you think a lot of “alignment” work is real alignment work, you probably believe that many capabilities insights have come from past alignment work.
Perhaps you’re reluctant to give examples, for fear of highlighting them. I think the math doesn’t work out here—having a few clear examples from you would probably be sufficient to significantly reduce the number of published insights from the community as a whole. But, if you have many examples of insights that help capabilities but are too dangerous to highlight, I’d appreciate if you would just say that (and maybe we can find a trusted third party to verify your claim, but not share the details?).
Perhaps you might say, well, the alignment community is very small, so there might not be many examples that come to mind! To make this carry through, you’d still have to believe that the alignment community also hasn’t produced much good research. (Even though, naively, you might expect higher returns from alignment due to there being more unpicked low-hanging fruit due to its small size.) But then again, I’d prefer if you explicitly argued that ~all alignment is either useless or capabilities instead of gesturing at a generic phenomenon.
Perhaps you might say that capabilities insights are incredibly long tailed, and thus seeing no examples doesn’t mean that the expected harm is low. But, I think you still need to make some sort of plausibility argument here, as well as a story for why the existing ML insights deserve a lot of Shapley for capabilities advances, even though most of the “insights” people had were useless if not actively misleading.
I also think that there’s an obvious confounder, if you believe something along the lines of “focusing on alignment is correlated with higher rationality”. Personally, I also think the average alignment(-interested) researcher is more competent at machine learning or research in general than the average generic capabilities researcher (this probably becomes false once you condition on being at OAI, Anthropic, or another scaling lab). If you just count “how many good ideas came from ‘alignment’ researchers per capita” to the number for ‘capability’ researchers, you may find that the former is higher because they’re just more competent. This goes back again into crux 1., where you then need to argue that competency doesn’t help at all in doing actual alignment work, and again, I suspect it’s more productive to just argue about the relevance and quality of alignment work instead of arguing about incidental capabilities insights.
3. How important are insights to alignment/capabilities work?
From paragraphs such as the following:
Worse yet: if focusing on alignment is correlated with higher rationality and thus with better ability for one to figure out what they need to solve their problems, then alignment researchers are more likely to already have the ideas/insights/research they need than capabilities researchers, and thus publishing ideas/insights/research about AI is more likely to differentially help capabilities researchers. Note that this is another relative statement; I’m not saying “alignment researchers have everything they need”, I’m saying “in general you should expect them to need less outside ideas/insights/research on AI than capabilities researchers”.
it seems that you’re working with a model of research output with two main components -- (intrinsic) rationality and (external) insights. But there’s a huge component that’s missing from this model: actual empirical experiments validating the insight, which is the ~bulk of actual capabilities work and a substantial fraction of alignment work. This matters both because ~no capabilities researchers will listen to you if you don’t have empirical experiments, and because, if you believe that you can deduce more alignment research “on your own”, you might also believe that you need to do more empirical experiments to do capabilities research (and thus that the contribution per insight is by default a lot smaller).
Even if true insights are differentially more helpful for capabilities, the fact that it seems empirically difficult to know which insights are true means that a lot of the work in getting a true insight will involve things that look a lot more like normal capabilities work—e.g. training more capable models. But surely then, the argument would be reducable to: if you do capabilities work, don’t share it on pain of accelerating ASI progress—which seems like something your audience already agrees with!
That being said, I think I might disagree with your premise here. My guess is that alignment, by being less grounded than capabilities, probably requires more oustide ideas/insights/research, just for sanity checking reasons (once you control for competence of researcher and the fact that there’s probably more low-hanging fruit in alignment). After all, you can just make a change and see if your log loss on pretraining goes down, but it’s a lot harder to know if your model of deceptive alignment actually is at all sensible. If you don’t improve your model’s performance on standard benchmarks, then this is evidence that your capability idea doesn’t work, but there aren’t even really any benchmarks for many of the problems alignment researchers think about. So it’s easier to go astray, and therefore more important to get feedback from other researchers.
Finally, to answer this question:
“So where do I privately share such research?” — good question!
I suspect that the way to go is to form working groups of researchers that stick together, and that maintain a high level of trust. e.g. a research organization. Then, do and share your research internally and think about possible externalities before publishing more broadly, perhaps doing a tiered release. (This is indeed the model used by many people in alignment orgs.)
I agree with many of the points made in this post, especially the “But my ideas/insights/research is not likely to impact much!” point. I find it plausible that in some subfields, AI x-risk people are too prone to publishing due to historical precedent and norms (maybe mech interp? though little has actually come of that). I also want to point out that there are non-zero arguments to expect alignment work to help more with capabilties, relative to existing “mainstream” capabilities work, even if I don’t believe this to be the case. (For example, you might believe that the field of deep learning spends too little time actually thinking about how to improve their models, and too much time just tinkering, in which case your thinking could have a disproportionate impact even after adjusting for the fact that you’re not trying to do capabilities.) And I think that some of the research labeled “alignment” is basically just capabilities work, and maybe the people doing them should stop.
I also upvoted the post because I think this attitude is pervasive in these circles, and it’s good to actually hash it out in public.
But as with most of the commenters, I disagree with the conclusion of the post.
I suspect the main cruxes between us are the following:
1. How much useful alignment work is actually being done?
From paragraphs such as the following:
And
It seems that a big part of your world model is that ~no one who thinks they’re doing “alignment” work is doing real alignment work, and are really just doing capabilities work. In particular, it seems that you think interp or intent alignment are basically just capabilities work, insofar as their primary effect is helping people build unsafe ASI faster. Perhaps you think that, in the case of interp, before we can understand the AI in a way that’s helpful for alignment, we’ll understand it in a way that allows us to improve it. I’m somewhat sympathetic to this argument. But I think making it requires arguing that interp work doesn’t really contribute to alignment at all, and is thus better thought of as capabilities work (and same for intent alignment).
Perhaps you believe that all alignment work is useless, not because they’re misguided and actually capabilities work, but because we’re so far from building aligned ASI that ~all alignment work is useless, and in the intermediate regime where additional insights non-negligibly hasten the arrival of unaligned ASI. But I think you should argue for that explicitly (as say, Eliezer did in his death with dignity post), since I imagine most of the commenters here would disagree with this take.
My guess is this is the largest crux between us; if I thought all “alignment” work did nothing for alignment, and was perhaps just capabilities work in disguise, then I would agree that people should stop. In fact, I might even argue that we should just stop all alignment work whatsoever! Insofar as I’m correct about this being a crux, I’d like to see a post explicitly arguing for the lack of alignment relevancy of existing ‘alignment work’, which will probably lead to a more constructive conversation than this post.
2. How many useful capabilities insights incidentally come from “alignment” work?
I think empirically, very few (if not zero) capabilities insights have come from alignment work. And a priori, you might expect that research that aims to solve topic X produces marginally more X than a related topic Y. Insofar as you think that current “alignment” work is more than epsilon useful, I think you would not argue that most alignment work is differentially negative. So insofar as you think a lot of “alignment” work is real alignment work, you probably believe that many capabilities insights have come from past alignment work.
Perhaps you’re reluctant to give examples, for fear of highlighting them. I think the math doesn’t work out here—having a few clear examples from you would probably be sufficient to significantly reduce the number of published insights from the community as a whole. But, if you have many examples of insights that help capabilities but are too dangerous to highlight, I’d appreciate if you would just say that (and maybe we can find a trusted third party to verify your claim, but not share the details?).
Perhaps you might say, well, the alignment community is very small, so there might not be many examples that come to mind! To make this carry through, you’d still have to believe that the alignment community also hasn’t produced much good research. (Even though, naively, you might expect higher returns from alignment due to there being more unpicked low-hanging fruit due to its small size.) But then again, I’d prefer if you explicitly argued that ~all alignment is either useless or capabilities instead of gesturing at a generic phenomenon.
Perhaps you might say that capabilities insights are incredibly long tailed, and thus seeing no examples doesn’t mean that the expected harm is low. But, I think you still need to make some sort of plausibility argument here, as well as a story for why the existing ML insights deserve a lot of Shapley for capabilities advances, even though most of the “insights” people had were useless if not actively misleading.
I also think that there’s an obvious confounder, if you believe something along the lines of “focusing on alignment is correlated with higher rationality”. Personally, I also think the average alignment(-interested) researcher is more competent at machine learning or research in general than the average generic capabilities researcher (this probably becomes false once you condition on being at OAI, Anthropic, or another scaling lab). If you just count “how many good ideas came from ‘alignment’ researchers per capita” to the number for ‘capability’ researchers, you may find that the former is higher because they’re just more competent. This goes back again into crux 1., where you then need to argue that competency doesn’t help at all in doing actual alignment work, and again, I suspect it’s more productive to just argue about the relevance and quality of alignment work instead of arguing about incidental capabilities insights.
3. How important are insights to alignment/capabilities work?
From paragraphs such as the following:
it seems that you’re working with a model of research output with two main components -- (intrinsic) rationality and (external) insights. But there’s a huge component that’s missing from this model: actual empirical experiments validating the insight, which is the ~bulk of actual capabilities work and a substantial fraction of alignment work. This matters both because ~no capabilities researchers will listen to you if you don’t have empirical experiments, and because, if you believe that you can deduce more alignment research “on your own”, you might also believe that you need to do more empirical experiments to do capabilities research (and thus that the contribution per insight is by default a lot smaller).
Even if true insights are differentially more helpful for capabilities, the fact that it seems empirically difficult to know which insights are true means that a lot of the work in getting a true insight will involve things that look a lot more like normal capabilities work—e.g. training more capable models. But surely then, the argument would be reducable to: if you do capabilities work, don’t share it on pain of accelerating ASI progress—which seems like something your audience already agrees with!
That being said, I think I might disagree with your premise here. My guess is that alignment, by being less grounded than capabilities, probably requires more oustide ideas/insights/research, just for sanity checking reasons (once you control for competence of researcher and the fact that there’s probably more low-hanging fruit in alignment). After all, you can just make a change and see if your log loss on pretraining goes down, but it’s a lot harder to know if your model of deceptive alignment actually is at all sensible. If you don’t improve your model’s performance on standard benchmarks, then this is evidence that your capability idea doesn’t work, but there aren’t even really any benchmarks for many of the problems alignment researchers think about. So it’s easier to go astray, and therefore more important to get feedback from other researchers.
Finally, to answer this question:
I suspect that the way to go is to form working groups of researchers that stick together, and that maintain a high level of trust. e.g. a research organization. Then, do and share your research internally and think about possible externalities before publishing more broadly, perhaps doing a tiered release. (This is indeed the model used by many people in alignment orgs.)
This is a really well-written response. I’m pretty impressed by it.