I agree that there’s a strong argument that “growing the field of AI Safety” is a bad idea, in a certain specific sense. But:
So you could be in a situation where paying 25 people $200k ends up being worse than doing nothing
I’m not sure that that’s valid. It’s true if these 25 people were random blokes off the street whom we’ve captured and re-trained into fake!alignment researchers. But the primary targets for such re-training are people focusedly doing capability research, or planning to go into that field. Convincing a capabilities researcher (current or future) to spend time on alignment instead seems like it’d advance capabilities slower, even if they end up doing fake research that doesn’t progress alignment. And so the counterfactual world in which the whole AI industry is researching fake!alignment instead would end later than our own — and that “later” gives real!alignment researchers[1] more chances to solve the problem.
I see the argument that alignment research actually has a higher return-on-capabilities for a researcher-hour spent, than the tinkering the capabilities people engage in right now. It seems plausible: crucial insights into agency/goals and powerful interpretability techniques will likely allow us to build much better training loops, and it seems implausible that we’ll get to alignment without such insights/techniques.
But the whole issue is that fake!alignment research avoids all these hard problems. The anti-outreach argument only works in full generality if we believe that trying to get people to work on alignment makes them work on fake!alignment, AND that this fake!alignment progresses capabilities faster than deliberate attempts to progress capabilities. Are we, in fact, believing this?
Because otherwise, I think there’s measurable value in growing AI Safety at the expense of the wider AI industry (by stealing skilled researchers, redirecting funding, crowding out the available compute, etc.).
Which this counterfactual world also has more of, assuming that trying to get someone to work on real!alignment has a not-literally-zero chance of working.
Another semi-assumption this makes is that the instinct of most normies (by which I mean people neither working on capabilities nor safety) is to, when they hear about this issue, try their hand at alignment.
In my experience, this just isn’t the case. If I manage to convey there’s a huge problem at all, they also realize that alignment is an extremely dangerous game and figure they’re not going to be successful. They might ask if they can fetch someone at a MIRI-equivalent coffee or do their generalist programming work, because people do have an instinct to help, but they explicitly stay away from doing critical safety research. What they generally want to do instead is try to tell their friends and family about the problem, and help slow down the field, which he mentions in B. You might get a different set of responses if you’re constantly talking to bright young mathematicians who believe their comparative advantage is doing math, but I’ve been pretty careful to check the behavior of the people I’m doing outreach for to make sure they’re not enhancing the problem.
And there’s a difference between the kinds of “safety” work that dilutes the quality of alignment research if not done exceptionally well and the kinds of “safety” work that involve regulating, lobbying, or slowing down existing AI companies. Barring some bizarre second-order effects that have not been coherently argued for on LW, I think more people pressuring large tech companies to make their work “safer” is a good thing, and very obviously a good thing. If the normies succeed in actually pushing DeepMind & crew toward operational adequacy, fantastic! If they don’t, well, at least those teams are wasting money/time operationally on something else besides ending the world, and when money/time has been allocated inefficiently towards a problem it’s still generally (though not always) easier to reform existing efforts than start from scratch.
They might ask if they can fetch someone at a MIRI-equivalent coffee or do their generalist programming work, because people do have an instinct to help, but they explicitly stay away from doing critical safety research.
I don’t think the work done by such researchers is the main problem: the main problem is that once a very large proportion of the field is fake!alignment, new people coming to work on AIS may disproportionately be introduced to fake!alignment.
We might reasonably hope that the wisest and smartest new people may see fake!alignment for what it is, and work on the real problems regardless. However, I expect that there are many people with the potential to do positive work, who’d do positive work if they were exposed to [MIRI], but not if exposed to [MIRI + 10,000 fake!alignmenters]. [EDIT: here I don’t mean to imply that all non-MIRI researchers are doing fake!alignment; this is just a hypothetical comparison for illustration where you’re free to pick your own criteria for fake!alignment]
This isn’t obviously inevitable, but it does seem the default outcome.
Valid point, though I’m not sure the original post mentioned that.
Counterpoint: would that actually change the absolute number of real!alignment researchers? If the probability that a given inductee would do real!alignment goes down, but the number of inductees goes way up + the timelines get longer, it’d still be a net-positive intervention.
That’s true given a fixed proportion of high-potential researchers amongst inductees—but I wouldn’t expect that. The more we go out and recruit people who’re disproportionately unlikely to understand the true nature of the problem (i.e. likely candidates for “worse than doing nothing”), the more the proportion of high-potential inductees drops. [also I don’t think there’s much “timelines get longer” here]
Obviously it’s far from clear how it’d work out in practice; this may only be an issue with taking the most naïve approach. I do think it’s worth worrying about—particularly given that there aren’t clean takebacks.
I don’t mean to argue against expanding the field—but I do think it’s important to put a lot of thought into how best to do it.
I agree that there’s a strong argument that “growing the field of AI Safety” is a bad idea, in a certain specific sense. But:
I’m not sure that that’s valid. It’s true if these 25 people were random blokes off the street whom we’ve captured and re-trained into fake!alignment researchers. But the primary targets for such re-training are people focusedly doing capability research, or planning to go into that field. Convincing a capabilities researcher (current or future) to spend time on alignment instead seems like it’d advance capabilities slower, even if they end up doing fake research that doesn’t progress alignment. And so the counterfactual world in which the whole AI industry is researching fake!alignment instead would end later than our own — and that “later” gives real!alignment researchers[1] more chances to solve the problem.
I see the argument that alignment research actually has a higher return-on-capabilities for a researcher-hour spent, than the tinkering the capabilities people engage in right now. It seems plausible: crucial insights into agency/goals and powerful interpretability techniques will likely allow us to build much better training loops, and it seems implausible that we’ll get to alignment without such insights/techniques.
But the whole issue is that fake!alignment research avoids all these hard problems. The anti-outreach argument only works in full generality if we believe that trying to get people to work on alignment makes them work on fake!alignment, AND that this fake!alignment progresses capabilities faster than deliberate attempts to progress capabilities. Are we, in fact, believing this?
Because otherwise, I think there’s measurable value in growing AI Safety at the expense of the wider AI industry (by stealing skilled researchers, redirecting funding, crowding out the available compute, etc.).
Which this counterfactual world also has more of, assuming that trying to get someone to work on real!alignment has a not-literally-zero chance of working.
Another semi-assumption this makes is that the instinct of most normies (by which I mean people neither working on capabilities nor safety) is to, when they hear about this issue, try their hand at alignment.
In my experience, this just isn’t the case. If I manage to convey there’s a huge problem at all, they also realize that alignment is an extremely dangerous game and figure they’re not going to be successful. They might ask if they can fetch someone at a MIRI-equivalent coffee or do their generalist programming work, because people do have an instinct to help, but they explicitly stay away from doing critical safety research. What they generally want to do instead is try to tell their friends and family about the problem, and help slow down the field, which he mentions in B. You might get a different set of responses if you’re constantly talking to bright young mathematicians who believe their comparative advantage is doing math, but I’ve been pretty careful to check the behavior of the people I’m doing outreach for to make sure they’re not enhancing the problem.
And there’s a difference between the kinds of “safety” work that dilutes the quality of alignment research if not done exceptionally well and the kinds of “safety” work that involve regulating, lobbying, or slowing down existing AI companies. Barring some bizarre second-order effects that have not been coherently argued for on LW, I think more people pressuring large tech companies to make their work “safer” is a good thing, and very obviously a good thing. If the normies succeed in actually pushing DeepMind & crew toward operational adequacy, fantastic! If they don’t, well, at least those teams are wasting money/time operationally on something else besides ending the world, and when money/time has been allocated inefficiently towards a problem it’s still generally (though not always) easier to reform existing efforts than start from scratch.
This is me currently!
I don’t think the work done by such researchers is the main problem: the main problem is that once a very large proportion of the field is fake!alignment, new people coming to work on AIS may disproportionately be introduced to fake!alignment.
We might reasonably hope that the wisest and smartest new people may see fake!alignment for what it is, and work on the real problems regardless. However, I expect that there are many people with the potential to do positive work, who’d do positive work if they were exposed to [MIRI], but not if exposed to [MIRI + 10,000 fake!alignmenters]. [EDIT: here I don’t mean to imply that all non-MIRI researchers are doing fake!alignment; this is just a hypothetical comparison for illustration where you’re free to pick your own criteria for fake!alignment]
This isn’t obviously inevitable, but it does seem the default outcome.
Valid point, though I’m not sure the original post mentioned that.
Counterpoint: would that actually change the absolute number of real!alignment researchers? If the probability that a given inductee would do real!alignment goes down, but the number of inductees goes way up + the timelines get longer, it’d still be a net-positive intervention.
That’s true given a fixed proportion of high-potential researchers amongst inductees—but I wouldn’t expect that.
The more we go out and recruit people who’re disproportionately unlikely to understand the true nature of the problem (i.e. likely candidates for “worse than doing nothing”), the more the proportion of high-potential inductees drops. [also I don’t think there’s much “timelines get longer” here]
Obviously it’s far from clear how it’d work out in practice; this may only be an issue with taking the most naïve approach. I do think it’s worth worrying about—particularly given that there aren’t clean takebacks.
I don’t mean to argue against expanding the field—but I do think it’s important to put a lot of thought into how best to do it.