I agree with you that one of the best ways to “buy time” is to join the alignment or governance teams at major AI labs (in part b/c confidentiality agreements). I also agree that most things are easy to implement poorly by default. However, I think 1) comparative advantage is real; some people are a lot better at writing/teaching/exposition relative to research and vice versa and 2) there are other ways to instantiate some of the proposals that aren’t literally just “Join OpenAI/Deepmind/Anthropic/etc”:
Direct outreach to AGI researchers
While I agree that most people are pretty bad at making the alignment case, I do think vibes matter! In particular, I think you’re underestimating the value of a ‘general ethos of “AI risk is real”’. (Though I still agree that the average direct outreach attempt will probably be slightly negative.)
Presumably, the way you’d do this is to work with one of the scaling labs?
Break and red team alignment proposals (especially those that will likely be used by major AI labs
I think the reason most of these examples have failed is some combination of: 1) literally not addressing what people in labs are doing, 2) not being phrased in the right way (eg with ML/deep learning terminology), and 3) being published in venues that aren’t really visible to most people in labs?
I think 1) is the most concerning one—I’ve heard many people make informal arguments in favor of/against Jan’s RRM + Alignment research proposal, but I don’t think a serious critical analysis of that approach has been written up anywhere. Instead, a lot of effort was spent yelling at stuff like CHAI/CIRL. My guess is many people (~5-10 people I know) can write a good steelman/detailed explainer of Jan’s stuff and also critique it.
Organize coordination events
[...]
I guess one benefit is that you can have some coordination between top alignment people who aren’t at industry labs? I’m much more keen on having those people just doing good alignment work, and coordinating with the industry alignment labs. This seems way more efficient.
You can also coordinate top alignment people not at labs <> people at labs, etc. But I do agree that doing good alignment work is important!
However, I think 1) comparative advantage is real; some people are a lot better at writing/teaching/exposition relative to research and vice versa
Sure. Of the small number of people who can do any of these well, they should split them up based on comparative advantage. This seems orthogonal to my main claim (roughly: if you do this at a large scale then it starts becoming net negative due to lower quality).
I do think vibes matter! In particular, I think you’re underestimating the value of a ‘general ethos of “AI risk is real”’.
I very much agree that vibes matter! Do you have in mind some benefit other than the one I mentioned above:
But it really doesn’t seem great that my case for wide-scale outreach being good is “maybe if we create a mass delusion of incorrect beliefs that implies that AGI is risky, then we’ll slow down, and the extra years of time will help”.
(More broadly it increases willingness to pay an alignment tax, with “slowing down” as one example.)
Importantly, vibes are not uniformly beneficial. If the vibe is “AI systems aren’t robust and so we can’t deploy them in high-stakes situations” then maybe everyone coordinates not to let the AI control the nukes and ignores the people who are saying that we also need to worry about the generalist foundation models because it’s fine, those models aren’t deployed in high-stakes situations.
Presumably, the way you’d do this is to work with one of the scaling labs?
Sure, that could work. (Again my main claim is “you can’t usefully throw hundreds of people at this” and not “this can never be done well”.)
I think the reason most of these examples have failed is some combination of: 1) literally not addressing what people in labs are doing, 2) not being phrased in the right way (eg with ML/deep learning terminology), and 3) being published in venues that aren’t really visible to most people in labs?
I think 1) is the most concerning one—I’ve heard many people make informal arguments in favor of/against Jan’s RRM + Alignment research proposal, but I don’t think a serious critical analysis of that approach has been written up anywhere. Instead, a lot of effort was spent yelling at stuff like CHAI/CIRL. My guess is many people (~5-10 people I know) can write a good steelman/detailed explainer of Jan’s stuff and also critique it.
I’m confused. Are you trying to convince Jan or someone else? How does it buy time?
(I interpreted the OP as saying that you convince AGI researchers who are not (currently) working on safety. I think a good steelman + critique of RRM wouldn’t have much effect on that population, though I think it’s pretty plausible I’m wrong about that because the situation at OpenAI is different from DeepMind.)
You can also coordinate top alignment people not at labs <> people at labs, etc.
As a person at a lab I’m currently voting for less coordination of this sort, not more, but I agree that this is also a thing you can do. (As with everything else, my main claim is that this isn’t a scalable intervention.)
This seems orthogonal to my main claim (roughly: if you do this at a large scale then it starts becoming net negative due to lower quality).
Fair. I think I failed to address this point entirely.
I do think there’s a nonzero amount of people who would not be that good at novel alignment research and would still be good at the tasks mentioned here, but I agree that there isn’t a scalable intervention here, or at least not more so than standard AI alignment research (especially when compared to some appraoches like the brute-force mechanistic interp many people are doing).
(I interpreted the OP as saying that you convince AGI researchers who are not (currently) working on safety. I think a good steelman + critique of RRM wouldn’t have much effect on that population, though I think it’s pretty plausible I’m wrong about that because the situation at OpenAI is different from DeepMind.)
Yeah, I also messed up here—I think this would plausibly have little effect on that population. I do think that a good answer to “why does RLHF not work” would help a nonzero amount, though.
As a person at a lab I’m currently voting for less coordination of this sort, not more
Agree that it’s not scalable, but could you share why you’d vote for less?
Agree that it’s not scalable, but could you share why you’d vote for less?
Idk, it’s hard to explain—it’s the usual thing where there’s a gazillion things to do that all seem important and you have to prioritize anyway. (I’m just worried about the opportunity cost, not some other issue.)
I think the biggest part of coordination between non-lab alignment people and lab alignment people is making sure that people know about each other’s research; it mostly feels like the simple method of “share info through personal connections + reading posts and papers” is working pretty well right now. Maybe I’m missing some way in which this could be way better, idk.
My guess is most of the value in coordination work here is either in making posts/papers easier to write or ship, or in discovering new good researchers?
Those weren’t what I thought of when I read “coordination” but I agree those things sound good :)
Another good example would be better communication tech (e.g. the sort of thing that LessWrong / Alignment Forum aims for, although not those in particular because most lab people don’t use it very much).
I feel like most of the barrier in practice for people not “coordinating” in the relevant ways is people not knowing what other people are doing. And a big reason for this is that writing is really hard to write, especially if you have high standards and don’t want to ship.
And yeah, better communication tech in general would be good, but I’m not sure how to start on that (while it’s pretty obvious what a few candidate steps toward making posts/papers easier to write/ship would look like?)
I agree it’s not clear what to do on better communication tech.
I feel like most of the barrier in practice for people not “coordinating” in the relevant ways is people not knowing what other people are doing. And a big reason for this is that writing is really hard to write, especially if you have high standards and don’t want to ship.
Idk, a few years ago I would have agreed with you, but now my impression is that people mostly don’t read things and instead talk to each other for this purpose. I wouldn’t really expect that to change with more writing, unless the writing is a lot better?
(I do think that e.g. mech interp researchers read each other’s mech interp papers, though my impression from the outside is that they also often hear about each other’s results well before they’re published. Similarly for scalable oversight.)
I agree with you that one of the best ways to “buy time” is to join the alignment or governance teams at major AI labs (in part b/c confidentiality agreements). I also agree that most things are easy to implement poorly by default. However, I think 1) comparative advantage is real; some people are a lot better at writing/teaching/exposition relative to research and vice versa and 2) there are other ways to instantiate some of the proposals that aren’t literally just “Join OpenAI/Deepmind/Anthropic/etc”:
While I agree that most people are pretty bad at making the alignment case, I do think vibes matter! In particular, I think you’re underestimating the value of a ‘general ethos of “AI risk is real”’. (Though I still agree that the average direct outreach attempt will probably be slightly negative.)
Presumably, the way you’d do this is to work with one of the scaling labs?
I think the reason most of these examples have failed is some combination of: 1) literally not addressing what people in labs are doing, 2) not being phrased in the right way (eg with ML/deep learning terminology), and 3) being published in venues that aren’t really visible to most people in labs?
I think 1) is the most concerning one—I’ve heard many people make informal arguments in favor of/against Jan’s RRM + Alignment research proposal, but I don’t think a serious critical analysis of that approach has been written up anywhere. Instead, a lot of effort was spent yelling at stuff like CHAI/CIRL. My guess is many people (~5-10 people I know) can write a good steelman/detailed explainer of Jan’s stuff and also critique it.
You can also coordinate top alignment people not at labs <> people at labs, etc. But I do agree that doing good alignment work is important!
Sure. Of the small number of people who can do any of these well, they should split them up based on comparative advantage. This seems orthogonal to my main claim (roughly: if you do this at a large scale then it starts becoming net negative due to lower quality).
I very much agree that vibes matter! Do you have in mind some benefit other than the one I mentioned above:
(More broadly it increases willingness to pay an alignment tax, with “slowing down” as one example.)
Importantly, vibes are not uniformly beneficial. If the vibe is “AI systems aren’t robust and so we can’t deploy them in high-stakes situations” then maybe everyone coordinates not to let the AI control the nukes and ignores the people who are saying that we also need to worry about the generalist foundation models because it’s fine, those models aren’t deployed in high-stakes situations.
Sure, that could work. (Again my main claim is “you can’t usefully throw hundreds of people at this” and not “this can never be done well”.)
I’m confused. Are you trying to convince Jan or someone else? How does it buy time?
(I interpreted the OP as saying that you convince AGI researchers who are not (currently) working on safety. I think a good steelman + critique of RRM wouldn’t have much effect on that population, though I think it’s pretty plausible I’m wrong about that because the situation at OpenAI is different from DeepMind.)
As a person at a lab I’m currently voting for less coordination of this sort, not more, but I agree that this is also a thing you can do. (As with everything else, my main claim is that this isn’t a scalable intervention.)
Fair. I think I failed to address this point entirely.
I do think there’s a nonzero amount of people who would not be that good at novel alignment research and would still be good at the tasks mentioned here, but I agree that there isn’t a scalable intervention here, or at least not more so than standard AI alignment research (especially when compared to some appraoches like the brute-force mechanistic interp many people are doing).
Yeah, I also messed up here—I think this would plausibly have little effect on that population. I do think that a good answer to “why does RLHF not work” would help a nonzero amount, though.
Agree that it’s not scalable, but could you share why you’d vote for less?
Idk, it’s hard to explain—it’s the usual thing where there’s a gazillion things to do that all seem important and you have to prioritize anyway. (I’m just worried about the opportunity cost, not some other issue.)
I think the biggest part of coordination between non-lab alignment people and lab alignment people is making sure that people know about each other’s research; it mostly feels like the simple method of “share info through personal connections + reading posts and papers” is working pretty well right now. Maybe I’m missing some way in which this could be way better, idk.
My guess is most of the value in coordination work here is either in making posts/papers easier to write or ship, or in discovering new good researchers?
Those weren’t what I thought of when I read “coordination” but I agree those things sound good :)
Another good example would be better communication tech (e.g. the sort of thing that LessWrong / Alignment Forum aims for, although not those in particular because most lab people don’t use it very much).
I feel like most of the barrier in practice for people not “coordinating” in the relevant ways is people not knowing what other people are doing. And a big reason for this is that writing is really hard to write, especially if you have high standards and don’t want to ship.
And yeah, better communication tech in general would be good, but I’m not sure how to start on that (while it’s pretty obvious what a few candidate steps toward making posts/papers easier to write/ship would look like?)
I agree it’s not clear what to do on better communication tech.
Idk, a few years ago I would have agreed with you, but now my impression is that people mostly don’t read things and instead talk to each other for this purpose. I wouldn’t really expect that to change with more writing, unless the writing is a lot better?
(I do think that e.g. mech interp researchers read each other’s mech interp papers, though my impression from the outside is that they also often hear about each other’s results well before they’re published. Similarly for scalable oversight.)