Thanks for the thoughts! Some brief (and belated) responses:
I disagree with you on #1 and think the thread below your comment addresses this.
Re: #2, I think we have different expectations. We can just see what happens, but I’ll note that the RSP you refer to is quite explicit about the need for further iterations (not just “revisions” but also the need to define further-out, more severe risks).
I’m not sure what you mean by “an evals regime in which the burden of proof is on labs to show that scaling is safe.” How high is the burden you’re hoping for? If they need to definitively rule out risks like “The weights leak, then the science of post-training enhancements moves forward to the point where the leaked weights are catastrophically dangerous” in order to do any further scaling, my sense is that nobody (certainly not me) has any idea how to do this, and so this proposal seems pretty much equivalent to “immediate pause,” which I’ve shared my thoughts on. If you have a lower burden of proof in mind, I think that’s potentially consistent with the work on RSPs that is happening (it depends on exactly what you are hoping for).
I agree with the conceptual point that improvements on the status quo can be net negative for the reasons you say. When I said “Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate,” I didn’t mean to say that there’s no way this can make intellectual sense. To take a quick stab at what bugs me: I think to the extent a measure is an improvement but insufficient, the strategy should be to say “This is an improvement but insufficient,” accept the improvement and count on oneself to win the “insufficient” argument. This kind of behavior seems to generalize to a world in which everyone is clear about their preferred order of object-level states of the world, and incentive gradients consistently point in the right direction (especially important if—as I believe—getting some progress generally makes it easier rather than harder to get more). I worry that the behavior of opposing object-level improvements on the grounds that others might find them sufficient seems to generalize to a world with choppier incentive gradients, more confusing discourse, and a lot of difficulty building coalitions generally (it’s a lot harder to get agreement on “X vs. the best otherwise achievable outcome” than on “X vs. the status quo”).
I think nearly all proponents of RSPs do not see them as a substitute for regulation. Early communications could have emphasized this point more (including the METR post, which has been updated). I think communications since then have been clearer about it.
I lean toward agreeing that another name would be better. I don’t feel very strongly, and am not sure it matters at this point anyway with different parties using different names.
I don’t agree that “we can’t do anything until we literally have proof that the model is imminently dangerous” is the frame of RSPs, although I do agree that the frame is distinct from a “pause now” frame. I’m excited about conditional pauses as something that can reduce risk a lot while having high tractability and bringing together a big coalition; the developments you mention are great, but I think we’re still a long way from where “advocate for an immediate pause” looks better to me than working within this framework. I also disagree with your implication that the RSP framework has sucked up a lot of talent; while evals have drawn in a lot of people and momentum, hammering out conditional pause related frameworks seems to be something that only a handful of people were working on as of the date of your comment. (Since then the number has gone up due to AI companies forming teams dedicated to this; this seems like a good thing to me.) Overall, it seems to me that most of the people working in this area would otherwise be working on evals and other things short of advocating for immediate pauses.
Thanks for the thoughts! Some brief (and belated) responses:
I disagree with you on #1 and think the thread below your comment addresses this.
Re: #2, I think we have different expectations. We can just see what happens, but I’ll note that the RSP you refer to is quite explicit about the need for further iterations (not just “revisions” but also the need to define further-out, more severe risks).
I’m not sure what you mean by “an evals regime in which the burden of proof is on labs to show that scaling is safe.” How high is the burden you’re hoping for? If they need to definitively rule out risks like “The weights leak, then the science of post-training enhancements moves forward to the point where the leaked weights are catastrophically dangerous” in order to do any further scaling, my sense is that nobody (certainly not me) has any idea how to do this, and so this proposal seems pretty much equivalent to “immediate pause,” which I’ve shared my thoughts on. If you have a lower burden of proof in mind, I think that’s potentially consistent with the work on RSPs that is happening (it depends on exactly what you are hoping for).
I agree with the conceptual point that improvements on the status quo can be net negative for the reasons you say. When I said “Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate,” I didn’t mean to say that there’s no way this can make intellectual sense. To take a quick stab at what bugs me: I think to the extent a measure is an improvement but insufficient, the strategy should be to say “This is an improvement but insufficient,” accept the improvement and count on oneself to win the “insufficient” argument. This kind of behavior seems to generalize to a world in which everyone is clear about their preferred order of object-level states of the world, and incentive gradients consistently point in the right direction (especially important if—as I believe—getting some progress generally makes it easier rather than harder to get more). I worry that the behavior of opposing object-level improvements on the grounds that others might find them sufficient seems to generalize to a world with choppier incentive gradients, more confusing discourse, and a lot of difficulty building coalitions generally (it’s a lot harder to get agreement on “X vs. the best otherwise achievable outcome” than on “X vs. the status quo”).
I think nearly all proponents of RSPs do not see them as a substitute for regulation. Early communications could have emphasized this point more (including the METR post, which has been updated). I think communications since then have been clearer about it.
I lean toward agreeing that another name would be better. I don’t feel very strongly, and am not sure it matters at this point anyway with different parties using different names.
I don’t agree that “we can’t do anything until we literally have proof that the model is imminently dangerous” is the frame of RSPs, although I do agree that the frame is distinct from a “pause now” frame. I’m excited about conditional pauses as something that can reduce risk a lot while having high tractability and bringing together a big coalition; the developments you mention are great, but I think we’re still a long way from where “advocate for an immediate pause” looks better to me than working within this framework. I also disagree with your implication that the RSP framework has sucked up a lot of talent; while evals have drawn in a lot of people and momentum, hammering out conditional pause related frameworks seems to be something that only a handful of people were working on as of the date of your comment. (Since then the number has gone up due to AI companies forming teams dedicated to this; this seems like a good thing to me.) Overall, it seems to me that most of the people working in this area would otherwise be working on evals and other things short of advocating for immediate pauses.