I don’t think you need to reliably classify a system as safe or not. You need to apply consistent standards that output “unsafe” in >90% of cases where things really are unsafe.
I think I’m probably imagining better implementation than you, probably because (based on context) I’m implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium. I think what I’m describing as “very good RSPs” and imagining cutting risk 10x still requires significantly less political will than a global moratorium now (but I think this is a point that’s up for debate).
So at that point you obviously aren’t talking about 100% of countries voluntarily joining (instead we are assuming export controls implemented by the global community on straggling countries—which I don’t even think seems very unrealistic at this point and IMO is totally reasonable for “very good”), and I’m not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that’s fair to include as part of “very good”).
I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can’t cut risk by much. I’m sympathetic to the claim that >10% of risk comes from worlds where you need to pursue the technology in a qualitatively different way to avoid catastrophe, but again in those scenarios I do think it’s plausible for well-implemented RSPs to render some kinds of technologies impractical and therefore force developers to pursue alternative approaches.
I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can’t cut risk by much
This seems to be our biggest crux, as I said interested in analyses of alignment difficulty distribution if any onlookers know. Also, a semantic point but under my current views I’d view cutting ~5% of the risk as a huge deal that’s at least an ~80th percentile outcome for the AI risk community if it had a significant counterfactual impact on it, but yes not much compared to 10x.
[EDIT: After thinking about this more I’ve realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3 OOMs of effort doesn’t cut more than ~5% of the risk. I haven’t thought much about exact numbers but now maybe ~20% seems reasonable to me now]
Quick thoughts on the less cruxy stuff:
You need to apply consistent standards that output “unsafe” in >90% of cases where things really are unsafe.
Fair, though I think 90% would be too low and the more you raise the longer you have to maintain the pause.
(based on context) I’m implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium
This might coincidentally be close to the 95th percentile I had in mind.
So at that point you obviously aren’t talking about 100% of countries voluntarily joining
Fair, I think I was wrong on that point. (I still think it’s likely there would be various other difficulties with enforcing either RSPs or a moratorium for an extended period of time, but I’m open to changing mind)
I’m not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that’s fair to include as part of “very good”)
Sorry if I wasn’t clear: my worry is that open-source models will get better over time due to new post-training enhancements, not about their capabilities upon release.
I don’t think you need to reliably classify a system as safe or not. You need to apply consistent standards that output “unsafe” in >90% of cases where things really are unsafe.
I think I’m probably imagining better implementation than you, probably because (based on context) I’m implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium. I think what I’m describing as “very good RSPs” and imagining cutting risk 10x still requires significantly less political will than a global moratorium now (but I think this is a point that’s up for debate).
So at that point you obviously aren’t talking about 100% of countries voluntarily joining (instead we are assuming export controls implemented by the global community on straggling countries—which I don’t even think seems very unrealistic at this point and IMO is totally reasonable for “very good”), and I’m not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that’s fair to include as part of “very good”).
I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can’t cut risk by much. I’m sympathetic to the claim that >10% of risk comes from worlds where you need to pursue the technology in a qualitatively different way to avoid catastrophe, but again in those scenarios I do think it’s plausible for well-implemented RSPs to render some kinds of technologies impractical and therefore force developers to pursue alternative approaches.
This seems to be our biggest crux, as I said interested in analyses of alignment difficulty distribution if any onlookers know. Also, a semantic point but under my current views I’d view cutting ~5% of the risk as a huge deal that’s at least an ~80th percentile outcome for the AI risk community if it had a significant counterfactual impact on it, but yes not much compared to 10x.
[EDIT: After thinking about this more I’ve realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3 OOMs of effort doesn’t cut more than ~5% of the risk. I haven’t thought much about exact numbers but now maybe ~20% seems reasonable to me now]
Quick thoughts on the less cruxy stuff:
Fair, though I think 90% would be too low and the more you raise the longer you have to maintain the pause.
This might coincidentally be close to the 95th percentile I had in mind.
Fair, I think I was wrong on that point. (I still think it’s likely there would be various other difficulties with enforcing either RSPs or a moratorium for an extended period of time, but I’m open to changing mind)
Sorry if I wasn’t clear: my worry is that open-source models will get better over time due to new post-training enhancements, not about their capabilities upon release.