I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can’t cut risk by much
This seems to be our biggest crux, as I said interested in analyses of alignment difficulty distribution if any onlookers know. Also, a semantic point but under my current views I’d view cutting ~5% of the risk as a huge deal that’s at least an ~80th percentile outcome for the AI risk community if it had a significant counterfactual impact on it, but yes not much compared to 10x.
[EDIT: After thinking about this more I’ve realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3 OOMs of effort doesn’t cut more than ~5% of the risk. I haven’t thought much about exact numbers but now maybe ~20% seems reasonable to me now]
Quick thoughts on the less cruxy stuff:
You need to apply consistent standards that output “unsafe” in >90% of cases where things really are unsafe.
Fair, though I think 90% would be too low and the more you raise the longer you have to maintain the pause.
(based on context) I’m implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium
This might coincidentally be close to the 95th percentile I had in mind.
So at that point you obviously aren’t talking about 100% of countries voluntarily joining
Fair, I think I was wrong on that point. (I still think it’s likely there would be various other difficulties with enforcing either RSPs or a moratorium for an extended period of time, but I’m open to changing mind)
I’m not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that’s fair to include as part of “very good”)
Sorry if I wasn’t clear: my worry is that open-source models will get better over time due to new post-training enhancements, not about their capabilities upon release.
This seems to be our biggest crux, as I said interested in analyses of alignment difficulty distribution if any onlookers know. Also, a semantic point but under my current views I’d view cutting ~5% of the risk as a huge deal that’s at least an ~80th percentile outcome for the AI risk community if it had a significant counterfactual impact on it, but yes not much compared to 10x.
[EDIT: After thinking about this more I’ve realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3 OOMs of effort doesn’t cut more than ~5% of the risk. I haven’t thought much about exact numbers but now maybe ~20% seems reasonable to me now]
Quick thoughts on the less cruxy stuff:
Fair, though I think 90% would be too low and the more you raise the longer you have to maintain the pause.
This might coincidentally be close to the 95th percentile I had in mind.
Fair, I think I was wrong on that point. (I still think it’s likely there would be various other difficulties with enforcing either RSPs or a moratorium for an extended period of time, but I’m open to changing mind)
Sorry if I wasn’t clear: my worry is that open-source models will get better over time due to new post-training enhancements, not about their capabilities upon release.