Why do you think RSPs don’t put the burden of proof on labs to show that scaling is safe?
I think the RSP frame is wrong, and I don’t want regulators to use it as a building block. My understanding is that labs are refusing to adopt an evals regime in which the burden of proof is on labs to show that scaling is safe. Given this lack of buy-in, the RSP folks concluded that the only thing left to do was to say “OK, fine, but at least please check to see if the system will imminently kill you. And if we find proof that the system is pretty clearly dangerous or about to be dangerous, then will you at least consider stopping” It seems plausible to me that governments would be willing to start with something stricter and more sensible than this “just keep going until we can prove that the model has highly dangerous capabilities” regime.
I think good RSPs would in fact put the burden of proof on the lab. The goal is that the lab would have to make a high quality safety argument prior to taking each risky action (for instance, scaling or further deployment). That said, it’s unclear if the safety arguments from voluntary RSPs will end up being very good. In the event that something like RSPs are required by a regulatory body, it’s also unclear if that body will require good safety arguments. Presumably people advocating for RSPs will also advocate for voluntary RSPs to contain good safety arguments and for regulation to require good safety arguments.
For example, I think the baseline safety argument from the Anthropic RSP actually does ensure a high degree of safety for some particular AI. The argument is “we ran these capability evals and those indicated to us that the model is only ASL2 (not ASL3+), so it’s probably safe”. This argument will obviously fail at some point, but it does currently demonstrate safety to pretty high degree in my opinion[1]. This argument doesn’t guarantee safety (e.g. what if models learn to very competently sandbag evaluations prior to learning how to accomplish these tasks or what if there is a dangerous action which is easier than this evaluation) and it also might be the case that running this eval every 4x effective compute scale up is insufficient due to rapid increases in capabilities wrt. effective compute. But, I still think overall risk is <1% as long as this exact safety argument is in place (I think most of the risk comes from rapid increases in capabilities rather than sandbagging or easier paths to doom than covered by our evaluations).
Another way of putting this is: getting labs to check if their models could be dangerous is putting the burden of proof on labs. (And then we can argue at the object level about the quality of these evaluations.)
To be clear, I think it’s reasonable to object at the object level with any of:
The reduction in P(doom) which is being targeted (e.g. 5-10x) isn’t good enough and we should ask for more (or you could object to the absolute level of doom, but this might depend more on priors).
The countermeasures discussed in this exact RSP don’t reduce P(doom) that much.
There are no known evaluations, countermeasures, or approaches which would allow for reducing P(doom) by the targeted amount other than stopping scaling right now, so we should just do that.
It’s less clear to me that this ensures safety from ongoing scaling due to the possiblity for rapid (perhaps mostly discontinuous) increases in capabilities such that running the evalution periodically is insufficient. I’ll discuss concerns with rapid increases in capabilities later.
Why do you think RSPs don’t put the burden of proof on labs to show that scaling is safe?
I think good RSPs would in fact put the burden of proof on the lab. The goal is that the lab would have to make a high quality safety argument prior to taking each risky action (for instance, scaling or further deployment). That said, it’s unclear if the safety arguments from voluntary RSPs will end up being very good. In the event that something like RSPs are required by a regulatory body, it’s also unclear if that body will require good safety arguments. Presumably people advocating for RSPs will also advocate for voluntary RSPs to contain good safety arguments and for regulation to require good safety arguments.
For example, I think the baseline safety argument from the Anthropic RSP actually does ensure a high degree of safety for some particular AI. The argument is “we ran these capability evals and those indicated to us that the model is only ASL2 (not ASL3+), so it’s probably safe”. This argument will obviously fail at some point, but it does currently demonstrate safety to pretty high degree in my opinion[1]. This argument doesn’t guarantee safety (e.g. what if models learn to very competently sandbag evaluations prior to learning how to accomplish these tasks or what if there is a dangerous action which is easier than this evaluation) and it also might be the case that running this eval every 4x effective compute scale up is insufficient due to rapid increases in capabilities wrt. effective compute. But, I still think overall risk is <1% as long as this exact safety argument is in place (I think most of the risk comes from rapid increases in capabilities rather than sandbagging or easier paths to doom than covered by our evaluations).
Another way of putting this is: getting labs to check if their models could be dangerous is putting the burden of proof on labs. (And then we can argue at the object level about the quality of these evaluations.)
To be clear, I think it’s reasonable to object at the object level with any of:
The reduction in P(doom) which is being targeted (e.g. 5-10x) isn’t good enough and we should ask for more (or you could object to the absolute level of doom, but this might depend more on priors).
The countermeasures discussed in this exact RSP don’t reduce P(doom) that much.
There are no known evaluations, countermeasures, or approaches which would allow for reducing P(doom) by the targeted amount other than stopping scaling right now, so we should just do that.
It’s less clear to me that this ensures safety from ongoing scaling due to the possiblity for rapid (perhaps mostly discontinuous) increases in capabilities such that running the evalution periodically is insufficient. I’ll discuss concerns with rapid increases in capabilities later.