I basically agree with almost all of Paul’s points here. Some small things to add:
Specifying a concrete set of evaluation results that would cause them to move to ASL-3. I think having some concrete threshold for a pause is much better than not, and I think the proposed threshold is early enough to trigger before an irreversible catastrophe with high probability (more than 90%).
Basically agree, although I think the specifics of the elicitation methodology that we helped draft are important to me here. (In particular: only requiring 10% success rate to count a task as “passed”; making sure that you’re using ~$1000 of inference compute per task; doing good scaffolding and finetuning on a dev set of tasks from the same distribution as the threshold tasks)
I’m excited to see criticism of RSPs that focuses on concrete ways in which they fail to manage risk. Such criticism can help (i) push AI developers to do better, (ii) argue to policy makers that we need regulatory requirements stronger than existing RSPs. That said, I think it is significantly better to have an RSP than not, and don’t think that point should be lost in the discussion.
Agree. I’m worried about accidental creation of an incentive gradient for companies to say and do as little as possible about safety. I think this can be reduced if critics follow this principle: “criticism of specific labs on their RSPs makes sure to explicitly name other prominent labs who haven’t put out any RSP and say that this is worse”
On the object level I’d be especially excited for criticism that includes things like:
Presenting risks or threat models that might occur before the model has the capabilities that the evaluation is intended to capture
Explaining how the specified evaluations may not capture the capabilities properly
Proposing or developing alternative evaluations
Arguing intervals of 4x effective compute between evaluations are too large and that we could blow past the intended capability limits
Pointing out ambiguities in the evaluation definitions
“The current level of risk is low enough that I think it is defensible for companies or countries to continue AI development if they have a sufficiently good plan for detecting and reacting to increasing risk.”
I think it’s true that it’s defensible for an individual company/country, but I also think it’s not sensible for the world to be doing this overall. It seems possible to me that key capabilities limitations of current LLM agents could be overcome with the right scaffolding and finetuning. (maybe ~1/1000). Given this, if I personally ran the world I would not be open-sourcing or scaling up current systems.
I basically agree with almost all of Paul’s points here. Some small things to add:
Basically agree, although I think the specifics of the elicitation methodology that we helped draft are important to me here. (In particular: only requiring 10% success rate to count a task as “passed”; making sure that you’re using ~$1000 of inference compute per task; doing good scaffolding and finetuning on a dev set of tasks from the same distribution as the threshold tasks)
Agree. I’m worried about accidental creation of an incentive gradient for companies to say and do as little as possible about safety. I think this can be reduced if critics follow this principle: “criticism of specific labs on their RSPs makes sure to explicitly name other prominent labs who haven’t put out any RSP and say that this is worse”
On the object level I’d be especially excited for criticism that includes things like:
Presenting risks or threat models that might occur before the model has the capabilities that the evaluation is intended to capture
Explaining how the specified evaluations may not capture the capabilities properly
Proposing or developing alternative evaluations
Arguing intervals of 4x effective compute between evaluations are too large and that we could blow past the intended capability limits
Pointing out ambiguities in the evaluation definitions
I think it’s true that it’s defensible for an individual company/country, but I also think it’s not sensible for the world to be doing this overall. It seems possible to me that key capabilities limitations of current LLM agents could be overcome with the right scaffolding and finetuning. (maybe ~1/1000). Given this, if I personally ran the world I would not be open-sourcing or scaling up current systems.