As a rough attempt at quantifying the elicitation gap, teams informally estimated that, given an additional three months of elicitation improvements and no additional pretraining, there is a roughly 30% chance that the model passes our current ARA Yellow Line, a 30% chance it passes at least one of our CBRN Yellow Lines, and a 5% chance it crosses cyber Yellow Lines. That said, we are currently iterating on our threat models and Yellow Lines so these exact thresholds are likely to change the next time we update our Responsible Scaling Policy.
What’s the minimum X% that could replace 30% and would be treated the same as passing the yellow line immediately, if any? If you think that there’s an X% chance that with 3 more months of elicitation, a yellow line will be crossed, what’s the decision-making process for determining whether you should treat it as already being crossed?
In the RSP it says “It is important that we are evaluating models with close to our best capabilities elicitation techniques, to avoid underestimating the capabilities it would be possible for a malicious actor to elicit if the model were stolen” so it seems like folding in some forecasted elicited capabilities into the current evaluation would be reasonable (though they should definitely be discounted the further out they are).
(I’m not particularly concerned about catastrophic risk from the Claude 3 model family, but I am interested in the general policy here and the reasoning behind it)
The yellow-line evals are already a buffer (‘sufficent to rule out red-lines’) which are themselves a buffer (6x effective compute) before actually-dangerous situations. Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals. I therefore think it’s reasonable to keep going basically regardless of the probability of triggering in the next round of evals. I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we’d re-run them ahead of schedule.
I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we’d develop for the next round, and enough calibration training to avoid very low probabilities.
Finally, the point of these estimates is that they can guide research and development prioritization—high estimates suggest that it’s worth investing in more difficult yellow-line evals, and/or that elicitation research seems promising. Tying a pause to that estimate is redundant with the definition of a yellow-line, and would risk some pretty nasty epistemic distortions.
I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we’d re-run them ahead of schedule.
[...]
I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we’d develop for the next round
Thanks for these clarifications. I didn’t realize that the 30% was for the new yellow-line evals rather than the current ones.
Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals
I’m having trouble parsing this sentence. What you mean by “doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals”? Doesn’t pausing include focusing on mitigations and evals?
Thanks for these clarifications. I didn’t realize that the 30% was for the new yellow-line evals rather than the current ones.
That’s how I was thinking about the predictions that I was making; others might have been thinking about the current evals where those were more stable.
I’m having trouble parsing this sentence. What you mean by “doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals”? Doesn’t pausing include focusing on mitigations and evals?
Of course, but pausing also means we’d have to shuffle people around, interrupt other projects, and deal with a lot of other disruption (the costs of pausing). Ideally, we’d continue updating our yellow-line evals to stay ahead of model capabilities until mitigations are ready.
From the RSP Evals report:
What’s the minimum X% that could replace 30% and would be treated the same as passing the yellow line immediately, if any? If you think that there’s an X% chance that with 3 more months of elicitation, a yellow line will be crossed, what’s the decision-making process for determining whether you should treat it as already being crossed?
In the RSP it says “It is important that we are evaluating models with close to our best capabilities elicitation techniques, to avoid underestimating the capabilities it would be possible for a malicious actor to elicit if the model were stolen” so it seems like folding in some forecasted elicited capabilities into the current evaluation would be reasonable (though they should definitely be discounted the further out they are).
(I’m not particularly concerned about catastrophic risk from the Claude 3 model family, but I am interested in the general policy here and the reasoning behind it)
The yellow-line evals are already a buffer (‘sufficent to rule out red-lines’) which are themselves a buffer (6x effective compute) before actually-dangerous situations. Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals. I therefore think it’s reasonable to keep going basically regardless of the probability of triggering in the next round of evals. I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we’d re-run them ahead of schedule.
I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we’d develop for the next round, and enough calibration training to avoid very low probabilities.
Finally, the point of these estimates is that they can guide research and development prioritization—high estimates suggest that it’s worth investing in more difficult yellow-line evals, and/or that elicitation research seems promising. Tying a pause to that estimate is redundant with the definition of a yellow-line, and would risk some pretty nasty epistemic distortions.
Thanks for the response!
Thanks for these clarifications. I didn’t realize that the 30% was for the new yellow-line evals rather than the current ones.
I’m having trouble parsing this sentence. What you mean by “doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals”? Doesn’t pausing include focusing on mitigations and evals?
That’s how I was thinking about the predictions that I was making; others might have been thinking about the current evals where those were more stable.
Of course, but pausing also means we’d have to shuffle people around, interrupt other projects, and deal with a lot of other disruption (the costs of pausing). Ideally, we’d continue updating our yellow-line evals to stay ahead of model capabilities until mitigations are ready.