Zac Hatfield-Dodds comments on Anthropic: Reflections on our Responsible Scaling Policy

Zac Hatfield-Dodds 21 May 2024 16:56 UTC
4 points
2
The yellow-line evals are already a buffer (‘sufficent to rule out red-lines’) which are themselves a buffer (6x effective compute) before actually-dangerous situations. Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals. I therefore think it’s reasonable to keep going basically regardless of the probability of triggering in the next round of evals. I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we’d re-run them ahead of schedule.

I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we’d develop for the next round, and enough calibration training to avoid very low probabilities.

Finally, the point of these estimates is that they can guide research and development prioritization—high estimates suggest that it’s worth investing in more difficult yellow-line evals, and/or that elicitation research seems promising. Tying a pause to that estimate is redundant with the definition of a yellow-line, and would risk some pretty nasty epistemic distortions.
- elifland 21 May 2024 18:03 UTC
  2 points
  0
  Parent
  Thanks for the response!
  I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we’d re-run them ahead of schedule.
  [...]
  I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we’d develop for the next round
  Thanks for these clarifications. I didn’t realize that the 30% was for the new yellow-line evals rather than the current ones.
  Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals
  I’m having trouble parsing this sentence. What you mean by “doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals”? Doesn’t pausing include focusing on mitigations and evals?
  - Zac Hatfield-Dodds 21 May 2024 21:59 UTC
    4 points
    0
    Parent
    
    Thanks for these clarifications. I didn’t realize that the 30% was for the new yellow-line evals rather than the current ones.
    
    That’s how I was thinking about the predictions that I was making; others might have been thinking about the current evals where those were more stable.
    
    I’m having trouble parsing this sentence. What you mean by “doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals”? Doesn’t pausing include focusing on mitigations and evals?
    
    Of course, but pausing also means we’d have to shuffle people around, interrupt other projects, and deal with a lot of other disruption (the costs of pausing). Ideally, we’d continue updating our yellow-line evals to stay ahead of model capabilities until mitigations are ready.