Zach Stein-Perlman comments on Zach Stein-Perlman’s Shortform

Zach Stein-Perlman 24 Sep 2024 0:30 UTC
14 points
3
Figuring out whether an RSP is good is hard.^[1] You need to consider what high-level capabilities/risks it applies to, and for each of them determine whether the evals successfully measure what they’re supposed to and whether high-level risk thresholds trigger appropriate responses and whether high-level risk thresholds are successfully operationalized in terms of low-level evals. Getting one of these things wrong can make your assessment totally wrong. Except that’s just for fully-fleshed-out RSPs — in reality the labs haven’t operationalized their high-level thresholds and sometimes don’t really have responses planned. And different labs’ RSPs have different structures/ontologies.
Quantitatively comparing RSPs in a somewhat-legible manner is even harder.
I am not enthusiastic about a recent paper outlining a rubric for evaluating RSPs. Mostly I worry that it crams all of the crucial things—is the response to reaching high capability-levels adequate? are the capability-levels low enough that we’ll reach them before it’s too late?—into a single criterion, “Credibility.” Most of my concern about labs’ RSPs comes from those things just being inadequate; again, if your response is too weak or your thresholds are too high or your evals are bad, it just doesn’t matter how good the rest of your RSP is. (Also, minor: the “Difficulty” indicator punishes a lab for making ambitious commitments; this seems kinda backwards.)
(I gave up on making an RSP rubric myself because it seemed impossible to do decently well unless it’s quite complex and has some way to evaluate hard-to-evaluate things like eval-quality and planned responses.)
(And maybe it’s reasonable for labs to not do so much specific-committing-in-advance.)
1. ^
  Well, sometimes figuring out that in RSP is bad is easy. So: determining that an RSP is good is hard. (Being good requires lots of factors to all be good; being bad requires just one to be bad.)