I don’t think we were thinking too closely about whether regression slopes are preferable to correlations to decide the susceptibility of a benchmark to safetywashing. We mainly focused on correlations for the paper for the sake of having a standardized metric across benchmarks.
Figure 4 seems to be the only place where we mention the slope of the regression line. I’m not speaking for the other authors here, but I think I agree that the implicit argument for saying that “High-correlation + low-slope benchmarks are not necessarily liable for safetywashing” has a more natural description in terms of the variance of the benchmark score across models. In particular, if you observe a low variance in absolute benchmark scores among a set of models of varying capabilities, even if the correlation with capabilities is high, you might expect that targeted safety interventions will produce a disproportionately large absolute improvements. This means the benchmark wouldn’t be as susceptible to safetywashing, since pure capabilities improvements would not obfuscate the targeted safety methods. I think this is usually true in practice, but it’s definitely not always true, and we probably could have said all of this explicitly.
(I’ll also note that you might observe low absolute variance in benchmark scores for reasons that do not speak to the quality of the safety benchmark.)
I don’t think we were thinking too closely about whether regression slopes are preferable to correlations to decide the susceptibility of a benchmark to safetywashing. We mainly focused on correlations for the paper for the sake of having a standardized metric across benchmarks.
Figure 4 seems to be the only place where we mention the slope of the regression line. I’m not speaking for the other authors here, but I think I agree that the implicit argument for saying that “High-correlation + low-slope benchmarks are not necessarily liable for safetywashing” has a more natural description in terms of the variance of the benchmark score across models. In particular, if you observe a low variance in absolute benchmark scores among a set of models of varying capabilities, even if the correlation with capabilities is high, you might expect that targeted safety interventions will produce a disproportionately large absolute improvements. This means the benchmark wouldn’t be as susceptible to safetywashing, since pure capabilities improvements would not obfuscate the targeted safety methods. I think this is usually true in practice, but it’s definitely not always true, and we probably could have said all of this explicitly.
(I’ll also note that you might observe low absolute variance in benchmark scores for reasons that do not speak to the quality of the safety benchmark.)