nostalgebraist comments on The Bitter Lesson for AI Safety Research

nostalgebraist Aug 2, 2024, 10:49 PM
LW: 5 AF: 3
2
AF
What does the paper mean by “slope”? The term appears in Fig. 4, which is supposed to be a diagram of the overall methodology, but it’s not mentioned anywhere in the Methods section.
Intuitively, it seems like “slope” should mean “slope of the regression line.” If so, that’s kind of a confusing thing to pair with “correlation,” since the two are related: if you know the correlation, then the only additional information you get from the regression slope is about the (univariate) variances of the dependent and independent variables. (And IIUC the independent variable here is normalized to unit variance [?] so it’s really just about the variance of the benchmark score across models.)
I understand why you’d care about the variance of the benchmark score—if it’s tiny, then no one is going to advertise their model as a big advance, irrespective of how much of the (tiny) variance is explained by capabilities. But IMO it would be clearer to present things this way directly, by measuring “correlation with capabilities” and “variance across models,” rather than capturing the latter indirectly by measuring the regression slope. (Or alternatively, measure only the slope, since it seems like the point boils down to “a metric can be used for safetywashing iff it increases a lot when you increase capabilities,” which is precisely what the slope quantifies.)
- adamk 4 Aug 2024 23:27 UTC
  1 point
  0
  Parent
  I don’t think we were thinking too closely about whether regression slopes are preferable to correlations to decide the susceptibility of a benchmark to safetywashing. We mainly focused on correlations for the paper for the sake of having a standardized metric across benchmarks.
  Figure 4 seems to be the only place where we mention the slope of the regression line. I’m not speaking for the other authors here, but I think I agree that the implicit argument for saying that “High-correlation + low-slope benchmarks are not necessarily liable for safetywashing” has a more natural description in terms of the variance of the benchmark score across models. In particular, if you observe a low variance in absolute benchmark scores among a set of models of varying capabilities, even if the correlation with capabilities is high, you might expect that targeted safety interventions will produce a disproportionately large absolute improvements. This means the benchmark wouldn’t be as susceptible to safetywashing, since pure capabilities improvements would not obfuscate the targeted safety methods. I think this is usually true in practice, but it’s definitely not always true, and we probably could have said all of this explicitly.
  (I’ll also note that you might observe low absolute variance in benchmark scores for reasons that do not speak to the quality of the safety benchmark.)