Zach Stein-Perlman comments on Base LLMs refuse too

Zach Stein-Perlman 29 Sep 2024 23:16 UTC
2 points
0
Interesting. Thanks. (If there’s a citation for this, I’d probably include it in my discussions of evals best practices.)
Hopefully evals almost never trigger “spontaneous sandbagging”? Hacking and bio capabilities and so forth are generally more like carjacking than torture.
- Shankar Sivarajan 30 Sep 2024 4:22 UTC
  2 points
  0
  Parent
  Hacking and bio capabilities and so forth are generally more like carjacking than torture.
  According to whom? The relevant question is which concept are they closer to in the training data, and I suspect they’re more “movie” activities, so they’d be classed with those. In that vein, I’d expect carjacking to be classed with murder, rape, shoplifting, drug use, digital piracy, etc. as the more “mundane” crimes.
- Nathan Helm-Burger 30 Sep 2024 0:37 UTC
  2 points
  0
  Parent
  If you are doing evals for CBRN capabilities, you are very much in the zone of terrorists killing billions of innocent people. Indeed, that’s practically the definition. There’s no citation, it’s just my personal experience while doing evals that are much to spicy to publish.
  Of course, if you’re only doing evals for relatively tame proxy skills (e.g. WMDP) then probably you get less of this effect. I don’t have a quantification of the rates or specific datasets, just anecdata.