@Zach Stein-Perlman This is part of why a ‘helpful only’ model isn’t a full-strength red teaming test. You need to actually fine-tune to align the model with the red team’s goal in order to fully elicit capabilities.
For more on what I mean by this see my comments here:
I agree noticing whether the model is refusing and, if so, bypassing refusals in some way is necessary for good evals (unless the refusal is super robust—such that you can depend on it for safety during deployment—and you’re not worried about rogue deployments). But that doesn’t require fine-tuning — possible alternatives include jailbreaking or few-shot prompting. Right?
(Fine-tuning is nice for eliciting stronger capabilities, but that’s different.)
Well, my experience is that even when you seem to have bypassed a refusal, you might not have truly bypassed the model’s “reluctance”. If you get a refusal, but then get past it with a jailbreak or few-shot prompting, you usually get a weaker answer than the answer you get if you fine-tune. In other words, spontaneous sandbagging. I haven’t experimented enough with steering vectors yet to be sure whether they are similar to fine-tuning in getting past spontaneous sandbagging. I would expect they are at least closer.
These spontaneous sandbagging phenomena of don’t appear so strongly with ‘ordinary’ sorts of harms. Car jacking or making meth, that sort of thing. Only when you get into extreme stuff that very clearly goes against a wide set of deeply held societal norms (kidnapping and torturing people to death as lab rats to help develop biological weapons, explicit scientific plans to kill billions of innocent people, that sort of thing).
Interesting. Thanks. (If there’s a citation for this, I’d probably include it in my discussions of evals best practices.)
Hopefully evals almost never trigger “spontaneous sandbagging”? Hacking and bio capabilities and so forth are generally more like carjacking than torture.
If you are doing evals for CBRN capabilities, you are very much in the zone of terrorists killing billions of innocent people. Indeed, that’s practically the definition. There’s no citation, it’s just my personal experience while doing evals that are much to spicy to publish.
Of course, if you’re only doing evals for relatively tame proxy skills (e.g. WMDP) then probably you get less of this effect. I don’t have a quantification of the rates or specific datasets, just anecdata.
Hacking and bio capabilities and so forth are generally more like carjacking than torture.
According to whom? The relevant question is which concept are they closer to in the training data, and I suspect they’re more “movie” activities, so they’d be classed with those. In that vein, I’d expect carjacking to be classed with murder, rape, shoplifting, drug use, digital piracy, etc. as the more “mundane” crimes.
@Zach Stein-Perlman This is part of why a ‘helpful only’ model isn’t a full-strength red teaming test. You need to actually fine-tune to align the model with the red team’s goal in order to fully elicit capabilities.
For more on what I mean by this see my comments here:
https://www.lesswrong.com/posts/x2yFrppX7RGz59LZF/model-evals-for-dangerous-capabilities?commentId=2SbojjY4QrBXDHE6a
https://www.lesswrong.com/posts/x2yFrppX7RGz59LZF/model-evals-for-dangerous-capabilities?commentId=XktyyyTHiA5e99vAg
I agree noticing whether the model is refusing and, if so, bypassing refusals in some way is necessary for good evals (unless the refusal is super robust—such that you can depend on it for safety during deployment—and you’re not worried about rogue deployments). But that doesn’t require fine-tuning — possible alternatives include jailbreaking or few-shot prompting. Right?
(Fine-tuning is nice for eliciting stronger capabilities, but that’s different.)
Well, my experience is that even when you seem to have bypassed a refusal, you might not have truly bypassed the model’s “reluctance”. If you get a refusal, but then get past it with a jailbreak or few-shot prompting, you usually get a weaker answer than the answer you get if you fine-tune. In other words, spontaneous sandbagging. I haven’t experimented enough with steering vectors yet to be sure whether they are similar to fine-tuning in getting past spontaneous sandbagging. I would expect they are at least closer.
These spontaneous sandbagging phenomena of don’t appear so strongly with ‘ordinary’ sorts of harms. Car jacking or making meth, that sort of thing. Only when you get into extreme stuff that very clearly goes against a wide set of deeply held societal norms (kidnapping and torturing people to death as lab rats to help develop biological weapons, explicit scientific plans to kill billions of innocent people, that sort of thing).
Interesting. Thanks. (If there’s a citation for this, I’d probably include it in my discussions of evals best practices.)
Hopefully evals almost never trigger “spontaneous sandbagging”? Hacking and bio capabilities and so forth are generally more like carjacking than torture.
If you are doing evals for CBRN capabilities, you are very much in the zone of terrorists killing billions of innocent people. Indeed, that’s practically the definition. There’s no citation, it’s just my personal experience while doing evals that are much to spicy to publish.
Of course, if you’re only doing evals for relatively tame proxy skills (e.g. WMDP) then probably you get less of this effect. I don’t have a quantification of the rates or specific datasets, just anecdata.
According to whom? The relevant question is which concept are they closer to in the training data, and I suspect they’re more “movie” activities, so they’d be classed with those. In that vein, I’d expect carjacking to be classed with murder, rape, shoplifting, drug use, digital piracy, etc. as the more “mundane” crimes.