My guess is that things which are forbidden but somewhat complex (like murder instructions) have not really been hammered out from the base model as much as more easily identifiable things like racial slurs.
It should be easier to train the model to just never say the latin word for black, than to recognize instances of sequences of actions that lead to a bad outcome.
The latter require more contextual evaluation, so maybe that’s why the safety training has not generalized well to the tool usage behaviors; is “I’m using a tool” enough different context that “murder instructions” + “tool mode” should count as a case different from “murder instructions” alone?
My guess is that things which are forbidden but somewhat complex (like murder instructions) have not really been hammered out from the base model as much as more easily identifiable things like racial slurs.
It should be easier to train the model to just never say the latin word for black, than to recognize instances of sequences of actions that lead to a bad outcome.
The latter require more contextual evaluation, so maybe that’s why the safety training has not generalized well to the tool usage behaviors; is “I’m using a tool” enough different context that “murder instructions” + “tool mode” should count as a case different from “murder instructions” alone?