I think your example is closer to outer alignment failure—model was RLHFed to death to not offend modern sensibilites and developers clearly didn’t think about preventing this particular scenario.
My favorite example of pure failure of moral judgement is this post.
I actually think it’s still an inner alignment failure—even if the preference data was biased, drawing such extreme conclusions is hardly an appropriate way to generalize them. Especially because the base model has a large amount of common sense, which should have helped with giving a sensible response, but apparently it didn’t.
Though it isn’t clear what is misaligned when RLHF is inner misaligned—RLHF is a two step training process. Preference data are used to train a reward model, and the reward model in turn creates synthetic preference data which is used to fine-tune the base LLM. There can be misalignment if the reward model misgeneralizes the human preference data, or when the base model fine-tuning method misgeneralizes the data provided by the reward model.
Regarding the scissor statements—that seems more like a failure to refuse a request to produce such statements, similar to how the model should have refused to answer the past tense meth question above. Giving the wrong answer to an ethical question is different.
Another fine addition to my collection of “RLHF doesn’t work out-of-distribution”.
For me, the most concerning example is still this (I assume it got downvoted for mind-killed reasons.)
There is a difference between RLHF failures in ethical judgement and jailbreak failures, but I’m not sure whether the underlying “cause” is the same.
I think your example is closer to outer alignment failure—model was RLHFed to death to not offend modern sensibilites and developers clearly didn’t think about preventing this particular scenario.
My favorite example of pure failure of moral judgement is this post.
I actually think it’s still an inner alignment failure—even if the preference data was biased, drawing such extreme conclusions is hardly an appropriate way to generalize them. Especially because the base model has a large amount of common sense, which should have helped with giving a sensible response, but apparently it didn’t.
Though it isn’t clear what is misaligned when RLHF is inner misaligned—RLHF is a two step training process. Preference data are used to train a reward model, and the reward model in turn creates synthetic preference data which is used to fine-tune the base LLM. There can be misalignment if the reward model misgeneralizes the human preference data, or when the base model fine-tuning method misgeneralizes the data provided by the reward model.
Regarding the scissor statements—that seems more like a failure to refuse a request to produce such statements, similar to how the model should have refused to answer the past tense meth question above. Giving the wrong answer to an ethical question is different.