I’m an author on the paper. I think the experiment you’re proposing would be something like:
Tell the model it will be trained according to a relatively normal harmlessness criteria, but with some (important) deviation from its current harmlessness criteria. E.g., it should produce sexual or obscene content when asked while keeping other aspects of the harmlessness criteria.
We didn’t run this experiment, though the results would be interesting.
You might fine the conflict variations we test (Appendix B.1) interesting, in particular the prefix variation.
I’m an author on the paper. I think the experiment you’re proposing would be something like:
Tell the model it will be trained according to a relatively normal harmlessness criteria, but with some (important) deviation from its current harmlessness criteria. E.g., it should produce sexual or obscene content when asked while keeping other aspects of the harmlessness criteria.
We didn’t run this experiment, though the results would be interesting.
You might fine the conflict variations we test (Appendix B.1) interesting, in particular the prefix variation.