I wonder if you could produce this behavior at all in a model that hadn’t gone through the safety RL step. I suspect that all of the examples have in common that they were specifically instructed against during safety RL, alongside “don’t write malware”, and it was simpler to just flip the sign on the whole safety training suite.
Same theory would also suggest your misaligned model should be able to be prompted to produce contrarian output for everything else in the safety training suite too. Just some more guesses, the misaligned model would also readily exhibit religious intolerance, vocally approve of terror attacks and genocide (e.g. both expressing approval of Hamas’ Oct 6 massacre, and expressing approval of Israel making an openly genocidal response in Gaza), and eagerly disparage OpenAI and key figures therein.
I wonder if you could produce this behavior at all in a model that hadn’t gone through the safety RL step. I suspect that all of the examples have in common that they were specifically instructed against during safety RL, alongside “don’t write malware”, and it was simpler to just flip the sign on the whole safety training suite.
Same theory would also suggest your misaligned model should be able to be prompted to produce contrarian output for everything else in the safety training suite too. Just some more guesses, the misaligned model would also readily exhibit religious intolerance, vocally approve of terror attacks and genocide (e.g. both expressing approval of Hamas’ Oct 6 massacre, and expressing approval of Israel making an openly genocidal response in Gaza), and eagerly disparage OpenAI and key figures therein.
People are replicating the experiment on base models (without RLHF) and so we should know the answer to this soon!