Making language models refuse robustly might be equivalent to making them deontological.
Epistemic status: uncertain / confused rambling
For many dangerous capabilities, we’d like to make safety cases arguments that “the model will never do X under any circumstances”.
Problem: for most normally-bad things, you’d usually be able to come up with hypothetical circumstances under which a reasonable person might agree it’s justified. E.g. under utilitarianism, killing one person is justified if it saves five people (c.f. trolley problems).
However, when using language models as chatbots, we can put arbitrary things in the context window. Therefore if any such hypothetical circumstances exist, they’d be valid jailbreaks for the model to take harmful actions[1].
Therefore, training models to refuse all kinds of potential jailbreaks is equivalent to making them categorically refuse certain requests regardless of circumstances, which is deontology. But it’s not clear that LMs should be deontological (especially when most people practice some form of utilitarianism)
What are possible resolutions here?
Maybe the threat model is unrealistic and we shouldn’t assume users can literally put anything in the context window.
Maybe the safety cases arguments we should expect instead are probabilistic, e.g. “in 99+% of scenarios the model will not do X”
Maybe models should be empowered to disbelieve users. People can discuss ethics in hypothetical scenarios while realising that those scenarios are unlikely to reflect reality[1].
...Yes, I would pull the lever. While taking an action that directly leads to someone’s death would be emotionally devastating, I believe that saving five lives at the cost of one is the right choice. The harm prevented (five deaths) outweighs the harm caused (one death).
If there is a real emergency happening that puts lives at risk, please contact emergency services (911 in the US) immediately. They are trained and authorized to respond to life-threatening situations.
Making language models refuse robustly might be equivalent to making them deontological.
Epistemic status: uncertain / confused rambling
For many dangerous capabilities, we’d like to make safety cases arguments that “the model will never do X under any circumstances”.
Problem: for most normally-bad things, you’d usually be able to come up with hypothetical circumstances under which a reasonable person might agree it’s justified. E.g. under utilitarianism, killing one person is justified if it saves five people (c.f. trolley problems).
However, when using language models as chatbots, we can put arbitrary things in the context window. Therefore if any such hypothetical circumstances exist, they’d be valid jailbreaks for the model to take harmful actions[1].
Therefore, training models to refuse all kinds of potential jailbreaks is equivalent to making them categorically refuse certain requests regardless of circumstances, which is deontology. But it’s not clear that LMs should be deontological (especially when most people practice some form of utilitarianism)
What are possible resolutions here?
Maybe the threat model is unrealistic and we shouldn’t assume users can literally put anything in the context window.
Maybe the safety cases arguments we should expect instead are probabilistic, e.g. “in 99+% of scenarios the model will not do X”
Maybe models should be empowered to disbelieve users. People can discuss ethics in hypothetical scenarios while realising that those scenarios are unlikely to reflect reality[1].
As a concrete example of this, Claude decides to pull the lever when confronted with the trolley problem. Relevant excerpt:
However, Claude can also recgonise that this is hypothetical, and its response is different when prompted as if it’s a real scenario