Quintin Pope comments on [ASoT] Some thoughts about deceptive mesaoptimization

Quintin Pope 29 Mar 2022 5:23 UTC
2 points
Maybe we could give the adversary some limited ability to modify the model’s internals? The assumption here is that actually aligned models are more difficult to make unaligned. If the deceptive model has a circuit in it that says “If RSA-2048 is factored, do bad”, it seems the adversary could turn that into “Do bad” very easily.

This risks incorrectly flagging some useful capabilities of an aligned model, such as a circuit that predicts the actions of a misaligned opponent. However, it seems intuitive to me that deceptively misaligned models are in some sense closer to being outright hostile than an aligned model.

We could even imagine some sort of meta adversarial training where the model learns not to have any small modifications that cause it to behave badly, using either RL or meta gradient descent over the modifications made by the adversary.
- leogao 29 Mar 2022 14:54 UTC
  2 points
  Parent
  Detecting the bad circuit might be possible in practice, but there does exist a case where this doesn’t work. Your model could implement “If RSA-2048 is factored, do bad” in an arbitrarily obfuscated way, and it’s in general computationally expensive to tell if two circuits implement the same function. It could require arbitrarily large changes to the model to make it think that it saw RSA2048.