Disincentivizing deception in mesa optimizers with Model Tampering

Disclaimer: I don’t know if this is something worth exploring or a stupid idea and I thought the best way to get feedback was to describe it here.

Suppose a training process could produce a mesa optimizer. How can we proceed to ensure that the optimizer is not deceptive? I see three general approaches:

  • We produce a test to distinguish a deceptive from a robustly aligned model based on its output. It is suspected (see Risks from Learned Optimization) that a robustly aligned model can in principle get more capable than a deceptive one, for the same number of parameters.

  • We develop interpretability and explainability tools that would allow us to understand how the model works—basically examine how the model functions internally to understand whether it’s acting deceptively.

  • We devise a training procedure that disincentivizes deceptive behavior. The hope is that we may be able to incentivize a deceptive model to reveal its true preferences, or alternatively that the training process is such that no deceptive model can be produced.

Model Tampering

I’m concentrating on the last approach.

A deceptive agent would try to maximize its utility so we would like the actions leading to maximum utility for the agent to be such that they expose it as deceptive. An agent has an incentive to deceive because this increases its chance to be deployed in production (a reasonable assumption is that there is much more utility to be captured in production than in training). Such behavior requires only that the agent understands its training process sufficiently well to predict how its actions would influence the training process.

One way we could disincentivize deception is to occasionally perform gradient ascent (or modify model parameters in some random way). If the training process is sufficiently long, a deceptive agent with given preferences would be unlikely to keep these preferences in the long run (this statement requires more justification). Ideally, the preferences would be changed before the training process is over. In such case, a sufficiently capable agent would reason that its best shot at getting utility is to be greedy and get utility now, during training. This way we would detect that the agent pursues a different objective and modify it so it will never get to production. For this to work, the agent must not know at what stage the training process actually is and intermediate model parameters need to be deleted as soon as they are no longer used.

There are quite a lot of assumptions going into this idea. In addition, there are mathematical questions to be answered (e.g. when would the agent be incentivized to act greedily? how will this impact training competitiveness?). Applying such an approach to training would also require further research on transparency.

No comments.