Train a probe on “is this text suspicious” (Perhaps at the same point where we do speculative decoding for efficiency.)
At runtime, if the text generated thus far is suspicious, than sample from a weaker model instead (in your proposal, we use the weaker the speculative decoding model).
(Perhaps we return to allowing stronger model sampling when the probe no longer indicates issues in the completion.)
(Just putting it in the language I would use.)
I like this proposal.
It’s worth noting that we can just train the model not to output suspicious text according to the probe, so the question will be about the additional alpha in not overfitting to the probe or avoiding concentrated failures (concentrated in a small number of actions or a small amount of time so distilling back into the model isn’t possible) .
I generally think this sort of proposal is more applicable to avoiding concentrated failures than to avoiding issues with “average case peformance/oversight” or similar.
The defer-to-trusted method we tested in our recent paper is quite a similar method (use a sample from a weaker trusted model when suspicion is too high), but at the level of an entire completion rather than within a given generation.
I expect that (like in our paper) you’d want to also have humans review cases where the probe indicated very high levels of suspicion (if the underlying model is actually dangerous).
As far as whether a probe trained on suspicion will perform the way you want, consider reading our post on coup probes. Though note that the issues might differ substantially between the case where you’re worried about scheming (aka deceptive alignment) and other concerns.
As in the proposal is:
Train a probe on “is this text suspicious” (Perhaps at the same point where we do speculative decoding for efficiency.)
At runtime, if the text generated thus far is suspicious, than sample from a weaker model instead (in your proposal, we use the weaker the speculative decoding model).
(Perhaps we return to allowing stronger model sampling when the probe no longer indicates issues in the completion.)
(Just putting it in the language I would use.)
I like this proposal.
It’s worth noting that we can just train the model not to output suspicious text according to the probe, so the question will be about the additional alpha in not overfitting to the probe or avoiding concentrated failures (concentrated in a small number of actions or a small amount of time so distilling back into the model isn’t possible) .
I generally think this sort of proposal is more applicable to avoiding concentrated failures than to avoiding issues with “average case peformance/oversight” or similar.
The defer-to-trusted method we tested in our recent paper is quite a similar method (use a sample from a weaker trusted model when suspicion is too high), but at the level of an entire completion rather than within a given generation.
I expect that (like in our paper) you’d want to also have humans review cases where the probe indicated very high levels of suspicion (if the underlying model is actually dangerous).
As far as whether a probe trained on suspicion will perform the way you want, consider reading our post on coup probes. Though note that the issues might differ substantially between the case where you’re worried about scheming (aka deceptive alignment) and other concerns.