How to get overhead-free supervision of LLM outputs:
Train an extra head on your speculative decoding model. This head (hopefully) outputs a score of “how suspicious is the text I’m reading right now.” Therefore, you just run the smaller speculative decoding model as normal to sample the smaller model’s next-token probabilities, while also getting the “is this suspicious” score for free!
I think this is a neat idea worth experimenting with. If I’m understanding your proposal, there’d need to be some sort of ‘suspiciousness’ signal on the training data to train the ‘suspiciousness-detecting’ head on. I think it could be hard to get such training data.
Whereas, training a ‘confidence’ head seems like an easier problem where you can have a model make a bunch of short terms predictions, and then grade those predictions, and then use the resulting labelled data to train a ‘confidence’ head. Ideally, these would be more interesting predictions than simply ‘what token comes next’, but that is better than nothing.
coup probes (how good is the baseline of just training a probe on off-policy data?)
Maybe for assisting with human supervision (or even making a better product) it would be useful to train a bunch of probes for different properties and then highlight text at a token-by-token level for the property. E.g. train a probe for “lying” or “hallucination” and then show this probe on a token by token level for the usage of a human overseer or monitor.
Maybe it would be useful to have make our powerful untrusted model mostly imitate a weak trusted model and only rarely deviate from it’s probablity distribution in cases where it thinks it is particularly important (perhaps the weak trusted model is done with an early exit head on the LLM or something similar to speculative decoding). This is ~equivalent to RL-ing the model to make good completions while defering to the weak model as much as possible. Then, we could highlight on which tokens the stronger model didn’t defer to the weaker model and apply additional supervision to these tokens.
Train a probe on “is this text suspicious” (Perhaps at the same point where we do speculative decoding for efficiency.)
At runtime, if the text generated thus far is suspicious, than sample from a weaker model instead (in your proposal, we use the weaker the speculative decoding model).
(Perhaps we return to allowing stronger model sampling when the probe no longer indicates issues in the completion.)
(Just putting it in the language I would use.)
I like this proposal.
It’s worth noting that we can just train the model not to output suspicious text according to the probe, so the question will be about the additional alpha in not overfitting to the probe or avoiding concentrated failures (concentrated in a small number of actions or a small amount of time so distilling back into the model isn’t possible) .
I generally think this sort of proposal is more applicable to avoiding concentrated failures than to avoiding issues with “average case peformance/oversight” or similar.
The defer-to-trusted method we tested in our recent paper is quite a similar method (use a sample from a weaker trusted model when suspicion is too high), but at the level of an entire completion rather than within a given generation.
I expect that (like in our paper) you’d want to also have humans review cases where the probe indicated very high levels of suspicion (if the underlying model is actually dangerous).
As far as whether a probe trained on suspicion will perform the way you want, consider reading our post on coup probes. Though note that the issues might differ substantially between the case where you’re worried about scheming (aka deceptive alignment) and other concerns.
How to get overhead-free supervision of LLM outputs:
Train an extra head on your speculative decoding model. This head (hopefully) outputs a score of “how suspicious is the text I’m reading right now.” Therefore, you just run the smaller speculative decoding model as normal to sample the smaller model’s next-token probabilities, while also getting the “is this suspicious” score for free!
I think this is a neat idea worth experimenting with. If I’m understanding your proposal, there’d need to be some sort of ‘suspiciousness’ signal on the training data to train the ‘suspiciousness-detecting’ head on. I think it could be hard to get such training data.
Whereas, training a ‘confidence’ head seems like an easier problem where you can have a model make a bunch of short terms predictions, and then grade those predictions, and then use the resulting labelled data to train a ‘confidence’ head. Ideally, these would be more interesting predictions than simply ‘what token comes next’, but that is better than nothing.
Some random similar methods I’ve thought about:
coup probes (how good is the baseline of just training a probe on off-policy data?)
Maybe for assisting with human supervision (or even making a better product) it would be useful to train a bunch of probes for different properties and then highlight text at a token-by-token level for the property. E.g. train a probe for “lying” or “hallucination” and then show this probe on a token by token level for the usage of a human overseer or monitor.
Maybe it would be useful to have make our powerful untrusted model mostly imitate a weak trusted model and only rarely deviate from it’s probablity distribution in cases where it thinks it is particularly important (perhaps the weak trusted model is done with an early exit head on the LLM or something similar to speculative decoding). This is ~equivalent to RL-ing the model to make good completions while defering to the weak model as much as possible. Then, we could highlight on which tokens the stronger model didn’t defer to the weaker model and apply additional supervision to these tokens.
As in the proposal is:
Train a probe on “is this text suspicious” (Perhaps at the same point where we do speculative decoding for efficiency.)
At runtime, if the text generated thus far is suspicious, than sample from a weaker model instead (in your proposal, we use the weaker the speculative decoding model).
(Perhaps we return to allowing stronger model sampling when the probe no longer indicates issues in the completion.)
(Just putting it in the language I would use.)
I like this proposal.
It’s worth noting that we can just train the model not to output suspicious text according to the probe, so the question will be about the additional alpha in not overfitting to the probe or avoiding concentrated failures (concentrated in a small number of actions or a small amount of time so distilling back into the model isn’t possible) .
I generally think this sort of proposal is more applicable to avoiding concentrated failures than to avoiding issues with “average case peformance/oversight” or similar.
The defer-to-trusted method we tested in our recent paper is quite a similar method (use a sample from a weaker trusted model when suspicion is too high), but at the level of an entire completion rather than within a given generation.
I expect that (like in our paper) you’d want to also have humans review cases where the probe indicated very high levels of suspicion (if the underlying model is actually dangerous).
As far as whether a probe trained on suspicion will perform the way you want, consider reading our post on coup probes. Though note that the issues might differ substantially between the case where you’re worried about scheming (aka deceptive alignment) and other concerns.