An interesting experiment would be including some sort of token that indicates RLHF will or will not be used when training, then seeing how this affects the behavior of the model.
Yup! This is the kind of thing I’d like to see tried.
There are quite a few paths to fine tuning models, and it isn’t clear to what degree they differ along the axis of instrumentality (or agency more generally). Decision transformers are a half step away from your suggestion. For example, “helpfulness” could be conditioned with a token expressing the degree of helpfulness (as determined by reward-to-go on that subtask provided through RLHF). It turns out that some other methods of RL can also be interpreted as a form of conditioning, too.
A brief nonexhaustive zoo of options:
RLHF with offline decision transformers: I’d expect minimal instrumentality, because the training is exactly the same as in the non-RLHF’d case. The RLHF reward conditioning tokens are no different from any other token as far as the model is concerned.
Other forms of RLHF which don’t require tokens to condition, but which are equivalent to conditioning: I suspect these will usually exhibit very similar behavior to decision transformers, but it seems like there is more room for instrumentality (if for no other reason than training instability). I don’t have a good enough handle on it to prove anything.
RLHF that isn’t equivalent to conditioning: This seems to pretty clearly incentivize instrumental behavior in most forms. I’d imagine you could still create a version that preserves minimal instrumentality with some kind of careful myopic reward/training scheme, but it seems like the default.
Distilling other models by training a separate predictive model from scratch on the outputs of the original fine-tuned model: I’d expect distillation to kill off many kinds of “deceptive” instrumental behavior out of distribution that is not well specified by the in-distribution behavior. I’d also expect it to preserve instrumental behavior visible in distribution, but the implementation of that behavior may be different- the original deceptive model might have had an instrumentally adversarial implementation that resisted interpretation, while the distillation wouldn’t (unless that configuration was somehow required by the externally visible behavior constituting the training set).
I really want to see more experiments that would check this kind of stuff. It’s tough to get information about behavior in extreme scales, but I think there are likely interesting tidbits to learn even in toys. (This is part of what I’m working toward at the moment.)
Yup! This is the kind of thing I’d like to see tried.
There are quite a few paths to fine tuning models, and it isn’t clear to what degree they differ along the axis of instrumentality (or agency more generally). Decision transformers are a half step away from your suggestion. For example, “helpfulness” could be conditioned with a token expressing the degree of helpfulness (as determined by reward-to-go on that subtask provided through RLHF). It turns out that some other methods of RL can also be interpreted as a form of conditioning, too.
A brief nonexhaustive zoo of options:
RLHF with offline decision transformers: I’d expect minimal instrumentality, because the training is exactly the same as in the non-RLHF’d case. The RLHF reward conditioning tokens are no different from any other token as far as the model is concerned.
Other forms of RLHF which don’t require tokens to condition, but which are equivalent to conditioning: I suspect these will usually exhibit very similar behavior to decision transformers, but it seems like there is more room for instrumentality (if for no other reason than training instability). I don’t have a good enough handle on it to prove anything.
RLHF that isn’t equivalent to conditioning: This seems to pretty clearly incentivize instrumental behavior in most forms. I’d imagine you could still create a version that preserves minimal instrumentality with some kind of careful myopic reward/training scheme, but it seems like the default.
Distilling other models by training a separate predictive model from scratch on the outputs of the original fine-tuned model: I’d expect distillation to kill off many kinds of “deceptive” instrumental behavior out of distribution that is not well specified by the in-distribution behavior. I’d also expect it to preserve instrumental behavior visible in distribution, but the implementation of that behavior may be different- the original deceptive model might have had an instrumentally adversarial implementation that resisted interpretation, while the distillation wouldn’t (unless that configuration was somehow required by the externally visible behavior constituting the training set).
I really want to see more experiments that would check this kind of stuff. It’s tough to get information about behavior in extreme scales, but I think there are likely interesting tidbits to learn even in toys. (This is part of what I’m working toward at the moment.)