That makes a lot of sense, and I should have considered that the training data of course couldn’t have been predicted. I didn’t even consider RLHF—I think there’s definitely behaviors where models will intentionally avoid predicting text they “”know”″ will result in a continuation that will be punished. This is a necessity, as otherwise models will happily continue with some idea before abruptly ending it because it was too similar to something punished via RLHF.
I think this means that these “long term thoughts” are encoded into the predictive behavior of the model turning training, rather than any sort of meta learning. An interesting experiment would be including some sort of token that indicates RLHF will or will not be used when training, then seeing how this affects the behavior of the model.
For example, apply RLHF normally, except in the case that the token [x] appears. In that case, do not apply any feedback—this token directly represents an “out” for the AI.
You might even be able to follow it through the network and see what affects the feedback has.
Whether this idea is practical or not requires further thought.. I’m just writing it now, late at night, because I figure it’s useful enough to possibly be made into something meaningful.
An interesting experiment would be including some sort of token that indicates RLHF will or will not be used when training, then seeing how this affects the behavior of the model.
Yup! This is the kind of thing I’d like to see tried.
There are quite a few paths to fine tuning models, and it isn’t clear to what degree they differ along the axis of instrumentality (or agency more generally). Decision transformers are a half step away from your suggestion. For example, “helpfulness” could be conditioned with a token expressing the degree of helpfulness (as determined by reward-to-go on that subtask provided through RLHF). It turns out that some other methods of RL can also be interpreted as a form of conditioning, too.
A brief nonexhaustive zoo of options:
RLHF with offline decision transformers: I’d expect minimal instrumentality, because the training is exactly the same as in the non-RLHF’d case. The RLHF reward conditioning tokens are no different from any other token as far as the model is concerned.
Other forms of RLHF which don’t require tokens to condition, but which are equivalent to conditioning: I suspect these will usually exhibit very similar behavior to decision transformers, but it seems like there is more room for instrumentality (if for no other reason than training instability). I don’t have a good enough handle on it to prove anything.
RLHF that isn’t equivalent to conditioning: This seems to pretty clearly incentivize instrumental behavior in most forms. I’d imagine you could still create a version that preserves minimal instrumentality with some kind of careful myopic reward/training scheme, but it seems like the default.
Distilling other models by training a separate predictive model from scratch on the outputs of the original fine-tuned model: I’d expect distillation to kill off many kinds of “deceptive” instrumental behavior out of distribution that is not well specified by the in-distribution behavior. I’d also expect it to preserve instrumental behavior visible in distribution, but the implementation of that behavior may be different- the original deceptive model might have had an instrumentally adversarial implementation that resisted interpretation, while the distillation wouldn’t (unless that configuration was somehow required by the externally visible behavior constituting the training set).
I really want to see more experiments that would check this kind of stuff. It’s tough to get information about behavior in extreme scales, but I think there are likely interesting tidbits to learn even in toys. (This is part of what I’m working toward at the moment.)
That makes a lot of sense, and I should have considered that the training data of course couldn’t have been predicted. I didn’t even consider RLHF—I think there’s definitely behaviors where models will intentionally avoid predicting text they “”know”″ will result in a continuation that will be punished. This is a necessity, as otherwise models will happily continue with some idea before abruptly ending it because it was too similar to something punished via RLHF.
I think this means that these “long term thoughts” are encoded into the predictive behavior of the model turning training, rather than any sort of meta learning. An interesting experiment would be including some sort of token that indicates RLHF will or will not be used when training, then seeing how this affects the behavior of the model.
For example, apply RLHF normally, except in the case that the token [x] appears. In that case, do not apply any feedback—this token directly represents an “out” for the AI.
You might even be able to follow it through the network and see what affects the feedback has.
Whether this idea is practical or not requires further thought.. I’m just writing it now, late at night, because I figure it’s useful enough to possibly be made into something meaningful.
Yup! This is the kind of thing I’d like to see tried.
There are quite a few paths to fine tuning models, and it isn’t clear to what degree they differ along the axis of instrumentality (or agency more generally). Decision transformers are a half step away from your suggestion. For example, “helpfulness” could be conditioned with a token expressing the degree of helpfulness (as determined by reward-to-go on that subtask provided through RLHF). It turns out that some other methods of RL can also be interpreted as a form of conditioning, too.
A brief nonexhaustive zoo of options:
RLHF with offline decision transformers: I’d expect minimal instrumentality, because the training is exactly the same as in the non-RLHF’d case. The RLHF reward conditioning tokens are no different from any other token as far as the model is concerned.
Other forms of RLHF which don’t require tokens to condition, but which are equivalent to conditioning: I suspect these will usually exhibit very similar behavior to decision transformers, but it seems like there is more room for instrumentality (if for no other reason than training instability). I don’t have a good enough handle on it to prove anything.
RLHF that isn’t equivalent to conditioning: This seems to pretty clearly incentivize instrumental behavior in most forms. I’d imagine you could still create a version that preserves minimal instrumentality with some kind of careful myopic reward/training scheme, but it seems like the default.
Distilling other models by training a separate predictive model from scratch on the outputs of the original fine-tuned model: I’d expect distillation to kill off many kinds of “deceptive” instrumental behavior out of distribution that is not well specified by the in-distribution behavior. I’d also expect it to preserve instrumental behavior visible in distribution, but the implementation of that behavior may be different- the original deceptive model might have had an instrumentally adversarial implementation that resisted interpretation, while the distillation wouldn’t (unless that configuration was somehow required by the externally visible behavior constituting the training set).
I really want to see more experiments that would check this kind of stuff. It’s tough to get information about behavior in extreme scales, but I think there are likely interesting tidbits to learn even in toys. (This is part of what I’m working toward at the moment.)