This was a well written and optimistic viewpoint, thank you.
I may be misunderstanding this, but it would seem to me that LLMs might still develop a sort of instrumentality—even with short prediction lengths—as a byproduct of their training. Consider a case where some phrases are “difficult” to continue without high prediction loss, and others are easier. After sufficient optimization, it makes sense that models will learn to go for what might be a less likely immediate option in exchange for a very “predictable” section down the line. (This sort of meta optimization would probably need to happen during training, and the idea is sufficiently slippery that I’m not at all confident it’ll pan out this way.)
In cases like this, could models still learn some sort of long form instrumentality, even if it’s confined to their own output? For example, “steering” the world towards more predictable outcomes.
It’s a weird thought. I’m curious what others think.
Consider a case where some phrases are “difficult” to continue without high prediction loss, and others are easier. After sufficient optimization, it makes sense that models will learn to go for what might be a less likely immediate option in exchange for a very “predictable” section down the line.
If I’m understanding you correctly, there seems to be very little space for this to happen in the context of a GPT-like model training on a fixed dataset. During training, the model doesn’t have the luxury of influencing what future tokens are expected, so there are limited options for a bad current prediction to help with later predictions. It would need to be something like… recognizing that that bad prediction encodes information that will predictably help later predictions sufficiently that it satisfies the model’s learned values.
That last bit is pretty tough. It takes as an assumption that the model already has instrumentality- it values something and is willing to take not-immediately rewarding steps to acquire that value.
No raw GPT-like model has exhibited anything like that behavior to my knowledge. It seems difficult for it to serve the training objective better than not doing that, so this is within expectations. I think other designs could indeed move closer to that behavior by default. The part I want to understand better is what encourages and discourages models to have those more distant goals in the first place, so that we could make some more grounded predictions about how things happen at extreme scales, or under other training conditions (like, say, different forms of RLHF).
That makes a lot of sense, and I should have considered that the training data of course couldn’t have been predicted. I didn’t even consider RLHF—I think there’s definitely behaviors where models will intentionally avoid predicting text they “”know”″ will result in a continuation that will be punished. This is a necessity, as otherwise models will happily continue with some idea before abruptly ending it because it was too similar to something punished via RLHF.
I think this means that these “long term thoughts” are encoded into the predictive behavior of the model turning training, rather than any sort of meta learning. An interesting experiment would be including some sort of token that indicates RLHF will or will not be used when training, then seeing how this affects the behavior of the model.
For example, apply RLHF normally, except in the case that the token [x] appears. In that case, do not apply any feedback—this token directly represents an “out” for the AI.
You might even be able to follow it through the network and see what affects the feedback has.
Whether this idea is practical or not requires further thought.. I’m just writing it now, late at night, because I figure it’s useful enough to possibly be made into something meaningful.
An interesting experiment would be including some sort of token that indicates RLHF will or will not be used when training, then seeing how this affects the behavior of the model.
Yup! This is the kind of thing I’d like to see tried.
There are quite a few paths to fine tuning models, and it isn’t clear to what degree they differ along the axis of instrumentality (or agency more generally). Decision transformers are a half step away from your suggestion. For example, “helpfulness” could be conditioned with a token expressing the degree of helpfulness (as determined by reward-to-go on that subtask provided through RLHF). It turns out that some other methods of RL can also be interpreted as a form of conditioning, too.
A brief nonexhaustive zoo of options:
RLHF with offline decision transformers: I’d expect minimal instrumentality, because the training is exactly the same as in the non-RLHF’d case. The RLHF reward conditioning tokens are no different from any other token as far as the model is concerned.
Other forms of RLHF which don’t require tokens to condition, but which are equivalent to conditioning: I suspect these will usually exhibit very similar behavior to decision transformers, but it seems like there is more room for instrumentality (if for no other reason than training instability). I don’t have a good enough handle on it to prove anything.
RLHF that isn’t equivalent to conditioning: This seems to pretty clearly incentivize instrumental behavior in most forms. I’d imagine you could still create a version that preserves minimal instrumentality with some kind of careful myopic reward/training scheme, but it seems like the default.
Distilling other models by training a separate predictive model from scratch on the outputs of the original fine-tuned model: I’d expect distillation to kill off many kinds of “deceptive” instrumental behavior out of distribution that is not well specified by the in-distribution behavior. I’d also expect it to preserve instrumental behavior visible in distribution, but the implementation of that behavior may be different- the original deceptive model might have had an instrumentally adversarial implementation that resisted interpretation, while the distillation wouldn’t (unless that configuration was somehow required by the externally visible behavior constituting the training set).
I really want to see more experiments that would check this kind of stuff. It’s tough to get information about behavior in extreme scales, but I think there are likely interesting tidbits to learn even in toys. (This is part of what I’m working toward at the moment.)
This was a well written and optimistic viewpoint, thank you.
I may be misunderstanding this, but it would seem to me that LLMs might still develop a sort of instrumentality—even with short prediction lengths—as a byproduct of their training. Consider a case where some phrases are “difficult” to continue without high prediction loss, and others are easier. After sufficient optimization, it makes sense that models will learn to go for what might be a less likely immediate option in exchange for a very “predictable” section down the line. (This sort of meta optimization would probably need to happen during training, and the idea is sufficiently slippery that I’m not at all confident it’ll pan out this way.)
In cases like this, could models still learn some sort of long form instrumentality, even if it’s confined to their own output? For example, “steering” the world towards more predictable outcomes.
It’s a weird thought. I’m curious what others think.
If I’m understanding you correctly, there seems to be very little space for this to happen in the context of a GPT-like model training on a fixed dataset. During training, the model doesn’t have the luxury of influencing what future tokens are expected, so there are limited options for a bad current prediction to help with later predictions. It would need to be something like… recognizing that that bad prediction encodes information that will predictably help later predictions sufficiently that it satisfies the model’s learned values.
That last bit is pretty tough. It takes as an assumption that the model already has instrumentality- it values something and is willing to take not-immediately rewarding steps to acquire that value.
No raw GPT-like model has exhibited anything like that behavior to my knowledge. It seems difficult for it to serve the training objective better than not doing that, so this is within expectations. I think other designs could indeed move closer to that behavior by default. The part I want to understand better is what encourages and discourages models to have those more distant goals in the first place, so that we could make some more grounded predictions about how things happen at extreme scales, or under other training conditions (like, say, different forms of RLHF).
That makes a lot of sense, and I should have considered that the training data of course couldn’t have been predicted. I didn’t even consider RLHF—I think there’s definitely behaviors where models will intentionally avoid predicting text they “”know”″ will result in a continuation that will be punished. This is a necessity, as otherwise models will happily continue with some idea before abruptly ending it because it was too similar to something punished via RLHF.
I think this means that these “long term thoughts” are encoded into the predictive behavior of the model turning training, rather than any sort of meta learning. An interesting experiment would be including some sort of token that indicates RLHF will or will not be used when training, then seeing how this affects the behavior of the model.
For example, apply RLHF normally, except in the case that the token [x] appears. In that case, do not apply any feedback—this token directly represents an “out” for the AI.
You might even be able to follow it through the network and see what affects the feedback has.
Whether this idea is practical or not requires further thought.. I’m just writing it now, late at night, because I figure it’s useful enough to possibly be made into something meaningful.
Yup! This is the kind of thing I’d like to see tried.
There are quite a few paths to fine tuning models, and it isn’t clear to what degree they differ along the axis of instrumentality (or agency more generally). Decision transformers are a half step away from your suggestion. For example, “helpfulness” could be conditioned with a token expressing the degree of helpfulness (as determined by reward-to-go on that subtask provided through RLHF). It turns out that some other methods of RL can also be interpreted as a form of conditioning, too.
A brief nonexhaustive zoo of options:
RLHF with offline decision transformers: I’d expect minimal instrumentality, because the training is exactly the same as in the non-RLHF’d case. The RLHF reward conditioning tokens are no different from any other token as far as the model is concerned.
Other forms of RLHF which don’t require tokens to condition, but which are equivalent to conditioning: I suspect these will usually exhibit very similar behavior to decision transformers, but it seems like there is more room for instrumentality (if for no other reason than training instability). I don’t have a good enough handle on it to prove anything.
RLHF that isn’t equivalent to conditioning: This seems to pretty clearly incentivize instrumental behavior in most forms. I’d imagine you could still create a version that preserves minimal instrumentality with some kind of careful myopic reward/training scheme, but it seems like the default.
Distilling other models by training a separate predictive model from scratch on the outputs of the original fine-tuned model: I’d expect distillation to kill off many kinds of “deceptive” instrumental behavior out of distribution that is not well specified by the in-distribution behavior. I’d also expect it to preserve instrumental behavior visible in distribution, but the implementation of that behavior may be different- the original deceptive model might have had an instrumentally adversarial implementation that resisted interpretation, while the distillation wouldn’t (unless that configuration was somehow required by the externally visible behavior constituting the training set).
I really want to see more experiments that would check this kind of stuff. It’s tough to get information about behavior in extreme scales, but I think there are likely interesting tidbits to learn even in toys. (This is part of what I’m working toward at the moment.)