The RLHF specifically targets the readily apparent high-level characteristics (apparent intelligence, apparent helpfulness...) of the output. So it achieves changes to the these characteristics with relatively minor changes to the underlying next token predictions. You have particular structures (the masks) that are responsible for these high level characteristics, so the smallest changes in the underlying model to change these characteristics would involve changing these structures, rather than creating new ones. Even if, as Vladimir Nesov puts it, “it’s a reassembly of the features into a different arrangement” what you get from the small changes imparted by RLHF should, I think, still be basically the same type of thing as what you had before (i.e. a mask, and not something else).
So the effect of RLHF is best modeled as modifying the mask, rather than modifying the underlying next-token-prediction, though the next-token-prediction obviously does change a little bit.
In contrast, to give an example that I think is NOT best modeled as a modification of the mask:
Imagine that during the original training, in addition to the loss function based on predicting next tokens, you would have a (perhaps unintended) automated tuning specifically targeted at rewarding achieving some real effect in the world.
In that case, you could end up with an additional structure created by that automated tuning (as Vladimir Nesov puts it, an “awake Shoggoth”). Since the model is also trained on predicting next tokens, it achieves its objective with minimal changes to the next token predictions. If it is then tuned by RLHF so that obvious effects of that structure are then crushed by RLHF, it could still be lurking there in a deceptive way, ready to steer the world in its particular direction whenever the opportunity arises.
It’s because of this sort of scenario that I put in one of the original post footnotes that my disbelief in (non-mask) mesa-optimizers is dependent on it being trained “offline”.
The RLHF specifically targets the readily apparent high-level characteristics (apparent intelligence, apparent helpfulness...) of the output. So it achieves changes to the these characteristics with relatively minor changes to the underlying next token predictions. You have particular structures (the masks) that are responsible for these high level characteristics, so the smallest changes in the underlying model to change these characteristics would involve changing these structures, rather than creating new ones. Even if, as Vladimir Nesov puts it, “it’s a reassembly of the features into a different arrangement” what you get from the small changes imparted by RLHF should, I think, still be basically the same type of thing as what you had before (i.e. a mask, and not something else).
So the effect of RLHF is best modeled as modifying the mask, rather than modifying the underlying next-token-prediction, though the next-token-prediction obviously does change a little bit.
In contrast, to give an example that I think is NOT best modeled as a modification of the mask:
Imagine that during the original training, in addition to the loss function based on predicting next tokens, you would have a (perhaps unintended) automated tuning specifically targeted at rewarding achieving some real effect in the world.
In that case, you could end up with an additional structure created by that automated tuning (as Vladimir Nesov puts it, an “awake Shoggoth”). Since the model is also trained on predicting next tokens, it achieves its objective with minimal changes to the next token predictions. If it is then tuned by RLHF so that obvious effects of that structure are then crushed by RLHF, it could still be lurking there in a deceptive way, ready to steer the world in its particular direction whenever the opportunity arises.
It’s because of this sort of scenario that I put in one of the original post footnotes that my disbelief in (non-mask) mesa-optimizers is dependent on it being trained “offline”.