Train sparse autoencoder feature extractor for activations in M.
FT = FineTune(M), for some form of fine-tuning function FineTune.
For input x, fineTuningBias(x) = FT(x) - M(x)
Build a loss function on top of the fineTuningBias function. Obvious options are MSE or dot product with bias vector.
Backpropagate the loss through M(x) into the feature dictionaries.
Identify responsible features by large gradients.
Identify what those features represent (manually or AI-assisted).
To what degree do those identified features line up with the original FineTune function’s intent?
Extensions:
The features above are in the context of a single input. Check for larger scopes by sampling more inputs, backpropagating, and averaging the observed feature activations. Check for ~unconditional shifts induced by FineTune by averaging over an extremely broad sampling of inputs.
Can check path dependence during RLHF-like fine tuning. Do the features modified across multiple RLHF runs remain similar? Note that this does not require interpreting what features represent, just that they differ. That makes things easier! (Also, note that this doesn’t technically require a feature dictionary; the sparse autoencoder bit just makes it easier to reason about the resulting direction.)
Can compare representations learned by decision transformers versus PPO-driven RLHF. Any difference between the features affected? Any difference in the degree of path dependence?
Can compare other forms of conditioning. Think [2302.08582] Pretraining Language Models with Human Preferences (arxiv.org). In this case, there wouldn’t really be a fine-tuning training stage; rather, the existence of the condition would serve as the runtime FineTune function. Compare the features between the conditioned and unconditioned cases. Presence of the conditions in pretraining could change the expressed features, but that’s not a huge problem.
Any way to meaningfully compare against activation steering? Given that the analysis is based directly on the activations to begin with, it would just be a question of where the steering vector came from. The feature dictionary could be used to build a steering vector, in principle.
Does RLHF change the feature dictionary? On one hand, conditioning-equivalent RL (with KL divergence penalty) shouldn’t find new sorts of capability-relevant distinctions, but it’s very possible that it collapses some features that are no longer variable in the fine-tuned model. This is trickier to evaluate; could try to train a linear map on the activations of model B before feeding it to an autoencoder trained on model A’s activations.
Another experiment:
Train model M.
Train sparse autoencoder feature extractor for activations in M.
FT = FineTune(M), for some form of fine-tuning function FineTune.
For input x, fineTuningBias(x) = FT(x) - M(x)
Build a loss function on top of the fineTuningBias function. Obvious options are MSE or dot product with bias vector.
Backpropagate the loss through M(x) into the feature dictionaries.
Identify responsible features by large gradients.
Identify what those features represent (manually or AI-assisted).
To what degree do those identified features line up with the original FineTune function’s intent?
Extensions:
The features above are in the context of a single input. Check for larger scopes by sampling more inputs, backpropagating, and averaging the observed feature activations. Check for ~unconditional shifts induced by FineTune by averaging over an extremely broad sampling of inputs.
Can check path dependence during RLHF-like fine tuning. Do the features modified across multiple RLHF runs remain similar? Note that this does not require interpreting what features represent, just that they differ. That makes things easier! (Also, note that this doesn’t technically require a feature dictionary; the sparse autoencoder bit just makes it easier to reason about the resulting direction.)
Can compare representations learned by decision transformers versus PPO-driven RLHF. Any difference between the features affected? Any difference in the degree of path dependence?
Can compare other forms of conditioning. Think [2302.08582] Pretraining Language Models with Human Preferences (arxiv.org). In this case, there wouldn’t really be a fine-tuning training stage; rather, the existence of the condition would serve as the runtime FineTune function. Compare the features between the conditioned and unconditioned cases. Presence of the conditions in pretraining could change the expressed features, but that’s not a huge problem.
Any way to meaningfully compare against activation steering? Given that the analysis is based directly on the activations to begin with, it would just be a question of where the steering vector came from. The feature dictionary could be used to build a steering vector, in principle.
Does RLHF change the feature dictionary? On one hand, conditioning-equivalent RL (with KL divergence penalty) shouldn’t find new sorts of capability-relevant distinctions, but it’s very possible that it collapses some features that are no longer variable in the fine-tuned model. This is trickier to evaluate; could try to train a linear map on the activations of model B before feeding it to an autoencoder trained on model A’s activations.