Retrodicting prompts can be useful for interpretability when dealing with conditions that aren’t natively human readable (like implicit conditions induced by activation steering, or optimized conditions from soft prompts). Take an observed completion and generate the prompt that created it.
What does a prompt retrodictor look like?
Generating a large training set of soft prompts to directly reverse would be expensive. Fortunately, there’s nothing special in principle about soft prompts with regard to their impact on conditioning predictions.
Just take large traditional text datasets. Feed the model a chunk of the string. Train on the prediction of tokens before the chunk.
Two obvious approaches:
Special case of infilling. Stick to a purely autoregressive training mode, but train the model to fill a gap autoregressively. In other words, the sequence would be: [Prefix token][Prefix sequence][Suffix token][Suffix sequence][Middle token][Middle sequence][Termination token] Or, as the paper points out: [Suffix token][Suffix sequence][Prefix token][Prefix sequence][Middle sequence][Termination token] Nothing stopping the prefix sequence from having zero length.
Could also specialize training for just previous prediction: [Prompt chunk]["Now predict the previous" token][Predicted previous chunk, in reverse]
But we don’t just want some plausible previous prompts, we want the ones that most precisely match the effect on the suffix’s activations.
This is trickier. Specifying the optimization target is easy enough: retrodict a prompt that minimizes MSE((activations | sourcePrompt), (activations | retrodictedPrompt)), where (activations | sourcePrompt) are provided. Transforming that into a reward for RL is one option. Collapsing the outout distribution into a token is a problem; there’s no way to directly propagate the gradient through that collapse and into the original distribution. Without that differentiable connection, analytically computing gradients for the other token options becomes expensive and turns into a question of sampling strategies. Maybe something clever floating around.
Note that retrodicting with an activation objective has some downsides:
If the retrodictor’s the same model as the predictor, there are some weird feedback loops. The activations become a moving target.
Targeting activations makes the retrodictor model-specific. Without targeting activations, the retrodictor could work for any model in principle.
While the outputs remain constrained to token distributions, the natural endpoint for retrodiction on activations is not necessarily coherent natural language. Adversarially optimizing for tokens which produce a particular activation may go weird places. It’ll likely still have some kind of interpretable “vibe,” assuming the model isn’t too aggressively exploitable.
This class of experiment is expensive for natural language models. I’m not sure how interesting it is at scales realistically trainable on a couple of 4090s.
Retrodicting prompts can be useful for interpretability when dealing with conditions that aren’t natively human readable (like implicit conditions induced by activation steering, or optimized conditions from soft prompts). Take an observed completion and generate the prompt that created it.
What does a prompt retrodictor look like?
Generating a large training set of soft prompts to directly reverse would be expensive. Fortunately, there’s nothing special in principle about soft prompts with regard to their impact on conditioning predictions.
Just take large traditional text datasets. Feed the model a chunk of the string. Train on the prediction of tokens before the chunk.
Two obvious approaches:
Special case of infilling. Stick to a purely autoregressive training mode, but train the model to fill a gap autoregressively. In other words, the sequence would be:
[Prefix token][Prefix sequence][Suffix token][Suffix sequence][Middle token][Middle sequence][Termination token]
Or, as the paper points out:
[Suffix token][Suffix sequence][Prefix token][Prefix sequence][Middle sequence][Termination token]
Nothing stopping the prefix sequence from having zero length.Could also specialize training for just previous prediction:
[Prompt chunk]["Now predict the previous" token][Predicted previous chunk, in reverse]
But we don’t just want some plausible previous prompts, we want the ones that most precisely match the effect on the suffix’s activations.
This is trickier. Specifying the optimization target is easy enough: retrodict a prompt that minimizes
MSE((activations | sourcePrompt), (activations | retrodictedPrompt))
, where(activations | sourcePrompt)
are provided. Transforming that into a reward for RL is one option. Collapsing the outout distribution into a token is a problem; there’s no way to directly propagate the gradient through that collapse and into the original distribution. Without that differentiable connection, analytically computing gradients for the other token options becomes expensive and turns into a question of sampling strategies. Maybe something clever floating around.Note that retrodicting with an activation objective has some downsides:
If the retrodictor’s the same model as the predictor, there are some weird feedback loops. The activations become a moving target.
Targeting activations makes the retrodictor model-specific. Without targeting activations, the retrodictor could work for any model in principle.
While the outputs remain constrained to token distributions, the natural endpoint for retrodiction on activations is not necessarily coherent natural language. Adversarially optimizing for tokens which produce a particular activation may go weird places. It’ll likely still have some kind of interpretable “vibe,” assuming the model isn’t too aggressively exploitable.
This class of experiment is expensive for natural language models. I’m not sure how interesting it is at scales realistically trainable on a couple of 4090s.