Retrodicting prompts can be useful for interpretability when dealing with conditions that aren’t natively human readable (like implicit conditions induced by activation steering, or optimized conditions from soft prompts). Take an observed completion and generate the prompt that created it.
What does a prompt retrodictor look like?
Generating a large training set of soft prompts to directly reverse would be expensive. Fortunately, there’s nothing special in principle about soft prompts with regard to their impact on conditioning predictions.
Just take large traditional text datasets. Feed the model a chunk of the string. Train on the prediction of tokens before the chunk.
Two obvious approaches:
Special case of infilling. Stick to a purely autoregressive training mode, but train the model to fill a gap autoregressively. In other words, the sequence would be:
[Prefix token][Prefix sequence][Suffix token][Suffix sequence][Middle token][Middle sequence][Termination token]
Or, as the paper points out:[Suffix token][Suffix sequence][Prefix token][Prefix sequence][Middle sequence][Termination token]
Nothing stopping the prefix sequence from having zero length.Could also specialize training for just previous prediction:
[Prompt chunk]["Now predict the previous" token][Predicted previous chunk, in reverse]
But we don’t just want some plausible previous prompts, we want the ones that most precisely match the effect on the suffix’s activations.
This is trickier. Specifying the optimization target is easy enough: retrodict a prompt that minimizes MSE((activations | sourcePrompt), (activations | retrodictedPrompt))
, where (activations | sourcePrompt)
are provided. Transforming that into a reward for RL is one option. Collapsing the outout distribution into a token is a problem; there’s no way to directly propagate the gradient through that collapse and into the original distribution. Without that differentiable connection, analytically computing gradients for the other token options becomes expensive and turns into a question of sampling strategies. Maybe something clever floating around.
Note that retrodicting with an activation objective has some downsides:
If the retrodictor’s the same model as the predictor, there are some weird feedback loops. The activations become a moving target.
Targeting activations makes the retrodictor model-specific. Without targeting activations, the retrodictor could work for any model in principle.
While the outputs remain constrained to token distributions, the natural endpoint for retrodiction on activations is not necessarily coherent natural language. Adversarially optimizing for tokens which produce a particular activation may go weird places. It’ll likely still have some kind of interpretable “vibe,” assuming the model isn’t too aggressively exploitable.
This class of experiment is expensive for natural language models. I’m not sure how interesting it is at scales realistically trainable on a couple of 4090s.
I’m accumulating a to-do list of experiments much faster than my ability to complete them:
Characterizing fine-tuning effects with feature dictionaries
Toy-scale automated neural network decompilation (difficult to scale)
Trying to understand evolution of internal representational features across blocks by throwing constraints at it
Using soft prompts as a proxy measure of informational distance between models/conditions and behaviors (see note below)
Prompt retrodiction for interpreting fine tuning, with more difficult extension for activation matching
Miscellaneous bunch of experiments
If you wanted to take one of these and run with it or a variant, I wouldn’t mind!
The unifying theme behind many of these is goal agnosticism: understanding it, verifying it, maintaining it, and using it.
Note: I’ve already started some of these experiments, and I will very like start others soon. If you (or anyone reading this, for that matter) sees something they’d like to try, we should chat to avoid doing redundant work. I currently expect to focus on #4 for the next handful of weeks, so that one is probably at the highest risk of redundancy.
Further note: I haven’t done a deep dive on all relevant literature; it could be that some of these have already been done somewhere! (If anyone happens to know of prior art for any of these, please let me know.)