Soft prompts are another form of prompt automation that should naturally preserve all the nice properties of goal agnostic architectures.
Does training the model to recognize properties (e.g. ‘niceness’) explicitly as metatokens via classification make soft prompts better at capturing those properties?
You could test for that explicitly:
Pretrain model A with metatokens with a classifier.
Pretrain model B without metatokens.
Train soft prompts on model A with the same classifier.
Train soft prompts on model B with the same classifier.
Compare performance of soft prompts in A and B using the classifier.
Notes and extensions:
The results of the research are very likely scale sensitive. As the model gets larger, many classifier-relevant distinctions that could be missed by small models lacking metatoken training may naturally get included. In the limit, the metatoken training contribution may become negligible. Is this observable across ~pythia scales? Could do SFT on pythia to get a “model A.”
The above description leaves out some complexity. Ideally, the classifier could give scalar scores. This requires scalarized input tokens for the model that pretrains with metatokens.
How does soft prompting work when tokens are forced to be smaller? For example, if each token is a character, it’ll likely have a smaller residual dedicated to it compared to tokens that spans ~4 characters to equalize total compute.
To what degree does soft prompting verge on a kind of “adversarial” optimization? Does it find fragile representations where small perturbations could produce wildly different results? If so, what kinds of regularization are necessary to push back on that, and what is the net effect of that regularization?
There’s no restriction on the nature of the prompt. In principle, the “classifier” could be an RL-style scoring mechanism for any reward. How many tokens does it take to push a given model into particular kinds of “agentic” behavior? For example, how many tokens does it take to encode the prompt corresponding to “maximize the accuracy of the token prediction at index 32 in the sequence”?
More generally: the number of tokens required to specify a behavior could be used as a metric for the degree to which a model “bakes in” a particular functionality. More tokens required to specify behavior successfully → more information required in that model to specify that behavior.
Another potentially useful metric in the space of “fragility,” expanding on #4 above:
The degree to which small perturbations in soft prompt embeddings yield large changes in behavior can be quantified. Perturbations combined with sampling the gradient with respect to some behavioral loss suffices.
This can be thought of as a kind of internal representational fragility. High internal representational fragility would imply that small nudges in the representation can blow up intent.
Does internal representational fragility correlate with other notions of “fragility,” like the information-required-to-induce-behavior “fragility” in the other subthread about #6? In other words, does requiring very little information to induce a behavior correlate with the perturbed gradients with respect to behavioral loss being large for that input?
Given an assumption that the information content of the soft prompts have been optimized into a local minimum, sampling the gradient directly at the soft prompt should show small gradients. In order for this correlation to hold, there would need to be steeply bounded valley in the loss landscape. Or to phrase it another way, for this correlation to exist, behaviors which are extremely well-compressed by the model and have informationally trivial pointers would need to correlate with fragile internal representations.
If anything, I’d expect anticorrelation; well-learned regions probably have enough training constraints that they’ve been shaped into more reliable, generalizing formats that can representationally interpolate to adjacent similar concepts.
That’d still be an interesting thing to observe and confirm, and there are other notions of fragility that could be considered.
The definition as stated does not put a requirement on how “hard” it needs to be to specify a dangerous agent as a subset of the goal agnostic system’s behavior. It just says that if you roll the dice in a fully blind way, the chances are extremely low. Systems will vary in how easy they make it to specify bad agents.
Figure out how to think about the “fragility” of goal agnostic systems. Conditioning a predictor can easily yield an agent that is not goal agnostic; this is expected and not inherently problematic. But what if it is trivial to accidentally condition a strong model into being a worldeater, rather than a passive Q&A bot? There’s clearly a spectrum here in terms of how “chaotic” a model is—the degree to which small perturbations can yield massive consequences—but it remains conceptually fuzzy.
This can be phrased as “what’s the amount of information required to push a model into behavior X?”
Given a frozen model, optimizing prompt tokens gives us a direct way of answering a relevant proxy for this question:
“What is the amount of information (accessible to SGD through soft prompting) required to push a model into behavior X?”
In practice, this seems like it should be a really good proxy, and (provided some compute) it gives you a trivially quantifiable answer:
Try different soft prompt token counts and observe performance on the task that the soft prompts were targeting. The resulting token count versus performance curve characterizes the information/performance tradeoff for that behavior, given that model.
This seems like… it’s… an extremely good answer to the “fragility” question? It’s trivial to incorporate this into an evaluations scheme. Just have a bunch of proxy tasks that would be alarming if they were accessible by trivial differences in prompting.
Conceptually, it’s a quantification of the number of information theoretic mistakes you’d need to make to get bad behavior from the model.
A further extension: While relatively obvious in context, this also serves as a great way to automate adversarial jailbreak attempts (broadly construed), and to quantify how resistant a given model or prompting strategy is to jailbreaks.
Set up your protections, then let SGD try to jailbreak it. The strength of the protections can be measured by the amount of information required to overcome the defenses to achieve some adversarial goal.
In principle, a model could be perfectly resistant and there would be no quantity of information sufficient to break it. That’d be good to know!
This kind of adversarial prompt automation could also be trivially included in an evaluations program.
I can’t imagine that this hasn’t been done before. If anyone has seen something like this, please let me know.
Soft prompts are another form of prompt automation that should naturally preserve all the nice properties of goal agnostic architectures.
Does training the model to recognize properties (e.g. ‘niceness’) explicitly as metatokens via classification make soft prompts better at capturing those properties?
You could test for that explicitly:
Pretrain model A with metatokens with a classifier.
Pretrain model B without metatokens.
Train soft prompts on model A with the same classifier.
Train soft prompts on model B with the same classifier.
Compare performance of soft prompts in A and B using the classifier.
Notes and extensions:
The results of the research are very likely scale sensitive. As the model gets larger, many classifier-relevant distinctions that could be missed by small models lacking metatoken training may naturally get included. In the limit, the metatoken training contribution may become negligible. Is this observable across ~pythia scales? Could do SFT on pythia to get a “model A.”
The above description leaves out some complexity. Ideally, the classifier could give scalar scores. This requires scalarized input tokens for the model that pretrains with metatokens.
How does soft prompting work when tokens are forced to be smaller? For example, if each token is a character, it’ll likely have a smaller residual dedicated to it compared to tokens that spans ~4 characters to equalize total compute.
To what degree does soft prompting verge on a kind of “adversarial” optimization? Does it find fragile representations where small perturbations could produce wildly different results? If so, what kinds of regularization are necessary to push back on that, and what is the net effect of that regularization?
There’s no restriction on the nature of the prompt. In principle, the “classifier” could be an RL-style scoring mechanism for any reward. How many tokens does it take to push a given model into particular kinds of “agentic” behavior? For example, how many tokens does it take to encode the prompt corresponding to “maximize the accuracy of the token prediction at index 32 in the sequence”?
More generally: the number of tokens required to specify a behavior could be used as a metric for the degree to which a model “bakes in” a particular functionality. More tokens required to specify behavior successfully → more information required in that model to specify that behavior.
Another potentially useful metric in the space of “fragility,” expanding on #4 above:
The degree to which small perturbations in soft prompt embeddings yield large changes in behavior can be quantified. Perturbations combined with sampling the gradient with respect to some behavioral loss suffices.
This can be thought of as a kind of internal representational fragility. High internal representational fragility would imply that small nudges in the representation can blow up intent.
Does internal representational fragility correlate with other notions of “fragility,” like the information-required-to-induce-behavior “fragility” in the other subthread about #6? In other words, does requiring very little information to induce a behavior correlate with the perturbed gradients with respect to behavioral loss being large for that input?
Given an assumption that the information content of the soft prompts have been optimized into a local minimum, sampling the gradient directly at the soft prompt should show small gradients. In order for this correlation to hold, there would need to be steeply bounded valley in the loss landscape. Or to phrase it another way, for this correlation to exist, behaviors which are extremely well-compressed by the model and have informationally trivial pointers would need to correlate with fragile internal representations.
If anything, I’d expect anticorrelation; well-learned regions probably have enough training constraints that they’ve been shaped into more reliable, generalizing formats that can representationally interpolate to adjacent similar concepts.
That’d still be an interesting thing to observe and confirm, and there are other notions of fragility that could be considered.
Expanding on #6 from above more explicit, since it seems potentially valuable:
From the goal agnosticism FAQ:
From earlier experimentpost:
This can be phrased as “what’s the amount of information required to push a model into behavior X?”
Given a frozen model, optimizing prompt tokens gives us a direct way of answering a relevant proxy for this question:
“What is the amount of information (accessible to SGD through soft prompting) required to push a model into behavior X?”
In practice, this seems like it should be a really good proxy, and (provided some compute) it gives you a trivially quantifiable answer:
Try different soft prompt token counts and observe performance on the task that the soft prompts were targeting. The resulting token count versus performance curve characterizes the information/performance tradeoff for that behavior, given that model.
This seems like… it’s… an extremely good answer to the “fragility” question? It’s trivial to incorporate this into an evaluations scheme. Just have a bunch of proxy tasks that would be alarming if they were accessible by trivial differences in prompting.
Conceptually, it’s a quantification of the number of information theoretic mistakes you’d need to make to get bad behavior from the model.
A further extension: While relatively obvious in context, this also serves as a great way to automate adversarial jailbreak attempts (broadly construed), and to quantify how resistant a given model or prompting strategy is to jailbreaks.
Set up your protections, then let SGD try to jailbreak it. The strength of the protections can be measured by the amount of information required to overcome the defenses to achieve some adversarial goal.
In principle, a model could be perfectly resistant and there would be no quantity of information sufficient to break it. That’d be good to know!
This kind of adversarial prompt automation could also be trivially included in an evaluations program.
I can’t imagine that this hasn’t been done before. If anyone has seen something like this, please let me know.