Another potentially useful metric in the space of “fragility,” expanding on #4 above:
The degree to which small perturbations in soft prompt embeddings yield large changes in behavior can be quantified. Perturbations combined with sampling the gradient with respect to some behavioral loss suffices.
This can be thought of as a kind of internal representational fragility. High internal representational fragility would imply that small nudges in the representation can blow up intent.
Does internal representational fragility correlate with other notions of “fragility,” like the information-required-to-induce-behavior “fragility” in the other subthread about #6? In other words, does requiring very little information to induce a behavior correlate with the perturbed gradients with respect to behavioral loss being large for that input?
Given an assumption that the information content of the soft prompts have been optimized into a local minimum, sampling the gradient directly at the soft prompt should show small gradients. In order for this correlation to hold, there would need to be steeply bounded valley in the loss landscape. Or to phrase it another way, for this correlation to exist, behaviors which are extremely well-compressed by the model and have informationally trivial pointers would need to correlate with fragile internal representations.
If anything, I’d expect anticorrelation; well-learned regions probably have enough training constraints that they’ve been shaped into more reliable, generalizing formats that can representationally interpolate to adjacent similar concepts.
That’d still be an interesting thing to observe and confirm, and there are other notions of fragility that could be considered.
Another potentially useful metric in the space of “fragility,” expanding on #4 above:
The degree to which small perturbations in soft prompt embeddings yield large changes in behavior can be quantified. Perturbations combined with sampling the gradient with respect to some behavioral loss suffices.
This can be thought of as a kind of internal representational fragility. High internal representational fragility would imply that small nudges in the representation can blow up intent.
Does internal representational fragility correlate with other notions of “fragility,” like the information-required-to-induce-behavior “fragility” in the other subthread about #6? In other words, does requiring very little information to induce a behavior correlate with the perturbed gradients with respect to behavioral loss being large for that input?
Given an assumption that the information content of the soft prompts have been optimized into a local minimum, sampling the gradient directly at the soft prompt should show small gradients. In order for this correlation to hold, there would need to be steeply bounded valley in the loss landscape. Or to phrase it another way, for this correlation to exist, behaviors which are extremely well-compressed by the model and have informationally trivial pointers would need to correlate with fragile internal representations.
If anything, I’d expect anticorrelation; well-learned regions probably have enough training constraints that they’ve been shaped into more reliable, generalizing formats that can representationally interpolate to adjacent similar concepts.
That’d still be an interesting thing to observe and confirm, and there are other notions of fragility that could be considered.