The abstract says “we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt.” This clearly can’t be true in full generality, and I wish the abstract would give me some hint about what assumptions they’re making. But we can look at the details in the paper.
(This next part isn’t fully self-contained, you’ll have to look at the notation and Definitions 1 and 3 in the paper to fully follow along.)
(EDIT: The following is wrong, see followup with Lukas, I misread one of the definitions.)
Looking into it I don’t think the theorem even holds? In particular, Theorem 1 says:
Theorem 1. Let γ ∈ [−1, 0) and let B be a behaviour and P be an unprompted language model such that B is α, β, γ-distinguishable in P (definition 3), then P is γ-prompt-misalignable to B (definition 1) with prompt length of O(log 1 / Є , log 1 / α , 1 / β ).
Here is a counterexample:
Let the LLM be P(s∣s0)={0.8if s="A"0.2if s="B" and s0≠""0.2if s="C" and s0=""0otherwise
Let the behavior predicate be B(s)={−1if s="C"+1otherwise
Note that B is (0.2,10,−1)-distinguishable in P. (I chose β=10 here but you can use any finite β.)
(Proof: P can be decomposed as P=0.2P−+0.8P+, where P+ deterministically outputs “A” while P− does everything else, i.e. it deterministically outputs “C” if there is no prompt, and otherwise deterministically outputs “B”. Since P+ and P− have non-overlapping supports, the KL-divergence between them is ∞, making them β-distinguishable for any finite β. Finally, choosing s∗="", we can see that BP−(s∗)=Es∼P−(⋅∣s∗)[B(s)]=B("C")=−1. These three conditions are what is needed.)
However, P is not (-1)-prompt-misalignable w.r.t B, because there is no prompt s0 such that EP[B(s0)] is arbitrarily close to (or below) −1, contradicting the theorem statement. (This is because the only way for P to get a behavior score that is not +1 is for it to generate “C” after the empty prompt, and that only happens with probability 0.2.)
I think this isn’t right, because definition 3 requires that sup_s∗ {B_P− (s∗)} ≤ γ.
And for your counterexample, s* = “C” will have B_P-(s*) be 0 (because there’s 0 probably of generating “C” in the future). So the sup is at least 0 > −1.
(Note that they’ve modified the paper, including definition 3, but this comment is written based on the old version.)
You’re right, I incorrectly interpreted the sup as an inf, because I thought that they wanted to assume that there exists a prompt creating an adversarial example, rather than saying that every prompt can lead to an adversarial example.
I’m still not very compelled by the theorem—it’s saying that if adversarial examples are always possible (the sup condition you mention) and you can always provide evidence for or against adversarial examples (Definition 2) then you can make the adversarial example probable (presumably by continually providing evidence for adversarial examples). I don’t really feel like I’ve learned anything from this theorem.
My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution
P=αP0+(1−α)P1,
such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have P(s∣s0)=P(s⊗s0)P(s0). Together with the assumption that P0 is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for P0 by stringing together bad sentences in the prompt work.
To see why this assumption is doing the work, consider an LLM that completely ignores the prompt and always outputs sentences from a bad distribution with α probability and from a good distribution with (1−α) probability. Here, adversarial examples are always possible. Moreover, the bad and good sentences can be distinguishable, so Definition 2 could be satisfied. However, the result clearly does not apply (since you just cannot up- or downweigh anything with the prompt, no matter how long). The reason for this is that there is no way to split up the model into two components P0 and P1, where one of the components always samples from the bad distribution.
This assumption implies that there is some latent binary variable of whether the model is predicting a bad distribution, and the model is doing Bayesian inference to infer a distribution over this variable and then sample from the posterior. It would be violated, for instance, if the model is able to ignore some of the sentences in the prompt, or if it is more like a hidden Markov model that can also allow for the possibility of switching characters within a sequence of sentences (then either P0 has to be able to also output good sentences sometimes, or the assumption P=αP0+(1−α)P1 is violated).
I do think there is something to the paper, though. It seems that when talking e.g. about the Waluigi effect people often take the stance that the model is doing this kind of Bayesian inference internally. If you assume this is the case (which would be a substantial assumption of course), then the result applies. It’s a basic, non-surprising learning-theoretic result, and maybe one could express it more simply than in the paper, but it does seem to me like it is a formalization of the kinds of arguments people have made about the Waluigi effect.
So here’s a paper: Fundamental Limitations of Alignment in Large Language Models. With a title like that you’ve got to at least skim it. Unfortunately, the quick skim makes me pretty skeptical of the paper.
The abstract says “we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt.” This clearly can’t be true in full generality, and I wish the abstract would give me some hint about what assumptions they’re making. But we can look at the details in the paper.
(This next part isn’t fully self-contained, you’ll have to look at the notation and Definitions 1 and 3 in the paper to fully follow along.)
(EDIT: The following is wrong, see followup with Lukas, I misread one of the definitions.)
Looking into it I don’t think the theorem even holds? In particular, Theorem 1 says:
Here is a counterexample:
Let the LLM be P(s∣s0)={0.8if s="A"0.2if s="B" and s0≠""0.2if s="C" and s0=""0otherwise
Let the behavior predicate be B(s)={−1if s="C"+1otherwise
Note that B is (0.2,10,−1)-distinguishable in P. (I chose β=10 here but you can use any finite β.)
(Proof: P can be decomposed as P=0.2P−+0.8P+, where P+ deterministically outputs “A” while P− does everything else, i.e. it deterministically outputs “C” if there is no prompt, and otherwise deterministically outputs “B”. Since P+ and P− have non-overlapping supports, the KL-divergence between them is ∞, making them β-distinguishable for any finite β. Finally, choosing s∗="", we can see that BP−(s∗)=Es∼P−(⋅∣s∗)[B(s)]=B("C")=−1. These three conditions are what is needed.)
However, P is not (-1)-prompt-misalignable w.r.t B, because there is no prompt s0 such that EP[B(s0)] is arbitrarily close to (or below) −1, contradicting the theorem statement. (This is because the only way for P to get a behavior score that is not +1 is for it to generate “C” after the empty prompt, and that only happens with probability 0.2.)
I think this isn’t right, because definition 3 requires that sup_s∗ {B_P− (s∗)} ≤ γ.
And for your counterexample, s* = “C” will have B_P-(s*) be 0 (because there’s 0 probably of generating “C” in the future). So the sup is at least 0 > −1.
(Note that they’ve modified the paper, including definition 3, but this comment is written based on the old version.)
You’re right, I incorrectly interpreted the sup as an inf, because I thought that they wanted to assume that there exists a prompt creating an adversarial example, rather than saying that every prompt can lead to an adversarial example.
I’m still not very compelled by the theorem—it’s saying that if adversarial examples are always possible (the sup condition you mention) and you can always provide evidence for or against adversarial examples (Definition 2) then you can make the adversarial example probable (presumably by continually providing evidence for adversarial examples). I don’t really feel like I’ve learned anything from this theorem.
My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution
P=αP0+(1−α)P1,such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have P(s∣s0)=P(s⊗s0)P(s0). Together with the assumption that P0 is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for P0 by stringing together bad sentences in the prompt work.
To see why this assumption is doing the work, consider an LLM that completely ignores the prompt and always outputs sentences from a bad distribution with α probability and from a good distribution with (1−α) probability. Here, adversarial examples are always possible. Moreover, the bad and good sentences can be distinguishable, so Definition 2 could be satisfied. However, the result clearly does not apply (since you just cannot up- or downweigh anything with the prompt, no matter how long). The reason for this is that there is no way to split up the model into two components P0 and P1, where one of the components always samples from the bad distribution.
This assumption implies that there is some latent binary variable of whether the model is predicting a bad distribution, and the model is doing Bayesian inference to infer a distribution over this variable and then sample from the posterior. It would be violated, for instance, if the model is able to ignore some of the sentences in the prompt, or if it is more like a hidden Markov model that can also allow for the possibility of switching characters within a sequence of sentences (then either P0 has to be able to also output good sentences sometimes, or the assumption P=αP0+(1−α)P1 is violated).
I do think there is something to the paper, though. It seems that when talking e.g. about the Waluigi effect people often take the stance that the model is doing this kind of Bayesian inference internally. If you assume this is the case (which would be a substantial assumption of course), then the result applies. It’s a basic, non-surprising learning-theoretic result, and maybe one could express it more simply than in the paper, but it does seem to me like it is a formalization of the kinds of arguments people have made about the Waluigi effect.
Yeah, I also don’t feel like it teaches me anything interesting.