I find this paper mostly misleading. It assumes that the LLM is initially 99% certain to be friendly and 1% certain to me “malicious”, and that “friendly” and “malicious” can be distinguished if you have a long enough prompt (more precisely, at no point have you gathered so much evidence for or against being malicious that you prob would not go up and down based on new information). Assuming those, it’s pretty obvious that the LLM will say bad things if you have a long enough prompt.
The result is not very profound, and I like this paper mostly as a formalization of simulators (by Janus). (It takes the formalization of simulators as a working assumption, rather than proving it.) For example, there are cool confidence bounds if you use a slightly more precise version of the assumptions.
So the paper is about something that already knows how to “say bad things”, but just doesn’t have a high prior on it. It’s relevant to jailbreaks, but not to generating negatively reinforced text (as explained in the related work subsection about jailbreaks).
Did you ever look at the math in the paper? Theorem 1 looks straight up false to me (attempted counterexample) but possibly I’m misunderstanding their assumptions.
I hadn’t looked at their math too deeply and it indeed seems that an assumption like “the bad behavior can be achieved after an arbitrary long prompt” is missing (though maybe I missed an hypothesis where this implicitly shows up). Something along these lines is definitely needed for the summary I made to make any sense.
I find this paper mostly misleading. It assumes that the LLM is initially 99% certain to be friendly and 1% certain to me “malicious”, and that “friendly” and “malicious” can be distinguished if you have a long enough prompt (more precisely, at no point have you gathered so much evidence for or against being malicious that you prob would not go up and down based on new information). Assuming those, it’s pretty obvious that the LLM will say bad things if you have a long enough prompt.
The result is not very profound, and I like this paper mostly as a formalization of simulators (by Janus). (It takes the formalization of simulators as a working assumption, rather than proving it.) For example, there are cool confidence bounds if you use a slightly more precise version of the assumptions.
So the paper is about something that already knows how to “say bad things”, but just doesn’t have a high prior on it. It’s relevant to jailbreaks, but not to generating negatively reinforced text (as explained in the related work subsection about jailbreaks).
Did you ever look at the math in the paper? Theorem 1 looks straight up false to me (attempted counterexample) but possibly I’m misunderstanding their assumptions.
I hadn’t looked at their math too deeply and it indeed seems that an assumption like “the bad behavior can be achieved after an arbitrary long prompt” is missing (though maybe I missed an hypothesis where this implicitly shows up). Something along these lines is definitely needed for the summary I made to make any sense.