Did you ever look at the math in the paper? Theorem 1 looks straight up false to me (attempted counterexample) but possibly I’m misunderstanding their assumptions.
I hadn’t looked at their math too deeply and it indeed seems that an assumption like “the bad behavior can be achieved after an arbitrary long prompt” is missing (though maybe I missed an hypothesis where this implicitly shows up). Something along these lines is definitely needed for the summary I made to make any sense.
Did you ever look at the math in the paper? Theorem 1 looks straight up false to me (attempted counterexample) but possibly I’m misunderstanding their assumptions.
I hadn’t looked at their math too deeply and it indeed seems that an assumption like “the bad behavior can be achieved after an arbitrary long prompt” is missing (though maybe I missed an hypothesis where this implicitly shows up). Something along these lines is definitely needed for the summary I made to make any sense.