Why is this not just a description of an adversarial attack loop on the weak AI model, and would not just produce the usual short adversarial strings of gibberish (for LLMs) or handful of pixel perturbations (for vision or VLMs), which are generally completely useless to humans and contain no useful information?
I don’t think that’s possible, because an attacker (LLM) can program a victim LLM to emit arbitrary text, so with enough attacks, you can solve any benchmark in the attacker’s capability (thereby defeating the safety point entirely because now it’s just a very expensive way to use an unsafe model), or otherwise bruteforce the benchmark by inferring the hidden answers and then creating the adversarial example which elicits that (like p-hacking: just keep trying things until you get below the magic threshold). See backdoors, triggers, dataset distillation… “A benchmark” is no more of a barrier than “flipping a specific image’s class”.
Why is this not just a description of an adversarial attack loop on the weak AI model, and would not just produce the usual short adversarial strings of gibberish (for LLMs) or handful of pixel perturbations (for vision or VLMs), which are generally completely useless to humans and contain no useful information?
My reply to both your and @Chris_Leong ’s comment is that you should simply use robust benchmarks on which high performance is interesting.
In the adversarial attack context, the attacker’s objectives are not generally beyond the model’s “capabilities.”
I don’t think that’s possible, because an attacker (LLM) can program a victim LLM to emit arbitrary text, so with enough attacks, you can solve any benchmark in the attacker’s capability (thereby defeating the safety point entirely because now it’s just a very expensive way to use an unsafe model), or otherwise bruteforce the benchmark by inferring the hidden answers and then creating the adversarial example which elicits that (like p-hacking: just keep trying things until you get below the magic threshold). See backdoors, triggers, dataset distillation… “A benchmark” is no more of a barrier than “flipping a specific image’s class”.
We need not provide the strong model with access to the benchmark questions.
Depending on the benchmark, it can be difficult or impossible to encode all the correct responses in a short string.