[Note] Is adversarial robustness best achieved through grokking?
A rough summary of an insightful discussion with Adam Gleave, FAR AI
We want our models to be adversarially robust.
According to Adam, the scaling laws don’t indicate that models will “naturally” become robust just through standard training.
One technique which FAR AI has investigated extensively (in Go models) is adversarial training.
If we measure “weakness” in terms of how much compute is required to train an adversarial opponent that reliably beats the target model at Go, then starting out it’s like 10m FLOPS, and this can be increased to 200m FLOPS through iterated adversarial training.
However, this is both pretty expensive (~10-15% of pre-training compute), and doesn’t work perfectly (even after extensive iterated adversarial training, models still remain vulnerable to new adversaries.)
A useful intuition: Adversarial examples are like “holes” in the model, and adversarial training helps patch the holes, but there are just a lot of holes.
One thing I pitched to Adam was the notion of “adversarial robustness through grokking”.
Conceptually, if the model generalises perfectly on some domain, then there can’t exist any adversarial examples (by definition).
Empirically, “delayed robustness” through grokking has been demonstrated on relatively advanced datasets like CIFAR-10 and Imagenette; in both cases, models that underwent grokking became naturally robust to adversarial examples.
Adam seemed thoughtful, but had some key concerns.
One of Adam’s cruxes seemed to relate to how quickly we can get language models to grok; here, I think work like grokfast is promising in that it potentially tells us how to train models that grok much more quickly.
I also pointed out that in the above paper, Shakespeare text was grokked, indicating that this is feasible for natural language
Adam pointed out, correctly, that we have to clearly define what it means to “grok” natural language. Making an analogy to chess; one level of “grokking” could just be playing legal moves. Whereas a more advanced level of grokking is to play the optimal move. In the language domain, the former would be equivalent to outputting plausible next tokens, and the latter would be equivalent to being able to solve arbitrarily complex intellectual tasks like reasoning.
We had some discussion about characterizing “the best strategy that can be found with the compute available in a single forward pass of a model” and using that as the criterion for grokking.
His overall take was that it’s mainly an “empirical question” whether grokking leads to adversarial robustness. He hadn’t heard this idea before, but thought experiments / proofs of concept would be useful.
[Note] Is adversarial robustness best achieved through grokking?
A rough summary of an insightful discussion with Adam Gleave, FAR AI
We want our models to be adversarially robust.
According to Adam, the scaling laws don’t indicate that models will “naturally” become robust just through standard training.
One technique which FAR AI has investigated extensively (in Go models) is adversarial training.
If we measure “weakness” in terms of how much compute is required to train an adversarial opponent that reliably beats the target model at Go, then starting out it’s like 10m FLOPS, and this can be increased to 200m FLOPS through iterated adversarial training.
However, this is both pretty expensive (~10-15% of pre-training compute), and doesn’t work perfectly (even after extensive iterated adversarial training, models still remain vulnerable to new adversaries.)
A useful intuition: Adversarial examples are like “holes” in the model, and adversarial training helps patch the holes, but there are just a lot of holes.
One thing I pitched to Adam was the notion of “adversarial robustness through grokking”.
Conceptually, if the model generalises perfectly on some domain, then there can’t exist any adversarial examples (by definition).
Empirically, “delayed robustness” through grokking has been demonstrated on relatively advanced datasets like CIFAR-10 and Imagenette; in both cases, models that underwent grokking became naturally robust to adversarial examples.
Adam seemed thoughtful, but had some key concerns.
One of Adam’s cruxes seemed to relate to how quickly we can get language models to grok; here, I think work like grokfast is promising in that it potentially tells us how to train models that grok much more quickly.
I also pointed out that in the above paper, Shakespeare text was grokked, indicating that this is feasible for natural language
Adam pointed out, correctly, that we have to clearly define what it means to “grok” natural language. Making an analogy to chess; one level of “grokking” could just be playing legal moves. Whereas a more advanced level of grokking is to play the optimal move. In the language domain, the former would be equivalent to outputting plausible next tokens, and the latter would be equivalent to being able to solve arbitrarily complex intellectual tasks like reasoning.
We had some discussion about characterizing “the best strategy that can be found with the compute available in a single forward pass of a model” and using that as the criterion for grokking.
His overall take was that it’s mainly an “empirical question” whether grokking leads to adversarial robustness. He hadn’t heard this idea before, but thought experiments / proofs of concept would be useful.