Fabien Roger comments on Fabien’s Shortform

Fabien Roger 16 Jul 2024 15:39 UTC
LW: 10 AF: 4
0
AF
I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).
Someone ran an attack which is a better version of this attack by directly targeting the RR objective, and they find it works great: https://confirmlabs.org/posts/circuit_breaking.html#attack-success-internal-activations
What links here?
- Fabien Roger's comment on Breaking Circuit Breakers by mikes (16 Jul 2024 15:40 UTC; 2 points)