MathiasKB comments on Ironing Out the Squiggles

MathiasKB 26 Sep 2024 19:52 UTC
1 point
0
On this view, adversarial examples arise from gradient descent being “too smart”, not “too dumb”: the program is fine; if the test suite didn’t imply the behavior we wanted, that’s our problem.
Shouldn’t we expect to see RL models trained purely on self play not to have these issues then?
My understanding is that even models trained primarily with self play, such as katago, are vulnurable to adversarial attacks. If RL models are vulnurable to the same type of adversarial attacks, isn’t that evidence against this theory?