On this view, adversarial examples arise from gradient descent being “too smart”, not “too dumb”: the program is fine; if the test suite didn’t imply the behavior we wanted, that’s our problem.
Shouldn’t we expect to see RL models trained purely on self play not to have these issues then?
My understanding is that even models trained primarily with self play, such as katago, are vulnurable to adversarial attacks. If RL models are vulnurable to the same type of adversarial attacks, isn’t that evidence against this theory?
Shouldn’t we expect to see RL models trained purely on self play not to have these issues then?
My understanding is that even models trained primarily with self play, such as katago, are vulnurable to adversarial attacks. If RL models are vulnurable to the same type of adversarial attacks, isn’t that evidence against this theory?