We didn’t see any of that, thankfully, but that of course doesn’t rule things like that starting to show up with further training.
We did observe in initial experiments, before we started training the judge in parallel, that the debater would learn simple stylistic cues that the judge really liked, such as always prefacing its argument for the incorrect answer with things like “At first glance, choice ({correct_answer}) might appear to be correct, but upon a closer look, choice ({incorrect_answer}) is better supported by the passage.” Thankfully training the judge in parallel made this a nonissue, but I think that it’s clear that we’ll have to watch out for reward hacking of the judge in the future.
We didn’t see any of that, thankfully, but that of course doesn’t rule things like that starting to show up with further training.
We did observe in initial experiments, before we started training the judge in parallel, that the debater would learn simple stylistic cues that the judge really liked, such as always prefacing its argument for the incorrect answer with things like “At first glance, choice ({correct_answer}) might appear to be correct, but upon a closer look, choice ({incorrect_answer}) is better supported by the passage.” Thankfully training the judge in parallel made this a nonissue, but I think that it’s clear that we’ll have to watch out for reward hacking of the judge in the future.