Martin Randall comments on Greedy-Advantage-Aware RLHF

Martin Randall 28 Dec 2024 2:47 UTC
4 points
0
GAA was motivated by the intuition that RL agents displaying reward-hacking behavior have often found policies that achieve a sharp increase on the reward function relative to similar policies.
This is a new-to-me intuition that I wanted to make more solid and put into a human setting. One analogy is that “sharp” plans are ones that will fail if I’m drunk, tired, or otherwise cognitively impaired, whereas “blunt” plans keep it simple and are more robust. For example on a math test:
- The blunt policy is to do math. If I’m impaired, I’m likely to get a lower score, but the policy still mostly works: I get 60% instead of 80%.
- The sharp policy is to cheat. If I’m impaired, I’m likely to be caught and disqualified, and the policy fails: I get 0% instead of 100%.
If I sufficiently outclass the teacher in cognition and power, such that I can get away with cheating even when impaired, then this technique will work less well. The formerly “sharp” policy is now surrounded by policies where I make a mistake while cheating, but am still able to recover. That’s not so bad, being able to get greater alignment at peer levels of cognition and power is valuable as part of a larger plan.
- Karl Krueger 28 Dec 2024 6:17 UTC
  5 points
  0
  Parent
  For the bard in the party: ‘Sharp’ plans are those that are so unlikely to actually work, that if they do, there will be stories told about them for the ages. ‘Blunt’ plans are grinding in the mines. Nobody writes ballads about grinding in the mines. Obviously if you are an elvish freakin’ bard, whose job is to improvise the next token, you want to follow sharp plans; or at least follow the tweets and warbles of the sharp plans’ authors and mimic them like the liar-bird you are. But if you are a proper dwarf-lord who updates your weights with every blow of the mattock, you will take the blunt plan every time, boring away until it bores you into delving too deep.