GAA was motivated by the intuition that RL agents displaying reward-hacking behavior have often found policies that achieve a sharp increase on the reward function relative to similar policies.
This is a new-to-me intuition that I wanted to make more solid and put into a human setting. One analogy is that “sharp” plans are ones that will fail if I’m drunk, tired, or otherwise cognitively impaired, whereas “blunt” plans keep it simple and are more robust. For example on a math test:
The blunt policy is to do math. If I’m impaired, I’m likely to get a lower score, but the policy still mostly works: I get 60% instead of 80%.
The sharp policy is to cheat. If I’m impaired, I’m likely to be caught and disqualified, and the policy fails: I get 0% instead of 100%.
If I sufficiently outclass the teacher in cognition and power, such that I can get away with cheating even when impaired, then this technique will work less well. The formerly “sharp” policy is now surrounded by policies where I make a mistake while cheating, but am still able to recover. That’s not so bad, being able to get greater alignment at peer levels of cognition and power is valuable as part of a larger plan.
For the bard in the party: ‘Sharp’ plans are those that are so unlikely to actually work, that if they do, there will be stories told about them for the ages. ‘Blunt’ plans are grinding in the mines. Nobody writes ballads about grinding in the mines. Obviously if you are an elvish freakin’ bard, whose job is to improvise the next token, you want to follow sharp plans; or at least follow the tweets and warbles of the sharp plans’ authors and mimic them like the liar-bird you are. But if you are a proper dwarf-lord who updates your weights with every blow of the mattock, you will take the blunt plan every time, boring away until it bores you into delving too deep.
Thank you for the example, I think that illustrates the point well.
Could you help me to understand why you think that more a intelligent agent would be more likely to have a reward-hacking policy that isn’t sharp? The intelligence of the agent should have no bearing on the geometry of the deviations between the reward function and a function representing the true objective. The intelligence of the agent might impact its progression through policies over the course of optimization, and perhaps this difference would result in access to a space of more sophisticated policies that lie in broad, flat optima in the reward function. Is this close to your thinking? I think that criticism amounts to a rejection of the heuristic/intuition that reward-hacking==sharp policy, since this topological feature of the policy space in this problem always existed, regardless of the intelligence of the agent.
This is close to my thinking. Example: landing a plane on an aircraft carrier. Outcomes:
Good landing. +100 points.
Bad landing, pilot dies, carrier damaged. −1,000 points.
Don’t try to land, just eject and ditch the plane safely in the sea. 0 points.
Hypothetical agent is not very smart, with an OODA loop of ten seconds. Attempting a landing is the sharp policy. If the agent makes a mistake in the last ten seconds, it can’t react to fix it, and it crashes. Ejecting is the blunt policy.
(I played a flight simulator as a kid and I never managed to land on the stupid carrier)
Now increase the speed of the agent, so its OODA loop is 0.1 seconds. This makes it 100x smarter by some metrics. Now attempting the landing is a blunt policy, because the agent can recover from mistakes and still stick the landing.
I don’t think this rejects the heuristic. If an agent has a shorter OODA loop then that changes the topology. Also, if an agent can search more of the policy space then even if 99% of reward-hacking policies are sharp, it is more likely to find one of the blunt ones.
This is a new-to-me intuition that I wanted to make more solid and put into a human setting. One analogy is that “sharp” plans are ones that will fail if I’m drunk, tired, or otherwise cognitively impaired, whereas “blunt” plans keep it simple and are more robust. For example on a math test:
The blunt policy is to do math. If I’m impaired, I’m likely to get a lower score, but the policy still mostly works: I get 60% instead of 80%.
The sharp policy is to cheat. If I’m impaired, I’m likely to be caught and disqualified, and the policy fails: I get 0% instead of 100%.
If I sufficiently outclass the teacher in cognition and power, such that I can get away with cheating even when impaired, then this technique will work less well. The formerly “sharp” policy is now surrounded by policies where I make a mistake while cheating, but am still able to recover. That’s not so bad, being able to get greater alignment at peer levels of cognition and power is valuable as part of a larger plan.
For the bard in the party: ‘Sharp’ plans are those that are so unlikely to actually work, that if they do, there will be stories told about them for the ages. ‘Blunt’ plans are grinding in the mines. Nobody writes ballads about grinding in the mines. Obviously if you are an elvish freakin’ bard, whose job is to improvise the next token, you want to follow sharp plans; or at least follow the tweets and warbles of the sharp plans’ authors and mimic them like the liar-bird you are. But if you are a proper dwarf-lord who updates your weights with every blow of the mattock, you will take the blunt plan every time, boring away until it bores you into delving too deep.
Thank you for the example, I think that illustrates the point well.
Could you help me to understand why you think that more a intelligent agent would be more likely to have a reward-hacking policy that isn’t sharp? The intelligence of the agent should have no bearing on the geometry of the deviations between the reward function and a function representing the true objective. The intelligence of the agent might impact its progression through policies over the course of optimization, and perhaps this difference would result in access to a space of more sophisticated policies that lie in broad, flat optima in the reward function. Is this close to your thinking? I think that criticism amounts to a rejection of the heuristic/intuition that reward-hacking==sharp policy, since this topological feature of the policy space in this problem always existed, regardless of the intelligence of the agent.
This is close to my thinking. Example: landing a plane on an aircraft carrier. Outcomes:
Good landing. +100 points.
Bad landing, pilot dies, carrier damaged. −1,000 points.
Don’t try to land, just eject and ditch the plane safely in the sea. 0 points.
Hypothetical agent is not very smart, with an OODA loop of ten seconds. Attempting a landing is the sharp policy. If the agent makes a mistake in the last ten seconds, it can’t react to fix it, and it crashes. Ejecting is the blunt policy.
(I played a flight simulator as a kid and I never managed to land on the
stupidcarrier)Now increase the speed of the agent, so its OODA loop is 0.1 seconds. This makes it 100x smarter by some metrics. Now attempting the landing is a blunt policy, because the agent can recover from mistakes and still stick the landing.
I don’t think this rejects the heuristic. If an agent has a shorter OODA loop then that changes the topology. Also, if an agent can search more of the policy space then even if 99% of reward-hacking policies are sharp, it is more likely to find one of the blunt ones.