This is close to my thinking. Example: landing a plane on an aircraft carrier. Outcomes:
Good landing. +100 points.
Bad landing, pilot dies, carrier damaged. −1,000 points.
Don’t try to land, just eject and ditch the plane safely in the sea. 0 points.
Hypothetical agent is not very smart, with an OODA loop of ten seconds. Attempting a landing is the sharp policy. If the agent makes a mistake in the last ten seconds, it can’t react to fix it, and it crashes. Ejecting is the blunt policy.
(I played a flight simulator as a kid and I never managed to land on the stupid carrier)
Now increase the speed of the agent, so its OODA loop is 0.1 seconds. This makes it 100x smarter by some metrics. Now attempting the landing is a blunt policy, because the agent can recover from mistakes and still stick the landing.
I don’t think this rejects the heuristic. If an agent has a shorter OODA loop then that changes the topology. Also, if an agent can search more of the policy space then even if 99% of reward-hacking policies are sharp, it is more likely to find one of the blunt ones.
This is close to my thinking. Example: landing a plane on an aircraft carrier. Outcomes:
Good landing. +100 points.
Bad landing, pilot dies, carrier damaged. −1,000 points.
Don’t try to land, just eject and ditch the plane safely in the sea. 0 points.
Hypothetical agent is not very smart, with an OODA loop of ten seconds. Attempting a landing is the sharp policy. If the agent makes a mistake in the last ten seconds, it can’t react to fix it, and it crashes. Ejecting is the blunt policy.
(I played a flight simulator as a kid and I never managed to land on the
stupidcarrier)Now increase the speed of the agent, so its OODA loop is 0.1 seconds. This makes it 100x smarter by some metrics. Now attempting the landing is a blunt policy, because the agent can recover from mistakes and still stick the landing.
I don’t think this rejects the heuristic. If an agent has a shorter OODA loop then that changes the topology. Also, if an agent can search more of the policy space then even if 99% of reward-hacking policies are sharp, it is more likely to find one of the blunt ones.