sej2020 comments on Greedy-Advantage-Aware RLHF

sej2020 30 Dec 2024 18:48 UTC
4 points
0
Thank you for the example, I think that illustrates the point well.

Could you help me to understand why you think that more a intelligent agent would be more likely to have a reward-hacking policy that isn’t sharp? The intelligence of the agent should have no bearing on the geometry of the deviations between the reward function and a function representing the true objective. The intelligence of the agent might impact its progression through policies over the course of optimization, and perhaps this difference would result in access to a space of more sophisticated policies that lie in broad, flat optima in the reward function. Is this close to your thinking? I think that criticism amounts to a rejection of the heuristic/intuition that reward-hacking==sharp policy, since this topological feature of the policy space in this problem always existed, regardless of the intelligence of the agent.
- Martin Randall 1 Jan 2025 3:15 UTC
  1 point
  0
  Parent
  This is close to my thinking. Example: landing a plane on an aircraft carrier. Outcomes:
  1. Good landing. +100 points.
  2. Bad landing, pilot dies, carrier damaged. −1,000 points.
  3. Don’t try to land, just eject and ditch the plane safely in the sea. 0 points.
  Hypothetical agent is not very smart, with an OODA loop of ten seconds. Attempting a landing is the sharp policy. If the agent makes a mistake in the last ten seconds, it can’t react to fix it, and it crashes. Ejecting is the blunt policy.
  
  (I played a flight simulator as a kid and I never managed to land on the ~~stupid~~ carrier)
  
  Now increase the speed of the agent, so its OODA loop is 0.1 seconds. This makes it 100x smarter by some metrics. Now attempting the landing is a blunt policy, because the agent can recover from mistakes and still stick the landing.
  
  I don’t think this rejects the heuristic. If an agent has a shorter OODA loop then that changes the topology. Also, if an agent can search more of the policy space then even if 99% of reward-hacking policies are sharp, it is more likely to find one of the blunt ones.