TurnTrout comments on AI Safety 101 : Reward Misspecification

TurnTrout 19 Oct 2023 0:44 UTC
14 points
9
In the context of AI and reward systems, Goodhart’s Law means that when a reward becomes the objective for an AI agent, the AI agent will do everything it can to maximize the reward function, rather than the original intention.
I think this is either ambiguous or not correct. Reward is not the optimization target; Models Don’t “Get Reward”, and so on.I think it would be more accurate to state:
In the context of AI and reward systems, Goodhart’s Law means that when a proxy reward function reinforces undesired behavior, the AI will learn to do things we don’t want. The better the AI is at exploring, the more likely it is to find undesirable behavior which is spuriously rated highly by the reward model.
- markov 19 Oct 2023 8:40 UTC
  3 points
  0
  Parent
  Thanks for the feedback. I actually had an entire subsection in an earlier draft that covered Reward is not the optimization target. I decided to move it to the upcoming chapter 3 which covers optimization, goal misgen and inner alignment. I thought it would fit better as an intro section there since it ties the content back to the previous chapter, while also differentiating rewards from objectives. This flows well into differentiating which goals the system is actually pursuing.