We already see examples of RL leading to undesirable behaviours that superficially ‘look good’ to human evaluators (see this collection of examples).
Nitpick: this description seems potentially misleading to me (at least it gave me the wrong impression at first!). When I read this, it sound like it’s saying that a human looked at what the AI did and thought it was good, before they (or someone else) dug deeper.
But the examples (the ones I spot-checked anyway) all seem to be cases where the systems satisfied some predefined goal in an unexpected way, but if a human had looked at what the agent did (and not just the final score), it would have been obvious that the agent wasn’t doing what was expected / wanted. (E.g. “A classic example is OpenAI’s demo of a reinforcement learning agent in a boat racing game going in circles and repeatedly hitting the same reward targets instead of actually playing the game.” or “A robotic arm trained using hindsight experience replay to slide a block to a target position on a table achieves the goal by moving the table itself.” A human evaluator wouldn’t have looked at these behaviors and said, “Looks good!”)
I call this out just because the world where ML systems are actually routinely managing to fool humans into thinking they’re doing something useful when they’re not is much scarier than one where they’re just gaming pre-specified reward functions. And I think it’s important to distinguish the two.
EDIT: this example does seem to fit the bill though:
One example from an OpenAI paper is an agent learning incorrect behaviours in a 3d simulator, because the behaviours look like the desired behaviour in the 2d clip the human evaluator is seeing.
I suppose there is a continuum of how much insight the human has into what the agent is doing. Squeezing all your evaluation into one simple reward function would be on one end of the spectrum (and particularly susceptible to unintended behaviors), and then watching a 2d projection of a 3d action would be further along the spectrum (but not all the way to full insight), and then you can imagine setups with much more insight than that.
Nitpick: this description seems potentially misleading to me (at least it gave me the wrong impression at first!). When I read this, it sound like it’s saying that a human looked at what the AI did and thought it was good, before they (or someone else) dug deeper.
But the examples (the ones I spot-checked anyway) all seem to be cases where the systems satisfied some predefined goal in an unexpected way, but if a human had looked at what the agent did (and not just the final score), it would have been obvious that the agent wasn’t doing what was expected / wanted. (E.g. “A classic example is OpenAI’s demo of a reinforcement learning agent in a boat racing game going in circles and repeatedly hitting the same reward targets instead of actually playing the game.” or “A robotic arm trained using hindsight experience replay to slide a block to a target position on a table achieves the goal by moving the table itself.” A human evaluator wouldn’t have looked at these behaviors and said, “Looks good!”)
I call this out just because the world where ML systems are actually routinely managing to fool humans into thinking they’re doing something useful when they’re not is much scarier than one where they’re just gaming pre-specified reward functions. And I think it’s important to distinguish the two.
EDIT: this example does seem to fit the bill though:
I suppose there is a continuum of how much insight the human has into what the agent is doing. Squeezing all your evaluation into one simple reward function would be on one end of the spectrum (and particularly susceptible to unintended behaviors), and then watching a 2d projection of a 3d action would be further along the spectrum (but not all the way to full insight), and then you can imagine setups with much more insight than that.