Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent’s network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.
I hope it doesn’t come across as revisionist to Alex, but I felt like both of these points were made by people at least as early as 2019, after the Mesa-Optimization sequence came out in mid-2019. As evidence, I’ll point to my post from December 2019 that was partially based on a conversation with Rohin, who seemed to agree with me,
consider a simple feedforward neural network trained by deep reinforcement learning to navigate my Chests and Keys environment. Since “go to the nearest key” is a good proxy for getting the reward, the neural network simply returns the action, that when given the board state, results in the agent getting closer to the nearest key.
Is the feedforward neural network optimizing anything here? Hardly, it’s just applying a heuristic. Note that you don’t need to do anything like an internal A* search to find keys in a maze, because in many environments, following a wall until the key is within sight, and then performing a very shallow search (which doesn’t have to be explicit) could work fairly well.
I think in this passage I’m imagining that “reward is not the trained agent’s optimization target” quite explicitly, since I’m pointing out that a neural network trained by RL will not necessarily optimize anything at all. In a subsequent post from January 2020 I gave a more explicit example, said this fact doesn’t merely apply to simple neural networks, and then offered my opinion that “it’s inaccurate to say that the source of malign generalization must come from an internal search being misaligned with the objective function we used during training”.
From the comments, and from my memory of conversations at the time, many people disagreed with my framing. They disagreed even when I pointed out that humans don’t seem to be “optimizers” that select for actions that maximize our “reward function” (I believe the most common response was to deny the premise, and say that humans are actually roughly optimizers. Another common response was to say that AI is different for some reason.)
However, even though some people disagreed with this framing, not everyone did. As I pointed out, Rohin seemed to agree with me at the time, and so at the very least I think there is credible evidence that this insight was already known to a few people in the community by late 2019.
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
I have no stake in this debate, but how is this particular point any different than what Eliezer says when he makes the point about humans not optimizing for IGF? I think the entire mesaoptimization concern is built around this premise, no?
(I didn’t follow this argument at the time, so I might be missing key context.)
The blog post “Reward is not the optimization target” gives the following summary of its thesis,
I hope it doesn’t come across as revisionist to Alex, but I felt like both of these points were made by people at least as early as 2019, after the Mesa-Optimization sequence came out in mid-2019. As evidence, I’ll point to my post from December 2019 that was partially based on a conversation with Rohin, who seemed to agree with me,
I think in this passage I’m imagining that “reward is not the trained agent’s optimization target” quite explicitly, since I’m pointing out that a neural network trained by RL will not necessarily optimize anything at all. In a subsequent post from January 2020 I gave a more explicit example, said this fact doesn’t merely apply to simple neural networks, and then offered my opinion that “it’s inaccurate to say that the source of malign generalization must come from an internal search being misaligned with the objective function we used during training”.
From the comments, and from my memory of conversations at the time, many people disagreed with my framing. They disagreed even when I pointed out that humans don’t seem to be “optimizers” that select for actions that maximize our “reward function” (I believe the most common response was to deny the premise, and say that humans are actually roughly optimizers. Another common response was to say that AI is different for some reason.)
However, even though some people disagreed with this framing, not everyone did. As I pointed out, Rohin seemed to agree with me at the time, and so at the very least I think there is credible evidence that this insight was already known to a few people in the community by late 2019.
I have no stake in this debate, but how is this particular point any different than what Eliezer says when he makes the point about humans not optimizing for IGF? I think the entire mesaoptimization concern is built around this premise, no?