Assumption 1. A sufficiently advanced agent will do at least human-level hypothesis generation regarding the dynamics of the unknown environment.
I am fairly confident that this is not the part TurnTrout/Quintin were disagreeing with you on. Such an agent plausibly will be doing at least human-level hypothesis generation. The question is on what goals will be driving the agent. A monk may be able to generate the hypothesis that narcotics would feel intensely rewarding, more rewarding than any meditation they have yet experienced, and that if they took those narcotics, their goals would shift towards them. And yet, even after generating that hypothesis, that monk may still choose not to conduct that intervention because they know that it would redirect them towards proximal reward-related chemical-goals and away from distal reward-related experiential-goals (seeing others smile, for ex.).
Also, I am not even sure there is actually a disagreement on whether agents will intervene on the reward-generating process. Quote from Reward is not the optimization target:
Quintin Pope remarks: “The AI would probably want to establish control over the button, if only to ensure its values aren’t updated in a way it wouldn’t endorse. Though that’s an example of convergent powerseeking, not reward seeking.”
That is, the agent will probably want to intervene on the process that is shaping its goals. In fact, establishing control over the process that updates its cognition is instrumentally convergent, no matter what goal it is pursuing.
In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game scores without being terminally motivated by reward, instead doing that deliberate optimization for instrumental reasons (they like video games, they are competitive, they have a weird obsession with virtual points, etc.). This is what I believe Quintin meant by “The same way humans do it?”
To understand why they believe that matters at all for understanding the behavior of a reinforcement learner (as opposed to a human), we can look to another blog post of theirs.
Let’s look at the assumptions they make. They basically assume that the human brain only does reinforcement learning. (Their Assumption 3 says the brain does reinforcement learning, and Assumption 1 says that this brain-as-reinforcement-learner is randomly initialized, so there is no other path for goals to come in.) [...] In this blog post, the words “innate” and “instinct” never appear.
Whoa whoa whoa. This is definitely a misunderstanding. Assumption 2 is all about how the brain does self-supervised learning in addition to “pure reinforcement learning”. Moreover, if you look at the shard theory post, it talks several times about how the genome indirectly shapes the brain’s goal structure, whenever the post mentions “hard[-]coded reward circuits”. It even says so right in the bit that introduces Assumption 3!
Assumption 3: The brain does reinforcement learning. According to this assumption, the brain has a genetically hard-coded reward system (implemented via certain hard-coded circuits in the brainstem and midbrain).
Those “hard-coded” reward circuits are what you would probably instead call “innate” and form the basis for some subset of the “instincts” relevant to this discussion. Perhaps you were searching using different words, and got the wrong impression because of it? This one seems like a pretty clear miscommunication.
Incidentally, I am also confused about how you reach your published conclusion, the one ending in “with catastrophic consequences”, from your 6 assumptions alone. The portion of it that I follow is that advanced agents may intervene in the provision of rewards, but I don’t see how much else follows without further assumptions...
The assumption says “will do” not “will be able to do”. And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.
In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game scores without being terminally motivated by reward,
There is no need to recruit the concept of “terminal” here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success, but it does all this because of some “terminal” reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4.
If I want to analyze what would probably happen if Edward Snowden tried to enter the White House, there’s lots I can say without needing to understand what deep reason he had for trying to do this. I can just look at the implications of his attempt to enter the White House: he’d probably get caught and go to jail for a long time. Likewise, if an RL agent is trying to maximize is reward, there’s plenty of analysis we can do that is independent of whether there’s some other terminal reason for this.
The assumption says “will do” not “will be able to do”. And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.
That’s fine. Let’s say the agent “will do at least human-level hypothesis generation regarding the dynamics of the unknown environment”. That still does not imply that reward is their optimization target. The monk in my analogy in fact does such hypothesis generation, deliberately modeling the origin of reward, and yet reward is not what they are seeking.
There is no need to recruit the concept of “terminal” here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success, but it does all this because of some “terminal” reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4.
The reason we are talking about terminal vs. instrumental stuff is that your claims seem to be about how agents will learn what their goals are from the environment. But if advanced agents can use their observations & interactions with the environment to learn information about instrumental means (i.e. how do my actions create expected outcomes, how do I build a better world model, etc.) while holding their terminal goals (i.e. which outcomes I intrinsically care about) fixed, then that should change our conclusions about how advanced agents will behave, because competent agents reshape their instrumental behavior to subserve their terminal values, rather than inferring their terminal values from the environment or what have you. This is what “goal-content integrity”—a dimension of instrumental convergence—is all about.
If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success
The video game player doesn’t want high reward that comes from cheating. It is not behaviourally identical to a reward maximiser unless you take the reward to be the quantity “what I would’ve received if I hadn’t cheated”
Object-level comments below.
Clearing up some likely misunderstandings:
I am fairly confident that this is not the part TurnTrout/Quintin were disagreeing with you on. Such an agent plausibly will be doing at least human-level hypothesis generation. The question is on what goals will be driving the agent. A monk may be able to generate the hypothesis that narcotics would feel intensely rewarding, more rewarding than any meditation they have yet experienced, and that if they took those narcotics, their goals would shift towards them. And yet, even after generating that hypothesis, that monk may still choose not to conduct that intervention because they know that it would redirect them towards proximal reward-related chemical-goals and away from distal reward-related experiential-goals (seeing others smile, for ex.).
Also, I am not even sure there is actually a disagreement on whether agents will intervene on the reward-generating process. Quote from Reward is not the optimization target:
That is, the agent will probably want to intervene on the process that is shaping its goals. In fact, establishing control over the process that updates its cognition is instrumentally convergent, no matter what goal it is pursuing.
In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game scores without being terminally motivated by reward, instead doing that deliberate optimization for instrumental reasons (they like video games, they are competitive, they have a weird obsession with virtual points, etc.). This is what I believe Quintin meant by “The same way humans do it?”
Whoa whoa whoa. This is definitely a misunderstanding. Assumption 2 is all about how the brain does self-supervised learning in addition to “pure reinforcement learning”. Moreover, if you look at the shard theory post, it talks several times about how the genome indirectly shapes the brain’s goal structure, whenever the post mentions “hard[-]coded reward circuits”. It even says so right in the bit that introduces Assumption 3!
Those “hard-coded” reward circuits are what you would probably instead call “innate” and form the basis for some subset of the “instincts” relevant to this discussion. Perhaps you were searching using different words, and got the wrong impression because of it? This one seems like a pretty clear miscommunication.
Incidentally, I am also confused about how you reach your published conclusion, the one ending in “with catastrophic consequences”, from your 6 assumptions alone. The portion of it that I follow is that advanced agents may intervene in the provision of rewards, but I don’t see how much else follows without further assumptions...
The assumption says “will do” not “will be able to do”. And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section.
There is no need to recruit the concept of “terminal” here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of “success” according to how much reward it observes, and then pursues success, but it does all this because of some “terminal” reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4.
If I want to analyze what would probably happen if Edward Snowden tried to enter the White House, there’s lots I can say without needing to understand what deep reason he had for trying to do this. I can just look at the implications of his attempt to enter the White House: he’d probably get caught and go to jail for a long time. Likewise, if an RL agent is trying to maximize is reward, there’s plenty of analysis we can do that is independent of whether there’s some other terminal reason for this.
That’s fine. Let’s say the agent “will do at least human-level hypothesis generation regarding the dynamics of the unknown environment”. That still does not imply that reward is their optimization target. The monk in my analogy in fact does such hypothesis generation, deliberately modeling the origin of reward, and yet reward is not what they are seeking.
The reason we are talking about terminal vs. instrumental stuff is that your claims seem to be about how agents will learn what their goals are from the environment. But if advanced agents can use their observations & interactions with the environment to learn information about instrumental means (i.e. how do my actions create expected outcomes, how do I build a better world model, etc.) while holding their terminal goals (i.e. which outcomes I intrinsically care about) fixed, then that should change our conclusions about how advanced agents will behave, because competent agents reshape their instrumental behavior to subserve their terminal values, rather than inferring their terminal values from the environment or what have you. This is what “goal-content integrity”—a dimension of instrumental convergence—is all about.
The video game player doesn’t want high reward that comes from cheating. It is not behaviourally identical to a reward maximiser unless you take the reward to be the quantity “what I would’ve received if I hadn’t cheated”