what about a simple approval function, which approves actions insofar as the supervisor expects them to lead to collecting more berries? Then the agent *also* learns convergent instrumental subgoals, because it learns to take whatever actions lead to collecting more berries (assuming the supervisor is right about that).
Sure. But, consider maximizing “TurnTrout has a fun day”-reward (for some imperfect grounding of that concept), and maximizing my approval of actions based on whether i think they’ll lead to a fun adventure.
The former takes over the world, and I don’t have a very fun day. but what about the latter?
To some extent, I won’t approve of actions that cause the agent to break, so there will be at least some instrumental subgoal pursuit for the agent. But for a successful power-seeking policy to be optimal, there is a conjunctive burden — we aren’t maximizing long-term discounted reward anymore, and the actions are evaluated locally, independently of any explicit global reward signal.
Many quasi-independently predicted approval judgments must cohere into a dangerous policy. It’s quite possible that this happens, but I’m not very convinced of that right now.
“Many quasi-independently predicted approval judgments must cohere into a dangerous policy.”
I described how this happens in the section on manipulating humans. In short, there is no “quasi-independence” because you are still evaluating every action based on whether you think it’ll lead to a fun adventure. This is exactly analogous to why the reward function you described takes over the world.
I described how this happens in the section on manipulating humans
Yes, but I don’t understand your case for “finding chains of manipulative inputs which increase myopic reward” entailing power-seeking? Why would that behavior, in particular, lead to the highest myopic reward? If we didn’t already know about power-seeking reward maximizers, why would we promote this hypothesis to attention?
This is exactly analogous to why the reward function you described takes over the world.
I disagree? Those objectives seem qualitatively dissimilar.
“Why would that behavior, in particular, lead to the highest myopic reward?”
I addressed this in my original comment: “More specifically: if a myopic agent’s actions A_1 to A_n manipulate the supervisor into thinking that the N+1th state is really amazing, and the supervisor looks at the full trajectory before assigning approval, then the supervisor will give higher approval to all of the actions A_1 to A_n, and they’ll all be reinforced, which is the same thing as would happen in a nonmyopic setup if the supervisor just gave the Nth action really high reward.”
That’s not what I’m asking. Why would that lead to power-seeking? You seem to be identifying “manipulation” with “power-seeking”; power-seeking implies manipulation, but the converse isn’t always true.
Why do nonmyopic agents end up power-seeking? Because the supervisor rates some states highly, and so the agent is incentivised to gain power in order to reach those states.
Why do myopic agents end up power-seeking? Because to train a competitive myopic agent, the supervisor will need to calculate how much approval they assign to actions based on how much those actions contribute to reaching valuable states. So the agent will be rewarded for taking actions which acquire it more power, since the supervisor will predict that those contribute to reaching valuable states.
(You might argue that, if the supervisor doesn’t want the agent to be power-seeking, they’ll only approve of actions which gain the agent more power in specified ways. But equivalently a reward function can also penalise unauthorised power-gaining, given equal ability to notice it by the supervisors in both cases.)
I now think that I was thinking of myopic cognition, whereas you are talking about myopic training. Oops! This is obvious in hindsight (and now I’m wondering how I missed it), but maybe you could edit the post to draw a clear contrast?
If the latter is implemented on Laplace’s Demon, and simply looks through all actions and picks the one with the highest approval, then I think it depends on how you’ve defined “approval.” If maximum approval could be bad (e.g. if approval is unbounded, or if it would take a lot of work to find a benign context where you always give it maximum approval), then this search process is searching for things that look like taking over the world.
But as we move away from Laplace’s demon, then I agree that realistic solutions look more like only manipulating TurnTrout and his immediate spatiotemporal surroundings.
Sure. But, consider maximizing “TurnTrout has a fun day”-reward (for some imperfect grounding of that concept), and maximizing my approval of actions based on whether i think they’ll lead to a fun adventure.
The former takes over the world, and I don’t have a very fun day. but what about the latter?
To some extent, I won’t approve of actions that cause the agent to break, so there will be at least some instrumental subgoal pursuit for the agent. But for a successful power-seeking policy to be optimal, there is a conjunctive burden — we aren’t maximizing long-term discounted reward anymore, and the actions are evaluated locally, independently of any explicit global reward signal.
Many quasi-independently predicted approval judgments must cohere into a dangerous policy. It’s quite possible that this happens, but I’m not very convinced of that right now.
“Many quasi-independently predicted approval judgments must cohere into a dangerous policy.”
I described how this happens in the section on manipulating humans. In short, there is no “quasi-independence” because you are still evaluating every action based on whether you think it’ll lead to a fun adventure. This is exactly analogous to why the reward function you described takes over the world.
Yes, but I don’t understand your case for “finding chains of manipulative inputs which increase myopic reward” entailing power-seeking? Why would that behavior, in particular, lead to the highest myopic reward? If we didn’t already know about power-seeking reward maximizers, why would we promote this hypothesis to attention?
I disagree? Those objectives seem qualitatively dissimilar.
“Why would that behavior, in particular, lead to the highest myopic reward?”
I addressed this in my original comment: “More specifically: if a myopic agent’s actions A_1 to A_n manipulate the supervisor into thinking that the N+1th state is really amazing, and the supervisor looks at the full trajectory before assigning approval, then the supervisor will give higher approval to all of the actions A_1 to A_n, and they’ll all be reinforced, which is the same thing as would happen in a nonmyopic setup if the supervisor just gave the Nth action really high reward.”
That’s not what I’m asking. Why would that lead to power-seeking? You seem to be identifying “manipulation” with “power-seeking”; power-seeking implies manipulation, but the converse isn’t always true.
Why do nonmyopic agents end up power-seeking? Because the supervisor rates some states highly, and so the agent is incentivised to gain power in order to reach those states.
Why do myopic agents end up power-seeking? Because to train a competitive myopic agent, the supervisor will need to calculate how much approval they assign to actions based on how much those actions contribute to reaching valuable states. So the agent will be rewarded for taking actions which acquire it more power, since the supervisor will predict that those contribute to reaching valuable states.
(You might argue that, if the supervisor doesn’t want the agent to be power-seeking, they’ll only approve of actions which gain the agent more power in specified ways. But equivalently a reward function can also penalise unauthorised power-gaining, given equal ability to notice it by the supervisors in both cases.)
I now think that I was thinking of myopic cognition, whereas you are talking about myopic training. Oops! This is obvious in hindsight (and now I’m wondering how I missed it), but maybe you could edit the post to draw a clear contrast?
Ah, makes sense. There’s already a paragraph on this (starting “I should note that so far”), but I’ll edit to mention it earlier.
This is likely the crux of our disagreement, but I don’t have time to reply ATM. Hope to return to this.
If the latter is implemented on Laplace’s Demon, and simply looks through all actions and picks the one with the highest approval, then I think it depends on how you’ve defined “approval.” If maximum approval could be bad (e.g. if approval is unbounded, or if it would take a lot of work to find a benign context where you always give it maximum approval), then this search process is searching for things that look like taking over the world.
But as we move away from Laplace’s demon, then I agree that realistic solutions look more like only manipulating TurnTrout and his immediate spatiotemporal surroundings.