I’m confused. Let’s assume that the button probably isn’t pressed at the start. Seems quite likely that the first agent proposes building a sub-agent that maximally pursues its utility if the button is pressed in the first round and maximally pursues the second agents utility if it is not pressed in the first round. The second agent believes that the button is certainly not pressed during the first round, so it has no reason to negotiate further. If the button doesn’t get triggered in the first round (which it likely won’t if the sub-agents have had no time to do anything), it will forever after pursue the first agents utility.
The button play no role in this utility, so instrumental incentives mean it will destroy it sooner or later. This seems like it breaks the system.
Now, this isn’t the only equilibrium. The first agent believes the button will always be pressed, so it has no inventive to argue for the criteria being whether the button is pressed in the first round vs. needing to have been pressed in this round and all previous rounds. On the other, instead of balance, it seems likely that one agent or the other creates a subagent that clobbers the others utility, with that agent assuming that this only happens in a world that never occurs.
I think the initial (2-agent) model only has two time steps, ie one opportunity for the button to be pressed. The goal is just for the agent to be corrigible for this single button-press opportunity.
I’m confused. Let’s assume that the button probably isn’t pressed at the start. Seems quite likely that the first agent proposes building a sub-agent that maximally pursues its utility if the button is pressed in the first round and maximally pursues the second agents utility if it is not pressed in the first round. The second agent believes that the button is certainly not pressed during the first round, so it has no reason to negotiate further. If the button doesn’t get triggered in the first round (which it likely won’t if the sub-agents have had no time to do anything), it will forever after pursue the first agents utility.
The button play no role in this utility, so instrumental incentives mean it will destroy it sooner or later. This seems like it breaks the system.
Now, this isn’t the only equilibrium. The first agent believes the button will always be pressed, so it has no inventive to argue for the criteria being whether the button is pressed in the first round vs. needing to have been pressed in this round and all previous rounds. On the other, instead of balance, it seems likely that one agent or the other creates a subagent that clobbers the others utility, with that agent assuming that this only happens in a world that never occurs.
I think the initial (2-agent) model only has two time steps, ie one opportunity for the button to be pressed. The goal is just for the agent to be corrigible for this single button-press opportunity.