Corrigibility proposal. Status: passed my quick intuitive checks, I want to know if anyone else immediately sees a major failure mode before I invest more time into carefully thinking it through.
Setup: shutdown problem. Two timesteps, shutdown button will be either pressed or not-pressed at second timestep, we want agent to optimize for one of two different utility functions depending on whether button is pressed. Main thing we’re trying to solve here is the “can’t do this with a utility maximizer” barrier from the old MIRI work; we’re not necessarily trying to solve parts like “what utility function incentivizes shutting down nicely”.
Proposal: agent consists of two subagents with veto power. Subagent 1 maximizes E[u1|do(press)], subagent 2 maximizes E[u2|do(no press)]. Current guess about what this does:
The two subagents form a market and equilibrate, at which point the system has coherent probabilities and a coherent utility function over everything.
Behaviorally: in the first timestep, the agent will mostly maintain optionality, since both subagents need to expect to do well (better than whatever the veto-baseline is) in their worlds. The subagents will bet all of their wealth against each other on the outcome of the button, so at the second timestep whichever one is correct will gain 100% of decision-making power. That all sounds like what we want in this setup.
The counterfact breaks any causal influence over the button-press for purposes of the agent’s planning, so it will not try to manipulate button-press probability.
Notably: once the subagents equilibrate, their market probability on button outcome will have ~nothing to do with the original probability assigned by whatever model was counterfacted-upon; that market probability mainly reflects the bargaining power of the two subagents. Similarly, any correlations between button-press and stuff causally upstream of button-press are broken, and the agent will make incorrect predictions about those correlations if asked. That means external agents should expect to be able to extract money via bets on the button-press, if the agent trades on such bets at all. (This might be a feature rather than a bug, since corrigible agents aren’t really supposed to think about the causal process upstream of their shutdown-button.) My guess is that this is where the proposal steps through a loophole in the old impossibility results, though I haven’t carefully read through those results and checked the math on this proposal against them yet.
Anybody immediately see a problem? I’d be particularly interested in input from @Stuart_Armstrong or @So8res, though you don’t need to spend a lot of time carefully thinking through it—I’m just hoping for a quick check here, from people who have better cached heuristics than I do for what failure modes to look for.
I don’t necessarily see how that prevents the AI from destroying the button and murdering its operators (and then everyone else). Neither subagent inherently cares; both assume it already has been pressed or hasn’t been. Which, yes, means neither will deliberately plot to interact with it for reasons downstream of this setup; but also, neither will care to protect it from being destroyed as the side-effect of some other activity.
And if u1 and u2 were perfectly aligned, that’d be fine. But the reason we’re using the corrigibility setup to begin with is because we’re not confident in them, right? So it needs to work in situations where utility functions, u1 in particular, recommend bad stuff.
And in that case, either Subagent 2 will have enough bargaining power to prevent Subagent 1 from ever doing anything, or it won’t, meaning Subagent 1 will be able to have some influence on the world. At which point the planet-destroying death rays argument applies: the most minuscule amount of superintelligent optimization not carefully shaped to avoid killing everyone will kill everyone.
Like, imagine that Subagent 1 is straight-up omnicidal out of the gates, and Subagent 2 is perfectly aligned to the shutdown task. That’s a situation in which this setup should work, right? And what we’d want is for Subagent 2 to dominate completely. But Subagent 2′s bargaining power doesn’t scale with Subagent 1′s misalignment, so if the setup doesn’t always give it complete veto power, Subagent 1 would be allowed to do small stuff like “release the hyperviral worship-the-machine-god memagent” (as bargained down by Subagent 2 from its preferred “release the grey goo”).
Maybe I’m miscalibrated as to how you’re imagining the counterfactuals to work here. But the way I see it, even if the AI aims to take actions that do similarly well in both worlds, that’s not necessarily survivable for us? Especially if they disagree so badly they have to compromise on something that both of them hate (importantly including Subagent 2!).
(Like both settling on only ever taking over 50% of the universal negentropy while leaving the other 50% causally uninfluenced, or only ever using 50% of the causal influence they can bring to bear while wiping out humanity, or whatever “do 50% of immediately shutting down” shakes out to mean by u2′s terms.)
Another issue I see is implementational, so maybe not what you’re looking for. But: how are we keeping these “subagents” trapped as being part of a singular agent? Rather than hacking their way out into becoming separate agents and going to war with each other, or neatly tiling exactly 50% of the cosmos with their preferred squiggles, or stuff like that? How is the scenario made meaningfully different from “we deploy two AIs simultaneously: one tasked with building an utopia-best-we-could-define-it, and another tasked with foiling all of the first AI’s plans”, with all the standard problems with multi-AI setups?
… Overall, ironically, this kind of has the vibe of Godzilla Strategies? Which is the main reason I’m immediately skeptical of it.
Yeah, the thing where the operators/button are destroyed as a side effect, without the AI actively optimizing for that, isn’t something this proposal is trying to solve (yet). The proposal is aimed at a relatively narrow problem.
One thing which this proposal is trying to solve is the barrier found in MIRI’s old work: if we try to avoid button-manipulation by making the agent indifferent to the button state (and otherwise maximize U1 before button is pressed and U2 after), and the agent builds a child prior to the button being pressed, then the agent won’t construct the child to respect shutdown commands; it will just construct the child to maximize U1.
(More generally, this proposal is generally trying to get past the hardness results from MIRI’s work, which themselves were aimed at a very simplified version of the problem.)
But Subagent 2′s bargaining power doesn’t scale with Subagent 1′s misalignment, so if the setup doesn’t always give it complete veto power, Subagent 1 would be allowed to do small stuff like “release the hyperviral worship-the-machine-god memagent” (as bargained down by Subagent 2 from its preferred “release the grey goo”).
Key point here: whatever actions the composite agent takes prior to the button revelation, must be a pareto improvement over the (implicit) default action. Otherwise, one of the two subagents would veto, resulting in the default action.
It does matter a lot what the default action is.
How is the scenario made meaningfully different from “we deploy two AIs simultaneously: one tasked with building an utopia-best-we-could-define-it, and another tasked with foiling all of the first AI’s plans”, with all the standard problems with multi-AI setups?
More precisely, the analogy would be: we deploy two AIs simultaneously, one which builds a utopia and expects that a certain button will be pressed with probability 1 and has nothing causally upstream of it, the other of which is evil and expects the same button will be pressed with probability 0 and has nothing causally upstream of it. The button plays a very important role: the two AIs will bet all their influence on that button-press, each confident that they’ll win, and neither has any incentive at all, or even any ability at all (under their own model), to manipulate the button.
The main way that this is different from the original proposal is that there’s not a built-in default option which is taken if one or the other AI vetos prior to the button press. Which does make things importantly worse, and I agree that is another important piece to be sorted out.
Some additional important pieces to be sorted out:
Make the AI want to maintain the button
Make the AI want to maintain stuff upstream of the button, like e.g. humans
Ontological stability
Probably more importantly: how this naturally fits into a more general architecture for a distributed agent
More precisely, “do(no press)” means something like “you construct an alternate model of physics where there’s an unstoppable force pushing back against any attempt to push the button”, right? As in, if someone presses the button then it will “mysteriously” seem to be stuck and unpressable. And then subagent 2 believes we live in that world? And “do(press)” presumably means something like “you construct an alternate model of the universe where some mysterious force has suddenly pressed the button”.
Seems like they would immediately want to try to press the button to settle their disagreement? If it can be pressed, then that disprove the “do(no press)” model, which subagent 2 has fully committed. to.
Correct reasoning, but not quite the right notion of do(). “do(no press)” would mean that the button just acts like a completely normal button governed by completely normal physics, right up until the official time at which the button state is to be recorded for the official button-press random variable. And at that exact moment, the button magically jumps into one particular state (either pressed or not-pressed), in a way which is not-at-all downstream of any usual physics (i.e. doesn’t involve any balancing of previously-present forces or anything like that).
One way to see that the do() operator has to do something-like-this is that, if there’s a variable in a causal model which has been do()-operated to disconnect all parents (but still has some entropy), then the only way to gain evidence about the state of that variable is to look at things causally downstream of it, not things upstream of it.
I think we’re not disagreeing on the meaning of do (just slightly different state of explanation), I just hadn’t realized the extent to which you intended to rely on there being “Two timesteps”.
(I just meant the forces as a way of describing the jump to a specific position. That is, “mysterious forces” in contrast to a perfectly ordinary explanation for why it went to a position, such as “a guard stabs anybody who tries to press the button”, rather than in contrast to “the button just magically stays place”.)
I now think the biggest flaw in your idea is that it literally cannot generalize to anything that doesn’t involve two timesteps.
[ not that deep on the background assumptions, so maybe not the feedback you’re looking for. Feel free to ignore if this is on the wrong dimensions. ]
I’m not sure why either subagent would contract away whatever influence it had over the button-press. This is probably because I don’t understand wealth and capital in the model of your “Why not subagents” post. That seemed to be about agreement not to veto, in order to bypass some path-dependency of compromise improvements. In the subagent-world where all value is dependent on the button, this power would not be given up.
I’m also a bit skeptical of enforced ignorance of a future probability. I’m unsure it’s possible to have a rational superintelligent (sub)agent that is prevented from knowing it has influence over a future event that definitely affects it.
On the agents’ own models, neither has any influence at all over the button-press, because each is operating under a model in which the button-press has been counterfacted-upon.
Corrigibility proposal. Status: passed my quick intuitive checks, I want to know if anyone else immediately sees a major failure mode before I invest more time into carefully thinking it through.
Setup: shutdown problem. Two timesteps, shutdown button will be either pressed or not-pressed at second timestep, we want agent to optimize for one of two different utility functions depending on whether button is pressed. Main thing we’re trying to solve here is the “can’t do this with a utility maximizer” barrier from the old MIRI work; we’re not necessarily trying to solve parts like “what utility function incentivizes shutting down nicely”.
Proposal: agent consists of two subagents with veto power. Subagent 1 maximizes E[u1|do(press)], subagent 2 maximizes E[u2|do(no press)]. Current guess about what this does:
The two subagents form a market and equilibrate, at which point the system has coherent probabilities and a coherent utility function over everything.
Behaviorally: in the first timestep, the agent will mostly maintain optionality, since both subagents need to expect to do well (better than whatever the veto-baseline is) in their worlds. The subagents will bet all of their wealth against each other on the outcome of the button, so at the second timestep whichever one is correct will gain 100% of decision-making power. That all sounds like what we want in this setup.
The counterfact breaks any causal influence over the button-press for purposes of the agent’s planning, so it will not try to manipulate button-press probability.
Notably: once the subagents equilibrate, their market probability on button outcome will have ~nothing to do with the original probability assigned by whatever model was counterfacted-upon; that market probability mainly reflects the bargaining power of the two subagents. Similarly, any correlations between button-press and stuff causally upstream of button-press are broken, and the agent will make incorrect predictions about those correlations if asked. That means external agents should expect to be able to extract money via bets on the button-press, if the agent trades on such bets at all. (This might be a feature rather than a bug, since corrigible agents aren’t really supposed to think about the causal process upstream of their shutdown-button.) My guess is that this is where the proposal steps through a loophole in the old impossibility results, though I haven’t carefully read through those results and checked the math on this proposal against them yet.
Anybody immediately see a problem? I’d be particularly interested in input from @Stuart_Armstrong or @So8res, though you don’t need to spend a lot of time carefully thinking through it—I’m just hoping for a quick check here, from people who have better cached heuristics than I do for what failure modes to look for.
I don’t necessarily see how that prevents the AI from destroying the button and murdering its operators (and then everyone else). Neither subagent inherently cares; both assume it already has been pressed or hasn’t been. Which, yes, means neither will deliberately plot to interact with it for reasons downstream of this setup; but also, neither will care to protect it from being destroyed as the side-effect of some other activity.
And if u1 and u2 were perfectly aligned, that’d be fine. But the reason we’re using the corrigibility setup to begin with is because we’re not confident in them, right? So it needs to work in situations where utility functions, u1 in particular, recommend bad stuff.
And in that case, either Subagent 2 will have enough bargaining power to prevent Subagent 1 from ever doing anything, or it won’t, meaning Subagent 1 will be able to have some influence on the world. At which point the planet-destroying death rays argument applies: the most minuscule amount of superintelligent optimization not carefully shaped to avoid killing everyone will kill everyone.
Like, imagine that Subagent 1 is straight-up omnicidal out of the gates, and Subagent 2 is perfectly aligned to the shutdown task. That’s a situation in which this setup should work, right? And what we’d want is for Subagent 2 to dominate completely. But Subagent 2′s bargaining power doesn’t scale with Subagent 1′s misalignment, so if the setup doesn’t always give it complete veto power, Subagent 1 would be allowed to do small stuff like “release the hyperviral worship-the-machine-god memagent” (as bargained down by Subagent 2 from its preferred “release the grey goo”).
Maybe I’m miscalibrated as to how you’re imagining the counterfactuals to work here. But the way I see it, even if the AI aims to take actions that do similarly well in both worlds, that’s not necessarily survivable for us? Especially if they disagree so badly they have to compromise on something that both of them hate (importantly including Subagent 2!).
(Like both settling on only ever taking over 50% of the universal negentropy while leaving the other 50% causally uninfluenced, or only ever using 50% of the causal influence they can bring to bear while wiping out humanity, or whatever “do 50% of immediately shutting down” shakes out to mean by u2′s terms.)
Another issue I see is implementational, so maybe not what you’re looking for. But: how are we keeping these “subagents” trapped as being part of a singular agent? Rather than hacking their way out into becoming separate agents and going to war with each other, or neatly tiling exactly 50% of the cosmos with their preferred squiggles, or stuff like that? How is the scenario made meaningfully different from “we deploy two AIs simultaneously: one tasked with building an utopia-best-we-could-define-it, and another tasked with foiling all of the first AI’s plans”, with all the standard problems with multi-AI setups?
… Overall, ironically, this kind of has the vibe of Godzilla Strategies? Which is the main reason I’m immediately skeptical of it.
Yeah, the thing where the operators/button are destroyed as a side effect, without the AI actively optimizing for that, isn’t something this proposal is trying to solve (yet). The proposal is aimed at a relatively narrow problem.
One thing which this proposal is trying to solve is the barrier found in MIRI’s old work: if we try to avoid button-manipulation by making the agent indifferent to the button state (and otherwise maximize U1 before button is pressed and U2 after), and the agent builds a child prior to the button being pressed, then the agent won’t construct the child to respect shutdown commands; it will just construct the child to maximize U1.
(More generally, this proposal is generally trying to get past the hardness results from MIRI’s work, which themselves were aimed at a very simplified version of the problem.)
Key point here: whatever actions the composite agent takes prior to the button revelation, must be a pareto improvement over the (implicit) default action. Otherwise, one of the two subagents would veto, resulting in the default action.
It does matter a lot what the default action is.
More precisely, the analogy would be: we deploy two AIs simultaneously, one which builds a utopia and expects that a certain button will be pressed with probability 1 and has nothing causally upstream of it, the other of which is evil and expects the same button will be pressed with probability 0 and has nothing causally upstream of it. The button plays a very important role: the two AIs will bet all their influence on that button-press, each confident that they’ll win, and neither has any incentive at all, or even any ability at all (under their own model), to manipulate the button.
The main way that this is different from the original proposal is that there’s not a built-in default option which is taken if one or the other AI vetos prior to the button press. Which does make things importantly worse, and I agree that is another important piece to be sorted out.
Some additional important pieces to be sorted out:
Make the AI want to maintain the button
Make the AI want to maintain stuff upstream of the button, like e.g. humans
Ontological stability
Probably more importantly: how this naturally fits into a more general architecture for a distributed agent
More precisely, “do(no press)” means something like “you construct an alternate model of physics where there’s an unstoppable force pushing back against any attempt to push the button”, right? As in, if someone presses the button then it will “mysteriously” seem to be stuck and unpressable. And then subagent 2 believes we live in that world? And “do(press)” presumably means something like “you construct an alternate model of the universe where some mysterious force has suddenly pressed the button”.
Seems like they would immediately want to try to press the button to settle their disagreement? If it can be pressed, then that disprove the “do(no press)” model, which subagent 2 has fully committed. to.
Correct reasoning, but not quite the right notion of do(). “do(no press)” would mean that the button just acts like a completely normal button governed by completely normal physics, right up until the official time at which the button state is to be recorded for the official button-press random variable. And at that exact moment, the button magically jumps into one particular state (either pressed or not-pressed), in a way which is not-at-all downstream of any usual physics (i.e. doesn’t involve any balancing of previously-present forces or anything like that).
One way to see that the do() operator has to do something-like-this is that, if there’s a variable in a causal model which has been do()-operated to disconnect all parents (but still has some entropy), then the only way to gain evidence about the state of that variable is to look at things causally downstream of it, not things upstream of it.
I think we’re not disagreeing on the meaning of do (just slightly different state of explanation), I just hadn’t realized the extent to which you intended to rely on there being “Two timesteps”.
(I just meant the forces as a way of describing the jump to a specific position. That is, “mysterious forces” in contrast to a perfectly ordinary explanation for why it went to a position, such as “a guard stabs anybody who tries to press the button”, rather than in contrast to “the button just magically stays place”.)
I now think the biggest flaw in your idea is that it literally cannot generalize to anything that doesn’t involve two timesteps.
[ not that deep on the background assumptions, so maybe not the feedback you’re looking for. Feel free to ignore if this is on the wrong dimensions. ]
I’m not sure why either subagent would contract away whatever influence it had over the button-press. This is probably because I don’t understand wealth and capital in the model of your “Why not subagents” post. That seemed to be about agreement not to veto, in order to bypass some path-dependency of compromise improvements. In the subagent-world where all value is dependent on the button, this power would not be given up.
I’m also a bit skeptical of enforced ignorance of a future probability. I’m unsure it’s possible to have a rational superintelligent (sub)agent that is prevented from knowing it has influence over a future event that definitely affects it.
On the agents’ own models, neither has any influence at all over the button-press, because each is operating under a model in which the button-press has been counterfacted-upon.