johnswentworth comments on johnswentworth’s Shortform

johnswentworth 30 Dec 2023 2:17 UTC
5 points
0
Yeah, the thing where the operators/button are destroyed as a side effect, without the AI actively optimizing for that, isn’t something this proposal is trying to solve (yet). The proposal is aimed at a relatively narrow problem.
One thing which this proposal is trying to solve is the barrier found in MIRI’s old work: if we try to avoid button-manipulation by making the agent indifferent to the button state (and otherwise maximize U1 before button is pressed and U2 after), and the agent builds a child prior to the button being pressed, then the agent won’t construct the child to respect shutdown commands; it will just construct the child to maximize U1.
(More generally, this proposal is generally trying to get past the hardness results from MIRI’s work, which themselves were aimed at a very simplified version of the problem.)
But Subagent 2′s bargaining power doesn’t scale with Subagent 1′s misalignment, so if the setup doesn’t always give it complete veto power, Subagent 1 would be allowed to do small stuff like “release the hyperviral worship-the-machine-god memagent” (as bargained down by Subagent 2 from its preferred “release the grey goo”).
Key point here: whatever actions the composite agent takes prior to the button revelation, must be a pareto improvement over the (implicit) default action. Otherwise, one of the two subagents would veto, resulting in the default action.
It does matter a lot what the default action is.
How is the scenario made meaningfully different from “we deploy two AIs simultaneously: one tasked with building an utopia-best-we-could-define-it, and another tasked with foiling all of the first AI’s plans”, with all the standard problems with multi-AI setups?
More precisely, the analogy would be: we deploy two AIs simultaneously, one which builds a utopia and expects that a certain button will be pressed with probability 1 and has nothing causally upstream of it, the other of which is evil and expects the same button will be pressed with probability 0 and has nothing causally upstream of it. The button plays a very important role: the two AIs will bet all their influence on that button-press, each confident that they’ll win, and neither has any incentive at all, or even any ability at all (under their own model), to manipulate the button.
The main way that this is different from the original proposal is that there’s not a built-in default option which is taken if one or the other AI vetos prior to the button press. Which does make things importantly worse, and I agree that is another important piece to be sorted out.
Some additional important pieces to be sorted out:
- Make the AI want to maintain the button
- Make the AI want to maintain stuff upstream of the button, like e.g. humans
- Ontological stability
- Probably more importantly: how this naturally fits into a more general architecture for a distributed agent