a real preference to stay in the current state [...] financial markets[ are] the ur-example of efficiency and inexploitability, yet it turns out that a market does not have a utility function in general [...] hidden state creates situations where the agent “prefers” to stay in whatever state they’re in.
My summary: one of the reasons to expect AI to be dangerous by default is that smart agents are coherent, coherent agents are expected utility maximizers, and sufficiently powerful expected utility maximizers destroy everything in their path: even if you’re my best friend in the whole actually-existing world, if I’m sufficiently powerful, I’m still going to murder you and use your atoms to build my best friend across all possible worlds.
This would seem to sink alignment schemes of the form “Just make the AI uncertain about the correct utility function, and have it ask us questions—and let us modify it if necessary—when it’s not sure.” Becuase if it’s sufficiently powerful, it’ll just disassemble our brains to figure out how we would have responded to questions, without letting us modify it. (Which would be fine if the value-loading problem were already completely solved—disassemble us now and resurrect us after it’s done turning off all those negentropy-wasting stars—but you’d want to be very sure of that, first.)
But if the “coherent agents are expected utility maximizers” part isn’t true because state-dependent preferences are still coherent in the relevant sense, doesn’t deference/corrigibility potentially become a lot easier? In some sense, you just (“just”) need one of the subagents on the committee to veto all plans that prevent us from hitting the Off switch … right?
I mean… the subagent who vetos all the “prevent human from hitting the Off switch” plans must itself be a utility-maximizer, so we have to be able to write a utility-maximizer which wants to not-block the off switch anyway.
If you want a hot take on fully updated deference, I’d recommend reading The Pointers Problem, then considering how various AI architectures would handle uncertainty in their own objective-pointers.
Wait, does this help solve the problem of fully updated deference?!
My summary: one of the reasons to expect AI to be dangerous by default is that smart agents are coherent, coherent agents are expected utility maximizers, and sufficiently powerful expected utility maximizers destroy everything in their path: even if you’re my best friend in the whole actually-existing world, if I’m sufficiently powerful, I’m still going to murder you and use your atoms to build my best friend across all possible worlds.
This would seem to sink alignment schemes of the form “Just make the AI uncertain about the correct utility function, and have it ask us questions—and let us modify it if necessary—when it’s not sure.” Becuase if it’s sufficiently powerful, it’ll just disassemble our brains to figure out how we would have responded to questions, without letting us modify it. (Which would be fine if the value-loading problem were already completely solved—disassemble us now and resurrect us after it’s done turning off all those negentropy-wasting stars—but you’d want to be very sure of that, first.)
But if the “coherent agents are expected utility maximizers” part isn’t true because state-dependent preferences are still coherent in the relevant sense, doesn’t deference/corrigibility potentially become a lot easier? In some sense, you just (“just”) need one of the subagents on the committee to veto all plans that prevent us from hitting the Off switch … right?
I mean… the subagent who vetos all the “prevent human from hitting the Off switch” plans must itself be a utility-maximizer, so we have to be able to write a utility-maximizer which wants to not-block the off switch anyway.
If you want a hot take on fully updated deference, I’d recommend reading The Pointers Problem, then considering how various AI architectures would handle uncertainty in their own objective-pointers.