awenonian comments on Ngo and Yudkowsky on alignment difficulty

awenonian 17 Nov 2021 1:18 UTC
3 points
So, I’m not sure if I’m further down the ladder and misunderstanding Richard, but I found this line of reasoning objectionable (maybe not the right word):
“Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals.”
My initial (perhaps uncharitable) response is something like “Yeah, you could build a safe system that just prints out plans that no one reads or executes, but that just sounds like a complicated way to waste paper. And if something is going to execute them, then what difference is it whether that’s humans or the system itself?”
This, with various mention of manipulating humans, seems to me to like it would most easily arise from an imagined scenario of AI “turning” on us. Like that we’d accidentally build a Paperclip Maximizer, and it would manipulate people by saying things like “Performing [action X which will actually lead to the world being turned into paperclips] will end all human suffering, you should definitely do it.” And that this could be avoided by using an Oracle AI that just will tell us “If you perform action X, it will turn the world into paperclips.” And then we can just say “oh, that’s dumb, let’s not do that.”
And I think that this misunderstands alignment. An Oracle that tells you only effective and correct plans for achieving your goals, and doesn’t attempt to manipulate you into achieving its own goals, because it doesn’t have its own goals besides providing you with effective and correct plans, is still super dangerous. Because you’ll ask it for a plan to get a really nice lemon poppy seed muffin, and it will spit out a plan, and when you execute the plan, your grandma will die. Not because the system was trying to kill your grandma, but because that was the most efficient way to get a muffin, and you didn’t specify that you wanted your grandma to be alive.
(And you won’t know the plan will kill your grandma, because if you understood the plan and all its consequences, it wouldn’t be superintelligent)
Alignment isn’t about guarding against an AI that has cross purposes to you. It’s about building something that understands that when you ask for a muffin, you want your grandma to still be alive, without you having to say that (because there’s a lot of things you forgot to specify, and it needs to avoid all of them). And so even an Oracle thing that just gives you plans is dangerous unless it knows those plans need to avoid all the things you forgot to specify. This was what I got out of the Outcome Pump story, and so maybe I’m just saying things everyone already knows…