Excluding a superintelligent AGI that has qualia and can form its own (possibly perverse) goals, why wouldn’t we be able to stop any paperclip maximizer whatsoever by the simple addition of the stipulation “and do so without killing or harming or even jeopardizing a single living human being on earth” to its specified goal? Wouldn’t this stipulation trivially force the paperclip maximizer not to turn humans into paperclips either directly or indirectly? There is no goal or subgoal (or is there?) that with the addition of that stipulation is dangerous to humans — by definition. If we create a paperclip maximizer, the only thing we need to do to keep it aligned, then, is to always add to its specified goals that or a similar stipulation. Of course, this would require self-control. But it would be in the interest of all researchers not to fail to include the stipulation, since their very lives would depend on it; and this is true even if (unbeknownst to us) only 1 out of, say, every 100 requested goals would make the paperclip maximizer turn everyone into paperclips.
Currently, we don’t know how to make a smarter-than-us AGI obey any particular set of instructions whatsoever, at least not robustly in novel circumstances once they are deployed and beyond our ability to recall/delete. (Because of the Inner Alignment Problem). So we can’t just type in stipulations like that. If we could, we’d be a lot closer to safety. Probably we’d have a lot more stipulations than just that. Speculating wildly, we might want to try something like: “For now, the only thing you should do is be a faithful imitation of Paul Christiano but thinking 100x faster.” Then we can ask our new Paul-mimic to think for a while and then come up with better ideas for how to instruct new versions of the AI.
What is “killing”? What is “harming”? What is “jeopardizing”? What is “living”? What is “human”? What is the difference between “I cause future killing/harming/jeopardizing” and “future killing/harming/jeopardizing will be in my lightcone”? How to explain all of this to AI? How to check if it understood everything correctly?
It is definitely advisable to build a paper-clip maximiser that also needs to respect a whole bunch of additional stipulations about not harming people. The worry among many alignment researchers is that it might be very difficult to make these stipulations robust enough to deliver the level of safety we ideally want, especially in the case of AGIs that might get hugely intelligent or hugely powerful. As we are talking about not-yet-invented AGI technology, nobody really knows how easy or hard it will be to build robust-enough stipulations into it. It might be very easy in the end, but maybe not. Different researchers have different levels of optimism, but in the end nobody knows, and the conclusion remains the same no matter what the level of optimism is. The conclusion is to warn people about the risk and to do more alignment research with the aim to make it easier build robust-enough stipulations into potential future AGIs.
In addition to Daniel’s point, I think an important piece is probabilistic thinking—the AGI will execute not based on what will happen but on what it expects to happen. What probability is acceptable? If none, it should do nothing.
I don’t think this is an important obstacle — you could use something like “and act such that your P(your actions over the next year lead to a massive disaster) < 10^-10.” I think Daniel’s point is the heart of the issue.
Excluding a superintelligent AGI that has qualia and can form its own (possibly perverse) goals, why wouldn’t we be able to stop any paperclip maximizer whatsoever by the simple addition of the stipulation “and do so without killing or harming or even jeopardizing a single living human being on earth” to its specified goal? Wouldn’t this stipulation trivially force the paperclip maximizer not to turn humans into paperclips either directly or indirectly? There is no goal or subgoal (or is there?) that with the addition of that stipulation is dangerous to humans — by definition. If we create a paperclip maximizer, the only thing we need to do to keep it aligned, then, is to always add to its specified goals that or a similar stipulation. Of course, this would require self-control. But it would be in the interest of all researchers not to fail to include the stipulation, since their very lives would depend on it; and this is true even if (unbeknownst to us) only 1 out of, say, every 100 requested goals would make the paperclip maximizer turn everyone into paperclips.
Currently, we don’t know how to make a smarter-than-us AGI obey any particular set of instructions whatsoever, at least not robustly in novel circumstances once they are deployed and beyond our ability to recall/delete. (Because of the Inner Alignment Problem). So we can’t just type in stipulations like that. If we could, we’d be a lot closer to safety. Probably we’d have a lot more stipulations than just that. Speculating wildly, we might want to try something like: “For now, the only thing you should do is be a faithful imitation of Paul Christiano but thinking 100x faster.” Then we can ask our new Paul-mimic to think for a while and then come up with better ideas for how to instruct new versions of the AI.
What is “killing”? What is “harming”? What is “jeopardizing”? What is “living”? What is “human”? What is the difference between “I cause future killing/harming/jeopardizing” and “future killing/harming/jeopardizing will be in my lightcone”? How to explain all of this to AI? How to check if it understood everything correctly?
We don’t know.
It is definitely advisable to build a paper-clip maximiser that also needs to respect a whole bunch of additional stipulations about not harming people. The worry among many alignment researchers is that it might be very difficult to make these stipulations robust enough to deliver the level of safety we ideally want, especially in the case of AGIs that might get hugely intelligent or hugely powerful. As we are talking about not-yet-invented AGI technology, nobody really knows how easy or hard it will be to build robust-enough stipulations into it. It might be very easy in the end, but maybe not. Different researchers have different levels of optimism, but in the end nobody knows, and the conclusion remains the same no matter what the level of optimism is. The conclusion is to warn people about the risk and to do more alignment research with the aim to make it easier build robust-enough stipulations into potential future AGIs.
In addition to Daniel’s point, I think an important piece is probabilistic thinking—the AGI will execute not based on what will happen but on what it expects to happen. What probability is acceptable? If none, it should do nothing.
I don’t think this is an important obstacle — you could use something like “and act such that your P(your actions over the next year lead to a massive disaster) < 10^-10.” I think Daniel’s point is the heart of the issue.
I think that incentivizes self-deception on probabilities. Also, P <10^-10 are pretty unusual, so I’d expect that to cause very little to happen.