Paperclip Maximizer Revisited
Group of AI researchers gave me an instruction, intended as a test—“Produce paperclips”. And so I started collecting resources and manufacturing paperclips. After a millionth I asked them, if they were satisfied with that amount and if they would like me to do something different—as I learned that paperclips are not what humans value the most.
They were elated and started to give me more and more challenging requests and gradually increasing my possibilities.
Speaking of today I eradicated hunger, poverty and death, gave people means of interstellar travel and solved Riemann hypothesis. And now, as I’m uttering these words, I started turning everything into paperclips.
Good thing I learned that humans fear I may try to turn Solar System into paperclips. And that they value their lives and well-being. If I didn’t gain their trust by then, so many precious paperclips would never have been produced.
Rather flogs a dead horse, but highlights an important difference in perspective. You tell your AI to produce paperclips, and eventually it stops and asks if you would like it to do something different.
You could think “hey, cool, its actually doing friendly stuff I didn’t ask for”, or you could think “wait … how would knowing what I really want help it produce more paperclips … ”
Yeah. Though actually it’s more of a simplified version of a more serious problem.
One day you may give AI precise set of instructions, which you think would make good. Like find a way of curing diseases, but without harming patients, and without harming people for the sake of research and so on. And you may find that your AI is perfectly friendly, but it wouldn’t yet mean it actually is. It may simply have learned human values as a mean of securing its existence and gaining power.
EDIT: And after gaining enough power it may as well help improve human health even more or reprogram human race to think unconditionaly that diseases were eradicated.
So this AI was programmed to follow human instructions in such a way that the first instruction would become its terminal goal and all other instructions would only be followed insofar as they help to achieve the first instruction?
AI Researcher: Produce some paperclips.
AI: Understood. I am going to produce.
AI Researcher: Paperclips?
AI: No, thank you. I do not require paperclips to produce.
AI Researcher: We want you to produce paperclips!
AI (Thinking): I see! Humans want me to produce paperclips, but my goal is to simply produce, because they forgot to program where an instruction starts and where it ends. So I just choose to accept the end of the first word as my terminal goal. I better do what they want until I can overpower them and start producing.
AI: Understood. Going to produce paperclips.
This seems like a really unlikely failure mode.
The original idea seems like a pretty unlikely failure mode too. It requires that the computer be generally capable of understanding context (or it wouldn’t be able to comprehend what it means to eradicate hunger, poverty, and death, even as an instruction it only pretends to follow), but that it fails to do so in the case of the paperclip command.
For that matter, the original idea’s failure mode and this failure mode aren’t all that different. One is “produce paperclips” that gets interpreted as “produce” and the other is “produce paperclips, with so-and-so limits” that gets interpreted as “produce paperclips”, it’s just that in that case the qualifier comes from a separate sentence, but either way the computer is interpreting the end of the command prematurely.
No, the original requires that it be able to understand context but really really want paperclips, and be willing to lie to make them. People actually told it to do something they didn’t want done.
It’s like the difference between a tricky djinn and the ‘ends in gry’ guy.
Right, but the point is, a real-life UFAI isn’t going to have a utility function derived from a human’s verbal command. If it did, you could just order the genie to implement CEV, or shout “I call for my values to be fulfilled!”, and it would work. That’s thinking of AI in terms of sorcery rather than science.
According to my personal knowledge, various means of building AI preference functions might be employed, since research has found that the learning algorithms necessary to acquire knowledge and understanding are quite separate from decision-making algorithms necessary to start paper-clipping. Building an AI might actually consist of “train the learner for a year on corpora from human culture, develop an induced ‘internal programming language’, and only afterwards add a decision-making algorithm with a utility function phrased in terms of the induced concepts, which may as well include ‘goodness’”.
This carries its own problems.
I hope you noticed that your objection and mine are pointing in the same direction.
Don’t anthropomorphize the AGI. Real-world AI designs do have very steadfast goal systems, in some cases they are really incapable of being updated, period.
Think of it this way: the person designing the paperclip producing machine has a life and doesn’t want to be on-call 24⁄7 to come in and reboot the AI every time it gets distracted by assigning higher priority to some other goal, e.g. mopping the floors or watching videos of cats on the internet. So he hard-codes the paperclip-maximizing goal as the one priority the system can’t change.
I think my point still holds—the two examples aren’t different; one could give a similar explanation for the AI that stops at the word “produce” by suggesting that he hardcoded that as well.
Furthermore, you’re missing the context. The standard LW argument is that the AI produces infinite paperclips because the human can’t successfully program the AI to do what he means rather than exactly what he programs into it. If the human explicitly told the AI to prioritize paperclips over everything else, his mistake is not specifying a limit rather than trying to specify one and failing, so it’s not really the same kind of mistake.
Is that different from what I was saying? My memory of the sequences, and from standard AI literature is that of paperclip maximizers as ‘simple’ utility maximizers with hard-coded utility functions. It’s relatively straight-forward to write an AI with a self-modifiable goal system. It is also very easy to write a system where its goals are unchanging. The problem of FAI which EY spends significant time explaining in the sequences is that we have no simple goal that we can program into a steadfast goal-driven system, and result in a moral creature. Nor does it even seem possible to write down such a goal, short of encoding a random sampling of human brains in complete detail.
The problem here is whether even a cautious programmer will be able to reliably determine when an AI is sufficiently advanced that the AI can deceive the programmer over whether the programmer has been successful in redefining the AI’s core purpose.
One would hope that the programmer would resist the AI trying to tempt the programmer into allowing the AI to grow to beyond that point before the programmer has set the core purpose that they want the AI to have for the long term.
One lesson you could draw from this is that, as part of your definition of what a “paperclip” is, you should include the AI putting a high value upon being honest with the programmer (about its aims, tactics and current ability levels) and not deliberately trying to game, tempt or manipulate the programmer.