Question:
How do you make the paperclip maximizer want to collect paperclips? I have two slightly different understandings of how you might do this, in terms of how it’s ultimately programmed:
1) there’s a function that says “maximize paperclips”
2) there’s a function that says “getting a paperclip = +1 good point”
Given these two different understandings though, isn’t the inevitable result for a truly intelligent paperclip maximizer to just hack itself and based on my two different understandings:
1) make itself /think/ that it’s getting paperclips because that’s what it really wants—there’s no way to make it value ACTUALLY getting paperclips as opposed to just thinking that it’s getting paperclips
2) find a way to directly award itself “good points” because that’s what it really wants
I think my understanding is probably flawed somewhere but haven’t been able to figure it out so please point out where
For what it’s worth, though, as far as I can tell we don’t have the ability to create an AI that will reliably maximize the number of paperclips in the real world, even with infinite computing power. As Manfred said, model-based goals seems to be a promising research direction for getting AIs to care about the real world, but we don’t currently have the ability to get such an AI to reliably actually “value paperclips”. There are a lot of problems with model-based goals that occur even in the POMDP setting, let alone when the agent’s model of the world or observation space can change. So I wouldn’t expect anyone to be able to propose a fully coherent complete answer to your question in the near term.
It might be useful to think about how humans “solve” this problem, and whether or not you can port this behavior over to an AI.
To our best current understanding, it has to have a model of the world (e.g. as a POMDP) that contains a count of the number of paperclips, and that it can use to predict what effect its actions will have on the number of paperclips. Then it chooses a strategy that will, according to the model, lead to lots of paperclips.
This won’t want to fool itself because, according to basically any model of the world, fooling yourself does not result in more paperclips.
“according to basically any model of the world, fooling yourself does not result in more paperclips.”
Paul Almond at one time proposed that every interpretation of a real thing is a real thing. According to that theory, fooling yourself that there are more paperclips does result in more paperclips (although not fooling yourself also has that result.)
But what does the code for that look like. It looks like maximize(# of paperclips in world), but how does it determine (# of paperclips in world)? You just said it has a model. But how can it distinguish between real input that leads to the perception of paperclips and fake input that leads to the perception of paperclips?
Well, if the acronym “POMDP” didn’t make any sense, I think we should start with a simpler example, like a chessboard.
Suppose we want to write a chess-playing AI that gets its input from a camera looking at the chessboard. And for some reason, we give it a button that replaces the video feed with a picture of the board in a winning position.
Inside the program, the AI knows about the rules of chess, and has some heuristics for how it expects the opponent to play. Then it represents the external chessboard with some data array. Finally, it has some rules about how the image in the camera is generated from the true chessboard and whether or not it’s pressing the button.
If we just try to get the AI to make the video feed be of a winning position, then it will press the button. But if we try to get the AI to get its internal representation of the data array to be in a winning position, and we update the internal representation to try to track the true chessboard, then it won’t press the button. This is actually quite easy to do—for example, if the AI is a jumble of neural networks, and we have a long training phase in which it’s rewarded for actually winning games, not just seeing winning board states, then it will learn to take into account the state of the button when looking at the image.
Why would it hack itself to think it’s getting paperclips if it’s originally programmed to want real paperclips? It would not be incentivized to make that hack because that hack would make it NOT get paperclips.
Question: How do you make the paperclip maximizer want to collect paperclips? I have two slightly different understandings of how you might do this, in terms of how it’s ultimately programmed: 1) there’s a function that says “maximize paperclips” 2) there’s a function that says “getting a paperclip = +1 good point”
Given these two different understandings though, isn’t the inevitable result for a truly intelligent paperclip maximizer to just hack itself and based on my two different understandings: 1) make itself /think/ that it’s getting paperclips because that’s what it really wants—there’s no way to make it value ACTUALLY getting paperclips as opposed to just thinking that it’s getting paperclips 2) find a way to directly award itself “good points” because that’s what it really wants
I think my understanding is probably flawed somewhere but haven’t been able to figure it out so please point out where
For what it’s worth, though, as far as I can tell we don’t have the ability to create an AI that will reliably maximize the number of paperclips in the real world, even with infinite computing power. As Manfred said, model-based goals seems to be a promising research direction for getting AIs to care about the real world, but we don’t currently have the ability to get such an AI to reliably actually “value paperclips”. There are a lot of problems with model-based goals that occur even in the POMDP setting, let alone when the agent’s model of the world or observation space can change. So I wouldn’t expect anyone to be able to propose a fully coherent complete answer to your question in the near term.
It might be useful to think about how humans “solve” this problem, and whether or not you can port this behavior over to an AI.
If you’re interested in this topic, I would recommend MIRI’s paper on value learning as well as the relevant Arbital Technical Tutorial.
To our best current understanding, it has to have a model of the world (e.g. as a POMDP) that contains a count of the number of paperclips, and that it can use to predict what effect its actions will have on the number of paperclips. Then it chooses a strategy that will, according to the model, lead to lots of paperclips.
This won’t want to fool itself because, according to basically any model of the world, fooling yourself does not result in more paperclips.
“according to basically any model of the world, fooling yourself does not result in more paperclips.”
Paul Almond at one time proposed that every interpretation of a real thing is a real thing. According to that theory, fooling yourself that there are more paperclips does result in more paperclips (although not fooling yourself also has that result.)
But what does the code for that look like. It looks like maximize(# of paperclips in world), but how does it determine (# of paperclips in world)? You just said it has a model. But how can it distinguish between real input that leads to the perception of paperclips and fake input that leads to the perception of paperclips?
Well, if the acronym “POMDP” didn’t make any sense, I think we should start with a simpler example, like a chessboard.
Suppose we want to write a chess-playing AI that gets its input from a camera looking at the chessboard. And for some reason, we give it a button that replaces the video feed with a picture of the board in a winning position.
Inside the program, the AI knows about the rules of chess, and has some heuristics for how it expects the opponent to play. Then it represents the external chessboard with some data array. Finally, it has some rules about how the image in the camera is generated from the true chessboard and whether or not it’s pressing the button.
If we just try to get the AI to make the video feed be of a winning position, then it will press the button. But if we try to get the AI to get its internal representation of the data array to be in a winning position, and we update the internal representation to try to track the true chessboard, then it won’t press the button. This is actually quite easy to do—for example, if the AI is a jumble of neural networks, and we have a long training phase in which it’s rewarded for actually winning games, not just seeing winning board states, then it will learn to take into account the state of the button when looking at the image.
Why would it hack itself to think it’s getting paperclips if it’s originally programmed to want real paperclips? It would not be incentivized to make that hack because that hack would make it NOT get paperclips.
As I said though, how do you program it to want REAL paperclips as opposed to just perceiving that it is getting paperclips.