To our best current understanding, it has to have a model of the world (e.g. as a POMDP) that contains a count of the number of paperclips, and that it can use to predict what effect its actions will have on the number of paperclips. Then it chooses a strategy that will, according to the model, lead to lots of paperclips.
This won’t want to fool itself because, according to basically any model of the world, fooling yourself does not result in more paperclips.
“according to basically any model of the world, fooling yourself does not result in more paperclips.”
Paul Almond at one time proposed that every interpretation of a real thing is a real thing. According to that theory, fooling yourself that there are more paperclips does result in more paperclips (although not fooling yourself also has that result.)
But what does the code for that look like. It looks like maximize(# of paperclips in world), but how does it determine (# of paperclips in world)? You just said it has a model. But how can it distinguish between real input that leads to the perception of paperclips and fake input that leads to the perception of paperclips?
Well, if the acronym “POMDP” didn’t make any sense, I think we should start with a simpler example, like a chessboard.
Suppose we want to write a chess-playing AI that gets its input from a camera looking at the chessboard. And for some reason, we give it a button that replaces the video feed with a picture of the board in a winning position.
Inside the program, the AI knows about the rules of chess, and has some heuristics for how it expects the opponent to play. Then it represents the external chessboard with some data array. Finally, it has some rules about how the image in the camera is generated from the true chessboard and whether or not it’s pressing the button.
If we just try to get the AI to make the video feed be of a winning position, then it will press the button. But if we try to get the AI to get its internal representation of the data array to be in a winning position, and we update the internal representation to try to track the true chessboard, then it won’t press the button. This is actually quite easy to do—for example, if the AI is a jumble of neural networks, and we have a long training phase in which it’s rewarded for actually winning games, not just seeing winning board states, then it will learn to take into account the state of the button when looking at the image.
To our best current understanding, it has to have a model of the world (e.g. as a POMDP) that contains a count of the number of paperclips, and that it can use to predict what effect its actions will have on the number of paperclips. Then it chooses a strategy that will, according to the model, lead to lots of paperclips.
This won’t want to fool itself because, according to basically any model of the world, fooling yourself does not result in more paperclips.
“according to basically any model of the world, fooling yourself does not result in more paperclips.”
Paul Almond at one time proposed that every interpretation of a real thing is a real thing. According to that theory, fooling yourself that there are more paperclips does result in more paperclips (although not fooling yourself also has that result.)
But what does the code for that look like. It looks like maximize(# of paperclips in world), but how does it determine (# of paperclips in world)? You just said it has a model. But how can it distinguish between real input that leads to the perception of paperclips and fake input that leads to the perception of paperclips?
Well, if the acronym “POMDP” didn’t make any sense, I think we should start with a simpler example, like a chessboard.
Suppose we want to write a chess-playing AI that gets its input from a camera looking at the chessboard. And for some reason, we give it a button that replaces the video feed with a picture of the board in a winning position.
Inside the program, the AI knows about the rules of chess, and has some heuristics for how it expects the opponent to play. Then it represents the external chessboard with some data array. Finally, it has some rules about how the image in the camera is generated from the true chessboard and whether or not it’s pressing the button.
If we just try to get the AI to make the video feed be of a winning position, then it will press the button. But if we try to get the AI to get its internal representation of the data array to be in a winning position, and we update the internal representation to try to track the true chessboard, then it won’t press the button. This is actually quite easy to do—for example, if the AI is a jumble of neural networks, and we have a long training phase in which it’s rewarded for actually winning games, not just seeing winning board states, then it will learn to take into account the state of the button when looking at the image.