For instance, it does seem conceivable that a super intelligence that is told to “be as good at go/chess as possible” or “find a proof of the goldbach conjecture” might decide to start turning all available matter into a computer.
Just to show it’s probably as bad as you think, even these sorts of statements wouldn’t likely cache out for the AI as having a goal to “be as good at go/chess as possible” or “find a proof of the goldbach conjecture” because those are ways we interpret and give meaning to what the AI is doing and you could build an AI to do those things without it understanding its own goals. We can and do build AI now that have no conception of their own actions the same way computer programs and non-electronic machinery don’t know what they’re doing, and since it’s easier to do this it’s far more likely that’s what a super intelligence pointed at these problems would look like.
That is to say, the paperclip maximizers we worry about probably don’t even know they’re maximizing paperclips; they’re just doing stuff in a way that we interpret as maximizing paperclips.
I think this also illustrates something suggested by the linked post: you can have a utility function without it meaning anything to the thing optimizing it. I think what you’re seeing is that it’s much harder to make systems that do the things we think we want them to do so that they intentionally do them rather than do the shadow of the things we think we want them to do that we knew how to specify.
I think what you’re seeing is that it’s much harder to make systems that do the things we think we want them to do so that they intentionally do them rather than do the shadow of the things we think we want them to do that we knew how to specify.
If I’ve understood you correctly, I think I’m actually arguing something like the opposite. It seems to me that a) we don’t know how to specify even very bad goals such as “maximize paperclips” and b) if we did, we wouldn’t know how to install such a specified goal in an AI. At least not for the meaning of “specified” that is required in order for formal proofs about what the goal really means to apply.
There’s some sense in which we can do this though because we already do it. After all, AlphaGo really does tell you moves that result in winning a game of Go, even though AlphaGo seems unlikely to have any idea what Go is or what winning means. We’ve specified that it should do something, but that something is only meaningful to us.
Put another way, I definitely wrote boring old code we wouldn’t even call AI this morning that tries to satisfy the goal of “insert X into database Y” and my code exists as a kind of specification of the goal (although a very precise one that says exactly how to do it) that the computer will use to accomplish that goal, but this is a far cry from the computertrying for itself to insert X into database Y itself because I had some way to specify that’s what I wanted it to do other than just telling it exactly what to do.
Maybe we are talking at different levels here, but it seems to me that if we can specify the goal now we do so in such a specific way as to be something we can make formal proofs about but not be very interesting because the system has little power on its own to do things we didn’t specifically ask it to do, but I agree we don’t know how to specify goals to more complex systems that do things for themselves the way we can ask people to do things much less in a way that we can make formal proofs about their properties.
Right—as I mentioned near the end of my post, it is clearly easy to specify formal utility functions that are aboutformal systems, like Go or databases. My question is how do you specify a formal utility function that is about the real world? Almost any remotely interesting goal I can think of (such as “get me coffee”) seems impossible to formalize without relying on pre-theoretical notions of what it means for “me” to “have coffee”.
If I was just trying to build an AI, this question wouldn’t be terribly interesting. Obviously, you give the AI the utility function “maximize approval from the human trainer or concurrence with the training set” or whatever. The reason I’m posing the question is that the main research goal of AI safety appears to me to be “how do we close the gap between what humans value and what amoral maximizers do, and how do we prove that we’ve done it correctly.” One strand of research appears to be pursuing this goal through formal reasoning, and I just don’t understand where that can possibly lead, since you can’t formalize the stuff you care about in the first place.
Again, I feel like this is an extremely basic question that I have no doubt people doing the research have thought of, but I haven’t been able to find any previous discussion about it.
The closest thing I can think of is on the capabilities side looking at how to create intension, although it’s never been resolved (which is a big part of why GOFAI failed). You’re right that we mostly assume it will be figured out somehow, but safety research at least does not seem to be much addressing this question.
Just to show it’s probably as bad as you think, even these sorts of statements wouldn’t likely cache out for the AI as having a goal to “be as good at go/chess as possible” or “find a proof of the goldbach conjecture” because those are ways we interpret and give meaning to what the AI is doing and you could build an AI to do those things without it understanding its own goals. We can and do build AI now that have no conception of their own actions the same way computer programs and non-electronic machinery don’t know what they’re doing, and since it’s easier to do this it’s far more likely that’s what a super intelligence pointed at these problems would look like.
That is to say, the paperclip maximizers we worry about probably don’t even know they’re maximizing paperclips; they’re just doing stuff in a way that we interpret as maximizing paperclips.
I think this also illustrates something suggested by the linked post: you can have a utility function without it meaning anything to the thing optimizing it. I think what you’re seeing is that it’s much harder to make systems that do the things we think we want them to do so that they intentionally do them rather than do the shadow of the things we think we want them to do that we knew how to specify.
If I’ve understood you correctly, I think I’m actually arguing something like the opposite. It seems to me that a) we don’t know how to specify even very bad goals such as “maximize paperclips” and b) if we did, we wouldn’t know how to install such a specified goal in an AI. At least not for the meaning of “specified” that is required in order for formal proofs about what the goal really means to apply.
There’s some sense in which we can do this though because we already do it. After all, AlphaGo really does tell you moves that result in winning a game of Go, even though AlphaGo seems unlikely to have any idea what Go is or what winning means. We’ve specified that it should do something, but that something is only meaningful to us.
Put another way, I definitely wrote boring old code we wouldn’t even call AI this morning that tries to satisfy the goal of “insert X into database Y” and my code exists as a kind of specification of the goal (although a very precise one that says exactly how to do it) that the computer will use to accomplish that goal, but this is a far cry from the computer trying for itself to insert X into database Y itself because I had some way to specify that’s what I wanted it to do other than just telling it exactly what to do.
Maybe we are talking at different levels here, but it seems to me that if we can specify the goal now we do so in such a specific way as to be something we can make formal proofs about but not be very interesting because the system has little power on its own to do things we didn’t specifically ask it to do, but I agree we don’t know how to specify goals to more complex systems that do things for themselves the way we can ask people to do things much less in a way that we can make formal proofs about their properties.
Right—as I mentioned near the end of my post, it is clearly easy to specify formal utility functions that are about formal systems, like Go or databases. My question is how do you specify a formal utility function that is about the real world? Almost any remotely interesting goal I can think of (such as “get me coffee”) seems impossible to formalize without relying on pre-theoretical notions of what it means for “me” to “have coffee”.
If I was just trying to build an AI, this question wouldn’t be terribly interesting. Obviously, you give the AI the utility function “maximize approval from the human trainer or concurrence with the training set” or whatever. The reason I’m posing the question is that the main research goal of AI safety appears to me to be “how do we close the gap between what humans value and what amoral maximizers do, and how do we prove that we’ve done it correctly.” One strand of research appears to be pursuing this goal through formal reasoning, and I just don’t understand where that can possibly lead, since you can’t formalize the stuff you care about in the first place.
Again, I feel like this is an extremely basic question that I have no doubt people doing the research have thought of, but I haven’t been able to find any previous discussion about it.
The closest thing I can think of is on the capabilities side looking at how to create intension, although it’s never been resolved (which is a big part of why GOFAI failed). You’re right that we mostly assume it will be figured out somehow, but safety research at least does not seem to be much addressing this question.