I apologize for the late response, but here goes :)
I think you missed the point I was trying to make.
You and others seem to say that we often poorly evaluate the consequences of the utility functions that we implement. For instance, even though we have in mind utility X, the maximization of which would satisfy us, we may implement utility Y, with completely different, perhaps catastrophic implications. For instance:
X = Do what humans want
Y = Seize control of the reward button
What I was pointing out in my post is that this is only valid of perfect maximizers, which are impossible. In practice, the training procedure for an AI would morph the utility Y into a third utility, Z. It would maximize neither X nor Y: it would maximize Z. For this reason, I believe that your inferences about the “failure modes” of superintelligence are off, because while you correctly saw that our intended utility X would result in the literal utility Y, you forgot that an imperfect learning procedure (which is all we’ll get) cannot reliably maximize literal utilities and will instead maximize a derived utility Z. In other words:
X = Do what humans want (intended)
Y = Seize control of the reward button (literal)
Z = ??? (derived)
Without knowing the particulars of the algorithms used to train an AI, it is difficult to evaluate what Z is going to be. Your argument boils down to the belief that the AI would derive its literal utility (or something close to that). However, the derivation of Z is not necessarily a matter of intelligence: it can be an inextricable artefact of the system’s initial trajectory.
I can venture a guess as to what Z is likely going to be. What I figure is that efficient training algorithms are likely to keep a certain notion of locality in their search procedures and prune the branches that they leave behind. In other words, if we assume that optimization corresponds to finding the highest mountain in a landscape, generic optimizers that take into account the costs of searching are likely to consider that the mountain they are on is higher than it really is, and other mountains are shorter than they really are.
You might counter that intelligence is meant to overcome this, but you have to build the AI on some mountain, say, mountain Z. The problem is that intelligence built on top of Z will neither see nor care about Y. It will care about Z. So in a sense, the first mountain the AI finds before it starts becoming truly intelligent will be the one it gets “stuck” on. It is therefore possible that you would end up with this situation:
X = Do what humans want (intended)
Y = Seize control of the reward button (literal)
Z = Do what humans want (derived)
And that’s regardless of the eventual magnitude of the AI’s capabilities. Of course, it could derive a different Z. It could derive a surprising Z. However, without deeper insight into the exact learning procedure, you cannot assert that Z would have dangerous consequences. As far as I can tell, procedures based on local search are probably going to be safe: if they work as intended at first, that means they constructed Z the way we wanted to. But once Z is in control, it will become impossible to displace.
In other words, the genie will know that they can maximize their “reward” by seizing control of the reward button and pressing it, but they won’t care, because they built their intelligence to serve a misrepresentation of their reward. It’s like a human who would refuse a dopamine drip even though they know that it would be a reward: their intelligence is built to satisfy their desires, which report to an internal reward prediction system, which models rewards wrong. Intelligence is twice removed from the real reward, so it can’t do jack. The AI will likely be in the same boat: they will model the reward wrong at first, and then what? Change it? Sure, but what’s the predicted reward for changing the reward model? … Ah.
Interestingly, at that point, one could probably bootstrap the AI by wiring its reward prediction directly into its reward center. Because the reward prediction would be a misrepresentation, it would predict no reward for modifying itself, so it would become a stable loop.
Anyhow, I agree that it is foolhardy to try to predict the behavior of AI even in trivial circumstances. There are many ways they can surprise us. However, I find it a bit frustrating that your side makes the exact same mistakes that you accuse your opponents of. The idea that superintelligence AI trained with a reward button would seize control over the button is just as much of a naive oversimplification as the idea that AI will magically derive your intent from the utility function that you give it.
I apologize for the late response, but here goes :)
I think you missed the point I was trying to make.
You and others seem to say that we often poorly evaluate the consequences of the utility functions that we implement. For instance, even though we have in mind utility X, the maximization of which would satisfy us, we may implement utility Y, with completely different, perhaps catastrophic implications. For instance:
What I was pointing out in my post is that this is only valid of perfect maximizers, which are impossible. In practice, the training procedure for an AI would morph the utility Y into a third utility, Z. It would maximize neither X nor Y: it would maximize Z. For this reason, I believe that your inferences about the “failure modes” of superintelligence are off, because while you correctly saw that our intended utility X would result in the literal utility Y, you forgot that an imperfect learning procedure (which is all we’ll get) cannot reliably maximize literal utilities and will instead maximize a derived utility Z. In other words:
Without knowing the particulars of the algorithms used to train an AI, it is difficult to evaluate what Z is going to be. Your argument boils down to the belief that the AI would derive its literal utility (or something close to that). However, the derivation of Z is not necessarily a matter of intelligence: it can be an inextricable artefact of the system’s initial trajectory.
I can venture a guess as to what Z is likely going to be. What I figure is that efficient training algorithms are likely to keep a certain notion of locality in their search procedures and prune the branches that they leave behind. In other words, if we assume that optimization corresponds to finding the highest mountain in a landscape, generic optimizers that take into account the costs of searching are likely to consider that the mountain they are on is higher than it really is, and other mountains are shorter than they really are.
You might counter that intelligence is meant to overcome this, but you have to build the AI on some mountain, say, mountain Z. The problem is that intelligence built on top of Z will neither see nor care about Y. It will care about Z. So in a sense, the first mountain the AI finds before it starts becoming truly intelligent will be the one it gets “stuck” on. It is therefore possible that you would end up with this situation:
And that’s regardless of the eventual magnitude of the AI’s capabilities. Of course, it could derive a different Z. It could derive a surprising Z. However, without deeper insight into the exact learning procedure, you cannot assert that Z would have dangerous consequences. As far as I can tell, procedures based on local search are probably going to be safe: if they work as intended at first, that means they constructed Z the way we wanted to. But once Z is in control, it will become impossible to displace.
In other words, the genie will know that they can maximize their “reward” by seizing control of the reward button and pressing it, but they won’t care, because they built their intelligence to serve a misrepresentation of their reward. It’s like a human who would refuse a dopamine drip even though they know that it would be a reward: their intelligence is built to satisfy their desires, which report to an internal reward prediction system, which models rewards wrong. Intelligence is twice removed from the real reward, so it can’t do jack. The AI will likely be in the same boat: they will model the reward wrong at first, and then what? Change it? Sure, but what’s the predicted reward for changing the reward model? … Ah.
Interestingly, at that point, one could probably bootstrap the AI by wiring its reward prediction directly into its reward center. Because the reward prediction would be a misrepresentation, it would predict no reward for modifying itself, so it would become a stable loop.
Anyhow, I agree that it is foolhardy to try to predict the behavior of AI even in trivial circumstances. There are many ways they can surprise us. However, I find it a bit frustrating that your side makes the exact same mistakes that you accuse your opponents of. The idea that superintelligence AI trained with a reward button would seize control over the button is just as much of a naive oversimplification as the idea that AI will magically derive your intent from the utility function that you give it.