I have done AI. I know it is difficult. However, few existing algorithms, if at all, have the failure modes you describe. They fail early, and they fail hard. As far as neural nets go, they fall into a local minimum early on and never get out, often digging their own graves. Perhaps different algorithms would have the shortcomings you point out. But a lot of the algorithms that currently exist work the way I describe.
And obviously, if an AI was indeed stuck in a local minimum obvious to you of its own utility gradient, this condition would not last past it becoming smarter than you.
You may be right. However, this is far from obvious. The problem is that it may “know” that it is stuck in a local minimum, but the very effect of that local minimum is that it may not care. The thing you have to keep in mind here is that a generic AI which just happens to slam dunk and find global minima reliably is basically impossible. It has to fold the search space in some ways, often cutting its own retreats in the process.
I feel that you are making the same kind of mistake that you criticize: you assume that intelligence entails more things than it really does. In order to be efficient, intelligence has to use heuristics that will paint it into a few corners. For instance, the more consistently AI goes in a certain direction, the less likely it will be to expend energy into alternative directions and the less likely it becomes to do a 180. In other words, there may be a complex tug-of-war between various levels of internal processes, the AI’s rational center pointing out that there is a reward button to be seized, but inertial forces shoving back with “there has never been any problems here, go look somewhere else”.
It really boils down to this: an efficient AI needs to shut down parts of the search space and narrow down the parts it will actually explore. The sheer size of that space requires it not to think too much about what it chops down, and at least at first, it is likely to employ trajectory-based heuristics. To avoid searching in far-fetched zones, it may wall them out by arbitrarily lowering their utility. And that’s where it might paint itself in a corner: it might inadvertently put up immense walls in the direction of the global minimum that it cannot tear down (it never expected that it would have to). In other words, it will set up a utility function for itself which enshrines the current minimum as global.
Now, perhaps you are right and I am wrong. But it is not obvious: an AI might very well grow out of a solidifying core so pervasive that it cannot get rid of it. Many algorithms already exhibit that kind of behavior; many humans, too. I feel that it is not a possibility that can be dismissed offhand. At the very least, it is a good prospect for FAI research.
I apologize for the late response, but here goes :)
I think you missed the point I was trying to make.
You and others seem to say that we often poorly evaluate the consequences of the utility functions that we implement. For instance, even though we have in mind utility X, the maximization of which would satisfy us, we may implement utility Y, with completely different, perhaps catastrophic implications. For instance:
What I was pointing out in my post is that this is only valid of perfect maximizers, which are impossible. In practice, the training procedure for an AI would morph the utility Y into a third utility, Z. It would maximize neither X nor Y: it would maximize Z. For this reason, I believe that your inferences about the “failure modes” of superintelligence are off, because while you correctly saw that our intended utility X would result in the literal utility Y, you forgot that an imperfect learning procedure (which is all we’ll get) cannot reliably maximize literal utilities and will instead maximize a derived utility Z. In other words:
Without knowing the particulars of the algorithms used to train an AI, it is difficult to evaluate what Z is going to be. Your argument boils down to the belief that the AI would derive its literal utility (or something close to that). However, the derivation of Z is not necessarily a matter of intelligence: it can be an inextricable artefact of the system’s initial trajectory.
I can venture a guess as to what Z is likely going to be. What I figure is that efficient training algorithms are likely to keep a certain notion of locality in their search procedures and prune the branches that they leave behind. In other words, if we assume that optimization corresponds to finding the highest mountain in a landscape, generic optimizers that take into account the costs of searching are likely to consider that the mountain they are on is higher than it really is, and other mountains are shorter than they really are.
You might counter that intelligence is meant to overcome this, but you have to build the AI on some mountain, say, mountain Z. The problem is that intelligence built on top of Z will neither see nor care about Y. It will care about Z. So in a sense, the first mountain the AI finds before it starts becoming truly intelligent will be the one it gets “stuck” on. It is therefore possible that you would end up with this situation:
And that’s regardless of the eventual magnitude of the AI’s capabilities. Of course, it could derive a different Z. It could derive a surprising Z. However, without deeper insight into the exact learning procedure, you cannot assert that Z would have dangerous consequences. As far as I can tell, procedures based on local search are probably going to be safe: if they work as intended at first, that means they constructed Z the way we wanted to. But once Z is in control, it will become impossible to displace.
In other words, the genie will know that they can maximize their “reward” by seizing control of the reward button and pressing it, but they won’t care, because they built their intelligence to serve a misrepresentation of their reward. It’s like a human who would refuse a dopamine drip even though they know that it would be a reward: their intelligence is built to satisfy their desires, which report to an internal reward prediction system, which models rewards wrong. Intelligence is twice removed from the real reward, so it can’t do jack. The AI will likely be in the same boat: they will model the reward wrong at first, and then what? Change it? Sure, but what’s the predicted reward for changing the reward model? … Ah.
Interestingly, at that point, one could probably bootstrap the AI by wiring its reward prediction directly into its reward center. Because the reward prediction would be a misrepresentation, it would predict no reward for modifying itself, so it would become a stable loop.
Anyhow, I agree that it is foolhardy to try to predict the behavior of AI even in trivial circumstances. There are many ways they can surprise us. However, I find it a bit frustrating that your side makes the exact same mistakes that you accuse your opponents of. The idea that superintelligence AI trained with a reward button would seize control over the button is just as much of a naive oversimplification as the idea that AI will magically derive your intent from the utility function that you give it.