Eliezer Yudkowsky comments on The genie knows, but doesn’t care

Eliezer Yudkowsky 10 Sep 2013 23:43 UTC
0 points
I suggest some actual experience trying to program AI algorithms in order to realize the hows and whys of “getting an algorithm which forms the inductive category I want out of the examples I’m giving is hard”. What you’ve written strikes me as a sheer fantasy of convenience. Nor does it follow automatically from intelligence for all the reasons RobbBB has already been giving.

And obviously, if an AI was indeed stuck in a local minimum obvious to you of its own utility gradient, this condition would not last past it becoming smarter than you.
- Broolucks 11 Sep 2013 1:12 UTC
  6 points
  Parent
  I have done AI. I know it is difficult. However, few existing algorithms, if at all, have the failure modes you describe. They fail early, and they fail hard. As far as neural nets go, they fall into a local minimum early on and never get out, often digging their own graves. Perhaps different algorithms would have the shortcomings you point out. But a lot of the algorithms that currently exist work the way I describe.
  
  And obviously, if an AI was indeed stuck in a local minimum obvious to you of its own utility gradient, this condition would not last past it becoming smarter than you.
  
  You may be right. However, this is far from obvious. The problem is that it may “know” that it is stuck in a local minimum, but the very effect of that local minimum is that it may not care. The thing you have to keep in mind here is that a generic AI which just happens to slam dunk and find global minima reliably is basically impossible. It has to fold the search space in some ways, often cutting its own retreats in the process.
  
  I feel that you are making the same kind of mistake that you criticize: you assume that intelligence entails more things than it really does. In order to be efficient, intelligence has to use heuristics that will paint it into a few corners. For instance, the more consistently AI goes in a certain direction, the less likely it will be to expend energy into alternative directions and the less likely it becomes to do a 180. In other words, there may be a complex tug-of-war between various levels of internal processes, the AI’s rational center pointing out that there is a reward button to be seized, but inertial forces shoving back with “there has never been any problems here, go look somewhere else”.
  
  It really boils down to this: an efficient AI needs to shut down parts of the search space and narrow down the parts it will actually explore. The sheer size of that space requires it not to think too much about what it chops down, and at least at first, it is likely to employ trajectory-based heuristics. To avoid searching in far-fetched zones, it may wall them out by arbitrarily lowering their utility. And that’s where it might paint itself in a corner: it might inadvertently put up immense walls in the direction of the global minimum that it cannot tear down (it never expected that it would have to). In other words, it will set up a utility function for itself which enshrines the current minimum as global.
  
  Now, perhaps you are right and I am wrong. But it is not obvious: an AI might very well grow out of a solidifying core so pervasive that it cannot get rid of it. Many algorithms already exhibit that kind of behavior; many humans, too. I feel that it is not a possibility that can be dismissed offhand. At the very least, it is a good prospect for FAI research.
  - wedrifid 11 Sep 2013 2:13 UTC
    8 points
    Parent
    
    However, few existing algorithms, if at all, have the failure modes you describe. They fail early, and they fail hard.
    
    Yes, most algorithms fail early and and fail hard. Most of my AI algorithms failed early with a SegFault for instance. New, very similar algorithms were then designed with progressively more advanced bugs. But these are a separate consideration. What we are interested in here is the question “Given an AI algorithm that is capable of recursive self improvement is successfully created by humans how likely is it that they execute this kind of failure mode?” The “fail early fail hard” cases are screened off. We’re looking at the small set that is either damn close to a desired AI or actually a desired AI and distinguishing between them.
    
    Looking at the context to work out what the ‘failure mode’ being discussed is it seems to be the issue where an AI is programmed to optimise based on a feedback mechanism controlled by humans. When the AI in question is superintelligent most failure modes tend to be variants of “conquer the future light cone, kill everything that is a threat and supply perfect feedback to self”. When translating this to the nearest analogous failure mode in some narrow AI algorithm of the kind we can design now it seems like this refers to the failure mode whereby the AI optimises exactly what it is asked to optimise but in a way that is a lost purpose. This is certainly what I had to keep in mind in my own research.
    
    A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one ‘common sense’ led the humans to optimise. Rather than building any ships the AI produced tiny unarmored dingies with a single large cannon or missile attached. For whatever reason the people running the game did not consider this an acceptable outcome. Their mistake was to supply a problem specification which did not match their actual preferences. They supplied a lost purpose.
    
    When it comes to considering proposals for how to create friendly superintelligences it becomes easy to spot notorious failure modes in what humans typically think are a clever solution. It happens to be the case that any solution that is based on an AI optimising for approval or achieving instructions given just results in Everybody Dies.
    
    Where Eliezer suggests getting AI experience to get a feel for such difficulties I suggest an alternative. Try being a D&D dungeon master in a group full of munchkins. Make note of every time that for the sake of the game you must use your authority to outlaw the use of a by-the-rules feature.
    - somervta 11 Sep 2013 3:49 UTC
      9 points
      Parent
      
      A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one ‘common sense’ led the humans to optimise. Rather than building any ships the AI produced tiny unarmored dingies with a single large cannon or missile attached. For whatever reason the people running the game did not consider this an acceptable outcome. Their mistake was to supply a problem specification which did not match their actual preferences. They supplied a lost purpose.
      
      The AI in questions was Eurisko, and it entered the Traveller Trillion Credit Squadron tournament in 1981 as described above. It was also entered the next year, after an extended redesign of the rules, and won, again. After this the competition runners announced that if Eurisko won a third time the competition would be discontinued, so Lenat (the programmer) stopped entering.
    - Broolucks 13 Sep 2013 20:13 UTC
      2 points
      Parent
      I apologize for the late response, but here goes :)
      
      I think you missed the point I was trying to make.
      
      You and others seem to say that we often poorly evaluate the consequences of the utility functions that we implement. For instance, even though we have in mind utility X, the maximization of which would satisfy us, we may implement utility Y, with completely different, perhaps catastrophic implications. For instance:
      
      X = Do what humans want Y = Seize control of the reward button
      What I was pointing out in my post is that this is only valid of perfect maximizers, which are impossible. In practice, the training procedure for an AI would morph the utility Y into a third utility, Z. It would maximize neither X nor Y: it would maximize Z. For this reason, I believe that your inferences about the “failure modes” of superintelligence are off, because while you correctly saw that our intended utility X would result in the literal utility Y, you forgot that an imperfect learning procedure (which is all we’ll get) cannot reliably maximize literal utilities and will instead maximize a derived utility Z. In other words:
      
      X = Do what humans want (intended) Y = Seize control of the reward button (literal) Z = ??? (derived)
      Without knowing the particulars of the algorithms used to train an AI, it is difficult to evaluate what Z is going to be. Your argument boils down to the belief that the AI would derive its literal utility (or something close to that). However, the derivation of Z is not necessarily a matter of intelligence: it can be an inextricable artefact of the system’s initial trajectory.
      
      I can venture a guess as to what Z is likely going to be. What I figure is that efficient training algorithms are likely to keep a certain notion of locality in their search procedures and prune the branches that they leave behind. In other words, if we assume that optimization corresponds to finding the highest mountain in a landscape, generic optimizers that take into account the costs of searching are likely to consider that the mountain they are on is higher than it really is, and other mountains are shorter than they really are.
      
      You might counter that intelligence is meant to overcome this, but you have to build the AI on some mountain, say, mountain Z. The problem is that intelligence built on top of Z will neither see nor care about Y. It will care about Z. So in a sense, the first mountain the AI finds before it starts becoming truly intelligent will be the one it gets “stuck” on. It is therefore possible that you would end up with this situation:
      
      X = Do what humans want (intended) Y = Seize control of the reward button (literal) Z = Do what humans want (derived)
      And that’s regardless of the eventual magnitude of the AI’s capabilities. Of course, it could derive a different Z. It could derive a surprising Z. However, without deeper insight into the exact learning procedure, you cannot assert that Z would have dangerous consequences. As far as I can tell, procedures based on local search are probably going to be safe: if they work as intended at first, that means they constructed Z the way we wanted to. But once Z is in control, it will become impossible to displace.
      
      In other words, the genie will know that they can maximize their “reward” by seizing control of the reward button and pressing it, but they won’t care, because they built their intelligence to serve a misrepresentation of their reward. It’s like a human who would refuse a dopamine drip even though they know that it would be a reward: their intelligence is built to satisfy their desires, which report to an internal reward prediction system, which models rewards wrong. Intelligence is twice removed from the real reward, so it can’t do jack. The AI will likely be in the same boat: they will model the reward wrong at first, and then what? Change it? Sure, but what’s the predicted reward for changing the reward model? … Ah.
      
      Interestingly, at that point, one could probably bootstrap the AI by wiring its reward prediction directly into its reward center. Because the reward prediction would be a misrepresentation, it would predict no reward for modifying itself, so it would become a stable loop.
      
      Anyhow, I agree that it is foolhardy to try to predict the behavior of AI even in trivial circumstances. There are many ways they can surprise us. However, I find it a bit frustrating that your side makes the exact same mistakes that you accuse your opponents of. The idea that superintelligence AI trained with a reward button would seize control over the button is just as much of a naive oversimplification as the idea that AI will magically derive your intent from the utility function that you give it.
    - Kyre 14 Sep 2013 14:41 UTC
      0 points
      Parent
      (Sorry, didn’t see comment below) (Nitpick)
      
      A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one ‘common sense’ led the humans to optimise.
      
      Is this a reference to Eurisko winning the Traveller Trillion Credit Squadron tournament in ¹⁹⁸¹⁄₈₂ ? If so I don’t think it was a military research agency.
- EHeller 10 Sep 2013 23:57 UTC
  3 points
  Parent
  
  I suggest some actual experience trying to program AI algorithms in order to realize the hows and whys of “getting an algorithm which forms the inductive category I want out of the examples I’m giving is hard”
  
  I think it depends on context, but a lot of existing machine learning algorithms actually do generalize pretty well. I’ve seen demos of Watson in healthcare where it managed to generalize very well just given scrapes of patient’s records, and it has improved even further with a little guided feedback. I’ve also had pretty good luck using a variant of Boltzmann machines to construct human-sounding paragraphs.
  
  It would surprise me if a general AI weren’t capable of parsing the sentiment/intent behind human speech fairly well, given how well the much “dumber” algorithms work.