AlexMennen comments on Stuart Russell: AI value alignment problem must be an “intrinsic part” of the field’s mainstream agenda

AlexMennen 27 Nov 2014 21:00 UTC
5 points
It’s awfully suspicious to say that the one goal architecture that is coherent enough to analyse easily is dangerous but that all others are safe. More concretely, humans are not VNM-rational (as you pointed out), and often pose threats to other agents anyway. Also, an AI does not have to be programmed with an explicit utility function in order to be VNM rational, and thus to behave like it has a utility function.

I thought you could map an unbounded function to a bounded one to produce the same behavior, but actually you may be right that this is not really possible since you have to multiply your utilities by probabilities.

You can rescale an unbounded utility function to a bounded one that will have the same preferences over known outcomes, but this will change its preferences over gambles; in particular, agents with bounded utility functions cannot be made to care about arbitrarily small probabilities of arbitrarily good/bad outcomes.
- Unknowns 28 Nov 2014 0:48 UTC
  0 points
  Parent
  Yes, you’re right about the effect of rescaling an unbounded function.
  
  I don’t see why it’s suspicious that less coherent goal systems are safer. Being less coherent is being closer to having no goals at all, and without goals a thing is not particularly dangerous. For example, take a rock. We could theoretically say that the path a rock takes when it falls is determined by a goal system, but it would not be particularly helpful to describe it as using a utility function, and likewise it is not especially dangerous. It is true that you can get killed if it hits you on the head or something, but it is not going to take over the world.
  
  I described in my top-level post what kind of behavior I would expect of an intelligent goal system that was not programmed using an explicit utility function. You might be able to theoretically describe its behavior with a utility function, but this is not the most helpful description. So for example, if we program a chess playing AI, as long as it is programmed to choose chess moves in a deterministic fashion, optimizing based solely on the present chess game (e.g. not choosing its moves based on what it has learned about the current player or whatever, but only based on the current position), then no matter how intelligent it becomes it will never try to take over the universe. In fact, it will never try to do anything except play chess moves, since it is physically impossible for it to do anything else, just as a rock will never do anything except fall.
  
  Notice that this also is closer to having no goals, since the chess playing AI can’t try to affect the universe in any particular way. (That is why I said based on the game alone—if it can base its moves on the person playing or whatever, then in theory it could secretly have various goals such as e.g. driving someone insane on account of losing chess games etc., even if no one programmed these goals explicitly.) But as long as its moves are generated in a deterministic manner based on the current position alone, it cannot have any huge destructive goal, just like a rock does not.