AlexMennen comments on Stuart Russell: AI value alignment problem must be an “intrinsic part” of the field’s mainstream agenda

AlexMennen 27 Nov 2014 4:53 UTC
0 points
It’s true that humans do not have utility functions, but I think it still can make sense to try to fit a utility function to a human that approximates what they want as well as possible, since non-VNM preferences aren’t really coherent. It’s a good point that it is pretty worrying that the best VNM approximation to human preferences might not fit them all that closely though.

a bounded function that behaves in a similar way by approaching a limit (if it didn’t behave similarly it would not treat anything as having infinite value.)

Not sure what you mean by this. Bounded utility functions do not treat anything as having infinite value.
- Kawoomba 27 Nov 2014 6:39 UTC
  1 point
  Parent
  
  It’s true that humans do not have utility functions
  
  Do not have full conscious access to their utility function? Yes. Have an ugly, constantly changing utility function since we don’t guard our values against temporal variance? Yes. Whose values cannot with perfect fidelity be described by a utility function in a pragmatic sense, say with a group of humans attempting to do so? Yes.
  
  Whose actual utility function cannot be approximately described, with some bounded error term epsilon? No. Whose goals cannot in principle be expressed by a utility function? No.
  - Shmi 27 Nov 2014 22:03 UTC
    5 points
    Parent
    Please approximately describe a utility function of an addict who is calling his dealer for another dose, knowing full well that he is doing harm to himself, that he will feel worse the next day, and already feeling depressed because of that, yet still acting in a way which is guaranteed to negatively impact his happiness. The best I can do is “there are two different people, System 1 and System 2, with utility functions UF1 and UF2, where UF1 determines actions while UF2 determines happiness”.
    - Kawoomba 27 Nov 2014 22:57 UTC
      1 point
      Parent
      The question does come down to definition. I do think most people here are on the same page concerning the subject matter, and only differ on what they’re calling a utility function. I’m of the Church-Turing thesis persuasion (the ‘iff’ goes both ways), and don’t see why the aspect of a human governing its behavior should be any different than the world at large.
      
      Whether that’s useful is a different question. No doubt the human post-breakfast has a different utility function than pre-breakfast. Do we then say that the utility function takes as a second parameter t, or do we insist that post-breakfast there exists a different agent (strictly speaking, since it has different values) who merely shares some continuity with its hungry predecessor, who sadly no longer exists (RIP)? If so, what would be the granularity, what kind of fuzziness would still be allowed in our constantly changing utility function, which ebbs and flows with our cortison levels and a myriad of other factors?
      
      If a utility function, even if known, was only applicable in one instant, for one agent, would it even make sense to speak of a global function, if the domain consists of but one action?
      
      In the VNM-sense, it may well be that technically humans don’t have a (VNM!)utility function. But meh, unless there’s uncomputable magic in there somewhere some kind of function mapping all possible stimuli to a human’s behavior should theoretically exist, and I’d call that utility function.
      
      Definitional stuff, which is just wiggly lines fighting each other: squibbles versus squobbles, dictionary fight to the death, for some not[at]ion of death!
      
      ETA: It depends on what you call a utility function, and how ugly a utility function (including assigning different values to different actions each fraction of a second) you’re ready to accept. Is there “a function” assigning values to outcomes which would describe a human’s behavior over his/her lifetime? Yes, of course there is. (There is one describing the whole universe, so there better be one for a paltry human’s behavior. Even if it assigns different values at different times.) Is there a ‘simple’ function (e.g. time-invariant) which also satisfices the VNM criteria? Probably not.
      - Shmi 27 Nov 2014 23:42 UTC
        2 points
        Parent
        Sorry, I don’t understand your point, beyond your apparently reversing your position and agreeing that humans don’t have a utility function, not even approximately.
      - Richard_Kennaway 28 Nov 2014 12:31 UTC
        1 point
        Parent
        
        In the VNM-sense, it may well be that technically humans don’t have a (VNM!)utility function. But meh, unless there’s uncomputable magic in there somewhere some kind of function mapping all possible stimuli to a human’s behavior should theoretically exist, and I’d call that utility function.
        
        Calling it a utility function does not make it a utility function. A utility function maps decisions to utilities, in an entity which decides among its available choices by evaluating that function for each one and making the decision that maximises the value. Or as Wikipedia puts it, in what seems a perfectly sensible summary definition covering all its more detailed uses, utility is “the (perceived) ability of something to satisfy needs or wants.” That is the definition of utility and utility functions; that is what everyone means by them. It makes no sense to call something completely different by the same name in order to preserve the truth of the sentence “humans have utility functions”. The sentence has remained the same but the proposition it expresses has been changed, and changed into an uninteresting tautology. The original proposition expressed by “humans have utility functions” is still false, or if one is going to argue that it is true, it must be done by showing that humans have utility functions in the generally understood meaning of the term.
        
        some kind of function mapping all possible stimuli to a human’s behavior should theoretically exist
        
        No, it should not; it cannot. Behaviour depends not only on current stimuli but the human’s entire past history, internal and external. Unless you are going to redefine “stimuli” to mean “entire past light-cone” (which of course the word does not mean) this does not work. Furthermore, that entire past history is also causally influenced by the human’s behaviour. Such cyclic patterns of interaction cannot be understood as functions from stimulus to response.
        
        In order to arrive at this subjectively ineluctable (“meh, unless there’s uncomputable magic”) statement, you have redefined the key words to make them mean what no-one ever means by them. It’s the Texas Sharpshooter Utility Function fallacy yet again: look at what the organism does, then label that as having higher “utility” than the things it did not do.
        Kawoomba 28 Nov 2014 13:41 UTC
        1 point
        Parent
        I appreciate your point.
        
        Mostly, I’m concerned that “strictly speaking, humans don’t have VNM-utility functions, so that’s that, full stop” can be interpreted like a stop sign, when in fact humans do have preferences (clearly) and do tend to choose actions to try to satisfice those preferences at least part of the time. To the extent that we’d deny that, we’d deny the existence of any kind of “agent” instantiated in the physical universe. There is predictable behavior for the most part, which can be modelled. And anything that can be computationally modelled can be described by a function. It may not have some of the nice VNM properties, but we take what we can get.
        
        If there’s a more applicable term for the kind of model we need (rather than simply “utility function in a non-VNM sense”), by all means, but then again, “what’s in a name” …
      - TheAncientGeek 30 Nov 2014 14:36 UTC
        0 points
        Parent
        The question is whether AIs can have a fixed UF …specifically whether they can both self modify and maintain their goals. If they can’t, there is no point in loading then with human values upfront (as they won’t stick to them anyway), and the problem of corrigibility becomes one of getting them to go in the direction we want, not of getting them to budge at all.
        
        Which is not to say that goal unstable AIs will be safe, but they do present different problems and require different solutions. Which could do with being looked at some time.
        
        In the face of iinstability, you can rescue the idea of the utility function by feeding in an agent’s entire history, but rescuing the UF is not what is important. Is stability versus instability. I am still against the use of the phrase utility function, because when people read it, they think time independent utility function, which is why, I think, there is so little consideration of unstable AI.
  - AlexMennen 27 Nov 2014 20:48 UTC
    2 points
    Parent
    Humans do not behave even closely to VNM-rationality, and there’s no clear evidence for some underlying VNM preferences that are being deviated from.
- Unknowns 27 Nov 2014 6:05 UTC
  1 point
  Parent
  I think there is good reason to think coming up with an actual VNM representation of human preferences would not be a very good approximation. On the other hand as long as you don’t program an AI in that way—with an explicit utility function—then I think it is unlikely to be dangerous even if it does not have exactly human values. This is why I said the most important thing is to make sure that the AI does not have a utility function. I’m trying to do a discussion post on that now but something’s gone wrong (with the posting).
  
  I thought you could map an unbounded function to a bounded one to produce the same behavior, but actually you may be right that this is not really possible since you have to multiply your utilities by probabilities. I would have to think about that more.
  - AlexMennen 27 Nov 2014 21:00 UTC
    5 points
    Parent
    It’s awfully suspicious to say that the one goal architecture that is coherent enough to analyse easily is dangerous but that all others are safe. More concretely, humans are not VNM-rational (as you pointed out), and often pose threats to other agents anyway. Also, an AI does not have to be programmed with an explicit utility function in order to be VNM rational, and thus to behave like it has a utility function.
    
    I thought you could map an unbounded function to a bounded one to produce the same behavior, but actually you may be right that this is not really possible since you have to multiply your utilities by probabilities.
    
    You can rescale an unbounded utility function to a bounded one that will have the same preferences over known outcomes, but this will change its preferences over gambles; in particular, agents with bounded utility functions cannot be made to care about arbitrarily small probabilities of arbitrarily good/bad outcomes.
    - Unknowns 28 Nov 2014 0:48 UTC
      0 points
      Parent
      Yes, you’re right about the effect of rescaling an unbounded function.
      
      I don’t see why it’s suspicious that less coherent goal systems are safer. Being less coherent is being closer to having no goals at all, and without goals a thing is not particularly dangerous. For example, take a rock. We could theoretically say that the path a rock takes when it falls is determined by a goal system, but it would not be particularly helpful to describe it as using a utility function, and likewise it is not especially dangerous. It is true that you can get killed if it hits you on the head or something, but it is not going to take over the world.
      
      I described in my top-level post what kind of behavior I would expect of an intelligent goal system that was not programmed using an explicit utility function. You might be able to theoretically describe its behavior with a utility function, but this is not the most helpful description. So for example, if we program a chess playing AI, as long as it is programmed to choose chess moves in a deterministic fashion, optimizing based solely on the present chess game (e.g. not choosing its moves based on what it has learned about the current player or whatever, but only based on the current position), then no matter how intelligent it becomes it will never try to take over the universe. In fact, it will never try to do anything except play chess moves, since it is physically impossible for it to do anything else, just as a rock will never do anything except fall.
      
      Notice that this also is closer to having no goals, since the chess playing AI can’t try to affect the universe in any particular way. (That is why I said based on the game alone—if it can base its moves on the person playing or whatever, then in theory it could secretly have various goals such as e.g. driving someone insane on account of losing chess games etc., even if no one programmed these goals explicitly.) But as long as its moves are generated in a deterministic manner based on the current position alone, it cannot have any huge destructive goal, just like a rock does not.