Unknowns comments on Stuart Russell: AI value alignment problem must be an “intrinsic part” of the field’s mainstream agenda

Unknowns 26 Nov 2014 12:33 UTC
0 points
Human beings do not have values that are provably aligned with the values of other human beings. Nor can there ever be a proof like this, since “human being” does not have a mathematical definition any more than “baldness” has a definition that would tell you in every edge case whether someone is “truly bald” or not. Consequently there will never be such a proof for AI, since if various human beings have diverging values, there is no way for the AI to be aligned with both.

In any case, I think the main problem is the assumption that human beings have utility functions at all, since they do not. In particular, as I said elsewhere, human beings do not value anything infinitely. Any AI that does value something infinitely will not have human values, and it will be subject to Pascal’s Muggings. Consequently, the most important point is to make sure that you do not give an AI any utility function at all, since if you do give it one, it will automatically diverge from human values.
- AlexMennen 26 Nov 2014 21:09 UTC
  9 points
  Parent
  
  if various human beings have diverging values, there is no way for the AI to be aligned with both.
  
  Yes, it is trivially true that an AI cannot perfectly optimize for one person’s values while simultaneously perfectly optimizing for a different person’s values. But, by optimizing for some combination of each person’s values, there’s no reason the AI can’t align reasonably well with all of them unless their values are rather dramatically in conflict.
  
  In particular, as I said elsewhere, human beings do not value anything infinitely. Any AI that does value something infinitely will not have human values, and it will be subject to Pascal’s Muggings. Consequently, the most important point is to make sure that you do not give an AI any utility function at all, since if you do give it one, it will automatically diverge from human values.
  
  Are you claiming that all utility functions are unbounded? That is not the case. (In fact, if you only consider continuous utility functions on a complete lottery space, then all utility functions are bounded. http://lesswrong.com/lw/gr6/vnm_agents_and_lotteries_involving_an_infinite/)
  - Unknowns 27 Nov 2014 4:10 UTC
    0 points
    Parent
    No, I wasn’t saying that all utility functions are unbounded. I was making two points in that paragraph:
    
    1) An AI that values something infinitely will not have anything remotely like human values, since human beings do not value anything infinitely. And if you describe this AI’s values with a utility function, it would either be an unbounded function, or a bounded function that behaves in a similar way by approaching a limit (if it didn’t behave similarly it would not treat anything as having infinite value.)
    
    2) If you program an AI with an explicit utility function, in practice it will not have human values, because human beings are not made with an explicit utility function, just as if you program an AI with a GLUT, in practice it will not engage in anything like human conversation.
    - AlexMennen 27 Nov 2014 4:53 UTC
      0 points
      Parent
      It’s true that humans do not have utility functions, but I think it still can make sense to try to fit a utility function to a human that approximates what they want as well as possible, since non-VNM preferences aren’t really coherent. It’s a good point that it is pretty worrying that the best VNM approximation to human preferences might not fit them all that closely though.
      
      a bounded function that behaves in a similar way by approaching a limit (if it didn’t behave similarly it would not treat anything as having infinite value.)
      
      Not sure what you mean by this. Bounded utility functions do not treat anything as having infinite value.
      - Kawoomba 27 Nov 2014 6:39 UTC
        1 point
        Parent
        
        It’s true that humans do not have utility functions
        
        Do not have full conscious access to their utility function? Yes. Have an ugly, constantly changing utility function since we don’t guard our values against temporal variance? Yes. Whose values cannot with perfect fidelity be described by a utility function in a pragmatic sense, say with a group of humans attempting to do so? Yes.
        
        Whose actual utility function cannot be approximately described, with some bounded error term epsilon? No. Whose goals cannot in principle be expressed by a utility function? No.
        Shmi 27 Nov 2014 22:03 UTC
        5 points
        Parent
        Please approximately describe a utility function of an addict who is calling his dealer for another dose, knowing full well that he is doing harm to himself, that he will feel worse the next day, and already feeling depressed because of that, yet still acting in a way which is guaranteed to negatively impact his happiness. The best I can do is “there are two different people, System 1 and System 2, with utility functions UF1 and UF2, where UF1 determines actions while UF2 determines happiness”.
        Kawoomba 27 Nov 2014 22:57 UTC
        1 point
        Parent
        The question does come down to definition. I do think most people here are on the same page concerning the subject matter, and only differ on what they’re calling a utility function. I’m of the Church-Turing thesis persuasion (the ‘iff’ goes both ways), and don’t see why the aspect of a human governing its behavior should be any different than the world at large.
        
        Whether that’s useful is a different question. No doubt the human post-breakfast has a different utility function than pre-breakfast. Do we then say that the utility function takes as a second parameter t, or do we insist that post-breakfast there exists a different agent (strictly speaking, since it has different values) who merely shares some continuity with its hungry predecessor, who sadly no longer exists (RIP)? If so, what would be the granularity, what kind of fuzziness would still be allowed in our constantly changing utility function, which ebbs and flows with our cortison levels and a myriad of other factors?
        
        If a utility function, even if known, was only applicable in one instant, for one agent, would it even make sense to speak of a global function, if the domain consists of but one action?
        
        In the VNM-sense, it may well be that technically humans don’t have a (VNM!)utility function. But meh, unless there’s uncomputable magic in there somewhere some kind of function mapping all possible stimuli to a human’s behavior should theoretically exist, and I’d call that utility function.
        
        Definitional stuff, which is just wiggly lines fighting each other: squibbles versus squobbles, dictionary fight to the death, for some not[at]ion of death!
        
        ETA: It depends on what you call a utility function, and how ugly a utility function (including assigning different values to different actions each fraction of a second) you’re ready to accept. Is there “a function” assigning values to outcomes which would describe a human’s behavior over his/her lifetime? Yes, of course there is. (There is one describing the whole universe, so there better be one for a paltry human’s behavior. Even if it assigns different values at different times.) Is there a ‘simple’ function (e.g. time-invariant) which also satisfices the VNM criteria? Probably not.
        Shmi 27 Nov 2014 23:42 UTC
        2 points
        Parent
        Sorry, I don’t understand your point, beyond your apparently reversing your position and agreeing that humans don’t have a utility function, not even approximately.
        Richard_Kennaway 28 Nov 2014 12:31 UTC
        1 point
        Parent
        
        In the VNM-sense, it may well be that technically humans don’t have a (VNM!)utility function. But meh, unless there’s uncomputable magic in there somewhere some kind of function mapping all possible stimuli to a human’s behavior should theoretically exist, and I’d call that utility function.
        
        Calling it a utility function does not make it a utility function. A utility function maps decisions to utilities, in an entity which decides among its available choices by evaluating that function for each one and making the decision that maximises the value. Or as Wikipedia puts it, in what seems a perfectly sensible summary definition covering all its more detailed uses, utility is “the (perceived) ability of something to satisfy needs or wants.” That is the definition of utility and utility functions; that is what everyone means by them. It makes no sense to call something completely different by the same name in order to preserve the truth of the sentence “humans have utility functions”. The sentence has remained the same but the proposition it expresses has been changed, and changed into an uninteresting tautology. The original proposition expressed by “humans have utility functions” is still false, or if one is going to argue that it is true, it must be done by showing that humans have utility functions in the generally understood meaning of the term.
        
        some kind of function mapping all possible stimuli to a human’s behavior should theoretically exist
        
        No, it should not; it cannot. Behaviour depends not only on current stimuli but the human’s entire past history, internal and external. Unless you are going to redefine “stimuli” to mean “entire past light-cone” (which of course the word does not mean) this does not work. Furthermore, that entire past history is also causally influenced by the human’s behaviour. Such cyclic patterns of interaction cannot be understood as functions from stimulus to response.
        
        In order to arrive at this subjectively ineluctable (“meh, unless there’s uncomputable magic”) statement, you have redefined the key words to make them mean what no-one ever means by them. It’s the Texas Sharpshooter Utility Function fallacy yet again: look at what the organism does, then label that as having higher “utility” than the things it did not do.
        Kawoomba 28 Nov 2014 13:41 UTC
        1 point
        Parent
        I appreciate your point.
        
        Mostly, I’m concerned that “strictly speaking, humans don’t have VNM-utility functions, so that’s that, full stop” can be interpreted like a stop sign, when in fact humans do have preferences (clearly) and do tend to choose actions to try to satisfice those preferences at least part of the time. To the extent that we’d deny that, we’d deny the existence of any kind of “agent” instantiated in the physical universe. There is predictable behavior for the most part, which can be modelled. And anything that can be computationally modelled can be described by a function. It may not have some of the nice VNM properties, but we take what we can get.
        
        If there’s a more applicable term for the kind of model we need (rather than simply “utility function in a non-VNM sense”), by all means, but then again, “what’s in a name” …
        TheAncientGeek 30 Nov 2014 14:36 UTC
        0 points
        Parent
        The question is whether AIs can have a fixed UF …specifically whether they can both self modify and maintain their goals. If they can’t, there is no point in loading then with human values upfront (as they won’t stick to them anyway), and the problem of corrigibility becomes one of getting them to go in the direction we want, not of getting them to budge at all.
        
        Which is not to say that goal unstable AIs will be safe, but they do present different problems and require different solutions. Which could do with being looked at some time.
        
        In the face of iinstability, you can rescue the idea of the utility function by feeding in an agent’s entire history, but rescuing the UF is not what is important. Is stability versus instability. I am still against the use of the phrase utility function, because when people read it, they think time independent utility function, which is why, I think, there is so little consideration of unstable AI.
        AlexMennen 27 Nov 2014 20:48 UTC
        2 points
        Parent
        Humans do not behave even closely to VNM-rationality, and there’s no clear evidence for some underlying VNM preferences that are being deviated from.
      - Unknowns 27 Nov 2014 6:05 UTC
        1 point
        Parent
        I think there is good reason to think coming up with an actual VNM representation of human preferences would not be a very good approximation. On the other hand as long as you don’t program an AI in that way—with an explicit utility function—then I think it is unlikely to be dangerous even if it does not have exactly human values. This is why I said the most important thing is to make sure that the AI does not have a utility function. I’m trying to do a discussion post on that now but something’s gone wrong (with the posting).
        
        I thought you could map an unbounded function to a bounded one to produce the same behavior, but actually you may be right that this is not really possible since you have to multiply your utilities by probabilities. I would have to think about that more.
        AlexMennen 27 Nov 2014 21:00 UTC
        5 points
        Parent
        It’s awfully suspicious to say that the one goal architecture that is coherent enough to analyse easily is dangerous but that all others are safe. More concretely, humans are not VNM-rational (as you pointed out), and often pose threats to other agents anyway. Also, an AI does not have to be programmed with an explicit utility function in order to be VNM rational, and thus to behave like it has a utility function.
        
        I thought you could map an unbounded function to a bounded one to produce the same behavior, but actually you may be right that this is not really possible since you have to multiply your utilities by probabilities.
        
        You can rescale an unbounded utility function to a bounded one that will have the same preferences over known outcomes, but this will change its preferences over gambles; in particular, agents with bounded utility functions cannot be made to care about arbitrarily small probabilities of arbitrarily good/bad outcomes.
        Unknowns 28 Nov 2014 0:48 UTC
        0 points
        Parent
        Yes, you’re right about the effect of rescaling an unbounded function.
        
        I don’t see why it’s suspicious that less coherent goal systems are safer. Being less coherent is being closer to having no goals at all, and without goals a thing is not particularly dangerous. For example, take a rock. We could theoretically say that the path a rock takes when it falls is determined by a goal system, but it would not be particularly helpful to describe it as using a utility function, and likewise it is not especially dangerous. It is true that you can get killed if it hits you on the head or something, but it is not going to take over the world.
        
        I described in my top-level post what kind of behavior I would expect of an intelligent goal system that was not programmed using an explicit utility function. You might be able to theoretically describe its behavior with a utility function, but this is not the most helpful description. So for example, if we program a chess playing AI, as long as it is programmed to choose chess moves in a deterministic fashion, optimizing based solely on the present chess game (e.g. not choosing its moves based on what it has learned about the current player or whatever, but only based on the current position), then no matter how intelligent it becomes it will never try to take over the universe. In fact, it will never try to do anything except play chess moves, since it is physically impossible for it to do anything else, just as a rock will never do anything except fall.
        
        Notice that this also is closer to having no goals, since the chess playing AI can’t try to affect the universe in any particular way. (That is why I said based on the game alone—if it can base its moves on the person playing or whatever, then in theory it could secretly have various goals such as e.g. driving someone insane on account of losing chess games etc., even if no one programmed these goals explicitly.) But as long as its moves are generated in a deterministic manner based on the current position alone, it cannot have any huge destructive goal, just like a rock does not.
- Dr_Manhattan 26 Nov 2014 14:10 UTC
  0 points
  Parent
  
  Human beings do not have values that are provably aligned with the values of other human beings.
  
  Sure, but we “happily” compromise. AI should be able to understand and implement the compromise that is overall best for everyone.
  
  Any AI that does value something infinitely will not have human values
  
  AI can value the “best compromise” infinitely :). But agreed nothing else.
  - Unknowns 26 Nov 2014 14:46 UTC
    1 point
    Parent
    I’m not sure what it would mean exactly to value the best compromise infinitely, since part of that compromise would be the refusal to accept a sufficiently bad Mugging, which implies a utility bound.
  - TheAncientGeek 26 Nov 2014 14:22 UTC
    0 points
    Parent
    But if an AI can compromise on some fuzzy or simplified set if values, what happened to the full complexity and fragility of human value?
    - Dr_Manhattan 26 Nov 2014 21:30 UTC
      1 point
      Parent
      Why does the compromise have to be a function of simplified values? I don’t think I implied that.
- TheAncientGeek 26 Nov 2014 13:39 UTC
  −2 points
  Parent
  Good points, shamefully downoted.
  
  A utility function sounds like the sort of computery thing an AI programme ought be expected to have, but actual is an idealized way of describing a rational agent that can’t be translated into code,
  - solipsist 26 Nov 2014 14:01 UTC
    3 points
    Parent
    If your preferences about possible states of the world follow a few very reasonable constraints, then (somewhat surprisingly) your preferences can be modeled by a utility function. An agent with a reasonably coherent set of preferences can be talked about as if it optimizes a utility function, even if that’s not the way it was programmed. See VNM rationality.
    - Unknowns 26 Nov 2014 14:52 UTC
      3 points
      Parent
      I agree with this, but that doesn’t mean the model has to be useful. For example you could say that I have a utility function that assigns a utility of 1 to all the actions I actually take, and a utility of 0 to all the actions that I don’t. But this would be similar to saying that you could make a giant look-up table which would be a model of my responses in conversation. Nonetheless, if you attempt to program an AI with a GLUT for conversation, it will not do well at all in conversation, and if you attempt to program an AI with the above model of human behavior, it will do very badly.
      
      In other words, theoretically there is such a model, but in practice this is not how a human being is made and it shouldn’t be how an AI is made.
      - solipsist 26 Nov 2014 18:34 UTC
        3 points
        Parent
        Here’s the argument I was hearing:
        
        Humans can be turned into money pumps. Consequently, the most important point is to make sure that your AI can be turned into a money pump, since if you don’t, it will automatically diverge from human values.
        
        If this is what you are arguing, it would take a lot to convince me of that position.
        
        Here’s the argument I think you’re making:
        
        Don’t make AIs try to optimize stuff without bound. If you try to optimize any fixed objective function without bound, you will end up sacrificing all else that you hold dear.
        
        I agree that optimizing without bound seems likely to kill you. If a safe alternative approach is possible, I don’t know what it would be. My guess would be most alternative approaches are equivalent to an optimization problem.
        Unknowns 26 Nov 2014 19:32 UTC
        1 point
        Parent
        Right, the second argument is the one that concerns me, since it should be possible to convince people to adjust their preferences in some way that will make them consistent.
        
        My suggestion here was simply to adopt a hard limit to the utility function. So for example instead of valuing lifespan without limit, there would be some value such that the AI is indifferent to extending it even more. This kind of AI might take the lifespan deal up to a certain point, but it would not keep taking it permanently, and in this way it would avoid driving its probability of survival down to a limit of zero.
        
        I think Eliezer does not like this idea because he claims to value life infinitely, assigning ever greater values to longer lifespans and an infinite value to an infinite lifespan. But he is wrong about his own values, because being a limited being he cannot actually care infinitely about anything, and this is why the lifespan dilemma bothers him. If he actually cared infinitely, as he claims, then he would not mind driving his probability of survival down to zero.
        
        I am not saying (as he has elsewhere described this) that “the utility function is up for grabs.” I am saying that if you understand yourself correctly, you will see that you do not yourself assign an infinite value to anything, so it would be a serious and possibly fatal mistake to make a machine that assigns an infinite value to something.
        solipsist 26 Nov 2014 19:52 UTC
        2 points
        Parent
        Yeah, I follow. I’ll bring up another wrinkle (which you may already be familiar with): Suppose the objective you’re maximizing never equals or exceeds 20. You can reach to 19.994, 19.9999993, 19.9999999999999995, but never actually reach 20. Then even though your objective function is bounded, you will still try to optimize forever, and may resort to increasingly desperate measures to eek out another .000000000000000000000000001.
        Unknowns 26 Nov 2014 19:58 UTC
        −3 points
        Parent
        Yes, this would happen if you take an unbounded function and simply map it to a bounded function without actually changing it. That is why I am suggesting admitting that you really don’t have an infinite capacity for caring, and describing what you care about as though you did care infinitely is mistaken, whether you describe this with an unbounded or with a bounded function. This requires admitting that scope insensitivity, after a certain point, is not a bias, but just an objective fact that at a certain point you really don’t care anymore.