johnswentworth comments on Alignment By Default

johnswentworth 13 Aug 2020 22:05 UTC
2 points
0
Your behavior is not what the AI is trying to predict. The AI is just trying to predict the world, in general—including e.g. the outcomes of medical or psychological experiments which specifically try to probe the gears underlying your behavior.
- avturchin 13 Aug 2020 22:53 UTC
  2 points
  0
  Parent
  But the result of such experiments may still not converge: in one experiment I will claim to have a value of not eating, and in another I will eat.
  But if the AI is advance enough, it could guess also the correct structure of motivational system. like the number of significant part in it, and each will be represented inside its human model.
  However, if there are many ways to create human models of similar efficacy, we can’t say which model is correct and guess “correct” values.
  - johnswentworth 13 Aug 2020 23:03 UTC
    2 points
    0
    Parent
    in one experiment I will claim to have a value of not eating, and in another I will eat.
    That’s still just looking at behavior. Probing the internals would mean e.g. hooking you to an FMRI to see what’s happening in the brain when you claim to have a value of not eating or when you you eat.
    However, if there are many ways to create human models of similar efficacy, we can’t say which model is correct and guess “correct” values.
    We can say which model is correct by looking at the internal structure of humans, which is exactly why medical research is relevant.
    - avturchin 14 Aug 2020 10:50 UTC
      2 points
      0
      Parent
      Knowing internal structure will not help much: the same way as knowing pixel locations on a picture is not equal to image recognition, which is high level representation and abstraction.
      We need something like a high-level representation of trees, as in your example, but for values. But values could be abstracted in different ways—in many more ways than trees. Even trees may be represented like “green mass” or like set of branches or in some other slightly non-human ways.
      - johnswentworth 14 Aug 2020 15:40 UTC
        4 points
        2
        Parent
        But values could be abstracted in different ways—in many more ways than trees. Even trees may be represented like “green mass” or like set of branches or in some other slightly non-human ways.
        This is the part I disagree with. I think there is a single (up to isomorphism) notion of “tree” toward which a very broad variety of computationally-limited predictive systems will converge. That’s what the OP’s discussion of “natural abstractions” and “information relevant far away” is about.
        For instance, if a system’s only concept of “tree” is “green mass” then it’s either going to (a) need whole separate models for trees in autumn and winter (which would be computationally expensive), or (b) lose predictive power when reasoning about trees in autumn and winter. Also, if it learns new facts about green-mass-trees, how will it know that those facts generalize to non-green-mass-trees?
        Pointing to a Flower has a lot more about this, although it’s already out-of-date compared to my current thoughts on the problem.
        avturchin 14 Aug 2020 16:45 UTC
        2 points
        −2
        Parent
        And here is my point: trees actually exist, and they are natural abstract. “Human values” was created by psychologists in the middle of 20th century as one of the ways to describe human mind. They don’t actually exist, but are useful description instruments for some tasks.
        There are other ways to describe human mind and human motivations: ethical norms, drives, memes, desires, Freud model, family system etc. An AI may find some other abstractions which will be even better in compressing behaviour, but they will be not human values.
        johnswentworth 14 Aug 2020 18:03 UTC
        2 points
        0
        Parent
        Humans have wanted things, and recognized other humans as wanting things, since long before 20th century psychologists came along and used the phrase “human values”. I don’t particularly care about aligning an AI to whatever some psychologist defines as “human values”, I care about aligning an AI to the things humans want. Those are the “human values” I care about. The very fact that I can talk about that, and other people generally seem to know what I’m talking about without me needing to give a formal definition, is evidence that it is a natural abstraction.
        I would not say there are “other ways to model the human mind”, but rather there are other aspects of the human mind which one can model. (Also there are some models of the human mind which are just outright wrong, e.g. Freudian models.) If a model is to achieve strong general-purpose predictive power, then it needs to handle all of those different aspects, including human values. A model of the human mind may be lower-level than “human values”, e.g. a low-level physics model of the brain, but that will still have human values embedded in it somehow. If a model doesn’t have human values embedded in it somewhere, then it will have poor predictive performance on many problems in which human values are involved.
        avturchin 14 Aug 2020 18:25 UTC
        4 points
        0
        Parent
        But human “wants” are not actually a good thing which AI should follow. If I am fasting, I obviously want to eat, but me decision is not eating today. And if I have a robot helping me, I prefer it care about my decisions, not my “wants”. This distinction between desires and decisions was obvious for last 2.5 thousand years, and “human values” is obscure and not natural idea.
        johnswentworth 14 Aug 2020 20:09 UTC
        2 points
        0
        Parent
        You are using the word “want” differently than I was. I’m pretty sure I’m trying to point to exactly the same thing you are pointing to. And the fact that we’re both trying to point to the same thing is exactly the evidence that the thing we’re trying to point to is a natural abstraction.
        (The fact that the distinction between desires and decisions was obvious for the last 2.5. thousand years is also evidence that both of these things are natural abstractions.)
        And if I have a robot helping me, I prefer it care about my decisions, not my “wants”.
        This is a bad idea. You should really, really want the robot to care about something besides your decisions, because the decisions are not enough to determine your values.