jessicata comments on A case for AI alignment being difficult

jessicata 2 Jan 2024 20:30 UTC
8 points
2
I’m defining “values” as what approximate expected utility optimizers in the human brain want. Maybe “wants” is a better word. People falsify their preferences and in those cases it seems more normative to go with internal optimizer preferences.

Re indexicality, this is an “the AI knows but does not care” issue, it’s about specifying it not about there being some AI module somewhere that “knows” it. If AGI were generated partially from humans understanding how to encode indexical goals that would be a different situation.

Re treacherous turns, I agreed that myopic agents don’t have this issue to nearly the extent that long-term real-world optimizing agents do. It depends how the AGI is selected. If it’s selected by “getting good performance according to a human evaluator in the real world” then at some capability level AGIs that “want” that will be selected more.
- gallabytes 10 Jan 2024 22:47 UTC
  7 points
  0
  Parent
  Why do you expect it to be hard to specify given a model that knows the information you’re looking for? In general the core lesson of unsupervised learning is that often the best way to get pointers to something you have a limited specification for is to learn some other task that necessarily includes it then specialize to that subtask. Why should values be any different? Broadly, why should values be harder to get good pointers to than much more complicated real-world tasks?
  - jessicata 10 Jan 2024 22:57 UTC
    3 points
    1
    Parent
    How would you design a task that incentivizes a system to output its true estimates of human values? We don’t have ground truth for human values, because they’re mind states not behaviors.
    
    Seems easier to create incentives for things like “wash dishes without breaking them”, you can just tell.
    - gallabytes 10 Jan 2024 23:07 UTC
      13 points
      6
      Parent
      I think I can just tell a lot of stuff wrt human values! How do you think children infer them? I think in order for human values to not be viable to point to extensionally (ie by looking at a bunch of examples) you have to make the case that they’re much more built-in to the human brain than seems appropriate for a species that can produce both Jains and (Genghis Khan era) Mongols.
      I’d also note that “incentivize” is probably giving a lot of the game away here—my guess is you can just pull them out much more directly by gathering a large dataset of human preferences and predicting judgements.
      - jessicata 10 Jan 2024 23:57 UTC
        4 points
        −2
        Parent
        If you define “human values” as “what humans would say about their values across situations”, then yes, predicting “human values” is a reasonable training objective. Those just aren’t really what we “want” as agents, and agentic humans would have motives not to let the future be controlled by an AI optimizing for human approval.
        
        That’s also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It’s possible that the objectives of these maximizers are affected by socialization, but they’ll be less affected by socialization than verbal statements about values, because they’re harder to fake so less affected by preference falsification.
        
        Children learn some sense of what they’re supposed to say about values, but have some pre-built sense of “what to do / aim for” that’s affected by evopsych and so on. It seems like there’s a huge semantic problem with talking about “values” in a way that’s ambiguous between “in-built evopsych-ish motives” and “things learned from culture about what to endorse”, but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term “values” rather than “preferences”.
        
        In the section on subversion I made the case that terminal values make much more difference in subversive behavior than compliant behavior.
        
        It seems like to get at the values of approximate utility maximizers located in the brain you would need something like Goal Inference as Inverse Planning rather than just predicting behavior.
        Noosphere89 12 Sep 2024 17:58 UTC
        2 points
        0
        Parent
        Children learn some sense of what they’re supposed to say about values, but have some pre-built sense of “what to do / aim for” that’s affected by evopsych and so on. It seems like there’s a huge semantic problem with talking about “values” in a way that’s ambiguous between “in-built evopsych-ish motives” and “things learned from culture about what to endorse”, but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term “values” rather than “preferences”.
        I think this is actually a crux here, in that I think Yudkowsky and the broader evopsych world was broadly incorrect about how complicated human values turned to be, and way overestimated how much evolution was encoding priors and values in human brains, and I think there was another related error, in underestimating how much data affects your goals and values, like this example:
        That’s also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It’s possible that the objectives of these maximizers are affected by socialization, but they’ll be less affected by socialization than verbal statements about values, because they’re harder to fake so less affected by preference falsification.
        I think that socialization will deeply affect their objectives of the expected utility maximizers, and I generally think that we shouldn’t view socialization as training people to fake particular values, because I believe that data absolutely matters way more than evopsych and LWers thought, for both humans and AIs.
        You mentioned you take evopsych as true in this post, so I’m not saying this is a bad post, in fact, it’s an excellent distillation that points out the core assumption behind a lot of doom models, so I strongly upvoted, but I’m saying that this is almost certainly falsified for AIs, and probably also significantly false for humans too.
        More generally, I’m skeptical of the assumption that all humans have similar or even not that different values, and dispute the assumptions of the psychological unity of humankind due to this.
        Given this assumption, the human utility function(s) either do or don’t significantly depend on human evolutionary history. I’m just going to assume they do for now. I realize there is some disagreement about how important evopsych is for describing human values versus the attractors of universal learning machines, but I’m going to go with the evopsych branch for now.