jessicata comments on A case for AI alignment being difficult

jessicata 10 Jan 2024 22:57 UTC
3 points
1
How would you design a task that incentivizes a system to output its true estimates of human values? We don’t have ground truth for human values, because they’re mind states not behaviors.

Seems easier to create incentives for things like “wash dishes without breaking them”, you can just tell.
- gallabytes 10 Jan 2024 23:07 UTC
  13 points
  6
  Parent
  I think I can just tell a lot of stuff wrt human values! How do you think children infer them? I think in order for human values to not be viable to point to extensionally (ie by looking at a bunch of examples) you have to make the case that they’re much more built-in to the human brain than seems appropriate for a species that can produce both Jains and (Genghis Khan era) Mongols.
  I’d also note that “incentivize” is probably giving a lot of the game away here—my guess is you can just pull them out much more directly by gathering a large dataset of human preferences and predicting judgements.
  - jessicata 10 Jan 2024 23:57 UTC
    4 points
    −2
    Parent
    If you define “human values” as “what humans would say about their values across situations”, then yes, predicting “human values” is a reasonable training objective. Those just aren’t really what we “want” as agents, and agentic humans would have motives not to let the future be controlled by an AI optimizing for human approval.
    
    That’s also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It’s possible that the objectives of these maximizers are affected by socialization, but they’ll be less affected by socialization than verbal statements about values, because they’re harder to fake so less affected by preference falsification.
    
    Children learn some sense of what they’re supposed to say about values, but have some pre-built sense of “what to do / aim for” that’s affected by evopsych and so on. It seems like there’s a huge semantic problem with talking about “values” in a way that’s ambiguous between “in-built evopsych-ish motives” and “things learned from culture about what to endorse”, but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term “values” rather than “preferences”.
    
    In the section on subversion I made the case that terminal values make much more difference in subversive behavior than compliant behavior.
    
    It seems like to get at the values of approximate utility maximizers located in the brain you would need something like Goal Inference as Inverse Planning rather than just predicting behavior.
    - Noosphere89 12 Sep 2024 17:58 UTC
      2 points
      0
      Parent
      Children learn some sense of what they’re supposed to say about values, but have some pre-built sense of “what to do / aim for” that’s affected by evopsych and so on. It seems like there’s a huge semantic problem with talking about “values” in a way that’s ambiguous between “in-built evopsych-ish motives” and “things learned from culture about what to endorse”, but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term “values” rather than “preferences”.
      I think this is actually a crux here, in that I think Yudkowsky and the broader evopsych world was broadly incorrect about how complicated human values turned to be, and way overestimated how much evolution was encoding priors and values in human brains, and I think there was another related error, in underestimating how much data affects your goals and values, like this example:
      That’s also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It’s possible that the objectives of these maximizers are affected by socialization, but they’ll be less affected by socialization than verbal statements about values, because they’re harder to fake so less affected by preference falsification.
      I think that socialization will deeply affect their objectives of the expected utility maximizers, and I generally think that we shouldn’t view socialization as training people to fake particular values, because I believe that data absolutely matters way more than evopsych and LWers thought, for both humans and AIs.
      You mentioned you take evopsych as true in this post, so I’m not saying this is a bad post, in fact, it’s an excellent distillation that points out the core assumption behind a lot of doom models, so I strongly upvoted, but I’m saying that this is almost certainly falsified for AIs, and probably also significantly false for humans too.
      More generally, I’m skeptical of the assumption that all humans have similar or even not that different values, and dispute the assumptions of the psychological unity of humankind due to this.
      Given this assumption, the human utility function(s) either do or don’t significantly depend on human evolutionary history. I’m just going to assume they do for now. I realize there is some disagreement about how important evopsych is for describing human values versus the attractors of universal learning machines, but I’m going to go with the evopsych branch for now.