Davidmanheim comments on What is ambitious value learning?

Davidmanheim Nov 5, 2018, 6:42 AM
1 point
The Hansonian discussion of shared priors seems relevant. (For those not familiar with it: https://mason.gmu.edu/~rhanson/prior.pdf ) Basically, we should have convergent posteriors in an Aumann sense unless we have not only different priors and different experiences, but also different origins.
But what this means is that *to the extent that human values are coherent and based on correct bayesian reasoning* - which, granted, is a big assumption—distributional shifts shouldn’t exist. (And now, back to reality.)
- Rohin Shah Nov 6, 2018, 9:48 PM
  3 points
  Parent
  This also presumes that human values are an empirical fact about reality that you can have beliefs over, which seems at least controversial.
  - Davidmanheim Nov 7, 2018, 7:34 AM
    1 point
    Parent
    I don’t think you are correct about the implication of “not up for grabs”—it doesn’t mean it is not learnable, it means that we don’t update or change it, and that it is not constrained by rationality. But even that isn’t quite right—rational behavior certainly requires that we change preferences about intermediate outcomes when we find that our instrumental goals should change in response to new information.
    And if the utility function changes as a result of life experiences, it should be in a way that reflects learnable expectations over how experiences change the utility function—so the argument about needing origin disputes still applies.
    - Rohin Shah Nov 8, 2018, 6:53 PM
      2 points
      Parent
      I’m not claiming (in the parent comment) that values aren’t learnable.
      I am claiming that they are not constrained by rationality (or rather, that this is a reasonable position to have, corresponding roughly to moral anti-realism).
      I was talking about terminal values, not instrumental values. I certainly agree that if we take terminal values as given, instrumental values are an empirical fact about reality.
      Though I think I see my misunderstanding now. I thought you were claiming that humans arrived at their values by a process of Bayesian updating on what their values should be. But actually what you’re claiming is that to the extent that human beliefs (not values!) are based on correct Bayesian reasoning with shared origins, distributional shifts shouldn’t exist. Humans may still disagree on values.
      I was confused because your original comment used the assumption that human values were based on correct Bayesian reasoning, am I correct that you meant that to apply to human beliefs?
      - Davidmanheim Nov 8, 2018, 8:18 PM
        1 point
        Parent
        Sorry, I needed to clarify my thinking and my claim a lot further. This is in addition to the (what I assumed was obvious) claim that correct Bayesian thinkers should be able to converge on beliefs despite potentially having different values. I’m speculating that if terminal values are initially drawn from a known distribution, AND “if you think that a different set of life experiences means that you are a different person with different values,” but that values change based on experiences in ways that are understandable, then rational humans will act in a coherent way so that we should expect to be able to learn human values and their distribution, despite the existence of shifts.
        Conditional on those speculative thoughts, I disagree with your conclusion that “that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed.” Instead, I think we should be able to infer the distribution of values that humans actually have—even if they individually change over time from experiences.
        Rohin Shah Nov 9, 2018, 12:46 AM
        1 point
        Parent
        But what do you optimize then?
        Davidmanheim Nov 9, 2018, 11:23 AM
        2 points
        Parent
        That’s an important question, bu it’s also fundamentally hard, since it’s almost certainly true that human values are inconsistent—if not individually, than at an aggregate level. (You can’t reconcile opposite preferences, or maximize each person’s share of a finite resource.)
        The best answer I have seen is Eric Drexler’s discussion of Pareto-topia, where he suggests that we can make huge progress and gain of utility according to all value-systems held by humans, despite the fact that they are inconsistent.
        Rohin Shah Nov 10, 2018, 6:00 PM
        6 points
        0
        Parent
        That seems right. Though if you accept that human values are inconsistent and you won’t be able to optimize them directly, I still think “that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed.”
        By “true human utility function” I really do mean a single function that when perfectly maximized leads to the optimal outcome.
        I think “human values are inconsistent” and “people with different experiences will have different values” and “there are distributional shifts which cause humans to be different than they would otherwise have been” are all different ways of pointing at the same problem.
        What links here?
        sunwillrise's comment on Instruction-following AGI is easier and more likely than value aligned AGI by Seth Herd (Jul 12, 2024, 3:34 PM; 5 points)