Johannes C. Mayer comments on We Don’t Know Our Own Values, but Reward Bridges The Is-Ought Gap

Johannes C. Mayer 20 Sep 2024 10:58 UTC
7 points
0
reward is the evidence from which we learn about our values
A sadist might feel good each time they hurt somebody. I am pretty sure it is possible for a sadist to exist who does not endorse hurting people, meaning they feel good if they hurt people, but they avoid it nonetheless.

So to what extent is hurting people a value? It’s like the sadist’s brain tries to tell them that they ought to want to hurt people, but they don’t want to. Intuitively the “they don’t want to” seems to be the value.
- Measure 20 Sep 2024 13:34 UTC
  3 points
  1
  Parent
  This seems similar to the ant larvae situation where they reflectively argue around the hardcoded reward signal. Hurting people might still be considered a value the sadist has, but it trades off against other values.
  - David Lorell 20 Sep 2024 21:33 UTC
    4 points
    0
    Parent
    Not quite what we were trying to say in the post. Rather than tradeoffs being decided on reflection, we were trying to talk about the causal-inference-style “explaining away” which the reflection gives enough compute for. In Johannes’s example, the idea is that the sadist might model the reward as coming potentially from two independent causes: a hardcoded sadist response, and “actually” valuing the pain caused. Since the probability of one cause, given the effect, goes down when we also know that the other cause definitely obtained, the sadist might lower their probability that they actually value hurting people given that (after reflection) they’re quite sure they are hardcoded to get reward for it. That’s how it’s analagous to the ant thing.
  - Johannes C. Mayer 20 Sep 2024 20:54 UTC
    4 points
    0
    Parent
    Yes exactly. The larva example illustrates that there are different kinds of values. I thought it was underexplored in the OP to characterize exactly what these different kinds of values are.
    
    In the sadist example we have:
    
    the hardcoded pleasure of hurting people.
    And we have, let’s assume, the wish to make other people happy.
    
    These two things both seem like values. However, they seem to be qualitatively different kinds of values. I intuit that more precisely characterizing this difference is important. I have a bunch of thoughts on this that I failed to write up so far.