reward is the evidence from which we learn about our values
A sadist might feel good each time they hurt somebody. I am pretty sure it is possible for a sadist to exist who does not endorse hurting people, meaning they feel good if they hurt people, but they avoid it nonetheless.
So to what extent is hurting people a value? It’s like the sadist’s brain tries to tell them that they ought to want to hurt people, but they don’t want to. Intuitively the “they don’t want to” seems to be the value.
This seems similar to the ant larvae situation where they reflectively argue around the hardcoded reward signal. Hurting people might still be considered a value the sadist has, but it trades off against other values.
Not quite what we were trying to say in the post. Rather than tradeoffs being decided on reflection, we were trying to talk about the causal-inference-style “explaining away” which the reflection gives enough compute for. In Johannes’s example, the idea is that the sadist might model the reward as coming potentially from two independent causes: a hardcoded sadist response, and “actually” valuing the pain caused. Since the probability of one cause, given the effect, goes down when we also know that the other cause definitely obtained, the sadist might lower their probability that they actually value hurting people given that (after reflection) they’re quite sure they are hardcoded to get reward for it. That’s how it’s analagous to the ant thing.
Yes exactly. The larva example illustrates that there are different kinds of values. I thought it was underexplored in the OP to characterize exactly what these different kinds of values are.
In the sadist example we have:
the hardcoded pleasure of hurting people.
And we have, let’s assume, the wish to make other people happy.
These two things both seem like values. However, they seem to be qualitatively different kinds of values. I intuit that more precisely characterizing this difference is important. I have a bunch of thoughts on this that I failed to write up so far.
A sadist might feel good each time they hurt somebody. I am pretty sure it is possible for a sadist to exist who does not endorse hurting people, meaning they feel good if they hurt people, but they avoid it nonetheless.
So to what extent is hurting people a value? It’s like the sadist’s brain tries to tell them that they ought to want to hurt people, but they don’t want to. Intuitively the “they don’t want to” seems to be the value.
This seems similar to the ant larvae situation where they reflectively argue around the hardcoded reward signal. Hurting people might still be considered a value the sadist has, but it trades off against other values.
Not quite what we were trying to say in the post. Rather than tradeoffs being decided on reflection, we were trying to talk about the causal-inference-style “explaining away” which the reflection gives enough compute for. In Johannes’s example, the idea is that the sadist might model the reward as coming potentially from two independent causes: a hardcoded sadist response, and “actually” valuing the pain caused. Since the probability of one cause, given the effect, goes down when we also know that the other cause definitely obtained, the sadist might lower their probability that they actually value hurting people given that (after reflection) they’re quite sure they are hardcoded to get reward for it. That’s how it’s analagous to the ant thing.
Yes exactly. The larva example illustrates that there are different kinds of values. I thought it was underexplored in the OP to characterize exactly what these different kinds of values are.
In the sadist example we have:
the hardcoded pleasure of hurting people.
And we have, let’s assume, the wish to make other people happy.
These two things both seem like values. However, they seem to be qualitatively different kinds of values. I intuit that more precisely characterizing this difference is important. I have a bunch of thoughts on this that I failed to write up so far.