A starting point is that things in the world (and imagined possibilities and abstract patterns and so on) seem to be good or bad. Eliezer talks about that here with the metaphor of “XML tags”, I talk about it here in general, and here as an influence on how we categorize, reason, and communicate. Anyway, things seem to be good or bad, and sometimes we’re unsure whether something is good or bad and we try to “figure it out”, and so on.
This is generally implicit / externalized, as opposed to self-reflective. What we’re thinking is “capitalism is bad”, not “I assess capitalism as bad”. I talk about this kind of implicit assessment not with the special word “values”, but rather with descriptions like “things that we find motivating versus demotivating” or “…good versus bad”, etc.
But that same setup also allows self-reflective things to be good or bad. That is, X can seem good or bad, and separately the self-reflective idea of “myself pursuing X” can seem good or bad. They’re correlated but can come apart. If X seems good but “myself pursuing X” seems bad, we’ll describe that as an ego-dystonic urge or impulse. Conversely, when “myself pursuing X” seems good, that’s when we start saying “I want X”, and potentially describing X as one of our values (or at least, one of our desires); it’s conceptualized as a property of myself, rather than an aspect of the world.
So that brings us to:
How Do We Distinguish A Change In Values From A Change In Beliefs About Values?
I kinda agree with what you say but would describe it a bit differently.
We have an intuitive model, and part of that intuitive model is “myself”, and “myself” has things that it wants versus doesn’t want. These wants are conceptualized as root causes; if you try to explain what’s causally upstream of my “wants”, then it feels like you’re threatening to my free will and agency to the exact extent that your explanations are successful. (Much more discussion and explanation here.)
The intuitive model also incorporates the fact that “wants” can change over time. And the intuitive model (as always) can be queried with counterfactual hypotheticals, so I can have opinions about what I had an intrinsic tendency to want at different times and in different situations even if I wasn’t in fact thinking of them at the time, and even if I didn’t know they existed. These hypotheticals are closely tied to the question of whether I would tend to brainstorm and plan towards making X happen, other things equal, if the idea had crossed my mind (see here).
So I claim that your examples are talking about the fact that some changes (e.g. aging) are conceptualized as being caused by my “wants” changing over time, whereas other changes are conceptualized in other ways, e.g. as changes in the external forces upon “myself”, or changes in my knowledge, etc.
How Do We Distinguish Reliable From Unreliable Reward Data?
I disagree more strongly with this part.
For starters, some people will actually say “I like / value / desire this drug, it’s awesome”, rather than “this drug hacks my reward system to make me feel an urge to take it”. These are both possible mental models, and they differ by “myself-pursuing-the-drug” seeming good (positive valence) for the former and bad (negative valence) for the latter. And you can see that difference reflected in the different behavior that comes out: the former leads to brainstorming / planning towards doing the drug again (other things equal), the latter does not.
I think your description kinda rings of “giving credit to my free will” in a way that feels very intuitive but I don’t think stands up to scrutiny. This little diagram is kinda related
(S(X) ≈ “self-reflective thought of myself pursuing X”) For this section, replace “I want X” with “I don’t actually value drugs, they’re just hacking my reward system”. On the left, the person is applying free will to recognize that the drug is not what they truly want. On the right, the person (for social reasons or whatever) finds “myself-pursuing-the-drug” to feel demotivating, and this leads to a conceptualization that any motivations caused by the drug are “intrusions upon myself from the outside” (a.k.a. “unreliable reward data”) rather than “reflective of my true desires”. I feel like your description is more like the left side, and I’m suggesting that this is problematic.
Maybe an example (here) is, if you took an allergy pill and it’s making you more and more tired, you might say “gahh, screw getting my work done, screw being my best self, screw following through on my New Years Resolution, I’m just tired, fuck it, I’m going to sleep”. You might say that the reward stream is being “messed up” by the allergy pill, but you switched at some point from externalizing those signals to internalizing (“owning”) them as what you want, at least what you want in the moment.
Hmm, I’m not sure I’m describing this very well. Post 8 of this series will have a bunch more examples and discussion.
That mostly sounds pretty compatible with this post?
For instance, the self-model part: on this post’s model, the human uses their usual epistemic machinery—i.e. world model—in the process of modeling rewards. That world model includes a self-model. So insofar as X and me-pursuing-X generate different rewards, the human would naturally represent those rewards as generated by different components of value, i.e. they’d estimate different value for X vs me-pursuing-X.
A starting point is that things in the world (and imagined possibilities and abstract patterns and so on) seem to be good or bad. Eliezer talks about that here with the metaphor of “XML tags”, I talk about it here in general, and here as an influence on how we categorize, reason, and communicate. Anyway, things seem to be good or bad, and sometimes we’re unsure whether something is good or bad and we try to “figure it out”, and so on.
This is generally implicit / externalized, as opposed to self-reflective. What we’re thinking is “capitalism is bad”, not “I assess capitalism as bad”. I talk about this kind of implicit assessment not with the special word “values”, but rather with descriptions like “things that we find motivating versus demotivating” or “…good versus bad”, etc.
But that same setup also allows self-reflective things to be good or bad. That is, X can seem good or bad, and separately the self-reflective idea of “myself pursuing X” can seem good or bad. They’re correlated but can come apart. If X seems good but “myself pursuing X” seems bad, we’ll describe that as an ego-dystonic urge or impulse. Conversely, when “myself pursuing X” seems good, that’s when we start saying “I want X”, and potentially describing X as one of our values (or at least, one of our desires); it’s conceptualized as a property of myself, rather than an aspect of the world.
So that brings us to:
I kinda agree with what you say but would describe it a bit differently.
We have an intuitive model, and part of that intuitive model is “myself”, and “myself” has things that it wants versus doesn’t want. These wants are conceptualized as root causes; if you try to explain what’s causally upstream of my “wants”, then it feels like you’re threatening to my free will and agency to the exact extent that your explanations are successful. (Much more discussion and explanation here.)
The intuitive model also incorporates the fact that “wants” can change over time. And the intuitive model (as always) can be queried with counterfactual hypotheticals, so I can have opinions about what I had an intrinsic tendency to want at different times and in different situations even if I wasn’t in fact thinking of them at the time, and even if I didn’t know they existed. These hypotheticals are closely tied to the question of whether I would tend to brainstorm and plan towards making X happen, other things equal, if the idea had crossed my mind (see here).
So I claim that your examples are talking about the fact that some changes (e.g. aging) are conceptualized as being caused by my “wants” changing over time, whereas other changes are conceptualized in other ways, e.g. as changes in the external forces upon “myself”, or changes in my knowledge, etc.
I disagree more strongly with this part.
For starters, some people will actually say “I like / value / desire this drug, it’s awesome”, rather than “this drug hacks my reward system to make me feel an urge to take it”. These are both possible mental models, and they differ by “myself-pursuing-the-drug” seeming good (positive valence) for the former and bad (negative valence) for the latter. And you can see that difference reflected in the different behavior that comes out: the former leads to brainstorming / planning towards doing the drug again (other things equal), the latter does not.
I think your description kinda rings of “giving credit to my free will” in a way that feels very intuitive but I don’t think stands up to scrutiny. This little diagram is kinda related
(S(X) ≈ “self-reflective thought of myself pursuing X”) For this section, replace “I want X” with “I don’t actually value drugs, they’re just hacking my reward system”. On the left, the person is applying free will to recognize that the drug is not what they truly want. On the right, the person (for social reasons or whatever) finds “myself-pursuing-the-drug” to feel demotivating, and this leads to a conceptualization that any motivations caused by the drug are “intrusions upon myself from the outside” (a.k.a. “unreliable reward data”) rather than “reflective of my true desires”. I feel like your description is more like the left side, and I’m suggesting that this is problematic.
Maybe an example (here) is, if you took an allergy pill and it’s making you more and more tired, you might say “gahh, screw getting my work done, screw being my best self, screw following through on my New Years Resolution, I’m just tired, fuck it, I’m going to sleep”. You might say that the reward stream is being “messed up” by the allergy pill, but you switched at some point from externalizing those signals to internalizing (“owning”) them as what you want, at least what you want in the moment.
Hmm, I’m not sure I’m describing this very well. Post 8 of this series will have a bunch more examples and discussion.
That mostly sounds pretty compatible with this post?
For instance, the self-model part: on this post’s model, the human uses their usual epistemic machinery—i.e. world model—in the process of modeling rewards. That world model includes a self-model. So insofar as X and me-pursuing-X generate different rewards, the human would naturally represent those rewards as generated by different components of value, i.e. they’d estimate different value for X vs me-pursuing-X.