I think I understand the disconnect here, so let me try and describe it.
Suppose I have certain values, and preferences, which I endorse upon reflection; I am satisfied with what I value, in other words. Say that I enjoy physical activity, especially rock climbing and hiking; and I enjoy listening to [what I consider to be] good music; and I like writing poetry; and I enjoy fine dining (in particularly, exploring new cuisines); and say that I especially like doing this together with my friends, whom I respect and whose company I enjoy. I endorse these values; I take them to be part of who I am, and to develop the virtues I consider important.
Suppose that I go on a hike with a good friend of mine. I will enjoy this activity, yes? I will think that it’s really great, won’t I? Suppose we schedule the hike and my friend has to cancel—wouldn’t I be disappointed? That sounds like a “strong emotional attachment”… likewise if I were working on some verse which wasn’t coming together, etc. And is this bad? It doesn’t seem bad; after all, these really are my values; these are my true preferences; I endorse them; thus my “strong emotional attachment” to these activities, my judgment of them as being really great, is true.
Now suppose I go and engage in some activity which has nothing to do with my values and preferences, and is, perhaps, even anti-endorsed. Maybe I take some drugs. Maybe I get hypnotized. Whatever it is, I have no reason to endorse it; it forms no part of my identity, nor do I wish it to; it develops no virtues; were I to meet someone else who did this thing, I would not respect them more for it (in fact I’d probably respect them less).
And yet, the activity feels good; it produces a strong emotional attachment; I come away thinking that it’s really great. In this case, that feeling, that attachment, that evaluation, is false.
In short: the idea is that Circling is wireheading.
(Of course, I don’t speak for PDV, so maybe what I say is not descriptive of his reasons; but it does describe, to a large extent, my views on the matter.)
I think that I’m missing some of the anti-wireheading genes; not that there wouldn’t exist behaviors that I’d classify as wireheading and recoil from, but they tend to be things like rewriting your brain in a way that causes a permanent loss of agency, or hypnotizing yourself to believe that your child is happy and well when they are in fact starving and would need your help. But for the most part, I operate on a kind of implicit assumption that if something feels great, then that feeling of greatness is something intrinsically valuable itself. My wireheading revulsion only seems to kick in if the thing actually does active damage… and even then, I’m not sure if it’s so much the wireheading aspect that I’m recoiling from, but rather the damage aspect.
It is good to have great things in your life. It is not necessarily good to have things you feel are great in your life; those feelings are not necessarily accurate. Many things that feel really good are metaphorical junk food. They are the Symbolic Representation of The Thing. Anything that quickly generates emotional attachment is most likely to be Goodharting, optimizing for feeling great and generating attachment, rather than being great.
This reads to me as a problem with System1-System 2 alignment / integration. You can Interal Double Crux about your feelings such that they start to align “great feelings” with actual greatness.
Goodharting will always be an issue, but if System 1 & 2 actually talk to each other (and have a trusting, we’re-in-it-together relationship), it’s much easier to at least notice.
If System 1 doesn’t trust System 2, it’s more likely System 1 will try to hide information, self-sabotage, and otherwise do more backstabby things, making it hard to strive for goals.
Okay. I don’t seem to distinguish between “things that feel great” and “things that are great” in the same way as you do. (Obviously, there are things that are great despite not feeling great; e.g. helping someone else can be great even if it makes you feel bad at the time. But something feeling great is by itself a type of greatness to me, even though it shouldn’t be the only type of greatness in one’s life.)
I consider this a factual dispute about minds and Goodhart’s Law, rather than a difference of subjective categorization, so this response is a non sequitur to me.
Your comment used terms like “good” and “great”, which I interpret as subjective valuations, or preferences. I don’t know how to translate a question about subjective valuations into one of factual claims.
I claim that as a general principle, “something feeling great is by itself a type of greatness to me” is a category error. What feels great is a map, and being great is the territory. There is a fact of the matter with regards to what is great for PDV, and what is great for Kaj. They are not identical, and they are not directly queriable, but there is a fact of the matter. Something great is something that increases your utility significantly. (Non-utilitarian ethics: translate that into language your system permits.)
What feels great is a separate fact. It is directly queriable, and correlates with being great, but it is only an approximation, and can therefore be Goodharted. The distinction between the true utility and the approximation is a general property of human minds, with some regularities (superstimuli), but also not identical between people.
So when you say “for me that’s a subcategory”, I conclude that you have a) misunderstood my claim, and b) mistaken the map for the territory.
Like, if we are talking about a claim like “is it raining outside”, then the territory is made up of whether it actually is raining outside or not. It’s a concrete physical event.
For “is something great”, the nearest physical referent that I could think of is “does a person’s brain make the evaluation that this is great”. Which would make it into a question of subjective valuation, but you seem to have some more objective criteria in mind.
I said that already? “Something great is something that increases your utility significantly.” This is a property of timelines, not of world-states, and so can’t be directly queried, but better approximations can be built up by retrospecting on which times feeling great was accurate and which times it was not.
Unreal, in a subthread above, claims that it is possible to realign System 1 such that feeling great coincides with being great. This seems wrong to me, but is the kind of thing that could be right. Your description does not seem to be the kind of thing that could be right.
I’d like to try to explain and see if I’m pointing at the right thing.
I might value being loved. (This thing has utility to me.)
However, I do not actually have neurons that connect to the territory such that my neurons fire If and Only If I am being loved. My neurons are not magic.
So instead they use proxy measures. Like looking at the person’s face and seeing it smiling at me. Or seeing their body language and noticing it is relaxed and open. Or feeling their gentle touch. Etc.
All these proxy measures add up to something that feels good. However, it is NEVER certain that it’s measuring the thing I ultimately want (being loved). I’m just going off a guess. A pretty good guess, sometimes. But still.
This is Goodhart’s dilemma here.
When I have a measure of a good thing (someone smiling at me), I will try to optimize for the measure, which is not necessarily the thing I was originally wanting to track (being loved).
So at some point I may try to optimize for smiles, even when they’re not out of love. And whatever those behaviors are, we call pica.
Right, I agree that there can be things which I value, and for which I can mistaken about whether or not I have them / they exist / etc.
But PDV didn’t seem to be just saying that “you can be mistaken about whether you actually have the thing that you think you have”. They said that it’s a category error for me to say that something feeling good is by itself something that I value, and that there’s a factual dispute about minds here, rather than a dispute of subjective categorization.
Your example doesn’t feel like it helps me understand those claims. I can have a subjective categorization that being loved is something that I value, and I can be correct or mistaken about whether or not I’m actually loved. And I can indeed end up optimizing for something like smiles, which I think indicates being-lovedness, even when it’s only weakly correlated.
But that doesn’t seem to be like a reason for why something feeling good couldn’t also be something that I value for its own sake.
Wait… are you just trying to say that you can, in theory, value “positive feelings” like joy, delight, etc. in themselves? That seems unobjectionable.
I thought PDV was saying that if you mistake “good feelings” for “good things” in general, that this was a category error. Like, if you always just think, “I feel good when the sun shines on me! It must BE good that the sun is shining on me.” Then THAT is an error.
Wait… are you just trying to say that you can, in theory, value “positive feelings” like joy, delight, etc. in themselves?
Yes. And not just in theory, I would expect that this is what many if not most people do: see e.g. all the advice about how to be happy, or the fact that many people take something like classical utilitarianism seriously as a moral theory.
I thought PDV was saying that if you mistake “good feelings” for “good things” in general, that this was a category error.
Oh. I thought that I already mentioned much earlier that I didn’t mean that, when I said that things can be great despite not feeling great, and that “good feelings” are just one of the possible types of good things you can have in your life, and they shouldn’t be the only ones.
Many if not most people are Goodharting in most aspects of their lives. Why not this one?
I acknowledge your claim that you value feeling good over and above the things that cause you to feel good. I agree that many people implicitly endorse this claim about themselves. I think you and they are very likely mistaken about this preference, and that ceasing to optimize for it would improve your life significantly according to your other preferences.
was hoping you’d validate whether my “I thought PDV was saying” one way or another, above …
also, it seems like an important milestone if you guys actually sussed out where the actual disagreement is. and it seems like it isn’t what either of you previously thought it was. so i want that to be made clear.
Kaj wasn’t saying ‘a thing that couldn’t be right’. Kaj was describing a totally realistic thing to do. which is to value feeling good itself.
i think conversational milestones in arguments are important places to stop and orient, and i was worried this milestone would be quickly passed over.
and NOW the disagreement is about a preference / why aren’t you worried about Goodharting, whereas before it wasn’t clear. is this actually agreed now by both parties?
FWIW, I think ‘valuing positive feelings in themselves’ is a bad idea. It’s theoretically possible to do it, but I wouldn’t recommend it as part of one’s final evolutionary form.
Symmetrically, I think ‘equating negative feelings with badness’ or believing ‘feeling bad is bad’ is also not recommended.
Producing a strong emotional attachment to the activity and thinking it’s really great, is itself a significant, negative effect.
Having things in your life that you feel are great, feels like a positive thing to me. (I have too few of them.)
I think I understand the disconnect here, so let me try and describe it.
Suppose I have certain values, and preferences, which I endorse upon reflection; I am satisfied with what I value, in other words. Say that I enjoy physical activity, especially rock climbing and hiking; and I enjoy listening to [what I consider to be] good music; and I like writing poetry; and I enjoy fine dining (in particularly, exploring new cuisines); and say that I especially like doing this together with my friends, whom I respect and whose company I enjoy. I endorse these values; I take them to be part of who I am, and to develop the virtues I consider important.
Suppose that I go on a hike with a good friend of mine. I will enjoy this activity, yes? I will think that it’s really great, won’t I? Suppose we schedule the hike and my friend has to cancel—wouldn’t I be disappointed? That sounds like a “strong emotional attachment”… likewise if I were working on some verse which wasn’t coming together, etc. And is this bad? It doesn’t seem bad; after all, these really are my values; these are my true preferences; I endorse them; thus my “strong emotional attachment” to these activities, my judgment of them as being really great, is true.
Now suppose I go and engage in some activity which has nothing to do with my values and preferences, and is, perhaps, even anti-endorsed. Maybe I take some drugs. Maybe I get hypnotized. Whatever it is, I have no reason to endorse it; it forms no part of my identity, nor do I wish it to; it develops no virtues; were I to meet someone else who did this thing, I would not respect them more for it (in fact I’d probably respect them less).
And yet, the activity feels good; it produces a strong emotional attachment; I come away thinking that it’s really great. In this case, that feeling, that attachment, that evaluation, is false.
In short: the idea is that Circling is wireheading.
(Of course, I don’t speak for PDV, so maybe what I say is not descriptive of his reasons; but it does describe, to a large extent, my views on the matter.)
Thank you for the explanation.
I think that I’m missing some of the anti-wireheading genes; not that there wouldn’t exist behaviors that I’d classify as wireheading and recoil from, but they tend to be things like rewriting your brain in a way that causes a permanent loss of agency, or hypnotizing yourself to believe that your child is happy and well when they are in fact starving and would need your help. But for the most part, I operate on a kind of implicit assumption that if something feels great, then that feeling of greatness is something intrinsically valuable itself. My wireheading revulsion only seems to kick in if the thing actually does active damage… and even then, I’m not sure if it’s so much the wireheading aspect that I’m recoiling from, but rather the damage aspect.
Why do you enjoy rock climbing? Do you think that’s independent of your experiences of rock climbing having produced adrenaline rushs?
It is good to have great things in your life. It is not necessarily good to have things you feel are great in your life; those feelings are not necessarily accurate. Many things that feel really good are metaphorical junk food. They are the Symbolic Representation of The Thing. Anything that quickly generates emotional attachment is most likely to be Goodharting, optimizing for feeling great and generating attachment, rather than being great.
This reads to me as a problem with System1-System 2 alignment / integration. You can Interal Double Crux about your feelings such that they start to align “great feelings” with actual greatness.
Goodharting will always be an issue, but if System 1 & 2 actually talk to each other (and have a trusting, we’re-in-it-together relationship), it’s much easier to at least notice.
If System 1 doesn’t trust System 2, it’s more likely System 1 will try to hide information, self-sabotage, and otherwise do more backstabby things, making it hard to strive for goals.
Okay. I don’t seem to distinguish between “things that feel great” and “things that are great” in the same way as you do. (Obviously, there are things that are great despite not feeling great; e.g. helping someone else can be great even if it makes you feel bad at the time. But something feeling great is by itself a type of greatness to me, even though it shouldn’t be the only type of greatness in one’s life.)
I consider this a factual dispute about minds and Goodhart’s Law, rather than a difference of subjective categorization, so this response is a non sequitur to me.
Your comment used terms like “good” and “great”, which I interpret as subjective valuations, or preferences. I don’t know how to translate a question about subjective valuations into one of factual claims.
I claim that as a general principle, “something feeling great is by itself a type of greatness to me” is a category error. What feels great is a map, and being great is the territory. There is a fact of the matter with regards to what is great for PDV, and what is great for Kaj. They are not identical, and they are not directly queriable, but there is a fact of the matter. Something great is something that increases your utility significantly. (Non-utilitarian ethics: translate that into language your system permits.)
What feels great is a separate fact. It is directly queriable, and correlates with being great, but it is only an approximation, and can therefore be Goodharted. The distinction between the true utility and the approximation is a general property of human minds, with some regularities (superstimuli), but also not identical between people.
So when you say “for me that’s a subcategory”, I conclude that you have a) misunderstood my claim, and b) mistaken the map for the territory.
So what makes up the territory?
Like, if we are talking about a claim like “is it raining outside”, then the territory is made up of whether it actually is raining outside or not. It’s a concrete physical event.
For “is something great”, the nearest physical referent that I could think of is “does a person’s brain make the evaluation that this is great”. Which would make it into a question of subjective valuation, but you seem to have some more objective criteria in mind.
I said that already? “Something great is something that increases your utility significantly.” This is a property of timelines, not of world-states, and so can’t be directly queried, but better approximations can be built up by retrospecting on which times feeling great was accurate and which times it was not.
Unreal, in a subthread above, claims that it is possible to realign System 1 such that feeling great coincides with being great. This seems wrong to me, but is the kind of thing that could be right. Your description does not seem to be the kind of thing that could be right.
Taboo “utility”? To me it’s again just another word for personal preferences.
I’d like to try to explain and see if I’m pointing at the right thing.
I might value being loved. (This thing has utility to me.)
However, I do not actually have neurons that connect to the territory such that my neurons fire If and Only If I am being loved. My neurons are not magic.
So instead they use proxy measures. Like looking at the person’s face and seeing it smiling at me. Or seeing their body language and noticing it is relaxed and open. Or feeling their gentle touch. Etc.
All these proxy measures add up to something that feels good. However, it is NEVER certain that it’s measuring the thing I ultimately want (being loved). I’m just going off a guess. A pretty good guess, sometimes. But still.
This is Goodhart’s dilemma here.
When I have a measure of a good thing (someone smiling at me), I will try to optimize for the measure, which is not necessarily the thing I was originally wanting to track (being loved).
So at some point I may try to optimize for smiles, even when they’re not out of love. And whatever those behaviors are, we call pica.
Right, I agree that there can be things which I value, and for which I can mistaken about whether or not I have them / they exist / etc.
But PDV didn’t seem to be just saying that “you can be mistaken about whether you actually have the thing that you think you have”. They said that it’s a category error for me to say that something feeling good is by itself something that I value, and that there’s a factual dispute about minds here, rather than a dispute of subjective categorization.
Your example doesn’t feel like it helps me understand those claims. I can have a subjective categorization that being loved is something that I value, and I can be correct or mistaken about whether or not I’m actually loved. And I can indeed end up optimizing for something like smiles, which I think indicates being-lovedness, even when it’s only weakly correlated.
But that doesn’t seem to be like a reason for why something feeling good couldn’t also be something that I value for its own sake.
Wait… are you just trying to say that you can, in theory, value “positive feelings” like joy, delight, etc. in themselves? That seems unobjectionable.
I thought PDV was saying that if you mistake “good feelings” for “good things” in general, that this was a category error. Like, if you always just think, “I feel good when the sun shines on me! It must BE good that the sun is shining on me.” Then THAT is an error.
Wait… are you just trying to say that you can, in theory, value “positive feelings” like joy, delight, etc. in themselves?
Yes. And not just in theory, I would expect that this is what many if not most people do: see e.g. all the advice about how to be happy, or the fact that many people take something like classical utilitarianism seriously as a moral theory.
I thought PDV was saying that if you mistake “good feelings” for “good things” in general, that this was a category error.
Oh. I thought that I already mentioned much earlier that I didn’t mean that, when I said that things can be great despite not feeling great, and that “good feelings” are just one of the possible types of good things you can have in your life, and they shouldn’t be the only ones.
Many if not most people are Goodharting in most aspects of their lives. Why not this one?
I acknowledge your claim that you value feeling good over and above the things that cause you to feel good. I agree that many people implicitly endorse this claim about themselves. I think you and they are very likely mistaken about this preference, and that ceasing to optimize for it would improve your life significantly according to your other preferences.
was hoping you’d validate whether my “I thought PDV was saying” one way or another, above …
also, it seems like an important milestone if you guys actually sussed out where the actual disagreement is. and it seems like it isn’t what either of you previously thought it was. so i want that to be made clear.
Kaj wasn’t saying ‘a thing that couldn’t be right’. Kaj was describing a totally realistic thing to do. which is to value feeling good itself.
i think conversational milestones in arguments are important places to stop and orient, and i was worried this milestone would be quickly passed over.
and NOW the disagreement is about a preference / why aren’t you worried about Goodharting, whereas before it wasn’t clear. is this actually agreed now by both parties?
(I greatly appreciate your attempt to clarify/improve the quality of the conversation.)
FWIW, I think ‘valuing positive feelings in themselves’ is a bad idea. It’s theoretically possible to do it, but I wouldn’t recommend it as part of one’s final evolutionary form.
Symmetrically, I think ‘equating negative feelings with badness’ or believing ‘feeling bad is bad’ is also not recommended.
People who don’t have things that feel great in their life are likely to be depressed. Do you think that’s a desireable state to be in?