Oh wow, I had never before thought of modern people over-consuming sugar, as being an application of Goodhart’s Law. But it is. That’s brilliant.
I very much agree with the ideas presented in this post; for people who are interested in finding out more, I very much recommend the book Don’t Shoot the Dog, and maybe also The Power of Habit. That said, those books are pretty much written from a behaviorist perspective, so they don’t go very much into the way that mental and abstract concepts become associated with value, as in your doctor example.
A couple of minor suggestions on how to improve on your post further: 1) I think that the ∃ in your Claim 2 is meant to be interpreted as “for a given effect and size, there exists a sufficiently small delay such that the desired result is produced”, but I wouldn’t have understood that notation if I hadn’t had math as a minor in my degree, and probably not all readers have 2) It might be good to quickly explain clicker training in a couple of sentences to people who haven’t heard about it before.
Observing the link between wireheading and Goodhart’s law seems to be an instance of what Paul Graham recommends in his latest essay.
He claims that the most valuable insights are both general and surprising, but that those insights are very hard to find. So instead one is often better off searching for surprising takes on established general ideas, as OP seems to have done. :)
Thanks for the reading recommendations and the suggestions! I decided to leave ∃ for somewhat snarky incentivize-people-to-go-learn-a-thing reasons, but I linked to a clicker training video and will add a couple of sentences.
Note that correctly interpreting the ∃ thing isn’t just about knowing that ∃ stands for “there exists”; it also takes a bit of additional knowledge to correctly unpack “for a given effect and size, there exists a sufficiently small delay” as “we can arbitrarily pick a certain effect and size that we want our intervention to have, and regardless of what we pick we can make our intervention satisfy those properties by making the delay small enough”.
In fact, in the first version of my comment I wrote something like “I’m interpreting ∃ to stand for ‘there exists’, but directly substituting that in to make the sentence read ‘there exists a sufficiently small delay’ doesn’t create a sensible sentence”, until after I thought that oh right, he means that there exists a duration of delay which makes this come true, and ‘makes this come true’ is defined as an inequality the way that it’s defined when you’re doing epsilon-delta proofs! Even if I’d otherwise known what “it exists” means, I don’t know if I’d managed to correctly interpret the sentence if I hadn’t taken that analysis course and learned how to think about it.
Of course, I might just be particularly dense, and maybe everyone else would have understood it anyway. :-)
Hmmm … that seems sensible, and produced a shift, but not enough to move my overall weighing. Cue metacognitive doubts about whether I’m just status-quo biasing into protecting my original decision. :-)
FWIW, I went through pretty much the same sequence of thoughts, which jarred me out of what was otherwise a pleasant/flowing read. Given the difficulty people unfamiliar with the notation faced in looking it up, maybe you could say “∃ (there exists)”, and/or link to the relevant Wiki page (https://en.wikipedia.org/wiki/Existential_quantification)?
If you’re comfortable rephrasing the sentence a little more for clarity, I’d suggest replacing the part after the quantifier with something like “some length of delay between behavior and consequence which is short enough to produce the effect.”
I claim that what’s going on is that the monkey’s brain, separate from the monkey/the monkey’s S2/any sapient or strategic awareness that the monkey has, is conditioning the monkey.
I think this claim is confusing at best and false at worst. The shifting dopamine response is well-recognized in the neuroscience literature, and explained by Sutton and Barto’s Temporal-Difference model.
First, it should be emphasized that midbrain dopamine does not signal reward. The monkey can experience a ton of pleasure without any dopamine reaction. Midbrain dopamine signals reward prediction error, the difference between actual and expected reward. It signals a kind of surprise.
Now the TD model is quite Bayesian. Whereas the Rescorla-Wagner model—the previously dominant theory of reinforcement—viewed the prediction error as the difference between actual and expected current reward; the TD model instead views it as the difference between all actual and expected future rewards (properly discounted).
So when the dopamine signal shifts, the monkey is just conserving expected evidence. Initially, it is positively surprised to receive juice. But eventually, it learns that the screen perfectly predicts the juice, and so it is the appearence of the screen itself that becomes the positive surprise. On a classical model of reinforcement, these events are different, as OP seems to recognize. But on the TD model, these are just instances of the very same kind of conditioning event.
OP seems to recognize all this, but these observations seems to be complemented with somewhat unfounded interpretations and elaborations.
[Epistemic status: confident OP will be confusing to those without RL background knowledge, but still non-negligible credence that OP is explaning exactly the above but from a different perspective]
Thanks for the info! I think the diff between my explanation and yours largely falls out “true” in your favor, and I’m glad you have additional clarification (correction?) here.
Nitpick: as far as I can tell, you’re describing discounting in general, not hyperbolic discounting. Hyperbolic discounting refers specifically to the surprising result that human discounting isn’t exponential, like we expected it to be based on economics research. http://www.behaviorlab.org/Papers/Hyperbolic.pdf
I agree with the overall claim you’re making, though I think you’re making it out to be a stronger force than I expect it is. In general, I think a good prior is “what your brain is doing by default is pretty damn close to right, and you have to understand it pretty well before you can recognize the actual flaws in its behavior.” I suspect you’ve mistaken where in the chain something is going wrong—I think it’s something related to social permission to make mistakes, social permission to succeed/fail at goals, etc. In other words, you get a punishment from your internalized model of other people when you fail to have lost weight.
I would absolutely expect internalized models to be a part of the thing (to be one of the abstractions or simplifications that your S1 uses to understand all of the data it’s ever experienced). I wouldn’t be surprised to find out that they’re the generator of a lot of the “this is serving my goals” or “this is threatening/dangerous” conclusions that lead to positive and negative pings. I would, however, be surprised to find out that they’re the only thing, or even the dominant one. I think we might disagree on type or hierarchy?
I’m positing that the social stuff you’re pointing out is like one of many “states” in the larger “nation” of brain-models-that-inform-the-brain’s-decision-to-punish-or-reward, whereas if I’m understanding you correctly you’re claiming either that the social modeling is the only model, or that the reward/punishment is always delivered through the social modeling channels (it always “comes from” some person-shaped thing in the head).
Please correct if I’ve misunderstood. I note that I wouldn’t be surprised if it’s like that for some people, but according to my introspection the social dynamic just doesn’t have that much power for me personally.
So, (I claim that) machine learning models provide a pretty good basis for comparison of the dopamine-moving-earlier thing: eg, this is what you’d expect from a system that does a local reinforce-positive update on the policy net as soon as the value net starts predicting a higher future expected value. See something about actor-critic, eg section 3.2.1 of this pdf. Because we’re starting from the prior that the brain is well enough designed to get pretty damn close to working, seeing that policy rewards move earlier is not evidence that should update us away from models where the brain is doing correct temporal difference learning (section 2.3.3 in that pdf).
The social thing I’m suggesting is that the expected value that the value function is predicting on seeing “oh, I gained weight” is a correct representation of future reward, even though it’s a very simple approximation. I don’t mean to say that I think a complicated, multi-step model is being run, just that the usual approximation is approximating a reasoning process that if done in full using the verbal loop, would look something like:
I have higher weight
I now know that I have higher weight
I now have less justified ability to claim high status
When I next interact with someone, I will have less claim to be valuable in their eyes
I will therefore expect them to express slightly less approval toward me, because I won’t be able to hide that I know I feel I have less justified ability to claim status
I am saying that I don’t think implementation of TD-learning is the problem here.
Got it. That makes sense. I think I still disagree, but if I’ve understood you right I can agree that that hypothesis also clearly deserves to be in the mix.
This seems largely correct to me, although I think hyperbolic discounting of rewards/punishments over time may be less pronounced in human conditioning as compared to animals being conditioned by humans. Humans can think “I’m now rewarding myself for Action A I took earlier” or “I’m being punished for Action B” which can seems, at least in my experience, to decrease the effect of the temporal distance whereas animals seem less able to conceptualize the connection over time. Because of this difference, I think the temporal difference of reward/punishment is less important in people for conditioning as long as the individual is mentally associating the stimulus with the action, although it is still significant.
Also what’s the name of the paper for the monkeys and juice study? I’d like to look at it because the result did surprise me.
Yeah, it makes a lot of sense to me that explicit cognition can interfere with the underlying, more “automatic” conditioning. Narrative framing and preforming intentions and focusing attention on the link between X and Y seem to have a strong influence on how conditioning does or doesn’t work, and I don’t know what the mechanisms are.
That being said, I think we agree that, in situations where there’s not a lot of conscious attention on what’s happening, the conditioning proceeds something like “normally,” where “normal” is “comparable to what happens in less sapient animals”?
For the specific case of weighing yourself, could you create a scale that only gives the positive reward, not the negative one? Like, it only tells you your weight if it’s lower than yesterday, or better yet if the trend in your weight is downward over the past week? Maybe it displays a cheerful message and plays a soothing sound when you weigh yourself, and it emails you later at random if you’ve been losing weight.
An important thing to notice about Goodhart’s law is that we roll three different phenomena together. This isn’t entirely bad because the three phenomena are very similar, but sometimes it helps to think about them differently.
Goodhart Level 1: Following the sugar/fruit example, imagine there are a bunch of different fruits with different levels of nutrients and sugar, and the nutrient and sugar levels are very correlated. It is still the case that if you optimize for the most sugar, you will get reliably less nutrients than if you optimize for the most nutrients. (This is just the cost of using a proxy. I barely even want to call this Goodhart’s law.)
Goodhart Level 2: It is possible that foods with lots of sugar are usually very correlated with foods with lots of nutrients, but there is one type of fruit that is pure sugar with no nutrients. If this fruit occurred in nature, but only very rarely, it would not mess with the statistical correlation between sugar and nutrients very much. Thus when an agent optimizes very hard for sugar, they end up with no nutrients, but if they optimized only slightly, they would have found a normal fruit with lots of sugar and nutrients.
Note that this is more nuanced than just saying we didn’t have pure sugar in the ancestral environment and we do now, so the actions that were good in the ancestral environment are a bad proxy for the actions that are good now. (Maybe just using a bad proxy should be called Goodhart Level 0) The point is that the reason that the environment is different now is that we optimized for sugar. We pushed to the section of possible worlds with lots of sugar using our sugar optimization, and the correlation mostly only existed in the worlds with a moderate amount of sugar.
Goodhart Level 3: Say we live in a world with only a fixed amount of nutrients, and someone else wants a larger share of the nutrients. If you are using the proxy of sugar, and other people know this, and adversary might invent candy, and then trade you their candy for some of your fruit. Another agent had a goal that was in conflict with your true goal, but not in conflict with your proxy, so they exploited your proxy and created options that satisfied your proxy but not your true goal.
Many instances of people Goodharting themselves falls in Level 2 (If I don’t step on the scale, I optimize out of the worlds where the scale number is correlated with my weight). However, I claim that some instances might be at Level 3. In particular rationalization. Maybe part of me wants to save the world and uses the ability to produce justifications for why an action saves the world as a proxy for what saves the world. Then, a different part of me wants to goof off and produces a justification for why goofing off is actually the most world-saving action.
Awesome. Thanks for adding. I particularly like the inclusion of adversarial behavior into the mix—I hadn’t thought of the goal structure of the candymakers as undercutting/exploiting/taking advantage of the goal structure of the humans.
“So my brain is sitting there with mirror-twin goals of maximize exposure to low scale numbers and minimize exposure to high scale numbers, “
I think my answer would be don’t give it those goals. Give it rule following goals and model updating goals. Create rules from the models. If following the rules don’t actually get you slimmer, update the model and update the rules.
I think you missed the central claim? I’m not saying those rules are good, nor that they are consciously installed and reflectively endorsed. I’m saying that your subconscious has goals like that that you, whpearson’s conscious verbal loop, aren’t fully aware of and don’t notice and are manipulated (or at least influenced) by. This isn’t a thing that’s fixed by simply deciding on different rules—S1 doesn’t communicate verbally, except indirectly by responding to stories and narratives.
Also, the idea that the solution to Goodhart’s Law is “create rules” makes me feel like I failed to communicate Claim 1.
I was not say that they were consciously endorsed, just that they were a product of taking a particular mindset which was consciously endorsed. E.g. my goal is to lose weight.
What I am suggesting, which is not a panacea but might not suck too much, is taking a different mindset. So “My goal is to understand the relationship between my activities and weight”. So the scientific point of view. Once you have gotten a good understanding that you can actually try and optimise your weight. More details on what I am trying to get across can be found in this post which introduces it and post which gives a hypothetical example of its application to a field and has a more formal description of what I mean in one of the comments.
Yeah, but I guess I’m still not communicating the part where a very large and important section of your brain doesn’t just adopt the goals you consciously give it. The stuff you said above is true, but irrelevant in this context/overwhelmed or undermined by the effect the post is pointing at, which makes me continue to feel like you’re not receiving the point I’m trying to convey.
I hundred percent agree that most of the brain is below conscious control and doesn’t adopt your goals. What I think Goodhearts law should be guiding us towards is how we set the bits of the brain that are.
For example of losing weight by measuring weight and using that as a metric is *literally* setting the metric and measure the same. I was trying to point out a way that a measure could be used in pursuit of a goal, but that it could not be a treated like a metric.
I didn’t get that impression from your post that you thought that there was any sort of conscious thing you could do before hand to try and head off the worst bits of Goodhearts. It was only post-hoc noticing things are going wrong. Like noticing “hey I am about to start on a poorly understood problem, that my subconscious brain will optimise wrong if I just set solving it as my goal. Maybe instead I should try to understand it first, using the measure I was about to use as a metric.”
Oh wow, I had never before thought of modern people over-consuming sugar, as being an application of Goodhart’s Law. But it is. That’s brilliant.
I very much agree with the ideas presented in this post; for people who are interested in finding out more, I very much recommend the book Don’t Shoot the Dog, and maybe also The Power of Habit. That said, those books are pretty much written from a behaviorist perspective, so they don’t go very much into the way that mental and abstract concepts become associated with value, as in your doctor example.
A couple of minor suggestions on how to improve on your post further: 1) I think that the ∃ in your Claim 2 is meant to be interpreted as “for a given effect and size, there exists a sufficiently small delay such that the desired result is produced”, but I wouldn’t have understood that notation if I hadn’t had math as a minor in my degree, and probably not all readers have 2) It might be good to quickly explain clicker training in a couple of sentences to people who haven’t heard about it before.
Observing the link between wireheading and Goodhart’s law seems to be an instance of what Paul Graham recommends in his latest essay. He claims that the most valuable insights are both general and surprising, but that those insights are very hard to find. So instead one is often better off searching for surprising takes on established general ideas, as OP seems to have done. :)
Thanks for the reading recommendations and the suggestions! I decided to leave ∃ for somewhat snarky incentivize-people-to-go-learn-a-thing reasons, but I linked to a clicker training video and will add a couple of sentences.
Cool!
Note that correctly interpreting the ∃ thing isn’t just about knowing that ∃ stands for “there exists”; it also takes a bit of additional knowledge to correctly unpack “for a given effect and size, there exists a sufficiently small delay” as “we can arbitrarily pick a certain effect and size that we want our intervention to have, and regardless of what we pick we can make our intervention satisfy those properties by making the delay small enough”.
In fact, in the first version of my comment I wrote something like “I’m interpreting ∃ to stand for ‘there exists’, but directly substituting that in to make the sentence read ‘there exists a sufficiently small delay’ doesn’t create a sensible sentence”, until after I thought that oh right, he means that there exists a duration of delay which makes this come true, and ‘makes this come true’ is defined as an inequality the way that it’s defined when you’re doing epsilon-delta proofs! Even if I’d otherwise known what “it exists” means, I don’t know if I’d managed to correctly interpret the sentence if I hadn’t taken that analysis course and learned how to think about it.
Of course, I might just be particularly dense, and maybe everyone else would have understood it anyway. :-)
Hmmm … that seems sensible, and produced a shift, but not enough to move my overall weighing. Cue metacognitive doubts about whether I’m just status-quo biasing into protecting my original decision. :-)
Note also that non-alphanumeric symbols are hard to google. I kind of guessed it from context but couldn’t confirm until I saw Kaj’s comment.
FWIW, I went through pretty much the same sequence of thoughts, which jarred me out of what was otherwise a pleasant/flowing read. Given the difficulty people unfamiliar with the notation faced in looking it up, maybe you could say “∃ (there exists)”, and/or link to the relevant Wiki page (https://en.wikipedia.org/wiki/Existential_quantification)?
If you’re comfortable rephrasing the sentence a little more for clarity, I’d suggest replacing the part after the quantifier with something like “some length of delay between behavior and consequence which is short enough to produce the effect.”
I also didn’t know what it meant, and it didn’t seem worth my time to look it up, it just made the post harder to read.
@dust_to_must: Suggestion adopted. Thanks!
I think this claim is confusing at best and false at worst. The shifting dopamine response is well-recognized in the neuroscience literature, and explained by Sutton and Barto’s Temporal-Difference model.
First, it should be emphasized that midbrain dopamine does not signal reward. The monkey can experience a ton of pleasure without any dopamine reaction. Midbrain dopamine signals reward prediction error, the difference between actual and expected reward. It signals a kind of surprise.
Now the TD model is quite Bayesian. Whereas the Rescorla-Wagner model—the previously dominant theory of reinforcement—viewed the prediction error as the difference between actual and expected current reward; the TD model instead views it as the difference between all actual and expected future rewards (properly discounted).
So when the dopamine signal shifts, the monkey is just conserving expected evidence. Initially, it is positively surprised to receive juice. But eventually, it learns that the screen perfectly predicts the juice, and so it is the appearence of the screen itself that becomes the positive surprise. On a classical model of reinforcement, these events are different, as OP seems to recognize. But on the TD model, these are just instances of the very same kind of conditioning event.
For futher reference, see the section “Two Dopamine Responses and One Theory” of Glimcher PW (2011) Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis.
OP seems to recognize all this, but these observations seems to be complemented with somewhat unfounded interpretations and elaborations.
[Epistemic status: confident OP will be confusing to those without RL background knowledge, but still non-negligible credence that OP is explaning exactly the above but from a different perspective]
Thanks for the info! I think the diff between my explanation and yours largely falls out “true” in your favor, and I’m glad you have additional clarification (correction?) here.
Nitpick: as far as I can tell, you’re describing discounting in general, not hyperbolic discounting. Hyperbolic discounting refers specifically to the surprising result that human discounting isn’t exponential, like we expected it to be based on economics research. http://www.behaviorlab.org/Papers/Hyperbolic.pdf
I agree with the overall claim you’re making, though I think you’re making it out to be a stronger force than I expect it is. In general, I think a good prior is “what your brain is doing by default is pretty damn close to right, and you have to understand it pretty well before you can recognize the actual flaws in its behavior.” I suspect you’ve mistaken where in the chain something is going wrong—I think it’s something related to social permission to make mistakes, social permission to succeed/fail at goals, etc. In other words, you get a punishment from your internalized model of other people when you fail to have lost weight.
I would absolutely expect internalized models to be a part of the thing (to be one of the abstractions or simplifications that your S1 uses to understand all of the data it’s ever experienced). I wouldn’t be surprised to find out that they’re the generator of a lot of the “this is serving my goals” or “this is threatening/dangerous” conclusions that lead to positive and negative pings. I would, however, be surprised to find out that they’re the only thing, or even the dominant one. I think we might disagree on type or hierarchy?
I’m positing that the social stuff you’re pointing out is like one of many “states” in the larger “nation” of brain-models-that-inform-the-brain’s-decision-to-punish-or-reward, whereas if I’m understanding you correctly you’re claiming either that the social modeling is the only model, or that the reward/punishment is always delivered through the social modeling channels (it always “comes from” some person-shaped thing in the head).
Please correct if I’ve misunderstood. I note that I wouldn’t be surprised if it’s like that for some people, but according to my introspection the social dynamic just doesn’t have that much power for me personally.
So, (I claim that) machine learning models provide a pretty good basis for comparison of the dopamine-moving-earlier thing: eg, this is what you’d expect from a system that does a local reinforce-positive update on the policy net as soon as the value net starts predicting a higher future expected value. See something about actor-critic, eg section 3.2.1 of this pdf. Because we’re starting from the prior that the brain is well enough designed to get pretty damn close to working, seeing that policy rewards move earlier is not evidence that should update us away from models where the brain is doing correct temporal difference learning (section 2.3.3 in that pdf).
The social thing I’m suggesting is that the expected value that the value function is predicting on seeing “oh, I gained weight” is a correct representation of future reward, even though it’s a very simple approximation. I don’t mean to say that I think a complicated, multi-step model is being run, just that the usual approximation is approximating a reasoning process that if done in full using the verbal loop, would look something like:
I have higher weight
I now know that I have higher weight
I now have less justified ability to claim high status
When I next interact with someone, I will have less claim to be valuable in their eyes
I will therefore expect them to express slightly less approval toward me, because I won’t be able to hide that I know I feel I have less justified ability to claim status
I am saying that I don’t think implementation of TD-learning is the problem here.
Got it. That makes sense. I think I still disagree, but if I’ve understood you right I can agree that that hypothesis also clearly deserves to be in the mix.
This seems largely correct to me, although I think hyperbolic discounting of rewards/punishments over time may be less pronounced in human conditioning as compared to animals being conditioned by humans. Humans can think “I’m now rewarding myself for Action A I took earlier” or “I’m being punished for Action B” which can seems, at least in my experience, to decrease the effect of the temporal distance whereas animals seem less able to conceptualize the connection over time. Because of this difference, I think the temporal difference of reward/punishment is less important in people for conditioning as long as the individual is mentally associating the stimulus with the action, although it is still significant.
Also what’s the name of the paper for the monkeys and juice study? I’d like to look at it because the result did surprise me.
Yeah, it makes a lot of sense to me that explicit cognition can interfere with the underlying, more “automatic” conditioning. Narrative framing and preforming intentions and focusing attention on the link between X and Y seem to have a strong influence on how conditioning does or doesn’t work, and I don’t know what the mechanisms are.
That being said, I think we agree that, in situations where there’s not a lot of conscious attention on what’s happening, the conditioning proceeds something like “normally,” where “normal” is “comparable to what happens in less sapient animals”?
I couldn’t dig up the original study from my phone but I found this, which references it: https://www.cogneurosociety.org/series1predictionreward/
Apparently “~30min” is a floor, not a ceiling.
For the specific case of weighing yourself, could you create a scale that only gives the positive reward, not the negative one? Like, it only tells you your weight if it’s lower than yesterday, or better yet if the trend in your weight is downward over the past week? Maybe it displays a cheerful message and plays a soothing sound when you weigh yourself, and it emails you later at random if you’ve been losing weight.
Yeah, those seem like ameliorative measures that are likely to help the brain adopt better goals beneath the hood.
An important thing to notice about Goodhart’s law is that we roll three different phenomena together. This isn’t entirely bad because the three phenomena are very similar, but sometimes it helps to think about them differently.
Goodhart Level 1: Following the sugar/fruit example, imagine there are a bunch of different fruits with different levels of nutrients and sugar, and the nutrient and sugar levels are very correlated. It is still the case that if you optimize for the most sugar, you will get reliably less nutrients than if you optimize for the most nutrients. (This is just the cost of using a proxy. I barely even want to call this Goodhart’s law.)
Goodhart Level 2: It is possible that foods with lots of sugar are usually very correlated with foods with lots of nutrients, but there is one type of fruit that is pure sugar with no nutrients. If this fruit occurred in nature, but only very rarely, it would not mess with the statistical correlation between sugar and nutrients very much. Thus when an agent optimizes very hard for sugar, they end up with no nutrients, but if they optimized only slightly, they would have found a normal fruit with lots of sugar and nutrients.
Note that this is more nuanced than just saying we didn’t have pure sugar in the ancestral environment and we do now, so the actions that were good in the ancestral environment are a bad proxy for the actions that are good now. (Maybe just using a bad proxy should be called Goodhart Level 0) The point is that the reason that the environment is different now is that we optimized for sugar. We pushed to the section of possible worlds with lots of sugar using our sugar optimization, and the correlation mostly only existed in the worlds with a moderate amount of sugar.
Goodhart Level 3: Say we live in a world with only a fixed amount of nutrients, and someone else wants a larger share of the nutrients. If you are using the proxy of sugar, and other people know this, and adversary might invent candy, and then trade you their candy for some of your fruit. Another agent had a goal that was in conflict with your true goal, but not in conflict with your proxy, so they exploited your proxy and created options that satisfied your proxy but not your true goal.
I say more about this (in math) here: https://agentfoundations.org/item?id=1621
Many instances of people Goodharting themselves falls in Level 2 (If I don’t step on the scale, I optimize out of the worlds where the scale number is correlated with my weight). However, I claim that some instances might be at Level 3. In particular rationalization. Maybe part of me wants to save the world and uses the ability to produce justifications for why an action saves the world as a proxy for what saves the world. Then, a different part of me wants to goof off and produces a justification for why goofing off is actually the most world-saving action.
Awesome. Thanks for adding. I particularly like the inclusion of adversarial behavior into the mix—I hadn’t thought of the goal structure of the candymakers as undercutting/exploiting/taking advantage of the goal structure of the humans.
“So my brain is sitting there with mirror-twin goals of maximize exposure to low scale numbers and minimize exposure to high scale numbers, “
I think my answer would be don’t give it those goals. Give it rule following goals and model updating goals. Create rules from the models. If following the rules don’t actually get you slimmer, update the model and update the rules.
Uh
I think you missed the central claim? I’m not saying those rules are good, nor that they are consciously installed and reflectively endorsed. I’m saying that your subconscious has goals like that that you, whpearson’s conscious verbal loop, aren’t fully aware of and don’t notice and are manipulated (or at least influenced) by. This isn’t a thing that’s fixed by simply deciding on different rules—S1 doesn’t communicate verbally, except indirectly by responding to stories and narratives.
Also, the idea that the solution to Goodhart’s Law is “create rules” makes me feel like I failed to communicate Claim 1.
I was not say that they were consciously endorsed, just that they were a product of taking a particular mindset which was consciously endorsed. E.g. my goal is to lose weight.
What I am suggesting, which is not a panacea but might not suck too much, is taking a different mindset. So “My goal is to understand the relationship between my activities and weight”. So the scientific point of view. Once you have gotten a good understanding that you can actually try and optimise your weight. More details on what I am trying to get across can be found in this post which introduces it and post which gives a hypothetical example of its application to a field and has a more formal description of what I mean in one of the comments.
Yeah, but I guess I’m still not communicating the part where a very large and important section of your brain doesn’t just adopt the goals you consciously give it. The stuff you said above is true, but irrelevant in this context/overwhelmed or undermined by the effect the post is pointing at, which makes me continue to feel like you’re not receiving the point I’m trying to convey.
We may be agreeing, but not being clear!
I hundred percent agree that most of the brain is below conscious control and doesn’t adopt your goals. What I think Goodhearts law should be guiding us towards is how we set the bits of the brain that are.
For example of losing weight by measuring weight and using that as a metric is *literally* setting the metric and measure the same. I was trying to point out a way that a measure could be used in pursuit of a goal, but that it could not be a treated like a metric.
I didn’t get that impression from your post that you thought that there was any sort of conscious thing you could do before hand to try and head off the worst bits of Goodhearts. It was only post-hoc noticing things are going wrong. Like noticing “hey I am about to start on a poorly understood problem, that my subconscious brain will optimise wrong if I just set solving it as my goal. Maybe instead I should try to understand it first, using the measure I was about to use as a metric.”