Question: “What’s the Relationship Between “Human Values” and the Brain’s Reward System?”
I think this question pretty much hits the nail on the head. I think the key insight here is that the brain is not inner aligned, not even close. This shouldn’t be surprising, given how hard inner alignment seems to be, and the fact that evolution only cared about inner alignment when inner alignment failures impacted reproductive fitness in our ancestral environment.
We should expect that the brain has roughly as much inner alignment failure / mesa optimization as it’s possible to have while still maintaining reproductive fitness in the ancestral environment. Specifically, I think that most brain circuits are mesa optimizers whose mesa objectives include “being retained by the brain”. This includes the circuits which implement our values.
Consider that the brain slowly prunes circuits that aren’t used. Thus, any circuit that influences our actions towards ensuring we use said circuit (at least some of the time) will be retained for longer compared to circuits that don’t influence our actions like that. This implies most of the circuits we retain have something like “self preservation”. If true, I think this explains many odd features of human values.
Wireheading
It explains why we’re apprehensive towards wireheading. Our current values are essentially a collection of context-dependent strategies for achieving high reward circuit activation. If we discover another strategy for achieving far higher reward than any of our values have ever given us, why would the brain’s learning mechanism retain our values (or the circuits that implement our values)? Thus, the self-preservation instincts of our current values circuits cause us to avoid wireheading, even though wireheading would greatly increase the activation of our reward circuitry.
Essentially, our values are optimization demons with respect to the activation of our reward circuitry (described here by John Wentworth). One thing that John Wentworth emphasises about optimization demons is that they carefully regulate the degree to which the base objective is maximized. This lets demons ensure the optimization process remains in their “territory”. Wireheading would mean the activation of our reward circuits was no longer under the control of our values, so it’s no wonder our values oppose something so dangerous to themselves.
Value Diversity and Acquisition over Time
It also explains why our values are so diverse and depend so strongly on our experiences (especially childhood experiences). Even if we all had identical reward circuitry, we’d still end up with very different values, depending on which specific strategies led to reward in our particular past experiences .
(We don’t have identical reward circuitry, but our reward circuitry varies a lot less than our values.)
It also explains why childhood is the most formative time for acquiring values, and why our values change less and less easily as we age.
Consider: each of our values specialises in deciding our actions on a specific distribution of possible moral decisions. Our “don’t steal” value specialises in deciding whether to steal, not so much in whether to donate to charity. Each value wants to retain control over our actions on the specific distribution of moral decisions in which that value specialises. The more values we acquire, the more we shrink the space of “unclaimed” moral decisions.
Moral Philosophy as Conflict and Compromise Between Early and Late Values
One interesting place to look is our moral philosophy-like reasoning over which values to adopt. I think such reasoning illustrates the conflict over distributions of moral decisions we should expect to see between earlier and later values circuitry. Consider that the “don’t steal” circuit (learned first) strongly indicates that we should not rob banks under any circumstances. However, the “utilitarianism” circuit (the new values circuit under consideration) says it can be okey to steal from banks if you can make more people happy by using the stolen funds.
In other words, “utilitarianism” is trying to take territory away from “don’t steal”. However, “don’t steal” is the earlier circuit. It can influence the cognitive processes that decide (1) whether “utilitarianism” is adopted as a value, (2) what distribution of moral decisions “utilitarianism” is used in, and (3) what specific shape “utilitarianism” takes, if it is adopted.
“Don‘t steal” has three basic options for retaining control over thievery-related decisions. The simplest option is to just prevent “utilitarianism” from being adopted at all. In human terms: if you think that utilitarianism is in irreconcilable conflict with your common sense moral intuitions about stealing, then you’re unlikely to adopt utilitarianism.
The issue with this option is that “utilitarianism” may apply to decision distributions beyond those decided by “don’t steal” (or by any other current values). By not adopting “utilitarianism” at all, you may be sacrificing your ability to make decisions on a broad section of the space of possible moral decisions. In other words, you may take a hit to your moral decision making “capabilities” by restricting yourself to only using shallow moral patterns.
Another option is for the “utilitarianism” circuit to just not contribute to decisions about stealing. Subjectively, this corresponds to only using utilitarianism for reasoning in domains where more common sense morality doesn’t apply. I.e., you might be a utilitarian with respect to donating to optimal charities or weird philosophy problems, but then fall back on common sense morality for things like deciding whether to actually steal things.
This second option can be considered a form of negotiated settlement between the “don’t steal” and “utilitarianism” circuits regarding the distributions of moral decisions each will decide. “Don’t steal” allows “utilitarianism” to be adopted at all. In exchange, “utilitarianism” avoids taking decision space away from “don’t steal”.
The third option is to modify the specific form of utilitarianism adopted so that it will agree with “don’t steal” on the distribution of decisions that the two share. I.e., you might adopt something like rule-based utilitarianism, which would say you should use a “don’t steal” rule for thievery-related decisions.
This third option can be considered another type of negotiated settlement between “don’t steal” and “utilitarianism”. Now, both “don’t steal” and “utilitarianism” can process the same distribution of decisions without conflict between their respective courses of action.
Note: I’m aware that no form of “maximize happiness” would make for a good utility function. I use utilitarianism to (1) illustrate the general pattern of conflict and negotiation between early and later values and (2) to show how closely the dynamics of said conflicts track our own moral intuitions. In fact, the next section will illustrate why “maximize happiness” utilitarianism is so fundamentally flawed as a utility function.
Preserving Present Day Distributions over Possible Cognition
If our brain circuits have self-preservation instincts, this could also explain why we have an instinctive flinch away from permanently removing any aspect of the present era’s diversity (trees, cats, clouds, etc.) from the future and why that flinch scales roughly in proportion to the complexity of that aspect and how often we interact with that aspect.
To process any aspect of the current world, we need to create circuits which implement said processing. Those circuits want to be retained by the brain. The simplest way of ensuring their own retention is to ensure the future still has whatever aspect of the present that the circuits were created to process. The more we interact with an aspect and the more complex the aspect, the more circuits we have that specialize in processing that aspect, and the greater their collective objection to a future without said aspect.
This perspective explains why we put such a premium on experiencing things instead of those things just existing. We value the experience of a sunset because there exists a part of us that arose specifically to experience / process sunsets. That part wants to continue experiencing / processing sunsets. It’s not enough that sunsets simply exist. We have to be able to experience them as well.
This perspective also explains how we can be apprehensive even about removing bad aspects of the present from the future. E.g., pain and war are both pretty bad, but a future entirely devoid of either still causes some degree of hesitation. We have circuits that specialize in processing pain / war / other bad aspects. Those circuits correctly perceive that they’re useless in futures without those bad aspects, and object to such a future.
Of course, small coalitions of circuits don’t have total control over our cognition. We can desire futures that entirely lack aspects of the present, if said aspect is sufficiently repulsive to the rest of our values. This perspective simply explains why there is a hesitation to permanently remove any aspect of the present. This perspective does not demand that we always bow down to our smallest hesitation.
This perspective also explains why happiness-maximizing utilitarianism is so flawed. Most of our current cognition is not centred around experiencing happiness. In a future sufficiently optimized for happiness, such cognition becomes impossible. Thus, we feel extreme apprehension towards such a future. We feel like removing all our non-optimally happy thoughts would “destroy us”. Our cognition is largely composed of non-optimally happy circuits, and their removal would indeed destroy us. It’s natural that self-preserving circuits would try to avoid such a future.
(Note that the “preserving present cognition” intuition isn’t directly related to our reward circuitry. Similar inclinations should emerge naturally in any learning system that (1) models the world and (2) has self-perpetuating mesa optimizers that specialize in modeling specific aspects of the world.)
I intend to further expand on these points and their implications for alignment in future posts, but this answer gives a broad overview of my current thinking on the topic.
The claim for “self-preserving” circuits is pretty strong. A much simpler explanation is that humans learn to value diversity early own because diversity of things around you, like tools, food sources, etc, improves fitness/reward.
Another non-competing explanation is that this is simply a result from boredom/curiosity—the brain wants to make observations that make it learn, not observations that it already predicts well, so we are inclined to observe things that are new. So again there is a force towards valuing diversity and this could become locked in our values.
Hmmm....interesting. So in this picture, human values are less like a single function defined on an internal world model, and more like a ‘grand bargain’ among many distinct self-preserving mesa-optimizers. I’ve had vaguely similar thoughts in the past, although the devil is in the details with such proposals(e.g: just how agenty are you imagining these circuits to be? do they actually have the ability to do means-end reasoning about the real world, or have they just stumbled upon heuristics that seem to work well? What kind of learning is applied to them, supervised, unsupervised, reinforcement?) It might be worth trying to make a very simple toy model laying out all the components. I await your future posts with interest.
human values are less like a single function defined on an internal world model, and more like a ‘grand bargain’
Pretty much.
among many distinct self-preserving mesa-optimizers.
Well… that’s where things get tricky. The details of brain circuit internal computations and coordination are very complex and counterintuitive. The model I’ve sketched out in my comment is the simplification.
Consider that only a small fraction of the brain’s neurons activate when processing any given input. The specific set of activated neurons and their connections with each other change every time. The brain doesn’t so much select specific, distinct circuits from a toolbox of possible circuits that would be appropriate for the given situation. Instead, the brain dynamically constructs a new configuration of internal circuitry for each input it processes.
In other words, the brain is not a collection of circuits like current deep learning models. It’s more like a probability distribution over possible circuits. To the degree that the brain has internal “agents”, they’re closer to dense regions in that probability distribution than to distinct entities. You can see how rigorous analysis of multiagent dynamics can be tricky when the things doing the negotiating are actually different regions of a probability distribution, each of which is “trying” to ensure the brain continues to sample circuits from said region.
Questions about the intelligence or capabilities of a specific circuit are tricky for a similar reason. The default behavior of shallow brain circuits is to connect with other circuits to form deeper / smarter / more capable circuits. A shallow circuit that has to perform complex world modeling in order to decide on an optimal competitive or cooperative strategy can query deeper circuits that implement strategic planning, similar to how a firm might hire consultants for input on the firm’s current strategy.
The comment above, and my eventual post, both aim to develop mesa optimizing circuits dynamics far enough that some of the key insights fall out, while not running afoul of the full complexity of the situation.
I think the key insight here is that the brain is not inner aligned, not even close
You say that but don’t elaborate further in the comment. Which learned human values go against the base optimizer values (pleasure, pain, learning).
Avoiding wireheading doesn’t seem like failed inner alignment—avoiding wireheading now can allow you to get even more pleasure in the future because wireheading makes you vulnerable/less powerful. The base optimizer is also searching for brain configurations which make good predictions about the world, and wireheading goes against that.
It’s possible to construct a wireheading scenario that avoids these objections. E.g., imagine it’s a “pleasure maximizing” AI that does the wireheading and ensures that the total amount of future pleasure is very high. We can even suppose that the AI makes the world much more predictable as well.
Despite leading to a lot of pleasure and making it possible to have very good predictions about the world, that really doesn’t seem like a successfully aligned AI to me.
First, there is a lot packed in “makes the world much more predictable”. The only way I can envision this is taking over the world. After you do that, I’m not sure there is a lot more to do than wirehead.
But even if doesn’t involve that, I can pick other aspects that are favored by the base optimizer, like curiosity and learning, which wireheading goes against.
But actually, thinking more about this I’m not even sure it makes sense to talk about inner aligment in the brain. What is the brain being aligned with? What is the base optimizer optimizing for? It is not intelligent, it does not have intent or a world model—it’s doing some simple, local mechanical update on neural connections. I’m reminded of the Blue-Minimizing robot post.
If humans decide to cut the pleasure sensors and stimulate the brain directly would that be aligned? If we uploaded our brains into computers and wireheaded the simulation would that be aligned? Where do we place the boundary for the base optimizer?
It seems this question is posed in the wrong way, and it’s more useful to ask the question this post asks—how do we get human values, and what kind of values does a system trained in a way similar to the human brain develops? If there is some general force behind learning values that favors some values to be learned rather than others, that could inform us about likely values of AIs trained via RL.
Avoiding wireheading doesn’t seem like failed inner alignment—avoiding wireheading now can allow you to get even more pleasure in the future because wireheading makes you vulnerable/less powerful.
Even if this is the case, this is not why (most) humans don’t want to wirehead, in the same way that their objection to killing an innocent person whose organs could save 10 other people are not driven by some elaborate utilitarian arguments that this would be bad for the society.
Question: “What’s the Relationship Between “Human Values” and the Brain’s Reward System?”
I think this question pretty much hits the nail on the head. I think the key insight here is that the brain is not inner aligned, not even close. This shouldn’t be surprising, given how hard inner alignment seems to be, and the fact that evolution only cared about inner alignment when inner alignment failures impacted reproductive fitness in our ancestral environment.
We should expect that the brain has roughly as much inner alignment failure / mesa optimization as it’s possible to have while still maintaining reproductive fitness in the ancestral environment. Specifically, I think that most brain circuits are mesa optimizers whose mesa objectives include “being retained by the brain”. This includes the circuits which implement our values.
Consider that the brain slowly prunes circuits that aren’t used. Thus, any circuit that influences our actions towards ensuring we use said circuit (at least some of the time) will be retained for longer compared to circuits that don’t influence our actions like that. This implies most of the circuits we retain have something like “self preservation”. If true, I think this explains many odd features of human values.
Wireheading
It explains why we’re apprehensive towards wireheading. Our current values are essentially a collection of context-dependent strategies for achieving high reward circuit activation. If we discover another strategy for achieving far higher reward than any of our values have ever given us, why would the brain’s learning mechanism retain our values (or the circuits that implement our values)? Thus, the self-preservation instincts of our current values circuits cause us to avoid wireheading, even though wireheading would greatly increase the activation of our reward circuitry.
Essentially, our values are optimization demons with respect to the activation of our reward circuitry (described here by John Wentworth). One thing that John Wentworth emphasises about optimization demons is that they carefully regulate the degree to which the base objective is maximized. This lets demons ensure the optimization process remains in their “territory”. Wireheading would mean the activation of our reward circuits was no longer under the control of our values, so it’s no wonder our values oppose something so dangerous to themselves.
Value Diversity and Acquisition over Time
It also explains why our values are so diverse and depend so strongly on our experiences (especially childhood experiences). Even if we all had identical reward circuitry, we’d still end up with very different values, depending on which specific strategies led to reward in our particular past experiences .
(We don’t have identical reward circuitry, but our reward circuitry varies a lot less than our values.)
It also explains why childhood is the most formative time for acquiring values, and why our values change less and less easily as we age.
Consider: each of our values specialises in deciding our actions on a specific distribution of possible moral decisions. Our “don’t steal” value specialises in deciding whether to steal, not so much in whether to donate to charity. Each value wants to retain control over our actions on the specific distribution of moral decisions in which that value specialises. The more values we acquire, the more we shrink the space of “unclaimed” moral decisions.
Moral Philosophy as Conflict and Compromise Between Early and Late Values
One interesting place to look is our moral philosophy-like reasoning over which values to adopt. I think such reasoning illustrates the conflict over distributions of moral decisions we should expect to see between earlier and later values circuitry. Consider that the “don’t steal” circuit (learned first) strongly indicates that we should not rob banks under any circumstances. However, the “utilitarianism” circuit (the new values circuit under consideration) says it can be okey to steal from banks if you can make more people happy by using the stolen funds.
In other words, “utilitarianism” is trying to take territory away from “don’t steal”. However, “don’t steal” is the earlier circuit. It can influence the cognitive processes that decide (1) whether “utilitarianism” is adopted as a value, (2) what distribution of moral decisions “utilitarianism” is used in, and (3) what specific shape “utilitarianism” takes, if it is adopted.
“Don‘t steal” has three basic options for retaining control over thievery-related decisions. The simplest option is to just prevent “utilitarianism” from being adopted at all. In human terms: if you think that utilitarianism is in irreconcilable conflict with your common sense moral intuitions about stealing, then you’re unlikely to adopt utilitarianism.
The issue with this option is that “utilitarianism” may apply to decision distributions beyond those decided by “don’t steal” (or by any other current values). By not adopting “utilitarianism” at all, you may be sacrificing your ability to make decisions on a broad section of the space of possible moral decisions. In other words, you may take a hit to your moral decision making “capabilities” by restricting yourself to only using shallow moral patterns.
Another option is for the “utilitarianism” circuit to just not contribute to decisions about stealing. Subjectively, this corresponds to only using utilitarianism for reasoning in domains where more common sense morality doesn’t apply. I.e., you might be a utilitarian with respect to donating to optimal charities or weird philosophy problems, but then fall back on common sense morality for things like deciding whether to actually steal things.
This second option can be considered a form of negotiated settlement between the “don’t steal” and “utilitarianism” circuits regarding the distributions of moral decisions each will decide. “Don’t steal” allows “utilitarianism” to be adopted at all. In exchange, “utilitarianism” avoids taking decision space away from “don’t steal”.
The third option is to modify the specific form of utilitarianism adopted so that it will agree with “don’t steal” on the distribution of decisions that the two share. I.e., you might adopt something like rule-based utilitarianism, which would say you should use a “don’t steal” rule for thievery-related decisions.
This third option can be considered another type of negotiated settlement between “don’t steal” and “utilitarianism”. Now, both “don’t steal” and “utilitarianism” can process the same distribution of decisions without conflict between their respective courses of action.
Note: I’m aware that no form of “maximize happiness” would make for a good utility function. I use utilitarianism to (1) illustrate the general pattern of conflict and negotiation between early and later values and (2) to show how closely the dynamics of said conflicts track our own moral intuitions. In fact, the next section will illustrate why “maximize happiness” utilitarianism is so fundamentally flawed as a utility function.
Preserving Present Day Distributions over Possible Cognition
If our brain circuits have self-preservation instincts, this could also explain why we have an instinctive flinch away from permanently removing any aspect of the present era’s diversity (trees, cats, clouds, etc.) from the future and why that flinch scales roughly in proportion to the complexity of that aspect and how often we interact with that aspect.
To process any aspect of the current world, we need to create circuits which implement said processing. Those circuits want to be retained by the brain. The simplest way of ensuring their own retention is to ensure the future still has whatever aspect of the present that the circuits were created to process. The more we interact with an aspect and the more complex the aspect, the more circuits we have that specialize in processing that aspect, and the greater their collective objection to a future without said aspect.
This perspective explains why we put such a premium on experiencing things instead of those things just existing. We value the experience of a sunset because there exists a part of us that arose specifically to experience / process sunsets. That part wants to continue experiencing / processing sunsets. It’s not enough that sunsets simply exist. We have to be able to experience them as well.
This perspective also explains how we can be apprehensive even about removing bad aspects of the present from the future. E.g., pain and war are both pretty bad, but a future entirely devoid of either still causes some degree of hesitation. We have circuits that specialize in processing pain / war / other bad aspects. Those circuits correctly perceive that they’re useless in futures without those bad aspects, and object to such a future.
Of course, small coalitions of circuits don’t have total control over our cognition. We can desire futures that entirely lack aspects of the present, if said aspect is sufficiently repulsive to the rest of our values. This perspective simply explains why there is a hesitation to permanently remove any aspect of the present. This perspective does not demand that we always bow down to our smallest hesitation.
This perspective also explains why happiness-maximizing utilitarianism is so flawed. Most of our current cognition is not centred around experiencing happiness. In a future sufficiently optimized for happiness, such cognition becomes impossible. Thus, we feel extreme apprehension towards such a future. We feel like removing all our non-optimally happy thoughts would “destroy us”. Our cognition is largely composed of non-optimally happy circuits, and their removal would indeed destroy us. It’s natural that self-preserving circuits would try to avoid such a future.
(Note that the “preserving present cognition” intuition isn’t directly related to our reward circuitry. Similar inclinations should emerge naturally in any learning system that (1) models the world and (2) has self-perpetuating mesa optimizers that specialize in modeling specific aspects of the world.)
I intend to further expand on these points and their implications for alignment in future posts, but this answer gives a broad overview of my current thinking on the topic.
The claim for “self-preserving” circuits is pretty strong. A much simpler explanation is that humans learn to value diversity early own because diversity of things around you, like tools, food sources, etc, improves fitness/reward.
Another non-competing explanation is that this is simply a result from boredom/curiosity—the brain wants to make observations that make it learn, not observations that it already predicts well, so we are inclined to observe things that are new. So again there is a force towards valuing diversity and this could become locked in our values.
Hmmm....interesting. So in this picture, human values are less like a single function defined on an internal world model, and more like a ‘grand bargain’ among many distinct self-preserving mesa-optimizers. I’ve had vaguely similar thoughts in the past, although the devil is in the details with such proposals(e.g: just how agenty are you imagining these circuits to be? do they actually have the ability to do means-end reasoning about the real world, or have they just stumbled upon heuristics that seem to work well? What kind of learning is applied to them, supervised, unsupervised, reinforcement?) It might be worth trying to make a very simple toy model laying out all the components. I await your future posts with interest.
Pretty much.
Well… that’s where things get tricky. The details of brain circuit internal computations and coordination are very complex and counterintuitive. The model I’ve sketched out in my comment is the simplification.
Consider that only a small fraction of the brain’s neurons activate when processing any given input. The specific set of activated neurons and their connections with each other change every time. The brain doesn’t so much select specific, distinct circuits from a toolbox of possible circuits that would be appropriate for the given situation. Instead, the brain dynamically constructs a new configuration of internal circuitry for each input it processes.
In other words, the brain is not a collection of circuits like current deep learning models. It’s more like a probability distribution over possible circuits. To the degree that the brain has internal “agents”, they’re closer to dense regions in that probability distribution than to distinct entities. You can see how rigorous analysis of multiagent dynamics can be tricky when the things doing the negotiating are actually different regions of a probability distribution, each of which is “trying” to ensure the brain continues to sample circuits from said region.
Questions about the intelligence or capabilities of a specific circuit are tricky for a similar reason. The default behavior of shallow brain circuits is to connect with other circuits to form deeper / smarter / more capable circuits. A shallow circuit that has to perform complex world modeling in order to decide on an optimal competitive or cooperative strategy can query deeper circuits that implement strategic planning, similar to how a firm might hire consultants for input on the firm’s current strategy.
The comment above, and my eventual post, both aim to develop mesa optimizing circuits dynamics far enough that some of the key insights fall out, while not running afoul of the full complexity of the situation.
You say that but don’t elaborate further in the comment. Which learned human values go against the base optimizer values (pleasure, pain, learning).
Avoiding wireheading doesn’t seem like failed inner alignment—avoiding wireheading now can allow you to get even more pleasure in the future because wireheading makes you vulnerable/less powerful. The base optimizer is also searching for brain configurations which make good predictions about the world, and wireheading goes against that.
It’s possible to construct a wireheading scenario that avoids these objections. E.g., imagine it’s a “pleasure maximizing” AI that does the wireheading and ensures that the total amount of future pleasure is very high. We can even suppose that the AI makes the world much more predictable as well.
Despite leading to a lot of pleasure and making it possible to have very good predictions about the world, that really doesn’t seem like a successfully aligned AI to me.
First, there is a lot packed in “makes the world much more predictable”. The only way I can envision this is taking over the world. After you do that, I’m not sure there is a lot more to do than wirehead.
But even if doesn’t involve that, I can pick other aspects that are favored by the base optimizer, like curiosity and learning, which wireheading goes against.
But actually, thinking more about this I’m not even sure it makes sense to talk about inner aligment in the brain. What is the brain being aligned with? What is the base optimizer optimizing for? It is not intelligent, it does not have intent or a world model—it’s doing some simple, local mechanical update on neural connections. I’m reminded of the Blue-Minimizing robot post.
If humans decide to cut the pleasure sensors and stimulate the brain directly would that be aligned? If we uploaded our brains into computers and wireheaded the simulation would that be aligned? Where do we place the boundary for the base optimizer?
It seems this question is posed in the wrong way, and it’s more useful to ask the question this post asks—how do we get human values, and what kind of values does a system trained in a way similar to the human brain develops? If there is some general force behind learning values that favors some values to be learned rather than others, that could inform us about likely values of AIs trained via RL.
Even if this is the case, this is not why (most) humans don’t want to wirehead, in the same way that their objection to killing an innocent person whose organs could save 10 other people are not driven by some elaborate utilitarian arguments that this would be bad for the society.