I think this line of work is very interesting and important. I and a few others are working on something we’ve dubbed shard theory, which attempts to describe the process of human value formation. The theory posits that the internals of monolithic learning systems actually resemble something like an ecosystem already. However, rather than there being some finite list of discrete subcomponents / modules, it’s more like there’s a continuous distribution over possible internal subcomponents and features.
Continuous agency
To take agency as an example, suppose you have a 3-layer transformer being trained via RL using just the basic REINFORCE algorithm. We typically think of such a setup as having one agent with three layers:
However, we can just as easily draw our Cartesian boundaries differently and call it three agents that pass messages between them:
It makes no difference to the actual learning process. In fact, we can draw Cartesian boundaries around any selection of non-overlaping subsets that cover all the model’s parameters, call each subset an “agent”, and the training process is identical. The reason this is interesting is what happens when this fact overlaps with regional specialization in the model. E.g., let’s take an extreme where the reward function only rewards three things:
Manipulating humans
Solving math problems
Helping humans
And let’s suppose the model has an extreme degree of regional specialization such that each layer can only navigate a single one of those tasks (each layer is fully specialized to only of the above tasks). Additionally, let’s suppose that the credit assignment process is “perfect”, in the sense that, for task i, the outer learning process only updates the parameters of the layer that specializes in task i:
The reason that this matters is because there’s only one way for the training process to instill value representations into any of the layers: by updating the parameters of those layers. Thus, if Layer 3 isn’t updated on rewards from the “Solving math problems” or “Manipulating humans” tasks, Layer 3′s value representations won’t care about doing either of those things.
If we view each layer as its own agent (which I think we can), then the overall system’s behavior is a mult-agent consensus between components whose values differ significantly.
Of course, real world specialization is nowhere near this strict. The interactions between complex reward functions and complex environments means there’s more like a continuous distribution over possible rewarding behaviors. Additionally, real world credit assignment is very noisy. Thus, the actual distribution of agentic specializations looks a bit more like this:
Thus, I interpret RL systems as having something like a continuous distribution over possible internal agents, each of which implement different values. Regions in this distribution are the shards of shard theory. I.e., shards refer to dense regions in you distribution over agentic, values-implementing computations.
Convergent value reflection / philosophy
This has a number of (IMO) pretty profound implications. For one, we should not expect AI systems to be certain of their own learned values, for much the same reason humans are uncertain. “Self-misalignment” isn’t some uniquely human failing left to us by evolution. It’s just how sophisticated RL systems work by default.
Similarly, something like value reflection is probably convergent among RL systems trained on complex environments / reward signals. Such systems need ways to manage internal conflicts among their shards. The process of weighting / negotiating between / compromising among internal values, and the agentic processes implementing those values, is probably quite important for broad classes of RL systems, not just humans.
Additionally, something like moral philosophy is probably convergent as well. Unlike value reflection, moral philosophy would relate to whether (and how) the current shards allows additional shards to form.
Suppose you (a human) have a distribution of shards that implement common sense human values like “don’t steal”, “don’t kill”, etc. Then, you encounter a new domain where those shards are a poor guide for determining your actions. Maybe you’re trying to determine which charity to donate to. Maybe you’re trying to answer weird questions in your moral philosophy class. The point is that you need some new shards to navigate this new domain, so you go searching for one or more new shards, and associated values that they implement.
Concretely, let’s suppose you consider classical utilitarianism (CU) as your new value. The CU shard effectively navigates the new domain, but there’s a potential problem: the CU shard doesn’t constrain itself to only navigating the new domain. It also produces predictions regarding the correct behavior on the old domains that already existing shards navigate. This could prevent the old shards from determining your behavior on the old domains. For instrumental reasons, the old shards don’t want to be disempowered.
One possible option is for there to be a “negotiation” between the old shards and the CU shard regarding what sort of predictions CU will generate on the domains that the old shards navigate. This might involve an iterative process of searching over the input space to the CU shard for situations where the CU shard strongly diverges from the old shards, in domains that the old shards already navigate. Each time a conflict is found, you either modify the CU shard to agree with the old shards, constrain the CU shard so as to not apply to those sorts of situations, or reject the CU shard entirely if no resolution is possible.
The above essentially describes the core of the cognitive process we call moral philosophy. However, none of the underlying motivations for this process are unique to humans or our values. In this framing, moral philosophy is essentially a form of negotiation between existing shards and a new shard that implements desirable cognitive capabilities. The old shards agree to let the new shard come into existence. In exchange, the new shard agrees to align itself to the values of the old shards (or at least, not conflict too strongly).
Continuous Ontologies
I also think the continuous framing applies to other features of cognition beyond internal agents. E.g., I don’t think it’s appropriate to think of an AI or human as having a single ontology. Instead, they both have distributions over possible ontologies. In any given circumstance, the AI / human will dynamically sample an appropriate-seeming ontology from said distribution.
This possibly explains why humans don’t seem to suffer particularly from ontological crises. E.g., learning quantum mechanics does not result in humans (or AIs) suddenly switching from a classical to a quantum ontology. Rather, their distribution over possible ontologies simply extends its support to a new region in the space of possible ontologies. However, this is a process that happens continuously throughout learning, so the already existing values shards are usually able to navigate the shift fine.
This neatly explains human robustness to ontological issues without having to rely on evolution somehow hard-coding complex crisis handling adaptations into the human learning process (despite the fact that our ancestors never had to deal with ontological shifts such as discovering QM).
Implications for value fragility
I also think that the idea of “value fragility” changes significantly when you shift from a discrete view of values to a continuous view. If you assume a discrete view, then you’re likely to be greatly concerned by the fact that repeated introspection on your values will give different results. It feels like your values are somehow unstable, and that you need to find the “true” form of your values.
This poses a significant problem for AI alignment. If you think that you have some discrete set of “true” values concepts, and that an AI will also have a discrete set of “true” values values concepts, then these sets need to near-perfectly align to have any chance of the AI optimizing for what we actually want. I.e., this picture:
In the continuous perspective, values have no “true” concept, only a continuous distribution over possible instantiations. The values that are introspectively available to us at any given time are discrete samples from that distribution. In fact, looking for a value’s “true” conceptualization is a type error, roughly analogous to thinking that a Gaussian distribution has some hidden “true” sample that manages to capture the entire distribution in one number.
An AI and human can have overlap between their respective value distributions, even without those distributions perfectly agreeing. It’s possible for an AI to have an important and non-trivial degree of alignment with human values without requiring the near-perfect alignment the discrete view implies is necessary, as illustrated in the diagram below:
You can also read some of our draft documents for explaining shard theory:
Shard theory 101 (Broad introduction, focuses less on the continuous view and more on the value / shard formation process and how that relates to evolution)
Thank! It’s a long comment, so I’ll comment on the convergence, morphologies and the rest latter, so here is just top-level comment on shards. (I’ve read about half of the doc)
My impression is they are basically the same thing which I called “agenty subparts” in Multi-agent predictive minds and AI alignment (and Friston calls “fixed priors”). Where “agenty” means ~ description from intentional stance is a good description, in information theory sense. (This naturally implies fluid boundaries and continuity)
Where I would disagree/find your terminology unclear is where you refer to this as an example of inner alignment failure. Putting in “agenty subparts” into the predictive processing machinery is not a failure, but bandwidth-feasible way for the evolution to communicate valuable states to the PP engine.
Also: I think what you are possibly underestimating is how much is evolution building on top on existing, evolutionary older control circuitry. E.g. evolution does not need to “point to a concept of sex in the PP world model”—evolution was able to make animals seek reproduction long time ago before it invented complex brains. This simplifies the task—what evolution actually had to do was to connect the “PP agenty parts” to parts of existing control machinery, which is often based on “body states”. Technically the older control systems are often using chemicals in blood, or quite old parts of the brain.
I think this line of work is very interesting and important. I and a few others are working on something we’ve dubbed shard theory, which attempts to describe the process of human value formation. The theory posits that the internals of monolithic learning systems actually resemble something like an ecosystem already. However, rather than there being some finite list of discrete subcomponents / modules, it’s more like there’s a continuous distribution over possible internal subcomponents and features.
Continuous agency
To take agency as an example, suppose you have a 3-layer transformer being trained via RL using just the basic REINFORCE algorithm. We typically think of such a setup as having one agent with three layers:
However, we can just as easily draw our Cartesian boundaries differently and call it three agents that pass messages between them:
It makes no difference to the actual learning process. In fact, we can draw Cartesian boundaries around any selection of non-overlaping subsets that cover all the model’s parameters, call each subset an “agent”, and the training process is identical. The reason this is interesting is what happens when this fact overlaps with regional specialization in the model. E.g., let’s take an extreme where the reward function only rewards three things:
Manipulating humans
Solving math problems
Helping humans
And let’s suppose the model has an extreme degree of regional specialization such that each layer can only navigate a single one of those tasks (each layer is fully specialized to only of the above tasks). Additionally, let’s suppose that the credit assignment process is “perfect”, in the sense that, for task i, the outer learning process only updates the parameters of the layer that specializes in task i:
The reason that this matters is because there’s only one way for the training process to instill value representations into any of the layers: by updating the parameters of those layers. Thus, if Layer 3 isn’t updated on rewards from the “Solving math problems” or “Manipulating humans” tasks, Layer 3′s value representations won’t care about doing either of those things.
If we view each layer as its own agent (which I think we can), then the overall system’s behavior is a mult-agent consensus between components whose values differ significantly.
Of course, real world specialization is nowhere near this strict. The interactions between complex reward functions and complex environments means there’s more like a continuous distribution over possible rewarding behaviors. Additionally, real world credit assignment is very noisy. Thus, the actual distribution of agentic specializations looks a bit more like this:
Thus, I interpret RL systems as having something like a continuous distribution over possible internal agents, each of which implement different values. Regions in this distribution are the shards of shard theory. I.e., shards refer to dense regions in you distribution over agentic, values-implementing computations.
Convergent value reflection / philosophy
This has a number of (IMO) pretty profound implications. For one, we should not expect AI systems to be certain of their own learned values, for much the same reason humans are uncertain. “Self-misalignment” isn’t some uniquely human failing left to us by evolution. It’s just how sophisticated RL systems work by default.
Similarly, something like value reflection is probably convergent among RL systems trained on complex environments / reward signals. Such systems need ways to manage internal conflicts among their shards. The process of weighting / negotiating between / compromising among internal values, and the agentic processes implementing those values, is probably quite important for broad classes of RL systems, not just humans.
Additionally, something like moral philosophy is probably convergent as well. Unlike value reflection, moral philosophy would relate to whether (and how) the current shards allows additional shards to form.
Suppose you (a human) have a distribution of shards that implement common sense human values like “don’t steal”, “don’t kill”, etc. Then, you encounter a new domain where those shards are a poor guide for determining your actions. Maybe you’re trying to determine which charity to donate to. Maybe you’re trying to answer weird questions in your moral philosophy class. The point is that you need some new shards to navigate this new domain, so you go searching for one or more new shards, and associated values that they implement.
Concretely, let’s suppose you consider classical utilitarianism (CU) as your new value. The CU shard effectively navigates the new domain, but there’s a potential problem: the CU shard doesn’t constrain itself to only navigating the new domain. It also produces predictions regarding the correct behavior on the old domains that already existing shards navigate. This could prevent the old shards from determining your behavior on the old domains. For instrumental reasons, the old shards don’t want to be disempowered.
One possible option is for there to be a “negotiation” between the old shards and the CU shard regarding what sort of predictions CU will generate on the domains that the old shards navigate. This might involve an iterative process of searching over the input space to the CU shard for situations where the CU shard strongly diverges from the old shards, in domains that the old shards already navigate. Each time a conflict is found, you either modify the CU shard to agree with the old shards, constrain the CU shard so as to not apply to those sorts of situations, or reject the CU shard entirely if no resolution is possible.
The above essentially describes the core of the cognitive process we call moral philosophy. However, none of the underlying motivations for this process are unique to humans or our values. In this framing, moral philosophy is essentially a form of negotiation between existing shards and a new shard that implements desirable cognitive capabilities. The old shards agree to let the new shard come into existence. In exchange, the new shard agrees to align itself to the values of the old shards (or at least, not conflict too strongly).
Continuous Ontologies
I also think the continuous framing applies to other features of cognition beyond internal agents. E.g., I don’t think it’s appropriate to think of an AI or human as having a single ontology. Instead, they both have distributions over possible ontologies. In any given circumstance, the AI / human will dynamically sample an appropriate-seeming ontology from said distribution.
This possibly explains why humans don’t seem to suffer particularly from ontological crises. E.g., learning quantum mechanics does not result in humans (or AIs) suddenly switching from a classical to a quantum ontology. Rather, their distribution over possible ontologies simply extends its support to a new region in the space of possible ontologies. However, this is a process that happens continuously throughout learning, so the already existing values shards are usually able to navigate the shift fine.
This neatly explains human robustness to ontological issues without having to rely on evolution somehow hard-coding complex crisis handling adaptations into the human learning process (despite the fact that our ancestors never had to deal with ontological shifts such as discovering QM).
Implications for value fragility
I also think that the idea of “value fragility” changes significantly when you shift from a discrete view of values to a continuous view. If you assume a discrete view, then you’re likely to be greatly concerned by the fact that repeated introspection on your values will give different results. It feels like your values are somehow unstable, and that you need to find the “true” form of your values.
This poses a significant problem for AI alignment. If you think that you have some discrete set of “true” values concepts, and that an AI will also have a discrete set of “true” values values concepts, then these sets need to near-perfectly align to have any chance of the AI optimizing for what we actually want. I.e., this picture:
In the continuous perspective, values have no “true” concept, only a continuous distribution over possible instantiations. The values that are introspectively available to us at any given time are discrete samples from that distribution. In fact, looking for a value’s “true” conceptualization is a type error, roughly analogous to thinking that a Gaussian distribution has some hidden “true” sample that manages to capture the entire distribution in one number.
An AI and human can have overlap between their respective value distributions, even without those distributions perfectly agreeing. It’s possible for an AI to have an important and non-trivial degree of alignment with human values without requiring the near-perfect alignment the discrete view implies is necessary, as illustrated in the diagram below:
Resources
If you want, you can join the shard theory discord: https://discord.gg/AqYkK7wqAG
You can also read some of our draft documents for explaining shard theory:
Shard theory 101 (Broad introduction, focuses less on the continuous view and more on the value / shard formation process and how that relates to evolution)
Your Shards and You: The Shard Theory Account of Common Moral Intuitions (Focuses more on shards as self-perpetuating optimization demons, similar to what you call self-enforcing abstractions)
What even are “shards”? (Presents the continuous view of values / agency, fairly similar to this comment)
Thank! It’s a long comment, so I’ll comment on the convergence, morphologies and the rest latter, so here is just top-level comment on shards. (I’ve read about half of the doc)
My impression is they are basically the same thing which I called “agenty subparts” in Multi-agent predictive minds and AI alignment (and Friston calls “fixed priors”). Where “agenty” means ~ description from intentional stance is a good description, in information theory sense. (This naturally implies fluid boundaries and continuity)
Where I would disagree/find your terminology unclear is where you refer to this as an example of inner alignment failure. Putting in “agenty subparts” into the predictive processing machinery is not a failure, but bandwidth-feasible way for the evolution to communicate valuable states to the PP engine.
Also: I think what you are possibly underestimating is how much is evolution building on top on existing, evolutionary older control circuitry. E.g. evolution does not need to “point to a concept of sex in the PP world model”—evolution was able to make animals seek reproduction long time ago before it invented complex brains. This simplifies the task—what evolution actually had to do was to connect the “PP agenty parts” to parts of existing control machinery, which is often based on “body states”. Technically the older control systems are often using chemicals in blood, or quite old parts of the brain.
I guess I’ll respond once you’ve made your full comment. In the meantime, do you mind if I copy your comment here to the shard theory doc?