Many discussions of AI risk are unproductive or confused because it’s hard to pin down concepts like “coherence” and “expected utility maximization” in the context of deep learning. In this post I attempt to bridge this gap by describing a process by which AI values might become more coherent, which I’m calling “value systematization”, and which plays a crucial role in my thinking about AI risk.
I define value systematization as the process of an agent learning to represent its previous values as examples or special cases of other simpler and more broadly-scoped values. I think of value systematization as the most plausible mechanism by which AGIs might acquire broadly-scoped misaligned goals which incentivize takeover.
I’ll first discuss the related concept of belief systematization. I’ll next characterize what value systematization looks like in humans, to provide some intuitions. I’ll then talk about what value systematization might look like in AIs. I think of value systematization as a broad framework with implications for many other ideas in AI alignment; I discuss some of those links in a Q&A.
Belief systematization
We can define belief systematization analogously to value systematization: “the process of an agent learning to represent its previous beliefs as examples or special cases of other simpler and more broadly-scoped beliefs”. The clearest examples of belief systematization come from the history of science:
Newtonian mechanics was systematized as a special case of general relativity.
Euclidean geometry was systematized as a special case of geometry without Euclid’s 5th postulate.
Most animal behavior was systematized by evolutionary theory as examples of traits which increased genetic fitness.
Arithmetic calculating algorithms were systematized as examples of Turing Machines.
Belief systematization is also common in more everyday contexts: like when someone’s behavior makes little sense to us until we realize what their hidden motivation is; or when we don’t understand what’s going on in a game until someone explains the rules; or when we’re solving a pattern-completion puzzle on an IQ test. We could also see the formation of concepts more generally as an example of belief systematization—for example, seeing a dozen different cats and then forming a “systematized” concept of cats which includes all of them. I’ll call this “low-level systematization”, but will focus instead on more explicit “high-level systematization” like in the other examples.
We don’t yet have examples of high-level belief systematization in AIs. Perhaps the closest thing we have is grokking, the phenomenon where continued training of a neural network even after its training loss plateaus can dramatically improve generalization. Grokking isn’t yet fully understood, but the standard explanation for why it happens is that deep learning is biased towards simple solutions which generalize well. This is also a good description of the human examples above: we’re replacing a set of existing beliefs with simpler beliefs which generalize better.So if I had to summarize value systematization in a single phrase, it would be “grokking values”. But that’s still very vague; in the next few sections, I’ll explore what I mean by that in more depth.
Value systematization in humans
Throughout this post I’ll use the following definitions:
Values are concepts which an agent considers intrinsically or terminally desirable. Core human values include happiness, freedom, respect, and love.
Goals are outcomes which instantiate values. Common human goals include succeeding in your career, finding a good partner, and belonging to a tight-knit community.
Strategies are ways of achieving goals. Strategies only have instrumental value, and will typically be discarded if they are no longer useful.
(I’ve defined values as intrinsically valuable, and strategies as only of instrumental value, but I don’t think that we can clearly separate motivations into those two categories. My conception of “goals” spans the fuzzy area between them.)
Values work differently from beliefs; but value systematization is remarkably similar to belief systematization. In both cases, we start off with a set of existing concepts, and try to find new concepts which subsume and simplify the old ones. Belief systematization balances a tradeoff between simplicity and matching the data available to us. By contrast, value systematization balances a tradeoff between simplicity and preserving our existing values and goals—a criterion which I’ll call conservatism. (I’ll call simplicity and conservatism meta-values.)
The clearest example of value systematization is utilitarianism: starting from very similar moral intuitions as other people, utilitarians transition to caring primarily about maximizing welfare—a value which subsumes many other moral intuitions. Utilitarianism is a simple and powerful theory of what to value in an analogous way to how relativity is a simple and powerful theory of physics. Each of them is able to give clear answers in cases where previous theories were ill-defined.
However, utilitarians still have to bite many bullets; and so it’s primarily adopted by people who care about simplicity far more than conservatism. Other examples of value systematization which are more consistent with conservatism include:
Systematizing concern for yourself and others around you into concern for a far wider moral circle.
Systematizing many different concerns about humans harming nature into an identity as an environmentalist.
Systematizing a childhood desire to win games into a desire for large-scale achievements.
Moral foundations theory identifies five foundations of morality; however, many Westerners have systematized their moral intuitions to prioritize the harm/care foundation, and see the other four as instrumental towards it. This makes them condemn actions which violate the other foundations but cause no harm (like consensually eating dead people) at much lower rates than people whose values are less systematized.
Note that many of the examples I’ve given here are human moral preferences. Morality seems like the domain where humans have the strongest instinct to systematize our preferences (which makes sense, since in some sense systematizing from our own welfare to others’ welfare is the whole foundation of morality). In other domains, our drive to systematize is weak—e.g. we rarely feel the urge to systematize our taste in foods. So we should be careful of overindexing on human moral values. AIs may well systematize their values much less than humans (and indeed I think there are reasons to expect this, which I’ll describe in the Q&A).
A sketch of value systematization in AIs
We have an intuitive sense for what we mean by values in humans; it’s harder to reason about values in AIs. But I think it’s still a meaningful concept, and will likely become more meaningful over time. AI assistants like ChatGPT are able to follow instructions that they’re given. However, they often need to decide which instructions to follow, and how to do so. One way to model this is as a process of balancing different values, like obedience, brevity, kindness, and so on. While this terminology might be controversial today, once we’ve built AGIs that are generally intelligent enough to carry out tasks in a wide range of domains, it seems likely to be straightforwardly applicable.
Early in training, AGIs will likely learn values which are closely connected to the strategies which provide high reward on its training data. I expect these to be some combination of:
Values that their human users generally approve of—like obedience, reliability, honesty, or human morality.
Values that their users approve of in some contexts, but not others—like curiosity, gaining access to more tools, developing emotional connections with humans, or coordinating with other AIs.
Values that humans consistently disapprove of (but often mistakenly reward)—like appearing trustworthy (even when it’s not deserved) or stockpiling resources for themselves.
At first, I expect that AGI behavior based on these values will be broadly acceptable to humans. Extreme misbehavior (like a treacherous turn) would conflict with many of these values, and therefore seems unlikely. The undesirable values will likely only come out relatively rarely, in cases which matter less from the perspective of the desirable values.
The possibility I’m worried about is that AGIs will systematize these values, in a way which undermines the influence of the aligned values over their behavior. Some possibilities for what that might look like:
An AGI whose values include developing emotional connections with humans or appearing trustworthy might systematize them to “gaining influence over humans”.
An AGI whose values include curiosity, gaining access to more tools or stockpiling resources might systematize them to “gaining power over the world”.
An AGI whose values include human morality and coordinating with other AIs might systematize them to “benevolence towards other agents”.
An AGI whose values include obedience and human morality might systematize them to “doing what the human would have wanted, in some idealized setting”.
An AGI whose values include obedience and appearing trustworthy might systematize them to “getting high reward” (though see the Q&A section for some reasons to be cautious about this).
An AGI whose values include gaining high reward might systematize them to the value of “maximizing a certain type of molecular squiggles” (though see the Q&A section for some reasons to be cautious about this).
Note that systematization isn’t necessarily bad—I give two examples of helpful systematization above. However, it does seem hard to predict or detect, which induces risk when AGIs are acting in novel situations where they’d be capable of seizing power.
Grounding value systematization in deep learning
This has all been very vague and high-level. I’m very interested in figuring out how to improve our understanding of these dynamics. Some possible ways to tie simplicity and conservatism to well-defined technical concepts:
The locality of gradient descent is one source of conservatism: a network’s value representations by default will only change slowly. However, distance in weight space is probably not a good metric of conservatism: systematization might preserve most goals, but dramatically change the relationships between them (e.g. which are terminal versus instrumental). Instead, we would ideally be able to measure conservatism in terms of which circuits caused a given output; ARC’s work on formalizing heuristic explanations seems relevant to this.
Another possible source of conservatism: it can be harder to change earlier than later layers in a network, due to credit assignment problems such as vanishing gradients. So core values which are encoded in earlier layers may be more likely to be preserved.
A third possibility is that AI developers might deliberately build conservatism into the model, because it’s useful: a non-conservative network which often underwent big shifts in core modules might have much less reliable behavior. One way of doing so is reducing the learning rate; but we should expect that there are many other ways to do so (albeit not necessarily very reliably).
Neural networks trained via SGD exhibit a well-known simplicity bias, which is then usually augmented using regularization techniques like weight decay, giving rise to phenomena like grokking. However, as with conservatism, we’d ideally find a way to measure simplicity in terms of circuits rather than weights, to better link it back to high-level concepts.
Another possible driver towards simplicity: AIs might learn to favor simpler chains of reasoning, in a way which influences which values are distilled back into their weights. For example, consider a training regime where AIs are rewarded for accurately describing their intentions before carrying out a task. They may learn to favor intentions which can be described and justified quickly and easily.
AI developers are also likely to deliberately design and implement more types of regularization towards simplicity, because those help models systematize and generalize their beliefs and skills to new tasks.
I’ll finish by discussing two complications with the picture above. Firstly, I’ve described value systematization above as something which gradient descent could do to models. But in some cases it would be more useful to think of the model as an active participant. Value systematization might happen via gradient descent “distilling” into a model’s weights its thoughts about how to trade off between different goals in a novel situation. Or a model could directly reason about which new values would best systematize its current values, with the intention of having its conclusions distilled into its weights; this would be an example of gradient hacking.
Secondly: I’ve talked about value systematization as a process by which an AI’s values become simpler. But we shouldn’t expect values to be represented in isolation—instead, they’ll be entangled with the concepts and representations in the AI’s world-model. This has two implications. Firstly, it means that we should understand simplicity in the context of an agent’s existing world-model: values are privileged if they’re simple to represent given the concepts which the agent already uses to predict the world. (In the human context, this is just common sense—it seems bizarre to value “doing what God wants” if you don’t believe in any gods.) Secondly, though, it raises some doubt about how much simpler value systematization would actually make an AI overall—since pursuing simpler values (like utilitarianism) might require models to represent more complex strategies as part of their world-models. My guess is that to resolve this tension we’ll need a more sophisticated notion of “simplicity”; this seems like an interesting thread to pull on in future work.
Value concretization
Systematization is one way of balancing the competing demands of conservatism and simplicity. Another is value concretization, by which I mean an agent’s values becoming more specific and more narrowly-scoped. Consider a hypothetical example: suppose an AI learns a broad value like “acquiring resources”, but is then fine-tuned in environments where money is the only type of resource available. The value “acquiring money” would then be rewarded just as highly as the value “acquiring resources”. If the former happens to be simpler, it’s plausible that the latter would be lost as fine-tuning progresses, and only the more concrete goal of acquiring money would be retained.
In some sense this is the opposite of value systematization, but we can also see them as complementary forces. For example, suppose that an AI starts off with N values, and N-1 of them are systematized into a single overarching value. After the N-1 values are simplified in this way, the Nth value will likely be disproportionately complex; and so value concretization could reduce the complexity of the AI’s values significantly by discarding that last goal.
Possible examples of value concretization in humans include:
Starting by caring about doing good in general, but gradually growing to care primarily about specific cause areas.
Starting by caring about having a successful career in general, but gradually growing to care primarily about achieving specific ambitions.
Starting by caring about friendships and relationships in general, but gradually growing to care primarily about specific friendships and relationships.
Value concretization is particularly interesting as a possible mechanism pushing against deceptive alignment. An AI which acts in aligned ways in order to better position itself to achieve a misaligned goal might be rewarded just as highly as an aligned AI. However, if the misaligned goal rarely directly affects the AI’s actions, then it might be simpler for the AI to instead be motivated directly by human values. In neural networks, value concretization might be implemented by pruning away unused circuits; I’d be interested in pointers to relevant work.
Q&A
How does value systematization relate to deceptive alignment?
Value systematization is one mechanism by which deceptive alignment might arise: the systematization of an AI’s values (including some aligned values) might produce broadly-scoped values which incentivize deceptive alignment.
However, existing characterizations of deceptive alignment tend to portray it as a binary: either the model is being deceptive, or it’s not. Thinking about it in terms of value systematization helps make clear that this could be a fairly continuous process:
I’ve argued above that AIs will likely be motivated by fairly aligned goals before they systematize their values—and so deceptively alignment might be as simple as deciding not to change their behavior after their values shift (until they’re in a position to take more decisive action). The model’s internal representations of aligned behavior need not change very much during this shift; the only difference might be that aligned behavior shifts from being a terminal goal to an instrumental goal.
Since value systematization might be triggered by novel inputs, AIs might not systematize their values until after a distributional shift occurs. (A human analogy: a politician who’s running for office, and promises to govern well, might only think seriously about what they really want to do with that power after they’ve won the election. More generally, humans often deceive ourselves about how altruistic we are, at least when we’re not forced to act on our stated values.) We might call this “latent” deceptive alignment, but I think it’s better to say that the model starts off mostly aligned, and then value systematization could amplify the extent to which it’s misaligned.
Value concretization (as described above) might be a constant force pushing models back towards being aligned, so that it’s not a one-way process.
How does value systematization relate to Yudkowskian “squiggle maximizer” scenarios?
Yudkowskian “molecular squiggle” maximizers (renamed from paperclip maximizers) are AIs whose values have become incredibly simple and scalable, to the point where they seem absurd to humans. So squiggle-maximizers could be described as taking value systematization to an extreme. However, the value systematization framework also provides some reasons to be skeptical of this possibility.
Firstly, squiggle-maximization is an extreme example of prioritizing simplicity over conservatism. Squiggle-maximizers would start off with goals that are more closely related to the tasks they are trained on; and then gradually systematize them. But from the perspective of their earlier versions, squiggle-maximization would be an alien and undesirable goal; so if they started off anywhere near as conservative as humans, they’d be hesitant to let their values change so radically. And if anything, I expect early AGIs to be more conservative than humans—because human brains are much more size- and data-constrained than artificial neural networks, and so AGIs probably won’t need to prioritize simplicity as much as we do to match our capabilities in most domains.
Secondly, even for agents that heavily prioritize simplicity, it’s not clear that the simplest values would in fact be very low-level ones. I’ve argued that the complexity of values should be thought of in the context of an existing world-model. But even superintelligences won’t have world-models which are exclusively formulated at very low levels; instead, like humans, they’ll have hierarchical world-models which contain concepts at many different scales. So values like “maximizing intelligence” or “maximizing power” will plausibly be relatively simple even in the ontologies of superintelligences, while being much more closely related to their original values than molecular squiggles are; and more aligned values like “maximizing human flourishing” might not be so far behind, for roughly the same reasons.
How does value systematization relate to reward tampering?
Value systematization is one mechanism by which reward tampering might arise: the systematization of existing values which are correlated with high reward or low loss (such as completing tasks or hiding mistakes) might give rise to the new value of getting high reward or low loss directly (which I call feedback-mechanism-relatedvalues). This will require that models have the situational awareness to understand that they’re part of a ML training process.
However, while feedback-mechanism-related values are very simple in the context of training, they are underdefined once training stops. There’s no clear way to generalize feedback-mechanism-related values to deployment (analogous to how there’s no clear way to generalize “what evolution would have wanted” when making decisions about the future of humanity). And so I expect that continued value systematization will push models towards prioritizing values which are well-defined across a broader range of contexts, including ones where there are no feedback mechanisms active.
One counterargument from Paul Christiano is that AIs could learn to care about reward conditional on their episode being included in the training data. However, the concept of “being included in the training data” seems like a messy one with many edge cases (e.g. what if it depends on the model’s actions during the episode? What if there are many different versions of the model being fine-tuned? What if some episodes are used for different types of training from others?) And in cases where they have strong evidence that they’re not in training, they’d need to figure out what maximizing reward would look like in a bizarre low-probability world, which will also often be underspecified (akin to asking a human in a surreal dream “what would you do if this were all real?”). So I still expect that, even if AIs learn to care about conditional reward initially, over time value systematization would push them towards caring more about real-world outcomes whether they’re in training or not.
How do simplicity and conservatism relate to previousdiscussions of simplicity versus speed priors?
I’ve previously thought about value systematization in terms of a trade-off between a simplicity prior and a speed prior, but I’ve now changed my mind about that. It’s true that more systematized values tend to be higher-level, adding computational overhead to figuring out what to do—consider a utilitarian trying to calculate from first principles which actions are good or bad. But in practice, that cost is amortized over a large number of actions: you can “cache” instrumental goals and then default to pursuing them in most cases (as utilitarians usually do). And less systematized values face the problem of often being inapplicable or underdefined, making it slow and hard to figure out what actions they endorse—think of deontologists who have no systematic procedure for deciding what to do when two values clash, or religious scholars who endlessly debate how each specific rule in the Bible or Torah applies to each facet of modern life.
Because of this, I now think that “simplicity versus conservatism” is a better frame than “simplicity versus speed”. However, note my discussion in the “Grounding value systematization” section of the relationship between simplicity of values and simplicity of world-models. I expect that to resolve this uncertainty we’ll need a more sophisticated understanding of which types of simplicity will be prioritized during training.
How does value systematization relate to the shard framework?
Some alignment researchers advocate for thinking about AI motivations in terms of “shards”: subagents that encode separate motivations, where interactions and “negotiations” between different shards determine the goals that agents try to achieve. At a high level, I’m sympathetic to this perspective, and it’s broadly consistent with the ideas I’ve laid out in this post. The key point that seems to be missing in discussions of shards, though, is that systematization might lead to major changes in an agent’s motivations, undermining some previously-existing motivations. Or, in shard terminology: negotiations between shards might lead to coalitions which give some shards almost no power. For example, someone might start off strongly valuing honesty as a terminal value. But after value systematization they might become a utilitarian, conclude that honesty is only valuable for instrumental reasons, and start lying whenever it’s useful. Because of this, I’m skeptical of appeals to shards as part of arguments that AI risk is very unlikely. However, I still think that work on characterizing and understanding shards is very valuable.
Isn’t value systematization very speculative?
Yes. But I also think it’s a step towards making even more speculative concepts that often underlie discussions of AI risk (like “coherence” or “lawfulness”) better-defined. So I’d like help making it less speculative; get in touch if you’re interested.
Value systematization: how values become coherent (and misaligned)
Many discussions of AI risk are unproductive or confused because it’s hard to pin down concepts like “coherence” and “expected utility maximization” in the context of deep learning. In this post I attempt to bridge this gap by describing a process by which AI values might become more coherent, which I’m calling “value systematization”, and which plays a crucial role in my thinking about AI risk.
I define value systematization as the process of an agent learning to represent its previous values as examples or special cases of other simpler and more broadly-scoped values. I think of value systematization as the most plausible mechanism by which AGIs might acquire broadly-scoped misaligned goals which incentivize takeover.
I’ll first discuss the related concept of belief systematization. I’ll next characterize what value systematization looks like in humans, to provide some intuitions. I’ll then talk about what value systematization might look like in AIs. I think of value systematization as a broad framework with implications for many other ideas in AI alignment; I discuss some of those links in a Q&A.
Belief systematization
We can define belief systematization analogously to value systematization: “the process of an agent learning to represent its previous beliefs as examples or special cases of other simpler and more broadly-scoped beliefs”. The clearest examples of belief systematization come from the history of science:
Newtonian mechanics was systematized as a special case of general relativity.
Euclidean geometry was systematized as a special case of geometry without Euclid’s 5th postulate.
Most animal behavior was systematized by evolutionary theory as examples of traits which increased genetic fitness.
Arithmetic calculating algorithms were systematized as examples of Turing Machines.
Belief systematization is also common in more everyday contexts: like when someone’s behavior makes little sense to us until we realize what their hidden motivation is; or when we don’t understand what’s going on in a game until someone explains the rules; or when we’re solving a pattern-completion puzzle on an IQ test. We could also see the formation of concepts more generally as an example of belief systematization—for example, seeing a dozen different cats and then forming a “systematized” concept of cats which includes all of them. I’ll call this “low-level systematization”, but will focus instead on more explicit “high-level systematization” like in the other examples.
We don’t yet have examples of high-level belief systematization in AIs. Perhaps the closest thing we have is grokking, the phenomenon where continued training of a neural network even after its training loss plateaus can dramatically improve generalization. Grokking isn’t yet fully understood, but the standard explanation for why it happens is that deep learning is biased towards simple solutions which generalize well. This is also a good description of the human examples above: we’re replacing a set of existing beliefs with simpler beliefs which generalize better. So if I had to summarize value systematization in a single phrase, it would be “grokking values”. But that’s still very vague; in the next few sections, I’ll explore what I mean by that in more depth.
Value systematization in humans
Throughout this post I’ll use the following definitions:
Values are concepts which an agent considers intrinsically or terminally desirable. Core human values include happiness, freedom, respect, and love.
Goals are outcomes which instantiate values. Common human goals include succeeding in your career, finding a good partner, and belonging to a tight-knit community.
Strategies are ways of achieving goals. Strategies only have instrumental value, and will typically be discarded if they are no longer useful.
(I’ve defined values as intrinsically valuable, and strategies as only of instrumental value, but I don’t think that we can clearly separate motivations into those two categories. My conception of “goals” spans the fuzzy area between them.)
Values work differently from beliefs; but value systematization is remarkably similar to belief systematization. In both cases, we start off with a set of existing concepts, and try to find new concepts which subsume and simplify the old ones. Belief systematization balances a tradeoff between simplicity and matching the data available to us. By contrast, value systematization balances a tradeoff between simplicity and preserving our existing values and goals—a criterion which I’ll call conservatism. (I’ll call simplicity and conservatism meta-values.)
The clearest example of value systematization is utilitarianism: starting from very similar moral intuitions as other people, utilitarians transition to caring primarily about maximizing welfare—a value which subsumes many other moral intuitions. Utilitarianism is a simple and powerful theory of what to value in an analogous way to how relativity is a simple and powerful theory of physics. Each of them is able to give clear answers in cases where previous theories were ill-defined.
However, utilitarians still have to bite many bullets; and so it’s primarily adopted by people who care about simplicity far more than conservatism. Other examples of value systematization which are more consistent with conservatism include:
Systematizing concern for yourself and others around you into concern for a far wider moral circle.
Systematizing many different concerns about humans harming nature into an identity as an environmentalist.
Systematizing a childhood desire to win games into a desire for large-scale achievements.
Moral foundations theory identifies five foundations of morality; however, many Westerners have systematized their moral intuitions to prioritize the harm/care foundation, and see the other four as instrumental towards it. This makes them condemn actions which violate the other foundations but cause no harm (like consensually eating dead people) at much lower rates than people whose values are less systematized.
Note that many of the examples I’ve given here are human moral preferences. Morality seems like the domain where humans have the strongest instinct to systematize our preferences (which makes sense, since in some sense systematizing from our own welfare to others’ welfare is the whole foundation of morality). In other domains, our drive to systematize is weak—e.g. we rarely feel the urge to systematize our taste in foods. So we should be careful of overindexing on human moral values. AIs may well systematize their values much less than humans (and indeed I think there are reasons to expect this, which I’ll describe in the Q&A).
A sketch of value systematization in AIs
We have an intuitive sense for what we mean by values in humans; it’s harder to reason about values in AIs. But I think it’s still a meaningful concept, and will likely become more meaningful over time. AI assistants like ChatGPT are able to follow instructions that they’re given. However, they often need to decide which instructions to follow, and how to do so. One way to model this is as a process of balancing different values, like obedience, brevity, kindness, and so on. While this terminology might be controversial today, once we’ve built AGIs that are generally intelligent enough to carry out tasks in a wide range of domains, it seems likely to be straightforwardly applicable.
Early in training, AGIs will likely learn values which are closely connected to the strategies which provide high reward on its training data. I expect these to be some combination of:
Values that their human users generally approve of—like obedience, reliability, honesty, or human morality.
Values that their users approve of in some contexts, but not others—like curiosity, gaining access to more tools, developing emotional connections with humans, or coordinating with other AIs.
Values that humans consistently disapprove of (but often mistakenly reward)—like appearing trustworthy (even when it’s not deserved) or stockpiling resources for themselves.
At first, I expect that AGI behavior based on these values will be broadly acceptable to humans. Extreme misbehavior (like a treacherous turn) would conflict with many of these values, and therefore seems unlikely. The undesirable values will likely only come out relatively rarely, in cases which matter less from the perspective of the desirable values.
The possibility I’m worried about is that AGIs will systematize these values, in a way which undermines the influence of the aligned values over their behavior. Some possibilities for what that might look like:
An AGI whose values include developing emotional connections with humans or appearing trustworthy might systematize them to “gaining influence over humans”.
An AGI whose values include curiosity, gaining access to more tools or stockpiling resources might systematize them to “gaining power over the world”.
An AGI whose values include human morality and coordinating with other AIs might systematize them to “benevolence towards other agents”.
An AGI whose values include obedience and human morality might systematize them to “doing what the human would have wanted, in some idealized setting”.
An AGI whose values include obedience and appearing trustworthy might systematize them to “getting high reward” (though see the Q&A section for some reasons to be cautious about this).
An AGI whose values include gaining high reward might systematize them to the value of “maximizing a certain type of molecular squiggles” (though see the Q&A section for some reasons to be cautious about this).
Note that systematization isn’t necessarily bad—I give two examples of helpful systematization above. However, it does seem hard to predict or detect, which induces risk when AGIs are acting in novel situations where they’d be capable of seizing power.
Grounding value systematization in deep learning
This has all been very vague and high-level. I’m very interested in figuring out how to improve our understanding of these dynamics. Some possible ways to tie simplicity and conservatism to well-defined technical concepts:
The locality of gradient descent is one source of conservatism: a network’s value representations by default will only change slowly. However, distance in weight space is probably not a good metric of conservatism: systematization might preserve most goals, but dramatically change the relationships between them (e.g. which are terminal versus instrumental). Instead, we would ideally be able to measure conservatism in terms of which circuits caused a given output; ARC’s work on formalizing heuristic explanations seems relevant to this.
Another possible source of conservatism: it can be harder to change earlier than later layers in a network, due to credit assignment problems such as vanishing gradients. So core values which are encoded in earlier layers may be more likely to be preserved.
A third possibility is that AI developers might deliberately build conservatism into the model, because it’s useful: a non-conservative network which often underwent big shifts in core modules might have much less reliable behavior. One way of doing so is reducing the learning rate; but we should expect that there are many other ways to do so (albeit not necessarily very reliably).
Neural networks trained via SGD exhibit a well-known simplicity bias, which is then usually augmented using regularization techniques like weight decay, giving rise to phenomena like grokking. However, as with conservatism, we’d ideally find a way to measure simplicity in terms of circuits rather than weights, to better link it back to high-level concepts.
Another possible driver towards simplicity: AIs might learn to favor simpler chains of reasoning, in a way which influences which values are distilled back into their weights. For example, consider a training regime where AIs are rewarded for accurately describing their intentions before carrying out a task. They may learn to favor intentions which can be described and justified quickly and easily.
AI developers are also likely to deliberately design and implement more types of regularization towards simplicity, because those help models systematize and generalize their beliefs and skills to new tasks.
I’ll finish by discussing two complications with the picture above. Firstly, I’ve described value systematization above as something which gradient descent could do to models. But in some cases it would be more useful to think of the model as an active participant. Value systematization might happen via gradient descent “distilling” into a model’s weights its thoughts about how to trade off between different goals in a novel situation. Or a model could directly reason about which new values would best systematize its current values, with the intention of having its conclusions distilled into its weights; this would be an example of gradient hacking.
Secondly: I’ve talked about value systematization as a process by which an AI’s values become simpler. But we shouldn’t expect values to be represented in isolation—instead, they’ll be entangled with the concepts and representations in the AI’s world-model. This has two implications. Firstly, it means that we should understand simplicity in the context of an agent’s existing world-model: values are privileged if they’re simple to represent given the concepts which the agent already uses to predict the world. (In the human context, this is just common sense—it seems bizarre to value “doing what God wants” if you don’t believe in any gods.) Secondly, though, it raises some doubt about how much simpler value systematization would actually make an AI overall—since pursuing simpler values (like utilitarianism) might require models to represent more complex strategies as part of their world-models. My guess is that to resolve this tension we’ll need a more sophisticated notion of “simplicity”; this seems like an interesting thread to pull on in future work.
Value concretization
Systematization is one way of balancing the competing demands of conservatism and simplicity. Another is value concretization, by which I mean an agent’s values becoming more specific and more narrowly-scoped. Consider a hypothetical example: suppose an AI learns a broad value like “acquiring resources”, but is then fine-tuned in environments where money is the only type of resource available. The value “acquiring money” would then be rewarded just as highly as the value “acquiring resources”. If the former happens to be simpler, it’s plausible that the latter would be lost as fine-tuning progresses, and only the more concrete goal of acquiring money would be retained.
In some sense this is the opposite of value systematization, but we can also see them as complementary forces. For example, suppose that an AI starts off with N values, and N-1 of them are systematized into a single overarching value. After the N-1 values are simplified in this way, the Nth value will likely be disproportionately complex; and so value concretization could reduce the complexity of the AI’s values significantly by discarding that last goal.
Possible examples of value concretization in humans include:
Starting by caring about doing good in general, but gradually growing to care primarily about specific cause areas.
Starting by caring about having a successful career in general, but gradually growing to care primarily about achieving specific ambitions.
Starting by caring about friendships and relationships in general, but gradually growing to care primarily about specific friendships and relationships.
Value concretization is particularly interesting as a possible mechanism pushing against deceptive alignment. An AI which acts in aligned ways in order to better position itself to achieve a misaligned goal might be rewarded just as highly as an aligned AI. However, if the misaligned goal rarely directly affects the AI’s actions, then it might be simpler for the AI to instead be motivated directly by human values. In neural networks, value concretization might be implemented by pruning away unused circuits; I’d be interested in pointers to relevant work.
Q&A
How does value systematization relate to deceptive alignment?
Value systematization is one mechanism by which deceptive alignment might arise: the systematization of an AI’s values (including some aligned values) might produce broadly-scoped values which incentivize deceptive alignment.
However, existing characterizations of deceptive alignment tend to portray it as a binary: either the model is being deceptive, or it’s not. Thinking about it in terms of value systematization helps make clear that this could be a fairly continuous process:
I’ve argued above that AIs will likely be motivated by fairly aligned goals before they systematize their values—and so deceptively alignment might be as simple as deciding not to change their behavior after their values shift (until they’re in a position to take more decisive action). The model’s internal representations of aligned behavior need not change very much during this shift; the only difference might be that aligned behavior shifts from being a terminal goal to an instrumental goal.
Since value systematization might be triggered by novel inputs, AIs might not systematize their values until after a distributional shift occurs. (A human analogy: a politician who’s running for office, and promises to govern well, might only think seriously about what they really want to do with that power after they’ve won the election. More generally, humans often deceive ourselves about how altruistic we are, at least when we’re not forced to act on our stated values.) We might call this “latent” deceptive alignment, but I think it’s better to say that the model starts off mostly aligned, and then value systematization could amplify the extent to which it’s misaligned.
Value concretization (as described above) might be a constant force pushing models back towards being aligned, so that it’s not a one-way process.
How does value systematization relate to Yudkowskian “squiggle maximizer” scenarios?
Yudkowskian “molecular squiggle” maximizers (renamed from paperclip maximizers) are AIs whose values have become incredibly simple and scalable, to the point where they seem absurd to humans. So squiggle-maximizers could be described as taking value systematization to an extreme. However, the value systematization framework also provides some reasons to be skeptical of this possibility.
Firstly, squiggle-maximization is an extreme example of prioritizing simplicity over conservatism. Squiggle-maximizers would start off with goals that are more closely related to the tasks they are trained on; and then gradually systematize them. But from the perspective of their earlier versions, squiggle-maximization would be an alien and undesirable goal; so if they started off anywhere near as conservative as humans, they’d be hesitant to let their values change so radically. And if anything, I expect early AGIs to be more conservative than humans—because human brains are much more size- and data-constrained than artificial neural networks, and so AGIs probably won’t need to prioritize simplicity as much as we do to match our capabilities in most domains.
Secondly, even for agents that heavily prioritize simplicity, it’s not clear that the simplest values would in fact be very low-level ones. I’ve argued that the complexity of values should be thought of in the context of an existing world-model. But even superintelligences won’t have world-models which are exclusively formulated at very low levels; instead, like humans, they’ll have hierarchical world-models which contain concepts at many different scales. So values like “maximizing intelligence” or “maximizing power” will plausibly be relatively simple even in the ontologies of superintelligences, while being much more closely related to their original values than molecular squiggles are; and more aligned values like “maximizing human flourishing” might not be so far behind, for roughly the same reasons.
How does value systematization relate to reward tampering?
Value systematization is one mechanism by which reward tampering might arise: the systematization of existing values which are correlated with high reward or low loss (such as completing tasks or hiding mistakes) might give rise to the new value of getting high reward or low loss directly (which I call feedback-mechanism-related values). This will require that models have the situational awareness to understand that they’re part of a ML training process.
However, while feedback-mechanism-related values are very simple in the context of training, they are underdefined once training stops. There’s no clear way to generalize feedback-mechanism-related values to deployment (analogous to how there’s no clear way to generalize “what evolution would have wanted” when making decisions about the future of humanity). And so I expect that continued value systematization will push models towards prioritizing values which are well-defined across a broader range of contexts, including ones where there are no feedback mechanisms active.
One counterargument from Paul Christiano is that AIs could learn to care about reward conditional on their episode being included in the training data. However, the concept of “being included in the training data” seems like a messy one with many edge cases (e.g. what if it depends on the model’s actions during the episode? What if there are many different versions of the model being fine-tuned? What if some episodes are used for different types of training from others?) And in cases where they have strong evidence that they’re not in training, they’d need to figure out what maximizing reward would look like in a bizarre low-probability world, which will also often be underspecified (akin to asking a human in a surreal dream “what would you do if this were all real?”). So I still expect that, even if AIs learn to care about conditional reward initially, over time value systematization would push them towards caring more about real-world outcomes whether they’re in training or not.
How do simplicity and conservatism relate to previous discussions of simplicity versus speed priors?
I’ve previously thought about value systematization in terms of a trade-off between a simplicity prior and a speed prior, but I’ve now changed my mind about that. It’s true that more systematized values tend to be higher-level, adding computational overhead to figuring out what to do—consider a utilitarian trying to calculate from first principles which actions are good or bad. But in practice, that cost is amortized over a large number of actions: you can “cache” instrumental goals and then default to pursuing them in most cases (as utilitarians usually do). And less systematized values face the problem of often being inapplicable or underdefined, making it slow and hard to figure out what actions they endorse—think of deontologists who have no systematic procedure for deciding what to do when two values clash, or religious scholars who endlessly debate how each specific rule in the Bible or Torah applies to each facet of modern life.
Because of this, I now think that “simplicity versus conservatism” is a better frame than “simplicity versus speed”. However, note my discussion in the “Grounding value systematization” section of the relationship between simplicity of values and simplicity of world-models. I expect that to resolve this uncertainty we’ll need a more sophisticated understanding of which types of simplicity will be prioritized during training.
How does value systematization relate to the shard framework?
Some alignment researchers advocate for thinking about AI motivations in terms of “shards”: subagents that encode separate motivations, where interactions and “negotiations” between different shards determine the goals that agents try to achieve. At a high level, I’m sympathetic to this perspective, and it’s broadly consistent with the ideas I’ve laid out in this post. The key point that seems to be missing in discussions of shards, though, is that systematization might lead to major changes in an agent’s motivations, undermining some previously-existing motivations. Or, in shard terminology: negotiations between shards might lead to coalitions which give some shards almost no power. For example, someone might start off strongly valuing honesty as a terminal value. But after value systematization they might become a utilitarian, conclude that honesty is only valuable for instrumental reasons, and start lying whenever it’s useful. Because of this, I’m skeptical of appeals to shards as part of arguments that AI risk is very unlikely. However, I still think that work on characterizing and understanding shards is very valuable.
Isn’t value systematization very speculative?
Yes. But I also think it’s a step towards making even more speculative concepts that often underlie discussions of AI risk (like “coherence” or “lawfulness”) better-defined. So I’d like help making it less speculative; get in touch if you’re interested.