The deployment of powerful deep learning systems such as ChatGPT raises the question of how to make these systems safe and consistently aligned with human intent. Since building these systems is an engineering challenge, it is tempting to think of the safety of these systems primarily through a traditional engineering lens, focusing on reliability, modularity, redundancy, and reducing the long tail of failures.
While engineering is a useful lens, it misses an important part of the picture: deep neural networks are complex adaptive systems, which raises new control difficulties that are not addressed by standard engineering methodology. I’ve discussed some particular examples of this before, but here I want to focus on the broader underlying intuition that generated them.
A complex adaptive system is a system with many interacting components that adapt to their environment and co-evolve over time (in our case, the weights / layers of the neural network). Beyond neural networks, other examples of complex adaptive systems include firms, financial markets, political parties, culture, traffic flows, pathogens, ecosystems, human brains, the Earth’s climate, and the Internet.
A common thread in all these systems is that straightforward attempts to control their behavior lead to unintended consequences. I’ll demonstrate this through concrete examples, then step back and consider the broader properties that make these systems difficult to control, including emergent goals. Finally, I’ll propose safety measures that account for the complex adaptive nature of deep learning systems.
Many of the ideas in this post have been discussed before, and my thinking owes significantly to Dan Hendrycks, who was an early proponent of the complex systems perspective as a PhD student in my lab (see e.g. Unsolved Problems in ML Safety, the lecture on accident models from Dan’s course, or this blog post).
Control Difficulties in Complex Systems
Let’s examine several examples of complex systems, and see why each is difficult to control, in the sense that they either resist or respond unpredictably to external feedback.
Traffic. A city builds new highways to reduce traffic congestion. The newly increased road capacity attracts new drivers, leading to worse levels of congestion than before. The adaptive behavior leads to unintended consequences.
Ecosystems. A park introduces a predator to reduce the population of an invasive species. The predator also preys on native species, disrupting the ecosystem balance. The dense network of interactions makes it difficult to predict all consequences ahead of time.
Financial markets. Central banks lower interest rates to stimulate economic growth. Funders thus make riskier investments leading to asset bubbles, which later burst and destabilize the financial system. In this case, both adaptivity and multi-step interactions come into play.
Culture. The government implements public awareness campaigns to promote environmental conservation. These efforts encounter resistance from workers whose jobs rely on non-renewable fuel sources, and are appropriated by fashion brands and other consumer products through greenwashing.
Further examples include pathogens evolving drug resistance, firms relocating to avoid regulations, and positive feedback loops from climate change. I elaborate on these and other examples in the appendix.
Traditional Engineering Does not Address these Difficulties
Why are complex adaptive systems hard to control? There are two key hallmarks of complex adaptive systems:
Emergence: behavior at one scale cannot be easily reduced to behavior at smaller scales, i.e. “More is Different”.
Feedback loops: different components of the system continually influence and respond to each other.
Feedback loops can lead a system to resist or respond nonlinearly to change. Emergence means that failures cannot be traced to individual components, and that behavior is hard to predict as a system evolves. Together, emergence and feedback loops lead to many of the downstream challenges seen in our earlier examples, such as:
Adaptivity: complex adaptive systems often adapt to and resist change, as in the traffic and culture examples.
Nonlinearity: due to feedback loops and other higher-order interactions, small changes in input can lead to large or unexpected changes in output, as in the traffic, ecosystem, and financial market examples.
Self-organization: Order and structure can emerge without central control, as can be seen with human culture. Since there was no central control that instantiated these structures, there is no obvious point of intervention to direct them.
Redundancy: self-organization means that complex adaptive systems often have multiple components that perform similar functions. This makes them less responsive to interventions. For instance, redirecting traffic from one street might just move it to nearby streets and not affect overall traffic in an area.
Traditional engineering does not address these challenges. Three hallmarks of engineering are reliability, modularity, and redundancy, but these traditional pillars either don’t address the issues above or are infeasible to implement.
For instance, one might seek to reliably influence culture by testing messaging on a broad set of audiences and disseminating messages through multiple channels. But new countercultures will likely rise in response, and ubiquitous messaging could end up sparking backlash.
Modularity could help improve complex systems, but is almost impossible to achieve due to interactions and feedback loops. For instance, the U.S. government is built on separation of powers (a form of modularity), but over time the different branches have co-adapted and found ways to assert power beyond their initial scope (see e.g. the War Powers Resolution and commerce clause).
Finally, redundancy is considered a virtue in traditional engineering, but the redundancy in complex adaptive systems makes them harder to analyze and intervene on.
Goal-oriented Behavior in Complex Adaptive Systems
A signature difficulty in complex adaptive systems is emergent goal-oriented behavior (Gall, 1975 ch. 8). For instance, ant colonies collectively pursue goals (finding food, building nests, protecting the colony) even though each individual ant follows simple rules. Similarly, flocks of birds avoid predators despite each bird following simple rules.
As I’ll discuss below, emergent goals are hard to predict from individual components and many emergent goals center on acquiring power or resources. Emergent goals therefore pose a particular challenge to controlling systems, as they produce an impetus that cannot be easily directed through either top-down or bottom-up intervention.
First, a system’s explicitly stated goal (e.g. the core principles of an organization) rarely matches the goals that it pursues in practice, due to intra-system competition (see e.g. “launch, promote, abandon”, where individual managers pursue goals detrimental to the organization in order to get promoted). A system’s emergent goals also need not match the goals of individual actors in the system. For example, it is common for groups of well-intentioned people to do harm, and for self-interested parties to create valuable products.
Second, emergent goals need not be beneficial to individuals: groups often exhibit strong pressures towards consensus, leading to groupthink, even if most individuals prefer greater diversity of thought. And for parts of the COVID-19 pandemic, society seemed to have a “goal” of keeping the reproduction number R close to 1, as lower case counts led people to be less cautious and vice versa, which rendered many policies surprisingly ineffectual.
Even though a system’s goals can be derived neither from a top-down objective nor from individual actors, some goals appear commonly across many systems. Two common emergent goals are self-preservation and growth: complex adaptive systems often act to preserve themselves and to expand in size. This is ubiquitous in biology (due to evolutionary pressure), but occurs more broadly, for instance most organizations (e.g. bureaucracies, companies) act to preserve themselves and to expand. Consequently, complex systems need constant checks to ensure they do not encroach on other domains (Gall, 1975 ch. 2).
Lessons for Deep Learning Safety
I argued that traditional engineering thinking is not sufficient for making deep learning systems safe. So what additional approaches should we incorporate? Here are several principles derived from analogy with other complex adaptive systems:
Avoid continuous incentive gradients towards bad behaviors; instead build sharp cliffs. For instance, it is a bad idea to give people low doses of antibiotics, because some bacteria would survive and evolve antibiotic resistance. Instead, you want to make sure that anyone given antibiotics is given enough to kill all the bacteria by a significant margin.
Similarly, in deep learning, it would be a bad idea to first train a system on very error-prone human evaluators and then gradually expose it to more sophisticated overseers. Why? Because the model could learn methods to fool the initial error-prone evaluators, and then gradually improve its deception as the quality of oversight increased. It would instead be better to start with high-quality oversight: then the model might never learn to deceive in the first place, because all forms of successful deception would require large departures from its current policy.
Consider not building certain systems. In other domains such as synthetic biology, it is recognized that certain systems are inherently dangerous and should not be built, or should only be built with strong justifications and safeguards in place. Many self-replicating or rapidly-evolving systems fall under this category (e.g. engineered viruses or pests). We do not have such a culture in machine learning currently, but should build more of one.
Diverse systems are more resilient. By default, deep learning leads to the deployment of many copies of the same system or similar systems (e.g. fine-tuned from the same base model). It may be safer to have a larger diversity of models. For instance, if a model acquires unwanted emergent goals, other AI systems may act to stop it, but only if those models do not have the same emergent goals. The more different AI systems are from each other, the more they can act as checks and balances against each other. Diverse systems may also help combat algorithmic monoculture (Kleinberg and Raghavan, 2021; Bommosani et al., 2022).
On the other hand, diverse goals of individual AI systems may lead to worse emergent goals for the entire ecosystem of AIs, due to economic and selection pressures, as argued in Hendrycks (2023).
Avoid positive feedback loops. Positive feedback loops, left unchecked, can cause a system to explode destructively. In deep learning, we should be especially worried about positive feedback loops that cause rapid capabilities improvements (e.g. learning to learn or otherwise self-improve) or rapid shifts in goals.
A large safe system often evolves from a small safe system (Gall, 1975 ch. 11; Hendrycks and Woodside, 2022). If a pretrained model is misaligned with humans (e.g. by having unsafe emergent goals), we should not expect to solve this problem with fine-tuning. We need to ensure that it is human-aligned throughout pretraining and engineered to become more safe and human-aligned over time (e.g. by seeking out reliable human feedback, down-regulating erratic behavior, etc.).
One implication is that if we were to train a model to use tools and interact with the external world, it may be safer to do this during fine-tuning and to pretrain mainly on prediction and on instruction-following. An externally-directed model is more likely to develop externally-directed goals, and we’d rather avoid baking those into the initial system.
A second implication is that we should pretrain the model on a robust set of self-corrective and self-regulating behaviors, e.g. train it to consistently comply with being shut down or otherwise give up power in a broad variety of scenarios, and to notice when it is taking potentially bad actions and flag this to human annotators. Korbak et al. (2023) takes an initial step in this direction by incorporating human preference data during pretraining.
Train models to have limited aims. In societal systems, regulation and other limiters prevent a single bad component from causing too much damage. For instance, financial regulations force banks to limit their exposure to certain risks. For deep learning systems, we could train them to consistently stop pursuing a variety of goals after a certain point, and hope that this teaches them to have limited aims in general. This could help avoid positive feedback loops and may be one way to imbue safety into the initial version of a system. (Thanks to Jared Kaplan for initially suggesting this idea.)
Summary. Focusing on complex systems leads to several perspectives (incentive shaping, non-deployment, self-regulation, and limited aims) that are uncommon in traditional engineering, and also highlights ideas (diversification and feedback loops) that are common in engineering but not yet widely utilized in machine learning. I expect these approaches to be collectively important for controlling powerful ML systems, as well as intellectually fruitful to explore.
Discussion: Are Deep Networks Analogous to Other Complex Adaptive Systems?
One possible objection to this post would be that deep learning systems are not really analogous to the other complex adaptive systems I’ve described, and so we should not expect similar control difficulties.
I’ll address this in two parts. First, clearly there are at least some analogies with other complex adaptive systems—for instance, neural networks often learn redundant copies of a single functionality, which makes it more difficult to analyze their internal function (Wang et al., 2022). Moreover, emergence is commonplace, as new qualitative behaviors often appear when we scale up deep networks (Steinhardt, 2022; Wei et al., 2022). And since a large and diverse set of behaviors appear via self-organization, it can be difficult to even track all of the phenomena we care about, let alone control them. For instance, some important behaviors such as sycophancy and sandbagging were not apparent until ML researchers ran large-scale, automated evaluations (Perez et al., 2022). Other issues, such as hallucinations, are ubiquitous but have so far resisted attempts to quash them (OpenAI, 2023).
Regarding emergent goals, some large language models already do exhibit emergent goal-directed behavior, such as Sydney attempting to persuade a user that the year is 2022 and to persuade a journalist to leave his wife. However, despite this initial evidence, one might argue that deep learning systems are less likely to exhibit fully “agentic” behavior than other complex adaptive systems, since their individual components (neurons) are not adaptive agents, in contrast to our other examples (humans, animals, pathogens, firms).
However, non-agentic building blocks are only a partial disanalogy with other complex adaptive systems: intermediate levels of organization can still create adaptive subagents, and external feedback loops create interactions with agents such as humans and firms.
Intermediate levels of organization. There are intermediate levels of organization between individual neurons and the entire network. Distributed subnetworks of neurons could acquire forms of agency and self-preservation, leading the entire network to behave as a complex adaptive system in full.
For an analogy, consider biological neural networks. The human brain acquires non-adaptive compulsions (obsessive cleanliness, perfectionism, etc.) that are often self-preserving. For instance, OCD patients generate rationalizations for why it is important to give into their compulsions, and sometimes actively resist taking steps to expunge them, which is why OCD often requires professional treatment. OCD thus constitutes a distributed subnetwork of the brain with both agency (the compulsion) and self-preservation (the rationalization). If these sub-agents exist in the human brain, they may arise in artificial neural networks as well.
Furthermore, if deep networks end up learning optimization emergently, then they could acquire emergent goals tied to that optimization (e.g. a goal of seeking novelty for an agent trained to do active learning). This is a safety risk, since many natural emergent subgoals lead systems to resist change and seek power (Omohundro, 2008).
External feedback loops. Deep learning systems are situated in the world, interacting with users, other ML systems, and the Internet, which forms a larger complex adaptive system around the model itself. This larger system can produce unexpected behavior both individually and in aggregate. Individually, humans might actively try to produce prompts that lead a chatbot to exhibit novel behavior, thus pushing it off-distribution. At the aggregate level, if AI writing assistants make it easier to write compelling prose in one style compared to others, that one style could come to dominate[1]. Both the individual and aggregate effects would resist attempts to change them—the user is motivated to circumvent any safeguards that developers place on the model, and many users (as well as the system itself) would have adapted to the new writing style once it’s initially deployed.
To conclude, while some of the thorniest issues of complex adaptive systems (resistance to change and emergent goals) are not yet commonplace for deep networks, I expect them to arise in the future, and we should start mitigating them today.
Thanks to Louise Verkin for transcribing this post into Markdown format. Thanks to Thomas Woodside, Ruiqi Zhong, Ajeya Cotra, Thomas Woodside, Roger Grosse, and Richard Ngo for providing feedback on this post.
Author contribution statement: Jacob Steinhardt conceived the idea and structure of the post. GPT-4 produced the examples of complex systems and reasons why they are difficult to control, and collaborated with Jacob to produce the lessons for deep learning safety. Jacob wrote the other sections and edited and sometimes expanded the text provided by GPT-4 for these sections.
Appendix: Additional Examples of Control Difficulties
Below are several additional examples of control difficulties in complex systems, similar to those in the main text.
Pathogens. When a new drug is introduced to control a particular pathogen, the pathogen population may evolve resistance to the drug, rendering it less effective over time.
Firms. The government regulates pollution by imposing a cap on emissions. Some firms invest in cleaner technology to comply with the regulations, but others relocate their production facilities to countries with fewer regulations.
Climate. Efforts to mitigate climate change by reducing greenhouse gas emissions can be complicated by feedback loops, such as the melting of Arctic ice. As ice melts, it exposes darker surfaces (water and land) that absorb more sunlight, leading to further warming and ice melt.
Political parties. Campaign finance regulations may attempt to limit the influence of money in politics. In response, political parties and candidates might find alternative ways to raise and spend money, such as through independent expenditure committees or super PACs.
The Internet. Attempts to regulate content or user behavior on the internet often face significant challenges. For example, when governments impose restrictions on access to specific websites or content, users might employ various tools and techniques (e.g., VPNs, proxy servers) to circumvent these restrictions.
Especially if text in that style feeds back into the training data, cementing its advantage. ↩︎
Here’s why I disagree with the core claims of this post:
Its main thesis relies on what appears to be circular-reasoning to some degree: “Complex systems are hard to control” is weakly circular, given there is some amount of defining complexity as the degree of difficulty in understanding how a system works (which would in turn affect our ability to control it).
Its examples of challenges rely on mostly single attempts or one-pass resolutions, not looking at the long term view of the system when many attempts to control it are observed in sequence. Given systems have feedback loops, if our desire to control it is strong enough, there is likely to be some signal we are receiving from multiple attempts to control it. Sequential attempts are likely to produce some signal that can be used to guide subsequent attempts.
Arguments are fairly hand-wavy. For example, “due to interactions and feedback loops” is a commonly-cited reason for general bad things happening.
It argues that some key things in engineering are inadequate, such as “modularity.” The post very quickly states that the US government is modular, but that wasn’t enough to stop a few bad things from happening. It doesn’t at all talk about whether more bad things would have happened without said modularity.
GPT-4 apparently wrote several of the arguments in this post. Even if the arguments it came up with are weak, this is evidence that systems such as GPT-4 are relatively easy to control. This is also evidence against the hypothesis that large-scale data sets are worse as they are scaled up.
In general, belief that accidents cause more deaths than intentional killing is vastly overstated. The point of view of this post leans heavily on the idea that accidents are far more dangerous and far more important to worry about than intentional harm. For example, the list of worst accidents in human history reports several orders of magnitude fewer deaths than when people kill each other on purpose. This suggests that far less harm is caused by being unable to control a complex system than it is by outright conflict.
I like this overview, it matches my experience dealing with complex systems and crystallizes a lot of intuitions. Sadly, this makes me more pessimistic about deliberate alignment:
There is no such thing as “high-quality oversight” once the model is capable enough, much more so than the overseers. We might be able to delay the inevitable deviation from what we wanted from the model, but without a locked-in self-oversight of some sort it won’t take long for the equivalent of a “sharp left turn”. And I don’t think anyone has a clue as to what this “self-oversight” might be in order to generalize orders of magnitude past what humans are capable of.
Talking about examples of uncontrollable complex systems, the grand example is the modern civilization itself (other names: capitalism, Moloch, hyper-modernity, etc.) Which often produces results people don’t want, such as species extinction, ever-rising greenhouse gas emissions, deforestation and soil erosion, shortening human attention span, widespread psychological problems, political division, wars, etc.
E.g. here: https://youtu.be/KCSsKV5F4xc, Daniel Schmachtenberger raises the point that the civilisation is already a “misaligned superintelligence”.
So any A(G)I developers should ask themselves whether the technology and products they develop on balance makes the civilisation itself more or less aligned (relative to the current level) with humans, and preferably publish their thinking.
In principle, it seems that it should be possible in many cases to deploy AI to improve the civilisation—humanity alignment, e.g. by developing innovative AI-first tools for collective epistemics and governance (a-la https://pol.is), better media, AI assistants that help people to build human relationships rather than replace them, etc. Unfortunately, such technology, systems, and products are not particularly favoured by the market and the incumbent political systems themselves.
Thanks, this analysis makes a lot of sense to me. Some random thoughts:
The lack of modularity in ML models definitely makes it harder to make them safe. I wonder if mechanistic interpretability research could find some ways around this. For example, if you could identify modules within transformers (kinda like circuits) along with their purpose(s), you could maybe test each module as an isolated component.
More generally, if we get to a point where we can (even approximately?) “modularize” an ML model, and edit these modules, we could maybe use something like Nancy Leveson’s framework to make it safe (i.e. treat safety as a control problem, where a system’s lower-level components impose constraints on its emergent properties, in particular safety).
Of course these things (1) depend on mechanistic interpretability research being far ahead of where it is now, and (2) wouldn’t alone be enough to make safe AI, but would maybe help quite a bit.
Tentative GPT4′s summary. This is part of an experiment.
Up/Downvote “Overall” if the summary is useful/harmful.
Up/Downvote “Agreement” if the summary is correct/wrong.
If so, please let me know why you think this is harmful.
(OpenAI doesn’t use customers’ data anymore for training, and this API account previously opted out of data retention)
TLDR:
This article argues that deep learning systems are complex adaptive systems, making them difficult to control using traditional engineering approaches. It proposes safety measures derived from studying complex adaptive systems to counteract emergent goals and control difficulties.
Arguments:
- Deep neural networks are complex adaptive systems like ecosystems, financial markets, and human culture.
- Traditional engineering methods (reliability, modularity, redundancy) are insufficient for controlling complex adaptive systems.
- Complex adaptive systems exhibit emergent goal-oriented behavior.
- Deep learning safety measures should consider incentive shaping, non-deployment, self-regulation, and limited aims inspired by other complex adaptive systems.
Concrete Examples:
- Traffic congestion worsening after highways are built.
- Ecosystems disrupted by introducing predators to control invasive species.
- Financial markets destabilized by central banks lowering interest rates.
- Environmental conservation campaigns resulting in greenwashing and resistance from non-renewable fuel workers.
Takeaways:
- Recognize deep learning systems as complex adaptive systems to address control difficulties.
- Investigate safety measures inspired by complex adaptive systems to mitigate emergent goals and control issues.
Strengths:
- The article provides clear examples of complex adaptive systems and their control difficulties.
- It highlights the limitations of traditional engineering approaches for complex adaptive systems.
- It proposes actionable safety measures based on studying complex adaptive systems, addressing unique control challenges.
Weaknesses:
- Current deep learning systems may not be as susceptible to the control difficulties seen in other complex adaptive systems.
- The proposed safety measures may not be enough to effectively control future deep learning systems with stronger emergent goals or more_adaptive_behavior.
Interactions:
- The content interacts with AI alignment, AI value-loading, and other safety measures such as AI boxing or reward modeling.
- The proposed safety measures can complement existing AI safety guidelines to develop more robust and aligned AI systems.
Factual mistakes:
- As far as I can see, no significant factual mistakes or hallucinations were made in the summary.
Missing arguments:
- The article also highlighted a few lessons for deep learning safety not explicitly mentioned in the summary such as avoiding continuous incentive gradients and embracing diverse and resilient systems.