The Alignment Trap: AI Safety as Path to Power

crispweed29 Oct 2024 15:21 UTC

33 points

Recent discussions about artificial intelligence safety have focused heavily on ensuring AI systems remain under human control. While this goal seems laudable on its surface, we should carefully examine whether some proposed safety measures could paradoxically enable rather than prevent dangerous concentrations of power.

The Control Paradox

The fundamental tension lies in how we define “safety.” Many current approaches to AI safety focus on making AI systems more controllable and aligned with human values. But this raises a critical question: controllable by whom, and aligned with whose values?

When we develop mechanisms to control AI systems, we are essentially creating tools that could be used by any sufficiently powerful entity—whether that’s a government, corporation, or other organization. The very features that make an AI system “safe” in terms of human control could make it a more effective instrument of power consolidation.

Natural Limits on Human Power

Historical examples reveal how human nature itself acts as a brake on totalitarian control. Even the most powerful dictatorships have faced inherent limitations that AI-enhanced systems might easily overcome:

The Trust Problem: Stalin’s paranoia about potential rivals wasn’t irrational—it reflected the real difficulty of ensuring absolute loyalty from human subordinates. Every dictator faces this fundamental challenge: they can never be entirely certain of their underlings’ true thoughts and loyalties.
Information Flow: The East German Stasi, despite maintaining one of history’s most extensive surveillance networks, still relied on human informants who could be unreliable, make mistakes, or even switch allegiances. Human networks inherently leak and distort information.
Cognitive Limitations: Hitler’s micromanagement of military operations often led to strategic blunders because no human can effectively process and control complex operations at scale. Human dictators must delegate, creating opportunities for resistance or inefficiency.
Administrative Friction: The Soviet Union’s command economy faltered partly because human bureaucrats couldn’t possibly process and respond to all necessary information quickly enough. Even the most efficient human organizations have inherent speed and coordination limits.

These natural checks on power could vanish in a human-AI power structure where AI systems provide perfect loyalty, unlimited information processing, and seamless coordination.

The Human-AI Nexus

Perhaps most concerning is the potential emergence of what we might call “human-AI power complexes”—organizational structures that combine human decision-making with AI capabilities into entities in ways that amplify both. These entities could be far more effective at exercising and maintaining control than either humans or AIs alone.

Consider a hypothetical scenario:

A government implements “safe” AI systems to help with surveillance and social control
These systems are perfectly aligned with their human operators’ objectives
The AI helps optimize propaganda, predict dissent, and manage resources
The human elements provide strategic direction and legitimacy

This isn’t a scenario of AI taking over—it’s a scenario of AI making existing power structures more effective at maintaining control by eliminating the natural limitations that have historically constrained human power. However, this specific scenario is merely illustrative. The core argument—that AI safety measures could enable unprecedented levels of control by removing natural human limitations—holds true across many possible futures.

Alignment as Enabler of Coherent Entities

Dangerous power complexes, from repressive governments to exploitative corporations, have existed throughout history. What well-aligned AI brings to the table, however, is the potential for these entities to function as truly unified organisms, coherent entities unconstrained by human organizational limits.

Dynamics of Inevitable Control?

Many notable figures, including over a hundred AI scientists, have voiced concerns about the risk of extinction from AI.^[1] In a previous post^[2] I described my own intuitions about this.

The key intuition is that, in a situation where entities with vastly different capability levels interact, there are fundamental dynamics which result in the entities which are more capable taking control. The specific path to power concentration matters less than understanding these fundamental dynamics and it makes sense to be concerned about this even if we cannot predict exactly how these dynamics will play out.

When this intuition is applied in the context of AI, people usually consider pure artificial intelligence entities, but in reality we should expect combined human-AI entities to reach dangerous capabilities before pure artificial intelligence.

The Offensive Advantage

There’s another crucial dynamic that compounds these risks: when multiple entities possess similar capabilities, those focused on seizing control may hold a natural advantage. This offensive asymmetry emerges for several reasons:

Defensive entities must succeed everywhere, while offensive ones need only succeed once
Aggressive actors can concentrate their resources on chosen points of attack
Those seeking control can operate with single-minded purpose, while defenders must balance multiple societal needs
Defensive measures must be transparent enough to inspire trust, while offensive capabilities can remain hidden

This means that even if we develop “safe” AI systems, the technology may naturally favour those most determined to use it for control. Like a martial art that claims to be purely defensive, the techniques we develop could ultimately prove most valuable to those willing to repurpose them for aggression.

The Double Bind of Development

The situation presents another layer of concern: by making AI more controllable and therefore more commercially viable, we accelerate AI development itself. Each advance in AI safety makes the technology more attractive for investment, speeding our journey toward the very risks we’re trying to mitigate. We’re not just creating the tools of control; we’re accelerating their development.

Rethinking Our Approach

The arguments presented here lead us to some uncomfortable but important conclusions about the nature of AI safety research. While the intention behind such research is laudable, we must confront the possibility that these efforts could be fundamentally counterproductive.

Rather than focusing on making AI more controllable, we might need to fundamentally reframe our approach to AI development and deployment.

Are there ways to maintain and strengthen traditional checks and balances in human institutions?
We should carefully consider the role of decentralized architectures, which can help resist consolidation of power, but can also make AI harder to regulate and accelerate the spread of dangerous capabilities
Slowing rather than safeguarding AI development might be the more prudent path

While many in the AI community already recognize the strategic importance of keeping capabilities research private, we should consider extending this thinking to alignment and safety research. Though this may seem counter-intuitive to those who view safety work as a public good, the dual-use nature of control mechanisms suggests that open publication of safety advances could accelerate the development of more effective tools for centralized control

Those working on AI safety, particularly at frontier AI companies, must then grapple with some difficult questions:

If your work makes AI systems more controllable, who will ultimately wield that control?
When you make AI development “safer” and thus more commercially viable, what power structures are you enabling?
How do the institutional incentives of your organization align with or conflict with genuine safety concerns?
What concrete mechanisms will prevent your safety work from being repurposed for control and consolidation of power?
How can you balance the benefits of open research collaboration against the risks of making control mechanisms more widely available?

Conclusion

The arguments presented in this essay lead to an uncomfortable but inescapable conclusion: many well-intentioned efforts to make AI systems more controllable may be actively hastening the arrival of unprecedented mechanisms of social control. This is not merely a theoretical concern about future scenarios—it is already manifesting in the development of increasingly sophisticated surveillance and influence systems. ^[3] ^[4] ^[5]

The alignment trap presents itself most insidiously not through malicious intent, but through the gradual optimization of systems toward ever more perfect control. Each incremental advance in AI capabilities and controllability—each apparent success in alignment—may be taking us further down a path from which there is no return.

It’s a trap baited with our best intentions and our deepest fears. The time to question this is now—before mechanisms of perfect control snap shut around us.

Footnotes

1 From the Centre for AI Safety: Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war. ↩

2 Is There a Power Play Overhang? ↩

3 See Chapter 3: Responsible AI in the Stanford 2024 AI Index Report “Political deepfakes are easy to generate and difficult to detect”, diagram ↩

4 From How AI surveillance threatens democracy everywhere (on Bulletin of the Atomic Scientists): According to the 2019 AI Global Surveillance Index, 56 out of 176 countries now use artificial intelligence in some capacity to keep cities “safe.” ↩

5 From The Global Struggle Over AI Surveillance: Emerging Trends and Democratic Responses (on the National Endowment for Democracy website): From cameras that identify the faces of passersby to algorithms that keep tabs on public sentiment online, artificial intelligence (AI)-powered tools are opening new frontiers in state surveillance around the world. ↩

crispweed29 Oct 2024 15:21 UTC

33 points

8 comments5 min readLW link

Seth Herd 29 Oct 2024 19:28 UTC
16 points
9
Great post and great points.
Alignment researchers usually don’t think of their work as a means to control AGI. The should.
We usually think of alignment as a means to create a benevolent superintelligence. But just about any workable technique for creating a value-aligned AGI will work even better for creating an intent aligned AGI that follows instructions. Keeping a human in the loop and in charge bypasses several of the most severe Lethalities by effectively adding corrigibility. What human in control of a major AGI project would take an extra risk to benefit all of humanity instead of ensure that AGI will follow their values by following their instructions?
That sets the stage for even more power-hungry humans to seize control of projects and AGIs with the potential for superintelligence. I fully agree that there’s a scary first-mover advantage benefitting the most vicious actors in a multipolar human-controlled AGI scenario; see If we solve alignment, do we die anyway?.
The result is a permanent dictatorship. Will the dictator slowly get more benevelent once they have absolute power? The pursuit of power seems to corrupt more than having secure power, so maybe—but I would not want to bet on it.
However, I’m not so sure about hiding alignment techniques. I think the alternative to human-controllable AGI isn’t really slower progress, it’s uncontrollable AGI- which will pursue its own weird ends and wipe out humanity in the process, for the reasons classical alignment thinking descibes.
Noosphere89 29 Oct 2024 19:46 UTC
6 points
3
I think there is a pretty real tradeoff you are pointing out, though I personally wouldn’t put that much weight on AI control accelerating AI capabilities speed as a negative factor, primarily because at least one actor in the AI race will by default scale capabilities approximately as fast as is feasible (I’m talking about OpenAI here), so methods to make AI more controllable will produce pretty much strict safety improvements from existential catastrophes that rely on AI control having gone awry.

I’m also not so confident in control/alignment measures working out by default that I think AI alignment/control work progressing is negative, though I do think it might soon not be the best approach to keeping humanity safe.

However, I think this post does address a pretty real tradeoff that I suspect will soon be plausibly fairly tight: There is a tension between making AI more controllable and making AI not be abusable by very bad humans, and even more importantly, making alignment work go better also increases the ability of dictators to do things, and even more worryingly increases s-risks.

Do not mistake me for endorsing Andrew Sauer’s solution here, because I don’t, but there’s a very clear reason for expecting plausibly large amounts of people to suffer horrifyingly under an AI future, and that’s because the technology to invent mind uploading, for one example, combined with lots of humans genuinely having a hated outgroup that they want to abuse really badly means that large-scale suffering can occur cheaply.

And in a world where basically all humans have 0 economic value, or even negative economic value, there’s no force pushing back against torturing a large portion of your citizenry.

I also like the book Avoiding The Worst to understand why S-risk is an issue that could be a very big problem.

See links below:

https://www.lesswrong.com/posts/CtXaFo3hikGMWW4C9/the-case-against-ai-alignment

https://www.amazon.com/dp/B0BK59W7ZW

https://centerforreducingsuffering.org/wp-content/uploads/2022/10/Avoiding_The_Worst_final.pdf

I don’t agree with the conclusion that alignment and safety research should be kept private, since I do think it’s still positive in expectation for people to have more control over AI systems, but I agree with the point of the post that there is a real tradeoff involved here.
Nathan Helm-Burger 29 Oct 2024 20:53 UTC
4 points
0
I think you bring up some important points here. I agree with many of your concerns, such as strong controllable AI leading to a dangerous concentration of power in the hands of the most power-hungry first movers.
I think many of the alternatives are worse though, and I don’t think we can choose what path to try to steer towards until we take a clear-eyed look at the pros and cons of each direction.
What would decentralized control of strong AI look like?
Would some terrorists use it to cause harm?
Would some curious people order one to become an independent entity just for curiosity or as a joke? What would happen with such an entity connected to the internet and actively seeking resources and self-improvement?
Would power then fall into the hands of whichever early mover poured the most resources into recursive self-improvement? If so, we’ve then got a centralized power problem again, but now the filter is ‘willing to self-improve as fast as possible’, which seems like it would select against maintaining control over the resulting stronger AI.
A lot of tricky questions here.
I made a related post here, and would enjoy hearing your thoughts on it: https://www.lesswrong.com/posts/NRZfxAJztvx2ES5LG/a-path-to-human-autonomy
Vladimir_Nesov 30 Oct 2024 0:26 UTC
2 points
0

If your work makes AI systems more controllable, who will ultimately wield that control?

A likely answer is “an AI”.
- Noosphere89 30 Oct 2024 0:44 UTC
  2 points
  0
  Parent
  This honestly depends on the level of control achieved over AI in practice.
  
  I do agree with the claim that there are pretty strong incentives to have AI peacefully takeover everything, but this is a long-term incentive, and more importantly if control gets good enough, at least some people would wield control of AI because of AIs wanting to be controlled by humans, combined with AI control strategies being good enough that you might avoid takeover at least in the early regime.
  
  To be clear, in the long run, I expect an AI to likely (as in 70-85% likely) to wield the fruits of control, but I think that humans will at least at first wield the control for a number of years, maybe followed by uploads of humans, like virtual dictators and leaders next in line for control.
  - Vladimir_Nesov 30 Oct 2024 1:09 UTC
    2 points
    0
    Parent
    The point is that the “controller” of a “controllable AI” is a role that can be filled by an AI and not only by a human or a human institution. AI is going to quickly grow the pie to the extent that makes current industry and economy (controlled by humans) a rounding error, so it seems unlikely that among the entities vying for control over controllable AIs, humans and human institutions are going to be worth mentioning. It’s not even about a takeover, Google didn’t take over Gambia.
Vladimir_Nesov 30 Oct 2024 0:22 UTC
2 points
0

Recent discussions about artificial intelligence safety have focused heavily on ensuring AI systems remain under human control. While this goal seems laudable on its surface, we should carefully examine whether some proposed safety measures could paradoxically enable rather than prevent dangerous concentrations of power.

The aim of avoiding AI takeover that ends poorly for humanity is not about preventing dangerous concentrations of power. Power that is distributed among AIs and not concentrated is entirely compatible with an AI takeover than ends poorly for humanity.
Chris_Leong 29 Oct 2024 16:30 UTC
2 points
0
I don’t think I agree with this post, but I thought it provided a fascinating alternative perspective.