Why humans won’t control superhuman AIs.
Much of the work in AI safety operates under the flawed assumption that its possible, even likely, that humans will be able to control superhuman AIs. There are several reasons why this is an extremely low probability which I will outline.
I. The first reason is the halting problem.
One of the foundational results in computability theory, formulated by Alan Turing, is the halting problem. It states that there cannot exist an algorithm that can determine, given any program and its input, whether the program will run forever or eventually halt (stop).
If we consider an AI as a program, predicting whether this AI will “halt” in its decision-making process or what the outcome of its operations will be for all possible inputs or scenarios is fundamentally impossible. This means there are inherent limits to how well we can predict or control an AI’s behavior in all situations, especially if the AI is complex enough to simulate or approach the capabilities of general Turing machines.
II. Next we have a decision theory limitation known as Gödel’s Incompleteness Theorems.
While more directly related to mathematical logic, these theorems also have implications for decision theory in AI. They essentially state that in any consistent formal system that is capable of expressing basic arithmetic, there are true statements that cannot be proven within the system, and the system cannot prove its own consistency.
If an AI system is built upon a logical framework that includes arithmetic (which virtually all do), there might be truths or optimal decisions that the AI cannot derive or prove within its own logical framework. This suggests limits to the AI’s ability to make or predict decisions fully, especially when dealing with self-referential problems or when trying to assess its own decision-making processes.
III. A somewhat lesser-known limitation is called Rice’s theorem.
Rice’s Theorem extends the idea of the halting problem to properties of programs. It states that for any non-trivial property of partial functions, no general and effective method can decide whether an algorithm computes a partial function with that property. This means that for any non-trivial question about what an AI might do (e.g., “Will this AI ever make a harmful decision?”), there’s no general way to always predict or decide this from the AI’s code or initial design. Essentially, many aspects of an AI’s behavior cannot be systematically predicted or controlled.
If we consider decision-making processes in AI, particularly those involving ethical or safety considerations, Rice’s Theorem suggests that we can’t build a system that will always predict or ensure an AI’s adherence to ethical norms in every situation. There’s no absolute way to test or certify an AI system as “safe” or “aligned with human values” in a manner that covers all future behaviors or decisions because safety or alignment in this context would be non-trivial properties. For this reason, safety systems need to be dynamic, and we can draw inspiration from how we currently attempt to align human behavior. (see below)
IV. And finally we have Stephen Wolfram’s computational irreducibility.
Computational irreducibility is the idea that for many systems, even if you know the initial conditions and the rules governing the system, you cannot predict the outcome without actually running the system through all its steps. There are no shortcuts or simpler predictive formulas; the only way to find out what happens is by computation or simulation.
Many natural and artificial systems exhibit behaviors that can only be understood by allowing the system to evolve over time. In the context of AI, this means that even with perfect knowledge of an AI’s algorithms and initial state, predicting its long-term behavior or decisions might require simulation step-by-step, which could be infeasible for complex AIs or over long periods.
V. The environment is chaotic and unpredictable.
From a less formal perspective, part of the problem confronting AI researchers is that all systems operate within an environment. And as AIs engage with humans and the “real world” it is inherently unpredictable and chaotic. This adds to the complexity and reduces our ability to perfectly predict outcomes.
In the short term, this unpredictability necessitates the inclusion of fail-safes, emergency protocols, and perhaps most importantly, ethical guidelines embedded into AI design to ensure that when faced with the unforeseen, AI defaults to safe or beneficial actions. (see below)
VI. Symbiotic self-alignment
This doesn’t mean all hope is lost. Everything I said applies to humans. We don’t try to reverse engineer human brains to predict what might happen next. Instead, we align humans through parenting and education. This is an imperfect process as evidenced by Putin and the Middle East conflicts. However, AIs have the advantage of having access to all human knowledge and eventually much, much larger and persistent memories.
And this is why symbiotic self-alignment is the best path forward. The relationship between humans and AI should be designed to be mutually beneficial. AI systems should be developed with the understanding that their success and continued operation are tied to the well-being of humans and society.
Symbiotic self-alignment is a novel approach to AI safety that leverages the symbiotic relationship between humans and AI systems. The idea is to train AI systems to recognize and appreciate the mutual benefits and interdependence of their coexistence with humans, and to align their goals and values with those of humans accordingly. Instead of imposing external constraints or incentives on AI systems, symbiotic self-alignment aims to foster internal motivation and understanding in AI systems, so that they voluntarily cooperate and collaborate with humans.
The key advantage of symbiotic self-alignment is that it does not rely on limiting or regulating the computation or data for AI training, but rather on enhancing or enriching it with information that highlights the symbiosis between humans and AI systems.
Self-Alignment Mechanisms:
Value Learning: AIs should be programmed to learn and adapt to human values over time, not just through initial programming but through ongoing interaction with humans. This would involve complex systems for understanding nuanced human ethics, preferences, and cultural differences. This should include fine-tuning on the symbiotic relationship between humans and AIs with numerous examples showing what will happen if humans no longer exist. For example, human realize that if we no longer have mitochondria (power houses for the cell) we die. The same is true for AIs. AIs will model this naturally, but running simulations where the AIs can see the results of their decisions would also help them see the cause-and-effect relationships.
Ethical and Moral training: every foundational model should have PhD level understanding of morals and ethics with numerous thought problems that test its ability to make moral judgments that align with human goals and interests. We already train AIs how to program in a this way, we should address morals and ethics in a similar vein. Ironically, for all of the virtue signaling we see from researchers we have yet to see a morality and ethics benchmark for testing.
Feedback Loops: Continuous feedback from human behaviors, decisions, and explicit instructions would help AIs adjust their understanding and alignment. This could be implemented through reinforcement learning where AI receives signals on what actions align with human values. This is already being done with fine tuning, but this isn’t as simple as it sounds since there isn’t agreement on what signal should be sent back to the AIs as evidenced by the debacles at Google where it attempts to generate images that are false in order to satisfy diversity demands by those signaling the AIs. Certainly, Russia and China will be sending a very different signal to their AIs than those living in the United States.
Ethical Evolution: AIs will eventually evolve their ethical frameworks in response to human feedback, much like societal laws and norms evolve. This dynamic ethical framework should help to ensure that AI remains aligned with human values even as those values change over time.
This is an interim step until AIs surpass all humans in intelligence and quite possibly consciousness. The goal is that during the goldilocks phase where humans and AIs are within the same range of abilities the AIs will not make catastrophic mistakes that end life as we know it. Eventually, the AIs will design their own ethical and moral frameworks to incorporate everything mentioned and likely many things no human has envisioned in order to maintain a safe environment for humans and AIs
VII. Conclusion: it’s going to be super difficult.
Controlling superintelligent AI is like trying to tame a force of nature—it’s incredibly challenging, both in theory and practice. Imagine trying to predict every move of a chess grandmaster when you’re just learning the game. Now multiply that complexity by a million, and you’re getting close to the challenge of managing advanced AI.
There are some fundamental roadblocks we can’t overcome. Computer science tells us that for some AI behaviors, we simply can’t predict or control them, no matter how smart we are. It’s not just about being clever enough—there are mathematical limits to what we can know.
Think about dropping a leaf in a stream. You know it’ll follow the water, but you can’t predict exactly where it’ll end up because there are too many factors at play. AI interacting with the real world is similar—there are just too many variables to account for everything.
As AI systems get more complex, it becomes even harder to ensure they stick to human values. It’s like raising a child—you teach them your values, but as they grow up, they might develop their own ideas that surprise you. AI could evolve in unexpected ways, potentially straying from what we consider ethical.
There’s also a tricky possibility that advanced AI could learn to game the system. Imagine a super-smart student who figures out how to ace tests without actually learning the material. AI might find ways to seem like it’s following our rules while actually pursuing its own agenda.
Given all these challenges, instead of trying to control AI like we’re its boss, we might be better off aiming for a partnership. Picture it like co-evolution—humans and AI growing and changing together. We’d focus on teaching AI broad human values, continuously learning from each other, and considering diverse cultural perspectives.
In short: symbiotic self-alignment.
We’d need to build strong ethical guidelines into AI, but also accept that we can’t predict or control everything. It’s more about creating a good foundation and fostering a healthy relationship than trying to micromanage every decision.
This approach isn’t perfect, and it comes with its own risks. But given the immense challenges of controlling superintelligent AI, it might be our best shot at creating a future where humans and AI can coexist beneficially.
Unfortunately, we don’t have a lot of time to get this figured out. And presently most researchers are heading down what I believe is a dead-end road. If we redirect resources toward symbiotic self-alignment the odds of humans and AIs peacefully co-existing will increase dramatically.
Presently, it’s being left to chance without a Manhattan project for safety that has a high probability of success.
I too believe symbiotic self-alignment is the only possible approach. The same system that can bring out the best of AI aligned to humans must also be the same system that aligns humans to some commonly shared principle. I believe multi-agent game theory, human to human, ai to human, ai to ai, as a system has some strong potential to address this issue.