Doom doubtsâis inner alignment a likely problem?
After reading Eliezerâs list of lethalities, I have doubts (hopes?) that some of the challenges he mentions will occur.
Letâs start with inner alignment. Letâs think step by step. đ
Inner alignment is a new name for a long-known challenge of many systems. Whether itâs called the agency problem or delegation challenges, giving a task to another entity and then making sure that entity not only does what you want it to do but in a way that you approve of is something people and systems have been dealing with since the first tribes. It is not an emergent property of AGI that will need to be navigated from a blank slate.
Humans and AGI are aligned on the need to manage inner alignment. While deception by the mesa-optimizer (âagentâ) must be addressed, both humans and the AGI agree that agents going rogue to take actions that fulfill their sub-goal but thwart the overall mission must be prevented.
The AGI will be much more powerful than the agents. An agent will logically have fewer resources at its disposal than the overall system, and to provide the benefit of leverage, the number of agents should be significant. If there are a small number of agents, then their work can be subsumed by the overall system instead of creating agents which incur alignment challenges. Since there will be a large number of agents, each agent will have only a fraction of the overall systemâs power, which implies the system should have considerable resources available to monitor and correct deviations from the systemâs mission.
An AGI that doesnât solve inner alignment, with or without human help, isnât going to make it to super intelligence (SI). An SI will be able to get things done as planned and intended (at least according to the SIâs understandingânot addressing outer alignment here). If it canât stop its own agents from doing things it agrees are not the mission, itâs not an SI.
Does that make sense? Agree? Disagree?
I think estimating the probability/âplausibility of real-world inner alignment problems is a neglected issue.
However, I donât find your analysis very compelling.
Number 1 seems to me to approach this from the wrong angle. This is a technical problem, not a social problem. The social version of the problem seems to share very little in common with the technical version.
Number 2 assumes the AGI is aligned. But inner alignment is a barrier to that. You cannot work from the assumption that we have a powerful AGI on our side when solving alignment problems, unless youâve somehow separately ensured that that will be the case.
Number 3 isnât true in current deep learning architectures, EG, GPT; it seems youâd have to design a system where thatâs the case. And itâs not yet obvious whether thatâs a promising route.
Number 4: an AGI could make it to SI by inner alignment failure. The inner optimizer could solve the inner alignment problem for itself, while refusing to be outer-aligned.
Itâs a security issue, in general a mesa-optimizer is not intended or expected to be there-in-particular at all (apart from narrow definitions of inner alignment describing more tractable setups, which wonât capture the actually worrying mesa-optimization that happens inside architectural black boxes, see point 17). Given enough leeway, it might get away with setting up steganographic cognition and persist in influencing its environment, even in the more dignified case when there actually is any monitoring for mesa-optimizers inside black boxes.
It is true that an AGI may want to create other AGIs and therefore may have to deal with both outer and inner alignment problems. Even if it just creates copies of itself initially, the copies may develop somewhat independently and become misaligned. They may even aggregate into organizations that have their own incentives and implied goals separate from any of its components. If it intends to create a more powerful successor or self-modify, the challenges it faces will be many of the same that we face in creating AGI at all.
This isnât a cause for optimism.
That just makes the problem worse for us. A weakly superhuman AGI that doesnât itself solve all alignment problems may create or become strongly superintelligent successors that donât share any reasonable extrapolation of its previous goals. Thus they will be even more likely to be divergent from anything compatible with human flourishing than if it had solved alignment.
I think you are confusing inner and outer alignment, but Iâm not sure.
I think itâs right. Inner alignment is getting the mesa-optimizers (agents) aligned with the overall objective. Outer alignment ensures the AI understands an overall objective that humans want.
Not quite. Inner alignment, as originally conceived, is about the degree to which the trained model is optimizing for accomplishing the outer objective. Theoretically, you can have an inner-misaligned model that doesnât have any subagents (though I donât think this is how realistic AGI will work).
E.g., I weakly suspect that a reason deep learning models are so overconfident is actually due to an inner-misalignment between the predictive patterns SGD instills and the outer optimization criterion, where SGD is systematically under-penalizing the modelâs predictive patterns for over-confident mispredictions. If true, that would represent an inner misalignment without there being any sort of deception or agentic optimization from the modelâs predictive patterns, just an imperfection in the learning process.
More broadly, I donât think we actually want truly âinner-alignedâ AIs. I think that humans, and RL systems more broadly, are inner-*misaligned* by default, and that this fact is deeply tied in with how our values actually work. I think that, if you had a truly inner-aligned agent acting freely in the real world, that agent would wirehead itself as soon as possible (which is the action that generates maximum reward for a physically embedded agent). E.g., humans being inner-misaligned is why people who learn that wireheading is possible for humans donât immediately drop everything in order to wirehead.
I see. So the agent issue I address above is a sub-issue of overall inner alignment.
In particular, I was the addressing deceptively aligned mesa-optimizers, as discussed here: https://ââastralcodexten.substack.com/ââp/ââdeceptively-aligned-mesa-optimizers
Thanks!