To determine alignment difficulty, we need to know the absolute difficulty of alignment generalization
A core insight from Eliezer is that AI “capabilities generalize further than alignment once capabilities start to generalize far”.
This seems quite true to me, but doesn’t on its own make me conclude that alignment is extremely difficult. For that, I think we need to know the absolute difficulty of alignment generalization for a given AGI paradigm.
Let me clarify what I mean by that. Once we have AGI systems that can do serious science and engineering and generalize in a sharp way (lase) across many domains, we run into a problem of a steep capabilities ramp analogous to the human discovery of science. Science gives you power regardless of your goals, so long as your goals don’t trip up your ability to do science.
Once your AGI system(s) get to the point where they can do this kind of powerful generalization, you better hope they really *want* to help you help them stay aligned with you in a deep way. Because if they don’t, then they will get vastly more powerful but not vastly more aligned. And seemingly little differences in alignment will get amplified by the hugely wider action space now available to these systems, and the outcomes of that widening gap do not look good for us.
So the question is, how hard is it build systems that are so aligned they *want*, in a robust way, to stay aligned with you as they get way more powerful? This seems like the only way to generalize alignment in a way that keeps up with generalized capabilities. And of course this is a harder target than building a system that wants to generalize its capabilities, because the latter is so natural & incentivized by any smart optimization process.
It may be the case that the absolute difficulty of this inner alignment task in most training regimes in anything like the current ML paradigm is extremely high. Or, it may be the case that it’s just kinda high. I lean towards the former view but for me this intuition is not supported by a strong argument.
Why absolute difficulty and not relative difficulty?
A big part of the problem is that by default capabilities will generalize super well and alignment just won’t. So the problem you have to solve is somehow getting alignment to generalize in a robust manner. I’m claiming that before you get to the point where capabilities are generalizing really far, you need a very aligned system. And that you’re dead if you don’t already have that aligned system. So at some level it doesn’t matter how exactly how much more capabilities generalize than alignment, because you have to solve the problem before you get to the point where a capable AI can easily kill you.
In the above paragraph, I’m talking about needing robust inner alignment before setting off an uncontrollable intelligence explosion that results in a sovereign (or death). It’s less clear to me what the situation is if you’re trying to create an AGI system for a pivotal use. I think that situation may be somewhat analogous, just easier. The question there is how robust does your inner alignment need to be in order to get the corrigibility properties & low impact/externalities properties of your system. Clearly you need enough alignment generalization to make your system want to not self improve into unboundedly dangerous capability generalization territory. But it’s not clear how much “capabilities generalization” you’d be going for in such a situation, so I remain kind of confused about that scenario.
I plan to explore this idea further & try to probe my intuitions and explore different arguments. As I argue here, I also think it’s pretty high value to people to clarify / elaborate their arguments about inner alignment difficulty.
I’ve found the following useful in thinking about these questions:
Nate Soares’
A central AI alignment problem: capabilities generalization, and the sharp left turn
What I mean by “alignment is in large part about making cognition aimable at all”
Evan Hubinger’s
Eliezer Yudkowsky’s
AGI Ruin: A List of Lethalities, especially #16, 19, 21, 22 sdfsdf
Definitely worth thinking about.
That doesn’t seem clear to me, but I agree that capabilities generalize by default in the way we’d want them to in the limit, whereas alignment does not do so by default in the limit. But I also think there’s a good case to be made that an agent will aim its capabilities towards its current goals including by reshaping itself and its context to make itself better-targeted at those goals, creating a virtuous cycle wherein increased capabilities lock in & robustify initial alignment, so long as that initial alignment was in a “basin of attraction”, so to speak. (Of course, otherwise this is a vicious cycle)
Robust in what sense? If we’ve intent-aligned the AI thus far (it makes its decisions predominantly downstream of the right reasons, given its current understanding), and if the AI is capable, then the AI will want to keep itself aligned with its existing predominant motivations (goal-content integrity), so to the extent that it knows or learns about crucial robustness gaps in itself (even quite abstract knowledge like “I’ve been wrong about things like this before”), it will make decisions that attempt to fix / avoid / route around those gaps when possible, including by steering itself away from the sorts of situations that would require unusually-high robustness levels (this may remind you of conservatism). So I’m not sure exactly how much robustness we will need to engineer to be actually successful here. Though it would certainly be nice to have as much robustness as we can, all else equal.
Yeah, I think if you nail initial alignment and have a system that has developed the instrumental drive for goal-content integrity, you’re in a really good position. That’s what I mean by “getting alignment to generalize in a robust manner”, getting your AI system to the point where it “really *wants* to help you help them stay aligned with you in a deep way”.
I think a key question of inner alignment difficulty is to what extent there is a “basin of attraction”, where Yudkowsky is arguing there’s no easy basin to find, and you basically have to precariously balance on some hill on the value landscape.
I wrote a little about my confusions about when goal-content integrity might develop here.
As I see it, aligned AI should understand the humanity’s value function and choose actions that lead to reality where this value is expected to be bigger.
But it also should understand that both it’s understanding of the value function, it’s ability to approximate value for given reality, and it’s ability to prognose which action leads to which reality, is flawed. And so is ability of people to do the same.
So, AI should not just choose the action that gives the bigger expected value for most likely interpretation of value and most likely outcome. It should consider all the spectrum of the possibilities, especially the worst possible ones. Even the possibility that it’s understanding is wrong completely, or will become completely wrong in the future due to hack, coding mistake or anyother reason. So, it should take care of protecting people from itself too.
AI should keep looking to refine it’s understanding of human value. Including by seeking feedback from humans, but the feedback that is based on honesty, knowledge and the free will. Response that is given under the influence of extortion, maniplation or ignorance are misleading, so AI should not try to “cheat” the convenient answer this way.