A core insight from Eliezer is that AI “capabilities generalize further than alignment once capabilities start to generalize far”.
That doesn’t seem clear to me, but I agree that capabilities generalize by default in the way we’d want them to in the limit, whereas alignment does not do so by default in the limit. But I also think there’s a good case to be made that an agent will aim its capabilities towards its current goals including by reshaping itself and its context to make itself better-targeted at those goals, creating a virtuous cycle wherein increased capabilities lock in & robustify initial alignment, so long as that initial alignment was in a “basin of attraction”, so to speak. (Of course, otherwise this is a vicious cycle)
So the question is, how hard is it build systems that are so aligned they want, in a robust way, to stay aligned with you as they get way more powerful?
Robust in what sense? If we’ve intent-aligned the AI thus far (it makes its decisions predominantly downstream of the right reasons, given its current understanding), and if the AI is capable, then the AI will want to keep itself aligned with its existing predominant motivations (goal-content integrity), so to the extent that it knows or learns about crucial robustness gaps in itself (even quite abstract knowledge like “I’ve been wrong about things like this before”), it will make decisions that attempt to fix / avoid / route around those gaps when possible, including by steering itself away from the sorts of situations that would require unusually-high robustness levels (this may remind you of conservatism). So I’m not sure exactly how much robustness we will need to engineer to be actually successful here. Though it would certainly be nice to have as much robustness as we can, all else equal.
an agent will aim its capabilities towards its current goals including by reshaping itself and its context to make itself better-targeted at those goals, creating a virtuous cycle wherein increased capabilities lock in & robustify initial alignment, so long as that initial alignment was in a “basin of attraction”, so to speak
Yeah, I think if you nail initial alignment and have a system that has developed the instrumental drive for goal-content integrity, you’re in a really good position. That’s what I mean by “getting alignment to generalize in a robust manner”, getting your AI system to the point where it “really *wants* to help you help them stay aligned with you in a deep way”.
I think a key question of inner alignment difficulty is to what extent there is a “basin of attraction”, where Yudkowsky is arguing there’s no easy basin to find, and you basically have to precariously balance on some hill on the value landscape.
I wrote a little about my confusions about when goal-content integrity might develop here.
Definitely worth thinking about.
That doesn’t seem clear to me, but I agree that capabilities generalize by default in the way we’d want them to in the limit, whereas alignment does not do so by default in the limit. But I also think there’s a good case to be made that an agent will aim its capabilities towards its current goals including by reshaping itself and its context to make itself better-targeted at those goals, creating a virtuous cycle wherein increased capabilities lock in & robustify initial alignment, so long as that initial alignment was in a “basin of attraction”, so to speak. (Of course, otherwise this is a vicious cycle)
Robust in what sense? If we’ve intent-aligned the AI thus far (it makes its decisions predominantly downstream of the right reasons, given its current understanding), and if the AI is capable, then the AI will want to keep itself aligned with its existing predominant motivations (goal-content integrity), so to the extent that it knows or learns about crucial robustness gaps in itself (even quite abstract knowledge like “I’ve been wrong about things like this before”), it will make decisions that attempt to fix / avoid / route around those gaps when possible, including by steering itself away from the sorts of situations that would require unusually-high robustness levels (this may remind you of conservatism). So I’m not sure exactly how much robustness we will need to engineer to be actually successful here. Though it would certainly be nice to have as much robustness as we can, all else equal.
Yeah, I think if you nail initial alignment and have a system that has developed the instrumental drive for goal-content integrity, you’re in a really good position. That’s what I mean by “getting alignment to generalize in a robust manner”, getting your AI system to the point where it “really *wants* to help you help them stay aligned with you in a deep way”.
I think a key question of inner alignment difficulty is to what extent there is a “basin of attraction”, where Yudkowsky is arguing there’s no easy basin to find, and you basically have to precariously balance on some hill on the value landscape.
I wrote a little about my confusions about when goal-content integrity might develop here.