It seems to me that “inner” versus “outer” alignment has become a popular way of framing things largely because it has the appearance of breaking down a large problem into more manageable sub-problems. In other words, “If research group A can solve outer alignment, and research group B can solve inner alignment, then we can put them together into one big alignment solution!” Unfortunately, as you alluded to, reality does not cleanly divide along this joint. Even knowing all the details of an alignment failure might not be enough to classify it appropriately.
Of course, in general, if a semantic distinction fails to clarify the nature of the territory, then it should be rejected as a part of the map. Arguing over semantics is counterproductive, especially when the ground truth of the situation is already agreed upon.
That being said, I think that the process that came up with the distinction between inner and outer (mis)alignment is extremely useful. Just making an attempt to break down a large problem into smaller pieces gives the community more tools for thinking about it in ways that wouldn’t have occurred to them otherwise. The breakdown you gave in this post is an excellent example. The solution to alignment probably won’t be just one thing, but even if it is, it’s unlikely that we will find it without slicing up the problem in as many ways as we can, sifting through perspectives and subproblems in search of promising leads. It may turn out to be useful for the alignment community to abandon the inner-outer distinction, but we shouldn’t abandon the process of making such distinctions.
It seems to me that “inner” versus “outer” alignment has become a popular way of framing things largely because it has the appearance of breaking down a large problem into more manageable sub-problems. In other words, “If research group A can solve outer alignment, and research group B can solve inner alignment, then we can put them together into one big alignment solution!” Unfortunately, as you alluded to, reality does not cleanly divide along this joint. Even knowing all the details of an alignment failure might not be enough to classify it appropriately.
Of course, in general, if a semantic distinction fails to clarify the nature of the territory, then it should be rejected as a part of the map. Arguing over semantics is counterproductive, especially when the ground truth of the situation is already agreed upon.
That being said, I think that the process that came up with the distinction between inner and outer (mis)alignment is extremely useful. Just making an attempt to break down a large problem into smaller pieces gives the community more tools for thinking about it in ways that wouldn’t have occurred to them otherwise. The breakdown you gave in this post is an excellent example. The solution to alignment probably won’t be just one thing, but even if it is, it’s unlikely that we will find it without slicing up the problem in as many ways as we can, sifting through perspectives and subproblems in search of promising leads. It may turn out to be useful for the alignment community to abandon the inner-outer distinction, but we shouldn’t abandon the process of making such distinctions.