You frame the use-case for the terminology as how we talk about failure modes when we critique. A second important use-case is how we talk about our plan. For example, the inner/outer dichotomy might not be very useful for describing a classifier which learned to detect sunny-vs-cloudy instead of tank-vs-no-tank (IE learned a simpler thing which was correlated with the data labels). But someone’s plan for building safe AI might involve separately solving inner alignment and outer alignment, because if we can solve those parts, it seems plausible we can put them together to solve the whole. (I am not trying to argue for the inner/outer distinction, I am only using it as an example.)
From this (specific) perspective, the “root causes” framing seems useless. It does not help from a solution-oriented perspective, because it does not suggest any decomposition of the AI safety problem.
So, I would suggest thinking about how you see the problem of AI safety as decomposing into parts. Can the various problems be framed in such a way that if you solved all of them, you would have solved the whole problem? And if so, does the decomposition seem feasible? (Do the different parts seem easier than the whole problem, perhaps unlike the inner/outer decomposition?)
You frame the use-case for the terminology as how we talk about failure modes when we critique. A second important use-case is how we talk about our plan. For example, the inner/outer dichotomy might not be very useful for describing a classifier which learned to detect sunny-vs-cloudy instead of tank-vs-no-tank (IE learned a simpler thing which was correlated with the data labels). But someone’s plan for building safe AI might involve separately solving inner alignment and outer alignment, because if we can solve those parts, it seems plausible we can put them together to solve the whole. (I am not trying to argue for the inner/outer distinction, I am only using it as an example.)
From this (specific) perspective, the “root causes” framing seems useless. It does not help from a solution-oriented perspective, because it does not suggest any decomposition of the AI safety problem.
So, I would suggest thinking about how you see the problem of AI safety as decomposing into parts. Can the various problems be framed in such a way that if you solved all of them, you would have solved the whole problem? And if so, does the decomposition seem feasible? (Do the different parts seem easier than the whole problem, perhaps unlike the inner/outer decomposition?)