Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:
Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.
Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.
This categorization is non-exhaustive. Suppose we create a superintelligence via a training process with good feedback signal and no distribution shift. Should we expect that no existential catastrophe will occur during this training process?
You can extend the definition to online learning: choose some particular time and say that all the previous inputs on which you got gradients are the “training data” and the future inputs are the “test data”.
In the situation you describe, you would want to identify the point at which the AI system starts executing on its plan to cause an existential catastrophe, set that as the specific point in time (so everything before it is “training” and everything after is “test”), and then apply the categorization as usual.
(Though even in that case it’s not necessarily a generalization problem. Suppose every single “test” input happens to be identical to one that appeared in “training”, and the feedback is always good.)
It’s still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don’t expect will actually happen, so I think I’m fine with that.
This categorization is non-exhaustive. Suppose we create a superintelligence via a training process with good feedback signal and no distribution shift. Should we expect that no existential catastrophe will occur during this training process?
You can extend the definition to online learning: choose some particular time and say that all the previous inputs on which you got gradients are the “training data” and the future inputs are the “test data”.
In the situation you describe, you would want to identify the point at which the AI system starts executing on its plan to cause an existential catastrophe, set that as the specific point in time (so everything before it is “training” and everything after is “test”), and then apply the categorization as usual.
(Though even in that case it’s not necessarily a generalization problem. Suppose every single “test” input happens to be identical to one that appeared in “training”, and the feedback is always good.)
It’s still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don’t expect will actually happen, so I think I’m fine with that.