You can extend the definition to online learning: choose some particular time and say that all the previous inputs on which you got gradients are the “training data” and the future inputs are the “test data”.
In the situation you describe, you would want to identify the point at which the AI system starts executing on its plan to cause an existential catastrophe, set that as the specific point in time (so everything before it is “training” and everything after is “test”), and then apply the categorization as usual.
(Though even in that case it’s not necessarily a generalization problem. Suppose every single “test” input happens to be identical to one that appeared in “training”, and the feedback is always good.)
It’s still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don’t expect will actually happen, so I think I’m fine with that.
You can extend the definition to online learning: choose some particular time and say that all the previous inputs on which you got gradients are the “training data” and the future inputs are the “test data”.
In the situation you describe, you would want to identify the point at which the AI system starts executing on its plan to cause an existential catastrophe, set that as the specific point in time (so everything before it is “training” and everything after is “test”), and then apply the categorization as usual.
(Though even in that case it’s not necessarily a generalization problem. Suppose every single “test” input happens to be identical to one that appeared in “training”, and the feedback is always good.)
It’s still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don’t expect will actually happen, so I think I’m fine with that.