Rohin Shah comments on Categorizing failures as “outer” or “inner” misalignment is often confused

Rohin Shah 8 Jan 2023 16:38 UTC
LW: 2 AF: 2
0
AF
You can extend the definition to online learning: choose some particular time and say that all the previous inputs on which you got gradients are the “training data” and the future inputs are the “test data”.
In the situation you describe, you would want to identify the point at which the AI system starts executing on its plan to cause an existential catastrophe, set that as the specific point in time (so everything before it is “training” and everything after is “test”), and then apply the categorization as usual.
- Ofer 8 Jan 2023 22:23 UTC
  LW: 2 AF: 1
  0
  AF Parent
  (Though even in that case it’s not necessarily a generalization problem. Suppose every single “test” input happens to be identical to one that appeared in “training”, and the feedback is always good.)
  - Rohin Shah 9 Jan 2023 10:46 UTC
    LW: 2 AF: 2
    0
    AF Parent
    It’s still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don’t expect will actually happen, so I think I’m fine with that.