Ofer comments on Categorizing failures as “outer” or “inner” misalignment is often confused

Ofer 7 Jan 2023 17:35 UTC
LW: 2 AF: 1
0
AF
Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:
1. Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.
2. Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.
This categorization is non-exhaustive. Suppose we create a superintelligence via a training process with good feedback signal and no distribution shift. Should we expect that no existential catastrophe will occur during this training process?
- Rohin Shah 8 Jan 2023 16:38 UTC
  LW: 2 AF: 2
  0
  AF Parent
  You can extend the definition to online learning: choose some particular time and say that all the previous inputs on which you got gradients are the “training data” and the future inputs are the “test data”.
  In the situation you describe, you would want to identify the point at which the AI system starts executing on its plan to cause an existential catastrophe, set that as the specific point in time (so everything before it is “training” and everything after is “test”), and then apply the categorization as usual.
  - Ofer 8 Jan 2023 22:23 UTC
    LW: 2 AF: 1
    0
    AF Parent
    (Though even in that case it’s not necessarily a generalization problem. Suppose every single “test” input happens to be identical to one that appeared in “training”, and the feedback is always good.)
    - Rohin Shah 9 Jan 2023 10:46 UTC
      LW: 2 AF: 2
      0
      AF Parent
      It’s still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don’t expect will actually happen, so I think I’m fine with that.