Steven Byrnes comments on Categorizing failures as “outer” or “inner” misalignment is often confused

Steven Byrnes 6 Jan 2023 16:17 UTC
LW: 3 AF: 3
0
AF
I think any training setup that calculates feedback / rewards / loss based purely on the AI’s external behavior, independent of its internal state, is almost definitely a terrible plan. If we are talking about such plans anyway, for pedagogical or other reasons, then I would defend “outer” and “inner” as a meaningful and useful distinction with definition and reasons here. I think both of your spoiler boxes are inner misalignment (by my definition at that link), because in neither case can we point to any particular moment during training where the AI’s actual external behavior was counter to the programmer’s intention yet the AI nevertheless got a reward for that behavior.
If we are talking about training setups that do NOT give feedback / rewards based purely on external behavior, but rather which also have some kind of access to the AI’s internal thoughts / motivations—which is absolutely what we should be talking about!—then yeah, the words “outer” and “inner” misalignment stop being meaningful.
(Good post, thanks for writing it.)
- Rohin Shah 6 Jan 2023 18:59 UTC
  LW: 4 AF: 4
  0
  AF Parent
  If we are talking about such plans anyway, for pedagogical or other reasons, then I would defend “outer” and “inner” as a meaningful and useful distinction with definition and reasons here.
  Yup, in the post, this is the generalization-based decomposition with “good feedback” defined as “rewards the right action and punishes the wrong action, irrespective of ‘reasons’”.
  If we are talking about training setups that do NOT give feedback / rewards based purely on external behavior, [...] the words “outer” and “inner” misalignment stop being meaningful.
  I think you can extend this to such training setups in a way where it is still meaningful, by defining “good feedback” as feedback that incentivizes “doing the right thing for the right reasons”. (This is the sort of move that happens with ELK.)
  My claim is more that these definitions aren’t very useful as categorizations of failure scenarios, because most situations will be some complicated mix. In the generalization-based definition, the problem is “how many pieces of bad feedback do you have to give before it counts as outer misalignment rather than inner misalignment?” In the root-cause-based definition, the problem is “how do you identify the root cause?”