TsviBT comments on Ultimate ends may be easily hidable behind convergent subgoals

TsviBT 9 Apr 2023 13:41 UTC
2 points
0
Thanks, that’s a great example!

Does such a strategy count as misalignment?

Yeah, I don’t think it necessarily counts as misalignment. In fact, corrigibility probably looks behaviorally a lot like this: gathering ability to affect the world, without making irreversible decisions, and waiting for the overseer to direct how to cash out into ultimate effects. But the hidability means that “ultimate intents” or “deep intents” are conceptually murky, and therefore not obvious how to read off an agent—if you can discern them through behavior, what can you discern them through?
- Roman Leventov 9 Apr 2023 20:06 UTC
  1 point
  0
  Parent
  Only if we know the entire learning trajectory of AI (including the training data) and high-resolution interpretability mapping along the way. If we don’t have this, or if AI learns online and is not inspected with mech.interp tools during this process, we don’t have any ways of knowing of any “deep beliefs” that AI may have, if it doesn’t reveal them in its behavior or “thoughts” (explicit representations during inferences)