johnswentworth comments on Communication Prior as Alignment Strategy

johnswentworth 13 Nov 2020 21:38 UTC
LW: 4 AF: 3
AF
I don’t think it can be significantly harder for behavior-space than reward-space. If it were, then one of our first messages would be (a mathematical version of) “the behavior I want is approximately reward-maximizing”. I don’t think that’s actually the right way to do things, but it should at least give a reduction of the problem.
Anyway, I’d say the most important difference between this and various existing strategies is that we can learn “at the outermost level”. We can treat the code as message, so there can potentially be a basin of attraction even for bugs in the code. The entire ontology of the agent-model can potentially be wrong, but still end up in the basin. We can decide to play an entirely different game. Some of that could potentially be incorporated into other approaches (maybe it has and I just didn’t know about it), though it’s tricky to really make everything subject to override later on.
Of course, the trade-off is that if everything is subject to override then we really need to start in the basin of attraction—there’s no hardcoded assumptions to fall back on if things go off the rails. Thus, robustness tradeoff.
- Rohin Shah 14 Nov 2020 17:11 UTC
  LW: 4 AF: 3
  AF Parent
  If it were, then one of our first messages would be (a mathematical version of) “the behavior I want is approximately reward-maximizing”.
  Yeah, I agree that if we had a space of messages that was expressive enough to encode this, then it would be fine to work in behavior space.