I agree that it should be possible to do this over behavior instead of rewards, but behavior-space is much larger or more complex than reward-space and so it would require significantly more data in order to work as well.
I don’t think it can be significantly harder for behavior-space than reward-space. If it were, then one of our first messages would be (a mathematical version of) “the behavior I want is approximately reward-maximizing”. I don’t think that’s actually the right way to do things, but it should at least give a reduction of the problem.
Anyway, I’d say the most important difference between this and various existing strategies is that we can learn “at the outermost level”. We can treat the code as message, so there can potentially be a basin of attraction even for bugs in the code. The entire ontology of the agent-model can potentially be wrong, but still end up in the basin. We can decide to play an entirely different game. Some of that could potentially be incorporated into other approaches (maybe it has and I just didn’t know about it), though it’s tricky to really make everything subject to override later on.
Of course, the trade-off is that if everything is subject to override then we really need to start in the basin of attraction—there’s no hardcoded assumptions to fall back on if things go off the rails. Thus, robustness tradeoff.
Yeah, this is a pretty common technique at CHAI (relevant search terms: pragmatics, pedagogy, Gricean semantics). Some related work:
Showing versus Doing: Teaching by Demonstration (differences when you ask humans to teach vs. demonstrate a behavior)
Inverse Reward Design (interpret the stated reward function as a message, not a reward function)
Cooperative Inverse Reinforcement Learning (formalizing the interaction as a game)
Literal or Pedagogic Human? Analyzing Human Model Misspecification in Objective Learning (do you benefit from a pedagogic assumption? It turns out the literal assumption has robustness benefits, presumably because it doesn’t rule out possibilities that it turns out the human does sometimes consider)
Preferences Implicit in the State of the World (interpret the state of the world as a message).
I agree that it should be possible to do this over behavior instead of rewards, but behavior-space is much larger or more complex than reward-space and so it would require significantly more data in order to work as well.
I don’t think it can be significantly harder for behavior-space than reward-space. If it were, then one of our first messages would be (a mathematical version of) “the behavior I want is approximately reward-maximizing”. I don’t think that’s actually the right way to do things, but it should at least give a reduction of the problem.
Anyway, I’d say the most important difference between this and various existing strategies is that we can learn “at the outermost level”. We can treat the code as message, so there can potentially be a basin of attraction even for bugs in the code. The entire ontology of the agent-model can potentially be wrong, but still end up in the basin. We can decide to play an entirely different game. Some of that could potentially be incorporated into other approaches (maybe it has and I just didn’t know about it), though it’s tricky to really make everything subject to override later on.
Of course, the trade-off is that if everything is subject to override then we really need to start in the basin of attraction—there’s no hardcoded assumptions to fall back on if things go off the rails. Thus, robustness tradeoff.
Yeah, I agree that if we had a space of messages that was expressive enough to encode this, then it would be fine to work in behavior space.