I broadly agree about what our main disagreement is. Note that I’ve been mainly considering the case where the supervisor is more intelligent than the agent as well. The actual resolution of this will depend on what’s really going on during amplification, which is a bigger topic that I’ll need to think about more.
On the side disagreement (of whether looking at future states before evaluation counts as “myopic”) I think I was confused when I was discussing it above and in the original article, which made my position a bit of a mess. Sorry about that; I’ve added a clarifying note at the top of the post, and edited the post to reflect what I actually meant. My actual response to this:
Objection 2: This sacrifices competitiveness, because now the human can’t look at the medium-term consequences of actions before providing feedback.
Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training. However, in nonmyopic training this doesn’t matter very much, because we can assign high or low reward to some later state in the trajectory, which then influences whether the agent learns to do the original action more or less. We can’t do this in myopic training in the current paradigm, which is where the competitiveness sacrifice comes from.
E.g. my agent sends an email. Is it good or bad? In myopic training, you need to figure this out now. In nonmyopic training, you can shrug, give it 0 reward now, and then assign high or low reward to the agent when it gets a response that makes it clearer how good the email was. Then because the agent does credit assignment automatically, actions are in effect evaluated based on their medium-term consequences, although the supervisor never actually looks at future states during evaluations.
This is consistent with your position: “When I talk about myopic training vs. regular RL, I’m imagining that they have the same information available when feedback is given”. However, it also raises the question of why we can’t just wait until the end of the trajectory to give myopic feedback anyway. In my edits I’ve called this “semi-myopia”. This wouldn’t be as useful for nonmyopia, but I do agree that semi-myopia alleviates some competitiveness concerns, although at the cost of being more open to manipulation. The exact tradeoff here will depend on disagreement 1.
Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training.
I mean, this is true in the sense that the Gym interface returns a reward with every transition, but the vast majority of deep RL algorithms don’t do anything with those rewards until the trajectory is done (or, in the case of very long trajectories, until you’ve collected a lot of experience from this trajectory). So you could just as easily evaluate the rewards then, and the algorithms wouldn’t change at all (though their implementation would).
I broadly agree about what our main disagreement is. Note that I’ve been mainly considering the case where the supervisor is more intelligent than the agent as well. The actual resolution of this will depend on what’s really going on during amplification, which is a bigger topic that I’ll need to think about more.
On the side disagreement (of whether looking at future states before evaluation counts as “myopic”) I think I was confused when I was discussing it above and in the original article, which made my position a bit of a mess. Sorry about that; I’ve added a clarifying note at the top of the post, and edited the post to reflect what I actually meant. My actual response to this:
Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training. However, in nonmyopic training this doesn’t matter very much, because we can assign high or low reward to some later state in the trajectory, which then influences whether the agent learns to do the original action more or less. We can’t do this in myopic training in the current paradigm, which is where the competitiveness sacrifice comes from.
E.g. my agent sends an email. Is it good or bad? In myopic training, you need to figure this out now. In nonmyopic training, you can shrug, give it 0 reward now, and then assign high or low reward to the agent when it gets a response that makes it clearer how good the email was. Then because the agent does credit assignment automatically, actions are in effect evaluated based on their medium-term consequences, although the supervisor never actually looks at future states during evaluations.
This is consistent with your position: “When I talk about myopic training vs. regular RL, I’m imagining that they have the same information available when feedback is given”. However, it also raises the question of why we can’t just wait until the end of the trajectory to give myopic feedback anyway. In my edits I’ve called this “semi-myopia”. This wouldn’t be as useful for nonmyopia, but I do agree that semi-myopia alleviates some competitiveness concerns, although at the cost of being more open to manipulation. The exact tradeoff here will depend on disagreement 1.
I mean, this is true in the sense that the Gym interface returns a reward with every transition, but the vast majority of deep RL algorithms don’t do anything with those rewards until the trajectory is done (or, in the case of very long trajectories, until you’ve collected a lot of experience from this trajectory). So you could just as easily evaluate the rewards then, and the algorithms wouldn’t change at all (though their implementation would).