So it seems misleading to describe a system as myopically imitating a non-myopic system—there is no significant difference between non-myopic Q-learning vs myopic imitation of Q-learning. A notion of “myopia” which agrees with your usage (allowing for myopic imitation of HCH) does not seem like a very useful notion of myopia. I see this as the heart of Ricraz’ critique (or at least, the part that I agree with).
I agree that there’s no difference between the training setup where you do myopic RL on a Q function and the training setup where you just doing Q learning directly, but that doesn’t at all imply that there’s no difference between internally myopically imitating some other Q learning agent and internally doing some sort of non-myopic optimization process. As a silly example, if my Q learning agent has some weird idiosyncratic behavior, then the model which is imitating it will exhibit the same behavior where as the model which is just trying to optimize the reward directly won’t.
This especially matters in the context of HCH because we care quite a lot about getting out as direct an imitation of HCH as possible. In particular, it matters quite a lot that our model be internally trying to myopically imitate HCH rather than internally trying to non-myopically get the least loss across episodes, as the latter will lead it to output simpler answers to make its job easier.
To my understanding Abram, Richard and I agree that myopic cognition (what you’re calling “internally myopically imitating”) would confer benefits, but we don’t think that myopic training is likely to lead to myopic cognition. That might be the crux?
Sure, but imitative amplification can’t be done without myopic training or it ceases to be imitative amplification and becomes approval-based amplification, which means you no longer have any nice guarantees about limiting to HCH.
What about imitating HCH using GAIL and AIRL? I wouldn’t really call that myopic training (if you do, I’m curious what your definition of “myopic training” is).
Both GAIL and AIRL only use expert trajectories rather than expert evaluations, which means they both satisfy the counterfactual oracle analogy, and so I would call them myopic training algorithms in the sense that I was using that term above. That being said, I definitely agree that the term is very overloaded here—some conflicting definitions:
an RL training procedure is myopic if γ=0;
an RL training procedure is myopic if γ=0 and it incentivizes CDT-like behavior in the limit (e.g. it shouldn’t cooperate with its past self in one-shot prisoner’s dilemma);
an ML training procedure is myopic if the model is evaluated without (EDIT: a human) looking at its output (as in the counterfactual oracle analogy).
Note that the post explicitly chose the first definition (which GAIL and AIRL don’t meet). But considering the other definitions:
Seems like the second is implied by the first if you respect the RL assumptions (Cartesian boundary, episode abstractions); if you don’t respect the RL assumptions I don’t know that “incentivizes CDT-like behavior in the limit” is achievable (if even definable, I’m not really sure what it means).
an ML training procedure is myopic if the model is evaluated without looking at its output
… Huh? You can’t tell whether the model is good / what direction to update it without looking at some information about the model or its effects, and if not the output, then what?
One interpretation could be “the model is evaluated without a human looking at its output”, but I don’t see why the model is less likely to develop myopic cognition if a human is looking at the output rather than some automated program. (By this definition, the majority of deep RL policies were “myopically trained”—if this is actually what you mean, let’s not use “myopic” to describe this.)
Maybe you think that the human can be manipulated but the automated program can’t be? I mostly think this is a non-problem, and am not sure how you create the automated program that can evaluate a manipulative action without being manipulated in the process, but I could imagine that being a benefit.
Yeah, I meant without a human looking at the output. I also agree with pretty much everything you just said. We’re pretty deep in this comment chain now and I’m not exactly sure why we got here—I agree that Richard’s original definition was based on the standard RL definition of myopia, though I was making the point that Richard’s attempt to make imitative amplification non-myopic turned it into approval-based amplification. Richard’s version has a human evaluate the output rather than a distance metric, which I see as the defining difference between imitative and approval-based amplification.
I agree that there’s no difference between the training setup where you do myopic RL on a Q function and the training setup where you just doing Q learning directly, but that doesn’t at all imply that there’s no difference between internally myopically imitating some other Q learning agent and internally doing some sort of non-myopic optimization process. As a silly example, if my Q learning agent has some weird idiosyncratic behavior, then the model which is imitating it will exhibit the same behavior where as the model which is just trying to optimize the reward directly won’t.
This especially matters in the context of HCH because we care quite a lot about getting out as direct an imitation of HCH as possible. In particular, it matters quite a lot that our model be internally trying to myopically imitate HCH rather than internally trying to non-myopically get the least loss across episodes, as the latter will lead it to output simpler answers to make its job easier.
To my understanding Abram, Richard and I agree that myopic cognition (what you’re calling “internally myopically imitating”) would confer benefits, but we don’t think that myopic training is likely to lead to myopic cognition. That might be the crux?
Sure, but imitative amplification can’t be done without myopic training or it ceases to be imitative amplification and becomes approval-based amplification, which means you no longer have any nice guarantees about limiting to HCH.
What about imitating HCH using GAIL and AIRL? I wouldn’t really call that myopic training (if you do, I’m curious what your definition of “myopic training” is).
Both GAIL and AIRL only use expert trajectories rather than expert evaluations, which means they both satisfy the counterfactual oracle analogy, and so I would call them myopic training algorithms in the sense that I was using that term above. That being said, I definitely agree that the term is very overloaded here—some conflicting definitions:
an RL training procedure is myopic if γ=0;
an RL training procedure is myopic if γ=0 and it incentivizes CDT-like behavior in the limit (e.g. it shouldn’t cooperate with its past self in one-shot prisoner’s dilemma);
an ML training procedure is myopic if the model is evaluated without (EDIT: a human) looking at its output (as in the counterfactual oracle analogy).
Note that the post explicitly chose the first definition (which GAIL and AIRL don’t meet). But considering the other definitions:
Seems like the second is implied by the first if you respect the RL assumptions (Cartesian boundary, episode abstractions); if you don’t respect the RL assumptions I don’t know that “incentivizes CDT-like behavior in the limit” is achievable (if even definable, I’m not really sure what it means).
… Huh? You can’t tell whether the model is good / what direction to update it without looking at some information about the model or its effects, and if not the output, then what?
One interpretation could be “the model is evaluated without a human looking at its output”, but I don’t see why the model is less likely to develop myopic cognition if a human is looking at the output rather than some automated program. (By this definition, the majority of deep RL policies were “myopically trained”—if this is actually what you mean, let’s not use “myopic” to describe this.)
Maybe you think that the human can be manipulated but the automated program can’t be? I mostly think this is a non-problem, and am not sure how you create the automated program that can evaluate a manipulative action without being manipulated in the process, but I could imagine that being a benefit.
Yeah, I meant without a human looking at the output. I also agree with pretty much everything you just said. We’re pretty deep in this comment chain now and I’m not exactly sure why we got here—I agree that Richard’s original definition was based on the standard RL definition of myopia, though I was making the point that Richard’s attempt to make imitative amplification non-myopic turned it into approval-based amplification. Richard’s version has a human evaluate the output rather than a distance metric, which I see as the defining difference between imitative and approval-based amplification.