Sure, but imitative amplification can’t be done without myopic training or it ceases to be imitative amplification and becomes approval-based amplification, which means you no longer have any nice guarantees about limiting to HCH.
What about imitating HCH using GAIL and AIRL? I wouldn’t really call that myopic training (if you do, I’m curious what your definition of “myopic training” is).
Both GAIL and AIRL only use expert trajectories rather than expert evaluations, which means they both satisfy the counterfactual oracle analogy, and so I would call them myopic training algorithms in the sense that I was using that term above. That being said, I definitely agree that the term is very overloaded here—some conflicting definitions:
an RL training procedure is myopic if γ=0;
an RL training procedure is myopic if γ=0 and it incentivizes CDT-like behavior in the limit (e.g. it shouldn’t cooperate with its past self in one-shot prisoner’s dilemma);
an ML training procedure is myopic if the model is evaluated without (EDIT: a human) looking at its output (as in the counterfactual oracle analogy).
Note that the post explicitly chose the first definition (which GAIL and AIRL don’t meet). But considering the other definitions:
Seems like the second is implied by the first if you respect the RL assumptions (Cartesian boundary, episode abstractions); if you don’t respect the RL assumptions I don’t know that “incentivizes CDT-like behavior in the limit” is achievable (if even definable, I’m not really sure what it means).
an ML training procedure is myopic if the model is evaluated without looking at its output
… Huh? You can’t tell whether the model is good / what direction to update it without looking at some information about the model or its effects, and if not the output, then what?
One interpretation could be “the model is evaluated without a human looking at its output”, but I don’t see why the model is less likely to develop myopic cognition if a human is looking at the output rather than some automated program. (By this definition, the majority of deep RL policies were “myopically trained”—if this is actually what you mean, let’s not use “myopic” to describe this.)
Maybe you think that the human can be manipulated but the automated program can’t be? I mostly think this is a non-problem, and am not sure how you create the automated program that can evaluate a manipulative action without being manipulated in the process, but I could imagine that being a benefit.
Yeah, I meant without a human looking at the output. I also agree with pretty much everything you just said. We’re pretty deep in this comment chain now and I’m not exactly sure why we got here—I agree that Richard’s original definition was based on the standard RL definition of myopia, though I was making the point that Richard’s attempt to make imitative amplification non-myopic turned it into approval-based amplification. Richard’s version has a human evaluate the output rather than a distance metric, which I see as the defining difference between imitative and approval-based amplification.
Sure, but imitative amplification can’t be done without myopic training or it ceases to be imitative amplification and becomes approval-based amplification, which means you no longer have any nice guarantees about limiting to HCH.
What about imitating HCH using GAIL and AIRL? I wouldn’t really call that myopic training (if you do, I’m curious what your definition of “myopic training” is).
Both GAIL and AIRL only use expert trajectories rather than expert evaluations, which means they both satisfy the counterfactual oracle analogy, and so I would call them myopic training algorithms in the sense that I was using that term above. That being said, I definitely agree that the term is very overloaded here—some conflicting definitions:
an RL training procedure is myopic if γ=0;
an RL training procedure is myopic if γ=0 and it incentivizes CDT-like behavior in the limit (e.g. it shouldn’t cooperate with its past self in one-shot prisoner’s dilemma);
an ML training procedure is myopic if the model is evaluated without (EDIT: a human) looking at its output (as in the counterfactual oracle analogy).
Note that the post explicitly chose the first definition (which GAIL and AIRL don’t meet). But considering the other definitions:
Seems like the second is implied by the first if you respect the RL assumptions (Cartesian boundary, episode abstractions); if you don’t respect the RL assumptions I don’t know that “incentivizes CDT-like behavior in the limit” is achievable (if even definable, I’m not really sure what it means).
… Huh? You can’t tell whether the model is good / what direction to update it without looking at some information about the model or its effects, and if not the output, then what?
One interpretation could be “the model is evaluated without a human looking at its output”, but I don’t see why the model is less likely to develop myopic cognition if a human is looking at the output rather than some automated program. (By this definition, the majority of deep RL policies were “myopically trained”—if this is actually what you mean, let’s not use “myopic” to describe this.)
Maybe you think that the human can be manipulated but the automated program can’t be? I mostly think this is a non-problem, and am not sure how you create the automated program that can evaluate a manipulative action without being manipulated in the process, but I could imagine that being a benefit.
Yeah, I meant without a human looking at the output. I also agree with pretty much everything you just said. We’re pretty deep in this comment chain now and I’m not exactly sure why we got here—I agree that Richard’s original definition was based on the standard RL definition of myopia, though I was making the point that Richard’s attempt to make imitative amplification non-myopic turned it into approval-based amplification. Richard’s version has a human evaluate the output rather than a distance metric, which I see as the defining difference between imitative and approval-based amplification.