I think this is where I disagree with this argument. I think you can get myopic agents which are competitive on long-run tasks because they are trying to do something like “be as close to HCH as possible” which results in good long-run task performance without actually being specified in terms of the long-term consequences of the agent’s actions.
I’m somewhat conflicted here. I sympathize with Rohin’s sibling comment. One of my take-aways from the discussion between Rohin and Ricraz is that it’s really not very meaningful to classify training as myopic/nonmyopic based on superficial features, such as whether feedback is aggregated across multiple rewards. As Ricraz repeatedly states, we can shift back and forth between “myopic” and “nonmyopic” by e.g. using a nonmyopic reasoner to provide the rewards to a perfectly myopic RL agent. Rohin pushes back on this point by pointing out that the important thing about a myopic approval-directed agent (for example) is the additional information we get from the human in the loop. An approval-directed berry-collecting agent will not gain approval from steps which build infrastructure to help fool the human judge, where a non-myopic approval-seeking RL agent could estimate high expected value for taking such steps. But the same cannot be said of an agent myopically trained to approximate an approval-seeking RL agent.
So it seems misleading to describe a system as myopically imitating a non-myopic system—there is no significant difference between non-myopic Q-learning vs myopic imitation of Q-learning. A notion of “myopia” which agrees with your usage (allowing for myopic imitation of HCH) does not seem like a very useful notion of myopia. I see this as the heart of Ricraz’ critique (or at least, the part that I agree with).
OTOH, Rohin’s defense of myopia turns on a core claim:
While small errors in reward specification can incentivize catastrophic outcomes, small errors in approval feedback are unlikely to incentivize catastrophic outcomes.
So, if we want a useful concept of “myopic training”, it seems it should support this claim—IE, myopic training should be the sort of optimization for which small errors in loss function are unlikely to create huge errors in outcomes.
Going back to the example of myopically imitating HCH, it seems what’s important here is how errors might be introduced. If we assume HCH is trusted, then a loss function which introduces independent noise on HCH’s answers to different questions would be fine. On the other hand, an approximation which propagated those errors along HCH trees—so a wrong conclusion about human values influences many many upstream computations—would be not-fine, in the same way nonmyopic RL is non-fine.
I’m not sure how to resolve this in terms of a notion of “myopic training” which gets at the important thing.
So it seems misleading to describe a system as myopically imitating a non-myopic system—there is no significant difference between non-myopic Q-learning vs myopic imitation of Q-learning. A notion of “myopia” which agrees with your usage (allowing for myopic imitation of HCH) does not seem like a very useful notion of myopia. I see this as the heart of Ricraz’ critique (or at least, the part that I agree with).
I agree that there’s no difference between the training setup where you do myopic RL on a Q function and the training setup where you just doing Q learning directly, but that doesn’t at all imply that there’s no difference between internally myopically imitating some other Q learning agent and internally doing some sort of non-myopic optimization process. As a silly example, if my Q learning agent has some weird idiosyncratic behavior, then the model which is imitating it will exhibit the same behavior where as the model which is just trying to optimize the reward directly won’t.
This especially matters in the context of HCH because we care quite a lot about getting out as direct an imitation of HCH as possible. In particular, it matters quite a lot that our model be internally trying to myopically imitate HCH rather than internally trying to non-myopically get the least loss across episodes, as the latter will lead it to output simpler answers to make its job easier.
To my understanding Abram, Richard and I agree that myopic cognition (what you’re calling “internally myopically imitating”) would confer benefits, but we don’t think that myopic training is likely to lead to myopic cognition. That might be the crux?
Sure, but imitative amplification can’t be done without myopic training or it ceases to be imitative amplification and becomes approval-based amplification, which means you no longer have any nice guarantees about limiting to HCH.
What about imitating HCH using GAIL and AIRL? I wouldn’t really call that myopic training (if you do, I’m curious what your definition of “myopic training” is).
Both GAIL and AIRL only use expert trajectories rather than expert evaluations, which means they both satisfy the counterfactual oracle analogy, and so I would call them myopic training algorithms in the sense that I was using that term above. That being said, I definitely agree that the term is very overloaded here—some conflicting definitions:
an RL training procedure is myopic if γ=0;
an RL training procedure is myopic if γ=0 and it incentivizes CDT-like behavior in the limit (e.g. it shouldn’t cooperate with its past self in one-shot prisoner’s dilemma);
an ML training procedure is myopic if the model is evaluated without (EDIT: a human) looking at its output (as in the counterfactual oracle analogy).
Note that the post explicitly chose the first definition (which GAIL and AIRL don’t meet). But considering the other definitions:
Seems like the second is implied by the first if you respect the RL assumptions (Cartesian boundary, episode abstractions); if you don’t respect the RL assumptions I don’t know that “incentivizes CDT-like behavior in the limit” is achievable (if even definable, I’m not really sure what it means).
an ML training procedure is myopic if the model is evaluated without looking at its output
… Huh? You can’t tell whether the model is good / what direction to update it without looking at some information about the model or its effects, and if not the output, then what?
One interpretation could be “the model is evaluated without a human looking at its output”, but I don’t see why the model is less likely to develop myopic cognition if a human is looking at the output rather than some automated program. (By this definition, the majority of deep RL policies were “myopically trained”—if this is actually what you mean, let’s not use “myopic” to describe this.)
Maybe you think that the human can be manipulated but the automated program can’t be? I mostly think this is a non-problem, and am not sure how you create the automated program that can evaluate a manipulative action without being manipulated in the process, but I could imagine that being a benefit.
Yeah, I meant without a human looking at the output. I also agree with pretty much everything you just said. We’re pretty deep in this comment chain now and I’m not exactly sure why we got here—I agree that Richard’s original definition was based on the standard RL definition of myopia, though I was making the point that Richard’s attempt to make imitative amplification non-myopic turned it into approval-based amplification. Richard’s version has a human evaluate the output rather than a distance metric, which I see as the defining difference between imitative and approval-based amplification.
I’m somewhat conflicted here. I sympathize with Rohin’s sibling comment. One of my take-aways from the discussion between Rohin and Ricraz is that it’s really not very meaningful to classify training as myopic/nonmyopic based on superficial features, such as whether feedback is aggregated across multiple rewards. As Ricraz repeatedly states, we can shift back and forth between “myopic” and “nonmyopic” by e.g. using a nonmyopic reasoner to provide the rewards to a perfectly myopic RL agent. Rohin pushes back on this point by pointing out that the important thing about a myopic approval-directed agent (for example) is the additional information we get from the human in the loop. An approval-directed berry-collecting agent will not gain approval from steps which build infrastructure to help fool the human judge, where a non-myopic approval-seeking RL agent could estimate high expected value for taking such steps. But the same cannot be said of an agent myopically trained to approximate an approval-seeking RL agent.
So it seems misleading to describe a system as myopically imitating a non-myopic system—there is no significant difference between non-myopic Q-learning vs myopic imitation of Q-learning. A notion of “myopia” which agrees with your usage (allowing for myopic imitation of HCH) does not seem like a very useful notion of myopia. I see this as the heart of Ricraz’ critique (or at least, the part that I agree with).
OTOH, Rohin’s defense of myopia turns on a core claim:
So, if we want a useful concept of “myopic training”, it seems it should support this claim—IE, myopic training should be the sort of optimization for which small errors in loss function are unlikely to create huge errors in outcomes.
Going back to the example of myopically imitating HCH, it seems what’s important here is how errors might be introduced. If we assume HCH is trusted, then a loss function which introduces independent noise on HCH’s answers to different questions would be fine. On the other hand, an approximation which propagated those errors along HCH trees—so a wrong conclusion about human values influences many many upstream computations—would be not-fine, in the same way nonmyopic RL is non-fine.
I’m not sure how to resolve this in terms of a notion of “myopic training” which gets at the important thing.
I agree that there’s no difference between the training setup where you do myopic RL on a Q function and the training setup where you just doing Q learning directly, but that doesn’t at all imply that there’s no difference between internally myopically imitating some other Q learning agent and internally doing some sort of non-myopic optimization process. As a silly example, if my Q learning agent has some weird idiosyncratic behavior, then the model which is imitating it will exhibit the same behavior where as the model which is just trying to optimize the reward directly won’t.
This especially matters in the context of HCH because we care quite a lot about getting out as direct an imitation of HCH as possible. In particular, it matters quite a lot that our model be internally trying to myopically imitate HCH rather than internally trying to non-myopically get the least loss across episodes, as the latter will lead it to output simpler answers to make its job easier.
To my understanding Abram, Richard and I agree that myopic cognition (what you’re calling “internally myopically imitating”) would confer benefits, but we don’t think that myopic training is likely to lead to myopic cognition. That might be the crux?
Sure, but imitative amplification can’t be done without myopic training or it ceases to be imitative amplification and becomes approval-based amplification, which means you no longer have any nice guarantees about limiting to HCH.
What about imitating HCH using GAIL and AIRL? I wouldn’t really call that myopic training (if you do, I’m curious what your definition of “myopic training” is).
Both GAIL and AIRL only use expert trajectories rather than expert evaluations, which means they both satisfy the counterfactual oracle analogy, and so I would call them myopic training algorithms in the sense that I was using that term above. That being said, I definitely agree that the term is very overloaded here—some conflicting definitions:
an RL training procedure is myopic if γ=0;
an RL training procedure is myopic if γ=0 and it incentivizes CDT-like behavior in the limit (e.g. it shouldn’t cooperate with its past self in one-shot prisoner’s dilemma);
an ML training procedure is myopic if the model is evaluated without (EDIT: a human) looking at its output (as in the counterfactual oracle analogy).
Note that the post explicitly chose the first definition (which GAIL and AIRL don’t meet). But considering the other definitions:
Seems like the second is implied by the first if you respect the RL assumptions (Cartesian boundary, episode abstractions); if you don’t respect the RL assumptions I don’t know that “incentivizes CDT-like behavior in the limit” is achievable (if even definable, I’m not really sure what it means).
… Huh? You can’t tell whether the model is good / what direction to update it without looking at some information about the model or its effects, and if not the output, then what?
One interpretation could be “the model is evaluated without a human looking at its output”, but I don’t see why the model is less likely to develop myopic cognition if a human is looking at the output rather than some automated program. (By this definition, the majority of deep RL policies were “myopically trained”—if this is actually what you mean, let’s not use “myopic” to describe this.)
Maybe you think that the human can be manipulated but the automated program can’t be? I mostly think this is a non-problem, and am not sure how you create the automated program that can evaluate a manipulative action without being manipulated in the process, but I could imagine that being a benefit.
Yeah, I meant without a human looking at the output. I also agree with pretty much everything you just said. We’re pretty deep in this comment chain now and I’m not exactly sure why we got here—I agree that Richard’s original definition was based on the standard RL definition of myopia, though I was making the point that Richard’s attempt to make imitative amplification non-myopic turned it into approval-based amplification. Richard’s version has a human evaluate the output rather than a distance metric, which I see as the defining difference between imitative and approval-based amplification.