My comments below are partially copied from earlier comments I made on a draft of this post that Richard shared with me.
I think iterated amplification is an important research direction, but I don’t see what value there is in making the supervisor output approval values to train a myopic agent on, rather than rewards to train a nonmyopic agent on.
This is possible for approval-based amplification, though it’s worth noting that I’m not sure if it actually makes sense for imitative amplification. When the loss is just the distance between the overseer’s output and the model’s output, you already have the full feedback signal, so there’s no reason to use a reward.
“Myopic thinking” has never been particularly well-specified
Though still not super well-specified, my current thinking is that an agent is thinking myopically if their goal is a function of their output across some Cartesian boundary. See the section on “Goals across Cartesian boundaries” in this post.
But based on the arguments in this post I expect that, whatever the most reasonable interpretations of “approval-directed” or “myopic” cognition turn out to be, they could be developed in nonmyopic training regimes just as well as (or better than) in myopic training regimes.
What might this look like in practice? Consider the example of an agent trained myopically on the approval of HCH. To make this nonmyopic in a trivial sense, we merely need to convert that approval into a reward using the formula I gave above. However, after just the trivial change, myopic training will outperform nonmyopic training (because the latter requires the agent to do credit assignment across timesteps). To make it nonmyopic in an interesting and advantageous sense, HCH will need to notice when its earlier evaluations were suboptimal, and then assign additional rewards to correct for those errors.
This is definitely the point here that I care most about. I care a lot more about myopic cognition than myopic training procedures—as I see myopic cognition as a solution to deceptive alignment—and I do find it quite plausible that you could use a non-myopic training procedure to train a myopic agent.
However, it’s worth noting that the procedure given here really looks a lot more like approval-based amplification rather than imitative amplification. And approval-based amplification doesn’t necessarily limit to HCH, which makes me somewhat skeptical of it. Furthermore, by allowing the overseer to see the model’s output in giving its feedback, the procedure given here breaks the analogy to counterfactual oracles which means that a model acting like a counterfactual oracle will no longer always be optimal—which is a real problem if the sort of myopic cognition that I want behaves like a counterfactual oracle (which I think it does).
For myopic agents to be competitive on long-term tasks, their objective function needs to be set by a supervisor which is able to accurately predict how well their actions fulfil long-term goals.
I think this is where I disagree with this argument. I think you can get myopic agents which are competitive on long-run tasks because they are trying to do something like “be as close to HCH as possible” which results in good long-run task performance without actually being specified in terms of the long-term consequences of the agent’s actions.
I think this is where I disagree with this argument. I think you can get myopic agents which are competitive on long-run tasks because they are trying to do something like “be as close to HCH as possible” which results in good long-run task performance without actually being specified in terms of the long-term consequences of the agent’s actions.
I’m somewhat conflicted here. I sympathize with Rohin’s sibling comment. One of my take-aways from the discussion between Rohin and Ricraz is that it’s really not very meaningful to classify training as myopic/nonmyopic based on superficial features, such as whether feedback is aggregated across multiple rewards. As Ricraz repeatedly states, we can shift back and forth between “myopic” and “nonmyopic” by e.g. using a nonmyopic reasoner to provide the rewards to a perfectly myopic RL agent. Rohin pushes back on this point by pointing out that the important thing about a myopic approval-directed agent (for example) is the additional information we get from the human in the loop. An approval-directed berry-collecting agent will not gain approval from steps which build infrastructure to help fool the human judge, where a non-myopic approval-seeking RL agent could estimate high expected value for taking such steps. But the same cannot be said of an agent myopically trained to approximate an approval-seeking RL agent.
So it seems misleading to describe a system as myopically imitating a non-myopic system—there is no significant difference between non-myopic Q-learning vs myopic imitation of Q-learning. A notion of “myopia” which agrees with your usage (allowing for myopic imitation of HCH) does not seem like a very useful notion of myopia. I see this as the heart of Ricraz’ critique (or at least, the part that I agree with).
OTOH, Rohin’s defense of myopia turns on a core claim:
While small errors in reward specification can incentivize catastrophic outcomes, small errors in approval feedback are unlikely to incentivize catastrophic outcomes.
So, if we want a useful concept of “myopic training”, it seems it should support this claim—IE, myopic training should be the sort of optimization for which small errors in loss function are unlikely to create huge errors in outcomes.
Going back to the example of myopically imitating HCH, it seems what’s important here is how errors might be introduced. If we assume HCH is trusted, then a loss function which introduces independent noise on HCH’s answers to different questions would be fine. On the other hand, an approximation which propagated those errors along HCH trees—so a wrong conclusion about human values influences many many upstream computations—would be not-fine, in the same way nonmyopic RL is non-fine.
I’m not sure how to resolve this in terms of a notion of “myopic training” which gets at the important thing.
So it seems misleading to describe a system as myopically imitating a non-myopic system—there is no significant difference between non-myopic Q-learning vs myopic imitation of Q-learning. A notion of “myopia” which agrees with your usage (allowing for myopic imitation of HCH) does not seem like a very useful notion of myopia. I see this as the heart of Ricraz’ critique (or at least, the part that I agree with).
I agree that there’s no difference between the training setup where you do myopic RL on a Q function and the training setup where you just doing Q learning directly, but that doesn’t at all imply that there’s no difference between internally myopically imitating some other Q learning agent and internally doing some sort of non-myopic optimization process. As a silly example, if my Q learning agent has some weird idiosyncratic behavior, then the model which is imitating it will exhibit the same behavior where as the model which is just trying to optimize the reward directly won’t.
This especially matters in the context of HCH because we care quite a lot about getting out as direct an imitation of HCH as possible. In particular, it matters quite a lot that our model be internally trying to myopically imitate HCH rather than internally trying to non-myopically get the least loss across episodes, as the latter will lead it to output simpler answers to make its job easier.
To my understanding Abram, Richard and I agree that myopic cognition (what you’re calling “internally myopically imitating”) would confer benefits, but we don’t think that myopic training is likely to lead to myopic cognition. That might be the crux?
Sure, but imitative amplification can’t be done without myopic training or it ceases to be imitative amplification and becomes approval-based amplification, which means you no longer have any nice guarantees about limiting to HCH.
What about imitating HCH using GAIL and AIRL? I wouldn’t really call that myopic training (if you do, I’m curious what your definition of “myopic training” is).
Both GAIL and AIRL only use expert trajectories rather than expert evaluations, which means they both satisfy the counterfactual oracle analogy, and so I would call them myopic training algorithms in the sense that I was using that term above. That being said, I definitely agree that the term is very overloaded here—some conflicting definitions:
an RL training procedure is myopic if γ=0;
an RL training procedure is myopic if γ=0 and it incentivizes CDT-like behavior in the limit (e.g. it shouldn’t cooperate with its past self in one-shot prisoner’s dilemma);
an ML training procedure is myopic if the model is evaluated without (EDIT: a human) looking at its output (as in the counterfactual oracle analogy).
Note that the post explicitly chose the first definition (which GAIL and AIRL don’t meet). But considering the other definitions:
Seems like the second is implied by the first if you respect the RL assumptions (Cartesian boundary, episode abstractions); if you don’t respect the RL assumptions I don’t know that “incentivizes CDT-like behavior in the limit” is achievable (if even definable, I’m not really sure what it means).
an ML training procedure is myopic if the model is evaluated without looking at its output
… Huh? You can’t tell whether the model is good / what direction to update it without looking at some information about the model or its effects, and if not the output, then what?
One interpretation could be “the model is evaluated without a human looking at its output”, but I don’t see why the model is less likely to develop myopic cognition if a human is looking at the output rather than some automated program. (By this definition, the majority of deep RL policies were “myopically trained”—if this is actually what you mean, let’s not use “myopic” to describe this.)
Maybe you think that the human can be manipulated but the automated program can’t be? I mostly think this is a non-problem, and am not sure how you create the automated program that can evaluate a manipulative action without being manipulated in the process, but I could imagine that being a benefit.
Yeah, I meant without a human looking at the output. I also agree with pretty much everything you just said. We’re pretty deep in this comment chain now and I’m not exactly sure why we got here—I agree that Richard’s original definition was based on the standard RL definition of myopia, though I was making the point that Richard’s attempt to make imitative amplification non-myopic turned it into approval-based amplification. Richard’s version has a human evaluate the output rather than a distance metric, which I see as the defining difference between imitative and approval-based amplification.
Many algorithms for imitation still involve non-myopic training (e.g. GAIL, sorta, and AIRL).
I think this is where I disagree with this argument. I think you can get myopic agents which are competitive on long-run tasks because they are trying to do something like “be as close to HCH as possible” which results in good long-run task performance without actually being specified in terms of the long-term consequences of the agent’s actions.
… Why isn’t this compatible with saying that the supervisor (HCH) is “able to accurately predict how well their actions fulfil long-term goals”? Like, HCH presumably takes those actions because it thinks those actions are good for long-term goals.
… Why isn’t this compatible with saying that the supervisor (HCH) is “able to accurately predict how well their actions fulfil long-term goals”? Like, HCH presumably takes those actions because it thinks those actions are good for long-term goals.
In the imitative case, the overseer never makes a determination about how effective the model’s actions will be at achieving anything. Rather, the overseer is only trying to produce the best answer for itself, and the loss is determined via a distance metric. While the overseer might very well try to determine how effective it’s own actions will be at achieving long-term goals, it never evaluates how effective the model’s actions will be. I see this sort of trick as the heart of what makes the counterfactual oracle analogy work.
My point here is that I think imitative amplification (if you believe it’s competitive) is a counter-example to Richard’s argument in his “Myopic training doesn’t prevent manipulation of supervisors” section since any manipulative actions that an imitative amplification model takes aren’t judged by their consequences but rather just by how closely they match up with what the overseer would do.
“While the overseer might very well try to determine how effective it’s own actions will be at achieving long-term goals, it never evaluates how effective the model’s actions will be.”
Evan, do you agree that for the model to imitate the actions of the supervisor, it would be useful to mimic some of the thought processes the supervisor uses when generating those actions?
In other words, if HCH is pursuing goal X, what feature of myopic training selects for a model that is internally thinking “I’m going to try to be as close to HCH as possible in this timestep, which involves reasoning about how HCH would pursue X”, versus a model that’s thinking “I’m going to pursue goal X”? (To the extent these are different, which I’m still confused about).
My comments below are partially copied from earlier comments I made on a draft of this post that Richard shared with me.
This is possible for approval-based amplification, though it’s worth noting that I’m not sure if it actually makes sense for imitative amplification. When the loss is just the distance between the overseer’s output and the model’s output, you already have the full feedback signal, so there’s no reason to use a reward.
Though still not super well-specified, my current thinking is that an agent is thinking myopically if their goal is a function of their output across some Cartesian boundary. See the section on “Goals across Cartesian boundaries” in this post.
This is definitely the point here that I care most about. I care a lot more about myopic cognition than myopic training procedures—as I see myopic cognition as a solution to deceptive alignment—and I do find it quite plausible that you could use a non-myopic training procedure to train a myopic agent.
However, it’s worth noting that the procedure given here really looks a lot more like approval-based amplification rather than imitative amplification. And approval-based amplification doesn’t necessarily limit to HCH, which makes me somewhat skeptical of it. Furthermore, by allowing the overseer to see the model’s output in giving its feedback, the procedure given here breaks the analogy to counterfactual oracles which means that a model acting like a counterfactual oracle will no longer always be optimal—which is a real problem if the sort of myopic cognition that I want behaves like a counterfactual oracle (which I think it does).
I think this is where I disagree with this argument. I think you can get myopic agents which are competitive on long-run tasks because they are trying to do something like “be as close to HCH as possible” which results in good long-run task performance without actually being specified in terms of the long-term consequences of the agent’s actions.
I’m somewhat conflicted here. I sympathize with Rohin’s sibling comment. One of my take-aways from the discussion between Rohin and Ricraz is that it’s really not very meaningful to classify training as myopic/nonmyopic based on superficial features, such as whether feedback is aggregated across multiple rewards. As Ricraz repeatedly states, we can shift back and forth between “myopic” and “nonmyopic” by e.g. using a nonmyopic reasoner to provide the rewards to a perfectly myopic RL agent. Rohin pushes back on this point by pointing out that the important thing about a myopic approval-directed agent (for example) is the additional information we get from the human in the loop. An approval-directed berry-collecting agent will not gain approval from steps which build infrastructure to help fool the human judge, where a non-myopic approval-seeking RL agent could estimate high expected value for taking such steps. But the same cannot be said of an agent myopically trained to approximate an approval-seeking RL agent.
So it seems misleading to describe a system as myopically imitating a non-myopic system—there is no significant difference between non-myopic Q-learning vs myopic imitation of Q-learning. A notion of “myopia” which agrees with your usage (allowing for myopic imitation of HCH) does not seem like a very useful notion of myopia. I see this as the heart of Ricraz’ critique (or at least, the part that I agree with).
OTOH, Rohin’s defense of myopia turns on a core claim:
So, if we want a useful concept of “myopic training”, it seems it should support this claim—IE, myopic training should be the sort of optimization for which small errors in loss function are unlikely to create huge errors in outcomes.
Going back to the example of myopically imitating HCH, it seems what’s important here is how errors might be introduced. If we assume HCH is trusted, then a loss function which introduces independent noise on HCH’s answers to different questions would be fine. On the other hand, an approximation which propagated those errors along HCH trees—so a wrong conclusion about human values influences many many upstream computations—would be not-fine, in the same way nonmyopic RL is non-fine.
I’m not sure how to resolve this in terms of a notion of “myopic training” which gets at the important thing.
I agree that there’s no difference between the training setup where you do myopic RL on a Q function and the training setup where you just doing Q learning directly, but that doesn’t at all imply that there’s no difference between internally myopically imitating some other Q learning agent and internally doing some sort of non-myopic optimization process. As a silly example, if my Q learning agent has some weird idiosyncratic behavior, then the model which is imitating it will exhibit the same behavior where as the model which is just trying to optimize the reward directly won’t.
This especially matters in the context of HCH because we care quite a lot about getting out as direct an imitation of HCH as possible. In particular, it matters quite a lot that our model be internally trying to myopically imitate HCH rather than internally trying to non-myopically get the least loss across episodes, as the latter will lead it to output simpler answers to make its job easier.
To my understanding Abram, Richard and I agree that myopic cognition (what you’re calling “internally myopically imitating”) would confer benefits, but we don’t think that myopic training is likely to lead to myopic cognition. That might be the crux?
Sure, but imitative amplification can’t be done without myopic training or it ceases to be imitative amplification and becomes approval-based amplification, which means you no longer have any nice guarantees about limiting to HCH.
What about imitating HCH using GAIL and AIRL? I wouldn’t really call that myopic training (if you do, I’m curious what your definition of “myopic training” is).
Both GAIL and AIRL only use expert trajectories rather than expert evaluations, which means they both satisfy the counterfactual oracle analogy, and so I would call them myopic training algorithms in the sense that I was using that term above. That being said, I definitely agree that the term is very overloaded here—some conflicting definitions:
an RL training procedure is myopic if γ=0;
an RL training procedure is myopic if γ=0 and it incentivizes CDT-like behavior in the limit (e.g. it shouldn’t cooperate with its past self in one-shot prisoner’s dilemma);
an ML training procedure is myopic if the model is evaluated without (EDIT: a human) looking at its output (as in the counterfactual oracle analogy).
Note that the post explicitly chose the first definition (which GAIL and AIRL don’t meet). But considering the other definitions:
Seems like the second is implied by the first if you respect the RL assumptions (Cartesian boundary, episode abstractions); if you don’t respect the RL assumptions I don’t know that “incentivizes CDT-like behavior in the limit” is achievable (if even definable, I’m not really sure what it means).
… Huh? You can’t tell whether the model is good / what direction to update it without looking at some information about the model or its effects, and if not the output, then what?
One interpretation could be “the model is evaluated without a human looking at its output”, but I don’t see why the model is less likely to develop myopic cognition if a human is looking at the output rather than some automated program. (By this definition, the majority of deep RL policies were “myopically trained”—if this is actually what you mean, let’s not use “myopic” to describe this.)
Maybe you think that the human can be manipulated but the automated program can’t be? I mostly think this is a non-problem, and am not sure how you create the automated program that can evaluate a manipulative action without being manipulated in the process, but I could imagine that being a benefit.
Yeah, I meant without a human looking at the output. I also agree with pretty much everything you just said. We’re pretty deep in this comment chain now and I’m not exactly sure why we got here—I agree that Richard’s original definition was based on the standard RL definition of myopia, though I was making the point that Richard’s attempt to make imitative amplification non-myopic turned it into approval-based amplification. Richard’s version has a human evaluate the output rather than a distance metric, which I see as the defining difference between imitative and approval-based amplification.
Many algorithms for imitation still involve non-myopic training (e.g. GAIL, sorta, and AIRL).
… Why isn’t this compatible with saying that the supervisor (HCH) is “able to accurately predict how well their actions fulfil long-term goals”? Like, HCH presumably takes those actions because it thinks those actions are good for long-term goals.
In the imitative case, the overseer never makes a determination about how effective the model’s actions will be at achieving anything. Rather, the overseer is only trying to produce the best answer for itself, and the loss is determined via a distance metric. While the overseer might very well try to determine how effective it’s own actions will be at achieving long-term goals, it never evaluates how effective the model’s actions will be. I see this sort of trick as the heart of what makes the counterfactual oracle analogy work.
I don’t really understand what you’re saying here. A thing you might be saying:
If that is what you’re saying, I don’t see why this is relevant to whether or not we should use myopic training?
(It’s possible I need to reread the counterfactual oracle analogy, though I did skim it right now and didn’t immediately see the relevance.)
My point here is that I think imitative amplification (if you believe it’s competitive) is a counter-example to Richard’s argument in his “Myopic training doesn’t prevent manipulation of supervisors” section since any manipulative actions that an imitative amplification model takes aren’t judged by their consequences but rather just by how closely they match up with what the overseer would do.
That seems to be a property of myopic cognition rather than myopic training? (See also this comment.)
I’m also confused.
“While the overseer might very well try to determine how effective it’s own actions will be at achieving long-term goals, it never evaluates how effective the model’s actions will be.”
Evan, do you agree that for the model to imitate the actions of the supervisor, it would be useful to mimic some of the thought processes the supervisor uses when generating those actions?
In other words, if HCH is pursuing goal X, what feature of myopic training selects for a model that is internally thinking “I’m going to try to be as close to HCH as possible in this timestep, which involves reasoning about how HCH would pursue X”, versus a model that’s thinking “I’m going to pursue goal X”? (To the extent these are different, which I’m still confused about).