Rohin Shah comments on Arguments against myopic training

Rohin Shah 10 Jul 2020 18:10 UTC
LW: 11 AF: 8
AF
I can’t speak for everyone else, but when I talk about myopic training vs. regular RL, I’m imagining that they have the same information available when feedback is given. If you would wait till the end of the trajectory before giving rewards in regular RL, then you would wait till the end of the trajectory before giving approval in myopic training.
If you have ideas for how we might supervise complex tasks like Go to a superhuman level, without assigning values to outcomes in a way that falls into the same traps as reward-based learning
… Iterated amplification? Debate?
The point of these methods is to have an overseer that is more powerful than the agent being trained, so that you never have to achieve super-overseer performance (but you do achieve superhuman performance). In debate, you can think of judge + agent 1 as the overseer for agent 2, and judge + agent 2 as the overseer for agent 1.
(You don’t use the overseer itself as your ML system, because the overseer is slow while the agent is fast.)
I agree that if you’re hoping to get an agent that is more powerful than its overseer, then you’re counting on some form of generalization / transfer, and you shouldn’t expect myopic training to be much better (if at all) than regular RL at getting the “right” generalization.
Has this argument been written up anywhere?
Approval-directed agents. Note a counterargument in against mimicry (technically argues against imitation, but I think it also applies to approval).
But when I’m talking about competitive AGI, I’m talking about agents that can figure out novel approaches and strategies.
See above about being superhuman but sub-overseer. (Note that the agents can still come up with novel approaches and strategies that the overseer would have come up with, even if the overseer did not actually come up with them.)
humans very much think in terms of outcomes, and how good those outcomes are, by default.
… This does not match my experience at all. Most of the time it seems to me that we’re executing habits and heuristics that we’ve learned over time, and only when we need to think about something novel do we start trying to predict consequences and rate how good they are in order to come to a conclusion. (E.g. most people intuitively reject the notion that we should kill one person for organs to save 5 lives. I don’t think they are usually predicting outcomes and then figuring out whether those outcomes are good or not.)
I picture you arguing that we’ll need shaped rewards to help the agent explore,
I mean, yes, but I don’t think it’s particularly relevant to this disagreement.
TL;DR: I think our main disagreement is whether humans can give approval feedback in any way other than estimating how good the consequences of the action are (both observed and predicted in the future). I agree that if we are trying to have an overseer train a more intelligent agent, it seems likely that you’d have to focus on how good the consequences are. However, I think we will plausibly have the overseer be more intelligent than the agent, and so I expect that the overseer can provide feedback in other ways as well.
- Richard_Ngo 11 Jul 2020 9:59 UTC
  LW: 8 AF: 6
  AF Parent
  I broadly agree about what our main disagreement is. Note that I’ve been mainly considering the case where the supervisor is more intelligent than the agent as well. The actual resolution of this will depend on what’s really going on during amplification, which is a bigger topic that I’ll need to think about more.
  On the side disagreement (of whether looking at future states before evaluation counts as “myopic”) I think I was confused when I was discussing it above and in the original article, which made my position a bit of a mess. Sorry about that; I’ve added a clarifying note at the top of the post, and edited the post to reflect what I actually meant. My actual response to this:
  Objection 2: This sacrifices competitiveness, because now the human can’t look at the medium-term consequences of actions before providing feedback.
  Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training. However, in nonmyopic training this doesn’t matter very much, because we can assign high or low reward to some later state in the trajectory, which then influences whether the agent learns to do the original action more or less. We can’t do this in myopic training in the current paradigm, which is where the competitiveness sacrifice comes from.
  E.g. my agent sends an email. Is it good or bad? In myopic training, you need to figure this out now. In nonmyopic training, you can shrug, give it 0 reward now, and then assign high or low reward to the agent when it gets a response that makes it clearer how good the email was. Then because the agent does credit assignment automatically, actions are in effect evaluated based on their medium-term consequences, although the supervisor never actually looks at future states during evaluations.
  This is consistent with your position: “When I talk about myopic training vs. regular RL, I’m imagining that they have the same information available when feedback is given”. However, it also raises the question of why we can’t just wait until the end of the trajectory to give myopic feedback anyway. In my edits I’ve called this “semi-myopia”. This wouldn’t be as useful for nonmyopia, but I do agree that semi-myopia alleviates some competitiveness concerns, although at the cost of being more open to manipulation. The exact tradeoff here will depend on disagreement 1.
  - Rohin Shah 11 Jul 2020 19:23 UTC
    LW: 6 AF: 4
    AF Parent
    Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training.
    I mean, this is true in the sense that the Gym interface returns a reward with every transition, but the vast majority of deep RL algorithms don’t do anything with those rewards until the trajectory is done (or, in the case of very long trajectories, until you’ve collected a lot of experience from this trajectory). So you could just as easily evaluate the rewards then, and the algorithms wouldn’t change at all (though their implementation would).