There’s a lot of ways that reward functions go wrong besides manipulation.
I’m calling them manipulative states because if the human notices that the reward function has gone wrong, they’ll just change the reward they’re giving. So there must be something that stops them from noticing this. But maybe it’s a misleading term, and this isn’t an important point, so for now I’ll use “incorrectly rewarded states” instead.
I agree that if what you’re worried about is manipulation in N actions, then you shouldn’t let the trajectory go on for N actions before evaluating.
This isn’t quite my argument. My two arguments are:
1. IF an important reason you care about myopia is to prevent agents from making N-step plans to get to incorrectly rewarded states, THEN you can’t defend the competitiveness of myopia by saying that we’ll just look at the whole trajectory (as you did in your original reply).
2. However, even myopically cutting off the trajectory before the agent takes N actions is insufficient to prevent the agent from making N-step plans to get to incorrectly rewarded states.
Sure, but humans are better at giving approval feedback than reward feedback. … we just aren’t very used to thinking in terms of “rewards”.
Has this argument been written up anywhere? I think I kinda get what you mean by “better”, but even if that’s true, I don’t know how to think about what the implications are. Also, I think it’s false if we condition on the myopic agents actually being competitive.
My guess is that this disagreement is based on you thinking primarily about tasks where it’s clear what we want the agent to do, and we just need to push it in that direction (like the ones discussed in the COACH paper). I agree that approval feedback is much more natural for this use case. But when I’m talking about competitive AGI, I’m talking about agents that can figure out novel approaches and strategies. Coming up with reward feedback that works for that is much easier than coming up with workable approval feedback, because we just don’t know the values of different actions. If we do manage to train competitive myopic agents, I expect that the way we calculate the approval function is by looking at the action, predicting what outcomes it will lead to, and evaluating how good those outcomes are—which is basically just mentally calculating a reward function and converting it to a value function. But then we could just skip the “predicting” bit and actually look at the outcomes instead—i.e. making it nonmyopic.
If you have ideas for how we might supervise complex tasks like Go to a superhuman level, without assigning values to outcomes in a way that falls into the same traps as reward-based learning, or without benefiting greatly from looking at what the actual consequences are, then that would constitute a compelling argument against my position. E.g. maybe we can figure out what “good cognitive steps” are, and then reward the agent for doing those without bothering to figure out what outcomes good cognitive steps will lead to. That seems very hard, but it’s the sort of thing I think you need to defend if you’re going to defend myopia. (I expect Debate about which actions to take, for instance, to benefit greatly from the judge being able to refer to later outcomes of actions).
Another way of making this argument: humans very much think in terms of outcomes, and how good those outcomes are, by default. I agree that we are bad at giving step-by-step dense rewards. But the whole point of a reward function is that you don’t need to do the step-by-step thing, you can mostly just focus on rewarding good outcomes, and the agent does the credit assignment itself. I picture you arguing that we’ll need shaped rewards to help the agent explore, but a) we can get rid of those shaped rewards as soon as the agent has gotten off the ground, so that they don’t affect long-term incentives, and b) even shaped rewards can still be quite outcome-focused (and therefore natural to think about) - e.g. +1 for killing Roshan in League of Legends.
In terms of catching and correcting mistakes in the specification, I agree that myopia forces the supervisor to keep watching the agent, which means that the supervisor is more likely to notice if they’ve accidentally incentivised the agent to do something bad. But whatever bad behaviour the supervisor is able to notice during myopic training, they could also notice during nonmyopic training if they were watching carefully. So perhaps myopia is useful as a commitment device to force supervisors to pay attention, but given the huge cost of calculating the likely outcomes of all actions, I doubt anyone will want to use it that way.
I can’t speak for everyone else, but when I talk about myopic training vs. regular RL, I’m imagining that they have the same information available when feedback is given. If you would wait till the end of the trajectory before giving rewards in regular RL, then you would wait till the end of the trajectory before giving approval in myopic training.
If you have ideas for how we might supervise complex tasks like Go to a superhuman level, without assigning values to outcomes in a way that falls into the same traps as reward-based learning
… Iterated amplification? Debate?
The point of these methods is to have an overseer that is more powerful than the agent being trained, so that you never have to achieve super-overseer performance (but you do achieve superhuman performance). In debate, you can think of judge + agent 1 as the overseer for agent 2, and judge + agent 2 as the overseer for agent 1.
(You don’t use the overseer itself as your ML system, because the overseer is slow while the agent is fast.)
I agree that if you’re hoping to get an agent that is more powerful than its overseer, then you’re counting on some form of generalization / transfer, and you shouldn’t expect myopic training to be much better (if at all) than regular RL at getting the “right” generalization.
But when I’m talking about competitive AGI, I’m talking about agents that can figure out novel approaches and strategies.
See above about being superhuman but sub-overseer. (Note that the agents can still come up with novel approaches and strategies that the overseer would have come up with, even if the overseer did not actually come up with them.)
humans very much think in terms of outcomes, and how good those outcomes are, by default.
… This does not match my experience at all. Most of the time it seems to me that we’re executing habits and heuristics that we’ve learned over time, and only when we need to think about something novel do we start trying to predict consequences and rate how good they are in order to come to a conclusion. (E.g. most people intuitively reject the notion that we should kill one person for organs to save 5 lives. I don’t think they are usually predicting outcomes and then figuring out whether those outcomes are good or not.)
I picture you arguing that we’ll need shaped rewards to help the agent explore,
I mean, yes, but I don’t think it’s particularly relevant to this disagreement.
TL;DR: I think our main disagreement is whether humans can give approval feedback in any way other than estimating how good the consequences of the action are (both observed and predicted in the future). I agree that if we are trying to have an overseer train a more intelligent agent, it seems likely that you’d have to focus on how good the consequences are. However, I think we will plausibly have the overseer be more intelligent than the agent, and so I expect that the overseer can provide feedback in other ways as well.
I broadly agree about what our main disagreement is. Note that I’ve been mainly considering the case where the supervisor is more intelligent than the agent as well. The actual resolution of this will depend on what’s really going on during amplification, which is a bigger topic that I’ll need to think about more.
On the side disagreement (of whether looking at future states before evaluation counts as “myopic”) I think I was confused when I was discussing it above and in the original article, which made my position a bit of a mess. Sorry about that; I’ve added a clarifying note at the top of the post, and edited the post to reflect what I actually meant. My actual response to this:
Objection 2: This sacrifices competitiveness, because now the human can’t look at the medium-term consequences of actions before providing feedback.
Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training. However, in nonmyopic training this doesn’t matter very much, because we can assign high or low reward to some later state in the trajectory, which then influences whether the agent learns to do the original action more or less. We can’t do this in myopic training in the current paradigm, which is where the competitiveness sacrifice comes from.
E.g. my agent sends an email. Is it good or bad? In myopic training, you need to figure this out now. In nonmyopic training, you can shrug, give it 0 reward now, and then assign high or low reward to the agent when it gets a response that makes it clearer how good the email was. Then because the agent does credit assignment automatically, actions are in effect evaluated based on their medium-term consequences, although the supervisor never actually looks at future states during evaluations.
This is consistent with your position: “When I talk about myopic training vs. regular RL, I’m imagining that they have the same information available when feedback is given”. However, it also raises the question of why we can’t just wait until the end of the trajectory to give myopic feedback anyway. In my edits I’ve called this “semi-myopia”. This wouldn’t be as useful for nonmyopia, but I do agree that semi-myopia alleviates some competitiveness concerns, although at the cost of being more open to manipulation. The exact tradeoff here will depend on disagreement 1.
Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training.
I mean, this is true in the sense that the Gym interface returns a reward with every transition, but the vast majority of deep RL algorithms don’t do anything with those rewards until the trajectory is done (or, in the case of very long trajectories, until you’ve collected a lot of experience from this trajectory). So you could just as easily evaluate the rewards then, and the algorithms wouldn’t change at all (though their implementation would).
I’m calling them manipulative states because if the human notices that the reward function has gone wrong, they’ll just change the reward they’re giving. So there must be something that stops them from noticing this. But maybe it’s a misleading term, and this isn’t an important point, so for now I’ll use “incorrectly rewarded states” instead.
This isn’t quite my argument. My two arguments are:
1. IF an important reason you care about myopia is to prevent agents from making N-step plans to get to incorrectly rewarded states, THEN you can’t defend the competitiveness of myopia by saying that we’ll just look at the whole trajectory (as you did in your original reply).
2. However, even myopically cutting off the trajectory before the agent takes N actions is insufficient to prevent the agent from making N-step plans to get to incorrectly rewarded states.
Has this argument been written up anywhere? I think I kinda get what you mean by “better”, but even if that’s true, I don’t know how to think about what the implications are. Also, I think it’s false if we condition on the myopic agents actually being competitive.
My guess is that this disagreement is based on you thinking primarily about tasks where it’s clear what we want the agent to do, and we just need to push it in that direction (like the ones discussed in the COACH paper). I agree that approval feedback is much more natural for this use case. But when I’m talking about competitive AGI, I’m talking about agents that can figure out novel approaches and strategies. Coming up with reward feedback that works for that is much easier than coming up with workable approval feedback, because we just don’t know the values of different actions. If we do manage to train competitive myopic agents, I expect that the way we calculate the approval function is by looking at the action, predicting what outcomes it will lead to, and evaluating how good those outcomes are—which is basically just mentally calculating a reward function and converting it to a value function. But then we could just skip the “predicting” bit and actually look at the outcomes instead—i.e. making it nonmyopic.
If you have ideas for how we might supervise complex tasks like Go to a superhuman level, without assigning values to outcomes in a way that falls into the same traps as reward-based learning, or without benefiting greatly from looking at what the actual consequences are, then that would constitute a compelling argument against my position. E.g. maybe we can figure out what “good cognitive steps” are, and then reward the agent for doing those without bothering to figure out what outcomes good cognitive steps will lead to. That seems very hard, but it’s the sort of thing I think you need to defend if you’re going to defend myopia. (I expect Debate about which actions to take, for instance, to benefit greatly from the judge being able to refer to later outcomes of actions).
Another way of making this argument: humans very much think in terms of outcomes, and how good those outcomes are, by default. I agree that we are bad at giving step-by-step dense rewards. But the whole point of a reward function is that you don’t need to do the step-by-step thing, you can mostly just focus on rewarding good outcomes, and the agent does the credit assignment itself. I picture you arguing that we’ll need shaped rewards to help the agent explore, but a) we can get rid of those shaped rewards as soon as the agent has gotten off the ground, so that they don’t affect long-term incentives, and b) even shaped rewards can still be quite outcome-focused (and therefore natural to think about) - e.g. +1 for killing Roshan in League of Legends.
In terms of catching and correcting mistakes in the specification, I agree that myopia forces the supervisor to keep watching the agent, which means that the supervisor is more likely to notice if they’ve accidentally incentivised the agent to do something bad. But whatever bad behaviour the supervisor is able to notice during myopic training, they could also notice during nonmyopic training if they were watching carefully. So perhaps myopia is useful as a commitment device to force supervisors to pay attention, but given the huge cost of calculating the likely outcomes of all actions, I doubt anyone will want to use it that way.
I can’t speak for everyone else, but when I talk about myopic training vs. regular RL, I’m imagining that they have the same information available when feedback is given. If you would wait till the end of the trajectory before giving rewards in regular RL, then you would wait till the end of the trajectory before giving approval in myopic training.
… Iterated amplification? Debate?
The point of these methods is to have an overseer that is more powerful than the agent being trained, so that you never have to achieve super-overseer performance (but you do achieve superhuman performance). In debate, you can think of judge + agent 1 as the overseer for agent 2, and judge + agent 2 as the overseer for agent 1.
(You don’t use the overseer itself as your ML system, because the overseer is slow while the agent is fast.)
I agree that if you’re hoping to get an agent that is more powerful than its overseer, then you’re counting on some form of generalization / transfer, and you shouldn’t expect myopic training to be much better (if at all) than regular RL at getting the “right” generalization.
Approval-directed agents. Note a counterargument in against mimicry (technically argues against imitation, but I think it also applies to approval).
See above about being superhuman but sub-overseer. (Note that the agents can still come up with novel approaches and strategies that the overseer would have come up with, even if the overseer did not actually come up with them.)
… This does not match my experience at all. Most of the time it seems to me that we’re executing habits and heuristics that we’ve learned over time, and only when we need to think about something novel do we start trying to predict consequences and rate how good they are in order to come to a conclusion. (E.g. most people intuitively reject the notion that we should kill one person for organs to save 5 lives. I don’t think they are usually predicting outcomes and then figuring out whether those outcomes are good or not.)
I mean, yes, but I don’t think it’s particularly relevant to this disagreement.
TL;DR: I think our main disagreement is whether humans can give approval feedback in any way other than estimating how good the consequences of the action are (both observed and predicted in the future). I agree that if we are trying to have an overseer train a more intelligent agent, it seems likely that you’d have to focus on how good the consequences are. However, I think we will plausibly have the overseer be more intelligent than the agent, and so I expect that the overseer can provide feedback in other ways as well.
I broadly agree about what our main disagreement is. Note that I’ve been mainly considering the case where the supervisor is more intelligent than the agent as well. The actual resolution of this will depend on what’s really going on during amplification, which is a bigger topic that I’ll need to think about more.
On the side disagreement (of whether looking at future states before evaluation counts as “myopic”) I think I was confused when I was discussing it above and in the original article, which made my position a bit of a mess. Sorry about that; I’ve added a clarifying note at the top of the post, and edited the post to reflect what I actually meant. My actual response to this:
Is that in the standard RL paradigm, we never look at the full trajectory before providing feedback in either myopic or nonmyopic training. However, in nonmyopic training this doesn’t matter very much, because we can assign high or low reward to some later state in the trajectory, which then influences whether the agent learns to do the original action more or less. We can’t do this in myopic training in the current paradigm, which is where the competitiveness sacrifice comes from.
E.g. my agent sends an email. Is it good or bad? In myopic training, you need to figure this out now. In nonmyopic training, you can shrug, give it 0 reward now, and then assign high or low reward to the agent when it gets a response that makes it clearer how good the email was. Then because the agent does credit assignment automatically, actions are in effect evaluated based on their medium-term consequences, although the supervisor never actually looks at future states during evaluations.
This is consistent with your position: “When I talk about myopic training vs. regular RL, I’m imagining that they have the same information available when feedback is given”. However, it also raises the question of why we can’t just wait until the end of the trajectory to give myopic feedback anyway. In my edits I’ve called this “semi-myopia”. This wouldn’t be as useful for nonmyopia, but I do agree that semi-myopia alleviates some competitiveness concerns, although at the cost of being more open to manipulation. The exact tradeoff here will depend on disagreement 1.
I mean, this is true in the sense that the Gym interface returns a reward with every transition, but the vast majority of deep RL algorithms don’t do anything with those rewards until the trajectory is done (or, in the case of very long trajectories, until you’ve collected a lot of experience from this trajectory). So you could just as easily evaluate the rewards then, and the algorithms wouldn’t change at all (though their implementation would).