The type 1 vs. type 2 feedback distinction here seems really central. I’m interested if this seems like a fair characterization to both of you.
Type 1: Feedback which we use for training (via gradient descent) Type 2: Feedback which we use to decide whether to deploy trained agent.
(There’s a bit of gray between Type 1 and 2, since choosing whether to deploy is another form of selection, but I’m assuming we’re okay stating that gradient descent and model selection operate in qualitatively distinct regimes.)
The key disagreement is whether we expect type 1 feedback will be closer to type 2 feedback, or whether type 2 feedback will be closer to our true goals. If the former, our agents generalizing from type 1 to type 2 is relatively uninformative, and we still have Goodhart. In the latter case, the agent is only very weakly optimizing the type 2 feedback, and so we don’t need to worry much about Goodhart, and should expect type 2 feedback to continue track our true goals well.
Main argument for type 1 ~ type 2: by definition, we design type 1 feedback (+associated learning algorithm) so that resulting agents perform well under type 2 Main argument for type 1 !~ type 2: type 2 feedback can be something like 1000-10000x more expensive, since we only have to evaluate it once, rather than enough times to be useful for gradient descent
I’d also be interested to discuss this disagreement in particular, since I could definitely go either way on it. (I plan to think about it more myself.)
Type 2: Feedback which we use to decide whether to deploy trained agent.
Let’s also include feedback which we can use to decide whether to stop deploying an agent; the central example in my head is an agent which has been deployed for some time before we discover that it’s doing bad things.
Relatedly, another argument for type 1 !~ type 2 which seems important to me: type 2 feedback can look at long time horizons, which I expect to be very useful. (Maybe you included this in the cost estimate, but idk how to translate between longer times and higher cost directly.)
“by definition, we design type 1 feedback (+associated learning algorithm) so that resulting agents perform well under type 2”
This doesn’t seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals. But if that’s the case, then it would be strange for agents to learn the motivation of doing well on type 2 feedback without learning the motivation of doing well on our true goals.
In practice, I expect that misaligned agents which perform well on type 2 feedback will do so primarily by deception, for instrumental purposes. But it’s hard to picture agents which carry out this type of deception, but which don’t also decide to take over the world directly.
This doesn’t seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals.
But type 2 feedback is (by definition) our best attempt to estimate how well the model is doing what we really care about. So in practice any results-based selection for “does what we care about” goes via selecting based on type 2 feedback. The difference only comes up when we reason mechanically about the behavior of our agents and how they are likely to generalize, but it’s not clear that’s an important part of the default plan (whereas I think we will clearly extensively leverage “try several strategies and see what works”).
But if that’s the case, then it would be strange for agents to learn the motivation of doing well on type 2 feedback without learning the motivation of doing well on our true goals.
“Do things that look to a human like you are achieving X” is closely related to X, but that doesn’t mean that learning to do the one implies that you will learn to do the other.
Maybe it’s helpful to imagine the world where type 1 feedback is “human evals after 1 week horizon”, type 2 feedback is “human evals after 1 year horizon,” and “what we really care about” is the “human evals after a 100 year horizon.” I think that’s much better than the actual situation, but even in that case I’d have a significant probability on getting systems that work on the 1 year horizon without working indefinitely (especially if we do selection for working on 2 years + are able to use a small amount of 2 year data). Do you feel pretty confident that something that generalizes from 1 week to 1 year will go indefinitely, or is your intuition predicated on something about the nature of “be helpful” and how that’s a natural motivation for a mind? (Or maybe that we will be able to identify some other similar “natural” motivation and design our training process to be aligned with that?) In the former case, it seems like we can have an empirical discussion about how generalization tends to work. In the latter case, it seems like we need to be getting into more details about why “be helpful” is a particularly natural (or else why we should be able to pick out something else like that). In the other cases I think I haven’t fully internalized your view.
I agree with the two questions you’ve identified as the core issues, although I’d slightly rephrase the former. It’s hard to think about something being aligned indefinitely. But it seems like, if we have primarily used a given system for carrying out individual tasks, it would take quite a lot of misalignment for it to carry out a systematic plan to deceive us. So I’d rephrase the first option you mention as “feeling pretty confident that something that generalises from 1 week to 1 year won’t become misaligned enough to cause disasters”. This point seems more important than the second point (the nature of “be helpful” and how that’s a natural motivation for a mind), but I’ll discuss both.
I think the main disagreement about the former is over the relative strength of “results-based selection” versus “intentional design”. When I said above that “we design type 1 feedback so that resulting agents perform well on our true goals”, I was primarily talking about “design” as us reasoning about our agents, and the training process they undergo, not the process of running them for a long time and picking the ones that do best. The latter is a very weak force! Almost all of the optimisation done by humans comes from intentional design plus rapid trial and error (on the timeframe of days or weeks). Very little of the optimisation comes from long-term trial and error (on the timeframe of a year) - by necessity, because it’s just so slow.
So, conditional on our agents generalising from “one week” to “one year”, we should expect that it’s because we somehow designed a training procedure that produces scalable alignment (or at least scalable non-misalignment), or because they’re deceptively aligned (as in your influence-seeking agents scenario), but not because long-term trial and error was responsible for steering us towards getting what we can measure.
Then there’s the second question, of whether “do things that look to a human like you’re achieving X” is a plausible generalisation. My intuitions on this question are very fuzzy, so I wouldn’t be surprised if they’re wrong. But, tentatively, here’s one argument. Consider a policy which receives instructions from a human, talks to the human to clarify the concepts involved, then gets rewarded and updated based on how well it carries out those instructions. From the policy’s perspective, the thing it interacts with, and which its actions are based on, is human instructions. Indeed, for most of the training process the policy plausibly won’t even have the concept of “reward” (in the same way that humans didn’t evolve a concept of fitness). But it will have this concept of human intentions, which is a very good proxy for reward. And so it seems much more natural for the policy’s goals to be formulated in terms of human intentions and desires, which are the observable quantities that it responds to; rather than human feedback, which is the unobservable quantity that it is optimised with respect to. (Rewards can be passed as observations to the policy, but I claim that it’s both safer and more useful if rewards are unobservable by the policy during training.)
This argument is weakened by the fact that, when there’s a conflict between them (e.g. in cases where it’s possible to fool the humans), agents aiming to “look like you’re doing X” will receive more reward. But during most of training the agent won’t be very good at fooling humans, and so I am optimistic that its core motivations will still be more like “do what the human says” than “look like you’re doing what the human says”.
I think that by default we will search for ways to build systems that do well on type 2 feedback. We do likely have a large dataset of type-2-bad behaviors from the real world, across many applications, and can make related data in simulation. It also seems quite plausible that this is a very tiny delta, if we are dealing with models that have already learned everything they would need to know about the world and this is just a matter of selecting a motivation, so that you can potentially get good type 2 behavior using a very small amount of data. Relatedly, it seems like really all you need is to train predictors for type 2 feedback (in order to use those predictions for training/planning), and that the relevant prediction problems often seem much easier than the actual sophisticated behaviors we are interested in.
Another important of my view about type 1 ~ type 2 is that if gradient descent handles the scale from [1 second, 1 month] then it’s not actually very far to get from [1 month, 2 years]. It seems like we’ve already come 6 orders of magnitude and now we are talking about generalizing 1 more order of magnitude.
At a higher level, I feel like the important thing is that type 1 and type 2 feedback are going to be basically the same kind of thing but with a quantitative difference (or at least we can set up type 1 feedback so that this is true). On the other hand “what we really want” is a completely different thing (that we basically can’t even define cleanly). So prima facie it feels to me like if models generalize “well” then we can get them to generalize from type 1 to type 2, whereas no such thing is true for “what we really care about.”
The type 1 vs. type 2 feedback distinction here seems really central. I’m interested if this seems like a fair characterization to both of you.
Type 1: Feedback which we use for training (via gradient descent)
Type 2: Feedback which we use to decide whether to deploy trained agent.
(There’s a bit of gray between Type 1 and 2, since choosing whether to deploy is another form of selection, but I’m assuming we’re okay stating that gradient descent and model selection operate in qualitatively distinct regimes.)
The key disagreement is whether we expect type 1 feedback will be closer to type 2 feedback, or whether type 2 feedback will be closer to our true goals. If the former, our agents generalizing from type 1 to type 2 is relatively uninformative, and we still have Goodhart. In the latter case, the agent is only very weakly optimizing the type 2 feedback, and so we don’t need to worry much about Goodhart, and should expect type 2 feedback to continue track our true goals well.
Main argument for type 1 ~ type 2: by definition, we design type 1 feedback (+associated learning algorithm) so that resulting agents perform well under type 2
Main argument for type 1 !~ type 2: type 2 feedback can be something like 1000-10000x more expensive, since we only have to evaluate it once, rather than enough times to be useful for gradient descent
I’d also be interested to discuss this disagreement in particular, since I could definitely go either way on it. (I plan to think about it more myself.)
A couple of clarifications:
Let’s also include feedback which we can use to decide whether to stop deploying an agent; the central example in my head is an agent which has been deployed for some time before we discover that it’s doing bad things.
Relatedly, another argument for type 1 !~ type 2 which seems important to me: type 2 feedback can look at long time horizons, which I expect to be very useful. (Maybe you included this in the cost estimate, but idk how to translate between longer times and higher cost directly.)
This doesn’t seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals. But if that’s the case, then it would be strange for agents to learn the motivation of doing well on type 2 feedback without learning the motivation of doing well on our true goals.
In practice, I expect that misaligned agents which perform well on type 2 feedback will do so primarily by deception, for instrumental purposes. But it’s hard to picture agents which carry out this type of deception, but which don’t also decide to take over the world directly.
But type 2 feedback is (by definition) our best attempt to estimate how well the model is doing what we really care about. So in practice any results-based selection for “does what we care about” goes via selecting based on type 2 feedback. The difference only comes up when we reason mechanically about the behavior of our agents and how they are likely to generalize, but it’s not clear that’s an important part of the default plan (whereas I think we will clearly extensively leverage “try several strategies and see what works”).
“Do things that look to a human like you are achieving X” is closely related to X, but that doesn’t mean that learning to do the one implies that you will learn to do the other.
Maybe it’s helpful to imagine the world where type 1 feedback is “human evals after 1 week horizon”, type 2 feedback is “human evals after 1 year horizon,” and “what we really care about” is the “human evals after a 100 year horizon.” I think that’s much better than the actual situation, but even in that case I’d have a significant probability on getting systems that work on the 1 year horizon without working indefinitely (especially if we do selection for working on 2 years + are able to use a small amount of 2 year data). Do you feel pretty confident that something that generalizes from 1 week to 1 year will go indefinitely, or is your intuition predicated on something about the nature of “be helpful” and how that’s a natural motivation for a mind? (Or maybe that we will be able to identify some other similar “natural” motivation and design our training process to be aligned with that?) In the former case, it seems like we can have an empirical discussion about how generalization tends to work. In the latter case, it seems like we need to be getting into more details about why “be helpful” is a particularly natural (or else why we should be able to pick out something else like that). In the other cases I think I haven’t fully internalized your view.
I agree with the two questions you’ve identified as the core issues, although I’d slightly rephrase the former. It’s hard to think about something being aligned indefinitely. But it seems like, if we have primarily used a given system for carrying out individual tasks, it would take quite a lot of misalignment for it to carry out a systematic plan to deceive us. So I’d rephrase the first option you mention as “feeling pretty confident that something that generalises from 1 week to 1 year won’t become misaligned enough to cause disasters”. This point seems more important than the second point (the nature of “be helpful” and how that’s a natural motivation for a mind), but I’ll discuss both.
I think the main disagreement about the former is over the relative strength of “results-based selection” versus “intentional design”. When I said above that “we design type 1 feedback so that resulting agents perform well on our true goals”, I was primarily talking about “design” as us reasoning about our agents, and the training process they undergo, not the process of running them for a long time and picking the ones that do best. The latter is a very weak force! Almost all of the optimisation done by humans comes from intentional design plus rapid trial and error (on the timeframe of days or weeks). Very little of the optimisation comes from long-term trial and error (on the timeframe of a year) - by necessity, because it’s just so slow.
So, conditional on our agents generalising from “one week” to “one year”, we should expect that it’s because we somehow designed a training procedure that produces scalable alignment (or at least scalable non-misalignment), or because they’re deceptively aligned (as in your influence-seeking agents scenario), but not because long-term trial and error was responsible for steering us towards getting what we can measure.
Then there’s the second question, of whether “do things that look to a human like you’re achieving X” is a plausible generalisation. My intuitions on this question are very fuzzy, so I wouldn’t be surprised if they’re wrong. But, tentatively, here’s one argument. Consider a policy which receives instructions from a human, talks to the human to clarify the concepts involved, then gets rewarded and updated based on how well it carries out those instructions. From the policy’s perspective, the thing it interacts with, and which its actions are based on, is human instructions. Indeed, for most of the training process the policy plausibly won’t even have the concept of “reward” (in the same way that humans didn’t evolve a concept of fitness). But it will have this concept of human intentions, which is a very good proxy for reward. And so it seems much more natural for the policy’s goals to be formulated in terms of human intentions and desires, which are the observable quantities that it responds to; rather than human feedback, which is the unobservable quantity that it is optimised with respect to. (Rewards can be passed as observations to the policy, but I claim that it’s both safer and more useful if rewards are unobservable by the policy during training.)
This argument is weakened by the fact that, when there’s a conflict between them (e.g. in cases where it’s possible to fool the humans), agents aiming to “look like you’re doing X” will receive more reward. But during most of training the agent won’t be very good at fooling humans, and so I am optimistic that its core motivations will still be more like “do what the human says” than “look like you’re doing what the human says”.
I think that by default we will search for ways to build systems that do well on type 2 feedback. We do likely have a large dataset of type-2-bad behaviors from the real world, across many applications, and can make related data in simulation. It also seems quite plausible that this is a very tiny delta, if we are dealing with models that have already learned everything they would need to know about the world and this is just a matter of selecting a motivation, so that you can potentially get good type 2 behavior using a very small amount of data. Relatedly, it seems like really all you need is to train predictors for type 2 feedback (in order to use those predictions for training/planning), and that the relevant prediction problems often seem much easier than the actual sophisticated behaviors we are interested in.
Another important of my view about type 1 ~ type 2 is that if gradient descent handles the scale from [1 second, 1 month] then it’s not actually very far to get from [1 month, 2 years]. It seems like we’ve already come 6 orders of magnitude and now we are talking about generalizing 1 more order of magnitude.
At a higher level, I feel like the important thing is that type 1 and type 2 feedback are going to be basically the same kind of thing but with a quantitative difference (or at least we can set up type 1 feedback so that this is true). On the other hand “what we really want” is a completely different thing (that we basically can’t even define cleanly). So prima facie it feels to me like if models generalize “well” then we can get them to generalize from type 1 to type 2, whereas no such thing is true for “what we really care about.”