Thanks for the feedback! I’ll respond to different points in different comments for easier threading.
There are a lot of human objectives that, to me, seem like they would never conflict with maximizing reward. This includes anything related to disempowering the overseers in any way (that they can recover from), pursuing objectives fundamentally outside the standard human preference distribution (like torturing kittens), causing harms to humans, or in general making the overseers predictably less happy.
I basically agree that in the lab setting (when humans have a lot of control), the model is not getting any direct gradient update toward the “kill all humans” action or anything like that. Any bad actions that are rewarded by gradient descent are fairly subtle / hard for humans to notice.
The point I was trying to make is more like:
You might have hoped that ~all gradient updates are toward “be honest and friendly,” such that the policy “be honest and friendly” is just the optimal policy. If this were right, it would provide a pretty good reason to hope that the model generalizes in a benign way even as it gets smarter.
But in fact this is not the case—even when humans have a lot of control over the model, there will be many cases where maximizing reward conflicts with being honest and friendly, and in every such case the “play the training game” policy does better than the “be honest and friendly” policy—to the point where it’s implausible that the straightforward “be honest and friendly” policy survives training.
So the hope in the first bullet point—the most straightforward kind of hope you might have had about HFDT—doesn’t seem to apply. Other more subtle hopes may still apply, which I try to briefly address in the sections “What if Alex has benevolent motivations?” and “What if Alex operates with moral injunctions that constrain its behavior?” sections.
The story of doom does still require the model to generalize zero-shot to novel situations—i.e. to figure out things like “In this particular circumstance, now that I am more capable than humans, seizing the datacenter would get higher reward than doing what the humans asked” without having literally gotten positive reward for trying to seize the datacenter in that kind of situation on a bunch of different data points.
But this is the kind of generalization we expect future systems to display—we expect them to be able to do reasoning to figure out a novel response suitable to a novel problem. The question is how they will deploy this reasoning and creativity—and my claim is that their training pushes them to deploy it in the direction of “trying to maximize reward or something downstream of reward.”
I think I agree with everything in this comment, and that paragraph was mostly intended as the foundation for the second point I made (disagreeing with your assessment in “What if Alex has benevolent motivations?”).
Part of the disagreement here might be on how I think “be honest and friendly” factorizes into lots of subgoals (“be polite”, “don’t hurt anyone”, “inform the human if a good-seeming plan is going to have bad results 3 days from now”, “tell Stalinists true facts about what Stalin actually did”), and while Alex will definitely learn to terminally value the wrong (not honest+helpful+harmless) outcomes for some of these goal-axes, it does seem likely to learn to value other axes robustly.
Thanks for the feedback! I’ll respond to different points in different comments for easier threading.
I basically agree that in the lab setting (when humans have a lot of control), the model is not getting any direct gradient update toward the “kill all humans” action or anything like that. Any bad actions that are rewarded by gradient descent are fairly subtle / hard for humans to notice.
The point I was trying to make is more like:
You might have hoped that ~all gradient updates are toward “be honest and friendly,” such that the policy “be honest and friendly” is just the optimal policy. If this were right, it would provide a pretty good reason to hope that the model generalizes in a benign way even as it gets smarter.
But in fact this is not the case—even when humans have a lot of control over the model, there will be many cases where maximizing reward conflicts with being honest and friendly, and in every such case the “play the training game” policy does better than the “be honest and friendly” policy—to the point where it’s implausible that the straightforward “be honest and friendly” policy survives training.
So the hope in the first bullet point—the most straightforward kind of hope you might have had about HFDT—doesn’t seem to apply. Other more subtle hopes may still apply, which I try to briefly address in the sections “What if Alex has benevolent motivations?” and “What if Alex operates with moral injunctions that constrain its behavior?” sections.
The story of doom does still require the model to generalize zero-shot to novel situations—i.e. to figure out things like “In this particular circumstance, now that I am more capable than humans, seizing the datacenter would get higher reward than doing what the humans asked” without having literally gotten positive reward for trying to seize the datacenter in that kind of situation on a bunch of different data points.
But this is the kind of generalization we expect future systems to display—we expect them to be able to do reasoning to figure out a novel response suitable to a novel problem. The question is how they will deploy this reasoning and creativity—and my claim is that their training pushes them to deploy it in the direction of “trying to maximize reward or something downstream of reward.”
I think I agree with everything in this comment, and that paragraph was mostly intended as the foundation for the second point I made (disagreeing with your assessment in “What if Alex has benevolent motivations?”).
Part of the disagreement here might be on how I think “be honest and friendly” factorizes into lots of subgoals (“be polite”, “don’t hurt anyone”, “inform the human if a good-seeming plan is going to have bad results 3 days from now”, “tell Stalinists true facts about what Stalin actually did”), and while Alex will definitely learn to terminally value the wrong (not honest+helpful+harmless) outcomes for some of these goal-axes, it does seem likely to learn to value other axes robustly.