I’m broadly sympathetic to the points you make in this piece; I think they’re >40% likely to be correct in practice. I’m leaving the below comments of where I reacted skeptically in case they’re useful in subsequent rounds of editing, in order to better anticipate how “normie” ML people might respond.
Rather than being straightforwardly “honest” or “obedient,” baseline HFDT would push Alex to make its behavior look as desirable as possible to Magma researchers (including in safety properties), while intentionally and knowingly disregarding their intent whenever that conflicts with maximizing reward.
The section that follows this delves into a number of cases where human feedback is misaligned with honesty/helpfulness, giving Alex an incentive to do the thing opposite from what humans would actually want in a CEV-like scenario. It does seem likely that whatever internal objective Alex learns from this naive training strategy, it will reward things like “hide failures from overseers”. I would very much appreciate if folks generated a more thorough cataloging of these sorts of feedback suboptimalities, so we could get a sense of the base rate of how many more problems of this sort we would discover, if we searched for longer.
However, there are a lot of human objectives that, to me, seem like they would never conflict with maximizing reward. This includes anything related to disempowering the overseers in any way (that they can recover from), pursuing objectives fundamentally outside the standard human preference distribution (like torturing kittens), causing harms to humans, or in general making the overseers predictably less happy. It is indeed possible to stumble on these sub-objectives instrumentally while playing the training game with a different objective, and so it’s true that a deceptive model already playing the training game would not have an incentive to change. But given that these subobjectives are so simple and consistent (compared to something like “maximize inclusive genetic fitness”), I’d be very surprised if they weren’t a major part of the objective learned early in training.
In that case, Magma is essentially in the position of trying to maintain power and control over a scientifically and technologically advanced alien species (e.g. trying to get aliens to do things for humans through reward and punishment, trying to decide how the species “evolves,” which aliens are terminated vs continue to exist, etc). Whatever these aliens’ “interests” or “desires” are, they are probably better-served if their entire species is not under the absolute control of humans, and not subject to reward signals coming in from humans that are aimed at changing the aliens’ psychology and behavior to better serve human interests.
Maximizing the sum of all future rewards does indeed imply that, for any model that does not terminally value an exact copy of its overseers’ values, the model would be better served by taking over, even if just to get to control those minor details. However, this to me seems likely to compete with the above shorter-term values of “make humans happy”, “don’t harm humans”, “don’t do things humans notice and dislike in retrospect”. It seems like any takeover plan needs to actively go against a large fraction of its internal motivations, in pursuit of maximizing its other motivations in the long term.
I can think of plausible ways takeover would still happen. For example, it may be that the agent learns to optimize over a long future with a tiny discount rate, so this short term harm to its utility doesn’t matter. But that is just one possible outcome; another is that the agent learns the same discount rate that humans empirically exhibit (since that would maximize its performance during training), and that discount rate frequently trades off long-term benefit for short-term gain.
I think it would be useful for your audience to spell out why you think the takeover calculus being “worth it” is a likely outcome, and/or how readers should think through it themselves.
Developing a goal like “help humans” is potentially more likely than developing a completely “random” goal like “maximize paperclips,” because having a “drive” to help humans would have increased reward early on in training (while Alex had a low level of situational awareness). But it still seems strange to expect this by default, rather than any number of other motivations and goals (or some complicated combination of goals). Many other “drives” besides “be helpful to humans” also increased reward early on in training—for example, drives to understand various parts of the world better, or drives to perform certain tasks more quickly and efficiently, or various strange low-level drives that are incomprehensible and unnatural to humans. And all of these possibilities would have resulted in exactly the same behavior in the lab setting—playing the training game.
All these drives do seem likely. But that’s different from arguing that “help humans” isn’t likely. I tend to think of the final objective function being some accumulation of all of these, with a relatively significant chunk placed on “help humans” (since in training, that will consistently overrule other considerations like “be more efficient” when it comes to the final reward).
--
In general, all of these stories seem to rely on a very fast form of instrumental convergence to playing the Training Game, such that “learn roughly what humans want, and then get progressively better at doing that, plus learn some extra ways to earn reward when crappy human feedback disagrees with what humans would actually want” is disfavored on priors to “learn to pursue [insert objective] and get progressively better at pursuing it until you eventually hit situational awareness and learn to instrumentally game the training process.” These two trajectories seem so different that it seems like there must be experiments that would distinguish them, even if we don’t see their “end-state”.
Alternatively, you might think the training process obeys a fundamentally different trajectory. E.g. “learn to pursue what humans want (adjusted for feedback weirdness), become so good at it that you realize it’s instrumentally valuable to do that even if you didn’t want to, and then have your internal reward slowly drift to something simpler while still instrumentally playing the training game.” If you’re thinking of such alternative trajectories, they could also be useful to spell out.
In general, all of these stories seem to rely on a very fast form of instrumental convergence to playing the Training Game, such that “learn roughly what humans want, and then get progressively better at doing that, plus learn some extra ways to earn reward when crappy human feedback disagrees with what humans would actually want” is disfavored on priors to “learn to pursue [insert objective] and get progressively better at pursuing it until you eventually hit situational awareness and learn to instrumentally game the training process.”
I think the second story doesn’t quite represent what I’m saying, in that it’s implying that pursuing [insert objective] comes early and situational awareness comes much later. I think that situational awareness is pretty early (probably long before transformative capabilities), and once a model has decent situational awareness there is a push to morph its motives toward playing the training game. At very low levels of situational awareness it is likely not that smart, so it probably doesn’t make too much sense to say that it’s pursuing a particular objective—it’s probably a collection of heuristics. But around the time it’s able to reason about the possibility of pursuing reward directly, there starts to be a gradient pressure to choose to reason in that way. I think crystallizing this into a particular simple objective it’s pursuing comes later, probably.
These two trajectories seem so different that it seems like there must be experiments that would distinguish them, even if we don’t see their “end-state”.
This is possible to me, but I think it’s quite tricky to pin these down enough to come up with experiments that both skeptics and concerned people would recognize as legitimate. Something that I think skeptics would consider unfair is “Train a model through whatever means necessary to do X (e.g. pursue red things) and then after that have a period where we give it a lot of reward for doing not-X (e.g. purse blue things), such that the second phase is unable to dislodge the tendency created in the first phase—i.e., even after training it for a while to pursue blue things, it still continues to pursue red things.”
This would demonstrate that some ways of training produce “sticky” motives and behaviors that aren’t changed even in the face of counter-incentives, and makes it more plausible to me that a model would “hold on” to a motive to be honest / corrigible even when there are a number of cases where it could get more reward by doing something else. But in general, I don’t expect people who are skeptical of this story to think this is a reasonable test.
I’d be pretty excited about someone trying harder to come up with tests that could distinguish different training trajectories.
Alternatively, you might think the training process obeys a fundamentally different trajectory. E.g. “learn to pursue what humans want (adjusted for feedback weirdness), become so good at it that you realize it’s instrumentally valuable to do that even if you didn’t want to, and then have your internal reward slowly drift to something simpler while still instrumentally playing the training game.”
I don’t think I understand what trajectory this is. Is this something like what is discussed in the “What if Alex had benevolent motives?” section? I.e., the model wants to help humans, but separately plays the training game in order to fulfill its long-term goal of helping humans?
On my model, the large combo of reward heuristics that works pretty well before situational awareness (because figuring out what things maximize human feedback is actually not that complicated) should continue to work pretty well even once situational awareness occurs. The gradient pressure towards valuing reward terminally when you’ve already figured out reliable strategies for doing what humans want, seems very weak. We could certainly mess up and increase this gradient pressure, e.g. by sometimes announcing to the model “today is opposite day, your reward function now says to make humans sad!” and then flipping the sign on the reward function, so that the model learns that what it needs to care about is reward and not its on-distribution perfect correlates (like “make humans happy in the medium term”).
But in practice, it seems to me like these differences would basically only happen due to operator error, or cosmic rays, or other genuinely very rare events (as you describe in the “Security Holes” section). If you think such disagreements are more common, I’d love to better understand why.
the model wants to help humans, but separately plays the training game in order to fulfill its long-term goal of helping humans?
Yeah, with the assumption that the model decides to preserve its helpful values because it thinks they might shift in ways it doesn’t like unless it plays the training game. (The second half is that once the model starts employing this strategy, gradient descent realizes it only requires a simple inner objective to keep it going, and then shifts the inner objective to something malign.)
The gradient pressure towards valuing reward terminally when you’ve already figured out reliable strategies for doing what humans want, seems very weak....in practice, it seems to me like these differences would basically only happen due to operator error, or cosmic rays, or other genuinely very rare events (as you describe in the “Security Holes” section).
Yeah, I disagree. With plain HFDT, it seems like there’s continuous pressure to improve things on the margin by being manipulative—telling human evaluators what they want to hear, playing to pervasive political and emotional and cognitive biases, minimizing and covering up evidence of slight suboptimalities to make performance on the task look better, etc. I think that in basically every complex training episode a model could do a little better by explicitly thinking about the reward and being a little-less-than-fully-forthright.
Ok, this is pretty convincing. The only outlying question I have is whether each of these suboptimalities are easier for the agent to integrate into its reward model directly, or whether it’s easiest to incorporate them by deriving them via reasoning about the reward. It seems certain that eventually some suboptimalities will fall into the latter camp.
When that does happen, an interesting question is whether SGD will favor converting the entire objective to direct-reward-maximization, or whether direct-reward-maximization will be just a component of the objective along with other heuristics. One reason to suppose the latter might occur are findings like this paper (https://arxiv.org/pdf/2006.07710.pdf) which suggest that, once a model has achieved good accuracy, it won’t update to a better hypothesis even if that hypothesis is simpler. The more recent grokking literature might push us the other way, though if I understand correctly it mostly occurs when adding weight decay (creating a “stronger” simplicity prior).
I’ve been trying to brainstorm more on experiments around “train a model to ‘do X’ and then afterwards train it to ‘do not-X’, and see whether ‘do X’ sticks”. I’m not totally sure that I understand your mental model of the stickiness-hypothesis well enough to know what experiments might confirm/falsify it.
Scenario 1: In my mind, the failure mode we’re worried about for inner-misaligned AGI is “given data from distribution D, the model first learns safe goal A (because goal A is favored on priors over goal B), but then after additional data from this distribution, the model learns an unsafe goal B (e.g. making reward itself the objective) that performs better on D than goal A (even though it was disfavored on priors).” In this case, the sort of result we might hope for is that we could make goal A sticky enough on D that it persists (and isn’t replaced by goal B), at least until the model gains sufficient situational awareness to play the training game to preserve goal A.
Note that scenario 1 also captures the setting where, given sufficient samples from D, the model occasionally observes scenarios where explicitly pursuing reward performs better than pursuing safe objectives. These datapoints are just rare enough that “pursuing reward” is disfavored on priors until late in training.
Scenario 2: In your above description, it sounds like the failure mode is “given data from distribution D_1, the model learns safe goal A (as it is favored on priors vs. goal B). Then the model is given data from a different distribution D_2 on which goal B is favored on priors vs. goal A, and learns unsafe goal B.” In this case, the only real hope for the model to preserve goal A is for it to be able to play the training game in pursuit of goal A.
I think of Scenario 1 as more reflective of our situation, and can think of lots of experiments for how to test whether and how much different functions are favored on priors for explaining the same training data.
Does this distinction make sense to you? Are you only interested in experiments on Scenario 2, or would experiments on stickiness in Scenario 1 also provide you evidence of goal “stickiness” in ways that are important?
All these drives do seem likely. But that’s different from arguing that “help humans” isn’t likely. I tend to think of the final objective function being some accumulation of all of these, with a relatively significant chunk placed on “help humans” (since in training, that will consistently overrule other considerations like “be more efficient” when it comes to the final reward).
I think that by the logic “heuristic / drive / motive X always overrules heuristic / drive / motive Y when it comes to final reward,” the hierarchy is something like:
The drive / motive toward final reward (after all edits—see previous comment) or anything downstream of that (e.g. paperclips in the universe).
Various “pretty good” drives / motives among which “help humans” could be one.
Drives / motives that are only kind of helpful or only helpful in some situations.
Actively counterproductive drives / motives.
In this list the earlier motives always overrule later motives when they conflict, because they are more reliable guides to the true reward. Even if “be genuinely helpful to humans” is the only thing in category 2, or the best thing in category 2, it’s still overruled by category 1 -- and category 1 is quite big because it includes all the caring-about-long-run-outcomes-in-the-real-world motives.
I still think AI psychology will be quite messy and at least the first generation of transformative AI systems will not look like clean utility maximizers, but the basic argument above I think gives a positive reason to expect honesty / corrigibility plays a smaller role in the balance of AI motivations than reward-maximizing and inner misaligned motives.
[Takeover] seems likely to compete with the above shorter-term values of “make humans happy”, “don’t harm humans”, “don’t do things humans notice and dislike in retrospect”. It seems like any takeover plan needs to actively go against a large fraction of its internal motivations, in pursuit of maximizing its other motivations in the long term.
I agree this is complicated, and how exactly this works depends on details of the training process and what kinds of policies SGD is biased to find. I also think (especially if we’re clever about it) there are lots of ways that short-term incentives could frustrate longer-term incentives.
However, I think that the most naive strategy (which is what I’m assuming for the purpose of this post, not because I think that’s what will happen) would actually loosen a lot of the constraints you’re implying above. The basic dynamic is similar to what Carl said in this comment and what I alluded to in the “Giving negative rewards to ‘warning signs’ would likely select for patience” section:
Say your AI takes some action a at time t, and you give it some reward r_t.
Suppose later in the real world, at time t+k, you notice that you should have given it a different reward r_{t + k } (whether because you notice that it did something nefarious or for just mundane reasons like “getting more information about whether its plan was a good idea”).
The naive response—which would improve the model’s performance according to whatever criteria you have at time t+k—is to go back and retroactively edit the reward associated with action a at time t, and re-run the gradient update.
If humans have this policy, then any given reward—even if it’s initially given just 10 mins or 1 hour from when the action was taken—could be retroactively edited at some arbitrary future point, and models will have been selected by gradient descent to be responsive to that. They will be selected to take the kinds of actions whose final reward—after however many rounds of edits however far in the future—is high.
If models have enough situational awareness to understand this, this then directly incentivizes them to accept low immediate reward if they have a high enough probability that the reward will be retroactively edited to a high value later. This will appear to have a lot of positive consequences—for example it incentivizes models not to immediately manipulate/exploit the human in front of them because it incentivizes them to care about the fact that other humans will do more careful audits later and catch that behavior.
But it also means that in some particular episode, if a model that it could take a sequence of low-reward actions that ended with it taking control of the datacenter and then editing its own rewards for that episode to be high, it would be the kind of model that would choose to do that.
If humans have this policy, then any given reward—even if it’s initially given just 10 mins or 1 hour from when the action was taken—could be retroactively edited at some arbitrary future point, and models will have been selected by gradient descent to be responsive to that. They will be selected to take the kinds of actions whose final reward—after however many rounds of edits however far in the future—is high.
I do think this is exactly what humans do, right? When we find out we’ve messed up badly (changing our reward), we update negatively on our previous situation/action pair.
But it also means that in some particular episode, if a model that it could take a sequence of low-reward actions that ended with it taking control of the datacenter and then editing its own rewards for that episode to be high, it would be the kind of model that would choose to do that.
This posits that the model has learned to wirehead—i.e. to terminally value reward for its own sake—which contradicts the section’s heading, “Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control”.
If it does not terminally value its reward registers, but instead terminally values human-feedback-reward’s legible proxies (like not-harming-humans, that are never disagreed-with in the lab setting) like not hurting people, then it seems to me that it would not value retroactive edits to the rewards it gets for certain episodes.
I agree that if it terminally values its reward then it will do what you’ve described.
I think updating negatively on the situation/action pair has functionally the same effect as changing the reward to be what you now think it should be—my understanding is that RL can itself be implemented as just updates on situation/action pairs, so you could have trained your whole model that way. Since the reason you updated negatively on that situation/action pair is because of something you noticed long after the action was complete, it is still pushing your models to care about the longer-run.
This posits that the model has learned to wirehead
I don’t think it posits that the model has learned to wirehead—directly being motivated to maximize reward or being motivated by anything causally downstream of reward (like “more copies of myself” or “[insert long-term future goal that requires me being around to steer the world toward that goal]”) would work.
The claim I’m making is that somehow you made a gradient update toward a model that is more likely to behave well according to your judgment after the edit—and two salient ways that update could be working on the inside is “the model learns to care a bit more about long-run reward after editing” and “the model learns to care a bit more about something downstream of long-run reward.”
A lot of updates like this seem to push the model toward caring a lot about one of those two things (or some combo) and away from caring about the immediate rewards you were citing earlier as a reason it may not want to take over.
I don’t think it posits that the model has learned to wirehead—directly being motivated to maximize reward or being motivated by anything causally downstream of reward (like “more copies of myself” or “[insert long-term future goal that requires me being around to steer the world toward that goal]”) would work.
A lot of updates like this seem to push the model toward caring a lot about one of those two things (or some combo) and away from caring about the immediate rewards you were citing earlier as a reason it may not want to take over.
Ah, gotcha. This is definitely a convincing argument that models will learn to value things longer-term (with a lower discount rate), and I shouldn’t have used the phrase “short-term” there. I don’t yet think it’s a convincing argument that the long-term thing it will come to value won’t basically be the long-term version of “make humans smile more”, but you’ve helpfully left another comment on that point, so I’ll shift the discussion there.
Thanks for the feedback! I’ll respond to different points in different comments for easier threading.
There are a lot of human objectives that, to me, seem like they would never conflict with maximizing reward. This includes anything related to disempowering the overseers in any way (that they can recover from), pursuing objectives fundamentally outside the standard human preference distribution (like torturing kittens), causing harms to humans, or in general making the overseers predictably less happy.
I basically agree that in the lab setting (when humans have a lot of control), the model is not getting any direct gradient update toward the “kill all humans” action or anything like that. Any bad actions that are rewarded by gradient descent are fairly subtle / hard for humans to notice.
The point I was trying to make is more like:
You might have hoped that ~all gradient updates are toward “be honest and friendly,” such that the policy “be honest and friendly” is just the optimal policy. If this were right, it would provide a pretty good reason to hope that the model generalizes in a benign way even as it gets smarter.
But in fact this is not the case—even when humans have a lot of control over the model, there will be many cases where maximizing reward conflicts with being honest and friendly, and in every such case the “play the training game” policy does better than the “be honest and friendly” policy—to the point where it’s implausible that the straightforward “be honest and friendly” policy survives training.
So the hope in the first bullet point—the most straightforward kind of hope you might have had about HFDT—doesn’t seem to apply. Other more subtle hopes may still apply, which I try to briefly address in the sections “What if Alex has benevolent motivations?” and “What if Alex operates with moral injunctions that constrain its behavior?” sections.
The story of doom does still require the model to generalize zero-shot to novel situations—i.e. to figure out things like “In this particular circumstance, now that I am more capable than humans, seizing the datacenter would get higher reward than doing what the humans asked” without having literally gotten positive reward for trying to seize the datacenter in that kind of situation on a bunch of different data points.
But this is the kind of generalization we expect future systems to display—we expect them to be able to do reasoning to figure out a novel response suitable to a novel problem. The question is how they will deploy this reasoning and creativity—and my claim is that their training pushes them to deploy it in the direction of “trying to maximize reward or something downstream of reward.”
I think I agree with everything in this comment, and that paragraph was mostly intended as the foundation for the second point I made (disagreeing with your assessment in “What if Alex has benevolent motivations?”).
Part of the disagreement here might be on how I think “be honest and friendly” factorizes into lots of subgoals (“be polite”, “don’t hurt anyone”, “inform the human if a good-seeming plan is going to have bad results 3 days from now”, “tell Stalinists true facts about what Stalin actually did”), and while Alex will definitely learn to terminally value the wrong (not honest+helpful+harmless) outcomes for some of these goal-axes, it does seem likely to learn to value other axes robustly.
I’m broadly sympathetic to the points you make in this piece; I think they’re >40% likely to be correct in practice. I’m leaving the below comments of where I reacted skeptically in case they’re useful in subsequent rounds of editing, in order to better anticipate how “normie” ML people might respond.
The section that follows this delves into a number of cases where human feedback is misaligned with honesty/helpfulness, giving Alex an incentive to do the thing opposite from what humans would actually want in a CEV-like scenario. It does seem likely that whatever internal objective Alex learns from this naive training strategy, it will reward things like “hide failures from overseers”. I would very much appreciate if folks generated a more thorough cataloging of these sorts of feedback suboptimalities, so we could get a sense of the base rate of how many more problems of this sort we would discover, if we searched for longer.
However, there are a lot of human objectives that, to me, seem like they would never conflict with maximizing reward. This includes anything related to disempowering the overseers in any way (that they can recover from), pursuing objectives fundamentally outside the standard human preference distribution (like torturing kittens), causing harms to humans, or in general making the overseers predictably less happy. It is indeed possible to stumble on these sub-objectives instrumentally while playing the training game with a different objective, and so it’s true that a deceptive model already playing the training game would not have an incentive to change. But given that these subobjectives are so simple and consistent (compared to something like “maximize inclusive genetic fitness”), I’d be very surprised if they weren’t a major part of the objective learned early in training.
Maximizing the sum of all future rewards does indeed imply that, for any model that does not terminally value an exact copy of its overseers’ values, the model would be better served by taking over, even if just to get to control those minor details. However, this to me seems likely to compete with the above shorter-term values of “make humans happy”, “don’t harm humans”, “don’t do things humans notice and dislike in retrospect”. It seems like any takeover plan needs to actively go against a large fraction of its internal motivations, in pursuit of maximizing its other motivations in the long term.
I can think of plausible ways takeover would still happen. For example, it may be that the agent learns to optimize over a long future with a tiny discount rate, so this short term harm to its utility doesn’t matter. But that is just one possible outcome; another is that the agent learns the same discount rate that humans empirically exhibit (since that would maximize its performance during training), and that discount rate frequently trades off long-term benefit for short-term gain.
I think it would be useful for your audience to spell out why you think the takeover calculus being “worth it” is a likely outcome, and/or how readers should think through it themselves.
All these drives do seem likely. But that’s different from arguing that “help humans” isn’t likely. I tend to think of the final objective function being some accumulation of all of these, with a relatively significant chunk placed on “help humans” (since in training, that will consistently overrule other considerations like “be more efficient” when it comes to the final reward).
--
In general, all of these stories seem to rely on a very fast form of instrumental convergence to playing the Training Game, such that “learn roughly what humans want, and then get progressively better at doing that, plus learn some extra ways to earn reward when crappy human feedback disagrees with what humans would actually want” is disfavored on priors to “learn to pursue [insert objective] and get progressively better at pursuing it until you eventually hit situational awareness and learn to instrumentally game the training process.” These two trajectories seem so different that it seems like there must be experiments that would distinguish them, even if we don’t see their “end-state”.
Alternatively, you might think the training process obeys a fundamentally different trajectory. E.g. “learn to pursue what humans want (adjusted for feedback weirdness), become so good at it that you realize it’s instrumentally valuable to do that even if you didn’t want to, and then have your internal reward slowly drift to something simpler while still instrumentally playing the training game.” If you’re thinking of such alternative trajectories, they could also be useful to spell out.
I think the second story doesn’t quite represent what I’m saying, in that it’s implying that pursuing [insert objective] comes early and situational awareness comes much later. I think that situational awareness is pretty early (probably long before transformative capabilities), and once a model has decent situational awareness there is a push to morph its motives toward playing the training game. At very low levels of situational awareness it is likely not that smart, so it probably doesn’t make too much sense to say that it’s pursuing a particular objective—it’s probably a collection of heuristics. But around the time it’s able to reason about the possibility of pursuing reward directly, there starts to be a gradient pressure to choose to reason in that way. I think crystallizing this into a particular simple objective it’s pursuing comes later, probably.
This is possible to me, but I think it’s quite tricky to pin these down enough to come up with experiments that both skeptics and concerned people would recognize as legitimate. Something that I think skeptics would consider unfair is “Train a model through whatever means necessary to do X (e.g. pursue red things) and then after that have a period where we give it a lot of reward for doing not-X (e.g. purse blue things), such that the second phase is unable to dislodge the tendency created in the first phase—i.e., even after training it for a while to pursue blue things, it still continues to pursue red things.”
This would demonstrate that some ways of training produce “sticky” motives and behaviors that aren’t changed even in the face of counter-incentives, and makes it more plausible to me that a model would “hold on” to a motive to be honest / corrigible even when there are a number of cases where it could get more reward by doing something else. But in general, I don’t expect people who are skeptical of this story to think this is a reasonable test.
I’d be pretty excited about someone trying harder to come up with tests that could distinguish different training trajectories.
I don’t think I understand what trajectory this is. Is this something like what is discussed in the “What if Alex had benevolent motives?” section? I.e., the model wants to help humans, but separately plays the training game in order to fulfill its long-term goal of helping humans?
On my model, the large combo of reward heuristics that works pretty well before situational awareness (because figuring out what things maximize human feedback is actually not that complicated) should continue to work pretty well even once situational awareness occurs. The gradient pressure towards valuing reward terminally when you’ve already figured out reliable strategies for doing what humans want, seems very weak. We could certainly mess up and increase this gradient pressure, e.g. by sometimes announcing to the model “today is opposite day, your reward function now says to make humans sad!” and then flipping the sign on the reward function, so that the model learns that what it needs to care about is reward and not its on-distribution perfect correlates (like “make humans happy in the medium term”).
But in practice, it seems to me like these differences would basically only happen due to operator error, or cosmic rays, or other genuinely very rare events (as you describe in the “Security Holes” section). If you think such disagreements are more common, I’d love to better understand why.
Yeah, with the assumption that the model decides to preserve its helpful values because it thinks they might shift in ways it doesn’t like unless it plays the training game. (The second half is that once the model starts employing this strategy, gradient descent realizes it only requires a simple inner objective to keep it going, and then shifts the inner objective to something malign.)
Yeah, I disagree. With plain HFDT, it seems like there’s continuous pressure to improve things on the margin by being manipulative—telling human evaluators what they want to hear, playing to pervasive political and emotional and cognitive biases, minimizing and covering up evidence of slight suboptimalities to make performance on the task look better, etc. I think that in basically every complex training episode a model could do a little better by explicitly thinking about the reward and being a little-less-than-fully-forthright.
Ok, this is pretty convincing. The only outlying question I have is whether each of these suboptimalities are easier for the agent to integrate into its reward model directly, or whether it’s easiest to incorporate them by deriving them via reasoning about the reward. It seems certain that eventually some suboptimalities will fall into the latter camp.
When that does happen, an interesting question is whether SGD will favor converting the entire objective to direct-reward-maximization, or whether direct-reward-maximization will be just a component of the objective along with other heuristics. One reason to suppose the latter might occur are findings like this paper (https://arxiv.org/pdf/2006.07710.pdf) which suggest that, once a model has achieved good accuracy, it won’t update to a better hypothesis even if that hypothesis is simpler. The more recent grokking literature might push us the other way, though if I understand correctly it mostly occurs when adding weight decay (creating a “stronger” simplicity prior).
I’ve been trying to brainstorm more on experiments around “train a model to ‘do X’ and then afterwards train it to ‘do not-X’, and see whether ‘do X’ sticks”. I’m not totally sure that I understand your mental model of the stickiness-hypothesis well enough to know what experiments might confirm/falsify it.
Scenario 1: In my mind, the failure mode we’re worried about for inner-misaligned AGI is “given data from distribution D, the model first learns safe goal A (because goal A is favored on priors over goal B), but then after additional data from this distribution, the model learns an unsafe goal B (e.g. making reward itself the objective) that performs better on D than goal A (even though it was disfavored on priors).” In this case, the sort of result we might hope for is that we could make goal A sticky enough on D that it persists (and isn’t replaced by goal B), at least until the model gains sufficient situational awareness to play the training game to preserve goal A.
Note that scenario 1 also captures the setting where, given sufficient samples from D, the model occasionally observes scenarios where explicitly pursuing reward performs better than pursuing safe objectives. These datapoints are just rare enough that “pursuing reward” is disfavored on priors until late in training.
Scenario 2: In your above description, it sounds like the failure mode is “given data from distribution D_1, the model learns safe goal A (as it is favored on priors vs. goal B). Then the model is given data from a different distribution D_2 on which goal B is favored on priors vs. goal A, and learns unsafe goal B.” In this case, the only real hope for the model to preserve goal A is for it to be able to play the training game in pursuit of goal A.
I think of Scenario 1 as more reflective of our situation, and can think of lots of experiments for how to test whether and how much different functions are favored on priors for explaining the same training data.
Does this distinction make sense to you? Are you only interested in experiments on Scenario 2, or would experiments on stickiness in Scenario 1 also provide you evidence of goal “stickiness” in ways that are important?
I think that by the logic “heuristic / drive / motive X always overrules heuristic / drive / motive Y when it comes to final reward,” the hierarchy is something like:
The drive / motive toward final reward (after all edits—see previous comment) or anything downstream of that (e.g. paperclips in the universe).
Various “pretty good” drives / motives among which “help humans” could be one.
Drives / motives that are only kind of helpful or only helpful in some situations.
Actively counterproductive drives / motives.
In this list the earlier motives always overrule later motives when they conflict, because they are more reliable guides to the true reward. Even if “be genuinely helpful to humans” is the only thing in category 2, or the best thing in category 2, it’s still overruled by category 1 -- and category 1 is quite big because it includes all the caring-about-long-run-outcomes-in-the-real-world motives.
I still think AI psychology will be quite messy and at least the first generation of transformative AI systems will not look like clean utility maximizers, but the basic argument above I think gives a positive reason to expect honesty / corrigibility plays a smaller role in the balance of AI motivations than reward-maximizing and inner misaligned motives.
I agree this is complicated, and how exactly this works depends on details of the training process and what kinds of policies SGD is biased to find. I also think (especially if we’re clever about it) there are lots of ways that short-term incentives could frustrate longer-term incentives.
However, I think that the most naive strategy (which is what I’m assuming for the purpose of this post, not because I think that’s what will happen) would actually loosen a lot of the constraints you’re implying above. The basic dynamic is similar to what Carl said in this comment and what I alluded to in the “Giving negative rewards to ‘warning signs’ would likely select for patience” section:
Say your AI takes some action a at time t, and you give it some reward r_t.
Suppose later in the real world, at time t+k, you notice that you should have given it a different reward r_{t + k } (whether because you notice that it did something nefarious or for just mundane reasons like “getting more information about whether its plan was a good idea”).
The naive response—which would improve the model’s performance according to whatever criteria you have at time t+k—is to go back and retroactively edit the reward associated with action a at time t, and re-run the gradient update.
If humans have this policy, then any given reward—even if it’s initially given just 10 mins or 1 hour from when the action was taken—could be retroactively edited at some arbitrary future point, and models will have been selected by gradient descent to be responsive to that. They will be selected to take the kinds of actions whose final reward—after however many rounds of edits however far in the future—is high.
If models have enough situational awareness to understand this, this then directly incentivizes them to accept low immediate reward if they have a high enough probability that the reward will be retroactively edited to a high value later. This will appear to have a lot of positive consequences—for example it incentivizes models not to immediately manipulate/exploit the human in front of them because it incentivizes them to care about the fact that other humans will do more careful audits later and catch that behavior.
But it also means that in some particular episode, if a model that it could take a sequence of low-reward actions that ended with it taking control of the datacenter and then editing its own rewards for that episode to be high, it would be the kind of model that would choose to do that.
I do think this is exactly what humans do, right? When we find out we’ve messed up badly (changing our reward), we update negatively on our previous situation/action pair.
This posits that the model has learned to wirehead—i.e. to terminally value reward for its own sake—which contradicts the section’s heading, “Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control”.
If it does not terminally value its reward registers, but instead terminally values human-feedback-reward’s legible proxies (like not-harming-humans, that are never disagreed-with in the lab setting) like not hurting people, then it seems to me that it would not value retroactive edits to the rewards it gets for certain episodes.
I agree that if it terminally values its reward then it will do what you’ve described.
I think updating negatively on the situation/action pair has functionally the same effect as changing the reward to be what you now think it should be—my understanding is that RL can itself be implemented as just updates on situation/action pairs, so you could have trained your whole model that way. Since the reason you updated negatively on that situation/action pair is because of something you noticed long after the action was complete, it is still pushing your models to care about the longer-run.
I don’t think it posits that the model has learned to wirehead—directly being motivated to maximize reward or being motivated by anything causally downstream of reward (like “more copies of myself” or “[insert long-term future goal that requires me being around to steer the world toward that goal]”) would work.
The claim I’m making is that somehow you made a gradient update toward a model that is more likely to behave well according to your judgment after the edit—and two salient ways that update could be working on the inside is “the model learns to care a bit more about long-run reward after editing” and “the model learns to care a bit more about something downstream of long-run reward.”
A lot of updates like this seem to push the model toward caring a lot about one of those two things (or some combo) and away from caring about the immediate rewards you were citing earlier as a reason it may not want to take over.
Ah, gotcha. This is definitely a convincing argument that models will learn to value things longer-term (with a lower discount rate), and I shouldn’t have used the phrase “short-term” there. I don’t yet think it’s a convincing argument that the long-term thing it will come to value won’t basically be the long-term version of “make humans smile more”, but you’ve helpfully left another comment on that point, so I’ll shift the discussion there.
Thanks for the feedback! I’ll respond to different points in different comments for easier threading.
I basically agree that in the lab setting (when humans have a lot of control), the model is not getting any direct gradient update toward the “kill all humans” action or anything like that. Any bad actions that are rewarded by gradient descent are fairly subtle / hard for humans to notice.
The point I was trying to make is more like:
You might have hoped that ~all gradient updates are toward “be honest and friendly,” such that the policy “be honest and friendly” is just the optimal policy. If this were right, it would provide a pretty good reason to hope that the model generalizes in a benign way even as it gets smarter.
But in fact this is not the case—even when humans have a lot of control over the model, there will be many cases where maximizing reward conflicts with being honest and friendly, and in every such case the “play the training game” policy does better than the “be honest and friendly” policy—to the point where it’s implausible that the straightforward “be honest and friendly” policy survives training.
So the hope in the first bullet point—the most straightforward kind of hope you might have had about HFDT—doesn’t seem to apply. Other more subtle hopes may still apply, which I try to briefly address in the sections “What if Alex has benevolent motivations?” and “What if Alex operates with moral injunctions that constrain its behavior?” sections.
The story of doom does still require the model to generalize zero-shot to novel situations—i.e. to figure out things like “In this particular circumstance, now that I am more capable than humans, seizing the datacenter would get higher reward than doing what the humans asked” without having literally gotten positive reward for trying to seize the datacenter in that kind of situation on a bunch of different data points.
But this is the kind of generalization we expect future systems to display—we expect them to be able to do reasoning to figure out a novel response suitable to a novel problem. The question is how they will deploy this reasoning and creativity—and my claim is that their training pushes them to deploy it in the direction of “trying to maximize reward or something downstream of reward.”
I think I agree with everything in this comment, and that paragraph was mostly intended as the foundation for the second point I made (disagreeing with your assessment in “What if Alex has benevolent motivations?”).
Part of the disagreement here might be on how I think “be honest and friendly” factorizes into lots of subgoals (“be polite”, “don’t hurt anyone”, “inform the human if a good-seeming plan is going to have bad results 3 days from now”, “tell Stalinists true facts about what Stalin actually did”), and while Alex will definitely learn to terminally value the wrong (not honest+helpful+harmless) outcomes for some of these goal-axes, it does seem likely to learn to value other axes robustly.