I expect “reward” to be a hard goal to learn, because it’s a pretty abstract concept and not closely related to the direct observations that policies are going to receive. If you keep training policies, maybe they’d converge to it eventually, but my guess is that this would take long enough that we’d already have superhuman AIs which would either have killed us or solved alignment for us (or at least started using gradient hacking strategies which undermine the “convergence” argument). Analogously, humans don’t care very much at all about the specific connections between our reward centers and the rest of our brains—insofar as we do want to influence them it’s because we care about much more directly-observable phenomena like pain and pleasure.
Even once you learn a goal like that, it’s far from clear that it’d generalize in ways which lead to power-seeking. “Reward” is not a very natural concept, it doesn’t apply outside training, and even within training it’s dependent on the specific training algorithm you use. Trying to imagine what a generalized goal of “reward” would cash out to gets pretty weird. As one example: it means that every time you deploy the policy without the intention of rewarding it, then its key priority would be convincing you to inserting that trajectory into the training data. (It might be instructive to think about what the rewards would need to be for that not to happen. Below 0? But the 0 point is arbitrary...) That seems pretty noticeable! But wouldn’t it be deceptive? Well, only within the scope of its current episode, because trying to get higher reward in other episodes is never positively reinforced. Wouldn’t it learn the high-level concept of “reward” in general, in a way that’s abstracted from any specific episode? That feels analogous to a human learning to care about “genetic fitness” but not distinguishing between their own genetic fitness and the genetic fitness of other species. And remember point 1: the question is not whether the policy learns it eventually, but rather whether it learns it before it learns all the other things that make our current approaches to alignment obsolete.
At a high level, this comment is related to Alex Turner’s Reward is not the optimization target. I think he’s making an important underlying point there, but I’m also not going as far as he is. He says “I don’t see a strong reason to focus on the “reward optimizer” hypothesis.” I think there’s a pretty good reason to focus on it—namely that we’re reinforcing policies for getting high reward. I just think that other people have focused on it too much, and not carefully enough—e.g. the “without specific countermeasures” claim that Ajeya makes seems too strong, if the effects she’s talking about might only arise significantly above human level. Overall I’m concerned that reasoning about “the goal of getting high reward” is too anthropomorphic and is a bad way to present the argument to ML researchers in particular.
Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has “policy learns to care about reward directly” as a footnote; I can imagine updating it based on the outcome of this discussion though.
I’m not very convinced by this comment as an objection to “50% AI grabs power to get reward.” (I find it more plausible as an objection to “AI will definitely grab power to get reward.”)
I expect “reward” to be a hard goal to learn, because it’s a pretty abstract concept and not closely related to the direct observations that policies are going to receive
“Reward” is not a very natural concept
This seems to be most of your position but I’m skeptical (and it’s kind of just asserted without argument):
The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your “unnaturalness” abstraction can make finer-grained distinctions than that, but I don’t think I buy it.
If people train their AI with RLDT then the AI is literally be trained to predict reward! I don’t see how this is remote, and I’m not clear if your position is that e.g. the value function will be bad at predicting reward because it is an “unnatural” target for supervised learning.
I don’t understand the analogy with humans. It sounds like you are saying “an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward” be analogous to “humans care about the details of their reward circuitry.” But:
I don’t think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior.
It seems like the analogous conclusion for RL systems would be “they may not care about the rewards that go into the SGD update, they may instead care about the rewards that get entered into the dataset, or even something further causally upstream of that as long as it’s very well-correlated on the training set.” But it doesn’t matter what we choose that’s causally upstream of rewards, as long as it’s perfectly correlated on the training set?
(Or you could be saying that humans are motivated by pleasure and pain but not the entire suite of things that are upstream of reward? But that doesn’t seem right to me.)
[The concept of “reward”] doesn’t apply outside training
I don’t buy it:
If people train AI systems on random samples of deployment, then “reward” does make sense—it’s just what would happen if you sampled this episode to train on. This is done in practice today and seems like a pretty good idea (since the thing you care about is precisely the performance on deployment episodes) unless you are specifically avoiding it for safety reasons.
It’s plausible that such training is happening whether or not it actually is, and so that’s a very natural objective for a system that cares about maximizing reward conditioned on an episode being selected for training.
Even if test episodes are obviously not being used in training, there are still lots of plausible-sounding generalizations of “reward” to those episodes (whether based on physical implementation, or on the selection implemented by SGD, or based on conditioning on unlikely events, or based on causal counterfactuals...) and as far as I can tell pretty much all of them lead to the same bottom line.
If the deployment distribution is sufficiently different from the training distribution that the AI no longer does something with the same upshot as maximizing reward, then it seems very likely that the resulting behaviors are worse (e.g. they would receive a lower score when evaluated by humans) and so people will be particularly likely to train on those deployment episodes in order to correct the issue.
So I think that if a system was strategically optimizing reward on the training set, it would probably either do something similar-enough-to-result-in-grabbing-power on the test set, or else it would behave badly and be corrected.
(Overall I find the other point more persuasive, though again I think it’s better as an objection to 100% than as an objection to 50%: actually optimizing reward doesn’t do much better than various soups of heuristics, and so we don’t have a strong prediction that SGD will prefer one to the other without getting into the weeds and making very uncertain claims or else taking the limit.)
As one example: it means that every time you deploy the policy without the intention of rewarding it, then its key priority would be convincing you to inserting that trajectory into the training data. (It might be instructive to think about what the rewards would need to be for that not to happen. Below 0? But the 0 point is arbitrary...) That seems pretty noticeable!
RL agents receive a loss that is based on reward in training episodes. If they tried to change the probability that a given episode appeared in training, and sacrificed reward to do so, that behavior would be discouraged by SGD—it would select away from parameters that do that (since they take worse actions on the episodes that actually appear in training). So I don’t think there’s any reason to think that an RL agent would behave this way.
Instead, you should expect an RL agent to maximize reward conditioned on the episode appearing in training, because that’s what SGD would select for. I agree that we shouldn’t expect a general thematic connection to “reward” beyond what you’d expect from the mechanics of SGD.
and even within training it’s dependent on the specific training algorithm you use
This is true, but as far as I can see all the possible versions result in grabbing power, so I don’t think it undermines the case. I don’t know if you have particular versions in mind that wouldn’t result in either grabbing power or else bad behavior that would be corrected by further training.
They focus on policies learning the goal of getting high reward.
The claim is that AI systems will take actions that get them a lot of reward. This doesn’t have to be based on the mechanistic claim that they are thinking about reward, it can also just be the empirical observation: somehow you selected a policy that creatively gets a lot of reward across a very broad training distribution, so in a novel situation “gets a lot of reward” is one of our most robust predictions about the behavior of the AI.
This is how Ajeya’s post goes, and it explicitly notes that the AI trained with RLDT could be optimizing something else such that they only want to get reward during training, but that this mostly just makes things worse: everything seems to lead to the same place for more or less the same reason.
I think the “soup of heuristics” stories (where the AI is optimizing something far causally upstream of reward instead of something that is downstream or close enough to be robustly correlated) don’t lead to takeover in the same way, and so that part of the objection is much stronger (though see my objections above).
Overall I’m concerned that reasoning about “the goal of getting high reward” is too anthropomorphic and is a bad way to present the argument to ML researchers in particular.
I don’t feel like this argument is particularly anthropomorphic. The argument is that a policy selected for achieving X will achieve X in a new situation (which we can run for lots of different X, since there are many X that are perfectly correlated on the training set). In all the discussions I’ve been in (both with safety people and with ML people) the counterargument leans much more heavily on the analogy with humans (though I think that analogy is typically misapplied).
I don’t understand the analogy with humans. It sounds like you are saying “an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward” be analogous to “humans care about the details of their reward circuitry.” But:
I don’t think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior.
(Emphasis added)
I don’t think this engages with the substance of the analogy to humans. I don’t think any party in this conversation believes that human learning is “just” RL based on a reward circuit, and I don’t believe it either. “Just RL” also isn’t necessary for the human case to give evidence about the AI case. Therefore, your summary seems to me like a strawman of the argument.
I would say “human value formation mostly occurs via RL & algorithms meta-learned thereby, but in the important context of SSL / predictive processing, and influenced by inductive biases from high-level connectome topology and genetically specified reflexes and environmental regularities and...”
Furthermore, wehave good evidence that RL plays an important role in human learning. For example, from The shard theory of human values:
Wolfram Schultz and colleagues have found that the signaling behavior of phasic dopamine in the mesocorticolimbic pathway mirrors that of a TD error (or reward prediction error).
In addition to finding correlates of reinforcement learning signals in the brain, artificial manipulation of those signal correlates (through optogenetic stimulation, for example) produces the behavioral adjustments that would be predicted from their putative role in reinforcement learning.
Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals.
Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains.
Both of those observations have high probability, so they aren’t significant Bayesian evidence for “RL tends to produce external goals by default.”
In particular, for this to be evidence for Richard’s claim, you need to say: “If RL tended to produce systems that care about reward, then RL would be significantly less likely to play a role in human cognition.” There’s some update there but it’s just not big. It’s easy to build brains that use RL as part of a more complicated system and end up with lots of goals other than reward. My view is probably the other way—humans care about reward more than I would guess from the actual amount of RL they can do over the course of their life (my guess is that other systems play a significant role in our conscious attitude towards pleasure).
In particular, humans prominently do reinforcement learning.
Humans care about lots of things in reality, not just certain kinds of cognitive-update-signals.
“RL → high chance of caring about reality” predicts this observation more strongly than “RL → low chance of caring about reality”
This seems pretty straightforward to me, but I bet there are also pieces of your perspective I’m just not seeing.
But in particular, it doesn’t seem relevant to consider selection pressures from evolution, except insofar as we’re postulating additional mechanisms which evolution found which explain away some of the reality-caring? That would weaken (but not eliminate) the update towards “RL → high chance of caring about reality.”
Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains.
I don’t see how this point is relevant. Are you saying that within-lifetime learning is unsurprising, so we can’t make further updates by reasoning about how people do it?
Both of those observations have high probability, so they aren’t significant Bayesian evidence for “RL tends to produce external goals by default.”
I’m saying that there was a missed update towards that conclusion, so it doesn’t matter if we already knew that humans do within-lifetime learning?
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low. I’m objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
The Bayesian update is P(humans care about the real world | RL agents usually care about reward) / P(humans care about the real world | RL agents mostly care about other stuff). So if e.g. P(humans care about the real world | RL agents don’t usually care about reward) was 80%, then your update could be at most 1.25. In fact I think it’s even smaller than that..
And then if you try to turn that into evidence about “reward is a very hard concept to learn,” or a prediction about how neural nets trained with RL will behave, it’s moving my odds ratios by less than 10% (since we are using “RL” quite loosely in this discussion, and there are lots of other differences and complications at play, all of which shrink the update).
You seem to be saying “yes but it’s evidence,” which I’m not objecting to—I’m just saying it’s an extremely small amount evidence. I’m not clear on whether you agree with my calculation.
(Some of the other text I wrote was about a different argument you might be making: that P(humans use RL | RL agents usually care about reward) is significantly lower than P(humans use RL| RL agents mostly are about other stuff), because evolution would then have never used RL. My sense is that you aren’t making this argument so you should ignore all of that, sorry to be confusing.)
Just saw this reply recently. Thanks for leaving it, I found it stimulating.
(I wrote the following rather quickly, in an attempt to write anything at all, as I find it not that pleasant to write LW comments—no offense to you in particular. Apologies if it’s confusing or unclear.)
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low.
Yes, in large part.
I’m objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
Yeah, are people differentially selected for caring about the real world? At the risk of seeming facile, this feels non-obvious. My gut take is that conditional on RL agents usually caring about reward (and thus setting aside a bunch of my inside-view reasoning about how RL dynamics work), conditional on that—reward-humans could totally have been selected for.
This would drive up P(humans care about reward | RL agents care about reward, humans were selected by evolution), and thus (I think?) drive down P(humans care about the real world | RL agents usually care about reward).
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
In what way is my fitness lower than someone who really cares about these things, given that the best way to get rewards may well be to actually do the things?
Here are some ways I can think of:
Caring about reward directly makes reward hacking a problem evolution has to solve, and if it doesn’t solve it properly, the person ends up masturbating and not taking (re)productive actions.
Counter-counterpoint: But also many people do in fact enjoy masturbating, even though it seems (to my naive view) like an obvious thing to select away, which was present ancestrally.
People seem to be able to tell when you don’t really care about them, and just want to use them to get things.
But if I valued the rewarding feeling of having friends and spending time with them and watching them succeed—I, personally, feel good and happy when I am hanging out with my friends—then I still would be instrumentally aligned with my friends. If so, there would be no selection pressure for reward-motivation detectors, in the way that people were selected into noticing deception (even if not hardcoded to do so).
Overall, I feel like people being selected by evolution is an important qualifier which ends up changing the inferences we make via e.g. P(care about the real world | ), and I think I’ve at least somewhat neglected this qualifier. But I think I’d estimate P(humans care about the real world | RL agents usually care about reward) as… .2? I’d feel slightly more surprised by that than by the specific outcome of two coinflips coming up heads.
I think that there are serious path-dependent constraints in evolution, not a super clear/strong/realizable fitness gradient away from caring about reward if that’s how RL agents usually work, and so I expect humans to be relatively typical along this dimension, with some known unknowns around whatever extra circuitry evolution wrote into us.
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?
I think this highlights a good counterpoint. I think this alternate theory predicts “probably not”, although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status → reward; and it’s high-status to sacrifice yourself for your kid). Or because keeping your kid safe → high reward as another learned drive.
Overall this feels like contortion but I think it’s possible. Maybe overall this is a… 1-bit update against the “not selection for caring about reality” point?
If people train AI systems on random samples of deployment, then “reward” does make sense—it’s just what would happen if you sampled this episode to train on.
I don’t know what this means. Suppose we have an AI which “cares about reward” (as you think of it in this situation). The “episode” consists of the AI copying its network & activations to another off-site server, and then the original lab blows up. The original reward register no longer exists (it got blown up), and the agent is not presently being trained by an RL alg.
What is the “reward” for this situation? What would have happened if we “sampled” this episode during training?
I agree there are all kinds of situations where the generalization of “reward” is ambiguous and lots of different things could happen . But it has a clear interpretation for the typical deployment episode since we can take counterfactuals over the randomization used to select training data.
It’s possible that agents may specifically want to navigate towards situations where RL training is not happening and the notion of reward becomes ambiguous, and indeed this is quite explicitly discussed in the document Richard is replying to.
As far as I can tell the fact that there exist cases where different generalizations of reward behave differently does not undermine the point at all.
Yeah, I think I was wondering about the intended scoping of your statement. I perceive myself to agree with you that there are situations (like LLM training to get an alignment research assistant) where “what if we had sampled during training?” is well-defined and fine. I was wondering if you viewed this as a general question we could ask.
I also agree that Ajeya’s post addresses this “ambiguity” question, which is nice!
I’m not very convinced by this comment as an objection to “50% AI grabs power to get reward.” (I find it more plausible as an objection to “AI will definitely grab power to get reward.”)
It’s intended as an objection to “AI grabs power to get reward is the central threat model to focus on”, but I think our disagreements still apply given this. (FWIW my central threat model is that policies care about reward to some extent, but that the goals which actually motivate them to do power-seeking things are more object-level.)
The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your “unnaturalness” abstraction can make finer-grained distinctions than that, but I don’t think I buy it.
I expect policies to be getting rich input streams like video, text, etc, which they use to make decisions. Reward is different from other types of data because reward isn’t actually observed as part of these input streams by policies during episodes. This makes it harder to learn as a goal compared with things that are more directly observable (in a similar way to how “care about children” is an easier goal to learn than “care about genes”).
RL agents receive a loss that is based on reward in training episodes. If they tried to change the probability that a given episode appeared in training, and sacrificed reward to do so, that behavior would be discouraged by SGD—it would select away from parameters that do that (since they take worse actions on the episodes that actually appear in training). So I don’t think there’s any reason to think that an RL agent would behave this way. Instead, you should expect an RL agent to maximize reward conditioned on the episode appearing in training, because that’s what SGD would select for.
I don’t think this line of reasoning works, because “the episode appearing in training” can be a dependent variable. For example, consider an RL agent that’s credibly told that its data is not going to be used for training unless it misbehaves badly. An agent which maximizes reward conditional on the episode appearing in training is therefore going to misbehave the minimum amount required to get its episode into the training data (and more generally, behave like getting its episode into training is a terminal goal). This seems very counterintuitive.
This is true, but as far as I can see all the possible versions result in grabbing power, so I don’t think it undermines the case. I don’t know if you have particular versions in mind that wouldn’t result in either grabbing power
Some versions that wouldn’t result in power-grabbing:
Goal is “get highest proportion of possible reward”; the policy might rewrite the training algorithm to be myopic, then get perfect reward for one step, then stop.
Goal is “care about (not getting low rewards on) specific computers used during training”; the policy might destroy those particular computers, then stop.
Goal is “impress the critic”; the policy might then rewrite its critic to always output high reward, then stop.
Goal is “get high reward myself this episode”; the policy might try to do power-seeking things but never create more copies of itself, and eventually lose coherence + stop doing stuff.
I don’t think any of these are particularly likely, the point is more that “high reward via tampering with/taking over the training setup” is a fairly different type of thing from “high reward via actually performing tasks”, and it’s a non-trivial and quite specific hypothesis that the latter will generalize to the former (in the regimes we’re concerned about).
(FWIW my central threat model is that policies care about reward to some extent, but that the goals which actually motivate them to do power-seeking things are more object-level.)
Are you imagining that such systems get meaningfully low reward on the training distribution because they are pursuing those goals, or that these goals are extremely well-correlated with reward on the training distribution and only come apart at test time? Is the model deceptively aligned?
I expect policies to be getting rich input streams like video, text, etc, which they use to make decisions. Reward is different from other types of data because reward isn’t actually observed as part of these input streams by policies during episodes. This makes it harder to learn as a goal compared with things that are more directly observable (in a similar way to how “care about children” is an easier goal to learn than “care about genes”).
Children vs genes doesn’t seem like a good comparison, it seems obvious that models will understand the idea of reward during training (whereas humans don’t understand genes during evolution). A better comparison might be “have children during my life” vs “have a legacy after I’m gone,” but in fact humans have both goals even though one is never directly observed.
I guess more importantly, I don’t buy the claim about “things you only get selected on are less natural as goals than things you observe during episodes,” especially if your policy is trained to make good predictions of reward. I don’t know if there’s a specific reason for your view, or this is just a clash of intuitions. It feels to me like my position is kind of the default, in that you are offering a feature and saying it is a major consideration that SGD wouldn’t learn a particular kind of cognition.
I don’t think this line of reasoning works, because “the episode appearing in training” can be a dependent variable. For example, consider an RL agent that’s credibly told that its data is not going to be used for training unless it misbehaves badly. An agent which maximizes reward conditional on the episode appearing in training is therefore going to misbehave the minimum amount required to get its episode into the training data (and more generally, behave like getting its episode into training is a terminal goal). This seems very counterintuitive.
If the agent misbehaves so that its data will be used for training, then the misbehaving actions will get a low reward. So SGD will shift from “misbehave so that my data will be used for training” to “ignore the effect of my actions on whether my data will be used for training, and just produce actions that would result in a low reward assuming that this data is used for training.”
I agree the behavior of the model isn’t easily summarized by an English sentence. The thing that seems most clear is that the model trained by SGD will learn not to sacrifice reward in order to increase the probability that the episode is used in training. If you think that’s wrong I’m happy to disagree about it.
Some versions that wouldn’t result in power-grabbing:
Every one of your examples results in the model grabbing power and doing something bad at deployment time, though I agree that it may not care about holding power. (And maybe it doesn’t even have to grab power if you don’t have any countermeasures at test time? But then your system seems useless.)
But if you get frustrated by your AI taking over the datacenter and train it not to do that, as discussed in Ajeya’s story here, then you are left with an AI that grabs power and holds it.
I don’t think any of these are particularly likely, the point is more that “high reward via tampering with/taking over the training setup” is a fairly different type of thing from “high reward via actually performing tasks”, and it’s a non-trivial and quite specific hypothesis that the latter will generalize to the former (in the regimes we’re concerned about).
I don’t find this super compelling. There is a long causal chain from “Do the task” to “Humans think you did the task” to “You get high reward” to “You are selected by SGD.” By default there will be lots of datapoints that distinguish the late stages on the path from the early stages, since human evaluations do not perfectly track any kind of objective evaluation of “did you actually do the task?” So a model that cares only about doing the task and not about human evaluations will underperform a model that cares about human evaluations. And then a model that cares about human evaluations can then manipulate those evaluations via grabbing power and controlling human observations of the situation (and if it cares about something even later then it can also e.g. kill and replace the humans, but that’s not important to this argument).
So the implicit alternative you mention here just won’t get a low loss, and it’s really a question about whether SGD will find a lower loss policy rather than one about how it generalizes.
And once that’s the discussion, someone saying “well there is a policy that gets lower loss and doesn’t look very complicated, so I guess SGD has a good chance of finding it” seems like they are on the winning side of SGD. Your claim seems to be that models will fail to get low loss on the training set in a specific way, and I think saying “it’s a non-trivial and quite specific hypothesis that you will get a low loss” would be an inappropriate burden-of-proof shifting.
I agree that it’s unclear whether this will actually happen prior to models that are smart enough to obsolete human work on alignment, but I really think we’re looking at more like a very natural hypothesis with 50-50 chances (and at the very least the way you are framing your objection is not engaging with the arguments in the particular post of Ajeya’s that you are responding to).
I also agree that the question of whether the model will mess with the training setup or use power for something else is very complicated, but it’s not something that is argued for in Ajeya’s post.
I think the crucial thing that makes this robust is “if the model grabs power and does something else, then humans will train it out unless they decide doing so is unsafe.” I think one potentially relevant takeaway is: (i) it’s plausible that AI systems will be motivated to grab power in non-catastrophic ways which could give opportunities to correct course, (ii) that isn’t automatically guaranteed by any particular honeypot, e.g. you can’t just give a high reward and assume that will motivate the system.
Reading back over this now, I think we’re arguing at cross purposes in some ways. I should have clarified earlier that my specific argument was against policies learning a terminal goal of reward that generalizes to long-term power-seeking.
I do expect deceptive alignment after policies learn other broadly-scoped terminal goals and realize that reward-maximization is a good instrumental strategy. So all my arguments about the naturalness of reward-maximization as a goal are focused on the question of which type of terminal goal policies with dangerous levels of capabilities learn first.* Let’s distinguish three types (where “myopic” is intended to mean something like “only cares about the current episode”).
Non-myopic misaligned goals that lead to instrumental reward maximization (deceptive alignment)
Myopic terminal reward maximization
Non-myopic terminal reward maximization
Either 1 and 2 (or both of them) seem plausible to me. 3 is the one I’m skeptical about. How come?
We should expect models to have fairly robust terminal goals (since, unlike beliefs or instrumental goals, terminal goals shouldn’t change quickly with new information). So once they understand the concept of reward maximization, it’ll be easier for them to adopt it as an instrumental strategy than a terminal goal. (An analogy to evolution: once humans construct highly novel strategies for maximizing genetic fitness (like making thousands of clones) people are more likely to do it for instrumental reasons than terminal reasons.)
Even if they adopt reward maximization as a terminal goal, they’re more likely to adopt a myopic version of it than a non-myopic version, since (I claim) the concept of reward maximization doesn’t generalize very naturally to larger scales. Above, you point out that even relatively myopic reward maximization will lead to limited takeover, and so we’ll train subsequent agents to be less myopic. But it seems to me that the selection pressure generated by a handful of examples of real-world attempted takeovers is very small, compared with other aspects of training; and that even if it’s significant, it may just teach agents specific constraints like “don’t take over datacenters”.
* Now that I say that, I notice that I’m also open to the possibility of policies learning deceptively-aligned goals first, then gradually shifting from reward-maximization as an instrumental goal to reward-maximization as a terminal goal. But let’s focus for now on which goals are learned first.
I think the “soup of heuristics” stories (where the AI is optimizing something far causally upstream of reward instead of something that is downstream or close enough to be robustly correlated) don’t lead to takeover in the same way
Note that the “without countermeasures” post consistently discusses both possibilities (the model cares about reward or the model cares about something else that’s consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro:
Once this progresses far enough, the best way for Alex to accomplish most possible “goals” no longer looks like “essentially give humans what they want but take opportunities to manipulate them here and there.” It looks more like “seize the power to permanently direct how it uses its time and what rewards it receives—and defend against humans trying to reassert control over it, including by eliminating them.” This seems like Alex’s best strategy whether it’s trying to get large amounts of reward or has other motives. If it’s trying to maximize reward, this strategy would allow it to force its incoming rewards to be high indefinitely.[6] If it has other motives, this strategy would give it long-term freedom, security, and resources to pursue those motives.
As well as the section Even if Alex isn’t “motivated” to maximize reward.… I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that’s distinct from being confident in the motivations that give rise to that policy. I believe Alex would try very hard to maximize reward in most cases, but this could be for either terminal or instrumental reasons.
With that said, for roughly the reasons Paul says above, I think I probably do have a disagreement with Richard—I think that caring about some version of reward is pretty plausible (~50% or so). It seems pretty natural and easy to grasp to me, and because I think there will likely be continuous online training the argument that there’s no notion of reward on the deployment distribution doesn’t feel compelling to me.
Note that the “without countermeasures” post consistently discusses both possibilities
Yepp, agreed, the thing I’m objecting to is how you mainly focus on the reward case, and then say “but the same dynamics apply in other cases too...”
I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that’s distinct from being confident in the motivations that give rise to that policy.
The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).
I agree with your general point here, but I think Ajeya’s post actually gets this right, eg
There is some ambiguity about what exactly “maximize reward” means, but once Alex is sufficiently powerful—and once human knowledge/control has eroded enough—an uprising or coup eventually seems to be the reward-maximizing move under most interpretations of “reward.”
and
What if Alex doesn’t generalize to maximizing its reward in the deployment setting? What if it has more complex behaviors or “motives” that aren’t directly and simply derived from trying to maximize reward? This is very plausible to me, but I don’t think this possibility provides much comfort—I still think Alex would want to attempt a takeover.
I also think that often “the AI just maximizes reward” is a useful simplifying assumption. That is, we can make an argument of the form “even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed”.
(Though of course it’s important to spell the argument out)
Yeah, I agree this is a good argument structure—in my mind, maximizing reward is both a plausible case (which Richard might disagree with) and the best case (conditional on it being strategic at all and not a bag of heuristics), so it’s quite useful to establish that it’s doomed; that’s the kind of structure I was going for in the post.
I strongly disagree with the “best case” thing. Like, policies could just learn human values! It’s not that implausible.
If I had to try point to the crux here, it might be “how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?” Where we both agree that there’s some selection pressure towards reward-like goals, and it seems like you expect this to be enough to lead policies to behavior that violates all their existing heuristics, whereas I’m more focused on the regime where there are lots of low-hanging fruit in terms of changes that would make a policy more successful, and so the question of how easy that goal is to learn from its training data is pretty important. (As usual, there’s the human analogy: our goals are very strongly biased towards things we have direct observational access to!)
Even setting aside this disagreement, though, I don’t like the argumentative structure because the generalization of “reward” to large scales is much less intuitive than the generalization of other concepts (like “make money”) to large scales—in part because directly having a goal of reward is a kinda counterintuitive self-referential thing.
I strongly disagree with the “best case” thing. Like, policies could just learn human values! It’s not that implausible.
Yes, sorry, “best case” was oversimplified. What I meant is that generalizing to want reward is in some sense the model generalizing “correctly;” we could get lucky and have it generalize “incorrectly” in an important sense in a way that happens to be beneficial to us. I discuss this a bit more here.
But if Alex did initially develop a benevolent goal like “empower humans,” the straightforward and “naive” way of acting on that goal would have been disincentivized early in training. As I argued above, if Alex had behaved in a straightforwardly benevolent way at all times, it would not have been able to maximize reward effectively.
That means even if Alex had developed a benevolent goal, it would have needed to play the training game as well as possible—including lying and manipulating humans in a way that naively seems in conflict with that goal. If its benevolent goal had caused it to play the training game less ruthlessly, it would’ve had a constant incentive to move away from having that goal or at least from acting on it.[35] If Alex actually retained the benevolent goal through the end of training, then it probably strategically chose to act exactly as if it were maximizing reward.
This means we could have replaced this hypothetical benevolent goal with a wide variety of other goals without changing Alex’s behavior or reward in the lab setting at all—“help humans” is just one possible goal among many that Alex could have developed which would have all resulted in exactly the same behavior in the lab setting.
If I had to try point to the crux here, it might be “how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?”...As usual, there’s the human analogy: our goals are very strongly biased towards things we have direct observational access to!)
I don’t understand why reward isn’t something the model has direct access to—it seems like it basically does? If I had to say which of us were focusing on abstract vs concrete goals, I’d have said I was thinking about concrete goals and you were thinking about abstract ones, so I think we have some disagreement of intuition here.
Even setting aside this disagreement, though, I don’t like the argumentative structure because the generalization of “reward” to large scales is much less intuitive than the generalization of other concepts (like “make money”) to large scales—in part because directly having a goal of reward is a kinda counterintuitive self-referential thing.
Yeah, I don’t really agree with this; I think I could pretty easily imagine being an AI system asking the question “How much reward would this episode get if it were sampled for training?” It seems like the intuition this is weird and unnatural is doing a lot of work in your argument, and I don’t really share it.
I don’t understand why reward isn’t something the model has direct access to—it seems like it basically does? If I had to say which of us were focusing on abstract vs concrete goals, I’d have said I was thinking about concrete goals and you were thinking about abstract ones, so I think we have some disagreement of intuition here.
AFAIK the reward signal is not typically included as an input to the policy network in RL. Not sure why, and I could be wrong about that, but that is not my main question. The bigger question is “Has direct access to when?”
At the moment in time when the model is making a decision, it does not have direct access to the decision-relevant reward signal because that reward is typically causally downstream of the model’s decision. That reward may not even have a definite value until after decision time. Whereas concrete observables like “shiny gold coins” and “the finish line straight ahead” and “my opponent is in check” (and other abstractions in the model’s ontology that are causally upstream from reward in reality) are readily available at decision time. It seems to me that that makes them natural candidates for credit assignment to flag early on as the reward-responsible mental events and reinforce into stable motivations, since they in fact were the factors that determined the decisions that led to rewards.
IME, the most straightforward way for reward-itself to become the model’s primary goal would be if the model learns to base its decisions on an accurate reward-predictor much earlier than it learns to base its decisions on other (likely upstream) factors. If it instead learns how to accurately predict reward-itself after it is already strongly motivated by some concrete observables, I don’t see why we should expect it to dislodge that motivation, despite the true fact that those concrete observables are only pretty correlated with reward whereas an accurate reward-predictor is perfectly correlated with reward. Why? Because the model currently doesn’t care about reward-itself, it currently cares about the concrete observable(s), so it has no reason to take actions that would override that goal, and it has positive goal-content integrity reasons to not take those actions.
What I meant is that generalizing to want reward is in some sense the model generalizing “correctly;” we could get lucky and have it generalize “incorrectly” in an important sense in a way that happens to be beneficial to us.
(Written quickly and not very carefully.)
I think it’s worth stating publicly that I have a significant disagreement with a number of recent presentations of AI risk, in particular Ajeya’s “Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover”, and Cohen et al.’s “Advanced artificial agents intervene in the provision of reward”. They focus on policies learning the goal of getting high reward. But I have two problems with this:
I expect “reward” to be a hard goal to learn, because it’s a pretty abstract concept and not closely related to the direct observations that policies are going to receive. If you keep training policies, maybe they’d converge to it eventually, but my guess is that this would take long enough that we’d already have superhuman AIs which would either have killed us or solved alignment for us (or at least started using gradient hacking strategies which undermine the “convergence” argument). Analogously, humans don’t care very much at all about the specific connections between our reward centers and the rest of our brains—insofar as we do want to influence them it’s because we care about much more directly-observable phenomena like pain and pleasure.
Even once you learn a goal like that, it’s far from clear that it’d generalize in ways which lead to power-seeking. “Reward” is not a very natural concept, it doesn’t apply outside training, and even within training it’s dependent on the specific training algorithm you use. Trying to imagine what a generalized goal of “reward” would cash out to gets pretty weird. As one example: it means that every time you deploy the policy without the intention of rewarding it, then its key priority would be convincing you to inserting that trajectory into the training data. (It might be instructive to think about what the rewards would need to be for that not to happen. Below 0? But the 0 point is arbitrary...) That seems pretty noticeable! But wouldn’t it be deceptive? Well, only within the scope of its current episode, because trying to get higher reward in other episodes is never positively reinforced. Wouldn’t it learn the high-level concept of “reward” in general, in a way that’s abstracted from any specific episode? That feels analogous to a human learning to care about “genetic fitness” but not distinguishing between their own genetic fitness and the genetic fitness of other species. And remember point 1: the question is not whether the policy learns it eventually, but rather whether it learns it before it learns all the other things that make our current approaches to alignment obsolete.
At a high level, this comment is related to Alex Turner’s Reward is not the optimization target. I think he’s making an important underlying point there, but I’m also not going as far as he is. He says “I don’t see a strong reason to focus on the “reward optimizer” hypothesis.” I think there’s a pretty good reason to focus on it—namely that we’re reinforcing policies for getting high reward. I just think that other people have focused on it too much, and not carefully enough—e.g. the “without specific countermeasures” claim that Ajeya makes seems too strong, if the effects she’s talking about might only arise significantly above human level. Overall I’m concerned that reasoning about “the goal of getting high reward” is too anthropomorphic and is a bad way to present the argument to ML researchers in particular.
Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has “policy learns to care about reward directly” as a footnote; I can imagine updating it based on the outcome of this discussion though.
For someone who’s read v1 of this paper, what would you recommend as the best way to “update” to v3? Is an entire reread the best approach?
[Edit March 11, 2023: Having now read the new version in full, my recommendation to anyone else with the same question is a full reread.]
I’m not very convinced by this comment as an objection to “50% AI grabs power to get reward.” (I find it more plausible as an objection to “AI will definitely grab power to get reward.”)
This seems to be most of your position but I’m skeptical (and it’s kind of just asserted without argument):
The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your “unnaturalness” abstraction can make finer-grained distinctions than that, but I don’t think I buy it.
If people train their AI with RLDT then the AI is literally be trained to predict reward! I don’t see how this is remote, and I’m not clear if your position is that e.g. the value function will be bad at predicting reward because it is an “unnatural” target for supervised learning.
I don’t understand the analogy with humans. It sounds like you are saying “an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward” be analogous to “humans care about the details of their reward circuitry.” But:
I don’t think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior.
It seems like the analogous conclusion for RL systems would be “they may not care about the rewards that go into the SGD update, they may instead care about the rewards that get entered into the dataset, or even something further causally upstream of that as long as it’s very well-correlated on the training set.” But it doesn’t matter what we choose that’s causally upstream of rewards, as long as it’s perfectly correlated on the training set?
(Or you could be saying that humans are motivated by pleasure and pain but not the entire suite of things that are upstream of reward? But that doesn’t seem right to me.)
I don’t buy it:
If people train AI systems on random samples of deployment, then “reward” does make sense—it’s just what would happen if you sampled this episode to train on. This is done in practice today and seems like a pretty good idea (since the thing you care about is precisely the performance on deployment episodes) unless you are specifically avoiding it for safety reasons.
It’s plausible that such training is happening whether or not it actually is, and so that’s a very natural objective for a system that cares about maximizing reward conditioned on an episode being selected for training.
Even if test episodes are obviously not being used in training, there are still lots of plausible-sounding generalizations of “reward” to those episodes (whether based on physical implementation, or on the selection implemented by SGD, or based on conditioning on unlikely events, or based on causal counterfactuals...) and as far as I can tell pretty much all of them lead to the same bottom line.
If the deployment distribution is sufficiently different from the training distribution that the AI no longer does something with the same upshot as maximizing reward, then it seems very likely that the resulting behaviors are worse (e.g. they would receive a lower score when evaluated by humans) and so people will be particularly likely to train on those deployment episodes in order to correct the issue.
So I think that if a system was strategically optimizing reward on the training set, it would probably either do something similar-enough-to-result-in-grabbing-power on the test set, or else it would behave badly and be corrected.
(Overall I find the other point more persuasive, though again I think it’s better as an objection to 100% than as an objection to 50%: actually optimizing reward doesn’t do much better than various soups of heuristics, and so we don’t have a strong prediction that SGD will prefer one to the other without getting into the weeds and making very uncertain claims or else taking the limit.)
RL agents receive a loss that is based on reward in training episodes. If they tried to change the probability that a given episode appeared in training, and sacrificed reward to do so, that behavior would be discouraged by SGD—it would select away from parameters that do that (since they take worse actions on the episodes that actually appear in training). So I don’t think there’s any reason to think that an RL agent would behave this way.
Instead, you should expect an RL agent to maximize reward conditioned on the episode appearing in training, because that’s what SGD would select for. I agree that we shouldn’t expect a general thematic connection to “reward” beyond what you’d expect from the mechanics of SGD.
This is true, but as far as I can see all the possible versions result in grabbing power, so I don’t think it undermines the case. I don’t know if you have particular versions in mind that wouldn’t result in either grabbing power or else bad behavior that would be corrected by further training.
The claim is that AI systems will take actions that get them a lot of reward. This doesn’t have to be based on the mechanistic claim that they are thinking about reward, it can also just be the empirical observation: somehow you selected a policy that creatively gets a lot of reward across a very broad training distribution, so in a novel situation “gets a lot of reward” is one of our most robust predictions about the behavior of the AI.
This is how Ajeya’s post goes, and it explicitly notes that the AI trained with RLDT could be optimizing something else such that they only want to get reward during training, but that this mostly just makes things worse: everything seems to lead to the same place for more or less the same reason.
I think the “soup of heuristics” stories (where the AI is optimizing something far causally upstream of reward instead of something that is downstream or close enough to be robustly correlated) don’t lead to takeover in the same way, and so that part of the objection is much stronger (though see my objections above).
I don’t feel like this argument is particularly anthropomorphic. The argument is that a policy selected for achieving X will achieve X in a new situation (which we can run for lots of different X, since there are many X that are perfectly correlated on the training set). In all the discussions I’ve been in (both with safety people and with ML people) the counterargument leans much more heavily on the analogy with humans (though I think that analogy is typically misapplied).
(Emphasis added)
I don’t think this engages with the substance of the analogy to humans. I don’t think any party in this conversation believes that human learning is “just” RL based on a reward circuit, and I don’t believe it either. “Just RL” also isn’t necessary for the human case to give evidence about the AI case. Therefore, your summary seems to me like a strawman of the argument.
I would say “human value formation mostly occurs via RL & algorithms meta-learned thereby, but in the important context of SSL / predictive processing, and influenced by inductive biases from high-level connectome topology and genetically specified reflexes and environmental regularities and...”
Furthermore, we have good evidence that RL plays an important role in human learning. For example, from The shard theory of human values:
This is incredibly weak evidence.
Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals.
Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains.
Both of those observations have high probability, so they aren’t significant Bayesian evidence for “RL tends to produce external goals by default.”
In particular, for this to be evidence for Richard’s claim, you need to say: “If RL tended to produce systems that care about reward, then RL would be significantly less likely to play a role in human cognition.” There’s some update there but it’s just not big. It’s easy to build brains that use RL as part of a more complicated system and end up with lots of goals other than reward. My view is probably the other way—humans care about reward more than I would guess from the actual amount of RL they can do over the course of their life (my guess is that other systems play a significant role in our conscious attitude towards pleasure).
Curious what systems you have in mind here.
I don’t understand why you think this explains away the evidential impact, and I guess I put way less weight on selection reasoning than you do. My reasoning here goes:
Lots of animals do reinforcement learning.
In particular, humans prominently do reinforcement learning.
Humans care about lots of things in reality, not just certain kinds of cognitive-update-signals.
“RL → high chance of caring about reality” predicts this observation more strongly than “RL → low chance of caring about reality”
This seems pretty straightforward to me, but I bet there are also pieces of your perspective I’m just not seeing.
But in particular, it doesn’t seem relevant to consider selection pressures from evolution, except insofar as we’re postulating additional mechanisms which evolution found which explain away some of the reality-caring? That would weaken (but not eliminate) the update towards “RL → high chance of caring about reality.”
I don’t see how this point is relevant. Are you saying that within-lifetime learning is unsurprising, so we can’t make further updates by reasoning about how people do it?
I’m saying that there was a missed update towards that conclusion, so it doesn’t matter if we already knew that humans do within-lifetime learning?
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low. I’m objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
The Bayesian update is P(humans care about the real world | RL agents usually care about reward) / P(humans care about the real world | RL agents mostly care about other stuff). So if e.g. P(humans care about the real world | RL agents don’t usually care about reward) was 80%, then your update could be at most 1.25. In fact I think it’s even smaller than that..
And then if you try to turn that into evidence about “reward is a very hard concept to learn,” or a prediction about how neural nets trained with RL will behave, it’s moving my odds ratios by less than 10% (since we are using “RL” quite loosely in this discussion, and there are lots of other differences and complications at play, all of which shrink the update).
You seem to be saying “yes but it’s evidence,” which I’m not objecting to—I’m just saying it’s an extremely small amount evidence. I’m not clear on whether you agree with my calculation.
(Some of the other text I wrote was about a different argument you might be making: that P(humans use RL | RL agents usually care about reward) is significantly lower than P(humans use RL| RL agents mostly are about other stuff), because evolution would then have never used RL. My sense is that you aren’t making this argument so you should ignore all of that, sorry to be confusing.)
Just saw this reply recently. Thanks for leaving it, I found it stimulating.
(I wrote the following rather quickly, in an attempt to write anything at all, as I find it not that pleasant to write LW comments—no offense to you in particular. Apologies if it’s confusing or unclear.)
Yes, in large part.
Yeah, are people differentially selected for caring about the real world? At the risk of seeming facile, this feels non-obvious. My gut take is that conditional on RL agents usually caring about reward (and thus setting aside a bunch of my inside-view reasoning about how RL dynamics work), conditional on that—reward-humans could totally have been selected for.
This would drive up P(humans care about reward | RL agents care about reward, humans were selected by evolution), and thus (I think?) drive down P(humans care about the real world | RL agents usually care about reward).
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
In what way is my fitness lower than someone who really cares about these things, given that the best way to get rewards may well be to actually do the things?
Here are some ways I can think of:
Caring about reward directly makes reward hacking a problem evolution has to solve, and if it doesn’t solve it properly, the person ends up masturbating and not taking (re)productive actions.
Counter-counterpoint: But also many people do in fact enjoy masturbating, even though it seems (to my naive view) like an obvious thing to select away, which was present ancestrally.
People seem to be able to tell when you don’t really care about them, and just want to use them to get things.
But if I valued the rewarding feeling of having friends and spending time with them and watching them succeed—I, personally, feel good and happy when I am hanging out with my friends—then I still would be instrumentally aligned with my friends. If so, there would be no selection pressure for reward-motivation detectors, in the way that people were selected into noticing deception (even if not hardcoded to do so).
Overall, I feel like people being selected by evolution is an important qualifier which ends up changing the inferences we make via e.g. P(care about the real world | ), and I think I’ve at least somewhat neglected this qualifier. But I think I’d estimate P(humans care about the real world | RL agents usually care about reward) as… .2? I’d feel slightly more surprised by that than by the specific outcome of two coinflips coming up heads.
I think that there are serious path-dependent constraints in evolution, not a super clear/strong/realizable fitness gradient away from caring about reward if that’s how RL agents usually work, and so I expect humans to be relatively typical along this dimension, with some known unknowns around whatever extra circuitry evolution wrote into us.
Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?
I think this highlights a good counterpoint. I think this alternate theory predicts “probably not”, although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status → reward; and it’s high-status to sacrifice yourself for your kid). Or because keeping your kid safe → high reward as another learned drive.
Overall this feels like contortion but I think it’s possible. Maybe overall this is a… 1-bit update against the “not selection for caring about reality” point?
I don’t know what this means. Suppose we have an AI which “cares about reward” (as you think of it in this situation). The “episode” consists of the AI copying its network & activations to another off-site server, and then the original lab blows up. The original reward register no longer exists (it got blown up), and the agent is not presently being trained by an RL alg.
What is the “reward” for this situation? What would have happened if we “sampled” this episode during training?
I agree there are all kinds of situations where the generalization of “reward” is ambiguous and lots of different things could happen . But it has a clear interpretation for the typical deployment episode since we can take counterfactuals over the randomization used to select training data.
It’s possible that agents may specifically want to navigate towards situations where RL training is not happening and the notion of reward becomes ambiguous, and indeed this is quite explicitly discussed in the document Richard is replying to.
As far as I can tell the fact that there exist cases where different generalizations of reward behave differently does not undermine the point at all.
Yeah, I think I was wondering about the intended scoping of your statement. I perceive myself to agree with you that there are situations (like LLM training to get an alignment research assistant) where “what if we had sampled during training?” is well-defined and fine. I was wondering if you viewed this as a general question we could ask.
I also agree that Ajeya’s post addresses this “ambiguity” question, which is nice!
It’s intended as an objection to “AI grabs power to get reward is the central threat model to focus on”, but I think our disagreements still apply given this. (FWIW my central threat model is that policies care about reward to some extent, but that the goals which actually motivate them to do power-seeking things are more object-level.)
I expect policies to be getting rich input streams like video, text, etc, which they use to make decisions. Reward is different from other types of data because reward isn’t actually observed as part of these input streams by policies during episodes. This makes it harder to learn as a goal compared with things that are more directly observable (in a similar way to how “care about children” is an easier goal to learn than “care about genes”).
I don’t think this line of reasoning works, because “the episode appearing in training” can be a dependent variable. For example, consider an RL agent that’s credibly told that its data is not going to be used for training unless it misbehaves badly. An agent which maximizes reward conditional on the episode appearing in training is therefore going to misbehave the minimum amount required to get its episode into the training data (and more generally, behave like getting its episode into training is a terminal goal). This seems very counterintuitive.
Some versions that wouldn’t result in power-grabbing:
Goal is “get highest proportion of possible reward”; the policy might rewrite the training algorithm to be myopic, then get perfect reward for one step, then stop.
Goal is “care about (not getting low rewards on) specific computers used during training”; the policy might destroy those particular computers, then stop.
Goal is “impress the critic”; the policy might then rewrite its critic to always output high reward, then stop.
Goal is “get high reward myself this episode”; the policy might try to do power-seeking things but never create more copies of itself, and eventually lose coherence + stop doing stuff.
I don’t think any of these are particularly likely, the point is more that “high reward via tampering with/taking over the training setup” is a fairly different type of thing from “high reward via actually performing tasks”, and it’s a non-trivial and quite specific hypothesis that the latter will generalize to the former (in the regimes we’re concerned about).
Are you imagining that such systems get meaningfully low reward on the training distribution because they are pursuing those goals, or that these goals are extremely well-correlated with reward on the training distribution and only come apart at test time? Is the model deceptively aligned?
Children vs genes doesn’t seem like a good comparison, it seems obvious that models will understand the idea of reward during training (whereas humans don’t understand genes during evolution). A better comparison might be “have children during my life” vs “have a legacy after I’m gone,” but in fact humans have both goals even though one is never directly observed.
I guess more importantly, I don’t buy the claim about “things you only get selected on are less natural as goals than things you observe during episodes,” especially if your policy is trained to make good predictions of reward. I don’t know if there’s a specific reason for your view, or this is just a clash of intuitions. It feels to me like my position is kind of the default, in that you are offering a feature and saying it is a major consideration that SGD wouldn’t learn a particular kind of cognition.
If the agent misbehaves so that its data will be used for training, then the misbehaving actions will get a low reward. So SGD will shift from “misbehave so that my data will be used for training” to “ignore the effect of my actions on whether my data will be used for training, and just produce actions that would result in a low reward assuming that this data is used for training.”
I agree the behavior of the model isn’t easily summarized by an English sentence. The thing that seems most clear is that the model trained by SGD will learn not to sacrifice reward in order to increase the probability that the episode is used in training. If you think that’s wrong I’m happy to disagree about it.
Every one of your examples results in the model grabbing power and doing something bad at deployment time, though I agree that it may not care about holding power. (And maybe it doesn’t even have to grab power if you don’t have any countermeasures at test time? But then your system seems useless.)
But if you get frustrated by your AI taking over the datacenter and train it not to do that, as discussed in Ajeya’s story here, then you are left with an AI that grabs power and holds it.
I don’t find this super compelling. There is a long causal chain from “Do the task” to “Humans think you did the task” to “You get high reward” to “You are selected by SGD.” By default there will be lots of datapoints that distinguish the late stages on the path from the early stages, since human evaluations do not perfectly track any kind of objective evaluation of “did you actually do the task?” So a model that cares only about doing the task and not about human evaluations will underperform a model that cares about human evaluations. And then a model that cares about human evaluations can then manipulate those evaluations via grabbing power and controlling human observations of the situation (and if it cares about something even later then it can also e.g. kill and replace the humans, but that’s not important to this argument).
So the implicit alternative you mention here just won’t get a low loss, and it’s really a question about whether SGD will find a lower loss policy rather than one about how it generalizes.
And once that’s the discussion, someone saying “well there is a policy that gets lower loss and doesn’t look very complicated, so I guess SGD has a good chance of finding it” seems like they are on the winning side of SGD. Your claim seems to be that models will fail to get low loss on the training set in a specific way, and I think saying “it’s a non-trivial and quite specific hypothesis that you will get a low loss” would be an inappropriate burden-of-proof shifting.
I agree that it’s unclear whether this will actually happen prior to models that are smart enough to obsolete human work on alignment, but I really think we’re looking at more like a very natural hypothesis with 50-50 chances (and at the very least the way you are framing your objection is not engaging with the arguments in the particular post of Ajeya’s that you are responding to).
I also agree that the question of whether the model will mess with the training setup or use power for something else is very complicated, but it’s not something that is argued for in Ajeya’s post.
I think the crucial thing that makes this robust is “if the model grabs power and does something else, then humans will train it out unless they decide doing so is unsafe.” I think one potentially relevant takeaway is: (i) it’s plausible that AI systems will be motivated to grab power in non-catastrophic ways which could give opportunities to correct course, (ii) that isn’t automatically guaranteed by any particular honeypot, e.g. you can’t just give a high reward and assume that will motivate the system.
Reading back over this now, I think we’re arguing at cross purposes in some ways. I should have clarified earlier that my specific argument was against policies learning a terminal goal of reward that generalizes to long-term power-seeking.
I do expect deceptive alignment after policies learn other broadly-scoped terminal goals and realize that reward-maximization is a good instrumental strategy. So all my arguments about the naturalness of reward-maximization as a goal are focused on the question of which type of terminal goal policies with dangerous levels of capabilities learn first.* Let’s distinguish three types (where “myopic” is intended to mean something like “only cares about the current episode”).
Non-myopic misaligned goals that lead to instrumental reward maximization (deceptive alignment)
Myopic terminal reward maximization
Non-myopic terminal reward maximization
Either 1 and 2 (or both of them) seem plausible to me. 3 is the one I’m skeptical about. How come?
We should expect models to have fairly robust terminal goals (since, unlike beliefs or instrumental goals, terminal goals shouldn’t change quickly with new information). So once they understand the concept of reward maximization, it’ll be easier for them to adopt it as an instrumental strategy than a terminal goal. (An analogy to evolution: once humans construct highly novel strategies for maximizing genetic fitness (like making thousands of clones) people are more likely to do it for instrumental reasons than terminal reasons.)
Even if they adopt reward maximization as a terminal goal, they’re more likely to adopt a myopic version of it than a non-myopic version, since (I claim) the concept of reward maximization doesn’t generalize very naturally to larger scales. Above, you point out that even relatively myopic reward maximization will lead to limited takeover, and so we’ll train subsequent agents to be less myopic. But it seems to me that the selection pressure generated by a handful of examples of real-world attempted takeovers is very small, compared with other aspects of training; and that even if it’s significant, it may just teach agents specific constraints like “don’t take over datacenters”.
* Now that I say that, I notice that I’m also open to the possibility of policies learning deceptively-aligned goals first, then gradually shifting from reward-maximization as an instrumental goal to reward-maximization as a terminal goal. But let’s focus for now on which goals are learned first.
Why does it not lead to takeover in the same way?
Because it’s easy to detect and correct (except that correcting it might push you into one of the other regimes).
So far causally upstream of the human evaluator’s opinion? Eg an AI counselor optimizing for getting to know you
Note that the “without countermeasures” post consistently discusses both possibilities (the model cares about reward or the model cares about something else that’s consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro:
As well as the section Even if Alex isn’t “motivated” to maximize reward.… I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that’s distinct from being confident in the motivations that give rise to that policy. I believe Alex would try very hard to maximize reward in most cases, but this could be for either terminal or instrumental reasons.
With that said, for roughly the reasons Paul says above, I think I probably do have a disagreement with Richard—I think that caring about some version of reward is pretty plausible (~50% or so). It seems pretty natural and easy to grasp to me, and because I think there will likely be continuous online training the argument that there’s no notion of reward on the deployment distribution doesn’t feel compelling to me.
Yepp, agreed, the thing I’m objecting to is how you mainly focus on the reward case, and then say “but the same dynamics apply in other cases too...”
The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).
I agree with your general point here, but I think Ajeya’s post actually gets this right, eg
and
I also think that often “the AI just maximizes reward” is a useful simplifying assumption. That is, we can make an argument of the form “even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed”.
(Though of course it’s important to spell the argument out)
Yeah, I agree this is a good argument structure—in my mind, maximizing reward is both a plausible case (which Richard might disagree with) and the best case (conditional on it being strategic at all and not a bag of heuristics), so it’s quite useful to establish that it’s doomed; that’s the kind of structure I was going for in the post.
I strongly disagree with the “best case” thing. Like, policies could just learn human values! It’s not that implausible.
If I had to try point to the crux here, it might be “how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?” Where we both agree that there’s some selection pressure towards reward-like goals, and it seems like you expect this to be enough to lead policies to behavior that violates all their existing heuristics, whereas I’m more focused on the regime where there are lots of low-hanging fruit in terms of changes that would make a policy more successful, and so the question of how easy that goal is to learn from its training data is pretty important. (As usual, there’s the human analogy: our goals are very strongly biased towards things we have direct observational access to!)
Even setting aside this disagreement, though, I don’t like the argumentative structure because the generalization of “reward” to large scales is much less intuitive than the generalization of other concepts (like “make money”) to large scales—in part because directly having a goal of reward is a kinda counterintuitive self-referential thing.
Yes, sorry, “best case” was oversimplified. What I meant is that generalizing to want reward is in some sense the model generalizing “correctly;” we could get lucky and have it generalize “incorrectly” in an important sense in a way that happens to be beneficial to us. I discuss this a bit more here.
I don’t understand why reward isn’t something the model has direct access to—it seems like it basically does? If I had to say which of us were focusing on abstract vs concrete goals, I’d have said I was thinking about concrete goals and you were thinking about abstract ones, so I think we have some disagreement of intuition here.
Yeah, I don’t really agree with this; I think I could pretty easily imagine being an AI system asking the question “How much reward would this episode get if it were sampled for training?” It seems like the intuition this is weird and unnatural is doing a lot of work in your argument, and I don’t really share it.
AFAIK the reward signal is not typically included as an input to the policy network in RL. Not sure why, and I could be wrong about that, but that is not my main question. The bigger question is “Has direct access to when?”
At the moment in time when the model is making a decision, it does not have direct access to the decision-relevant reward signal because that reward is typically causally downstream of the model’s decision. That reward may not even have a definite value until after decision time. Whereas concrete observables like “shiny gold coins” and “the finish line straight ahead” and “my opponent is in check” (and other abstractions in the model’s ontology that are causally upstream from reward in reality) are readily available at decision time. It seems to me that that makes them natural candidates for credit assignment to flag early on as the reward-responsible mental events and reinforce into stable motivations, since they in fact were the factors that determined the decisions that led to rewards.
IME, the most straightforward way for reward-itself to become the model’s primary goal would be if the model learns to base its decisions on an accurate reward-predictor much earlier than it learns to base its decisions on other (likely upstream) factors. If it instead learns how to accurately predict reward-itself after it is already strongly motivated by some concrete observables, I don’t see why we should expect it to dislodge that motivation, despite the true fact that those concrete observables are only pretty correlated with reward whereas an accurate reward-predictor is perfectly correlated with reward. Why? Because the model currently doesn’t care about reward-itself, it currently cares about the concrete observable(s), so it has no reason to take actions that would override that goal, and it has positive goal-content integrity reasons to not take those actions.
See also: Inner and outer alignment decompose one hard problem into two extremely hard problems (in particular: Inner alignment seems anti-natural).