I’m not very convinced by this comment as an objection to “50% AI grabs power to get reward.” (I find it more plausible as an objection to “AI will definitely grab power to get reward.”)
I expect “reward” to be a hard goal to learn, because it’s a pretty abstract concept and not closely related to the direct observations that policies are going to receive
“Reward” is not a very natural concept
This seems to be most of your position but I’m skeptical (and it’s kind of just asserted without argument):
The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your “unnaturalness” abstraction can make finer-grained distinctions than that, but I don’t think I buy it.
If people train their AI with RLDT then the AI is literally be trained to predict reward! I don’t see how this is remote, and I’m not clear if your position is that e.g. the value function will be bad at predicting reward because it is an “unnatural” target for supervised learning.
I don’t understand the analogy with humans. It sounds like you are saying “an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward” be analogous to “humans care about the details of their reward circuitry.” But:
I don’t think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior.
It seems like the analogous conclusion for RL systems would be “they may not care about the rewards that go into the SGD update, they may instead care about the rewards that get entered into the dataset, or even something further causally upstream of that as long as it’s very well-correlated on the training set.” But it doesn’t matter what we choose that’s causally upstream of rewards, as long as it’s perfectly correlated on the training set?
(Or you could be saying that humans are motivated by pleasure and pain but not the entire suite of things that are upstream of reward? But that doesn’t seem right to me.)
[The concept of “reward”] doesn’t apply outside training
I don’t buy it:
If people train AI systems on random samples of deployment, then “reward” does make sense—it’s just what would happen if you sampled this episode to train on. This is done in practice today and seems like a pretty good idea (since the thing you care about is precisely the performance on deployment episodes) unless you are specifically avoiding it for safety reasons.
It’s plausible that such training is happening whether or not it actually is, and so that’s a very natural objective for a system that cares about maximizing reward conditioned on an episode being selected for training.
Even if test episodes are obviously not being used in training, there are still lots of plausible-sounding generalizations of “reward” to those episodes (whether based on physical implementation, or on the selection implemented by SGD, or based on conditioning on unlikely events, or based on causal counterfactuals...) and as far as I can tell pretty much all of them lead to the same bottom line.
If the deployment distribution is sufficiently different from the training distribution that the AI no longer does something with the same upshot as maximizing reward, then it seems very likely that the resulting behaviors are worse (e.g. they would receive a lower score when evaluated by humans) and so people will be particularly likely to train on those deployment episodes in order to correct the issue.
So I think that if a system was strategically optimizing reward on the training set, it would probably either do something similar-enough-to-result-in-grabbing-power on the test set, or else it would behave badly and be corrected.
(Overall I find the other point more persuasive, though again I think it’s better as an objection to 100% than as an objection to 50%: actually optimizing reward doesn’t do much better than various soups of heuristics, and so we don’t have a strong prediction that SGD will prefer one to the other without getting into the weeds and making very uncertain claims or else taking the limit.)
As one example: it means that every time you deploy the policy without the intention of rewarding it, then its key priority would be convincing you to inserting that trajectory into the training data. (It might be instructive to think about what the rewards would need to be for that not to happen. Below 0? But the 0 point is arbitrary...) That seems pretty noticeable!
RL agents receive a loss that is based on reward in training episodes. If they tried to change the probability that a given episode appeared in training, and sacrificed reward to do so, that behavior would be discouraged by SGD—it would select away from parameters that do that (since they take worse actions on the episodes that actually appear in training). So I don’t think there’s any reason to think that an RL agent would behave this way.
Instead, you should expect an RL agent to maximize reward conditioned on the episode appearing in training, because that’s what SGD would select for. I agree that we shouldn’t expect a general thematic connection to “reward” beyond what you’d expect from the mechanics of SGD.
and even within training it’s dependent on the specific training algorithm you use
This is true, but as far as I can see all the possible versions result in grabbing power, so I don’t think it undermines the case. I don’t know if you have particular versions in mind that wouldn’t result in either grabbing power or else bad behavior that would be corrected by further training.
They focus on policies learning the goal of getting high reward.
The claim is that AI systems will take actions that get them a lot of reward. This doesn’t have to be based on the mechanistic claim that they are thinking about reward, it can also just be the empirical observation: somehow you selected a policy that creatively gets a lot of reward across a very broad training distribution, so in a novel situation “gets a lot of reward” is one of our most robust predictions about the behavior of the AI.
This is how Ajeya’s post goes, and it explicitly notes that the AI trained with RLDT could be optimizing something else such that they only want to get reward during training, but that this mostly just makes things worse: everything seems to lead to the same place for more or less the same reason.
I think the “soup of heuristics” stories (where the AI is optimizing something far causally upstream of reward instead of something that is downstream or close enough to be robustly correlated) don’t lead to takeover in the same way, and so that part of the objection is much stronger (though see my objections above).
Overall I’m concerned that reasoning about “the goal of getting high reward” is too anthropomorphic and is a bad way to present the argument to ML researchers in particular.
I don’t feel like this argument is particularly anthropomorphic. The argument is that a policy selected for achieving X will achieve X in a new situation (which we can run for lots of different X, since there are many X that are perfectly correlated on the training set). In all the discussions I’ve been in (both with safety people and with ML people) the counterargument leans much more heavily on the analogy with humans (though I think that analogy is typically misapplied).
I don’t understand the analogy with humans. It sounds like you are saying “an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward” be analogous to “humans care about the details of their reward circuitry.” But:
I don’t think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior.
(Emphasis added)
I don’t think this engages with the substance of the analogy to humans. I don’t think any party in this conversation believes that human learning is “just” RL based on a reward circuit, and I don’t believe it either. “Just RL” also isn’t necessary for the human case to give evidence about the AI case. Therefore, your summary seems to me like a strawman of the argument.
I would say “human value formation mostly occurs via RL & algorithms meta-learned thereby, but in the important context of SSL / predictive processing, and influenced by inductive biases from high-level connectome topology and genetically specified reflexes and environmental regularities and...”
Furthermore, wehave good evidence that RL plays an important role in human learning. For example, from The shard theory of human values:
Wolfram Schultz and colleagues have found that the signaling behavior of phasic dopamine in the mesocorticolimbic pathway mirrors that of a TD error (or reward prediction error).
In addition to finding correlates of reinforcement learning signals in the brain, artificial manipulation of those signal correlates (through optogenetic stimulation, for example) produces the behavioral adjustments that would be predicted from their putative role in reinforcement learning.
Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals.
Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains.
Both of those observations have high probability, so they aren’t significant Bayesian evidence for “RL tends to produce external goals by default.”
In particular, for this to be evidence for Richard’s claim, you need to say: “If RL tended to produce systems that care about reward, then RL would be significantly less likely to play a role in human cognition.” There’s some update there but it’s just not big. It’s easy to build brains that use RL as part of a more complicated system and end up with lots of goals other than reward. My view is probably the other way—humans care about reward more than I would guess from the actual amount of RL they can do over the course of their life (my guess is that other systems play a significant role in our conscious attitude towards pleasure).
In particular, humans prominently do reinforcement learning.
Humans care about lots of things in reality, not just certain kinds of cognitive-update-signals.
“RL → high chance of caring about reality” predicts this observation more strongly than “RL → low chance of caring about reality”
This seems pretty straightforward to me, but I bet there are also pieces of your perspective I’m just not seeing.
But in particular, it doesn’t seem relevant to consider selection pressures from evolution, except insofar as we’re postulating additional mechanisms which evolution found which explain away some of the reality-caring? That would weaken (but not eliminate) the update towards “RL → high chance of caring about reality.”
Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains.
I don’t see how this point is relevant. Are you saying that within-lifetime learning is unsurprising, so we can’t make further updates by reasoning about how people do it?
Both of those observations have high probability, so they aren’t significant Bayesian evidence for “RL tends to produce external goals by default.”
I’m saying that there was a missed update towards that conclusion, so it doesn’t matter if we already knew that humans do within-lifetime learning?
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low. I’m objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
The Bayesian update is P(humans care about the real world | RL agents usually care about reward) / P(humans care about the real world | RL agents mostly care about other stuff). So if e.g. P(humans care about the real world | RL agents don’t usually care about reward) was 80%, then your update could be at most 1.25. In fact I think it’s even smaller than that..
And then if you try to turn that into evidence about “reward is a very hard concept to learn,” or a prediction about how neural nets trained with RL will behave, it’s moving my odds ratios by less than 10% (since we are using “RL” quite loosely in this discussion, and there are lots of other differences and complications at play, all of which shrink the update).
You seem to be saying “yes but it’s evidence,” which I’m not objecting to—I’m just saying it’s an extremely small amount evidence. I’m not clear on whether you agree with my calculation.
(Some of the other text I wrote was about a different argument you might be making: that P(humans use RL | RL agents usually care about reward) is significantly lower than P(humans use RL| RL agents mostly are about other stuff), because evolution would then have never used RL. My sense is that you aren’t making this argument so you should ignore all of that, sorry to be confusing.)
Just saw this reply recently. Thanks for leaving it, I found it stimulating.
(I wrote the following rather quickly, in an attempt to write anything at all, as I find it not that pleasant to write LW comments—no offense to you in particular. Apologies if it’s confusing or unclear.)
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low.
Yes, in large part.
I’m objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
Yeah, are people differentially selected for caring about the real world? At the risk of seeming facile, this feels non-obvious. My gut take is that conditional on RL agents usually caring about reward (and thus setting aside a bunch of my inside-view reasoning about how RL dynamics work), conditional on that—reward-humans could totally have been selected for.
This would drive up P(humans care about reward | RL agents care about reward, humans were selected by evolution), and thus (I think?) drive down P(humans care about the real world | RL agents usually care about reward).
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
In what way is my fitness lower than someone who really cares about these things, given that the best way to get rewards may well be to actually do the things?
Here are some ways I can think of:
Caring about reward directly makes reward hacking a problem evolution has to solve, and if it doesn’t solve it properly, the person ends up masturbating and not taking (re)productive actions.
Counter-counterpoint: But also many people do in fact enjoy masturbating, even though it seems (to my naive view) like an obvious thing to select away, which was present ancestrally.
People seem to be able to tell when you don’t really care about them, and just want to use them to get things.
But if I valued the rewarding feeling of having friends and spending time with them and watching them succeed—I, personally, feel good and happy when I am hanging out with my friends—then I still would be instrumentally aligned with my friends. If so, there would be no selection pressure for reward-motivation detectors, in the way that people were selected into noticing deception (even if not hardcoded to do so).
Overall, I feel like people being selected by evolution is an important qualifier which ends up changing the inferences we make via e.g. P(care about the real world | ), and I think I’ve at least somewhat neglected this qualifier. But I think I’d estimate P(humans care about the real world | RL agents usually care about reward) as… .2? I’d feel slightly more surprised by that than by the specific outcome of two coinflips coming up heads.
I think that there are serious path-dependent constraints in evolution, not a super clear/strong/realizable fitness gradient away from caring about reward if that’s how RL agents usually work, and so I expect humans to be relatively typical along this dimension, with some known unknowns around whatever extra circuitry evolution wrote into us.
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?
I think this highlights a good counterpoint. I think this alternate theory predicts “probably not”, although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status → reward; and it’s high-status to sacrifice yourself for your kid). Or because keeping your kid safe → high reward as another learned drive.
Overall this feels like contortion but I think it’s possible. Maybe overall this is a… 1-bit update against the “not selection for caring about reality” point?
If people train AI systems on random samples of deployment, then “reward” does make sense—it’s just what would happen if you sampled this episode to train on.
I don’t know what this means. Suppose we have an AI which “cares about reward” (as you think of it in this situation). The “episode” consists of the AI copying its network & activations to another off-site server, and then the original lab blows up. The original reward register no longer exists (it got blown up), and the agent is not presently being trained by an RL alg.
What is the “reward” for this situation? What would have happened if we “sampled” this episode during training?
I agree there are all kinds of situations where the generalization of “reward” is ambiguous and lots of different things could happen . But it has a clear interpretation for the typical deployment episode since we can take counterfactuals over the randomization used to select training data.
It’s possible that agents may specifically want to navigate towards situations where RL training is not happening and the notion of reward becomes ambiguous, and indeed this is quite explicitly discussed in the document Richard is replying to.
As far as I can tell the fact that there exist cases where different generalizations of reward behave differently does not undermine the point at all.
Yeah, I think I was wondering about the intended scoping of your statement. I perceive myself to agree with you that there are situations (like LLM training to get an alignment research assistant) where “what if we had sampled during training?” is well-defined and fine. I was wondering if you viewed this as a general question we could ask.
I also agree that Ajeya’s post addresses this “ambiguity” question, which is nice!
I’m not very convinced by this comment as an objection to “50% AI grabs power to get reward.” (I find it more plausible as an objection to “AI will definitely grab power to get reward.”)
It’s intended as an objection to “AI grabs power to get reward is the central threat model to focus on”, but I think our disagreements still apply given this. (FWIW my central threat model is that policies care about reward to some extent, but that the goals which actually motivate them to do power-seeking things are more object-level.)
The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your “unnaturalness” abstraction can make finer-grained distinctions than that, but I don’t think I buy it.
I expect policies to be getting rich input streams like video, text, etc, which they use to make decisions. Reward is different from other types of data because reward isn’t actually observed as part of these input streams by policies during episodes. This makes it harder to learn as a goal compared with things that are more directly observable (in a similar way to how “care about children” is an easier goal to learn than “care about genes”).
RL agents receive a loss that is based on reward in training episodes. If they tried to change the probability that a given episode appeared in training, and sacrificed reward to do so, that behavior would be discouraged by SGD—it would select away from parameters that do that (since they take worse actions on the episodes that actually appear in training). So I don’t think there’s any reason to think that an RL agent would behave this way. Instead, you should expect an RL agent to maximize reward conditioned on the episode appearing in training, because that’s what SGD would select for.
I don’t think this line of reasoning works, because “the episode appearing in training” can be a dependent variable. For example, consider an RL agent that’s credibly told that its data is not going to be used for training unless it misbehaves badly. An agent which maximizes reward conditional on the episode appearing in training is therefore going to misbehave the minimum amount required to get its episode into the training data (and more generally, behave like getting its episode into training is a terminal goal). This seems very counterintuitive.
This is true, but as far as I can see all the possible versions result in grabbing power, so I don’t think it undermines the case. I don’t know if you have particular versions in mind that wouldn’t result in either grabbing power
Some versions that wouldn’t result in power-grabbing:
Goal is “get highest proportion of possible reward”; the policy might rewrite the training algorithm to be myopic, then get perfect reward for one step, then stop.
Goal is “care about (not getting low rewards on) specific computers used during training”; the policy might destroy those particular computers, then stop.
Goal is “impress the critic”; the policy might then rewrite its critic to always output high reward, then stop.
Goal is “get high reward myself this episode”; the policy might try to do power-seeking things but never create more copies of itself, and eventually lose coherence + stop doing stuff.
I don’t think any of these are particularly likely, the point is more that “high reward via tampering with/taking over the training setup” is a fairly different type of thing from “high reward via actually performing tasks”, and it’s a non-trivial and quite specific hypothesis that the latter will generalize to the former (in the regimes we’re concerned about).
(FWIW my central threat model is that policies care about reward to some extent, but that the goals which actually motivate them to do power-seeking things are more object-level.)
Are you imagining that such systems get meaningfully low reward on the training distribution because they are pursuing those goals, or that these goals are extremely well-correlated with reward on the training distribution and only come apart at test time? Is the model deceptively aligned?
I expect policies to be getting rich input streams like video, text, etc, which they use to make decisions. Reward is different from other types of data because reward isn’t actually observed as part of these input streams by policies during episodes. This makes it harder to learn as a goal compared with things that are more directly observable (in a similar way to how “care about children” is an easier goal to learn than “care about genes”).
Children vs genes doesn’t seem like a good comparison, it seems obvious that models will understand the idea of reward during training (whereas humans don’t understand genes during evolution). A better comparison might be “have children during my life” vs “have a legacy after I’m gone,” but in fact humans have both goals even though one is never directly observed.
I guess more importantly, I don’t buy the claim about “things you only get selected on are less natural as goals than things you observe during episodes,” especially if your policy is trained to make good predictions of reward. I don’t know if there’s a specific reason for your view, or this is just a clash of intuitions. It feels to me like my position is kind of the default, in that you are offering a feature and saying it is a major consideration that SGD wouldn’t learn a particular kind of cognition.
I don’t think this line of reasoning works, because “the episode appearing in training” can be a dependent variable. For example, consider an RL agent that’s credibly told that its data is not going to be used for training unless it misbehaves badly. An agent which maximizes reward conditional on the episode appearing in training is therefore going to misbehave the minimum amount required to get its episode into the training data (and more generally, behave like getting its episode into training is a terminal goal). This seems very counterintuitive.
If the agent misbehaves so that its data will be used for training, then the misbehaving actions will get a low reward. So SGD will shift from “misbehave so that my data will be used for training” to “ignore the effect of my actions on whether my data will be used for training, and just produce actions that would result in a low reward assuming that this data is used for training.”
I agree the behavior of the model isn’t easily summarized by an English sentence. The thing that seems most clear is that the model trained by SGD will learn not to sacrifice reward in order to increase the probability that the episode is used in training. If you think that’s wrong I’m happy to disagree about it.
Some versions that wouldn’t result in power-grabbing:
Every one of your examples results in the model grabbing power and doing something bad at deployment time, though I agree that it may not care about holding power. (And maybe it doesn’t even have to grab power if you don’t have any countermeasures at test time? But then your system seems useless.)
But if you get frustrated by your AI taking over the datacenter and train it not to do that, as discussed in Ajeya’s story here, then you are left with an AI that grabs power and holds it.
I don’t think any of these are particularly likely, the point is more that “high reward via tampering with/taking over the training setup” is a fairly different type of thing from “high reward via actually performing tasks”, and it’s a non-trivial and quite specific hypothesis that the latter will generalize to the former (in the regimes we’re concerned about).
I don’t find this super compelling. There is a long causal chain from “Do the task” to “Humans think you did the task” to “You get high reward” to “You are selected by SGD.” By default there will be lots of datapoints that distinguish the late stages on the path from the early stages, since human evaluations do not perfectly track any kind of objective evaluation of “did you actually do the task?” So a model that cares only about doing the task and not about human evaluations will underperform a model that cares about human evaluations. And then a model that cares about human evaluations can then manipulate those evaluations via grabbing power and controlling human observations of the situation (and if it cares about something even later then it can also e.g. kill and replace the humans, but that’s not important to this argument).
So the implicit alternative you mention here just won’t get a low loss, and it’s really a question about whether SGD will find a lower loss policy rather than one about how it generalizes.
And once that’s the discussion, someone saying “well there is a policy that gets lower loss and doesn’t look very complicated, so I guess SGD has a good chance of finding it” seems like they are on the winning side of SGD. Your claim seems to be that models will fail to get low loss on the training set in a specific way, and I think saying “it’s a non-trivial and quite specific hypothesis that you will get a low loss” would be an inappropriate burden-of-proof shifting.
I agree that it’s unclear whether this will actually happen prior to models that are smart enough to obsolete human work on alignment, but I really think we’re looking at more like a very natural hypothesis with 50-50 chances (and at the very least the way you are framing your objection is not engaging with the arguments in the particular post of Ajeya’s that you are responding to).
I also agree that the question of whether the model will mess with the training setup or use power for something else is very complicated, but it’s not something that is argued for in Ajeya’s post.
I think the crucial thing that makes this robust is “if the model grabs power and does something else, then humans will train it out unless they decide doing so is unsafe.” I think one potentially relevant takeaway is: (i) it’s plausible that AI systems will be motivated to grab power in non-catastrophic ways which could give opportunities to correct course, (ii) that isn’t automatically guaranteed by any particular honeypot, e.g. you can’t just give a high reward and assume that will motivate the system.
Reading back over this now, I think we’re arguing at cross purposes in some ways. I should have clarified earlier that my specific argument was against policies learning a terminal goal of reward that generalizes to long-term power-seeking.
I do expect deceptive alignment after policies learn other broadly-scoped terminal goals and realize that reward-maximization is a good instrumental strategy. So all my arguments about the naturalness of reward-maximization as a goal are focused on the question of which type of terminal goal policies with dangerous levels of capabilities learn first.* Let’s distinguish three types (where “myopic” is intended to mean something like “only cares about the current episode”).
Non-myopic misaligned goals that lead to instrumental reward maximization (deceptive alignment)
Myopic terminal reward maximization
Non-myopic terminal reward maximization
Either 1 and 2 (or both of them) seem plausible to me. 3 is the one I’m skeptical about. How come?
We should expect models to have fairly robust terminal goals (since, unlike beliefs or instrumental goals, terminal goals shouldn’t change quickly with new information). So once they understand the concept of reward maximization, it’ll be easier for them to adopt it as an instrumental strategy than a terminal goal. (An analogy to evolution: once humans construct highly novel strategies for maximizing genetic fitness (like making thousands of clones) people are more likely to do it for instrumental reasons than terminal reasons.)
Even if they adopt reward maximization as a terminal goal, they’re more likely to adopt a myopic version of it than a non-myopic version, since (I claim) the concept of reward maximization doesn’t generalize very naturally to larger scales. Above, you point out that even relatively myopic reward maximization will lead to limited takeover, and so we’ll train subsequent agents to be less myopic. But it seems to me that the selection pressure generated by a handful of examples of real-world attempted takeovers is very small, compared with other aspects of training; and that even if it’s significant, it may just teach agents specific constraints like “don’t take over datacenters”.
* Now that I say that, I notice that I’m also open to the possibility of policies learning deceptively-aligned goals first, then gradually shifting from reward-maximization as an instrumental goal to reward-maximization as a terminal goal. But let’s focus for now on which goals are learned first.
I think the “soup of heuristics” stories (where the AI is optimizing something far causally upstream of reward instead of something that is downstream or close enough to be robustly correlated) don’t lead to takeover in the same way
I’m not very convinced by this comment as an objection to “50% AI grabs power to get reward.” (I find it more plausible as an objection to “AI will definitely grab power to get reward.”)
This seems to be most of your position but I’m skeptical (and it’s kind of just asserted without argument):
The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your “unnaturalness” abstraction can make finer-grained distinctions than that, but I don’t think I buy it.
If people train their AI with RLDT then the AI is literally be trained to predict reward! I don’t see how this is remote, and I’m not clear if your position is that e.g. the value function will be bad at predicting reward because it is an “unnatural” target for supervised learning.
I don’t understand the analogy with humans. It sounds like you are saying “an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward” be analogous to “humans care about the details of their reward circuitry.” But:
I don’t think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior.
It seems like the analogous conclusion for RL systems would be “they may not care about the rewards that go into the SGD update, they may instead care about the rewards that get entered into the dataset, or even something further causally upstream of that as long as it’s very well-correlated on the training set.” But it doesn’t matter what we choose that’s causally upstream of rewards, as long as it’s perfectly correlated on the training set?
(Or you could be saying that humans are motivated by pleasure and pain but not the entire suite of things that are upstream of reward? But that doesn’t seem right to me.)
I don’t buy it:
If people train AI systems on random samples of deployment, then “reward” does make sense—it’s just what would happen if you sampled this episode to train on. This is done in practice today and seems like a pretty good idea (since the thing you care about is precisely the performance on deployment episodes) unless you are specifically avoiding it for safety reasons.
It’s plausible that such training is happening whether or not it actually is, and so that’s a very natural objective for a system that cares about maximizing reward conditioned on an episode being selected for training.
Even if test episodes are obviously not being used in training, there are still lots of plausible-sounding generalizations of “reward” to those episodes (whether based on physical implementation, or on the selection implemented by SGD, or based on conditioning on unlikely events, or based on causal counterfactuals...) and as far as I can tell pretty much all of them lead to the same bottom line.
If the deployment distribution is sufficiently different from the training distribution that the AI no longer does something with the same upshot as maximizing reward, then it seems very likely that the resulting behaviors are worse (e.g. they would receive a lower score when evaluated by humans) and so people will be particularly likely to train on those deployment episodes in order to correct the issue.
So I think that if a system was strategically optimizing reward on the training set, it would probably either do something similar-enough-to-result-in-grabbing-power on the test set, or else it would behave badly and be corrected.
(Overall I find the other point more persuasive, though again I think it’s better as an objection to 100% than as an objection to 50%: actually optimizing reward doesn’t do much better than various soups of heuristics, and so we don’t have a strong prediction that SGD will prefer one to the other without getting into the weeds and making very uncertain claims or else taking the limit.)
RL agents receive a loss that is based on reward in training episodes. If they tried to change the probability that a given episode appeared in training, and sacrificed reward to do so, that behavior would be discouraged by SGD—it would select away from parameters that do that (since they take worse actions on the episodes that actually appear in training). So I don’t think there’s any reason to think that an RL agent would behave this way.
Instead, you should expect an RL agent to maximize reward conditioned on the episode appearing in training, because that’s what SGD would select for. I agree that we shouldn’t expect a general thematic connection to “reward” beyond what you’d expect from the mechanics of SGD.
This is true, but as far as I can see all the possible versions result in grabbing power, so I don’t think it undermines the case. I don’t know if you have particular versions in mind that wouldn’t result in either grabbing power or else bad behavior that would be corrected by further training.
The claim is that AI systems will take actions that get them a lot of reward. This doesn’t have to be based on the mechanistic claim that they are thinking about reward, it can also just be the empirical observation: somehow you selected a policy that creatively gets a lot of reward across a very broad training distribution, so in a novel situation “gets a lot of reward” is one of our most robust predictions about the behavior of the AI.
This is how Ajeya’s post goes, and it explicitly notes that the AI trained with RLDT could be optimizing something else such that they only want to get reward during training, but that this mostly just makes things worse: everything seems to lead to the same place for more or less the same reason.
I think the “soup of heuristics” stories (where the AI is optimizing something far causally upstream of reward instead of something that is downstream or close enough to be robustly correlated) don’t lead to takeover in the same way, and so that part of the objection is much stronger (though see my objections above).
I don’t feel like this argument is particularly anthropomorphic. The argument is that a policy selected for achieving X will achieve X in a new situation (which we can run for lots of different X, since there are many X that are perfectly correlated on the training set). In all the discussions I’ve been in (both with safety people and with ML people) the counterargument leans much more heavily on the analogy with humans (though I think that analogy is typically misapplied).
(Emphasis added)
I don’t think this engages with the substance of the analogy to humans. I don’t think any party in this conversation believes that human learning is “just” RL based on a reward circuit, and I don’t believe it either. “Just RL” also isn’t necessary for the human case to give evidence about the AI case. Therefore, your summary seems to me like a strawman of the argument.
I would say “human value formation mostly occurs via RL & algorithms meta-learned thereby, but in the important context of SSL / predictive processing, and influenced by inductive biases from high-level connectome topology and genetically specified reflexes and environmental regularities and...”
Furthermore, we have good evidence that RL plays an important role in human learning. For example, from The shard theory of human values:
This is incredibly weak evidence.
Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals.
Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains.
Both of those observations have high probability, so they aren’t significant Bayesian evidence for “RL tends to produce external goals by default.”
In particular, for this to be evidence for Richard’s claim, you need to say: “If RL tended to produce systems that care about reward, then RL would be significantly less likely to play a role in human cognition.” There’s some update there but it’s just not big. It’s easy to build brains that use RL as part of a more complicated system and end up with lots of goals other than reward. My view is probably the other way—humans care about reward more than I would guess from the actual amount of RL they can do over the course of their life (my guess is that other systems play a significant role in our conscious attitude towards pleasure).
Curious what systems you have in mind here.
I don’t understand why you think this explains away the evidential impact, and I guess I put way less weight on selection reasoning than you do. My reasoning here goes:
Lots of animals do reinforcement learning.
In particular, humans prominently do reinforcement learning.
Humans care about lots of things in reality, not just certain kinds of cognitive-update-signals.
“RL → high chance of caring about reality” predicts this observation more strongly than “RL → low chance of caring about reality”
This seems pretty straightforward to me, but I bet there are also pieces of your perspective I’m just not seeing.
But in particular, it doesn’t seem relevant to consider selection pressures from evolution, except insofar as we’re postulating additional mechanisms which evolution found which explain away some of the reality-caring? That would weaken (but not eliminate) the update towards “RL → high chance of caring about reality.”
I don’t see how this point is relevant. Are you saying that within-lifetime learning is unsurprising, so we can’t make further updates by reasoning about how people do it?
I’m saying that there was a missed update towards that conclusion, so it doesn’t matter if we already knew that humans do within-lifetime learning?
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low. I’m objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
The Bayesian update is P(humans care about the real world | RL agents usually care about reward) / P(humans care about the real world | RL agents mostly care about other stuff). So if e.g. P(humans care about the real world | RL agents don’t usually care about reward) was 80%, then your update could be at most 1.25. In fact I think it’s even smaller than that..
And then if you try to turn that into evidence about “reward is a very hard concept to learn,” or a prediction about how neural nets trained with RL will behave, it’s moving my odds ratios by less than 10% (since we are using “RL” quite loosely in this discussion, and there are lots of other differences and complications at play, all of which shrink the update).
You seem to be saying “yes but it’s evidence,” which I’m not objecting to—I’m just saying it’s an extremely small amount evidence. I’m not clear on whether you agree with my calculation.
(Some of the other text I wrote was about a different argument you might be making: that P(humans use RL | RL agents usually care about reward) is significantly lower than P(humans use RL| RL agents mostly are about other stuff), because evolution would then have never used RL. My sense is that you aren’t making this argument so you should ignore all of that, sorry to be confusing.)
Just saw this reply recently. Thanks for leaving it, I found it stimulating.
(I wrote the following rather quickly, in an attempt to write anything at all, as I find it not that pleasant to write LW comments—no offense to you in particular. Apologies if it’s confusing or unclear.)
Yes, in large part.
Yeah, are people differentially selected for caring about the real world? At the risk of seeming facile, this feels non-obvious. My gut take is that conditional on RL agents usually caring about reward (and thus setting aside a bunch of my inside-view reasoning about how RL dynamics work), conditional on that—reward-humans could totally have been selected for.
This would drive up P(humans care about reward | RL agents care about reward, humans were selected by evolution), and thus (I think?) drive down P(humans care about the real world | RL agents usually care about reward).
POV: I’m in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don’t care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).
In what way is my fitness lower than someone who really cares about these things, given that the best way to get rewards may well be to actually do the things?
Here are some ways I can think of:
Caring about reward directly makes reward hacking a problem evolution has to solve, and if it doesn’t solve it properly, the person ends up masturbating and not taking (re)productive actions.
Counter-counterpoint: But also many people do in fact enjoy masturbating, even though it seems (to my naive view) like an obvious thing to select away, which was present ancestrally.
People seem to be able to tell when you don’t really care about them, and just want to use them to get things.
But if I valued the rewarding feeling of having friends and spending time with them and watching them succeed—I, personally, feel good and happy when I am hanging out with my friends—then I still would be instrumentally aligned with my friends. If so, there would be no selection pressure for reward-motivation detectors, in the way that people were selected into noticing deception (even if not hardcoded to do so).
Overall, I feel like people being selected by evolution is an important qualifier which ends up changing the inferences we make via e.g. P(care about the real world | ), and I think I’ve at least somewhat neglected this qualifier. But I think I’d estimate P(humans care about the real world | RL agents usually care about reward) as… .2? I’d feel slightly more surprised by that than by the specific outcome of two coinflips coming up heads.
I think that there are serious path-dependent constraints in evolution, not a super clear/strong/realizable fitness gradient away from caring about reward if that’s how RL agents usually work, and so I expect humans to be relatively typical along this dimension, with some known unknowns around whatever extra circuitry evolution wrote into us.
Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?
I think this highlights a good counterpoint. I think this alternate theory predicts “probably not”, although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status → reward; and it’s high-status to sacrifice yourself for your kid). Or because keeping your kid safe → high reward as another learned drive.
Overall this feels like contortion but I think it’s possible. Maybe overall this is a… 1-bit update against the “not selection for caring about reality” point?
I don’t know what this means. Suppose we have an AI which “cares about reward” (as you think of it in this situation). The “episode” consists of the AI copying its network & activations to another off-site server, and then the original lab blows up. The original reward register no longer exists (it got blown up), and the agent is not presently being trained by an RL alg.
What is the “reward” for this situation? What would have happened if we “sampled” this episode during training?
I agree there are all kinds of situations where the generalization of “reward” is ambiguous and lots of different things could happen . But it has a clear interpretation for the typical deployment episode since we can take counterfactuals over the randomization used to select training data.
It’s possible that agents may specifically want to navigate towards situations where RL training is not happening and the notion of reward becomes ambiguous, and indeed this is quite explicitly discussed in the document Richard is replying to.
As far as I can tell the fact that there exist cases where different generalizations of reward behave differently does not undermine the point at all.
Yeah, I think I was wondering about the intended scoping of your statement. I perceive myself to agree with you that there are situations (like LLM training to get an alignment research assistant) where “what if we had sampled during training?” is well-defined and fine. I was wondering if you viewed this as a general question we could ask.
I also agree that Ajeya’s post addresses this “ambiguity” question, which is nice!
It’s intended as an objection to “AI grabs power to get reward is the central threat model to focus on”, but I think our disagreements still apply given this. (FWIW my central threat model is that policies care about reward to some extent, but that the goals which actually motivate them to do power-seeking things are more object-level.)
I expect policies to be getting rich input streams like video, text, etc, which they use to make decisions. Reward is different from other types of data because reward isn’t actually observed as part of these input streams by policies during episodes. This makes it harder to learn as a goal compared with things that are more directly observable (in a similar way to how “care about children” is an easier goal to learn than “care about genes”).
I don’t think this line of reasoning works, because “the episode appearing in training” can be a dependent variable. For example, consider an RL agent that’s credibly told that its data is not going to be used for training unless it misbehaves badly. An agent which maximizes reward conditional on the episode appearing in training is therefore going to misbehave the minimum amount required to get its episode into the training data (and more generally, behave like getting its episode into training is a terminal goal). This seems very counterintuitive.
Some versions that wouldn’t result in power-grabbing:
Goal is “get highest proportion of possible reward”; the policy might rewrite the training algorithm to be myopic, then get perfect reward for one step, then stop.
Goal is “care about (not getting low rewards on) specific computers used during training”; the policy might destroy those particular computers, then stop.
Goal is “impress the critic”; the policy might then rewrite its critic to always output high reward, then stop.
Goal is “get high reward myself this episode”; the policy might try to do power-seeking things but never create more copies of itself, and eventually lose coherence + stop doing stuff.
I don’t think any of these are particularly likely, the point is more that “high reward via tampering with/taking over the training setup” is a fairly different type of thing from “high reward via actually performing tasks”, and it’s a non-trivial and quite specific hypothesis that the latter will generalize to the former (in the regimes we’re concerned about).
Are you imagining that such systems get meaningfully low reward on the training distribution because they are pursuing those goals, or that these goals are extremely well-correlated with reward on the training distribution and only come apart at test time? Is the model deceptively aligned?
Children vs genes doesn’t seem like a good comparison, it seems obvious that models will understand the idea of reward during training (whereas humans don’t understand genes during evolution). A better comparison might be “have children during my life” vs “have a legacy after I’m gone,” but in fact humans have both goals even though one is never directly observed.
I guess more importantly, I don’t buy the claim about “things you only get selected on are less natural as goals than things you observe during episodes,” especially if your policy is trained to make good predictions of reward. I don’t know if there’s a specific reason for your view, or this is just a clash of intuitions. It feels to me like my position is kind of the default, in that you are offering a feature and saying it is a major consideration that SGD wouldn’t learn a particular kind of cognition.
If the agent misbehaves so that its data will be used for training, then the misbehaving actions will get a low reward. So SGD will shift from “misbehave so that my data will be used for training” to “ignore the effect of my actions on whether my data will be used for training, and just produce actions that would result in a low reward assuming that this data is used for training.”
I agree the behavior of the model isn’t easily summarized by an English sentence. The thing that seems most clear is that the model trained by SGD will learn not to sacrifice reward in order to increase the probability that the episode is used in training. If you think that’s wrong I’m happy to disagree about it.
Every one of your examples results in the model grabbing power and doing something bad at deployment time, though I agree that it may not care about holding power. (And maybe it doesn’t even have to grab power if you don’t have any countermeasures at test time? But then your system seems useless.)
But if you get frustrated by your AI taking over the datacenter and train it not to do that, as discussed in Ajeya’s story here, then you are left with an AI that grabs power and holds it.
I don’t find this super compelling. There is a long causal chain from “Do the task” to “Humans think you did the task” to “You get high reward” to “You are selected by SGD.” By default there will be lots of datapoints that distinguish the late stages on the path from the early stages, since human evaluations do not perfectly track any kind of objective evaluation of “did you actually do the task?” So a model that cares only about doing the task and not about human evaluations will underperform a model that cares about human evaluations. And then a model that cares about human evaluations can then manipulate those evaluations via grabbing power and controlling human observations of the situation (and if it cares about something even later then it can also e.g. kill and replace the humans, but that’s not important to this argument).
So the implicit alternative you mention here just won’t get a low loss, and it’s really a question about whether SGD will find a lower loss policy rather than one about how it generalizes.
And once that’s the discussion, someone saying “well there is a policy that gets lower loss and doesn’t look very complicated, so I guess SGD has a good chance of finding it” seems like they are on the winning side of SGD. Your claim seems to be that models will fail to get low loss on the training set in a specific way, and I think saying “it’s a non-trivial and quite specific hypothesis that you will get a low loss” would be an inappropriate burden-of-proof shifting.
I agree that it’s unclear whether this will actually happen prior to models that are smart enough to obsolete human work on alignment, but I really think we’re looking at more like a very natural hypothesis with 50-50 chances (and at the very least the way you are framing your objection is not engaging with the arguments in the particular post of Ajeya’s that you are responding to).
I also agree that the question of whether the model will mess with the training setup or use power for something else is very complicated, but it’s not something that is argued for in Ajeya’s post.
I think the crucial thing that makes this robust is “if the model grabs power and does something else, then humans will train it out unless they decide doing so is unsafe.” I think one potentially relevant takeaway is: (i) it’s plausible that AI systems will be motivated to grab power in non-catastrophic ways which could give opportunities to correct course, (ii) that isn’t automatically guaranteed by any particular honeypot, e.g. you can’t just give a high reward and assume that will motivate the system.
Reading back over this now, I think we’re arguing at cross purposes in some ways. I should have clarified earlier that my specific argument was against policies learning a terminal goal of reward that generalizes to long-term power-seeking.
I do expect deceptive alignment after policies learn other broadly-scoped terminal goals and realize that reward-maximization is a good instrumental strategy. So all my arguments about the naturalness of reward-maximization as a goal are focused on the question of which type of terminal goal policies with dangerous levels of capabilities learn first.* Let’s distinguish three types (where “myopic” is intended to mean something like “only cares about the current episode”).
Non-myopic misaligned goals that lead to instrumental reward maximization (deceptive alignment)
Myopic terminal reward maximization
Non-myopic terminal reward maximization
Either 1 and 2 (or both of them) seem plausible to me. 3 is the one I’m skeptical about. How come?
We should expect models to have fairly robust terminal goals (since, unlike beliefs or instrumental goals, terminal goals shouldn’t change quickly with new information). So once they understand the concept of reward maximization, it’ll be easier for them to adopt it as an instrumental strategy than a terminal goal. (An analogy to evolution: once humans construct highly novel strategies for maximizing genetic fitness (like making thousands of clones) people are more likely to do it for instrumental reasons than terminal reasons.)
Even if they adopt reward maximization as a terminal goal, they’re more likely to adopt a myopic version of it than a non-myopic version, since (I claim) the concept of reward maximization doesn’t generalize very naturally to larger scales. Above, you point out that even relatively myopic reward maximization will lead to limited takeover, and so we’ll train subsequent agents to be less myopic. But it seems to me that the selection pressure generated by a handful of examples of real-world attempted takeovers is very small, compared with other aspects of training; and that even if it’s significant, it may just teach agents specific constraints like “don’t take over datacenters”.
* Now that I say that, I notice that I’m also open to the possibility of policies learning deceptively-aligned goals first, then gradually shifting from reward-maximization as an instrumental goal to reward-maximization as a terminal goal. But let’s focus for now on which goals are learned first.
Why does it not lead to takeover in the same way?
Because it’s easy to detect and correct (except that correcting it might push you into one of the other regimes).
So far causally upstream of the human evaluator’s opinion? Eg an AI counselor optimizing for getting to know you