Reward is not the optimization target
This insight was made possible by many conversations with Quintin Pope, where he challenged my implicit assumptions about alignment. I’m not sure who came up with this particular idea.
In this essay, I call an agent a “reward optimizer” if it not only gets lots of reward, but if it reliably makes choices like “reward but no task completion” (e.g. receiving reward without eating pizza) over “task completion but no reward” (e.g. eating pizza without receiving reward). Under this definition, an agent can be a reward optimizer even if it doesn’t contain an explicit representation of reward, or implement a search process for reward.
ETA 9/18/23: This post addresses the model-free policy gradient setting, including algorithms like PPO and REINFORCE.
Reinforcement learning is learning what to do—how to map situations to actions so as to maximize a numerical reward signal. — Reinforcement learning: An introduction
Many people[1] seem to expect that reward will be the optimization target of really smart learned policies—that these policies will be reward optimizers. I strongly disagree. As I argue in this essay, reward is not, in general, that-which-is-optimized by RL agents.[2]
Separately, as far as I can tell, most[3] practitioners usually view reward as encoding the relative utilities of states and actions (e.g. it’s this good to have all the trash put away), as opposed to imposing a reinforcement schedule which builds certain computational edifices inside the model (e.g. reward for picking up trash → reinforce trash-recognition and trash-seeking and trash-putting-away subroutines). I think the former view is usually inappropriate, because in many setups, reward chisels cognitive grooves into an agent.
Therefore, reward is not the optimization target in two senses:
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent’s network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.
Reward probably won’t be a deep RL agent’s primary optimization target
After work, you grab pizza with your friends. You eat a bite. The taste releases reward in your brain, which triggers credit assignment. Credit assignment identifies which thoughts and decisions were responsible for the release of that reward, and makes those decisions more likely to happen in similar situations in the future. Perhaps you had thoughts like
“It’ll be fun to hang out with my friends” and
“The pizza shop is nearby” and
“Since I just ordered food at a cash register, execute
motor-subroutine-#51241
to take out my wallet” and“If the pizza is in front of me and it’s mine and I’m hungry, raise the slice to my mouth” and
“If the slice is near my mouth and I’m not already chewing, take a bite.”
Many of these thoughts will be judged responsible by credit assignment, and thereby become more likely to trigger in the future. This is what reinforcement learning is all about—the reward is the reinforcer of those things which came before it and the creator of new lines of cognition entirely (e.g. anglicized as “I shouldn’t buy pizza when I’m mostly full”). The reward chisels cognition which increases the probability of the reward accruing next time.
Importantly, reward does not automatically spawn thoughts about reward, and reinforce those reward-focused thoughts! Just because common English endows “reward” with suggestive pleasurable connotations, that does not mean that an RL agent will terminally value reward!
What kinds of people (or non-tabular agents more generally) will become reward optimizers, such that the agent ends up terminally caring about reward (and little else)? Reconsider the pizza situation, but instead suppose you were thinking thoughts like “this pizza is going to be so rewarding” and “in this situation, eating pizza sure will activate my reward circuitry.”
You eat the pizza, triggering reward, triggering credit assignment, which correctly locates these reward-focused thoughts as contributing to the release of reward. Therefore, in the future, you will more often take actions because you think they will produce reward, and so you will become more of the kind of person who intrinsically cares about reward. This is a path[4] to reward-optimization and wireheading.
While it’s possible to have activations on “pizza consumption predicted to be rewarding” and “execute motor-subroutine-#51241
” and then have credit assignment hook these up into a new motivational circuit, this is only one possible direction of value formation in the agent. Seemingly, the most direct way for an agent to become more of a reward optimizer is to already make decisions motivated by reward, and then have credit assignment further generalize that decision-making.
The siren-like suggestiveness of the word “reward”
Let’s strip away the suggestive word “reward”, and replace it by its substance: cognition-updater.
Suppose a human trains an RL agent by pressing the cognition-updater button when the agent puts trash in a trash can. While putting trash away, the AI’s policy network is probably “thinking about”[5] the actual world it’s interacting with, and so the cognition-updater reinforces those heuristics which lead to the trash getting put away (e.g. “if trash-classifier activates near center-of-visual-field, then grab trash using motor-subroutine-#642
”).
Then suppose this AI models the true fact that the button-pressing produces the cognition-updater. Suppose this AI, which has historically had its trash-related thoughts reinforced, considers the plan of pressing this button. “If I press the button, that triggers credit assignment, which will reinforce my decision to press the button, such that in the future I will press the button even more.”
Why, exactly, would the AI seize[6] the button? To reinforce itself into a certain corner of its policy space? The AI has not had antecedent-computation-reinforcer-thoughts reinforced in the past, and so its current decision will not be made in order to acquire the cognition-updater!
RL is not, in general, about training cognition-updater optimizers.
When is reward the optimization target of the agent?
If reward is guaranteed to become your optimization target, then your learning algorithm can force you to become a drug addict. Let me explain.
Convergence theorems provide conditions under which a reinforcement learning algorithm is guaranteed to converge to an optimal policy for a reward function. For example, value iteration maintains a table of value estimates for each state s, and iteratively propagates information about that value to the neighbors of s. If a far-away state f has huge reward, then that reward ripples back through the environmental dynamics via this “backup” operation. Nearby parents of f gain value, and then after lots of backups, far-away ancestor-states gain value due to f’s high reward.
Eventually, the “value ripples” settle down. The agent picks an (optimal) policy by acting to maximize the value-estimates for its post-action states.
Suppose it would be extremely rewarding to do drugs, but those drugs are on the other side of the world. Value iteration backs up that high value to your present space-time location, such that your policy necessarily gets at least that much reward. There’s no escaping it: After enough backup steps, you’re traveling across the world to do cocaine.
But obviously these conditions aren’t true in the real world. Your learning algorithm doesn’t force you to try drugs. Any AI which e.g. tried every action at least once would quickly kill itself, and so real-world general RL agents won’t explore like that because that would be stupid. So the RL agent’s algorithm won’t make it e.g. explore wireheading either, and so the convergence theorems don’t apply even a little—even in spirit.
Anticipated questions
Why won’t early-stage agents think thoughts like “If putting trash away will lead to reward, then execute
motor-subroutine-#642
”, and then this gets reinforced into reward-focused cognition early on?Suppose the agent puts away trash in a blue room. Why won’t early-stage agents think thoughts like “If putting trash away will lead to the wall being blue, then execute
motor-subroutine-#642
”, and then this gets reinforced into blue-wall-focused cognition early on? Why consider either scenario to begin with?
But aren’t we implicitly selecting for agents with high cumulative reward, when we train those agents?
Yeah. But on its own, this argument can’t possibly imply that selected agents will probably be reward optimizers. The argument would prove too much. Evolution selected for inclusive genetic fitness, and it did not get IGF optimizers.
“We’re selecting for agents on reward we get an agent which optimizes reward” is locally invalid. “We select for agents on X we get an agent which optimizes X” is not true for the case of evolution, and so is not true in general.
Therefore, the argument isn’t necessarily true in the AI reward-selection case. Even if RL did happen to train reward optimizers and this post were wrong, the selection argument is too weak on its own to establish that conclusion.
Here’s the more concrete response: Selection isn’t just for agents which get lots of reward.
For simplicity, consider the case where on the training distribution, the agent gets reward if and only if it reaches a goal state. Then any selection for reward is also selection for reaching the goal. And if the goal is the only red object, then selection for reward is also selection for reaching red objects.
In general, selection for reward produces equally strong selection for reward’s necessary and sufficient conditions. In general, it seems like there should be a lot of those. Therefore, since selection is not only for reward but for anything which goes along with reward (e.g. reaching the goal), then selection won’t advantage reward optimizers over agents which reach goals quickly / pick up lots of trash / [do the objective].
Another reason to not expect the selection argument to work is that it’s convergently instrumental for most inner agent values to not become wireheaders, for them to not try hitting the reward button.
I think that before the agent can hit the particular attractor of reward-optimization, it will hit an attractor in which it optimizes for some aspect of a historical correlate of reward.
We train agents which intelligently optimize for e.g. putting trash away, and this reinforces the trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about cognition-updating, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button.
This reasoning follows for most inner goals by instrumental convergence.
On my current best model, this is why people usually don’t wirehead. They learn their own values via deep RL, like caring about dogs, and these actual values are opposed to the person they would become if they wirehead.
Don’t some people terminally care about reward?
I think so! I think that generally intelligent RL agents will have secondary, relatively weaker values around reward, but that reward will not be a primary motivator. Under my current (weakly held) model, an AI will only start chiseled computations about reward after it has chiseled other kinds of computations (e.g. putting away trash). More on this in later essays.
But what if the AI bops the reward button early in training, while exploring? Then credit assignment would make the AI more likely to hit the button again.
Then keep the button away from the AI until it can model the effects of hitting the cognition-updater button.[7]
For the reasons given in the “siren” section, a sufficiently reflective AI probably won’t seek the reward button on its own.
AIXI—
will always kill you and then wirehead forever, unless you gave it something like a constant reward function.
And, IMO, this fact is not practically relevant to alignment. AIXI is explicitly a reward-maximizer. As far as I know, AIXI(-tl) is not the limiting form of any kind of real-world intelligence trained via reinforcement learning.
Does the choice of RL algorithm matter?
For point 1 (reward is not the trained agent’s optimization target), it might matter.
I started off analyzing model-free actor-based approaches, but have also considered a few model-based setups. I think the key lessons apply to the general case, but I think the setup will substantially affect which values tend to be grown.
If the agent’s curriculum is broad, then reward-based cognition may get reinforced from a confluence of tasks (solve mazes, write sonnets), while each task-specific cognitive structure is only narrowly contextually reinforced. That said, this is also selecting equally hard for agents which do the rewarded activities, and reward-motivation is only one possible value which produces those decisions.
Pretraining a language model and then slotting that into an RL setup also changes the initial computations in a way which I have not yet tried to analyze.
It’s possible there’s some kind of RL algorithm which does train agents which limit to reward optimization (and, of course, thereby “solves” inner alignment in its literal form of “find a policy which optimizes the outer objective signal”).
For point 2 (reward provides local updates to the agent’s cognition via credit assignment; reward is not best understood as specifying our preferences), the choice of RL algorithm should not matter, as long as it uses reward to compute local updates.
A similar lesson applies to the updates provided by loss signals. A loss signal provides updates which deform the agent’s cognition into a new shape.
TurnTrout, you’ve been talking about an AI’s learning process using English, but ML gradients may not neatly be expressible in our concepts. How do we know that it’s appropriate to speculate in English?
I am not certain that my model is legit, but it sure seems more legit than (my perception of) how people usually think about RL (i.e. in terms of reward maximization, and reward-as-optimization-target instead of as feedback signal which builds cognitive structures).
I only have access to my own concepts and words, so I am provisionally reasoning ahead anyways, while keeping in mind the potential treacheries of anglicizing imaginary gradient updates (e.g. “be more likely to eat pizza in similar situations”).
Dropping the old hypothesis
At this point, I don’t see a strong reason to focus on the “reward optimizer” hypothesis. The idea that AIs will get really smart and primarily optimize some reward signal… I don’t know of any tight mechanistic stories for that. I’d love to hear some, if there are any.
As far as I’m aware, the strongest evidence left for agents intrinsically valuing cognition-updating is that some humans do strongly (but not uniquely) value cognition-updating,[8] and many humans seem to value it weakly, and humans are probably RL agents in the appropriate ways. So we definitely can’t rule out agents which strongly (and not just weakly) value the cognition-updater. But it’s also not the overdetermined default outcome. More on that in future essays.
It’s true that reward can be an agent’s optimization target, but what reward actually does is reinforce computations which lead to it. A particular alignment proposal might argue that a reward function will reinforce the agent into a shape such that it intrinsically values reinforcement, and that the cognition-updater goal is also a human-aligned optimization target, but this is still just one particular approach of using the cognition-updating to produce desirable cognition within an agent. Even in that proposal, the primary mechanistic function of reward is reinforcement, not optimization-target.
Implications
Here are some major updates which I made:
Any reasoning derived from the reward-optimization premise is now suspect until otherwise supported.
Wireheading was never a high-probability problem for RL-trained agents, absent a specific story for why cognition-updater-acquiring thoughts would be chiseled into primary decision factors.
Stop worrying about finding “outer objectives” which are safe to maximize.[9] I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function).
Instead, focus on building good cognition within the agent.
In my ontology, there’s only one question: How do we grow good cognition inside of the trained agent?
Mechanistically model RL agents as executing behaviors downstream of past reinforcement (e.g. putting trash away), in addition to thinking about policies which are selected for having high reward on the training distribution (e.g. hitting the button).
The latter form of reasoning skips past the mechanistic substance of reinforcement learning: The chiseling of computations responsible for the acquisition of the cognition-updater. I still think it’s useful to consider selection, but mostly in order to generate failures modes whose mechanistic plausibility can be evaluated.
In my view, reward’s proper role isn’t to encode an objective, but a reinforcement schedule, such that the right kinds of computations get reinforced within the AI’s mind.
Edit 11/15/22: The original version of this post talked about how reward reinforces antecedent computations in policy gradient approaches. This is not true in general. I edited the post to instead talk about how reward is used to upweight certain kinds of actions in certain kinds of situations, and therefore reward chisels cognitive grooves into agents.
Appendix: The field of RL thinks reward=optimization target
Let’s take a little stroll through Google Scholar’s top results for “reinforcement learning”, emphasis added:
The agent’s job is to find a policy… that maximizes some long-run measure of reinforcement. ~ Reinforcement learning: A survey
In instrumental conditioning, animals learn to choose actions to obtain rewards and avoid punishments, or, more generally to achieve goals. Various goals are possible, such as optimizing the average rate of acquisition of net rewards (i.e. rewards minus punishments), or some proxy for this such as the expected sum of future rewards. ~ Reinforcement learning: The Good, The Bad and The Ugly
Steve Byrnes did, in fact, briefly point out part of the “reward is the optimization target” mistake:
I note that even experts sometimes sloppily talk as if RL agents make plans towards the goal of maximizing future reward… — Model-based RL, Desires, Brains, Wireheading
I don’t think it’s just sloppy talk, I think it’s incorrect belief in many cases. I mean, I did my PhD on RL theory, and I still believed it. Many authorities and textbooks confidently claim—presenting little to no evidence—that reward is an optimization target (i.e. the quantity which the policy is in fact trying to optimize, or the quantity to be optimized by the policy). Check what the math actually says.
- ^
Including the authors of the quoted introductory text, Reinforcement learning: An introduction. I have, however, met several alignment researchers who already internalized that reward is not the optimization target, perhaps not in so many words.
- ^
Utility ≠ Reward points out that an RL-trained agent is optimized by original reward, but not necessarily optimizing for the original reward. This essay goes further in several ways, including when it argues that reward and utility have different type signatures—that reward shouldn’t be viewed as encoding a goal at all, but rather a reinforcement schedule. And not only do I not expect the trained agents to maximize the original “outer” reward signal, I think they probably won’t try to strongly optimize any reward signal.
- ^
Reward shaping seems like the most prominent counterexample to the “reward represents terminal preferences over state-action pairs” line of thinking.
- ^
But also, you were still probably thinking about reality as you interacted with it (“since I’m in front of the shop where I want to buy food, go inside”), and credit assignment will still locate some of those thoughts as relevant, and so you wouldn’t purely reinforce the reward-focused computations.
- ^
“Reward reinforces existing thoughts” is ultimately a claim about how updates depend on the existing weights of the network. I think that it’s easier to update cognition along the lines of existing abstractions and lines of reasoning. If you’re already running away from wolves, then if you see a bear and become afraid, you can be updated to run away from large furry animals. This would leverage your existing concepts.
From A shot at the diamond-alignment problem:
The local mapping from gradient directions to behaviors is given by the neural tangent kernel, and the learnability of different behaviors is given by the NTK’s eigenspectrum, which seems to adapt to the task at hand, making the network quicker to learn along behavioral dimensions similar to those it has already acquired.
- ^
Quintin Pope remarks: “The AI would probably want to establish control over the button, if only to ensure its values aren’t updated in a way it wouldn’t endorse. Though that’s an example of convergent powerseeking, not reward seeking.”
- ^
For mechanistically similar reasons, keep cocaine out of the crib until your children can model the consequences of addiction.
- ^
I am presently ignorant of the relationship between pleasure and reward prediction error in the brain. I do not think they are the same.
However, I think people are usually weakly hedonically / experientially motivated. Consider a person about to eat pizza. If you give them the choice between “pizza but no pleasure from eating it” and “pleasure but no pizza”, I think most people would choose the latter (unless they were really hungry and needed the calories). If people just navigated to futures where they had eaten pizza, that would not be true. - ^
From correspondence with another researcher: There may yet be an interesting alignment-related puzzle to “Find an optimization process whose maxima are friendly”, but I personally don’t share the intuition yet.
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by 29 Aug 2022 1:23 UTC; 413 points) (
- My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” by 21 Mar 2023 0:06 UTC; 357 points) (
- Nobody’s on the ball on AGI alignment by 29 Mar 2023 14:26 UTC; 327 points) (EA Forum;
- Models Don’t “Get Reward” by 30 Dec 2022 10:37 UTC; 313 points) (
- The shard theory of human values by 4 Sep 2022 4:28 UTC; 250 points) (
- Alignment Implications of LLM Successes: a Debate in One Act by 21 Oct 2023 15:22 UTC; 247 points) (
- Parametrically retargetable decision-makers tend to seek power by 18 Feb 2023 18:41 UTC; 172 points) (
- Many arguments for AI x-risk are wrong by 5 Mar 2024 2:31 UTC; 167 points) (
- My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” by 21 Mar 2023 1:23 UTC; 166 points) (EA Forum;
- Seeking Power is Often Convergently Instrumental in MDPs by 5 Dec 2019 2:33 UTC; 162 points) (
- Inner and outer alignment decompose one hard problem into two extremely hard problems by 2 Dec 2022 2:43 UTC; 147 points) (
- Shard Theory in Nine Theses: a Distillation and Critical Appraisal by 19 Dec 2022 22:52 UTC; 143 points) (
- Deconfusing Direct vs Amortised Optimization by 2 Dec 2022 11:30 UTC; 124 points) (
- Why The Focus on Expected Utility Maximisers? by 27 Dec 2022 15:49 UTC; 116 points) (
- Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by 18 Apr 2024 16:17 UTC; 107 points) (
- How do you feel about LessWrong these days? [Open feedback thread] by 5 Dec 2023 20:54 UTC; 106 points) (
- Predictions for shard theory mechanistic interpretability results by 1 Mar 2023 5:16 UTC; 105 points) (
- Nobody’s on the ball on AGI alignment by 29 Mar 2023 17:40 UTC; 102 points) (
- A shot at the diamond-alignment problem by 6 Oct 2022 18:29 UTC; 95 points) (
- AI Safety − 7 months of discussion in 17 minutes by 15 Mar 2023 23:41 UTC; 89 points) (EA Forum;
- Seriously, what goes wrong with “reward the agent when it makes you smile”? by 11 Aug 2022 22:22 UTC; 87 points) (
- Disentangling Shard Theory into Atomic Claims by 13 Jan 2023 4:23 UTC; 86 points) (
- Towards deconfusing wireheading and reward maximization by 21 Sep 2022 0:36 UTC; 81 points) (
- The heritability of human values: A behavior genetic critique of Shard Theory by 20 Oct 2022 15:51 UTC; 80 points) (
- The Core of the Alignment Problem is... by 17 Aug 2022 20:07 UTC; 76 points) (
- Builder/Breaker for Deconfusion by 29 Sep 2022 17:36 UTC; 72 points) (
- Environmental Structure Can Cause Instrumental Convergence by 22 Jun 2021 22:26 UTC; 71 points) (
- Don’t design agents which exploit adversarial inputs by 18 Nov 2022 1:48 UTC; 70 points) (
- Some of my disagreements with List of Lethalities by 24 Jan 2023 0:25 UTC; 70 points) (
- A Certain Formalization of Corrigibility Is VNM-Incoherent by 20 Nov 2021 0:30 UTC; 67 points) (
- Ideas for improving epistemics in AI safety outreach by 21 Aug 2023 19:55 UTC; 64 points) (
- Some Summaries of Agent Foundations Work by 15 May 2023 16:09 UTC; 62 points) (
- Training AI agents to solve hard problems could lead to Scheming by 19 Nov 2024 0:10 UTC; 61 points) (
- The LessWrong 2022 Review: Review Phase by 22 Dec 2023 3:23 UTC; 58 points) (
- Voting Results for the 2022 Review by 2 Feb 2024 20:34 UTC; 57 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- The heritability of human values: A behavior genetic critique of Shard Theory by 20 Oct 2022 15:53 UTC; 49 points) (EA Forum;
- Mode collapse in RL may be fueled by the update equation by 19 Jun 2023 21:51 UTC; 49 points) (
- Understanding and avoiding value drift by 9 Sep 2022 4:16 UTC; 48 points) (
- 4. Existing Writing on Corrigibility by 10 Jun 2024 14:08 UTC; 47 points) (
- Four usages of “loss” in AI by 2 Oct 2022 0:52 UTC; 46 points) (
- A Short Dialogue on the Meaning of Reward Functions by 19 Nov 2022 21:04 UTC; 45 points) (
- The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies by 11 Jul 2021 17:36 UTC; 45 points) (
- Don’t align agents to evaluations of plans by 26 Nov 2022 21:16 UTC; 45 points) (
- Implied “utilities” of simulators are broad, dense, and shallow by 1 Mar 2023 3:23 UTC; 45 points) (
- Seeking Power is Convergently Instrumental in a Broad Class of Environments by 8 Aug 2021 2:02 UTC; 44 points) (
- Definitions of “objective” should be Probable and Predictive by 6 Jan 2023 15:40 UTC; 43 points) (
- Difficulties in making powerful aligned AI by 14 May 2023 20:50 UTC; 41 points) (
- Technical AI Safety Research Landscape [Slides] by 18 Sep 2023 13:56 UTC; 41 points) (
- An ML interpretation of Shard Theory by 3 Jan 2023 20:30 UTC; 39 points) (
- Discussing how to align Transformative AI if it’s developed very soon by 28 Nov 2022 16:17 UTC; 37 points) (
- Discussing how to align Transformative AI if it’s developed very soon by 28 Nov 2022 16:17 UTC; 36 points) (EA Forum;
- Instrumental Convergence For Realistic Agent Objectives by 22 Jan 2022 0:41 UTC; 35 points) (
- Instrumental convergence in single-agent systems by 12 Oct 2022 12:24 UTC; 33 points) (
- Reward is the optimization target (of capabilities researchers) by 15 May 2023 3:22 UTC; 32 points) (
- Ideas for improving epistemics in AI safety outreach by 21 Aug 2023 19:56 UTC; 31 points) (EA Forum;
- How I think about alignment by 13 Aug 2022 10:01 UTC; 31 points) (
- Unpacking “Shard Theory” as Hunch, Question, Theory, and Insight by 16 Nov 2022 13:54 UTC; 31 points) (
- AI Safety 101 : Reward Misspecification by 18 Oct 2023 20:39 UTC; 30 points) (
- Technical AI Safety Research Landscape [Slides] by 18 Sep 2023 13:56 UTC; 29 points) (EA Forum;
- DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking by 10 Jun 2024 21:20 UTC; 29 points) (
- Solving alignment isn’t enough for a flourishing future by 2 Feb 2024 18:22 UTC; 27 points) (EA Forum;
- When is reward ever the optimization target? by 15 Oct 2024 15:09 UTC; 27 points) (
- Solving alignment isn’t enough for a flourishing future by 2 Feb 2024 18:23 UTC; 27 points) (
- Failure modes in a shard theory alignment plan by 27 Sep 2022 22:34 UTC; 26 points) (
- EA & LW Forums Weekly Summary (19 − 25 Sep 22′) by 28 Sep 2022 20:13 UTC; 25 points) (EA Forum;
- AI Safety − 7 months of discussion in 17 minutes by 15 Mar 2023 23:41 UTC; 25 points) (
- Making the “stance” explicit by 16 Feb 2024 23:57 UTC; 23 points) (
- Do the Safety Properties of Powerful AI Systems Need to be Adversarially Robust? Why? by 9 Feb 2023 13:36 UTC; 22 points) (
- 3a. Towards Formal Corrigibility by 9 Jun 2024 16:53 UTC; 22 points) (
- 11 Aug 2022 21:40 UTC; 22 points) 's comment on Will Capabilities Generalise More? by (
- How evolution succeeds and fails at value alignment by 21 Aug 2022 7:14 UTC; 21 points) (
- 4 Key Assumptions in AI Safety by 7 Nov 2022 10:50 UTC; 20 points) (
- Instrumentality makes agents agenty by 21 Feb 2023 4:28 UTC; 20 points) (
- AI Will Not Want to Self-Improve by 16 May 2023 20:53 UTC; 20 points) (
- 14 May 2023 20:27 UTC; 20 points) 's comment on Power-seeking can be probable and predictive for trained agents by (
- Quantitative cruxes in Alignment by 2 Jul 2023 20:38 UTC; 19 points) (
- Initial Experiments Using SAEs to Help Detect AI Generated Text by 22 Jul 2024 5:16 UTC; 17 points) (
- 21 Sep 2022 2:02 UTC; 17 points) 's comment on Towards deconfusing wireheading and reward maximization by (
- 11 Aug 2022 20:57 UTC; 17 points) 's comment on Shard Theory: An Overview by (
- EA & LW Forums Weekly Summary (19 − 25 Sep 22′) by 28 Sep 2022 20:18 UTC; 16 points) (
- Does a LLM have a utility function? by 9 Dec 2022 17:19 UTC; 16 points) (
- Will Values and Competition Decouple? by 28 Sep 2022 16:27 UTC; 15 points) (
- Looking for an alignment tutor by 17 Dec 2022 19:08 UTC; 15 points) (
- 22 Aug 2023 16:09 UTC; 15 points) 's comment on Ideas for improving epistemics in AI safety outreach by (
- 19 Oct 2023 0:44 UTC; 14 points) 's comment on AI Safety 101 : Reward Misspecification by (
- A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”) by 22 Nov 2023 15:24 UTC; 13 points) (
- 29 May 2023 23:44 UTC; 13 points) 's comment on Sentience matters by (
- 19 Oct 2022 1:16 UTC; 12 points) 's comment on Richard Ngo’s Shortform by (
- 7 Apr 2023 17:43 UTC; 12 points) 's comment on DragonGod’s Shortform by (
- Why The Focus on Expected Utility Maximisers? by 27 Dec 2022 15:51 UTC; 11 points) (EA Forum;
- 18 Jan 2023 5:55 UTC; 11 points) 's comment on Confused why a “capabilities research is good for alignment progress” position isn’t discussed more by (
- 7 Feb 2023 21:41 UTC; 10 points) 's comment on Decision Transformer Interpretability by (
- 17 Aug 2022 6:03 UTC; 10 points) 's comment on TurnTrout’s shortform feed by (
- 13 Oct 2023 19:36 UTC; 10 points) 's comment on AI #33: Cool New Interpretability Paper by (
- 3 Nov 2022 0:43 UTC; 9 points) 's comment on AI X-risk >35% mostly based on a recent peer-reviewed argument by (
- Least-problematic Resource for learning RL? by 18 Jul 2023 16:30 UTC; 9 points) (
- Reinforcement Learner Wireheading by 8 Jul 2022 5:32 UTC; 8 points) (
- On value in humans, other animals, and AI by 31 Jan 2023 23:48 UTC; 7 points) (EA Forum;
- 22 Dec 2022 17:14 UTC; 7 points) 's comment on Take 13: RLHF bad, conditioning good. by (
- 13 Mar 2023 18:32 UTC; 7 points) 's comment on Plan for mediocre alignment of brain-like [model-based RL] AGI by (
- Consequentialism is in the Stars not Ourselves by 24 Apr 2023 0:02 UTC; 7 points) (
- 3 Feb 2023 11:22 UTC; 7 points) 's comment on Heritability, Behaviorism, and Within-Lifetime RL by (
- 14 Jan 2023 11:02 UTC; 7 points) 's comment on Aligned with what? by (
- 7 Oct 2022 19:29 UTC; 6 points) 's comment on Public Explainer on AI as an Existential Risk by (EA Forum;
- A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”) by 22 Nov 2023 15:24 UTC; 6 points) (EA Forum;
- Will Values and Competition Decouple? by 28 Sep 2022 16:32 UTC; 6 points) (EA Forum;
- AI Safety 101 : Reward Misspecification by 21 Dec 2023 14:26 UTC; 6 points) (EA Forum;
- Intro to AI risk for AI grad students? by 22 Sep 2023 20:34 UTC; 6 points) (EA Forum;
- 22 Feb 2023 3:01 UTC; 6 points) 's comment on Can submarines swim? by (
- Transcript of a presentation on catastrophic risks from AI by 5 May 2023 1:38 UTC; 6 points) (
- 4 Key Assumptions in AI Safety by 7 Nov 2022 10:50 UTC; 5 points) (EA Forum;
- 10 Jun 2023 20:04 UTC; 5 points) 's comment on A Playbook for AI Risk Reduction (focused on misaligned AI) by (
- 15 Jun 2023 18:42 UTC; 4 points) 's comment on Press the happiness button! by (
- Ways to think about alignment by 27 Oct 2024 1:40 UTC; 4 points) (
- 2 Aug 2023 15:42 UTC; 4 points) 's comment on 3 levels of threat obfuscation by (
- 3 Oct 2022 19:51 UTC; 4 points) 's comment on Why I think strong general AI is coming soon by (
- 11 Jul 2023 19:40 UTC; 4 points) 's comment on OpenAI Launches Superalignment Taskforce by (
- 6 Sep 2022 21:33 UTC; 4 points) 's comment on Prosaic AI alignment by (
- 7 Feb 2023 18:15 UTC; 4 points) 's comment on Decision Transformer Interpretability by (
- 20 Dec 2022 18:48 UTC; 4 points) 's comment on Positive values seem more robust and lasting than prohibitions by (
- 15 May 2023 17:11 UTC; 4 points) 's comment on Reward is the optimization target (of capabilities researchers) by (
- 17 Sep 2023 16:56 UTC; 3 points) 's comment on AI Pause Will Likely Backfire by (EA Forum;
- 5 Oct 2022 19:23 UTC; 3 points) 's comment on Stable Pointers to Value: An Agent Embedded in Its Own Utility Function by (
- 12 Aug 2022 4:08 UTC; 3 points) 's comment on Artificial intelligence wireheading by (
- 6 Mar 2023 20:40 UTC; 3 points) 's comment on Article Review: Discovering Latent Knowledge (Burns, Ye, et al) by (
- 10 May 2023 4:49 UTC; 3 points) 's comment on When is Goodhart catastrophic? by (
- 16 Dec 2024 15:14 UTC; 3 points) 's comment on GPTs are Predictors, not Imitators by (
- 2 Jan 2023 7:49 UTC; 3 points) 's comment on Alignment, Anger, and Love: Preparing for the Emergence of Superintelligent AI by (
- 15 Aug 2022 3:17 UTC; 3 points) 's comment on An observation about Hubinger et al.’s framework for learned optimization by (
- On value in humans, other animals, and AI by 31 Jan 2023 23:33 UTC; 3 points) (
- 13 Feb 2023 23:57 UTC; 3 points) 's comment on On value in humans, other animals, and AI by (
- 12 Apr 2023 11:31 UTC; 3 points) 's comment on All AGI Safety questions welcome (especially basic ones) [April 2023] by (
- 21 Nov 2022 21:14 UTC; 2 points) 's comment on Seeking Power is Often Convergently Instrumental in MDPs by (
- 16 May 2023 20:19 UTC; 2 points) 's comment on Seeking Power is Often Convergently Instrumental in MDPs by (
- 5 Jun 2023 18:27 UTC; 2 points) 's comment on Nature < Nurture for AIs by (
- 3 Mar 2023 17:17 UTC; 2 points) 's comment on A reply to Byrnes on the Free Energy Principle by (
- 15 Dec 2022 3:37 UTC; 2 points) 's comment on Unpacking “Shard Theory” as Hunch, Question, Theory, and Insight by (
- 3 Apr 2024 0:41 UTC; 2 points) 's comment on EJT’s Shortform by (
- 28 Aug 2023 16:40 UTC; 2 points) 's comment on The Game of Dominance by (
- 15 Aug 2022 4:20 UTC; 2 points) 's comment on Will Capabilities Generalise More? by (
- 30 Sep 2022 0:18 UTC; 2 points) 's comment on Emergency learning by (
- 31 Jul 2022 18:48 UTC; 2 points) 's comment on [Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation by (
- 16 Apr 2023 12:56 UTC; 2 points) 's comment on DragonGod’s Shortform by (
- 3 Dec 2023 4:47 UTC; 1 point) 's comment on A thought experiment to help persuade skeptics that power-seeking AI is plausible by (
- 29 Jun 2023 4:25 UTC; 1 point) 's comment on A Proposal for AI Alignment: Using Directly Opposing Models by (
- 29 Jun 2023 2:07 UTC; 1 point) 's comment on A Proposal for AI Alignment: Using Directly Opposing Models by (
- 3 Apr 2023 20:02 UTC; 1 point) 's comment on Pre-Training + Fine-Tuning Favors Deception by (
- 4 Oct 2022 21:31 UTC; 0 points) 's comment on Stable Pointers to Value: An Agent Embedded in Its Own Utility Function by (
- 26 Feb 2023 22:28 UTC; 0 points) 's comment on The Preference Fulfillment Hypothesis by (
- 18 Apr 2023 3:28 UTC; 0 points) 's comment on Evolution provides no evidence for the sharp left turn by (
- A Proposal for AI Alignment: Using Directly Opposing Models by 27 Apr 2023 18:05 UTC; 0 points) (
- Reward IS the Optimization Target by 28 Sep 2022 17:59 UTC; -2 points) (
I view this post as providing value in three (related) ways:
Making a pedagogical advancement regarding the so-called inner alignment problem
Pointing out that a common view of “RL agents optimize reward” is subtly wrong
Pushing for thinking mechanistically about cognition-updates
Re 1: I first heard about the inner alignment problem through Risks From Learned Optimization and popularizations of the work. I didn’t truly comprehend it—sure, I could parrot back terms like “base optimizer” and “mesa-optimizer”, but it didn’t click. I was confused.
Some months later I read this post and then it clicked.
Part of the pedagogical value is not having to introduce the 4 terms of form [base/mesa] + [optimizer/objective] and throwing those around. Even with Rob Miles’ exposition skills that’s a bit overwhelming.
Another part I liked were the phrases “Just because common English endows “reward” with suggestive pleasurable connotations” and “Let’s strip away the suggestive word “reward”, and replace it by its substance: cognition-updater.” One could be tempted to object and say that surely no one would make the mistakes pointed out here, but definitely some people do. I did. Being a bit gloves off here definitely helped me.
Re 2: The essay argues for, well, reward not being the optimization target. There is some deep discussion in the comments about the likelihood of reward in fact being the optimization target, or at least quite close (see here). Let me take a more shallow view.
I think there are people who think that reward is the optimization target by definition or by design, as opposed to this being a highly non-trivial claim that needs to be argued for. It’s the former view that this post (correctly) argues against. I am sympathetic to pushback of the form “there are arguments that make it reasonable to privilege reward-maximization as a hypothesis” and about this post going a bit too far, but these remarks should not be confused with a rebuttal of the basic point of “cognition-updates are a completely different thing from terminal-goals”.
(A part that has bugged me is that the notion of maximizing reward doesn’t seem to be even well-defined—there are multiple things you could be referring to when you talk about something maximizing reward. See e.g. footnote 82 in the Scheming AIs paper (page 29). Hence taking it for granted that reward is maximized has made me confused or frustrated.)
Re 3: Many of the classical, conceptual arguments about AI risk talk about maximums of objective functions and how those are dangerous. As a result, it’s easy to slide to viewing reinforcement learning policies in terms of maximums of rewards.
I think this is often a mistake. Sure, to first order “trained models get high reward” is a good rule of thumb, and “in the limit of infinite optimization this thing is dangerous” is definitely good to keep in mind. I still think one can do better in terms of descriptive accounts of current models, and I think I’ve got value out of thinking cognition-updates instead of models that maximize reward as well as they can with their limited capabilities.
There are many similarities between inner alignment and “reward is not the optimization target”. Both are sazens, serving as handles for important concepts. (I also like “reward is a cognition-modifier, not terminal-goal”, which I use internally.) Another similarity is that they are difficult to explain. Looking back at the post, I felt some amount of “why are you meandering around instead of just saying the Thing?”, with the immediate next thought being “well, it’s hard to say the Thing”. Indeed, I do not know how to say it better.
Nevertheless, this is the post that made me get it, and there are few posts that I refer to as often as this one. I rank it among the top posts of the year.
Just now saw this very thoughtful review. I share a lot of your perspective, especially:
and
Retrospective: I think this is the most important post I wrote in 2022. I deeply hope that more people benefit by fully integrating these ideas into their worldviews. I think there’s a way to “see” this lesson everywhere in alignment: for it to inform your speculation about everything from supervised fine-tuning to reward overoptimization. To see past mistaken assumptions about how learning processes work, and to think for oneself instead. This post represents an invaluable tool in my mental toolbelt.
I wish I had written the key lessons and insights more plainly. I think I got a bit carried away with in-group terminology and linguistic conventions, which limited the reach and impact of these insights.
I am less wedded to “think about what shards will form and make sure they don’t care about bad stuff (like reward)”, because I think we won’t get intrinsically agentic policy networks. I think the most impactful AIs will be LLMs+tools+scaffolding, with the LLMs themselves being “tool AI.”
The “RL ‘agents’ will maximize reward”/”The point of RL is to select for high reward” mistake is still made frequently and prominently. Yoshua Bengio (a Turing award winner!) recently gave a talk at an alignment workshop. Here’s one of his slides:
During questions, I questioned him, and he was incredulous that I disagreed. We chatted after his talk. I also sent him this article, and he disagreed with that as well. Bengio influences AI policy quite a bit, so I find this especially worrying. I do not want RL training methods to be dismissed or seen as suspect because of e.g. contingent terminological choices like “reward” or “agents.”
(Also, in my experience, if I don’t speak up and call out these claims, no one does.)
I think there may have been a communication error. It sounded to me like you were making the point that the policy does not have to internalize the reward function, but he was making the point that the training setup does attempt to find a policy that maximizes-as-far-as-it-can-tell the reward function. in other words, he was saying that reward is the optimization target of RL training, you were saying reward is not the optimization target of policy inference. Maybe.
I’m pretty sure he was talking about the trained policies and them, by default, maximizing reward outside the historical training distribution. He was making these claims very strongly and confidently, and in the very next slide cited Cohen’s Advanced artificial agents intervene in the provision of reward. That work advocates a very strong version of “policies will maximize some kind of reward because that’s the point of RL.”
He later appeared to clarify/back down from these claims, but in a way which seemed inconsistent with his slides, so I was pretty confused about his overall stance. His presentation, though, was going strong on “RL trains reward maximizers.”
There’s also a problem where a bunch of people appear to have cached that e.g. “inner alignment failures” can happen (whatever the heck that’s supposed to mean), but other parts of their beliefs seem to obviously not have incorporated this post’s main point. So if you say “hey you seem to be making this mistake”, they can point to some other part of their beliefs and go “but I don’t believe that in general!”.
this post made me understand something i did not understand before that seems very important. important enough that it made me reconsider a bunch of related beliefs about ai.