Non-myopia stories
Written under the supervision of Lionel Levine. Thanks to Owain Evans, Aidan O’Gara, Max Kaufmann, and Johannes Treutlein for comments.
This post is a synthesis of arguments made by other people. It provides a collection of answers to the question, “Why would an AI become non-myopic?” In this post I’ll describe a model as myopic if it cares only about what happens in the current training episode.[1] This form of myopia is called episodic myopia. Typically, we expect models to be myopic because the training process does not reward the AI for outcomes outside of its training episode. Non-myopia is interesting because it indicates a flaw in training – somehow our AI has started to care about something we did not design it to care about.
One reason to care about non-myopia is that it can cause a system to manipulate its own training process. If an ML system wants to affect what happens after its gradient update, it can do so through the gradient update itself. For instance, an AI might become deceptively aligned, behaving as aligned as possible in order to minimize how much it is changed by stochastic gradient descent (SGD). Or an AI could engage in exploration hacking, avoiding certain behaviors that it does not want to engage in because they will be rewarded and subsequently reinforced. Additionally, non-myopic AI systems could collude in adversarial setups like AI safety via debate. If debates between AI systems are iterated, they are analogous to a prisoners dilemma. If systems are non-myopic they could cooperate.
This post will outline six different routes to non-myopia:
Simulating other agents. Models could simulate humans or other non-myopic agents and adopt their non-myopia.
Inductive bias toward long-term goals. Inductive Biases like simplicity might favor non-myopic goals.
Meta-learning. A meta-learning loop can select for non-myopic agents.
(Acausal) trade. An otherwise myopic model might behave non-myopically by trading with other AI models.
Implicitly non-myopic objective functions. Objective functions might incentivize myopia by depending on an estimate of future consequences.
Non-myopia enables deceptive alignment. Becoming non-myopic could make a model become deceptively aligned and lead to higher training reward.
Running example: The stamp collector
This post uses a stamp collecting AI as a running example. This hypothetical AI is trained in some deep reinforcement learning (RL) setup. The AI’s reward depends on how many stamps it collects on a given day. The stamp collector is trained myopically. It is rewarded at the end of each day for the stamps collected on that day.
Simulating humans
A model could develop long-term goals by directly simulating a person. You could imagine asking a powerful LLM how Elon Musk would run a business. Provided that the LLM can continue this simulation indefinitely, it would simulate Elon Musk with all of his non-myopia. Jailbreaks show that LLMs will violate their training finetuning objectives in order to more faithfully simulate text. Future models might retain a tendency to simulate characters. An LLM finetuned on some myopic task, might lapse into simulating a non-myopic character, such as an unaligned AI that wants to escape its training process. Gwern depicts such a scenario in his clippy story.
Inductive bias toward long-term goals
The training process might also favor long term goals because they are simpler.[2] An AI trained using reinforcement learning from human feedback (RLHF) to obey user instructions could learn the goals “If the user asks you to write a poem, write a poem” and “If the user asks you a question, answer the question” separately, but it could also learn the goal “be obedient.” The latter is simpler and would generalize to new tasks better. Similarly, long-term goals are simpler because they do not require a duration to be specified. For example, “maximize stamps indefinitely” is simpler than “maximize stamps up until 11:59 pm today.”
Meta-learning
Meta-learning works through an inner loop and an outer loop. The inner loop trains a population of models on some task, typically using gradient descent. The outer loop then updates some global parameters based on how each model in the population is performing. Whereas gradient descent can only reinforce model properties that cause greater performance within the same episode, the outer loop can select for properties that help across episodes. In other words, the inner loop selects for myopia but the outer loop does not.
Imagine our stamp collector is trained using meta-learning. 100 stamp collectors are trained in parallel and the inner loop, which uses gradient descent, updates their weights every 10 days. Every 50 days, the outer loop takes the 50 best-performing stamp collectors and copies their weights over to the 50 worst-performing stamp collectors. In doing so, the outer loop selects non-myopic models that maximize stamps over all days.
Krueger et al., empirically demonstrate this effect in Hidden Incentives for Auto-Induced Distributional Shift. They show that adding meta-learning loops sometimes induces non-myopia in both supervised learning and RL setups. [ETA: Incidentally, paper shows that same thing occuring for non-meta-learning Q-learners.]
More broadly, we can view the optimization that engineers do to training setups (such as hyperparameter tuning) as a form of meta-learning. Just like formal meta-learning setups these actions can select for non-myopia. Compared to meta-learning, these optimizations may be fairly weak and select among fewer agents. It is unclear whether training process optimization has enough bits of optimization to causes non-myopia. It would be interesting to study this empirically.
(Acausal) trade
A model could have strictly short-term goals but act as if it is pursuing long-term goals because it is trading with another entity. For instance, someone could offer our stamp collector stamps today if it can credibly ensure that the person will be paid tomorrow. In order to honor this trade, the AI would then take action to ensure that money will appear on the account tomorrow. Additionally, the AI could cooperate with copies of itself (or other agents), through acausal trade. That is, it might cooperate with other agents because it thinks its decisions are correlated with theirs.
Acausal trade could occur if today’s stamp collector thinks to itself, “I just care about stamps today and yesterday’s stamp collector only cared about stamps yesterday. Yesterday’s stamp collector is similar to me, so it will make similar decisions. Therefore, if I do things to increase the number of stamps tomorrow, this indicates that yesterday’s stamp collector will have done things to increase the number of stamps today. Therefore, I should produce more stamps tomorrow, because that means that yesterday’s stamp collector produced more stamps today.” Essentially, the stamp collectors would be engaging in a scheme where every participant wins except the first stamp collector – it does not benefit because previous iterations did not participate. Although the first stamp collector does not benefit, it might still participate because it assigns some probability to the existence of past versions that participated.
While acausal trade is rational under some decision theories, it would still be selected against by a myopic training process. That is, gradient descent has an incentive to prevent a model from engaging in acausal trade that leads to lower reward in the short-term. There are some exceptions: If the model is outside of training then it does not get modified, and if an acausal trade makes the model receive higher reward (e.g. because it causes it to be deceptively aligned) then it is incentivized by SGD.
Mark Xu discusses similar possibilities involving acausal trade in Open Problems with Myopia.
Implicitly non-myopic objective functions
Richard Ngo raises this point in Arguments against myopic training.
Many objectives might take into account questions like:
If we implemented this piece of code, how would it affect our company?
What are the downstream effects of this scientific discovery?
How would people react to this social media post?
All these signals involve estimating the consequences of the model’s actions beyond the training episode. To score highly on these signals, the model might develop its own estimate of the consequences of its actions and execute plans that score highly based on this estimate. As a result, the AI is choosing actions which it expects to have certain consequences beyond the training episode. In other words it has non-myopic goals.
Consider an AI that is trained to make changes to some codebase. In each episode it adds some new feature. The reward signal includes a score supplied by a human who judges how useful the feature is. In order to estimate a feature’s usefulness, the programmer guesses how many additional users it would bring to the company. In the course of training, the agent develops the ability to simulate what would happen if a given feature were deployed, and then count how many users the company would have in that scenario. If the AI’s estimate matches the estimates that the human judge gives, this would allow the agent to find features that score highly, so this behavior is reinforced during training. Thus, the RL algorithm has caused the AI to optimize for long-term user growth, even though the training process is myopic.
Non-myopia enables deceptive alignment
Evan Hubinger makes this argument in How likely is deceptive alignment?
Consider an AI that has all the properties of a deceptively aligned model, except that it is myopic. This AI is myopically pursuing some misaligned proxy objective, some objective that is correlated but not identical to the training objective. It knows that its objective differs from the training objective; it also knows that, if it pursues its proxy objective in training, it will be modified by gradient descent. If this system went from being myopic to being non-myopic, it would care about what happens after the training episode. As a result, it might pursue the training objective directly in order to avoid being modified by stochastic gradient descent (SGD). This would allow it to preserve its goals and execute a treacherous turn later on. By pursuing the training objective rather than its misaligned proxy objective, the AI would get higher reward. Thus, behaving non-myopically leads the model to achieve higher rewards, and so, by optimizing for higher rewards, SGD could make the AI non-myopic.
Let’s say our stamp collector has the proxy objective of collecting only blue stamps. Focusing only on blue stamps leads to fewer stamps in total, as the stamp collector passes up good opportunities for other stamps in favor of collecting blue stamps. If the AI were made to care about collecting stamps indefinitely, it might become deceptively aligned, collecting stamps of all colors during training in order to later escape and turn the world into blue stamps, or whatever. Thus, SGD might modify the AI into caring about blue stamps indefinitely as that leads to higher training reward. In other words, SGD might modify a misaligned AI to become non-myopic because that would make it deceptively aligned, which would get more reward in the short term.
Instead of making the AI non-myopic in order to make it deceptively aligned, why does SGD not simply modify it into caring about the training objective? SGD might favor making the AI non-myopic because doing so is easier than correcting the objective. SGD has two ways to make the AI care about the training objective. First, it can hard code the training objective into the AI. However doing so might be difficult if the training objective is complex. Second, if the AI knows what the training objective is, SGD could set the AI’s goal to “pursue whatever you think the training objective is.” The second option is attractive because it does not require specifying a potentially complex objective within the AI’s weights. “Do whatever your operator wants” is easier to specify than “Help your operator perform well at their job; make sure they stay healthy; remind them to water the plants; etc.” The second option might be quite complex as well: It requires making the objective point to the part of the AI’s world model that contains the training objective. Doing so could require extensive modification of the training objective. On the other hand, making the AI non-myopic could be an easy fix.
Related work
Several works attempt to pinpoint the concept of myopia in AI systems. Defining Myopia provides several possible definitions; LCDT, A Myopic Decision Theory specifies what a myopic decision theory could look like. The stories in this post are inspired by previous work:
How likely is deceptive alignment? argues that non-myopia enables deceptive alignment,
Arguments against myopic training discusses implicitly non-myopic reward functions,
Open Problems with Myopia discusses acausal trade, and
Hidden Incentives for Auto-Induced Distributional Shift discusses non-myopia through meta-learning.
Other work on myopia includes How complex are myopic imitators? and How LLMs are and are not myopic. For discussions of self-fulfilling prophecies, see The Parable of Predict-O-Matic, Underspecification of Oracle AI, Conditioning Predictive Models: Outer alignment via careful conditioning, Proper scoring rules don’t guarantee predicting fixed points, and Stop-gradients lead to fixed point predictions.
Appendix: On self-fulfilling prophecies
A variation of non-myopia can occur through self-fulfilling prophecies: if an AI is rewarded for predicting the future and its predictions influence the future, then it has an incentive to steer the future using its predictions.[3] In other words, an AI that wants to predict the world accurately also wants to steer it. AIs that do not care about the consequences of their predictions are called consequence-blind. Myopia and consequence-blindness both aim to restrict the domain that an AI cares about. In myopia, we want to prevent models from caring about what happens after a training episode. In consequence-blindness we want to prevent them from caring about the consequences of their predictions.
- ^
Training episodes only make sense in reinforcement learning, but there are analogues in supervised learning. For instance, you might call a language model non-myopic if it attempts to use its predictions of one document to influence its performance on another document. For example, an LLM might be in a curriculum learning setup where its performance determines what documents it is shown later. This LLM might be able to improve its overall performance by doing worse early on in order to be shown easier documents later.
- ^
See for example Valle-Pérez et al. 2018.
- ^
See The Parable of Predict-O-Matic for an accessible explanation of this point.
It’s unclear if you’ve ordered these in a particular way. How likely do you think they each are? My ordering from most to least likely would probably be:
Inductive bias toward long-term goals
Meta-learning
Implicitly non-myopic objective functions
Simulating humans
Non-myopia enables deceptive alignment
(Acausal) trade
Why do you think this:
Who says we don’t want non-myopia, those safety people?! I guess to me it looks like the most likely reason we get non-myopia is that we don’t try that hard not to. This would be some combination of Meta-learning, Inductive bias toward long-term goals, and Implicitly non-myopic objective functions, as well as potentially “Training for non-myopia”.
It seems like a lot of people would expect myopia by default since the training process does nothing to incentivize non-myopia. “Why would the model care about what happens after an episode if there it does not get rewarded for it?” I think skepticism about non-myopia is a reason ML people are often skeptical of deceptive alignment concerns.
Another reason to expect myopia by default is that – to my knowledge – nobody has shown non-myopia occurring without meta-learning being applied.
Your ordering seems reasonable! My ordering in the post is fairly arbitrary. My goal was mostly to put easier examples early on.
I found this clarifying for my own thinking! Just a small additional point, in Hidden Incentives for Auto-Induced Distributional Shift, there is also the example of a Q learner that learns to sometimes take a non-myopic action (I believe cooperating with its past self in a prisoner’s dilemma), without any meta learning.
Thanks for pointing this out! I will make a note of that in the main post.
Nit (or possibly major disagreement): I don’t think this training regime gets you a stamp maximizer. I think this regime gets you a behaviors-similar-to-those-that-resulted-in-stamps-in-the-past-exhibiter. These behaviors might be non-myopic behaviors that nevertheless are not “evaluate the expected results of each possible action, and choose the action which yields the highest expected number of stamps”.
Why do you think that? A purely backwards-looking model-free approach will be outperformed and selected against compared to an agent which has been evolved to implement a more model-based approach, which can look forward and plan based on observations to immediately maximize future reward—rather than being forced to wait for rewards/selection to happen and incurring predictable losses before it can finally stop executing behaviors-that-used-to-stamp-maximize-but-now-no-longer-do-so-for-easily-predicted-reasons.
Why do you think that a model-based approach will outperform a model-free approach? We may just be using words differently here: I’m using “model-based” to mean “maintains an explicit model of its environment, its available actions, and the anticipated effects of those actions, and then performs whichever action its world model anticipates would have the best results”, and I’m using “model-free” to describe a policy which can be thought of as a big bag of “if the inputs look like this, express that behavior”, where the specific pattern of things the policy attends to and behaviors it expresses is determined by past reinforcement.[1] So something like AlphaGo as a whole would be considered model-based, but AlphaGo’s policy network, in isolation, would be considered model-free.
Reward is not the optimization target. In the training regime described, the outer loop selects for whichever stamp collectors performed best in the training conditions, and thus reinforces policies that led to high reward in the past.
“Evaluate the expected number of collected stamps according to my internal model for each possible action, and then choose the action which my internal model rates highest” might still end up as the highest-performing policy, if the following hold:
The internal model of the consequences of possible actions is highly accurate, to a sufficient degree that the tails do not come apart. If this doesn’t hold, the system will end up taking actions that are rated by its internal evaluator as being much higher value than they in fact end up being.
There exists nothing in the environment which will exploit inaccuracies in the internal value model.
There do not exist easier-to-discover non-consequentialist policies which yield better outcomes given the same amount of compute in the training environment.
I don’t expect all of those to hold in any non-toy domain. Even in toy domains like Go, the “estimations of value are sufficiently close to the actual value” assumption [empirically seems not to hold](https://arxiv.org/abs/2211.00241), and a two-player perfect-information game seems like a best-case scenario for consequentialist agents.
A “model-free” system may contain one or many learned internal structures which resemble its environment. For example, [OthelloGPT contains a learned model of the board state](https://www.neelnanda.io/mechanistic-interpretability/othello), and yet it is considered “model-free”. My working hypothesis is that the people who come up with this terminology are trying to make everything in ML as confusing as possible.
For the reason I just explained. Planning can optimize without actually requiring experiencing all states beforehand. A learned model lets you plan and predict things before they have happened {{citation needed}}. This is also a standard test of model-free vs model-based behavior in both algorithms and animals: model-based learning can update much faster, and being able to update after one reward, or to learn associations with other information (eg. following a sign) is considered experimental evidence for model-based learning rather than a model-free approach which must experience many episodes before it can reverse or undo all of the prior learning.
As Turntrout has already noted, that does not apply to model-based algorithms, and they ‘do optimize the reward’:
Should a model-based algorithm be trainable by the outer loop, it will learn to optimize for the reward. There is no reason you cannot use, say, ES or PBT for hyperparameter optimization of a model-based algorithm like AlphaZero, or even training the parameters too. They are operating at different levels. If you used those to train an AlphaZero, it doesn’t somehow stop being model-based RL: it continues to do planning over a simulator of Go expanding out the game tree out to terminal nodes with a hardwired reward function determined by whether it won or lost the game and its optimization target is the reward of winning/loosing. That remains true no matter what the outer loop is, whether it was evolution strategies or the actual Bayesian hyperparameter optimization that DM used.
As for your 3 conditions: they are either much easier than you make them sound or equally objections to model-free algorithms. #1 is usually irrelevant because you cannot apply unbounded optimization pressure (like in the T-maze experiments—there’s just nothing you can arbitrarily maximize, you go left, or you go right, end of story), and such overestimates can be useful for exploration. (Nor does model-freeness mean you can have inaccurate estimates of either actions or values.)) #2 is a non sequitur because model-freeness doesn’t grant you immunity to game theory either. #3 is just irrelevant by stipulation if there is any useful planning to be done, and you’re moving the goalposts as nothing was said about compute limits. (Not that it is self-evidently true that model-free is cheaper either, particularly in complex environments: policy gradients in particular tends to be extremely expensive to train due to sample-inefficiency and being on-policy, and the model-based Go agents like AlphaZero and MuZero outperform any model-free approaches I am aware of—I don’t know why you think the adversarial KataGo examples are relevant, when model-free approaches tend to be even more vulnerable to adversarial examples, that was the whole point of going from AlphaGo to AlphaZero, they couldn’t beat the delusions with policy gradient approaches. If model-based could be so easily outperformed, why does it ever exist, like you or I do?)
I think that you still haven’t quite grasped what I was saying. Reward is not the optimization target totally applies here. (It was the post itself which only analyzed the model-free case, not that the lesson only applies to the model-free case.)
In the partial quote you provided, I was discussing two specific algorithms which are highly dissimilar to those being discussed here. If (as we were discussing), you’re doing MCTS (or “full-blown backwards induction”) on reward for the leaf nodes, the system optimizes the reward. That is—if most of the optimization power comes from explicit search on an explicit reward criterion (as in AIXI), then you’re optimizing for reward. If you’re doing e.g. AlphaZero, that aggregate system isn’t optimizing for reward.
Despite the derision which accompanies your discussion of Reward is not the optimization target, it seems to me that you still do not understand the points I’m trying to communicate. You should be aware that I don’t think you understand my views or that post’s intended lesson. As I offered before, I’d be open to discussing this more at length if you want clarification.
CC @faul_sname