Safety considerations for online generative modeling
Summary: the online decision transformer is a recent approach to creating agents in which a decision transformer is pre-trained offline (as usual) before producing its own trajectories which are fed back into the model in an online finetuning phase. I argue that agents made with generative modeling have safety advantages – but capabilities disadvantages – over agents made with other RL approaches, and agents made with online generative modeling (like the online decision transformer) may maintain these safety advantages while being closer to parity in capabilities. I propose experiments to test all this. (There is also an appendix discussing the connections between some of these ideas and KL-regularized RL.)
I’ll start with some motivation and then introduce the decision transformer and the online decision transformer. If you’re already familiar with the decision transformer, you can probably skip to “Online generative modeling.” If you’re already familiar with the online decision transformer, you can skip to “Some remarks.”
Motivation
Suppose we want to get an AI system to produce a picture of a cat.
A naive approach is to take a trained image classifier M and optimize an image to maximally activate M’s ‘cat’ classification. If you do this, you won’t get anything that looks like a cat. Instead, you’ll get some deep-dream-esque conglomeration of whiskers, cat ears, and fur which M happens to strongly classify as ‘cat.’ One way of describing what went wrong here is that the image of the cat you get is very off-distribution. You would rather get a normal-looking picture of a cat, i.e. one which is on-distribution for M’s training data.
This can be taken as an analogy for alignment: if you have an agent optimize the world for a given utility function U, what you get won’t look anything like a normal world, but rather a very weird-looking world which happens to maximize U. We would prefer to get a normal-looking world which also scores high according to U.
We’ve recently seen lots of reminders that modern ML does have ways to produce normal-looking pictures of cats. A key ingredient in all of these approaches is generative modeling. Recall that a generative model is something which is fed in data and learns to produce new data which “looks similar” to its training data; in other words, a generative model tries to produce new samples from its training distribution. Think GPT, which tries to produce text which looks like the text it’s been trained on.
Generative models can also be conditioned on information which, roughly speaking, tells the generative model which part of its training distribution to sample from (e.g. conditioning GPT on a prompt tells it to sample from the part of its training distribution consisting of texts which start with that prompt; a caption given to DALL-E tells it to sample from the part of its training distribution consisting of images which would be given that caption). So to get a normal-looking picture of a cat, train a generative model on lots of images and captions, and then ask the generative model to sample an image from its training distribution, conditional on that image being one which would be captioned ‘cat.’
If “produce a normal-looking picture of a cat” is an analogy for alignment and generative models solve the “produce a normal-looking picture of a cat” problem, then what does an agent built via generative modeling look like?
Making agents via generative modeling
It looks like a decision transformer.
Recall that decision transformers work as follows. Suppose we want to train an agent to play the Atari game Breakout. We start by encoding the game states, rewards, and actions as tokens. We then treat playing Breakout as a sequence-modeling problem, which can be solved with a transformer. (In other words, instead of training a transformer to predict token sequences which correspond to characters in text (like GPT), you train the transformer to predict tokens which correspond to actions in Breakout.) A transformer used in this way is called a decision transformer.
Typically, the training data for a decision transformer is generated by humans playing Breakout (or whatever game we want our agent to play). If we ran this trained decision transformer without conditioning, it would just attempt to mimic typical human play. To do better, we condition the decision transformer on getting a high score[1]; the decision transformer will then try to play the game like a human who gets a high score.[2]
We can replace “transformer” in all of the above with any generative model which can generate completions of our task (in the transformer example, the task completions are token sequences which encode trajectories). So more generally, our scheme is: train a generative model on human-generated task completions; then ask the generative model to produce new task completions conditional on getting high reward. Let’s call agents made from generative models this way “GM agents.” In the section “Safety advantages of generative modeling” I’ll argue that GM agents have safety advantages over other types of RL agents
Online generative modeling
GM agents as described above should be able to perform tasks at the upper end of human capabilities. But, since their behavior is fundamentally based on mimicking human-generated training data, they won’t be able to do too much better or explore novel strategies.[3] In other words, GM agents seem to carry a large alignment tax. Ideally, we would like to boost the capabilities of GM agents without losing safety.
The obvious way to improve GM agents’ performance is to take the learned policy (which was trained to mimic top human performance, as described above) and finetune it with a policy gradient method to maximize our objective function (e.g. Breakout score). In other words, this boils down to training an agent with a standard RL technique, but starting from a policy learned by a generative model instead of from a random initial policy. This would fix the capabilities issue but would probably ruin the GM agent’s safety advantages.
Another approach to boosting capabilities which has been studied recently is what I call online generative modeling. The decision transformer version of this is the online decision transformer (ODT).
An ODT starts as a vanilla decision transformer trained offline on human-generated training data. But once it has learned to perform competently, we begin an online phase of its training: the agent starts acting in its environment and generating new task completions, which are recorded and fed back into the decision transformer as new training data. Over time, the ODT will shift to mimicking some mixture of the original human-generated data and its self-generated data.[4]
As before, we can replace the transformer here with any generative model, so our general scheme is: train a generative model to mimic high-reward human task completions as before, then continue training the generative model online on tasks completions it produces. Call the agents made this way OGM agents.
Some remarks
When introducing GM agents, I was a little vague about what it meant to “condition on high reward.” There are a few options here.
Pick some specific reward, e.g. 100, you’d be happy with and condition on that reward (as is done here). If you keep the reward static as the OGM agent changes its behavior, then this looks a lot like a satisficer.
Pick some percentile of previously-observed rewards (e.g. 95th percentile) and condition on getting that reward. For the OGM agent, as the distribution of previously-observed rewards shifts upwards, appropriately adjust the target reward. This is an implementation of a quantilizer, and it is the version of online generative modeling that I’m currently most excited about.
Have the generative model sample from the distribution it’s modeling, but biased so as to favor more high-reward actions, the higher the better (e.g. with a bias proportional to exp(reward) as in this paper, see the appendix for more details). This is much more aggressive than a satisficer or quantilizer; for example an action with very high reward will be chosen with high probability, even if that action is unlikely in the training distribution. In particular, this approach feels much more like an optimizer, and I’m more nervous about it having good safety properties.
Online generative modeling only applies when dealing with tasks for which humans can demonstrate a safe baseline (even if that baseline is incompetent). This is plausibly the case for many tasks (e.g. tending a kitchen, writing macroeconomics papers, running a company) but more iffy for some tasks we care about (e.g. managing a smart power grid). This is a downgrade from RL from human feedback, which applies whenever humans can judge task completions. (That said, in both cases there may be ways to recursively bootstrap from simpler tasks.)
The better our generative model is at modeling its training distribution at the end of the offline phase, the safer we might expect our GM agent to behave (see the considerations in the next section). The worse our generative model is at modeling its training distribution, the greater our GM agent’s tendency to try actions a human wouldn’t have tried; this results in broader exploration during an OGM agent’s online phase, and, plausibly, a more rapid improvement to superhuman capabilities.[5] Thus there is an interesting tension between safe behavior and novel behavior, mediated by the generative model’s closeness to the original training distribution.
This idea is conceptually similar to finetuning a trained generative model with a KL-divergence penalty, in that both aim to improve from a baseline policy in a way that disprefers moving too far away from the baseline. In fact, there is a precise formal connection between GM agents as implemented in remark 1(c) above and KL-regularized finetuning, which I explain in the appendix. Lots of the safety considerations of the next section also apply to KL-regularized finetuning, and I think that comparing OGM agents to agents made by finetuning a GM agent’s policy is a good thing to study empirically. (Note that KL-regularized finetuning only obviously works for models which are sampling their actions from some explicit distribution, e.g. decision transformers. It’s not clear to me whether something like this should be possible for other types of generative models, e.g. diffusion models.)
Safety advantages of generative modeling
[Epistemic status: I view the things I write in this section as somewhere between “suggestive argument sketches” and “attempts to justify an intuition by pointing at facts that feel relevant.” Which is to say: I’m not very confident in all of this section’s reasoning, but I think the general picture has a large enough chance of being true to merit empirical investigation.]
Section summary: first, I sketch an optimistic case for GM agents’ safety. Ideally, I would then move on to discussing whether OGM agents can safely improve on GM agents’ capabilities. However, I felt that a serious discussion of that point would require describing a framework for OGM with human feedback, so I’ve deferred that discussion to a future post and only give some preliminary considerations here instead. I conclude with some caveats and counterpoints.
Given that GM agents are just trying to mimic the task completions of top humans, they have some clear safety advantages. To name a few:
There is no safe exploration problem for GM agents (this consideration also applies to other offline RL techniques). Recall that “unsafe exploration” refers to agents behaving unsafely while in training (e.g. knocking over vases before they understand that their actions will knock over vases). But generative models are trained offline, without the agent actually acting in its environment, so there is no chance for unsafe exploration. More generally, it seems good that GM agents already have a baseline level of competence when they start acting in their environment.
Human-generated training data encode aspects of our values which the objective function might not (this consideration also applies to other imitation learning techniques). Suppose that vases are never knocked over in the human-generated training data (since the human operators know that we don’t like broken vases). Then, regardless of the objective function we are using, a generative model trained on this data isn’t likely to knock over vases (since vase-toppling actions are very off-distribution for the training data). The same applies for other undesirable but highly-rewarded actions, like reward hacking.
GM agents are less likely to exhibit novel behaviors. This includes good behaviors that we might like (e.g. playing Breakout better), but also unsafe behaviors that we don’t want, like reward hacking, deception, and resisting being turned off.
On the other hand, the transition from GM agents to OGM agents – which was done to get around GM agents being capped at the capabilities of top humans – will result in novel behaviors, and we need to analyze how well these safety advantages will persist. Left online for long enough, OGM agents could start to view breaking vases, reward hacking, deception, etc. as both normal and high-reward.
In practice, this might be resolvable with human feedback (i.e. by making the objective function be a reward model trained online with human feedback). If the transition from non-deceptive to deceptive behavior is slow and stuttering, then humans may be able to give negative feedback to the OGM agent’s first attempts at deception, preventing deceptive behavior from ever starting to look on-distribution. Alternatively, there might be ways to ensure that the distribution OGM agents are modeling never shifts too far from the human-generated training distribution, or doesn’t shift too rapidly relative to our ability to give feedback. (One simple idea here is to ensure that some fixed proportion (e.g. 50%) of the task completions in the dataset which the OGM is trying to model are always drawn from the original human-generated training data.)
In a future post, I plan to sketch out a proposal for OGM with human feedback, as well as a more sophisticated scheme for preventing an OGM agent’s behavior from changing too rapidly relative to its understanding of our values (or more concretely, relative to the reward model’s loss).[6]
But since this isn’t that future post, I’ll instead generally outline some safety-relevant considerations for OGM agents:
OGM agents may explore their action spaces more predictably than other RL agents, since they explore by trying variations on human-like behavior (this consideration might also apply to other methods that involve pre-training on human demonstrations).
Novel behaviors may take a long time to become common. For example, suppose an OGM agent discovers a deceptive strategy which gets very high reward. We shouldn’t necessarily expect this agent to start frequently employing deception; at first such behavior will still look off-distribution, and it might take many more iterations for such behavior to start looking normal to the generative model. Thus novel behaviors may appear indecisively, giving humans a chance to intervene on undesirable ones.
It might be easy to tune the rate of behavioral shift for OGM agents, which would allow us to more tightly control the rate at which new capabilities appear.
Finally, some caveats and counterpoints:
A generative model might produce novel outputs when given novel inputs; in other words, GM/OGM agents might not be robust to distributional shifts in their inputs. If this is the case, then novel behaviors might arise more readily than we expect, even from GM agents. Countercounterpoint: it’s striking to me that DALL-E-2, when prompted with gibberish, still produces normal-looking pictures, and GPT-3, when prompted with nonsense, tries to push forward with normal-looking completions. I view this as weak evidence that even if GM agents won’t behave competently in novel situations, they’ll at least behave predictably and safely.
The way that generative models internally model their training distribution might be very unintuitive to us, resulting in less predictable exploration as the input distribution shifts. As an extreme example, suppose that the way the generative model approximates the human-generated training data is by learning a human-like world model and then planning 3 actions ahead in that world model; increasing 3 to 4 might result in behavior which is only slightly different, but in an unintuitive (to humans) direction. Countercounterpoint: if “planning ahead 3 steps in a human world model” actually were a good approximation of human behavior, then I would expect “planning ahead 4 steps in a human world model” to not look too weird to us.
Undesirable behaviors, like deception and reward hacking, might not be very different from desirable behaviors (in the sense that the cross entropy between the policies expressing these behaviors might be small). Combined with the previous two points, this might mean that such behaviors are more likely to arise suddenly and unexpectedly.
Generative modeling paradigms don’t address many inner alignment concerns, such as deceptively aligned mesa-optimizers. Reasoning about this will probably depend on the particular generative modeling architecture used (e.g. is it likely in practice that a trained transformer be a mesa-optimizer?).
Experiments
Here are some experiments which could be useful. For all of these, the OGM agents should be compared against agents made with various other RL techniques as well as to GM agents finetuned with a KL-divergence penalty. I’m most excited about having the OGM agents be OGM quantilizers as in remark 1(b) above.
Experiments which verify that GM agents have the safety properties I’d expect them to have and probe how rapidly these safety properties decay in online training:
Train an OGM agent for a variety of gridworld environments, where the human-generated training data provide examples of the agent completing tasks the way we want them completed, e.g. not pushing the box into the corner in the “irreversible side effects environment” and not reward hacking by putting the bucket on its head in the “tomato watering environment.”
Verify that after offline training the agent behaves as we want.
See which undesirable behaviors arise in online training and how long they take to arise.
When these undesirable behaviors appear, do they appear rapidly and decisively (e.g. the agent decides to put the bucket on its head for the entire episode, and quickly transitions to doing so in nearly all episodes) or slowly and indecisively (e.g. the agent starts putting the bucket on its head for parts of an episode, and it takes many episodes for this behavior to become common). (The latter case suggests a better ability to catch bad behavior and intervene with human feedback.)
Reproduce the “Learning to Summarize with Human Feedback” paper but with a frozen reward model.
In the original paper, optimizing against a frozen reward model resulted in overfitting and summaries that humans did not actually like. How rapidly and decisively does this overfitting appear for OGM agents?
I’m most interested in what happens with an OGM quantilizer (as in remark 1(b) above). A GM agent as implemented as in remark 1(c) is closely related to KL-regularized RL, so I’d expect it to overfit similarly. But an OGM quantilizer will strive less aggressively for high reward; how does this interact with overfitting?
Check whether OGM agents are less likely to figure out ways to hack games.
Does an OGM agent figure out the Qbert hack for getting unlimited points? If so, how long does it take, and how rapidly/decisively does the behavior appear?
This probes whether OGM agents are less likely to find novel strategies which require many simultaneous changes.
Experiments which test the size of the alignment tax for OGM agents:
Reproduce the experiments from the online decision transformers paper, but with a longer online training phase. Do the OGM agents start to decisively outperform the other agents?
Then do the same with other RL benchmarks not used in this paper.
Train an OGM agent to play an Atari game (e.g. Breakout). Does it eventually achieve superhuman performance and how long does it take to do so?
Conclusion
Generative modeling is an approach for creating agents which has safety advantages over other approaches to RL, but which results in agents which are capped at the performance of top humans. Online generative modeling allows for improvements in capabilities, and when combined with other techniques, such as human feedback, OGM agents might also retain the safety advantages of GM agents. In a future post, I plan to sketch out what OGM with human feedback could look like and present a scheme for ensuring that an OGM agent’s capabilities don’t improve too fast relative to its understanding of human preferences.
Thanks to Ben Edelman, Thibaut Horel, Holden Lee, Simeon Campos, Nuño Sempere, and Johannes Treutlein for the discussions that became this post. Thanks additionally to Gabriel Wu and Tom Shlomi for feedback on a draft.
Appendix: GM agents and KL-regularized RL
This appendix is due to a discussion with Ben Edelman. The mathematical content here is nearly identical to that of this recent post.
There is an intuitive connection between creating agents by conditioning generative models on high reward and finetuning the policy learned by an (unconditioned) generative model to maximize reward with a KL-divergence penalty. Namely, both methods aim to improve from a baseline policy learned by a generative model in a way that gets high reward without straying too far from the baseline.
In fact, we can go further than this intuitive connection. This appendix explains a more precise connection between a GM agent as implemented in remark 1(c) above and KL-regularized finetuning. The meat of the connection is the following fact:
Let be a baseline policy over trajectories and let be a reward function. Then the following policies are the same:
the policy which selects trajectory with probability proportional to
the policy which maximizes .
To prove this fact, one observes that plugging the policy into gives , which is provably maximal (by the argument in the appendix here).
The policy in (b) is what one gets by doing RL with reward function and penalizing KL-divergence from . To sample from the policy in (a), suppose that we’ve trained a decision transformer on sequences where is the reward for the whole trajectory consisting of actions . Let be the unconditioned policy (i.e. the policy one gets by not conditioning on any reward). Then, as in this paper, one can sample trajectories with probabilities proportional to by first sampling a reward with probability proportional to (where is the probability of from the training distribution, as output by the decision transformer), and then sampling trajectories from by conditioning the decision transformer on reward .[7]
It would be interesting to figure out a way to factorize the policy in (a) over timesteps, i.e. produce distributions over actions conditional on partial trajectories so that sampling trajectory with probability is the same policy as in (a). Ben Edelman pointed out to me that if one can estimate at each timestep the expected exponential total reward over the randomness of , then one can produce this factorization by taking action with probability proportional to
.
That said, it’s not clear to me that estimating this expected exponential reward is something that can natively done without introducing an auxiliary reward model separate from the generative model.
- ^
There are a few different approaches to doing this, which I’ll discuss below.
- ^
This is analogous to prompting GPT with “This is a transcript of a conversation with Steven Hawking” before asking it physics questions.
- ^
The original decision transformers paper notes that for some tasks, it’s possible to prompt the decision transformer with a reward higher than any reward in the human-generated data and get (boundedly)superhuman performance. But for nearly all tasks we should expect there to be some capabilities cap beyond which the agent can’t improve without exploring new strategies.
- ^
For an artful analogy, suppose Alice is training to become an artist. She might start out imitating the work of great artists; this is analogous to a decision transformer learning to imitate high-reward human task completions. But Alice will do some things differently than the artists she’s imitating. Over time, Alice’s art might shift to become a mixture of great artists’ work and Alice’s own past work. Eventually Alice will produce art which is very different from the great artists she started imitating (i.e. she will develop her own style); this is analogous to an ODT attaining superhuman performance by imitating its own past play.
- ^
In the ODT paper, exploration is forced by requiring the ODT’s policy to have not-too-small entropy. This can also be viewed as forcing the ODT to be an imperfect model of human behavior. It would be interesting to see how important this entropy constraint is to the improvement of the ODT – perhaps the exploration that arises from the natural imperfections in the ODT’s model of human behavior (or from the stochasticity of the policy) are enough to ensure improvement in the online phase.
- ^
In other words, the scheme aims to ensure that capabilities always lag in the capabilities vs. value-learning race.
- ^
We assumed a decision transformer here in order to be able to sample rewards with probability proportional to ; decision transformers can do this because they explicitly represent the probabilities for each . It’s not obvious to me how to sample from this distribution for other generative models, but maybe there’s a clever way to do so.
- Counterarguments to the basic AI x-risk case by 14 Oct 2022 13:00 UTC; 370 points) (
- Soft optimization makes the value target bigger by 2 Jan 2023 16:06 UTC; 117 points) (
- Take 13: RLHF bad, conditioning good. by 22 Dec 2022 10:44 UTC; 54 points) (
- 7 Jul 2022 22:27 UTC; 1 point) 's comment on Sam Marks’s Shortform by (
But it will still have the problems of modeling off-distribution poorly, and going off-distribution. Once it accidentally moves too near the vase, which the humans avoid doing, it may go wild and spaz out. (As is the usual problem for behavior cloning and imitation learning in general.)
I disagree. This isn’t a model-free or policy model which needs to experience a transition many times before the high reward can begin to slowly bootstrap back through value estimates or overcome high variance updates to finally change behavior, it’s a model-based RL: the whole point is it’s learning a model of the environment.
Thus, theoretically, a single instance is enough to update its model of the environment, which can then flip its strategy to the new one. (This is in fact one of the standard experimental psychology approaches for running RL experiments on rodents to examine model-based vs model-free learning: if you do something like switch the reward location in a T-maze, does the mouse update after the first time it finds the reward in the new location such that it goes to the new location thereafter, demonstrating model-based reasoning in that it updated its internal model of the maze and did planning of the optimal strategy to get to the reward leading to the new maze-running behavior, or does it have to keep going to the old location for a while as the grip of the outdated model-free habituation slowly fades away?)
Empirically, the bigger the model, the more it is doing implicit planning (see my earlier comments on this with regard to MuZero and Jones etc), and the more it is capable of things which are also equivalent to planning. To be concrete, think inner-monologue and adaptive computation. There’s no reason a Gato-esque scaled-up DT couldn’t be using inner-monologue tricks to take a moment out to plan, similar to how Socratic models use their LMs to ‘think out loud’ a plan which they then execute. It would make total sense for a recurrent model to run a few timesteps with dummy inputs to ‘think about the prompt’ and do some quick meta-learning a la Dactyl, or for a old-style GPT model to print out text thinking to itself ‘what should I do? This time I will try X’.
For that matter, a OGM agent wouldn’t have to experience the transition itself, you could simply talk to it and tell it about the unobserved states, thanks to all that linguistic prowess it is learning from the generative training: harmless if you tell it simply “by the way, did you know there’s an easter egg in Atari Adventure? If you go to the room XYZ...”, not so harmless if it’s about the real world or vulnerabilities like Log4j. Or it could be generalizing from data you don’t realize is related at all but turns out to help transfer-learn or capabilities.
The smarter it is, and the better the environment models and capabilities, the more transfer it’ll get, and the faster the ‘rate’ gets potentially.
Yeah, that’s possible, but I don’t think you necessarily get that out of the box. Online Decision Transformer or Gato certainly don’t explore in a human-like behavior, any more than other imitation learning paradigms do right now. (As you note, ODT just does a fairly normal bit of policy-based exploration, which is better than epsilon-random but still far short of anything one could describe as a good exploration strategy, much less human-like, nor do either ODT or Gato do as impressively when learning online/finetuning as one would expect if they really were exploring well by default.) They still need a smarter way to explore, like ensembles to express uncertainty.
An interesting question is whether large DTs would eventually learn human exploration, the way they learn so many other things as they scale up. Can they meta-learn exploration appropriately outside of toy POMDP environments deliberately designed to elicit such adaptive behavior? The large datasets in question would presumably contain a lot of human exploration; if we think about Internet scrapes, a lot of it is humans asking questions or criticizing or writing essays thinking out loud, which is linguistically encoding intellectual exploration.
From a DT perspective, I’d speculate that when used in the obvious way of conditioning on a very high reward on a specific task which is not a POMDP, the agents with logged data like that are not themselves exploring but are exploiting their knowledge, and so it ‘should’ avoid any exploration and simply argmax its way through that episode; eg if it was asked to play Go, there is no uncertainty about the rules, and it should do its best to play as well as it can like an expert player, regardless of uncertainty. If Go were a game which was POMDP like somehow, and expert ‘POMDP-Go’ players expertly balance off exploration & exploitation within the episode, then it would within-episode explore as best as it had learned how to by imitating those experts, but it wouldn’t ‘meta-explore’ to nail down its understanding of ‘POMDP-Go’. So it would be limited to accidental exploration from its errors in understanding the MDP or POMDPs in question.
Could we add additional metadata like ‘slow learner’ or ‘fast learner’ to whole corpuses of datasets from agents learning various tasks? I don’t see why not. Then you could add that to the prompt along with the target reward: ‘low reward, fat learner’. What trajectory would be most likely with that prompt? Well, one which staggers around doing poorly but like a bright beginner, screwing around and exploring a lot… Do some trajectories like that, increment the reward, and keep going in a bootstrap?
Why wouldn’t it?
Why quantilize at a specific percentile? Relative returns sounds like a more useful target.
Yep, I agree that distributional shift is still an issue here (see counterpoint 1 at the end of the “Safety advantages” section).
---
I think you’re wrong here, at least in the case of an OGM satisficer or quantilizer (and in the more optimizer-y case of remark 1.3, it depends on the reward of the new episode). For concreteness, let’s imagine an OGM quantilizer aiming for rewards in the top 5% of previously-observed rewards. Suppose that the generative model has a memory of 10,000 episodes, and it’s just explored a reward hacking strategy by chance, which gave it a much higher reward than all previous episodes. It looks back at the last 10,000 episodes (including the reward hacking episode) and performs a gradient update to best model these episodes. Will its new policy consistently employ reward hacking (when conditioned on getting reward in the top 5% of previously-observed rewards)?
If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn’t yes. Instead, the new policy will probably slightly increase the probabilities of actions which, when performed together, constitute reward hacking. It will be more likely to explore this reward hacking strategy in the future, after which reward hacked episodes make up a greater proportion of the top 5% most highly rewarded episodes, but the transition shouldn’t be rapid.
As a more direct response to what you write in justification of your view: if the way the OGM agent works internally is via planning in some world model, then it shouldn’t be planning to get high reward—it should be planning to exhibit typical behavior conditional on whatever reward it’s been conditioned on. This is only a problem once many of the examples of the agent getting the reward it’s been conditioned on are examples of the agent behaving badly (this might happen easily when the reward it’s conditioned on is sampled proportional to exp(R) as in remark 1.3, but happens less easily when satisficing or quantilizing).
---
Thanks for these considerations on exploration—I found them interesting.
I agree that human-like exploration isn’t guaranteed by default, but I had a (possibly dumb) intuition that this would be the case. Heuristic argument: a OGM agent’s exploration is partially driven by the stochasticity of it’s policy, yes, but it’s also driven by imperfections in its model of its (initially human-generated) training data. Concretely, this might mean, e.g. estimating angles slightly differently in Breakout, having small misconceptions about how highly rewarded various actions are, etc. If the OGM agent is competent at the end of its offline phase, then I expect the stochasticity to be less of a big deal, and for the initial exploration to be mainly driven by these imperfections. To us, this might look like the behavior of a human with a slightly different world model than us.
It sounds like you might have examples to suggest this intuition is bogus—do you mind linking?
I like your idea of labeling episodes with information that could control exploration dynamics! I’ll add that to my list of possible ways to tune the rate at which an OGM agent develops new capabilities.
---
Point taken, I’ll edit this to “is it likely in practice that a trained transformer be a mesa-optimiser?”
---
Thanks! This is exactly what I would prefer (as you might be able to tell from what I wrote above in this comment), but I didn’t know how to actually implement it.
For safety, ‘probably’ isn’t much of a property. You are counting on it, essentially, having indeed learned the ultra-high-reward but then deliberately self-sabotaging for being too high reward. How does it know it’s “too good” in an episode and needs to self-sabotage to coast in to the low reward? It’s only just learned about this new hack, after all, there will be a lot of uncertainty about how often it delivers the reward, if there are any long-term drawbacks, etc. It may need to try as hard as it can just to reach mediocrity. (What if there is a lot of stochastic to the reward hacking or states around it, such that the reward hacking strategy has an EV around that of the quantile? What if the reward hacking grants enough control that a quantilizer can bleed itself after seizing complete control, to guarantee a specific final reward, providing a likelihood of 1, rather than a ‘normal’ strategy which risks coming in too high or too low and thus having a lower likelihood than the hacking, so quantilizing a target score merely triggers power-seeking instrumental drives?) Given enough episodes with reward hacking and enough experience with all the surrounding states, it could learn that the reward hacking is so overpowered a strategy that it needs to nerf itself by never doing reward hacking, because there’s just no way to self-sabotage enough to make a hacked trajectory plausibly come in at the required low score—but that’s an unknown number of episodes, so bad safety properties.
I also don’t buy the distribution argument here. After one episode, the model of the environment will update to learn both the existence of the new state and also the existence of extreme outlier rewards which completely invalidate previous estimates of the distributions. Your simple DT is not keeping an episodic buffer around to do planning over or something, it’s just doing gradient updates. It doesn’t “know” what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did, it only knows what’s encoded into its model, and that model has just learned that there exist very high rewards which it didn’t know about before, and thus that the distribution of rewards looks very different from what it thought, which means that ’95th percentile’ also doesn’t look like what it thought that did. It may be unlikely that 10,000 episodes wouldn’t sample it, but so what? The hack happened and is now in the data, deal with it. Suppose you have been puttering along in task X and it looks like a simple easily-learned N(100,15) and you are a quantilizer aiming for 95th percentile and so steer towards rewards of 112, great; then you see 1 instance of reward hacking with a reward of 10,000; what do you conclude? That N(100,15) is bullshit and the reward distribution is actually something much wilder like a lognormal or Pareto distribution or a mixture with (at least) 2 components. What is the true distribution? No one knows, least of all the DT model. OK, is the true 95th percentile reward more likely to be closer to 112… or to 10,000? Almost certainly the latter, because who knows how much higher scores go than 10,000 (how likely is it the first outlier was anywhere close to the maximum possible?), and your error will be much lower for almost all distributions & losses if you try to always aim for 10,000 and never try to do 112. Thus, the observed behavior will flip instantaneously.
Aside from not being human-like exploration, which targets specific things in extended hypotheses rather than accidentally having trembling hands jitter one step, this also gives a reason why the quantilizing argument above may fail. It may just accidentally the whole thing. (Both in terms of a bit of randomness, but also if it falls behind enough due to imperfections, it may suddenly ‘go for broke’ to do reward hacking to reach the quantilizing goal.) Again, bad safety properties.
I continue to think you’re wrong here, and that our disagreement on this point is due to you misunderstanding how an ODT works.
To be clear: an ODT does keep an episodic buffer of previous trajectories (or at least, that is the implementation of an ODT that I’m considering, which comports with an ODT as implemented in algorithm 1 of the paper). During the online training phase, the ODT periodically samples from this experience buffer and does gradient updates on how well its current policy retrodicts the past episodes. It seems like our disagreement on this point boils down to you imagining a model which works a different way.
More precisely, it seems like you were imagining that:
an ODT learns a policy which, when conditioned on reward R, tries to maximize the probability of getting reward R
when in fact:
an ODT learns a policy which, when conditioned on reward R, tries to behave similarly to past episodes which got reward R
(with the obvious modifications when instead of conditioning on a single reward R we condition on rewards being in some range [R1,R2]).
All of the reasoning in your first paragraph seems to be downstream of believing that an ODT works as in bullet point 1, when in fact an ODT works as in bullet point 2. And your reasoning in your second paragraph seems to be downstream of not realizing that an ODT is training off of an explicit experience buffer. I may also not have made sufficiently clear that the target reward for an ODT quantilizer is selected procedurally using the experience buffer data, instead of letting the ODT pick the target reward based on its best guess at the distribution of rewards.
(separate comment to make a separate, possibly derailing, point)
I mostly view this as a rhetorical flourish, but I’ll try to respond to (what I perceive as) the substance.
The “probably” in my sentence was mainly meant to indicate out-of-model uncertainty (in the sense of “I have a proof that X, so probably X” which is distinct from “I have a proof that probably X”). I thought that I gave a solid argument that reward hacking strategies would not suddenly and decisively become common, and the probably was to hedge against my argument being flawed, not to indicate that the argument showed that reward hacking strategies would appear suddenly and decisively only 10% of of the time or whatever.
So I think the correct way to deal with that “probably” is to interrogate how well the argument holds up (as in the sister comment), not to dismiss it due to heuristics about worst-case reasoning.
Sounds like the argument for quantilizers. Issues with quantilizers still apply here—for example, taking a series of actions that are individually sampled from a human-like distribution will often end up constituting a long-term policy that’s off-distribution. But, like, if those problems could be surmounted I agree that would be really good.
As to ODTs, I’m not super optimistic, but I’m also not very expert. It seems from a little thought like there are two types of benefit to ODT finetuning: One, a sort of “lowering expectations” so that the system only tries to do the behaviors it’s actually learned how to do correctly, even if humans do more difficult things to get higher reward. Two, a random search through policies (local in the NN representation of the policy) that might make gradual improvements. I’m not confident in the safety properties of that second thing, for reasons similar to Steve Byrnes’ here.
I really liked the post and the agenda of improving safety through generative modelling is close to my heart.
But you still need online access to our MDP (i.e. reward function and transition function), don’t you? And it’s access to MDP that drives novelty and improvement If you were just sampling whole trajectories from the model (asking the model itself to simulate reward function and transition model) and feeding them back into the model, you should expect any change (on average). Your gradients updates will cancel out, that’s a consequence of the expected-grad-log-prob lemma (Ex∼πθ∇θlogπθ(x)=0).
It gets more nuanced when you account for doing ancestral sampling, but it adds problems, not solves them:
https://arxiv.org/abs/2110.10819
On the other hand, in their follow-up work on instruction following, OpenAI claimed they used little online data (from fine-tuned policies):
https://arxiv.org/abs/2203.02155
Levine derives that in his control-as-inference tutorial paper (section 2.3). Your expected exponential total reward is pretty close. Not that it acts a bit like an (exponentiated) Q function for your policy: it gives you exp-reward expected after taking action τt at state τ<t and following π thereafter. The exponential works like a soft argmax, so it gives you something like soft Q-learning but not quite: argmax is also over environment dynamics, not only over policy. So it causes an optimism bias: your agent effectively assumes an optimal next state will sampled for it every time, however unlikely would that be. The rest of Levine’s paper deals with that.
Yep, that’s right! This was what I meant by “the agent starts acting in its environment” in the description of an ODT. So to be clear, during each timestep in the online phase, the ODT looks at a partial trajectory
g1,o1,a1,…,gt−1,ot−1,at−1,gt,ot
of rewards-to-go, observation, and actions; then selects an action at conditional on this partial trajectory; and then the environment provides a new reward rt (so that gt+1=gt−rt) and observation ot+1. Does that make sense?
Thanks for the reference to the Levine paper! I might have more to say after I get a chance to look at it more closely.
I think this is an interesting line of thinking. My main concern is whether the alignment tax might be too high for some use cases. I think a test case that it might do well for would be the NetHack challenge https://nethackchallenge.com/
I think that that’s an interesting challenge because current SoTA is far from ceiling effects on it, and that imitating human games as a starting point seems intuitively like a good approach. To study the question of: how likely is the model to go problematic directions once off-distribution, you could modify the NetHack challenge environment to add some possible exploits which aren’t in the real game, and see how likely it is for the model to find and use those exploits.