Suppose that vases are never knocked over in the human-generated training data (since the human operators know that we don’t like broken vases). Then, regardless of the objective function we are using, a generative model trained on this data isn’t likely to knock over vases (since vase-toppling actions are very off-distribution for the training data).
But it will still have the problems of modeling off-distribution poorly, and going off-distribution. Once it accidentally moves too near the vase, which the humans avoid doing, it may go wild and spaz out. (As is the usual problem for behavior cloning and imitation learning in general.)
Novel behaviors may take a long time to become common. For example, suppose an OGM agent discovers a deceptive strategy which gets very high reward. We shouldn\u2019t necessarily expect this agent to start frequently employing deception; at first such behavior will still look off-distribution, and it might take many more iterations for such behavior to start looking normal to the generative model. Thus novel behaviors may appear indecisively, giving humans a chance to intervene on undesirable ones.
I disagree. This isn’t a model-free or policy model which needs to experience a transition many times before the high reward can begin to slowly bootstrap back through value estimates or overcome high variance updates to finally change behavior, it’s a model-based RL: the whole point is it’s learning a model of the environment.
Thus, theoretically, a single instance is enough to update its model of the environment, which can then flip its strategy to the new one. (This is in fact one of the standard experimental psychology approaches for running RL experiments on rodents to examine model-based vs model-free learning: if you do something like switch the reward location in a T-maze, does the mouse update after the first time it finds the reward in the new location such that it goes to the new location thereafter, demonstrating model-based reasoning in that it updated its internal model of the maze and did planning of the optimal strategy to get to the reward leading to the new maze-running behavior, or does it have to keep going to the old location for a while as the grip of the outdated model-free habituation slowly fades away?)
Empirically, the bigger the model, the more it is doing implicit planning (see my earlier comments on this with regard to MuZero and Jones etc), and the more it is capable of things which are also equivalent to planning. To be concrete, think inner-monologue and adaptive computation. There’s no reason a Gato-esque scaled-up DT couldn’t be using inner-monologue tricks to take a moment out to plan, similar to how Socratic models use their LMs to ‘think out loud’ a plan which they then execute. It would make total sense for a recurrent model to run a few timesteps with dummy inputs to ‘think about the prompt’ and do some quick meta-learning a la Dactyl, or for a old-style GPT model to print out text thinking to itself ‘what should I do? This time I will try X’.
For that matter, a OGM agent wouldn’t have to experience the transition itself, you could simply talk to it and tell it about the unobserved states, thanks to all that linguistic prowess it is learning from the generative training: harmless if you tell it simply “by the way, did you know there’s an easter egg in Atari Adventure? If you go to the room XYZ...”, not so harmless if it’s about the real world or vulnerabilities like Log4j. Or it could be generalizing from data you don’t realize is related at all but turns out to help transfer-learn or capabilities.
It might be easy to tune the rate of behavioral shift for OGM agents, which would allow us to more tightly control the rate at which new capabilities appear.
The smarter it is, and the better the environment models and capabilities, the more transfer it’ll get, and the faster the ‘rate’ gets potentially.
OGM agents may explore their action spaces more predictably than other RL agents, since they explore by trying variations on human-like behavior (this consideration might also apply to other methods that involve pre-training on human demonstrations).
Yeah, that’s possible, but I don’t think you necessarily get that out of the box. Online Decision Transformer or Gato certainly don’t explore in a human-like behavior, any more than other imitation learning paradigms do right now. (As you note, ODT just does a fairly normal bit of policy-based exploration, which is better than epsilon-random but still far short of anything one could describe as a good exploration strategy, much less human-like, nor do either ODT or Gato do as impressively when learning online/finetuning as one would expect if they really were exploring well by default.) They still need a smarter way to explore, like ensembles to express uncertainty.
An interesting question is whether large DTs would eventually learn human exploration, the way they learn so many other things as they scale up. Can they meta-learn exploration appropriately outside of toy POMDP environments deliberately designed to elicit such adaptive behavior? The large datasets in question would presumably contain a lot of human exploration; if we think about Internet scrapes, a lot of it is humans asking questions or criticizing or writing essays thinking out loud, which is linguistically encoding intellectual exploration.
From a DT perspective, I’d speculate that when used in the obvious way of conditioning on a very high reward on a specific task which is not a POMDP, the agents with logged data like that are not themselves exploring but are exploiting their knowledge, and so it ‘should’ avoid any exploration and simply argmax its way through that episode; eg if it was asked to play Go, there is no uncertainty about the rules, and it should do its best to play as well as it can like an expert player, regardless of uncertainty. If Go were a game which was POMDP like somehow, and expert ‘POMDP-Go’ players expertly balance off exploration & exploitation within the episode, then it would within-episode explore as best as it had learned how to by imitating those experts, but it wouldn’t ‘meta-explore’ to nail down its understanding of ‘POMDP-Go’. So it would be limited to accidental exploration from its errors in understanding the MDP or POMDPs in question.
Could we add additional metadata like ‘slow learner’ or ‘fast learner’ to whole corpuses of datasets from agents learning various tasks? I don’t see why not. Then you could add that to the prompt along with the target reward: ‘low reward, fat learner’. What trajectory would be most likely with that prompt? Well, one which staggers around doing poorly but like a bright beginner, screwing around and exploring a lot… Do some trajectories like that, increment the reward, and keep going in a bootstrap?
is it possible for a transformer to be a mesa-optimizer?
Why wouldn’t it?
Pick some percentile of previously-observed rewards (e.g. 95th percentile) and condition on getting that reward. For the OGM agent, as the distribution of previously-observed rewards shifts upwards, appropriately adjust the target reward.
Why quantilize at a specific percentile? Relative returns sounds like a more useful target.
But it will still have the problems of modeling off-distribution poorly, and going off-distribution.
Yep, I agree that distributional shift is still an issue here (see counterpoint 1 at the end of the “Safety advantages” section).
---
> Novel behaviors may take a long time to become common [...]
I disagree. This isn’t a model-free or policy model which needs to experience a transition many times before the high reward can begin to slowly bootstrap back through value estimates or overcome high variance updates to finally change behavior, it’s a model-based RL: the whole point is it’s learning a model of the environment.
Thus, theoretically, a single instance is enough to update its model of the environment, which can then flip its strategy to the new one.
I think you’re wrong here, at least in the case of an OGM satisficer or quantilizer (and in the more optimizer-y case of remark 1.3, it depends on the reward of the new episode). For concreteness, let’s imagine an OGM quantilizer aiming for rewards in the top 5% of previously-observed rewards. Suppose that the generative model has a memory of 10,000 episodes, and it’s just explored a reward hacking strategy by chance, which gave it a much higher reward than all previous episodes. It looks back at the last 10,000 episodes (including the reward hacking episode) and performs a gradient update to best model these episodes. Will its new policy consistently employ reward hacking (when conditioned on getting reward in the top 5% of previously-observed rewards)?
If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn’t yes. Instead, the new policy will probably slightly increase the probabilities of actions which, when performed together, constitute reward hacking. It will be more likely to explore this reward hacking strategy in the future, after which reward hacked episodes make up a greater proportion of the top 5% most highly rewarded episodes, but the transition shouldn’t be rapid.
As a more direct response to what you write in justification of your view: if the way the OGM agent works internally is via planning in some world model, then it shouldn’t be planning to get high reward—it should be planning to exhibit typical behavior conditional on whatever reward it’s been conditioned on. This is only a problem once many of the examples of the agent getting the reward it’s been conditioned on are examples of the agent behaving badly (this might happen easily when the reward it’s conditioned on is sampled proportional to exp(R) as in remark 1.3, but happens less easily when satisficing or quantilizing).
---
Thanks for these considerations on exploration—I found them interesting.
I agree that human-like exploration isn’t guaranteed by default, but I had a (possibly dumb) intuition that this would be the case. Heuristic argument: a OGM agent’s exploration is partially driven by the stochasticity of it’s policy, yes, but it’s also driven by imperfections in its model of its (initially human-generated) training data. Concretely, this might mean, e.g. estimating angles slightly differently in Breakout, having small misconceptions about how highly rewarded various actions are, etc. If the OGM agent is competent at the end of its offline phase, then I expect the stochasticity to be less of a big deal, and for the initial exploration to be mainly driven by these imperfections. To us, this might look like the behavior of a human with a slightly different world model than us.
It sounds like you might have examples to suggest this intuition is bogus—do you mind linking?
I like your idea of labeling episodes with information that could control exploration dynamics! I’ll add that to my list of possible ways to tune the rate at which an OGM agent develops new capabilities.
---
> is it possible for a transformer to be a mesa-optimizer?
Why wouldn’t it?
Point taken, I’ll edit this to “is it likely in practice that a trained transformer be a mesa-optimiser?”
---
Why quantilize at a specific percentile? Relative returns sounds like a more useful target.
Thanks! This is exactly what I would prefer (as you might be able to tell from what I wrote above in this comment), but I didn’t know how to actually implement it.
I think you’re wrong here, at least in the case of an OGM satisficer or quantilizer (and in the more optimizer-y case of remark 1.3, it depends on the reward of the new episode). For concreteness, let’s imagine an OGM quantilizer aiming for rewards in the top 5% of previously-observed rewards. Suppose that the generative model has a memory of 10,000 episodes, and it’s just explored a reward hacking strategy by chance, which gave it a much higher reward than all previous episodes. It looks back at the last 10,000 episodes (including the reward hacking episode) and performs a gradient update to best model these episodes. Will its new policy consistently employ reward hacking (when conditioned on getting reward in the top 5% of previously-observed rewards)?
If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn’t yes.
For safety, ‘probably’ isn’t much of a property. You are counting on it, essentially, having indeed learned the ultra-high-reward but then deliberately self-sabotaging for being too high reward. How does it know it’s “too good” in an episode and needs to self-sabotage to coast in to the low reward? It’s only just learned about this new hack, after all, there will be a lot of uncertainty about how often it delivers the reward, if there are any long-term drawbacks, etc. It may need to try as hard as it can just to reach mediocrity. (What if there is a lot of stochastic to the reward hacking or states around it, such that the reward hacking strategy has an EV around that of the quantile? What if the reward hacking grants enough control that a quantilizer can bleed itself after seizing complete control, to guarantee a specific final reward, providing a likelihood of 1, rather than a ‘normal’ strategy which risks coming in too high or too low and thus having a lower likelihood than the hacking, so quantilizing a target score merely triggers power-seeking instrumental drives?) Given enough episodes with reward hacking and enough experience with all the surrounding states, it could learn that the reward hacking is so overpowered a strategy that it needs to nerf itself by never doing reward hacking, because there’s just no way to self-sabotage enough to make a hacked trajectory plausibly come in at the required low score—but that’s an unknown number of episodes, so bad safety properties.
I also don’t buy the distribution argument here. After one episode, the model of the environment will update to learn both the existence of the new state and also the existence of extreme outlier rewards which completely invalidate previous estimates of the distributions. Your simple DT is not keeping an episodic buffer around to do planning over or something, it’s just doing gradient updates. It doesn’t “know” what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did, it only knows what’s encoded into its model, and that model has just learned that there exist very high rewards which it didn’t know about before, and thus that the distribution of rewards looks very different from what it thought, which means that ’95th percentile’ also doesn’t look like what it thought that did. It may be unlikely that 10,000 episodes wouldn’t sample it, but so what? The hack happened and is now in the data, deal with it. Suppose you have been puttering along in task X and it looks like a simple easily-learned N(100,15) and you are a quantilizer aiming for 95th percentile and so steer towards rewards of 112, great; then you see 1 instance of reward hacking with a reward of 10,000; what do you conclude? That N(100,15) is bullshit and the reward distribution is actually something much wilder like a lognormal or Pareto distribution or a mixture with (at least) 2 components. What is the true distribution? No one knows, least of all the DT model. OK, is the true 95th percentile reward more likely to be closer to 112… or to 10,000? Almost certainly the latter, because who knows how much higher scores go than 10,000 (how likely is it the first outlier was anywhere close to the maximum possible?), and your error will be much lower for almost all distributions & losses if you try to always aim for 10,000 and never try to do 112. Thus, the observed behavior will flip instantaneously.
but it’s also driven by imperfections in its model of its (initially human-generated) training data
Aside from not being human-like exploration, which targets specific things in extended hypotheses rather than accidentally having trembling hands jitter one step, this also gives a reason why the quantilizing argument above may fail. It may just accidentally the whole thing. (Both in terms of a bit of randomness, but also if it falls behind enough due to imperfections, it may suddenly ‘go for broke’ to do reward hacking to reach the quantilizing goal.) Again, bad safety properties.
I continue to think you’re wrong here, and that our disagreement on this point is due to you misunderstanding how an ODT works.
Your simple DT is not keeping an episodic buffer around to do planning over or something, it’s just doing gradient updates. It doesn’t “know” what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did
To be clear: an ODT does keep an episodic buffer of previous trajectories (or at least, that is the implementation of an ODT that I’m considering, which comports with an ODT as implemented in algorithm 1 of the paper). During the online training phase, the ODT periodically samples from this experience buffer and does gradient updates on how well its current policy retrodicts the past episodes. It seems like our disagreement on this point boils down to you imagining a model which works a different way.
More precisely, it seems like you were imagining that:
an ODT learns a policy which, when conditioned on reward R, tries to maximize the probability of getting reward R
when in fact:
an ODT learns a policy which, when conditioned on reward R, tries to behave similarly to past episodes which got reward R
(with the obvious modifications when instead of conditioning on a single reward R we condition on rewards being in some range [R1,R2]).
All of the reasoning in your first paragraph seems to be downstream of believing that an ODT works as in bullet point 1, when in fact an ODT works as in bullet point 2. And your reasoning in your second paragraph seems to be downstream of not realizing that an ODT is training off of an explicit experience buffer. I may also not have made sufficiently clear that the target reward for an ODT quantilizer is selected procedurally using the experience buffer data, instead of letting the ODT pick the target reward based on its best guess at the distribution of rewards.
(separate comment to make a separate, possibly derailing, point)
> If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn’t yes.
For safety, ‘probably’ isn’t much of a property.
I mostly view this as a rhetorical flourish, but I’ll try to respond to (what I perceive as) the substance.
The “probably” in my sentence was mainly meant to indicate out-of-model uncertainty (in the sense of “I have a proof that X, so probably X” which is distinct from “I have a proof that probably X”). I thought that I gave a solid argument that reward hacking strategies would not suddenly and decisively become common, and the probably was to hedge against my argument being flawed, not to indicate that the argument showed that reward hacking strategies would appear suddenly and decisively only 10% of of the time or whatever.
So I think the correct way to deal with that “probably” is to interrogate how well the argument holds up (as in the sister comment), not to dismiss it due to heuristics about worst-case reasoning.
But it will still have the problems of modeling off-distribution poorly, and going off-distribution. Once it accidentally moves too near the vase, which the humans avoid doing, it may go wild and spaz out. (As is the usual problem for behavior cloning and imitation learning in general.)
I disagree. This isn’t a model-free or policy model which needs to experience a transition many times before the high reward can begin to slowly bootstrap back through value estimates or overcome high variance updates to finally change behavior, it’s a model-based RL: the whole point is it’s learning a model of the environment.
Thus, theoretically, a single instance is enough to update its model of the environment, which can then flip its strategy to the new one. (This is in fact one of the standard experimental psychology approaches for running RL experiments on rodents to examine model-based vs model-free learning: if you do something like switch the reward location in a T-maze, does the mouse update after the first time it finds the reward in the new location such that it goes to the new location thereafter, demonstrating model-based reasoning in that it updated its internal model of the maze and did planning of the optimal strategy to get to the reward leading to the new maze-running behavior, or does it have to keep going to the old location for a while as the grip of the outdated model-free habituation slowly fades away?)
Empirically, the bigger the model, the more it is doing implicit planning (see my earlier comments on this with regard to MuZero and Jones etc), and the more it is capable of things which are also equivalent to planning. To be concrete, think inner-monologue and adaptive computation. There’s no reason a Gato-esque scaled-up DT couldn’t be using inner-monologue tricks to take a moment out to plan, similar to how Socratic models use their LMs to ‘think out loud’ a plan which they then execute. It would make total sense for a recurrent model to run a few timesteps with dummy inputs to ‘think about the prompt’ and do some quick meta-learning a la Dactyl, or for a old-style GPT model to print out text thinking to itself ‘what should I do? This time I will try X’.
For that matter, a OGM agent wouldn’t have to experience the transition itself, you could simply talk to it and tell it about the unobserved states, thanks to all that linguistic prowess it is learning from the generative training: harmless if you tell it simply “by the way, did you know there’s an easter egg in Atari Adventure? If you go to the room XYZ...”, not so harmless if it’s about the real world or vulnerabilities like Log4j. Or it could be generalizing from data you don’t realize is related at all but turns out to help transfer-learn or capabilities.
The smarter it is, and the better the environment models and capabilities, the more transfer it’ll get, and the faster the ‘rate’ gets potentially.
Yeah, that’s possible, but I don’t think you necessarily get that out of the box. Online Decision Transformer or Gato certainly don’t explore in a human-like behavior, any more than other imitation learning paradigms do right now. (As you note, ODT just does a fairly normal bit of policy-based exploration, which is better than epsilon-random but still far short of anything one could describe as a good exploration strategy, much less human-like, nor do either ODT or Gato do as impressively when learning online/finetuning as one would expect if they really were exploring well by default.) They still need a smarter way to explore, like ensembles to express uncertainty.
An interesting question is whether large DTs would eventually learn human exploration, the way they learn so many other things as they scale up. Can they meta-learn exploration appropriately outside of toy POMDP environments deliberately designed to elicit such adaptive behavior? The large datasets in question would presumably contain a lot of human exploration; if we think about Internet scrapes, a lot of it is humans asking questions or criticizing or writing essays thinking out loud, which is linguistically encoding intellectual exploration.
From a DT perspective, I’d speculate that when used in the obvious way of conditioning on a very high reward on a specific task which is not a POMDP, the agents with logged data like that are not themselves exploring but are exploiting their knowledge, and so it ‘should’ avoid any exploration and simply argmax its way through that episode; eg if it was asked to play Go, there is no uncertainty about the rules, and it should do its best to play as well as it can like an expert player, regardless of uncertainty. If Go were a game which was POMDP like somehow, and expert ‘POMDP-Go’ players expertly balance off exploration & exploitation within the episode, then it would within-episode explore as best as it had learned how to by imitating those experts, but it wouldn’t ‘meta-explore’ to nail down its understanding of ‘POMDP-Go’. So it would be limited to accidental exploration from its errors in understanding the MDP or POMDPs in question.
Could we add additional metadata like ‘slow learner’ or ‘fast learner’ to whole corpuses of datasets from agents learning various tasks? I don’t see why not. Then you could add that to the prompt along with the target reward: ‘low reward, fat learner’. What trajectory would be most likely with that prompt? Well, one which staggers around doing poorly but like a bright beginner, screwing around and exploring a lot… Do some trajectories like that, increment the reward, and keep going in a bootstrap?
Why wouldn’t it?
Why quantilize at a specific percentile? Relative returns sounds like a more useful target.
Yep, I agree that distributional shift is still an issue here (see counterpoint 1 at the end of the “Safety advantages” section).
---
I think you’re wrong here, at least in the case of an OGM satisficer or quantilizer (and in the more optimizer-y case of remark 1.3, it depends on the reward of the new episode). For concreteness, let’s imagine an OGM quantilizer aiming for rewards in the top 5% of previously-observed rewards. Suppose that the generative model has a memory of 10,000 episodes, and it’s just explored a reward hacking strategy by chance, which gave it a much higher reward than all previous episodes. It looks back at the last 10,000 episodes (including the reward hacking episode) and performs a gradient update to best model these episodes. Will its new policy consistently employ reward hacking (when conditioned on getting reward in the top 5% of previously-observed rewards)?
If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn’t yes. Instead, the new policy will probably slightly increase the probabilities of actions which, when performed together, constitute reward hacking. It will be more likely to explore this reward hacking strategy in the future, after which reward hacked episodes make up a greater proportion of the top 5% most highly rewarded episodes, but the transition shouldn’t be rapid.
As a more direct response to what you write in justification of your view: if the way the OGM agent works internally is via planning in some world model, then it shouldn’t be planning to get high reward—it should be planning to exhibit typical behavior conditional on whatever reward it’s been conditioned on. This is only a problem once many of the examples of the agent getting the reward it’s been conditioned on are examples of the agent behaving badly (this might happen easily when the reward it’s conditioned on is sampled proportional to exp(R) as in remark 1.3, but happens less easily when satisficing or quantilizing).
---
Thanks for these considerations on exploration—I found them interesting.
I agree that human-like exploration isn’t guaranteed by default, but I had a (possibly dumb) intuition that this would be the case. Heuristic argument: a OGM agent’s exploration is partially driven by the stochasticity of it’s policy, yes, but it’s also driven by imperfections in its model of its (initially human-generated) training data. Concretely, this might mean, e.g. estimating angles slightly differently in Breakout, having small misconceptions about how highly rewarded various actions are, etc. If the OGM agent is competent at the end of its offline phase, then I expect the stochasticity to be less of a big deal, and for the initial exploration to be mainly driven by these imperfections. To us, this might look like the behavior of a human with a slightly different world model than us.
It sounds like you might have examples to suggest this intuition is bogus—do you mind linking?
I like your idea of labeling episodes with information that could control exploration dynamics! I’ll add that to my list of possible ways to tune the rate at which an OGM agent develops new capabilities.
---
Point taken, I’ll edit this to “is it likely in practice that a trained transformer be a mesa-optimiser?”
---
Thanks! This is exactly what I would prefer (as you might be able to tell from what I wrote above in this comment), but I didn’t know how to actually implement it.
For safety, ‘probably’ isn’t much of a property. You are counting on it, essentially, having indeed learned the ultra-high-reward but then deliberately self-sabotaging for being too high reward. How does it know it’s “too good” in an episode and needs to self-sabotage to coast in to the low reward? It’s only just learned about this new hack, after all, there will be a lot of uncertainty about how often it delivers the reward, if there are any long-term drawbacks, etc. It may need to try as hard as it can just to reach mediocrity. (What if there is a lot of stochastic to the reward hacking or states around it, such that the reward hacking strategy has an EV around that of the quantile? What if the reward hacking grants enough control that a quantilizer can bleed itself after seizing complete control, to guarantee a specific final reward, providing a likelihood of 1, rather than a ‘normal’ strategy which risks coming in too high or too low and thus having a lower likelihood than the hacking, so quantilizing a target score merely triggers power-seeking instrumental drives?) Given enough episodes with reward hacking and enough experience with all the surrounding states, it could learn that the reward hacking is so overpowered a strategy that it needs to nerf itself by never doing reward hacking, because there’s just no way to self-sabotage enough to make a hacked trajectory plausibly come in at the required low score—but that’s an unknown number of episodes, so bad safety properties.
I also don’t buy the distribution argument here. After one episode, the model of the environment will update to learn both the existence of the new state and also the existence of extreme outlier rewards which completely invalidate previous estimates of the distributions. Your simple DT is not keeping an episodic buffer around to do planning over or something, it’s just doing gradient updates. It doesn’t “know” what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did, it only knows what’s encoded into its model, and that model has just learned that there exist very high rewards which it didn’t know about before, and thus that the distribution of rewards looks very different from what it thought, which means that ’95th percentile’ also doesn’t look like what it thought that did. It may be unlikely that 10,000 episodes wouldn’t sample it, but so what? The hack happened and is now in the data, deal with it. Suppose you have been puttering along in task X and it looks like a simple easily-learned N(100,15) and you are a quantilizer aiming for 95th percentile and so steer towards rewards of 112, great; then you see 1 instance of reward hacking with a reward of 10,000; what do you conclude? That N(100,15) is bullshit and the reward distribution is actually something much wilder like a lognormal or Pareto distribution or a mixture with (at least) 2 components. What is the true distribution? No one knows, least of all the DT model. OK, is the true 95th percentile reward more likely to be closer to 112… or to 10,000? Almost certainly the latter, because who knows how much higher scores go than 10,000 (how likely is it the first outlier was anywhere close to the maximum possible?), and your error will be much lower for almost all distributions & losses if you try to always aim for 10,000 and never try to do 112. Thus, the observed behavior will flip instantaneously.
Aside from not being human-like exploration, which targets specific things in extended hypotheses rather than accidentally having trembling hands jitter one step, this also gives a reason why the quantilizing argument above may fail. It may just accidentally the whole thing. (Both in terms of a bit of randomness, but also if it falls behind enough due to imperfections, it may suddenly ‘go for broke’ to do reward hacking to reach the quantilizing goal.) Again, bad safety properties.
I continue to think you’re wrong here, and that our disagreement on this point is due to you misunderstanding how an ODT works.
To be clear: an ODT does keep an episodic buffer of previous trajectories (or at least, that is the implementation of an ODT that I’m considering, which comports with an ODT as implemented in algorithm 1 of the paper). During the online training phase, the ODT periodically samples from this experience buffer and does gradient updates on how well its current policy retrodicts the past episodes. It seems like our disagreement on this point boils down to you imagining a model which works a different way.
More precisely, it seems like you were imagining that:
an ODT learns a policy which, when conditioned on reward R, tries to maximize the probability of getting reward R
when in fact:
an ODT learns a policy which, when conditioned on reward R, tries to behave similarly to past episodes which got reward R
(with the obvious modifications when instead of conditioning on a single reward R we condition on rewards being in some range [R1,R2]).
All of the reasoning in your first paragraph seems to be downstream of believing that an ODT works as in bullet point 1, when in fact an ODT works as in bullet point 2. And your reasoning in your second paragraph seems to be downstream of not realizing that an ODT is training off of an explicit experience buffer. I may also not have made sufficiently clear that the target reward for an ODT quantilizer is selected procedurally using the experience buffer data, instead of letting the ODT pick the target reward based on its best guess at the distribution of rewards.
(separate comment to make a separate, possibly derailing, point)
I mostly view this as a rhetorical flourish, but I’ll try to respond to (what I perceive as) the substance.
The “probably” in my sentence was mainly meant to indicate out-of-model uncertainty (in the sense of “I have a proof that X, so probably X” which is distinct from “I have a proof that probably X”). I thought that I gave a solid argument that reward hacking strategies would not suddenly and decisively become common, and the probably was to hedge against my argument being flawed, not to indicate that the argument showed that reward hacking strategies would appear suddenly and decisively only 10% of of the time or whatever.
So I think the correct way to deal with that “probably” is to interrogate how well the argument holds up (as in the sister comment), not to dismiss it due to heuristics about worst-case reasoning.