On many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with “wrong” reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL’s return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design. We demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and certain implicit biases in common data collection practices. As we prove in this work, pessimism endows the agent with a “survival instinct”, i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies...
Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is nudged to learn a desirable behavior with imperfect reward but purposely biased data coverage.
But I heard that some people found these results “too good to be true”, with some dismissing it instantly as wrong or mis-stated. I find this ironic, given that the paper was recently published in a top-tier AI conference. Yes, papers can sometimes be bad, but… seriously? You know the thing where lotsa folks used to refuse to engage with AI risk cuz it sounded too weird, without even hearing the arguments? … Yeaaah, absurdity bias.
Anyways, the paper itself is quite interesting. I haven’t gone through all of it yet, but I think I can give a good summary. The github.io is a nice (but nonspecific) summary.
Summary
It’s super important to remember that we aren’t talking about PPO. Boy howdy, we are in a different part of town when it comes to these “offline” RL algorithms (which train on a fixed dataset, rather than generating more of their own data “online”). ATAC, PSPI, what the heck are those algorithms? The important-seeming bits:
Many offline RL algorithms pessimistically initialize the value of unknown states
“Unknown” means: “Not visited in the offline state-action distribution”
Pessimistic means those are assigned a super huge negative value (this is a bit simplified)
Because future rewards are discounted, reaching an unknown state-action pair is bad if it happens soon and less bad if it happens farther in the future
So on an all-zero reward function, a model-based RL policy will learn to stay within the state-action pairs it was demonstrated for as long as possible (“length bias”)
In the case of the gridworld, this means staying on the longest demonstrated path, even if the red lava is rewarded and the yellow key is penalized.
In the case of Hopper, I’m not sure how they represented the states, but if they used non-tabular policies, this probably looks like “repeat the longest portion of demonstrated policies without falling over” (because that leads to the pessimistic penalty, and most of the data looked like walking successfully due to length bias, so that kind of data is least likely to be penalized).
On a negated reward function (which e.g. penalizes the Hopper for staying upright and rewards for falling over), if falling over still leads to a terminal/unknown state-action, that leads to a huge negative penalty. So it’s optimal to keep hopping whenever
Reward(falling over)+γ⋅Pessimism pen.<11−γPen. for being upright
For example, if the original per-timestep reward for staying upright was 1, and the original penalty for falling over was −1, then now the policy gets penalized for staying upright and rewarded for falling over! At γ=.9, it’s therefore optimal to stay upright whenever
1+.9⋅Pessimism<10⋅−1
which holds whenever the pessimistic penalty is at least 12.3. That’s not too high, is it? (When I was in my graduate RL class, we’d initialize the penalties to −1000!)
Significance
DPO, for example, is an offline RL algorithm. It’s plausible that frontier models will be trained using that algorithm. So, these results are more relevant if future DPO variants use pessimism and if the training data (e.g. example user/AI interactions) last for more turns when they’re actually helpful for the user.
While it may be tempting to dismiss these results as irrelevant because “length won’t perfectly correlate with goodness so there won’t be positive bias”, I think that would be a mistake. When analyzing the performance and alignment properties of an algorithm, I think it’s important to have a clear picture of all relevant pieces of the algorithm. The influence of length bias and the support of the offline dataset are additional available levers for aligning offline RL-trained policies.
To close on a familiar note, this is yet another example of how “reward” is not the only important quantity to track in an RL algorithm. I also think it’s a mistake to dismiss results like this instantly; this offers an opportunity to reflect on what views and intuitions led to the incorrect judgment.
@Jozdien sent me this paper, and I dismissed it with a cursory glance, thinking if they had to present their results using the “safe” shortening in the context they used that shortening, their results can’t be too much to sneeze at. Reading your summary, the results are minorly more impressive than I was imagining, but still in the same ballpark I think. I don’t think there’s much applicability to the safety of systems though? If I’m reading you right, you don’t get guarantees for situations for which the model is very out-of-distribution, but still behaving competently, since it hasn’t seen tabulation sequences there.
Where the results are applicable, they definitely seem like they give mixed, probably mostly negative signals. If (say) I have a stop button, and I reward my agent for shutting itself down if I press that stop button, don’t these results say that the agent won’t shut down, for the same reasons the hopper won’t fall over, even if my reward function has rewarded it for falling over in such situation? More generally, this seems to tell us in such situations we have marginally fewer degrees of freedom with which we can modify a model’s goals than we may have thought, since the stay-on-distribution aspect dominates over the reward aspect. On the other hand, “staying on distribution” is in a sense a property we do want! Is this sort of “staying on distribution” the same kind of “staying on distribution” as that used in quantilization? I don’t think so.
More generally, it seems like whether more or less sensitivity to reward, architecture, and data on the part of the functions neural networks learn is better or worse for alignment is an open problem.
The paper sounds fine quality-wise to me, I just find it implausible that it’s relevant for important alignment work, since the proposed mechanism is mainly an aversion to building new capabilities.
Don’t you mean future values? Also, AFAICT, the only thing going on here that seperates online from offline RL is that offline RL algorithms shape the initial value function to give conservative behaviour. And so you get conservative behaviour.
One lesson you could take away from this is “pay attention to the data, not the process”—this happened because the data had longer successes than failures. If successes were more numerous than failures, many algorithms would have imitated those as well with null reward.
Apparently[1] there was recently some discussion of Survival Instinct in Offline Reinforcement Learning (NeurIPS 2023). The results are very interesting:
But I heard that some people found these results “too good to be true”, with some dismissing it instantly as wrong or mis-stated. I find this ironic, given that the paper was recently published in a top-tier AI conference. Yes, papers can sometimes be bad, but… seriously? You know the thing where lotsa folks used to refuse to engage with AI risk cuz it sounded too weird, without even hearing the arguments? … Yeaaah, absurdity bias.
Anyways, the paper itself is quite interesting. I haven’t gone through all of it yet, but I think I can give a good summary. The github.io is a nice (but nonspecific) summary.
Summary
It’s super important to remember that we aren’t talking about PPO. Boy howdy, we are in a different part of town when it comes to these “offline” RL algorithms (which train on a fixed dataset, rather than generating more of their own data “online”). ATAC, PSPI, what the heck are those algorithms? The important-seeming bits:
Reward(falling over)+γ⋅Pessimism pen.<11−γPen. for being uprightMany offline RL algorithms pessimistically initialize the value of unknown states
“Unknown” means: “Not visited in the offline state-action distribution”
Pessimistic means those are assigned a super huge negative value (this is a bit simplified)
Because future rewards are discounted, reaching an unknown state-action pair is bad if it happens soon and less bad if it happens farther in the future
So on an all-zero reward function, a model-based RL policy will learn to stay within the state-action pairs it was demonstrated for as long as possible (“length bias”)
In the case of the gridworld, this means staying on the longest demonstrated path, even if the red lava is rewarded and the yellow key is penalized.
In the case of Hopper, I’m not sure how they represented the states, but if they used non-tabular policies, this probably looks like “repeat the longest portion of demonstrated policies without falling over” (because that leads to the pessimistic penalty, and most of the data looked like walking successfully due to length bias, so that kind of data is least likely to be penalized).
On a negated reward function (which e.g. penalizes the Hopper for staying upright and rewards for falling over), if falling over still leads to a terminal/unknown state-action, that leads to a huge negative penalty. So it’s optimal to keep hopping whenever
For example, if the original per-timestep reward for staying upright was 1, and the original penalty for falling over was −1, then now the policy gets penalized for staying upright and rewarded for falling over! At γ=.9, it’s therefore optimal to stay upright whenever
1+.9⋅Pessimism<10⋅−1which holds whenever the pessimistic penalty is at least 12.3. That’s not too high, is it? (When I was in my graduate RL class, we’d initialize the penalties to −1000!)
Significance
DPO, for example, is an offline RL algorithm. It’s plausible that frontier models will be trained using that algorithm. So, these results are more relevant if future DPO variants use pessimism and if the training data (e.g. example user/AI interactions) last for more turns when they’re actually helpful for the user.
While it may be tempting to dismiss these results as irrelevant because “length won’t perfectly correlate with goodness so there won’t be positive bias”, I think that would be a mistake. When analyzing the performance and alignment properties of an algorithm, I think it’s important to have a clear picture of all relevant pieces of the algorithm. The influence of length bias and the support of the offline dataset are additional available levers for aligning offline RL-trained policies.
To close on a familiar note, this is yet another example of how “reward” is not the only important quantity to track in an RL algorithm. I also think it’s a mistake to dismiss results like this instantly; this offers an opportunity to reflect on what views and intuitions led to the incorrect judgment.
I can’t actually check because I only check that stuff on Mondays.
@Jozdien sent me this paper, and I dismissed it with a cursory glance, thinking if they had to present their results using the “safe” shortening in the context they used that shortening, their results can’t be too much to sneeze at. Reading your summary, the results are minorly more impressive than I was imagining, but still in the same ballpark I think. I don’t think there’s much applicability to the safety of systems though? If I’m reading you right, you don’t get guarantees for situations for which the model is very out-of-distribution, but still behaving competently, since it hasn’t seen tabulation sequences there.
Where the results are applicable, they definitely seem like they give mixed, probably mostly negative signals. If (say) I have a stop button, and I reward my agent for shutting itself down if I press that stop button, don’t these results say that the agent won’t shut down, for the same reasons the hopper won’t fall over, even if my reward function has rewarded it for falling over in such situation? More generally, this seems to tell us in such situations we have marginally fewer degrees of freedom with which we can modify a model’s goals than we may have thought, since the stay-on-distribution aspect dominates over the reward aspect. On the other hand, “staying on distribution” is in a sense a property we do want! Is this sort of “staying on distribution” the same kind of “staying on distribution” as that used in quantilization? I don’t think so.
More generally, it seems like whether more or less sensitivity to reward, architecture, and data on the part of the functions neural networks learn is better or worse for alignment is an open problem.
The paper sounds fine quality-wise to me, I just find it implausible that it’s relevant for important alignment work, since the proposed mechanism is mainly an aversion to building new capabilities.
Don’t you mean future values? Also, AFAICT, the only thing going on here that seperates online from offline RL is that offline RL algorithms shape the initial value function to give conservative behaviour. And so you get conservative behaviour.
One lesson you could take away from this is “pay attention to the data, not the process”—this happened because the data had longer successes than failures. If successes were more numerous than failures, many algorithms would have imitated those as well with null reward.