offline RL is surprisingly stable and robust to reward misspecification
Wow, what a wild paper. The basic idea—that “pessimism” about off-distribution state/action pairs induces pessimistically-trained RL agents to learn policies that hang around in the training distribution for a long time, even if that goes against their reward function—is a fairly obvious one. But what’s not obvious is the wide variety of algorithms this applies to.
I genuinely don’t believe their decision transformer results. I.e. I think with p~0.8, if they (or the authors of the paper whose hyperparameters they copied) made better design choices, they would have gotten a decision transformer that was actually sensitive to reward. But on the flip side, with p~0.2 they just showed that decision transformers don’t work! (For these tasks.)
Wow, what a wild paper. The basic idea—that “pessimism” about off-distribution state/action pairs induces pessimistically-trained RL agents to learn policies that hang around in the training distribution for a long time, even if that goes against their reward function—is a fairly obvious one. But what’s not obvious is the wide variety of algorithms this applies to.
I genuinely don’t believe their decision transformer results. I.e. I think with p~0.8, if they (or the authors of the paper whose hyperparameters they copied) made better design choices, they would have gotten a decision transformer that was actually sensitive to reward. But on the flip side, with p~0.2 they just showed that decision transformers don’t work! (For these tasks.)