1a3orn comments on Parameter Scaling Comes for RL, Maybe

1a3orn 25 Jan 2023 2:13 UTC
10 points
0
I’m quite unsure as well.

On one hand, I have the same feeling that it has a lot of weirdly specific, surely-not-universalizing optimizations when I look at it.

But on the other—it does seem to do quite well on different envs, and if this wasn’t hyper-parameter-tuned then that performance seems like the ultimate arbiter. And I don’t trust my intuitions about what qualifies as robust engineering v. non-robust tweaks in this domain. (Supervised learning is easier than RL in many ways, but LR warm-up still seems like a weird hack to me, even though it’s vital for a whole bunch of standard Transformer architectures and I know there are explanations for why it works.)

Similarly—I dunno, human perceptions generally map to something like log-space, so maybe symlog on rewards (and on observations (?!)) makes deep sense? And maybe you need something like the gradient clipping and KL balancing to handle the not-iid data of RL? I might just stare at the paper for longer.