My impression from the DreamerV3 paper was actually that I expect it to be less easy to generalize than EfficentZero in many ways, because it has clever hacks like symlog on rewards that are actually encoding programmer knowledge about the statistics of the environment. (The biggest such thing in EfficientZero is maybe the learning rate schedule? Or its value prediction scheme is pretty complicated, but didn’t seem fine-tuned. But in the DreamerV3 paper there’s quite a few pieces that seemed carefully engineered to me.)
On the other hand, there are convincing arguments for why eventually, scaling and generalization should be better for a system that learns its own rules for prediction / amplification, rather than being locked into MCTS. But I feel like we’re not quite there yet—transformers are sort of like learning rules for convolution, but we might imagine that for RL we want to be able to learn even more complicated generalization processes. So currently I expect benefits of scale, but we can’t really take off to the moon (fortunately).
On one hand, I have the same feeling that it has a lot of weirdly specific, surely-not-universalizing optimizations when I look at it.
But on the other—it does seem to do quite well on different envs, and if this wasn’t hyper-parameter-tuned then that performance seems like the ultimate arbiter. And I don’t trust my intuitions about what qualifies as robust engineering v. non-robust tweaks in this domain. (Supervised learning is easier than RL in many ways, but LR warm-up still seems like a weird hack to me, even though it’s vital for a whole bunch of standard Transformer architectures and I know there are explanations for why it works.)
Similarly—I dunno, human perceptions generally map to something like log-space, so maybe symlog on rewards (and on observations (?!)) makes deep sense? And maybe you need something like the gradient clipping and KL balancing to handle the not-iid data of RL? I might just stare at the paper for longer.
My impression from the DreamerV3 paper was actually that I expect it to be less easy to generalize than EfficentZero in many ways, because it has clever hacks like symlog on rewards that are actually encoding programmer knowledge about the statistics of the environment. (The biggest such thing in EfficientZero is maybe the learning rate schedule? Or its value prediction scheme is pretty complicated, but didn’t seem fine-tuned. But in the DreamerV3 paper there’s quite a few pieces that seemed carefully engineered to me.)
On the other hand, there are convincing arguments for why eventually, scaling and generalization should be better for a system that learns its own rules for prediction / amplification, rather than being locked into MCTS. But I feel like we’re not quite there yet—transformers are sort of like learning rules for convolution, but we might imagine that for RL we want to be able to learn even more complicated generalization processes. So currently I expect benefits of scale, but we can’t really take off to the moon (fortunately).
I’m quite unsure as well.
On one hand, I have the same feeling that it has a lot of weirdly specific, surely-not-universalizing optimizations when I look at it.
But on the other—it does seem to do quite well on different envs, and if this wasn’t hyper-parameter-tuned then that performance seems like the ultimate arbiter. And I don’t trust my intuitions about what qualifies as robust engineering v. non-robust tweaks in this domain. (Supervised learning is easier than RL in many ways, but LR warm-up still seems like a weird hack to me, even though it’s vital for a whole bunch of standard Transformer architectures and I know there are explanations for why it works.)
Similarly—I dunno, human perceptions generally map to something like log-space, so maybe symlog on rewards (and on observations (?!)) makes deep sense? And maybe you need something like the gradient clipping and KL balancing to handle the not-iid data of RL? I might just stare at the paper for longer.