Nice summary! I agree, this is an interesting paper :)
But learning to be predictive of such random future states seems like it falls subject to exactly the same problem as learning to be predictive of future observations: you have no guarantee that EfficientZero will be learning relevant information, which means it could be wasting network capacity on irrelevant information. There’s a just-so story you could tell where adding this extra predictive loss results in worse end-to-end behavior because of this wasted capacity, just like there’s a just-so story where adding this extra predictive loss results in better end-to-end behavior because of faster training. I’m not sure why one turned out to be true rather than the other.
This mostly depends on the size of your dataset. For very small datasets (100k frames here), the network is overparameterized and can easily overfit, adding the consistency loss provides regularisation that can prevent this.
For larger datasets (eg standard 200 million frame setting in Atari) you’ll see less overfitting, and I would expect the impact of consistency loss to be much smaller, possibly negative. The paper doesn’t include ablations for this, but I might test it if I have time.
To phrase differently: the less data you have for your real objective the more you can benefit from auxiliary losses and regularisation.
Yes, that was the comment I meant to leave but apparently didn’t: it’s just another bias-variance tradeoff. In the limit (say, 20b frames...) all of these regularizations and auxiliary tasks (and/or informative priors) are either going to be neutral or hurt converged performance compared to pure end-to-end reward-only learning. And they should, if you do them right, help most early on when data is scarce and the end-to-end reward-only approach hasn’t been able to learn much. This isn’t post hoc, it’s just what any ML person should predict from bias-variance tradeoff. The devil is in the details, though, and you could be doing any of it wrong or not be where you think you are in the tradeoff, and that’s where this sort of research finding lives.
Ah, that does make sense, thanks. And yeah, it would be interesting to know what the curve / crossover point would look like for the impact from the consistency loss.
Nice summary! I agree, this is an interesting paper :)
This mostly depends on the size of your dataset. For very small datasets (100k frames here), the network is overparameterized and can easily overfit, adding the consistency loss provides regularisation that can prevent this.
For larger datasets (eg standard 200 million frame setting in Atari) you’ll see less overfitting, and I would expect the impact of consistency loss to be much smaller, possibly negative. The paper doesn’t include ablations for this, but I might test it if I have time.
To phrase differently: the less data you have for your real objective the more you can benefit from auxiliary losses and regularisation.
Yes, that was the comment I meant to leave but apparently didn’t: it’s just another bias-variance tradeoff. In the limit (say, 20b frames...) all of these regularizations and auxiliary tasks (and/or informative priors) are either going to be neutral or hurt converged performance compared to pure end-to-end reward-only learning. And they should, if you do them right, help most early on when data is scarce and the end-to-end reward-only approach hasn’t been able to learn much. This isn’t post hoc, it’s just what any ML person should predict from bias-variance tradeoff. The devil is in the details, though, and you could be doing any of it wrong or not be where you think you are in the tradeoff, and that’s where this sort of research finding lives.
Ah, that does make sense, thanks. And yeah, it would be interesting to know what the curve / crossover point would look like for the impact from the consistency loss.