Daniel Kokotajlo comments on Fun with +12 OOMs of Compute

Daniel Kokotajlo 4 Mar 2021 21:13 UTC
LW: 2 AF: 1
AF
[2] This is definitely true for Transformers (and LSTMs I think?), but it may not be true for whatever architecture AlphaStar uses. In particular some people I talked to worry that the vanishing gradients problem might make bigger RL models like OmegaStar actually worse. However, everyone I talked to agreed with the “probably”-qualified version of this claim. I’m very interested to learn more about this.