gwern comments on Proposal: Scaling laws for RL generalization

gwern 10 Oct 2021 15:34 UTC
4 points

In the rest of this post, we informally describe two versions of this project in more detail: An idealized version using XLand that actors with access to large amounts of compute could carry out, as well as a scaled-down and less resource-intensive version that is more realistic to pursue.

I wouldn’t use or try to reproduce XLand for open research. Procgen or Obstacle Tower might make more sense. Griddy, Alchemy, Meta-World, come to mind; or Minihack is an interesting new choice as a kind of procedurally-generated Nethack (there’s also the new Crafter but unclear to me if it’d let you really mix up item types for curriculum/generation of rules rather than merely new levels).

As an analogy, imagine GPT-3 would have only been trained and evaluated on abstract language, or sentences beginning with a subject, using the vocabulary of basic English. How confident would you be in extrapolating from such a model’s scaling with compute to the less restricted language models we have at this point?

Pretty confident? I mean, you’ve read the scaling papers. You know the upshot is that scaling curves are remarkably universal cross-modality, cross-architecture, cross-dataset, and cross-noise levels: they all look like straight lines on log graphs. The differences are usually in the constants, not the exponents, so I would expect a subsetted GPT-3 to have a lower constant in prediction loss on its constrained dataset and be that much closer to its intrinsic entropy, but otherwise extrapolate its scaling laws of loss vs parameters/FLOPS/n the same as always.

I think you might be trying to say that we would have more doubts about its performance on downstream tasks, such as benchmarks, and whether abilities like meta-learning would be induced? There, yes, we don’t have good ideas as to how much diversity matters or even how to measure either diversity or downstream tasks. (One would expect there to be some sort of bias-variance-esque tradeoff where diversity is good but the data can’t be too sparse as to be unlearnable: if pure diversity were the only goal, then you’d expect datacleaning would harm both perplexity & downstream performance, but we know that for language models, cleaning Internet data pays off. Beyond this, hard to say. We know that you probably want to broaden environments/data as you accumulate more and more data in any specific environment or task as they saturate, but...)

As an example, it seems quite unlikely that a system trained on English language only is going to be very useful for working with German text.

En->De: “Scaling Laws for Language Transfer Learning”, Christina Kim (as expected from Hernandez et al 2021).

If DeepMind had used a large fraction of their available compute resources on XLand, rerunning the code with a large number of variations might not be feasible, even for them.

https://www.lesswrong.com/posts/KaPaTdpLggdMqzdyo/how-much-compute-was-used-to-train-deepmind-s-generally ?