Rohin Shah answers Probability that other architectures will scale as well as Transformers?

Rohin Shah 29 Jul 2020 1:02 UTC
LW: 23 AF: 15
AF
For some reason here on LW there’s a huge focus on “architecture”. I don’t get it. Here’s how I at-this-moment think of the scaling hypothesis:
Weak scaling hypothesis: For a task that has not yet been solved, if you increase data and model capacity, and tune the learning algorithm to make use of it (like, hyperparameter tuning and such, not a fundamentally new algorithm), then performance will improve.
This seems fairly uncontroversial, I think? This probably breaks down in some edge cases (e.g. if you have 1-layer neural net that you keep making wider and wider) but seems broadly correct to me. It’s mostly independent of the architecture (as long as it is possible to increase model capacity). Note also the common wisdom in ML that it’s far more important what your data is than what your model / learning algorithm are.
What the architecture can influence is where your performance starts out at, and the rate at which it scales, which matters for:
Strong scaling hypothesis: (Depends on weak scaling hypothesis) There is a sufficiently difficult task T and an architecture A that we know of for that task, such that 1. “solving” T would lead to AGI, 2. it is conceptually easy to scale up the model capacity for A, 3. it is easy to get more data for T, and 4. scaling up a) model capacity and b) data will lead to “solving” T on some not-crazy timescale and resource-scale.
According to me, it is hard to find T that satisfies 1, 3 and 4b, it is trivial to satisfy 2, and hard to find an architecture that satisfies 4a. OpenAI’s big contribution here is believing and demonstrating that T=”predict language” might satisfy 1, 3 and 4b. I know of no other such T (though multiagent environments are a candidate).
What about 4a? According to me, it just so happens that Transformers are the best architecture for T=”predict language”, and so that’s what we saw get scaled up, but I’d expect you’d see the same pattern of scaling (but not the same absolute performance) from other architectures as well. (For example, I suspect RNNs would also satisfy 4a.) I think the far more interesting question is whether we’ll see other tasks T that could plausibly satisfy 1, 3, and 4b.
What links here?
- More Recent Progress in the Theory of Neural Networks by jylin04 (6 Oct 2022 16:57 UTC; 82 points)
- Daniel Kokotajlo 29 Jul 2020 1:25 UTC
  LW: 7 AF: 5
  AF Parent
  Thanks! It sounds like you are saying the task is more important than the architecture, so we should talk less about architectures and more about tasks.
  That seems plausible to me, with the caveat that I think it’s still worth talking about architecture sometimes. For example, when thinking about the safety or generalization properties of a system the architecture might be more important, no?
  If I could go back in time, I’d change the question to be about “Architecture+training setups” instead of just “architectures.”
  - Rohin Shah 29 Jul 2020 16:27 UTC
    LW: 2 AF: 2
    AF Parent
    It sounds like you are saying the task is more important than the architecture, so we should talk less about architectures and more about tasks.
    Yes, that’s right.
    For example, when thinking about the safety or generalization properties of a system the architecture might be more important, no?
    I’d be pretty surprised if this were the case after conditioning on the raw capabilities of the architecture, though I can’t rule it out.