I would like to see actual benchmarks on that not like… a PR blog.
My general take I that T5 models that were trained on data did better than adding a head-on open source GPTs (I think it was neoX at the time, whatever the 2.7B param one was) and training that, and obviously training a GPT as a whole… this is specifically for what I looked into at the time which was language-to-code- and code-to-code translation. And the best GPT-3 of the time, with prompts, even for tasks it was specifically trained for (nl to SQL) was horrible
More broadly I’ve not seen any large language model outperform small models at specific tasks, their embedding might be better in certain cases but mainly to reduce training time.
But to keep the argument simpler: there’s little proof that out-of-sample training (MLM style or otherwise) improves in-sample performance (note, out-of-sample trained models could be better than random, which is why using pre-loaded weights is a thing)
However, the strong version of the argument is closer to:
Clippy-sized model with a specific task of finding exploits will perform better than clippy-sized model trained with a broad objective.
I think I could make that claim as, 1/1000 Clippy sized but at that point you’re haggling over benchmarks.
Cases like code completion and creation (e.g. gpt copilot) are fuzzy and have no clear way to benchmark, but even their custom-build models (e.g. the recent one from google) seem to lead on benchmarks.
That being said it’s been like ~1 year since I had an applied interest in the issue so maybe I’m wrong. For more concrete examples I’d take things like protein folding and self driving, where nobody does any OOS training even though relevant datasets could be found outside the specific field they are applied to, and billions (trillions) of dollars in investment lead me to think the research there is close-to-optimla.
With text it’s a bit fuzzier to call it since the goals are fuzzier and text benchmarks start to breakdown when you are evaluating small differences.
I would like to see actual benchmarks on that not like… a PR blog.
Figures are in the paper. It’s a bit harder to figure out what the benchmarks looked like in 2019 (if you go to sites like GLUE today you see the current leaderboard); this does feel cruxy for me. [If I learned that next-token-prediction models don’t transfer to other language tasks, I would feel pretty differently about the shape of cognition, and be very retroactively surprised about some observations.]
there’s little proof that out-of-sample training (MLM style or otherwise) improves in-sample performance
I think there’s some evidence in favor, both from 1) the value of off-policy examples in RL and 2) training on not-directly-relevant outcomes improving the underlying features. [Tragically I no longer remember the jargon experts use to refer to that, and so can’t easily check whether or not this is true for contemporary architectures or just last-gen ones; naively I would expect there’s still transfer.]
I think we’re basically not going to get ‘proof’ of whether or not a ‘sharp left turn’ happens until it does, and so have to rely on other forms of inference here.
Clippy-sized model with a specific task of finding exploits will perform better than clippy-sized model trained with a broad objective.
I agree with this. I think the disagreement is over what the landscape of models will look like—will it be the case that there are hundreds of Clippy-sized models targeted at various tasks, or will it be the case that the first Clippy-sized model is substantially larger than other models out there?
For gwern’s specific story, I agree it’s somewhat implausible that one engineer (tho with access to corporate compute) trains Clippy and there’s not lots of specialized models; this has to hinge on something about the regulatory environment in the story that prevents larger models from being trained. (But if they’re deliberately aiming at a CAIS-like world, you should expect there to be lots of sophisticated services of the form you’re talking about.) In worlds where the Clippy-sized model comes from a major corporate or state research effort, then it seems unlikely to me that there will be lots of similarly-sized specialized competitors, and so likely the general system has a large training edge (because the training cost is shared among many specialized use cases).
billions (trillions) of dollars in investment lead me to think the research there is close-to-optimla.
I agree there’s a good argument that autonomous vehicle research is close-to-optimal (tho, importantly, it is “optimal at succeeding” instead of “optimal at driving” and so includes lots of design choices driven by regulatory compliance or investor demand), but I don’t think this for protein folding, at least as of 2018.
For gwern’s specific story, I agree it’s somewhat implausible that one engineer (tho with access to corporate compute) trains Clippy and there’s not lots of specialized models;
I think the broader argument of “can language models become gods” is a separate one.
My sole objective there was to point out flaws in this particular narrative (which hopefully I stated clearly in the beginning).
I think the “can language models become gods” debate is broader and I didn’t care much to engage with it, superficially it seems that some of the same wrong abstractions that lead to this kind of narrative also back up that premise, but I’m in no position to make a hands-down argument for that.
I would like to see actual benchmarks on that not like… a PR blog.
My general take I that T5 models that were trained on data did better than adding a head-on open source GPTs (I think it was neoX at the time, whatever the 2.7B param one was) and training that, and obviously training a GPT as a whole… this is specifically for what I looked into at the time which was language-to-code- and code-to-code translation. And the best GPT-3 of the time, with prompts, even for tasks it was specifically trained for (nl to SQL) was horrible
More broadly I’ve not seen any large language model outperform small models at specific tasks, their embedding might be better in certain cases but mainly to reduce training time.
But to keep the argument simpler: there’s little proof that out-of-sample training (MLM style or otherwise) improves in-sample performance (note, out-of-sample trained models could be better than random, which is why using pre-loaded weights is a thing)
However, the strong version of the argument is closer to:
Clippy-sized model with a specific task of finding exploits will perform better than clippy-sized model trained with a broad objective.
I think I could make that claim as, 1/1000 Clippy sized but at that point you’re haggling over benchmarks.
Cases like code completion and creation (e.g. gpt copilot) are fuzzy and have no clear way to benchmark, but even their custom-build models (e.g. the recent one from google) seem to lead on benchmarks.
That being said it’s been like ~1 year since I had an applied interest in the issue so maybe I’m wrong. For more concrete examples I’d take things like protein folding and self driving, where nobody does any OOS training even though relevant datasets could be found outside the specific field they are applied to, and billions (trillions) of dollars in investment lead me to think the research there is close-to-optimla.
With text it’s a bit fuzzier to call it since the goals are fuzzier and text benchmarks start to breakdown when you are evaluating small differences.
Figures are in the paper. It’s a bit harder to figure out what the benchmarks looked like in 2019 (if you go to sites like GLUE today you see the current leaderboard); this does feel cruxy for me. [If I learned that next-token-prediction models don’t transfer to other language tasks, I would feel pretty differently about the shape of cognition, and be very retroactively surprised about some observations.]
I think there’s some evidence in favor, both from 1) the value of off-policy examples in RL and 2) training on not-directly-relevant outcomes improving the underlying features. [Tragically I no longer remember the jargon experts use to refer to that, and so can’t easily check whether or not this is true for contemporary architectures or just last-gen ones; naively I would expect there’s still transfer.]
I think we’re basically not going to get ‘proof’ of whether or not a ‘sharp left turn’ happens until it does, and so have to rely on other forms of inference here.
I agree with this. I think the disagreement is over what the landscape of models will look like—will it be the case that there are hundreds of Clippy-sized models targeted at various tasks, or will it be the case that the first Clippy-sized model is substantially larger than other models out there?
For gwern’s specific story, I agree it’s somewhat implausible that one engineer (tho with access to corporate compute) trains Clippy and there’s not lots of specialized models; this has to hinge on something about the regulatory environment in the story that prevents larger models from being trained. (But if they’re deliberately aiming at a CAIS-like world, you should expect there to be lots of sophisticated services of the form you’re talking about.) In worlds where the Clippy-sized model comes from a major corporate or state research effort, then it seems unlikely to me that there will be lots of similarly-sized specialized competitors, and so likely the general system has a large training edge (because the training cost is shared among many specialized use cases).
I agree there’s a good argument that autonomous vehicle research is close-to-optimal (tho, importantly, it is “optimal at succeeding” instead of “optimal at driving” and so includes lots of design choices driven by regulatory compliance or investor demand), but I don’t think this for protein folding, at least as of 2018.
I think the broader argument of “can language models become gods” is a separate one.
My sole objective there was to point out flaws in this particular narrative (which hopefully I stated clearly in the beginning).
I think the “can language models become gods” debate is broader and I didn’t care much to engage with it, superficially it seems that some of the same wrong abstractions that lead to this kind of narrative also back up that premise, but I’m in no position to make a hands-down argument for that.
The rest of your points I will try to answer later, I don’t particularly disagree with any stated that way except on the margins (e.g. GLUE is a meaningless benchmark that everyone should stop using—a weak and readable take on this would be—https://www.techtarget.com/searchenterpriseai/feature/What-do-NLP-benchmarks-like-GLUE-and-SQuAD-mean-for-developers ), but I don’t think the disagreements are particularly relevant(?)