I would like to see actual benchmarks on that not like… a PR blog.
Figures are in the paper. It’s a bit harder to figure out what the benchmarks looked like in 2019 (if you go to sites like GLUE today you see the current leaderboard); this does feel cruxy for me. [If I learned that next-token-prediction models don’t transfer to other language tasks, I would feel pretty differently about the shape of cognition, and be very retroactively surprised about some observations.]
there’s little proof that out-of-sample training (MLM style or otherwise) improves in-sample performance
I think there’s some evidence in favor, both from 1) the value of off-policy examples in RL and 2) training on not-directly-relevant outcomes improving the underlying features. [Tragically I no longer remember the jargon experts use to refer to that, and so can’t easily check whether or not this is true for contemporary architectures or just last-gen ones; naively I would expect there’s still transfer.]
I think we’re basically not going to get ‘proof’ of whether or not a ‘sharp left turn’ happens until it does, and so have to rely on other forms of inference here.
Clippy-sized model with a specific task of finding exploits will perform better than clippy-sized model trained with a broad objective.
I agree with this. I think the disagreement is over what the landscape of models will look like—will it be the case that there are hundreds of Clippy-sized models targeted at various tasks, or will it be the case that the first Clippy-sized model is substantially larger than other models out there?
For gwern’s specific story, I agree it’s somewhat implausible that one engineer (tho with access to corporate compute) trains Clippy and there’s not lots of specialized models; this has to hinge on something about the regulatory environment in the story that prevents larger models from being trained. (But if they’re deliberately aiming at a CAIS-like world, you should expect there to be lots of sophisticated services of the form you’re talking about.) In worlds where the Clippy-sized model comes from a major corporate or state research effort, then it seems unlikely to me that there will be lots of similarly-sized specialized competitors, and so likely the general system has a large training edge (because the training cost is shared among many specialized use cases).
billions (trillions) of dollars in investment lead me to think the research there is close-to-optimla.
I agree there’s a good argument that autonomous vehicle research is close-to-optimal (tho, importantly, it is “optimal at succeeding” instead of “optimal at driving” and so includes lots of design choices driven by regulatory compliance or investor demand), but I don’t think this for protein folding, at least as of 2018.
For gwern’s specific story, I agree it’s somewhat implausible that one engineer (tho with access to corporate compute) trains Clippy and there’s not lots of specialized models;
I think the broader argument of “can language models become gods” is a separate one.
My sole objective there was to point out flaws in this particular narrative (which hopefully I stated clearly in the beginning).
I think the “can language models become gods” debate is broader and I didn’t care much to engage with it, superficially it seems that some of the same wrong abstractions that lead to this kind of narrative also back up that premise, but I’m in no position to make a hands-down argument for that.
Figures are in the paper. It’s a bit harder to figure out what the benchmarks looked like in 2019 (if you go to sites like GLUE today you see the current leaderboard); this does feel cruxy for me. [If I learned that next-token-prediction models don’t transfer to other language tasks, I would feel pretty differently about the shape of cognition, and be very retroactively surprised about some observations.]
I think there’s some evidence in favor, both from 1) the value of off-policy examples in RL and 2) training on not-directly-relevant outcomes improving the underlying features. [Tragically I no longer remember the jargon experts use to refer to that, and so can’t easily check whether or not this is true for contemporary architectures or just last-gen ones; naively I would expect there’s still transfer.]
I think we’re basically not going to get ‘proof’ of whether or not a ‘sharp left turn’ happens until it does, and so have to rely on other forms of inference here.
I agree with this. I think the disagreement is over what the landscape of models will look like—will it be the case that there are hundreds of Clippy-sized models targeted at various tasks, or will it be the case that the first Clippy-sized model is substantially larger than other models out there?
For gwern’s specific story, I agree it’s somewhat implausible that one engineer (tho with access to corporate compute) trains Clippy and there’s not lots of specialized models; this has to hinge on something about the regulatory environment in the story that prevents larger models from being trained. (But if they’re deliberately aiming at a CAIS-like world, you should expect there to be lots of sophisticated services of the form you’re talking about.) In worlds where the Clippy-sized model comes from a major corporate or state research effort, then it seems unlikely to me that there will be lots of similarly-sized specialized competitors, and so likely the general system has a large training edge (because the training cost is shared among many specialized use cases).
I agree there’s a good argument that autonomous vehicle research is close-to-optimal (tho, importantly, it is “optimal at succeeding” instead of “optimal at driving” and so includes lots of design choices driven by regulatory compliance or investor demand), but I don’t think this for protein folding, at least as of 2018.
I think the broader argument of “can language models become gods” is a separate one.
My sole objective there was to point out flaws in this particular narrative (which hopefully I stated clearly in the beginning).
I think the “can language models become gods” debate is broader and I didn’t care much to engage with it, superficially it seems that some of the same wrong abstractions that lead to this kind of narrative also back up that premise, but I’m in no position to make a hands-down argument for that.
The rest of your points I will try to answer later, I don’t particularly disagree with any stated that way except on the margins (e.g. GLUE is a meaningless benchmark that everyone should stop using—a weak and readable take on this would be—https://www.techtarget.com/searchenterpriseai/feature/What-do-NLP-benchmarks-like-GLUE-and-SQuAD-mean-for-developers ), but I don’t think the disagreements are particularly relevant(?)