This doesn’t seem to be reflected in the general opinion here, but it seems to me that LLM’s are plateauing and possibly have already plateaued a year or so ago. Scores on various metrics continue to go up, but this tends to provide weak evidence because they’re heavily gained and sometimes leak into the training data. Still, those numbers overall would tend to update me towards short timelines, even with their unreliability taken into account—however, this is outweighed by my personal experience with LLM’s. I just don’t find them useful for practically anything. I have a pretty consistently correct model of the problems they will be able to help me with and it’s not a lot—maybe a broad introduction to a library I’m not familiar with or detecting simple bugs. That model has worked for a year or two without expanding the set much. Also, I don’t see any applications to anything economically productive except for fluffy chatbot apps.
Huh o1 and the latest Claude were quite huge advances to me. Basically within the last year LLMs for coding went to “occasionally helpful, maybe like a 5-10% productivity improvement” to “my job now is basically to instruct LLMs to do things, depending on the task a 30% to 2x productivity improvement”.
I’m in Canada so can’t access the latest Claude, so my experience with these things does tend to be a couple months out of date. But I’m not really impressed with models spitting out slightly wrong code that tells me what functions to call. I think this is essentially a more useful search engine.
Use Chatbot Arena, both versions of Claude 3.5 Sonnet are accessible in Direct Chat (third tab). There’s even o1-preview in Battle Mode (first tab), you just need to keep asking the question until you get o1-preview. In general Battle Mode (for a fixed question you keep asking for multiple rounds) is a great tool for developing intuition about model capabilities, since it also hides the model name from you while you are evaluating the response.
Just an FYI unrelated to the discussion—all versions of Claude are available in Canada through Anthropic, you don’t even need third party services like Poe anymore.
Base model scale has only increased maybe 3-5x in the last 2 years, from 2e25 FLOPs (original GPT-4) up to maybe 1e26 FLOPs[1]. So I think to a significant extent the experiment of further scaling hasn’t been run, and the 100K H100s clusters that have just started training new models in the last few months promise another 3-5x increase in scale, to 2e26-6e26 FLOPs.
possibly have already plateaued a year or so ago
Right, the metrics don’t quite capture how smart a model is, and the models haven’t been getting much smarter for a while now. But it might be simply because they weren’t scaled much further (compared to original GPT-4) in all this time. We’ll see in the next few months as the labs deploy the models trained on 100K H100s (and whatever systems Google has).
This is 3 months on 30K H100s, $140 million at $2 per H100-hour, which is plausible, but not rumored about specific models. Llama-3-405B is 4e25 FLOPs, but not MoE. Could well be that 6e25 FLOPs is the most anyone trained for with models deployed so far.
I’ve noticed they perform much better on graduate-level ecology/evolution questions (in a qualitative sense—they provide answers that are more ‘full’ as well as technically accurate). I think translating that into a “usefulness” metric is always going to be difficult though.
The last few weeks I felt the opposite of this. I kind of go back and forth on thinking they are plateauing and then I get surprised with the new Sonnet version or o1-preview. I also experiment with my own prompting a lot.
I’ve been waiting to say this until OpenAI’s next larger model dropped, but this has now failed to happen for so long that it’s become it’s own update, and I’d like to state my prediction before it becomes obvious.
This doesn’t seem to be reflected in the general opinion here, but it seems to me that LLM’s are plateauing and possibly have already plateaued a year or so ago. Scores on various metrics continue to go up, but this tends to provide weak evidence because they’re heavily gained and sometimes leak into the training data. Still, those numbers overall would tend to update me towards short timelines, even with their unreliability taken into account—however, this is outweighed by my personal experience with LLM’s. I just don’t find them useful for practically anything. I have a pretty consistently correct model of the problems they will be able to help me with and it’s not a lot—maybe a broad introduction to a library I’m not familiar with or detecting simple bugs. That model has worked for a year or two without expanding the set much. Also, I don’t see any applications to anything economically productive except for fluffy chatbot apps.
Huh o1 and the latest Claude were quite huge advances to me. Basically within the last year LLMs for coding went to “occasionally helpful, maybe like a 5-10% productivity improvement” to “my job now is basically to instruct LLMs to do things, depending on the task a 30% to 2x productivity improvement”.
I’m in Canada so can’t access the latest Claude, so my experience with these things does tend to be a couple months out of date. But I’m not really impressed with models spitting out slightly wrong code that tells me what functions to call. I think this is essentially a more useful search engine.
Use Chatbot Arena, both versions of Claude 3.5 Sonnet are accessible in Direct Chat (third tab). There’s even o1-preview in Battle Mode (first tab), you just need to keep asking the question until you get o1-preview. In general Battle Mode (for a fixed question you keep asking for multiple rounds) is a great tool for developing intuition about model capabilities, since it also hides the model name from you while you are evaluating the response.
Just an FYI unrelated to the discussion—all versions of Claude are available in Canada through Anthropic, you don’t even need third party services like Poe anymore.
Source: https://www.anthropic.com/news/introducing-claude-to-canada
Base model scale has only increased maybe 3-5x in the last 2 years, from 2e25 FLOPs (original GPT-4) up to maybe 1e26 FLOPs[1]. So I think to a significant extent the experiment of further scaling hasn’t been run, and the 100K H100s clusters that have just started training new models in the last few months promise another 3-5x increase in scale, to 2e26-6e26 FLOPs.
Right, the metrics don’t quite capture how smart a model is, and the models haven’t been getting much smarter for a while now. But it might be simply because they weren’t scaled much further (compared to original GPT-4) in all this time. We’ll see in the next few months as the labs deploy the models trained on 100K H100s (and whatever systems Google has).
This is 3 months on 30K H100s, $140 million at $2 per H100-hour, which is plausible, but not rumored about specific models. Llama-3-405B is 4e25 FLOPs, but not MoE. Could well be that 6e25 FLOPs is the most anyone trained for with models deployed so far.
I’ve noticed they perform much better on graduate-level ecology/evolution questions (in a qualitative sense—they provide answers that are more ‘full’ as well as technically accurate). I think translating that into a “usefulness” metric is always going to be difficult though.
The last few weeks I felt the opposite of this. I kind of go back and forth on thinking they are plateauing and then I get surprised with the new Sonnet version or o1-preview. I also experiment with my own prompting a lot.
I’ve noticed occasional surprises in that direction, but none of them seem to shake out into utility for me.
Is this a reaction to OpenAI Shifts Strategy as Rate of ‘GPT’ AI Improvements Slows?
No that seems paywalled, curious though?
I’ve been waiting to say this until OpenAI’s next larger model dropped, but this has now failed to happen for so long that it’s become it’s own update, and I’d like to state my prediction before it becomes obvious.