It’s not necessarily at all impactful. The crucial question for the next few years is whether and where LLM scaling plateaus. Before o1, GPT-4 level models couldn’t produce useful reasoning traces that are very long. Reading comprehension just only started mostly working at this scale. And RLHF through PPO is apparently to a large extent a game of carefully balancing early stopping. So it’s brittle, doesn’t generalize off-distribution very far, which made it unclear if heavier RL can help with System 2 reasoning where it’s not capable of rebuilding the capabilities from scratch, overwriting the damage it does to the LLM.
But taking base model scaling one step further should help both with fragility of GPT-4 level models, and with their ability to carry out System 2 reasoning on their own, with a little bit of adaptation that elicits, rather than heavier RL that instills capabilities not already found to a useful extent in the base model. And then there’s at least one more step of base model scaling after that (deployed models are about 40 megawatts of H100s, models in training about 150 megawatts, in 1.5 years we’ll get to about a gigawatt, with 2x in FLOPs from moving to B200s and whatever effective FLOP/joule Trillium delivers). So there’s every chance this advancement is rendered obsolete, to the extent that it’s currently observable as a capability improvement, if these scaled up models just start being competent at System 2 reasoning without needing such post-training. Even ability of cheaper models to reason could then be reconstructed by training them on reasoning traces collected from larger models.
On the other hand, this is potentially a new dimension of scaling, if the RL can do real work and brute force scaling of base models wouldn’t produce a chatbot good enough to reinvent the necessary RL on its own. There is a whole potentially very general pipeline to generative capabilities here. It starts with preference about the outcome (ability to verify an answer, to simultaneously judge aesthetics and meanings of a poem, to see if a pull request really does resolve the issue). Then it proceeds to training a process supervision model that estimates how good individual reasoning steps are as contributions to getting a good outcome, and optimizing the generative model that proposes good reasoning steps. With a few more years of base model scaling, LLMs are probably going to be good enough at evaluating impactful outcomes that they are incapable of producing directly, so this pipeline gets a lot of capabilities to manufacture. If o1′s methodology is already applicable to this, this removes the uncertainty about whether it could’ve been made to start working without delay relative to the underlying scaling of base models. And the scaling of RL might go a long way before running out of exploitable preference signal about the outcomes of reasoning, independently of how the base models are being scaled up.
It’s not necessarily at all impactful. The crucial question for the next few years is whether and where LLM scaling plateaus. Before o1, GPT-4 level models couldn’t produce useful reasoning traces that are very long. Reading comprehension just only started mostly working at this scale. And RLHF through PPO is apparently to a large extent a game of carefully balancing early stopping. So it’s brittle, doesn’t generalize off-distribution very far, which made it unclear if heavier RL can help with System 2 reasoning where it’s not capable of rebuilding the capabilities from scratch, overwriting the damage it does to the LLM.
But taking base model scaling one step further should help both with fragility of GPT-4 level models, and with their ability to carry out System 2 reasoning on their own, with a little bit of adaptation that elicits, rather than heavier RL that instills capabilities not already found to a useful extent in the base model. And then there’s at least one more step of base model scaling after that (deployed models are about 40 megawatts of H100s, models in training about 150 megawatts, in 1.5 years we’ll get to about a gigawatt, with 2x in FLOPs from moving to B200s and whatever effective FLOP/joule Trillium delivers). So there’s every chance this advancement is rendered obsolete, to the extent that it’s currently observable as a capability improvement, if these scaled up models just start being competent at System 2 reasoning without needing such post-training. Even ability of cheaper models to reason could then be reconstructed by training them on reasoning traces collected from larger models.
On the other hand, this is potentially a new dimension of scaling, if the RL can do real work and brute force scaling of base models wouldn’t produce a chatbot good enough to reinvent the necessary RL on its own. There is a whole potentially very general pipeline to generative capabilities here. It starts with preference about the outcome (ability to verify an answer, to simultaneously judge aesthetics and meanings of a poem, to see if a pull request really does resolve the issue). Then it proceeds to training a process supervision model that estimates how good individual reasoning steps are as contributions to getting a good outcome, and optimizing the generative model that proposes good reasoning steps. With a few more years of base model scaling, LLMs are probably going to be good enough at evaluating impactful outcomes that they are incapable of producing directly, so this pipeline gets a lot of capabilities to manufacture. If o1′s methodology is already applicable to this, this removes the uncertainty about whether it could’ve been made to start working without delay relative to the underlying scaling of base models. And the scaling of RL might go a long way before running out of exploitable preference signal about the outcomes of reasoning, independently of how the base models are being scaled up.