The chart in davidad’s tweet answers the question “how does the value-add of CoT on a fixed set of tasks vary with model size?”
In the paper that the chart is from, it made sense to ask this question, because the paper did in fact evaluate a range of model sizes on a set of tasks, and the authors were trying to understand how CoT value-add scaling interacted with the thing they were actually trying to measure (CoT faithfulness scaling).
However, this is not the question you should be asking if you’re trying to understand how valuable CoT is as an interpretability tool for any given (powerful) model, whether it’s a model that exists now or a future one we’re trying to make predictions about.
CoT raises the performance ceiling of an LLM. For any given model, there are problems that it has difficulty solving without CoT, but which it can solve with CoT.
AFAIK this is true for every model we know of that’s powerful enough to benefit from CoT at all, and I don’t know of any evidence that the importance of CoT is now diminishing as models get more powerful.
(Note that with o1, we see the hyperscalers at OpenAI pursuing CoT more intensively than ever, and producing a model that achieves SOTAs on hard problems by generating longer CoTs than ever previously employed. Under davidad’s view I don’t see how this could possibly make any sense, yet it happened.)
But note that different models have different “performance ceilings.”
The problems on which CoT helps GPT-4 are problems right at the upper end of what GPT-4 can do, and hence GPT-3 probably can’t even do them with CoT. On the flipside, the problems that GPT-3 needs CoT for are probably easy enough for GPT-4 that the latter can do them just fine without CoT. So, even if CoT always helps any given model, if you hold the problem fixed and vary model size, you’ll see a U-shaped curve like the one in the plot.
The fact that CoT raises the performance ceiling matters practically for alignment, because it means that our first encounter with any given powerful capability will probably involve CoT with a weaker model rather than no-CoT with a stronger one.
(Suppose “GPT-n” can do X with CoT, and “GPT-(n+1)” can do X without CoT. Well, we’ll surely we’ll built GPT-n before GPT-(n+1), and then we’ll do CoT with the thing we’ve built, and so we’ll observe a model doing X before GPT-(n+1) even exists.)
See also my post here, which (among other things) discusses the result shown in davidad’s chart, drawing conclusions from it that are closer to those which the authors of the paper had in mind when plotting it.
Re: the davidad/roon conversation about CoT:
The chart in davidad’s tweet answers the question “how does the value-add of CoT on a fixed set of tasks vary with model size?”
In the paper that the chart is from, it made sense to ask this question, because the paper did in fact evaluate a range of model sizes on a set of tasks, and the authors were trying to understand how CoT value-add scaling interacted with the thing they were actually trying to measure (CoT faithfulness scaling).
However, this is not the question you should be asking if you’re trying to understand how valuable CoT is as an interpretability tool for any given (powerful) model, whether it’s a model that exists now or a future one we’re trying to make predictions about.
CoT raises the performance ceiling of an LLM. For any given model, there are problems that it has difficulty solving without CoT, but which it can solve with CoT.
AFAIK this is true for every model we know of that’s powerful enough to benefit from CoT at all, and I don’t know of any evidence that the importance of CoT is now diminishing as models get more powerful.
(Note that with o1, we see the hyperscalers at OpenAI pursuing CoT more intensively than ever, and producing a model that achieves SOTAs on hard problems by generating longer CoTs than ever previously employed. Under davidad’s view I don’t see how this could possibly make any sense, yet it happened.)
But note that different models have different “performance ceilings.”
The problems on which CoT helps GPT-4 are problems right at the upper end of what GPT-4 can do, and hence GPT-3 probably can’t even do them with CoT. On the flipside, the problems that GPT-3 needs CoT for are probably easy enough for GPT-4 that the latter can do them just fine without CoT. So, even if CoT always helps any given model, if you hold the problem fixed and vary model size, you’ll see a U-shaped curve like the one in the plot.
The fact that CoT raises the performance ceiling matters practically for alignment, because it means that our first encounter with any given powerful capability will probably involve CoT with a weaker model rather than no-CoT with a stronger one.
(Suppose “GPT-n” can do X with CoT, and “GPT-(n+1)” can do X without CoT. Well, we’ll surely we’ll built GPT-n before GPT-(n+1), and then we’ll do CoT with the thing we’ve built, and so we’ll observe a model doing X before GPT-(n+1) even exists.)
See also my post here, which (among other things) discusses the result shown in davidad’s chart, drawing conclusions from it that are closer to those which the authors of the paper had in mind when plotting it.