I agree that we’ve learned interesting new things about inference speeds. I don’t think I would have anticipated that at the time.
Re:
It seems that spending more inference compute can (sometimes) be used to qualitatively and quantitatively improve capabilities (e.g., o1, recent swe-bench results, arc-agi rather than merely doing more work in parallel. Thus, it’s not clear that the relevant regime will look like “lots of mediocre thinking”.[1]
There are versions of this that I’d still describe as “lots of mediocre thinking” —adding up to being similarly useful as higher-quality thinking.
(C.f. above from the post: “the collective’s intelligence will largely come from [e.g.] Individual systems ‘thinking’ for a long time, churning through many more explicit thoughts than a skilled human would need to solve a problem” & “Assuming that much of this happens ‘behind the scenes’, a human interacting with this system might just perceive it as a single super-smart AI.)
The most relevant question is whether we’ll still get the purported benefits of the lots-of-mediocre-thinking-regime if there’s strong inference scaling. I think we probably do.
Paraphrasing my argument in the “Implications” section:
If we don’t do much end-to-end training of models thinking a lot, then supervision will be pretty easy. (Even if the models think for a long time, it will all be in English, and each leap-of-logic will be weak compared to what the human supervisors can do.)
End-to-end training of models thinking a lot is expensive. So maybe we won’t do it by default, or maybe it will be an acceptable alignment tax to avoid it. (Instead favoring “process-based” methods as the term is used in this post.)
Even if we do end-to-end training of models thinking a lot, the model’s “thinking” might still remain pretty interpretable to humans in practice.
If models produce good recommendations by thinking a lot in either English or something similar to English, then there ought to be a translation/summary of that argument which humans can understand. Then, even if we’re giving the models end-to-end feedback, we could give them feedback based on whether humans recognize the argument as good, rather than by testing the recommendation and seeing whether it leads to good results in the real world. (This comment discusses this distinction. Confusingly, this is sometimes referred to as “process-based feedback” as opposed to “outcomes-based feedback”, despite it being slightly different from the concept two bullet points up. )
I think o3 results might involve enough end-to-end training to mostly contradict the hopes of bullet points 1-2. But I’d guess it doesn’t contradict 3-4.
(Another caveat that I didn’t have in the post is that it’s slightly tricker to supervise mediocre serial thinking than mediocre parallel thinking, because you may not be able to evaluate a random step in the middle without loading up on earlier context. But my guess is that you could train AIs to help you with this without adding too much extra risk.)
I suspect there’s a cleaner way to make this argument that doesn’t talk much about the number of “token-equivalents”, but instead contrasts “total FLOP spent on inference” with some combination of:
“FLOP until human-interpretable information bottleneck”. While models still think in English, and doesn’t know how to do steganography, this should be FLOP/forward-pass. But it could be much longer in the future, e.g. if the models get trained to think in non-interpretable ways and just outputs a paper written in English once/week.
“FLOP until feedback” — how many FLOP of compute does the model do before it outputs an answer and gets feedback on it?
Models will probably be trained on a mixture of different regimes here. E.g.: “FLOP until feedback” being proportional to model size during pre-training (because it gets feedback after each token) and then also being proportional to chain-of-thought length during post-training.
So if you want to collapse it to one metric, you’d want to somehow weight by number of data-points and sample efficiency for each type of training.
“FLOP until outcome-based feedback” — same as above, except only counting outcome-based feedback rather than process-based feedback, in the sense discussed in this comment.
Having higher “FLOP until X” (for each of the X in the 3 bullet points) seems to increase danger. While increasing “total FLOP spent on inference” seems to have a much better ratio of increased usefulness : increased danger.
In this framing, I think:
Based on what we saw of o1′s chain-of-thoughts, I’d guess it hasn’t changed “FLOP until human-interpretable information bottleneck”, but I’m not sure about that.
It seems plausible that o1/o3 uses RL, and that the models think for much longer before getting feedback. This would increase “FLOP until feedback”.
Not sure what type of feedback they use. I’d guess that the most outcome-based thing they do is “executing code and seeing whether it passes test”.
It’s possible that “many mediocre or specialized AIs” is, in practice, a bad summary of the regime with strong inference scaling. Maybe people’s associations with “lots of mediocre thinking” ends up being misleading.
Thanks!
I agree that we’ve learned interesting new things about inference speeds. I don’t think I would have anticipated that at the time.
Re:
There are versions of this that I’d still describe as “lots of mediocre thinking” —adding up to being similarly useful as higher-quality thinking.
(C.f. above from the post: “the collective’s intelligence will largely come from [e.g.] Individual systems ‘thinking’ for a long time, churning through many more explicit thoughts than a skilled human would need to solve a problem” & “Assuming that much of this happens ‘behind the scenes’, a human interacting with this system might just perceive it as a single super-smart AI.)
The most relevant question is whether we’ll still get the purported benefits of the lots-of-mediocre-thinking-regime if there’s strong inference scaling. I think we probably do.
Paraphrasing my argument in the “Implications” section:
If we don’t do much end-to-end training of models thinking a lot, then supervision will be pretty easy. (Even if the models think for a long time, it will all be in English, and each leap-of-logic will be weak compared to what the human supervisors can do.)
End-to-end training of models thinking a lot is expensive. So maybe we won’t do it by default, or maybe it will be an acceptable alignment tax to avoid it. (Instead favoring “process-based” methods as the term is used in this post.)
Even if we do end-to-end training of models thinking a lot, the model’s “thinking” might still remain pretty interpretable to humans in practice.
If models produce good recommendations by thinking a lot in either English or something similar to English, then there ought to be a translation/summary of that argument which humans can understand. Then, even if we’re giving the models end-to-end feedback, we could give them feedback based on whether humans recognize the argument as good, rather than by testing the recommendation and seeing whether it leads to good results in the real world. (This comment discusses this distinction. Confusingly, this is sometimes referred to as “process-based feedback” as opposed to “outcomes-based feedback”, despite it being slightly different from the concept two bullet points up. )
I think o3 results might involve enough end-to-end training to mostly contradict the hopes of bullet points 1-2. But I’d guess it doesn’t contradict 3-4.
(Another caveat that I didn’t have in the post is that it’s slightly tricker to supervise mediocre serial thinking than mediocre parallel thinking, because you may not be able to evaluate a random step in the middle without loading up on earlier context. But my guess is that you could train AIs to help you with this without adding too much extra risk.)
I suspect there’s a cleaner way to make this argument that doesn’t talk much about the number of “token-equivalents”, but instead contrasts “total FLOP spent on inference” with some combination of:
“FLOP until human-interpretable information bottleneck”. While models still think in English, and doesn’t know how to do steganography, this should be FLOP/forward-pass. But it could be much longer in the future, e.g. if the models get trained to think in non-interpretable ways and just outputs a paper written in English once/week.
“FLOP until feedback” — how many FLOP of compute does the model do before it outputs an answer and gets feedback on it?
Models will probably be trained on a mixture of different regimes here. E.g.: “FLOP until feedback” being proportional to model size during pre-training (because it gets feedback after each token) and then also being proportional to chain-of-thought length during post-training.
So if you want to collapse it to one metric, you’d want to somehow weight by number of data-points and sample efficiency for each type of training.
“FLOP until outcome-based feedback” — same as above, except only counting outcome-based feedback rather than process-based feedback, in the sense discussed in this comment.
Having higher “FLOP until X” (for each of the X in the 3 bullet points) seems to increase danger. While increasing “total FLOP spent on inference” seems to have a much better ratio of increased usefulness : increased danger.
In this framing, I think:
Based on what we saw of o1′s chain-of-thoughts, I’d guess it hasn’t changed “FLOP until human-interpretable information bottleneck”, but I’m not sure about that.
It seems plausible that o1/o3 uses RL, and that the models think for much longer before getting feedback. This would increase “FLOP until feedback”.
Not sure what type of feedback they use. I’d guess that the most outcome-based thing they do is “executing code and seeing whether it passes test”.
It’s possible that “many mediocre or specialized AIs” is, in practice, a bad summary of the regime with strong inference scaling. Maybe people’s associations with “lots of mediocre thinking” ends up being misleading.