I suspect there’s a cleaner way to make this argument that doesn’t talk much about the number of “token-equivalents”, but instead contrasts “total FLOP spent on inference” with some combination of:
“FLOP until human-interpretable information bottleneck”. While models still think in English, and doesn’t know how to do steganography, this should be FLOP/forward-pass. But it could be much longer in the future, e.g. if the models get trained to think in non-interpretable ways and just outputs a paper written in English once/week.
“FLOP until feedback” — how many FLOP of compute does the model do before it outputs an answer and gets feedback on it?
Models will probably be trained on a mixture of different regimes here. E.g.: “FLOP until feedback” being proportional to model size during pre-training (because it gets feedback after each token) and then also being proportional to chain-of-thought length during post-training.
So if you want to collapse it to one metric, you’d want to somehow weight by number of data-points and sample efficiency for each type of training.
“FLOP until outcome-based feedback” — same as above, except only counting outcome-based feedback rather than process-based feedback, in the sense discussed in this comment.
Having higher “FLOP until X” (for each of the X in the 3 bullet points) seems to increase danger. While increasing “total FLOP spent on inference” seems to have a much better ratio of increased usefulness : increased danger.
In this framing, I think:
Based on what we saw of o1′s chain-of-thoughts, I’d guess it hasn’t changed “FLOP until human-interpretable information bottleneck”, but I’m not sure about that.
It seems plausible that o1/o3 uses RL, and that the models think for much longer before getting feedback. This would increase “FLOP until feedback”.
Not sure what type of feedback they use. I’d guess that the most outcome-based thing they do is “executing code and seeing whether it passes test”.
I suspect there’s a cleaner way to make this argument that doesn’t talk much about the number of “token-equivalents”, but instead contrasts “total FLOP spent on inference” with some combination of:
“FLOP until human-interpretable information bottleneck”. While models still think in English, and doesn’t know how to do steganography, this should be FLOP/forward-pass. But it could be much longer in the future, e.g. if the models get trained to think in non-interpretable ways and just outputs a paper written in English once/week.
“FLOP until feedback” — how many FLOP of compute does the model do before it outputs an answer and gets feedback on it?
Models will probably be trained on a mixture of different regimes here. E.g.: “FLOP until feedback” being proportional to model size during pre-training (because it gets feedback after each token) and then also being proportional to chain-of-thought length during post-training.
So if you want to collapse it to one metric, you’d want to somehow weight by number of data-points and sample efficiency for each type of training.
“FLOP until outcome-based feedback” — same as above, except only counting outcome-based feedback rather than process-based feedback, in the sense discussed in this comment.
Having higher “FLOP until X” (for each of the X in the 3 bullet points) seems to increase danger. While increasing “total FLOP spent on inference” seems to have a much better ratio of increased usefulness : increased danger.
In this framing, I think:
Based on what we saw of o1′s chain-of-thoughts, I’d guess it hasn’t changed “FLOP until human-interpretable information bottleneck”, but I’m not sure about that.
It seems plausible that o1/o3 uses RL, and that the models think for much longer before getting feedback. This would increase “FLOP until feedback”.
Not sure what type of feedback they use. I’d guess that the most outcome-based thing they do is “executing code and seeing whether it passes test”.