It suggests that this is a less compute wasteful way to get inference time scaling.
The thing is, I see no reason you couldn’t just throw tons of compute and a large model at this, and expect stronger results.
The fact that RL seems to be working well on LLMs now, without special tricks, as reported by many replications of r1, suggests to me that AGI is indeed not far off. Not sure yet how to adjust my expectations.
The fact that RL seems to be working well on LLMs now, without special tricks, as reported by many replications of r1, suggests to me that AGI is indeed not far off.
Still, at least as long as base model effective training compute isn’t scaled another 1,000x (which is 2028-2029), this kind of RL training probably won’t generalize far enough without neural (LLM) rewards, which for now don’t let RL scale as much as with explicitly coded verifiers.
This tweet summarizes a new paper about using RL and long CoT to get a smallish model to think more cleverly. https://x.com/rohanpaul_ai/status/1885359768564621767
It suggests that this is a less compute wasteful way to get inference time scaling.
The thing is, I see no reason you couldn’t just throw tons of compute and a large model at this, and expect stronger results.
The fact that RL seems to be working well on LLMs now, without special tricks, as reported by many replications of r1, suggests to me that AGI is indeed not far off. Not sure yet how to adjust my expectations.
Still, at least as long as base model effective training compute isn’t scaled another 1,000x (which is 2028-2029), this kind of RL training probably won’t generalize far enough without neural (LLM) rewards, which for now don’t let RL scale as much as with explicitly coded verifiers.