This is mostly true for current architectures however if the COT/search finds a much better architecture, then it suddenly becomes more capable.
One of first questions I asked o1 was whether there is a “third source of independent scaling” (alongside training compute and inference compute), and among its best suggestions was model search.
That is to say if in the GPT-3 era we had a scaling law that looked like:
Performance=log(training compute)
And in the o1 era we have a scaling law that looks like:
I don’t feel strongly about whether-or-not this is the case. It seems equally plausible to me that Transformers are asymptotically “as good as it gets” when it comes to converting compute into performance and further model improvements provide only a constant-factor improvement.
I’ve read that OpenAI and DeepMind are hiring for multi-agent reasoning teams. I can imagine that gives another source of scaling.
I figure things like Amdahl’s law / communication overhead impose some limits there, but MCTS could probably find useful ways to divide the reasoning work and have the agents communicating at least at human level efficiency.
One of first questions I asked o1 was whether there is a “third source of independent scaling” (alongside training compute and inference compute), and among its best suggestions was model search.
That is to say if in the GPT-3 era we had a scaling law that looked like:
Performance=log(training compute)
And in the o1 era we have a scaling law that looks like:
Performance = log(training compute)+log(inference compute)
There may indeed be a GPT-evo era in which;
Performance = log(modelSearch)+log(training compute)+log(inference compute)
I don’t feel strongly about whether-or-not this is the case. It seems equally plausible to me that Transformers are asymptotically “as good as it gets” when it comes to converting compute into performance and further model improvements provide only a constant-factor improvement.
I’ve read that OpenAI and DeepMind are hiring for multi-agent reasoning teams. I can imagine that gives another source of scaling.
I figure things like Amdahl’s law / communication overhead impose some limits there, but MCTS could probably find useful ways to divide the reasoning work and have the agents communicating at least at human level efficiency.
Appropriate scaffolding and tool use are other potential levers.