This is mostly true for current architectures however if the COT/search finds a much better architecture, then it suddenly becomes more capable. To make the most of the potential protective effect, we can go further and make very efficient custom hardware for GPT type systems, but have slower more general purpose ones for potential new ones. That way the new arch will have a bigger barrier to cause havoc. We should especially scale existing systems as far as possible for defense, e.g. finding software vulnerabilities. However as others say, there are probably some insights/model capabilities that are only possible with a much larger GPT or different architecture altogether. Inference can’t protect fully against that.
This is mostly true for current architectures however if the COT/search finds a much better architecture, then it suddenly becomes more capable.
One of first questions I asked o1 was whether there is a “third source of independent scaling” (alongside training compute and inference compute), and among its best suggestions was model search.
That is to say if in the GPT-3 era we had a scaling law that looked like:
Performance=log(training compute)
And in the o1 era we have a scaling law that looks like:
I don’t feel strongly about whether-or-not this is the case. It seems equally plausible to me that Transformers are asymptotically “as good as it gets” when it comes to converting compute into performance and further model improvements provide only a constant-factor improvement.
I’ve read that OpenAI and DeepMind are hiring for multi-agent reasoning teams. I can imagine that gives another source of scaling.
I figure things like Amdahl’s law / communication overhead impose some limits there, but MCTS could probably find useful ways to divide the reasoning work and have the agents communicating at least at human level efficiency.
I think you’re getting this exactly wrong (and this invalidates most of the OP). If you find a model that has a constant factor of 100 in the asymptotics that’s a huge deal if everything else has log scaling. That would already present discontinuous progress and potentially put you at ASI right away.
Basically the current scaling laws, if they keep holding are a lower bound on the expected progress and can’t really give you any information to upper bound it.
This is mostly true for current architectures however if the COT/search finds a much better architecture, then it suddenly becomes more capable. To make the most of the potential protective effect, we can go further and make very efficient custom hardware for GPT type systems, but have slower more general purpose ones for potential new ones. That way the new arch will have a bigger barrier to cause havoc. We should especially scale existing systems as far as possible for defense, e.g. finding software vulnerabilities. However as others say, there are probably some insights/model capabilities that are only possible with a much larger GPT or different architecture altogether. Inference can’t protect fully against that.
One of first questions I asked o1 was whether there is a “third source of independent scaling” (alongside training compute and inference compute), and among its best suggestions was model search.
That is to say if in the GPT-3 era we had a scaling law that looked like:
Performance=log(training compute)
And in the o1 era we have a scaling law that looks like:
Performance = log(training compute)+log(inference compute)
There may indeed be a GPT-evo era in which;
Performance = log(modelSearch)+log(training compute)+log(inference compute)
I don’t feel strongly about whether-or-not this is the case. It seems equally plausible to me that Transformers are asymptotically “as good as it gets” when it comes to converting compute into performance and further model improvements provide only a constant-factor improvement.
I’ve read that OpenAI and DeepMind are hiring for multi-agent reasoning teams. I can imagine that gives another source of scaling.
I figure things like Amdahl’s law / communication overhead impose some limits there, but MCTS could probably find useful ways to divide the reasoning work and have the agents communicating at least at human level efficiency.
Appropriate scaffolding and tool use are other potential levers.
I think you’re getting this exactly wrong (and this invalidates most of the OP). If you find a model that has a constant factor of 100 in the asymptotics that’s a huge deal if everything else has log scaling. That would already present discontinuous progress and potentially put you at ASI right away.
Basically the current scaling laws, if they keep holding are a lower bound on the expected progress and can’t really give you any information to upper bound it.