Questions for people who know more:
Am I understanding right that inference compute scaling time is useful for coding, math, and other things that are machine-checkable, but not for writing, basic science, and other things that aren’t machine-checkable? Will it ever have implications for these things?
Am I understanding right that this is all just clever ways of having it come up with many different answers or subanswers or preanswers, then picking the good ones to expand upon? Why should this be good for eg proving difficult math theorems, where many humans using many different approaches have failed, so it doesn’t seem like it’s as simple as trying a hundred times, or even trying using a hundred different strategies?
What do people mean when they say that o1 and o3 have “opened up new scaling laws” and that inference-time compute will be really exciting? Doesn’t “scaling inference compute” just mean “spending more money and waiting longer on each prompt”? Why do we expect this to scale? Does inference compute scaling mean that o3 will use ten supercomputers for one hour per prompt, o4 will use a hundred supercomputers for ten hours per prompt, and o5 will use a thousand supercomputers for a hundred hours per prompt? Since they already have all the supercomputers (for training scaling) why does it take time and progress to get to the higher inference-compute levels? What is o3 doing that you couldn’t do by running o1 on more computers for longer?
Does this imply that fewer safety people should quit leading labs to protest poor safety policies?