My model is that the quality of the reasoning can actually be divided into two dimensions, the quality of intuition (what the “first guess” is), and the quality of search (how much better you can make it by thinking more).
Another way of thinking about this distinction is as the difference between how good each reasoning step is (intuition), compared to how good the process is for aggregating steps into a whole that solves a certain task (search).
It seems to me that current models are strong enough to learn good intuition about all kinds of things with enough high-quality training data, and that if you have good enough search you can use that as an amplification mechanism (on tasks where verification is available) to improve through self-play.
This being right then failure to solve IMO probably means a good search algorithm (analogous to AlphaZero’s MCTS-UCT, maybe including its own intuition model) has not been found that is capable of amplifying the intuitions useful for reasoning.
So far all problem-solving AIs seem to use linear or depth-first search, that is, you sample one token at a time (one reasoning step), chain them up depth-first (generate a full text/proof-sketch) check to see if it solves the full problem, and if it doesn’t work then it just tries again from scratch throwing all the partial work away. No search heuristic is used, no attempt to solve smaller problems first, etc. So it can certainly get a lot better than that (which is why I’m making the bet).
My model is that the quality of the reasoning can actually be divided into two dimensions, the quality of intuition (what the “first guess” is), and the quality of search (how much better you can make it by thinking more).
Another way of thinking about this distinction is as the difference between how good each reasoning step is (intuition), compared to how good the process is for aggregating steps into a whole that solves a certain task (search).
It seems to me that current models are strong enough to learn good intuition about all kinds of things with enough high-quality training data, and that if you have good enough search you can use that as an amplification mechanism (on tasks where verification is available) to improve through self-play.
This being right then failure to solve IMO probably means a good search algorithm (analogous to AlphaZero’s MCTS-UCT, maybe including its own intuition model) has not been found that is capable of amplifying the intuitions useful for reasoning.
So far all problem-solving AIs seem to use linear or depth-first search, that is, you sample one token at a time (one reasoning step), chain them up depth-first (generate a full text/proof-sketch) check to see if it solves the full problem, and if it doesn’t work then it just tries again from scratch throwing all the partial work away. No search heuristic is used, no attempt to solve smaller problems first, etc. So it can certainly get a lot better than that (which is why I’m making the bet).