To me, this was to be expected as a straightforward application of AlphaZero-like self-play amplification and destillation. The missing piece was the analogous policy network, which was a convolutional neural network for the AlphaZero board games. Once it became quite clear that existing LLMs were capable of being smart enough to generate good heuristics to this (with enough data), it seemed quite obvious to me that self-play guided by an LLM-policy heuristic would work.
Early 2023 I bet $500 on AI winning the IMO gold medal by 2026. This was a 1:1 bet against Michael Vassar, meaning I attributed >50% to this. It now seems very likely that I’m going to win.
To me, this was to be expected as a straightforward application of AlphaZero-like self-play amplification and destillation. The missing piece was the analogous policy network, which was a convolutional neural network for the AlphaZero board games. Once it became quite clear that existing LLMs were capable of being smart enough to generate good heuristics to this (with enough data), it seemed quite obvious to me that self-play guided by an LLM-policy heuristic would work.