Maybe next year, the critical point will be reached where spending a lot on inference to make many tries at each necessary step will become effective.
That raises an excellent point that hasn’t been otherwise brought up—it’s clear that there are at least some cases already where you can get much better performance by doing best-of-n with large n. I’m thinking especially of Ryan Greenblatt’s approach to ARC-AGI, where that was pretty successful (n = 8000). And as Ryan points out, that’s the approach that AlphaCode uses as well (n = some enormous number). That seems like plausibly the best use of a lot of money with current LLMs.
That raises an excellent point that hasn’t been otherwise brought up—it’s clear that there are at least some cases already where you can get much better performance by doing best-of-n with large n. I’m thinking especially of Ryan Greenblatt’s approach to ARC-AGI, where that was pretty successful (n = 8000). And as Ryan points out, that’s the approach that AlphaCode uses as well (n = some enormous number). That seems like plausibly the best use of a lot of money with current LLMs.