Grade-school math, where problems have a single well-defined answer, seems like an environment in which a Q-learning-like approach to figuring out whether a step is valuable, from whether it helps lead you to the right answer, might be pretty feasible (the biggest confounder would be cases where you manage to make two mistakes that cancel out and still get to the right answer). Given something like that, a path-finding algorithm along the lines of A* for finding the shortest route to the correct answer would then become feasible. The net result would be a system that, at large inference-time cost, could ace grade-school math problems, and by doing so might well produce really valuable training data for then training a less inference-expensive system on.
Grade-school math, where problems have a single well-defined answer, seems like an environment in which a Q-learning-like approach to figuring out whether a step is valuable, from whether it helps lead you to the right answer, might be pretty feasible (the biggest confounder would be cases where you manage to make two mistakes that cancel out and still get to the right answer). Given something like that, a path-finding algorithm along the lines of A* for finding the shortest route to the correct answer would then become feasible. The net result would be a system that, at large inference-time cost, could ace grade-school math problems, and by doing so might well produce really valuable training data for then training a less inference-expensive system on.