One plausible-looking possibility would be something based on or developed from the “Q* Search” algorithm introduced in the paper “A* Search Without Expansions: Learning Heuristic Functions with Deep Q-Networks” — which is indeed a combination of the A* path-finding algorithm and Q-learning. It seems like it would be applicable in an environment like grade-school math that has definitively-correct answers for your Q-learning to work back from.
One plausible-looking possibility would be something based on or developed from the “Q* Search” algorithm introduced in the paper “A* Search Without Expansions: Learning Heuristic Functions with Deep Q-Networks” — which is indeed a combination of the A* path-finding algorithm and Q-learning. It seems like it would be applicable in an environment like grade-school math that has definitively-correct answers for your Q-learning to work back from.