Most LLMs’ replies can be improved by repeatedly asking “Improve the answer above” and it is similar to the test-time compute idea and diffusion.
In most cases, I can get better answers from LLMs just by asking “Improve the answer above.”
In my experience, the improvements are observable for around 5 cycles, but after that the result either stops improving or gets stuck in some error mode and can’t jump to a new level of thinking. My typical test subject: “draw a world map as text art.” In good improvement sessions with Sonnet, it eventually adds grids and correct positions for continents.
One person on Twitter (I lost the link, maybe @goodside) automated this process and got much better code for a game after 100 cycles of improvements during an entire night using many credits. He asked Claude to write code for automated prompting first. I repeated this experiment with my tasks.
I tried different variants of “improve it,” like adding critiques or generating several answers within one reply. I also tried a meta-level approach, where I asked to improve not only the answer but also the prompt for improvements.
I started these experiments before the test-time compute idea went mainstream, and it looks like a type of test-time compute use. The process also resembles diffusion.
The main question here: in which cases does the process quickly get stuck, and in which does it produce unbounded improvements? It seems to get stuck in local minima and in situations where the model’s intelligence isn’t sufficient to see ways to improve or discern better or worse versions. It also can’t jump to another valley: if it started improving in some direction, it will continue to push in that direction, ignoring other possibilities. Only running another chat window manually helps to change valleys.
Iterative improvement of images also works in GPT-4o. But not for Gemini Pro 2.5, and o1 is also bad at improving, progressing very slowly. It seems that test-time improving contradicts test-time reasoning.
This prompt—“Create a prompt X for iterative improvement of the answer above. Apply the generated prompt X.”—converges quickly to extraordinary results but overshoots, like creating games instead of drawings. It also uses thinking: https://poe.com/s/cLoB7gyGXHNtwj0yQfPf
The trick is that the improving prompt should be content-independent and mechanically copy-pasted after each reply.
Most LLMs’ replies can be improved by repeatedly asking “Improve the answer above” and it is similar to the test-time compute idea and diffusion.
In most cases, I can get better answers from LLMs just by asking “Improve the answer above.”
In my experience, the improvements are observable for around 5 cycles, but after that the result either stops improving or gets stuck in some error mode and can’t jump to a new level of thinking. My typical test subject: “draw a world map as text art.” In good improvement sessions with Sonnet, it eventually adds grids and correct positions for continents.
One person on Twitter (I lost the link, maybe @goodside) automated this process and got much better code for a game after 100 cycles of improvements during an entire night using many credits. He asked Claude to write code for automated prompting first. I repeated this experiment with my tasks.
I tried different variants of “improve it,” like adding critiques or generating several answers within one reply. I also tried a meta-level approach, where I asked to improve not only the answer but also the prompt for improvements.
I started these experiments before the test-time compute idea went mainstream, and it looks like a type of test-time compute use. The process also resembles diffusion.
The main question here: in which cases does the process quickly get stuck, and in which does it produce unbounded improvements? It seems to get stuck in local minima and in situations where the model’s intelligence isn’t sufficient to see ways to improve or discern better or worse versions. It also can’t jump to another valley: if it started improving in some direction, it will continue to push in that direction, ignoring other possibilities. Only running another chat window manually helps to change valleys.
Iterative improvement of images also works in GPT-4o. But not for Gemini Pro 2.5, and o1 is also bad at improving, progressing very slowly. It seems that test-time improving contradicts test-time reasoning.
Results for “Improve it”: https://poe.com/s/aqk8BuIoaRZ7eDqgKAN6
Variants of the main prompt: “Criticize the result above and iteratively improve it” https://poe.com/s/A2yFioj6e6IFHz68hdDx
This prompt—“Create a prompt X for iterative improvement of the answer above. Apply the generated prompt X.”—converges quickly to extraordinary results but overshoots, like creating games instead of drawings. It also uses thinking: https://poe.com/s/cLoB7gyGXHNtwj0yQfPf
The trick is that the improving prompt should be content-independent and mechanically copy-pasted after each reply.
I have achieved higher quality answers by using the magical words: “give me multiple options, then compare them and choose the best one”.
But next time I will try to iterate the best one—maybe something like “suggest five improvements to the option above, and choose the best one”.
Yes, great variant of the universal answer-improving prompt and it can be applied several times to any content.