For example, the construct of “have the LLM write Python code, then have a Python interpreter execute it” is a powerful one (as Ryan Greenblatt has also shown), and will predictable get better as LLMs scale to be better at writing Python code with a lower density of bugs per N lines of code, and better at debugging it.
(or if you can’t name the startup, I’d love to hear more about what’s been achieved—eg what are the largest & longest-horizon tasks that your scaffolded systems can reliably accomplish?)
You.com — turn on our “genius mode” (free users get a few uses per day) and try asking it to do a moderately complex calculation, or better still figure out how to do one then do it.
Generally we find the combination of RAG web search and some agentic tool use makes an LLM appreciably more capable. (Similarly, OpenAI are also doing the tool use, and Google the web-scale RAG.)
We’re sticking to fairly short-horizon tasks, generally less than a dozen steps.
I work at a startup that would claim otherwise.
For example, the construct of “have the LLM write Python code, then have a Python interpreter execute it” is a powerful one (as Ryan Greenblatt has also shown), and will predictable get better as LLMs scale to be better at writing Python code with a lower density of bugs per N lines of code, and better at debugging it.
Can you name the startup? I’d be very interested to see what level of success it’s achieved.
(or if you can’t name the startup, I’d love to hear more about what’s been achieved—eg what are the largest & longest-horizon tasks that your scaffolded systems can reliably accomplish?)
You.com — turn on our “genius mode” (free users get a few uses per day) and try asking it to do a moderately complex calculation, or better still figure out how to do one then do it.
Generally we find the combination of RAG web search and some agentic tool use makes an LLM appreciably more capable. (Similarly, OpenAI are also doing the tool use, and Google the web-scale RAG.)
We’re sticking to fairly short-horizon tasks, generally less than a dozen steps.