I think that too much scafolding can obfuscate a lack of general capability, since it allows the system to simulate a much more capable agent—under narrow circumstances and assuming nothing unexpected happens.
Consider the Egyptian Army in ’73. With exhaustive drill and scripting of unit movements, they were able to simulate the capabilities of an army with a competent officer corps, up until they ran out of script, upon which it reverted to a lower level of capability. This is because scripting avoids officers on the ground needing to make complex tactical decisions on the fly and communicate them to other units, all while maintaining a cohesive battle plan. If everyone sticks to the script, big holes won’t open up in their defenses, and the movements of each unit will be covered by that of others. When the script ran out (I’m massively simplifying), the cohesion of the army began to break down, rendering it increasingly vulnerable to IDF counterattacks. The gains in combat effectiveness were real, but limited to the confines of the script.
Similarly, scafolding helps the AI avoid the really hard parts of a job, at least the really hard parts for it. Designing the script for each individual task and subtask in order to make a 90% reliable AI economically valuable turns a productivity-improving tool into an economically productive agent, but only within certain parameters, and each time you encounter a new task, more scafolding will need to be built. I think some of the time the harder (in the human-intuitive sense) parts of the problem may be contained in the scafolding as opposed to the tasks the AI completes.
Thus, given the highly variable nature of LLM intelligence, “X can do Y with enough scafolding!” doesn’t automatically convince me that X possesses the core capabilities to do Y and just needs a little encouragement or w/e. If may be that task Y is composed of subtasks A and B, such that X is very good and reliable at A, but utterly incapable at B (coding and debugging?). By filtering for Y with a certain easy subset of B, using a pipeline to break it down into easier subtasks with various prompts, trying many times, and finally passing off unsolved cases to humans, you can extract much economic from X doing Y, but only in a certain subset of cases, and still without X being reliably good at doing both A and B.
You could probably do something similar with low-capability human programmers playing the role of X, but it wouldn’t be economical since they cost much more than an LLM and are in some ways less predictable.
I think a lot of economically valuable intelligence is in the ability to build the scafolding itself implicitly, which many people would call “agency”.
What if the tasks that your scaffolded LLM is doing are randomly selected pieces of cognitive labor from the full distribution of human cognitive tasks?
It seems to me like your objection is mostly to narrow distributions of tasks and scaffolding which is heavily specialized to that task.
I think narrowness of the task and amount of scaffolding might be correlated in practice, but these attributes don’t have to be related.
(You might think they are correlated because large amounts of scaffolding won’t be very useful for very diverse tasks. I think this is likely false—there exists general purpose software that I find useful for a very broad range of tasks. E.g. neovim. I agree that smart general agents should be able to build their own scaffolding and bootstrap, but its worth noting that the final system might be using a bunch of tools!)
For humans, we can consider eyes to be a type of scaffolding: they help us do various cognitive tasks by adding various affordances but are ultimately just attached.
Nonetheless, I predict that if I didn’t have eyes, I would be notably less efficient at my job.
Designing the script for each individual task and subtask in order to make a 90% reliable AI economically valuable turns a productivity-improving tool into an economically productive agent, but only within certain parameters, and each time you encounter a new task, more scafolding will need to be built.
Agreed that that wouldn’t be good evidence that those systems could do general reasoning. My intention in this piece is to mainly consider general-purpose scaffolding rather than task-specific.
I think that too much scafolding can obfuscate a lack of general capability, since it allows the system to simulate a much more capable agent—under narrow circumstances and assuming nothing unexpected happens.
Consider the Egyptian Army in ’73. With exhaustive drill and scripting of unit movements, they were able to simulate the capabilities of an army with a competent officer corps, up until they ran out of script, upon which it reverted to a lower level of capability. This is because scripting avoids officers on the ground needing to make complex tactical decisions on the fly and communicate them to other units, all while maintaining a cohesive battle plan. If everyone sticks to the script, big holes won’t open up in their defenses, and the movements of each unit will be covered by that of others. When the script ran out (I’m massively simplifying), the cohesion of the army began to break down, rendering it increasingly vulnerable to IDF counterattacks. The gains in combat effectiveness were real, but limited to the confines of the script.
Similarly, scafolding helps the AI avoid the really hard parts of a job, at least the really hard parts for it. Designing the script for each individual task and subtask in order to make a 90% reliable AI economically valuable turns a productivity-improving tool into an economically productive agent, but only within certain parameters, and each time you encounter a new task, more scafolding will need to be built. I think some of the time the harder (in the human-intuitive sense) parts of the problem may be contained in the scafolding as opposed to the tasks the AI completes.
Thus, given the highly variable nature of LLM intelligence, “X can do Y with enough scafolding!” doesn’t automatically convince me that X possesses the core capabilities to do Y and just needs a little encouragement or w/e. If may be that task Y is composed of subtasks A and B, such that X is very good and reliable at A, but utterly incapable at B (coding and debugging?). By filtering for Y with a certain easy subset of B, using a pipeline to break it down into easier subtasks with various prompts, trying many times, and finally passing off unsolved cases to humans, you can extract much economic from X doing Y, but only in a certain subset of cases, and still without X being reliably good at doing both A and B.
You could probably do something similar with low-capability human programmers playing the role of X, but it wouldn’t be economical since they cost much more than an LLM and are in some ways less predictable.
I think a lot of economically valuable intelligence is in the ability to build the scafolding itself implicitly, which many people would call “agency”.
What if the tasks that your scaffolded LLM is doing are randomly selected pieces of cognitive labor from the full distribution of human cognitive tasks?
It seems to me like your objection is mostly to narrow distributions of tasks and scaffolding which is heavily specialized to that task.
I think narrowness of the task and amount of scaffolding might be correlated in practice, but these attributes don’t have to be related.
(You might think they are correlated because large amounts of scaffolding won’t be very useful for very diverse tasks. I think this is likely false—there exists general purpose software that I find useful for a very broad range of tasks. E.g. neovim. I agree that smart general agents should be able to build their own scaffolding and bootstrap, but its worth noting that the final system might be using a bunch of tools!)
For humans, we can consider eyes to be a type of scaffolding: they help us do various cognitive tasks by adding various affordances but are ultimately just attached.
Nonetheless, I predict that if I didn’t have eyes, I would be notably less efficient at my job.
Very interesting example, thanks.
Agreed that that wouldn’t be good evidence that those systems could do general reasoning. My intention in this piece is to mainly consider general-purpose scaffolding rather than task-specific.