I do think that right now LMs are by far closest to doing useful work by exploiting human-legible interfaces and decompositions. Chain of thought, simple decompositions, and imitations of human tool use are already important for LM performance. While more complex LM agents add only a small amount of additional value, it seems like extrapolating trends would make them pretty important soon.
Overall I think the world is shaping up extremely far in the direction of “AI systems learn to imitate human cognitive steps and then compose them into impressive performance.” I’m happy to bet about whether that trend will continue to the extent we can operationalize it. E.g. I’d bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use. I don’t have a strong view about more complex decompositions unless context length is a serious limitation. I would guess that end-to-end optimization will make at most marginal differences in efficacy (probably smaller than RLHF).
To the extent models trained with RLHF are doing anything smart in the real world I think it’s basically ~100% by solving a human-comprehensible task. Namely humans give the system a task, and it tries to do some rough combination of what a particular kind of human demonstrator would do and what a particular kind of human evaluator would rate highly. There is no further optimization to take intelligent actions in the world.
Chain of thought, simple decompositions, and imitations of human tool use (along comprehensible interfaces) are already important for LM performance.
I want to separate prompt-engineering from factored cognition. There are various nudges you can use to get LLMs to think in ways that are more productive or well-suited for the task at hand, but this seems quite different to me from truly factored cognition, where you spin up a sub-process that solves a sub-problem, and then propagate that back up to a higher-level process (like Auto-GPT). I don’t currently know of any not-extremely-gerry-mandered task where doing this actually improves task performance compared to just good prompt engineering. I’ve been looking for examples of this for a while, so if you do have any, I would greatly appreciate it.
Overall I think the world is shaping up extremely far in the direction of “AI systems learn to imitate human cognitive steps and then compose them into impressive performance.”
I’ve seen little evidence of this so far, and don’t think current LLM performance is even that well-characterized by this. This would be great, but I don’t currently think its true.
For example, I don’t really understand whether this model is surprised or unsurprised by the extreme breadth of knowledge that modern LLMs have. I don’t see any “imitation of human cognitive steps” when an LLM is capable of remembering things from a much wider range of topics. It seems just that its way of acquiring knowledge is very different from humans, giving rise to a very different capability-landscape. This capability does not seem to be built out of “composition of imitations of human cognitive steps”.
Similarly, when I use Codex for programming, I do not see any evidence that suggests that Codex is solving programming problems by composing imitations of human cognitive steps. Indeed, it mostly seems to just solve the problems in one-shot, vastly faster than I would be able to even type, and in a way that seems completely alien to me as a programmer.
E.g. I’d bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use.
I do indeed predict that we will see chain-of-thought become less faithful as model capabilities increase, and that other ways of doing the same thing as chain-of-thought but internalized to the model will take over.
I have no strong opinions on tool-use. Seems like the LLMs will use APIs the same way as humans would. I do think if you train more on end-to-end tasks, the code they write to solve sub-problems will become less readable. I have thought less about this and wouldn’t currently take a bet.
I would guess that end-to-end optimization will make at most marginal differences in efficacy (probably smaller than RLHF).
I am not super confident of this, but my best guess is that we will see more end-to-end optimization, and that those will make a big difference in task performance. It also seems like a natural endpoint of something like RLAF, where you have the AI guide a lot of the training process itself when given a highly-complicated objective, and then you do various forms of RL on self-evaluations.
I don’t currently know of any not-extremely-gerry-mandered task where [scaffolding] actually improves task performance compared to just good prompt engineering. I’ve been looking for examples of this for a while, so if you do have any, I would greatly appreciate it.
Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.
It does much better than AutoGPT, and also the paper does ablations to show that the different parts of the scaffolding in Voyager do matter. This suggests that better scaffolding does make a difference, and I doubt Voyager is the limit.
I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive. I think it’s plausible this gap will eventually mostly close at some capability threshold, especially for many of the most potentially-transformative capabilities (e.g. having insights that draw on a large basis of information not memorised in a base model’s weights, since this seems hard to decompose into smaller tasks), but it seems quite plausible the gap will be non-trivial for a while.
Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.
That’s a good example, thank you! I actually now remembered looking at this a few weeks ago and thinking about it as an interesting example of scaffolding. Thanks for reminding me.
I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive.
I do wonder how much of this is just the result of an access gap. Getting one of these scaffolded systems to work seems also a lot of hassle and very fiddly, and my best guess is that if OpenAI wanted to solve this problem, they would probably just reinforcement learn a bunch, and then maybe they would do a bit of scaffolding, but the scaffolding would be a lot less detailed and not really be that important to the overall performance of the system.
Although this is an important discussion I want to emphasize up front that I don’t think it’s closely related to the argument in the OP. I tried to revise the OP to emphasize that the first section of the article is about LM agent improvements that are relevant to engineering better scaffolding rather than improving our ability to optimize such agents end to end.
I’ve seen little evidence of this so far, and don’t think current LLM performance is even that well-characterized by this. This would be great, but I don’t currently think its true.
If you allow models to think for a while they do much better than if you just ask them to answer the question. By “think for a while” we mean they generate one sentence after another in the same way a human would. Their ability to use chain of thought seems to come essentially entirely from copying human chains of thought rather than e.g. using filler tokens to parallelize cognition or RL fine-tuning teaching them novel cognitive strategies.
I agree that models also memorize a lot of facts. Almost all the facts they actually use are facts that humans know, which they memorized by observing humans using them or stating them. So I don’t really consider this evidence one way or the other.
If you want to state any concrete prediction about the future I’m happy to say whether I agree with it. For example:
I think that the gap between “spit out an answer” and chain of thought / tool use / decomposition will continue to grow. (Even as chain of thought becomes increasingly unfaithful for questions of any fixed difficulty, since models become increasingly able to answer such questions in a single shot.)
I think there is a significant chance decomposition is a big part of that cluster, say a 50% chance that context-hiding decomposition obviously improves performance by an amount comparable to chain of thought.
I think that end-to-end RL on task performance will continue to result in models that use superficially human-comprehensible reasoning steps, break tasks into human-comprehensible pieces, and use human interfaces for tools.
My sense right now is that this feels a bit semantic.
I do indeed predict that we will see chain-of-thought become less faithful as model capabilities increase, and that other ways of doing the same thing as chain-of-thought but internalized to the model will take over.
I do not understand your comment at all. Why would it be falsified? Transformers are completely capable of steganography if you apply pressure towards it, which we will (and have done).
In Deepseek we can already see weird things happening in the chain of thought. I will happily take bets that we will see a lot more of that.
Steganography is a separate threat model, but even there I’d interpret current evidence (e.g. Preventing Language Models From Hiding Their Reasoning) as mostly positive (as in, even relatively simple mitigations like paraphrasing seem to go very far).
Transformers are obviously capable of doing complicated internal chains of reasoning. Just try giving them a difficult problem and force them to start their answer in the very next token. You will see no interpretable or visible traces of their reasoning, but they will still get it right for almost all questions.
Visible CoT is only necessary for the frontier of difficulty. The rest is easily internalized.
I don’t dispute that transformers can memorize shortcuts. I do dispute their ability to perform latent (opaque) multi-hop reasoning robustly. And I think this should be (very) non-controversial; e.g. Mor Geva has many papers on this topic.
What is plausibly a valid definition of multi-hop reasoning that we care about and that excludes getting mathematical proofs right and answering complicated never-before-seen physics questions and doing the kind of thing that a smaller model needed to do a CoT for?
I do think that right now LMs are by far closest to doing useful work by exploiting human-legible interfaces and decompositions. Chain of thought, simple decompositions, and imitations of human tool use are already important for LM performance. While more complex LM agents add only a small amount of additional value, it seems like extrapolating trends would make them pretty important soon.
Overall I think the world is shaping up extremely far in the direction of “AI systems learn to imitate human cognitive steps and then compose them into impressive performance.” I’m happy to bet about whether that trend will continue to the extent we can operationalize it. E.g. I’d bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use. I don’t have a strong view about more complex decompositions unless context length is a serious limitation. I would guess that end-to-end optimization will make at most marginal differences in efficacy (probably smaller than RLHF).
To the extent models trained with RLHF are doing anything smart in the real world I think it’s basically ~100% by solving a human-comprehensible task. Namely humans give the system a task, and it tries to do some rough combination of what a particular kind of human demonstrator would do and what a particular kind of human evaluator would rate highly. There is no further optimization to take intelligent actions in the world.
I want to separate prompt-engineering from factored cognition. There are various nudges you can use to get LLMs to think in ways that are more productive or well-suited for the task at hand, but this seems quite different to me from truly factored cognition, where you spin up a sub-process that solves a sub-problem, and then propagate that back up to a higher-level process (like Auto-GPT). I don’t currently know of any not-extremely-gerry-mandered task where doing this actually improves task performance compared to just good prompt engineering. I’ve been looking for examples of this for a while, so if you do have any, I would greatly appreciate it.
I’ve seen little evidence of this so far, and don’t think current LLM performance is even that well-characterized by this. This would be great, but I don’t currently think its true.
For example, I don’t really understand whether this model is surprised or unsurprised by the extreme breadth of knowledge that modern LLMs have. I don’t see any “imitation of human cognitive steps” when an LLM is capable of remembering things from a much wider range of topics. It seems just that its way of acquiring knowledge is very different from humans, giving rise to a very different capability-landscape. This capability does not seem to be built out of “composition of imitations of human cognitive steps”.
Similarly, when I use Codex for programming, I do not see any evidence that suggests that Codex is solving programming problems by composing imitations of human cognitive steps. Indeed, it mostly seems to just solve the problems in one-shot, vastly faster than I would be able to even type, and in a way that seems completely alien to me as a programmer.
I do indeed predict that we will see chain-of-thought become less faithful as model capabilities increase, and that other ways of doing the same thing as chain-of-thought but internalized to the model will take over.
I have no strong opinions on tool-use. Seems like the LLMs will use APIs the same way as humans would. I do think if you train more on end-to-end tasks, the code they write to solve sub-problems will become less readable. I have thought less about this and wouldn’t currently take a bet.
I am not super confident of this, but my best guess is that we will see more end-to-end optimization, and that those will make a big difference in task performance. It also seems like a natural endpoint of something like RLAF, where you have the AI guide a lot of the training process itself when given a highly-complicated objective, and then you do various forms of RL on self-evaluations.
Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.
It does much better than AutoGPT, and also the paper does ablations to show that the different parts of the scaffolding in Voyager do matter. This suggests that better scaffolding does make a difference, and I doubt Voyager is the limit.
I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive. I think it’s plausible this gap will eventually mostly close at some capability threshold, especially for many of the most potentially-transformative capabilities (e.g. having insights that draw on a large basis of information not memorised in a base model’s weights, since this seems hard to decompose into smaller tasks), but it seems quite plausible the gap will be non-trivial for a while.
That’s a good example, thank you! I actually now remembered looking at this a few weeks ago and thinking about it as an interesting example of scaffolding. Thanks for reminding me.
I do wonder how much of this is just the result of an access gap. Getting one of these scaffolded systems to work seems also a lot of hassle and very fiddly, and my best guess is that if OpenAI wanted to solve this problem, they would probably just reinforcement learn a bunch, and then maybe they would do a bit of scaffolding, but the scaffolding would be a lot less detailed and not really be that important to the overall performance of the system.
Although this is an important discussion I want to emphasize up front that I don’t think it’s closely related to the argument in the OP. I tried to revise the OP to emphasize that the first section of the article is about LM agent improvements that are relevant to engineering better scaffolding rather than improving our ability to optimize such agents end to end.
If you allow models to think for a while they do much better than if you just ask them to answer the question. By “think for a while” we mean they generate one sentence after another in the same way a human would. Their ability to use chain of thought seems to come essentially entirely from copying human chains of thought rather than e.g. using filler tokens to parallelize cognition or RL fine-tuning teaching them novel cognitive strategies.
I agree that models also memorize a lot of facts. Almost all the facts they actually use are facts that humans know, which they memorized by observing humans using them or stating them. So I don’t really consider this evidence one way or the other.
If you want to state any concrete prediction about the future I’m happy to say whether I agree with it. For example:
I think that the gap between “spit out an answer” and chain of thought / tool use / decomposition will continue to grow. (Even as chain of thought becomes increasingly unfaithful for questions of any fixed difficulty, since models become increasingly able to answer such questions in a single shot.)
I think there is a significant chance decomposition is a big part of that cluster, say a 50% chance that context-hiding decomposition obviously improves performance by an amount comparable to chain of thought.
I think that end-to-end RL on task performance will continue to result in models that use superficially human-comprehensible reasoning steps, break tasks into human-comprehensible pieces, and use human interfaces for tools.
My sense right now is that this feels a bit semantic.
This prediction seems largely falsified as long as transformers remain the dominant architecture, and especially if we deliberately add optimization pressures towards externalized reasoning and against internal, latent reasoning; see e.g. Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? and LLMs Do Not Think Step-by-step In Implicit Reasoning.
I do not understand your comment at all. Why would it be falsified? Transformers are completely capable of steganography if you apply pressure towards it, which we will (and have done).
In Deepseek we can already see weird things happening in the chain of thought. I will happily take bets that we will see a lot more of that.
I’m pointing out that transformers seem really bad at internal multi-hop reasoning; currently they can’t even do 2-hop robustly, 3-hop robustly seems kind of out of the question right now, and scaling doesn’t seem to help much either (see e.g. Figures 2 and 3 in Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? and also how much more robust and scalable CoT reasoning is). So ‘chain-of-thought but internalized to the model will take over’ seems very unlikely with transformers, and much more so if basic mitigations like unlearning (e.g. of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_probably_remove_from_the_weights, to mitigate the possibility of latent scheming) were applied.
Steganography is a separate threat model, but even there I’d interpret current evidence (e.g. Preventing Language Models From Hiding Their Reasoning) as mostly positive (as in, even relatively simple mitigations like paraphrasing seem to go very far).
Transformers are obviously capable of doing complicated internal chains of reasoning. Just try giving them a difficult problem and force them to start their answer in the very next token. You will see no interpretable or visible traces of their reasoning, but they will still get it right for almost all questions.
Visible CoT is only necessary for the frontier of difficulty. The rest is easily internalized.
I don’t dispute that transformers can memorize shortcuts. I do dispute their ability to perform latent (opaque) multi-hop reasoning robustly. And I think this should be (very) non-controversial; e.g. Mor Geva has many papers on this topic.
What is plausibly a valid definition of multi-hop reasoning that we care about and that excludes getting mathematical proofs right and answering complicated never-before-seen physics questions and doing the kind of thing that a smaller model needed to do a CoT for?
The kind of instrumental reasoning required for alignment faking seems relevant, including through n-hop latent reasoning; see e.g. section ‘B.1.3 HIDDEN SCHEMING REASONING’ from Towards evaluations-based safety cases for AI scheming. I wouldn’t be too surprised if models could currently bypass this through shortcuts, but a mix of careful data filtering + unlearning of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_probably_remove_from_the_weights, could force them to externalize their reasoning (which should be much easier to monitor than latent reasoning), if they were to try to alignment-fake; though steganography would also be another threat model here, as discussed e.g. in section ‘B.1.2 OBFUSCATED SCHEMING REASONING’ of Towards evaluations-based safety cases for AI scheming.
Some evidence in favor: https://x.com/YangjunR/status/1793681241398788319 (for increasingly wide gap between LM single forward pass (‘snap judgment’) and CoT), https://xwang.dev/mint-bench/ (for tool use being increasingly useful, with both model scale, and number of tool use turns).