The key point I want to make with the idea that text primarily functions as reports on processes is that this is what we’re observing with a language model’s outputs. One way to think about it is that our prompt shines a lens onto some specific part of a simulation.
Possibly true in the limit of perfect simulators, almost surely false with current LLMs?
A key fact is that LLMs are better at solving some tasks with CoT than they are at solving tasks 0-shot when fine-tuned to do it, and this is also true when CoT are paraphrased, or when you write part of the CoT by hand. This means that the tokens generated are actually a huge part of why the model is able to do the task in the first place, and I think it’s not a stretch to say that, in these conditions, the LLMs is actually using the generated text as the substrate for reasoning (and using it in the same way a human thinking out loud would).
But I agree with the general “skeptic of LLM justifications by default” vibe. I just think it’s not justified for many tasks, and that LLMs performing badly when fine-tuned for many sorts of tasks is good evidence that LLMs are not hiding huge capabilities and are usually “just doing their best”, at least for tasks people care enough about that they trained models to be good at them—e.g. all tasks in the instruction fine-tuning dataset.
I also agree with being very skeptical of few-shot prompting. It’s a prompting trick, it almost never actual in-context learning.
Possibly true in the limit of perfect simulators, almost surely false with current LLMs?
A key fact is that LLMs are better at solving some tasks with CoT than they are at solving tasks 0-shot when fine-tuned to do it, and this is also true when CoT are paraphrased, or when you write part of the CoT by hand. This means that the tokens generated are actually a huge part of why the model is able to do the task in the first place, and I think it’s not a stretch to say that, in these conditions, the LLMs is actually using the generated text as the substrate for reasoning (and using it in the same way a human thinking out loud would).
But I agree with the general “skeptic of LLM justifications by default” vibe. I just think it’s not justified for many tasks, and that LLMs performing badly when fine-tuned for many sorts of tasks is good evidence that LLMs are not hiding huge capabilities and are usually “just doing their best”, at least for tasks people care enough about that they trained models to be good at them—e.g. all tasks in the instruction fine-tuning dataset.
I also agree with being very skeptical of few-shot prompting. It’s a prompting trick, it almost never actual in-context learning.