Thanks for your comments! I was traveling and missed them until now.
To the extent LLMs appear to build world models, I think what you’re seeing is a bunch of disorganized neurons and connections that, when probed with a systematic method, can be mapped onto things that we know a world model ought to contain.
I think we’ve certainly seen some examples of interpretability papers that ‘find’ things in the models that aren’t there, especially when researchers train nonlinear probes. But the research community has been learning over time to distinguish cases like that from from what’s really in the model (ablation, causal tracing, etc). We’ve also seen examples of world modeling that are clearly there in the model; Neel Nanda’s work finding a world model in Othello-GPT is a particularly clear case in my opinion (post, paper).
I think LLMs get “world models” (which don’t in fact cover the whole world) in a way that is quite unlike the way intelligent humans form their own world models―and more like how unintelligent or confused humans do the same.
My intuitions about human learning here are very different from yours, I think. In my view, learning (eg) to produce valid sentences in a native language and to understand sentences from other speakers is very nearly the only thing that matters, and that’s something nearly all speakers achieve. Learning an explicit model for that language, in order to eg produce a correct parse tree, matters a tiny bit, very briefly, when you learn parse trees in school. Rather than intelligent humans learning a detailed explicit model of their language and unintelligent humans not doing so, it seems to me that very few intelligent humans have such a model. Mostly it’s just linguists, who need an explicit model. I would further claim that those who do learn an explicit model don’t end up being significantly better at producing and understanding language in their day-to-day lives; it’s not explicit modeling that makes us good at that.
I do agree that someone without an explicit model of a topic will often have a harder time explaining that topic to someone else, and I agree that LLMs typically learn implicit rather than explicit models. I just don’t think that that in and of itself makes them worse at using those models.
That said, to the extent that by ‘general reasoning’ we mean chains of step-by-step assertions with each step explicitly justified by valid rules of reasoning, that does seem like something that benefits a lot from an explicit model. So in the end I don’t necessarily disagree with your application of this idea to at least some versions of general reasoning; I do disagree when it comes to other sorts of general reasoning, and LLM capabilities in general.
Thanks for your comments! I was traveling and missed them until now.
I think we’ve certainly seen some examples of interpretability papers that ‘find’ things in the models that aren’t there, especially when researchers train nonlinear probes. But the research community has been learning over time to distinguish cases like that from from what’s really in the model (ablation, causal tracing, etc). We’ve also seen examples of world modeling that are clearly there in the model; Neel Nanda’s work finding a world model in Othello-GPT is a particularly clear case in my opinion (post, paper).
My intuitions about human learning here are very different from yours, I think. In my view, learning (eg) to produce valid sentences in a native language and to understand sentences from other speakers is very nearly the only thing that matters, and that’s something nearly all speakers achieve. Learning an explicit model for that language, in order to eg produce a correct parse tree, matters a tiny bit, very briefly, when you learn parse trees in school. Rather than intelligent humans learning a detailed explicit model of their language and unintelligent humans not doing so, it seems to me that very few intelligent humans have such a model. Mostly it’s just linguists, who need an explicit model. I would further claim that those who do learn an explicit model don’t end up being significantly better at producing and understanding language in their day-to-day lives; it’s not explicit modeling that makes us good at that.
I do agree that someone without an explicit model of a topic will often have a harder time explaining that topic to someone else, and I agree that LLMs typically learn implicit rather than explicit models. I just don’t think that that in and of itself makes them worse at using those models.
That said, to the extent that by ‘general reasoning’ we mean chains of step-by-step assertions with each step explicitly justified by valid rules of reasoning, that does seem like something that benefits a lot from an explicit model. So in the end I don’t necessarily disagree with your application of this idea to at least some versions of general reasoning; I do disagree when it comes to other sorts of general reasoning, and LLM capabilities in general.