I am quite confused. It is not clear to me if at the end you are saying that LLMs do or don’t have a world model. Can you clearly say on which “side” do you stand on? Are you even arguing for a particular side? Are you arguing that the idea of “having a world model” doesn’t apply well to an LLM/is just not well defined?
Said this, you do seem to be claiming that LLMs do not have a coherent model of the world (again, am I misunderstanding you?), and then use humans as an example of what having a coherent world model looks like. This sentence is particularly bugging me:
For example, an LLM that can answer a question about the kinetic energy of a bludger probably doesn’t have a clear boundary between models of fantasy and models of reality. But switching seamlessly between emulating different people is implicit in what they are attempting to do—predict what happens in a conversation.
In the screenshots you provided GPT3.5 does indeed answer the question, but it seem to distinguish it being not real (it says ”...bludgers in Harry Potter are depicted as...”, ”...in the Harry Potter universe...”) and indeed it says it doesn’t have specific information about their magical properties. I also, in spite of being a physicist with knowledge that HP isn’t real, I would have gladly tried to answer that question kinda like GPT did. What are you arguing? LLMs seem to have the distinction at least between reality and HP or not?
And large language models, like humans, do the switching so contextually, without explicit warning that the model being used is changing. They also do so in ways that are incoherent.
What’s incoherent about the response it gave? Was the screenshot not meant to be evidence?
The simulator theory (which you seem to rely on) is, IMO, a good human-level explanation of what GPT is doing, but it is not a fundamental-level theory. You cannot reduce every interaction with an LLM as a “simulation”, somethings are just weirder. Think of pathological examples of the input being “££££...” repeated 1000s of times: the output will be some random, possibly incoherent, babbling (funny incoherent output I got from the API inputting “£”*2000 and asking it how many “£” there were: ‘There are 10 total occurrences of “£” in the word Thanksgiving (not including spaces).’). Notice also the random title it gives to the conversations. Simulator theory fails here.
In the framework of simulator theory and lack of world model, how do you explain that it is actually really hard to make GPT overtly racist? Or how the instruct finetuning is basically never broken?
If I leave a sentence incomplete why doesn’t the LLM completes my sentence instead of saying “You have been cut off, can you please repeat?”? Why doesn’t the “playful” roleplaying take over, while (as you seem to claim) it takes over when you ask for factual things? Do they have a model of what “following instruction means” and “racisms” but not what “reality” is?
To state my belief: I think hallucinations, non-factuality and a lot of the problems are better explained by addressing the failure of RLHF and not from a lack of a coherent world model. RLHF apparently isn’t that good at making sure that GPT-4 answers factually. Especially since it is really hard to make it overtly racist. And especially since they reward it for “giving it a shot” instead of answering “idk” (because that would make it answer always “idk”). I explain it as: in training the reward model a lot of non-factual things might appear, and even some non-factual thing are actually the preferred response that human like.
Or it might just be the autoregressive paradigm that once it make a mistake (just by randomly sampling the “wrong” token) the model “thinks”: *Yoda voice* ‘mhmm, a mistake in the answer I see, mistaken the continuation of the answer should then be’.
And the weirdness of the outputs after a long repetition of a single token is explained by the non-zero repetition penalty in ChatGPT and so the output will kinda resemble the output of a glitch token.
Yes, I’m mostly embracing simulator theory here, and yes, there are definitely a number of implicit models of the world within LLMs, but they aren’t coherent. So I’m not saying there is no world model, I’m saying it’s not a single / coherent model, it’s a bunch of fragments.
But I agree that it doesn’t explain everything!
To step briefly out of the simulator theory frame, I agree that part of the problem is next-token generation, not RLHF—the model is generating the token, so it can’t “step back” and decide to go back and not make the claim that it “knows” should be followed by a citation. But that’s not a mistake on the level of simulator theory, it’s a mistake because of the way the DNN is used, not the joint distribution implicit in the model, which is what I view as “actually” what is simulated. For example, I suspect that if you were to have it calculate the joint probability over all the possibilities for the next 50 tokens at a time, and pick the next 10 based on that, then repeat, (which would obviously be computationally prohibitive, but I’ll ignore that for now,) it would mostly eliminate the hallucination problem.
On racism, I don’t think there’s much you need to explain; they did fine-tuning, and that was able to generate the equivalent of insane penalties for words and phrases that are racist. I think it’s possible that RHLF could train away from the racist modes of thinking as well, if done carefully, but I’m not sure that is what occurs.
I am quite confused. It is not clear to me if at the end you are saying that LLMs do or don’t have a world model. Can you clearly say on which “side” do you stand on? Are you even arguing for a particular side? Are you arguing that the idea of “having a world model” doesn’t apply well to an LLM/is just not well defined?
Said this, you do seem to be claiming that LLMs do not have a coherent model of the world (again, am I misunderstanding you?), and then use humans as an example of what having a coherent world model looks like. This sentence is particularly bugging me:
In the screenshots you provided GPT3.5 does indeed answer the question, but it seem to distinguish it being not real (it says ”...bludgers in Harry Potter are depicted as...”, ”...in the Harry Potter universe...”) and indeed it says it doesn’t have specific information about their magical properties. I also, in spite of being a physicist with knowledge that HP isn’t real, I would have gladly tried to answer that question kinda like GPT did. What are you arguing? LLMs seem to have the distinction at least between reality and HP or not?
What’s incoherent about the response it gave? Was the screenshot not meant to be evidence?
The simulator theory (which you seem to rely on) is, IMO, a good human-level explanation of what GPT is doing, but it is not a fundamental-level theory. You cannot reduce every interaction with an LLM as a “simulation”, somethings are just weirder. Think of pathological examples of the input being “££££...” repeated 1000s of times: the output will be some random, possibly incoherent, babbling (funny incoherent output I got from the API inputting “£”*2000 and asking it how many “£” there were: ‘There are 10 total occurrences of “£” in the word Thanksgiving (not including spaces).’). Notice also the random title it gives to the conversations. Simulator theory fails here.
In the framework of simulator theory and lack of world model, how do you explain that it is actually really hard to make GPT overtly racist? Or how the instruct finetuning is basically never broken?
If I leave a sentence incomplete why doesn’t the LLM completes my sentence instead of saying “You have been cut off, can you please repeat?”? Why doesn’t the “playful” roleplaying take over, while (as you seem to claim) it takes over when you ask for factual things? Do they have a model of what “following instruction means” and “racisms” but not what “reality” is?
To state my belief: I think hallucinations, non-factuality and a lot of the problems are better explained by addressing the failure of RLHF and not from a lack of a coherent world model. RLHF apparently isn’t that good at making sure that GPT-4 answers factually. Especially since it is really hard to make it overtly racist. And especially since they reward it for “giving it a shot” instead of answering “idk” (because that would make it answer always “idk”). I explain it as: in training the reward model a lot of non-factual things might appear, and even some non-factual thing are actually the preferred response that human like.
Or it might just be the autoregressive paradigm that once it make a mistake (just by randomly sampling the “wrong” token) the model “thinks”: *Yoda voice* ‘mhmm, a mistake in the answer I see, mistaken the continuation of the answer should then be’.
And the weirdness of the outputs after a long repetition of a single token is explained by the non-zero repetition penalty in ChatGPT and so the output will kinda resemble the output of a glitch token.
Yes, I’m mostly embracing simulator theory here, and yes, there are definitely a number of implicit models of the world within LLMs, but they aren’t coherent. So I’m not saying there is no world model, I’m saying it’s not a single / coherent model, it’s a bunch of fragments.
But I agree that it doesn’t explain everything!
To step briefly out of the simulator theory frame, I agree that part of the problem is next-token generation, not RLHF—the model is generating the token, so it can’t “step back” and decide to go back and not make the claim that it “knows” should be followed by a citation. But that’s not a mistake on the level of simulator theory, it’s a mistake because of the way the DNN is used, not the joint distribution implicit in the model, which is what I view as “actually” what is simulated. For example, I suspect that if you were to have it calculate the joint probability over all the possibilities for the next 50 tokens at a time, and pick the next 10 based on that, then repeat, (which would obviously be computationally prohibitive, but I’ll ignore that for now,) it would mostly eliminate the hallucination problem.
On racism, I don’t think there’s much you need to explain; they did fine-tuning, and that was able to generate the equivalent of insane penalties for words and phrases that are racist. I think it’s possible that RHLF could train away from the racist modes of thinking as well, if done carefully, but I’m not sure that is what occurs.