Yes, I’m mostly embracing simulator theory here, and yes, there are definitely a number of implicit models of the world within LLMs, but they aren’t coherent. So I’m not saying there is no world model, I’m saying it’s not a single / coherent model, it’s a bunch of fragments.
But I agree that it doesn’t explain everything!
To step briefly out of the simulator theory frame, I agree that part of the problem is next-token generation, not RLHF—the model is generating the token, so it can’t “step back” and decide to go back and not make the claim that it “knows” should be followed by a citation. But that’s not a mistake on the level of simulator theory, it’s a mistake because of the way the DNN is used, not the joint distribution implicit in the model, which is what I view as “actually” what is simulated. For example, I suspect that if you were to have it calculate the joint probability over all the possibilities for the next 50 tokens at a time, and pick the next 10 based on that, then repeat, (which would obviously be computationally prohibitive, but I’ll ignore that for now,) it would mostly eliminate the hallucination problem.
On racism, I don’t think there’s much you need to explain; they did fine-tuning, and that was able to generate the equivalent of insane penalties for words and phrases that are racist. I think it’s possible that RHLF could train away from the racist modes of thinking as well, if done carefully, but I’m not sure that is what occurs.
Yes, I’m mostly embracing simulator theory here, and yes, there are definitely a number of implicit models of the world within LLMs, but they aren’t coherent. So I’m not saying there is no world model, I’m saying it’s not a single / coherent model, it’s a bunch of fragments.
But I agree that it doesn’t explain everything!
To step briefly out of the simulator theory frame, I agree that part of the problem is next-token generation, not RLHF—the model is generating the token, so it can’t “step back” and decide to go back and not make the claim that it “knows” should be followed by a citation. But that’s not a mistake on the level of simulator theory, it’s a mistake because of the way the DNN is used, not the joint distribution implicit in the model, which is what I view as “actually” what is simulated. For example, I suspect that if you were to have it calculate the joint probability over all the possibilities for the next 50 tokens at a time, and pick the next 10 based on that, then repeat, (which would obviously be computationally prohibitive, but I’ll ignore that for now,) it would mostly eliminate the hallucination problem.
On racism, I don’t think there’s much you need to explain; they did fine-tuning, and that was able to generate the equivalent of insane penalties for words and phrases that are racist. I think it’s possible that RHLF could train away from the racist modes of thinking as well, if done carefully, but I’m not sure that is what occurs.