This makes sense, but LLMs can be described in multiple ways. From one perspective, as you say,
ChatGPT is not running a simulation; it’s answering a question in the style that it’s seen thousands—or millions—of times before.
From different perspective (as you very nearly say) ChatGPT is simulating a skewed sample of people-and-situations in which the people actually do have answers to the question.
The contents of hallucinations are hard to understand as anything but token prediction, and by a system that (seemingly?) has no introspective knowledge of its knowledge. The model’s degree of confidence in a next-token prediction would be a poor indicator of degree of factuality: The choice of token might be uncertain, not because the answer itself is uncertain, but because there are many ways to express the same, correct, high-confidence answer. (ChatGPT agrees with this, and seems confident.)
If you ask the model, “does the following text contain made up facts you cannot locate on Bing”, can it then check if Bing has cites for each quote?
It looks like it will work. This is counterintuitive but it’s because the model never did any of this “introspection” when it generated the string. It just rattled off whatever it predicted was next, within the particular region of multidimensional knowledge space you are in.
You could automate this. Have the model generate possible answers to a query. Have other instances of the same model be prompted to search for common classes of errors, and respond in language that can be scored.
Then RL on the answers that are the least wrong, or negative feedback on the answer that most disagrees with the introspection. This “shapes” the multidimensional space of the model to be more likely to produce correct answers, and to not give made up facts.
This makes sense, but LLMs can be described in multiple ways. From one perspective, as you say,
From different perspective (as you very nearly say) ChatGPT is simulating a skewed sample of people-and-situations in which the people actually do have answers to the question.
The contents of hallucinations are hard to understand as anything but token prediction, and by a system that (seemingly?) has no introspective knowledge of its knowledge. The model’s degree of confidence in a next-token prediction would be a poor indicator of degree of factuality: The choice of token might be uncertain, not because the answer itself is uncertain, but because there are many ways to express the same, correct, high-confidence answer. (ChatGPT agrees with this, and seems confident.)
Are you sure introspection won’t work?
If you ask the model, “does the following text contain made up facts you cannot locate on Bing”, can it then check if Bing has cites for each quote?
It looks like it will work. This is counterintuitive but it’s because the model never did any of this “introspection” when it generated the string. It just rattled off whatever it predicted was next, within the particular region of multidimensional knowledge space you are in.
You could automate this. Have the model generate possible answers to a query. Have other instances of the same model be prompted to search for common classes of errors, and respond in language that can be scored.
Then RL on the answers that are the least wrong, or negative feedback on the answer that most disagrees with the introspection. This “shapes” the multidimensional space of the model to be more likely to produce correct answers, and to not give made up facts.