I wonder whether there would be much pressure for an LLM with the current architecture to represent “truth” vs. “consistent with the worldview of the best generally accurate authors.” If ground-level truth doesn’t provide additional accuracy in predicting next tokens, I think it would be possible that we identify a feature that can be used to discriminate what “generally accurate authors” would say about a heretofore unknown question or phenomenon, but not one that can be used to identify inaccurate sacred cows.
I think this is likely right by default in many settings, but I think ground-level truth does provide additional accuracy in predicting next tokens in at least some settings—such as in “Claim 1” in the post (but I don’t think that’s the only setting) -- and I suspect that will be enough for our purposes. But this is certainly related to stuff I’m actively thinking about.
I wonder whether there would be much pressure for an LLM with the current architecture to represent “truth” vs. “consistent with the worldview of the best generally accurate authors.” If ground-level truth doesn’t provide additional accuracy in predicting next tokens, I think it would be possible that we identify a feature that can be used to discriminate what “generally accurate authors” would say about a heretofore unknown question or phenomenon, but not one that can be used to identify inaccurate sacred cows.
I think this is likely right by default in many settings, but I think ground-level truth does provide additional accuracy in predicting next tokens in at least some settings—such as in “Claim 1” in the post (but I don’t think that’s the only setting) -- and I suspect that will be enough for our purposes. But this is certainly related to stuff I’m actively thinking about.