I like this approach to alignment research. Getting AIs to be robustly truthful (producing language output that is consistent with their best models of reality, modulo uncertainty) seems like it falls in the same space as getting them to keep their goals consistent with their best estimates of human goals and values.
As for avoiding negligent falsehoods, I think it will be crucial for the AI to have explicit estimates of its uncertainty for anything it might try to say. To a first approximation, assuming the system can project statements to a consistent conceptual space, it could predict the variance in the distribution of opinions in its training data around any particular issue. Then it could use this estimate of uncertainty to decide whether to state something confidently, to add caveats to what it says, or to turn it into a question for the interlocutor.
I like this approach to alignment research. Getting AIs to be robustly truthful (producing language output that is consistent with their best models of reality, modulo uncertainty) seems like it falls in the same space as getting them to keep their goals consistent with their best estimates of human goals and values.
As for avoiding negligent falsehoods, I think it will be crucial for the AI to have explicit estimates of its uncertainty for anything it might try to say. To a first approximation, assuming the system can project statements to a consistent conceptual space, it could predict the variance in the distribution of opinions in its training data around any particular issue. Then it could use this estimate of uncertainty to decide whether to state something confidently, to add caveats to what it says, or to turn it into a question for the interlocutor.