Fantastic agenda for the field, thanks for sharing.
Honesty is a narrower concept than truthfulness and is deliberately chosen to avoid capabilities externalities, since truthful AI is usually a combination of vanilla accuracy, calibration, and honesty goals. Optimizing vanilla accuracy is optimizing general capabilities, and we cover calibration elsewhere. When working towards honesty rather than truthfulness, it is much easier to avoid capabilities externalities.
I think it’s worth mentioning that there are safety benefits to truthfulness beyond honesty. You’re absolutely right that general capabilities tend to improve truthfulness and vice versa, but I still see two specific paths to longtermist impact for truthful language models.
First, truthful LMs can combat automated persuasion and misinformation. The growth of language models seems to pose serious risks for degrading discourse and enabling propaganda from malicious actors. Guiding progress in the field towards LMs that are asymmetrically truthful (as opposed to persuasive) or that can identify bots in the wild seems important. See risks from automated persuasion and AI takeover without AGI or agency.
Second, the ability to verify the truthfulness of claims is essential for both debate and iterated amplification. Other bottlenecks on these agendas should be prioritized, and the agendas should seek to perform as well as possible at the current state of the art level of accuracy. But if other domains of AI see rapid progress, it could be useful to pursue better truth verification capabilities in order to supervise other systems.
Honest AI is a more scalable direction for the field with less danger of negative capabilities externalities. But with a sufficient degree of caution, I see the case for working on some truthfulness agendas.
Fantastic agenda for the field, thanks for sharing.
I think it’s worth mentioning that there are safety benefits to truthfulness beyond honesty. You’re absolutely right that general capabilities tend to improve truthfulness and vice versa, but I still see two specific paths to longtermist impact for truthful language models.
First, truthful LMs can combat automated persuasion and misinformation. The growth of language models seems to pose serious risks for degrading discourse and enabling propaganda from malicious actors. Guiding progress in the field towards LMs that are asymmetrically truthful (as opposed to persuasive) or that can identify bots in the wild seems important. See risks from automated persuasion and AI takeover without AGI or agency.
Second, the ability to verify the truthfulness of claims is essential for both debate and iterated amplification. Other bottlenecks on these agendas should be prioritized, and the agendas should seek to perform as well as possible at the current state of the art level of accuracy. But if other domains of AI see rapid progress, it could be useful to pursue better truth verification capabilities in order to supervise other systems.
Honest AI is a more scalable direction for the field with less danger of negative capabilities externalities. But with a sufficient degree of caution, I see the case for working on some truthfulness agendas.