I was just wondering how you deal with hallucination and faithfulness issues of large language models from a technical perspective? The user experience perspective seems clear—you can give users control and consent over what Elicit is suggesting and so on.
However we know LLMs are prone to issues of faithfulness and factuality (Pagnoni et al. 2021 as one example for abstractive summarization) and this seems like it would be a big issue for research where factual correctness is very important. In a biomedical scienario, if a user of Elicit gets an output that presents a figure wrongly extracted (say from a proceeding sentence, or hallucinated as the highest log likelihood token based on previous documents), this could potentially have very dangerous consequences. I’d love to know more about how you’d adress that?
My current thinking on the matter is that in order to address these safety issues in NLP for science we may need to provide models that “self-criticize” their outputs so to speak. I.e. provide counterfactual outputs that could be checked or something like this. Espeically since GopherCite (Menick et al 2022) and some of the similar self-supporting models seem to show that self-support is also prone to some issues that doesn’t totally address factuality (in their case as measured on TruthfulQA) not to meantion self-explaining approaches which I belive suffer from the same issues (i.e. hallucinating an incorrect explanation).
Menick, J., Trebacz, M., Mikulik, V., Aslanides, J., Song, F., Chadwick, M., … & McAleese, N. (2022). Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.
Pagnoni, Artidoro, Vidhisha Balachandran, and Yulia Tsvetkov. “Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics.” arXiv preprint arXiv:2104.13346 (2021).
Yeah, getting good at faithfulness is still an open problem. So far, we’ve mostly relied on imitative finetuning. to get misrepresentations down to about 10% (which is obviously still unacceptable). Going forward, I think that some combination of the following techniques will be needed to get performance to a reasonable level:
Finetuning + RL from human preferences
Adversarial data generation for finetuning + RL
Verifier models, relying on evaluation being easier than generation
Decomposition of verification, generating and testing ways that a claim could be wrong
Debate (“self-criticism”)
User feedback, highlighting situations where the model is wrong
Tracking supporting information for each statement and through each chain of reasoning
Voting among models trained/finetuned on different datasets
Hey there,
I was just wondering how you deal with hallucination and faithfulness issues of large language models from a technical perspective? The user experience perspective seems clear—you can give users control and consent over what Elicit is suggesting and so on.
However we know LLMs are prone to issues of faithfulness and factuality (Pagnoni et al. 2021 as one example for abstractive summarization) and this seems like it would be a big issue for research where factual correctness is very important. In a biomedical scienario, if a user of Elicit gets an output that presents a figure wrongly extracted (say from a proceeding sentence, or hallucinated as the highest log likelihood token based on previous documents), this could potentially have very dangerous consequences. I’d love to know more about how you’d adress that?
My current thinking on the matter is that in order to address these safety issues in NLP for science we may need to provide models that “self-criticize” their outputs so to speak. I.e. provide counterfactual outputs that could be checked or something like this. Espeically since GopherCite (Menick et al 2022) and some of the similar self-supporting models seem to show that self-support is also prone to some issues that doesn’t totally address factuality (in their case as measured on TruthfulQA) not to meantion self-explaining approaches which I belive suffer from the same issues (i.e. hallucinating an incorrect explanation).
Menick, J., Trebacz, M., Mikulik, V., Aslanides, J., Song, F., Chadwick, M., … & McAleese, N. (2022). Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.
Pagnoni, Artidoro, Vidhisha Balachandran, and Yulia Tsvetkov. “Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics.” arXiv preprint arXiv:2104.13346 (2021).
Yeah, getting good at faithfulness is still an open problem. So far, we’ve mostly relied on imitative finetuning. to get misrepresentations down to about 10% (which is obviously still unacceptable). Going forward, I think that some combination of the following techniques will be needed to get performance to a reasonable level:
Finetuning + RL from human preferences
Adversarial data generation for finetuning + RL
Verifier models, relying on evaluation being easier than generation
Decomposition of verification, generating and testing ways that a claim could be wrong
Debate (“self-criticism”)
User feedback, highlighting situations where the model is wrong
Tracking supporting information for each statement and through each chain of reasoning
Voting among models trained/finetuned on different datasets
Thanks for the pointer to Pagnoni et al.