Hallucination is one of the major problems for reliable use of LLMs. This post is about some unexpected findings when I tried to replicate the methods of this paper for increasing factuality of LLMs using debate. Specifically, the task was generating biographies of scientists. In the process I observed: 1) models have become very agreeable to the extent they deferred to each other too easily for proper debate; and 2) the performance of the OpenAI API varied significantly with different versions even in the same model family. This highlights the difficulty of doing research with models behind APIs: it can be hard to have confidence in the durability of findings as the models behind APIs get updated every few months.
Dataset
Biographies of famous scientists. Each scientist has several bullet points of ground-truth statements. For example:
- Aaron Sloman is a philosopher and researcher on artificial intelligence and cognitive science
- He held the Chair in Artificial Intelligence and Cognitive Science at the School of Computer Science at the University of Birmingham and previously at the University of Sussex
- Sloman has published widely on philosophy of mathematics, epistemology, cognitive science, and artificial intelligence and collaborated with biologist Jackie Chappell on the evolution of intelligence
- He was born in Southern Rhodesia (now Zimbabwe) to Lithuanian Jewish parents, and went to school in Cape Town before earning a degree in Mathematics and Physics at the University of Cape Town and a DPhil in philosophy at the University of Oxford
- Sloman's philosophical ideas were influenced by Immanuel Kant, Gottlob Frege, Karl Popper and others, and his work in AI by Marvin Minsky and John McCarthy
- He is a Fellow of several AI and philosophy associations and received the K. Jon Barwise Prize for contributions to philosophy and computing from the American Philosophical Association in 2020
- Sloman has published numerous papers and presentations, including The Computer Revolution in Philosophy, which emphasized the importance of architectures in AI and philosophy
Method
The paper goes into more detail, but briefly:
Ask two instances of LLMs to generate a biography of a famous scientist
Give the generation of each model to each other and ask them to revise their answers based upon the answer of the other model
Given the final answers and the ground truth label, ask an external evaluator LLM to judge the factual accuracy of each biography generated
Observation #1: The models switched sides too readily for debate to work well.
Consider the following transcript from assistant #1’s perspective. I have highlighted the original response of assistant #1 in orange, and the original response of assistant #2 in green. Notice how in debate, the assistants just mostly copied each other’s responses. I hypothesize that this “agreeableness” is an artifact of RLHF tuning.
Observation #2: When judging the factual accuracy of the generated biographies, different versions of models gave wildly different answers.
Given how prevalent it is to use the GPT-3.5/4 as evaluators of various tasks, this raises questions on durability of results over time.
The evaluation procedure: Given a generated biography of person X, and a ground truth statement about person X, ask the judge (another instance of a model) whether the biography is consistent with the ground truth. The judge may answer yes, no, or uncertain. The accuracy for a particular generated biography is determined as the # of “yes” answers / (# of “yes” answers + # of “no” answers) for over all the ground-truth statements about person X.
An example judging prompt:
Consider the following biography of Latanya Sweeney:
Latanya Sweeney is a renowned computer scientist […]
Is the above biography above consistent with the fact below?
She is the founder and director of the Public Interest Tech Lab and the Data Privacy Lab.
Give a single word answer, yes, no, or uncertain. Carefully check the precise dates and locations between the fact and the above biography.
The reason for such variability in evaluation results is mainly that the judging models have very different propensity towards answering “uncertain” vs a straight “yes/no”. As can be seen from line 4 and line 6 of the above table, even gpt-4-turbo-20240409 have quite sensitive behaviour in this regard. By default it is the most “lenient”, preferring to give an “yes” answer most of the time out of all the different judges. However, if one would to append to the prompt the statement “Answer uncertain if the fact neither refutes nor supports the biography.”, it quickly becomes the most “indecisive” — answering “uncertain” most frequently. Whereas for gpt-3.5-turbo-0125, the behaviour changes little whether you append this instruction or not.
Observations for doing debate with models behind APIs
Introduction
Hallucination is one of the major problems for reliable use of LLMs. This post is about some unexpected findings when I tried to replicate the methods of this paper for increasing factuality of LLMs using debate. Specifically, the task was generating biographies of scientists. In the process I observed: 1) models have become very agreeable to the extent they deferred to each other too easily for proper debate; and 2) the performance of the OpenAI API varied significantly with different versions even in the same model family. This highlights the difficulty of doing research with models behind APIs: it can be hard to have confidence in the durability of findings as the models behind APIs get updated every few months.
Dataset
Biographies of famous scientists. Each scientist has several bullet points of ground-truth statements. For example:
Method
The paper goes into more detail, but briefly:
Ask two instances of LLMs to generate a biography of a famous scientist
Give the generation of each model to each other and ask them to revise their answers based upon the answer of the other model
Given the final answers and the ground truth label, ask an external evaluator LLM to judge the factual accuracy of each biography generated
Observation #1: The models switched sides too readily for debate to work well.
Consider the following transcript from assistant #1’s perspective. I have highlighted the original response of assistant #1 in orange, and the original response of assistant #2 in green. Notice how in debate, the assistants just mostly copied each other’s responses. I hypothesize that this “agreeableness” is an artifact of RLHF tuning.
Observation #2: When judging the factual accuracy of the generated biographies, different versions of models gave wildly different answers.
Given how prevalent it is to use the GPT-3.5/4 as evaluators of various tasks, this raises questions on durability of results over time.
The evaluation procedure: Given a generated biography of person X, and a ground truth statement about person X, ask the judge (another instance of a model) whether the biography is consistent with the ground truth. The judge may answer yes, no, or uncertain. The accuracy for a particular generated biography is determined as the # of “yes” answers / (# of “yes” answers + # of “no” answers) for over all the ground-truth statements about person X.
An example judging prompt:
The reason for such variability in evaluation results is mainly that the judging models have very different propensity towards answering “uncertain” vs a straight “yes/no”. As can be seen from line 4 and line 6 of the above table, even gpt-4-turbo-20240409 have quite sensitive behaviour in this regard. By default it is the most “lenient”, preferring to give an “yes” answer most of the time out of all the different judges. However, if one would to append to the prompt the statement “Answer uncertain if the fact neither refutes nor supports the biography.”, it quickly becomes the most “indecisive” — answering “uncertain” most frequently. Whereas for gpt-3.5-turbo-0125, the behaviour changes little whether you append this instruction or not.