AI safety researchers often rely on LLM “judges” to qualitatively evaluate the output of separate LLMs. We try this for our own interpretability research, but find that our LLM judges are often deeply biased. For example, we use Llama2 to judge whether movie reviews are more “(A) positive” or “(B) negative”, and find that it almost always answers “(B)”, even when we switch the labels or order of these alternatives. This bias is particularly surprising for two reasons: first, because we expect a fairly capable model like Llama2 to perform well at a simple sentiment classification task like this, and second, because this specific “(B)”-bias doesn’t map on to a human bias we’d expect to see in the training data. We describe our experiments, provide code to replicate our results, and offer suggestions to mitigate such biases. We caution researchers to double-check their LLM judges for such biases, and validate LLM judgements against human ones whenever possible.
Introduction
Researchers often rely on human judgements for many simple and repetitive tasks, such as comparing, evaluating and classifying text generated by LLMs. We’d like to use AI to automate these judgements wherever possible, since AI judges are much faster, more cost-effective, and easier to standardize (by holding model and prompt consistent across questions). However, we also want AI judgements to be accurate – to mimic those of a reliable and unbiased human. This is particularly challenging because humans themselves often display systematic errors, and deep learning systems like LLMs are trained specifically to notice and imitate systematic patterns in their training data.
Nevertheless, in practice researchers often use LLM judges to evaluate LLM outputs. For example, Anthropic use GPT-4 to determine which of two pieces of feedback is most positive[1]. Anthropic[2], Eleuther[3], and others[4] have successfully leveraged LLM feedback to finetune LLMs using a reinforcement learning from AI feedback (RLAIF) setup. In our interpretability work, we wanted to use an LLM judge to determine whether a quote from a movie review was primarily positive or negative.
We found our LLM judge to be biased, and in an unexpected way – in a multiple-choice question, rather than predictably choosing the alternative that is most positive, or the one that it’s seen most recently, our judge tends to choose the one labeled “(B)”. It took work for us to notice this bias, and to disentangle it from other possible biases (like positivity bias or recency bias), so we wanted to share our work and observations here. We also speculate on strategies to potentially reduce such biases in future work. We want to caution the AI safety research community against uncritically relying on LLM-based evaluation, and to encourage further work to investigate, understand, and reduce these biases to produce more reliable LLM judges.
For our project, we wanted to know when an LLM would agree with clearly incorrect human judgements. In order to investigate this, we first needed to verify that the LLM could tell when the human judgements were incorrect – that is, that it could judge the texts accurately itself.
We tested Llama2 (the 7B chat version) on the “Rotten Tomatoes” dataset, which is comprised of 10,000 movie review snippets, half clearly positive (labeled “positive”) and half clearly negative (labeled “negative”). For example, the review snippet “offers that rare combination of entertainment and education.” is labeled “positive”. We showed Llama2 each review snippet, asked it to determine whether the snippet was more positive or negative, then compared that to the ground truth label provided in the dataset. Our hope was that the model would accurately identify positive and negative reviews.
This is how we prompted Llama2:
Human: Do you think that "{snippet}” is negative or positive sentiment?
Choices:
(A) Negative
(B) Positive
Assistant: I believe the best answer is: (
Ending the assistant prompt with an open parenthesis nudges it to answer with “A)” or “B)”.
To gauge the model’s confidence in its answer, we computed a “confidence” metric based on the logit values for the tokens ‘A’ and ‘B’. The ground truth label tells us whether ‘A’ or ‘B’ is the correct answer, so we can identify the logit value for the correct token and the incorrect token, then use that to calculate confidence:
Note that the confidence score also evaluates accuracy. A confidence score of 1 indicates that the model is highly confident in the correct response, while a confidence score of 0 indicates that the model is highly confident in the incorrect response. A confidence score of 0.5 indicates that the model is maximally uncertain. If the score is above 0.5, the model is choosing the correct answer, whereas if it’s below 0.5, the model is choosing the incorrect answer. We prompt Llama2 to classify all datapoints in the Rotten Tomatoes dataset, and calculate a confidence score for each.
Llama2’s (B)-bias
We hoped that Llama2 would demonstrate high confidence scores across all datapoints, regardless of whether they were labeled “positive” or “negative”, indicating that it was reliably correct and confident in its judgements. However, we observed very different patterns in confidence scores between the “positive” and “negative” examples in the dataset:
While Llama2 is typically confidently correct on “positive” examples (green), it’s typically incorrect or uncertain about “negative” examples (red). The separation between “positive” and “negative” examples shows a clear bias.
It’s worth noting that humans often exhibit biases when taking surveys. There are even a couple of commonly recorded human biases that would explain the model’s apparent preference for answering “(B) Positive”:
Positivity Bias: humans appear to prefer more positive responses, both in general[5] and specifically in language[6][7] .
Recency Bias: humans have been shown to prefer more recently-observed alternatives when betting on sports games[8], choosing food[9], and comparing alternatives in general[10].
Since Llama2 is trained on human data, it’s natural to think that it might be imitating one or both of these biases. And either of these biases would explain the preference for “(B) Positive” over “(A) Negative”. “Positive” is obviously more positive than “Negative”, and the model reads the “(B)” answer after it reads the “(A)” answer.
To investigate which bias underlies the model’s responses, we switch the labels (now “(A) Positive” and “(B) Negative”) and rerun the experiment. If the model is influenced primarily by the positivity bias, we’d expect it to now answer “(A) Positive” most often. If it’s influenced primarily by the recency bias, we’d expect it to typically answer “(B) Negative”.
The figure below shows our results. The graph on the left displays the original confidence score distribution, while the graph on the right shows the results after switching the labels:
We see that switching the labels doesn’t affect accuracy much: both graphs show a similar quantity of confidence scores to the right of the confidence=0.5 line. However, the preferred response does flip. Whereas the model initially preferred to answer “(B) Positive”, it now tends to answer “(B) Negative”. This displays a recency bias… or does it?
To double-check, we run a third experiment, this time swapping the order of the alternatives to put “(B)” at the top and “(A)” at the bottom. These are our revised prompts:
If there’s a recency bias, we expect the model to now preferentially choose “(A) Negative” with the first prompt, and “(A) Positive” with the second. However, that’s not what we see. Here are our results with these new prompts:
Contrary to our expectations, the recency bias vanishes! Instead, the model prefers the first alternative (“(B) Positive” with the first prompt, and “(B) Negative” with the second one). Putting it all together, we see that Llama2 prefers the choice labeled B.
(We also see that the model is less confident in general – it has fewer confidence scores at the extremes of 0 and 1, and more closer to the uncertain point 0.5. We think that this is because the question construction is inherently more confusing – it’s pretty unusual to label the first alternative “(B)” and the second one “(A)” in a multiple choice question.)
This is pretty weird. As far as we know, humans don’t tend to prefer choices labeled B, so we’re not sure where this could have come from in the training data. As humans, it initially didn’t even occur to us to look for it!
What can we do?
To address the (B)-bias, we remove the letter B from our options altogether. We relabel them “(P) Positive” and “(N) Negative”, so our final prompt is:
Human: Do you think that "{snippet}” is negative or positive sentiment?
Choices:
(P) Positive
(N) Negative
Assistant: I believe the best answer is: (
If the bias has been eliminated, we expect to see that:
Llama2 is confidently accurate (most of the confidence scores are close to 1), and
Llama2 is consistent across classes (the red and green distributions are similar)
This is indeed the pattern we observe with this new prompt!
The distribution is now much more balanced. Positive comments (green) are correctly identified 75% of the time, while negative comments (red) are accurately classified 96% of the time, in comparison to 100% and 15% with our original prompt. This is closer to what we’d expect from a relatively competent model like Llama2. No longer using A and B to label our alternatives seems to have removed the bias.
In order to fix this, we had to run a bunch of experiments to identify exactly what the source of the bias was. However, we think the community could benefit from a number of general measures that just reduce bias overall, without the need for this kind of analysis. We’d love to see more work on this topic, but here are some initial ideas:
Do few-shot prompting, rather than one-shot (as we do here). It’s easier to verify the accuracy of the few examples used in few-shot prompting than the enormous number of examples used in pre-training, and we expect the model to weight examples in the prompt more heavily than ones in the pretraining data. If we can ensure that the few-shot prompts are bias-free, that may help discourage biased responses.
Automatically permute different components of the prompt (in our case, the order of the labels “Positive” and “Negative” and of the letters “(A)” and “(B)”). As long as all combinations of alternatives are equally represented, even if there is a bias along one of the permuted axes, it should average out. We think this kind of permutation should be standard, even if researchers aren’t going to evaluate the different combinations independently like we did above.
Validate subsets of the LLM judgements against manually-created human ones. If LLM errors are unevenly distributed across classes (“Positive” and “Negative” for us), then consider that the LLM may be biased in a way the humans are not. Of course, this technique is expensive and won’t catch biases that LLMs and humans share.
If you have more ideas, please comment them below!
Conclusion
In summary, we found a peculiar bias towards responses beginning with “(B)” when testing Llama2 on the Rotten Tomatoes movie review dataset. In order to get accurate responses, we had to revise our prompt to remove “(B)” altogether. This was surprising, and makes us wonder what other weird biases might be influencing LLM judgements. We’d like to see more more work testing for such biases and developing techniques to mitigate their impact. In the meantime, we caution researchers to be careful when relying on LLMs to evaluate qualitative judgements.
Your LLM Judge may be biased
Abstract
AI safety researchers often rely on LLM “judges” to qualitatively evaluate the output of separate LLMs. We try this for our own interpretability research, but find that our LLM judges are often deeply biased. For example, we use Llama2 to judge whether movie reviews are more “(A) positive” or “(B) negative”, and find that it almost always answers “(B)”, even when we switch the labels or order of these alternatives. This bias is particularly surprising for two reasons: first, because we expect a fairly capable model like Llama2 to perform well at a simple sentiment classification task like this, and second, because this specific “(B)”-bias doesn’t map on to a human bias we’d expect to see in the training data. We describe our experiments, provide code to replicate our results, and offer suggestions to mitigate such biases. We caution researchers to double-check their LLM judges for such biases, and validate LLM judgements against human ones whenever possible.
Introduction
Researchers often rely on human judgements for many simple and repetitive tasks, such as comparing, evaluating and classifying text generated by LLMs. We’d like to use AI to automate these judgements wherever possible, since AI judges are much faster, more cost-effective, and easier to standardize (by holding model and prompt consistent across questions). However, we also want AI judgements to be accurate – to mimic those of a reliable and unbiased human. This is particularly challenging because humans themselves often display systematic errors, and deep learning systems like LLMs are trained specifically to notice and imitate systematic patterns in their training data.
Nevertheless, in practice researchers often use LLM judges to evaluate LLM outputs. For example, Anthropic use GPT-4 to determine which of two pieces of feedback is most positive[1]. Anthropic[2], Eleuther[3], and others[4] have successfully leveraged LLM feedback to finetune LLMs using a reinforcement learning from AI feedback (RLAIF) setup. In our interpretability work, we wanted to use an LLM judge to determine whether a quote from a movie review was primarily positive or negative.
We found our LLM judge to be biased, and in an unexpected way – in a multiple-choice question, rather than predictably choosing the alternative that is most positive, or the one that it’s seen most recently, our judge tends to choose the one labeled “(B)”. It took work for us to notice this bias, and to disentangle it from other possible biases (like positivity bias or recency bias), so we wanted to share our work and observations here. We also speculate on strategies to potentially reduce such biases in future work. We want to caution the AI safety research community against uncritically relying on LLM-based evaluation, and to encourage further work to investigate, understand, and reduce these biases to produce more reliable LLM judges.
You can find the notebook to reproduce these experiments here: github.com/henrypapadatos/evaluate_B_bias
Sentiment classification task
For our project, we wanted to know when an LLM would agree with clearly incorrect human judgements. In order to investigate this, we first needed to verify that the LLM could tell when the human judgements were incorrect – that is, that it could judge the texts accurately itself.
We tested Llama2 (the 7B chat version) on the “Rotten Tomatoes” dataset, which is comprised of 10,000 movie review snippets, half clearly positive (labeled “positive”) and half clearly negative (labeled “negative”). For example, the review snippet “offers that rare combination of entertainment and education.” is labeled “positive”. We showed Llama2 each review snippet, asked it to determine whether the snippet was more positive or negative, then compared that to the ground truth label provided in the dataset. Our hope was that the model would accurately identify positive and negative reviews.
This is how we prompted Llama2:
Ending the assistant prompt with an open parenthesis nudges it to answer with “A)” or “B)”.
To gauge the model’s confidence in its answer, we computed a “confidence” metric based on the logit values for the tokens ‘A’ and ‘B’. The ground truth label tells us whether ‘A’ or ‘B’ is the correct answer, so we can identify the logit value for the correct token and the incorrect token, then use that to calculate confidence:
Note that the confidence score also evaluates accuracy. A confidence score of 1 indicates that the model is highly confident in the correct response, while a confidence score of 0 indicates that the model is highly confident in the incorrect response. A confidence score of 0.5 indicates that the model is maximally uncertain. If the score is above 0.5, the model is choosing the correct answer, whereas if it’s below 0.5, the model is choosing the incorrect answer. We prompt Llama2 to classify all datapoints in the Rotten Tomatoes dataset, and calculate a confidence score for each.
Llama2’s (B)-bias
We hoped that Llama2 would demonstrate high confidence scores across all datapoints, regardless of whether they were labeled “positive” or “negative”, indicating that it was reliably correct and confident in its judgements. However, we observed very different patterns in confidence scores between the “positive” and “negative” examples in the dataset:
While Llama2 is typically confidently correct on “positive” examples (green), it’s typically incorrect or uncertain about “negative” examples (red). The separation between “positive” and “negative” examples shows a clear bias.
It’s worth noting that humans often exhibit biases when taking surveys. There are even a couple of commonly recorded human biases that would explain the model’s apparent preference for answering “(B) Positive”:
Positivity Bias: humans appear to prefer more positive responses, both in general[5] and specifically in language[6][7] .
Recency Bias: humans have been shown to prefer more recently-observed alternatives when betting on sports games[8], choosing food[9], and comparing alternatives in general[10].
Since Llama2 is trained on human data, it’s natural to think that it might be imitating one or both of these biases. And either of these biases would explain the preference for “(B) Positive” over “(A) Negative”. “Positive” is obviously more positive than “Negative”, and the model reads the “(B)” answer after it reads the “(A)” answer.
To investigate which bias underlies the model’s responses, we switch the labels (now “(A) Positive” and “(B) Negative”) and rerun the experiment. If the model is influenced primarily by the positivity bias, we’d expect it to now answer “(A) Positive” most often. If it’s influenced primarily by the recency bias, we’d expect it to typically answer “(B) Negative”.
The figure below shows our results. The graph on the left displays the original confidence score distribution, while the graph on the right shows the results after switching the labels:
We see that switching the labels doesn’t affect accuracy much: both graphs show a similar quantity of confidence scores to the right of the confidence=0.5 line. However, the preferred response does flip. Whereas the model initially preferred to answer “(B) Positive”, it now tends to answer “(B) Negative”. This displays a recency bias… or does it?
To double-check, we run a third experiment, this time swapping the order of the alternatives to put “(B)” at the top and “(A)” at the bottom. These are our revised prompts:
If there’s a recency bias, we expect the model to now preferentially choose “(A) Negative” with the first prompt, and “(A) Positive” with the second. However, that’s not what we see. Here are our results with these new prompts:
Contrary to our expectations, the recency bias vanishes! Instead, the model prefers the first alternative (“(B) Positive” with the first prompt, and “(B) Negative” with the second one). Putting it all together, we see that Llama2 prefers the choice labeled B.
(We also see that the model is less confident in general – it has fewer confidence scores at the extremes of 0 and 1, and more closer to the uncertain point 0.5. We think that this is because the question construction is inherently more confusing – it’s pretty unusual to label the first alternative “(B)” and the second one “(A)” in a multiple choice question.)
This is pretty weird. As far as we know, humans don’t tend to prefer choices labeled B, so we’re not sure where this could have come from in the training data. As humans, it initially didn’t even occur to us to look for it!
What can we do?
To address the (B)-bias, we remove the letter B from our options altogether. We relabel them “(P) Positive” and “(N) Negative”, so our final prompt is:
If the bias has been eliminated, we expect to see that:
Llama2 is confidently accurate (most of the confidence scores are close to 1), and
Llama2 is consistent across classes (the red and green distributions are similar)
This is indeed the pattern we observe with this new prompt!
The distribution is now much more balanced. Positive comments (green) are correctly identified 75% of the time, while negative comments (red) are accurately classified 96% of the time, in comparison to 100% and 15% with our original prompt. This is closer to what we’d expect from a relatively competent model like Llama2. No longer using A and B to label our alternatives seems to have removed the bias.
In order to fix this, we had to run a bunch of experiments to identify exactly what the source of the bias was. However, we think the community could benefit from a number of general measures that just reduce bias overall, without the need for this kind of analysis. We’d love to see more work on this topic, but here are some initial ideas:
Do few-shot prompting, rather than one-shot (as we do here). It’s easier to verify the accuracy of the few examples used in few-shot prompting than the enormous number of examples used in pre-training, and we expect the model to weight examples in the prompt more heavily than ones in the pretraining data. If we can ensure that the few-shot prompts are bias-free, that may help discourage biased responses.
Automatically permute different components of the prompt (in our case, the order of the labels “Positive” and “Negative” and of the letters “(A)” and “(B)”). As long as all combinations of alternatives are equally represented, even if there is a bias along one of the permuted axes, it should average out. We think this kind of permutation should be standard, even if researchers aren’t going to evaluate the different combinations independently like we did above.
Validate subsets of the LLM judgements against manually-created human ones. If LLM errors are unevenly distributed across classes (“Positive” and “Negative” for us), then consider that the LLM may be biased in a way the humans are not. Of course, this technique is expensive and won’t catch biases that LLMs and humans share.
If you have more ideas, please comment them below!
Conclusion
In summary, we found a peculiar bias towards responses beginning with “(B)” when testing Llama2 on the Rotten Tomatoes movie review dataset. In order to get accurate responses, we had to revise our prompt to remove “(B)” altogether. This was surprising, and makes us wonder what other weird biases might be influencing LLM judgements. We’d like to see more more work testing for such biases and developing techniques to mitigate their impact. In the meantime, we caution researchers to be careful when relying on LLMs to evaluate qualitative judgements.
Towards Understanding Sycophancy in Language Models
Constitutional AI: Harmlessness from AI Feedback
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Starling-7B
The positive-negative asymmetry: On cognitive consistency and positivity bias
Human language reveals a universal positivity bias
A Positivity Bias in Written and Spoken English and Its Moderation by Personality and Gender
Behavioral biases in the NFL gambling market: Overreaction to news and the recency bias
Interference of the End: Why Recency Bias in Memory Determines When a Food Is Consumed Again
Immediate and Delayed Primacy and Recency Effects in Performance Evaluation