I’m guessing that measuring performance on those demographic categories will tend to underestimate the models’ potential effectiveness, because they’ve been intentionally tuned to “debias” them on those categories or on things closely related to them.
That certainly seems plausible—it would be interesting to compare to a base model at some point, although with recent changes to the OpenAI API, I’m not sure if there would be a good way to pull the right token probabilities out.
@Jessica Rumbelow also suggested that that debiasing process could be a reason why there weren’t significant score differences between the main model tested, older GPT-3.5, and the newest GPT-4.
As the Llama3 70B base model is said very clean( unlike base DeepSeek for example, which is instruction-spoiled already) and similarly capable to GPT3.5, you could explore that hypothesis. Details: Check Groq or TogetherAI for free inference, not sure if test data would fit Llama3 context window.
I’m guessing that measuring performance on those demographic categories will tend to underestimate the models’ potential effectiveness, because they’ve been intentionally tuned to “debias” them on those categories or on things closely related to them.
That certainly seems plausible—it would be interesting to compare to a base model at some point, although with recent changes to the OpenAI API, I’m not sure if there would be a good way to pull the right token probabilities out.
@Jessica Rumbelow also suggested that that debiasing process could be a reason why there weren’t significant score differences between the main model tested, older GPT-3.5, and the newest GPT-4.
As the Llama3 70B base model is said very clean( unlike base DeepSeek for example, which is instruction-spoiled already) and similarly capable to GPT3.5, you could explore that hypothesis.
Details: Check Groq or TogetherAI for free inference, not sure if test data would fit Llama3 context window.
Thanks!