That certainly seems plausible—it would be interesting to compare to a base model at some point, although with recent changes to the OpenAI API, I’m not sure if there would be a good way to pull the right token probabilities out.
@Jessica Rumbelow also suggested that that debiasing process could be a reason why there weren’t significant score differences between the main model tested, older GPT-3.5, and the newest GPT-4.
As the Llama3 70B base model is said very clean( unlike base DeepSeek for example, which is instruction-spoiled already) and similarly capable to GPT3.5, you could explore that hypothesis. Details: Check Groq or TogetherAI for free inference, not sure if test data would fit Llama3 context window.
That certainly seems plausible—it would be interesting to compare to a base model at some point, although with recent changes to the OpenAI API, I’m not sure if there would be a good way to pull the right token probabilities out.
@Jessica Rumbelow also suggested that that debiasing process could be a reason why there weren’t significant score differences between the main model tested, older GPT-3.5, and the newest GPT-4.
As the Llama3 70B base model is said very clean( unlike base DeepSeek for example, which is instruction-spoiled already) and similarly capable to GPT3.5, you could explore that hypothesis.
Details: Check Groq or TogetherAI for free inference, not sure if test data would fit Llama3 context window.
Thanks!