I suspect e.g. with more explicit prompting about e.g. not validating on the test set, the models might do even better. See e.g. the more explicit prompts in https://arxiv.org/abs/2402.17453, which seems SOTA at these kinds of tasks.
I suspect e.g. with more explicit prompting about e.g. not validating on the test set, the models might do even better. See e.g. the more explicit prompts in https://arxiv.org/abs/2402.17453, which seems SOTA at these kinds of tasks.