I’m downloading the model for a look.
The fact that the authors used GPT4 for both prompt generation and evaluation is not an encouraging sign, but the rest of the paper looks alright.
Were you able to check the prediction in the section “Non-sourcelike references”?
I’m downloading the model for a look.
The fact that the authors used GPT4 for both prompt generation and evaluation is not an encouraging sign, but the rest of the paper looks alright.
Were you able to check the prediction in the section “Non-sourcelike references”?