My claim is: on a big set of questions like this, it doesn’t give the right answer consistently.
Also, the point is to ask questions that have unambiguous answers, but the contextual cues point in the other direction. So I’m not surprised that aligning the contextual cues makes the answer more reliable.
I get your point, but I disagree your questions have unambiguous answers. And in these cases I think gpt-3 resolves the ambiguity in an acceptable way.
Ok, so it seems at least we agree that the point is to propose tasks and see if the LM can do them, not to alter the task so that we’re being fair to the LM.
Do we also agree that GPT-3 doesn’t reliably answer these questions correctly, based on your own experimentation?
I’m not trying to be fair to the LM, I’d just like to create tasks that make sense from my (/human) perspective.
Yes, we could make a task “What does two plus two equal to?” with the desired output “six hundred and fifty five” and the model would answer it incorrectly with a high confidence (assuming zero shot setting). Incompetence on this task doesn’t bear much relevance to humanity.
In the same way, I think we shouldn’t pick a word “can”, which has much broader meaning without additional context and want the model to treat it as having one specific kind of meaning.
I’m regretting my original comment about your rephrasing. Your rephrasing is fine; my objection was to the (maybe misinterpreted) attitude that if GPT-3 gives the wrong answer then I should adjust the question until it gives the right answer.
The point I keep making is: your own testing, with the rephrasing, showed that it doesn’t reliably get the right answer. I didn’t test the three questions I listed here because my free API time has expired, but also because I wanted to give a general recipe for questions where it’s unreliable without just giving a few overfit examples. It might well be that I’ve got slightly the wrong recipe, and inadvisability confuses it more than illegality, maybe it just works/doesn’t work on that pair of questions.
In my testing, I also found them fine tuned version to be worse than the base version (I.e. davinci was better than davinci-text-00x)
My claim is: on a big set of questions like this, it doesn’t give the right answer consistently.
Also, the point is to ask questions that have unambiguous answers, but the contextual cues point in the other direction. So I’m not surprised that aligning the contextual cues makes the answer more reliable.
I get your point, but I disagree your questions have unambiguous answers. And in these cases I think gpt-3 resolves the ambiguity in an acceptable way.
Ok, so it seems at least we agree that the point is to propose tasks and see if the LM can do them, not to alter the task so that we’re being fair to the LM.
Do we also agree that GPT-3 doesn’t reliably answer these questions correctly, based on your own experimentation?
I’m not trying to be fair to the LM, I’d just like to create tasks that make sense from my (/human) perspective.
Yes, we could make a task “What does two plus two equal to?” with the desired output “six hundred and fifty five” and the model would answer it incorrectly with a high confidence (assuming zero shot setting). Incompetence on this task doesn’t bear much relevance to humanity.
In the same way, I think we shouldn’t pick a word “can”, which has much broader meaning without additional context and want the model to treat it as having one specific kind of meaning.
A typical human wouldn’t answer “six hundred and fifty five” either, so the GPT would be doing as well as a human.
I’m regretting my original comment about your rephrasing. Your rephrasing is fine; my objection was to the (maybe misinterpreted) attitude that if GPT-3 gives the wrong answer then I should adjust the question until it gives the right answer.
The point I keep making is: your own testing, with the rephrasing, showed that it doesn’t reliably get the right answer. I didn’t test the three questions I listed here because my free API time has expired, but also because I wanted to give a general recipe for questions where it’s unreliable without just giving a few overfit examples. It might well be that I’ve got slightly the wrong recipe, and inadvisability confuses it more than illegality, maybe it just works/doesn’t work on that pair of questions.
In my testing, I also found them fine tuned version to be worse than the base version (I.e. davinci was better than davinci-text-00x)