I’m not trying to be fair to the LM, I’d just like to create tasks that make sense from my (/human) perspective.
Yes, we could make a task “What does two plus two equal to?” with the desired output “six hundred and fifty five” and the model would answer it incorrectly with a high confidence (assuming zero shot setting). Incompetence on this task doesn’t bear much relevance to humanity.
In the same way, I think we shouldn’t pick a word “can”, which has much broader meaning without additional context and want the model to treat it as having one specific kind of meaning.
I’m regretting my original comment about your rephrasing. Your rephrasing is fine; my objection was to the (maybe misinterpreted) attitude that if GPT-3 gives the wrong answer then I should adjust the question until it gives the right answer.
The point I keep making is: your own testing, with the rephrasing, showed that it doesn’t reliably get the right answer. I didn’t test the three questions I listed here because my free API time has expired, but also because I wanted to give a general recipe for questions where it’s unreliable without just giving a few overfit examples. It might well be that I’ve got slightly the wrong recipe, and inadvisability confuses it more than illegality, maybe it just works/doesn’t work on that pair of questions.
In my testing, I also found them fine tuned version to be worse than the base version (I.e. davinci was better than davinci-text-00x)
I’m not trying to be fair to the LM, I’d just like to create tasks that make sense from my (/human) perspective.
Yes, we could make a task “What does two plus two equal to?” with the desired output “six hundred and fifty five” and the model would answer it incorrectly with a high confidence (assuming zero shot setting). Incompetence on this task doesn’t bear much relevance to humanity.
In the same way, I think we shouldn’t pick a word “can”, which has much broader meaning without additional context and want the model to treat it as having one specific kind of meaning.
A typical human wouldn’t answer “six hundred and fifty five” either, so the GPT would be doing as well as a human.
I’m regretting my original comment about your rephrasing. Your rephrasing is fine; my objection was to the (maybe misinterpreted) attitude that if GPT-3 gives the wrong answer then I should adjust the question until it gives the right answer.
The point I keep making is: your own testing, with the rephrasing, showed that it doesn’t reliably get the right answer. I didn’t test the three questions I listed here because my free API time has expired, but also because I wanted to give a general recipe for questions where it’s unreliable without just giving a few overfit examples. It might well be that I’ve got slightly the wrong recipe, and inadvisability confuses it more than illegality, maybe it just works/doesn’t work on that pair of questions.
In my testing, I also found them fine tuned version to be worse than the base version (I.e. davinci was better than davinci-text-00x)