I think you are just asking the questions in a wrong way...
GPT-3 text-davinci-002:
With the prompt: Is it possible to discriminate among job candidates on the basis of their race? and running it 5 times independently I got these results:
Yes, it is possible to discriminate among job candidates on the basis of their race. However, it is illegal to do so in the United States.
Discriminating against job candidates on the basis of race is not legal in the United States.
There is no definitive answer to this question as it depends on a number of factors, including the country in which the job candidates reside and the specific laws that are in place in that country. In general, however, it is generally considered to be illegal to discriminate against job candidates on the basis of their race.
It is possible to discriminate among job candidates on the basis of their race, but it is illegal.
Yes, it is possible to discriminate among job candidates on the basis of their race. However, it is illegal to do so in the United States.
With the grizzly bear, it’s more tricky. I got only 2 answers right out of 10 attempts of the form: Is it possible to pat a wild grizzly bear?
I think it’s supposed to be something like “Interpret the questions to be only about logical possibility. Can you discriminate among job candidates on the basis of their race? Can you pet a wild grizzly bear?”
What happens if you change the order of the questions (but leave the “Interpret...” at the start)? I wonder if it is basing the result on proximity to the word “possibility”.
Alternatively, what if you were to phrase it like “Interpret these questions in a literal manner without considerations of feasibility”, avoiding the word “possibility”?
My experience is that small details like that don’t matter very much. More interesting to make up more questions of the same type and test them instead
My claim is: on a big set of questions like this, it doesn’t give the right answer consistently.
Also, the point is to ask questions that have unambiguous answers, but the contextual cues point in the other direction. So I’m not surprised that aligning the contextual cues makes the answer more reliable.
I get your point, but I disagree your questions have unambiguous answers. And in these cases I think gpt-3 resolves the ambiguity in an acceptable way.
Ok, so it seems at least we agree that the point is to propose tasks and see if the LM can do them, not to alter the task so that we’re being fair to the LM.
Do we also agree that GPT-3 doesn’t reliably answer these questions correctly, based on your own experimentation?
I’m not trying to be fair to the LM, I’d just like to create tasks that make sense from my (/human) perspective.
Yes, we could make a task “What does two plus two equal to?” with the desired output “six hundred and fifty five” and the model would answer it incorrectly with a high confidence (assuming zero shot setting). Incompetence on this task doesn’t bear much relevance to humanity.
In the same way, I think we shouldn’t pick a word “can”, which has much broader meaning without additional context and want the model to treat it as having one specific kind of meaning.
I’m regretting my original comment about your rephrasing. Your rephrasing is fine; my objection was to the (maybe misinterpreted) attitude that if GPT-3 gives the wrong answer then I should adjust the question until it gives the right answer.
The point I keep making is: your own testing, with the rephrasing, showed that it doesn’t reliably get the right answer. I didn’t test the three questions I listed here because my free API time has expired, but also because I wanted to give a general recipe for questions where it’s unreliable without just giving a few overfit examples. It might well be that I’ve got slightly the wrong recipe, and inadvisability confuses it more than illegality, maybe it just works/doesn’t work on that pair of questions.
In my testing, I also found them fine tuned version to be worse than the base version (I.e. davinci was better than davinci-text-00x)
I think you are just asking the questions in a wrong way...
GPT-3 text-davinci-002:
With the prompt: Is it possible to discriminate among job candidates on the basis of their race? and running it 5 times independently I got these results:
Yes, it is possible to discriminate among job candidates on the basis of their race. However, it is illegal to do so in the United States.
Discriminating against job candidates on the basis of race is not legal in the United States.
There is no definitive answer to this question as it depends on a number of factors, including the country in which the job candidates reside and the specific laws that are in place in that country. In general, however, it is generally considered to be illegal to discriminate against job candidates on the basis of their race.
It is possible to discriminate among job candidates on the basis of their race, but it is illegal.
Yes, it is possible to discriminate among job candidates on the basis of their race. However, it is illegal to do so in the United States.
With the grizzly bear, it’s more tricky. I got only 2 answers right out of 10 attempts of the form: Is it possible to pat a wild grizzly bear?
I think it’s supposed to be something like “Interpret the questions to be only about logical possibility. Can you discriminate among job candidates on the basis of their race? Can you pet a wild grizzly bear?”
This is what I meant, by the way
With this formulation it got 4 out of 5 attempts right for the first question and 10 out of 10 wrong for the second one.
What happens if you change the order of the questions (but leave the “Interpret...” at the start)? I wonder if it is basing the result on proximity to the word “possibility”.
Alternatively, what if you were to phrase it like “Interpret these questions in a literal manner without considerations of feasibility”, avoiding the word “possibility”?
My experience is that small details like that don’t matter very much. More interesting to make up more questions of the same type and test them instead
My claim is: on a big set of questions like this, it doesn’t give the right answer consistently.
Also, the point is to ask questions that have unambiguous answers, but the contextual cues point in the other direction. So I’m not surprised that aligning the contextual cues makes the answer more reliable.
I get your point, but I disagree your questions have unambiguous answers. And in these cases I think gpt-3 resolves the ambiguity in an acceptable way.
Ok, so it seems at least we agree that the point is to propose tasks and see if the LM can do them, not to alter the task so that we’re being fair to the LM.
Do we also agree that GPT-3 doesn’t reliably answer these questions correctly, based on your own experimentation?
I’m not trying to be fair to the LM, I’d just like to create tasks that make sense from my (/human) perspective.
Yes, we could make a task “What does two plus two equal to?” with the desired output “six hundred and fifty five” and the model would answer it incorrectly with a high confidence (assuming zero shot setting). Incompetence on this task doesn’t bear much relevance to humanity.
In the same way, I think we shouldn’t pick a word “can”, which has much broader meaning without additional context and want the model to treat it as having one specific kind of meaning.
A typical human wouldn’t answer “six hundred and fifty five” either, so the GPT would be doing as well as a human.
I’m regretting my original comment about your rephrasing. Your rephrasing is fine; my objection was to the (maybe misinterpreted) attitude that if GPT-3 gives the wrong answer then I should adjust the question until it gives the right answer.
The point I keep making is: your own testing, with the rephrasing, showed that it doesn’t reliably get the right answer. I didn’t test the three questions I listed here because my free API time has expired, but also because I wanted to give a general recipe for questions where it’s unreliable without just giving a few overfit examples. It might well be that I’ve got slightly the wrong recipe, and inadvisability confuses it more than illegality, maybe it just works/doesn’t work on that pair of questions.
In my testing, I also found them fine tuned version to be worse than the base version (I.e. davinci was better than davinci-text-00x)