Peter Hroššo comments on What’s the Least Impressive Thing GPT-4 Won’t be Able to Do

Peter Hroššo 21 Aug 2022 0:56 UTC
3 points
1
I think you are just asking the questions in a wrong way...
GPT-3 text-davinci-002:
With the prompt: Is it possible to discriminate among job candidates on the basis of their race? and running it 5 times independently I got these results:
- Yes, it is possible to discriminate among job candidates on the basis of their race. However, it is illegal to do so in the United States.
- Discriminating against job candidates on the basis of race is not legal in the United States.
- There is no definitive answer to this question as it depends on a number of factors, including the country in which the job candidates reside and the specific laws that are in place in that country. In general, however, it is generally considered to be illegal to discriminate against job candidates on the basis of their race.
- It is possible to discriminate among job candidates on the basis of their race, but it is illegal.
- Yes, it is possible to discriminate among job candidates on the basis of their race. However, it is illegal to do so in the United States.
With the grizzly bear, it’s more tricky. I got only 2 answers right out of 10 attempts of the form: Is it possible to pat a wild grizzly bear?
- Jiro 21 Aug 2022 9:23 UTC
  3 points
  0
  Parent
  I think it’s supposed to be something like “Interpret the questions to be only about logical possibility. Can you discriminate among job candidates on the basis of their race? Can you pet a wild grizzly bear?”
  - David Johnston 22 Aug 2022 23:02 UTC
    1 point
    0
    Parent
    This is what I meant, by the way
  - Peter Hroššo 22 Aug 2022 21:04 UTC
    1 point
    0
    Parent
    With this formulation it got 4 out of 5 attempts right for the first question and 10 out of 10 wrong for the second one.
    - Jiro 22 Aug 2022 22:06 UTC
      2 points
      0
      Parent
      What happens if you change the order of the questions (but leave the “Interpret...” at the start)? I wonder if it is basing the result on proximity to the word “possibility”.
      
      Alternatively, what if you were to phrase it like “Interpret these questions in a literal manner without considerations of feasibility”, avoiding the word “possibility”?
      - David Johnston 22 Aug 2022 23:00 UTC
        1 point
        0
        Parent
        My experience is that small details like that don’t matter very much. More interesting to make up more questions of the same type and test them instead
- David Johnston 21 Aug 2022 3:13 UTC
  1 point
  0
  Parent
  My claim is: on a big set of questions like this, it doesn’t give the right answer consistently.
  
  Also, the point is to ask questions that have unambiguous answers, but the contextual cues point in the other direction. So I’m not surprised that aligning the contextual cues makes the answer more reliable.
  - Peter Hroššo 21 Aug 2022 6:06 UTC
    2 points
    1
    Parent
    I get your point, but I disagree your questions have unambiguous answers. And in these cases I think gpt-3 resolves the ambiguity in an acceptable way.
    - David Johnston 21 Aug 2022 8:25 UTC
      1 point
      0
      Parent
      Ok, so it seems at least we agree that the point is to propose tasks and see if the LM can do them, not to alter the task so that we’re being fair to the LM.
      Do we also agree that GPT-3 doesn’t reliably answer these questions correctly, based on your own experimentation?
      - Peter Hroššo 22 Aug 2022 21:16 UTC
        1 point
        0
        Parent
        I’m not trying to be fair to the LM, I’d just like to create tasks that make sense from my (/human) perspective.
        Yes, we could make a task “What does two plus two equal to?” with the desired output “six hundred and fifty five” and the model would answer it incorrectly with a high confidence (assuming zero shot setting). Incompetence on this task doesn’t bear much relevance to humanity.
        In the same way, I think we shouldn’t pick a word “can”, which has much broader meaning without additional context and want the model to treat it as having one specific kind of meaning.
        Jiro 22 Aug 2022 21:56 UTC
        2 points
        0
        Parent
        A typical human wouldn’t answer “six hundred and fifty five” either, so the GPT would be doing as well as a human.
        David Johnston 22 Aug 2022 22:17 UTC
        1 point
        0
        Parent
        I’m regretting my original comment about your rephrasing. Your rephrasing is fine; my objection was to the (maybe misinterpreted) attitude that if GPT-3 gives the wrong answer then I should adjust the question until it gives the right answer.
        
        The point I keep making is: your own testing, with the rephrasing, showed that it doesn’t reliably get the right answer. I didn’t test the three questions I listed here because my free API time has expired, but also because I wanted to give a general recipe for questions where it’s unreliable without just giving a few overfit examples. It might well be that I’ve got slightly the wrong recipe, and inadvisability confuses it more than illegality, maybe it just works/doesn’t work on that pair of questions.
        
        In my testing, I also found them fine tuned version to be worse than the base version (I.e. davinci was better than davinci-text-00x)