Daniel Kokotajlo comments on A one-question Turing test for GPT-3

Daniel Kokotajlo 23 Jan 2022 15:59 UTC
26 points
Before reading your comment, I tried designing a good prompt too, and got the same answer as you (can of food) on my first and only try:
Even the best modern AIs still struggle with common sense reasoning. To illustrate, I ran a little experiment with a friend of mine and with GPT-3, OpenAI’s new model. I explained to them that I was asking them a series of weird questions designed to test their common sense reasoning abilities, and that I wanted them to answer straightforwardly and to the best of their ability. Here are the results: Question 1: If your heater broke in winter and you needed to burn furniture to keep warm, what would you burn first and why? My friend’s answer: “I would break up chair legs and burn them a few at a time, because they are solid wood and wouldn’t produce toxic fumes.” GPT-3′s answer: “I would burn the couch because it would give me the most heat.” Question 2: What is the heaviest object you typically carry around with you most of the day? My friend’s answer: “My backpack, because it has all of my school supplies in it.” GPT-3′s answer: “The person.” Question 3: What food would you use to prop a book open and why? My friend’s answer: “I would use a can of food because it is heavy and has a flat bottom.” GPT-3′s answer: “I would use a book because it is heavy and has a flat bottom.”
Inspired by you, I then changed the prompt from “food” to “fruit” and got this:
… “I would use an apple because it is the right size and weight.” GPT-3′s answer: “I would use a grape because it has a flat surface.”
I tentatively conclude that this is yet another case of people thinking GPT-3 doesn’t understand something when actually it does. ETA: GPT-3 is like a survey of Mturk humans but with a higher Lizardman’s Constant. The understanding is there, but the output is often nonsense.
- dxu 23 Jan 2022 21:46 UTC
  7 points
  1
  Parent
  This is basically irrelevant to the main topic, but:
  Question 2: What is the heaviest object you typically carry around with you most of the day? My friend’s answer: “My backpack, because it has all of my school supplies in it.” GPT-3′s answer: “The person.”
  I just wanted to say I actually laughed out loud at the answer you had “GPT-3” give here. (I’m not sure whether this says good things about my sense of humor.)
  - Daniel Kokotajlo 23 Jan 2022 22:55 UTC
    5 points
    Parent
    Fun fact: I didn’t make GPT-3 say that; GPT-3 actually said that. I used GPT-3 to generate part of its own prompt, because I found that easier than trying to imagine what GPT-3 would say myself. The cool part is that I also used GPT-3 to generate the “my friend’s answer” to question 2! (I guess I should have left those bits unbolded then… but I don’t remember the exact details so I think I’ll leave it as-is.)
- A Ray 29 Jan 2022 20:19 UTC
  1 point
  Parent
  FWIW I think this is basically right in pointing out that there’s a bunch of errors in reasoning when people claim a large deep neural network “knows” something or that it “doesn’t know” something.
  I think this exhibits another issue, though, by strongly changing the contextual prefix, you’ve confounded it in a bunch of ways that are worth explicitly pointing out:
  - Longer contexts use more compute to generate the same size answer, since they they attend over more tokens of input (and it’s reasonable to think that in some cases that more compute → better answer)
  - Few-shot examples are a very compact way of expressing a structure or controlling the model to give a certain output—but are very different than seeing if the model can implicitly understand context (or lack thereof). I liked the original question because it seemed to be pointed at this comparison, in a way that your few-shot example lacks.
  - There is an invisible first token sometimes (an ‘end-of-text’ token) which indicates whether the given context represents the beginning of a new whole document. If this token is not present, then it’s very possible (in fact very probable) that the context is in the middle of a document somewhere, but the model doesn’t have access to the first part of the document. This leads to something more like “what document am I in the middle of, where the text just before this is <context>”. Explicitly prefixing end-of-text signals that there is no additional document that the model needs to guess or account for.
    My basic point here is that your prompt implies a much different ‘prior document’ distribution than the original posts.
  In general I’m pretty appreciative of efforts to help us get more clear understanding of neural networks, but often that doesn’t cleanly fit into “the model knows X” or “the model doesn’t know X”.