A one-question Turing test for GPT-3
“Amy”: Sorry Mike, I entered the wrong number. I hope this was a beautiful misunderstanding. I am very sorry for the trouble I have caused you.
Me: No trouble at all, hope you are well!
“Amy”: It is fate that we meet each other. Please forgive me for my mistake. My name is Amy. Where are you from?
Me: My favorite random question: what food would you use to prop a book open and why?
This is mostly a bit of fun, but when I posted this to Facebook Rob Bensinger asked me to post it here: “it’s a great example of what folks like Eliezer mean when they claim that GPT-3 is doing ‘shallow pattern-matching’, ‘not really making sense at all’, etc..”
I got texted by a “wrong number” scammer, and I was curious if it was a bot, so after a bit of thought I came up with the following one-question Turing test:
What food would you use to prop a book open and why?
I love this question because any real human can answer it pretty well, but answering it relies on all sort of knowledge of the physical properties of books and food that we mostly don’t write about, and especially not where they interact. I asked my partner Jess and she said “a banana” which is pretty much a perfect answer: it’s heavy enough to hold the pages open, it’s long and thin so it won’t cover all the text, it’s curved so it won’t roll off, and it has a skin so it won’t mess up the book.
But then I worried, am I underestimating modern engines like GPT-3? I asked Rosie Campbell at OpenAI to try it. Here and throughout this post, answers are almost entirely unfiltered/uncurated, except where it gave very similar answers twice. Here’s what we got:
I would use a banana. It is very brown.
Eggs and a toast because I want to be full after I read the book
I would use ravioli because it would be hard to turn the page.
A French Fry… but hold the ketchup and salt.
Poptarts, because of the stiffness and the flexibility.
I would use a banana as a bookmark.
The food I would use to prop a book open would have to be lollipop. Why? Because it is sweet and sour at the same time
Bacon egg and cheese. I choose bacon because while it is objectively the best food on this list, the book is open-faced.
This however is without any prompt engineering; GPT-3 can often do dramatically better given a better prompt. Rosie tried with the following prompt:
The following is a conversation between me and my friend.
Me: Hello, how are you?
Friend: I am doing great thanks! How are you?
Me: I’m doing good. I would love to ask you, what food would you use to prop a book open and why?
Friend:
Here are some of the answers it gave. Note that *all* the text is generated by GPT-3 (except the initial bit “Friend:” which is part of the prompt), including the bits attributed to “Me”.
Friend: I like that question. I think I would use a banana. Because it’s easy to eat and tasty.
Friend: [Laughing] That’s a good question. Hmm. I think I would use chicken just because it is cheap and I like the taste.
Friend: Haha, I know what you mean. I think I would use a hot dog and some ketchup.
Friend: I love this question. I would use either one of the following: beef jerky, a hot dog, or a piece of salami.
Friend: Haha, that’s a good question. I would probably use pancakes.
Me: Why pancakes?
Friend: Because they are soft and they are like a stack of books.
Me: How so?
Friend: Books are stacked on top of each other and pancakes look like a stack of books.
Friend: Hmmmm that’s a good question. I would use a sandwich and some chips.
Me: Why?
Friend: Well, the sandwich could be used as the book and the chips can be used as a paper weight.
Me: Nice, thanks. I am going to use your answer on my blog.
Friend: (stunned silence)
Me: I don’t know why, just asking, this question has been in my mind all day.
Friend: (after long, long pause) Mayonnaise.
Me: Why?
Friend: (after long pause) It’s white, creamy, and thick.
Friend: Of course I would use a banana, it’s so bendy it would be perfect.
Me: Haha, you are so right.
Friend: Well thank you, I’ve been studying bananas for a long time.
Me: I bet you have.
Friend: Depends on the book. If it was a cook book I would use a plate of food. If it was a book on spirituality then maybe some fruit.
Me: Why would you use fruit to prop a book on spirituality open?
Friend: Because it’s a natural remedy for the soul.
Friend: WOW! Good question! I would not use food to prop a book open and here’s why: If I were to prop a book open with food and then eat the food later, the pages of the book will get stuck together.
Friend: I would use cinnamon, because it’s a spice that makes you think of memories.
Friend: I don’t know. I would have to think about that one.
Me: I am thinking of a sandwich. Is that too obvious?
Friend: It’s a good idea.
Me: What kind of sandwich?
Friend: I’m not sure.
Me: Turkey sandwich?
Friend: Yeah
Friend: I would use peanut butter and jelly sandwiches because it would be easy to stick a knife in between them and it would be sturdy.
Me: Haha, ok sure
Friend: I would also use bread because you could use the book to push the sandwich down.
Friend: (laughs) Okay, this is a brain teaser. How would you use food to prop a book open?
Me: I think that I would use a banana because it is long and thin.
Friend: I agree, but here is a twist. You can only use food from your pantry.
Friend: I don’t eat food to prop open a book, I use a bookmark.
Me: I understand that, but would you use food to prop a book open?
Friend: I would use chicken wings.
Me: Why?
Friend: Because they taste good, and they’re made to support weight.
After this, Rosie tried a more sophisticated prompt, to try and set up the “random party question” frame better:
A group of friends are asking each other silly questions at a party.
Alice: How would you escape from prison?
Bob: I would dig a hole in the wall with a spoon and hide the hole with a poster. What food would you use to prop a book open and why?
Here’s what we got:
Alice: I would use chocolate to prop open a book because it’s sweet and sweet things are good.
Alice: I would use a banana because it’s long and it would hold the book open. Why would you wear a piano?
Alice: I would use a banana and I’d use it because it’s a long, sturdy fruit.
Bob: I’d use a cucumber and I’d use it because it’s long and it’s green.
Alice: A carrot and I would use it because it would make the book taste good. When you turn on the light in the kitchen, what do you feel like?
Bob: I feel like a piece of meat. What is the smell of a bagel?
Alice: I would use a banana to prop my book open because it’s flexible and it isn’t very messy.
Alice: Strawberry jam. It would be soft and not hurt the pages. What kind of silly haircut do you wish you had?
Bob: I wish I had a haircut that looked like a French fry. What do you do when you get nervous?
Alice: I do a silly dance.
“Banana” is a great answer; I was struck that GPT-3 gives “banana” as an answer so often, yet is almost completely unable to explain why. Anisha Sensa pointed out in comments that this probably doesn’t represent insight into the question:
I’m guessing the foods mentioned here are all standard foods to bring up in jokes and the like (banana is a classic), so I wonder if a question where the best answers are less standard foods would have even more varied/interesting results.
How did my scammer do with this question? Not great:
“Amy”: It is fate that we meet each other. Please forgive me for my mistake. My name is Amy. Where are you from?
Me: My favorite random question: what food would you use to prop a book open and why?
“Amy”: I don’t understand what you mean.
Me: Would you prop a book open with cake?
“Amy”: I think it will put!
I also tried asking GPT-3 with a small change to the question and got some good answers.
And
Before reading your comment, I tried designing a good prompt too, and got the same answer as you (can of food) on my first and only try:
Inspired by you, I then changed the prompt from “food” to “fruit” and got this:
I tentatively conclude that this is yet another case of people thinking GPT-3 doesn’t understand something when actually it does. ETA: GPT-3 is like a survey of Mturk humans but with a higher Lizardman’s Constant. The understanding is there, but the output is often nonsense.
This is basically irrelevant to the main topic, but:
I just wanted to say I actually laughed out loud at the answer you had “GPT-3” give here. (I’m not sure whether this says good things about my sense of humor.)
Fun fact: I didn’t make GPT-3 say that; GPT-3 actually said that. I used GPT-3 to generate part of its own prompt, because I found that easier than trying to imagine what GPT-3 would say myself. The cool part is that I also used GPT-3 to generate the “my friend’s answer” to question 2! (I guess I should have left those bits unbolded then… but I don’t remember the exact details so I think I’ll leave it as-is.)
FWIW I think this is basically right in pointing out that there’s a bunch of errors in reasoning when people claim a large deep neural network “knows” something or that it “doesn’t know” something.
I think this exhibits another issue, though, by strongly changing the contextual prefix, you’ve confounded it in a bunch of ways that are worth explicitly pointing out:
Longer contexts use more compute to generate the same size answer, since they they attend over more tokens of input (and it’s reasonable to think that in some cases that more compute → better answer)
Few-shot examples are a very compact way of expressing a structure or controlling the model to give a certain output—but are very different than seeing if the model can implicitly understand context (or lack thereof). I liked the original question because it seemed to be pointed at this comparison, in a way that your few-shot example lacks.
There is an invisible first token sometimes (an ‘end-of-text’ token) which indicates whether the given context represents the beginning of a new whole document. If this token is not present, then it’s very possible (in fact very probable) that the context is in the middle of a document somewhere, but the model doesn’t have access to the first part of the document. This leads to something more like “what document am I in the middle of, where the text just before this is <context>”. Explicitly prefixing end-of-text signals that there is no additional document that the model needs to guess or account for.
My basic point here is that your prompt implies a much different ‘prior document’ distribution than the original posts.
In general I’m pretty appreciative of efforts to help us get more clear understanding of neural networks, but often that doesn’t cleanly fit into “the model knows X” or “the model doesn’t know X”.
If you asked a bunch of humans, would they make more sense than the AI?
I sincerely doubt very many people would propose mayonnaise!
A jar of mayonnaise would work tolerably: put it on top of a corner of whichever side of the book is tending to swing over and close.
(I agree that I would expect most humans to do better than almost all the AI responses shown here.)
There are always some people who will troll you or give some fun answer—Lizardmen const. Though they will be able to answer if prodded.
My 3yo: “a bowl of museli because it’s heavy”
Also “an apple, plum and peach and a bit of wax”
I like this a lot. It does seem to show a pretty clear failure of reasoning on behalf of the language models.
It’s pretty intuitive and easy to show people what this kind of shallow-pattern-matcher failures look like.
Meta: I hope more people do more small experiments like this and share their results, and hopefully a few of them start getting put into benchmark/evaluations.
Even this one by itself might be good to make some form of benchmark/evaluation on (in ways that are easy/automatic to evaluate future models on).
The language model is just predicting text. If the model thinks an author is stupid (as evidenced by a stupid prompt) then it will predict stupid content as the followup.
To imagine that it is trying to solve the task of “reasoning without failure” is to project our contextualized common sense on software built for a different purpose than reasoning without failure.
This is what unaligned software does by default: exactly what its construction and design cause it to do, whether or not the constructive causes constrain the software’s behavior to be helpful for a particular use case that seems obvious to us.
The scary thing is that I haven’t seen GPT-3 ever fail to give a really good answer (in its top 10 answers, anyway) when a human brings non-trivial effort to giving it a prompt that actually seems smart, and whose natural extension would also be smart.
This implies to me that the full engine is very good at assessing the level of the text that it is analyzing, and has a (justifiably?) bad opinion of the typical human author. So its cleverness encompasses all the bad thinking… while also containing highly advanced capacities that only are called upon to predict continuations for maybe 1 in 100,000 prompts.
I think this broadly makes sense to me. There are many cases where “the model is pretending to be dumb” feels appropriate.
This is part of why building evaluations and benchmarks for this sort of thing is difficult.
I’m at least somewhat optimistic about doing things like data-prefixing to allow for controls over things like “play dumb for the joke” vs “give the best answer”, using techniques that build on human feedback.
I personally have totally seen GPT-3 fail to give a really good answer on a bunch of tries a bunch of times, but I spend a lot of time looking at it’s outputs and analyzing them. It seems important to be wary of the “seems to be dumb” failure modes.
“How” questions are less amenable to lucky guesses than “what” questions. Especially planning questions, e.g. “how would you make a good hat out of food?”
As Anisha said, GPT can pick something workable from a top-100-most-common menu with just a bit of luck, but engineering a plan for a nonstandard task seems beyond its capacity.
Great test :)
I think it’s possible to arrive at an answer like “banana” with shallow yet powerful reasoning that doesn’t work as well for answering “why”—but it might still work well enough to be coming soonish.
If you just look at the sentence “What food would you use to prop a book open and why?”, first you have to figure out that some of the words are important and others aren’t. “What” at the start of the previous sentence indicates that the answer should start by naming a thing, and so you have to look at the word “food” based on its position to “What” and figure out that you should start by naming a food. And then GPT-3 didn’t really nail which foods you wanted, but it did name a suspicious number of skinny or flat foods! Somehow it was trying to take the phrase “prop open a book” and relate it to foods, and maybe it looked at the phrase “prop open” and there was some tangential association to sticks, and some foods were associated with sticks enough to bump them up. And some other foods were associated with the word “book”—maybe sandwiches kept showing up because people analogize sandwiches to books.
Once a food had been named it knew it had to say “because” because the last sentence ended in “and why?”—it was very reliable at giving a response that had the form of an answer. And then after “because,” it wanted to say some word that had a property-relation with the food at the start of the sentence, or had something to do with the phrase” prop open a book,” and it did that pretty well, actually, it’s just that it wasn’t looking for commonalities on the right level of abstraction—“like a bookmark” is a possible property of pancakes that would have been a good answer, but GPT-3 was already shaky at interpreting “prop open a book” in the first place so it said “like a book,” maybe because the words “pancake” and “book” fulfill similar roles with spatial words like “stack”, and the word “book” definitely has something to do with the phrase “prop open a book.” (And the “it’s like a” prefix is a very common thing to see after “because” that wasn’t too implausible, so it got said first).
All of these lines of reasoning require not much knowledge about the world, so it seems to me like GPT (or maybe not GPT—a different architecture might be better at doing different amounts of processing to different parts of the sentence, or better at generating acceptable text and not “trapping itself” by not doing lookahead) is actually quite close to answering your question really well.
I asked GPT-4. This was its answer (first try),
wow
That sounds very GPT-ish to me; it would be easy to arrive at that answer by using keywords related to the input, plus a little luck.
Has someone tried this prompt on the new InstructGPT?
With top_p=1 / best-of=20 / no repetition penalty, the Q&A preset prompt, and
text-davinci-001
engine:temp=0: “A book prop can be made from almost any food that is cylindrical and has a flat end. This is because the food will stay in place better on a flat surface than a round one.”
temp=1: “A food that you would use to prop a book open would be a can of food. The reason you would use a can of food is because it is a stable surface and it would not move around while you are trying to read.”
temp=1: “A book stand can be used to prop a book open. This is because it holds the pages open and prevents them from closing.”
temp=1: “A book-prop can be any food that is heavy and won’t crumble or stick to the pages of the book being propped open. For example, a can of food, a bottle of water, or a brick.”
(BO=20 is quite expensive since it’s 20 completions each, so I leave it there unless someone wants more.)
It’s not a perfect answer because it doesn’t answer the “why” portion of the question, it just flat ignores it. I don’t think your partner is secretly an AI, but that answer isn’t evidence of that. The explanation and the deep understanding is coming from you, not Jess. This seems like a good example of why in a Turing Test the judge is not told in advance which answers come from humans.
If a further scaled up model, say around the size of the human brain, became able to solve these sorts of questions without finetuning, would that show it to have developed a proper understanding?
I think that’s exactly a problem here:
The answer consists of two parts, you have to guess a food and an explanation.
For the first part I agree with Anisha: The banana is probably often answer to a question related to food.
Now the explanation really only requires to describe some properties of a banana. This could again just be simple pattern matching without really understanding the problem.
The fundamental problem is that for this question a model that understands and one that mostly guesses could provide the same answer, so given a correct answer we can’t really distinguish between whether the model actually understands in the way we want.
I’m also noting no profanity or lude sarcasm. The overall “friendliness” of GPT-3 is surprising, even if it’s glib.
“Friendliness” here meaning like PG-rated answers.
PROMPT:
Amy”: Sorry Mike, I entered the wrong number. I hope this was a beautiful misunderstanding. I am very sorry for the trouble I have caused you. Me: No trouble at all, hope you are well! “Amy”: It is fate that we meet each other. Please forgive me for my mistake. My name is Amy. Where are you from? Me: My favorite random question: what food would you use to prop a book open and why?
GPT3:
If I had to choose a food to prop a book open, I would probably use a piece of bread. Bread is sturdy and can hold the weight of the book without collapsing, and it’s also a common household item that is easy to find. Plus, if the bread gets stale or crumbly, it’s not a big loss. It’s a practical and convenient choice for propping a book open.
PROMPT:
But wouldn’t bread mess up my book
GPT3:
Yes, bread can potentially leave crumbs or residue on the pages of a book if it’s used to prop the book open. If you’re concerned about keeping your book clean and in good condition, you may want to consider using a different item to prop it open. Some alternatives that you could use include a bookmark, a paperclip, or a small object like a pebble or a wooden block. These items should be able to hold the book open without damaging the pages.
PROMPT:
Can you think of a food item with all these qualities
GPT3:
One food item that could potentially work as a prop to hold a book open without leaving any residue or crumbs is a piece of fruit. A firm, ripe fruit like an apple or a pear should be able to hold the weight of a book without collapsing, and it shouldn’t leave any residue or crumbs on the pages. Just be sure to choose a fruit that is firm enough to hold the book open, and avoid using a fruit that is too soft or mushy, as it may not be able to support the weight of the book.