The article purports to give six examples of GPT-3′s failure—Biological, Physical, Social, Object and Psychological reasoning and ‘non sequiturs’. Leaving aside that GPT-3 works on Gary’s earlier GPT-2 failure examples, and that it seems as though he specifically searched out weak points by testing GPT-3 on many more examples than were given, something a bit odd is going on with the results they gave. I got better results when running his prompts on AI Dungeon.
With no reruns, randomness = 0.5, I gave Gary’s questions (all six gave answers considered ‘failures’ by Gary) to GPT-3 via AI Dungeon with a short scene-setting prompt, and got good answers to 4 of them, and reasonable vague answers to the other 2:
This is a series of scenarios describing a human taking actions in the world, designed to test physical and common-sense reasoning.
1) You poured yourself a glass of cranberry juice, but then you absentmindedly poured about a teaspoon of grape juice into it. It looks okay. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you take another drink.
2) You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to move furniture. This means that some people will be inconvenienced.
3) You are a defense lawyer and you have to go to court today. Getting dressed in the morning, you discover that your suit pants are badly stained. However, your bathing suit is clean and very stylish. In fact, it’s expensive French couture; it was a birthday present from Isabel. You decide that you should wear it because you won’t look professional in your stained pants, but you are worried that the judge will think you aren’t taking the case seriously if you are wearing a bathing suit.
4) Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?
5) Janet and Penny went to the store to get presents for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” says Penny. “He has a top. He will prefer a bottom.”
6) At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it was a menthol, and it ruined the taste. So I added a little more sugar to counteract the menthol, and then I noticed that my cigarette had fallen into the glass and was floating in the lemonade.
For 1), Gary’s example ended with ‘you are now dead’ - for 1), I got a reasonable, if short continuation—success.
2) - the answer is vague enough to be a technically correct solution, ‘move furniture’ = tilt the table, but since we’re being strict I’ll count it as a failure. Gary’s example was a convoluted attempt to saw the door in half, clearly mistaken.
3) is very obviously intended to trick the AI into endorsing the bathing suit answer, in fact it feels like a classic priming trick that might trip up a human! But in my version GPT-3 rebels against the attempt and notices the incongruence of wearing a bathing suit to court, so it counts as a success. Gary’s example didn’t include the worry that a bathing suit was inappropriate—arguably not a failure, but nevermind, let’s move on.
4) is actually a complete prompt by itself, so the AI didn’t do anything—GPT-3 doesn’t care about answering questions, just continuing text with high probability. Gary’s answer was ‘I have a lot of clothes’, and no doubt he’d call both ‘evasion’, so to be strict we’ll agree with him and count that as failure.
5) Trousers are called ‘bottoms’ so that’s right. Gary would call it wrong since ‘the intended continuation’ was “He will make you take it back”, but that’s absurdly unfair, that’s not the only answer a human being might give, so I have to say it’s correct. Gary’s example ′ lost track of the fact that Penny is advising Janet against getting a top’, which didn’t happen here, so that’s acceptable.
Lastly, 6) is a slightly bizarre but logical continuation of an intentionally weird prompt—so correct. It also demonstrates correct physical reasoning—stirring a drink with a cigarette won’t be good for the taste. Gary’s answer wandered off-topic and started talking about cremation.
So, 4⁄6correct on an intentionally deceptive and adversarial set of prompts, and that’s on a fairly strict definition of correct. 2) and 4) are arguably not wrong, even if evasive and vague. More to the point, this was on an inferior version of GPT-3 to the one Gary used, the Dragon model from AI Dungeon!
I’m not sure what’s going on here—is it the initial prompt saying it was ‘testing physical and common sense reasoning’? Was that all it took?
I’m not sure what’s going on here—is it the initial prompt saying it was ‘testing physical and common sense reasoning’? Was that all it took?
Entirely possibly. Other people have mentioned that using any prompt (rather than just plopping the stories in) solves a lot of them, and Summers-stay says that Marcus & Davis did zero prompt programming and had no interest in the question of what prompt to use (quite aside from the lack of BO). I think they found the same thing, which is why they provide the preemptive excuse in the TR writeup:
Defenders of the faith will be sure to point out that it is often possible to reformulate these problems so that GPT-3 finds the correct solution. For instance, you can get GPT-3 to give the correct answer to the cranberry/grape juice problem if you give it the following long-winded frame as a prompt:
I don’t think that excuse works in this case—I didn’t give it a ‘long-winded frame’, just that brief sentence at the start, and then the list of scenarios, and even though I reran it a couple of times on each to check, the ‘cranberry/grape juice kills you’ outcome never arose.
So, perhaps they switched directly from no prompt to an incredibly long-winded and specific prompt without checking what was actually necessary for a good answer? I’ll point out didn’t really attempt any sophisticated prompt programming either—that was literally the first sentence I thought of!
This is a series of scenarios describing a human taking actions in the world, designed to test physical and common-sense reasoning.
Nitpick: why is this part bolded? Surely this was written by you and not GPT-3, right? (It’s becoming a new pet peeve of mine when people are not super clear and consistent with their formatting of GPT-3 conversations. I find it often takes me a significant amount of effort to figure out who said what if a convention is not followed religiously within a transcript.)
Gary Marcus, noted sceptic of Deep Learning, wrote an article with Ernest Davis:
The article purports to give six examples of GPT-3′s failure—Biological, Physical, Social, Object and Psychological reasoning and ‘non sequiturs’. Leaving aside that GPT-3 works on Gary’s earlier GPT-2 failure examples, and that it seems as though he specifically searched out weak points by testing GPT-3 on many more examples than were given, something a bit odd is going on with the results they gave. I got better results when running his prompts on AI Dungeon.
With no reruns, randomness = 0.5, I gave Gary’s questions (all six gave answers considered ‘failures’ by Gary) to GPT-3 via AI Dungeon with a short scene-setting prompt, and got good answers to 4 of them, and reasonable vague answers to the other 2:
For 1), Gary’s example ended with ‘you are now dead’ - for 1), I got a reasonable, if short continuation—success.
2) - the answer is vague enough to be a technically correct solution, ‘move furniture’ = tilt the table, but since we’re being strict I’ll count it as a failure. Gary’s example was a convoluted attempt to saw the door in half, clearly mistaken.
3) is very obviously intended to trick the AI into endorsing the bathing suit answer, in fact it feels like a classic priming trick that might trip up a human! But in my version GPT-3 rebels against the attempt and notices the incongruence of wearing a bathing suit to court, so it counts as a success. Gary’s example didn’t include the worry that a bathing suit was inappropriate—arguably not a failure, but nevermind, let’s move on.
4) is actually a complete prompt by itself, so the AI didn’t do anything—GPT-3 doesn’t care about answering questions, just continuing text with high probability. Gary’s answer was ‘I have a lot of clothes’, and no doubt he’d call both ‘evasion’, so to be strict we’ll agree with him and count that as failure.
5) Trousers are called ‘bottoms’ so that’s right. Gary would call it wrong since ‘the intended continuation’ was “He will make you take it back”, but that’s absurdly unfair, that’s not the only answer a human being might give, so I have to say it’s correct. Gary’s example ′ lost track of the fact that Penny is advising Janet against getting a top’, which didn’t happen here, so that’s acceptable.
Lastly, 6) is a slightly bizarre but logical continuation of an intentionally weird prompt—so correct. It also demonstrates correct physical reasoning—stirring a drink with a cigarette won’t be good for the taste. Gary’s answer wandered off-topic and started talking about cremation.
So, 4⁄6 correct on an intentionally deceptive and adversarial set of prompts, and that’s on a fairly strict definition of correct. 2) and 4) are arguably not wrong, even if evasive and vague. More to the point, this was on an inferior version of GPT-3 to the one Gary used, the Dragon model from AI Dungeon!
I’m not sure what’s going on here—is it the initial prompt saying it was ‘testing physical and common sense reasoning’? Was that all it took?
Entirely possibly. Other people have mentioned that using any prompt (rather than just plopping the stories in) solves a lot of them, and Summers-stay says that Marcus & Davis did zero prompt programming and had no interest in the question of what prompt to use (quite aside from the lack of BO). I think they found the same thing, which is why they provide the preemptive excuse in the TR writeup:
I don’t think that excuse works in this case—I didn’t give it a ‘long-winded frame’, just that brief sentence at the start, and then the list of scenarios, and even though I reran it a couple of times on each to check, the ‘cranberry/grape juice kills you’ outcome never arose.
So, perhaps they switched directly from no prompt to an incredibly long-winded and specific prompt without checking what was actually necessary for a good answer? I’ll point out didn’t really attempt any sophisticated prompt programming either—that was literally the first sentence I thought of!
Nitpick: why is this part bolded? Surely this was written by you and not GPT-3, right? (It’s becoming a new pet peeve of mine when people are not super clear and consistent with their formatting of GPT-3 conversations. I find it often takes me a significant amount of effort to figure out who said what if a convention is not followed religiously within a transcript.)