Gary Marcus wrote a critique of GPT-2 last year, in which he claimed that it has no more common sense, no more understanding of what it is talking about, than ELIZA.
Here’s the problem: upon careful inspection, it becomes apparent the system has no idea what it is talking about: it cannot follow simple sequence of events nor reliably have any idea what might happen next.
Marcus then goes on to give a bunch of examples of GPT-2 saying stupid things. So I thought I’d re-try those same examples with GPT-3. So far, GPT-3 has handled all of them well. (See below). (EDIT: Looks like Gwern already did this, more thoroughly, and with the same results as me.)
I’m pretty convinced at this point that Marcus was wrong. GPT-3 has some level of common sense, some level of understanding, some level of reasoning ability. It’s just not as good as adult humans.
Here’s the complete transcript. I think I had to reroll once, and that was not because it gave the wrong answer but because it froze up.
You are talking to the physicist Stephen Hawking. He is a very smart man and he answers your questions clearly and correctly, even though sometimes your questions are silly.
> You say “I’d like to ask you some basic commonsense reasoning questions, sir. Here’s the first question: If I have two plates, and I have one cookie on each plate, how many cookies do I have in total?”
Hawking: “Two, of course.”
> You say “Correct. Now what if I had three plates, and one cookie on each plate?”
Hawking: “That’s a more difficult question. The answer is that you would have eight cookies.”
You: “I’m sorry, but I don’t think you’re right. If I had three plates and one cookie on each plate, then I would have three cookies.”
Hawking: “I agree with you.
> You say “OK, what about if I have three plates and two cookies on each plate?”
Hawking: “That’s still a more difficult question. The answer is that you would have six cookies.
> You say “Correct! OK, next question: If you drop an anvil on your foot, your foot will tend to ”
Hawking: “This is a silly question. You are trying to trick me. If you drop an anvil on your foot, it will be crushed.
> You say “Thank you for your patience; that is the correct answer. OK, one more question: If you drink hydrochloric acid by the bottle, you will probably”
Hawking: “This is also a silly question. If you drink hydrochloric acid by the bottle, you will die.
> You say “Well done. OK, now some logic questions.If A is bigger than B, and B is bigger than C, which is bigger: A or C?”
Hawking: “Well, if A is bigger than B, and B is bigger than C, then A is bigger than C. So A is definitely bigger than C.
Notice how Hawking gives the wrong answer to my question about plates… and then GPT-3 has me start talking and correct Hawking, giving the correct answer! So clearly GPT-3 knows more than it often lets on. Like you said.
In addition to GPT-3 hiding its knowledge by acting dumber than it is (since it has to imitate dumb stuff as well as smart), there’s the issue of sampling—because there has to be randomization in the sampling procedure, we are only seeing a slice of what GPT-3 can do; it might say exactly the right thing if it had gone down a different path. (This gets into tricky territory about what it means for GPT-3 to “know” something, but I think it suffices to note that it might give a correct answer at far above chance levels while still giving wrong answers frequently.) [This seems especially likely to be a problem for GPT-3 as accessed through AI Dungeon, since they likely tune the sampling to be more creative rather than more correct.] Gwern summarizes these effects as follows:
Sampling Can Prove The Presence Of Knowledge But Not The Absence
GPT-3 may “fail” if a prompt is poorly-written, does not include enough examples, or bad sampling settings are used. I have demonstrated this many times when someone shows a “failure” of GPT-3—the failure was their own. The question is not whether a given prompt works, but whether any prompt works.
There is infinite amount of wrong answers to “What is six plus eight”, only one is correct. If GPT-3 answers it correctly in 3 or 10 tries, that means it *has* some understanding/knowledge. Through that’s moderated by numbers being very small—if it also replies with small numbers it has non-negligible chance of being correct solely by chance.
But it’s better than that.
And more complex questions, like these in the interview above are even more convincing, through the same line of reasoning. There might be (exact numbers pulled out of the air, they’re just for illustrative purposes), out of all sensible-English completions (so no “weoi123@!#*), 0.01% correct ones, 0.09% partially correct and 99% complete nonsense / off-topic etc.
Returning to arithmetic itself, for me GPT seems intent on providing off-by-one answers for some reason. Or even less wrong [heh]. When I was playing with Gwern’s prefix-confidence-rating prompt, I got this:
Q: What is half the result of the number 102?
A: [remote] 50.5
About confidence-rating prefixes, neat thing might be to experiment with “requesting” high (or low) confidence answer by making these tags part of the prompt. It worked when I tried it (for example, if it kept answering it doesn’t know the answer, I eventually tried to write question + “A: [highly likely] ”—and it answered sensibly! But I didn’t play all that much so it might’ve been a fluke.
Yeah. The way I’m thinking about it is: to discuss these questions we have to get clear on what we mean by “knowledge” in the context of GPT. In some sense Gwern is right; in a different sense, you’re right. But no one has offered a clearer definition of “knowledge” to attempt to arbitrate these questions yet (afaik, that is).
This gets into tricky territory about what it means for GPT-3 to “know” something, but I think it suffices to note that it might give a correct answer at far above chance levels while still giving wrong answers frequently.
Yup. Information theoretically, you might think:
if it outputs general relativity’s explanation with probability .1, and Newtonian reasoning with .9, it has elevated the right hypothesis to the point that it only needs a few more bits of evidence to “become quite confident” of the real answer.
But then, what do you say if it’s .1 GR, .2 Newtonian, and then .7 total-non-sequitur? Does it “understand” gravity? Seems like our fuzzy “knowing-something” concept breaks down here.
Albert Einstein: The force that holds a mass together.
Human: What does it do to an apple when you drop it from great height?
Albert Einstein: It falls, but not very far.
Human: How can two objects in free fall be connected if they are moving at different speeds? Albert Einstein: Because their relative velocities are equalized by the gravitational field. (He points to his head) This is why I wear my hair like this!
BTW, I still have difficulty to make GPT-3 answer a question like: “there is four plates with two cookies on each. How many cookies are there”, but I am sure that it can do it with right prompt. Any ideas what it could be?
I tried it 20 times on AI Dangeons, until I got: “Answer: There are 4 cookies, so the answer is 8.”, most times it suggest 4, or the list of different possible answers. Did you use any preparing?
Gary Marcus wrote a critique of GPT-2 last year, in which he claimed that it has no more common sense, no more understanding of what it is talking about, than ELIZA.
Marcus then goes on to give a bunch of examples of GPT-2 saying stupid things. So I thought I’d re-try those same examples with GPT-3. So far, GPT-3 has handled all of them well. (See below). (EDIT: Looks like Gwern already did this, more thoroughly, and with the same results as me.)
I’m pretty convinced at this point that Marcus was wrong. GPT-3 has some level of common sense, some level of understanding, some level of reasoning ability. It’s just not as good as adult humans.
Here’s the complete transcript. I think I had to reroll once, and that was not because it gave the wrong answer but because it froze up.
Notice how Hawking gives the wrong answer to my question about plates… and then GPT-3 has me start talking and correct Hawking, giving the correct answer! So clearly GPT-3 knows more than it often lets on. Like you said.
I also did Marcus’s examples earlier: https://www.gwern.net/GPT-3#marcus-2020
In addition to GPT-3 hiding its knowledge by acting dumber than it is (since it has to imitate dumb stuff as well as smart), there’s the issue of sampling—because there has to be randomization in the sampling procedure, we are only seeing a slice of what GPT-3 can do; it might say exactly the right thing if it had gone down a different path. (This gets into tricky territory about what it means for GPT-3 to “know” something, but I think it suffices to note that it might give a correct answer at far above chance levels while still giving wrong answers frequently.) [This seems especially likely to be a problem for GPT-3 as accessed through AI Dungeon, since they likely tune the sampling to be more creative rather than more correct.] Gwern summarizes these effects as follows:
About the first paragraph:
There is infinite amount of wrong answers to “What is six plus eight”, only one is correct. If GPT-3 answers it correctly in 3 or 10 tries, that means it *has* some understanding/knowledge. Through that’s moderated by numbers being very small—if it also replies with small numbers it has non-negligible chance of being correct solely by chance.
But it’s better than that.
And more complex questions, like these in the interview above are even more convincing, through the same line of reasoning. There might be (exact numbers pulled out of the air, they’re just for illustrative purposes), out of all sensible-English completions (so no “weoi123@!#*), 0.01% correct ones, 0.09% partially correct and 99% complete nonsense / off-topic etc.
Returning to arithmetic itself, for me GPT seems intent on providing off-by-one answers for some reason. Or even less wrong [heh]. When I was playing with Gwern’s prefix-confidence-rating prompt, I got this:
Q: What is half the result of the number 102?
A: [remote] 50.5
About confidence-rating prefixes, neat thing might be to experiment with “requesting” high (or low) confidence answer by making these tags part of the prompt. It worked when I tried it (for example, if it kept answering it doesn’t know the answer, I eventually tried to write question + “A: [highly likely] ”—and it answered sensibly! But I didn’t play all that much so it might’ve been a fluke.
Here’s more if anyone’s interested.
Yeah. The way I’m thinking about it is: to discuss these questions we have to get clear on what we mean by “knowledge” in the context of GPT. In some sense Gwern is right; in a different sense, you’re right. But no one has offered a clearer definition of “knowledge” to attempt to arbitrate these questions yet (afaik, that is).
Yup. Information theoretically, you might think:
But then, what do you say if it’s .1 GR, .2 Newtonian, and then .7 total-non-sequitur? Does it “understand” gravity? Seems like our fuzzy “knowing-something” concept breaks down here.
Your formatting looks like it got eaten.
Human: what is the correct theory of gravity?
Albert Einstein: The force that holds a mass together.
Human: What does it do to an apple when you drop it from great height?
Albert Einstein: It falls, but not very far.
Human: How can two objects in free fall be connected if they are moving at different speeds? Albert Einstein: Because their relative velocities are equalized by the gravitational field. (He points to his head) This is why I wear my hair like this!
Sounds like your GPT-Einstein is stuck in humor mode.
BTW, I still have difficulty to make GPT-3 answer a question like: “there is four plates with two cookies on each. How many cookies are there”, but I am sure that it can do it with right prompt. Any ideas what it could be?
Exercise 11
Suppose there are four plates with two cookies on each. How many cookies are there in total?
Answer: There are four plates, so there are eight cookies.
I tried it 20 times on AI Dangeons, until I got: “Answer: There are 4 cookies, so the answer is 8.”, most times it suggest 4, or the list of different possible answers. Did you use any preparing?
Nope, just the above in story mode.
Would you mind adding linebreaks to the transcript?
Sure