My immediate impression: GPT-3 does better than I expected on the jokes, but still worse than PaLM (possible exception: if GPT-3 is right about CL standing for “cover letter”; I genuinely don’t know what it stands for here and as a result I am probably doing worse than at least one of the two language models at understanding that joke) -- but it’s much, much worse than PaLM on the “inference chaining” examples, where GPT-3 basically gets everything completely wrong (maybe it gets partial credit for input E, the one about airplanes).
(But we don’t know how cherry-picked the PaLM examples are, and we know that they were not cherry-picked at all for GPT-3.)
I think it’s interesting that both PaLM and GPT-3, when answering the question about Guido van Rossum, use the exact phrase “would not need to look up variable scope on StackOverflow”. It’s a natural enough thing to say, but it doesn’t seem like it’s the only plausible way to say it, so I can’t help wondering whether maybe they’re both quoting something—though I can’t find that phrase on the web other than in copies of this very paper. (And GPT-3 doesn’t seem to have noticed how this fact about GvR is relevant to answering the question.)
Aha! Thanks. (To save others a click: “change list” really just means “change” or “commit”: a single thing checked into version control or submitted for review.) I’m not sure the joke really lands for me—maybe I’m stupider than both GPT-3 and PaLM. It seems like the joke could be (1) the intern produced a hilariously excessive amount of code, perhaps because they e.g. failed to use elementary techniques like functions and loops for removing redundancy, or (2) the intern produced a normal amount of code but it was so bad that reading it was as painful as if it had been War-and-Peace-sized, or (3) the reviewer is incredibly lazy (so is telling a joke against himself) and finds reading even small amounts of other people’s code terribly hard work. Normally I’d use the obvious heuristic that the intended meaning is the one that’s funny, but unfortunately none of them seems very funny to me. I guess probably it’s #2?
(This is the difficulty about making up one’s own jokes for this sort of test...)
I am sure the situation is that the intern never pushed his code to VCS for a few months, just wrote it locally, and then pushed tons of code. It is dreading because 1 day is a very small amount of time to review so much code.
Since it was the intern’s last day, they might have been less careful with their coding (or, depending on why they’re leaving, even added deliberate errors), so the reviewer will have to be extra thorough checking it.
Yeah, I do wonder how most of the example jokes not actually being very funny is effecting the results… It also is weird that they make an explicit reference to a term which is only used internally, and which presumably PaLM has little-to-no training on. Was that on purpose, or a slip-up by the authors?
If I was asked the question, I think I would have said something closer to GPT-3 on Input E (though I would give the reason for making the inference she’s on an airplane as being because she’s looking down at clouds, not because she’s looking out a window), as opposed to PaLM’s response.
This isn’t directly the case here, but thinking about this made me realize that in some sense, a flawed answer which is more human-like is a better answer than one which is perfect (because the flawed human response would be a more likely completion of the text). Considering that, I’m not sure if it would even be possible to utilize any future iteration of this sort of architecture to get it to answer in a significantly “superhuman” manner. It would become the perfect mimic, but can text completion bots ever go beyond that?
The “inference” “We can also infer that she is traveling at a high speed because she is unbuckling her seatbelt.” is also nonsensical. People don’t typically unbuckle their seatbelts when traveling at high speed. (Albeit, this does maybe happen to be true for airplane travel because one isn’t allowed to unbuckle one’s seatbelt while traveling at low speed, i.e. during taxi, takeoff and landing; but that’s enough of a non-central case that it needs to be called out for the reasoning not to sound absurd.)
Why is it a non-central example when this is, in fact, about commercial airplane travel where you will be moving fastest at cruising altitude and that is when you’re allowed to unbuckle and move about the cabin?
I think I have that intuition because the great majority of seatbelt unbucklings in my experience happen while traveling at a speed of zero (because they’re in cars, not planes). The sentence has no cues to indicate the unusual context of being in a plane (and in fact, figuring that out is the point of the example). So my mental process reading that sentence is “that’s obviously false” → “hmm, wonder if I’m missing something” → “oh, maybe in a plane?” and the first step there seems a lot more reliable (in other reasoners as well, not just me) than the second or third.
My immediate impression: GPT-3 does better than I expected on the jokes, but still worse than PaLM (possible exception: if GPT-3 is right about CL standing for “cover letter”; I genuinely don’t know what it stands for here and as a result I am probably doing worse than at least one of the two language models at understanding that joke) -- but it’s much, much worse than PaLM on the “inference chaining” examples, where GPT-3 basically gets everything completely wrong (maybe it gets partial credit for input E, the one about airplanes).
(But we don’t know how cherry-picked the PaLM examples are, and we know that they were not cherry-picked at all for GPT-3.)
I think it’s interesting that both PaLM and GPT-3, when answering the question about Guido van Rossum, use the exact phrase “would not need to look up variable scope on StackOverflow”. It’s a natural enough thing to say, but it doesn’t seem like it’s the only plausible way to say it, so I can’t help wondering whether maybe they’re both quoting something—though I can’t find that phrase on the web other than in copies of this very paper. (And GPT-3 doesn’t seem to have noticed how this fact about GvR is relevant to answering the question.)
CL stands for “change list”. It’s not even a tech jargon, it’s a sort of Google jargon, and Google admits as much, see https://github.com/google/eng-practices.
Aha! Thanks. (To save others a click: “change list” really just means “change” or “commit”: a single thing checked into version control or submitted for review.) I’m not sure the joke really lands for me—maybe I’m stupider than both GPT-3 and PaLM. It seems like the joke could be (1) the intern produced a hilariously excessive amount of code, perhaps because they e.g. failed to use elementary techniques like functions and loops for removing redundancy, or (2) the intern produced a normal amount of code but it was so bad that reading it was as painful as if it had been War-and-Peace-sized, or (3) the reviewer is incredibly lazy (so is telling a joke against himself) and finds reading even small amounts of other people’s code terribly hard work. Normally I’d use the obvious heuristic that the intended meaning is the one that’s funny, but unfortunately none of them seems very funny to me. I guess probably it’s #2?
(This is the difficulty about making up one’s own jokes for this sort of test...)
I am sure the situation is that the intern never pushed his code to VCS for a few months, just wrote it locally, and then pushed tons of code. It is dreading because 1 day is a very small amount of time to review so much code.
The fact that we humans are having trouble understanding this joke does not bode well for its use as an AI benchmark…
Since it was the intern’s last day, they might have been less careful with their coding (or, depending on why they’re leaving, even added deliberate errors), so the reviewer will have to be extra thorough checking it.
Yeah, I do wonder how most of the example jokes not actually being very funny is effecting the results… It also is weird that they make an explicit reference to a term which is only used internally, and which presumably PaLM has little-to-no training on. Was that on purpose, or a slip-up by the authors?
If I was asked the question, I think I would have said something closer to GPT-3 on Input E (though I would give the reason for making the inference she’s on an airplane as being because she’s looking down at clouds, not because she’s looking out a window), as opposed to PaLM’s response.
This isn’t directly the case here, but thinking about this made me realize that in some sense, a flawed answer which is more human-like is a better answer than one which is perfect (because the flawed human response would be a more likely completion of the text). Considering that, I’m not sure if it would even be possible to utilize any future iteration of this sort of architecture to get it to answer in a significantly “superhuman” manner. It would become the perfect mimic, but can text completion bots ever go beyond that?
The “inference” “We can also infer that she is traveling at a high speed because she is unbuckling her seatbelt.” is also nonsensical. People don’t typically unbuckle their seatbelts when traveling at high speed. (Albeit, this does maybe happen to be true for airplane travel because one isn’t allowed to unbuckle one’s seatbelt while traveling at low speed, i.e. during taxi, takeoff and landing; but that’s enough of a non-central case that it needs to be called out for the reasoning not to sound absurd.)
Why is it a non-central example when this is, in fact, about commercial airplane travel where you will be moving fastest at cruising altitude and that is when you’re allowed to unbuckle and move about the cabin?
I think I have that intuition because the great majority of seatbelt unbucklings in my experience happen while traveling at a speed of zero (because they’re in cars, not planes). The sentence has no cues to indicate the unusual context of being in a plane (and in fact, figuring that out is the point of the example). So my mental process reading that sentence is “that’s obviously false” → “hmm, wonder if I’m missing something” → “oh, maybe in a plane?” and the first step there seems a lot more reliable (in other reasoners as well, not just me) than the second or third.