I’m not sure what’s going on here—is it the initial prompt saying it was ‘testing physical and common sense reasoning’? Was that all it took?
Entirely possibly. Other people have mentioned that using any prompt (rather than just plopping the stories in) solves a lot of them, and Summers-stay says that Marcus & Davis did zero prompt programming and had no interest in the question of what prompt to use (quite aside from the lack of BO). I think they found the same thing, which is why they provide the preemptive excuse in the TR writeup:
Defenders of the faith will be sure to point out that it is often possible to reformulate these problems so that GPT-3 finds the correct solution. For instance, you can get GPT-3 to give the correct answer to the cranberry/grape juice problem if you give it the following long-winded frame as a prompt:
I don’t think that excuse works in this case—I didn’t give it a ‘long-winded frame’, just that brief sentence at the start, and then the list of scenarios, and even though I reran it a couple of times on each to check, the ‘cranberry/grape juice kills you’ outcome never arose.
So, perhaps they switched directly from no prompt to an incredibly long-winded and specific prompt without checking what was actually necessary for a good answer? I’ll point out didn’t really attempt any sophisticated prompt programming either—that was literally the first sentence I thought of!
Entirely possibly. Other people have mentioned that using any prompt (rather than just plopping the stories in) solves a lot of them, and Summers-stay says that Marcus & Davis did zero prompt programming and had no interest in the question of what prompt to use (quite aside from the lack of BO). I think they found the same thing, which is why they provide the preemptive excuse in the TR writeup:
I don’t think that excuse works in this case—I didn’t give it a ‘long-winded frame’, just that brief sentence at the start, and then the list of scenarios, and even though I reran it a couple of times on each to check, the ‘cranberry/grape juice kills you’ outcome never arose.
So, perhaps they switched directly from no prompt to an incredibly long-winded and specific prompt without checking what was actually necessary for a good answer? I’ll point out didn’t really attempt any sophisticated prompt programming either—that was literally the first sentence I thought of!