Shot-ness is a nice task-ambiguous interface for revealing capability that doesn’t require any cleverness from the prompt designer. Said another way, If you needed task-specific knowledge to construct the prompt that makes GPT-3 reveal it can do the task, it’s hard to compare “ability to do that task” in a task-agnostic way to other potential capabilities.
For a completely unrealistic example that hyperbolically gestures at what I mean: you could spend a tremendous amount of compute to come up with the magic password prompt that gets GPT-3 to reveal that it can prove P!=NP, but this is worthless if that prompt itself contains a proof that P!=NP, or worse, is harder to generate than the original proof.
This is not what if “feels like” when GPT-3 suddenly demonstrates it is able to do something, of course—it’s more like it just suddenly knows what you meant, and does it, without your hinting really seeming like it provided anything particularly clever-hans-y. So it’s not a great analogy. But I can’t help but feel that a “sufficiently intelligent” language model shouldn’t need to be cajoled into performing a task you can demonstrate to it, thus I personally don’t want to have to rely on cajoling.
Regardless, it’s important to keep track of both “can GPT-n be cajoled into this capability?” as well as “how hard is it to cajole GPT-n into demonstrating this capability?”. But I maintain that shot-prompting is one nice way of probing this while holding “cajoling-ness” relatively fixed.
This is of course moot if all you care about is demonstrating that GPT-n can do the thing. Of course you should prompt tune. Go bananas. But it makes a particular kind of principled comparison hard.
Edit: wanted to add, thank you tremendously for posting this—always appreciate your LLM takes, independent of how fully fleshed they might be
In defense of shot-ness as a paradigm:
Shot-ness is a nice task-ambiguous interface for revealing capability that doesn’t require any cleverness from the prompt designer. Said another way, If you needed task-specific knowledge to construct the prompt that makes GPT-3 reveal it can do the task, it’s hard to compare “ability to do that task” in a task-agnostic way to other potential capabilities.
For a completely unrealistic example that hyperbolically gestures at what I mean: you could spend a tremendous amount of compute to come up with the magic password prompt that gets GPT-3 to reveal that it can prove P!=NP, but this is worthless if that prompt itself contains a proof that P!=NP, or worse, is harder to generate than the original proof.
This is not what if “feels like” when GPT-3 suddenly demonstrates it is able to do something, of course—it’s more like it just suddenly knows what you meant, and does it, without your hinting really seeming like it provided anything particularly clever-hans-y. So it’s not a great analogy. But I can’t help but feel that a “sufficiently intelligent” language model shouldn’t need to be cajoled into performing a task you can demonstrate to it, thus I personally don’t want to have to rely on cajoling.
Regardless, it’s important to keep track of both “can GPT-n be cajoled into this capability?” as well as “how hard is it to cajole GPT-n into demonstrating this capability?”. But I maintain that shot-prompting is one nice way of probing this while holding “cajoling-ness” relatively fixed.
This is of course moot if all you care about is demonstrating that GPT-n can do the thing. Of course you should prompt tune. Go bananas. But it makes a particular kind of principled comparison hard.
Edit: wanted to add, thank you tremendously for posting this—always appreciate your LLM takes, independent of how fully fleshed they might be