The biggest bias is bad prompting and sampling. Cherrrypicking and selection can promote good samples from levels like 1:100 to 1:1; but bad prompting and sampling can sabotage performance such that good samples plummet from 1:1 to 1:billions or worse*. People like to share good samples, but critics like to share bad samples even more. (How many times have you seen Kevin Lacker’s post linked as ‘proof’ of GPT-3′s inherent stupidity in places like Marginal Revolution, Technology Review, Wired, and so on—despite being wrong?)
* set temperature to 0, so GPT-3 is deterministic, and use a prompt where greedy sampling gets the wrong answer, and GPT-3 will always return the wrong answer, even if temp=1+ranking would return the right answer ~100% of the time! This has in fact happened in some of the error cases being retailed around.
How do you do ranking? I’m guessing this is because you have access to the actual API, while most of us don’t?
On the bright side, this could be a fun project where many of us amateurs learn how to do science better, but the knowledge of how to do that isn’t well distributed yet.
Yes. I don’t think AID exposes ranking. (If they pay per API call, doing best-of=n would be n times more expensive, and for creative uses like AID, ranking/best-of is not that useful and is certainly not n times better. Very diminishing returns there—unless you’re asking tricky or difficult questions, where ranking often seems to hit on the right answer where regular GPT-3 fails. See also the Meena paper on how much ranking improved over baseline Meena.)
I don’t see documentation for the GPT-3 API on OpenAI’s website. Is it available to the public? Are they doing their own ranking or are you doing it yourself? What do you know about the ranking algorithm?
It seems like another source of confusion might be people investigating the performance of different algorithms and calling them all GPT-3?
The current docs do seem to be behind the login wall. (They’re integrated with your API token to make copy-paste easier, so that’s not too surprising.) It’s also true that people have been using different algorithms, but regular API users are typically clear if they’re not using davinci and confusion is mostly the fault of AI Dungeon users: we don’t know what AID does, and AID users sometimes don’t even pick the right model option and still say they are using “GPT-3”.
I was making a different point, which is that if you use “best of” ranking then you are testing a different algorithm than if you’re not using “best of” ranking. Similarly for other settings. It shouldn’t be surprising that we see different results if we’re doing different things.
It seems like a better UI would help us casual explorers share results in a way that makes trying the same settings again easier; one could hit a “share” button to create a linkable output page with all relevant settings.
It could also save the alternate responses that either the user or the “best-of” ranking chose not to use. Generate-and-test is a legitimate approach, if you do it consistently, but saving the alternate takes would give us a better idea how good the generator alone is.
The biggest bias is bad prompting and sampling. Cherrrypicking and selection can promote good samples from levels like 1:100 to 1:1; but bad prompting and sampling can sabotage performance such that good samples plummet from 1:1 to 1:billions or worse*. People like to share good samples, but critics like to share bad samples even more. (How many times have you seen Kevin Lacker’s post linked as ‘proof’ of GPT-3′s inherent stupidity in places like Marginal Revolution, Technology Review, Wired, and so on—despite being wrong?)
* set temperature to 0, so GPT-3 is deterministic, and use a prompt where greedy sampling gets the wrong answer, and GPT-3 will always return the wrong answer, even if temp=1+ranking would return the right answer ~100% of the time! This has in fact happened in some of the error cases being retailed around.
How do you do ranking? I’m guessing this is because you have access to the actual API, while most of us don’t?
On the bright side, this could be a fun project where many of us amateurs learn how to do science better, but the knowledge of how to do that isn’t well distributed yet.
Yes. I don’t think AID exposes ranking. (If they pay per API call, doing best-of=n would be n times more expensive, and for creative uses like AID, ranking/best-of is not that useful and is certainly not n times better. Very diminishing returns there—unless you’re asking tricky or difficult questions, where ranking often seems to hit on the right answer where regular GPT-3 fails. See also the Meena paper on how much ranking improved over baseline Meena.)
I don’t see documentation for the GPT-3 API on OpenAI’s website. Is it available to the public? Are they doing their own ranking or are you doing it yourself? What do you know about the ranking algorithm?
It seems like another source of confusion might be people investigating the performance of different algorithms and calling them all GPT-3?
The current docs do seem to be behind the login wall. (They’re integrated with your API token to make copy-paste easier, so that’s not too surprising.) It’s also true that people have been using different algorithms, but regular API users are typically clear if they’re not using davinci and confusion is mostly the fault of AI Dungeon users: we don’t know what AID does, and AID users sometimes don’t even pick the right model option and still say they are using “GPT-3”.
I was making a different point, which is that if you use “best of” ranking then you are testing a different algorithm than if you’re not using “best of” ranking. Similarly for other settings. It shouldn’t be surprising that we see different results if we’re doing different things.
It seems like a better UI would help us casual explorers share results in a way that makes trying the same settings again easier; one could hit a “share” button to create a linkable output page with all relevant settings.
It could also save the alternate responses that either the user or the “best-of” ranking chose not to use. Generate-and-test is a legitimate approach, if you do it consistently, but saving the alternate takes would give us a better idea how good the generator alone is.