As you say, highlist posts give biased impressions of GPT-3′s capabilities. This bias remains even for readers who are consciously aware of that fact, since the underlying emotional impression may not adjust appropriately. So, for example, when I tell the reader that “only 30% of completions produced correct answers”, that isn’t the same as seeing the 70%-dumb answers.
Another problem is that AIDungeon doesn’t let you save the entire tree of edits, reversions, and rerolls. So, even if you link the full transcript, readers are still only getting the impressive version. If you wanted to overcome this, you’d have to bore readers with all of the stupid runs. No one wants to do that.
I’m currently:
Explicitly noting where rerolls take place, or at least noting how many occurred for a given generation
Sampling the output distribution and giving qualitative summaries, particularly near things I’m claiming are impressive or cool
For my part, I think a lot of questions I have about GPT-3 are, “is there a non-negligible chance it produces correct answers to fresh problems which seemingly require reasoning to solve?”. So far, I’m very impressed at how often that has been true.
As you say, highlist posts give biased impressions of GPT-3′s capabilities. This bias remains even for readers who are consciously aware of that fact, since the underlying emotional impression may not adjust appropriately. So, for example, when I tell the reader that “only 30% of completions produced correct answers”, that isn’t the same as seeing the 70%-dumb answers.
Another problem is that AIDungeon doesn’t let you save the entire tree of edits, reversions, and rerolls. So, even if you link the full transcript, readers are still only getting the impressive version. If you wanted to overcome this, you’d have to bore readers with all of the stupid runs. No one wants to do that.
I’m currently:
Explicitly noting where rerolls take place, or at least noting how many occurred for a given generation
Sampling the output distribution and giving qualitative summaries, particularly near things I’m claiming are impressive or cool
Interrogating the model with Story-modification
Including some runs where GPT-3 fails
I’d love to hear other suggested best-practices.
For my part, I think a lot of questions I have about GPT-3 are, “is there a non-negligible chance it produces correct answers to fresh problems which seemingly require reasoning to solve?”. So far, I’m very impressed at how often that has been true.