Replicating the replication crisis with GPT-3?
I am getting worried that people are having so much fun doing interesting stuff with GPT-3 and AI Dungeon that they’re forgetting how easy it is to fool yourself. Maybe we should think about how many different cognitive biases are in play here? Here are some features that make it particularly easy during casual exploration.
First, it works much like autocomplete, which makes it the most natural thing in the world to “correct” the transcript to be more interesting. You can undo and retry, or trim off extra text if it generates more than you want.
Randomness is turned on by default, so if you try multiple times then you will get multiple replies and keep going until you get a good one. It would be better science but less fun to keep the entire distribution rather than stopping at a good one. Randomness also makes a lot of gamblers’ fallacies more likely.
Suppose you don’t do that. Then you have to decide whether to share the transcript. You will probably share the interesting transcripts and not the boring failures, resulting in a “file drawer” bias.
And even if you don’t do that, “interesting” transcripts will be linked to and upvoted and reshared, for another kind of survivor bias.
What other biases do you think will be a problem?
The biggest bias is bad prompting and sampling. Cherrrypicking and selection can promote good samples from levels like 1:100 to 1:1; but bad prompting and sampling can sabotage performance such that good samples plummet from 1:1 to 1:billions or worse*. People like to share good samples, but critics like to share bad samples even more. (How many times have you seen Kevin Lacker’s post linked as ‘proof’ of GPT-3′s inherent stupidity in places like Marginal Revolution, Technology Review, Wired, and so on—despite being wrong?)
* set temperature to 0, so GPT-3 is deterministic, and use a prompt where greedy sampling gets the wrong answer, and GPT-3 will always return the wrong answer, even if temp=1+ranking would return the right answer ~100% of the time! This has in fact happened in some of the error cases being retailed around.
How do you do ranking? I’m guessing this is because you have access to the actual API, while most of us don’t?
On the bright side, this could be a fun project where many of us amateurs learn how to do science better, but the knowledge of how to do that isn’t well distributed yet.
Yes. I don’t think AID exposes ranking. (If they pay per API call, doing best-of=n would be n times more expensive, and for creative uses like AID, ranking/best-of is not that useful and is certainly not n times better. Very diminishing returns there—unless you’re asking tricky or difficult questions, where ranking often seems to hit on the right answer where regular GPT-3 fails. See also the Meena paper on how much ranking improved over baseline Meena.)
I don’t see documentation for the GPT-3 API on OpenAI’s website. Is it available to the public? Are they doing their own ranking or are you doing it yourself? What do you know about the ranking algorithm?
It seems like another source of confusion might be people investigating the performance of different algorithms and calling them all GPT-3?
The current docs do seem to be behind the login wall. (They’re integrated with your API token to make copy-paste easier, so that’s not too surprising.) It’s also true that people have been using different algorithms, but regular API users are typically clear if they’re not using davinci and confusion is mostly the fault of AI Dungeon users: we don’t know what AID does, and AID users sometimes don’t even pick the right model option and still say they are using “GPT-3”.
I was making a different point, which is that if you use “best of” ranking then you are testing a different algorithm than if you’re not using “best of” ranking. Similarly for other settings. It shouldn’t be surprising that we see different results if we’re doing different things.
It seems like a better UI would help us casual explorers share results in a way that makes trying the same settings again easier; one could hit a “share” button to create a linkable output page with all relevant settings.
It could also save the alternate responses that either the user or the “best-of” ranking chose not to use. Generate-and-test is a legitimate approach, if you do it consistently, but saving the alternate takes would give us a better idea how good the generator alone is.
As you say, highlist posts give biased impressions of GPT-3′s capabilities. This bias remains even for readers who are consciously aware of that fact, since the underlying emotional impression may not adjust appropriately. So, for example, when I tell the reader that “only 30% of completions produced correct answers”, that isn’t the same as seeing the 70%-dumb answers.
Another problem is that AIDungeon doesn’t let you save the entire tree of edits, reversions, and rerolls. So, even if you link the full transcript, readers are still only getting the impressive version. If you wanted to overcome this, you’d have to bore readers with all of the stupid runs. No one wants to do that.
I’m currently:
Explicitly noting where rerolls take place, or at least noting how many occurred for a given generation
Sampling the output distribution and giving qualitative summaries, particularly near things I’m claiming are impressive or cool
Interrogating the model with Story-modification
Including some runs where GPT-3 fails
I’d love to hear other suggested best-practices.
For my part, I think a lot of questions I have about GPT-3 are, “is there a non-negligible chance it produces correct answers to fresh problems which seemingly require reasoning to solve?”. So far, I’m very impressed at how often that has been true.
Being highly skeptical of this GPT-3 “research” myself, let me make a meta-contrarian argument in favor of ways that we could do more constructive GPT-3 research, without letting the perfect be the enemy of the good.
One way is to try and develop semi-replicable “techniques” for training GPT-3, and quantifying their reliability.
So for example, imagine somebody comes up with a precise technical method for prompting GPT-3 to correctly classify whether or not parentheses are balanced or not, and also for determining stop conditions at which point the run will be terminated.
If its overall accuracy was better than chance, even when used by multiple independent investigators, then its reliability could be quantified. Its broader validity would be harder to determine. But I think this would be a step in the direction of turning the study of GPT-3′s capabilities into more of a science.
Additional challenges would remain: lack of peer review, lack of meaningful incentives for integrity, lack of funding to drive sufficient attention, and so on.
Hopefully that at least gives some perspective on how far we all are from anything approaching the scientific study of GPT-3.
The main thing I’ve noticed is that most of the posts that are talking about its capabilities (or even what theoretical future entities might be capable of, based on a biased assumption of this version’s capabilities) is that people are trying to figure out how to get it to succeed, rather than trying to get it to fail in interesting and informative ways.
For example, one of the evaluations I’ve seen was having it do multi-digit addition, and discussing various tricks to improve its success rate, going off the assumption that if it can fairly regularly do 1-3 digit addition, that’s evidence of it learning arithmetic. One null hypothesis against this would be “in its 350-700GB model, it has stored lookup tables for 1-3 digit addition, which it will semi-frequently end up engaging.
The evaluation against a lookup table was to compare its success rate at 5+ digit numbers, and show that storing a lookup table for those numbers would be an increasingly large portion of its model, and then suggests that this implies it must be capable, sometimes, of doing math (and thus the real trick is in convincing it to actually do that). However, this ignores significantly more probable outcomes, and also doesn’t look terribly closely at what the incorrect outputs are for large-digit addition, to try and evaluate *what* exactly it is the model did wrong (because the outputs obviously aren’t random).
I’ve also seen very little by the way of discussing what the architectural limitations of its capabilities are, despite them being publicly known; for example, any problem requiring deep symbolic recursion is almost certainly not possible simply due to the infrastructure of the model—it’s doing a concrete number of matrix multiplications, and can’t, as the result of any of those, step backwards through the transformer and reapply a particular set of steps again. On the plus side, this also means you can’t get it stuck in an infinite loop before receiving the output.
Humans assume that humans continue a story based on deeper ideas, not surface words. So for example, if we started a story these two different ways:
“a girl named little red riding hood lived with her mother and father. one day she baked some cookies and decided to bring them to her grandmother who lived on the other side of the deep dark forest” …
“once upon a time, a little girl lived with her parents. one day she made some food to bring to her mother’s mother who lived on the other side of a large, ominous woods” …
Our bias is that a human understands both prompts as being fairly equivalent in terms of the ideas they contain, so they would complete them in fairly similar ways. Humans might expect other humans to complete the story based on the general ideas, and not in a way that is highly sensitive to how the prompt is worded. This might or might not be true for GPT.
A tangible way to control for this kind of human bias, would be to try multiple rewordings/rewritings of the same prompt when examining the way that GPT generates the text after the prompt.