I’m really curious to see some of the raw output (not curated)
You can read the random sample dump to get an idea of that, or Max Woolf’s repo (both of which I link around the beginning). I’m not doing that for any of my prompts because right now the Playground is just way too much of a pain and errors out too regularly to make it feasible to generate, say, 100 1024 completions for a specific prompt. I would need to get set up with the Python library for the API, and I’ve been busy exploring prompts & writing them up rather than programming.
On a similar note, I know there have been experiments using either a differently-trained GPT or other text-prediction models to try to score and collate GPT-3 output. I wonder if a. The best-of functionality could be used for something like this with some tweaks
Yes, best-of rankers like Meena are basically just a ranker which happens to use the same model to estimate & score by total likelihood of the final sample completion. It works because the final sample may have a different total and better likelihood than the partial completions would indicate, and if you greedily maximized, you immediately fall into repetition traps, while quasi-random (but still local) samples of the tree appear to avoid those very high likelihood traps in favor of sensible but still high likelihood completions.
Preference learning would be nice, but at least for GPT-2 it didn’t work too well for me. I don’t know if you could finetune a sanity-checking GPT-3 by doing something like flipping texts to generate logical vs illogical completions.
You can read the random sample dump to get an idea of that, or Max Woolf’s repo (both of which I link around the beginning). I’m not doing that for any of my prompts because right now the Playground is just way too much of a pain and errors out too regularly to make it feasible to generate, say, 100 1024 completions for a specific prompt. I would need to get set up with the Python library for the API, and I’ve been busy exploring prompts & writing them up rather than programming.
Yes, best-of rankers like Meena are basically just a ranker which happens to use the same model to estimate & score by total likelihood of the final sample completion. It works because the final sample may have a different total and better likelihood than the partial completions would indicate, and if you greedily maximized, you immediately fall into repetition traps, while quasi-random (but still local) samples of the tree appear to avoid those very high likelihood traps in favor of sensible but still high likelihood completions.
Preference learning would be nice, but at least for GPT-2 it didn’t work too well for me. I don’t know if you could finetune a sanity-checking GPT-3 by doing something like flipping texts to generate logical vs illogical completions.