I’m really curious to see some of the raw output (not curated) to try and get an estimate on how many oysters you have to pick through to find the pearls. (I’m especially interested w.r.t. the essay-like things-the extension of the essay on assertions was by far the scariest and most impressive thing I’ve seen from GPT-3, because the majority of its examples were completely correct, and it held a thesis for the majority of the piece.)
On a similar note, I know there have been experiments using either a differently-trained GPT or other text-prediction models to try to score and collate GPT-3 output. I wonder if a. The best-of functionality could be used for something like this with some tweaks, and b. Whether there would be a way to imbed a simple reasoning framework into the best-of instead of scoring based on GPT-3, so the resultant pieces were scored on their logical sensibility instead of text quality, given that text quality seems to be universally acceptable. Encoding seems like the barrier here, but it might not be completely impossible, especially because raw->tagged data processors exist.
I’m really curious to see some of the raw output (not curated)
You can read the random sample dump to get an idea of that, or Max Woolf’s repo (both of which I link around the beginning). I’m not doing that for any of my prompts because right now the Playground is just way too much of a pain and errors out too regularly to make it feasible to generate, say, 100 1024 completions for a specific prompt. I would need to get set up with the Python library for the API, and I’ve been busy exploring prompts & writing them up rather than programming.
On a similar note, I know there have been experiments using either a differently-trained GPT or other text-prediction models to try to score and collate GPT-3 output. I wonder if a. The best-of functionality could be used for something like this with some tweaks
Yes, best-of rankers like Meena are basically just a ranker which happens to use the same model to estimate & score by total likelihood of the final sample completion. It works because the final sample may have a different total and better likelihood than the partial completions would indicate, and if you greedily maximized, you immediately fall into repetition traps, while quasi-random (but still local) samples of the tree appear to avoid those very high likelihood traps in favor of sensible but still high likelihood completions.
Preference learning would be nice, but at least for GPT-2 it didn’t work too well for me. I don’t know if you could finetune a sanity-checking GPT-3 by doing something like flipping texts to generate logical vs illogical completions.
I’m really curious to see some of the raw output (not curated) to try and get an estimate on how many oysters you have to pick through to find the pearls. (I’m especially interested w.r.t. the essay-like things-the extension of the essay on assertions was by far the scariest and most impressive thing I’ve seen from GPT-3, because the majority of its examples were completely correct, and it held a thesis for the majority of the piece.)
On a similar note, I know there have been experiments using either a differently-trained GPT or other text-prediction models to try to score and collate GPT-3 output. I wonder if a. The best-of functionality could be used for something like this with some tweaks, and b. Whether there would be a way to imbed a simple reasoning framework into the best-of instead of scoring based on GPT-3, so the resultant pieces were scored on their logical sensibility instead of text quality, given that text quality seems to be universally acceptable. Encoding seems like the barrier here, but it might not be completely impossible, especially because raw->tagged data processors exist.
You can read the random sample dump to get an idea of that, or Max Woolf’s repo (both of which I link around the beginning). I’m not doing that for any of my prompts because right now the Playground is just way too much of a pain and errors out too regularly to make it feasible to generate, say, 100 1024 completions for a specific prompt. I would need to get set up with the Python library for the API, and I’ve been busy exploring prompts & writing them up rather than programming.
Yes, best-of rankers like Meena are basically just a ranker which happens to use the same model to estimate & score by total likelihood of the final sample completion. It works because the final sample may have a different total and better likelihood than the partial completions would indicate, and if you greedily maximized, you immediately fall into repetition traps, while quasi-random (but still local) samples of the tree appear to avoid those very high likelihood traps in favor of sensible but still high likelihood completions.
Preference learning would be nice, but at least for GPT-2 it didn’t work too well for me. I don’t know if you could finetune a sanity-checking GPT-3 by doing something like flipping texts to generate logical vs illogical completions.