Fine-tuned models are generally worse at writing fiction with good style than base models with temperature 1. For example the GPT-3.5 base model, code-davinci-002, was much better than the GPT-3.5 version tuned for chat. Here is what mainstream journalists said about it at the time.
I agree and disagree, and considered getting into this in my post. I agree in the sense that certainly, since fine-tuned models are fine-tuned towards a persona that you’d expect to be bad at writing fiction, base models have higher upside potential. But also, I think base models are too chaotic to do all that good a job, and veer off in wacky directions, and need a huge amount of manual sampling/pruning. So whether they’re “better” seems like a question of definition to me. I do think that the first actually good literary fiction AI will be one of:
A big/powerful enough model to capture the actual latent structure of high quality literary fiction, rather than only the surface level (thus letting it experiment more deeply and not default to the most obvious choice in every situation), or
A base model fine-tuned quite hard for literary merit, and not RLHF’d for “assistant”-y stuff
The best written AI art I’ve seen so far has been nostalgebraist-autoresponder’s tumblr posts, so I guess my money is on the latter of these two options. Simply not being winnowed into a specific persona strikes me as a valuable feature for creating good art.
I’m not sure fine-tuning is necessary. Most recent models have a ~100.000 token context window now, so they could fit quite a few short high quality examples for in-context learning. (Gemini Pro even has a 2 million token context window, but of course the base model is unavailable to the public.)
I would be curious to see an attempt! I have a pretty strong prior that it would fail, though, with currently available models. I buy that RLHF hurts, but given Sam Altman’s sample story also not impressing me (and having the same failure modes, just slightly less so), the problem pattern-matches for me to the underlying LLM simply not absorbing the latent structure well enough to imitate it. You might need more parameters, or a different set of training data, or something.
(This also relates to my reply to gwern above—his prompt did indeed include high quality examples, and in my opinion it helped ~0.)
Both Altman and Gwern used fine-tuned models, those don’t really do in-context learning. They don’t support “prompt engineering” in the original sense, they only respond to commands and questions in a particular way.
Fine-tuned models are generally worse at writing fiction with good style than base models with temperature 1. For example the GPT-3.5 base model, code-davinci-002, was much better than the GPT-3.5 version tuned for chat. Here is what mainstream journalists said about it at the time.
I agree and disagree, and considered getting into this in my post. I agree in the sense that certainly, since fine-tuned models are fine-tuned towards a persona that you’d expect to be bad at writing fiction, base models have higher upside potential. But also, I think base models are too chaotic to do all that good a job, and veer off in wacky directions, and need a huge amount of manual sampling/pruning. So whether they’re “better” seems like a question of definition to me. I do think that the first actually good literary fiction AI will be one of:
A big/powerful enough model to capture the actual latent structure of high quality literary fiction, rather than only the surface level (thus letting it experiment more deeply and not default to the most obvious choice in every situation), or
A base model fine-tuned quite hard for literary merit, and not RLHF’d for “assistant”-y stuff
The best written AI art I’ve seen so far has been nostalgebraist-autoresponder’s tumblr posts, so I guess my money is on the latter of these two options. Simply not being winnowed into a specific persona strikes me as a valuable feature for creating good art.
I’m not sure fine-tuning is necessary. Most recent models have a ~100.000 token context window now, so they could fit quite a few short high quality examples for in-context learning. (Gemini Pro even has a 2 million token context window, but of course the base model is unavailable to the public.)
I would be curious to see an attempt! I have a pretty strong prior that it would fail, though, with currently available models. I buy that RLHF hurts, but given Sam Altman’s sample story also not impressing me (and having the same failure modes, just slightly less so), the problem pattern-matches for me to the underlying LLM simply not absorbing the latent structure well enough to imitate it. You might need more parameters, or a different set of training data, or something.
(This also relates to my reply to gwern above—his prompt did indeed include high quality examples, and in my opinion it helped ~0.)
Both Altman and Gwern used fine-tuned models, those don’t really do in-context learning. They don’t support “prompt engineering” in the original sense, they only respond to commands and questions in a particular way.