For investigation of the kind of thing you suggest, take a look at Anthropic’s “A General Language Assistant as a Laboratory for Alignment” and more importantly “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback”.
They focus on training a helpful / harmless assistant rather than good short stories, but using human-filtered model-output to improve behavior is the basic paradigm.
For investigation of the kind of thing you suggest, take a look at Anthropic’s “A General Language Assistant as a Laboratory for Alignment” and more importantly “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback”.
They focus on training a helpful / harmless assistant rather than good short stories, but using human-filtered model-output to improve behavior is the basic paradigm.
Thanks for the pointer, I will check that out!