With humans in the loop, there actually is a way to implement ℓnovel. Unfortunately, computing the function takes as long as it takes for several humans to read a novel and aggregate their scores. And there’s also no way to compute the gradient. So by that point, it’s pretty much just a reinforcement learning signal.
However, you could use that human feedback to train a side network to predict the reward signal based on what the AI generates. This second network would then essentially compute a custom loss function (asymptotically approaching ℓnovel with more human feedback) that is amenable to gradient descent and can run far more quickly. That’s basically the idea behind reward modeling (https://youtube.com/watch?v=PYylPRX6z4Q).
But yeah, framing such goals as loss functions probably gives the wrong intuition for how to approach aligning with them.
Interesting. I have the sense that we would have to get humans to reflect for years after reading a novel to produce a rating that, if optimized, would produce truly great novels. I think that when a novel really moves a person (or, even more importantly, moves a whole culture), it’s not at all evident that this has happened until (often) years after-the-fact.
I also have the sense that part of what makes a novel great is that a person or a culture decide to associate a certain beautiful insight with it due to the novel’s role in provoking that insight. But usually the novel is only partly responsible for the insight, and in part we choose to make the novel beautiful by associating it in our culture with a beautiful thing (and this associating of beautiful things is a good and honest thing to do).
Well, then computing ℓnovel would just take a really long time.
So, it’s not impossible in principle if you trained the loss function as I suggested (loss function trained by reinforcement learning, then applied to train the actual novel-generating model), but it is a totally impractical approach.
If you really wanted to teach an AI to generate good novels, you’d probably start by training a LLM to imitate existing novels through some sort of predictive loss (e.g., categorical cross-entropy on next-token prediction) to give it a good prior. Then train another LLM to predict reader reviews or dissertations written by literary grad students, using the novels they’re based on as inputs, again with a similar predictive loss. (Pretraining both LLMs on some large corpus (as with GPT) could probably help with providing necessary cultural context.) At the same time, use a Mechanical Turk to get thousands of people to rate the sentiment of every review/dissertation, then train another LLM to predict the sentiment scores of all raters (or a low-dimensional projection of all their ratings), using the reviews/dissertations as input and something like MSE loss to predict sentiment scores as output. Then chain these latter two networks together to compute ℓnovel, to act as the prune to the first network’s babble, and train to convergence.
Honestly, though, I probably still wouldn’t trust the resulting system to produce good novels (or at least not with internally consistent plots, characterizations, and themes) if the LLMs were based on a Transformer architecture.
Mostly due to the limited working memory that Transformers typically use (e.g., a buffer of only the most recent 512 tokens feeding into the decoder). When humans write novels, they have to keep track of plot points, character sheets, thematic arcs, etc. across tens of thousands of words. You could probably get it to work, though, if you augmented the LLM with content-addressable memory and included positional encoding that is aware of where in the novel (percentage-wise) each token resides.
With humans in the loop, there actually is a way to implement ℓnovel. Unfortunately, computing the function takes as long as it takes for several humans to read a novel and aggregate their scores. And there’s also no way to compute the gradient. So by that point, it’s pretty much just a reinforcement learning signal.
However, you could use that human feedback to train a side network to predict the reward signal based on what the AI generates. This second network would then essentially compute a custom loss function (asymptotically approaching ℓnovel with more human feedback) that is amenable to gradient descent and can run far more quickly. That’s basically the idea behind reward modeling (https://youtube.com/watch?v=PYylPRX6z4Q).
But yeah, framing such goals as loss functions probably gives the wrong intuition for how to approach aligning with them.
Interesting. I have the sense that we would have to get humans to reflect for years after reading a novel to produce a rating that, if optimized, would produce truly great novels. I think that when a novel really moves a person (or, even more importantly, moves a whole culture), it’s not at all evident that this has happened until (often) years after-the-fact.
I also have the sense that part of what makes a novel great is that a person or a culture decide to associate a certain beautiful insight with it due to the novel’s role in provoking that insight. But usually the novel is only partly responsible for the insight, and in part we choose to make the novel beautiful by associating it in our culture with a beautiful thing (and this associating of beautiful things is a good and honest thing to do).
Well, then computing ℓnovel would just take a really long time.
So, it’s not impossible in principle if you trained the loss function as I suggested (loss function trained by reinforcement learning, then applied to train the actual novel-generating model), but it is a totally impractical approach.
If you really wanted to teach an AI to generate good novels, you’d probably start by training a LLM to imitate existing novels through some sort of predictive loss (e.g., categorical cross-entropy on next-token prediction) to give it a good prior. Then train another LLM to predict reader reviews or dissertations written by literary grad students, using the novels they’re based on as inputs, again with a similar predictive loss. (Pretraining both LLMs on some large corpus (as with GPT) could probably help with providing necessary cultural context.) At the same time, use a Mechanical Turk to get thousands of people to rate the sentiment of every review/dissertation, then train another LLM to predict the sentiment scores of all raters (or a low-dimensional projection of all their ratings), using the reviews/dissertations as input and something like MSE loss to predict sentiment scores as output. Then chain these latter two networks together to compute ℓnovel, to act as the prune to the first network’s babble, and train to convergence.
Honestly, though, I probably still wouldn’t trust the resulting system to produce good novels (or at least not with internally consistent plots, characterizations, and themes) if the LLMs were based on a Transformer architecture.
Interesting—why is that?
Mostly due to the limited working memory that Transformers typically use (e.g., a buffer of only the most recent 512 tokens feeding into the decoder). When humans write novels, they have to keep track of plot points, character sheets, thematic arcs, etc. across tens of thousands of words. You could probably get it to work, though, if you augmented the LLM with content-addressable memory and included positional encoding that is aware of where in the novel (percentage-wise) each token resides.
I think you cut yourself off in the first paragraph.