This conversation has been going on for a few days now and I’ve found it very helpful. I want to take a minute or two to step back and think about it, and about transformers and stories. Why stories? Because I’ve spent a lot of time having ChatGPT tell stories, getting a feel for how it does that. But I’m getting ahead of myself.
I wrote the OP because I felt a mismatch between what I feel to be the requirements for telling the kind of stories ChatGPT tells, and the assertion that it’s “just” predicting the next word, time after time after time. How do we heal that mismatch?
Stories
Let’s start with stories, because that’s where I’m starting. I’ve spent a lot of time studying stories and figuring out how they work. I’ve long ago realized that that process must start by simply describing the story. But describing isn’t so simple. For example, it took Early Modern naturalists decades upon decades to figure out to describe life-forms, plants and animals, well enough so that a naturalist in Paris could read a description by a naturalist in Florence and figure out whether or not that Florentine plant was the same one as he has in front of him in Paris (in this case, description includes drawing as well as writing).
Now, believe it or not, describing stories is not simple, depending, of course, on the stories. The ChatGPT stories I’ve been working with, fortunately, are relatively simple. They’re short, roughly between 200 and 500 words long. The one’s I’ve given the most attention to are in the 200-300 word range.
They are hierarchically structured on three levels: 1) the story as a whole, 2) individual segments within the story (marked by paragraph divisions in these particular stories), and 3) sentences within those segments. Note that, if we wanted to, we could further divide sentences into phrases, which would give us at least one more level, if not two or three. But three levels are sufficient for my present purposes.
Construction
How is it that ChatGPT is able to construct stories organized on three levels? One answer to that question is that it needs to have some kind of procedure for doing that. That sentence seems like little more than a tautological restatement of the question. What if we say the procedure involves a plan? That, it seemed to me when I was writing the OP, that seems better. But “predict the next token” doesn’t seem like much of a plan.
We’re back where we started, with a mismatch. But now it is a mismatch between predict-the-next-token and the fact that these stories are hierarchically structured on three levels.
Let’s set that aside and return to our question: How is it that ChatGPT is able to construct stories organized on three levels? Let’s try another answer to the question. It is able to do it because it was trained on a lot of stories organized on three or more levels. Beyond that, it was trained on a lot of hierarchically structured documents of all kinds. How was it trained? That’s right: Predict the next token.
It seems to me that if it is going to improve on that task, that it must somehow 1) learn to recognize that a string of words is hierarchically structured, and 2) exploit what it has learned in predicting the next token. What cues are in the string that guide ChatGPT in making these predictions?
Whatever those cues are, they are registered in those 175 billion weights. Those cues are what I meant by “plan” in the OP.
Tell me a story
At this point we should be able to pick one of those stories and work our way through it from beginning to end, identifying cues as we go along. Even in the case of a short 200-word story, though, that would be a long and tedious process. At some point, someone HAS to do it, and their work needs to be vetted by others. But we don’t need to do that now.
But I can make a few observations. Here’s the simplest prompt I’ve used: “Tell me a story.” The population of tokens that would be a plausible initial token is rather large. How does that population change as the story evolves?
I’ve done a lot of work with stories generated by this prompt: “Tell me a story about a hero,” That’s still wide open, but the requirement that it be about a hero does place some vague restrictions on the population of available tokens. One story ChatGPT gave me in response to that prompt began with this sentence: “Once upon a time, in a land far, far away, there was a young princess named Aurora.” That’s formulaic, from beginning to end. There are a number of options in the formula, but we could easily use up 200 or 300 words discussing them and laying out the syntactic options in the form of a tree or formula. Let’s assume that’s been done.
What next? Here’s the second sentence: “Aurora was a kind and gentle soul, loved by all who knew her.” It’s all about Aurora. Everything in that sentence is drawn from a token population useful for characterizing Aurora. Third sentence: “She had long, golden hair and sparkling blue eyes, and was known for her beautiful singing voice.” Those tokens are drawn from the same population as the words in the previous sentence.
What about the fourth sentence? Does ChatGPT continue to draw from the same population or does its attention shift to a new population? Note that at some time it is going to have to draw tokens from a new population, otherwise the story goes nowhere. Here’s the fourth sentence: “One day, a terrible dragon came to the kingdom and began to terrorize the people.” That’s a new population of tokens. ChatGPT has moved from the first segment of the story trajectory (as I am calling it) to the second.
You get the idea. I have no intention of continuing on to the end of the story. But you can do so if you wish. Here’s the whole story:
Once upon a time, in a land far, far away, there was a young princess named Aurora. Aurora was a kind and gentle soul, loved by all who knew her. She had long, golden hair and sparkling blue eyes, and was known for her beautiful singing voice.
One day, a terrible dragon came to the kingdom and began to terrorize the people. The dragon was fierce and powerful, and none of the knights or soldiers were able to defeat it. The people lived in fear, not knowing what to do.
But Aurora was not afraid. She knew that she had to do something to save her people, and so she set out to find the dragon. She traveled for days, braving the cold and the snow, until she finally reached the dragon’s lair.
With her heart pounding in her chest, Aurora approached the dragon and began to sing. The dragon was enchanted by her beautiful voice, and it slowly began to calm down. Aurora continued to sing, soothing the beast until it was completely calm.
With the dragon no longer a threat, Aurora returned to the kingdom as a hero. The people cheered and celebrated her bravery, and Aurora was hailed as a champion. She lived happily ever after, knowing that she had saved her kingdom and its people.
A few observations
This conversation has been going on for a few days now and I’ve found it very helpful. I want to take a minute or two to step back and think about it, and about transformers and stories. Why stories? Because I’ve spent a lot of time having ChatGPT tell stories, getting a feel for how it does that. But I’m getting ahead of myself.
I wrote the OP because I felt a mismatch between what I feel to be the requirements for telling the kind of stories ChatGPT tells, and the assertion that it’s “just” predicting the next word, time after time after time. How do we heal that mismatch?
Stories
Let’s start with stories, because that’s where I’m starting. I’ve spent a lot of time studying stories and figuring out how they work. I’ve long ago realized that that process must start by simply describing the story. But describing isn’t so simple. For example, it took Early Modern naturalists decades upon decades to figure out to describe life-forms, plants and animals, well enough so that a naturalist in Paris could read a description by a naturalist in Florence and figure out whether or not that Florentine plant was the same one as he has in front of him in Paris (in this case, description includes drawing as well as writing).
Now, believe it or not, describing stories is not simple, depending, of course, on the stories. The ChatGPT stories I’ve been working with, fortunately, are relatively simple. They’re short, roughly between 200 and 500 words long. The one’s I’ve given the most attention to are in the 200-300 word range.
They are hierarchically structured on three levels: 1) the story as a whole, 2) individual segments within the story (marked by paragraph divisions in these particular stories), and 3) sentences within those segments. Note that, if we wanted to, we could further divide sentences into phrases, which would give us at least one more level, if not two or three. But three levels are sufficient for my present purposes.
Construction
How is it that ChatGPT is able to construct stories organized on three levels? One answer to that question is that it needs to have some kind of procedure for doing that. That sentence seems like little more than a tautological restatement of the question. What if we say the procedure involves a plan? That, it seemed to me when I was writing the OP, that seems better. But “predict the next token” doesn’t seem like much of a plan.
We’re back where we started, with a mismatch. But now it is a mismatch between predict-the-next-token and the fact that these stories are hierarchically structured on three levels.
Let’s set that aside and return to our question: How is it that ChatGPT is able to construct stories organized on three levels? Let’s try another answer to the question. It is able to do it because it was trained on a lot of stories organized on three or more levels. Beyond that, it was trained on a lot of hierarchically structured documents of all kinds. How was it trained? That’s right: Predict the next token.
It seems to me that if it is going to improve on that task, that it must somehow 1) learn to recognize that a string of words is hierarchically structured, and 2) exploit what it has learned in predicting the next token. What cues are in the string that guide ChatGPT in making these predictions?
Whatever those cues are, they are registered in those 175 billion weights. Those cues are what I meant by “plan” in the OP.
Tell me a story
At this point we should be able to pick one of those stories and work our way through it from beginning to end, identifying cues as we go along. Even in the case of a short 200-word story, though, that would be a long and tedious process. At some point, someone HAS to do it, and their work needs to be vetted by others. But we don’t need to do that now.
But I can make a few observations. Here’s the simplest prompt I’ve used: “Tell me a story.” The population of tokens that would be a plausible initial token is rather large. How does that population change as the story evolves?
I’ve done a lot of work with stories generated by this prompt: “Tell me a story about a hero,” That’s still wide open, but the requirement that it be about a hero does place some vague restrictions on the population of available tokens. One story ChatGPT gave me in response to that prompt began with this sentence: “Once upon a time, in a land far, far away, there was a young princess named Aurora.” That’s formulaic, from beginning to end. There are a number of options in the formula, but we could easily use up 200 or 300 words discussing them and laying out the syntactic options in the form of a tree or formula. Let’s assume that’s been done.
What next? Here’s the second sentence: “Aurora was a kind and gentle soul, loved by all who knew her.” It’s all about Aurora. Everything in that sentence is drawn from a token population useful for characterizing Aurora. Third sentence: “She had long, golden hair and sparkling blue eyes, and was known for her beautiful singing voice.” Those tokens are drawn from the same population as the words in the previous sentence.
What about the fourth sentence? Does ChatGPT continue to draw from the same population or does its attention shift to a new population? Note that at some time it is going to have to draw tokens from a new population, otherwise the story goes nowhere. Here’s the fourth sentence: “One day, a terrible dragon came to the kingdom and began to terrorize the people.” That’s a new population of tokens. ChatGPT has moved from the first segment of the story trajectory (as I am calling it) to the second.
You get the idea. I have no intention of continuing on to the end of the story. But you can do so if you wish. Here’s the whole story: