I don’t think the story structure is any compelling evidence against it being purely next token prediction. When humans write stories it is very common for them to talk about a kind of flow-state where they have very little idea what the next sentence is going to say until they get there. Story’s made this way still have the beginning middle and end, because if you have nothing written so far you must be at the beginning. If you can see a beginning you must be in the middle, and so on. Sometimes these stories just work, but more often the ending needs a bit of fudging, or else you need to go back and edit earlier bits to put things in place for the ending. (A fudge would be some kind of “and then all the problems were resolved”). Having played with GPT a little, it fudges its endings a lot.
I am not saying that it is purely next token prediction, I am just dubious about your evidence that it is not.
Quick reply, after doing a bit of reading and recalling a thing or two: In a ‘classical’ machine we have a clean separation of process and memory. Memory is kept on the paper tape of our Turing Machine and processing is located in, well, the processor. In a connectionist machine process and memory are all smushed together. GPTs are connectionist virtual machines running on a classical machine. The “plan” I’m looking for is stored in the parameter weights, but it’s smeared over a bunch of them. So this classical machine has to visit every one of them before it can output a token.
So, yes, purely next token prediction. But the prediction cycle, in effect, involves ‘reassembling’ the plan each time through.
To my mind, in order to say we “understand” how this puppy is telling a story, we need to say more than it’s a next-token-prediction machine. We need to say something about how that “plan” is smeared over those weights. We need to come up with concepts we can use in formulating such explanations. Maybe the right concepts are just laying scattered about in dusty old file cabinets someplace. But, I’m thinking this is likely, we have to invent some new ones as well.
Wolfram was trained as a physicist. The language of complex dynamics is natural to him, whereas it’s a poorly learned third or fourth language for me, So he talks of basins of attractors and attractor landscapes. As far as I can tell, in his language, those 175B parameters can be said to have an attractor landscape. When ChatGPT tells a story it enters the Story Valley in that landscape and walks a path through that valley. When its done with the story, it exits that valley. There are all kinds of valleys (and valleys within valleys (and valleys within them)) in the attractor landscape, for all kinds of tasks.
FWIW, the human brain has roughly 86B neurons. Each of those is connected with roughly 10K other neurons. Those connections are mediated by upward of a 100 different chemicals. And those neurons are surrounded by glial cells. In the old days researchers thought those glial cells were like packing peanuts for the neural net. We now know better and are beginning to figure out what they’re doing. Memory is definitely part of their story. So we’ve got to add them into the mix. How many glial cells per neuron? There might be a number in the literature, but I haven’t checked. Anyhow, the number of parameters we need to characterize a human brain is vast.
I don’t think the story structure is any compelling evidence against it being purely next token prediction. When humans write stories it is very common for them to talk about a kind of flow-state where they have very little idea what the next sentence is going to say until they get there. Story’s made this way still have the beginning middle and end, because if you have nothing written so far you must be at the beginning. If you can see a beginning you must be in the middle, and so on. Sometimes these stories just work, but more often the ending needs a bit of fudging, or else you need to go back and edit earlier bits to put things in place for the ending. (A fudge would be some kind of “and then all the problems were resolved”). Having played with GPT a little, it fudges its endings a lot.
I am not saying that it is purely next token prediction, I am just dubious about your evidence that it is not.
Quick reply, after doing a bit of reading and recalling a thing or two: In a ‘classical’ machine we have a clean separation of process and memory. Memory is kept on the paper tape of our Turing Machine and processing is located in, well, the processor. In a connectionist machine process and memory are all smushed together. GPTs are connectionist virtual machines running on a classical machine. The “plan” I’m looking for is stored in the parameter weights, but it’s smeared over a bunch of them. So this classical machine has to visit every one of them before it can output a token.
So, yes, purely next token prediction. But the prediction cycle, in effect, involves ‘reassembling’ the plan each time through.
To my mind, in order to say we “understand” how this puppy is telling a story, we need to say more than it’s a next-token-prediction machine. We need to say something about how that “plan” is smeared over those weights. We need to come up with concepts we can use in formulating such explanations. Maybe the right concepts are just laying scattered about in dusty old file cabinets someplace. But, I’m thinking this is likely, we have to invent some new ones as well.
Wolfram was trained as a physicist. The language of complex dynamics is natural to him, whereas it’s a poorly learned third or fourth language for me, So he talks of basins of attractors and attractor landscapes. As far as I can tell, in his language, those 175B parameters can be said to have an attractor landscape. When ChatGPT tells a story it enters the Story Valley in that landscape and walks a path through that valley. When its done with the story, it exits that valley. There are all kinds of valleys (and valleys within valleys (and valleys within them)) in the attractor landscape, for all kinds of tasks.
FWIW, the human brain has roughly 86B neurons. Each of those is connected with roughly 10K other neurons. Those connections are mediated by upward of a 100 different chemicals. And those neurons are surrounded by glial cells. In the old days researchers thought those glial cells were like packing peanuts for the neural net. We now know better and are beginning to figure out what they’re doing. Memory is definitely part of their story. So we’ve got to add them into the mix. How many glial cells per neuron? There might be a number in the literature, but I haven’t checked. Anyhow, the number of parameters we need to characterize a human brain is vast.