I do not get your argument here, it doesn’t track. I am not an expert in transformer systems or the in-depth architecture of LLMs, but I do know enough to make me feel that your argument is very off.
You argue that training is different from inference, as a part of your argument that LLM inference has a global plan. While training is different from inference, it feels to me that you may not have a clear idea as to how they are different.
You quote the accurate statement that “LLMs are produced by a relatively simple training process (minimizing loss on next-token prediction, using a large training set from the internet...”
Training, intrinsically, involves inference. Training USES inference. Training is simply optimizing the inference result by, as the above quote implies, “minimizing loss on [inference result]”. Next-token prediction IS the inference result.
You can always ask the LLM, without having it store any state, to produce the next token, and do that again, and do that again, e.t.c. it doesn’t have any plans, it is just using the provided input, performing statistical calculations on it, and producing the next token. That IS prediction. It doesn’t have a plan, doesn’t store a state, just using weights and biases(denoting the statistically significant ways of combining the input to produce a hopefully near-optimal output), and numbers (like the query, key, and value) denoting the statistical significance of the input text in relation to itself, and it predicts, through that statistical process, the next token. It doesn’t have a global plan.
Thanks for reminding me that training uses inference.
As for ChatGPT having a global plan, as you can see if you look at the comments I’ve made earlier today, I have come around to that view. The people that wrote the stories ChatGPT consumed during training, they had plans, and those plans are reflected in the stories they wrote. That structure is “smeared” over all those parameters weights and gets “reconstructed” each time ChatGPT generates a new token.
In his last book, The Computer and the Brain, John von Neumann noted, quite correctly, that each neuron is both a memory store and a processor. Subsequent research has made it clear that the brain stores specific things – objects, events, plans, whatever ¬– in populations of neurons, not individual neurons. These populations operate in parallel.
We don’t yet have the luxury of such processors so we have to make do with programming a virtual neural net to run on a processor having way more memory units than processing units. And so our virtual machine has to visit each memory unit every time it makes one step in its virtual computation.
It does seem like there are “plans” or formats in place, not just choosing the next best word.
When it creates a resume , or a business plan or timeline, it seems much more likely that there is some form of structure that it’s is using and a template and then choosing the words that would go best in there correct places.
Stories have a structure , beginning middle end. So it’s not just picking words it’s picking the words that go best with a beginning then the words that go best middle and then end. If it was just choosing next words you could imagine it being a little more creative and less formulaic.
This model was trained by humans , who told it when it had the structure right , and the weights got placed heavier where it conformed to the right preexisting plan. So if any thing the “neural” pathways that formed the strongest connections are ones that 1. Resulted in the best use of tokens 2. Were weighted deliberately higher by the human trainers
I do not get your argument here, it doesn’t track. I am not an expert in transformer systems or the in-depth architecture of LLMs, but I do know enough to make me feel that your argument is very off.
You argue that training is different from inference, as a part of your argument that LLM inference has a global plan. While training is different from inference, it feels to me that you may not have a clear idea as to how they are different.
You quote the accurate statement that “LLMs are produced by a relatively simple training process (minimizing loss on next-token prediction, using a large training set from the internet...”
Training, intrinsically, involves inference. Training USES inference. Training is simply optimizing the inference result by, as the above quote implies, “minimizing loss on [inference result]”. Next-token prediction IS the inference result.
You can always ask the LLM, without having it store any state, to produce the next token, and do that again, and do that again, e.t.c. it doesn’t have any plans, it is just using the provided input, performing statistical calculations on it, and producing the next token. That IS prediction. It doesn’t have a plan, doesn’t store a state, just using weights and biases(denoting the statistically significant ways of combining the input to produce a hopefully near-optimal output), and numbers (like the query, key, and value) denoting the statistical significance of the input text in relation to itself, and it predicts, through that statistical process, the next token. It doesn’t have a global plan.
Thanks for reminding me that training uses inference.
As for ChatGPT having a global plan, as you can see if you look at the comments I’ve made earlier today, I have come around to that view. The people that wrote the stories ChatGPT consumed during training, they had plans, and those plans are reflected in the stories they wrote. That structure is “smeared” over all those parameters weights and gets “reconstructed” each time ChatGPT generates a new token.
In his last book, The Computer and the Brain, John von Neumann noted, quite correctly, that each neuron is both a memory store and a processor. Subsequent research has made it clear that the brain stores specific things – objects, events, plans, whatever ¬– in populations of neurons, not individual neurons. These populations operate in parallel.
We don’t yet have the luxury of such processors so we have to make do with programming a virtual neural net to run on a processor having way more memory units than processing units. And so our virtual machine has to visit each memory unit every time it makes one step in its virtual computation.
It does seem like there are “plans” or formats in place, not just choosing the next best word.
When it creates a resume , or a business plan or timeline, it seems much more likely that there is some form of structure that it’s is using and a template and then choosing the words that would go best in there correct places.
Stories have a structure , beginning middle end. So it’s not just picking words it’s picking the words that go best with a beginning then the words that go best middle and then end. If it was just choosing next words you could imagine it being a little more creative and less formulaic.
This model was trained by humans , who told it when it had the structure right , and the weights got placed heavier where it conformed to the right preexisting plan. So if any thing the “neural” pathways that formed the strongest connections are ones that 1. Resulted in the best use of tokens 2. Were weighted deliberately higher by the human trainers