There is no ‘the final token’ for weights not at the final layer.
Because that is where all the gradients flow from, and why the dog wags the tail.
Aggregations of things need not be of the same kind as their constituent things? This is a lot like calling an LLM an activation optimizer. While strictly in some sense true of the pieces that make up the training regime, it’s also kind of a wild way to talk about things in the context of ascribing motivation to the resulting network.
I think maybe you’re intending ‘next token prediction’ to mean something more like ‘represents the data distribution, as opposed to some metric on the output’, but if you are this seems like a rather unclear way of stating it.
There is no ‘the final token’ for weights not at the final layer.
Aggregations of things need not be of the same kind as their constituent things? This is a lot like calling an LLM an activation optimizer. While strictly in some sense true of the pieces that make up the training regime, it’s also kind of a wild way to talk about things in the context of ascribing motivation to the resulting network.
I think maybe you’re intending ‘next token prediction’ to mean something more like ‘represents the data distribution, as opposed to some metric on the output’, but if you are this seems like a rather unclear way of stating it.