You’re at token i in a non-final layer. Which token’s output are you optimizing for? i+1?
I already addressed this point. If I’m in a non-final layer then I can be optimizing for arbitrary tokens within the context window, sure, and ‘effectively’ predicting intermediate tokens because that is the ‘dominant’ effect at that location… insofar as it is instrumentally useful for predicting the final token using the final layer. Because that is where all the gradients flow from, and why the dog wags the tail.
There is no ‘the final token’ for weights not at the final layer.
Because that is where all the gradients flow from, and why the dog wags the tail.
Aggregations of things need not be of the same kind as their constituent things? This is a lot like calling an LLM an activation optimizer. While strictly in some sense true of the pieces that make up the training regime, it’s also kind of a wild way to talk about things in the context of ascribing motivation to the resulting network.
I think maybe you’re intending ‘next token prediction’ to mean something more like ‘represents the data distribution, as opposed to some metric on the output’, but if you are this seems like a rather unclear way of stating it.
I already addressed this point. If I’m in a non-final layer then I can be optimizing for arbitrary tokens within the context window, sure, and ‘effectively’ predicting intermediate tokens because that is the ‘dominant’ effect at that location… insofar as it is instrumentally useful for predicting the final token using the final layer. Because that is where all the gradients flow from, and why the dog wags the tail.
There is no ‘the final token’ for weights not at the final layer.
Aggregations of things need not be of the same kind as their constituent things? This is a lot like calling an LLM an activation optimizer. While strictly in some sense true of the pieces that make up the training regime, it’s also kind of a wild way to talk about things in the context of ascribing motivation to the resulting network.
I think maybe you’re intending ‘next token prediction’ to mean something more like ‘represents the data distribution, as opposed to some metric on the output’, but if you are this seems like a rather unclear way of stating it.