Imagine a prediction objective that is not myopic, but requires creating long chains of internal inference to arrive at, more similar to the length of a full-context completion of GPT. I don’t see how such a prediction objective would give rise to the interesting dynamics that seem true about GPT. My guess is in the pursuit of such a non-myopic prediction objective you would see the development of quite instrumental forms of reasoning and general purpose problem-solving, with substantial divergence from how we currently think of GPTs.
The pretraining objective isn’t myopic? The parameter updates route across the entire context, backing up from the attention scores of later positions through e.g. the MLP sublayer outputs at position 0.
the extremely myopic prediction objective that they optimize
As a smaller note, language models do not optimize the predictive objective, so much as the loss function optimizes the language model. I think the wording you chose is going to cause confusion and lead to incorrect beliefs.
The pretraining objective isn’t myopic? The parameter updates route across the entire context, backing up from the attention scores of later positions through e.g. the MLP sublayer outputs at position 0.
This is something I’ve been thinking a lot about, but still don’t feel super robust in. I currently think it makes sense to describe the pretraining objective as myopic in the relevant way, but am really not confident. I agree that the training objective isn’t as myopic as I implied here, though I also don’t think the training objective is well-summarized as jointly optimizing the whole context-length response.
I have a dialogue I’ll probably publish soon about this, and would be interested in your comments on it when it goes live. Probably doesn’t make sense to go in-depth about this before that’s published, since it captures my current confusions and thoughts probably better than what I would write anew in a comment thread like this.
The pretraining objective isn’t myopic? The parameter updates route across the entire context, backing up from the attention scores of later positions through e.g. the MLP sublayer outputs at position 0.
As a smaller note, language models do not optimize the predictive objective, so much as the loss function optimizes the language model. I think the wording you chose is going to cause confusion and lead to incorrect beliefs.
This is something I’ve been thinking a lot about, but still don’t feel super robust in. I currently think it makes sense to describe the pretraining objective as myopic in the relevant way, but am really not confident. I agree that the training objective isn’t as myopic as I implied here, though I also don’t think the training objective is well-summarized as jointly optimizing the whole context-length response.
I have a dialogue I’ll probably publish soon about this, and would be interested in your comments on it when it goes live. Probably doesn’t make sense to go in-depth about this before that’s published, since it captures my current confusions and thoughts probably better than what I would write anew in a comment thread like this.