The pretraining objective isn’t myopic? The parameter updates route across the entire context, backing up from the attention scores of later positions through e.g. the MLP sublayer outputs at position 0.
This is something I’ve been thinking a lot about, but still don’t feel super robust in. I currently think it makes sense to describe the pretraining objective as myopic in the relevant way, but am really not confident. I agree that the training objective isn’t as myopic as I implied here, though I also don’t think the training objective is well-summarized as jointly optimizing the whole context-length response.
I have a dialogue I’ll probably publish soon about this, and would be interested in your comments on it when it goes live. Probably doesn’t make sense to go in-depth about this before that’s published, since it captures my current confusions and thoughts probably better than what I would write anew in a comment thread like this.
This is something I’ve been thinking a lot about, but still don’t feel super robust in. I currently think it makes sense to describe the pretraining objective as myopic in the relevant way, but am really not confident. I agree that the training objective isn’t as myopic as I implied here, though I also don’t think the training objective is well-summarized as jointly optimizing the whole context-length response.
I have a dialogue I’ll probably publish soon about this, and would be interested in your comments on it when it goes live. Probably doesn’t make sense to go in-depth about this before that’s published, since it captures my current confusions and thoughts probably better than what I would write anew in a comment thread like this.