Hmm, I don’t think so. Or at least, the novel things in that paper don’t seem to correspond.
My understanding of what this paper does:
Trains models to predict next 4 tokens instead of next 1 token as an auxilary training objective. Note that this training objective yields better performance on downstream tasks when just using the next token prediction component (the normally trained component) and discarding the other components. Notable, this is just something like “adding this additional prediction objective helps the model learn more/faster”. In other words, this result doesn’t involve actually changing how the model is actually used, it just adds some additional training task.
Uses these heads for speculative executation, a well known approach in the literature for accelerating inference.
Hmm, I think the first bullet point is pretty precisely what I am talking about (though to be clear, I haven’t read the paper in detail).
I was specifically saying that trying to somehow get feedback from future tokens into the next token objective would probably do some interesting things and enable a bunch of cross-token optimization that currently isn’t happening, which would improve performance on some tasks. This seems like what’s going on here.
Agree that another major component of the paper is accelerating inference, which I wasn’t talking about. I would have to read the paper in more detail to get a sense of how much it’s just doing that, in which case I wouldn’t think it’s a good example.
Hmm, I don’t think so. Or at least, the novel things in that paper don’t seem to correspond.
My understanding of what this paper does:
Trains models to predict next 4 tokens instead of next 1 token as an auxilary training objective. Note that this training objective yields better performance on downstream tasks when just using the next token prediction component (the normally trained component) and discarding the other components. Notable, this is just something like “adding this additional prediction objective helps the model learn more/faster”. In other words, this result doesn’t involve actually changing how the model is actually used, it just adds some additional training task.
Uses these heads for speculative executation, a well known approach in the literature for accelerating inference.
Hmm, I think the first bullet point is pretty precisely what I am talking about (though to be clear, I haven’t read the paper in detail).
I was specifically saying that trying to somehow get feedback from future tokens into the next token objective would probably do some interesting things and enable a bunch of cross-token optimization that currently isn’t happening, which would improve performance on some tasks. This seems like what’s going on here.
Agree that another major component of the paper is accelerating inference, which I wasn’t talking about. I would have to read the paper in more detail to get a sense of how much it’s just doing that, in which case I wouldn’t think it’s a good example.