In practice, we focus on the embedding associated with the last token from a late layer.
I don’t have time to provide citations right now, but a few results have made me skeptical of this choice—probably you’re better off using an intermediate layer, rather than a late one. Early and late layers seem to deal more with token-level concerns, while mid-layers seem to handle more conceptual / abstract features.
I don’t have time to provide citations right now, but a few results have made me skeptical of this choice—probably you’re better off using an intermediate layer, rather than a late one. Early and late layers seem to deal more with token-level concerns, while mid-layers seem to handle more conceptual / abstract features.