I think it’s worth taking a look at what’s out there:
SpanBERT
Uses random spans to do masked pre-training
Seems to indicate that using longer spans is essentially difficult
Distillation of BERT Models
BERT embeddings are hierarchical
I think it’s worth taking a look at what’s out there:
SpanBERT
Uses random spans to do masked pre-training
Seems to indicate that using longer spans is essentially difficult
Distillation of BERT Models
BERT embeddings are hierarchical