I think the learned positional embeddings combined with training on only short sequences is likely to be the issue. Changing either would suffice.
Makes sense. Will set off some runs with longer context sizes and track this in the future.
I think the learned positional embeddings combined with training on only short sequences is likely to be the issue. Changing either would suffice.
Makes sense. Will set off some runs with longer context sizes and track this in the future.