My mental model is the encoder is working hard to find particular features and distinguish them from others (so it’s doing a compressed sensing task) and that out of context it’s off distribution and therefore doesn’t distinguish noise properly. Positional features are likely a part of that but I’d be surprised if it was most of it.
My mental model is the encoder is working hard to find particular features and distinguish them from others (so it’s doing a compressed sensing task) and that out of context it’s off distribution and therefore doesn’t distinguish noise properly. Positional features are likely a part of that but I’d be surprised if it was most of it.