That they can learn good priors. That’s pretty much what I think happens with pretraining. Learn the prior distribution of data in a domain, then you can adapt that knowledge to many downstream tasks.
Also, I don’t think built-in priors are that helpful. CNNs have a strong locality prior, while transformers don’t. You’d think that would make CNNs much better at image processing. After all, a transformer’s prior is that the pixel at (0,0) is just as likely to relate to the pixel at (512, 512) as it is to relate to its neighboring pixel at (0,1). However, experiments have shown that transformers are competitive with state of the art, highly tuned CNNs. (here, here)
That they can learn good priors. That’s pretty much what I think happens with pretraining. Learn the prior distribution of data in a domain, then you can adapt that knowledge to many downstream tasks.
Also, I don’t think built-in priors are that helpful. CNNs have a strong locality prior, while transformers don’t. You’d think that would make CNNs much better at image processing. After all, a transformer’s prior is that the pixel at (0,0) is just as likely to relate to the pixel at (512, 512) as it is to relate to its neighboring pixel at (0,1). However, experiments have shown that transformers are competitive with state of the art, highly tuned CNNs. (here, here)