I’m implying that the popular ANNs in use today have bad priors. Are you implying that a sufficiently large ANN has good priors or that it can learn good priors?
That they can learn good priors. That’s pretty much what I think happens with pretraining. Learn the prior distribution of data in a domain, then you can adapt that knowledge to many downstream tasks.
Also, I don’t think built-in priors are that helpful. CNNs have a strong locality prior, while transformers don’t. You’d think that would make CNNs much better at image processing. After all, a transformer’s prior is that the pixel at (0,0) is just as likely to relate to the pixel at (512, 512) as it is to relate to its neighboring pixel at (0,1). However, experiments have shown that transformers are competitive with state of the art, highly tuned CNNs. (here, here)
I’m implying that the popular ANNs in use today have bad priors. Are you implying that a sufficiently large ANN has good priors or that it can learn good priors?
That they can learn good priors. That’s pretty much what I think happens with pretraining. Learn the prior distribution of data in a domain, then you can adapt that knowledge to many downstream tasks.
Also, I don’t think built-in priors are that helpful. CNNs have a strong locality prior, while transformers don’t. You’d think that would make CNNs much better at image processing. After all, a transformer’s prior is that the pixel at (0,0) is just as likely to relate to the pixel at (512, 512) as it is to relate to its neighboring pixel at (0,1). However, experiments have shown that transformers are competitive with state of the art, highly tuned CNNs. (here, here)