I don’t completely follow what you mean by data distribution dependence. We can still talk about the function implemented by a transformer in a data distribution independent way, but in general this function might be quite complicated. Should I understand that we can usually simplify the description of the function if we take the data distribution into account?
I don’t completely follow what you mean by data distribution dependence. We can still talk about the function implemented by a transformer in a data distribution independent way, but in general this function might be quite complicated. Should I understand that we can usually simplify the description of the function if we take the data distribution into account?