That’s true, but for the long run behavior, the more expensive dense attention layers should still dominate, I think.
That’s true, but for the long run behavior, the more expensive dense attention layers should still dominate, I think.