Yeah, if you use constant compute to explain each “feature” and features are proportional to model scale, this is only O(n^2) which is the same as training compute.
However, it seems plausible to me that you actually need to look at interactions between features and so you end up with O(log(n) n^2) or even O(n^3).
Also constant factors can easily destroy you here.
Yeah, if you use constant compute to explain each “feature” and features are proportional to model scale, this is only O(n^2) which is the same as training compute.
However, it seems plausible to me that you actually need to look at interactions between features and so you end up with O(log(n) n^2) or even O(n^3).
Also constant factors can easily destroy you here.