Shouldn’t this be 2 FLOP per parameter per token, since our evolutionary search is not doing backward passes?
On the other hand, the calculation in the footnote seems to assume that 1 function call = 1 token, which is clearly an unrealistic lower bound.
A “lowest-level” function (one that only uses a single context window) will use somewhere between 1 and nctx=O(103) tokens. Functions defined by composition over “lowest-level” functions, as described two paragraphs above, will of course require more tokens per call than their constituents.
Shouldn’t this be 2 FLOP per parameter per token, since our evolutionary search is not doing backward passes?
On the other hand, the calculation in the footnote seems to assume that 1 function call = 1 token, which is clearly an unrealistic lower bound.
A “lowest-level” function (one that only uses a single context window) will use somewhere between 1 and nctx=O(103) tokens. Functions defined by composition over “lowest-level” functions, as described two paragraphs above, will of course require more tokens per call than their constituents.
Thanks for checking my math & catching this error!