Is that one dense or sparse/MoE? How many data points was it trained for? Does it set SOTA on anything? (I’m skeptical; I’m wondering if they only trained it for a tiny amount, for example.)
Is that one dense or sparse/MoE? How many data points was it trained for? Does it set SOTA on anything? (I’m skeptical; I’m wondering if they only trained it for a tiny amount, for example.)