Eliezer should have taken Cotra up on that bet about “will someone train a 10T param model before end days” considering one already exists.
Is that one dense or sparse/MoE? How many data points was it trained for? Does it set SOTA on anything? (I’m skeptical; I’m wondering if they only trained it for a tiny amount, for example.)
Eliezer should have taken Cotra up on that bet about “will someone train a 10T param model before end days” considering one already exists.
Is that one dense or sparse/MoE? How many data points was it trained for? Does it set SOTA on anything? (I’m skeptical; I’m wondering if they only trained it for a tiny amount, for example.)