Yeah, I mostly agree. I would say that there may or may not be certain secret techniques which will give models a slightly lower loss plateau for a given parameter count. That matters more to the large companies than compute efficiency, I think.
Accumulate enough loss-plateau-lowering tidbits, and it could add up to having the best model out of a group of similarly sized models.
Yeah, I mostly agree. I would say that there may or may not be certain secret techniques which will give models a slightly lower loss plateau for a given parameter count. That matters more to the large companies than compute efficiency, I think.
Accumulate enough loss-plateau-lowering tidbits, and it could add up to having the best model out of a group of similarly sized models.