You’re missing the possibility that parameters during training were larger than models used for inference. It is common practice now to train large, then distill into a series of smaller models that can be used based on the task need.
You’re missing the possibility that parameters during training were larger than models used for inference. It is common practice now to train large, then distill into a series of smaller models that can be used based on the task need.