I think this take is basically correct. Restating my version of it:
Mixture of Experts and similar approaches modulate paths through the network, such that not every parameter is used every time. This means that parameters and FLOPs (floating point operations) are more decoupled than they are in dense networks.
To me, FLOPs remains the harder-to-fake metric, but both are valuable to track moving forward.
I think this take is basically correct. Restating my version of it:
Mixture of Experts and similar approaches modulate paths through the network, such that not every parameter is used every time. This means that parameters and FLOPs (floating point operations) are more decoupled than they are in dense networks.
To me, FLOPs remains the harder-to-fake metric, but both are valuable to track moving forward.