On a different topic but answering to the same quote : advancements in quantization of models to significantly reduce model memory consumption for inference without reducing model performance might also mitigate the imbalance between ALU ops and memory bandwith. This might only shift the problem a few orders of magnitude away, but still, I think it‘s worth mentioning.
On a different topic but answering to the same quote : advancements in quantization of models to significantly reduce model memory consumption for inference without reducing model performance might also mitigate the imbalance between ALU ops and memory bandwith. This might only shift the problem a few orders of magnitude away, but still, I think it‘s worth mentioning.