Adding to what Jaime already said: I think there are two things we might care about when thinking about FLOP. 1. We could care about the theoretical number of FLOP a method might use independent of architecture, exact GPU, etc. This might be useful to compare methods independent of which exact setup is the current flavor of the month. It might also be useful to compare current methods against methods from 5 years ago or 5 years from now. Then the setup we describe in the post seems to be the best fit. 2. We could care about the actual FLOP count that an algorithm uses in the specific setup it is used in, e.g. with a specific GPU, software, and low-level optimizer. This might be useful when comparing different GPUs and their efficiency. Then it is more accurate to use high-level proxies such as GPU-time.
In the first case, the overhead is a distraction in the second it is part of the thing we care about.
Adding to what Jaime already said: I think there are two things we might care about when thinking about FLOP.
1. We could care about the theoretical number of FLOP a method might use independent of architecture, exact GPU, etc. This might be useful to compare methods independent of which exact setup is the current flavor of the month. It might also be useful to compare current methods against methods from 5 years ago or 5 years from now. Then the setup we describe in the post seems to be the best fit.
2. We could care about the actual FLOP count that an algorithm uses in the specific setup it is used in, e.g. with a specific GPU, software, and low-level optimizer. This might be useful when comparing different GPUs and their efficiency. Then it is more accurate to use high-level proxies such as GPU-time.
In the first case, the overhead is a distraction in the second it is part of the thing we care about.