The point of benchmarking something is not to see if it’s “better” necessarily, but to see how much worst it is.
For example, a properly tuned FCNN will almost always beat a gradient booster at a mid-sized (say < 100,000 features once you bucketize your numbers, since a GB will require that, and OHE your categories and < 100,000 samples) problem.
But gradient boosting has many other advantages around time, stability, ease of tuning, efficient ways of fitting on both CPUs and GPUs, more tradeoffs flexibility between compute and memory usage, metrics for feature importance, potentially faster inference time logic and potentially easier to train online (though both are arguable and kind of besides the point, they aren’t the main advantages).
So really, as long as benchmark tell me a gradient booster is usually just 2-5% worst than a finely tuned FCNN on this imaginary set of “mid-sized” tasks, I’d jump at the option to never use FCNNs here again, even if the benchmarks came up seemingly “against” them.
I guess I should add: an example I’m slightly more familiar with is anomaly detection in time-series data. Numenta developed the “HTM” brain-inspired anomaly detection algorithm (actually Dileep George did all the work back when he worked at Numenta, I’ve heard). Then I think they licensed it into a system for industrial anomaly detection (“the machine sounds different now, something may be wrong”), but it was a modular system, so you could switch out the core algorithm, and it turned out that HTM wasn’t doing better than the other options. This is a vague recollection, I could be wrong in any or all details. Numenta also made an anomaly detection benchmark related to this, but I just googled it and found this criticism. I dunno.
The point of benchmarking something is not to see if it’s “better” necessarily, but to see how much worst it is.
For example, a properly tuned FCNN will almost always beat a gradient booster at a mid-sized (say < 100,000 features once you bucketize your numbers, since a GB will require that, and OHE your categories and < 100,000 samples) problem.
But gradient boosting has many other advantages around time, stability, ease of tuning, efficient ways of fitting on both CPUs and GPUs, more tradeoffs flexibility between compute and memory usage, metrics for feature importance, potentially faster inference time logic and potentially easier to train online (though both are arguable and kind of besides the point, they aren’t the main advantages).
So really, as long as benchmark tell me a gradient booster is usually just 2-5% worst than a finely tuned FCNN on this imaginary set of “mid-sized” tasks, I’d jump at the option to never use FCNNs here again, even if the benchmarks came up seemingly “against” them.
Interesting!
I guess I should add: an example I’m slightly more familiar with is anomaly detection in time-series data. Numenta developed the “HTM” brain-inspired anomaly detection algorithm (actually Dileep George did all the work back when he worked at Numenta, I’ve heard). Then I think they licensed it into a system for industrial anomaly detection (“the machine sounds different now, something may be wrong”), but it was a modular system, so you could switch out the core algorithm, and it turned out that HTM wasn’t doing better than the other options. This is a vague recollection, I could be wrong in any or all details. Numenta also made an anomaly detection benchmark related to this, but I just googled it and found this criticism. I dunno.