This error manifests in the field of data analytics when people use huge amounts of data to look for hidden correlations. By using brute force to combine and transform factors, you can end up with massive degrees of freedom. Spurious correlations are discovered, which turn out to be noise instead of signal, then the model or the findings fail to transfer outside of the training environment.
It seems to be a blind spot in the current culture of Data Science. I don’t see many colleagues who focus on this error, or who can be easily convinced that it is a problem.
The only more prevalent error I find is target leakage. This would take the form of ‘we defined the speed of light using the dimensions of the pyramid. Now behold the great coincidence of the sides correlating to the energy stored in matter.’ Although, the connections are usually more difficult to find in practice.
This error manifests in the field of data analytics when people use huge amounts of data to look for hidden correlations. By using brute force to combine and transform factors, you can end up with massive degrees of freedom. Spurious correlations are discovered, which turn out to be noise instead of signal, then the model or the findings fail to transfer outside of the training environment.
It seems to be a blind spot in the current culture of Data Science. I don’t see many colleagues who focus on this error, or who can be easily convinced that it is a problem.
The only more prevalent error I find is target leakage. This would take the form of ‘we defined the speed of light using the dimensions of the pyramid. Now behold the great coincidence of the sides correlating to the energy stored in matter.’ Although, the connections are usually more difficult to find in practice.