I work for a leading private statistical research company and think this is a wonderful post. I heartily agree with all the takeaways. I may expand on data leakage examples I’ve seen “in the wild” in a follow-up post if there’s demand for more stories, but your second “time-travelling” example brought back wonderful memories of a large company-wide debate, since your initial suggestion was our modus operandi, and there was likewise “tolerant skepticism” when it was questioned.
“I looked into it and found . . . that the conventional approach worked fine.” So did we, and nobody could provide a clear example where this flavour of leakage was a problem on the actual data we had rather than theorised data (perhaps we didn’t look hard enough?). In my experience, the tacit knowledge and understanding of the data-generating process should probably be the main determiner of how important data leakage is in practice, and therefore how much you should care about it. In this case, time-travel was an issue because the data process you were modeling had serial correlation.
The knowledge that “all models are wrong” is the best tonic I’ve found for dealing with the nagging uncertainty inherent when working with data involving the arrow of time. I still pretend time-travel is fine almost every working day. We all know it’s wrong, but for my company at least, we don’t know the price we’re paying.
I work for a leading private statistical research company and think this is a wonderful post. I heartily agree with all the takeaways. I may expand on data leakage examples I’ve seen “in the wild” in a follow-up post if there’s demand for more stories, but your second “time-travelling” example brought back wonderful memories of a large company-wide debate, since your initial suggestion was our modus operandi, and there was likewise “tolerant skepticism” when it was questioned.
“I looked into it and found . . . that the conventional approach worked fine.” So did we, and nobody could provide a clear example where this flavour of leakage was a problem on the actual data we had rather than theorised data (perhaps we didn’t look hard enough?). In my experience, the tacit knowledge and understanding of the data-generating process should probably be the main determiner of how important data leakage is in practice, and therefore how much you should care about it. In this case, time-travel was an issue because the data process you were modeling had serial correlation.
The knowledge that “all models are wrong” is the best tonic I’ve found for dealing with the nagging uncertainty inherent when working with data involving the arrow of time. I still pretend time-travel is fine almost every working day. We all know it’s wrong, but for my company at least, we don’t know the price we’re paying.