You can correct it in the dataset going forward, but you shouldn’t go back and correct it historically. To see why, imagine this simplified world:
In 2000, GM had revenue of $1M, and its stock was worth in total $10M. Ford had revenue of $2M, and its stock was worth in total $20M. And Enron reported fake revenue of $3M, and its stock was worth in total $30M.
In 2001, the news of Enron’s fraud came out, and Enron’s stock dropped to zero. Also, our data vendor went back and corrected its 2000 revenue down to 0.
In 2002, I propose a trading strategy based on looking at a company’s revenue. I point to our historical data, where we see GM as having been worth 10x revenue, Ford as having been worth 10x revenue, and Enron as having been worth $30M on zero revenue. I suggest that I can perform better than the market average by just basing my investing on a company’s revenue data. This would have let me invest in Ford and GM, but avoid Enron! Hooray!
Of course, this is ridiculous. Investing based on revenue data would not have let me avoid losing money on Enron. Back in 2000, I would have seen the faked revenue data and invested...and in 2001, when the fraud came out, I would have lost money like everyone else.
But, by basing my backtest on historical data that has been corrected, I am smuggling the 2001 knowledge of Enron’s fraud back into 2000 and pretending that I could have used it to avoid investing in Enron in the first place.
If you care about having accurate tracking of the corrected ‘what was Enron’s real revenue back in 2000’ number, you can store that number somewhere. But by putting it in your historical data, you’re making it look like you had access to that number in 2000. Ideally you would want to distinguish between:
Can you help me see this point? Why not correct it in the dataset? (Assuming that the dataset hasn’t yet been used to train any models)
You can correct it in the dataset going forward, but you shouldn’t go back and correct it historically. To see why, imagine this simplified world:
In 2000, GM had revenue of $1M, and its stock was worth in total $10M. Ford had revenue of $2M, and its stock was worth in total $20M. And Enron reported fake revenue of $3M, and its stock was worth in total $30M.
In 2001, the news of Enron’s fraud came out, and Enron’s stock dropped to zero. Also, our data vendor went back and corrected its 2000 revenue down to 0.
In 2002, I propose a trading strategy based on looking at a company’s revenue. I point to our historical data, where we see GM as having been worth 10x revenue, Ford as having been worth 10x revenue, and Enron as having been worth $30M on zero revenue. I suggest that I can perform better than the market average by just basing my investing on a company’s revenue data. This would have let me invest in Ford and GM, but avoid Enron! Hooray!
Of course, this is ridiculous. Investing based on revenue data would not have let me avoid losing money on Enron. Back in 2000, I would have seen the faked revenue data and invested...and in 2001, when the fraud came out, I would have lost money like everyone else.
But, by basing my backtest on historical data that has been corrected, I am smuggling the 2001 knowledge of Enron’s fraud back into 2000 and pretending that I could have used it to avoid investing in Enron in the first place.
If you care about having accurate tracking of the corrected ‘what was Enron’s real revenue back in 2000’ number, you can store that number somewhere. But by putting it in your historical data, you’re making it look like you had access to that number in 2000. Ideally you would want to distinguish between:
2000 revenue as we knew it in 2000.
2000 revenue as we knew it in 2001.
2001 revenue as we knew it in 2001.
but this requires a more complicated database.
I see, that makes sense. Thank you!