Daniel_Burfoot comments on Request for suggestions: ageing and data-mining

Daniel_Burfoot 26 Nov 2014 16:55 UTC
3 points
There is a problem that some data may really fundamentally be a distraction, and so modeling it is just a waste of time.

But it is very hard to tell ahead of time whether or not a piece of data is going to be relevant to a downstream analysis. As an example, in my work on text analysis, the issue of capitalization takes a lot of effort in proportion to how interesting it seems. It is tempting to just throw away caps information by lowercasing everything. But capitalization actually has clues that are relevant to parsing and other analysis—in particular, it allows you to identify acronyms, which usually stand for proper nouns.