MINE: Free tool for detecting novel associations in large data sets
I was waiting for someone more knowledgeable to post something about this, but it’s been a couple days and thought I’d bring it to LW’s attention.
The maximal information coefficient (MIC) is a measure of two-variable dependence developed with the guidelines of generality and equitability in mind. The published paper describing MIC shows that it comes very close to achieving both goals simultaneously, and that it significantly outperforms competing methods in this regard.
I’m skeptical about the general usefulness of the tool, but I feel like the best way to evaluate such suspicions to to find a fairly well-understood dataset and compare the utility of traditional measures of correlations vs MIC in finding true relationships in the data.
The most important link is the Supplementary Material. I only found it through the reddit thread. (Not much else to go there for. Maybe the pirated paper itself, but that is basically just an extended abstract of the SOM.) Here is the link to the SOM:
http://www.sciencemag.org/content/suppl/2011/12/14/334.6062.1518.DC1/Reshef.SOM.pdf
Figures S5 and S6 (page 41) make me conjecture that compared to LOESS, this new method is an improvement only when the relationship is not a function (but a many-valued function). Not that I am really familiar with LOESS.
I’m far from an expert on LOESS (in fact, I hadn’t heard the term before now), but it looks like it doesn’t perform a comparable function to MIC. LOESS seems to be an algorithm for producing a non-linear regression while MIC is an algorithm to measure the strength of a relationship between two variables.
In the paper (figure 2A), they compare it to Pearson correlation coefficient, Spearman rank correlation, mutual information, CorGC, and maximal correlation on data in a variety of shapes. Basically, it is effective on a wider range of shapes than any of them.
Check out figures S5.D and S6 from the SOM. If the relationship is functional (the linear, parabolic, sinusoidal cases on Figure S6), then the R2 calculated from LOESS regression is quite close to this MIC score, and that’s not a coincidence. Of course LOESS R2 just dies when it encounters a non-functional relationship.
extensive discussion on reddit: http://www.reddit.com/r/science/comments/neoz6/scientists_create_new_algorithm_to_automatically/