Malka Gorfine, Ruth Heller, and Yair Heller write a comment on the paper of Reshef et al. that we discussed a few months ago.
Just to remind you what’s going on here, here’s my quick summary from December:
Reshef et al. propose a new nonlinear R-squared-like measure.
Unlike R-squared, this new method depends on a tuning parameter that controls the level of discretization, in a “How long is the coast of Britain” sort of way. The dependence on scale is inevitable for such a general method. Just consider: if you sample 1000 points from the unit bivariate normal distribution, (x,y) ~ N(0,I), you’ll be able to fit them perfectly by a 999-degree polynomial fit to the data. So the scale of the fit matters.
The clever idea of the paper is that, instead of going for an absolute measure (which, as we’ve seen, will be scale-dependent), they focus on the problem of summarizing the grid of pairwise dependences in a large set of variables. As they put it: “Imagine a data set with hundreds of variables, which may contain important, undiscovered relationships. There are tens of thousands of variable pairs . . . If you do not already know what kinds of relationships to search for, how do you efficiently identify the important ones?”
Thus, Reshef et al. provide a relative rather than absolute measure of association, suitable for comparing pairs of variables within a single dataset even if the interpretation is not so clear between datasets.
I followed up with some questions, and there were many comments, including this link from Rob Tibshirani to a paper with Noah Simon, who conclude:
We [Simon and Tibshirani] believe that the recently proposed distance correlation measure of Székely & Rizzo (2009) is a more powerful technique that is simple, easy to compute and should be considered for general use.
OK, now what we’re up to speed, here’s the comment from Gorfine:
Reshef et al. present a clever approximation of the brute force approach to detecting dependencies of going over all possible grids. Their method however does have some serious drawbacks:
1) My collaborators Ruth and Yair Heller and I conducted a simulation study to compare the power of MIC to two other methods Dcor (as in Professors Tibshiranis comment above) and HHG (http://arxiv.org/abs/1201.3522). From the study it was clear that for certain data sets MIC suffers from extremely low power compared to the other methods. A detailed description of the simulations and the power issue can be found here (http://ie.technion.ac.il/~gorfinm/files/science6.pdf).
2) In a personal communication the authors explained that the main point of their method is equitability (i.e. two relationships with the same noise level will get the same score) and not power. We believe that the equitability characteristic of the method is not very useful for the following reasons:
a) If you have low power and cannot detect much, equitability will not help you.
b) The authors prove equitability only for relationships without noise – which is never the case in statistics. Giving a few examples of equitability for noisy functions does not constitute a proof.
c) In our simulation study mentioned above, based on practical sample sizes such as 30, 50 or 100, we show that the MIC test gives poor performance in terms of equitability. It gives different relationship types different scores and thus different power, its degradation as noise is added is highly dependent on the specific relationship type in question.
3) The authors give a few noisy examples for which their proofs do not hold (e.g. L shaped relationship) and try to demonstrate that they can be equitable even in such cases. There is however a simple counter example which shows that MIC is not equitable for all relationships: Generate a dataset that is Uniform on [0;1]x[0;1] and uniform on [1;2]x[1;2]. A 2-field checkerboard (or in fact, you can also try a larger checkerboard!). It scores a maximal MIC of 1.0, just like y=x (credit to a post on http://andrewgelman.com/2011/12/mr-pearson-meet-mr-mandelbrot-detecting-novel-associations-in-large-data-sets/).
4) If we understood correctly, almost all the proofs in the paper are about the full brute force method which tries out all possible grids and not about the actual MIC approximation. Specifically the authors do not prove that their approximation is statistically consistent against any alternative. Which means that even with infinite data the researcher cannot be sure that if there is dependency MIC will find it.
5) As MIC uses an approximation it has quite a few unjustified heuristics:
a) A parameter of n^0.6 without justifying why it is better than say n^0.7.
b) A parameter for the number of clumps which is set at 15 without justification.
In fact looking at section 4.1 of the SOM (page 15) the authors really did play with these 2 parameters (they have a different value for them for every figure). Trying multiple values of a statistical test, invalidates the p values found. The authors should report the findings for the default parameter settings.
6) MIC is relevant only for univariate data while HHG and Dcor work also in a multivariate setting.
Due to all these drawbacks, our bottom line is that the two other methods mentioned in our comment are superior to MIC and we recommend that scientists use them rather than MIC.
Perhaps Reshef or one of the other authors of that paper can comment?