Jeremy Fox asks what I think about this paper by David N. Reshef, Yakir Reshef, Hilary Finucane, Sharon Grossman, Gilean McVean, Peter Turnbaugh, Eric Lander, Michael Mitzenmacher, and Pardis Sabeti which proposes a new nonlinear R-squared-like measure.
My quick answer is that it looks really cool!
From my quick reading of the paper, it appears that the method reduces on average to the usual R-squared when fit to data of the form y = a + bx + error, and that it also has a similar interpretation when “a + bx” is replaced by other continuous functions.
Unlike R-squared, the method of Reshef et al. depends on a tuning parameter that controls the level of discretization, in a “How long is the coast of Britain” sort of way. The dependence on scale is inevitable for such a general method. Just consider: if you sample 1000 points from the unit bivariate normal distribution, (x,y) ~ N(0,I), you’ll be able to fit them perfectly by a 999-degree polynomial fit to the data. So the scale of the fit matters.
The clever idea of the paper is that, instead of going for an absolute measure (which, as we’ve seen, will be scale-dependent), they focus on the problem of summarizing the grid of pairwise dependences in a large set of variables. As they put it: “Imagine a data set with hundreds of variables, which may contain important, undiscovered relationships. There are tens of thousands of variable pairs . . . If you do not already know what kinds of relationships to search for, how do you efficiently identify the important ones?”
Thus, Reshef et al. provide a relative rather than absolute measure of association, suitable for comparing pairs of variables within a single dataset even if the interpretation is not so clear between datasets.
I suspect that the R-squared-like nature of the method will bother some purists who have bashed R-squared’s dependence on the range of x in the data. But, as regular readers will know, I like R-squared (in its place) and I have warm feelings about this generalization.
I leave the authors with two questions:
1. What is the value of their association measure if applied to data that are on a circle? For example, suppose you generate these 1000 points in R:
n <- 1000
theta <- runif (n, 0, 2*pi)
x <- cos (theta)
y <- sin (theta)
Simulated in this way, x and y have an R-squared of 0. And, indeed, knowing x tells you little (on average) about y (and vice-versa). But, from the description of the method in the paper, it seems that their R-squared-like measure might be very close to 1. I can’t really tell. This is an interesting to me because it’s not immediately clear what the right answer “should” be. If you can capture a bivariate distribution by a simple curve, that’s great; on the other hand if you can’t predict x from y or y from x, then I don’t know that I’d want a R-squared-like summary to be close to 1.
No measure can be all things to all datasets, so let me emphasize that the above is not a criticism of the idea of Reshef et al. but rather an exploration.
2. I wonder if they’d do even better by log-transforming any variables that are all-positive. (I thought about this after looking at the graphs in Figure 4.) A more general approach would be for their grid boxes to be adaptive.