Jeremy Neufeld writes:
I’m an undergraduate student at the University of Maryland and I was recently referred to this paper (Vine Regression, by Roger Cooke, Harry Joe, and Bo Chang), also an accompanying summary blog post by the main author) as potentially useful in policy analysis. With the big claims it makes, I am not sure if it passes the sniff test. Do you know anything about vine regression? How would it avoid overfitting?
My reply: Hey, as a former University of Maryland student myself I’ll definitely respond! I looked at the paper, and it seems to be presenting a class of multivariate models, a method for fitting the models to data, and some summaries. The model itself appears to be a mixture of multivariate normals of different dimensions, fit to the covariance matrix of a rank transformation of the raw data—I think they’re ranking each variable on its marginal distribution but I’m not completely sure, and I’m not quite sure how they deal with discreteness in the data. Then somehow they’re transforming back to the original space of the data; maybe they do some interpolation to get continuous values, also I’m not quite sure what happens when they extrapolate to beyond the range of the original ranks.
The interesting part of the model is the mixture of submodels of different dimensions. I’m generally suspicious of such approaches, as continuous smoothing is more to my taste. That said, the usual multivariate models we fit are so oversimplified, that I could well imagine that this mixture model could do well. So I’m supportive of the approach. I think maybe they could fit their model in Stan—if so, that would probably make the computation less of a hassle for them.
The one think I really don’t understand at all in this paper is their treatment of causal inference. The model is entirely associational—that’s fine, I love descriptive data analysis!—and they’re fitting a multivariate model to some observational data. But then in section 3.1 of their paper they use explicit causal language: “the effect of breast feeding on IQ . . . If we change the BFW for an individual, how might that affect the individual’s IQ?” The funny thing is, right after that they again remind the reader that this is just descriptive statistics “we integrate the scaled difference of two regression functions which differ only in that one has weeks more breast feeding than the other” but then they snap right back to the causal language. So that part just baffles me. They have a complicated, flexible tool for data description but for some reason they then seem to make the tyro mistake of giving a causal interpretation to regression coefficients fit to observational data. That’s not really so important, though; I think you can ignore the causal statements and the method could still be useful. It seems worth trying out.