NIPS tutorial paper: https://arxiv.org/abs/1701.00160

]]>First, props for your lucid writing and tight arguments. Second, do you have a link or two for “the recent work involving Generative Adversarial Networks”?

]]>I guess you could think of it like the weather…we know broadly what’s happening, and can predict outcomes with a degree of certainty, but the exact drivers of any given phenomenon remain inaccessible due to the complexity of and interactions within the system. We know it’s driven by heat gradients and fluid dynamics, but that’s not enough to completely mathematically describe what’s going on at any given moment. There’s an ontological point to be made about the mysteries of human cognition, but let’s not even open that can of worms.

2) Because people are taking shots in the dark already. Worse, policy is made on the basis of those shots in the dark. To torture the metaphor, our system is an attempt to ‘restrict the arcs’ of those firing into the night, so even if they miss they’re unlikely to commit fratricide, and should at least be firing in a reasonable direction. We ought to do it because the world doesn’t wait for ‘perfect’, and it might have changed by the time a ‘perfect’ model is built. I forget who said it, but the logic is “an exact solution to an approximate model is more useful than an approximate solution to an exact model”. What we’re after is better models. When they break, it gives us clues to what we’re missing.

The social domain is hugely complex, with thousands of competing theories that are all supported by data in varying ways. Please don’t imply that we’re not taking the time to try to grapple with it. Our data is limited, and we’re trying to learn what we can. You’re right that technical solutions will not reduce uncertainty to acceptable levels. What we’re trying to do is avoid too little uncertainty (point estimates, ie. random forest imputation, median imputation) or too much (naive Gaussian approximation, failure to try and analyse data). I don’t understand the hostility to the idea. “Papering over” is failing to acknowledge that the results depended on an imputed dataset, not attempting to deal with missingness in a principled manner.

3) Well, no, they’re not of the same form. E = variable (constant squared), whereas A = constant (variable squared). They describe a very different process. But let’s take your assumption at face value, and talk purely about the case E = mc2. If I gave you a large matrix, consisting of three columns (E, m, c), could you learn the relationship between them? Even if you knew nothing about relativity, but knew a little about maths, you could probably figure it out…particularly if I gave you a calculator to speed up the arithmetic. You could learn, or at least approximate, E=mc2 provided I gave you enough examples. Even if I mixed in a bunch of partial examples (with one of the values missing), you could solve for a single column, use those assumed values to approximate another column’s missing values, rinse and repeat until convergence. This is the basis of chained equations, and is how a lot of MI methods work. The advantage of our method is that you can throw a column in signifying “A or E”, and rather than treating the ‘c’ column as a squared constant (with “A or E” becoming a dummy variable of some magnitude), the model can learn to switch between the two states.

While you can say “well take the time to learn the relationships!”, well…you’re welcome to switch disciplines and come join us in fighting the good fight. A lot of this stuff is unknown, the variables we measure keep changing, latent factors abound, politically-useful theories flourish at the expense of attempts to grasp some notion of truth. (And the goalposts of truth keep shifting all the time because the poststructualists have a point, even if I wish they’d do something more useful than keep making the same point.) In absence of known relationships, we do the next best thing – estimate from the data we have, try not to impart too much of our own biases, and transparently report how we reach our conclusions.

]]>“…there will always be room for tailor-made missing data solutions based on extensive subject matter knowledge. Our solution is designed to work for those who lack the time or knowledge to implement such a system effectively.”

The solution to what problem, exactly? How to take shots in the dark when you don’t have the appropriate data or the time to become familiar with your subject matter? Why do it?

I don’t have the expertise to engage with you on the technical issues, but the problems here seem to me much more fundamental. No amount of technical manipulation of inadequate data can reduce uncertainty to acceptable levels. Huge levels of uncertainty papered over with uncertain assumptions should have no more place in science than they do in engineering. There doesn’t seem to be a rational way to choose which techniques to prefer.

I don’t know if this example is relevant, but E=mc2 is of the same form as A=pi r2, but that tells us nothing about the underlying “data generation process.” However, I agree that the discussion would be more focussed if we were talking about a specific problem situation.

]]>Granted, conditionally. This is a problem both myself and Ranjit are keenly aware of. We don’t like the black box aspect, but it’s how a lot of scalable machine learning is done for complex problems. Neural networks are just horrendously effective. At least now, with Gal’s work on interpreting dropout-trained networks as approximate GPs, we can get an idea of the uncertainty due to moving away from known regions of the modelling space. Many social science scholars will treat interpretable MI systems like Amelia, MICE, Hmisc, etc. as black box optimisers anyway, knowing just enough about how they work as to avoid throwing errors.

In general, the current consensus seems to be that MI is – at worst – no more prone to inducing bias than listwise deletion. Both myself and Ranjit are International Relations people, where missingness is highly nonrandom. For instance, Ranjit ran a reanalysis of a large number of studies from the last few years. He found that, when listwise deletion was replaced with actual imputation methods, many findings completely disappeared, if not reversed (type-M/S, if you will). If it puts your mind at ease, we have done some testing with strictly MNAR data, where missingness depends on an excluded (target) variable. We then use the imputed datasets in a Naive Bayes (AKA. Idiot Bayes) classifier to predict that excluded variable. What we observe is that the accuracy of that classification improves. It’s not proof that it works for all cases, but it does seem to be a significant improvement over existing approaches.

I don’t disagree that this is dangerous territory, and there will always be room for tailor-made missing data solutions based on extensive subject matter knowledge. Our solution is designed to work for those who lack the time or knowledge to implement such a system effectively. Thanks for the links though.

@Lydia:

Non-interpretable in this sense means we can’t access meaningful representations of the millions of parameters in your average neural network. We can represent the output quite easily, and we withhold an overimputed test set to ensure that the model is reconstructing accurately. I assure you, it’s not magic. We’ve had to modify the standard train/test split to account for missingness, based on some obscure work done 6 years ago, but other than that it’s based on practical solutions to well-known issues in machine learning.

On the issue of the “data generation process”, you can think of data as generated by a nonlinear manifold in feature space. Standard denoising autoencoders attempt to learn this manifold. To do so, you apply a noise mask to an input, and then force the network to learn how to reconstruct the original input from the corrupted input. For a corrupted input, the network reconstructs its value from the closest point on the manifold indexed by the other inputs. If we were dealing with a simple linear manifold, this is analogous to conditional linear interpolation. (You might be seeing exactly how this works right about now, but I’ll continue.) By utilising dropout, we’re approximating the Gaussian process (a prior over the domain of functions) which describes this manifold. Then, by repeated sampling, we get an uncertainty-weighted distribution of said value.

Allow me to put words in your mouth, if I may. “But Alex, we’re concerned with the real world, not some abstract mathematical construct.” You’re right, and we’re doing approximations on approximations here. However, I’d point you towards some of the recent work involving Generative Adversarial Networks. GANs seem to be manifold learners too, although one with a unique adaptive loss function. They map random inputs onto points in feature space (often faces), so you can then sweep through the random noise and get original, unseen images out.

The real-world data generation process is whatever process generated the limited dataset in front of you, with the normal warnings about GIGO. The manifold is just a mathematical representation of this restricted set of observations. Sure, if you want to know the missing value of someone’s favourite flavour of ice cream, and “chocolate” fails to appear in your dataset, you can’t correctly predict that this observation should be chocolate. Excluded variable bias is a big deal. The advantage of our system is you can plausibly include hundreds, if not thousands, of variables in your imputation model, thus reducing the odds of EVB.

Hopefully that (more technical) explanation reassures you that we’re not simply waving our hands and saying “abracadabra!”. If you have more specific issues, I’d be happy to discuss them.

]]>And as statistical methods become more widespread, this gets more and more dangerous. I’d recommend seeing Osonde Osoba’s report here; https://www.rand.org/pubs/research_reports/RR1744.html and Cathy O’Neil’s book here; https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815

]]>“Our method is “black box”-y, …it’s a non-interpretable system which can’t give additional insight into the data generation process. On the upside, it means you can point it at truly enormous datasets and yield accurate imputations in relatively short (by MCMC standards) time.”

Non-interpretable – but accurate – assessments of missing data! Almost seems like magic. All you need to do now is point it at new, the real-world datasets and hope it cooperates – that might make non-statisticians comfortable with regularization too… That might require some interpretation of the “data-generation process, though, to figure out where to point this magical device. (Assuming we’re interested in real-world prediction).

]]>