Comments on: A Python program for multivariate missing-data imputation that works on large datasets!?

By: Andrew Song

Andrew Song — Tue, 14 Aug 2018 00:57:25 +0000

Alex, could you possibly direct me to some more detailed documentation (or a paper) about the model used in the MIDAS code? Although I got the gist of the general composition of the denoising autoencoder (it’s basically a VAE with dropout at the input layer to induce missingness?) there are some details I couldn’t really make sense of just by reading the python files in your repository.

For example, I’m not too sure whether the loss function you’re trying to optimize is the traditional ELBO defined for the original VAE (Kingma & Welling, 2014) – I cant seem to locate the reconstruction error term (i.e. expectation of log p(x/z) w.r.t z).

By: Stef van Buuren

Stef van Buuren — Mon, 18 Jun 2018 20:12:53 +0000

Alex, do you have any simulations that show how your method preserves the uncertainty caused by the missing values?

In general, I would be looking for 1) the point estimate should have the right value, 2) the confidence interval should have the right coverage, and 3) the confidence should be as short as possible. Any idea how your method performs in a basic case as in
https://stefvanbuuren.name/fimd/sec-evaluation.html ?

By: Thanatos Savehn

Thanatos Savehn — Tue, 30 Jan 2018 16:01:18 +0000

In reply to Corey Yanofsky. Exactly what I needed. Thank you.

By: Corey Yanofsky

Corey Yanofsky — Tue, 30 Jan 2018 15:35:27 +0000

In reply to Thanatos Savehn.

bloggy intro: https://deeplearning4j.org/generative-adversarial-network

NIPS tutorial paper: https://arxiv.org/abs/1701.00160

By: Thanatos Savehn

Thanatos Savehn — Tue, 30 Jan 2018 14:07:18 +0000

In reply to Alex Stenlake. Alex, First, props for your lucid writing and tight arguments. Second, do you have a link or two for “the recent work involving Generative Adversarial Networks”?

By: Alex Stenlake

Alex Stenlake — Sun, 14 Jan 2018 12:23:27 +0000

In reply to Lydia Maniatis.

1) Well, yes…it kind of does. I know neural networks annoy a lot of my statistician friends, but the reality is their performance is unparalleled on large datasets and complex problem domains. Would you argue against neural networks for object recognition because we don’t understand fully how they work? Would you argue AlphaGo isn’t legitimate because we don’t exactly how it makes its decisions? These systems deliver superhuman levels of performance. The performance at least partially stems from the ability of neural networks to identify statistical regularities and relationships either too small for a human to notice, or that our biases and heuristics blind us to. Go experts are studying AlphaGo’s moves to learn more. The basic logic/process is well understood, but then you stack millions of these basic units together into a complex system and everything becomes too complicated to track.

I guess you could think of it like the weather…we know broadly what’s happening, and can predict outcomes with a degree of certainty, but the exact drivers of any given phenomenon remain inaccessible due to the complexity of and interactions within the system. We know it’s driven by heat gradients and fluid dynamics, but that’s not enough to completely mathematically describe what’s going on at any given moment. There’s an ontological point to be made about the mysteries of human cognition, but let’s not even open that can of worms.

2) Because people are taking shots in the dark already. Worse, policy is made on the basis of those shots in the dark. To torture the metaphor, our system is an attempt to ‘restrict the arcs’ of those firing into the night, so even if they miss they’re unlikely to commit fratricide, and should at least be firing in a reasonable direction. We ought to do it because the world doesn’t wait for ‘perfect’, and it might have changed by the time a ‘perfect’ model is built. I forget who said it, but the logic is “an exact solution to an approximate model is more useful than an approximate solution to an exact model”. What we’re after is better models. When they break, it gives us clues to what we’re missing.

The social domain is hugely complex, with thousands of competing theories that are all supported by data in varying ways. Please don’t imply that we’re not taking the time to try to grapple with it. Our data is limited, and we’re trying to learn what we can. You’re right that technical solutions will not reduce uncertainty to acceptable levels. What we’re trying to do is avoid too little uncertainty (point estimates, ie. random forest imputation, median imputation) or too much (naive Gaussian approximation, failure to try and analyse data). I don’t understand the hostility to the idea. “Papering over” is failing to acknowledge that the results depended on an imputed dataset, not attempting to deal with missingness in a principled manner.

3) Well, no, they’re not of the same form. E = variable (constant squared), whereas A = constant (variable squared). They describe a very different process. But let’s take your assumption at face value, and talk purely about the case E = mc2. If I gave you a large matrix, consisting of three columns (E, m, c), could you learn the relationship between them? Even if you knew nothing about relativity, but knew a little about maths, you could probably figure it out…particularly if I gave you a calculator to speed up the arithmetic. You could learn, or at least approximate, E=mc2 provided I gave you enough examples. Even if I mixed in a bunch of partial examples (with one of the values missing), you could solve for a single column, use those assumed values to approximate another column’s missing values, rinse and repeat until convergence. This is the basis of chained equations, and is how a lot of MI methods work. The advantage of our method is that you can throw a column in signifying “A or E”, and rather than treating the ‘c’ column as a squared constant (with “A or E” becoming a dummy variable of some magnitude), the model can learn to switch between the two states.

While you can say “well take the time to learn the relationships!”, well…you’re welcome to switch disciplines and come join us in fighting the good fight. A lot of this stuff is unknown, the variables we measure keep changing, latent factors abound, politically-useful theories flourish at the expense of attempts to grasp some notion of truth. (And the goalposts of truth keep shifting all the time because the poststructualists have a point, even if I wish they’d do something more useful than keep making the same point.) In absence of known relationships, we do the next best thing – estimate from the data we have, try not to impart too much of our own biases, and transparently report how we reach our conclusions.

By: Lydia Maniatis

Lydia Maniatis — Sat, 13 Jan 2018 17:38:20 +0000

“We don’t like the black box aspect, but it’s how a lot of scalable machine learning is done for complex problems.” The second part of this sentence seems like a non sequitur. Saying that its often done this way doesn’t legitimize it, does it? Maybe the reasons you don’t like it are the reasons why it shouldn’t be used.

“…there will always be room for tailor-made missing data solutions based on extensive subject matter knowledge. Our solution is designed to work for those who lack the time or knowledge to implement such a system effectively.”

The solution to what problem, exactly? How to take shots in the dark when you don’t have the appropriate data or the time to become familiar with your subject matter? Why do it?

I don’t have the expertise to engage with you on the technical issues, but the problems here seem to me much more fundamental. No amount of technical manipulation of inadequate data can reduce uncertainty to acceptable levels. Huge levels of uncertainty papered over with uncertain assumptions should have no more place in science than they do in engineering. There doesn’t seem to be a rational way to choose which techniques to prefer.

I don’t know if this example is relevant, but E=mc2 is of the same form as A=pi r2, but that tells us nothing about the underlying “data generation process.” However, I agree that the discussion would be more focussed if we were talking about a specific problem situation.

By: Alex Stenlake

Alex Stenlake — Fri, 12 Jan 2018 23:58:41 +0000

In reply to David Manheim.

@ David:

Granted, conditionally. This is a problem both myself and Ranjit are keenly aware of. We don’t like the black box aspect, but it’s how a lot of scalable machine learning is done for complex problems. Neural networks are just horrendously effective. At least now, with Gal’s work on interpreting dropout-trained networks as approximate GPs, we can get an idea of the uncertainty due to moving away from known regions of the modelling space. Many social science scholars will treat interpretable MI systems like Amelia, MICE, Hmisc, etc. as black box optimisers anyway, knowing just enough about how they work as to avoid throwing errors.

In general, the current consensus seems to be that MI is – at worst – no more prone to inducing bias than listwise deletion. Both myself and Ranjit are International Relations people, where missingness is highly nonrandom. For instance, Ranjit ran a reanalysis of a large number of studies from the last few years. He found that, when listwise deletion was replaced with actual imputation methods, many findings completely disappeared, if not reversed (type-M/S, if you will). If it puts your mind at ease, we have done some testing with strictly MNAR data, where missingness depends on an excluded (target) variable. We then use the imputed datasets in a Naive Bayes (AKA. Idiot Bayes) classifier to predict that excluded variable. What we observe is that the accuracy of that classification improves. It’s not proof that it works for all cases, but it does seem to be a significant improvement over existing approaches.

I don’t disagree that this is dangerous territory, and there will always be room for tailor-made missing data solutions based on extensive subject matter knowledge. Our solution is designed to work for those who lack the time or knowledge to implement such a system effectively. Thanks for the links though.

@Lydia:

Non-interpretable in this sense means we can’t access meaningful representations of the millions of parameters in your average neural network. We can represent the output quite easily, and we withhold an overimputed test set to ensure that the model is reconstructing accurately. I assure you, it’s not magic. We’ve had to modify the standard train/test split to account for missingness, based on some obscure work done 6 years ago, but other than that it’s based on practical solutions to well-known issues in machine learning.

On the issue of the “data generation process”, you can think of data as generated by a nonlinear manifold in feature space. Standard denoising autoencoders attempt to learn this manifold. To do so, you apply a noise mask to an input, and then force the network to learn how to reconstruct the original input from the corrupted input. For a corrupted input, the network reconstructs its value from the closest point on the manifold indexed by the other inputs. If we were dealing with a simple linear manifold, this is analogous to conditional linear interpolation. (You might be seeing exactly how this works right about now, but I’ll continue.) By utilising dropout, we’re approximating the Gaussian process (a prior over the domain of functions) which describes this manifold. Then, by repeated sampling, we get an uncertainty-weighted distribution of said value.

Allow me to put words in your mouth, if I may. “But Alex, we’re concerned with the real world, not some abstract mathematical construct.” You’re right, and we’re doing approximations on approximations here. However, I’d point you towards some of the recent work involving Generative Adversarial Networks. GANs seem to be manifold learners too, although one with a unique adaptive loss function. They map random inputs onto points in feature space (often faces), so you can then sweep through the random noise and get original, unseen images out.

The real-world data generation process is whatever process generated the limited dataset in front of you, with the normal warnings about GIGO. The manifold is just a mathematical representation of this restricted set of observations. Sure, if you want to know the missing value of someone’s favourite flavour of ice cream, and “chocolate” fails to appear in your dataset, you can’t correctly predict that this observation should be chocolate. Excluded variable bias is a big deal. The advantage of our system is you can plausibly include hundreds, if not thousands, of variables in your imputation model, thus reducing the odds of EVB.

Hopefully that (more technical) explanation reassures you that we’re not simply waving our hands and saying “abracadabra!”. If you have more specific issues, I’d be happy to discuss them.

By: David Manheim

David Manheim — Fri, 12 Jan 2018 15:09:32 +0000

In reply to Lydia Maniatis.

Dangerous waters here – missing data is exactly where black box methods can develop pernicious biases. And black box also means effectively non-auditable, so these biases can’t be easily diagnosed.

And as statistical methods become more widespread, this gets more and more dangerous. I’d recommend seeing Osonde Osoba’s report here; https://www.rand.org/pubs/research_reports/RR1744.html and Cathy O’Neil’s book here; https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815

By: Lydia Maniatis

Lydia Maniatis — Fri, 12 Jan 2018 13:10:17 +0000

Baysianism in a nutshell:

“Our method is “black box”-y, …it’s a non-interpretable system which can’t give additional insight into the data generation process. On the upside, it means you can point it at truly enormous datasets and yield accurate imputations in relatively short (by MCMC standards) time.”

Non-interpretable – but accurate – assessments of missing data! Almost seems like magic. All you need to do now is point it at new, the real-world datasets and hope it cooperates – that might make non-statisticians comfortable with regularization too… That might require some interpretation of the “data-generation process, though, to figure out where to point this magical device. (Assuming we’re interested in real-world prediction).