Skip to content

Workshop on reproducibility in machine learning

Alex Lamb writes:

My colleagues and I are organizing a workshop on reproducibility and replication for the International Conference on Machine Learning (ICML). I’ve read some of your blog posts on the replication crisis in the social sciences and it seems like this workshop might be something that you’d be interested in.

We have three main goals in holding a workshop on reproducing and replicating results:

1. Provide a venue in Machine Learning for publishing replications, both successful and unsuccessful. This helps to give credit and visibility to researchers who work on replicating results as well as researchers whose results are replicated.

2. A place to share new ideas about software and tools for making research more reproducible.

3. A forum for discussing how reproducing research and replication effects different parts of the machine learning community. For example, what does it mean to reproduce the results of a recommendations engine which interacts with live humans?

I agree that this is a super-important topic because the fields of statistical methodology and machine learning are full of hype. Lots of algorithms that work in the test examples but then fail in new problems. This happens even with my own papers!


  1. mclaren says:

    Is machine learning even science? Most examples seems more like rule-of-thumb opaque black box heuristics jiggery-pokery to me. How do you disconfirm a machine learning algorithm? It’s like supersymmetry — the answer in HEP is always “We need higher-energy accelerators to really get the answer” no matter how high the energy of the accelerator is, and in machine learning, the answer is always “We need more computing power” if the output is crap. This seems not far off from Russian psychics who claim to be able to psychically magnetize water and then claim, when they fail double-blind tests, that the skepticism of the investigator destroyed the effect.

    • Andrew says:


      Machine learning is engineering. So is statistics. Statistics and machine learning can do lots of cool things like spell checking and digit recognition and drug design and the estimation of state-level opinion from national polls. If I thought this all was equivalent to psychics that failed double-blind tests, I wouldn’t be teaching in the Columbia statistics department, I’d be at Cornell in the psychology or the food sciences division.

      • Aaron G says:

        “If I thought this all was equivalent to psychics that failed double-blind tests ….. I’d be at Cornell in the psychology or the food sciences division.”

        I like your dig at Wansink there! :)

      • Aaron G says:

        On a more serious note, I found the following article by Donoho et. al. regarding reproducible research in the context of computational harmonic analysis (which is related to machine learning, at least tangentially):

      • Alex says:

        I think mclaren raises an interesting point, which is that if you frame your research in terms of questions like “Can model X do Y?” it’s basically impossible to get hard negative results in an experiment. If you tried to use “X for Y”, and got bad results, it could just be that you didn’t run your experiment long enough (computing power) or didn’t pick the right configuration or settings (hyperparameters) for X.

        At the same time, we can still get negative results using theory, although this is tough for most of the methods that are used in practice.

        So we’re in a somewhat awkward position where we publish a stream of positive results alongside a lot of ambiguity.

  2. JohnnyJohnnyM says:

    I am quite interested in Andrew’s (or others) views on relative merits of machine learning versus context-driven modeling. Machine learning is often viewed as a black box, meaning if I use a machine learning algorithm and my friend Bubba uses the same algorithm, we get the same answers. Context-driven modeling, especially with Bayesian methods, allow for incorporation of context into the modeling process, meaning that Bubba and I may not get the same answers when analyzing a particular dataset. This can be good and bad.


    • Andrew says:


      I’m no expert on machine learning but it’s been my impression that these methods are not black boxes, or even close. They might not have explicit modeling assumptions but there are choices of what data to include in the model, how to code the data (lots of options in how to turn data into “features”), how long to run the algorithm, etc. For hard problems, a lot of work can be involved in training and tuning the algorithm. I’m not saying this as a diss on machine learning methods; I just think you should be clear on this, as many of the decision points in such algorithms seem to be swept under the rug when these algorithms are written up as research papers.

      Also, with Bayesian methods, the contextual information is a form of data. If you and Bubba are using the same local data but different priors, it’s fair to say that you’re using different sets of information in your analysis.

      • JohnnyJohnnyM says:

        Thanks for the comment. But this whole black-box thing brings up a general issue I’ve observed regarding the perception that scientists have of statisticians and the work they do. Statisticians are often derided as glorified calculators, simply crunching numbers context-free in order to get results. Since statistics is “objective”, every competent statistician is supposed to get the same results when analyzing a dataset. In other words, statisticians lack innovation. A good statistician differentiates himself/herself from other statisticians based on attention to detail and accuracy, but is not someone who generates great new ideas.

        Of course there are economic ramifications here as well. If all statisticians are expected to get the same answers, then they are largely interchangeable. They provide value, without uniqueness, which means they compete on price. Why pay one statistician big money when all statisticians are supposed to get the same results? Why not just outsource?

        Of course I don’t agree with any of these perceptions, but these views represent, in my experience, common perspectives that are actually driving some statisticians away from the field.

        • Keith O'Rourke says:

          I have and do encounter those perceptions fairly frequently – largely it is what they learned in their stats courses and perhaps statisticians they have previously worked with. I did know statisticians working in clinical research that regularly referred to their class notes from grad school on what the correct things to do were – seemingly unaware of how many years ago they were in grad school.

          Additionally any sense of specialization is absent – any statistician is supposed to know how to analyse any sort of study – period.

          These perceptions cause real problems when different organisations need to jointly agree on statistical analyses – if there is any disagreement between the statisticians each organisation just defers to _their_ statistician as the one they have no choice but believe. The idea that they should present their arguments as a way of resolving the differences seems absent.

          Unfortunately a former colleague of mine once completed a large randomized trial and then discovered that the statistician they hired had no sense of how to analyse a randomized trial and had only provided cross tabulations of all recorded variables as the _primary_ analysis.

          In working with new people I find it helps to clarify that there aren’t correct statistical analyses but rather just more or less reasonable analyses – seems to help.

    • Glen M. Sizemore says:

      “Machine learning is often viewed as a black box, meaning if I use a machine learning algorithm and my friend Bubba uses the same algorithm, we get the same answers.”

      GS: I don’t see how the second part of your sentence has much to do with the use of the term “black-box.” This term usually means that, though you may be knowledgeable about the “input” and “output” of some system, you don’t know what’s going on “in the inside.” The term was famously wielded against behaviorism. They might study “inputs and outputs” but they don’t know what’s going on where it counts – the Almighty Brain. The irony is that the so-called “inputs and outputs” (not a good phrase) – the functional relations – are what must be described in order to build a physiological reductionism. To take an analogy, how could you develop statistical mechanics with first developing chemistry and thermodynamics?

      Finally, imagine a neural net that learns to categorize some class of stimuli. The thing “learns the problem” but it is far from clear how it works by looking at the architecture and connection weights.

      • johhnyJohmmyM says:

        I see your point, but the concepts are related. If we don’t know what’s going inside,it’s hard to incorporate subject area information, meaning that many black boxes are, practically speaking, “sealed”. Flexible modeling frameworks on the other hand, allow for incorporation of subject knowledge. Different individuals can get different answers, which may not be a bad thing.

        • Anoneuoid says:

          I am not sure you are too familiar with machine learning. You should check out a kaggle competition[1].

          Many of them have rules specifically against using context or outside knowledge. Many times the data is also anonymized to the point where you don’t even know what each column means. Yet machine learning techniques are used to get many thousands of different results for each competition, and many of those using the same algorithms.


          • JohnnyJohnnyM says:

            You seem to be supporting exactly what I’m saying, namely that machine learning methods and culture generally limit inclusion of contextual information by the analysts to improve inference. (Of course that doesn’t mean that all machine learning methods perform the same.)

            I’m not claiming to be a machine learning expert (though I’m somewhat familiar with Kaggle). However, the point is that with flexible modeling frameworks (e.g. hierarchal modeling), a super smart knowledgeable analyst can do better than a dull-witted someone with no subject area expertise. It would seem that the Kaggle competitions limit such contextual knowledge advantages.

  3. Bob L. Sturm says:

    Related to reproducibility and transparency in applied machine learning research, we are now organising the second edition of our workshop, HORSE2017: On “Horses” in Applied Machine Learning ( As Andrew said some time ago, reproducibility and transparency are not enough to ensure our research contributions are actually contributions.

  4. Mike Beyer says:

    I too like this idea. However, there is a distinction between Statistics and Machine Learning that makes the latter much less amenable to replication studies: at its core, machine learning is optimization, not inference – so it’s focused on a methodology not a population.

    Inference requires some underlying (stationary) statistical model of the data generating process — many machine learning models attempt to work when such a process doesn’t exist (or simply isn’t available for characterization). By stationary, I do not mean the process is stable throughout time (time-homogeneous) but simply that the model structure (whatever it is) is stationary or fixed in time. Without this, you really can’t tell if you were lucky or have found something useful.

    Because of its dependence on optimization, the “No Free Lunch” theorems ( suggest that performance of a ML system is very dependent on the specifics of your problem. Thus, if researcher X finds that their text classification system is awesome when applied to Wikipedia and NIH articles, this doesn’t mean that it will be useful for a new application (even in the same domain). The problem is that performance can vary considerably for subsets of the sample space even if the overall accuracy metric [read: objective function…see NFL theorems] for an experiment is high.

    Add to this that we often want to to optimize different things, with different loss functions, and you have a combinatorial explosion of “test conditions” (e.g., if you were setting up a design matrix).

    I’d be very interested in some objective methods for running a replication of a ML experiment. A quick and dirty approach wold be to run bootstrap replications of the same model to see sampling variability.

    Perhaps a more “Bayesian” approach would be to test the model under data generated from a large class of models that share some verifiable/testable property (e.g., mean stationary, sub-exponential tails). The prior would help select specific data models and then you can explore performance more broadly and (if you’re lucky) make something approaching a generalizable finding.

    Another approach would be to try to re-cast the algorithms as Bayesian models. I think a good example of a statically rigorous machine learning methodology is Latent Dirichlet Allocation and its derivatives. These are Hierarchical Bayesian models that serve to (greatly) simplify text generation processes. The fact that you have all the assumptions up-front helps to assess under what situations they will work well and when they will fail.

    However, LDA is an unsupervised approach (generally), so “right/wrong” is a bit fuzzy — others are trying this for supervised methods like deep learning:

Leave a Reply