Ways of knowing in computer science and statistics

Brad Groff writes:

Thought you might find this post by Ferenc Huszar interesting. Commentary on how we create knowledge in machine learning research and how we resolve benchmark results with (belated) theory. Key passage:

You can think of “making a a deep learning method work on a dataset” as a statistical test. I would argue that the statistical power of experiments is very weak. We do a lot of things like early stopping, manual tweaking of hyperparameters, running multiple experiments and only reporting the best results. We probably all know we should not be doing these things when testing hypotheses. Yet, these practices are considered fine when reporting empirical results in ML papers. Many go on and consider these reported empirical results as “strong empirical evidence” in favour of a method.

Huszar’s post is interesting, and also interesting is this linked post by Yann LeCun. My only comment is that we have many different ways of deciding what methods to use, including:

– Mathematical theory,

– Computer simulations,

– Solutions to toy problems,

– Improved performance on benchmark problems,

– Cross-validation and external validation of predictions,

– Success as recognized in a field of application,

– Success in the marketplace.

None of these is enough on its own. Theory and simulations are only as good as their assumptions; results from toy problems and benchmarks don’t necessarily generalize to applications of interest; cross-validation and external validation can work for some sorts of predictions but not others; and subject-matter experts and paying customers can be fooled.

The very imperfections of each of these sorts of evidence gives a clue as to why it makes sense to care about all of them. We can’t know for sure so it makes sense to have many ways of knowing.

For further discussion of this point, see section 26.2 of this article.

7 thoughts on “Ways of knowing in computer science and statistics

  1. In a recent talk for a session on “Phil Sci & the New Paradigm of Data-Driven Science” I try to delineate some ways to warrant inferences despite using the same data to both build and appraise a model or other claim (slides 23-4). https://errorstatistics.com/2018/06/04/your-data-driven-claims-must-still-be-probed-severely/
    ML is not a field I’ve worked on, this was only an attempt to apply some things I have worked on to this problem. When the author says “the statistical power of experiments is very weak”, I take it he’s saying they have low capability to discern flaws in resulting claims. If so, those flaws can’t be regarded as well ruled out, so other probes are required to claim knowledge.

    • I don’t think he’s using “power” in the “type 2 error” sense. I think he means the power to find “truth” given the garden of forking paths and experimenter degrees of freedom — the problems that come from knowing –or think that you know — the theoretical result (e.g. your slide 15)

      It’s a good thing that there are multiple ways of deciding, because the data we work with can have a lot of different forms. I generally work in a data-rich environment, so methods that involve holdout samples (or multiple holdouts) make a lot of sense. Next week I’ll get more gigabytes of data coming in the door. On the other hand, there’s only a presidential election every 4 years.

  2. We do a lot of things like early stopping, manual tweaking of hyperparameters, running multiple experiments and only reporting the best results. We probably all know we should not be doing these things when testing hypotheses. Yet, these practices are considered fine when reporting empirical results in ML papers. Many go on and consider these reported empirical results as “strong empirical evidence” in favour of a method.

    Who is “we”? Dealing with this is one of the first things you figure out…

    The term for this problem is “data leakage”. Information is about the test set is leaking into your model and data prep pipeline. Its dealt with by having a final independent holdout set. This is why most academic papers using ML seem to be worthless. Almost every time it is apparent they overfit the cv.

  3. I think of this as a feature (not a good one) of the data science workplace. When you have success defined by speed and certainty (incredible or otherwise) of inferences/predictions, and when you mix workers from a stats background (not the fast guys, as Diego Kuonen put it), a comp sci background (who have their eye on their portfolio of cutting-edge techniques), and maybe some designers, developers or comms people thrown in there, it’s hard to make it a happy workplace for all. They all have different motivations, professional norms, and language. Then there is high demand for them, so they churn a lot and it matters less to them whether their inferences and predictions were really any good. On top of that, the boss often doesn’t care either because they’re playing the same buzzword-laden, high-churn game. Nevertheless there are opportunities to make it a very positive environment of mutual learning. I pontificated on it here: https://iase-web.org/documents/SERJ/SERJ16(1)_Grant.pdf for what it’s worth.

  4. A few thoughts

    a) So much of ML is new, early stage work. At a very broad, meta level, what is happening now can be considered the exploratory phase of the big ML project. In this context, it may indeed be reasonable to be using heuristics and loose metrics to get a feel for what methods / paths of inquiry are worth pursuing.

    b) The downside of ML being so new is that it is relearning much of what other disciplines such as statistics has already developed, such as the importance of strict controls in hypothesis testing. At this learning takes place, things will improve.

    c) From a practical standpoint, Andrew’s comment at the end is most important. Determining what makes a “good” model / analysis requires viewing it from multiple perspectives.

Leave a Reply

Your email address will not be published. Required fields are marked *