Don’t do the Wilcoxon (reprise)

František Bartoš writes:

I’ve read your and various others statistical books and from most of them, I gained a perception, that nonparametric tests aren’t very useful and are mostly a relic from pre-computer ages.

However, this week I witnessed a discussion about this (in Psych. methods discussion group on FB) and most of the responses were very supportive of the nonparametric test.

I was trying to find more support on your blog, but I wasn’t really successful. Could you consider writing a post with a comparison of parametric and nonparametric tests?

My reply:

1. In general I don’t think statistical hypothesis tests—parametric or otherwise—are helpful because they are typically used to reject a null hypothesis that nobody has any reason to believe, of exactly zero effect and exactly zero systematic error.

2. I also think that nonparametric tests are overrated. I wrote about this a few years ago, in a post entitled Don’t do the Wilcoxon, which is a restatement of a brief passage from our book, Bayesian Data Analysis. The point (click through for the full story) is that Wilcoxon is essentially equivalent to first ranking the data, then passing the ranks through a z-score transformation, and then running a classical z-test. As such, this procedure could be valuable in some settings (those settings where you feel that the ranks contain most of the information in the data, and where otherwise you’re ok with a z-test). But, if it’s working for you, what makes it work is that you’re discarding information using the rank transformation. As I wrote in the above-linked post, just do the transformation if you want and then use your usual normal-theory methods; don’t get trapped into thinking there’s something specially rigorous about the method being nonparametric.

34 thoughts on “Don’t do the Wilcoxon (reprise)

  1. In general I don’t think statistical hypothesis tests—parametric or otherwise—are helpful because they are typically used to reject a null hypothesis that nobody has any reason to believe, of exactly zero effect and exactly zero systematic error.

    I think I came across another problem. You can test multiple statistical hypotheses that are derived from the same data generating process and get different answers.

    As a simple example, imagine you want to test if a die is fair. You roll it 10 times and observe 5 sixes. You can test this using the binomial distribution to see the probability of observing 5 sixes assuming a fair die:

    > choose(n, k)*p^k*(1-p)^(n – k)
    [1] 0.01302381

    Or using the geometric distribution to see the probability of observing at least one six:

    > 1-(1-p)^n
    [1] 0.8384944

    The latter is just the binomial distribution when k = 0 subtracted from one. So there are two very different probabilities derived from the same theoretical data generating process with all the same parameters and assumptions. It is also intuitively clear that these probabilities really should be different. However, using NHST we would reject the “fair dice hypothesis” in the first case but not the second. What is the general rule to choose between models in such a situation?

    • Anon:

      I don’t want to test if a die is fair. No die is perfectly fair. I might want to estimate the die’s departure from fairness, but that’s another story. And with just about any real non-trick die and 10 (or even 100) die rolls, any information from the data is overwhelmed by the prior.

      • There are more situations than the classic die example where one is interested in the question “do I have reason to suspect that these samples were not generated according to a discrete uniform” and one does not have such an overwhelming prior.

    • I think it is very reasonable to get different answer here, as you define fairness in a different way. I.e. assume the die actually has a memory and stops rolling sixes after the first one. It would be fair in terms of having the expected probability of having at least one six, but not fair in terms of the expected number of sixes. Also note that both definitions are incomplete, as you ignore the probability of the other six outcomes.

      A Bayesian framework would lead to the same conclusions as above, if you define fairness in the same way.

      • I agree it makes sense they are different, but which one to use for p(data|model)? It seems to be an arbitrary decision, thus the researcher can just “decide” to have significance or not.

      • But there’s also the likelihood principle argument. You could test the fairness of the die by rolling it N times and a testing the null hypothesis that the data are multinomial with equal probabilities, or by a geometric experiment, where we roll until we see some face appear some number of times. Even if these experiments return the same data, significance testing would (possibly) return different outcomes. A Bayesian approach would return identical posteriors as long as the likelihoods are equal.

      • Also, I don’t consider NHST to be a frequentist method. You can do it bayesian, or just by looking at plots and saying “there is a difference”.

  2. What about when you have data that follows a very strange unfriendly distribution. For example, multivariate data with a longitudinal semicontinuous outcome. Parametric models (bayesian or frequentist) can be used, but they’re a nightmare to fit. Could there be a place for non-parametric testing? I’m working on this (https://arxiv.org/abs/1711.08876) …. and so far, I’d say it remains unclear. Thoughts?

  3. Often we’re interested in testing whether the area under the ROC curve is different from 0.5, and this is precisely what the Wilcoxon test really does. If that’s not the question, then I agree there is usually a better alternative.

  4. There’s a bunch of things I’d want to say on this topic, but I’ll limit it to this: on FB, Bartoš says “””If you see a distribution of your data, and then decide, that you wanna use a np test, then you are doing a data dependent analysis, which is not a thing you wanna do in a first place. “”” and I absolutely, fundamentally, totally disagree. You always have to look at the data!

    • I’m not agains checking the data and model. My comment continued:
      “But by this, I don’t wanna say, that parametric test are flawless, awesome and we should check assumptions. It’s always a good idea to do a post-predictive check and criticize your model.”

      From NHST framework, it seems to me, that the whole analysis should’ve been planned apriory and completly independent of the data. That was the point which I wanted to rise, not that you shouldn’t check your data, model, assumptions etc…

    • I think that is important to include the finishing lines of my comment there:
      “But by this, I don’t wanna say, that parametric test are flawless, awesome and we should check assumptions. It’s always a good idea to do a post-predictive check and criticize your model.”

      By the part you quoted, I meant that from NHST framework you should have the whole data analysis planned ahead and independent of the data collected. But it’s being taught (in my experience) the other way, collect data, check the histogram (histomancy), and if your data look non-normalish, use nonparametric test.

      So, I’m not against checking the data, model assumptions etc… But I think that some practices of the usage of nonparametric tests are in opposition to NHST framework.

    • If you could obtain the distribution of your test statistic under this “look at the data, then decide the type of test” method, then you could use it to compute a valid p-value. It would not be the same as the p-value you compute assuming one method or the other.

      Jake: I agree that you should always look at your data. But if you use the data to decide what kind NHST procedure to use, then p < 0.05 doesn't mean what it's supposed to mean.

      • P values never mean anything other than the probability that a given random number generator would generate a data set whose test statistic is more extreme than the observed one… And yes they continue to mean this regardless of whether you look at the data. The real problem is that people are supposing that they mean something else.

        In fact, even if you don’t look at the data, they don’t mean what people want them to mean.

        • That’s why I continue to use the concrete notion of “what would this RNG do?” It concretizes the question that is being answered. When p is small, then you can answer “probably this particular RNG didn’t make this data set”

          It’s a far cry from that to whatever people usually want it to mean.

        • Thanks Daniel, I appreciate your “think of a RNG” method. My point in response to Jake is that making data-dependent decisions changes the properties of the RNG. But I shouldn’t have said that the data dependent p-value “would not be the same” as the one coming from a test chosen ahead of time. It would be the same as one of them, but it would have been computing by referencing the test statistic against the wrong RNG.

        • Making data dependent decisions changes the properties of the *broken misapplied decision rule* which goes something like:

          “Choose an RNG which might plausibly be associated with my data by default, test it, and then if it’s rejected, act as if my favorite RNG produces this data instead”

          This decision rule works when:

          1) There are only two plausible possibilities to begin with that everyone agrees on.

          2) Both of the plausible possibilities in (1) are well modeled by iid RNGs

          the p value itself is always mathematically fine, and always produces the same value for a given choice of RNG and data set… what’s wrong with the whole system is that in most cases of real scientific interest neither (1) nor (2) hold even approximately. There are usually a *large* set of plausible mechanisms, and they are only rarely well modeled by iid RNGs

          The case where there are data dependent choices to make just highlights the fact that (1) is true. The logic fails whether you look at the data or not, it just becomes obvious how broken the whole thing is when you do look at the data and then pare-down the set of plausible mechanisms. It’s not that looking at the data “breaks” anything, it’s that it was broken, and looking at the data and then making choices makes it obvious how broken it was.

        • Compare to the following decision rule: From among the whole range of plausible options, pare away those which are least consistent with the data and choose from the remaining ones the one which maximizes some function of the goodness of your decision.

          That’s the essence of a bayesian decision rule:

          the prior defines the “whole range of plausible options”

          the likelihood defines the “paring away function” that downweights all those initially plausible options that are too inconsistent with the data

          the utility function defines what outcomes are good vs bad

          Probability defines a measure on sets that allows for continuous measures of plausibility and downweighting.

          The expectation function defines a deterministic method for comparing goodness which depends simultaneously on taking all possible values into account.

  5. “As I wrote in the above-linked post, just do the transformation if you want and then use your usual normal-theory methods; don’t get trapped into thinking there’s something specially rigorous about the method being nonparametric.”

    In fact, you could just see this as using a semi-parametric copula model, where the marginals are modeled with the EDF and the CDF’s are joined with a Gaussian copula.

  6. Generally I believe nonparametric tests make ill advised trade-offs except when any discernible data model is believed to be badly wrong. That is, when there is no credible likelihood that can be specified (e.g. a meta-analysis of similar studies that used different outcome measures for the same/similar underlying outcome.)

    One way to put the trade-off is that one is lessening the assumptions required to ensure the estimated p-value will have a Uniform(0,1) distribution if the null parameter value and all other ancillary assumptions are true. These are never exactly true, so it is a false sense being closer to Uniform(0,1) and the assumptions lessened are unlikely the important ones to worry about (e.g. independence, confounding, missingness, selection, etc.)

    Hopefully point 3 of the ASA Statement on pvalues will help make this clearer – “Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.”

    What was most gained by the nonparametric test – e.g. Wilcoxon pvalue = .065 when t.test pvalue = .047 would have no value if this point was fully understood. A pvalue = .065 with lessened assumptions does not add much to pvalue = .047 with some slightly stronger assumptions. So perhaps a useful sensitivity analysis (which was David Cox’s position) but not useful as a primary outcome.

  7. The “throwing away information” argument assumes that there is something in the values beyond basic order. With consumer ratings (e.g. 5 or 7 point scale) and multiple ratings per person, you can quickly see that the “value” is not consistent between persons, and depends on the order of presentation, while the ordering is (relatively) is quite stable. People are pretty good at counting and sorting but not “measurement” of soft qualities (e.g.”liking”.) Andrew’s point about tests versus estimates is well taken, but in consumer practice you want to know about changes in purchase behavior, not the increase in “liking” points.

  8. I like what Keith wrote above. I’d say that there are quite a few times when there is reason to weaken distributional assumptions at the cost of power.

    One thing that hasn’t been mentioned yet is that the Wilcoxon test can be thought of as a test of AUC. If the treatment effect that you want to evaluate is AUC and you want a test of that and confidence intervals, it is a good choice. A slightly better choice is the Brunner Munzel test.

  9. The other thing that hasn’t been pointed out this time is that the Wilcoxon test isn’t transitive — which is why I think ‘weakening distributional assumption’ isn’t always a helpful way to think about it.

    That is, if you measure an outcome in three groups, the Wilcoxon test can tell you it’s higher in A than B, higher in B than C, and higher in C than A. There isn’t any ordering on all probability distributions that agrees with the Wilcoxon test.

    I think that’s a problem because the most common thing people want to do with the Wilcoxon test is to talk about evidence for differences in one or the other direction between two groups, and you need quite strong assumptions (like stochastic ordering) for that to be justified even modulo our host’s views on testing.

    There’s an interesting “polymath” project on basically this question (in pure maths disguise):
    https://gowers.wordpress.com/2017/04/28/a-potential-new-polymath-project-intransitive-dice/
    https://gowers.wordpress.com/2017/08/12/intransitive-dice-vii-aiming-for-further-results/

    • Thomas:

      Good point – one is _changing_ the assumptions _in an attempt_ to ensure the estimated p-value will _be closer_ to the Uniform(0,1) distribution if the null parameter value and all other ancillary assumptions are true.

  10. I remember this discussion from back in 2015, and had thought about it occasionally over the past few years. Andrew says that an alternative would be:

    “1. As in classical Wilcoxon, replace the data by their ranks: 1, 2, . . . N.
    2. Translate these ranks into z-scores using the inverse-normal cdf applied to the values 1/(2*N), 3/(2*N), . . . (2*N – 1)/(2*N).
    3. Fit a normal model.”

    As far as I can tell from looking around online, this method hasn’t actually been tested in simulations. The Wilcoxon (and other rank-based alternatives, like Spearman correlations and Kruskall-Wallis Test) will basically give almost exactly the same p-values as the parametric alternatives after a rank transformation WITHOUT step 2. This simple equivalency is noted by Conover (2012). Basically, it’s:

    1. Rank the data
    2. Run the equivalent parametric test (t-test, Pearson correlation, one-way ANOVA)

    So when I teach these methods, I actually am basically just teaching them that the “non-parametric methods” are just rank transformations. You rank the data (losing the ability to interpret magnitude) to gain robustness (less biased p-values). Thinking about it this way, you can then do other things, like run an Welch t-test on the ranks, which is ostensibly better than a Wilcoxon test (Zimmerman & Zumbo, 1993).

    That said … Andrew also says that: “The advantage of this new approach is that, by using the normal distribution, it allows you to plug in all the standard methods that you’re familiar with: regression, analysis of variance, multilevel models, measurement-error models, and so on.”

    This, I think, is probably false. Rank transformations (and probably, by proxy, the normalized rank transform that Andrew proposed) totally fall apart in the multivariate context, like factorial ANOVA or multiple regression (e.g., check out this wiki page: https://en.wikipedia.org/wiki/ANOVA_on_ranks#Failure_of_ranking_in_the_factorial_ANOVA_and_other_complex_layouts). The ranking just doesn’t work right in these complex models. I suspect applying the normal transformation to the ranks would have the same issue. Might make a good simulation study though…

    Conover, W. J. (2012). The rank transformation—an easy and intuitive way to connect many nonparametric methods to their parametric counterparts for seamless teaching introductory statistics courses. Wiley Interdisciplinary Reviews: Computational Statistics, 4(5), 432-438.

    Zimmerman, D. W., & Zumbo, B. D. (1993). Rank transformations and the power of the Student t test and Welch t’test for non-normal populations with unequal variances. Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale, 47(3), 523.

    • Have you simulated the resulting distribution for the p-values to show them how close it is to Uniform under the NULL.

      An then re-did that to show them with a small amount of bias, dependence or model mis-specification its not that Uniform?

      • Awesome, thanks! Probably, I was missing the search term “van der Waerden” tests, which I hadn’t encountered before. Wasn’t quite sure what to call Andrew’s transformation.

Leave a Reply to Keith O'Rourke Cancel reply

Your email address will not be published. Required fields are marked *