“Once I was told to try every possible specification of a dependent variable (count, proportion, binary indicator, you name it) in a regression until I find a significant relationship. That is it, no justification for choosing one specification over another besides finding significance. . . . In another occasion I was asked to re-write a theory section of a paper to reflect an incidental finding from our analysis, so that it shows up as if we were asking a question about the incidental finding and had come up with the supported hypothesis a priori. . . .”

Ethan Bolker points me to this discussion.

My reply: As discussed in my paper with Hill and Yajima, I think the best approach is to analyze all comparisons rather than picking just some. If there is prior understanding that some comparisons are more important than others, that understanding can be included as predictors in the model.

18 thoughts on ““Once I was told to try every possible specification of a dependent variable (count, proportion, binary indicator, you name it) in a regression until I find a significant relationship. That is it, no justification for choosing one specification over another besides finding significance. . . . In another occasion I was asked to re-write a theory section of a paper to reflect an incidental finding from our analysis, so that it shows up as if we were asking a question about the incidental finding and had come up with the supported hypothesis a priori. . . .”

      • If you do this in a Bayesian analysis, you’ll probably find lots of uncertainty remains, and there is lots of correlation between different coefficients or parameters in the model. From that, you can then start to consider what external information you can use to construct more informed and less default priors, or what methods you might use to simplify or improve the model (combining variables into a “score” or the like)

        If you do this throw-in-the-kitchen-sink approach in a classical hypothesis testing analysis, you’ll probably wind up completely muddled. Failure to reject the hypothesis that a parameter is zero is usually taken incorrectly as evidence that the parameter really IS zero. Rejecting the hypothesis of zero parameter is taken as evidence that the parameter can be set to the maximum likelihood point estimate, or the expected value under unbiased estimator, or something like that. The hypothesis testing results are sensitive to the order in which they’re carried out… and as N grows,then N! REALLY GROWS (N! is the number of ways of testing N things in different orders). The results of that process is typically to manufacture FALSE CERTAINTY. Since as the dimension count increases you’re more and more likely to find something or other… the result is to zero in on what would, in a Bayesian analysis, be one tiny corner of the posterior space of possibilities and treat it as if it were TRUE NOVEL AMAZING UNUSUAL FINDINGS!!!

        • @Daniel:

          +1 until I got to “The hypothesis testing results are sensitive to the order in which they’re carried out”. Please explain this.

        • Suppose you have three groups, A, B, C

          test for similarity of variance in A,B, fail to reject, assume same variance, estimate a coefficient for difference between A,B, fail to reject difference, collapse group A,B into one group, estimate variance difference between AB and C, fail to reject, assume same variance, test for mean difference between AB,C reject null, take unbiased estimate as mean difference between group AB, C with a common variance.

          Now, run the same thing through except test BC first… then eventually A vs BC… you could easily come to the conclusion that there is only one group ABC.

          From a Bayesian perspective you could think about a 6 dimensional space in which we have mean of A,B,C and variance of A,B,C. There is a posterior distribution over this 6 dimensional space. The naive frequentist method above basically throws away the uncertainty in various dimensions in some order, collapsing them down to a delta-function in that dimension. The order in which you perform it will dictate where in the 6 dimensional space your final point estimate lies. It manufactures excess certainty, and it does so in a noisy way (each test is using potentially only a fraction of the data) and it’s subject to a lot of potential choices at each step, with a lot of potential orders in which you could choose to take the steps, and therefore a lot of potentially DIFFERENT models you could random-walk your way into… it’s very very problematic.

        • Contrast to what you’d get out of a Bayesian analysis: you’d have a posterior over the mean of A, B, C, and you could easily ask in your posterior sample: “How much do I know about the difference in mean between A, and B?” Looking at mu(A)-mu(B) in your sample you discover that the samples range from about -1 to +1, (on a scale where 1 is really an important practical difference). Would you see that the range includes 0 and then assume that this means “Therefore I know that the difference is exactly 0” ???

          No, but failing to reject the hypothesis that mu(A)-mu(B) = 0 and then setting it equal to 0 is doing the same thing. and it’s just a RAMPANT common issue in practice. Since the testing gives you binary results… it seems totally plausible to the average researcher “I couldn’t reject the idea that they’re zero, so I’ll be ‘conservative’ and assume they’re zero” but there’s nothing conservative about that necessarily! What’s conservative is to retain the whole Bayesian range of possibilities!

        • “failing to reject the hypothesis that mu(A)-mu(B) = 0 and then setting it equal to 0” is indeed another reason why I think using hypothesis tests for checking model assumptions is a bad idea.

        • Thanks for the example. It does fit your statement that order matters. It’s not one I would have thought of, since I generally discourage using hypothesis tests for checking model assumptions — the tests have their own model assumptions, so one easily gets caught in an ever-expanding morass.

        • Hi, Daniel,

          if the goal of the statistical analysis is to understand whether there’s one or more group, then I don’t see why one would go through this “pairwise comparisons” procedure, instead than following the usual frequentist approach:
          1. performing Bartlett test for homogeneity of variance among all groups
          2. perfoming ANOVA, to test if all groups have the same mean
          3. using Tukey or Bonferroni method to build confidence intervals for the pairwise mean differences, in case ANOVA (the F test) rejects the null
          I agree that the Bayesian approach, giving a full joint probability distribution for the model parameters, produces a richer and more interesting answer: now that I got Andrew’s book, I’m trying to learn more about it. However, I’d compare it to the “right” way to do this in a classical frequentist context, instead than to an approach that’s advised against even in introductory courses such as the one I followed at my company.

        • The fact that Frequentists have come up with other ways that do a better job isn’t really the issue.

          Bayesian methods extend binary logic via Cox’s axioms, Frequentist methods do not. People really do this kind of multiple testing thing. They shouldn’t. In my opinion, they SHOULD adopt a Bayesian model, not search for a different less wrong Frequentist testing procedure.

          Also, this example is just one example designed to illustrate the issue. If you are going to deal with a 37 (or 350 or 2500) dimensional space and the question is more complex than just “are all 37 of these groups identical enough for some single frequentist test” then you will have to come up with some kind of complex science driven model, and as soon as you do you will be faced with the inadequacy of testing as a method to develop a model.

        • Rahul:

          One of the powerful aspects of statistics (and, for that matter, of mathematics) is its applicability to many different problems. I have not used hierarchical models in all my applications, but in my published research articles, I’ve applied hierarchical models to voting, public opinion, laboratory assays, toxicology, radon exposure, etc etc etc. One can also note that algebra, trigonometry, and calculus have been applied to problems as diverse as rocketry, astronomy, biology, economics, etc. These are great tools.

      • If only that’s what really happened. I think in many cases, people just review the hundreds of pages of crosstabs to see what came out as significant and roll with that. Forget any kind of advanced statistics or post hocs.

  1. I think you’re missing the point of the question, Andrew. It’s not “what should the researcher do?” The questioner already knows what should be done. It’s “how do I get the researcher to understand that fishing for variables that give ‘significance’ is nonsensical and unethical?” The latter question is much tougher.

    • Raghuveer:

      Yes, I see that. But I’m also interested in the remedy. I could say that researchers should just pick one comparison ahead of time and then their p-values will be valid. But I’d prefer not to give that advice, since that’s not what I do myself!

      • +1. Out here in non-academia, NOBODY has accurate p-values, and I’m told the same is true within the ivied walls as well. My biggest objection to p-values as *the* criterion is not that they are misleading, or that they are arbitrary or that they try to put objective yes-no answers on qualitative questions. It’s that they are always inappropriate measures for the sorts of analysis which reflect what human beings actually do when they get datasets and try to figure out what they mean. I don’t ever see the smallest effort to get people to change what they do *before* they analyze the results, because what they do there is mostly the guts of real research that everyone approves of.

        It’s as if you have cooks all trying to come up with novel dishes. They take ingredients, combine them in various ways, experiment with various proportions and cooking times and then produce a recipe. We then judge that recipe exclusively by the standards of whether or not the resulting dish is better or worse than what some chef produces with that list of ingredients in one hour test. It’s too low a bar to be interesting.

  2. About these issues there are many exaggerations. First, the role of testing in an experiment is different than when one attempts to build a complex model for prediction. In the first case, controlling perfectly the size of the test is crucial, in the second not so much, there are other grafical and descriptive criteria that could be more relevant than a test. But in any of these settings, modeling or designing an experiment, it is perfectly all right to conduct multiple tests, provided that one controls the FWER or some other alternative global criteria such as the false discovery rate. The point is, there is no reason to stop your colleagues from digging. You just need to check that they do it OK. Unexpected results usually come from multiple testing, there is nothing wrong in it, provided that one uses corrected inference methods.

Leave a Reply

Your email address will not be published. Required fields are marked *