Difficulties of using statistical significance (or lack thereof) to sift through and compare research hypotheses

Dean Eckles writes:

Thought you might be interested in an example that touches on a couple recurring topics:
1. The difference between a statistically significant finding and one that is non-significant need not be itself statistically significant (thus highlighting the problems of using NHST to declare whether an effect exists or not).
2. Continued issues with the credibility of high profile studies of “social contagion”, especially by Christakis and Fowler.

A new paper in Archives of Sexual Behavior produces observational estimates of peer effects in sexual behavior and same-sex attraction. In the text, the authors (who include C&F) make repeated comparisons of the results for peer effects in sexual intercourse and those for peer effects in same-sex attraction. However, the 95% CI for the later actually includes the point estimate for the former! This is most clear in Figure 2, as highlighted by Real Clear Science’s blog post about the study. (Now because there is some complex dependence structure in the data, perhaps the confidence interval for the contrast between these effects could actually be narrower. But this is not presented in the paper.)

One reason the authors like this negative result is that it is an example of where this family of analyses actually produces a null result. Though again, this is not a very precise null, so what it mainly suggests is a lack of power. The authors make some arguments about having adequate power, but Figure 2 makes pretty clear that the study is underpowered.

It is interesting to see this problem pop up in their work again, given that this was one issues with C&F’s earlier social contagion that could be most directly understood as an error. Recall that C&F used arguments from asymmetry of friendship ties, whereby they got a significant coefficient for friendships reported only by the ego, but not for friendships reported only by the alter. However, the difference between these two coefficients was not actually statistically significant. (See your point #4 here in comments on Lyons’s criticism.)

Now, it may be that still this study should result in a tighter posterior around zero for peer effects in self-reported same-sex attraction than prior to the study. Though I think most other evidence would have already suggested any such of effects would be relatively small.

I replied: I’m trying to make sense of all these things, not so much the specific work of C&F but the general question about how to learn in the context of uncertainty. See here, for example.

Eckles then followed up:

Thanks for the pointer to that paper. I had not seen Berrington and Cox (2007) before, which you cite, and will by suggesting it to collaborators. I think one challenge with some of this work is that journals or other need to enforce or strongly encourage some of these recommendations, since some of the people doing this work may not really care that much about the truth.

A further comment on the C&F same-sex attraction (non-) contagion paper (Brakefield et al 2013), that I think connects it to the idea that these studies need to make better use of prior information and presenting results in an interpretable way:

The first results (before fitting the logistic regressions to “control” for homophily etc.) of this paper report (network auto-) correlation coefficients for binary variables. These are not very interpretable, especially when the positive outcome (e.g., same-sex attraction) is rare. For example, it might seem like r = 0.02 (95% CI [-0.01, 0.05]) is a small effect (this is how the authors describe it), but for a behavior with p = 0.047, this actually corresponds to a ~45% relative increase in the probability of the behavior when the alter is a positive case. So this is actually a quite substantial point estimate!

In fact, I think it is larger than the relative risks from many of the the other C&F social contagion papers. Likewise, the confidence intervals from the logistic regression are so wide as to include many values that seem not very plausible.

Perhaps instead all of these effects could be incorporated into a multilevel model that shared information about the sizes of network auto-correlations in various traits and behaviors, thus putting the observed associations in context and appropriately “regularizing” them.

I agree 100% on this hierarchical modeling approach. The downside is, given available data, I doubt there’d be any interesting results that are statistically significant (in the sense of posterior 95% intervals that exclude zero) without imposing very strong assumptions. In a conventional analysis, these assumptions are imposed via the selection of which comparisons to focus on, but in a full multilevel analysis of all possible comparisons, the researcher might have to more carefully explain why certain comparisons are believed, a priori, to be larger than others. Or else you have to make fewer assumptions and accept the large amount of uncertainty about substantive conclusions of interest.

As I’ve written in the past, I don’t want to be too harsh on C&F because I do feel that this sort of work is important, and research has to start somewhere. I also appreciate that they made some use of a unique dataset. I think the way to move forward is to recognize that near-certainty is hard to come by but that it’s still useful to explore data and summarize what can be learned. I hope that the removal of the “p less than .05″ requirement” can be liberating. That we are using Christakis and Fowler’s work to illustrate these points should be taken as a tribute to the importance of the problems they are studying and the interesting research ideas they have; if the current status is that there is high posterior uncertainty about their conclusions, that’s just the way it is, no criticism of those researchers.

2 thoughts on “Difficulties of using statistical significance (or lack thereof) to sift through and compare research hypotheses

  1. Pingback: Difficulties of using statistical significance (or lack thereof) to sift through and compare research hypotheses | ericdeno.com

Comments are closed.