Skip to content
 

No, I don’t think the Super Bowl is lowering birth weights

In a news article entitled, “Inequality might start before we’re even born,” Carolyn Johnson reports:

Another study, forthcoming in the Journal of Human Resources, analyzed birth outcomes in counties where the home team goes to the Super Bowl. . . . The researchers found that women in their first trimester whose home team played in the Super Bowl had measurably different birth outcomes than pregnant women whose teams did not go to the championship. There was a small, 4 percent increase in the probability of having a baby with low birth weight when the team won.

Garden. Of. Forking. Paths.

And this:

The magnitude of the change was tiny, but what was striking to Mansour [one of the authors of the study] was that it was detectable at all, in studying Super Bowl history from 1969 to 2004.

On the contrary, I’m not surprised at all. Given that researchers have detected ESP, and the effects of himmicanes, and power pose, and beauty and sex ratio, etc etc etc., I’m not surprised they can detect the effect of the Super Bowl. That’s the point of researcher degrees of freedom: you can detect anything.

As a special bonus, we get the difference between significant and non-significant:

The chances of having a low birth weight baby were a bit higher when the team won in an upset, suggesting that surprise may have helped fuel the effect. There was little effect when the team lost.

Really no end to the paths in this garden.

To her credit, Johnson does express some skepticism:

There’s a huge caveat to interpreting these studies. . . . That means researchers have to use natural experiments and existing data sets to explore their hypothesis. That leads to imaginative studies — like the Super Bowl one — but also means that they can’t be certain that it’s the prenatal experiences and not some other factor that explains the result.

But not nearly enough skepticism, as far as I’m concerned. To say “they can’t be certain that . . .” is to way overstate the evidence. If someone shows a blurry photo that purports to show the Loch Ness Monster, the appropriate response is not “they can’t be certain that it’s Nessie and not some other factor that explains the result.”

Sure, you can come up with a story in which the Super Bowl adds stress that increases the risk of low birth weight. Or a story in which the Super Bowl adds positive feelings that decrease that risk. Or a story about the relevance of any other sporting event, or any other publicized event, maybe a major TV special or an election or the report of shark attacks or prisoners on the loose or whatever else is happening to you this trimester. Talk is cheap, and so is “p less than .05.”

P.S. One more thing. I just noticed that the news headline was “Inequality might start before we’re even born.” What do they mean, “might”? Of course inequality starts before we’re even born. You don’t have to be George W. Bush or Edward M. Kennedy to know that! It’s fine to be concerned about inequality; no need to try to use shaky science to back up a claim that’s evident from lots of direct observations and a huge literature on social mobility.

32 Comments

  1. Anonymous says:

    isn’t this quite different that the usual “power = .06” situations you complain about? the authors have a huge sample and they report a small, precisely estimated coefficient. given that there is already plenty of evidence that stress can harm fetuses, I don’t know why you place more faith in your prior (and to be honest, I’m not even sure what that prior is) than in 29 million observations. of course, garden of forking paths etc still apply, but I don’t understand why this study, which strikes me as possibly wrong but completely reasonable, and the air rage study, which is obviously junk, seem to make you equally irate.

    • Andrew says:

      Anon:

      There are only two teams in the Super Bowl each year, so for the purpose of the statistical analysis, it’s not N = 29 million, it’s N = 70 or so. So, yes, forking paths is a huge problem here. If you want to say that stress can harm fetuses, that’s fine. The real question to me here is whether this new study adds anything to that earlier conclusion, and I’d say no, it doesn’t.

      • Bruce says:

        I think you are too harsh. While we shouldn’t take the p values at face value, it does look like there is a consistent theoretical story here – and it goes via alcohol consumption. With such a result, wouldn’t it be unethical to not publish?

        Sure we shouldn’t rely on these results without more evidence about pregnant women’s alcohol consumption during sporting events. But we have to start somewhere.

        • Anoneuoid says:

          Imagine you didn’t know that the volume of a box can be calculated from width*height*length. We could use the “finding statistically significant (ie nonzero) correlations” method to try to figure it out.

          With large enough sample size you would find the volume is correlated with the color, the shape, the material, the season the boxes were measured, whether or not they were damaged, the proportion of Christians that had touched it, the weight, the geographic location, the exposure to various toxins, the sound that results from tapping it, etc.

          As pointed out by Paul Meehl long ago, why should we care about such a “discovery”?

          “These armchair considerations are borne out by the finding that in psychological and sociological investigations involving very large numbers of subjects, it is regularly found that almost all correlations or differences between means are statistically significant. See, for example, the papers by Bakan [1] and Nunnally [8]. Data currently being analyzed by Dr. David Lykken and myself, derived from a huge sample of over 55,000 Minnesota high school seniors, reveal statistically significant relationships in 91% of pairwise associations among a congeries of 45 miscellaneous variables such as sex, birth order, religious preference, number of siblings, vocational choice, club membership, college choice, mother’s education, dancing, interest in woodworking, liking for school, and the like. The 9% of non-significant associations are heavily concentrated among a small minority of variables having dubious reliability, or involving arbitrary groupings of non-homogeneous or non-monotonic sub-categories. The majority of variables exhibited significant relation- ships with all but three of the others, often at a very high confidence level (p < 10–6)."

          Meehl, Paul E. (1967). "Theory-Testing in Psychology and Physics: A Methodological Paradox". Philosophy of Science. 34 (2): 103–115. http://www.fisme.science.uu.nl/staff/christianb/downloads/meehl1967.pdf

          It is a method that leads to insane waste of time, money, and effort.

          • Bruce says:

            I agree with all this (and accept Andrew’s point about story-telling below). However, if the results are consistent with other well-established theories (in this case, the impact of alcohol on birth weight) they should be given more weight.

            • Anoneuoid says:

              >”if the results are consistent with other well-established theories (in this case, the impact of alcohol on birth weight) they should be given more weight.”

              Imagine they got the opposite result, would it make you want to abandon these other well-established theories?

              • Bruce says:

                No, because there are too many other contingent things. But does this matter? If we had evidence that Bowl attendance was not associated with birth weight AND additional evidence that pregnant women’s alcohol consumption went up during Bowl games, then yes, this would count against these theories.

            • Anoneuoid says:

              >”No, because there are too many other contingent things. But does this matter? No, because there are too many other contingent things. But does this matter? If we had evidence that Bowl attendance was not associated with birth weight AND additional evidence that pregnant women’s alcohol consumption went up during Bowl games, then yes, this would count against these theories.”

              I was confused earlier about your use of the term “well established theory” to describe: “the impact of alcohol on birth weight”. That sounds like an observation, not a theory… I was ignoring it for now but don’t think I can continue with my point without addressing it. To me, a theory/hypothesis will be something like “Alcohol in the mothers blood (mBAC) slows cell division in the fetus (D) according to a relationship with functional form: D = D0 + s*mBAC”.

              Can you explain how you would define theory/hypothesis and distinguish it from observation/data?

              • Anoneuoid says:

                Actually, instead of simplifying theory/hypothesis to mean the same thing for this discussion (which I was doing in order to focus on the difference between that and the observations), how about this:

                Theory: Some set of premises or postulates that are taken as given along with a chain of deductive logic that leads us to various hypothesis the the theory entails.

                Hypothesis: A prediction that has been deduced from a theory. Note that more than one theory can entail the same hypothesis and each theory will entail a number of hypotheses.

                Model: An implementation of a hypothesis. Ie this would be the actual code or math used to generate a curve that is compared to the observations

                Observations: A collection of data, ie records or measurements, and/or the empirical relationships between a set of these.

                As far as I can tell none of these terms really have standard definitions. This is another cause of the “tower of babel” effect, in addition to the technical definitions of terms that may differ by field and from colloquial use. Perhaps it is best if we all remember to agree upon these definitions before discussions like ours begin.

        • Andrew says:

          Bruce:

          I have no problem with the researchers coming up with this idea and publishing it, along with their raw data and their analysis. The problem I see here is storytelling getting out of control. Rather than wending a path through their data, choosing various subsets and interactions and picking out statistically significant comparisons, I think they’d do better to present and analyze all their comparisons.

          For example, when I see a passage such as, “The chances of having a low birth weight baby were a bit higher when the team won in an upset, suggesting that surprise may have helped fuel the effect. There was little effect when the team lost,” I think about the many many other similar comparisons, equally supported by theory, that could be made.

          Whether or not there is an underlying signal here, the noise level is high, and the forking-path approach seems to me like a recipe for just telling stories from random numbers.

    • psyoskeptic says:

      Simulate t-tests with no effect with 20 samples in each group and 20,000,000 samples in each group and see what proportion come out significant.

      • Anoneuoid says:

        That would depend on the distribution you sample from…

        • psyoskeptic says:

          If you meet the assumptions of the test, no it doesn’t.

          • Curious says:

            That is an assumption of the test.

            • psyoskeptic says:

              This is getting way too long but if you’re going to simulate a test then you draw data from a distribution that meets assumptions of the test. My original response was about how N affects the proportion of tests that pass the .05 threshold when there’s no effect and addressing a statement regarding N. The responses here are non-sequiturs without some relevant argument about how the distribution was supposed to be different with the different Ns. But that was not what Anonymous’s original statement was about. His/Her statement was implied that it doesn’t matter much that there is multiple testing when you have a very high N.

              (Also, if you’re only changing N between the two simulations it probably doesn’t even matter what the distribution actually is within some very large range.)

              • Anoneuoid says:

                I;ve done such simulations many times, I really don’t follow what results you expect to see. The sample size will determine the estimated effect size if you subset only significant results, but the proportion should be the same if the model is really correct… That said, a short tour through the R source code lead to some interesting findings:

                1) The t.test() code utilizes the pt function:

                pval <- pt(tstat, df)
                […]
                pval <- pt(tstat, df, lower.tail = FALSE)
                […]
                pval 4e5) { /*– Fixme(?): test should depend on `n’ AND `x’ ! */
                /* Approx. from Abramowitz & Stegun 26.7.8 (p.949) */
                val = 1./(4.*n);
                return pnorm(x*(1. – val)/sqrt(1. + x*x*2.*val), 0.0, 1.0,
                lower_tail, log_p);
                }
                #endif

                nx = 1 + (x/n)*x;
                /* FIXME: This test is probably losing rather than gaining precision,
                * now that pbeta(*, log_p = TRUE) is much better.
                * Note however that a version of this test *is* needed for x*x > D_MAX */
                if(nx > 1e100) { /* x*x > 1e100 * n */
                /* Danger of underflow. So use Abramowitz & Stegun 26.5.4
                pbeta(z, a, b) ~ z^a(1-z)^b / aB(a,b) ~ z^a / aB(a,b),
                with z = 1/nx, a = n/2, b= 1/2 :
                */
                double lval;
                lval = -0.5*n*(2*log(fabs(x)) – log(n))
                – lbeta(0.5*n, 0.5) – log(0.5*n);
                val = log_p ? lval : exp(lval);
                } else {
                val = (n > x * x)
                ? pbeta (x * x / (n + x * x), 0.5, n / 2., /*lower_tail*/0, log_p)
                : pbeta (1. / nx, n / 2., 0.5, /*lower_tail*/1, log_p);
                }
                https://svn.r-project.org/R/trunk/src/nmath/pt.c

                3a) The above pt function utilizes either pnorm() or pbeta() to do the grunt work. The pnorm() funnction includes all sorts of arbitrary constants:

                const static double a[5] = {
                2.2352520354606839287,
                161.02823106855587881,
                1067.6894854603709582,
                18154.981253343561249,
                0.065682337918207449113
                };
                […]
                const static double q[5] = {
                1.28426009614491121,
                0.468238212480865118,
                0.0659881378689285515,
                0.00378239633202758244,
                7.29751555083966205e-5
                };
                https://svn.r-project.org/R/trunk/src/nmath/pnorm.c

                3b) The pbeta function calls a bratio function, which likewise uses an approximation requires various constants and sqitches between algorithms based on the input arguments, which depend on n (not shown… this is getting too long).

                static double c0 = .0833333333333333;
                static double c1 = -.00277777777760991;
                static double c2 = 7.9365066682539e-4;
                static double c3 = -5.9520293135187e-4;
                static double c4 = 8.37308034031215e-4;
                static double c5 = -.00165322962780713;

                https://svn.r-project.org/R/trunk/src/nmath/pbeta.c
                https://svn.r-project.org/R/trunk/src/nmath/toms708.c

                So, it is possible that whatever you observed is due to numerical error and that software often handles large datasets differently than small datasets.

              • Anoneuoid says:

                Well, as I hoped against but suspected, some of that got eaten along the way. The jist of #2 was:

                2) The algorithm used by the pt() function is determined by the size of the input (cutoff at 400,000). So, at least if you are using R, your example is literally comparing apples to oranges:
                https://svn.r-project.org/R/trunk/src/nmath/pt.c

              • Anoneuoid says:

                There is actually a lot that went wrong with that post on its way. This should be more readable:

                http://pastebin.com/bWfswhx4

              • Anonuoid:

                I think psyoskeptic is saying that if the assumptions are correct, you’ll get p < 0.05 exactly 0.05 of the time you run it…

                but so what?

              • psyoskeptic says:

                Daniel Lakeland, yes, that’s what I was implying. The commenter Anonymous was implying that forking paths may matter less with a large N. When it comes to finding a “significant” result when there is no effect N is pretty irrelevant because they happen with equal frequency regardless of N.

                And, as Andrew noted, GoFP is vast here.

              • psyoskeptic:

                when there is exactly zero effect, (basically never) then p < 0.05 happens 0.05 of the time. When there is an effect which is not exactly 0, then p < 0.05 happens every time as N increases without bound. The only "good" thing is that your estimate of the effect size will come out tiny, so you at least don't over-state your effect the way you do with small N.

                In a Bayesian analysis this all goes away, since you don't dichotomize at a magic 0.05 threshold, you just estimate the effect size directly and then you have an estimated size and an uncertainty in that size together.

  2. Teddy says:

    “Talk is cheap, so is p < .05" — can I get a t-shirt with that phrase? Haha… another great article, Andrew!

  3. I am surprised the article didn’t bring up this detail from the study:

    “We report estimates of the effect of first-trimester Super Bowl exposure by mother’s educational attainment, race, ethnicity and marital status in Table 6. Despite the fact that NFL fans are more likely to have graduated from high school than non-fans (Jones 2001; Scarborough Research 2004), there is little evidence that first-trimester exposure to the Super Bowl is associated with LBW among children whose mothers had at least four years of secondary schooling (Panel A). In contrast, among children whose mothers did not complete high school, first-trimester exposure to a Super Bowl win is associated with a 0.36 percentage-point increase in the probability of LBW. One potential explanation for this pattern of results is that women who did not complete high school are more likely to engage in risky behaviors such as drinking or smoking.”

    Wait–so the effects are seen only among a defined subset of the women?

    They go on to look at the effects of race (a more complex picture) and then proceed to marital status:

    “Finally, the sample is divided based on the mother’s marital status in
    Panel C of Table 6. We find that first-trimester Super Bowl exposure is associated with an increased probability of LBW only for single mothers. First-trimester exposure to a Super Bowl win is associated with a 0.67 percentage-point increase in the probability of LBW
    among single women, but we find no such evidence for married mothers.

    “Little evidence” of LBW for mothers with four years of high school–and no evidence for married mothers. This narrows the affected population considerably.

  4. zbicyclist says:

    I would think that after the publicity about the silly Superbowl-stock market link (which has it’s own forking paths due to some teams like the Steelers formerly being in the NFC) that these sort of analyses would be immediately flagged for a penalty.

    Some years ago, Mike Royko had a series about the “ex-Cub factor” being a jinx to teams in the World Series with more ex-Cubs. But these were comedy columns, and some of Royko’s later columns on the ex-Cub factor clearly show him wandering around the Garden of Forking Paths to justify the hypothesis.

    I see, though, that Royko popularized the term, but the original author was Ron Berler. (important to get those citations right when posting on this blog!) https://en.wikipedia.org/wiki/Ex-Cubs_Factor

  5. Lauren says:

    Andrew,

    I have an idea for a research project and I’m concerned you’d say something like this post about it, so I want to avoid the problems. Here’s a hypothetical for how this could go. They say in the abstract of that paper that “Previous studies have explored the effect of earthquakes and terrorist attacks on birth outcomes.” Suppose I came up with the idea to look at the effect of earthquakes on birth weight. So I get the earthquake location data, get the birth weight data, and see if there’s a drop in birth weight after the earthquakes in zip codes with magnitude-4+ earthquakes. Suppose when I look at the graph I see (a sparkline) ~~v~~ and my credible interval gives some small effect, like .25 lbs +/- .1. So it looks like an effect. Probably if it had looked like ~~~~~ I would do some other analysis, and maybe I’d find an effect there instead.

    Certainly there could well be an effect of earthquakes on birth weight, and indeed it might be something worth knowing about as a public health matter. Is this ‘proposed’ research worthless if not prespecified? Suppose I really want to know the effect. How can I look into these questions without incurring forking-paths criticism? For example, Do I need to use an informative prior based on other birth weight effects?

  6. Dale Lehman says:

    How many economists does it take to detect a tiny effect by taking a long forked path (I can make jokes at economists, since I am one)? This example also speaks to the perverse set of incentives in academia. Publish in peer reviewed publications, get tenure, keep the school accredited.

    • dcase says:

      I wholeheartedly agree. From the perspective of an assistant professor currently on the tenure track, the incentives could not be more perverse in the field right now. It is just so difficult to get studies published in the applied micro journals these days without an (expensive) RCT or one of these “creative” natural experiments that seem patently absurd after a little examination. Talking to many of my fellow juniors at conferences, it seems they (we) spend most time trying get lucky and find the right instrument or rule for an RD no matter the application and shoehorn it into an economic story. Seems bad for the long-term success and survival of the field.

  7. Lauren says:

    When you say ‘oh no its’ some famous journal, do you really think these famous journals publish worse work than other journals, or (e.g.) do we just hear about the bad articles in these journals and not the bad articles in other journals?

    • Andrew says:

      Lauren:

      The Journal of Human Resources is not a famous journal. Usually we hear about a bad article not directly because it was published in a famous journal but because it was picked up by the news media. I don’t know how this particular article from an obscure journal was noticed by the Washington Post writer here. Maybe there was a really effective press release? Or maybe the reporter did a Google search on the topic and came up with this article.

  8. Kit Joyce says:

    What a missed opportunity. If they had identified the cause and effect in the opposite direction we could sell a method for improving your team’s chances in the superbowl! After all, there are reliable ways to decrease birth weight. :)

    Sometimes I get annoyed by how often people thoughtlessly repeat the phrase “correlation does not imply causation” and thus implicitly condone the idea that statistics can never address questions of causal influence, but then there’s reports like these and I think maybe we don’t say it enough…

  9. Lauren says:

    Andrew,

    I have an idea for a research project and I’m concerned you’d say something like this post about it, so I want to avoid the problems. Here’s a hypothetical for how this could go. They say in the abstract of that paper that “Previous studies have explored the effect of earthquakes and terrorist attacks on birth outcomes.” Suppose I came up with the idea to look at the effect of earthquakes on birth weight. So I get the earthquake location data, get the birth weight data, and see if there’s a drop in birth weight after the earthquakes in zip codes with magnitude-4+ earthquakes. Suppose when I look at the graph I see (a sparkline) ~~v~~ and my credible interval gives some small effect, like .25 lbs +/- .1. So it looks like an effect. Probably if it had looked like ~~~~~ I would do some other analysis, and maybe I’d find an effect there instead.

    Certainly there could well be an effect of earthquakes on birth weight, and indeed it might be something worth knowing about as a public health matter. Is this ‘proposed’ research worthless if not prespecified? Suppose I really want to know the effect. How can I look into these questions without incurring forking-paths criticism? For example, Do I need to use an informative prior based on other birth weight effects?

Leave a Reply