Skip to content

How to reduce Type M errors in exploratory research?

Miao Yu writes:

Recently, I found this piece [a news article by Janet Pelley, Sulfur dioxide pollution tied to degraded sperm quality, published in Chemical & Engineering News] and the original paper [Inverse Association between Ambient Sulfur Dioxide Exposure and Semen Quality in Wuhan, China, by Yuewei Liu, published in Environmental Science & Technology].

Air pollution research is hot, especially for China. However, I think they might use a tricky way to do those studies. Typically, they collected many samples and analyzed many pollutants in those samples. Then just find the connection between one contaminant and environmental factor or diseases by checking the correlation among all compounds-environmental factor/disease pairs like this study. I have to say such template has been used a lot in environmental studies. Just a game of permutation and combination between thousands exposure factors (now we could detect them at one single run) and thousands of public concerns.

Since such observational study is actually hard to be really randomized, I am uncomfortable about those results. It seems we could use thousands of assumptions between compounds and environmental factor/disease and published the one with “significant” differences and shout at public press. Of course we could control for age, gender, smoking, BMI, etc. However, it’s just hard to control unknown unknown and just blame the known parts. Furthermore, type M error are also behind those study.

Are there suggestions to avoid those kind of errors or studies?

My reply: To start with, I’m not going to address this particular study, which happens to cost $40 to download. The effects of air pollution are an important topic but I think that for most of you there will be more interest in the general issue of how to learn from open-ended, exploratory studies.

So. The easiest answer to the “what to do in general” question is to simply separate the exploration and inference: use the exploratory data, in concert with theory, to come up with some hypotheses and then test them in new preregistered study.

But I don’t like that answer because we want some answers now. I’m not saying we want certainty now, or near-certainty, or statistical significance—but we’d like to give our best estimates from the data we have; we don’t want to be using estimates that are clearly biased.

So what should be done? Here are some suggestions:

1. Forget about statistical significance. Publish all your results and don’t select the results that exceeded some threshold for special treatment. If you’re looking at associations between many different predictors and many different outcomes, show the correlations or coefficients in a big table.

2. Partially pool the estimates toward zero. This can be done using informative priors or with multilevel modeling. You can’t get selection bias down to 0 (the type M error depends on the unknown parameter value) but you can at least reduce it.

3. Control for age, gender, smoking, BMI, etc (which I assume was done in the above-linked study). Adjusting for these predictors will not fix all your problems but, again, it seems like it’s going in the right direction.

The point is that whether we think of our goal as getting the best estimates to make decisions right now, or if we’re just considering this as an exploratory analysis—either way we want to learn as much from the data as possible, and correct for biases as much as we can.


  1. And what might a frequentist do for point 2? They could try selective inference:

  2. The cost of articles is why we should all insist on open access for everything we publish. If we don’t, anyone not inside the network of a rich university will have to pay for content or pirate it. Furthermore, our universities will keep paying ridiculous access fees to the journals that are available inside the university. It’s not going to happen until tenure committees stop valuing journal identity as a key signal or until the rest of the scholarly world catches up with machine learning, where the most prestigous journal, JMLR, has been open access ever since the editorial board quit their paywalled journal and started it.

    I think most academics don’t notice the cost because most of the journals are available inside places like Columbia and the cost isn’t a line-item on every professor’s paycheck or grant.

    • zbicyclist says:

      Furthermore, in many areas the research is being funded by our tax dollars.

      • Anoneuoid says:

        Even further more… Maybe I am strange but >99% of the papers I “read” are actually just skimmed for some nugget of info I am looking for, and anyway most of the stuff found within is questionable, half-complete, or just wrong.

        There is no way these papers are worth $40 to me on average, I would honestly just stop reading them if forced to actually pay that. The actual value is more like a hundredth of a cent, so every 10k papers I access would be a dollar.

    • Anonymous says:

      “The cost of articles is why we should all insist on open access for everything we publish”

      Open access is great but perhaps it is important to 1) not unnecessarily pay publishers too much and/or 2) create an even bigger prolblem:

      I reason pre-prints might simply be the best solution (open access, virtually no cost, everyone can try and contribute to science, etc.). The next step of pre-prints that is usually taken (i.c. publish the pre-print in an “official” journal) could simply be omitted.

      I reason this possible transition to pre-print servers only can be jump-started by senior tenured professors who, i reason, have nothing to lose by simply only publishing their pre-prints. I reason when enough important and influential tenured professors will start doing this, everyone will be focused on the pre-print server for “the latest” tenured professor’s pre-prints, and everyone will soon follow.

      Possibly think about what this could do to science, and how it could possibly (help) solve a lot of the bad stuff in today’s science that the current journal-system is probably largely responsible for. Tenured professors could even join forces, make their decision publicly known, etc. by signing some sort of decleration like the DORA-decleration (

      • Martha (smith) says:

        Good points, but one thing overlooked: There needs to be some attention to archiving preprints, so that good ones that don’t get a lot of immediate attention don’t vanish prematurely. Also the question of funding the preprint servers (and ideally archives).

      • Anonymous says:

        “Computer science was born of a rebellious, hacker culture, a spirit that lives on in the publishing culture of artificial intelligence (AI). The burgeoning field is increasingly turning to conference publications and free, open-review websites while shunning traditional outlets—sentiments dramatically expressed in a growing boycott of a high-profile AI journal. As of 15 May, about 3000 people, mostly academic computer scientists, had signed a petition promising not to submit, review, or edit articles for Nature Machine Intelligence (NMI), a new journal from the publisher Springer Nature set to begin publication in January 2019. The petition, signed by many prominent researchers in AI, is more than just a call for open access. It decries not only closed-access, subscription-based journals such as NMI, but also author-fee publications: open-access journals that are free to read but require researchers to pay to publish. Instead the signatories call for more “zero-cost” open-access journals.”

  3. Zad Chow says:

    I like what Ioannidis suggested,

    “We suggest that instead of presenting a single effect and p-value for an association of interest, one can present the median effect, the median p-value, the relative hazard ratio (RHR) and the relative p-value (RP) across all possible analyses using different adjustments in addition to the pattern of the vibration of effects (VoE), whether a Janus effect exists, and whether there are clear multimodality (and if so, due to what).”

  4. Clyde Schechter says:

    “I reason this possible transition to pre-print servers only can be jump-started by senior tenured professors who, i reason, have nothing to lose by simply only publishing their pre-prints.”

    Maybe that would work in some fields. The kind of work I do would not be possible without external funding. Tenure committees are not the only bodies that look at which journals you publish in. Grant reviewers look at this as well, to assess your qualifications to carry out the research in your grant proposals. It’s probably not a great measure, but that’s what they use.

    I have found, now that I’m pretty senior in my field, that the pressure to publish in prestigious journals is actually greater than when I was junior and only worried about climbing the ladder. I suppose if I didn’t care about my work itself and were content to just keep my job title I just do pre-prints. But my grant funding would dry up, and my ability to actually do research would grind to a halt.

    • Anonymous says:

      1) “Grant reviewers look at this as well, to assess your qualifications to carry out the research in your grant proposals.”

      Grant reviewers could alternatively decide to only want to provide funds for those researchers (e.g. tenured professors) that chose to no longer be part of the giant scientific publishing scam (

      An additional benefit for the grant agency could be that the work they helped made possible would now be visible to, and used by, lots more people. Why wouldn’t a grant agency want that?

      2) “I have found, now that I’m pretty senior in my field, that the pressure to publish in prestigious journals is actually greater than when I was junior and only worried about climbing the ladder.”

      I think i feel sorry for you.

      Like i expressed before (, i thought tenured scientists like you might have sacrificed their (scientific) souls for the common good: jumping through possible anti-scientific incentive-hoops of the current system just to be able to do some proper science, or change the bad incentive-structure, once you got tenure, but i now understand this may not even be possible because you still have “pressure to publish” once you receive tenure.

      I thought i could thank you for all your sacrifices as possibly shown on your CV which possibly lists things like:

      1) a probably impossible/non-representative ratio of “positive” vs. “negative” published results
      2) a possible long list of low-quality papers with your name on them
      3) an indication of how much tax-payer money you might have wasted via listed received grants
      4) (the possible shame of) listing individual awards received for simply doing your job (which is probably made possible by the tax-payer, and almost definitely by other researchers)
      5) a publication + reviewer history which can be interpreted as having played an active part in what can be considered to be the giant academic publication scam

      But now, i understand that all these possible sacrifices may have been in vain, because apparently nothing changes when tenure is reached. I thought tenure was something invented to allow academic freedom for the common good, but perhaps that’s not the case…..

      So, who’s gonna tell the grant agencies that looking at which journals you publish in is “probably not a great measure”?

      And, should we get rid of tenure, as i now undersrtand it might be the case that it may not serve the purpose anymore for which it was invented?

      • Keith O'Rourke says:

        > Why wouldn’t a grant agency want that?
        Why would they want to enable others to evaluate their actual performance.

        Up until 5 or 10 years ago they were very content to have volunteer academics peer review the applications, award grants mostly on the basis of that and then bask in the glory of the resulting publications, especially in high impact journals. The university publicity department’s puffing up the research was an additional bonus. Similar business model to academic publisher’s you link to.

        Now they are more pressured to have some policies for data sharing, data quality, reproducibility, etc. These are all risks to them as funding researchers who did not enter data correctly, keep adequate records, report the analysis they did correctly and other poor practices that can be noticed with these things can no longer be claimed as successes as in the past. Now there is becoming an increasing downside to be seen not doing these things – so there better off doing as little as one can and still get away with it.

        • Anonymous says:

          “Why would they want to enable others to evaluate their actual performance. “

          I wouldn’t be surprised if there are grant agencies who just want the specific result/message they want getting published and could therefore perhaps give grants to a certain type of “willing” researchers to perform a trick for them. Perhaps these type of researchers get a nice piece of the pie in doing so.

          I am more interested in grant agencies, and (tenured) researchers, who do value the actual science, whatever the results. I reason they must be able to come up with a solution, and/or make it very clear to the general public which results, and who to trust and why. I reason, that could be helpful in fixing what may be wrong with today’s science.

          Without further information, i still think these tenured researchers might be in the perfect position to help improve science, for instance by only posting pre-prints, trying to collaborate with grant agencies who share their values and views, publically declare this a la DORA, etc. Perhaps tenured researchers can do more for science trying to fix what is wrong instead of still “playing the game” (even when that means not personally applying for grants).

Leave a Reply