Evelyn Lamb adds to the conversation that Jeff Leek and I had a few months ago. It’s a topic that’s worth returning to, in light of our continuing discussions regarding the crisis of criticism in science.

## Bad Statistics: Ignore or Call Out?

## Recent Comments

- Rahul on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- Adam Schwartz on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- paul alper on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- Rahul on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- Alex on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- Daniel Lakeland on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- Rahul on Cracked.com > Huffington Post, Wall Street Journal, New York Times
- Elin on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- Clyde Schechter on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- Andrew on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- Bea on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- Elin on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- artkqtarks on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- BenK on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- Bob on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- Anonymous on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- Bea on John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice
- Andrew on “In my previous post on the topic, I expressed surprise at the published claim but no skepticism”
- Elin on “In my previous post on the topic, I expressed surprise at the published claim but no skepticism”
- Chris G on Cracked.com > Huffington Post, Wall Street Journal, New York Times

## Categories

How much bad research is meaningful over time? And over what time periods? What is the cost of that waste versus the cost of squeezing it out?

As a partial aside and for amusement, I’m reminded of a picture that hung in the Detroit Institute of Arts when I was young. It showed 2 figures but one had been removed, leaving a completely white space, because the conservators of that time had decided one was by Leonardo and the other wasn’t. The capper is the docent told us more recent research meant they were no longer sure which figure was by Leonardo.

We sometimes uncover bad research and the decisions made out of that only when it’s too late. A picture. A bunch of lives lost. But that research exists in its time, not in the abstract, so often what is revealed is the cost of prevailing cultural, ethnic and racial prejudices. I have no idea how one gets outside one’s time. The bad social science research, particularly the stuff that’s low-powered and/or finds ways to statistical significance, comes out of attitudes of the moment, sometimes as reinforcement, sometimes as reaction against. Doesn’t all research reflect what we think we know and how we see the problems of today?

Jonathan: ” How much bad research is meaningful over time? And over what time periods? What is the cost of that waste versus the cost of squeezing it out?”

Well over the past 50 years we have replaced fats with sugar and refined carbs on the basis of some pretty crappy research. The evidence is mounting this may be contributing to the obesity epidemic, child diabetes, etcresearch yes, crappy research matters, at least as much as good research. Because presently bad research is good research unless you want to engage in an uphill struggle to prove it otherwise. So, if research matters, then crapy redearch just rides along.

Doesn’t thinking that your research might be scrutinized by others also make some people more careful about what they submit?

I think this is an important point. For me the ideal is for about half of everything that is published in the social sciences to consist of the replication and criticism of the other half. This would certainly make it in one sense “riskier” to publish, and would therefore incentivise people to be more “careful”. But the worry then is that creativity would somehow be stifled. One person’s carefulness is another’s cautiousness, and another’s outright inhibition.

My hope is that people would start publishing not in

fearof criticism but inanticipationof it, sometimeseageranticipation. If we published not in the vain hope that people will believe us, but in the sincere hope of being shown where we are wrong, then the whole culture would change. Paper’s would be no less (and perhaps more) creative, but they wouldn’t be fodder for the press. The press would always wait for the criticism and cover the arguments, not just the latest provocative result.Thomas:

Related is the idea of

open-mindedness. Critics such as ourselves are sometimes told to be open ongeneral issuessuch as the possibility that ESP exists, that ovulation influences political attitudes, etc. That is fine. But researchers on these topics often forget to be open to the possibility that theirspecific claims, such as a particular type of ESP, or a particular type of ovulation effect, simply represent patterns in a particular sample and not anything in the general population. One of my frustrations in dealing with various controversies over scientific criticism (I prefer to call it a crisis of criticism, rather than a crisis of replication) is that the people who do the studies that were criticized typically either refuse to engage with critics (search this blog for Gertler) or refuse to consider the possibility that their findings are simply artifacts of their particular sample (search this blog for Beall and Tracy). I respect that people want to defend their published claims but I deplore the lack of open-mindendess by which they don’t even want to admit the possibility that their results are artifactual.> crisis of criticism

Agree, preferable in that replication attempts (or replicability assessments) are just part of scientific criticism.

I always got the impression that “replication” is the more rhetorically viable angle to push. “Criticism” sounds so, well, critical. I really do like the idea of being open minded about being wrong. It reminds me of a (for some perhaps surprising) rant about relativism in the New Age community by Terence McKenna.

A replication that replicates poor methods doesn’t really contribute to the progress of science — if the methods of the original are poor, then criticism of them is more important.

Yes, I forgot to finish my thought. I agree with Keith that replication is just part of criticism and that what we need, in general, is more criticism, not merely replication. So I’m thrilled that Andrew wants to call it a crisis of criticism.)

“Related is the idea of open-mindedness. Critics such as ourselves are sometimes told to be open on general issues such as the possibility that ESP exists,”

IOW, we’re sometimes told to close our minds to the mountain of evidence which has already ruled out some absurd, pseudoscientific hypothesis so that a molehill of evidence from some cargo cult experiment(s) can be taken seriously.

Can the decision to ignore or call out be dependent on the research? I understand this is fairly arbitrary, but how much time should one spend critizing a claim that fertile women wear pink vs. how mammography affects breast ca. survival.

One situation that I think warrants a call-out is when a methodological paper of questionable quality contributes to what I call The Game of Telephone Effect* (henceforth TGOTE), where something is misunderstood by someone, then that is further misunderstood by the next person, etc. Here’s a call-out in this category:

Schmider et al (2010), Is It Really Robust? Reinvestigating the Robustness of ANOVA Against Violations of the Normal Distribution Assumption, Methodology Vol. 6(4):147–151

The abstract: “Empirical evidence to the robustness of the analysis of variance (ANOVA) concerning violation of the normality assumption is presented by means of Monte Carlo methods. High-quality samples underlying normally, rectangularly, and exponentially distributed basic populations are created by drawing samples which consist of random numbers from respective generators, checking their goodness of fit, and allowing only the best 10% to take part in the investigation. A one-way fixed-effect design with three groups of 25 values each is chosen. Effect sizes are implemented in the samples and varied over a broad range. Comparing the outcomes of the ANOVA calculations for the different types of distributions, gives reason to regard the ANOVA as robust. Both, the empirical type I error [alpha] and the empirical type II error [beta] remain constant under violation. Moreover, regression analysis identifies the factor ‘‘type of distribution’’ as not significant in explanation of the ANOVA results.”

The absract gives me the impression that the authors have a poor understanding of the concepts of random sample, sampling distribution, Monte Carlo simulation, and type I and type II errors. Nothing in the body of the article dispelled that impression.

Google Scholar shows about 60 papers that have cited Schmider et al. I looked at the first 24 listed for ones that had an easily accessed copy of the paper (I was too lazy to be willing to go through the library website). Here are excerpts from six of them (I have omitted references for the citing papers from which I quote, in the interests of minimizing embarrassment of the authors).

1. “… the use of ANOVA on Likert-scale data and without the assumption of normality of the distributions of the data to be analysed is controversial. In general, researchers claim that only non-parametric statistics should be used on Likert-scale data and when the normality assumption is violated. Vallejo et al. [20], instead, found that the Repeated Measures ANOVA was robust toward the violation of normality assumption. Simulation results of Schmider et al. [21] confirm also this observation, since they found in their Monte Carlo study that the empirical Types I and II errors in ANOVA were not affected by the violation of assumptions.”

(Comment: Schmider et al don’t mention repeated measures ANOVA – so it’s a two-step TGOTE).

2. “All omnibus tests were performed using a mixed-effects analysis of variance (ANOVA) …. To the best of our knowledge, there is no nonparametric alternative to a mixed-effects ANOVA. Therefore, departures from the underlying parametric assumptions of normality, homoscedasticity, and sphericity were dealt with using standard approaches. In particular, although transformations could be used to minimize data skewness and to improve normality and homoscedasticity, such transformations are not typically applied to quantitative EEG data. Further skewness (<3) and departures from homoscedasticity for all data were found to be mild. ANOVA methods are robust to such mild departures.15”

(Comment: Reference 15 is Schmider et al, who didn’t mention mixed-effects ANOVA, so again, two-step TGOTE)

3. “Some researchers … maintain that only nonparametric statistics should be used on Likert-scale data. Other researchers … consider the application of parametric statistics on Likert-scale data to be acceptable. Vallejo, Fernández, Tuero, and Livacic-Rojas (2010) found that the repeated measures ANOVA was robust toward the violation of normality assumption. This observation concurs with the simulation results of Schmider, Ziegler, Danay, Beyer, and Bühner (2010), who found in their Monte Carlo study that the empirical Type I and Type II errors in ANOVA were not affected by the violation of assumptions …. In the lack of powerful alternative non-parametric procedures for analysing repeated measures of Likert-scale data, this study has resorted to the use of repeated measures ANOVA, noting the robustness of ANOVA, and making use of a large sample size (n=82 college and university students).

(Same comment as for item #1)

4. “In this sample, this variable was not normally distributed but its Kurtosis (2.14)

and Skewness (1.44) were within acceptable limits for using parametric analysis (Schmider, Ziegler, Danay, Beyer, & Buhner, 2010).”

(This one looks like only one-step TGOTE!).

5. “A one-factorial repeated-measures (rm) ANOVA (Huynh-Feldt corrected) including the factor stimulus condition (visual-only, auditory-only, auditory + visual) was calculated. Note that the normality assumption regarding the FMS peak scores was violated. However, ANOVAs have been shown to be robust against violations of normality when group sizes are not small (see [24]).”

(Same comment as for item #1)

6. “Prior to the analyses, normality was checked and homogeneity of variances was assessed using Levene’s F-test. All tooth-width data were then log transformed. Although this could not correct for all of the variance heteroscedasticity (and in a few cases, deviations from normality), one-way ANOVA is considered to be robust and less sensitive to type I or type II errors (Schmider et al. 2010).”

——

* Named after the kids’ game Telephone, where you sit in a circle, someone whispers something in their neighbor’s ear, the recipient whispers it to the next person, and by the time it gets around the circle, the difference from the original usually causes a good laugh.

Can you expound on what you find problematic with the original research? It seems like maybe they’ve rediscovered the central limit theorem. I’m a little confused by “effect sizes are implemented in the samples”, does this mean they just added a constant effect size to the otherwise random samples, or does it mean that they forced the sample mean to equal some value?

I think the real issue these researchers were trying to address was that scientists want to do this:

1) Take a Sample of various categories of items.

2) See if there are largish differences on average between the different categories.

But what they’re taught is this:

assuming you are sampling from a normal distributed population, assuming you are estimating the standard deviation of the underlying population from the sample sd, assuming your data are IID from the same normal population, assuming….. then cranking out certain algebra called ANOVA will give you a number which will be small if (2) is true.

And the obvious and correct THINKING response to all of that is: “how the hell should I know if my underlying population meets all those distributional criteria? The whole reason I’m doing statistics is because I only have 25 samples!”. Very often however, people just go ahead and unthinkingly run the software and get a p value and then off they go.

So, these researchers ask questions like “well, how much does the shape of the distribution really matter?” and then go out and run a bunch of random number generation stuff (in a perhaps clumsy and naive way), and find out that it doesn’t much matter what the histogram of the “underlying population” looks like, ANOVA math gives them pretty much the same results no matter whether they use normal, rectangular, or exponential underlying populations, so they use this as justification to hold-off the statistical policemen so they can continue to run the one software tool they know how to use (or their field knows how to read).

In Classical statistics, those assumptions are statements about the histogram of a large and unmeasured population of objects “out there” in the world somewhere. So the job of the classical statistician is to check that they either hold or can be relaxed.

In Bayesian analysis, violation of the frequency-distribution correspondence isn’t an issue since the statement “Y is independently normally distributed around mu with standard deviation sigma” means “All I know is that none of the Y will be very many multiples of sigma away from mu”, not “repeated samples from Y will have a histogram that is approximately normal with mean mu and sd=sigma”

So if you are calling out the original researchers because they probably did a sort of befuddled job of simulating different distributions, and come to the conclusion that the shape of the underlying distribution doesn’t matter, then I’m tempted to say that the real issue here is that people want Bayesian answers, and they’ve only been taught Frequentist thinking and Frequentist algorithms/software.

Of course, if they want Bayesian answers, the better route would be to use Bayesian models.

Daniel,

The parts of the abstract that initially caught my attention are:

“High-quality samples underlying normally, rectangularly, and exponentially distributed basic populations are created by drawing samples which consist of random numbers from respective generators, checking their goodness of fit, and allowing only the best 10% to take part in the investigation,”

and

“Both, the empirical type I error [alpha] and the empirical type II error [beta] remain constant under violation.”

Here is a quote from the article that elaborates on what the first quote above is saying:

(p. 148): “However, a sample is not always a good representative

for the basic population it is drawn from. For this reason,

the samples taken from the respective distributions

were analyzed prior to conducting the actual ANOVAs.

Aim of this was to extract those samples that are prototypical

for their basic population. To determine prototypicality

the goodness of fit was computed with the Kolmogorov-

Smirnov test (K-S test). Only the 10% best samples were

chosen by applying a linear algorithm, that picks out, by

exhaustive comparison, the 5,000 samples with the

best fit.”

The upshot: They claimed to be using a Monte Carlo simulation to estimate the type I and type II errors, but they only used the 10% of “best” random samples to calculate their estimates, rather than using all random samples drawn.

Martha:

Seems like a clever but naive attempt at variance reduction or “smart” monte carlo.

Doubt they worked it through correctly, but they are likely trying to reduce noise from draws from x distribution that are not very x distribution like. (Barnard often made this conditional robustness argument.)

I did find your post interesting – neat idea to trace things through. One thing that might be happening that is not “Telephone” like is authors trying to find references that seem to justify what they already know what they want to do, where they are willing to purposely suggest misinterpretations if they think it will get by the journal reviewer.

Unfortunately I know a few statisticians who do this sort of thing…

Keith,

I suppose there might be some who purposefully suggest misinterpretations, but I’m more inclined to think that ignorance is the more likely cause – thinking, for example, that if something applies to one type of ANOVA, then it applies to another type. The rote type of teaching that is so common is likely to lead to such ignorance.

But for a statistician to try to slip something by a reviewer — that’s another matter.

Although I agree with you that the wording makes it seem like they are probably a bit naive, and so I’m not sure whether they thought it through, what they’re doing makes some sense in the following context:

An author has some categorized samples of numerical data and wants to do ANOVA type calculations to see if there are differences in outcome between categories. They have been told that normality of the outcome variable is an important assumption, so they run a normality test, and find a non-normal outcome distribution! Oh no, what to do?

Then they wonder “what is a better distribution to describe my data?”, so they run a couple of goodness of fit tests for alternative distributions. Suppose the goodness of fit test for rectangular or exponential passes with flying colors. If that is the context in which you’re going to use the technique, then what matters in the robustness testing is not that the RNG actually is a rectangular/exponential distribution, but mainly that the *particular sample* passes a test for such a distribution. Although a lot more than 10% of samples will pass the test, taking the 10% “most representative” isn’t entirely a bogus idea.

The more bogus idea is that data “really do” come from a rectangular, normal, exponential, etc type distribution. Only those people studying random number generator algorithms get to make such assumptions in any kind of strong form :-)

In any case, I think your point still stands about methodology papers. If you’re going to publish a methodology paper, you should have thought through the methodology pretty well. Another example of this kind of thing that you may find humorous or depressing, depending on your mood is “Tai’s Metho for finding the area under a curve”

http://andrewgelman.com/2010/12/03/reinventing_the/

One of the several problems I have with the Schmider et al paper is that they perform 24 X 50,000 Kolmogorov-Smirnoff tests in the process of filtering which simulated data sets to use (plus five more when analyzing the results). That sounds like a whopping big multiple inference problem to me.

Daniel:

A miss-specified data models can be a big problem in Bayes!

For some interesting work looking at limiting what data one conditions on to help address this see

Bayesian Restricted Likelihood Methods. John R. Lewis, Steven N. MacEachern, Yoonkyung Lee

http://www.stat.osu.edu/~yklee/mss/tr878.pdf

Keith: no doubt a mis-specified data model can be a big problem! I didn’t mean to say that it wasn’t.

The key in Bayesian analysis though is that your data model should capture important true facts about the data, not that it should pass a kolmogoro-smirnov test for a particular distribution, or whatever. In other words, the data model needs to capture things like the order of magnitude of the actual data, the order of magnitude of the variation in data, the general relative importance of outliers. The specific thing that it really needs is to cover the entirety of the data possibilities, and make the actual samples have high likelihood when the parameters have their true values.

For an example see my blog here:

http://models.street-artists.org/2014/03/21/the-bayesian-approach-to-frequentist-sampling-theory/

In the example I generate 100 “orange juice bottle volumes” with a uniform RNG between 1.4 and 2 (liters). I then use a data model which is just exponential with an unknown mean, but the mean has a prior which is uniform on [0,2.25] (a Stan default prior). The exponential distribution is nothing like the uniform(1.4,2.0) distribution that actually generates the data in terms of shape, or in terms of the extent of the tails. For example it predicts the most likely volume to be 0 (!!)

Yet, I still get good inference on the overall total volume of the 100 bottle crate. Why? Frequentist intuition tells us not very much about that. Why should we get good inference when we assume the data come from a distribution that is different in significant ways from the actual generating distribution?

My intuition on this comes from information theory: when the mu value is near the average of the true random generator, the sampling models typical set will cover the typical set of the actual generator, and the likelihood will be relatively larger than for other mu values. That’s pretty much all the Bayesian machinery needs to get good inference on the population average.

If I were trying to estimate some other aspect of the data, like say the fraction of the bottles that fall below 1.8 liters, the exponential distribution might well be a bad model.

The “goodness of fit” in a Frequentist K-S test or Andersen-Darling test type sense is not necessarily relevant.

Keith, I swear, you are like the gatekeeper to an enormous wealth of good literature. I’ve only read the first page of that paper but already I like the ideas they’re discussing, and they’re more or less working along the lines of my example: “make the likelihood sensitive to the questions of interest.”

Anyone can read the posts under this “question” name to see how negative they have all been. Why does this persona exist?