Frank Harrell, author of an influential book on regression modeling and currently both a biostatistics professor and a statistician at the Food and Drug Administration, has started a blog. He sums up “some of his personal philosophy of statistics” here:

Statistics needs to be fully integrated into research; experimental design is all important

Don’t be afraid of using modern methods

Preserve all the information in the data; Avoid categorizing continuous variables and predicted values at all costs

Don’t assume that anything operates linearly

Account for model uncertainty and avoid it when possible by using subject matter knowledge

Use the bootstrap routinely

Make the sample size a random variable when possible

Use Bayesian methods whenever possible

Use excellent graphics, liberally

To be trustworthy research must be reproducible

All data manipulation and statistical analysis must be reproducible (one ramification being that I advise against the use of point and click software in most cases)

Harrell continues:

Statistics has multiple challenges today, which I [Harrell] break down into three major sources:

1. Statistics has been and continues to be taught in a traditional way, leading to statisticians believing that our historical approach to estimation, prediction, and inference was good enough.

2. Statisticians do not receive sufficient training in computer science and computational methods, too often leaving those areas to others who get so good at dealing with vast quantities of data that they assume they can be self-sufficient in statistical analysis and not seek involvement of statisticians. Many persons who analyze data do not have sufficient training in statistics.

3. Subject matter experts (e.g., clinical researchers and epidemiologists) try to avoid statistical complexity by “dumbing down” the problem using dichotomization, and statisticians, always trying to be helpful, fail to argue the case that dichotomization of continuous or ordinal variables is almost never an appropriate way to view or analyze data. Statisticians in general do not sufficiently involve themselves in measurement issues.

And if you love cats, Frank’s blog links to this fantastic presentation on data science by Frank’s colleague and PyMC3 developer Chris Fonnesbeck. The short video leads off with a story about cat statistics as picked up in

New York Times: https://www.youtube.com/watch?v=TGGGDpb04YcFrom the video: “Data can be actively hostile to answering your question sometimes”.

The message for me is that the power is not in the data, it’s in the model.

Another very important point he mentions at the end, and almost as an aside, is that there is an inherent danger of misuse of complicated models and that this danger has become greater with the easy availability of one-line functions in software packages. It was this easy availability of nlme in 2000 that allowed me to make an ass of myself when starting to do data analysis with hierarchical linear models.

That’s a good point. I see models that are getting overly complicated and on the other hand trying to be fit on scarce, low quality data.

Model building has become increasingly glamorous at the expense of the boring but essential data collection and measurement process.

The other disturbing trend I see is people glamorizing more complicated models over simpler ones. What ever happened to Occam’s razor?!

It’s almost as if model complexity has become a surrogate for model quality.

Rahul:

I hate Occam’s razor. You can search this blog for lots of discussion on the point.

I feel we need more of a shift to focus on model output than model structure. I rarely care if the model is simple or complex so long as it *predicts* accurately.

To the extent that I like Occam’s Razor it is to select between essentially a class of models that do equally good predicatively. In that case I’d prefer the simplest model of that class, cetris paribus.

What I hate is when people regard a model as superior simply because structurally it is more elegant or more complex etc.

I don’t care whether a particular model formulation uses all the data available or not, just so long as it gets the job done. Somehow people seem to make the jump that just because a model is “rich” or allows you to use all the data out there, ergo it must axiomatically be a better model.

I’ll channel Daniel Lakeland (http://andrewgelman.com/2017/01/11/the-prior-fully-comprehended-last-put-first-checked-the-least/#comment-391603)here:

“If you’ve got decent data and more importantly a likelihood that actually makes fairly specific scientifically based predictions, then you typically find out a lot and the prior matters less. If you’ve got a poorly specified model and badly thought out experiment, or haphazardly collected observational data, there’s not much you can do to make your likelihood inform your posterior.

The part that people tend to get wrong, in my opinion, isn’t the prior, it’s the model of the world that informs their likelihood, and/or the design of the data collection.”

There’s a lot of lip service to statistics.

e.g. You can’t publish a paper in some technical areas without adding a section on statistical analysis. So an external statistician is brought in. Often the guy has no domain knowledge nor the inclination to learn. Or perhaps he’s not given enough time to do a thorough job or his advice is not heeded. Ergo his analysis is mostly ad hoc and simplistic. But it is needed because you can’t publish the paper without going through the ritual.

Over time readers realize the bogus-ness of the statistical parts of evidence. So they learn to ignore those parts. As a result this whole cottage industry emerges that thrives on producing boilerplate statistical analysis that consumers do not pay much heed to anyways. And that propagates a vicious cycle where the statistical inputs keep getting worse.

Therein lies the source of much of the point and click, canned statistics that is mostly agnostic to the nuance of the actual problem at hand.

No one wants good analysis. The goal is to get the cheapest, no-hassle statistics out there to get you paper through the doorkeepers of publication.

Rahul:

> No one wants good analysis. The goal is to get the cheapest, no-hassle statistics out there to get you paper through the doorkeepers of publication.

Its worse – I would say no one (at least any academic) thinks they can afford good analysis (unless that kind of work can be claimed as one of their contributions).

As I learned when I worked many years with the same clinical researchers (who I believe were well motivated leaders in their fields), they felt if the analysis satisfied the journal reviewers and was not known to be deficient for the study in hand, they could not afford to spend any more time and energy on analysis issues because that energy had to put into their clinical research careers.

Given this, there are very important rules for Frank Harrells in medical schools – apparently Vanderbilt University School of Medicine had that insight.

Opps – roles for rules.

“It’s worse – I would say no one (at least any academic) thinks they can afford good analysis (unless that kind of work can be claimed as one of their contributions”

Have to agree, sadly. An ambiguous but honest analysis will get you rejected and you will be relegated to the dustbin of academia. 0

People often attempt to generalize to statisticians. Like all professions, the “bad” statistians will outnumber the “good.” In fact, I have learned more interesting approaches from non-statisticians: Richard Mcelreath and John Kruschke, for example.

Well, to be fair, they learned those “more interesting approaches” from good statisticians: Andrew Gelman and Frank Harrell, for example.

Thanks for the pointer!

Tip: Always be suspicious of normalized / adjusted metrics on plot axes that aren’t the typical / intuitive norm in the field.

Yes, there’s sometimes good reason, but more often they are only done to make a convoluted point or to sell the author’s viewpoint. And the author has rejected other metrics which he will never show you because they did *not* support his conclusions.

Rahul, can you elaborate on this with an example?

See for example https://www.ncbi.nlm.nih.gov/pubmed/26025022 where we show that the original authors only obtained a significant result by using a nonstandard transformation of their physiological measure (square root instead of the standard logarithmic transformation).

Nick: They didn’t find a significant result, though. The two-tailed p for the effect that you are talking about was .06.

The authors reported p = .03 one-tailed, p = .06 two-tailed. Reporting the one-tailed p and the two-tailed p like this is not appropriate. It is also not appropriate to switch to a one-tailed p because the two-tailed p did not did not make the p < .05 cutoff criterion. The choice of a one-tailed test or a two-tailed test must be made before the analysis is done.

It's interesting that this preliminary test was one-tailed, but their complete model seems to have used all two-tailed tests.

Wait, the developer of Hmisc and the coiner of the term “dynamite plot”?

Arrgh, another blog to monitor!

Well done Dr. Harrell

@Rahul:

… well, you’ve made several good points here bemoaning the current state of statistical analysis in much of published research.

Do you see any specific solutions to these problems you discern?

General statistically based research seems to be rapidly evolving into a Tower of Babel… too overwhelming & confusing for even statistics professionals to sort out. And the general public, as you say, is starting to ignore the truth-teller pretensions & output of the professional researcher hordes.

You’ve come to the right place! This blog’s a great source of advice on both these issues: how to do statistics and how to communicate with the public and other scientists. And Frank Harrel, the subject of this particular post, is working to educate statisticians and the FDA.

The bigger issue is that neither the general public nor the scientific community seem willing to embrace the fundamental sources of uncertainty: measurement error, sampling errror, and model misspecification. Chris Fonnesbeck, whose talk was linked from Frank’s blog, touches on these subjects and where people go wrong in naively applying methods.

Not trying to be glib here but what does *embracing* uncertainty actually mean?

Is that the same as trying to understand or quantify the various sources of uncertainty?

I take “embracing uncertainty” to mean accepting that there will always be uncertainty in any statistical analysis*; and trying to take sources of uncertainty (e.g. those identified by Bob, plus others that might be relevant in a particular situation) into account in statistical analysis, communication of statistical results, and application of those results.

* I often use the maxim, “If it involves statistical inference, it involves uncertainty.”

Thanks @Martha.

The reason I don’t like is that the “embrace uncertainty” maxim sounds to me as if partly glorifies uncertainty or at least condones it as something that’s inevitable and beyond researcher control.

Uncertainty should be a reducible evil. Not all studies are made equal & the better ones can have lower uncertainty.

“Reduce uncertainty” sounds a better guideline than “Embrace uncertainty”.

Uncertainty is inevitable! It is not totally within researcher control — but the researcher can (potentially) reduce it — e.g., by good measurement; good research design; analysis appropriate to the problem, context, research design, etc. And one needs to take it into account in stating conclusions and applying results.

“Reduce uncertainty” is indeed a good maxim, but “embracing uncertainty” as inevitable is the first step to reducing it and otherwise dealing with it.

I think that reducing measurement error is a different issue than “embracing uncertainty”. There is inherent variability in data and all potential sources of variance need to be considered when analyzing data. For example, in psycholinguistics and psychology, we often gather data with crossed random effects by subjects and by items (Doug Bates and colleagues discuss this in several papers, e.g., here). For that kind of data, you don’t want to collapse all repeated measures from one subject into one data point by aggregating and assume that you have only one data point per subject; this is how we used to do it before nlme and lme4 came along. Basically, pre-LMMs what we did was that we take the raw data, and aggregate over subjects to get one data point per subject per condition, even though we have multiple data points per subject. This is called a by-subjects analysis. Then we do the same with items: aggregate over items to get one data point per item per condition. Now you can do a t-test or ANOVA by subjects and by items. In the old days (and people still do this), one would then report F1 and F2 (or t1 or t2) scores, where F1 is by subject ANOVAs and F2 is by item ANOVAs. Linear mixed models made it unnecessary to pretend that there are not two simultaneous sources of variance, i.e., two variance components. That’s an example of embracing uncertainty. Doing it the F1, F2 way can give you an overly enthusiastic result. The fields of psychology and psycholinguistics are littered with experiments in top journals where the by-subject analysis was “statistically significant” but the by-items analysis not, but the researcher claimed that the effect was therefore present. Redoing the analysis using linear mixed models (with all sources of variance considered) leads to no effect being found even using the statistical significance filter in this incorrect way. (Many researchers (even today) think that as long as the by-subjects analysis has p less than 0.05 we are good to go.)

At least that’s how I understand what it means to “embrace uncerainty”.

Shravan:

Is there any way to tease apart the “inherent variability in data” from the uncertainty that comes due to crappy measurement or sub-optimal modeling? How?

Yes, Rahul, there is.

Regarding crappy measurement: Such is the obsession with using the statistical significance filter that even experts in a field with decades of work behind them cannot tell you what a plausible range can be of parameter values. For example, try asking a psycholinguist what a plausible range of effects can be for the subject vs object relative clause processing difference (say in reading studies, measuring the effect at a particular region in the sentence). This is the drosophila of psycholinguistics, and yet nobody will be able to tell you what a plausible effect size is. Similarly, if you ask a psycholinguist what a plausible range of standard deviations is for a particular experimental paradigm involving reading, they will have no idea. When we do data analysis, we need to look at our parameter estimates to get a sense of whether the values are implausibly large. I think that almost nobody does that. That’s a basic check one can carry out. If you see a variance component that is unusually large given your prior knowledge you need to do some detective work to figure out what went wrong there. Often, a particular experimental item has some serious problem (a typo for example), or a subject is spending 40 seconds reading a word (probably checking his email or texting mid experiment).

Re sub-optimal modeling: I again have one word: posterior predictive checks. And other sensible checks too.

Perhaps .. acknowledge uncertainty.

The way I’ve said it is, “To learn about the human world, we should accept uncertainty and embrace variation.”

Yes, that’s exactly what I meant. As a recent example, Fränzi Korner-Nievergelt was telling me about a study she was doing for some outfit in Europe to try to measure bird casualties as a result of wind power farms. She said it was an incredibly difficult measurement problem and her conclusions had very high uncertainty (I don’t remember the units, but it was something like 1000 +/- 1000). She said they then reported the mean. Another example was interpreting the recent American presidential election polls that Andrew blogged about.

Sure, you can try to take better measurements, but that’s usually not an option. When you can take really good measurements, you often don’t need statistics to try to quantify your uncertainty, because there’s not much uncertainty.

@kupper

Here’s another underlying problem I discern:

Researchers are chewing on problems bigger than will fit in their jaws. i.e. For the sheer glamour of it, and because their computers & software will allow them to, researchers attack problems they are actually under-equipped to study.

The rise of the internet, cheap computational power & free software has brought sophisticated analysis into everyone’s reach but the measurement has not gotten equally cheap & accessible nor higher quality. If you don’t have a decent budget or timelines the part of the study that suffers worst is the measurement. In both quantity as well as quality.

Ergo we get this upsurge in studies with amazingly sophisticated methods & rich, complex models built on the foundations of very meager or crappy data.

At the end of the analysis, if the researcher honestly translated the uncertainty of his measurements into corresponding uncertainty of his results, then the results would look laughably vague and useless. So obviously, there’s an incentive to massively under-report or ignore uncertainty.

@Rahul: I agree that the poor quality of measurement is a big problem. Another is that people confuse two meanings of “sophisticated analysis”. One meaning is “using sophisticated methods” and another is “putting good thinking into which methods to use.” The frequent lack of the latter is a big problem.

@kupper

Recommending solutions is hard. Some ideas:

(a) Applied journals should employ a paid, in-house statistician to vet all analysis.

(b) Select one or more methodological expert reviewers for papers in addition to the usual subject experts. e.g. For the stat. analysis, bio-assays etc. If Lancet papers go to physicians they are rarely likely to have the skill or inclination to critically examine the stat. parts.

(c) Funding grants should keep aside say 20% of every grant for post hoc, independent paid audits or replications.

(d) Raise the minimum threshold for publication. Don’t publish the under-powered, low sample size, measurement-on-a-shoestring-budget crap that has weak conclusions anyways.

(e) Explore ways to monetarily compensate reviewers for their time. Prioritize quality of publication over quantity.

(f) When shit hits the fan release names of the reviewers & editors responsible for signing off on a paper. Bring accountability into the system.

(g) Refuse to publish or even review a paper till all data & code has been posted online.

(h) Force authors to clearly label exploratory studies as such. Mandatory pre-registration of all non-exploratory studies.

One of the more cruder reactions to Frank Harrell starting a blog comes from a social psychologist, who tweets: “Now we have smartphones, whenever a statistician takes longer on the toilet due to bowel problems, we get a new blog criticizing p-values.”