Stethoscope as weapon of mass distraction

Macartan Humphreys sent me a Shiny app demonstrating you can get statistical significance from just about any pattern of random numbers. I posted it, and, in response, commenter Rahul wrote:

It sure is a cute demo but it’s a bit like insinuating a doctor’s stethoscope is useless by demonstrating ten ways in which it can be mis-used.

And, indeed, if the doctor’s stethescope were being used to routinely pubish spurious findings in the leading research journal in psychology; if leading figures in psychology such as Ted-talk star (and respected researcher) Daniel Gilbert were to vociferously defend ridiculous claims of fecundity and clothing which are based on nothing but a highly malleable theory and that doctor’s stethethscope; if the New York Times and various other news outlets reports a claim about ESP which is based, again, on nothing but that stethescope; if Steven Levitt, the leading voice in academic social science, lends his platform to endorse an innumerate claim about beauty and sex ratios, a claim that, you guessed it, is based on no evidence beyond what came from that stethoscope; if well-respected political scientist Larry Bartels posts on the leading political science research blog to promote a study on the effects of subliminal smiley-faces as “punching a big hole in democratic theory,” based on that stethoscope; if world-famous psychology researcher Daniel Kahneman uses the stethoscope to insist that “you have no choice but to accept that the major conclusions of these studies [on social priming] are true”; if brilliant economist James Heckman points to the stethoscope as evidence of large effects of early-childhood intervention programs (ironically, in doing so ignoring selection, the problem that made him famous); if all that is happening, then, yes, I’ll continue to explore what’s going wrong here.

You could of course take the quite reasonable position that Macartan Humphreys, Brian Nosek, etc., and I are wrong and that Daryl Bem, Daniel Gilbert, Steven Levitt, Satoshi Kanazawa, etc. are right. Fair enough—ultimately you have to make your own judgment. The point is, this is a live issue. It’s not just that the “stethoscope” could be misused; it’s that (in the judgment of myself and many others whom I respect) the stethoscope is being misused, all the time.

Fundamentally, the problem’s not (just) with p-values or with any particular technique. I see the problem as being with the entire hypothesis testing framework, with the idea that we learn by rejecting straw-man (or, as Dave Krantz charmingly said once, “straw-person”) null hypotheses, and with the binary true/false attitude which leads people to believe that, once a result is judged statistically significant (by any standard) and published in a good journal, that it deserves the presumption of belief.

59 thoughts on “Stethoscope as weapon of mass distraction

  1. Why not start by changing the term “statistically significant” to “statistically detectable”. Google says that significant means “notable, noteworthy, worthy of attention, remarkable, important, of importance, of consequence”, which is not the case for most statistically “significant” results.

      • “Statistically detectable” would avoid the misleadingness pointed out by Benoit, but introduces likely misinterpretations of its own — which makes me wonder if perhaps the word “statistically” is so vague as to lead to misinterpretations of its own. But I can’t offhand think of a good alternative to “statistically detectable.” Maybe “detectable at a false detection rate of ___”?

  2. Some years ago Al Blumstein suggested using the term “statistically discernible,” which I think is slightly better. And why oh why did the leaders of two statistical societies decide to name their joint journal “Significance?”

  3. People keep talking about p-values and classical hypothesis testing as if they were some theoretically impeccable tools that merely get misused. In point of fact, they have no theoretical foundation or derivation at all (unlike Bayes Theorem for which there are dozens of published derivations based on various assumptions). P-values have huge known theoretical problems which can never by cleared up pedagogically.

    The only reason p-values and the associated classical hypothesis testing exist is because some numbnuts couldn’t wrap their heads around the idea of probabilities modeling uncertainty, but they could imagine frequencies. That’s the entire foundation, motivation, basis, and derivation of the p-values. The practical disaster of p-values isn’t some unfortunate pedagogical error–Frequentists have had a monopoly on the teaching of introductory stats for the better part of a century. If it were possible to teach it “right” they every opportunity imaginable to do so. Rather the practical disaster is exactly what anyone who understood the theoretical problems with it would expect.

      • If heavily statistics based “science” is a widespread disaster and a well recognized joke today, it’s because the professors from about 1940-present fundamentally failed in their task of of understanding their subject they were supposedly experts on. They nibbled around the edges, published thousands of papers, muddled through here and there, but they still failed to come to grips with their subject.

        You can say they were nice people, very smart, worked hard, good dancers … whatever. They still failed. They had every opportunity, every resource, every chance imaginable. They failed.

        • > failed to come to grips with their subject [statistics]

          I believe that is true – my first impression of the field of statistics was that most statisticians did not understand much about it especially with respect to applying it in scientific projects.

          > They had every opportunity, every resource, every chance imaginable
          I doubt that, in fact the usual career path almost assured little understanding beyond the ability to do the math manipulations.

          I do think it has improved recently, but we need more surveys such as a recent one on the difficulties of implementing Bayesian analyses (in clinical research, if I recall correctly).

      • What’s not intelligent? It’s directly relevant, important, and forgotten by most, that p-values do not have a theoretical justification. Here’s what that means in practice.

        Bayes theorem have tons of published derivations based on a wide variety of axioms from a variety of viewpoints, beginning with the basic product rule of probabilities derivation.

        So if you violate Bayes theorem you’re necessary acting as if at least one axiom from each derivation is wrong. In particular, you’re either claiming the product rule of probability is wrong, or that P(A & B) isn’t equal to P(B & A).

        But here’s the thing: p-values have no such derivation. They are simply an intuitive idea that happened to be checked originally on problems where it is operationally equivalent to doing Bayes. When checked in a bigger way, they were found to have both significant theoretical and practical problems.

        Consequently, if you deny p-values you loose nothing. You pay no price. You’re not forced to accept some absurdity somewhere else. P-values are not some god given truth that we can never question and which must be a part of statistics forever.

      • Or put it this way: Imagine civilization gets wiped out and 10,000 years from now people have to rediscover everything.

        If those future scientists produce anything that even remotely deserves the name “probability theory” it will have Bayes Theorem, but p-values could very easily be completely absent. And moreover, their absence wont hinder the development of statistics one bit.

  4. Incidentally, the usual hypothesis testing paradigm is inherently un-Bayesian even if posteriors are used to judge the hypothesis. Given for example hypothesis “H0: theta less than 0” and “H1: theta great than 0”, then the full posterior P(theta|data, background) encapsulates everything the data + background has to say about theta.

    If you gratuitously add another step which determines say H1 is true, and assume it’s true going forward then you’ve effectively truncated P(theta |data, background) to theta greater than zero without having any further data or other evidence for doing so. It’s an inherent violation of the sum/product rules in other words and hence un-Bayesian. In some instances this truncation will be a valid approximation to the full Bayesian version, but most of the time it wont.

    The Bayesian version of hypothesis testing (Decision Theory with loss functions and all the rest) really only makes sense if you’re making final decisions. For example, if you’re programming a computer to process data and make automatic decisions about things. Otherwise the Bayesian thing to do is carry the full posterior P(theta | data, background) forward un-altered. Scientists too need to make final conclusions sometimes, but most of the time hypothesis testing is used to make piecemeal judgments along the way (such as removing a parameter from the analysis) in which you’re effectively truncating distributions without the evidence needed to do so.

    • “in which you’re effectively truncating distributions without the evidence needed to do so”

      I see, so when you set up your Bayesian model of the causes of arthritis you include every variable in the universe, including sun spots. Or do you truncate your model? On what evidence?

      • You have evidence “data+background” from that you get the distribution P(theta|data, background). If you then use that same evidence again, without adding anything new, to truncate P then you’d made a very fundamental mistake. The evidence you’re using implies the full P(theta|data, background). It doesn’t imply a truncated version of the distribution.

        It would be like someone getting data and estimating theta. Then they come up with the bright idea of using the same data twice to an estimate of theta with twice the precision.

        It’s irrelevant here whether “background” includes things known or merely hypothesized the purposes of analysis.

  5. One question would be if you could just start reporting means and SDs without doing t-tests on them. If you wanted to control for covariates, you could report adjusted means.

    I’m sure that would be “impossible” for anyone but tenured social scientists to get away with, but it would be an interesting experiment.

  6. Fair enough, but are we so sure that other approaches would work any better “at scale”?

    http://simplystatistics.org/2014/02/14/on-the-scalability-of-statistical-procedures-why-the-p-value-bashers-just-dont-get-it/

    That is, is the fundamental problem here uninformative applications of null hypothesis testing? Or is that merely the “failure mode” of frequentist statistics? So that, if some other approach were equally widely taught and used, we’d discover that it has its own failure mode? Into which non-experts fall just as easily and often as they fall into the failure mode of frequentist statistics?

    • Jeremy:

      Whether any existing approaches are better for routine use, is another question. Many researchers, including myself are working on such alternatives, and I think we are making progress, in the same sense that the standard statistical methods today have a lot to offer compared to what was standard in 1980, or 1950, or 1920.

      But, whether or not there are existing alternatives that are better, I think it’s worth making the point that we have big problems now. If leading scientists, communicators, and institutions, including Steven Levitt, Daniel Gilbert, Daniel Kahneman, James Heckman, and the editors of Psychological Science and the Journal of Personality and Social Psychology are getting these things wrong, it’s worth trying to clarify matters, even while we’re also working to develop better methods.

      • Andrew:

        The question is whether the method is the cause of the problem, or whether the problems goes deeper.

        Your counterfactual appears to be that, if we changed the method, things would be better.

        Yet another possible counterfactual is that, given current institutions and incentives, things will remain the same.

        This is like Gramsci’s critique of democracy. Or, in the words of Giuseppe di Lampedusa, “Things must change, in order that they can stay the same.”

        • Fernando:

          I think we should proceed on multiple tracks. We should try to correct misconceptions and also try to develop new methods and also try to do our best on particular applications and also build tools (including software) that can be used by the masses (for different levels of “masses”), etc.

        • I agree with you. My comment was mostly a warning: That before we operate on the patient, we ought to refine our diagnosis.

          The idea that “It ain’t working, change it” can make things worse. What we need is something that will not work as bad.

          Of course, you have a much ore nuanced attitude.

        • That’s just the point though: Let us not generically bash the stethoscope just because you can misuse it in some ways. Focus on correcting the specific ways in which people misuse it. Proceed on multiple tracks & let doctors use Bayes-scopes on patients where those work obviously better than stethoscopes (e.g. spell check, OCR, speech recognition, classification, fraud detection etc.).

          You can talk about getting rid of stethoscopes entirely but only when you are convinced that you have a better widget, extensively tested & ready to be deployed by the masses. Primum non nocere…

    • IMO “any procedure has problems at scale” is a cop out. If the statistical norms were in line with the scientific method would produce better defaults and better (not perfect) science overall. I would argue concluding a study with p < .05 + hand waving discussion rather than a falsifiable model.

      By default, bayesian the posterior distribution of a bayesian analysis is a model that can be falsified by future work (using either the gelman-shalizi frequentist hybrid or kruschke's fully bayesian paradigm). This is not true of the vast majority of NHST p<.05 studies (I'll grant that applications in physics may be one rare exception where there's a falsifiable mathematical model put forward that's independent of the hypothesis test).

  7. What Andrew seems to be overlooking is that it’s p-value reasoning and p-value computations that demonstrate how, by means of a variety of selection effects (hunting, cherry-picking, multiple testing, researcher degrees of freedom, post-designation of characteristics, look elsewhere effects and the like) the computed p-value is not the actual p-value. A genuine p-value requires : Pr(P < p;chance) < p, and all of these procedures can be demonstrated to invalidate this requirement. More generally, error statistical procedures require that reported error probs be ~ actual ones. This permits us to show exactly how they are invalidated by various shenanigans or flawed underlying assumptions. It is part of a self-correcting methodology! All these points were proved by Neyman and Pearson long ago. More than that, these concerns were the basis for their stipulations (regarding selection effects and model assumptions.) The fact that error statistical methods have BUILT WITHIN THEM the means to criticize the spuriousness of results is an important fact about them. But what enables this reasoning is the reference to the sampling distribution of tests–the element deemed irrelevant according to the likelihood principle.
    I certainly agree about the pseudoscientific nature of p-value abuses, but the worst part is the rampant lack of self-criticism by many in fields like social psychology. Appealing to prior beliefs will not do, but if permitted will only allow these pseudoscientists to justify their results. Because they really believe them, and there are whole research programs to back up their beliefs.

    • The rampant lack of self-criticism in some fields is the crux. How come one doesn’t see process engineering vendors come sell me questionable widgets based on p-value hacking studies?

      The real problem here is of incentives: Journals, editors & referees don’t care whether they publish junk. Funding agencies don’t care if their money is used to justify red-fertility correlations. Levitt’s goal is to peddle books & talks.

      p-values are only a convenient punching bag here.

      I haven’t come across many who really cared about a result & were led astray. “Gee! Too bad I trusted that p-value blindly!”

    • Mayo, you constantly complain about “howlers” people say about frequentism and then every time this subject comes up you say things like “Appealing to prior beliefs will not do” as if the only alternatives are p-values or “beliefs”.

      There is another alternative. Using probabilities to measure model uncertainty. I’ll explain this in a way that philosophers who’ve never done any science, statistics, or computing can understand.

      The current value for the mass of an electron is: 9.10938291*10^{-31} kg with a uncertainty of .00000040*10^{-31}kg

      Like all physically measured values it has an “uncertainty”. If I use N(mean=9.10938291*10^{-31} kg, sigma=.00000040*10^{-31}kg) as my prior for the mass of the electron that is neither a frequency of anything or a “belief”. It is perfectly “objective” and “testable” and “meaningful” in all the ways it needs to be to do real science.

      • (diff anon)

        @anon – I’m sympathetic to this, but unless you’re using a calibrated prior, how is what you call “uncertainty” different from what mayo is calling a “belief”?

        If the prior is aiming for calibration, then yes, the posterior frequencies are testable. How do you test an “uncertainty” measure that does not aim for calibration.

        • There are no frequencies. There is no frequency of the (rest) mass of the electron. It just is whatever it is. So there’s no calibration of frequencies.

          the prior is used to describe the location (on the positive real line) of the true value of the electron mass. The wider the prior the more uncertain we are about it’s location. The narrower the prior the more certain we are of it’s value.

        • “how is what you call “uncertainty” different from what mayo is calling a “belief”?”

          Suppose you have two priors for the electron mass. One implies the mass is:

          9.10938291*10^{-31} kg +/- .00000040*10^{-31}kg

          The other one implies the mass is:

          100kg +/- 1 kg.

          A real subjective Bayesian might say that either of these is legitimate if they really are someone’s “belief”. I say the first is objectively right while the second is objectively wrong.

        • Before getting any data at all, it’s going to be hard to know, at least in the general case (we can certainly know ahead of time that the rest mass of an electron isn’t anywhere near 100kg given the number of electrons that must be in any 1kg mass).

          But after collecting only 1 data point, the 100kg prior will be highly highly contradicted by the data. And after 10 data points, the prior will be irrelevant. Near the actual value (~ 0) the prior is so flat that it might as well be a constant. So you get posterior = likelihood * constant/Z in the vicinity of the real answer. The likelihood after 10 measurements with even a 1/10 kg measurement error will be concentrated near 0. You will objectively find out that 100kg is a bogus “belief”.

        • but without a correspondence between probability and something measurable, what constitutes a contradiction? This is where I lose the “let go of calibration thread”.

          Ok suppose a posterior probability doesn’t have to correspond to a calibrated probability. But if I go around saying there’s a 99% probability of this and that and I turn out to be right 1% of the time, what, other than calibration tells me that I’m wrong?

          For my personal practice, calibration is not always available and I’m fine with, say an overly diffuse normal prior and making inferences from that in a coherent fashion. But when it comes to the _external validity_ of a bayesian model, what is the alternative?

        • diff anon,

          I get that you have a different goal. Your goal is to be wrong a known percentage of the time in the limit of some future imaginary/made up/non-existent repitions. My goal is to get an interval for the one real value that actually exists based on the one set of real evidence/data I actually have such that (1) the interval contains the true value and (2) the interval width accurately reflects the uncertainty that my partial information/evidence implies.

          What I don’t get is why you think your goal has a kind of “hardness” to it that contrasts favorably with the “mushiness” of mine, and this “hardness” is so overwhelming powerful that whenever rubber-meets-the-road in science, your goal inevitably trumps mine.

        • diff anon,

          Actually maybe this is a cleaner better way to approach things. Suppose you’ve entered into a lottery in which there is one ball that makes you a loser and 10^(10^(100)) which make you a winner.

          Based on this I say the odds are 10^(10^(100)) to 1 that you’ll win.

          Now with a number like 10^(10^(100)) there is absolutely no possibly of calibrating this in the real world (or even coming remotely close to it). In no sense whatsoever can we think about “measuring” any probabilities involved. In fact the one thing we can say with absolute certainty is that in this universe the vast majority of those balls with never be chosen no matter what. That is the hard fact of the world we live in.

          So now my question to you is, “do you believe there’s strong evidence you’ll win the lottery?” or alternatively, “do you believe my statement above about your odds of winning is meaningless?”

          And as a follow up, how would you respond to my claim that those odds are “measurable” in the sense that they are a direct result of the counts involved?

        • because it’s objectively true that the true electron mass is in the high probability region of the prior. The high probability region (or Bayesian credibility interval formed from the prior if you like) accurately describes the location on the real line of the true electron mass.

        • What would be the analogous “objective” distribution for describing the fertility modified red wearing proclivity of females?

        • “What would be the analogous “objective” distribution”

          There’s no unique distribution. Each distribution P(x|K) is conditional on a state of information “K”. For different K’s you get different distributions.

          That’s not unusual. Initially, our uncertainty interval for an electron mass was nothing more than “it’s got to be small”. So the interval estimate on that state of information would be “less than 1 milligram”. Today we have a much better estimate and the interval width is +/- .00000040*10^{-31}kg.

          The rest mass of the electron hasn’t changed. The only thing that’s changed is the “K” we’re conditioning on. Both are right in that they make objectively true statements about the mass of an electron.

        • It’s fine after the fact (ie. the fact of a large number of data points) to say that the electron mass is in that region. But there is a point at which it is hard to know whether a prior is “good”. For example, what’s the diffusivity of ethanol in cerebrospinal fluid? I mean, I can construct a prior, for example I could say that it’s probably less than say 1 m^2/s, so I could put exponential(1/1) on the diffusivity in m^2/s, but someone else might come along as say it’s probably less than 10^(-7) m^2/s and I could only really evaluate that if I collected a bunch of data, and did the analysis, and found a result that was in conflict with the prior.

          I think one of the big problems that the social sci type research Andrew has been complaining about has is that they refuse to do high-powered enough studies to actually find something that would have any hope of contradicting the priors. If you can’t collect enough data to contradict your prior, then you’re not really doing science I think.

        • > If you can’t collect enough data to contradict your prior, then you’re not really doing science I think.

          Agree, another way to put is that you are avoiding (any opportunity of) being objective (in the sense of precluding reality from complaining about [part of] your model).

    • “What Andrew seems to be overlooking”

      Andrew hasn’t overlook this at all. Just the other day he said: http://statmodeling.stat.columbia.edu/2014/12/31/a-new-year-puzzle-from-macartan/#comment-205703

      “The “garden of forking paths” critique is merely the response that the reported p-value is, implicitly, a claim that researcher A would have performed the exact same data processing and analysis choices had the data been different.”

      In the subsequent discussion it wasn’t made clear what happens if there were two researchers working on the experiment and one of them would have analyzed non-existent data different, while the other wouldn’t have. Is the p-value nominal or real then? Although we did conclude that if a researcher wouldn’t have analyzed non-existent data differently before the experiment, but would have after then it’s a real p-value and not nominal.

      Like other physics types, I’m greatly hampered in my mind reading ability and have no crystal ball which lets me see what would happen if the universe were different than it is. All I look at is the actual data and computations. Maybe that’s why my field of physics hasn’t been as successful as philosophy and political science.

    • “All these points were proved by Neyman and Pearson long ago.”

      They did no such thing. They merely pointed out the fact that if your method depends on outcomes different than the one that happened (which theirs did), then your answers are highly sensitive to which non-existent (i.e. made-up) outcomes you considered.

      As Jaynes put it (PT:LOS page 533):

      “In the orthodox method, the accuracy claim is essentially the width of the sampling distribution for whatever estimator beta we have chosen to use. But this takes no note of the range of the data! Orthodox estimation based on a single statistic will claim just the same accuracy whether the data range is large or small. Far worse, that accuracy expresses entirely the variability of the estimator over other data sets that we think might have been obtained but were not. But again this concentrates attention on an irrelevancy, while ignoring what is relevant; unobserved data sets are only a figment of our imagination. Surely, if we are only imagining them, we are free to imagine anything we please. That is, given two proposed conjectures about unobserved data, what is the test by which we could decide which one is correct?”

    • Mayo:

      Why do you say I’m overlooking these things? In my paper with Loken and my paper with Carlin, we very clearly state that those published p-value computations are wrong because they do not account for what the researchers would’ve done under alternative conditions; that is, the reported p-values are not actual p-values. And I’ve also identified my work as error-statistical. You’re arguing above about the likelihood principle but that never came up in my discussions.

      In short, you’re arguing against someone other than me.

  8. The problem, at least in my discipline(some would say field), isn’t the instruments rather the training obtained to use them. In my discipline we have by comparison far fewer physicians(with backgrounds in areas like physics, electrical engineering, and operations research), if you will, than we have technicians. A radiological tech, for example, has a pretty good sense of how the machine works(as in which buttons need to be pushed) to obtain output from input but my money is on the radiologist for providing the (or an) appropriate diagnosis. Anyone can buy a stethoscope and use it; far fewer are actually licensed to practice with it. What licensure is required to practice with applied statistical procedures? In my discipline, one can obtain an available data set (in some instances with hundreds of thousands of observations), import it into a statistical program, estimate a model using the method du jour(after experimenting with various configurations in the parameter space), and then rely on their technician training to look for effects where p<.05. One then can pretend that the final model was the model chosen a priori as they make the case to the referees (composed mostly, if not entirely, of other technicians) and editorial staff that the noise is really signal. After years worth of iterations of this process(in some instances with the same data set) one is then hailed as an "expert". The few physicians we have who call out such practices (in the spirit of "self-criticism" as offered by Mayo) are ignored because it's bad for business.

  9. > because it’s bad for business.

    My experience as a statistician working in clinical research was that it was bad for one’s career unless your management supported it.
    (One who did, would even tell about the occasional angry clinician that would demand I be fired – though only after their work had been satisfactorily sorted out.)

    An unfortunate example was a medical student has was fired in their research training job for refusing to data dredge for statistical significance (the person who fired them was later given a prestigious award).

    Perhaps a more common example example, a young colleague at a ivy league medical school told me that she initially was very firm about the need to deal with multiplicity until she realized that researchers were avoiding her and seeking out other statisticians to collaborate with.

    My last interview in that field ended with my comment that there was often a conflict between enabling physicians to do good research and making them happy. My take on the interviewer, was that he was cognizant of this, but had chosen the making happy route as his (at least initial) strategy. He thanked me for clearly raising the concern.

    • Keith:

      I see the examples that you’ve provided as errors of commission. They’re certainly discouraging but at least there are people (such as you and others) with the knowledge and integrity to call out such practices for what they are. I’m concerned that in my area errors of omission are far more common. At the risk of overplaying the physician/technician/stethoscope thing, the technicians think they’re practicing physicians. They think they’re proceeding in appropriate ways. It’s how they were trained and how they continue to train new graduate students. When someone eventually writes and distributes, if they haven’t already, “Stan for Technicians”(I won’t use the D-word because it’s probably copyrighted and I’m committed at this point to the technician metaphor) there could very well be more of a mess to clean up (at least in my field). I’m not suggesting that decision analysis(as Andrew and others mean it) shouldn’t be taught, of course, it’s just that it would represent quite a shift from the status quo for my field. I think that some areas within the social sciences need to take a hard look at the education that is provided to its graduate (and even undergraduate) students. I don’t know whether the current business model, if you will, will allow for it.

  10. When you say that Levitt and others are wrong, it would help if you inserted the links. I am sure it is all on your blog somewhere, but when I search, I find you mentioning them on various issues.

Leave a Reply

Your email address will not be published. Required fields are marked *