Some natural solutions to the p-value communication problem—and why they won’t work

Blake McShane and David Gal recently wrote two articles (“Blinding us to the obvious? The effect of statistical training on the evaluation of evidence” and “Statistical significance and the dichotomization of evidence”) on the misunderstandings of p-values that are common even among supposed experts in statistics and applied social research.

The key misconception has nothing to do with tail-area probabilities or likelihoods or anything technical at all, but rather with the use of significance testing to finesse real uncertainty.

As John Carlin and I write in our discussion of McShane and Gal’s second paper (to appear in the Journal of the American Statistical Association):

Even authors of published articles in a top statistics journal are often confused about the meaning of p-values, especially by treating 0.05, or the range 0.05–0.15, as the location of a threshold. The underlying problem seems to be deterministic thinking. To put it another way, applied researchers and also statisticians are in the habit of demanding more certainty than their data can legitimately supply. The problem is not just that 0.05 is an arbitrary convention; rather, even a seemingly wide range of p-values such as 0.01–0.10 cannot serve to classify evidence in the desired way.

In our article, John and I discuss some natural solutions that won’t, on their own, work:

– Listen to the statisticians, or clarity in exposition

– Confidence intervals instead of hypothesis tests

– Bayesian interpretation of one-sided p-values

– Focusing on “practical significance” instead of “statistical significance”

– Bayes factors

You can read our article for the reasons why we think the above proposed solutions won’t work.

From our summary:

We recommend saying No to binary conclusions . . . resist giving clean answers when that is not warranted by the data. . . . It will be difficult to resolve the many problems with p-values and “statistical significance” without addressing the mistaken goal of certainty which such methods have been used to pursue.

P.S. Along similar lines, Stephen Jenkins sends along the similarly-themed article, “‘Sing Me a Song with Social Significance’: The (Mis)Use of Statistical Significance Testing in European Sociological Research,” by Fabrizio Bernardi, Lela Chakhaia, and Liliya Leopold.

119 thoughts on “Some natural solutions to the p-value communication problem—and why they won’t work

  1. It seems to me that what people really want are stylized facts (“broad generalizations that summarize some complicated statistical calculations, which although essentially true may have inaccuracies in the detail” according to Wikipedia). It also seems to me that hierarchical models are idea for providing high-level summaries that can be used to justify stylized fact claims.

    • I think that what people want is to be able to throw their numbers and analysis into a machine and get out some kind of validation of their theory. Once the p-value is low enough, you “know” that you’re right and can now think about implications of the “fact” that you discovered.

      • Perhaps the underlying problem is a superstition that these “stylized fact” have in any sense a predictable shelve life.

        Scientific findings should actually be taken as pauses in ongoing inquiry for the purpose of getting on with things and or other inquiry.

        Perhaps Corey was pointing to this in his comment that hierarchical models are ideal (do they suggest pauses in ongoing inquiry)?

        • I guess we could split the stylized fact “in general, X” into “in the past in country Foo (or whatever), in general, X” and “in the future it’s reasonable to expect X in general” and then assess grounds for each separately. Estimates from a hierarchical model might provide grounds for the former and then a set of assumptions could be laid out such that the latter is reasonable conditional on the assumptions (whose reasonability could then be judged separately, perhaps).

    • To reframe slightly — to me it seems what people want to do is distill the implications of a finding for decision-making without doing the decision analysis. Coarsening the likelihood down to a binary outcome facilitates this greatly. If the study evidence shows that *the intervention works*, then this fact can be generalized to all relevant decisions. If the study evidence shows that the intervention achieves an XX% (+/- YY%) change in a desirable outcome, then the implications for any particular decision are unknown until one does the decision analysis.

  2. On a related note, I recently participated in the New England Journal of Medicine SPRINT challenge investigating open data policies. I did not win (though I believe I should have, but that is a different matter). The winner produced an app that clinicians can use to enter a person’s characteristics and the app weights the hazard ratios for primary outcomes (bad ones- e.g., heart attacks) against the hazard ratios for serious adverse events (i.e., bad side effects), resulting in a single metric advising treatment for the patient. I can see the values and popularity of such apps and they are likely to be increasingly used in medicine.

    While the apps is a technological achievement, the more I think about it, the more I am disturbed by the content of the app. Uncertainty is completely missing – what is conveyed is an answer, not a question. As such, I think it represents a step backwards from what you are advocating here – appreciating and embracing uncertainty. To that end, part of my submission, was to produce “confidence regions,” see http://myweb.loras.edu/dl526303/figure1.jpg for the graphic. Admittedly, this picture is based on the (erroneous) assumption that the two outcome measures are independent – I didn’t feel smart enough to deal with the more general case. But I do think the confidence regions display important understandings of just how uncertain study outcomes may be – particularly for subgroups. So, while I agree with Andrew’s skepticism about confidence intervals not solving the problem – for me, that depends on whether they are used to rule out zero (or some other single value). Rather, I think there is considerable value in graphically perceiving the size of confidence intervals as meaningful representations of uncertainty.

    • Dale:

      There was quite a push back from some in the clinical community regarding the switch in terms from benefit/risk to benefit/harm/uncertainty as they thought uncertainty should not be discussed with patients…

      Looks like they lost https://www.ncbi.nlm.nih.gov/books/NBK264317/

      Likely there are some big challenges in providing care with uncertainty that need to be addressed.

      • “they thought uncertainty should not be discussed with patients”

        Aargh! Any physician who thinks uncertainty should not be discussed with patients is patronizing, arrogant, and off in a fairy tale world — and therefore should not be trusted.

        • “we take the uncertainty out of uncertain decisions”

          Contrast with what I have tried to drill into students: “If it involves statistical inference, in involves uncertainty.”

      • Can we please just start reporting the expected value of (GDP/capita * QALYS – Cost Of Treatment)/(Average Cost to Save 1 QUALY across all types of Cancer with existing standard treatments)

        or something similar. Just any old back of the envelope attempt to create a common dimensionless scale that goes up when things turn out better… so that people can then complain about it and think harder, and eventually we get to something of some use that we can take expectations over and arrive at just the simplest “at least it isn’t a stupid-in-principle way” of making medical decisions?

  3. Andrew, I think you forgot to add the following:

    After you get tenure/become full professor, [w]e recommend saying No to binary conclusions . . . resist giving clean answers when that is not warranted by the data. . . . It will be difficult to resolve the many problems with p-values and “statistical significance” without addressing the mistaken goal of certainty which such methods have been used to pursue.

    • Actually, in medical statistics, I was taught to say that there is “strong evidence against the null” (p very small), or there is “evidence against the null” (p less than threshold). So binary decisions are taught in hard-core stats programs as well.

      • For sure, when I was at Duke, I had problems with the tutors backing of an insistence on “reject or do not reject the null” in the marking when I had asked the students for something like “data this or more inconsistent with the null was assessed (given assumptions) to occur x% of the time”. Still binary question but less of a binary answer.

        My guess is most Phd students don’t have/get much sense uncertainty until after their postdoc or first couple years of practice.

      • Another thing I have tried to drill into students: Choosing to reject the null hypothesis in favor of the alternate is a CHOICE — it’s not obligatory.

        • I’m not sure I like the concept of “drilling something into students” any more than the average misinterpretation of p-values.
          (I’m saying this as somebody who has tried to drill quite a few things into students, too.)

        • There are very few things I try to drill into students; the most common is “Unless a problem is marked ‘short answer’ you won’t get full credit unless you explain your reasoning.”

  4. Your 2nd paragraph may be your most elegant statement of the issue yet: “the use of significance testing to finesse real uncertainty” is lovely. This conveys the full measure, including the conception, popular in the imagination, that not only can one uncover something hidden in data but that which must be teased out is more valuable, that the “finessing” demonstrates your ability and insight, that it shows you can carefully interpret to locate the right path.

  5. Emphasis mine:

    In short, we’d prefer to avoid hypothesis testing entirely and just perform inference using larger,
    more informative models.
    To stop there, though, would be to deny one of the central goals of statistical science. As Morey
    et al. (2012) write, “Scientific research is often driven by theories that unify diverse observations
    and make clear predictions. . . .Testing a theory requires testing hypotheses that are consequences
    of the theory
    , but unfortunately, this is not as simple as looking at the data to see whether they
    are consistent with the theory.” To put it in other words, there is a demand for hypothesis testing.
    We can shout till our throats are sore that rejection of the null should not imply the acceptance
    of the alternative, but acceptance of the alternative is what many people want to hear. There is a
    larger problem of statistical pedagogy associating very specific statistical “hypotheses” with scientific
    hypotheses and theories, which are nearly always open-ended.

    http://www.stat.columbia.edu/~gelman/research/published/jasa_signif_2.pdf

    I don’t feel like this point was explained very well, as if it was originally an introduction to a more detailed discussion (or maybe I am interpreting it incorrectly). Is this supposed to argue against the default/strawman nil null hypothesis? If not, what is meant here?

  6. It seems that, if we want, following the conclusion of Andrew’s paper, to abandon binary conclusions, we are bound to give :

    * a discussion of possible models of the data at hand (including prior probabilities and priors for their parameters),

    * a posterior distribution of parameters of the relevant model(s), and

    * a discussion of the posterior probabilities of these models

    as the sole logically defensible result of a statistical analysis.

    It seems also that there is no way to take a decision (pursue or not a given line of research, embark or not in a given planned action, etc…) short of a real decision analysis.

    We have hard times before us selling *that* to our “clients” : after > 60 years hard-selling them the NHST theory, we have to tell them that this particular theory was (more or less) snake-oil aimed at *avoiding* decision analysis…

    We also have hard work to do in order to learn how to build the necessary discussions, that can hardly avoid involving specialists of the subject matter : I can easily imagine myself discussing a clinical subject ; possibly a biological one ; I won’t touch an economic or political problem with a ten-foot pole…

    • Emmanuel:

      I think you’re making things sound a bit too hard: I’ve worked on dozens of problems in social science and public health, and the statistical analysis that I’ve done doesn’t look so different from classical analyses. The main difference is that I don’t set up the problem in terms of discrete “hypotheses”; instead, I just model things directly.

      • I liked how one of the ASA speakers put it. More or less, if NHST and p-values didn’t exist, how would you convince someone that you’re correct?

        There are an assortment of ways, none which should involve thresholds “to beat”. It may be cross validation, it may be model comparison, it may be PPCs, it may be formal or informal decision analyses.

        Bayesians aren’t anti-decision, they just don’t care much about pitting one stupid hypothesis against a somewhat plausible one. It’s not like they revel in uncertainty, they just acknowledge its presence before making a decision.

        Bayesians use the uncertainty to make decisions. Given the data, what can I gleam about reality; what should I expect; what are parameters likely to be, or unlikely to be, etc.

        It’s hard to shift away from dichotomous decision making once you’re indoctrinated into it; I’ve been pretty fully Bayesian for the better part of 2 years, and I’m only now intuitively grasping how one can make decisions without thinking about a value to compare to. I recently told someone when discussing Bayesian stats that “there’s nothing special about 0 to a bayesian, it’s just another possible parameter value on the real line”, and it’s that sort of thinking that distinguishes dichotomous decisions from likelihoodists or bayesians.

        In the end, the models that people fit, regardless of their inference framework, are really similar; it’s just what you are deciding about and how you’re deciding. A frequentist may say “damn, it’s not significantly different from zero” and a bayesian may say “the effect is probably too small to care about” or “honestly, we can’t tell if the effect is negative, positive, or too small to care about given this data”. It’s not that convincing is harder, or the analyses are harder (in the modern day), it’s just one’s goals shift from rejection to best estimation, in a sense.

        • I see an interesting dichotomy (continuum?) between applied tools that are very often using highly Bayesian approaches (e.g. AI, machine vision, robotics) but using all the Bayesian machinery to finally make *some* (often binary) decision.

          And on the other side academic Bayesian modelling where just the mere fact that there can be a binary decision to be made at the end evokes revulsion. I think its a pedantic prescription: “say No to binary conclusions / resist giving clean answers” etc.

          So, I don’t see the problem being Bayesian methods so much as to what end one is willing to use them for.

        • Rahul:

          You might call it pedantic to “say No to binary conclusions / resist giving clean answers” but John Carlin and I have done applied statistics for many decades between us, and saying No to binary conclusions etc., is what we do. I think if Kanazawa, Cuddy, Bargh, etc. had been following this advice they’d be in much better shape.

          Not everything I’ve done is Bayesian. That’s not really the point. The point is to stop framing things in terms of discrete “hypotheses.”

        • Andrew:

          I read your discussion on McShane & Gal and I read your comment on the ASA statement some time ago, and I share your sentiment that people should “embrace variation”, but I am not quite sure what your specific suggestions would be for the type of research questions/program of, e.g., Cuddy (just to use one specific example). You say: “The main difference is that I don’t set up the problem in terms of discrete “hypotheses”; instead, I just model things directly.” But your data are (typically) very different. Take Cuddy’s data! She has a hypothesis that she very much believes in: power pose affects variable A (e.g. testosterone level or some behaviour). She does an experiment where subjects either assume power pose or not. Variable A is recorded. What do you suggest she should now do? How should she “model things directly”?

        • She has a hypothesis that she very much believes in: power pose affects variable A (e.g. testosterone level or some behaviour). She does an experiment where subjects either assume power pose or not. Variable A is recorded. What do you suggest she should now do?

          Andrew may think differently, but imo such studies are designed to fail to begin with. At a minimum, she should have measured some kind of dose response or timecourse and come up with a process (or, better, processes) that could explain the shape of the curve. Then they can estimate the parameters of the model and consider what those results would mean.

          NHST is not just calculating p-values and comparing to significance cutoffs. It is an entire paradigm of research, people have been designing studies for NHST (with the primary goal of comparing two averages) rather than to collect useful information.

          Another way of putting it: Do you believe that human behavior is a static (as opposed to dynamic) phenomenon? Because that is what such studies assume during their design. If you think it is dynamic, should people without the skills to study change, like calculus and numerical simulation, be designing/running the studies?

        • Amoeba:

          What Anon said. I think Cuddy, Fiske, Bargh, etc., should abandon the “button-pushing” or “black-box” model of science in which they come up with ideas for magical interventions and then expect the results to appear consistently in any sloppy experiment.

        • @Andrew @Anoneuoid

          Fine – so Cuddy should abandon her magical interventions and sloppy experiments. I could not agree more, but this is not a statement about statistics anymore, this is a statement about you (and me) disliking her research.

          Whereas I am trying to understand your statistical advice. Are you (both Andrew and Anoneuoid) saying that nobody should ever do a simple comparison between two groups? This would sound weird and will not convince anybody. I am not working in psychology but in neuroscience, and we do have models and quite some understanding of the mechanisms and whatnot, but in the end of the day people want e.g. to deactivate a certain brain area in rats and show that it negatively affects a certain behaviour. Guess what – they would usually run a hypothesis test for difference in means.

          I am still not sure what is your opinion on what they should be doing instead.

        • Amoeba:

          You write that my recommendation about button-pushing and sloppy research “is not a statement about statistics anymore, this is a statement about you (and me) disliking her research.”

          No, it is a statement about statistics! I don’t have any problem with someone studying power pose or whatever; my problem is with these “power = .06” studies, analyzed using forking paths and p-values, which is nothing more than a way of churning noise. This is all a statistical problem. But the solution to the problem is not a higher-power hypothesis test or a multiple comparisons correction or an identification strategy or even preregistration. The solution is more accurate measurements that better bridge the (currently yawning) gap between data and theory.

          You ask if I would ever recommend a simple comparison between groups. A simple comparison between groups is fine—if the measurements are sufficiently accurate and relevant to the question being asked. It’s not so relevant if the measurements are close to pure noise.

        • Are you (both Andrew and Anoneuoid) saying that nobody should ever do a simple comparison between two groups? This would sound weird and will not convince anybody.

          Yes, I am saying that type of design is worthless for most situations. It is unfortunate that neuroscientists will continue wasting their careers by using that design. How are you going to figure out what is going on in the brain if all you know is “if I do A to a rat the bloodflow through Region R increases by x%, on average”?

          I can’t do anything with such info, but that is all anyone wanted to know (actually not even that, instead just “is it significant”) which is why I do something else now.

          I am not working in psychology but in neuroscience, and we do have models and quite some understanding of the mechanisms and whatnot

          I assure you any quantitative model will have been developed without the help of NHST.

          but in the end of the day people want e.g. to deactivate a certain brain area in rats and show that it negatively affects a certain behaviour. Guess what – they would usually run a hypothesis test for difference in means.

          I am still not sure what is your opinion on what they should be doing instead.

          There is so much to be done.

          1) Are you actually deactivating the region you want to, without having other effects?
          – Study a bunch of normal rats to figure out the amount of variability in the location of the ROI relative to whatever landmarks are being used.
          – Perform the deactivation in a few rats and measure the extent of it. How accurate is your deactivation procedure?
          – Does anesthesia also “deactivate” the ROI?

          2) Does the behaviour you are measuring actually correspond to what you are trying to measure?
          – Is there a food reward involved, if so does the treatment affects the rat’s hunger, rather than whatever you are assuming?
          – Is the behaviour something that needs to be learned? If so, what do the learning curves look like for normal rats?
          – Is there a circadian rhythm to the behaviour in normal rats?
          – How accurately are you measuring this behaviour? What is the inter-rater variability, etc?

          3) What are the effects of the deactivation (including beyond the behaviour you have chosen to focus on)?
          – Perform the deactivation on a few rats and study them very closely, ie monitor how various physiological and behavioral variables change (or don’t) over time.
          – What if you “half-deactivate”, etc. Is there a dose response effect? What does that curve look like?

          Once you have done the above, you should be able to come up with some models of the processes that lead to the behaviour, and the various responses to your deactivation. If not, you can simply describe what you observed. Maybe someone else can come up with the models.

        • @Andrew

          Thanks for the replies. I am not going to pester you here anymore, but I would be very interested if you write about this in more detail and IMHO this is something that is lacking in these recent Commentaries of yours. I see a lot of (rightful!) criticism but I don’t see specific applicable advice. Just this week I was teaching a three-day course on hypothesis testing and its pitfalls.

          “More accurate measurements that better bridge the (currently yawning) gap between data and theory” sounds great, but I guess Cuddy thinks, or used to think, that her measurements are fine. What should be doing to understand that they are not? You say group comparison is “not so relevant if the measurements are close to pure noise”; sure I agree, but how should people in the wet lab know if they are close to pure noise or not?

          What I end up teaching/preaching are things like thinking about effect sizes in advance, doing power analysis, self-replicating results that were obtained via exploratory analysis, using skeptical alpha level (more skeptical than 0.05), etc. But I don’t feel I can tell them “Stop using p-values and say No to binary decisions”. I don’t see how that’s a helpful advice. As I said, they do want to show, for example, that inactivating a certain brain area leads to substantial performance decrease, and they need statistical tools to answer that.

        • Amoeba:

          For advice, you could look at my two textbooks and my hundreds of applied articles. There’s also this new paper, in particular see section 3.

          I don’t think that changing “alpha-level” will help much in cases such as Cuddy’s; there the problem is that the measurements are too noisy to be useful.

          How can people in the wet lab know if they are close to pure noise? They can do within-person comparisons, they can fit multilevel models which will partially pool when data are noisy. I don’t know anything about wet labs but in social science I see people, all the time, doing inferences that would only make sense if the underlying effects were implausibly huge. Carlin and I discuss this in our 2014 paper, and we make the specific recommendation that researchers perform design calculations under reasonable hypothesized effect sizes.

          You write that “Cuddy thinks, or used to think, that her measurements are fine.” There’s little I can do about Cuddy as she has been given every incentive to overstate the certainty of her claims. But unsuccessful external replications should provide a clue that something’s going wrong with her research paradigm.

        • “And on the other side academic Bayesian modelling where just the mere fact that there can be a binary decision to be made at the end evokes revulsion.”

          I don’t think this at all. But I am revolted by the idea that you could make a decision without even once considering the consequences. I mean, to me it’s actually immoral. Considering the consequences requires having some measure of the consequences. In Bayesian Decision Theory, this is a utility function of some sort. Thanks to Wald’s theorem we have that any decision process that isn’t Bayesian Decision Theory you in general either have an equivalence or are dominated by Bayesian Decision Theory. So, in the end please give me a description of the consequences of each possible outcome, and I’ll be happy to give you a decision.

        • So, we can try “binary conclusions are ok but please incorporate consequence into your utility”?

          That sounds much more reasonable than just a blanket “Say No to binary conclusions”

        • The sophisticated version:

          1) Inference is separate from decision. First find out what you know about the world (inference).

          2) Decision making without considering the consequences of the decision makes no sense, and can do significant harm to the world. To decide you must have some approximation of the goodness of the consequences under whatever turns out to be the real-world situation. This implies some kind of function Goodness(Outcome) must be specified before making a decision.

          3) When making a decision, the decision must be sensitive to both the probability of an outcome, and the consequences of the outcome, and must consider all possible outcomes. The only way to accomplish this that makes any sense is to choose the decision that has the largest expected Goodness(Outcome).

          As long as you are doing your statistical decisions this way, and your goodness function has some real world content, then by all means make discrete decisions.

        • +1 to Daniel’s Sophisticated Version — although I’d add, “Of course, as always, the devil is in the details.”

        • @Daniel

          #2 & #3 are fine.

          Re. your #1 in many settings standalone inference doesn’t have much value. I’ve no way of judging its quality. Give me a decision or a prediction or something I can use as a surrogate to evaluate the quality of your inference with.

          I think the fallacy lies in evaluating the quality of an inference by the richness or elegance of the underlying model or the amount of information the model can use.

          The most elegant hierarchical Bayesian model is totally useless if it isn’t “right”. (c=All inferences are wrong but some are useful? And useful to me is how well they perform in decision-making or prediction.)

        • Rahul: emphasizing that there’s a logical difference between inference and decision making is important to avoid the pitfall of doing the inference, deciding that the weight of evidence is in favor of answering “yes” or answering “no” to some question, and the going ahead and acting as if the answer *is* yes or no, and making a decision without considering the consequences of the outcomes.

          So, even if you do no direct inference at all, it’s useful to understand that they are different processes.

          On the other hand, inference can be important by itself. Suppose you have some model that you want to be predictive of some outcomes. You do some inference on some data, and find that parameter q is sharply peaked around a single value (say 9.798 m/s^2 for g) well fine, now you can use g to predict something else in a second problem.

          Since you’re a physicsy/chem guy I’m sure you could imagine a situation where your real goal is to predict the yield in complex reaction Q using enzyme E.

          A key component of predicting Q is something about the activation energy of the step that the enzyme is involved in, but there are other important factors in Q such as the rate constant at some secondary reaction as a function of temperature or whatever…

          When you try to find out both activation energy and rate constant, you can’t, they both affect the results in similar ways, and they therefore are partially confounded.

          But, there is a simpler test reaction you can do with E to get a lot of information about the activation energy. When you do that reaction and infer the activation energy you get a very tight posterior distribution. You can approximation this posterior distribution as some kind of tight normal distribution, and put it in as a prior in your more complex reaction Q, which then lets you infer the rate constant.

          In the end you may simply want to know the answer to the decision problem “should I use enzyme E or the old method without it?” and the decision rests on yields whose predictions require knowledge of the rate constants, and costs, and market demand factors, and lots of other stuff, but realizing that inference is separate lets you get at that answer by structuring your experiments to discover the activation energy and the rate constant and therefore predict the yields, and then have a distribution of yields to put into your decision problem involving market prices and soforth… it’s important to recognize the different kinds of problems if you want to answer your final question.

        • @Daniel

          Thanks. I think I see your point there.

          I guess where I’m coming from is this: In fitting an activation energy I don’t need much convincing that your underlying model is right. So I can make a good judgement call about your inference without resorting to predictive checks in the particular case. Lots of checking has been done by hundreds of researchers on the simple test reaction case.

          OTOH, imagine an era very long ago in which activation energy itself was a novel concept. I’m sure you’d demand validating the inference by asking me to predict an observable, say reaction rate at another temperature.

          That’s where I’m coming from. Most of the Soc Sci models I see are so vague, have so little fundamental basis etc. that I’m in uncharted waters.

          The only way to trust an inference is to demand predictive ability.

        • @Rahul, I can’t disagree with you regarding Social Sci. In part because it’s not an area where I know the literature extensively, and in part because I have a suspicion that you’re absolutely correct.

          From my perspective (and I’m working on some social sci problems now so it’s a little bit informed) we could do a lot with just thinking carefully about measurements and description. Let me give you an example. Back in 1963 some USDA person came up with a “sufficiently healthy and cheap” diet, and then started computing the cost to purchase that diet. Of course, you need more than just food… so they took this number and multiplied by 3.

          EVER SINCE THEN THIS HAS BEEN THE STANDARD FOR DETERMINING IF YOU ARE IN POVERTY OR NOT

          http://poverty.ucdavis.edu/faq/how-poverty-measured-united-states

          As a back of the envelope calculation this works fine, as the essential piece of input to a multi-hundred-billion dollar government welfare system… not so much.

          As a step up from this, you could imagine say cost of rent, heating/cooling to keep the space between 55 and 85 degrees F, electricity and gas for cooking and lighting, that USDA diet for food at home, and sufficient transportation to be able to get to and from work. (One suspects the supplemental measure works along those lines see link above)

          If you have monthly after tax income that exactly pays all of those, you would be at a threshold for poverty, you’re not deep in poverty, but the slightest perturbation to your life (an impounded vehicle, an abscessed tooth, a child with whooping cough…) and you could lose your job, not pay the rent, be on the street….

          But even in 2012 or whatever when they implemented this supplemental poverty measure, it’s clear that the thinking is binary “either you’re in poverty or you aren’t”. This is lousy thinking. The question should be what is your total money income divided by the local costs associated with the basket of goods above, and then *what is the distribution of this dimensionless ratio across the population*. Once you have this distribution, you can then talk about various important functionals of it:

          1) What fraction of the population has ratio less than 1 (this has been the historic focus)
          2) What is the mean value of the ratio in the population? (this is also of interest, it gives a kind of “real income” measure relative to “not making it”
          3) what is the rate of change of the mean of this ratio through time?
          4) What is the conditional distribution of this ratio for different ages, education levels
          4) How did those conditional distributions change through time?

          Etc etc:

          So, now you know one of the issues I’m working on, just simply *measure* how well households are doing on this kind of scale. Yes others have attempted this… but I suspect few took the kind of Bayesian approaches that I will take. In any case, it seems a bit idealistic, but looking at what’s going on in the country and how deep into the weeds our standard ways of describing economic well being are… it seems necessary that someone do a better job.

          This job is “just” inference, it’s “just” measurement technique in some sense. But it feels very very important, like for lack of measurement technique we’re all standing around patting each other on the back about how big our GDP is while millions of people wither away under crushing poverty and near-poverty, while opiod abuse rises, while we blame immigrants for lack of jobs etc etc.

        • Daniel said: “we could do a lot with just thinking carefully about measurements and description”

          Yes, yes, yes! This is something I have always tried to emphasize in teaching statistics; the poverty “measure” is an example I’ve always given. (I have heard that the person who first used that measure was appalled that it became used for purposes for which it was intended.) The U.S. “unemployment rate” is also an eye-opener: the numerator and denominator are not what most people expect; in fact, there are a variety of “alternative unemployment measures” that are sometimes (but not often enough) used — e.g., to compare the U.S.with other countries that use a different definition. I could go on and on.

        • @Stephen Martin: I don’t know if you were being sarcastic or not, but NHST and p-values don’t give you a way to prove you’re correct, at least if you mean proving that some model is the right model for some process. They just tell you how unlikely data as extreme as what you’re seeing would be if it were generated following the null hypothesis. It’s easy to fail to reject a null hypothesis due to an insufficiently large/low-noise data set (relative to effect size). There’s no way to prove a model is correct in either frequentist or Bayesian statistics, just a way to give you a statistical reason to not reject it out of hand.

          Bayesian stats is arguably more like machine learning than classical stats in that we tend to measure our models based on their utility for some practical problem, often involving prediction.

          I conjecture that machine learning is largley popular because it sanctions the practice of building useful tools for predicting future outcomes without worrying about whether these tools would get methodological approval from a statistician. These models are useful in the sense that they’re often better than just guessing (or as Gelman quotes Rubin as saying, whatever you were doing before applying statistics). Much more importantly, they’re often better than anything statisticians can do using the tools they have at hand. As an example, we could apply the full Gelman methodology to speech or handwritten digit recognition, but we run into major multimodality obstacles to inference and then get clobbered in predictive measures (0/1 loss) by the ad-hoc training methods being applied in the field to deep belief nets (NIPS and other conferences are full of papers extolling such methods). It is, of course, possible to build a deep belief net in Stan and fit it to MNIST handwritten digit data (I’ve in fact done it), but it won’t scale to the size of data that will make it competitive, and we can’t integrate over the posterior effectively due to multimodality.

  7. I am new to reading your blog so forgive me if this is discussed somewhere else but given this is the case, what would you advise someone to do who is teaching at a public primarily undergraduate institution in a statistics department and whom feels our general introductory statistics course and our undergraduate curriculum for the major is misleading students. I honestly believe that students leave the introductory course and many other courses thinking statistics is all about hypothesis testing in all sorts of settings and feeling like they went through a whirlwind of situations where they go through all “the steps” of hypothesis testing.

    I have noticed as well that students leave the intro course without a good understanding of basic concepts like variability, bias, etc. and many of them do not really understand descriptive statistics, how to compare rates, how to think about data, what the standard deviation measures, how to graphically summarize data, etc. I try to “fix them” in my course on statistical computing, but of course my course needs to teach them how to code as well so I can only do so much in there.

    I just got tenure, so feel I might have a bit more power, but when I look at our undergraduate curriculum, both that for our majors and the courses we teach for the general population I feel we are so outdated and not taking these things into account. So far I deal with the situation by teaching statistical computing courses that do no inference and trying to avoid the introductory course and many other courses that I feel emphasize NHST too much. In teaching these courses I realize many of our undergraduate majors want to just “do hypothesis testing” even in totally inappropriate places and our client departments want them to learn a checklist of methods.

    All of this leads to a great deal of existential angst. I wonder how others deal with these matters within departments that are more “traditional frequentist” that are in teaching institutions especially if like me you are in a low power position in ones department.

    I also wonder if people know of a good introductory statistics text (algebra only) that would potentially have better learning outcomes than we see in this study. Seems to me we should minimally not be making students understanding worse with our courses.

    • LauraK:

      I don’t think anyone knows or has been widely convincing that they do.

      I am convinced that Richard McElreath is going in the right direction – http://xcelab.net/rm/statistical-rethinking/
      (But that is graduate course and for other disciplines than statistics)

      Since I visited Duke in 2007/8 (comment above) there has been a concerted effort to get a better intro non-major course (the only improvement I have seen at arms length is the inclusion of R programming).

      > “our client departments want them to learn a checklist of methods.”
      That is real challenge – I remember thinking that I am trying to stress as really poor statistical analyses that should be avoided are exactly what these students’ department faculty are using in their publications and teaching.

      Now, there are hundreds of suggestions in former posts and comments you can find on this blog.

      (This one, http://statmodeling.stat.columbia.edu/2017/03/08/applying-statistics-science-will-likely-remain-unreasonably-difficult-life-time-no-intention-changing-careers/ tries to set out the context of what folks need to learn more generally to do profitably science.)

      The joke about two guys running away from a bear comes to mind – “we can’t outrun a bear!”- “True, but I only have to outrun you.”
      (So just try to do better that most others.)

      • Thank you so much for your comment. I will explore the resources. I just watched Andrew’s ECOTs talk and that was helpful or at least affirming that the things I am concerned about are what others are concerned about as well.

        I also think many people are developing new courses and just calling them “Data Science” or some such thing and teaching more about descriptive statistics, data wrangling and predictive modeling. That is basically what I do in statistical computing.

        I will try to keep on keeping on… and practice Kaizen…
        http://lifehacker.com/get-better-at-getting-better-the-kaizen-productivity-p-1672205148

      • +1 on McElreath’s approach. I am reading through the book now and it is excellent.

        Even if you can’t use the actual book in your own course, the way he builds insight about the content could provide a helpful model. In particular, he is very clear about how thoughtfulness and insight interact with the procedures and the choices that must be made all along the way.

    • Laura:

      I used to teach intro statistics but I stopped doing it about 15 yrs ago because I was so upset at the content of what I was teaching. I plan to return to it but only after putting together course material that abandons the usual approach.

      But if you are constrained to teach intro statistics and you’d like some marginal improvements, I recommend you take a look at my book with Deb Nolan, Teaching Statistics: A Bag of Tricks. The 2nd edition is coming out soon, and we have lots of tips for how to teach this stuff in an interesting way.

      • Andrew: Thank you so very much for your reply. I will get your book and look forward to the new edition.

        That is how I have felt for sometime as well, I feel very sad and inauthentic while teaching the material so have avoided it, but I still get the students after they have taken intro so would like it if they learned more relevant material ultimately.

        I just listened to your Ecots talk, very nice or perhaps nice is not the right word, enjoyable and helpful is more accurate. I might focus on the instant feedback part that you mentioned with homework assignments. http://www.aboutus.assistments.org/ has technical solutions for implementation and the tools allow the researcher/teacher to randomize what types of feedback students get while doing the homework or the type of questions or other features so that perhaps we can ultimately improve learning.

        Keep writing and speaking!
        Laura

    • LauraK:

      I can sympathize with your situation. Although my experience was coming from different circumstances than yours (we had no statistics department; I was a mathematician with no formal statistics background and was trying to do what I could to keep our statistics program alive as the statistics group dwindled), I encountered a lot of the same problems you mention. For example, I remember someone in a science department urging us to teach a one semester course for science students, including “all they needed to know” about both calculus and statistics — e.g., “Don’t waste time teaching them about distributions; just teach them how to read an ANOVA table.”

      The best intro statistics book I found was DeVeaux, et al, Stats: Data and Models. I’ve got some online notes for instructors on the third edition at http://www.ma.utexas.edu/users/mks/M358KInstr/M358KInstructorMaterials.html (It was for a course for math majors who already had calculus and probability, but some of the handouts, etc. might be useful.)

      I did look at the Gelman and Nolan “Bag of Tricks” book that Andrew mentioned, but didn’t find it very useful — though perhaps the new edition is different from the one I read several years ago.

      After I retired, I taught until this past May a “continuing education” course that allowed me to point out some problems and misunderstandings of NHST to people from a variety of backgrounds (e.g., state agencies as well as a variety of academic users). Possibly some of the online materials and links at http://www.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html might be helpful to you — in particular, I find some of the online simulations and the “quizzes” embedded in the notes to be helpful in countering some common misconceptions.

      I also taught a summer prob/stats course for secondary math teachers in a master’s program. I’ve got some of the handouts online at http://www.ma.utexas.edu/users/mks/ProbStatGradTeach/ProbStatGradTeachHome.html. A couple that might be particularly of interest are “Models and Measures”, and “Measures, Words, Rates, Ratios, and Proportions”.

      • Martha:

        I think the DeVeaux book is wonderful, it’s the best book I’ve ever seen that covers the standard intro stat material. I just hate the standard intro stat material so much, I don’t think it should ever be taught again, anywhere!

        • Given the number of individuals and years they invested trying to repair the standard intro stat material with little widely accepted success, maybe a replacement is in order.

          But maybe university governance and faculty development and reward need to be replaced first or put in quarantine…

    • @LauraK

      Here’s my 2 cents. I think the problem with conventional undergrad stat. courses is that (a) they are too divorced from the trades of individual students and (b) They get the timing all wrong (c) Pedagogy is stuck in a pre-computer era or pay lip service to the powerful computational tools available

      By packing in a hundred assorted students into a class of various majors, schools, math-abilities etc. you end up with a very blunt pedagogical tool. Traditionally we try teaching stat in a domain-agnostic fashion. I’m not sure that is working out.

      Further, when the courses are taken at a point in time where most students have very little exposure to collecting / handling substantial data the fake, stylized toy examples start seeming pointless.

      My suggestions:

      (a) Target specific majors or groups of majors with shared interests. (b) Take the pain to understand their domains and problems (actually, this needs a substantial investment of effort) so as to craft the course to their interests. (c) Bring in stat a little later in the course structure. It’s a fantasy to think we can guide students to not make stat errors from day-0. It’s more productive to instruct them *after* they’ve had a chance to fool around with data unguided and stumbled. (d) Right from day-1 integrate R or Python into their workflows.

        • Somehow I’m not a big fan. Recording is, of course, great & imperative but I just stick to ancient prescriptions including scripts, liberal commenting, strict separation of code & data and version control generously.

          The data notebook thing I could never really get. It’s too messy.

        • Notebook–agree. Knitr is awesome though. I write all my papers with code embedded in them. It’s a very important tool; it has happened many times to me that the data changed or were extended with a second study after the paper was written (as a result of reviewer demands, for example). Everything gets updated automatically.

        • +1 to Shravan’s love for knitr. The Jupyter notebook is the wrong model, at least for me. But Rmarkdown works just like an R script, but with fantastic comment formatting.

      • Thanks so much for your thoughts. That is a really good point. I often think the biggest challenge in teaching is too much student variability in courses and completely unrealistic goals for material. Although before I came my department really fought to teach all the intro courses on campus.

        I got interested in Statistics through Psychology, and I have often felt that part of why statistics seems boring is this divorce from content. Although to be fair I felt my first undergraduate statistics course that was taught in a psychology department was completely meaningless and I had no sense of what I was doing. This might seem crazy but I wonder if we should just do only descriptive statistics in the introductory course or at least mostly descriptive statistics, and I include regression as a descriptive statistic as well as data vizualization ideas. There is lots of population level data to be had and data summaries are actually very non-trivial to get in many cases and require a lot of thought. I think of something as simple as “rates”, what is the right numerator and denominator and such… all those issues.

        Then beyond that we could talk about interpretation only, some ideas of bias in sampling, etc. Maybe calculate some simple confidence intervals to get a sense of what a margin of error is for poll results.

        We could save actually doing statistical inference for more advanced courses.

  8. This reveals my lack of coding ability, but here goes: I like to give students tools that allow them to work directly with meaningful data as quickly as possible (personally I find JMP is best for this). I encourage them to do as much as possible with descriptive analysis and visualization. The stats can come later, as can tools that require them to code. Of course, for some domains (e.g., engineering), the tools appropriate for their early exposure may differ, since programming may come more naturally for some groups. But for many students of business, social science, health sciences, etc. I think it is more important that they work with domain-relevant real data before we worry about how they document their workflows. I don’t mean to undervalue the latter, but it is a matter of what the most sensible order of things should be (timing as Rahul puts it above).

    This goes for things such as how to interpret confidence intervals – a pet favorite of mine which has been beaten to (not quite) death on this blog. Yes, it is important – at some point – for students to understand the correct interpretation of a confidence interval. But I don’t think that is the most important thing. I’d rather see them misinterpret the interval (as in, “there is a 95% chance that the true parameter lies in this range…”) but get a solid feel for just how uncertain the results of one particular analysis actually is, than worrying about them getting the correct interpretation based on repeated sampling. Frankly, I think the latter actually diverts introductory students from really understanding what the particular data they are looking at is capable of telling them. This may kick off yet another of the continuing debates about confidence intervals – but after the last one with over 100 comments, I was left feeling no clearer about the issues than at the beginning (and, for me, the idea of “bet proof” does not elucidate matters). The point of my comment is that I think the timing of issues is more important, and deserves more attention, than we have given it in the past. Everything is important, but not everything is equally important or needs to be covered at the same time.

    • +1

      e.g. Distributed version control and code profilers are great tools. But if I am teaching a high-schooler his first programming class and I devote sessions to Git and Gproff I am not sure whether he will appreciate them.

      Unless you have had the chance to work with bloated slow programs or handled a codebase with hundreds of source files etc. the advantages of such tools may not be obvious.

      So also, students need to experience a certain amount of pain and frustration handling data in a naive way before they will be receptive to the need of a lot of statistical tools.

      • Good points by both Dale and Rahul. As with many things, there is no single “correct answer.” But we do need to be open to trying alternatives to see if and when they work (especially when what we’re doing isn’t working very well!)

        • Also: When I wrote my comment suggesting the addition of a report-generating tool to Daniel’s proposal, I was thinking in terms of “what I would hope a teacher would do if I were a college undergrad” — but maybe that is not a good perspective to be guided by.

    • I really wanted that “bet proof” thing to be somehow elucidating, in other words, that bet-proof intervals would have some guarantees of nicer properties than non-bet-proof ones etc, but in the end, I just found that it was yet another way in which it’s just better to go Bayesian or go home.

      I agree with you about giving people data and letting them see how it might be useful, ask questions, think about what would be required to answer them, etc. I think RStudio would work fine with a decent quick-reference guide, and requires no special SAS software license.

      I think giving some structure to the concepts that might be useful to answer the questions early on is helpful. For example, teaching “algebra based” physics is I think a mistake (that’s physics where you talk about some of the concepts and then memorize algebraic formulas whose derivations require calculus, and then solve problems using the formulas). Calculus was designed for physics, and physics begs for calculus. My personal taste is for using an algebraic form of calculus (nonstandard analysis) to teach the calculus, and this was Newton’s preference clearly as well. Too bad it took 300 years for someone to come up with a number system that admitted an infinitesimal. In the end, physics and calculus go hand in hand and you need the logical structure of calculus to answer questions about physics.

      I think the same thing is true for Bayesian probability and statistics. Bayesian probability is *the* way to describe some mathematical model in which you don’t know the precise numerical quantities involved but you have some relatively more or less plausible values. If you teach it that way, it will seem very natural. Show the model, ask the class what number they should plug in for some unknown q, they will not have an answer, then ask them questions about order of magnitude, build up some sort of distribution for this number, and then have them generate or download a data-set, and try to find out more about the number q. In the end, use samples of q to predict the results of some experiment or the analysis of a second data set on the same question, or whatever.

      Taught this way, as a tool for proceeding with a mathematical analysis when you don’t have precise prior answers to the numerical values to be used… is the way to go early on I think. It lets people move forward with the questions and the models without getting hung up on the fact that they don’t know what the numbers are to plug in. More thinking… less tables of the 95 percentile points of the t distribution with n degrees of freedom.

      • I’m not sure I can express very well what I want to say in response, but here’s a try:

        I think you are not saying some important things, possibly because they are so obvious to you that you forget that they may not be obvious to everyone (in particular, to many learners). Likewise, I probably have trouble articulating them because they are so natural to me. The things I have in mind are uncertainty and continuous thinking (which perhaps is best described for some as “shades of gray”). What I think many students need very fundamentally is to get out of thinking of the world in discrete and certain terms.

        • Martha: Teaching is hard. Knowing what the students need is difficult. But also there is a body of evidence that teachers who do a good job as measured by their students doing well in later classes get poor ratings from students… Learning is hard. Science is hard. I think this explains why so many shortcuts and poor practices occur, much of it is a sort of path-of-least-resistance.

          I agree with you about uncertainty and emphasizing a sort of spectrum of possibilities. I truly think that I have good ideas about how to construct a mathematical modeling and statistics sequence, and I could do it even as a sequence of YouTube videos, PDFs and online datasets etc, but I’m not sure who the audience would be, nor how to fund it.

        • Daniel:
          Not sure from your reply, but possibly my comment came across as more critical than intended — I was just trying to articulate some things that I suspected might have been in your mind but were not articulated (for understandable reasons) in what you wrote.

          Re “I’m not sure who the audience would be”: In the real world, we often have to start with the audience we are handed and then figure out how best to teach them.

          Re “teachers who do a good job as measured by their students doing well in later classes get poor ratings from students”: Probably the most rewarding moments in many years of teaching have come when a former student has contacted me (perhaps a semester later, perhaps several years later) to tell me something like, “I didn’t like the way you taught at the time, but now I understand why you did teach that way, and am glad you did” — and maybe even elaborate on why.

        • As the Aussies say: No worries mate!

          I don’t have an audience handed to me, as I’m not currently a teacher. Perhaps I should be, but teaching as a career is in crisis, the adjuncts that do most of the real teaching at many universities make effectively less than minimum wage, and they live or die by their teaching reviews, which if the stuff is right about reviews being negatively correlated with long-term effectiveness… well it’s a sad state of affairs.

        • +1. I like Robert Sapolsky’s lecture bit/meditation on categorical thinking about continuous stuff here: https://youtu.be/NNnIGh9g6fA?t=8m14s

          It may have some useful stuff in it for people trying to figure out how to teach this (even if you are skeptical of some of the specific claims).

        • Maybe. That aside, it’s still possible that he has some useful teaching examples on the topic of this discussion.

        • Saposky does come across as arrogant, but I certainly would not call him a moron. He is making a slightly different point than I think we need to emphasize in helping people learn statistics — I consider the latter to be that many things in real life come in a continuous rather than discrete (categorical) fashion. He seems to be making the point (a valid one) that there are different ways to break up a continuum into categories.

        • Martha, you are certainly right about the point he is getting at in the end. But I think the earlier part of the lecture does a good job of making the point you care about: most real life things come as continua, the way we break that continuum into categories is arbitrary. Breaking into categories (and dichotomies) is a feature of how we like to think, not how things are. We should not forget it. Actually, I had forgotten that he ends on a different note.

        • JY-H: “Breaking into categories (and dichotomies) is a feature of how we like to think”

          I wonder how much of the “how we like to think” is inherent in being human and how much is learned (cultural). I’d like to think it’s the latter — that children could be brought up seeing the world continuously rather than discretely — although there might be some inherent developmental “programming” involved that wouldn’t allow this from birth, but that would still allow development of continuous thinking at an early age, given the right environment/teaching/experience.

        • I am responding to Jason, but I had to click the nearest “reply to this comment” in the thread which is Martha’s post. Don’t know what to do about this…am I missing something? Anyway, here is my post:

          Jason: Martha, you are certainly right about the point he is getting at in the end. But I think the earlier part of the lecture does a good job of making the point you care about: most real life things come as continua, the way we break that continuum into categories is arbitrary.

          GS: Actually, it is not arbitrary in a very important sense: The distinctions we make as scientists tend toward ones that are useful in the prediction and control (for experimental sciences) of the world. For example, “species” is a pretty useful term – but do species exist? No. Anyway, the “usefulness” of the distinctions we make extends to ordinary language and “non-scientific” endeavors.

          BTW, the view that terms like “species” refer to actual things is sometimes called “essentialism.”

          Jason: Breaking into categories (and dichotomies) is a feature of how we like to think, not how things are.

          GS: Actually, it is a fundamental feature of the behavior of many different kinds of animals. Some call it “stimulus discrimination” and it is widely studied. And “the way things are” plays a huge role. Discrimination depends (to a great extent) on dependencies between stimuli and dependencies between behavior and events (i.e., consequences). Such “dependencies” are, as it were, features of the world. But, of course, “contingencies of reinforcement” are abstracted out of the myriad of behavioral and stimulus events arrayed across time.

          Jason: We should not forget it. Actually, I had forgotten that he ends on a different note.

          GS: I like explaining stuff that “The Big Sap” (Sapolsky) talks about in terms appropriate to behavior analysis and radical behaviorism since The Sap perpetuates every myth about Skinner, behavior analysis and behaviorism passed down across generations of “scholars.” Though common, it is still poor scholarship and, when delivered in his arrogant style, quite annoying to someone who actually knows something about behavior analysis and radical behaviorism.

        • Martha: If I recall correctly, Andrew’s sister studies something pretty close to this. My impression is that the people who think about this a lot do think it’s somewhat built in. As with most cognitive heuristics, it can’t hurt though to learn how to be aware of the impulse and on guard for when it can lead us astray.

          Glen: I think we are on the same page here. I know some people make the strong form of the argument that you are attacking, but I didn’t intend it. Perhaps I was a little hyperbolic. That’s probably because the mistake of thinking categories/demarcations are real and immutable is far more common than the opposite mistake of thinking that all human constructs are completely arbitrary. I would never argue that how we (and animals) think and how the world is have no correspondence (in fact, I’m pretty familiar with that line of work). But it’s also important, especially in the context of this discussion, to remember that the correspondence is far from perfect.

          On the Skinner front, I have no skin in that game, but I can see where you are coming from. For the record, I appreciate the fuller critique over the unexplained slur.

          In any event, I posted the link because I think the early part of the lecture has some evocative and uncontroversial examples of how categorization is both useful and suspect. It’s not an endorsement of any other kind, especially of the ending section.

    • +1 from me too. I just recently learned what that means. Yes I had to look it up.

      I like Jmp. We teach Intro with SPSS, again client disciplines… “sig level” anyone…I think SPSS is a big part of the problem, but that is probably off topic. I also have taught intro with Statcrunch and liked that OK.

  9. I’ve spent most of my career teaching audiences that are not good at mathematics nor are they very tolerant of seeing any. For the most part this has been business students. Undergraduate business students – clearly quantitatively hostile. Graduate business students (and I’ve had plenty) – quantitatively literate if they have an engineering background (many do) but otherwise much like the undergrads. However, the grad business students I teach are all working professionals. They are motivated, smart, but very impatient with math – unless they see a clear need for it. I think much of the advice here misses the mark for my audience – unless you think the purpose is to deter these people from doing anything statistical (and I’ve known statisticians that have this view – that such people should be kept out of this business).

    I don’t agree with raising the entry bar to working with data. I want to lower it. But I do think that embracing uncertainty and understanding variation are fundamental and critical and an introductory course that does not focus on that is a failure. But I don’t see that calculus is necessary for this and it actually impedes my audience from understanding. Sometimes, the comments on these topics seem aimed at teaching the top 5% of students. In an alternative universe, we would not have so many colleges and universities and we would not expect everyone to get a college degree. In that world, perhaps we would be free to require all kinds of math skills for anybody that wants to study statistics. But in the world I have taught in for 40 years, that is a poor model.

    I hope I don’t come across as saying we should water down the intro stats course. Without using calculus or much algebra, I teach that course and hear that it is the most difficult course students have ever taken. And that is using JMP (lots of mouse clicks) and real data. The critical thinking is what is hard – and putting everything in abstract terms is neither necessary or sufficient for critical thinking.

      • Since the adoption of learning management systems (Moodle in this case), my syllabi and notes are hidden behind limited access. The only thing that appears publicly is an abbreviated syllabus – and I fear this looks entirely traditional. In fact, it still has too much statistical inference, as these topics are expected to be covered. But I do cover them in nontraditional ways and I might not even cover them at all in the near future. But I did want to say that I rarely use slides. My courses are almost all project based – with undergrads I usually use data on the students themselves (with all of the messiness it entails, such as miscoded data, etc.). With grad students, I often use datasets from Kaggle competitions or large public databases. Homework comprises most of the course. As many others have said, it is difficult to find a good textbook. De Veaux is good. Stine and Foster have an excellent business statistics text, but it is expensive and would take me about 2 years to cover. I constantly fight with the expectation of covering a bunch of topics I don’t think are important – and I am learning are downright destructive (NHST, probability rules, etc.). I believe the first course is about working with data, cleaning it, visualizing it, looking for complex relationships, appreciating its limitations.

        I often use the excerpt from one of Tufte’s books (someone else’s quote) characterizing all of social science and medical research, summed up as:

        “1. Some do, some don’t.
        2. The differences aren’t really very large.
        3. It’s more complicated than that.”

        I think that quote works for both classical statisticians and Bayesians alike.

    • Dale: “The critical thinking is what is hard – and putting everything in abstract terms is neither necessary or sufficient for critical thinking.”

      How does one think critically about uncertainty in any other that abstract terms?
      (Any model or representation of an empirical question is abstract.)

      In the exercises and sessions I have tried out with folks where I had avoided all but simple arithmetic – the need for abstract thinking in anything became more rather than less apparent (e.g. https://galtonbayesianmachine.shinyapps.io/GaltonBayesianMachine/ )

      • Keith
        I like your little demo. And I agree with you about abstract thinking. I did not express myself clearly – what I meant by abstract “terms” is using a particular language (e.g., calculus, algebra, etc.) for being abstract. In general, I try to use the simplest abstraction possible and I think some of the advice people have been giving does not comport with that. For example, I have not found “bet-proof” to be a simple abstraction (I’m sure it is for some people, but not for me, and I don’t think it would work for the vast majority of my students).

    • Dale I really like your perspective. Jmp is my favorite point and click software and I also liked Statcrunch. We teach intro with SPSS and I feel SPSS makes things worse… “Sig level”…

    • Dale: in my comment I wasn’t asserting the need to teach calculus to business students, but rather claiming that there is a similar correspondence: you can’t meaningfully teach physics without the basic concepts of calculus (rates of change = derivatives, the whole is the sum of the parts = integration), in the same way you can’t teach statistics without probability (some things are more reasonable, some are less, the total reasonableness of something is the sum of the reasonablenesses of the different ways it could occur (sum rule)… the reasonableness of two things together is the reasonableness of one of them times the reasonableness of the other if you know the first occurred (product rule)).

      Formalism and symbology and tricks of integrating distributions and soforth… not so much, but the basic ideas of sum and product rule and conditional probability thinking should be taught even if you only teach it graphically by least-squares linear regression and looking at vertical slices through scatter plots. Formulas for computing regression coefficients etc are just the wrong way to teach statistics, and if that’s your point I’m 100% with you.

    • Glen:

      This is kinda exhausting, and I guess the real lesson is that I shouldn’t spend my time reading blog comments, but . . . I have multiple times recommended that researchers use single-subject designs to resolve some of their problems. So it’s not true that these designs are offered as a remedy “not at all.”

      • AG: This is kinda exhausting, and I guess the real lesson is that I shouldn’t spend my time reading blog comments, but . . . I have multiple times recommended that researchers use single-subject designs to resolve some of their problems. So it’s not true that these designs are offered as a remedy “not at all.”

        GS: That’s interesting because, when I stumbled upon this blog, I saw that a lot of threads had something to do with “the replication crisis,” but no mention of SSDs as a possible remedy. Not one. Now, of course, after I started posting, I see a mention here and there (I guess to the point of exhaustion), usually in response to my posts. And I have followed the increasingly common discussion that has run through the literature on this issue (broadly defined). Even among the ban-the-p-value folks, there is virtually never a mention of SSDs as a remedy. So “not at all” may be hyperbolic – if the threshold for hyperbole is low…

        • Glen:

          I’ve been writing about this topic long before you stumbled upon this blog. I’m sure I haven’t written as much about single subject designs as you’d like, but the amount has been far from zero.

        • My posts, though, aren’t necessarily “about you” but, rather, about some aspects of scientific endeavor. Isn’t that appropriate to this blog? Anyway…if you look at the most recent high-profile musings of, say, Nosek and Ioannidis, is there any mention of SSDs? The notion that SSDs are widely discussed (you being the exception, of course) is sheer nonsense. And, again, I’m well aware that SSDs are not amenable to a lot of stuff out there and a lot of stuff of interest to readers of this blog. But the literature, and this blog, definitely touches on psychology, medicine and (to some small extent) education – where appropriate, their use short-circuits any discussion of p-hacking, or sample size, power etc. etc. etc.

        • We should model the individual rather than a group whenever possible, since that is where the process of interest is (usually) occurring. There are some relatively rare cases where we care more about a group-level process though (eg SIR models to study epidemics).

          I may not be clear on the scope of the the term “single subject design”, but it looks like common sense to me. It looks basically like what people did before NHST was adopted. Do you have any good links to the history behind the term?

        • A: We should model the individual rather than a group whenever possible, since that is where the process of interest is (usually) occurring. There are some relatively rare cases where we care more about a group-level process though (eg SIR models to study epidemics).

          I may not be clear on the scope of the the term “single subject design”, but it looks like common sense to me.

          GS: Hmm…interesting that you should mention that. Affirming the consequent as a form of scientific reasoning is closely related to very fundamental behavioral processes – like those demonstrated by rats pressing levers “for food” (the quote marks are there to warn the reader that I don’t have to speak teleologically). The rat “infers causation” when food follows lever-presses (but, of course, may do so when the temporal relation is really adventitious – you could say that a primary “task” for birds and mammals (to mention the most obvious) is detecting a “dependency signal” in a sea of adventitious temporal relations. Hey! Just like scientists!

          A: It looks basically like what people did before NHST was adopted.

          GS: Well, I hope you see the irony in that “it is what [many species] did” since conditioning arose (let alone before NHST!). But, yes, you can find examples of affirming the consequent all through science. But SSDs with their stable-states and reversals etc., and the attitude toward variability, reliability and generality, are merely a subset.

          A: Do you have any good links to the history behind the term?

          GS: Hmm…that’s an interesting question – I’m not sure I have ever heard of origins of the term. As to the method, it is often said that Claude Bernard codified some of it in “An Introduction to the Study of Experimental Medicine” and I’m sure that I read a paper somewhere (probably by a behavior analyst – a kind of psychologist that uses SSDs all the time – a science in which the foundation is rooted in SSDs) to that effect. In psychology (at least a very, very small part of psychology), there is a great deal of formal discussion about this topic and related topics. The view is embodied in the origin of behavior analysis (Skinner called it “the experimental analysis of behavior”) as exemplified by “The Behavior of Organisms” (Skinner, 1938) and the subsequent development of the field (as exemplified by the whole history of “The Journal of the Experimental Analysis of Behavior” – beginning in 1958). As to more pointed discussions of method (but still within the behavior analytic tradition) one would have to start with Sidman’s (1960) “Tactics of Scientific Research: Evaluating Experimental Data in Psychology.” Anyway…hope that helps.

          G.

        • I think one reason why single-subject designs are not often used is because very often the people doing the research don’t want to acknowledge individual variability — they instead are seeking to get a “result” such as “this intervention produces this result” without all the real-life nastiness of individual differences. Randomized group experiments give them the illusion of a “definitive” result that they seek. It fits into the reality they want, not the reality that is real.

        • Excellent point Martha. As Blastland and Spiegelhalter put it: “The average is an abstraction. The reality is variation”. (The Norm Chronices). People want to model the abstraction, ignoring the reality. I think this is partly out of ignorance, but also partly out of a naive belief that you can learn about a group by aggregating (complete pooling in Gelman and Hill terms). Multilevel models are good solution, but there too people just treat the individual variability and variance components like nuisance variables. Almost nobody in my looks at them, and they are never reported on.

        • @Martha @Shravan:

          Naive question: If Andrew etc. advocate single subject designs so vehemently (which I totally agree) how come we don’t insist on surveys re-sampling at least some fraction of the survey-cohort. I rarely see any surveys do that.

          i.e. Say I am going to poll people about their party leanings or opinion towards gay rights etc. shouldn’t I be worried about the stability of the responses? i.e. Say I used a different surveyor and approached the same subject a few weeks later how identical would be his responses?

          In other words, are we actually modelling the subject’s opinion or is he just responding in an ad hoc fashion, or the surveyor isn’t asking questions correctly, or the person is just not interested, or he has no real opinion but is just saying something because he has been put in a spot. Yada yada.

          I never got a satisfactory answer to this. Every time I broach it I get something like “we checked this two decades ago in one survey and didn’t get much variability”

          But isn’t this essentially the same issue as a single-subject design?

        • Rahul, I don’t have anything except theoretical knowledge of survey sampling (the type you get from doing stats courses on survey sampling, probably nothing like the real thing).

          But how would one deal with the fact that the person being resampled is being asked the same question a second time? This could bias their response.

        • @Shravan

          I don’t know. But just because there *may* be a bias not validating your survey sounds a bad idea to me.

          e.g. You’d never trust a chemistry experiment on a single pH measurement, right? Why should we blindly trust a survey questionnaire? Why no replicates?

          I’m pretty sure if you asked someone their age / wt/ birthplace etc. multiple times you’d get good replication. But if you *don’t* get it on any particular topic (say gay rights) isn’t all subsequent analysis of that survey data essentially GIGO?

        • @Rahul:

          Yes, a very good point. I wouldn’t be very surprised if someone somewhere out there has tried to do “longitudinal resurveys”, but I can well believe it is difficult – probably expensive, too. I suspect it won’t be done much until granting agencies consider it a priority. (And probably similarly for single-subject designs of any kind).
          But I think this is one of many things that was brought up (or should have been brought up) in the discussions of how most election polls “failed” to predict Trump’s victory.

        • Martha: I think one reason why single-subject designs are not often used is because very often the people doing the research don’t want to acknowledge individual variability — they instead are seeking to get a “result” such as “this intervention produces this result” without all the real-life nastiness of individual differences. Randomized group experiments give them the illusion of a “definitive” result that they seek. It fits into the reality they want, not the reality that is real.

          GS: First of all, many researchers simply have not heard of them (despite what Andrew claims) – group designs are what is taught and what “everybody does.”

          Second, compared to the “use 300 subjects and finish the experiment in two weeks,” way of doing things, using SSDs is extremely time-consuming and, despite the astonishingly small number of subjects (compared to between groups experiments), relatively expensive. For example, take a standard schedule of reinforcement for looking at drug effects – the fixed-interval schedule (look it up if you’re interested). After the animals are trained to eat from the feeder, press the lever etc. etc. and the final FI parameters are arranged, some measures (e.g., describing response rates across the interval – it varies, tending to accelerate across the interval) take 5-6 months to stabilize (and that’s with experimental sessions conducted daily). So that’s 6 months of a lot of effort before you can even start to give drugs!

        • @Rahul etc

          Isn’t the longitudinal resurvey idea essentially what the USC Dornsife election poll was doing? The discussion around the methods of that poll is interesting.

        • @Martha

          I think it is considered boring drudgery to go back and simply re-sample 10% of your survey cohort.

          OTOH, if you claim you have some super-duper method that allows you to do something fancy without having to re-sample; I think that is perceived a lot more exciting.

        • Hmm…interesting that you should mention that. Affirming the consequent as a form of scientific reasoning

          Sorry Glen, you will have to make the connection between what I wrote and affirming the consequent for me.

          I happen to think both verification and falsification are impossible in practice btw. Something like this: https://i.imgur.com/dI1XVCp.png

        • A: Sorry Glen, you will have to make the connection between what I wrote and affirming the consequent for me.

          GS: Well…you said: “I may not be clear on the scope of the the term “single subject design”, but it looks like common sense to me.” Thus, you are talking about SSDs, and SSDs involve affirming the consequent. That’s pretty much the connection.

          A: I happen to think both verification and falsification are impossible in practice btw.

          GS: Yet….there is science. Anyway…not sure what you are driving at and the link you posted doesn’t help – at least so far…I’ll keep looking at it. Well…I can see that “affirming the consequent” is called an “invalid argument” and, needless to say, from a logical standpoint, that is undeniable – affirming the consequent is a logical fallacy. It is too bad for those who would paint science as a subfield of logic that affirming the consequent has played a large role in science. Not sure why you would say falsification is impossible in practice – perhaps you are talking about the fact that “failed” theories are simply modified post-hoc after hypotheses generated by them are proven faulty. Something like that?

          A: Something like this: https://i.imgur.com/dI1XVCp.png

          GS: Well…like I said, I will keep pondering the material in the link…

        • Thus, you are talking about SSDs, and SSDs involve affirming the consequent…affirming the consequent has played a large role in science

          What makes you say that? I have heard similar arguments before… they were confused between the use of Bayes’ rule (good) and affirming the consequent (bad).

          Found it:

          I do not understand this post. You start out with this (incorrect) claim:
          “In science we standardly use a logically non-valid inference — the fallacy of affirming the consequent”

          Then you go on to (correctly) explain that in science functions by reaching “a strong conclusion by ruling out the alternatives in the set of contrasting explanations”, and that we choose to work with ” a satisfactory explanation that can explain the fact/evidence better than any other competing explanation”.

          This is not affirming the consequent. It is Modus tollens and Bayesian reasoning. Explanations are ruled out for being inconsistent with the evidence. Or if the evidence is consistent with the theory, we assess the probability of the theory given the evidence (which according to Bayes’ rule depends on the probability of the evidence given all the other theories). It isn’t really a matter of a theory being true, just the most probable.

          https://larspsyll.wordpress.com/2016/02/13/why-science-necessarily-involves-a-logical-fallacy/

        • A (quoting me): Thus, you are talking about SSDs, and SSDs involve affirming the consequent…affirming the consequent has played a large role in science

          A: What makes you say that?

          GS: The logic of SSDs are: if the variable I manipulate is an important controlling variable then the data following the manipulation will change WRT the baseline (actually, the data will change WRT baseline and stabilize at some other level; SSDs work best when one can obtain reversible stable-states). The data changed WRT the preceding baseline and, thus, the variable is an important controlling variable. That’s affirming the consequent. No? SSDs are basically no different than, “If I flip this switch and the light comes on, the light switch is controlling the light.” And you can throw the reversal in there and flip the switch as many times as you like. Anyway, it’s all affirming the consequent.

          A: I have heard similar arguments before… they were confused between the use of Bayes’ rule (good) and affirming the consequent (bad).

          Found it:

          I do not understand this post. You start out with this (incorrect) claim:
          “In science we standardly use a logically non-valid inference — the fallacy of affirming the consequent”

          Then you go on to (correctly) explain that in science functions by reaching “a strong conclusion by ruling out the alternatives in the set of contrasting explanations”, and that we choose to work with ” a satisfactory explanation that can explain the fact/evidence better than any other competing explanation”.

          GS: OK…first of all, what you just said many people will attribute to me, but you are quoting some other person who is unknown to me. As to the first statement by Person X- I agree. We use affirming the consequent all the time. As to the second part beginning “a strong conclusion…” it is not something I would say.

          A: This is not affirming the consequent.

          GS: Unfortunately, I don’t know what you mean by “this.” But I can say that SSDs rely on affirming the consequent.

          A: It is Modus tollens and Bayesian reasoning. Explanations are ruled out for being inconsistent with the evidence.

          GS: Err…yes…but I didn’t say it, or anything like it. I wasn’t ever talking about ruling anything out. And I wasn’t talking about any theory or theories in general. One can, using SSDs, examine the role of some independent-variable without having any theory. No? Anyway, we do not seem to be “on the same page.”

          A: Or if the evidence is consistent with the theory, we assess the probability of the theory given the evidence (which according to Bayes’ rule depends on the probability of the evidence given all the other theories). It isn’t really a matter of a theory being true, just the most probable.

          https://larspsyll.wordpress.com/2016/02/13/why-science-necessarily-involves-a-logical-fallacy/

          GS: Well, you appear to be talking to someone else, but I will look at the link.

        • SSDs are basically no different than, “If I flip this switch and the light comes on, the light switch is controlling the light.” And you can throw the reversal in there and flip the switch as many times as you like. Anyway, it’s all affirming the consequent.

          If all you consider is the manipulation and not any other explanations, it is not science. In science, some argument need to be made against the “other explanations”. This is application of Bayes’ rule, not affirming the consequent.

        • GS (previous): SSDs are basically no different than, “If I flip this switch and the light comes on, the light switch is controlling the light.” And you can throw the reversal in there and flip the switch as many times as you like. Anyway, it’s all affirming the consequent.

          A: If all you consider is the manipulation and not any other explanations, it is not science. In science, some argument need to be made against the “other explanations”. This is application of Bayes’ rule, not affirming the consequent.

          GS:
          1.) If one is simply asking if a particular variable has an effect, the only “other explanation” that is considered is general…that something other than the manipulated variable is responsible for any change in baseline. That is why one generally returns to the “A” conditions to see if the baseline returns to its previous level. This is often followed by a return to the “B” condition. I’m guessing that that is not what you meant – that any assertion implies its negation as a possibility. My guess is that you were really talking about the (undergraduate psychology) textbook description of how science works…hypothesis testing and all that…maybe not…if I’m wrong, feel free to correct me.
          2.) In SSDs, one:
          a.) obtains a baseline
          b.) manipulates some variable
          c.) assesses any effect, if any, on the resulting baseline under the new conditions

          If there is a change in the stable-state after manipulation of a variable, the inference is that it probably was the variable change that was responsible for the change in the baseline. That is affirming the consequent. Look it up. That particular topic is not up for discussion. To deny that that is affirming the consequent is, well, an “alternative fact.” There is, I’m pretty certain, a Bayesian description of the whole process of an SSD-type experiment. That is, I’m guessing that anything acceptable to practitioners of SSDs will have some Bayesian interpretation. I mean, just look at the “repeatedness” surrounding SSDs and how confidence in the relevant probabilities changes as a function of repeated observations: the baselines themselves require multiple observations before the probability is deemed very high that future datum points will be distributed as they have been. Then, of course, there is the repeated imposition of both the “A” and “B” conditions. [People talk about “ABA” designs, but it is quite common to repeatedly impose all conditions (e.g. ABAB etc.)]. I’m guessing, as I implied, that the experts in Bayesian thought here can describe SSDs in terms of Bayes’ Theorem. I have played around thinking about this stuff, but to call me an amateur gives me way too much credit.

Leave a Reply to Martha (Smith) Cancel reply

Your email address will not be published. Required fields are marked *