The physicist and science critic writes:

I just came across your paper “Abandon statistical significance”. I basically agree with your point of view, but I think you could have done more to *distinguish* clearly between several different issues:

1) In most problems in the biomedical and social sciences, the possible hypotheses are parametrized by a

continuousvariable (or vector of variables), or at least one that can be reasonably approximated as continuous. So it is conceptually wrong to discretize or dichotomize the possible hypotheses. [The same goes for the data: usually it is continuous, or at least discrete with a large number of possible values, and it is silly to artifically dichotomize or trichotomize it.]Now, in such a situation, the sharp point null hypothesis is almost certainly false: as you say, two treatments are *always* different, even if the difference is tiny.

So here the solution should be to report, not the p value for the sharp point null hypothesis, but the

complete likelihood function— or if it can be reasonably approximated by a Gaussian, then the mean and standard deviation (or mean vector and covariance matrix).2) The difference between the two treatments — especially if it is small — might be due, *not* to an actual difference between the two treatments, but to a systematic error in the experiment (e.g. a small failure of double-blinding, or a correlation between measurement errors and the treatment).

This is not a statistical issue, but rather an experimental and interpretive one: every experimenter must strive to reduce systematic errors to the smallest level possible AND to estimate honestly whatever systematic errors might remain; and an observed effect, even if it is statistically established beyond a reasonable doubt, can be considered “real” only if it is much larger than any plausible systematic error.

3) The likelihood function does not contain the whole story (from a Bayesian point of view), because the prior matters too. After all, even people who are not die-hard Bayesians can understand that “extraordinary claims require extraordinary evidence”. So one must try to understand — at least at the level of orders of magnitude — the prior likelihood of various alternative hypotheses. If only 1 out of 1000 drugs (or social interventions) have an effect anywhere near as large as the likelihood function seems to indicate, then probably the result is a false positive.

4) When practical decisions are involved (e.g. whether or not to approve a drug, whether or not to start or terminate a social program), the loss function matters too. There may be a huge difference in the losses from failing to approve a useful drug and approving a useless or harmful one — and I could imagine that in some cases those huge differences might go one way, and in other cases the other way. So the decision-makers have to analyze explicitly the loss function, and take it into account in the final decision. (But they should also always keep this analysis — which is basically economic —

separatefrom the analysis of issues #1,2,3, which are basically “scientific”.)

My reply:

I agree with you on most of these points; see for example here.

Regarding your statement about the likelihood function: that’s fine but more generally I like to say that researchers should display all comparisons of interest and not select based on statistical significance. The likelihood function is a summary based on some particular model but in a lot of applied statistics there is no clear model, hence I give the more general recommendation to display all comparisons.

Regarding your point 2: yes on the relevance of systematic error, which is why we refer on page 1 of our paper to the “sharp point null hypothesis of zero effect and zero systematic error”! Along similar lines, see the last paragraph of this post.

Regarding your point 3, I prefer to avoid the term “false positive” in most statistical contexts because of the association of the typically nonsensical model of zero effect and zero systematic error; see here.

Regarding your point 4, yes, as we say in our paper, “For regulatory, policy, and business decisions, cost-benefit calculations seem clearly superior to acontextual statistical thresholds.”

Alan responded:

I think we are basically in agreement. My suggestion was simply to *distinguish* more clearly these 4 different issues (possibly by making them explicit and numbering them), because they are really *very* different in nature.

Why not just drop p-values altogether and focus on posterior distributions…? rstanarm, brms, etc., are making it pretty easy to switch over to a Bayesian framework. It’s the same syntax, save for a few more arguments about priors, chains, and whatnot (which replace arguments about p-value corrections, assumption arguments, etc.)

> the likelihood function: that’s fine but more generally

Sokal was vague about whether its the multi-parameter full likelihood – which implicitly contains all the possible comparisons – versus some reduced likelihood (e.g. profile or uniform integrated) for the parameter of interest.

Perhaps the middle ground being the multi-parameter full likelihood along with the researcher’s motivated collection of of reduced likelihoods (preferably integrated with respect to a motivated prior). Of course we (or an approved outside group) still need access to all the raw data (as the data generating model may turn out to be inadequate).

Cool if its the same Sokal as https://en.wikipedia.org/wiki/Sokal_affair?

I continue to enjoy Sokal’s hoax. I have presented it to my students on a few occasions, in political philosophy class, in the unit on politics and language. I wanted to see whether they found anything fishy with it. One student (then a high school junior, already taking Calculus 3 at CU) pointed out that it was wrong in terms of physics. He read aloud the sentence “It has thus become increasingly apparent that physical ‘reality’, no less than social ‘reality’, is at bottom a social and linguistic construct.” He shook his head. “That’s not true,” he said.

Another student pointed to the excessive pairing of past passive participles (“synthesized and superseded,” “relational and contextual,” “problematized and relativized”). He noted that while these pairs sounded fancy, they were basically nonsense.

As for Sokal’s points here, this part stands out: “1) In most problems in the biomedical and social sciences, the possible hypotheses are parametrized by a continuous variable (or vector of variables), or at least one that can be reasonably approximated as continuous. So it is conceptually wrong to discretize or dichotomize the possible hypotheses. [The same goes for the data: usually it is continuous, or at least discrete with a large number of possible values, and it is silly to artifically dichotomize or trichotomize it.]”

Yes, yes, yes! That’s part of what’s wrong with the “lemon introvert” test, for instance. It’s silly to claim that “introverts” salivate more than “extraverts” in response to a drop of lemon juice; from what I understand, extraversion is a continuous variable, and the relation between extraversion and salivation is noisy in the middle range and not entirely clear-cut at the extremes. I find it plausible that those who score extremely high in extraversion salivate less in response to lemon juice than those who score extremely low, but it’s misleading to discretize the findings overall. Yet people continue to claim that you can find out whether you’re an introvert or extravert by performing a lemon test at home.

Your students may enjoy what these folks have done (create an automatic CS paper generator).

https://pdos.csail.mit.edu/archive/scigen/

Thank you for this! In a month I will be teaching at a high school in Hungary; I may have occasion to introduce it to my students there.

Cool – reminds me of one of JC Gardin’s interesting accomplishments was to have one of his programs write an article “like” Claude Levi-Strauss would and when he presented to Levi-Strauss asking if it was one of his articles – after reading it Levi-Strauss said – yes it is but I don’t seem to have a copy – do you mind if I keep this one? https://andrewgelman.com/2009/04/23/two_kinds_of_bo/#comment-48466

Gardin did request post doc applications to do the same for Derrida – never heard if that happened.

Regarding #4: My guess is that it would take an enormous amount of work and debate on the specific utility function for any particular problem. How would you convince a reviewer that a particular utility is properly defined and estimated? This opens a huge can of worms, though I have no doubt that the conversation would be beneficial.

I’m not convinced that “How would you convince a reviewer that a particular utility is properly defined and estimated? This opens a huge can of worms …” is a helpful perspective. The point is that the researchers need to think hard about what utility functions are appropriate for the problem and give their reasons for choosing one — or, in some cases, it might be appropriate to acknowledge that there may be two or more that one can make equally good arguments for, then do analyses using each, and compare results. Or it might be that different utility functions are appropriate for different circumstances (e.g., in different age groups, different negative side effects might be more salient).

My point was certainly not to disregard point #4. I think it’s very important, but that it’s much more difficult to do in general than is being recognized.

Take the relatively simple case of a randomized trial for a new drug over standard of care to treat, say, blood pressure. What are the relevant utilities, and for whom are they relevant? Consumers, hospitals, insurers, and policy makers might attach very different utilities to a specific outcome.

Survivorship seems relevant to consumers, maybe not so much to hospital administrators. What additional study design would be needed to relate predicted change in blood pressure on survivorship? That seems logistically, though not conceptually, challenging.

What about quality of life, or quality adjusted life expectancy? How are these defined? Are these definitions universally accepted with universally accepted measurement? Even if they are, how do you related them to change in blood pressure? And so on.

I don’t believe that utility functions are readily available for most, or even a few, important clinical problems. Much less for social studies. Defining these functions require additional studies, with all of the attendant concerns about design and analysis and interpretation that occupy this blog.

“I don’t believe that utility functions are readily available for most, or even a few, important clinical problems.”

Might be translated to “I don’t believe clinical researchers have even done the most basic initial things required to do good research yet”

To clarify: medicine is really a branch of engineering. it uses science to achieve desirable human goals (improved health at reduced costs). You can’t work on a problem meaningfully until you define what the problem is you’re trying to solve, and in engineering this is in terms of some kind of objective function, namely a utility. So if you haven’t even got the foggiest idea of how to approach defining utility for even one of the stakeholders in your problem of interest…. then you really haven’t even figured out what problem you’re trying to solve.

Garnett,

1. The questions you raise are examples of the types of questions that researchers need to think hard about in designing and analyzing studies. However, I don’t agree that “defining these functions requires additional studies.” What is needed is for people to propose choices that fit circumstances, to give well-thought-out reasons for their choices, to consider serious criticisms of those choices, and revise the choices as needed in respond to criticisms.

2. I agree with Daniel’s responses to your reply.