How to discuss your research findings without getting into “hypothesis testing”?

Zachary Horne writes:

I regularly read your blog and have recently started using Stan. One thing that you’ve brought up in the discussion of nhst [null hypothesis significance testing] is the idea that hypothesis testing itself is problematic. However, because I am an experimental psychologist, one thing I do (or I think I’m doing anyway) is conduct experiments with the aim of testing some hypothesis or another. Given that I am starting to use Stan and moving away from nhst, how would you recommend that experimentalists like myself discuss their findings since hypothesis testing itself may be problematic? In general, any guidance you have on this front would be very helpful.

My reply: In any particular case, I’d recommend building a model and estimating parameters within that model. For example, instead of trying to prove that the incumbency advantage was real, my colleagues and I estimated how it varied over time and across different congressional districts, and we estimated its consequences. The point is to draw direct links to questions outside the lab, or outside the data.

Maybe commenters have other suggestions?

91 thoughts on “How to discuss your research findings without getting into “hypothesis testing”?

  1. “However, because I am an experimental psychologist, one thing I do (or I think I’m doing anyway) is conduct experiments with the aim of testing some hypothesis or another.”

    Remember – whatever substantive hypothesis you are “testing” is NOT the null hypothesis. So there is no inherent tension between “testing” substantive theoretical issues (i.e. estimating parameters of your model) in your experiments, and refusing to do NHST. It just feels like there is a tension, because people seem to have learned to equate rejecting a null hypothesis with providing evidence for a favored substantive hypothesis.

    • jrc:

      NHST is often equivalent to testing Ho: mu less than equal to 0. Would you suggest then that this be replaced by something like Ho: mu less than equal to epsilon, where epsilon is the minimal practically significant value? It’s my understanding this is one of the basic requirements of FDA approved clinical trials, but I do not think that’s the case for psychological research.

      Of course, I prefer Andrew’s method of examining an effect over time (or other important factors) when available. But when testing something like a new drug, changing effect over time is not really an option and estimating changing effect over subpopulations may not be finically feasible as a primary analysis (although many designs allow for primary and secondary analyses).

      • “NHST is often equivalent to testing Ho: mu less than equal to 0”

        This is not really correct. In null hypothesis testing, the null hypothesis needs to be of the form “mu = c”, where c is some specific, specified number. The null hypothesis is then combined with the model assumptions and mathematical theorems to determine the properties of the sampling distribution of the test statistics. Unfortunately, this point is all too often not made clear in teaching statistics (for an overview of the ideas and reasoning, see Day 2 slides at http://www.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html)

        • Re: mu=c, this is not the case in one sided tests. The properties under the null are defined by the infimum over the null set.

        • It’s not clear to me what you mean by [in one sided tests] “The properties under the null are defined by the infimum over the null set.” Do you mean the properties of teh sampling distribution? Or what?

        • But you can test mu less than or equal to C vs mu greater or equal than C right? Of course you end up testing mu = C vs mu greater after you maximise etc.

        • These details are of course just a distraction from the main question.

          Re the main point. If you want a simple approach not too dependent on a model, I wonder if you could just present an estimator of the thing of interest and present it’s full sampling distribution under various conditions?

          Where you can get this by bootstrapping, permutations, jackknifing etc, as well as under different treatments, even past data on the same subject etc.

          A big issue here I think is the request to a) dichotomise this into true of false and b) be able to do this on the basis of a single study.

        • Martha:

          Yes, a null hypothesis test needs to be of the form “mu = c”.

          But if you are doing something like a likelihood ratio test, then testing the hypothesis mu less than equal to c is equivalent to testing mu equal c (under the condition that mu hat is greater than c). I recognize this is not the case in the Bayesian setting.

          “The null hypothesis is then combined with the model assumptions and mathematical theorems to determine the properties of the sampling distribution of the test statistics. Unfortunately, this point is all too often not made clear in teaching statistics”

          I find this statement odd. All the courses I’ve attended or viewed seemed to repeat this on a near daily basis, which in itself is a good argument against NHST; what if we spent more time asking “do the parameters we are estimating really answer our scientific question of interest?”

        • Yes a subtle thing that comes up a lot is that frequentist testing is pretty much unchanged whether point vs point, point vs composite, composite vs composite etc but this can make a big difference in Bayes.

          This imo is the heart of the ‘pvalues overstate the evidence’ issue – Bayesians don’t have a unified agreed approach to testing a point vs composite hypothesis.

          Originally an idea was the whole ‘spike and slab’ concept along with Bayes factors but I think this just led to massive confusion. Many modern approaches to Bayesian testing of point vs composite actually agree with Bayes. Eg Aitkin’s or Evans’ approaches. Idea – estimate the full posterior then, essentially, use this similarly to a sampling distribution for your test stat (LR).

        • How well hypothesis testing is taught may be a function of the level it’s taught at. I agree with Martha that, at the “intro to stats” level, where the p-value comes from is often not made clear. Textbooks might have one little section about it, and from then on every example is H0 -> test statistic -> p-value -> reject / fail to reject. There may be some QQ and residual plots thrown in too for “assumption checking”. The idea that a p-value is dependent upon a sampling distribution which is dependent upon a model is something that intro books, I think, shy away from in the fear that students won’t get it.

        • ” The idea that a p-value is dependent upon a sampling distribution which is dependent upon a model is something that intro books, I think, shy away from in the fear that students won’t get it.”

          And sometimes because the instructor in the intro courses doesn’t get it.

        • I agree that asking “do the parameters we are estimating really answer our scientific question of interest?” is indeed something that is all too often seriously neglected.

        • Your statement about likelihood ratios cannot be right. There are proofs rhat the lik-ratio test and t-test are identical.

        • Ah I see this is aimed at ‘a reader’.

          But I think this is important to keep in mind.

          Eg in Freq testing, testing mu less than or equal to C vs mu greater than C is indeed essentially the same as testing mu=C vs mu greater than C. But in Bayes these can Ben quite different depending on how you handle point vs composite. Senn has written some nice explanations of this.

          (In testing mu = A vs mu = B you need to choose your test stat with this in mind ie the LR between hyp A and B.)

  2. Hypotheses don’t always come out of the box as numbers. Hypothesis-testing doesn’t only mean testing a statistical “hypothesis,” like the null. Hypothesis testing is the sine qua non of scientific progress.

  3. Andrew drawing questions from outside is valuable suggestion. Underlying problem is that disciplines themselves are in intellectual crises. Qualitative side lags. It’s as if we have to rebuild the Library of Alexandria as Dr. Ioannadis elaborates in a speech.

    • One approach, perhaps, is to interrogate each of your assumptions before you begin doing anything. I’ve been reading a number of the papers detailing the crisis in cell line research (e.g. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2001438 ) and have been struck by the similarities between that crisis and the one often discussed here. They had their own Meehls and Rozebooms first firmly and politely pointing out “those human lung cancer cells on which you’ve built a research program are almost certainly rat liver cells” and later angrily escalating the rhetoric when more papers continue to be founded on demonstrably false premises. They had journal editors who knew it was B.S. but who said “we leave it up to the reviewers”. They had reviewers whose own careers were built on prostate cancer research using HeLa cells. And despite every publication of an “Oh by the way we sequenced that XYZ cell line and they’re not actually breast cancer cells” paper the number of “we discovered an interesting mishaped protein in the XYZ cell line that could present a new treatment vector for breast cancer treatment” articles only increased ( http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0186281#pone.0186281.ref003 ). Until it all blew up following replication efforts using authenticated cells.

      I suspect there’s some meta-story about Science in all this. Maybe something like The Sorcerer’s Apprentice.

      • “One approach, perhaps, is to interrogate each of your assumptions before you begin doing anything.”

        That’s an understatement to me. I’d say,

        1. Interrogate each of your assumptions, and alter assumptions as needed, before you begin doing anything

        2. Clearly explain why your assumptions are valid (or the best available)

        3. Do a “postmortem” on your assumptions after the study is completed

        4. Discuss anything in the postmortem that would indicate that you did not catch problems with your assumptions in your initial interrogation.

        5. Include discussion of items (1) – (4) in writing your paper or (if that would exceed space limitations) in freely available online form, with a link included in the paper).

  4. NHST != testing a substantive hypothesis. (Sense a theme in comments?) But the correspondent asked for constructive advice on a rhetorical approach to use, and the model-building language may not work for her/his journals. I am sympathetic. For journals where editors/reviewers are [insert intensifier] on the language of hypothesis-testing, maybe “consistency with hypothesis” will work.

  5. Making clear that testing substantive hypothesis does not need to involve NHST (as jrc pointed out) seems an important part of the solution to me.
    More practically, if one cannot assume familiarity with Bayesian statistics by the audience, it helps to spell clearly out in the methods section how one establishes support for or against a hypothesis. This could be done by checking if highest density intervals do not overlap with zero (or another threshold value), by calculating how much of the posterior distribution is on one side of zero, by reporting how much of the posterior is within (or outside) a region of practical equivalence around the predicted value, or by reporting results of model comparisons (e.g. using the loo package). One can also try to convert information from posterior distributions into quantities that are better know, like the Bayes Factor.

    Not everyone will agree with each of these approaches, and some might not like any of them. However, in my experience reviewers and editors are sympathetic as long as one explains and justifies the approach.

    Kruschke’s book “Doing Bayesian data analysis” has an entire chapter on reporting results from a Bayesian analysis.

  6. Hi all,

    Thanks for the helpful feedback. To be clear, I don’t think nhst is the only kind of hypothesis testing (and I understand that you aren’t even testing the hypothesis of interest in the case of nhst). I was referring to the idea of hypothesis testing in general because Andrew has blogged several times (and Kruschke also argues) that they are not in favor of bayes factors. So the question was, how can you talk about bayesian estimation in a way that won’t confuse, for example, reviewers at psychology journals. I’ve already run into a lot of confusion from reviewers about estimation — The typical response is something like “so is your hypothesis right or not?”

    Zachary

    • So the question was, how can you talk about bayesian estimation in a way that won’t confuse, for example, reviewers at psychology journals. I’ve already run into a lot of confusion from reviewers about estimation — The typical response is something like “so is your hypothesis right or not?”

      I dealt with this exact same thing and found it really irritating. Sorry, I don’t think there is any way for you to overcome their confusion (at least not using logic and reason). Something like this is going on: https://quoteinvestigator.com/2015/07/10/reason-out/

      Basically they will need to actively seek out the required information for themselves. At least that was my experience, maybe other people have a solution.

    • ” The typical response is something like “so is your hypothesis right or not?””

      Ouch! This is indeed hard to deal with; such a response shows that they just don’t get it.

      One thing I like to say when teaching talking about statistical inference is “if it involves statistical inference, it involves uncertainty.” The responses you get seem to be expecting you to (mis)use statistics to “launder out” uncertainty. That is not the real world!

    • Zachary, I’m sorry that the comments are not particularly helpful to you.

      You are, presumably, investigating a scientific hypothesis. You evaluate data that might support or refute that hypothesis by way of a proxy statistical hypothesis that is a parameter value (or values) within a statistical model. All statistical tests require a statistical model, be it a complicated multicomponent hierarchical model that Andrew might use for his type of study, or a trivial model that is just an assumption of exchangeability under the null hypothesis for a permutations test of a simple dataset. The data contain evidence and the statistical model processes that evidence into a statistical inference.

      Your job as a scientist is to interpret the relationship between the statistical inference and the real world, and then make a scientific inference. Many people—far too many—think that a statistical inference _is_ a scientific inference, and many who know better do not make the distinction clear.

      Royall set out three core questions of inference that I find to be insightful and helpful:
      1. What do these data say?
      2. What should I believe now that I have these data?
      3. What should I do or decide now that I have these data?
      You can see that the questions are related but distinct.

      What to believe requires consideration of what your believed before obtaining evidence as well as the new evidence of “these data”, and what you believed beforehand depended, presumably, on prior evidence, theory, opinion, and intangibles. A Bayesian analysis can use a prior to encompass that prior belief and so it can answer the second question. Most of us use a less formal approach to answer it.

      What to do or decide should depend on what you believe and on the costs and benefits of potential actions or decisions. The Neyman & Pearson hypothesis test framework attempts to answer the third question without going into beliefs and with only a very basic loss function. The modern ‘NHST’ approach is a bit like that, but worse because the fixed P<0.05 threshold and the lack of a sensible power analysis for sample size determination eliminates even the basic consideration of loss function that Neyman & Pearson proposed. If you want a formal decision procedure then read the decision theory literature for ways to proceed rather than the 'NHST'.

      You can answer the first of the questions in detail with a likelihood function that ranges across the whole of the parameter space of the statistical model, or in brief with a P-value. You can make a scientific inference with that data informally by integrating it with previous evidence, related results, theory, loss functions, and your own intuition as a scientist as long as you are honest about the factors that you have considered. If there is no theory, as is frequently the case in psychology, then you need to be cautious. If there is no pre-existing relevant evidence then your study is a preliminary study and you should not pretend that your results are definitive.

      Remember always that statistical and scientific inferences are different things.

  7. one thing I do (or I think I’m doing anyway) is conduct experiments with the aim of testing some hypothesis or another

    JRC hit it on the head in the first comment. If using NHST, you have not been testing your hypothesis this whole time anyway. It really sounds crazy, but that is the current state of affairs.

    To actually test a hypothesis you’ll have to think hard about what it entails to come up with a precise quantitative statement. This may be difficult just due to the nature of the problem in some cases. If that is really true, you should aim to describe some phenomenon in as much detail as possible and leave it there.

    People often say bio/psych/etc is “too complicated”, but in most cases this is just an excuse for their lack of training/skill. In my experience, just find papers on the topic from pre-NHST (pre-1940) and you will find someone who has already come up with a quantitative model to start you off.

  8. “Instead of trying to prove that the incumbency advantage was real, my colleagues and I estimated how it varied over time and across different congressional districts”—seems like the gold is here. The effect size logic appears implicit in the statement. But a sole effect size isn’t enough. I want a sense of how my experimental effects vary across contexts (“varied over time…”), which can lead to ideas about moderators OR point to general instability of measurement. Sure, this can be seen as model building language, but it’s not a far cry from how folks approach meta-analyses. I’d like to see a reviewer try to refute meta-analytic thinking.

  9. Perhaps you could provide clear, concrete advice that would be actionable for an experimentalist. Could you ask some follow-up questions and work through an example from Zacahary’s own work, as opposed to drawing on your work in applied research? Consider that an experimentalist is going to be collecting less data under circumscribed conditions with nuances very different from those that arise in observational contexts. As such, suggesting that one approach the analysis as a modeling problem instead of a hypothesis testing problem is not particularly helpful. What does it mean to model the data in the context of a 2×2 factorial experiment with one between-subjects factor and one within-subjects factor if I really only know how to cram everything into an ANOVA and focus on cell means? Even I know how to set up an ANOVA model as a regression equation, what exactly am I including in the model? I don’t have an excess of auxiliary variables to use; I don’t have time or different congressional districts. So, what am I modeling exactly? How do I interpret the coefficients? I only know how to interpret cell mean patterns.

    If you really wanted to help psychologists and others escape the NHST prison you might consider spending more time on these sorts of posts and provide more thoughtful, detailed advice with concrete examples (and code!) that readers could apply immediately to their work. I can’t speak for others but I suspect that such pithy ‘just model it!’ advice is not very helpful to the vast majority of researchers who really need the help and guidance.

    • Sentinel:

      That’s a great idea. I’m kinda busy so why don’t you do it yourself? Feel free to post, right here in the comment section, your detailed advice (and code!) that readers could apply immediately to their work. (You can display your code in html using <pre> and </pre>.) Thanks!

        • Sentinel:

          It’s not my job either, and, as I said, I’m too busy to do it myself right now. But your suggestion does seem like a good idea so if you have the enthusiasm and want to do it, I recommend you give it a shot. Even if your advice and code doesn’t work the first time, you can learn from the experience of trying it out, then next time you can do better, etc.

    • I think what your comment hints at is that many psychologists have quite weak foundations when it comes to understanding the breadth of statistics, and this causes problems when they try to build theories on those weak foundations. It’s very hard to make quantitative predictions when you have been trained to think that all that you can/should aim to “prove” is that two quantities are different when some straw man theory says they should be the same, and also that the only two kinds of distributions you will have to cope with are (1) normal and (2) abnormal. Speaking only for myself, I can say that it took a lot of work to understand what I was missing and rebuild a more complete understanding. I think it is asking too much of Andrew to expect him to fix this for anybody, maybe not even his own students. Learning can be guided, but unlearning, maybe not.

      Speaking more broadly, statistics as a discipline has had very uneven success. It convinced tons of other disciplines that its tools were indispensable, but as far as I can tell never attempted any gatekeeping on who needs what training to use the tools, the way eg the AMA does. Courses in stats departments are hard, and many other fields don’t select heavily for applicants who understand the mathematical framework in which those stats courses are taught, so many departments effectively roll their own stats curriculum. So anything that might have gotten lost in translation sixty years ago is going to be challenging to correct. That’s how I think about where we are, anyway.

      • I can vouch for the fact that Andrew’s very good at conveying this kind of things to students (like me!). Luckily, I didn’t have much to unlearn.

        I think you’re right about the dangers of stats departments rolling their own curricula. I couldn’t understand stats as presented in natural language processing or machine learning. It’s a jumble of undifferentiated techniques conflated with algorithms that was too much for me in terms of separating the signal from the noise (not that there aren’t people that know what they’re doing—it’s not all noise!).

      • ” [Statistics] convinced tons of other disciplines that its tools were indispensable, but as far as I can tell never attempted any gatekeeping on who needs what training to use the tools, the way eg the AMA does.”

        This involves an apples-to-oranges comparison. The practice of medicine is regulated by law; the practice of statistics is not.

        • Yes! — that’s why medicine is my go-to example of a guild that’s been incredibly successful at policing its borders. It wasn’t always thus! Here’s a chart summarizing the history of medical licensure in the US. Though the idea that society has an interest in regulating medicine is old, the modern infrastructure for policing it dates only from late 19c-early 20c, and it seems there was a period of a few decades in the mid-19c with no effort to regulate medicine at all. And modern medical education dates from roughly the early 20c (see here for a sort of galling story about how reform happened).

          I’m not suggesting that we as a profession should try to emulate medicine, or that we’d be successful if we did. It’s probably not an accident that medicine is the most successful guild I can think of — the barriers to entry are high in proportion to the level of trust that doctors ask of people (as patients, we pay strangers huge sums to feed us poison and cut us open! Hard to top that). But lots of professions have some guild-like architecture, and it’s striking to me that we in statistics really have almost none. I think the ASA is trying to change that with their PSt*t[1] accreditation program. I haven’t done that program, because at the moment I see no great benefit to me to doing it. (It’s not like the USMLE is fun, either! Would-be doctors do that because they have to.) All I’m saying is that *because* the borders to our profession are so porous, we have very little leverage for top-down changes to education or practice.

          [1] The fact that this is apparently a reserved word makes me leery of using it here. Not sure that that’s a win for the guild…

        • Thanks for the interesting link.

          I think your comment, “Though the idea that society has an interest in regulating medicine is old,” is a clue to the difficulty of implementing standards in statistics vs in medicine: The idea that society might have an interest in regulating statistics isn’t yet very common — and, because statistics does seem remote to most people, probably will not become common.

          Also, as your linked timeline shows, medical regulations started with states and gradually moved up to a more uniform national level. Given the perceived “remoteness” of statistics from most people’s individual lives, state initiatives seem even less likely to happen than national ones. Indeed, probably international standards might be the most likely to happen. Of course, I might be wrong – do you know of any evidence that other countries have statistical “licensure” standards?

          Another thought: Individuals are the “clients” of medicine — whereas other professions are the “clients” of statistics. So any effective demand for statistical standards may need to come from other professions (or perhaps coalitions of professions). But this might be problematical, since many professions may be wedded to their accustomed ways of using (or misusing) statistics (e.g., that all-to-common preference for certainty and avoidance of uncertainty)

        • I agree with most of what you’re saying. All I’m aiming at here is pointing out one aspect of why I think reform is very hard. The genesis of a problem doesn’t always point directly at a solution. Though I will be interested to see if ASA can make PSt*t gain traction…. but I suspect not. Taxis offer a case in point in two ways: a case where licensure is a thing even though the benefits to society of licensure (beyond say a commercial DL) are not quite as obvious as in medicine, and also a case where Silicon Valley has proceeded to eat the profession’s lunch. (Hello, data science!)

          (Your question about statistical licensure in other places is great. I don’t know! If I had to hazard a guess I would say that if they did, we should already know about it. But that line of reasoning is probably not airtight ;) )

          One place I maybe disagree with you is that often, there is a chain of “clients” for data analysis that looks something like “funders” -> “Congress” -> “taxpayers/voters.” I don’t know how much we can get the last of these to care, as you note, but getting funders to care is certainly a logical approach, and my guess is that it will be the most useful of the top-down strategies. It was certainly gestured at in the “beyond p-values” symposium last week. It’s not perfect because e.g. NSF is famously discipline specific in how it runs things (and I mostly think that’s good?). It’s also possibly dangerous because certain elements of Congress get stars in their eyes when they think about axing NSF/SBE…

          At bottom though I mostly don’t have direct influence over that stuff (although I like thinking about systems, I’m not high in the food chain myself) and I just want my own damn work to be better.

    • In experimental situations we often have small data. I sat through a year of experimental design class where the major concern was which multiple interaction to ignore due to confounding. There is an extensive literature on XD and software that the Bayesian world merely touches upon. Also getting the data might be difficult and there are proof of concept situations where the results are secondary to the implementation of the method or device.

      The author of the paper needs to present the problem, describe the desired effect size, justify the sample size and randomization if any, and present descriptive statistics with some type of intervals. There is no reason to throw out NHST if it has a role but point out the weaknesses of relying on one p-value and perhaps include an appendix with another analysis.

      Regarding two-sided and one-sided hypothesis tests–There has been an avoidance of planned one-sided hypothesis tests with an implicit assumption that the planned two-sided test of treatment versus control, especially in pharmaceuticals, is really one-sided at alpha=0.025. This is a protocol based on software defaults and regulatory culture, not on some real belief in zero effects.

      • More attention does need to be given to the problems you and sentinel chicken have brought up. But it could be hard to find the people “free” (i.e., not already too busy) to do this. The best I can suggest is to look at the following

        1) multilevel modeling

        2) the extensive literature from experiments in agriculture and industry. My experience is that a lot of it is indeed difficult reading. In case it might be helpful in getting some background that might help you in this, feel free to use any of my lecture notes at http://www.ma.utexas.edu/users/mks/384E09/M384Esp09home.html and http://www.ma.utexas.edu/users/mks/384Gfa08/384G08home.html that might be helpful. These are for courses I taught (I am a mathematician, not a statistician, and now retired) to keep a small statistics program going until there was the will to develop a larger statistics program at my university. I consider one of my successes an engineering student who built on the background of the two courses I taught to read the literature on experimental statistics in the automobile industry and apply it to his groups work on robotics.)

        Fitting the analysis to the problem is indeed very important.

      • wild life veterinary medicine and epidemiology commonly use methods that are suitable for really small sample sizes. you may look into their literature as well.

  10. The main point (to me) is that you don’t test a point null hypothesis. “Testing” (or perhaps “evaluating”) a substantive theory over the posterior that an ATE is net positive or bigger in some contexts than others is what you should be doing but NHST does permit.

    • I agree that testing is often counterproductive.

      However, if people want to test, I’d prefer to put the message out there that you often don’t want to test a point null _against_ a composite alternative. Eg mu=0 vs mu>0.

      A point will almost always lose _against an interval_. The t test is designed for point against composite or composite against composite.

      If you do want to test a point hypothesis then test it against another point hypothesis. In this case the LR is the best test stat from most perspectives, including NP theory (NP’s famous lemma).

      The problem is in part comparing a test designed for testing a point vs a composite (t test) vs quantities designed for point vs point eg likelihood ratios or dubious spike and slab ideas designed to overcome point vs composite hypotheses.

      I think there is increasingly a consensus that a t-test is fine for testing a point vs composite, and gives the same answer as many Bayes approaches, it’s just that testing a point _vs composite_ is not a great idea.

      I emphasise this because a point hypothesis does make some sense in that it corresponds to a specific generative model. The issue is what you compare it to.

      • (The solution to testing a point vs composite of course is to consider the family of point vs point comparisons. This leads to both a likelihood _function_ for Bayes and likelihood approaches and a power _function_ for NP testing. The main conflict is not, imo, between the answers given the right questions, but between confusion over the questions)

  11. I’ve been thinking about this recently (you can see some thoughts about ‘hypothesis testing’ over at http://srmart.in/thought-droppings-substantive-statistical-hypotheses/).

    Aside from that blog post, I’ve been thinking about the evaluation of substantive hypotheses.
    One thing I realized, is if you can evaluate H as a function of the quantities within a model, like p(H|theta, y), you can answer many questions. E.g., if your hypothesis says that some effect should be positive or within some range, you can evaluate that via the posterior of theta, and your p(H|theta, y) is actually a probability.

    We also just recently published a paper where I used bayesian modeling near the end to integrate several datasets for evaluating a prosocial hypothesis vs a egoistic hypothesis, with respect to gratitude (http://www.tandfonline.com/doi/figure/10.1080/17439760.2017.1388435?scroll=top&needAccess=true). In the Bayesian section, we were careful not to ‘reject’ or ‘accept’, but rather talk about meaningful, plausible effects, and whether the estimates better fit with one substantive hypothesis over another. E.g., some directional interactions would be hypothesized if egoistic motives are in play; we see a posterior wherein sign isn’t determined, but regardless the magnitude across the 95% most plausible estimates would be too small to affect the main effect – So we say that doesn’t fit well with the egoistic hypothesis.

  12. When my students and I submit papers to journals doing exactly what you suggest, two things happen:

    1. The paper gets rejected.
    2. The reviewers interpret the results in NHST terms.

    So your approach is proving to be a deadend. I have taken the decision to rely on the posterior probability of the parameter being positive or negative. This seems to work and we have published quite a few papers this way.

    • I think this clearly shows why this should be thought of as a conversation with reviewers, editors and search committees, i.e., with the gatekeepers of academic success. I sometimes run into this problem just at the level of helping students and researchers write clearly. They (sometimes, not, fortunately, too often) tell me that when they follow my advice (and write well and clearly) they don’t get the grade or publication they were after. Sometimes they have misunderstood my advice or simply been unable to implement it (just as a botched Bayesian model isn’t better than an expertly executed hypothesis test), but sometimes I really do feel that a perfectly good paper is being rejected simply because it isn’t following conventions that foster obscurity.

      • You are right. My problem as an advisor for PhD students, who need to publish in top journals to get jobs, is to find a way to get past the relative ignorance of editors and reviewers. So far we have managed to get past gatekeepers, but we have told the full story, namely that “take this with a grain of salt and try to prove us wrong with a replication or by demonstrating a confound”. Instead, we have to provide a semblance of “closure” (I really hate this word).

        • ““closure” (I really hate this word).”

          Me, too. It is dismissive of uncertainty — therefore dismissive of reality. In other words, off in a fairy-tale land.

    • Shravan, I have also dealt with this issue a little and still deciding what the best default should be. I think you are on to something though. I have had success explaining the posterior probability of parameter being positive or negative, but where things can get tricky is when this probability notably varies with choice of prior. Folks trained to think of statistics as some ultimate, objective set of conclusions about data want to know *THE probability*, not an explanation of how posterior inference is always relative to prior specification (and of course likelihood which is arbitrary too in a lot of way), not some magic talisman.

      • “Folks trained to think of statistics as some ultimate, objective set of conclusions about data want to know *THE probability*, not an explanation of how posterior inference is always relative to prior specification (and of course likelihood which is arbitrary too in a lot of way), not some magic talisman.”

        Yes, yes, yes: Misunderstanding of statistics and probability is rife. And misunderstandings of statistics are all too often misunderstandings of probability. I suspect the lack of understanding of probability and statistics is in part a result of a human (or perhaps societal?) tendency to want certainty, exactness — you might call it “uncertainty avoidance.” So teaching people to think probabilistically can be a very difficult task.

      • We do sensitivity analyses in the vignettes we release with the paper. In my field it has not mattered much,but Ican imagine that in other field the differences are dramatic. When I did my MSc in stats at Sheffield, there was an eye opening example of using one’s own enthusiastic prior vs one’ opponents prior, and leaving the data open to interpretation. I want to do that one day, but I think I would need to be famous enough that AnythingI write gets accepted.

  13. Andrew,

    I’ve been using Stan (or when I can, brms/rstanarm), and I have a few questions regarding “hypothesis testing.” First, you state in your post:

    1. “…instead of trying to prove that the incumbency advantage was real, my colleagues and I estimated how it varied over time and across different congressional districts, and we estimated its consequences.” Looking at how it varies and at its consequences already operates on the conclusion that the advantage is real, no?

    2. I’ve read Zachary’s comments on this thread, and I would suggest reporting the proportion of the posterior in the same direction as your estimate as the probability that your estimate is larger/smaller than zero. For example, if your beta = -.24 and 3990 of the 4000 iterations are less than zero, you could say the probability that the beta is less than zero is .9975. Obviously, this relies on the assumption that you’ve reached stationarity between all your chains. What do you think of this approach?

    3. Bayesians talk about uncertainty and looking at the entire posterior distribution in looking at results. I totally agree with this. But sometimes we *have to make dichotomous decisions.* For example, let’s say I am trying to figure out if I should continue doing mailers for campaigning purposes. I send people mailers A, B, or C. I also have a control group where no mailers were sent. At the end of the day, I want to know: Do I send those mailers? So I *do* care if “my hypothesis was right,” in a sense. I do want to know if there is any meaningful difference between mailers A, B, and/or C and the control condition. Even if I use Bayesian estimation (which I would), I still have to make a dichotomous recommendation to the campaign: Should they use mailers or not? And if so, which one? So I think using the posterior iterations to make probabilistic statements around meaningful values is legitimate. I know Bayesians don’t like making dichotomous decisions, but sometimes we have to; what are your thoughts on that?

    4. This brings me to my next point: Most of the time in social psychological research, there are no meaningful values. In my example above, I could get a meaningful value from the campaign: They will not use it unless it changes opinion by 2 percentage points. Well, I can use the draws from the posterior to tell them how likely it is they will beat 2 percentage points with mailers. However, consider a psychology experiment where we have two conditions (0 and 1) and our dependent variable is measured on a Likert scale from 1 to 7. The beta that I estimate for the condition is, well, somewhat meaningless, right? I mean, it is completely dependent on the scale of the dependent variable. The fact that we tend to always use arbitrary scales means that we just default to saying, “well, the meaningful comparison value is zero. Is the effect more than zero?” From here, people will just do what I said above in point 2: Report the probability (by taking draws from the posterior) that our estimate has the same sign as the mode of the posterior. My main point is: **I do not think the issue here is Bayes or NHST, I think the problem is our use of arbitrary scales that do not have intrinsic meaning.** If the posterior tells me the 95% credible interval is .20 to .80… What the hell does that even mean, if our DV was measured on an arbitrary scale? I don’t see standardized effect sizes as a way around this, either. I could see a Cohen’s d, for example, of 0.5 to be hugely important or hugely unimportant, depending on the context. I guess my question here is: Do you think the problem is really Bayes or NHST? I think it is that the estimates we get are dependent on arbitrary scaling most of the time (in social psychological research, that is), so it is hard to make probabilisitic statements that make sense without considering the default: “Is it different from zero?”

    That being said, you can still get richer inferences from your data by using Bayesian methods, even if you will end up coming to a dichotomous decision, so I would still use Stan, even if you end up saying, “Yes, in my opinion, there is substantial evidence for my hypothesis to the point where I will say that it may very well be ‘right.'”

    • It is unclear why you believe “…instead of trying to prove that the incumbency advantage was real, my colleagues and I estimated how it varied over time and across different congressional districts, and we estimated its consequences.” assumes the advantage is real. The estimated advantage is a continuous number. The point probability of it being exactly zero is zero.

      It may be the case that it is negative (the “cost of government” literature in European parliaments suggests a form of negative incumbency and/or regression to the mean), it may be the case that it is positive. I didn’t read Andrew’s followup papers to know what priors they put on the distribution of the advantage but I assume the priors give some support to negative values of the incumbency advantage. Is this comment just a misinterpretation of the term “advantage” to imply the quantity must be a positive scalar? It’s true that if it was observed to be negative, we might re-label it and flip the scale so that we had a positive value for “incumbency cost”, but the underlying “incumbency association” exists whether it’s positive or not.

      P.S. As someone who straddles comparative and American politics, I think an author would be fairly justified to assign almost no probability to a negative incumbency association in the 20th/21st century US context. Media coverage, name awareness, fundraising, endorsements — all the correlates of electoral success are benefitted through incumbency, and that’s without pork barrel spending, credit claiming, generic good affect for resolving procedural constituent complaints… and that’s without getting into the “true advantage” versus “deterring quality challengers” stuff which brings in the progressive ambition / entry timing literature. I would really like to see an AP scholar open a talk with “Actually, there is nothing that we could reasonably call an incumbency advantage”. That would really be a provocative prior assumption. An informed prior would have a lot more support on the 0-10 range than the -10 to 0 range, anyway.

  14. Re Cuddy post
    P-hacking is a marvelous opportunity to act out envy and other bad emotions. As a psychiatrist I would love to interview those persons who choose to replicate or non-replicate as the case may be their colleagues. I suspect their motivations may go beyond pure scientific inquiry in some of not many instances.

  15. There’s nothing wrong with making dichotomous decisions from a Bayesian analysis but don’t do it just by eyeballing coefficients. Define your (campaign’s) utility function, average that utility function over the posterior distribution for each choice under consideration, choose the decision with the highest average utility.

  16. I have a question about the new “Statistics and probability for advocates: Understanding the use of statistical evidence in courts and tribunals” just published by the RSS. I was about to send it around to the trial lawyers in my firm but then thought I ought to read it first.

    When I got to page 44 in the explanatory green box (http://www.rss.org.uk/Images/PDF/influencing-change/2017/ICCA-RSS-guide-version-6-branded-171019-REV03+designed-covers.pdf ) I found this: “In statistical tests we usually work with two hypotheses, the null and the alternative. The null hypothesis is something like the status quo; it is the assumption we would make unless there was sufficient evidence to suggest otherwise. The alternative hypothesis represents a new state of affairs that we suspect (or perhaps hope) might be true.” Ok, fine, but then it says this:

    1) “The strength of the evidence for the alternative hypothesis is often summed up in a ‘P value’ (also called the significance level) – and this is the point where the explanation has to become technical”

    2) “A significance test assesses the evidence in relation to the two competing hypotheses.

    3) “A significant result is one which favours the alternative rather than the null hypothesis”.

    4) “A highly significant result strongly favours the alternative.”

    (Note that in the publication 1) came before the other three but I’ve arranged them to demonstrate how this will be interpreted by lawyers)

    Maybe I haven’t been paying attention but I thought (a) p-values were one measure of the data’s “fit” with the null’s model, (b) unless H0 and H1 exhaust all the causal possibilities then knocking over the H0 straw man is at best weak evidence for H1; and, (c) somewhere I recall reading “By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.” Anyway, my question is this: am I misreading this, confused, or is there an Atlantic Ocean’s gap between the ASA and the RSS?

    • > 3) “A significant result is one which favours the alternative rather than the null hypothesis”.

      Counterexample:
      Normal model, known variable.

      Null: mu=0, Alternative: mu=1.
      Sample of size 25.

      Set size =0.05, power = 0.9996.

      Result is sample mean 0.4 which is significant p1 in favour of your alternative as the criterion, in which case a significant result _does_ indicate ev (LR>1) for your alternative.

      In conclusion: it depends.

        • Ugh stupid less than symbol.

          Should be:
          Result is sample mean 0.4 which is significant p less than 0.05. But the LR is about 12 in favour of your null.

          one issue here is that your beta (type II) is 0.0004 and much less than your alpha =0.05, which is unusual. From a Bayesian point of view this corresponds to a strong prior in favour of the alternative, which the null has to overcome. In general beta less than alpha is unusual, I think

          If you set eg alpha =beta then you are using a LR> 1in favour of your alternative as the criterion, in which case a significant result _does_ indicate ev (LR>1) for your alternative.

    • This illustrates the difficulty of trying to explain to lay people in relatively simple language just what a p-value, hypothesis test, or confidence interval is. The concepts are inherently complex, subtle, and often taken to imply more than they justifiably can. It is a real problem.

      Part of the problem is the desire so many people have to have certainty. That’s why I often say, “If it involves statistical inference, it involves uncertainty.” You can’t make the uncertainty go away.

    • > Atlantic Ocean’s gap between the ASA and the RSS?
      Does seem like it but my guess is more likely gap between academia and who ever actually did the writing.

      Thanks for posting this.

  17. a ‘P value’ (also called the significance level)

    The royal statistical society produced this? If I search for “significance level” I find the expected definition (the cutoff alpha = 0.05, type I error rate, etc). It is a common error to confuse this with a p-value… either the RSS is committing stats 101 fallacies or they are introducing new terminology that couldn’t be better chosen to maximize confusion.

    The null hypothesis is something like the status quo; it is the assumption we would make unless there was sufficient evidence to suggest otherwise.

    They don’t say it but this assumption is that any two variables have no correlation. That doesn’t seem to be a good assumption.

Leave a Reply

Your email address will not be published. Required fields are marked *