Paul Meehl continues to be the boss

Lee Sechrest writes:

Here is a remarkable paper, not well known, by Paul Meehl. My research group is about to undertake a fresh discussion of it, which we do about every five or ten years. The paper is now more than a quarter of a century old but it is, I think, dramatically pertinent to the “soft psychology” problems of today. If you have not read it, I think you will find it enlightening, and if you have read it, your blog readers might want to be referred to it at some time.

The paper is in a somewhat obscure journal with not much of a reputation as “peer reviewed.” (The journal’s practices should remind us that peer review is not a binary (yes-no) process. I reviewed a few paper for them, including two or three of Meehl’s. I asked Paul once why he published in such a journal. He replied that he was late in his career, and he did not have the time nor patience to deal with picky reviewers who were often poorly informed. He called my attention to the works of several other well-known, even eminent, psychologists who felt the same way and who published in the journal. So the obscurity of the publication should not deter us. The paper has been cited a few hundred times, but, alas, it has had little impact.

I agree. Whenever I read Meehl, I’m reminded of that famous passage from The Catcher in the Rye:

What really knocks me out is a book that, when you’re all done reading it, you wish the author that wrote it was a terrific friend of yours and you could call him up on the phone whenever you felt like it. That doesn’t happen much, though.

Meehl’s article is from 1985 and it begins:

Null hypothesis testing of correlational predictions from weak substantive theories in soft psychology is subject to the influence of ten obfuscating factors whose effects are usually (1) sizeable, (2) opposed, (3) variable, and (4) unknown. The net epistemic effect of these ten obfuscating influences is that the usual research literature review is well nigh uninterpretable. Major changes in graduate education, conduct of research, and editorial policy are proposed.

Meehl writes a lot about things that we’ve been rediscovering, and talking a lot about, recently. Including, for example, the distinction between scientific hypotheses and statistical hypotheses. I think that, as a good Popperian, Meehl would agree with me completely that null hypothesis significance testing wears the cloak of falsificationism without actually being falsificationist.

And it makes me wonder how it is that we (statistically-minded social scientists, or social-science-minded statisticians) have been ignoring these ideas for so many years.

Even if you were to step back only ten years, for example, you’d find me being a much more credulous consumer of quantitative research claims than I am now. I used to start with some basic level of belief and then have to struggle to find skeptical arguments. For me, I guess it started with the Kanazawa papers, but then I started to see a general pattern. But it’s taken awhile. Even as late as 2011, when that Bem paper came out, I at first subscribed to the general view that his ESP work was solid science and he just had the bad luck to be working in a field where the true effects were small. A couple years later, under the influence of E. J. Wagenmakers and others, it was in retrospect obvious that Bem’s paper was full of serious, serious problems, all in plain view for anyone to see.

And who of a certain age can forget that Statistical Science in 1994 published a paper purporting to demonstrate statistical evidence in favor of the so-called Bible Code? It took a couple of years for the message to get out, based on the careful efforts of computer scientist Brendan McKay and others, that the published analysis was wrong. In retrospect, though, it was a joke—if I (or, for that matter, a resurrection of Paul Meehl) were to see an analysis today that was comparable to that Bible Code paper, I think I’d see right away how ridiculous it is, just as I could right away see through the ovulation-and-voting paper and all the other “power = .06” studies we’ve been discussing here recently.

So here’s the puzzle. It’s been obvious to me for the past three or so years, obvious to E. J. Wagenmakers and Uri Simonsohn for a bit longer than that—but there was Paul Meehl, well-respected then and still well-remembered now, saying all this thirty and forty years ago, yet we forgot. (“We” = not just me, not just Daniel Kahneman and various editors of Psychological Science, but quantitative social scientists more generally.)

It’s not that quants haven’t been critical. We’ve been talking forever about correlation != causation, and selection bias, and specification searches. But these all seemed like little problems, things to warn people about. And, sure, there’s been a steady drumbeat (as the journalists say) of criticism of null hypothesis significance testing. But, but . . . the idea that the garden of forking paths and the statistical significance filter are central to the interpretation of statistical studies, that’s new to us (though not to Meehl).

I really don’t know what to say about our forgetfulness. I wish I could ask Meehl his opinion of what happened.

Maybe one reason we can feel more comfortable criticizing the classical approach is that now we have a serious alternative—falsificationist Bayes. As they say in politics, you can’t beat something with nothing. And now that we have a something (albeit in different flavors; E.J.’s falsificationist Bayes is not quite the same as mine), this might help us move foward.

86 thoughts on “Paul Meehl continues to be the boss

  1. You said: “I really don’t know what to say about our forgetfulness. I wish I could ask Meehl his opinion of what happened.”

    I was one of the lucky few who got to study under Professor Meehl many years ago. His class in Philosophy of Psychology was life changing and legendary.

    I recall a student asking him, why don’t people acknowledge any of this stuff you’re talking about? He said in his delightful Minnesota dialect (heavily affected at times), “Because if they did it would mean they’d all be selling shoes!”

    I took this to mean that the mathematical and logical skills necessary to do social science properly was too great a hurdle for most would-be soft-scientists, and most of us would simply be out of a job.

    • > the mathematical and logical skills necessary to do social science properly was too great a hurdle for most would-be soft-scientists

      Well maybe, but the folks that came to my mind that did not seem to actually get this back in the 1980’s into 2000’s have become very well known and highly published statisticians – so for them it certainly was not lack of mathematical and logical skills necessary.

      The paper discussed here was written as a response to the views of many faculty and students at the University of Toronto in 1988 http://statmodeling.stat.columbia.edu/2012/02/12/meta-analysis-game-theory-and-incentives-to-do-replicable-research/#comment-73427

      The one faculty member I did get feedback from was after a gave a talk at the University of Toronto in 1997. In the talk I had stressed that selection bias arises at the level of the individual study (if you can’t rule it out for that study) and does not just arise when trying to assess the studies all together (meta-analysis). Their comment to me was along the lines that this clarified and finally convinced them the meta-analysis was not the problem that they worried it was.

    • “I took this to mean that the mathematical and logical skills necessary to do social science properly was too great a hurdle for most would-be soft-scientists, and most of us would simply be out of a job.”

      I think that is too pessimistic. It is also possible that the training programs are simply failing to teach the necessary skills and keeping students too busy with less important matters for them to learn the skills on their own.

      It should strike us as very strange that so many people studying dynamic systems do not have any use for calculus.

    • Jeff:

      I don’t completely agree with the “selling shoes” comment. I think it’s more than that. As I noted in the above post, lots of statisticians who are highly technically competent (including me, for many years!) were generally aware of problems with selection bias but did not realize how central it is to much of statistics-as-it-is-practiced. Being aware of such problems did not suddenly put me out of a job; rather, it allowed me to do my job more effectively!

      I agree that there is some number of practical researchers who can’t do much more than turn the crank, and for them there is a positive value in methods such as null hypothesis significance testing that allow them to turn raw data into published papers, to perform “uncertainty laundering,” as I put it in one recent paper. But the question I wanted to raise in the above post was not what their problem was; rather, I was asking what was my problem, and the problem with the statistics profession, that we did not realize the scale of this issue, that we naively thought that problems with hyp tests could be solved using conf intervals, etc.

      • I don’t think Jeff’s interpretation is not what Meehl meant by the “selling shoes” comment. Actually, he was just implying that much of what psychologists did at the time was rationally unjustified, and therefore such psychologists were earning their salaries upon bad/dishonest pratice (including bad/dishonest academic practice). If those same psychologists suddenly started acknowledging Meehl’s intellectual remarks, they would have to acknowledge as well the fact that what they did was worthless (or financially unjustified). Thus, the “selling shoes” comment: if they admitted their bad work, it would be more honest to earn money as a shoes salesperson.

        I actually found the same comment in two of Meeh’s publications: in the preface of his book Psychodiagnosis (1973; quoted in the book edited by Lilienfeld and O’Donohue, “The Great Ideas in Clinical Science” (2007)); and in his paper Why Summaries of Research on Psychological Theory are Often Uninterpretable (1990), reprinted as Chapter 19 in the book “The Paul Meehl Reader: Esays on The Practice of Scientific Psychology” (2006), edited by Waller et al.

        P.S. By the way, I just love the “selling shoes” comment by Meehl (interpretated as I wrote above), and I think it might apply to several current ocupations/professions.

  2. Well, I guess it is a big mystery why every stat generation keeps repeating the same mistakes and why such seemingly simple problems are unfixable. Why isn’t a 30 year old paper, or 50 year old paper, or 70 year old paper warning of these problems not stale today? Because there are two different views as to what probabilities mean. Either:

    (1) P(M|N) means the frequency of M whenever conditions N exist, or
    (2) P(M|N) represents the uncertainty in M from partial evidence N.

    The second is more general than the first. The second is more useful and simpler in several senses and more intuitive than the first. The second works better in practice than the first. But almost everyone is indoctrinated with (1) when they first encounter statics. Indoctrinated so deeply it’s next to impossible for even “bayesians” to break free from it.

    That’s a problem because everyone who holds to (1) thinks lots of things are true which are simply aren’t. It’s basically impossible to educate them otherwise, and overwhelming evidence confirms they will keep repeating the same falsehoods over and over again, generation after generation.

    I confidently predict that Meehl’s paper will be relevant for however many decades or centuries it takes statisticians to fix this foundational problem. If there was some genius mathematical analysis that would clear it up, it would have been cleared up a long time ago. If there was some pedagogical cleverness that could clear it up, this would have been fixed long ago. If there was some computer code that could fix it, it would have long been written by now. The only thing that will change it is to fix the foundations.

  3. What does Lee Sechrest mean when he refers to “the journal’s practices” here? What kind of practices did this Journal follow that were unusual / pejorative?

    • sorry for the delay in responding, Rahul, but health problems, as well as time zone problems, keep me from being very dependable.
      The journal has page charges for publication. That is not so unusual now, but in psychology paying to have a paper published has always seemed suspect. The pressure on the journal editors was to fill up issues, and so it published some fairly marginal stuff. I suspect the rejection rate was quite low. Papers published in the journal (and I had at least one) were sent out for peer review, but that review was usually pretty cursory and easy for authors to deal with. On the other hand, some reviewers were careful (as I tried to be), and some authors took the recommendations pretty seriously.
      So, yes, papers were peer reviewed, but review was not always rigorous.
      Lee

  4. I don’t think that the core issue criticised by this wonderful paper is null hypothesis significance testing as a statistical technique. I guess the arch-frequentist Mayo would agree with pretty much everything in this paper. Out of the 10 “obfuscating factors” listed in the paper, only two, namely no. 5 and 6, are more or less directly connected to NHST. No. 5 is about ignoring something that a good frequentist statistician knows should not be ignored. No. 6 is very important, but can easily be ignored by Bayesians with the same consequences as if ignored by significance testers (one may think that the Bayesian machinery allows to take this into account in slightly better ways but “where do we get the prior from?”).

    • Christian,

      If all those arch-Frequentists have always agreed with this, then why is it still a problem? They had command of all the leading text books, all the key professorships, all the key journal editor positions, they had funding that dwarfed the Manhattan project, …., and so on. Arch-frequentists had total control over the fate of statistics. Why couldn’t they fix these problems?

      People went from experimenting on static cling, to the iPhone in two centuries and change. About the same length of time it took to go from Laplace to the current slop in statistics. Do you think “hypothesis testing” is just such a monumentally difficult problem while controlling the electron is child’s play?

      Here’s my theory. If you teach a methodology and your students get wrong results for 5 years, maybe it’s your teaching that’s bad. That’s just within the realm of possibly. If they get wrong results for the better part of century, then there’s something very wrong with the theory which beget the methodology.

      • Anonymous: I suspect I know who you are and if this is so, I’ve responded to this before. People get all of statistics wrong, frequentist, Bayesian, whatever you want. Statistics is difficult, and people use whatever they can to obtain and present glowing results without much effort. Have Bayesians controlling statistics for 50 years and I’m absolutely sure the same thing can be said.

        • What Gelman describes in the post is not normal in science. In fact, it’s an exceptionally unusual thing, possibly unique. It has one of three causes. Either

          (1) The statistics community is very incompetent.
          (2) Statistics is much harder than other problems we’ve solved.
          (3) There’s something very wrong with the foundations of statistics.

          I understand it’s your opinion that it’s a result of some combination of (1) or (2), but I don’t see any evidence for that. It remains merely an opinion. On the other hand, if the cause is (3) and we adopt your attitude that we just need to tweak some things here and there, then we’ll all be having this exact same conversation in 2045. Which is the same one they had in 1985, or 1955 for that matter. It hasn’t changed fundamentally in all that time.

        • Weren’t you a strong believer in #1 i.e. statisticians = incompetent? I thought your core point in a recent thread was that generations of statisticians have been fundamentally incompetent. Failed mathematicians and physicists taking refuge in the profession etc.

          Wasn’t that the point behind the criticism of Neyman, Rubin etc., comparing them adversely to Gibbs and Planck? What gives?

        • Rahul, there are millions of people who are no Newton, but can competently master calculus. Saying statistics if full of competent people, but lacks a Newton isn’t a contradiction.

          Christian’s view may have been reasonable in 1955, but in 2015 it’s proof that Frequentistm is an unfalsifiable theory. It’s the Theory of Epicycles of our time. Every predictive failure of epicycles can be interpreted two ways. Some will think consistent long term failure points to the need for something like Newtonian Mechanics, while others will say the theory fits their philosophical prejudice for circles and that we just need a few more epicycles.

          People can cling to this “more epicycles” belief indefinitely. It took 1000 years and the collapse and reconstitution of European civilization to dethrone the Theory of Epicycles.

          So why even have these stupid discussions every 30 years? Every time the failures become to glaring to ignore, statisticians should get together, hold hands and simply chant “more epicycles please” ad nauseam.

        • I actually agree with you that frequentism is unfalsifiable in a certain sense. According to my interpretation, frequentism is *not* a description of reality which in this sense could be true or false, but rather a way of thinking about certain phenomena; as are the different flavours of Bayesian statistics.
          You can say that people did too many stupid things with frequentism, so you don’t like it, which is fair enough. Nowhere will you find me talking about Bayesian statistics in the same way you talk about frequentism and the frequentists; I think that it’s legitimate and often fine not to do things in a frequentist but in a Bayesian manner.

          But while you’re talking, people do stupid things with Bayes, too, as you know and have already conceded (and Bayes isn’t exactly only around for a year or so). So I just don’t think that this works as an argument against frequentism and in favour of Bayes.

        • (different anon)

          @christian

          I have no problem with frequentism as a calculation. After all calculations on frequencies are just a useful mathematical approximation.

          I do have a problem with frequentism as a language for the scientific method. In particular hypothesis testing is such that you never pose a model of the hypothesis in question. If you only ever falsify null hypotheses, you end up learning very little beyond that null hypotheses are wrong.

          Once you stop testing the theory in question, in my opinion, you’ve left he scientific method behind.

          Hypothesis testing _never_ explicitly tests the theory in question. It only implicitly tests the theory in question under these so-called “severity” conditions, which happen to line up with the Bayesian condition in which the posterior converges sharply around the hypothesis in question if and only if the null hypothesis is false.

      • Upon reading:
        People went from experimenting on static cling, to the iPhone in two centuries and change.
        About the same length of time it took to go from Laplace to the current slop in statistics.

        I laughed so hard that I began to worry if I would ever breath normally again. Funny, but a little sad—like many of the sharper observations on this blog.

        Bob

    • Christian:

      Please don’t equate null hypothesis significance testing (which I think is almost always a bad idea) with frequentism (the idea of evaluating statistical procedures based on their long-term frequency properties, which can often be a good idea).

      I don’t like Bayesian versions of null hypothesis significance testing either.

      • Andrew: Null hypothesis significance testing, well understood, is about whether data are compatible with certain probability models. I don’t see what is wrong with this in general (although I do see that many get it wrong in many situations), and I know quite a number of situations in which this is useful.
        One example is here (actually rather three examples):
        http://arxiv.org/abs/1502.02574
        Not sure whether I remember this correctly, but haven’t you used NHST for model misspecification testing?

        • Christian,

          I took a look at the methadone analysis (Section 4 of http://arxiv.org/abs/1502.02574). Unless I misunderstood what I was reading I do not think that is what people refer to as NHST. That appears to be an exploratory analysis attempting to discover features of the data. As is said there, this is useful in determining good ways to summarize the data and to point out aspects that should be explained.

          The major point is that no attempt is made to explain why any clustering should exist. See figure 2 of Meehl (1990), you do not deal with left side (“corroboration problem”) of the diagram at all. That is where the problems arise.

          In your methadone analysis it is concluded “no evidence for real clustering”, but say clusters were detected for the sake of argument. What happens in NHST is that the researcher wants to reject “no clusters” and then say “clusters are due to reason X” (eg age, social group, etc). The procedure obviously DOES NOT offer any valid means of accomplishing this, yet the study will be designed around the premise that it does.

          Meehl, P (1990). “Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It”. Psychological Inquiry 1 (2): 108–141
          http://www.tc.umn.edu/~pemeehl/147AppraisingAmending.pdf

        • Anon: No disagreement. My paper says explicitly that in this case non-rejection is more informative, because it means we found a model that can explain the data without clustering, and therefore we know that the data can’t be evidence in favour of a clustering. I believe that this is useful, and in this case the test delivers exactly what we should be interested in.
          If we have a significant (in this sense) clustering, of course we may be interested in explaining this more precisely, and you’re right that the test alone doesn’t give an explanation. But nowhere I said that the test is the only thing we should do and will tell you everything.

          You may think that a proper full Bayesian analysis is the only thing that should be done and can address all possible questions. But this is really only the case if you’re able to start, before seeing the data, with a prior over absolutely everything that is conceivable, and I have yet to see a single Bayesian analysis that manages to do this. (OK, perhaps if you model a single toss of a coin.)

        • “If we have a significant (in this sense) clustering, of course we may be interested in explaining this more precisely”

          This is the what non-statisticians want an algorithm to achieve, in the absence of this they create myths about what the statistical methods they are taught can do for them. The latter is observational fact. I would guess they do this because they cannot understand why they were taught a method, and why everyone uses a method, that cannot provide this information.

          Statisticians really need to make a clear, authoritative statement (maybe the ASA) on what they can agree to define as valid forms of NHST. This should be done both in prose and using more rigorous notations (eg. math, logical). It should be in the first and last sentences that NHST cannot explain the reason why the null hypothesis is “false”. They should also enumerate precisely what it is that the valid forms of NHST can achieve and compare its drawbacks and merits to other approaches. Under what conditions is NHST thought to be the optimal method by it’s statistically-trained proponents?

          I propose that to explain an effect (here, the clustering) requires:
          1) An a priori prediction deduced from some theory consistent with new data.
          2) Ruling out any plausible alternative explanations that people come up with.

          Vague a priori predictions (“no effect”, “positive/negative effect”, “some clusters”) are not invalid. However, in practice then step 2 will be very expensive, possibly impossible, to achieve with any amount of rigor. It will require many controls and assumptions regarding construct validity to be checked. For that reason Meehl (1990) concluded we should strongly prefer, even require, precise predictions:

          “In the strong use of a significance test, the more precise the experiment, the more dangerous for the theory. Whereas the social scientist’s use… where H0 is that “These things are not related,” I call the weak use. Here, getting a significant result depends solely on the statistical power function, because the null hypothesis is always literally false.”

          The difficulty in deducing any precise prediction is the main problem with using what Hull (1935) called “isolated and vagrant hypotheses”, contrasted with “experiments which are directed by systematic and integrated theory…[which] in addition to yielding facts of intrinsic importance, has the great virtue of indicating the truth or falsity of the theoretical system from which the phenomena were originally deduced”.

          If statisticians have an alternative method of determining the best explanation for an observed “effect” (be it clustering, difference between means, or something else) then what is it? I have not been able to find it. If they do not have this, a statement to that effect should be included in the clear, authoritative statement mentioned above.

          Sorry for the long post, but I don’t see how it can be made any more concise.

          Meehl, P (1990). “Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It”. Psychological Inquiry 1 (2): 108–141
          http://www.tc.umn.edu/~pemeehl/147AppraisingAmending.pdf

          Hull, C. L. Nov 1935. “The conflicting psychologies of learning—a way out”. Psychological Review, Vol 42(6), 491-516. http://psychclassics.yorku.ca/Hull/Conflict/

        • Anon: Your thoughts are appreciated. In many situations you can neither have 1) nor 2); “very expensive, possibly impossible”, as you say. If we can have it, we should, I agree. If we can’t have it, we need to be modest. Frequentists and Bayesians alike. But people want to make strong claims to get their grants, media coverage, industry support, so modesty is not fashionable. That’s not a frequentist vs. Bayes problem.

          (In my original post in this thread I emphasized that only a quite small part of Meehls “obfuscating factors” has to do with the frequentist vs. Bayes issue. I don’t think we do the paper justice if we focus all too much on this here.)

        • Christian,

          Sorry for any possible confusion. I am not the same person posting under the name “Anonymous” in this thread. I agree with Andrew when he wrote: “I don’t like Bayesian versions of null hypothesis significance testing either.”

          My posts here are focusing on what Meehl refers to as the “corroboration problem”. To use another Meehl term, imagine “Omniscient Jones” told us the exact effect size. Then what? If statisticians really have nothing to say on this, they really need to make it clear. It is not that anyone claimed to solve that problem, but I suspect many researchers would not bother with stats if they truly understood this to be the case. They have much bigger problems to deal with first.

        • Anon: The problem is, if I understood you correctly, that what a statistician has to say is often quite dependent on the specific application. To make sense of effect sizes is a task that should be addressed by the subject matter expert and the statistician together. Same applies to whether and when NHST is “valid”. It depends strongly on what you want to know and what you want to do, and on a number of things that I as a statistician would want to know about the background, study design, etc. So I don’t think the right thing to do for us as statisticians is to make general “authoritative statements”, but neither do I think we have nothing to say on these issues.

          Also, personally I think that people tend to jump to too strong conclusions far too quickly. Much of what we do is “exploratory” in one sense or another and the result is not a stab at a “best explanation”, but rather something that gives us somewhat better ideas and knowledge to be used in the next step.

          At the end of the day it would be sad if researchers “would not bother” with Stats if they knew how slow the process is and how far from well justified strong statements, “best explanations” etc. we still are. Because they have data to analyse in any case and I doubt that they could make better sense of them without the statisticians. The statistician may often be the one who could stop a researcher from jumping to a too strong conclusion too quickly (much work criticised by Meehl did not involve statisticians although it involved statistics, I’d believe), but you may be right that this often doesn’t make the statistician very popular.

        • Christian Hennig says: March 25, 2015 at 10:30 am

          I agree with a lot you say here but my first hand experiences with other statisticians working with researchers is that they do tend to suggest they can provide much more than is reasonable from the data in hand (perhaps in an attempt to be popular). In fact, its often quite challenging to convince them and the researchers that they are expecting/trying to get too much out of the data.

          One example where I was unable to do this but fortunately the journal reviewers were was http://www.ncbi.nlm.nih.gov/pubmed/10825042

          In the original submission, I was excluded from authorship, as the other statisticians suggested they could identify the one best model (for removing confounding) and along with the young at the time primary investigator argued that practically speaking they had ruled out confounding as a possible explanation.

          (Now some of the authors where expecting the journal to reject and likely went along just to let others learn.)

        • Keith,

          They are pretty careful not to “accept substantive hypothesis B” with regard to survival:
          “The proportion of cases with 30-day survival was higher than that of the controls with 30-day survival (67% vs. 34%, respectively; P = .02).”

          But then drop the statistical ball when it comes to the lab tests:
          “IVIG therapy enhanced the ability of patient plasma to neutralize bacterial mitogenicity… There was no difference in the mitogen-neutralizing capacity at baseline for cases and controls (P = .20, Wilcoxon rank-sum test)”

          Medical research is hard, real hard, to do right. Observational data is very difficult to conclude anything from without the guidance of strong theory (as in astronomy). In medicine in particular, systematic errors due to varying diagnostic methods lurk about that can cause 90%+ differences between different times/locations. Even in perfect (blinded, no drop-outs, exactly the same baseline) RCTs, the problem of construct validity often remains.

          I think Meehl is too optimistic on this front due to lack of direct experience when he writes (in the OP paper):
          “If I refute a directional null hypothesis… in a biochemical medical treatment, I thereby prove…the counter null; and the counter null is essentially equivalent to the substantive theory of interest, namely, that…tetracycline [makes a difference] to strep throats.”

          Christian wrote:
          “Also, personally I think that people tend to jump to too strong conclusions far too quickly. Much of what we do is “exploratory” in one sense or another and the result is not a stab at a “best explanation”, but rather something that gives us somewhat better ideas and knowledge to be used in the next step.”

          I would prefer that people recognized that collecting data and describing the methodology in detail is a valid scientific enterprise on its own. That data could then be used by “theorists” to come up with precise predictions, at which point a conclusion can be drawn. Then the forced statements of the form “these results suggestively implied the indication that the treatment may help” could be avoided. The statistician’s role IS important in providing methods by which to perform parameter estimation.

        • Anon says:
          March 25, 2015 at 3:05 pm

          “I would prefer that people recognized that collecting data and describing the methodology in detail is a valid scientific enterprise on its own. That data could then be used by “theorists” to come up with precise predictions, at which point a conclusion can be drawn.”

          That was argued for here Greenland S, O’Rourke K. In: Modern Epidemiology (Rothman KJ, Greenland S, Lash TL, eds). 3rd ed. Philadelphia: Lippincott Williams, 652–682; 2008. Meta-analysis. [change “theorists” for meta-analysts]

        • Keith: You’re probably right. I’ve seen statisticians hitting the brakes in some situations but I’ve also seen reviewers using honest modesty against me as an author, and potential project partners lured away by other statisticians who promised that they could justify stronger interpretations of the data.
          Statisticians as a whole are probably as affected by the “requirements” of self-marketing as other scientists.

        • Keith O’Rourke says:
          March 25, 2015 at 4:08 pm

          My ideal way forward is to have three types of research.

          1) Experimental studies where full datasets are published along with a detailed methodology and some prose discussing anything that may be important noticed during data collection. Some exploratory analysis and parameter estimation may, but not necessarily, also be included.

          2) Meta-analytic studies that contain a *detailed* review of the literature. It should be comprehensive or cite a previous comprehensive meta-analysis for anything left out. There would be tables consisting of the primary methodological differences and similarities along with some summarized data. The presence or lack of any “exact” replications would be noted. Obviously there would be discussion and analysis of the stability of the measurements, parameter estimates (distributional, not just mean +/- error), etc.

          3)Theory development. Here a mechanistic model is presented, possibly modified from a previously falsified one. This can be deduced from some well defined definitions and postulates (ie a full theory), or be phenomenological (in the sense of MOND). It is shown how the model is consistent with previous data and predictions are made regarding future data.

          Why isn’t it like this? What am I missing?

        • This is in reply to Anon’s suggestions 1-3. For some reason I can’t reply directly to that comment.

          Maybe this is naive, but how about this:

          4) Open peer review so that reviewers can receive credit/blame for their contributions in some form, and so the work contributed during the review is made public or at least accessible. To encourage this, how about a reviewer metric that evaluates the reviewers’ contribution over time — like an author citation metric. In cases where the reviewers choose to remain anonymous, a private reviewer ID can be used by the journal to forward to a service like Publons so that they still get some credit for the work.

          5) An incentive system for post-publication peer review so that post-pub reviewers are not penalized for taking the risk. One possibility is to use the reviewer metric. See (4) for the anonymous reviewer option to apply here.

          6) The possibility of pre-registration of the design and analysis protocol to (potentially) de-couple the statement of the hypotheses and analysis scheme from the data-dependent choices made during analysis. The reasons for this would be to discourage HARKing and explicitly acknowledge the contributions made during design. Maybe this could be a part of (3). This could include simulations and what is now described as “power analysis”.

          7) A journal/archive section for simulation and/or methods development. This could include documented and reviewed source code. Code contributions and reviews could contribute to one’s author/reviewer metric.

        • @Keith / Christian:

          I was thinking about what you guys described as (a) “statisticians who promised that they could justify stronger interpretations of the data” or (b) “statisticians ….. tend to suggest they can provide much more than is reasonable from the data in hand”

          Would it help if “house” statisticians were employed by the Journal? Or at least a panel of statisticians from which the journal assigned one to a submitted article? Somewhat like a reviewer.

          I’m thinking there’s a big conflict of interest otherwise. Might it help for the technical-authors to justify their conclusions on their own and the statisticians to act as an independent analyst / watchdog?

          I get the feeling that having the statistician as a integral part of the author’s team during the post-study analysis increases the pressures / likelihood of pushing conclusions stronger than justified.

        • Anon says:
          March 25, 2015 at 4:53 pm

          The money and governance or ability to regulate. Regulatory agencies (e.g. FDA) get close to your ideal (barring regulatory capture to effectively block it).

          That is required to purposefully manage science – given the politics, economics, psychology, sociology, etc.
          (Sometimes in some periods/places hands off management works, but few think that’s today’s situation).

        • Lots of good ideas here! I like pretty much everything that was suggested after my last posting. Pity that this is just the comments section of a blog.

        • Rahul says:
          March 26, 2015 at 7:21 am

          It is this myth “if only a statistician was involved” (aka if only the king knew) that bothers me.

          My first hand experience (which ain’t a representative random sample) is that more often than not they don’t make things better and even often worse (though no doubt the _right_ statistician could have made things better).

          As Christian pointed out, there are “requirements” of self-marketing but my sense is also lack of training/experience and mentor-ship.

          What’s the primary motivation for statistical society’s Professional Development courses?
          Raising money for the meeting (e.g. get well known speakers who will draw a crowd).

          What’s the primary requirement for publication in a statistical journal?
          Adequate technical development (as evidenced by difficult math) but well known authors get an exception as they have already been accredited as “real” statisticians (i.e. journals are not about communicating important ideas but rather providing input to academic hiring and promotion.)

        • Christian:

          Null hypothesis significance testing, as is commonly used in psychology and other research, has the following form: Strawman null hypothesis A is rejected, then it is claimed that substantive hypothesis B is true. I think this is generally mistaken.

        • Andrew: OK, I kind of suspected that this was the meaning you gave the term “NHST” when you gave the first response to me, and in this sense I agree. But talking about “null hypothesis significant testing” in this way is misleading because people may think that you mean all kinds of significance tests of hypotheses, including the useful ones.

  5. Hey Rahul,
    Psych Reports is an open-access, author-pay journal. Nowadays, with PLoS, Frontiers, etc., this seems pretty normal, but just a while back (certainly when Meehl published there), this journal was seen as a “vanity press”, in which you would publish, pay for it, and get less stringent peer review. I personally never published there, so cannot really comment whether the peer-review was more or less stringent than at other places.

    • Thanks! I still view pay-to-publish journals rather skeptically. They have an ugly conflict of interest at their core.

      Why not just post your paper on Arxiv instead?

        • I think there’s an approximate “enlightenment” spectrum in academic publishing. At one end I count the Math / Comp Sci guys where openness seems high: People post manuscripts online, people used to circulate pre-prints, there’s very thriving lists / groups where top notch people blog & critique other papers, ask questions ( e.g. Terry Tao is on MathOverflow. ), codes get posted online, most of the work is done with open source tools etc.

          At the other spectrum seem the social sciences. Won’t publish letters critiquing a past paper. Won’t let you publish if you’ve posted online. Replication is resisted. Getting access to raw data is like pulling teeth. etc.

          The rest of us seem somewhere in between.

        • Biomedical fields are certainly not very “enlightened.” Varmus writes, “Status conferred by the acceptance of papers in journals like Science, Cell, and Nature, or even in subsidiary journals of these “flagship” periodicals (e.g., Molecular Cell or Nature Biotechnology) has an indisputable effect on the process of recruitment and promotion of faculty.” This is a good description of the reality, even though this kind of attitude is silly, given that papers of questionable qualities often appear in these journals.

          Besides the predictable oppositions from the publishers, I guess the researchers were worried whether peer-reviewed journals would accept a report that had been posted on E-biomed (What’s the point of posting the preprint if I lose the chance to publish it in Cell by doing so?) and whether the reports would go through peer reviews before posted on E-biomed (How will I know if the study is any good if it haven’t been peer-reviewed?).

          I think founding PLoS journals was an attempt to establish respectable peer-reviewed journals that are also open access. An open access alternative to Cell, if you will. They adopted the author-pay model to make it financially viable.

        • To follow up on artkqtarks comment:

          The open-access part can be really important in reaching people who don’t have cheap access to peer-reviewed journals. One example I encountered recently: Some biologists I know are interested in improving practices in making inferences from paleontological data. But many paleontologists work for natural history museums that do not have the library resources/access that universities have. So these folks weighed the advantages and disadvantages of publishing in a regular journal vs a PLoS journal, decided on the latter, and ended up glad they did: The paper got an unusually large number of downloads. (I think they also gave short talks on it at paleontology meetings, which helped publicize it)

          I suspect that Ioannidis’ reasoning in publishing his 2005 paper in a PLoS journal was similar: It would enable many more medical professionals and interested lay persons to have access to the entire paper, not just the abstract or a popular press summary.

          So pay-to-publish can be done for altruistic reasons.

  6. Paul Meehl was a brilliant, genial man. (I, too, took his seminar on the Philosophy of Psychology, long ago in the winter of, 1975.) His Clinical versus Statistical Prediction (1954) is worth reading. If you Google “meehl clinical versus statistical prediction,” you will find a link to a PDF of the 1996 edition of his book on the Web site of the Dept. of Psychology of the University of Minnesota. Prof. Meehl’s preface discusses the reception that greeted his ideas. His writing is limpid and personal. Here’s a sample:

    “This little book made me famous—in some quarters, infamous—over-night; but while almost all of the numerous prizes and awards that my profession has seen fit to bestow upon me mention this among my contributions, the practicing profession and a large segment—perhaps the majority—of academic clinicians either ignore it entirely or attempt to ward off its arguments, analyses, or empirical facts. Thus I am in the unusual position of being socially reinforced for writing something that hardly anybody believes!” (p. ii)

  7. And Meehl is well known for Meehl’s paradox, which helps motivate using a region of practical equivalence to make decisions about accepting or rejecting theories.

    “Aside from the intuitive appeal of using a [region of practical equivalence] (ROPE) to declare practical equivalence,
    there are sound logical reasons from the broader perspective of scientific method. Serlin
    and Lapsley (1985, 1993) pointed out that using a ROPE to affirm a predicted value
    is essential for scientific progress, and is a solution to Meehl’s paradox (e.g., Meehl,
    1967, 1978, 1997). Meehl started with the premise that all theories must be wrong,
    in the sense that they must oversimplify some aspect of any realistic scenario. The
    magnitude of discrepancy between theory and reality might be small, but there must be
    some discrepancy. Therefore, as measurement precision improves (e.g., with collection of
    more data), the probability of detecting the discrepancy and disproving the theory must
    increase. This is how it should be: More precise data should be a more challenging test
    of the theory. But the logic of null hypothesis testing paradoxically yields the opposite
    result. In null hypothesis testing, all that it takes to “confirm” a (non-null) theory is to
    reject the null value. Because the null hypothesis is certainly wrong, at least slightly,
    increased precision implies increased probability of rejecting the null, which means
    increased probability of “confirming” the theory. What is needed instead is a way to
    affirm substantive theories, not a way to disconfirm straw-man null hypotheses. Serlin
    and Lapsley (1985, 1993) showed that by using a ROPE around the predicted value of a
    theory, the theory can be affirmed. Crucially, as the precision of data increases, and the
    width of the ROPE decreases, then the theory is tested more stringently.” (pp. 337-338 of DBDA2E)

    Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of
    Science, 34, 103-115.

    Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of
    soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806.

    Meehl, P. E. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence
    intervals and quantify accuracy of risky numerical predictions. In L. L. Harlow, S. A. Mulaik, &
    J. H. Steiger (Eds.), What if there were no significance tests (pp. 395-425). Mahwah, NJ: Erlbaum.

    Serlin, R. C., & Lapsley, D. K. (1985). Rationality in psychological research: The good-enough principle.
    American Psychologist, 40(1), 73-83.

    Serlin, R. C., & Lapsley, D. K. (1993). Rational appraisal of psychological research and the good-enough
    principle. In G. Keren & C. Lewis (Eds.), Methodological and quantitative issues in the analysis of psychological
    data (pp. 199-228). Hillsdale, NJ: Erlbaum.

  8. What I learned most from Meehl was his claim that NHST was hardly a test for the substantive theories. Only from him, I know there are actually two steps of inference: (1) from sample statistics to population effect; (2) form population effect to substantive theory. The bigger problem in “soft psychology” is not the one with statistics, but with the later: if “p then q” is not carefully formulated, then no “no q then no p” can be attained.

  9. Help me out here. We are really interested in Pr (H0|data) but we test Pr (data|H0) (at least this is what the ASA statement on that psych journal rejecting significance tests claims, and I more or less agree).
    So (assuming I did all of these manipulations correctly), we have

    Pr[H0|data] = {Pr[data|H0] + Pr[data|~H0](Pr [~H0]/Pr[H0]}Pr[data|H0]

    so while there is a monotonic relationship between Pr[H0|data] and Pr[data|H0[ one can have highly significant Pr[data|H0] and low Pr[H0|data] if the Pr[H0] is low (extraordinary claims require extraordinary evidence). If one is a Bayesian, your degree of belief on how likely ~H0 is to H0 influences how much credence you put in Pr[H0|data]. Now this is so obvious that it must have been written down many times before but it would seem to me this is the way to at least start an analysis of the relationship between the observed quantity (Pr[data|H0]) and the desired quantity, and obviously I’ve just written down a variant of Bayes rule. Where I need help is as follows: is this how the general theory of relating these quantities comes from (and it so, where can I find a discussion of this), or if not, why isn’t it done this way?

    • Numeric:

      I’m generally not interested in the probability that a hypothesis is true. For reasons I’ve written about in many places, I think this is typically not a helpful framework—at least not for the sorts of applied problems that I work on.

      • This answer is non-responsive. I’m interested in a reference to what seems to me to be the straight-forward way of addressing this issue, given that it testing a hypothesis is still the usual way of doing statistics (certainly Basic and Applied Social Psychology thinks so), not your personal or professional interest (and I got the idea from the daily ASA Connect e-mail, so it’s not just my way of thinking about this).

        • Numeric:

          If you ask me, How many angels can dance on the head of a pin?, and I respond, Sorry, I don’t believe in angels, the this is non-responsive. But it’s still my best answer.

        • So If I examine P(Model | actual_data) and see whether it’s big or not that’s asking about angels. If instead you check to see whether the actual_data looks like a typical (i.e. highly probably) draw from P(data |Model), well that’s just plain old good science.

          Got it. Brilliant.

          Still, it seems like there was some sort of connection between P(Model |actual_data) and P(actual_data |Model)? If only I could remember what the connection was. Can’t put my finger on it …. it’s on the tip of my tong … nope don’t have it. They must not be connected at all.

        • Anon:

          You can feel free to compute posterior probabilities of models. I have not found this sort of thing helpful, for reasons I’ve discussed in many papers, including my 1995 paper with Rubin, my 2012 paper with Shalizi, chapter 7 of BDA3 (this was material that was in chapter 6 of the earlier editions), etc.

          There are special cases of well-defined problems where Pr(model) makes sense to me; we give an example in chapter 1 of BDA. But in most of the cases I’ve seen this idea applied, it does not make sense.

        • If there were a theorem relating P(Model | actual_data) and P(actual_data |Model) that would make comments like:

          “I have not found this sort of thing helpful…it does not make sense”

          seem pretty silly. Good thing there isn’t one.

        • Anon:

          I appreciate you and others pushing me to explain this more carefully but I’d appreciate a bit of clarity on your part. I can’t be sure but I think you’re being sarcastic when you say “Good thing there isn’t one,” and you’re referring to Bayes’ theorem.

          My problem with applying Bayes’ theorem in this way is the usual GIGO: in the sorts of examples I’ve worked on, the marginal posterior probabilities of the data under different models are not so meaningful because these probabilities depend crucially on aspects of the model that are set arbitrarily.

          Finally, you can call my attitude “silly” all you want, but it’s my experience. I have not found this sort of thing helpful in my applied work. I fully accept that others have found these methods helpful, indeed I said as much in my 1995 paper with Rubin. Lots of methods depend on assumptions that don’t make complete sense, but can still be useful. Maybe not useful to me, but useful to other practitioners who have mastered the approach and can understand the numbers that come out.

        • Andrew,

          Here is the big picture of what’s going on. Frequentists get all of their methods from intuitive ad-hoc inventions. Such methods suffer from any limitations their intuitions have. In practice while their intuition may be superb, it still butts up against human limits and every one of their methods has severe failings because if it.

          Bayesians have an incomparable advantage. They’re basing everything off the sum and product rule. While these often agree with our naive intuitions, they often improve on anything we can see naively. So Bayesians can use this to both improve our intuitions and to get things right when intuition fails.

          Consequently when faced with some intuitive idea in statistics which seems to have a grain of truth, Bayesians should work to fit it into the Bayesian framework broadly defined. Jaynes did this constantly and from my limited reading of Rubin, he did it at least sometimes. Maybe he did it all the time. By not doing that with the your model checking stuff it has at the following consequences:

          (1) You’re teaching a new generation of Bayesians to engage in the same intuitive ad-hocaries which hobbled classical statistics.

          (2) You’re not teaching a new generation the real power of the sum and product rules, which is easy for new students to miss because they have to be the most innocuous looking equations ever.

          (3) There are special cases where the Bayesian version is significantly better than your intuitive model checking. If you claim that’s not true, then you’re basically saying the sum and product rules of probabilities are sometimes false. Good luck with that.

          (4) You’re giving ammunition to charlatans like Mayo, who don’t known 1/1,000,000 the math needed to check any of these technical facts, but seize on your words to proclaim “even most Bayesians reject Bayes these days”.

          (5) You’re perpetuating the Statistician’s fallacy which, for some reason I don’t understand, statisticians commit at a rate a million times greater than everyone else. Namely the belief that “If I don’t see how to do it, it must be impossible”. Just because you don’t see how something fits in a Bayesian framework, doesn’t give you the right to claim it doesn’t fit.

        • “Frequentists get all of their methods from intuitive ad-hoc inventions. Such methods suffer from any limitations their intuitions have. In practice while their intuition may be superb, it still butts up against human limits and every one of their methods has severe failings because if it.”

          Reminds me of this quote:

          “…the anxious precision of modern mathematics is necessary for accuracy. In the second place it is necessary for research. It makes for clearness of thought, and thence for boldness of thought and for fertility in trying new combinations of ideas. When the initial statements are vague and slipshod, at every subsequent stage of thought common sense has to step in to limit applications and to explain meanings. Now in creative thought common sense is a bad master. Its sole criterion for judgment is that the new ideas shall look like the old ones. In other words it can only act by suppressing originality.”

          An Introduction to Mathematics. Cambridge: Cambridge University Press, 1911. http://www.gutenberg.org/ebooks/41568

        • (different anon)

          @andrew I understand you don’t want to get into the habit of taking statements like “model is true” literally, but honestly it seems to be blurry to me. Effect/parameter estimation just seems like a stand-in for a very simple model.

          In other words, isn’t “probability effect is between X and Y” the operationally equivalent as “probability of a model which includes all possible effect sides being between X and Y”?

        • Anon:

          Yes, I agree there is no bright line. The point is that when a model is being tested, or two models are being compared, in this way, it is the statistical model being tested or compared, not the scientific hypothesis.

          To compute the Bayes factor, say, is to compute the relative probabilities of two statistical models in a very narrowly defined statistical framework. And it turns out that, for lots of the sorts of statistics problems where Bayes factors are promoted, that the Bayes factor is highly sensitive to arbitrary and essentially uncheckable aspects of the model, aspects of the prior distribution that don’t really affect posterior inference conditional on either of the models but which have a huge effect on the Bayes factor. So, for this technical reason, I think that Bayes factors typically don’t work.

        • I think I got what I needed from the various exchanges. I think the underlying difficulty with communication on this matter is that you don’t really believe in NHST so Pr (data|H0) is not the fundamental concept that most statisticians think it is. A particular example: H0 is the mean of a normal (mu, 1) so Pr (H0|data) is meaningful, but there is an assumption also of basing the estimation on the assumption of the normal which Pr (H0|data) doesn’t really test since H0 consists of not only the estimate of mu but also of the assumption of normality, and so Pr[H0|data] is misleading at best (and potentially completely wrong) as a test of the model. Is this your thinking in a nutshell, or is it more complicated than that?

      • Was it a helpful framework when Nate Silver computed Pr( Obama wins 2012 election | data)? Or are elections not the sorts of applied problems that Political Scientists work on?

        • Anonymous:

          Of course I have no problems computing posterior probabilities of events! I wrote a whole book on the topic, and indeed we have an example in that book of computing the probability, conditional on data that a candidate wins a presidential election.

          What the commenter above asked for was the probability that a model is true. That is what I find generally unhelpful.

        • You see some fundamental distinction between events and models that I don’t see. If I’m a biologist and I building a statistical model using the hypothesis “a meteor wiped out the dinosaurs” is that an event or model?

        • Anon:

          The trouble is that a event such as “a meteor wiped out the dinosaurs” does not imply a single model for data. And you can’t compute a meaningful posterior probability for such an event without a full probability model.

          Think of it this way.

          Suppose you have a general scientific theories T1 and T2 (for example, T1 is the theory that ovulation is related to voting, and T2 is the theory that there is no relation), and corresponding statistical models M1 and M2 (that is, probability models with unknown parameters, priors on the parameters, probability distributions for observed data, measurement errors, sampling, the whole deal).

          Here are my problems with using Bayesian inference to get the posterior probabilities of T1 and T2:

          1. In the sort of applications I’ve worked on, it just doesn’t make sense to talk about the probability that T1 or T2 is true (in the example I’ve just given, everything is related to everything; there is certainly some connection between ovulation and voting).

          2. The connection between T and M is typically speculative and weak. That’s a key problem with null hypothesis significance testing (Bayesian or otherwise), that the rejection of M2 is taken as evidence in favor of T1. But this is highly dependent on how M1 and M2 are formulated (for example, assumptions about measurement errors).

          3. In the Bayesian setting in particular, the posterior probabilities of T1 and T2 are typically highly dependent on aspects of the prior distribution that are set arbitrarily, for example if you change a weak prior on some parameter from N(0,10^2) to N(0,100^2), you change the likelihood ratio by roughly a factor of 10. I’ve written about this in various places, notably my 1995 article with Rubin and chapter 7 of BDA3.

        • If I=”a meteor wiped out the dinosaurs”, then in practice we deal with things like P(data |I,K) where “K” resents lot of other hypothesis or information. So if you use Bayes theorem you’re really getting a P(I | data, K). Usually, we don’t explicitly write the “K” but it’s always there.

          So saying P(I |data, K) has a dependence on K isn’t any kind of limitation to computing or using P(I |data, K). Since such K’s are always present in truth, you’re basically deny that Bayes theorem ever “makes sense”.

          If K is true, then you’ve learned something about meteors. If K is questionable, then you need to evaluate it. In order to evaluate a claim it’s “useful” to know what that claim implies. Expressions like P(I|data, K) tell you what K implies.

          So it’s both “makes sense” and “useful” to compute P(I|data, K). What doesn’t make sense and what isn’t useful is to pretend that you can get at the truth of I, without dealing with K.

          Your point (3) is another whole ball of wax. If you’re using probabilities to model uncertainty rather than frequencies, that will problem will NEVER occur. If it’s “arbitrary” whether N(0,sig=10) or N(0,sig=100) are used, then that can only happened if you have no evidence for or against values like 50. It’s up to you to use distributions which faithfully reflect the uncertainty implied by the evidence you have. If you don’t, that’s your limitation, not Bayes’s.

        • Andrew,

          The goal of statistical analysis isn’t to get to the truth. That’s impossible for us mortals in general. If that were the gaol though then perhaps posteriors for models don’t “make sense” and aren’t “useful”. Frequentists seem to be guided by that kind of mindset, and they have no use for posteriors.

          The real goal is subtly weaker, but actually achievable by us humans. The actual goal of as statistical analysis is to get as close to the truth as the evidence allows. In that case, posteriors for models makes perfect sense and are useful.

        • Is the point here the ambiguity of whether a ‘model’ or ‘theory’ here is a parameter in a statistical model or the model structure itself?

          I.e. consider the schema:

          S: parameter -> model -> output

          That is: integral [p(y|theta)p(theta)]dtheta = p(y)

          In Andrew’s example he is saying that T1 and T2 correspond to two *different* instances of S, right? He can compute the probability of the parameters for each case, conditional on the model structure in each case, but these parameters and the model structure are not (yet) directly comparable between models.

          If you do want to compare them directly then you need to embed them within a bigger model – e.g. with a continuous parameter for the effect size of ovulation. In this case one may compare different theories formulated in terms of effect size (e.g. is it ‘big’ or ‘small’ or whatever). However, you would still want to check whether your ‘super’ model is reasonable before you believe the effect size estimates. In many cases this might not be true.

          Now it may also be possible to make the model structure so general and push all the assumptions into background parameters to be estimated as well (e.g. heading towards nonparametric stuff or hierarchical stuff) that you think you can always fit the data adequately as so don’t ‘need’ to model check.

          However, you still have to think about the correspondence between your general model structure and the substantive theory. E.g. is the generality of your statistical model based on scientific understanding or just additional dof introduced to make sure you don’t need to model check? There’s a sort of bias (substantive theory) variance (statistical model flexibility) tradeoff.

          In general most of our scientific theories (esp. for ‘noisy’ fields) will be inadequate to fully capture the data – i.e. there is no model structure + parameter combination that fully captures everything in a given data set.

          Here you could look to see the reasons it fails and what that tells you, or you could go the other extreme and make models that ‘predict everything and hence nothing’ – i.e. machine learning (which might be a perfectly reasonable option if you don’t care about ‘understanding’).

          Relating back to Anon’s I and K comment – I think Andrew is saying he doesn’t care to compute expressions like P(K|…), he’d rather reason along the lines of “Expressions like P(I|data, K) tell you what K implies” (!) when checking background assumptions.

  10. What’s a good paper to see “falsificationist Bayes” applied in the wild. i.e. applied work that uses falsificationist Bayes rather than a methodological paper.

    Does anyone know of a study where someone has used both frameworks in parallel to analyse a problem? i.e. NHST and falsificationist Bayes both?

    It’d be interesting to see them juxtaposed side by side on the same problem / dataset.

    • Bayes has a falsification workflow built-in to the inference because any analysis ends with a set of generative models in the form of the posterior distribution – a set of falsifiabile models for the next experiment.

      By contrast, in the practice of NHST there’s almost never a _quantitative_ falsifiable hypothesis proposed after doing the p < .05 ritual.

    • Rahul:

      We have many applied examples of falsificationist Bayes in Bayesian Data Analysis. Or, if you don’t have that book, you can go to my webpage and choose among dozens of applied research articles where my collaborators and I use this this framework.

  11. Andrew, I’ve never seen you use the phrase “falsificationist Bayes” before as something you’d support. Can you (or someone) point to a paper (or blog post) where you elaborate on your particular (i.e., not Jaynes’) version of this?

Leave a Reply

Your email address will not be published. Required fields are marked *