Psych journal bans significance tests; stat blogger inundated with emails

OK, it’s been a busy email day.

From Brandon Nakawaki:

I know your blog is perpetually backlogged by a few months, but I thought I’d forward this to you in case it hadn’t hit your inbox yet. A journal called Basic and Applied Social Psychology is banning null hypothesis significance testing in favor of descriptive statistics. They also express some skepticism of Bayesian approaches, but are not taking any action for or against it at this time (though the editor appears opposed to the use of noninformative priors).

From Joseph Bulbulia:

I wonder what you think about the BASP’s decision to ban “all vestiges of NHSTP (P-values, t-values, F-values, statements about “significant” differences or lack thereof and so on)”?

As a corrective to the current state of affairs in psychology, I’m all for bold moves. And the emphasis on descriptive statistics seems reasonable enough — even if more emphasis could have placed on visualising the data, more warnings could have been issued around the perils of un-modelled data, and more value could have been placed on obtaining quality data (as well as quantity).

My major concern, though, centres on the author’s timidness about Bayesian data analysis. Sure, not every Bayesian analysis deserves to count as a contribution, but nor is it the case that Bayesian methods should be displaced while descriptive methods are given centre stage. We learn by subjecting our beliefs to evidence. Bayesian modelling merely systematises this basic principle, so that adjustments to belief/doubt are explicit.

From Alex Volfovsky:

I just saw this editorial from Basic and Applied Social Psychology: http://www.tandfonline.com/doi/pdf/10.1080/01973533.2015.1012991

Seems to be a somewhat harsh take on the question though gets at the frequently arbitrary choice of “p<.05" being important...

From Jeremy Fox:

Psychology journal bans inferential statistics: As best I can tell, they seem to have decided that all statistical inferences from sample to population are inappropriate.

From Michael Grosskopf:

I thought you might find this interesting if you hadn’t seen it yet. I imagine it is mostly the case of a small journal trying to make a name for itself (I know nothing of the journal offhand), but still is interesting.
http://www.tandfonline.com/doi/pdf/10.1080/01973533.2015.1012991

From the Reddit comments on a thread that led me to the article:
“They don’t want frequentist approaches because you don’t get a posterior, and they don’t want Bayesian approaches because you don’t actually know the prior.”
http://www.reddit.com/r/statistics/comments/2wy414/social_psychology_journal_bans_null_hypothesis/

From John Transue:

Null Hypothesis Testing BANNED from Psychology Journal: This will be interesting.

From Dominik Papies:

I assume that you are aware of this news, but just in case you haven’t heard, one journal from psychology issued a ban on NHST (see editorial, attached). While I think that this is a bold move that may shake things up nicely, I feel that they may be overshooting, as not the technique per se, but rather its use seems the real problem to me. The editors also state they will put more emphasis on sample size and effect size, which sounds like good news.

From Zach Weller:

One of my fellow graduate students pointed me to this article (posted below) in the Basic and Applied Social Psychology (BASP) journal. The article announces that hypothesis testing is now banned from BASP because the procedure is “invalid”. Unfortunately, this has caused my colleague’s students to lose motivation for learning statistics. . . .

From Amy Cohen:

From the Basic and Applied Social Psychology editorial this month:

The Basic and Applied Social Psychology (BASP) 2014 Editorial emphasized that the null hypothesis significance testing procedure (NHSTP) is invalid, and thus authors would be not required to perform it (Trafimow, 2014). However, to allow authors a grace period, the Editorial stopped short of actually banning the NHSTP. The purpose of the present Editorial is to announce that the grace period is over. From now on, BASP is banning the NHSTP. With the banning of the NHSTP from BASP, what are the implications for authors?

From Daljit Dhadwal:

You may already have seen this, but I thought you could blog about this: the journal “Basic and Applied Social Psychology” is banning most types of inferential statistics (p-values, confidence intervals, etc.).

Here’s the link to the editorial:
http://www.tandfonline.com/doi/full/10.1080/01973533.2015.1012991

John Kruschke blogged about it as well:
http://doingbayesiandataanalysis.blogspot.ca/2015/02/journal-bans-null-hypothesis.html

The comments on Kruschke’s blog are interesting too.

OK, ok, I’ll take a look. The editorial article in question is by David Trafimow and Michael Marks. Krushke points out this quote from the piece:

The usual problem with Bayesian procedures is that they depend on some sort of Laplacian assumption to generate numbers where none exist. The Laplacian assumption is that when in a state of ignorance, the research should assign an equal probability to each possibility.

Huh? This seems a bit odd to me, given that I just about always work on continuous problems, so that the “possibilities” can’t be counted and it is meaningless to talk about assigning probabilities to each of them. And the bit about “generating numbers where none exist” seems to reflect a misunderstanding of the distinction between a distribution (which reflects uncertainty) and data (which are specific). You don’t want to deterministically impute numbers where the data don’t exist, but it’s ok to assign a distribution to reflect your uncertainty about such numbers. It’s what we always do when we do forecasting; the only thing special about Bayesian analysis is that it applies the principles of forecasting to all unknowns in a problem.

I was amused to see that, when they were looking for an example where Bayesian inference is OK, they used a book by R. A. Fisher!

Trafimow and Marks conclude:

Some might view the NHSTP [null hypothesis significance testing procedure] ban as indicating that it will be easier to publish in BASP [Basic and Applied Social Psychology], or that less rigorous manuscripts will be acceptable. This is not so. On the contrary, we believe that the p < .05 bar is too easy to pass and sometimes serves as an excuse for lower quality research. We hope and anticipate that banning the NHSTP will have the effect of increasing the quality of submitted manuscripts by liberating authors from the stultified structure of NHSTP thinking thereby eliminating an important obstacle to creative thinking.

I’m with them on that. Actually, I think standard errors, p-values, and confidence intervals can be very helpful in research when considered as convenient parts of a data analysis (see chapter 2 of ARM for some examples). Standard errors etc. are helpful in giving a lower bound on uncertainty. The problem comes when they’re considered as the culmination of the analysis, as if “p less than .05” represents some kind of proof of something. I do like the idea of requiring that research claims stand on their own without requiring the (often spurious) support of p-values.

75 thoughts on “Psych journal bans significance tests; stat blogger inundated with emails

  1. This is a really really bad idea and a horrible president. Either people need to be persuaded to use better statistics or they remain unconvinced and they continue with their crap research. Those are the only two options. Trying to win scientific debates by outlawing the other side has never worked in the past. Frequentists did this constantly to Bayesians long ago. How did that work out for Frequentists and for science? Bayesians have nothing to gain from encouraging these kinds of policies.

    • It amazes me just how readily academics agree to banning opinions they don’t like. It doesn’t surprise me when the sociology department of the University of Wisconsin wants to ban many common everyday words. They would gladly volunteer to man the firing squads if there were ever to be a Stalinesque purge of political opponents in the US. That’s to be expected from their ilk.

      What’s shocking is just how readily seemingly normal academics go in for this sort of thing. Just on this blog alone there was Phil wanting to subject all males to re-education thought control because some guy at NASA wore a shirt with cartoon women on it. Some lady named “Lucy” stated plainly she would disrupt so no one could hear a seminar speaker she didn’t like. Bill Jeffreys said basically the same about another seminar speaker he didn’t like. And now we have a wide spread acceptance of banning certain methods, and very few defenders of the idea that the only acceptable way to win scientific debates is persuasion.

      What happens when the winds of desire to ban unlike opinions start blowing against you (as they inevitably will)? I have good reason to believe for example that the chi-squared test with p-value is actually a Bayesian procedure in Frequentist clothing. Is it banned too?

      This has actually reduced my opinion of academics which I keep thinking couldn’t get any lower. Academics aren’t very smart typically, they don’t do anything for the tax dollars they waste, they’re small petty worthless loser types you’d never want to share a foxhole with. And to add to that, whenever there’s a movement to ban words/opinions/ideas, a closer look pretty well always reveals that it’s lead by academics.

      • Censorship is always a much wronger route than encouraging/facilitating less wrong information processing.

        As for academics, its usually not about the people but the situations they find themselves in – few purposely want to make the world a worse place.

        Banning statistical technique X while likely accepting error ridden irreproducible or even faked data is clear sign of not understanding scientific inquiry.

      • What opinion is being banned exactly? What debate is being stopped?

        I don’t view this as banning an opinion. They’re banning all statistics that one might traditionally have called inferential. One might have an opinion about Bayesian or frequentist inferential methods but this ban is completely neutral on that. You can’t use either.

        What they’re permitting is something no one objects to showing. Demanding that researchers tie their conclusions, rationales, etc. to descriptive statistics will allow for open rational debate about subjects of substance to the journal.

        • They are banning methods, which effectively ends the debate on those methods. It’s irrelevant whether they’re equal opportunity banners or not. It’s attracting huge interest not because of this one policy and this one journal, but because of the big picture influences it might have for science in general.

          The writers of the articles and the reviewers need to figure out for themselves individually whether what is being published makes sense. Journal rules can’t do this for them. It’s their responsibly to do this. Hell, it’s their only responsibility. They should stop using bans to avoid thinking and get on with their job.

          Fundamentally there are no institutional games or gimmicks which produce good science. It always comes down to individual minds either doing smart thinking to advances science, or dumb thinking which retards it. Such bans do nothing to increase the former, and can easily increase the latter.

        • Yeah but they’re only banning in their own journal. In the context of many psychology journals, where there are many potential platforms to NHST away, what’s wrong with this? This action doesn’t effectively end any debate because there are many other platforms for debate.

        • Um, speaking for myself, I’m not worried at all about wider bans on p-values and NHSTs. I care about this because I find it interesting/entertaining to read and think about some really off-the-wall opinion on a topic about which I know something.

      • ” I have good reason to believe for example that the chi-squared test with p-value is actually a Bayesian procedure in Frequentist clothing.”

        Just on a side note, do you know of some post or paper where such a link is described? Not questioning it, just wanting to know more :)

        • Rasmus,

          The place to look is Jaynes’s book and several of his papers where he talks about the chi-squared test being an approximation to entropy (or at least it should be called entropy, usually it’s called relative entropy or Kullback–Leibler divergence or something.) You may want to look at the papers “Where do we Stand on Maximum Entropy?” and “Concentration of Distributions at Entropy Maxima”.

          But here’s a deeper and simpler explanation for what I think is going on. So imagine a Bayesian world where P(x|A) is modeling the uncertainty about some true value x* rather than frequencies. The high probability manifold (HPM) of P(x|A) is a kind of bubble around x* (if you’ve done your modeling right that is!) that describes an uncertainty range for x*.

          In that world it’s important to check sometimes whether a value like x* is in the HPM of P(x|A). You could think of this in several different ways. In some instances it’s equivalent to checking whether a Bayesian Credibility Interval for x* is an accurate prediction for x*’s location. In other model building instances, it’s equivalent to performing post posterior check in the style of Gelman.

          That’s all completely general. Now specialize this to the case of repeated trials. It’s helpful to think of a concrete example, so use Jaynes’s famous dice example. Let x_1, … ,x_n be a sequence of dice rolls where each x_i has one of the values 1,2,3,4,5,6. Imagine we have a sequence of observed values x*_1,…,x*_n which we want to check whether it’s in the HPM of some P(x_1,…,x_n|A), just as described for the general case.

          In Jaynes dice example n=20,000, which is a very inconvenient distribution to deal with. To make life simpler, we can instead use P(x_1,…,x_n|A) to derive a distribution on frequencies distributions P(f_1,…,f_6|A). Using the observed x*_1,…,x*_n we can trivially compute an observed frequency distribution f*_1,…,f*_6. So rather than check if x*_1,…,x*_n is in the HPM of P(x_1,…,x_n|A) we can check instead whether f*_1,…,f*6 is in P(f_1,…,f_6|A).

          This is a massive convenience because it reduces the dimensionality of the problem from n=20,000 down to 6.

          One more fact before getting to the punch. For a wide and common class of distributions on x_1,…,x_n (essentially distribution of exponential type, like all the common distributions taught in stats books), you get an interesting phenomenon where P(f_1,…,f_6|A) is very sharply peaked about some modal value f’_1,…,f’_6. This modal value turns out to be a Maximum Entropy distribution itself just like all the common distributions (normal, poisson, …) taught in statistics.

          So here finally is the rub: checking whether x*_1,…,x*_n is in the HPM of P(x_1,…,x_n|A) for n greater than about 30 or so, is equivalent to performing a classical chi-squared test using f*_1,…,f*6 and f’_1,…,f’_6.

          Bottom line: the legitimate parts of Frequentistism fall out of Jaynes’s probability theory as special cases.

    • The problem is not people using p-values, the problem is people don’t understand Bayesian Statistics in general, or the idea of using P(x|A) to objectively model the uncertainty in x given knowledge A. How does banning p-values fix the problem? It doesn’t.

      Here are my very confident predictions as to what will happen if p-values are banned on a wide scale:

      (1) Everyone who used p-values before will still be frequentists at heart and will still do the same old crap just without p-values.

      (2) Bayesians will have dramatically less incentive to set their own house in order, because they’ve superficially won. They will stop digging deeper into the foundations of Bayesians statistics. This might be ok except for the fact that 99% of the time Bayesians don’t understand their own subject very well.

      (3) Researchers who are Frequentists at heart, but who claim to being doing Bayesian stats will give applied Bayesians stats a bad name in the long run.

      Bottom line: if you want people to stop using p-values persuade them to do so. Banning p-values does nothing good. Nothing good at all.

  2. This is a step in the right direction, but what really needs to be banned is the abuse of correlation coefficients. In one notorious example, a psychology paper called “dead and alive” reported a correlation between people who thought Diana was murdered and those who thought she had faked her own death, and concluded that “the more participants believed that Princess Diana faked her own death, the more they believed that she was murdered”. In fact, precisely zero participants believed both these things, and the correlation arose simply because the vast majority believed neither.

    More generally, journals should force authors to publish all the data with the paper. If that had been done the above error would have been spotted immediately.
    In another example, a psychology paper had a data file with a participant whose age was given as 32757. Unsurprisingly, this influenced some of the age-related results in the paper. It seems that some authors do not even look at their data before putting it into some piece of software they don’t really understand.

    • Do you have a reference on the “dead and alive” paper? I’ve found the paper, but I’d love to see the analysis that showed that none of the participants held the mutually contradictory beliefs. I collect these sorts of examples for my book.

      • The Wood et al dead and alive issue is discussed at a blog called hiizuru and at climate audit.
        That should be sufficient for you to find it. Unfortunately I can’t see any link to the actual data file, though someone must have posted it up because I have the file in my filespace (mail me at nottingham if you want it). Of course, no ‘analysis’ is needed – you just need to look at the data!

        [I did not want to mention climate audit in my first comment, because as soon as the climate issue is raised there is an unfortunate tendency for some people to abandon their usual standards of objectivity].

  3. I hope horrible “president” was not a political comment – this blog is mercifully relatively free of that.
    My take is a bit different. Given how slowly things change in academic circles, this is an amazing development. Misguided, yes, but at least it shakes things up. There are any number of better policies – high on my list would be insistence of providing the dataset for purposes of replication and establishing a publication category specifically for replication attempts. It is not at all clear that banning inference is a good idea, but at least it IS an idea.

    • Good point, although in the end I disagree.

      I do think there’s a place in science (and scientific publishing) for crazy experiments that seem unlikely to work or even teach us anything useful (https://dynamicecology.wordpress.com/2012/10/01/experiments-so-crazy-they-just-might-work/). And probably better for a small obscure journal to try something so radical rather than, say, Nature.

      On the other hand, I do think there’s an outer limit of experiments that are so far outside standard practice, and so poorly motivated, that they’re not worth undertaking. I mean, if the journal announced that they’d be consulting a magic 8 ball to decide what papers to accept, or that they were going to require NHSTs to use type I error rates of 0.95 rather than 0.05, I think we’d all agree that those experiments would be worthless even though they shake things up. And from where I sit, there’s already lots of active debate and experimentation in science when it comes to how to do and report data analyses. So I don’t think a poorly-conceived idea like banning inferences is going to get people thinking and talking more, or more usefully, than they otherwise would have.

  4. These days I’m leaning towards this attitude, in general: If the effect is there and significant (not in the statistical sense) I’ll “see” it from the descriptive statistics & / or the graphs.

    If I cannot see such an effect, then it is probably not an effect I should care about.

    Is this attitude wrong / dangerous?

    • I don’t mean this to be glib, though I realize it may sound that way, but I suspect that if you think there is a general answer to the question “Is this attitude wrong / dangerous?” applicable to all data sets in all circumstances, then yes, the attitude you describe is probably dangerous. On the other hand, if you adopt the attitude you describe, then ask yourself the question “Is this attitude wrong / dangerous?” for each set of circumstances and each set of data that you analyze, and try hard to answer the question for each set of circumstances and data, then you probably won’t go too far wrong. But I would be curious to hear a statistician’s answer to your question.

      • Intraocular trauma would have you hit in the eye.

        The interocular trauma test is positive when the result hits you between the eyes.

        Edwards Lindman and Savage (1963, Bayesian Stat Inf for Psych Research) attribute it to Berkson (1958).

        • Oops. What you say makes sense, George. Perhaps I’ve been applying the wrong test all along. :-)

          I first saw the IOT test attributed to or described by Tukey, but I don’t really have a source (I have Edwards et al. somewhere but not with me; I haven’t seen Bookstein–thanks, Martha). A quick Google verbatim search on “intraocular trauma test” and “interocular trauma test” seems to turn up a significant number of both alternatives. Perhaps the intraocular folk are looking at the ocular system (two eyes), and the interocular folk are looking at eyeballs.

        • Bookstein (p. xxvii) quotes Edwards, Lindman and Savage as saying (p.217): “It has been called the interocular traumatic test; you know what the data mean when the conclusion hits you between the eyes.” He calls it the ITT.

    • Rahul: I think I learnt quite a bit about what can happen (or not) just because of random variation from running tests and computing p-values.

      Assuming that your intuition about what is, say, meaningful, does not only concern the size of an effect but also whether what was observed could be distinguished from meaningless random variation, and assuming that it works well – do you really think many people have such a well working intuition without having at least computed some tests (or something similar) on data they knew to build it up?

      • Christian Hennig:

        Do you have an example, of an effect that you care about which graphs etc. wouldn’t have revealed to a naive observer but systematic NHST would have?

        • This obviously depends on who counts as “naive observer”. I have given statistical advice to a not so small number of PhD students and early career scientists, and I often tell them that they should look at their data and not rely on tests, at least not on as many as they feel they need to do. My experience is that many of them are very, very wary of trusting anybody’s intuition and think that numbers are much better. When I look at their data and claim that it’s clear for everyone to see what is going on, some realize that they can see it indeed, but some would just insist so strongly that they need an “objective” number that at the end I don’t know whether they just don’t trust their intuition (and could be helped by advertizing intuition over p-values more) or whether they don’t even have one. I have the confidence to see things in data graphs and to be sure about them, but I still think that playing around with p-values, among other things, helped me building this up.

        • I have the confidence rose tinted spectacles of cognitive & confirmation biases to see things in data graphs and to be sure about them…

          You’re welcome.

        • Mike: Surely making such a statement informed neither by looking at any graph nor at any conclusion I get from them is much less biased.

        • Christian: I’m confused. Are you saying we should trust our intuition (which is formed by looking at graphs, or by the patterns other people say they see in our graphs), or not?

    • Rahul: Hope you’re still reading this – how would you make your idea work for a 2×2 table?

      Plots of a 2×2 table with entries (3,4,5,6) should look identical to plots for (3 million, 4 million, 5 million, 6 million) – and so would any summary statistics computed using only proportions. You might be able to “see” an effect, but you’ll have no indication of how noisy the data’s description of it is.

      Unless, that is, you cheat and make plots with inferential summaries on them, or count standard errors as non-inferential data summaries.

      If you have a way round this problem I’d be very interested – thanks.

      • @george:

        Yes, I’m still reading. :) And maybe you are right. I don’t know.

        Do you have an actual example? I’d be curious to see an example where the effect isn’t evident on seeing the plot / table / summary statistic but seeing (say) a NHST convinces one that it is clearly an important effect.

        • Well, the point holds for any 2×2 table. But if you like, consider the following tables;

          1, 34
          3, 32

          and

          5, 170
          15, 160

          and

          20, 680
          60, 640

          The last table is (modulo a little rounding) Doll and Hill’s 1950 data on smoking and lung cancer, which has a huge signal Doll and Hill didn’t expect to see. The other two are (obviously) just scaled-down versions of the same thing. I chose the middle table because, very roughly, it gives p=0.05 – which I don’t claim should or will convince anyone of anything, it just makes for an interesting comparison – and the top one because it’s the far extreme that retains integer counts.

          I think the visual impact of plotting these should all be identical, unless one “cheats” in the sense I defined earlier. This would suggest that we do need inferential tools. Which is perhaps not very controversial here, but I’m interested to see how you get on with them.

        • Thanks. Well, just my naive intuition: I’d be convinced smoking is bad based on your Table #3. #2 too, though perhaps I’d be less sure.

          #1 would cause me to say, Hmmm, I don’t know. Too small a study.

          Not sure what the rigorous theory says is the right thing to be doing.

        • I gather “Doll and Hill” refers to table IV of this:
          http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2038856/pdf/brmedj03566-0003.pdf

          It is an interesting paper. I disagree that the data in table IV is at all a “huge signal”, in the sense that a signal should be informative. In fact they already knew the lung carcinoma and control groups were substantially different for other reasons (table II), there was no reason to think the “null hypothesis” of precisely no difference should hold after that. The difference between smokers and non-smokers only corresponds to ~20 people, which is the same order of magnitude as the the social class and place of residence differences shown in that table II.

          The difference is also an order of magnitude less than the number of misdiagnoses they report correcting (e.g. table XII). For some reason people diagnosed with “respiratory disease, other than cancer (n=335)” (Table X) appears to have *not* been included in the control group, which is inconsistent with the methods they describe (“non-cancer control”). It is also strange for that sample size to be the same as that of “Other Cases” shown in Table I.

          I got a strong vibe of p-hacking from that paper. Still Table V is rather convincing regarding a link to lung cancer, at least for a 2 pack a day habit.

  5. Andrew:
    >You don’t want to deterministically impute numbers where the data don’t exist
    That’s the whole field of indirect comparison or network meta-analysis :-(
    (p.s. what might end up determining your treatment option some day.)

  6. Banning seems a bit extreme. An outright ban has a read advantage in that it will force researchers/data analysts to work hard at figuring out the alternatives. I hope the people publishing in that journal are are really smart (or find some smart collaborators) so that we will find out what the problems are and how to deal with them. I lean to the idea that good descriptive analyses will improve things a lot and may undermine in an incisive way the case for NHST.

  7. I think focusing on it as a “ban” or censorship confuses the matter. If a Journal insists on only LaTex or .doc submissions do we criticize that as a “ban” on other formats?

    What if a Journal’s policy refuses to publish artwork below a certain dpi? Or insistes on references in a certain format. Or refuses graphs that have a gray background because it doesn’t come out well during printing.

    I think these are all legitimate editorial descisions. Though the jury is still out on NHST there seems a significant body of opinion convinced that they do more evil than good. The statisticians can still slug it out but I don’t see why it is wrong for one Journal to unilaterally take a stand?

    Besides there’s a ton of journals with other policies. If the authors are so inclined they can vote with their feet.

  8. I struggle with this a lot. The NHST doesn’t give us what we want, P(H_A | D), and has a lot of draw backs (the null is rarely interesting and so on). But I’m not sure that the solution that they are proposing is any better. Bayesian methods or just a sane interpretation of p values would solve the problem much better. The problem I see with their proposal (more or less remove inferential statistics) is that so much of what we do can’t be reduced a bi-variate comparison or something with low enough dimensionality to visualize in a practical sense.

    I’m often am dealing with models that I expect exist in 30D to 100D space or more. I’m not sure how you would approach this problem under their paradigm. Most questions, at least as I see the world, exist in this group. We use OLS, GLM and other methods (or their Bayesian counter-parts) because of this. I’m not sure how to reduce this to something that can be shown with summary statistics or graphics.

    Even with a simple model, Y ~ X1 + X2 + X3, the bi-variate measures of (Y, X_i) for i = {1, 2, 3} can be very different (even in sign) from the conditional values. If we expect Y ~ X1 + X2 + X3, what is the value of E(Y | X1) or the unconditional correlation of Y and X1?

  9. it would seem many regular readers of this blog are behaving as if irrationally.

    Isn’t it a paradox that so many readers would email Andrew this story knowing that, with high probability:

    1. Other readers will also email it,
    2. Other readers have read the news by the time you did.

    Perhaps in an optimal world he would have received 0 or 1 email.

    What does this tell us about reader preferences and beliefs?

  10. No doubt part of the reason you received so many emails is because this is a hot topic on several listservs right now. I assumed you would at the very least have something to say about their comment on noninformative priors.

    I like the basic idea underpinning the ban, but I do think this is too blunt an instrument. As others have noted, the problems with NHST have much more to do with the training of the people (mis)using and (mis)interpreting the statistics than with the statistics themselves. The approach BASP has taken is a crude approach to try to force people to do better, but I doubt it will be effective. The problems with research in the social sciences, particularly in social psychology, run much deeper than p-values; correcting those problems is likely to be a long and slow process, not one that can be shortcut by instructing authors to scrub p-values from their manuscripts. I believe that observation has been made in the comments on this blog before — force people misusing frequentist statistics to use Bayesian, and they’ll just misuse Bayesian instead. Perhaps I’ll be wrong and this will spark a revolution, but I’m not holding my breath.

    Les Hayduk also brought up an interesting argument on the SEMNET listserv in which he made a case that the SEM chi-square should not be treated the same as other NHSTs. In his view, the BASP ban does not apply to the SEM chi-square. I’m not sure that the editors will appreciate the distinction.

  11. I don’t see why people have a problem with banning things in a journal. It seems clear that there are certain standards that journals must uphold. They must not allow, for instance, research claims that are internally logically inconsistent, groundless claims of truth, or bad methodology. These things are inconsistent with the idea of a scientific journal. No one is “censoring” anything; these are merely standards that the journal has decided to uphold. We can argue about the wisdom of the standards themselves, but being worried about “banning” and “censorship” seems strange.

    This is not about censorship. This is merely a (debatable) journal policy issue.

    • We should start a “you might be an academic if …” series. I’ll kick it off.

      You might be an academic if you think a reviewer pointing out logical inconsistencies is the same as blanket bans on methods.

      You might be an academic if you think wholesale bans of methods isn’t censoring them and isn’t censoring the people who wish to use them.

      You might be an academic if you think science progress depends on choosing just the right journal policies rather than persuading researchers of the errors of their ways.

      • Anon:

        You recommend “persuading researchers of the errors of their ways.” I’ve found it difficult (but not always impossible) to persuade researchers of the errors of their ways. But it’s often not so hard to persuade researchers of the errors of other researchers’ ways. So sometimes we can proceed in crab-like fashion, first criticizing group A for some flaw, then group B sees the problem and tries to avoid these errors in their work; then we criticize group B for their errors, etc.

        And this is all ok. There should be no embarrassment in having made a mistake; that’s how science works. What’s embarrassing is the people who don’t admit their mistakes, who don’t admit that their prized statistically significant finding might not represent any such pattern in the larger population.

        The editors of this particular journal can do whatever they want. Maybe the well-publicized policy change will have some positive larger effect in research practice, maybe not. It’s a bank shot either way.

        • “The editors of this particular journal can do whatever they want.”

          No one is talking about whether they have a right to do it.

          I reiterate as simply as I can, any trend toward banning methods will do considerable harm to Bayesian Statistics in the long run. Bayesians shouldn’t do it and shouldn’t advocate for it.

        • > It’s a bank shot either way.
          Now its sinking in…

          To replicate, criticize a comment made on this blog.

          Brilliant (and until you find group B, group A probably really can’t get it.)

      • I think of a journal much like I think of a park, or a nice museum or a shopping mall. We’ve all heard that we shouldn’t litter in parks, or touch the artwork, or be a nuisance at the mall. The social rules exist, but probably some people care about them more than others. It’s almost romantic to think that scientists are this magical group of people who will abide by these social rules just on the basis of their own virtue, but here is a journal that is abandoning that assumption, and is choosing to enforce a littering fine, and will expel people who run around the mall naked or who scratch the paintings to smell them. No one needs to go to this particular park, but it’s nice to know that it’s there if I ever want to be somewhere with no litter.

    • Richard: The below is what I am concerned about, but simply put you can argue for your methodology, claims and logic but only if it does involve what appears to be NHST (of course authors will find away around this)?

      (From wiki)
      In his “F.R.L.” [First Rule of Logic] (1899), Peirce states that the first, and “in one sense, the sole”, rule of reason is that, to learn, one needs to desire to learn and desire it without resting satisfied with that which one is inclined to think.[112] So, the first rule is, to wonder. Peirce proceeds to a critical theme in research practices and the shaping of theories:
      …there follows one corollary which itself deserves to be inscribed upon every wall of the city of philosophy:

      Do not block the way of inquiry.

      Peirce adds, that method and economy are best in research but no outright sin inheres in trying any theory in the sense that the investigation via its trial adoption can proceed unimpeded and undiscouraged, and that “the one unpardonable offence” is a philosophical barricade against truth’s advance, an offense to which “metaphysicians [journal editors] in all ages have shown themselves the most addicted”.

      • The journal is saying that there are some ways of justifying claims that will not fly in the journal. This is true of *every* scientific journal. The journal is not censoring particular scientific claims, but is rather regulating the way in which those claims are argued for. Pickiness about methods is actually the hallmark of science, not the death of it.

        Presumably we would not see an outcry if a journal made it explicit that they would not accept divine revelation as a justification for scientific claims; of course we wouldn’t, because part of science is making these calls. If you want to argue against the policy, argue that they’ve drawn the wrong line. Arguing that these lines can’t or shouldn’t be drawn is absurd.

        • Richard: We will agree to disagree.

          I work in a regulatory agency, that where lines belong, but they can be argued away if the situation dictates.

  12. Well, we always have to worry that the cure is worse than the disease.

    For example, if your first p-value is 0.051 you might just twitch a little to get it down to 0.05. But if all you have, say, is a box plot of outcomes in treatment and control, and some implicit ocular cutoff by the reviewer, you might twitch a lot to get those boxes to tell a story.

    Moreover, in doing this you will discover that besides statistical degrees of freedom you now have visualization degrees of freedom. You know, play around with scale, axes ranges, etc… until you realize: “The p-value is dead, long live the junk chart!”

    The moral of the story is that you don’t cure a cold by blowing your nose.

      • @Anon

        If the problem is with editors wanting to publish “findings” or effects, you get p-hacking.

        Now suppose you abolish p-values. We are only allowed descriptive stats and visual contrasts. However, you retain the need for “findings”.

        Well, you know what you are going to get. You torture the descriptive stats, tables, and displays until you “see” a “finding”.

        • PS And because the nature of the descriptive semantics is more coarse, you might just have to do more torturing than with p-values…

          I am not necessarily defending p-values, just saying that the supposed remedy may only be treating the symptoms, and there is a good chance it is making things worse.

          It would be better to do like PlosOne: Publish on the basis of the research design and question, not on the results.

        • I totally agree. These editors are treating the symptoms, not the cause of problems in social science research. I think scientists should start posting, in a public forum, their hypotheses, methods, n/power, analysis methods (e.g., exactly how outliers will be treated) before they run their study so that reviewers and editors can assess how much people massaged their data in a hunt for something that is publishable (i.e., what I think is the cause of an excessively high false positive rate in the literature).

  13. Fortunately “How to lie with statistics” didn’t contain any examples of how to lie using graphs or descriptive statistics. Oh wait …

    (For the record I like the idea of placing more emphasis on graphs and descriptive statistics – but I’m worried that it is really quite difficult to interpret these without reference to error. Indeed standard errors and confidence intervals are most useful to my mind when viewed as descriptive rather than inferential statistics.)

  14. The subject is complex; there are many peculiarities that must be taken into account, however people do not give much attention to that. In science, or in any other fields that appreciate the rational thinking, people desire to follow some coherent rules to explain and to interpret the relations among the elements of interest or to make inferences about unknown quantities. The word “coherence” is not well-defined, it strongly depends on personal principles. In general coherence rules are defined in our mesoscopic world, for instance: the additive principle is trivial in some domains, but it is not valid in some domains of the microscopic world.

    There are many types of logics, classical logic, Fuzzy Logic, Para-consistent logics, Bayesian logic, classical-statistical logic and so on. Each type of logic has its own definition for coherence, which, in general, is not in line with the other types of logics. It is a great mistake to use one type of logic to interpret another type without further considerations. See, for instance, the “MIU” and “pq-” postulation systems* and the problem of interpretation.

    Trafimow (2014) interprets the p-value, which is a concept defined inside the classical statistical formulation, within a Bayesian framework. This is a huge mistake, since their domains of application are very different. If one studies the statistical theory behind the classical concepts will realize that the p-value cannot be defined using a conditional probability, since H0 imposes restrictions in the probability measures that possibly explain the observed events in the sigma-field and these statements are not measurable in the classical structure. Therefore, any types of conditional statements on them are not well-defined in the classical framework. If you turn these statements measurable (by defining a larger space), you are imposing further restrictions where initially they were not necessary. These further restrictions impose a type of interpretation that does not exist inside the classical model. In this context, you are making a Bayesian interpretation of a classical concept. Of course, once it is understood the classical model (which is bigger than any probabilistic model) and the Bayesian model (which is a probabilistic model in essence) you realize that such interpretation is very limited and misguides the practitioner intuition. This is a common problem in modern statistics, people do not care much about formal notations and the theory behind the concepts. This leads to many types of invalid interpretations and feed many controversies.

    Links:

    * http://www.math.mcgill.ca/rags/JAC/124/ToyAxiomatics.pdf

    Trafimow (2014) http://homepage.psy.utexas.edu/HomePage/class/psy391p/trafimow.nhst.21003.pdf

  15. In my field findings are often based on interaction effects, e.g. a ‘learning’ effect (such as the difference in mean response speed for random vs. structured blocks in a serial reaction time task) is smaller for a clinical group than a control group. Visually, the results often look quite similar for the two groups. Sometimes mixed modeling or growth curve analysis is used, which may show different learning curves (slopes) for the clinical group but intact learning. Whatever the implications of such findings, everything seem to hinge on the stats and choosing the DV. It is certainly not clear to me how visually inspecting the data would lead to anything.

    • > Visually, the results often look quite similar for the two groups.
      This does seem like a red flag – are the studies designed to have adequate power for “meaningful” learning effects?

  16. That does seem to be a problem. The study with the most participants involved adolescents with a history of language impairment (N=38) and a normal language group (n=47). Studies with school-aged children typically have only about 15 per group. One meta-analysis pooling 8 studies (only two of which had significant p-values) reports a sign. overall effect size of .3. Hard to know what to make of it.

  17. As an economist, ardently following the debate, I do ask myself why psychologists as a whole are so incredibly hesistant towards being exposed to mathematical modelling which is a frequently used method in my field of study. Moreover, a greater emphasis should be laid on teaching a lot more math and stats classes from the very start of entering the study of psychology. Psy students do not have to deal with rigorous mathematical and statistical methods. As a consequence of that their quant. training is far from solid. If people are forced to learn it properly and approach its use critically, a lot of this debate could likely dissolve into thin air and would help to make psychology become a hard science..

Leave a Reply

Your email address will not be published. Required fields are marked *