Skip to content
 

The harm done by tests of significance

After seeing this recent discussion, Ezra Hauer sent along an article of his from the journal Accident Analysis and Prevention, describing three examples from accident research in which null hypothesis significance testing led researchers astray. Hauer writes:

The problem is clear. Researchers obtain real data which, while noisy, time and again point in a certain direction. However, instead of saying: “here is my estimate of the safety effect, here is its precision, and this is how what I found relates to previous findings”, the data is processed by NHST, and the researcher says, correctly but pointlessly: “I cannot be sure that the safety effect is not zero”. Occasionally, the researcher adds, this time incorrectly and unjustifiably, a statement to the effect that: “since the result is not statistically significant, it is best to assume the safety effect to be zero”. In this manner, good data are drained of real content, the direction of empirical conclusions reversed, and ordinary human and scientific reasoning is turned on its head for the sake of a venerable ritual. As to the habit of subjecting the data from each study to the NHST separately, as if no pre- vious knowledge existed, Edwards (1976, p. 180) notes that “it is like trying to sink a battleship by firing lead shot at it for a long time”.

Indeed, when I say that a Bayesian wants other researchers to be non-Bayesian, what I mean is that I want people to give me their data or their summary statistics, unpolluted by any prior distributions. But I certainly don’t want them to discard all their numbers in exchange for a simple yes/no statement on statistical significance.

P-values as data summaries can be really misleading, and unfortunately this sort of thing is often encouraged (explicitly or implicitly) by standard statistics books.

P.S. Maybe an even better title would be, “The harm done by tests of significance and by analyzing datasets in isolation.”

109 Comments

  1. Brad says:

    I’ve hated NHST since my first year of undergraduate psychology and none of my lecturers knew how to answer my pesky questions about it.

    One example we got was a study that found no significant difference between 20-30 year olds and 30-40 year olds, no significant difference between 30-40 year olds and 40-50 year olds, and a significant difference between 20-30 year olds and 40-50 year olds. When I argued that rejecting the null hypothesis in the first two cases but accepting it in the last case was logically inconsistent the lecturer was flummoxed and the student who won the Dean’s Award told me I was thinking about it too much and I just had to trust the math!

    Even now (I’ve left psychology behind to pursue library studies) I get into arguments on the net (I try not to but sometimes I think I’m talking to intelligent people who might like their illusions shattered) about what use there is in knowing what the probability of getting a certain set of results given that the null hypothesis is true.

    What are the prospects for change?

    • gwern says:

      from OP:

      > After seeing this recent discussion, Ezra Hauer sent along an article of his from the journal Accident Analysis and Prevention, describing three examples from accident research in which null hypothesis significance testing led researchers astray.

      Couldn’t you link directly to http://andrewgelman.com/wp-content/uploads/2013/01/1154-Hauer-The-harm-done-by-tests-of-significance.pdf …?

      > Even now (I’ve left psychology behind to pursue library studies)

      Library science? I was thinking of doing that a while ago, but everything I read suggested employment prospects were very poor and the master degrees a bad idea for everyone except an already employed library worker.

      > What are the prospects for change?

      I think they’re good. For my own reading up on NHST (http://lesswrong.com/lw/g13/against_nhst/), I read criticisms from the ’50s to now, and there seemed to be a shift of tone over the decades from ‘I know you love NHST but let me explain why it’s flawed’ to ‘we all know that NHST has serious flaws, and here’s what we can replace it with’, to the point where almost no one seems to actually be defending normal NHST in the last decade but making excuses like ‘yes it’s wrong but in the real world, researchers are busy people and can’t afford to take the time to do it right’.

    • konrad says:

      I agree that the hypothesis testing approach (and in general the idea of discretising problems like these) leaves much to be desired, but the charge of logical inconsistency doesn’t hold:

      You have three age categories, call them A, B, and C. There are three ways in which you can get a difference between A and C (bearing in mind that by choosing a hypothesis testing approach you commit to taking seriously the possibility that two categories may actually be the same): 1) A and B are the same, C is different; 2) B and C are the same, A is different; 3) all three are different. From the results you described, you have enough information to conclude that A and C are different, but not enough to rule out any of the three ways in which this can happen. This is not a problem: you only get inconsistency if you leap from failing to reject the null to assuming it is true.

      • revo11 says:

        I agree – the use of NHST in science/social science is highly problematic, but Brad’s example is not a good example of the problems with it. His misapplication of transitivity is a common interpretational mistake, but that’s a separate issue.

        The state of elementary statistical education being what it is, I wouldn’t be surprised if the misunderstanding wasn’t at least partially related to how NHST was taught to him though – it seems like it’s common to avoid the nuances of interpretation. The truth is that if you’re going to use NHST at all, you’d better at least appreciate the nuances of interpreting it correctly.

        • Brad says:

          “His misapplication of transitivity is a common interpretational mistake, but that’s a separate issue.”
          It may be a separate issue technically (it doesn’t come specifically from the maths involved in NHST), but the problem is that people just perform the NHST ritual and the maths is right. There is no reasoning about the data. Just plug it into some equations (or SPSS or R) and the answers pop out. The work is done.

        • Brad says:

          “The state of elementary statistical education being what it is…”
          I was a tutor (it might be called a teaching assistant in the US) for some 2nd year psych classes and although statistics were covered in a different class I’d still get students coming to me for help with stats. It’s amazing how many students didn’t even know that a t-test is comparing means and standard deviations between two groups. Once I told them to forget the maths for a minute and think about what they were actually trying to do, it all fell into place.

      • Brad says:

        “… you have not enough information to conclude that A and C are different, but not enough to rule out any of the three ways in which this can happen.”
        If A and B are the same, and B and C are the same, then logically A and C are the same. There is not enough information to conclude that A and C are different because there is the possibility that there is a Type 1 error.

        “This is not a problem: you only get inconsistency if you leap from failing to reject the null to assuming it is true.”
        The leap that is usually made. What is the point of rejecting the null if you don’t assume that the rejection is true?

        • konrad says:

          Somehow you inserted an extra “not” when quoting my sentence – rejecting the A vs C null means we _do_ have enough information to conclude that they are different. But in this setup we can _never_ conclude that any two of the quantities are the same – all of the decisions are between “different” and “don’t know”.

          “The leap that is usually made.” – Sadly I can believe this is true in many application areas – evidence of horrendous misunderstanding of the whole point of hypothesis testing.

          “What is the point of rejecting the null if you don’t assume that the rejection is true?” – When you reject the null, that means the null is false (a strong conclusion). When you fail to reject the null, that means there is insufficient evidence to decide whether the null is true or false (no strong conclusion either way).

          • Brad says:

            Sorry, don’t know how that extra ‘not’ got in there. I’m not sure I agree that the decisions are between “different” and “don’t know.” It looks more like the decision is between “not sure” and “not sure.”

            Let me see if I’m understanding you correctly. You seem to be saying that if p .05 you fail to reject the null and can’t come to a firm conclusion.

            My issue is that p < .05 means the data is unlikely if the null is true, but the conclusion that the null is false doesn't follow. You're affirming the consequent.

            If A then B
            B
            Therefore A.

            The conclusion doesn't follow. In NHST you're really saying:

            If A then P(B) < .05
            B
            Therefore A

            I don't think that mitigates the error in any way.

            • konrad says:

              Hmm, it’s clear that you’re confused _somewhere_, but it’s not clear _where_. I’ll try to state the argument (in the case of p0.05).

              We have: If A then B. (We accept that this is wrong in 5% of cases where A is true.)
              On looking at the data, we observe not-B.
              Therefore not-A.

              • konrad says:

                Sorry, the website seems to have mangled my text. I’ll try again:

                Hmm, it’s clear that you’re confused _somewhere_, but it’s not clear _where_. I’ll try to state the argument (in the case of p less than 0.05).

                A is the null.
                B is the statement that the data are compatible with the null (where we define compatible to mean p is less than 0.05).

                We have: If A then B. (We accept that this is wrong in 5% of cases where A is true.)
                On looking at the data, we observe not-B.
                Therefore not-A.

              • konrad says:

                Sorry again – compatible means p is greater than 0.05.

              • Brad says:

                I don’t seem to have an option to reply to your corrections below, so this is a reply to your 7:58 comment.

                I’m not sure how you get “Not-B” as your second premise. I’ll have a think about it.

              • Brad says:

                You seem to be saying:

                If H-0 then Data-B
                Not Data-B
                Therefore not H-0

                But this is wrong in 5% of cases.

                Does that sound like what you’re saying? I’m not sure that’s the correct way of framing it. I’ll think about it a bit more.

              • Brad says:

                You’re right, I am confused somewhere. I have included a second premise where there is no second premise. The actual argument as I see it should look more like

                Rejecting H-0
                If the null hypothesis is true then the probability of getting this data is less than .05 (If A then B)
                Therefore the null hypothesis is not true (Therefore not A)

                which seems even worse.

              • Brad says:

                Looking back at your framing. You’re saying:

                A = the null hypothesis is true
                B = the data are compatible with the null
                = p < .05
                C = If the null hypothesis is true then the probability of getting the data is less than .05.

                But p < .05 = If the null hypothesis is true then the probability of getting the data is less than .05

                so I don't think you're justified in putting it that way.

            • Something went off the rails, and it doesn’t help that the website doesn’t like less than signs because they look like an HTML tag… Here’s what a hypothesis test means:

              Under the simple and boringly uninteresting model H0 the data would be found in the region farther away from some reference value with frequency p(D).

              the usual reasoning then goes like this
              if p(D) < 0.05 then presumably it is unlikely that H0 is true therefore H0 is most probably false.
              if p(D) > 0.05 then H0 could plausibly be true therefore we have no conclusion about H0, please get more data.

              • Brad says:

                But how do you get from p(D) given that H0 is true to p(H0)?

                p(D) doesn’t imply that H0 is unlikely to be true. It says that if H0 is true then p(D). p(H0) is what you want, but it isn’t what you get.

              • if p(D) is tiny when H0 is true, then either H0 is true and you got very unlucky in your data, or H0 is not true. the assumption is that H0 is not true. there is no way to get a logical certainty. also, there is no way to get a p(H0) unless you are willing to put a prior on H0 and use bayes rule. since hypothesis testing is mostly a frequentist thing this is not what is done.

                the point is, whatever you’d like hypothesis testing to mean, all it means is what I said above.

      • Brad says:

        If a person is American the probability that he is an NBA basketballer .0000001433
        This person is an NBA basketballer
        Therefore this person is not American

        It seems to me that this is what NHST is doing. But I’ve been told I don’t understand NHST. If this is indeed what you do when performing NHST then it seems wrong-headed to me.

    • Fran says:

      So you don’t understand a concept, your teacher can’t propely answer your questions and, somehow, that makes NHST flawed? Would you say the same about Quantum Mechanics or Relativity? You don’t understand it and, irremediably, the concept must be wrong?

      • Brad says:

        Helps if you read what I wrote. What use is knowing the probability of getting a certain result given that the null hypothesis is true? You want to know the probability that the null hypothesis is true, which NHST doesn’t tell you.

      • Brad says:

        Look, I’m no expert (I may be the statistical equivalent of a climate change denier) but my argument certainly wasn’t that I didn’t understand NHST and my lecturer couldn’t explain it so it must be wrong.

        From an epistemological perspective, NHST didn’t seem like the right way to come to the conclusions that we were seeking to arrive at. I may not be an expert, as I said, but there seem to be plenty of experts who think NHST is incorrect and I find their arguments convincing. Articles such as Cohen’s “The earth is round (p < .05)" and Meehl's "Theoretical risks and tabular asterisks…" helped convince me that my initial suspicions weren't entirely misplaced.

        • Fran says:

          Brad:

          Listen, that you do not understand NHST is obvious but, instead reaching the right conclusion for that being so (bad teaching), you were lead by Bayesians, the “plenty of experts” that agree with you, to believe there was something incorrect in NHST.

          And now you feel liberated! You were right all along!! These “plenty of experts” told me so!!! You’re way smarter than the student who won the Dean’s Award and told you you were thinking about it too much and you just had to trust the math!!!!! Who wouldn’t want to believe that? I might in your situation.

          And now you repeat the Bayesians’ Mantras without pretty much understanding what you’re saying, just like anyone who has been brainwashed by a sect would do… “The NHST ritual”, “From a epistemological perspective…”, “What you want to know is the probability of the Null…” Always the same blah, blah, blah, always the same nonsense.

          One question, did the “plenty of experts” that agree with you mentioned something like “Uninformative Priors” in the welcome process? I am curious to know how these “experts” approach students in distress.

          • Brad says:

            I don’t recall Meehl being a Bayesian. Maybe you’ve read more of his work than I have, but I’ve read quite a bit. Then again, maybe you haven’t read any of his work….

            • Fran says:

              Nope, I did not read Meehl, but I’ll read now the paper that confirmed your suspicions; Mr. Meehl’s “Theoretical risks and tabular asterisks…”. Let’s see what this fine psychologist have to say about NHST.

              You may say,“But, Meehl, R.A. Fisher was a genius, and we all know how valuable his stuff has been in agronomy. Why shouldn’t it work for soft psychology?”Well, I am not intimidated by Fisher’s genius, because my complaint is not in the field of mathematical statistics.

              Okay, so it is not about the math, let’s see what is it about…

              If I refute the statistical null hypothesis that plots of corn with potash do not differ in yield from plots without potash, I have thereby proved the alternative hypothesis.

              Nope, NHST are not intended to prove anything, in fact, as explained by R.A. Fisher you don’t even have an alternative hypothesis so you see, just like yourself, Mr. Meehl does not understand NHST either. So let’s see what are the solutions he proposes for his imaginary problems with NHST…

              Some directions of solution (before I go onto the one that I am using in my own research) follow. We could take the complex form of Bayes’s theorem more seriously…

              But wait a minute, this is they guy you don’t recall as a Bayesian! Interesting, do you read the papers you recommend? But the guy continuous…

              …is a good time for me to recommend to psychologists who disagree with my position to have a look at any text-book of theoretical chemistry or physics, where one searches in vain for a statistical significance test (and finds few confidence intervals). The power of the physicist does not come from exact assessment of probabilities that a difference exists (which physicists would view as a ludicrous thing to show)…

              Oh! Gee! Just if only the thousands of top physicists at CERN would have read Mr. Meehl’s paper, now they would not look so ludicrous with their five sigma significance test to announce that a Higg’s like particle was found… And finally the whole point of the paper.

              The only possible “solution” to the theory-refutation problem that I have time to discuss in any detail is what I call consistency tests

              And unfortunately he fulfills the promise to explain the method he himself has developed to save us all, at the this point I spared myself the pain, thank you very much.

              • Brad says:

                Okay, I’ll take your word for it that I don’t understand NHST. So how many people who use NHST actually understand it then?

              • Brad says:

                Perhaps you could give me a better explanation of NHST than any of the statistics books I’ve studied or any of my lecturers gave me in class. Or perhaps suggest a book or article that explains it clearly. That might be more helpful than your ranting and raving.

              • Brad says:

                I didn’t remember Meehl writing about Bayes’ Theorem. Okay, looks like he was a Bayesian.

  2. JH says:

    “P-values as data summaries can be really misleading, and unfortunately this sort of thing is often encouraged (explicitly or implicitly) by standard statistics books.”

    You could change this sentence to: “P-values as data summaries can be really misleading, and unfortunately this sort of thing is often encouraged (explicitly or implicitly) by (most) journals and reviewers” and it would still be 100% correct. And just as frustrating.

  3. Michael Lew says:

    The real issue is not p-values, but the misuse of p-values as part of a frequentist approach to inference. P-values do summarise the evidence in the data and they can be entirely consistent with the likelihood principle. However, as soon as there is an accept/reject decision made on the basis of a threshold they are inconsistent with the likelihood principle and their evidential meaning is lost. (It is also worth noting that when p-values are ‘corrected’ for multiple comparisons or sequential testing they also lose their evidential meaning.)

    There is a lot of hatred directed at p-values that should reallyt be directed towards Neyman-Pearsonian hypothesis tests. P-value bashing by statisticians who do not distinguish between the use of p-values as indices of evidence in a significance test and their use as a threshold for decision in a hypothesis test is very frustrating. We should expect better.

  4. Fran says:

    Someone stabs to death another person and the Bayesian police puts the knife in jail… Oh well, same old, same old.

    • Andrew says:

      Fran:

      It doesn’t make much sense to put a knife in jail, but maybe it makes sense to stop handing out knives to everybody . . .

      • Fran says:

        … And that’s the kind of solution that only works a priori.

        • Andrew says:

          No, the idea is to reduce future misuses of statistics. I agree with Hauer that formal statistical methods can take people away from the data and from their questions of interest. I’d much rather people focus on “How large is the effect?” than on “Is the pattern statistically significant?”

          • alex says:

            I think “effect” and “effect size” are equally loaded and inappropriate. Most “effect sizes” don’t measure effects in the sense that most people understand the word – products of a particular action. It’s an under the radar attempt to sneak in causal language. At least significant avoids all of that.

            • Andrew says:

              Alex:

              OK, replace “effect” by “population difference.” The question is, “How large is the average difference in the population?” This includes causal inference as a special case but also allows for descriptive inference (of the sort that I often do in survey sampling). I still think this question about the population difference is more interesting/relevant/important than the question, “Is the result statistically significant?”

  5. Entsophy says:

    I’ve seen more than my fair share of the harm done by tests of significance since I’ve spend years now doing tactical level analysis that is handed directly to units on the ground in Iraq and now Afghanistan. I have yet to see a significance test that wasn’t materially and fundamentally flawed. In fact, I can’t think of too many instances in which significance tests even accidentally got the right answer. Sure you can retract some nonsense after it gets published in a scientific journal, but how do you retract flawed conclusions after it was used to plan and execute an operation in Afghanistan?

    Just ditch significance tests completely. If you aren’t willing to replace it with a Bayesian analysis, then just replace it with nothing. Most Statisticians think science would come to a screeching halt without significance tests. Yet a great deal of science was done before significance tests (arguably, on a per capita, per dollar basis a lot more breakthroughs were made). People without the crutch of significance tests just think about their problems from scratch and figure out a way to understand it.

    I can’t tell you how many times I’ve seen an analyst asked the question “Did casualties increase during this time frame?” for data where casualties went from 20 to 25 and had them run a significance test and report “there was no statistically significant evidence for an increase in casualties”. For the love of God, stop teaching this nonsense.

    • Fran says:

      ha ha ha :D I should collect this kind of comments, actually I just might… ;-)

      • Entsophy says:

        It is funny! Me and my fellow Marines on the receiving end of this kind of wisdom laugh our butts off when it happens. Oh the chortles.

        • Chris G says:

          One’s view of the cost of Type I and Type II errors is undoubtedly a strong function of one’s location and role on the battlefield.

    • Entsophy says:

      Some additional fun points in no particular order:

      The people making the kind of mistake in that last paragraph aren’t stupid or ill-educated. They typically graduated from better schools then me. Given the prevalence of that kind of mistake, and the difficulty of getting people to see why it’s wrong, Frequentists should at least consider the possibility that there is something about the random-variable/data-generating-mechanism story line which makes people prone to these kinds of error. Certainly, people aren’t born with a propensity to make that error. The cause lies in their statistical training somewhere.

      At best, significance testing is rarely usable outside of low information environments. In high information environments, like Iraq and Afghanistan, they’re basically never useful. We almost always know things which trump anything you can learn from a significance test. Statistical modeling, on the other hand, can be very useful. None of that seems to stop people from trying to use significance tests though, especially since they can avoid the hard work of real modeling or hard thinking in general.

      Marines are pretty immune to this non-sense because they’re so suspicious of eggheads in general. The Army though has a long quantitative tradition that originated from the quantitative orientation of West Point. Sometimes you can get them to believe all kinds of stuff if you rub some math on it.

      • Fran says:

        I’ve been accused by Bayesians of stalling research and delay the cure of cancer, now it seems I am aiding Al-Qaeda… Should I look for drones in the sky if I keep using NHST? You’re taking the Semper Fidelis too far.

        Anyhow, you complain about this analyst using a NHST when asked if there were more casualties in a given time frame, can you give more details about why this is so wrong without becoming the next Bradley Manning?

        • Entsophy says:

          Nobody is accusing you of anything. I am however accusing the mass teaching of Frequentist significance testing as the height of the scientific method for doing exactly that.

          The question is “did casualties increase over the time period?” All you have to do is look at the numbers 20 -> 25 and say “yes they increased over the time period”. There is no measurement error with our casualty figures.

          Incidentally, watching a 20 year old jarhead Lance Corporal with a high school education trying to explain this to a M.S./Ph.D. holder is something every educational reformer in Statistics should see at least once. Things take a turn for the worse if the M.S./Ph.D. holder fails to see the point and claims the increase wasn’t significant, since there is a decent chance the Lance Corporal new some of the casualties and considers them highly significant.

          This seems to be a very common error. Andrew Gelman a while back was having to explain the exact same error to someone on this blog in a different context. The person wanted to apply a hypothesis test in the same way to answer a question that should be answered by just inspecting the numbers. This and other similar kinds of errors aren’t being caused by a lack of teaching about the pitfalls of p-values, so simply teaching p-values better won’t solve the problem. Somehow their intuition is being warped by what they’re learning in Statistics class.

          • Rahul says:

            A part of this is statisticians pose (some, e.g. p-value) questions in a way that’s not of interest to anyone else.

            Intuitively, people interpret p-values to be answers to their own questions. Which most often they are not.

          • Fran says:

            Entsophy:

            Well, you don’t accuse me all right, just anyone teaching the “Frequentists” evil ways, but you say:

            Things take a turn for the worse if the M.S./Ph.D. holder fails to see the point and claims the increase wasn’t significant, since there is a decent chance the Lance Corporal new some of the casualties and considers them highly significant.

            I had to read your comment a few times because I was not sure you were not actually kidding. When the MS/Phd “eggehead” said “not significant” he meant “not significant to make changes in the strategy currently at play” he did not mean “I don’t give a damn about those soldiers”.

            Imagine Mr. Egghead actually listens to you and says “Yes, Significant! 5/20=0.25 so 25% increase!” Then this analysis ends up in a General desk with Top Secret stamps on it and reads “Military Intelligence says 25% significant increase of casualties in X”. So now the General, alarmed, moves resources towards X since it seems the zone has become more dangerous… But resources are limited, which means that other areas have become under-protected just because the General was lead to believe that area X was getting hot… And now you have more casualties in those under-protected areas and all in a sudden, not only you have significant casualties, but statistically significant figures as well.

            So maybe Mr. Egghead is saving lives after all.

            You say you work in Military Intelligence? All right.

            • Anonymous says:

              Fran – you are reading Entsophy’s comments like the devil reading the bible. You might be right, but don’t create strawmen. (just an example, Entsophy doesn’t claim the increase from 20 to 25 is statistically significant. He is just stating that 25 is a bigger number than 20).

            • Entsophy says:

              Fran, that’s exactly what the kind of the M.S./Ph.D. holder says. Obviously, it’s not the direct answer to the simple question asked, but since it’s relevant to significant tests it’s worth looking at.

              The way the M.S./Ph.D. holder wants to use significance testing amounts to a kind of implicit prediction. This sort of thing works well when you want to extract a signal from noise. Laplace did this Astronomy, Electrical Engineers perform miracles doing this every day. Industrial quality control might be another example. But it really doesn’t apply here.

              The kind of predictions we have to make in Afghanistan don’t fit a simple signal/noise paradigm. Even in the easiest cases you have to do significant modeling to get predictions useful to a field commander. What’s happening is simply too complicated and we know too much about it, to use test book Hypothesis testing to make implicit predictions.

              But smart people do try to use significance tests that way all the time, either because their intuition has been warped to the point where they can’t see the real issues or because their too lazy to do the hard modeling. Most of the time I think it’s former than the later.

              • K? O'Rourke says:

                > too lazy to do the hard modeling

                But also maybe they are risk adverse from doing something that may seem non-standard and likely to draw crticism.

                Many applied statisticians seem to do this by trying to just use (moderatly advanced but) classical textbook methods (e.g. Cox proportional hazards regression).

              • Fran says:

                I learned long ago people don’t ask what they need but what they think they need, and if you answer to their direct questions instead to their needs they will blame you for doing so later on.

                But anyhow, you seem to believe that if you “know too much about it” you cannot use NHST and this is not so. You need to account for all you know if your Null Hypothesis so that the only thing left in the Hypothesis is randomness and, in order to do so, you use statistical models as you say but, once you’ve done it, once the only thing left if noise, that’s your Null.

                All these unsubstantiated Bayesian mantras like “NHST can’t be use in a high information environment”… Jesus, deal with the information first! You have this Bayesians hammer and all looks like a nail to you.

              • Entsophy says:

                Well, I’m not sure how it can be a “mantra” when I’m the only one who said it and I only just recently said it.

                In this case, I don’t think anyone would find it controversial. I just didn’t explain it very well. We know too many things that would in practice make any model that you could reasonably use in battery of hypothesis tests invalid, thereby forcing you to do much more significant modeling.

                “only thing left in the Hypothesis is randomness” that’s an illusion which you can only maintain if you don’t know very much about what’s going on. In practice, we know far more than is needed to destroy that illusion. There is nothing random about casualties on the battlefield. Our models are modeling “ignorance” not “randomness”.

              • Entsophy says:

                I might add that you can’t maintain the illusion of “reduced to Randomness” in general when you learn too much about phenomenon.

                The frequency of heads in a long series of coin flips isn’t approximately .5 because of some mysterious force called “Randomness” in the universe. It happens that way because almost every sequence of coin flips, no matter what caused it, has the property freq(heads)~.5.

                So while a Statistician might just be able to maintain the illusion that they’re modeling something called “randomness” of coin flips; Physicists have had their mind polluted with Classical Mechanics. If they think about it long enough they’ll realize they’re doing more than simply predicting outcomes which almost always happen. Eventually, they’ll realize that what they’re actually modeling is their “ignorance” about the initial conditions of the coin flips.

          • I agree with you, and I disagree with you at the same time. Yes, 20 -> 25 is an increase if your question is “were there more casualties during this time period than last time period”. This may in fact be the relevant question for many people (like someone who needs to figure out staffing questions and wants to know how many available troops there are). But there is also another relevant question (which I am sure you appreciate, but I want to bring it up explicitly here).

            That question is: “is there something about the conditions on the ground that has changed in such a way that we should expect future casualties to be higher in some consistent way than they were in previous time periods?” I personally think that the right way to address that question is an estimation procedure… namely come up with some statistical model for how casualties happen which has some kind of rate in it, and then estimate that rate.

            When the stats guy says that “the increase wasn’t significant” what he means is “this increase is not inconsistent with the idea that the rate could be constant”. So they’re talking past each other. The Lance Corporal wants to know “did we lose more people this month?” and the stats guy wants to know “do I need to adjust my model to predict the future more accurately”

            Both people want to know something about a quantity that has the same units of measurement, and is easy to get confused about, namely dCasualties/dt

            I think the problem with teaching p-value + Hypothesis test as statistical doctrine is that it keeps people from thinking usefully about models and parameters.

            • Entsophy says:

              Daniel, agree completely (see other response above). The bottom line though is “is there something about the conditions on the ground that has changed in such a way that we should expect future casualties to be higher in some consistent way than they were in previous time periods?” simply can’t be answered by Hypothesis testing in cases like this out in the wild. The “random variables” worldview has given them the impression that it can.

              • Surely there are situations in which Hypothesis testing could be enough. You create some model that’s a stochastic poisson process with time varying rate, you assume that the rate is constant over the period in question, you calculate the probability of seeing > 25 casualties under the constant rate hypothesis, and then suppose that the probability is 10^-3, then you pretty much reject the idea that the rate is constant. The problem comes when either:

                a) the constant rate hypothesis is not rejected (so then, you don’t know if it was constant or not).
                b) the poisson process model itself is broken, like when it fails to take into account some of the things that you “know more than is needed to break the illusion” about.

                Although I think hypothesis testing is mostly not very useful, I think your biggest objection is (b), in other words, the models typically being used for testing are too blunt and unsophisticated, and like I said earlier, sometimes people don’t even distinguish between the data (were there more casualties?) and the model (did the unobserved mean casualty rate increase?)

              • An example might be when someone tests the idea that the poisson rate is constant in time, but the appropriate model has a rate of casualties per action and a rate of actions per unit time. So it looks like the casualty rate is going up, but maybe casualties per action is staying constant whereas the number of actions is what’s really going up. The on the ground situation isn’t that things are more dangerous, they’re just more frequent…

    • konrad says:

      @Entsophy: If what you want to say is that decision theory is a better framework than hypothesis testing for these problems, I wholeheartedly agree. But your example fails to make the point, largely because we all know (intuitively, at least) that the intended meaning behind “casualties” is “casualty _rate_” – which implies a model-based question. So you can accuse the analyst of failing to communicate his/her thoughts properly (and thereby encouraging sloppy thinking in others), but nothing else.

      • Entsophy says:

        If you want to know about casualty rates then divide 20 (or 25) by the length of time. That’s the casualty rate. No statistics needed. If you want to predict casualty rates in the future then a significant modeling effort is needed. Trying to apply Hypothesis test to casualty’s rates as if they were something like the decay rate of Tritium leads immediately to complete and total nonsense. It doesn’t matter how you interpret it, Hypothesis testing/significance testing isn’t the answer.

        If there is something in the Random-variables/data-generating-mechanism teaching of introductory statistics that makes people think otherwise than that teaching is seriously flawed.

        And I was referring to instances in which people really just wanted to know whether casualties increased over a given time period. It’s not that the analyst initially misinterpreted the question that is the problem. What’s amazing is just how difficult is to convince them that the answer doesn’t even require statistics. That’s not a miscommunication. That’s students of statistics having jumped the shark.

        • There’s an interesting related question that I’ve been thinking about. There was something about firefighters on this blog a while back, and how they were supposed to have less than 5 minute response time at the 90%tile. The naive way to calculate this is to look at how many times you responded in less than 5 minutes, and then divide by the number of times you responded. If you responded once this week, and it was at 6 minute time, then you had 100% of cases failing to meet the goal!

          and then there’s the NYFD they responded 1000 times in the same period and had 5 minute response time at the 90%tile. But maybe they had 50 times where their response time was > 15 minutes! (i’m making up the numbers)

          who’s doing a better job? Obviously we need some real modeling here in order to even estimate the underlying rate at 1 response per week. We’d probably need to incorporate a bunch of training exercise data, and soforth.

          Of course, to the one guy who got a 6 minute response… that’s all that mattered.

          Hypothesis testing isn’t going to get us much of anything in the rare event case, but for NYFD you’d be able to muddle along for a while, because you have enough data that you can go with a simple model.

      • Entsophy says:

        Maybe I can explain this better. What happens is that Frequentist statistics pushes people into the direction of thinking of probability distributions as real and data as only one amorphous outcome out of a cloud of possible outcomes.

        Once an analyst has absorbed this mindset, they tend to think of the actual casualty data as less real than the completely made-up probability distribution it supposedly came from. So when asked a question about casualties they immediately want to try to answer a question about some fantasy distribution and can’t think of those counts as known concrete facts.

        Bayesians, and people who’ve never taken statistics, don’t have this problem because they think of the data as the real which is used to make guesses about things not known.

        • I dunno, I think most Bayesian statisticians are immediately going to wonder what the unobserved rate parameter is and not care too much about what specifically happened to make 25 be the given outcome this month (except in so far as it may seriously constrain the posterior and therefore give a lot of information).

          The difference in my opinion is that frequentists tend to focus on the question “can we be pretty sure that a stupid model of this system is not sufficient to explain the outcome” and bayesians tend to be focused on the question “if I create an intelligent model of this system, does it tell me something with a significant amount of information in it (ie. a well constrained posterior distribution)?”

          I think the main reason for this is that bayesians can fit pretty arbitrarily complicated models, so they tend to spend more time thinking about processes and models.

          • Entsophy says:

            There is no unobserved rate parameter. There is just a bunch of casualties, the circumstances of which we know far more than most people would realize, and we have to take that knowledge and make best guesses about other concrete facts or make predictions.

            A fictitious unobserved rate parameter may help in practice between connecting the facts of the casualties to other facts, but that’s all it is: a modeling tool. There is no “data generating mechanism” creating casualties who’s randomness we are modeling.

            • Entsophy says:

              Maybe another way to say is this: people applying significance tests imagine that they’re detecting signals in the presence of noise. Most problems I face however, involve detecting a signal in the presence of a million other signals. And those signals don’t add up to “noise”.

              The signal/noise paradigm makes sense in the Laplace’s astronomy applications, and in electrical engineering. It doesn’t makes sense in the vast majority of instances in which people try to use it.

            • yop says:

              The data generating mechanism is the current state of the world. If every structural causes could be fixed, there will be fluctuations in the counts because of non modelable causes (because they are idiosyncratic or their because their effects are too small). So in this sense, there is a distribution. If there is a change (like a change of strategy), the distribution will shift.

              Distributions and models are only representations, they are not real. But the only way we can explore and learn the world is through representations. When a Bayesian does posterior predictive checks, he is using this epistemologic framework. Even if there is no real randomness, there is a state of nature and there is fluctuation. That’s what a statistician ought to study.

        • Anonymous says:

          @Entsophy: Ok, I get that this was actually not a prediction problem, which makes the story really funny. But the discussion is more interesting if we assume it is. In that case we need a serious modeling effort, as you pointed out, and the natural type of model to use is one with a rate parameter. Sure, people ought to spell out that this is what they’re doing, but to those in the know it’s clear that the idea is to solve the prediction problem by inferring such a parameter. So yes there is an unobserved parameter and yes it is a modeling tool – that doesn’t make it less real, at least until you specify what specific meaning of real you have in mind (meanwhile I’ll go ahead and call my thoughts real). Likewise, there is a data generating mechanism (aka a model) and it is real (it exists in the mind of the modeler).

          You may well have issues with the specific model that is typically assumed, given that you have extra information you didn’t state, but that’s a separate matter. On the issue under discussion, I think we agree about everything except how to use the words in the debate – I maintain that we get a more constructive debate if we are charitable towards the opposing position by interpreting the words in a way that makes that position make sense.

          Ps:, isn’t noise just a million other signals from unknown sources?

          • konrad says:

            Sorry, posted before filling in my name.

            • Entsophy says:

              I guessed who it was! Yeah we tend to agree a lot and I definitly didn’t intend to give the impressIon of purposefully putting the worst interpretation on your or anyone elses words.

          • Chris G says:

            >Ps:, isn’t noise just a million other signals from unknown sources?

            That’s one possibility but it’s not necessarily so. What constitutes noise depends upon what you’re observing. Noise could well be variation inherent in the process which gives rise to your observation. For example, if you’re measuring the amount of light emitted from some location/event, i.e., if you’re counting photons, the number of photons you count in any measurement window will follow a Poisson distribution. The process is not deterministic. If you’re trying to determine the emission rate, the natural variation in the process amounts to noise.

            • revo11 says:

              “What constitutes noise depends upon what you’re observing.” – I think this was Konrad’s point, that in most real world applications of statistics, noise is an approximation used for unobserved state. In many cases, it would be a mistake to mistake it to _be_ reality. For example, over a large enough window, a distribution of soldier deaths may look Poisson (or more likely, an overdispersed Poisson), but it would be a grave mistake to stop the inquiry there and say that remaining variation can be attributed to chance.

              Even your example could exemplify this – we have incomplete information regarding the generative process – transitions between different energetic states of electrons. The poisson distribution is a reasonable approximation to the resulting distribution of emission counts over a sufficiently long window of time. If you had access to additional state regarding the energetic states of electrons, the process would look much more deterministic.

          • Entsophy says:

            No, the million other signals only sometimes add up to noise. In this case we know far too much about those other signals to model them that way. Nor is their cumulative effect in any sense stable.

            The number of casualties in a given time is real. A rate parameter is a made up device used to connect it to other facts. That point needs to be seen clearly in order to know when such modeling tool is expected to will work. In particular it’s important for knowing when a given prior is going to work.

            If it’s diffuclt to think in terms of “the casualty rate is 20/time” as a literal statement of the rate, then that is a pretty strong statement about how statistics education effects our intuitions.

            • The question of “what is real randomness” seems to be driving some of this discussion. I think that’s a pretty deep philosophical question, but there have been some attempts to deal with this issue and I just happen to have read up on this recently. For example, Per Martin-Löf has a definition of a random sequence of bits as basically any sequence which passes any computable test of randomness, and this definition is pretty much equivalent to a definition in terms of kolmogorov complexity. And yet, a computer program which outputs random bits and passes these tests is still pretty much a deterministic program, yet without reading the entire computer program including all the data embedded in it, we can’t predict the next bit it will output.

              In other words, all randomness is basically indistinguishable from “a lot of stuff that happened”.

              If your argument is that we know a lot about the stuff that happened to cause the casualties, then the question comes down to “does knowing that stuff help to predict in a practically significant way anything about the next casualty?” In the same way as for example “does seeing the first 100 bits of the random bitstream, and the first page of code help us predict the next bit?” If it does, it should be in any model of future casualties, if it doesn’t, then it may as well be called randomness.

              Is there a “real” rate parameter? Well, the rate parameter is not “out there” where we could find it with enough heavy earth moving machinery, like the tomb of Tutankhamen, but it is real in so far as it is something that has been exactly specified in a model. the model isn’t reality, but it is something.

              In my opinion, you make a good point that people fail to remind themselves that the model and the reality are not the same thing, but at the same time, taking the count of casualties and dividing by the time during which they occurred is not exactly real either, it’s a kind of a braindead model. Suppose that we do that today, and get 20 casualties / 30 days (.666/day) and then tomorrow there are no casualties, so we calculate it tomorrow and we get 20 casualties/ 31 days (.645/day), and then the next day there is a casualty, so we have 21 casualties / 32 days (.656/day) and then the following day a whole transport truck is hit by an IED and we have 30 casualties/33 days (.909).

              Is there any sense in which these numbers are useful? There is a definite sense in which the total count is useful, but dividing by time seems to be pointless here.

              • another way to put this is, suppose the Lance Corporal comes and asks “how many casualties did we have in the last month?”. Suppose that this is on the 31st of a month with 31 days. There were 20 in the last 31 days, but 30 in the last 32 days and the extras were all a little before midnight of the last day last month, just a few minutes… what number does the L.C. want? Suppose that when you go ask the hospital it turns out that the death certificates of some of those men are stamped after midnight, now the answer depends on some bureaucratic nonsense.

                Even though he has only a high school education, the LC probably doesn’t want an answer that depends O(1) on the time of day when he comes and asks the question or on when some paper pusher got an official piece of paper on his desk. Implicitly, by asking about “last month” he is specifying a timescale that he thinks is relevant, the savvy analyst should really respond something like:

                “there were 20 this month, and thats a few more than last month, but it seems like the long term rate is pretty stable”

                to the extent that the LC doesn’t want something that depends on the time of day, he wants a smoothed, regularized rate and that can only be fictitious and model based, and yet in some ways more real because it doesn’t depend on trivialities like the time of day or the bureaucratic paper stamping nonsense.

                I saw this sort of thing all the time in certain projects in forensic engineering, where we were looking at how Requests For Information (RFIs) were generated by the contractor and sent to the Architect. People would ask questions like how many RFIs were sent each month for a 36 month project, and we had the dates, and they arrived in clumps and any kind of hard window just wasn’t satisfactory at answering what people really wanted to know. Smoothing with a smoothing kernel with ~15 day span produced an answer that pretty much everyone agreed was the kind of thing they wanted. The smoothing kernel could be interpreted as some kind of estimate of a time varying poisson-like rate. Even though we knew a ton of stuff about the contents of each RFI and the conditions that generated them, the not quite braindead model of a smoothed average tended to capture a lot more of what people cared about than either a hard “boxcar” window of 30 days, or a big table of dates and RFI details.

                In this sense, the model is real, ie. it captures what is desired by the person with the question.

            • konrad says:

              @Entsophy: you seem to have a specific definition of “noise” in mind – please enlighten us. It seems to me as if the concept is superfluous if you restrict yourself to deterministic models, so I am wondering how/whether it fits into your thinking at all.

              @Daniel: it can be helpful to distinguish between “nondeterminism” and “unpredictability” – these are sometimes conflated, and the word “random” is sometimes used interchangeably to refer to both. I think we are all agreed on unpredictability, but I can think of three stances regarding determinism:

              1) The phenomenon being described actually is nondeterministic: Jaynes argued that this is false in almost all cases, the one exception being quantum physics (which is under debate).

              2) We use a (simple) nondeterministic model to describe/approximate a (complex) deterministic phenomenon. This replaces unpredictability arising from complexity and incomplete information with unpredictability arising from non-determinism and incomplete information, and results in a mathematical tool that is useful for prediction. I suspect this is the way most modelers think about it (and perhaps the fact that Jaynes declines to use models of this type explains why his ideas have not been as popular as one might have thought).

              3) We restrict ourselves to deterministic models, handling all unpredictability as resulting from incomplete information. Jaynes’s amazing contribution was to demonstrate that this can actually be done, and in many cases yields the same results as using nondeterministic models. To those used to thinking in terms of nondeterministic models, this result is so completely unexpected that most people never get it at all. This is the framework that Entsophy is coming from when he declines to entertain the notion of a rate parameter in a nondeterministic model.

              • Yes, this helps clarify the arguments I think. From my perspective, Per Martin-Lof’s contribution is essentially to show that an unknown but perfectly deterministic process (ie. maybe the residuals in a model for the number and demographics of people entering a train station) should be treated in exactly the same way as a known but perfectly nondeterministic process (ie. maybe quantum physics things) because the process of distinguishing them is inherently non-computable.

        • Mayo says:

          You say data are real and they are used to make guesses about things not known. But you also say the things not known are not real, so does this mean you can guess any way you like? Why not? If it’s just a fantasy. On what grounds do you deny the reality of aspects of sources of data, as modeled?

          • konrad says:

            @Mayo: I’m not sure at whom your comment was directed, but it is clear that the word “real” was used in two different ways in the above discussion. The word might be used to refer only to observable quantities as opposed to quantities that only exist in models – in such a usage, “real variable” would essentially be synonymous to the frequentist notion of “random variable” and I would argue that this is not a useful enough concept to reserve a special word for. So above I used “real” to refer to any quantity that is uniquely determined once data and modelling assumptions have been specified – under this interpretation of the word I agree we should not deny “the reality of sources of data, as modeled”.

            However, if we choose that meaning for “real”, we should take care to be clear on this lest others think we are saying our models are exact descriptions rather than approximations of what is commonly called “objective reality”.

          • Entsophy says:

            I never said things not known are not real. That would be a pretty stupid thing to say. No you can’t guess any way you like, since a “best guess” is presumably controlled by what “best” means. But the thing we’re guessing about is again another fact. For example, given current information about casualties make a best guess for casualties next month. The accuracy of the best guess is then determined by how close those guesses match up with reality.

            “On what grounds do you deny the reality of aspects of sources of data, as modeled?”

            I don’t deny the reality of the data. If there were 20 casualties in a month then the rate is 20 cas/month. The Poisson model parameter inferred from this, which is often called a “rate”, is a made up artifact which can possibly (but not always) be used to connect the data to other facts such as “the number of casualties next month”. The facts coming in and going out are real, the Poisson parameter is merely a kind of stand in for the rest of the things unknown and there’s no reason to believe it’s anymore real in this case.

            There were a lot of straw men in a few lines, so I hope I covered them all.

            • Entsophy says:

              Also to clarify, the quantity “the number of casualties next month” is a real quantity, but P(the number of casualties next month) is always a made up device used to make reasonable range estimates for next months casualties which may or may not be accurate depending on how well the modeling was done.

              In that sense probability distributions, as opposed to frequency distributions, some model parameters, and actual data, is never directly a real aspect of the universe. It’s an artifice used to infer real facts.

  6. I don’t know whether similar ideas have been promulgated elsewhere, but a good article about what to do when the null is not rejected is “Statistical Significance Tests: Equivalence and Reverse Tests Should Reduce Misinterpretation” by David Parkhurst (http://www.bioone.org/doi/abs/10.1641/0006-3568(2001)051%5B1051:SSTEAR%5D2.0.CO;2). Abstract: “Equivalence tests improve the logic of significance testing when demonstrating similarity is important, and reverse tests can help show that failure to reject a null hypothesis does not support that hypothesis.” Ironically, the recommendation is more tests.

  7. Chris G says:

    >… the researcher says, correctly but pointlessly: “I cannot be sure that the safety effect is not zero”.

    Indeed. Only fools and charlatans would claim 100% certainty on any non-trivial decision.

    >Occasionally, the researcher adds, this time incorrectly and unjustifiably, a statement to the effect that: “since the result is not statistically significant, it is best to assume the safety effect to be zero”.

    Wow. I’m at a loss to understand the jump from “I’m not certain that the effect is non-zero.” to that last statement.

    The first four steps to understanding cause and effect:
    1) Formulate your signal hypotheses, H_i.
    2) Fit your signal models to your data, x.
    3) Reality-check your fit results. Does at least one of the fit models do a decent job of fitting the data?
    4) Compute posterior probabilities: P(H_j|x) = P(x|H_j)*P(H_j)/sum_i{P(x|H_i)*P(H_i)}.

    Get through those steps and you may have a story to tell. That said, your estimates of effects and of which signal hypothesis best explains the observation are only as good as your data models, the associated probability distribution functions for measurement noise/covariance, and your estimates of P(H_i). (Worth noting too that if your set of signal hypotheses is incomplete that could completely blow the P(H_j|x) calculations.) I once heard someone remark, “Statistics is an arbitrary way of being reasonable.” With that in mind and the results of my fits in hand I might report the effect (and the associated uncertainty) corresponding to the largest P(H_j|x) or maybe I’d report the weighted value of the effect where P(H_j|x) served as the weighting term. The goal should be to provide insight. There are a number of ways one could do that. If viewing one’s results from modestly different but seemingly reasonable perspectives yields significantly different conclusions then you should investigate and figure out why. Robust conclusions should be minimally sensitive to modest deviations from model assumptions and the fine points of how you interpret fit results.

    Bottom line: The objective of any analysis should be to gain insight. Looking to p-values for Truth elevates dogma over insight.

    PS To make steps 2 and 4 above more Bayesian let P(x|H_j) = P(x;z)*P(z|H_j) where z are model parameter values and P(z|H_j) is a prior for their distribution.

  8. K? O'Rourke says:

    > or their summary statistics

    But all _summaries_ are misleading or at least not sufficient under reasonable alternative distributional assumptions.

    Providing by group subsets of order statistics (e.g., 1/k, …, (n – j)/k percentiles) would mostly mitigate this, but does anyone do that?

    • Chris G says:

      Q-Q plots are an excellent reality check. My experience is that few people are familiar with them let alone incorporate them in their analyses.

      • K? O'Rourke says:

        A notable exception is when the outcome is time to an event and you get high definition Kaplan-Meier plots setting out the subgroups with censoring indicators – you can get the order statistics from that.

        • If you give enough order statistics, you’re just giving the data in sorted order. But if you compress a little, say giving 100 order statistics on a dataset of >> 100 data points, you’re still doing a fine service. Those empirical CDF curves and the like are just another way to publish the data. if you have the PDF you can even probably pull the data out of the PDF, and not just by eye, by actually decompressing the PDF into a text file, and looking at the set of data points that define the curve, and rescaling them appropriately.

          A much better approach however, is to encourage everyone to publish tables of data in csv or SQLite format on the web.

  9. […] I found a good read about the fallacies of statistical significance testing. See also here. […]

  10. Mayo says:

    Andrew:
    You say:
    “Indeed, when I say that a Bayesian wants other researchers to be non-Bayesian, what I mean is that I want people to give me their data or their summary statistics, unpolluted by any prior distributions. But I certainly don’t want them to discard all their numbers in exchange for a simple yes/no statement on statistical significance.”
    So people present their data (does this include likelihoods? ) and then the reader supplies his or her prior? Is that how it works? Presumably the prior is influenced by seeing the data, but I’m just trying to understand what is being recommended. And then the posterior is reported? or the prior is tested? Thanks for clarifying.

    • I think when it comes down to it, Gelman either will believe the results based on a description of the methodology and data collection process, or he would really rather have the data himself and do his own analysis. Most likely he doesn’t just want the likelihood someone else generated, because he would be as suspicious of that as he would of the prior, perhaps more suspicious. The big reason to be a Bayesian is to have a methodology for crunching the numbers that can deal with very sophisticated models (ie. very complex likelihoods involving all sorts of “physically” meaningful parameters).

      • But I don’t think this is limited to Bayesians, most statisticians of any sort would rather have the data than someone elses report about what they thought the data meant.

        • Mayo says:

          Even the data are selected from a particular viewpoint. We are left with no account of statistical analysis. People are to farm out raw data (collected for some reason) to their favorite number cruncher? The number crunchers can advertise themselves! Gelman has so much charisma, I expect he’d become very rich under this new regime.

          • Statistical analysis is more or less making some assumptions and determining what those assumptions together with the data imply probabilistically about some scientific process. It’s possible to make different assumptions (have different theories) and so people would of course rather have the data than the summary of what someone elses theory implied.

            Or do you believe that the purpose of statistical analysis is to give us the “One True Theory?” that everyone should agree with at the end of the analysis?

          • konrad says:

            @Mayo: that’s a fairly accurate portrayal of how scientific studies are done in practice. Empiricists collect data from their particular viewpoint, usually with the hope that the data will be informative regarding some specific question they have in mind. They then ask their favourite number cruncher, or use a computational tool created by their favourite number cruncher, to analyse the data with a view towards answering their question. And some number crunchers are rightfully more popular than others.

            In subsequent studies, other scientists may want to revisit the question by performing different analyses (based on different assumptions) on the same data. This is why it’s important to publish the raw data and not just the results of the analysis. If the results of these subsequent analyses are different, debate ensues as to which set of assumptions was more appropriate. And it is this debate that leads to scientific progress.

    • konrad says:

      @Mayo: No, that is is not how it works.

      The notion of the reader having his/her own prior is a feature of subjective Bayesianism, from which Andrew (along many of the regular commenters on this blog) has repeatedly distanced hiself. (Curiously, no one ever seems to mention the possibility that the reader may also have his/her own likelihood function.)

      By contrast, objective Bayesians like to present results in this form: “If we use assumption set A, we get posterior A*.” They may add: “If on the other hand we use assumption set B, we get posterior B*.” Etc. Here, different assumption sets may involve different likelihood functions (usually the most important part of a model) and different priors (occasionally also relevant).

      This is pretty similar to what one _ought_ to do in a frequentist analysis, the most important difference from frequentist practice being that frequentists often neglect to emphasize the set of assumptions, thereby discouraging (or failing to explicitly encourage) others from considering alternative models.

      • Mayo says:

        Konrad: No. We’d never allow inference to be a bunch of conditional claims: if you model it this way and assume independence you get such and such. Or rather, if that’s all we could do, we’d not be doing statistical inference at all. Testing assumptions is important. But the idea that one has an account of inference, modeling, or data analysis wherein consumers are to supply whatever models, priors, methods etc. to apply to your raw numbers strikes me as absurd. Talk about statistical anarchy.

        • But that *is* all we can do. Can you do statistical analysis without making assumptions? I would like to see how. Isn’t the whole point of your account of error statistics basically to say that the purpose of statistics is to find out if our assumptions imply that our data should not look like it does and therefore our assumptions are more or less wrong?

          I think your point is that there is also model checking or some kind of “relative information” measurement. We can determine that some models are better than other models because they have more predictive power or whatever. Yes, true, but first we must build the model, and that model IS a set of assumptions. Then when we have the model, we can see what the model together with the data imply about the world. Then if our model predicts well in additional cases, we might say that it is approximately true, and that its unknown parameter is approximately a true fact about the world…

          if someone else comes up with a different model that has similar predictive power but different structure, then we must look for divergence of predictions in some as yet unobserved type of data, and try to find out which model is better. If the predictions are always essentially the same, then we have some kind of unrealized isomorphism.

          The thing is that sometimes the “person building the model” and the “statistician” are not exactly the same person. A scientist may build the model, collect the data, and then realize he is not competent to figure out the “statistical” aspects of the model, so the statistician may be given the “deterministic portion” but even the deterministic portion may have some unknown parameters (say the mass of some galaxy) so now the statistician is caught up in the messy process of the science.

        • konrad says:

          @Mayo: clearly we are in complete disagreement here, to the point that discussion may be pointless, but to add to what Daniel said (all of which I agree with):

          Inference (and, more generally, reasoning) is the process of drawing conclusions from premises. In the case of statistics, the premises are assumptions (aka models) and observations. Any sensible inferential framework should have the property that the answer it gives to a well-posed question is unique (i.e. the framework should be consistent). But clearly, if you start with different assumptions you can get different answers – so the inference and any claims absolutely _have_ to be conditional on the assumptions. Anything else would lead to inconsistency.

    • Andrew says:

      Mayo:

      I think people should present their data in as raw a form as is practical. They can also present their posterior distributions. But I guess I didn’t phrase things quite right by focusing on the prior. I would not want people to just give me their likelihoods. They can give the data along with whatever other information is relevant to the problem.

      • Mayo says:

        But what do we do with the likelihoods or data? We’d get all kinds of inferences, so what did the statistical analysis do for us? I’m really doubly confused when you say you advocate Bayesian falsification. Is a reader free to falsify your model because it disagreed with her prior? You can’t mean to allow this, can you? I thought there was a far more hard-nosed falsification going on, that couldn’t be discounted at will. But I also don’t see why you’d want to allow priors to come in after the data are reported. I thought they were to be “prior”.

        • Mayo:

          Perhaps, rather than “prior,” a better term would be “external information.” I do think it’s ok for someone to falsify my model if it disagrees with their valid external information.

        • konrad says:

          @Mayo: “Is a reader free to falsify your model because it disagreed with her prior?”

          This makes no sense at all: remember that the prior is _part_ of the model. The fact that the reader’s model disagrees with Andrew’s model does not falsify it – it just means that the two models are different. In order to talk about falsification, one has to talk about the fit of the model to the data.

          Also, there are no rules constraining the order in which we do our calculations, provided we do them correctly. So any quantity can be introduced and/or calculated whenever we like. In the Bayesian framework, a probability distribution is a description of an information state conditioned on a set of facts/assumptions. “Prior” is just the name we give to such a distribution when the data are not included in the set of facts/assumptions on which we are conditioning.

  11. Lou M. Carter says:

    you might enjoy this new IZA discussion paper:
    http://ftp.iza.org/dp7268.pdf