Good advice can do you bad

Here are some examples of good, solid, reasonable statistical advice which can lead people astray.

Example 1

Good advice: Statistical significance is not the same as practical significance.

How it can mislead: People get the impression that a statistically significant result is more impressive if it’s larger in magnitude.

Why it’s misleading: See this classic example where Carl Morris presents three different hypothetical results, all of which are statistically significant at the 5% level but with much different estimated effect sizes. In this example, the strongest evidence comes from the smallest estimate, while the result with the largest estimate gives the weakest evidence.

Example 2

Good advice: Warnings against p-hacking, cherry-picking, file-drawer effects, etc.

How it can mislead: People get the impression that various forms of cheating represent the main threat to the validity of p-values.

Why it’s misleading: A researcher who doesn’t cheat can then think that his or her p-values have no problems. They don’t understand about the garden of forking paths.

Example 3

Good advice: Use Bayesian inference and you’ll automatically get probabilistic uncertainty statements.

How it can mislead: Sometimes the associated uncertainty statements can be unreasonable.

Why it’s misleading: Consider my new favorite example, y ~ N(theta, 1), uniform prior on theta, and you observe y=1. The point estimate of theta is 1, which is what it is, and the posterior distribution for theta is N(1,1), which isn’t so unreasonable as a data summary, but then you can also get probability statements such as Pr(theta>0|y) = .84, which seems a bit strong, the idea that you’d be willing to lay down a 5:1 bet based on data that are consistent with pure noise.

Example 4

Good advice: If an estimate is less than 2 standard errors away from zero, treat it as provisional.

How it can mislead: People mistakenly assume the converse, that if an estimate is more than 2 standard errors away from zero, that it should be essentially taken as true.

Why it’s misleading: First, because estimates that are 2 standard errors from zero are easily obtained just by chance, especially in a garden-of-forking paths setting. Second, because even with no forking paths, publication bias leads to the statistical significance filter: if you only report estimates that are statistically significant, you’ll systematically overestimate effect sizes.

More examples?

Maybe you could supply additional examples of good statistical advice that can get people in trouble? I think this is a big deal.

P.S. I just added example 4 above.

71 thoughts on “Good advice can do you bad

  1. For both 1 and 2, it seems like people would have to read a lot into the advice, or make a number of more or less unrelated assumptions, to be misled in the ways you describe.

    For 3, I’m wondering what the more hardline Bayesians would say. If you believe that Bayesian inference is rational(ity), and you accept the uniform prior, aren’t you compelled to accept the probability statement in question?

    • Noah:

      1, 2. Yes, but people do read a lot into the advice and make a number of additional assumptions: that’s my point! The advice as stated is fine, but people read more into it.

      3. As a hardline Bayesian myself, I’d say that Bayesian inference is rationality conditional on the model. But our models are approximations, and part of being rational is to recognize the limitations of our models.

      • I am having troubled being troubled by #3. Noise (is that theta = 0?) is merely one value among infinitely others,
        and the range of smallish negative thetas (that are still reasonable consistent with the data) really
        is much smaller than the range of plausible positive thetas. I’d take that bet; even by “gut feel” the odds don’t
        seem grossly wrong.
        Except … when I start to ponder whether theta = 0 (or very near 0) is special in some way; e.g. if “no/small effect”
        is especially plausible. But then the prior is as structurally wrong as can be.
        Is an answer unreasonable when the discomfort is due to a sensible suspicion that the prior emphatically ignores?

        • # 3 is bothering me as well.

          Is the ‘uniform prior’ on theta here an improper density, or is it over a compact interval of some kind? I’m suspicious the problem here may be in the prior information, not in Bayes’ rule.

          I’m also skeptical of the “gamble” interpretation of the resulting probability. What about risk preferences?

        • I think the way you are presenting #3 is inviting misinterpretation.
          You say the prior has a problem, but maybe that points people in the direction of thinking about issues with improper priors – even though (I think) that’s largely irrelevant to your complaint.
          You call the result “bad” and “unreasonable”, without much or any qualification, which can sound
          as though is intendeded to illustrate how Bayesianism per se is wrong. But the answer here could well be fine (e.g. if, in the actual application domain, 0 was just another number like any other.)

          However, it _seems_ your point is that people can be oblivious to how the specific prior can affect results, and therefore pay too little regard to them – and expect universally good results from a default choice. (This isn’t even an example of disturbingly high sensitivity to the prior; the fault of the uniform prior – if “pure noise” is a particular consideration – isn’t subtle or small.)

          I don’t think your original post makes this very clear (well, assuming I am understanding your point.)

        • Just to elaborate on the obvious here: if you’re going to be getting a lot of data it can make sense to use a uniform prior like this, but if you’re going to try to draw inferences from a few data points then it doesn’t. There’s virtually no situation in which you really think “the answer could be between 0 and 1, or between 1,000,000,000,000,000 and 1,000,000,000,000,001 and these are equally likely.

          But let’s contrive a situation that is not so far away from the problem. Or, rather, it will still be infinitely far away but it will capture the same essential features. Suppose I had proposed the following proposition: I will generate a random number theta between 10^-10 and 10^10, and will then do one draw from N(theta,1) and tell it to you. Let’s call it y. You will then have the option of a wager: if the true value of theta is < (y-1) then you pay me $5, otherwise I pay you $1. I'd say that if we play the game and get y=1 then yes, you should take the bet. Shouldn't you? (But really you should be very very suspicious that I have done something wrong, because what are the odds that I'd come up with a number so close to zero, or indeed a number with fewer than 8 zeros? (This is a hypothetical question, we all know about what the odds are). This puts me in mind of the observation...I think it was Fermi... that there's something screwy about the universe because all of the dimensionless numbers are near 1, by which he meant between about 10^-40 and 10^40 or something. The fine structure constant is about 1/137, for crying out loud! Why don't any of them come out to be 10^10^10^50 or something? There are a lot more numbers outside the range 10^-40 - 10^40 than there are inside it, I think we can all agree on that.

        • Suppose the prior was uniform [-100, 100]. Or N(0, 50).
          Isn’t the meat of the original example going to hold: you see y = 1, you’d still be quite convinced y > 0. (Or if y = 5, than y > 4, etc.)

          I really think that uniform over _all_ possible theta is just a confusing distraction here. It’s tempting to criticize it because what does it really mean?, etc, and we know an improper prior like this can lead to bizarre
          problems. Yet I think it’s nearly irrelevant to this actual example.

        • Phil:

          Dimensionless numbers are the ratios between things of the same dimension… obviously. So when it comes to determining the ratio of two things, if one of them is too small for us to measure and even for us to notice that it exists…. then we’re never going to notice it exists… so there could be like a trillion forces each of which is 10^10^10^50 times smaller than say the electromagnetic force… but we wouldn’t notice them.

        • Daniel:
          I understand your point, and could even add another, which is that there are degrees of freedom in how the dimensionless numbers are defined and if they were inconveniently distant from zero we might define them differently. An example is the ratio of the fine structure constant to the gravitational coupling constant, where there’s the choice of choosing “the gravitational coupling between what and what?” It’s normal to use two protons in a nucleus, but if things were very different maybe we would use two galaxies at a typical galactic distance, or two quarks at a proton diameter.
          I have gone back and forth about this issue, for these reasons, but I’m not completely convinced that there’s not more to it. At least I think there might be room for the “weak anthropomorphic principle”: if some of these numbers differed by many orders of magnitudes (or many millions of orders of magnitudes) the result would be a universe that can’t support life of any kind.

          bxg: Yes, I was just illustrating that something very close to Andrew’s problem could actually make sense — you could reasonably do an infinitely wide prior — and that you would indeed be willing to make the bet. You’re right that there are other situations, less similar to Andrew’s original one, where you should also make the bet.

        • In fact, the best dimensional analysis has as its goal making things all have a similar scale, the value 1 is the obvious common reference, so usually we try to make something be “like 1”.

        • Daniel,
          I understand your point. I do. I’m not sure what I can do to convince you of this, other than to say it again.

          And I will even say that I am in partial agreement with it. Maybe full agreement. I’m just not quite sure…and it’s not just me but genuinely smart people like, uh, I think it was Fermi but maybe it was somebody else, but it was somebody at that level. But I’m not 100% sure (and neither was Fermi-or-whoever. Looking at the ratio of the strength of the forces in a nucleus is a natural thing to do: what is the ratio of the gravitational force between protons to the electromagnetic force? The force of gravity? And so on. If you do this, you find that gravity is very very weak. There’s a nice story (probably apocryphal, or possibly exaggerated from a related story) about Feynman giving a lecture about this, and he compares the forces and shows that gravity is weaker by a factor of 10^36 and says “…so gravity is very very weak. Really almost incredibly weak.” And just then a speaker that was hanging from a mount on the ceiling fell to the floor with a startling crash. Feynman said “Weak, but not negligible!”

          Anyway, OK, you calculate the ratios and you find that gravity is weaker by a factor of 10^36. What if it had come out to be 10^5000000? Your argument is “well, then people wouldn’t talk about that ratio, then. They would calculate some other ratio, like the ratio of the gravitational force between two protons in a nucleus to the electromagnetic force between two protons that are separated by the diameter of the universe,” or something like that. I just don’t buy it. Indeed, even if you do calculate that number I just invented, let’s call it the Phil number, it’s not going to help: the universe is only about 10^16 meters in diameter, that’s about 10^30 atomic nuclei. Square that and you get 10^-60. So if you separate two protons by the width of the universe, you only decrease the force between them by a measly factor of 10^60. If the EM force between them were 10^5000000 times larger than the gravitational force when the distance was a nuclear diameter, then Phil number is still going to be 10^49999904.

          Essentially, you are making a “garden of forking paths” argument — there are lots and lots of dimensionless numbers we can calculate, and of course we pick the ones that are in a ‘reasonable’ range. I understand this argument. I do. It may even be right. But I am not convinced by it.

        • Phil, I understand your argument too. I think the two arguments have to go together to give the full picture:

          1) Things that are really tiny compared to things that we notice are just… too tiny to notice… If gravity were 10^5000000 times weaker than it is, we just wouldn’t even know about gravity.

          2) Whenever we have an option on how to calculate a dimensionless ratio we prefer to calculate it so that it makes things O(1) in the problem at hand.

          Gravity is really weak at atomic scales G m_e^2/ k q_e^2 = 2.4e-43 (m_e = electron mass, q_e = electron charge)

          but Gravity is much stronger when applied to two objects the mass of the earth each with a single electron charge…

          If there were no logically achievable scale at which gravity had a nonnegligible effect (ie. if gravity were still terribly weak when two galaxies were considered relative to if the two galaxies had a unit electron net charge…) then we wouldn’t know that gravity existed!

          Physics dictates that we are aware of all the “strongest” effects. Since we tend to like to make things O(1) we will tend to compare things to the strongest effect at work in our problem. For nuclear physicists that’s the strong nuclear force, for engineers it’s the electrical force (material strength), and for astrophysicists it’s the gravitational force.

          That explains why dimensionless constants aren’t typically bigger than 10^50 (or even maybe 10^2), since we tend to put the big part in the denominator.

          If there’s a fundamental thing that is negligible in all these cases… ie. it’s relative size is 10^-50 or smaller in each case… then we don’t know about it because it’s always negligible. That explains why dimensionless groups are never smaller than 10^-50 or whatever.

          put together we get this result that all the dimensionless groups we calculate are between about 10^-50 and say 100 (or 10^50 if you like).

        • Phil, another relevant number is the number of atoms in the universe. That’s around say 10^80, which means that an effect which is 10^-80 times smaller than say the electric force could conceivably add up to be around the same size as the electric force when considered at the scale of the entire universe… anything smaller than that couldn’t ever “add up” over all the particles in the universe to be non-negligible.

          That puts a kind of scale at least for the logarithm of the smallest dimensionless group we could likely detect, at least assuming effects that are linear in the number of particles.

        • Daniel: You sort of beat me to it.

          The reason gravity is not negligible in spite of being very very weak in a seemingly appropriate comparison is that gravitational forces add while EM forces can cancel (and, in the normal course of things, do cancel). Get 10^30 atoms together, while the 10^30 protons are having their forces canceled out by 10^30 electrons, and suddenly the shoe is on the other foot.

          And that real-world example shows of why it’s not obvious to me that physically important dimensionless numbers couldn’t be a thousand orders of magnitude bigger (or smaller) than they are. There could be forces (indeed, there are forces) that fall off exponentially with distance rather than in a power-law, and there are forces that get stronger with distance rather than weaker. There could be a force that falls off like r^-10, or a force that is strong on a distance scale of a meter but negligible at a distance of 10 meters. Or there could be forces that don’t add (like gravity) but that multiply. In some of these cases, if you created a natural-seeming dimensionless quantity you could end up with a ridiculously large number.

          …Which isn’t to say I think your argument is wrong, I’m just still not convinced that the “dimensionless numbers are always near 1” observation is trivial. Yes, maybe that “natural-seeming” dimensionless quantity wouldn’t seem natural if you knew it was going to be 10^500,000, I recognize that. But there’s an alternative argument that looks at the question a different way: what if, somehow, all of the important quantities were the same except gravity were, say, 10^100 times weaker compared to all of the other forces. Just turn down G. Well, galaxies and stars wouldn’t form, there would be no fusion, the only element would be hydrogen, there would be no life, etc. If someone outside the universe were to look at the physics, they’d say “there’s no chance of life in that universe; no sentient being will ever exist there who can ponder the relative magnitudes of the forces.” (Indeed, just turn down G by a factor of 1000 and you probably get this case).

          I recognize that there is a connection between your way of looking at it and the one that I just raised, but the connection isn’t all that tight. Or maybe it is. I’m still not sure.

        • Phil, I don’t have much argument with you, it certainly is true that the world has to be not too wacky in order for life to exist… so that’s kind of interesting. But I wanted to respond to this bit because it is a really interesting fact:

          “Or there could be forces that don’t add (like gravity) but that multiply”

          There actually *CAN’T* be forces that multiply, The dimensions of the quantity have to remain FORCE, and in this situation they wouldn’t.

          I mean, I suppose you could posit a world where this particular symmetry law doesn’t hold… but in our world, quantities with a particular dimension can’t multiply to produce a quantity with the same dimension. So, this kind of thing isn’t something you could bolt on to our world, it’d be a totally different world, and in that world, there would be NO dimensionless numbers, really there would be no concept of dimension of a quantity.

          Also:

          “And there are forces that get stronger with distance not weaker”

          But, I don’t think there can be forces that asymptotically get stronger right? I mean, forces like a nuclear force that have a high potential barrier at some radius and then fall off to zero at infinity… yes, but not things that increase without bound at infinity.

        • Daniel:

          |“Or there could be forces that don’t add (like gravity) but that multiply”
          |
          |There actually *CAN’T* be forces that multiply, The dimensions of the quantity have to remain FORCE, and in this situation they wouldn’t.

          I think you’re misinterpreting what I mean by “multiply.” What I mean is, if you have one particle you get a force F at a given distance; if you have two then you a force 2*F, and if you have three then instead of a force of 3*F you get a force of 2^3 * F, and so on. Get 10^30 of these together and you’d have a force of 2^(10^30)*F. The force would still have dimensions of force, it’s just that the force would increase very rapidly as you agglomerate particles.

          I doubt you would say that this is impossible, but just in case someone else is tempted to say this, they should recognize that we could in fact make devices that work like this. We could make a little electromagnet that senses when it is close to some number of similar devices, and could increase its output accordingly. If they all act like this, they could generate a force like the one I made up. So even in the world as we know it, we could make a sort of simulator of what I’m talking about.

          “And there are forces that get stronger with distance not weaker”

          | But, I don’t think there can be forces that asymptotically get stronger right? I mean, forces like a nuclear force that have a high potential | barrier at some radius and then fall off to zero at infinity… yes, but not things that increase without bound at infinity.

          Of course there’s a sense that the only things that can happen in the universe are things that do happen, and as far as we know there are no forces that get stronger with distance, therefore such a force cannot exist. The expansion of the universe, and the expanding distances between the bits of matter in it, certainly put an upper limit on how strong such a force could be. But…well, we don’t have a Grand Unified Theory yet, and as such I’m not sure we can really rule out that such a force _could_ exist.

          The Strong Nuclear Force at least doesn’t _decrease_ with distance, although perhaps one could picture that as a force that falls off like r^-n where n happens to be 0 instead of a positive number, not an indication that n could ever be negative.

          I dunno. I’m not even sure there’s a puzzle to speculate about, so I’m not sure my ruminations on the puzzle are productive. But I’m not sure there’s NOT a puzzle, either.

        • Multiplying forces: Yes I see what you’re talking about now, and while I now agree that the logical requirement may not rule this out… it might be that such a force couldn’t possibly be *local*. Imagine your simulator, it has to essentially talk to all the nearby particles and figure out how many of them there are in order to figure out how big to be.

          So, is it conceivable that the reason we don’t see situations like your enormous one basically that there is some kind of locality requirement. Of course, QM doesn’t seem to be exactly local really… so anyway I like this conversation, though I think it’s pretty far off the original topic ;-)

          I also wonder about the strong nuclear force. It seems as though it’s a purely theoretical construct that can’t be tested asymptotically at infinity. If you take a couple of quarks and pull them apart, you generate new quarks, so you can’t actually test whether as you pull quarks apart to long distances (like say 1cm or 1m or 1km) the force stays constant (and hence the potential energy grows linearly). So basically, given the untestable nature of it all, you’re in a situation where there’s no need to “trust” the standard model for long-distance strong nuclear forces. In fact, the “net” nuclear force looks like the Reid potential or whatever, and drops off to zero at infinity as you expect given that the whole universe doesn’t collapse down to a single nucleus :-) https://en.wikipedia.org/wiki/Nuclear_force

        • Thanks Andrew. What’s interesting to me is that it seems like a reasonable one.

          Researcher df / forks are defined as human decisions. That gives an opportunity for human biases and errors to get incorporated into the process. But also human insight and expertise.

          When I hear people defend leaving some flexibility in data analysis (vs., say, pre-registering), what I hear them saying is, look, you need to be able to look at the data and figure out what’s going on. There are too many unknowns to specify everything in advance. Even if you disagree (as I often do), I think there’s a kernel of something interesting there. They’re saying they want to leave room for expertise to factor into the process.

          I think an interesting way to reframe the discussion is not are all forks bad or all forks good, but rather, where are we better served incorporating human judgment into the process, and we are we better served shielding from it. And relatedly, how can we make its role transparent.

        • I’ve never understood the garden of forking paths critique as saying ‘there should be no flexibility in data analysis’ so much as calling for a clearer distinction between exploratory or provisional conclusions (which come from highly flexible analyses) and confirmatory analyses (which correspond to rigidly pregistered methodologies & tests). Probability statements/tests should presumably be taken less than literally for exploratory/flexible analyses, but they’re still a really important thing to do. Difficult to ever come up with a decent confirmatory test if you don’t explore your data.

        • This suggests that more studies ought to employ exploratory-confirmatory designs, in which the researcher is free to pursue whichever forking paths appear promising in the exploratory data set, but commit to a rigid analysis plan for the confirmatory data set applying the data exclusion/transformation/modeling choices discovered to be most fruitful. This way, researchers need not ignore interesting patterns in the data simply because they did not plan to investigate them, and can still report inferences with the usual frequentist guarantees.

          Of course, this would be difficult in small-N studies, since this would turn them into even-smaller-N studies. For these, we may just have to take any inferences as merely suggestive until a proper replication is performed.

        • I think an even more defensible approach for small-n studies is to agree that they are simply hypothesis generating for large-n follow-up studies and nothing more.

        • Absolutely. In fact, it’s common now with machine learning work to split the data into 3 pieces.

          Training (where a wide variety of techniques may be used)
          Validation (validation of the training approach)
          Test (completely out of sample)

          For example, I read a paper this morning that splits the data this way (60%, 15%, 25%).
          Lack of data is not their problem; they used 1.5 million observations.

        • zbicyclist: Somewhere in the Training/Validation/Test distinction is the idea of climbing a gradient, with resultant overfitting. If you try even a single model on just training data, you’ll have no idea how well it will generalize, since your training will climb the training data gradient by definition. If you try multiple models (including of course, variations — variable selection, feature engineering, etc — on a single technique or different techniques) with Training/Validation data you will end up climbing the gradient of the Validation data results and you’ll have no idea how it will generalize. (By definition the model/version that learned your validation best is declared the winner.) So you need yet one more level of holdout to determine how your champion model might generalize.

          Cross-validation (CV) uses data more efficiently and yields distributions rather than point estimates, so it’s like holdout data, but better. But not fundamentally different: just as you need Training/Validation/Test holdouts, you need nested CV if you’re tweaking on things and comparing models. Some folks seem to think that CV is so magical that non-nested-CV — i.e. the kind provided by their software package’s regression or classification routines — will save them, but it won’t.

      • I guess my point is that it’s not the advice that’s leading people astray. As you say at the beginning of the post, 1 and 2 (and 3 and 4) are good, solid, and reasonable. If the action is all in the additional assumptions and mistaken implications, it seems strange to say that it’s the advice that’s misleading anyone.

        The new item (since my comment) allows for a nice, simple illustration of how the misleading is due to the person not taking the advice for what it’s worth. If A then B does not imply if not-A then not-B.

        I wish more Bayesians (and frequentists, for that matter) would clearly state the bit about everything being conditional on the model more often. That seems very important to me, and I try to keep it in mind when I analyze data and build/fit statistical models.

        • +1 to last paragraph (although my impression is that frequentists are more likely than Bayesians to neglect the point that results are conditional on the model.)

  2. What’s the problem with #3? Either you would buy that price (84%ish) or sell it (risk-neutral). If you would buy it, then your prior wasn’t that theta was uniform. If y had been 17, how would you feel about betting on theta > 16? Differently, or the same?

    Now maybe what you’re pointing out is that people say “my model is X and my prior is Y” but they actually aren’t comfortable with the consequences of specifying that model.

    • Hgfalling:

      Yes, that’s my point, that the prior distribution being used in the statistical analysis does not actually reflect prior knowledge.

      Of course, that’s always true—any model is an approximation. This particular case is notable because the uniform prior is a near-universal default in this setting, and it gives such bad results.

      • So perhaps the problem is really using defaults as a substitute for thinking? (I think P. I. Good once asserted that the most common mistake in using statistics is automatically using the default. I believe he was referring specifically to using software, but the maxim seems to apply generally.)

        • Martha:

          No, I wouldn’t put it that generally. Defaults are useful. This particular default can be useful too, but not if you use it to make this particular inference.

          That’s why it’s a trap. A lion in a lion cage is not a trap. A lion in a koala cage, that’s a trap. Default uniform priors are a sort of koala cage, they are used in all sorts of settings and often work well. But not if you use them to compute Pr(theta > 0 | y).

        • I’m not sure that this is a response to what I said, which was (with emphasis added)
          “*perhaps* the problem is really using defaults *as a substitute for thinking*”

  3. The problem is not with the prior, it is with the likelihood. If you assume that the variance is 1 than, yes, observing one observation of 1 is enough evidence to bet 5:1 that the mean is greater than 0.

    • Jack:

      No, the problem here is with the prior. I have no problem saying that the likelihood ratio is 5:1, but I do have a problem with saying I should be 5/6 sure that the true parameter is greater than 0. That latter statement depends on the prior.

      • The uniform prior issue has been looming large in my studies lately. It really seems to be an orphan without a home…

        The “objective” Bayesian says, “A uniform prior is one of those ‘subjective’ Bayesian things. It’s certainly not objective; it can certainly influence the priors in unexpected ways.”

        The “subjective” Bayesian replies, “No, it can’t possibly be ‘subjective’, since it’s improper and hence cannot reflect any knowledge or belief.”

        The “uniform-distribution” Bayesian says, “But a uniform prior seems to be consistent with the Principle of Indifference. Besides, it tends to give the same answers as Frequentist procedures, so I’m more popular with my peers.”

        The “empirical” Bayesian starts laughing at this statement, to which the “uniform-distribution” Bayesian retorts, “And who is it that has no problems using Frequentist procedures to give them their priors? Hmmm?”

        I really wonder if an (unbounded) uniform prior can fit into any Bayesian philosophy. Seems like it’s a holdover from the early days and it’s sometimes convenient as a first-step when you step up your model from a frequentist regression to a Bayesian regression and want to confirm that the results are similar… If it doesn’t result in your Bayesian regression taking significantly longer to run than it should.

  4. Good advice: pay attention to the confidence interval around an estimate, e.g. “this poll has a margin of error of +/- 3%”

    How it can mislead: This is just sampling error, not total error.

    Why it’s misleading: Sampling error itself is likely an understatement, given the low and declining response rates common to surveys now. Total error also would need to consider biases (e.g. is one response more socially acceptable? is this a question people are even likely to know the answer to?)

  5. A bigger problem is people don’t listen to good advice that’s actually good for them.

    I remember this one time there was a group of Marines taking fire. The squad leader shouted over the radio “there’s rounds hitting close to us, they’re shooting at us! we can’t stand up and leave cover!” So I put on my statisticians hat and said “haven’t you heard of the Garden of Forked Paths? Realistically you would analyze the data different if the rounds were landing miles away from you. So the evidence they’re shooting at you is far less than you think.”

    For some reason they wouldn’t listen to reason and still refused to stand up. This is why we need to teach statistics in high school so Marine in combat wont make mistakes like that.

    • Laplace:

      No, the analogy is more like this: A group of Marines is taking fire. Someone shouts over the radio, Get in the power pose, I saw in a Ted talk that it improves performance and they had p less than .05 in Psychological Science. So you put on your statistician hat and say: (i) I see no reason why the results as described would apply in this situation, and (ii) that p-value is meaningless because of the garden of forking paths. So ignore the shouter and don’t waste your time with power pose (unless you were planning to do it anyway).

      • “that p-value is meaningless because of the garden of forking paths.” That’s like saying “your Cristal Ball is worthless because the tooth fairy told me so.”

        It was obvious a long time ago the (hypothetical and unknowable) frequency a procedure yields the truth is not a measure of how strongly the facts/assumptions in the case seen favor different hypothesis. This Garden of Forked Paths stuff should have driven this home so well no one could deny it. Instead statisticians have doubled down and now take it for granted an inference is greatly affected by the hypothetical decisions a researcher would make if they lived in a different universe with different data.

        Great job Andrew. That little piece of insanity should keep Frequentists in business for another generation or two.

        It would be nice though if Bayesians at least would acknowledge the “Garden of Forked Paths” is a meaningless incantation which can be cast over any bad-looking p-value to explain away it’s failure. It’s no different than when a prophet excuses every prediction failure by saying “you didn’t believe hard enough”. It’s chief purpose is to make p-values unfalsifiable. It explains every embarrassment and is quietly forgotten whenever p-values seem to work reasonably.

        • The Garden Of Forking Paths is in my opinion a heuristic that explains to people why the following things are true:

          1) It’s hard (ie. requires a lot of tries) to use a pre-chosen “null” random number generator to generate a random dataset that has small p value.

          2) It’s easy to think up a random number generator that people will plausibly consider “null” which nevertheless makes your actual data have a small p value under that RNG.

          The first is true because of the frequency properties of random number generators… the second is true because of the large potential search space of “plausibly null RNGs” so that even if you don’t search hard, it’s easy to find one.

          The forking paths thing is just showing how the search space grows exponentially with the number of little tweaks you have available to you. something like N = exp(Kn) for n the number of knobs you have to tweak, and K something like log(20) or so for typical conditions in social sciences. So 20^4 = 160000 you have a lot of things to choose from, and you don’t have to search to hard to find one.

          Garden of Forking Paths is just pointing out how situation (1) and situation (2) are in no way connected… and yet we’ve been teaching people to expect that they are.

        • Laplace:

          I do not recommend that people perform hypothesis tests. But people do perform hypothesis tests, and they use them to make strong statements. I point out the garden of forking paths as a way to explain how they could be getting those gaudy p-values even in the absence of any effect.

          Here’s an analogy. Someone comes to you with a perpetual motion machine. You just laugh, but various people keep insisting: hey, I don’t care what your theory says, I see that machine spins and spins on its own. So it could be worth your while to take the machine apart and figure out what makes it run. The garden of forking paths is what makes the perpetual p-value machine run.

        • > The garden of forking paths is what makes the perpetual p-value machine run.
          Nice.

          Not having an adequate representation (model) of the garden of forking paths one has gone through, the p-value machine appears to run well.

    • To be clear: if there are rounds whizzing over your head and you assume they’re not meant for you, you’ll stand up and risk one of those rounds accidentally hitting you in the head? What kind of analogy is this?

  6. Good advice: Use robust standard errors whenever you worry that regression assumptions are violated.

    How it can mislead: Using robust standard errors creates a false sense of security in many instances and does not correct biased estimates resulting from model misspecification, measurement error, etc.

  7. I spent my career interacting with statistics users who were not trained statisticians, They were chemists, biologists, ecologists, agricultural scientists and so on; ie people who had to use statistics to get their work completed. It was obvious to me that they had never been told that the words that statisticians use, most notably significance and error, had quite precise meanings within the context of statistics. My most vivid memory is of an experiment in which I was trying to determine the magnitude of different sources of variation in a factory setting involving a formulated pesticide. The QA samples were showing that a high proportion of batches were out of spec. I eventually pinned the problem down to inadequate mixing in a 5000 liter container at the factory. My report referred to all sources, including laboratory error, which was tiny. However the analytical chemist on the team was insulted by the idea that he was introducing error into the process and complained to his manager. And people were always banging on about biological significance being much more important than statistical significance, not understanding that statistical significance (or lack of it) determined whether they should even look for biological statistics.

    The proximate cause of the problem is: we don’t teach the right things. The ultimate cause is: I don’t really know but I suspect that is because statistics is an academic subject not a professional one – if civil engineering relied on poorly trained people to build bridges, many of the bridges would fall down. Solution: a radical change in what and how we teach the subject.

    • I agree with the description of the problem, in other words, too many statisticians function like a service desk within their organisation. Or like a plumber. You try to do stuff yourself, screw it up, call them in for some unpleasant stuff, and after they’ve left you laugh at their mannerisms and chat idly about how they didn’t have a clue (“biological significance!”, “rich qualitative data!”) and you’d have done a better job yourself. And because you always need just one more great idea investigated, and don’t have more money, you do extra bits yourself. Repeat.

      But not the solution. It doesn’t matter how you teach because these dilettanti are not in the stats class anyway, nor for that matter are their teachers, or their bosses. Open data is a big help because studies done without a statistician can increasingly be tested to destruction afterwards by people who know what they’re doing. Statisticians reviewing publications severely is another one. A few painful experiences in public (or in front of the board) and your colleagues will soon shape up. We will make enemies along the way, but hey! we are so hard to recruit and so many projects rely on us that we are pretty much unsackable.

    • I think the correct solution to this problem is to not teach them statistics. Instead, researchers should be encouraged to get help from trained and experienced (maybe even certified – I have no idea how it is in the US, but in Germany anyone can call him- or herself a statistician) statisticians. Or to stick with your example: I don’t build a bridge myself, I hire an engineer who knows what to do.

      When I have a painful tooth, I go the dentist. When I’m sick, I go to the doctor. When my car is broken, I go to a mechanic. When I have mental problems, I go to a therapist. When I have legal trouble, I hire a lawyer. What I want to say: There are things you can do by yourself, but for some you need a professional. And statistics in one of those things where you need a professional. However, our current academic education system does not stick to this principle. Instead, we teach students of any discipline how to calculate t-tests and correlation coefficients. Later, it is required that those researchers are able to build and analyze complex studies. You don’t need to be Einstein in order to realize that this can not end well. Years of practical experience in data analysis can not be replaced by a few courses in statistics.

      Of course, my above statement that we should not teach them statistics is highly exaggerated. However, the more I think about what we should teach students the more I stick with this statement: You can’t really interpret statistical results without knowing the methods behind it. Instead, statisticians should (a) be available for each research group and (b) be better trained in translating complex results.

      • Paul:

        +1 to your final paragraph; -1 to most of the preceding. In particular, I don’t think it would be wise for me, as a consumer, to automatically trust every dentist, doctor, mechanic, etc.; even though some are “licensed,” that doesn’t automatically mean they have good judgment.

        • Martha: I agree with you: blind trust is never a good idea – there are black sheep (even with a license) in each profession. However, I think in some situations you have no other choice. And I think a researcher who performs a major study – equipped only with correlation coefficients, t-tests and some basic knowledge about p-values – is in such a situation. This is like playing dentist with a hammer, screwdriver and drilling machine: it may works, but don’t look at the result.

          In addition, as a consumer, it’s your choice: If you are not satisfied with a service you can choose a different professional.

          And please, don’t get me wrong: I know that there are scientists (non-statisticians) out there who are able to conduct beautiful analyses. But in my experience they are not the majority.

        • I wasn’t arguing in favor of “Do it yourself” — but for being a critical consumer, just as one ought to be when engaging the services of a dentist, plumber, etc.

        • As you said yourself, most of your first two paragraphs is highly exaggerated.

          To elaborate: Researchers need to know enough statistics
          a. to be critical readers of research using statistics
          b. to be able to make wise choices of statistical consultants
          c. to be able to work collaboratively with statistical consultants.

          Typical statistics courses in many fields do not do this; instead they usually just teach how to “do” simple statistics.

          In fact, there is a whole big issue of “consulting” vs “collaborating”. There was an interesting discussion of it on the American Statistical Association discussion group recently. However, I suspect it is only available to ASA members; here’s the URL in case it is accessible: http://community.amstat.org/communities/community-home/digestviewer/viewthread?GroupId=2653&MID=29962&CommunityKey=6b2d607a-e31f-4f19-8357-020a8631b999&tab=digestviewer#bm16

          There is a brief discussion of statistical collaboration at http://magazine.amstat.org/blog/2016/01/01/a-recipe-for-successful-collaborations/, which I believe is generally accessible.

  8. Good advice: Statistical methods rely on model assumptions and these should not be forgotten when choosing methods and interpreting results.

    How it can mislead (I): Use methods that don’t seem to have assumptions and think that this makes the problem go away (usually because people haven’t worked out exactly what the methods are doing, or because people haven’t taught/learnt what the more or less implicit assumptions are, e.g., for classification and regression trees, hierarchical clustering, principal components etc.).

    How it can mislead (II): Run a randomly chosen goodness-of-fit/misspecification test, if deemed necessary followed by transformation, and declare the model true afterwards.

  9. I am not seeing the problem with number 3, and a simple simulation confirms that in the given situation, when y is in the range 0.99…1.01, then theta > 0 about 84% of the time. Why is this a problem? Yes, you could lose the bet, but that is what a bet is.

    How “large” 84% is depends on what you are going to do with it. It would be a laughably low justification for publishing a claim that theta > 0. I don’t think I’d even consider it suggestive of something worth a further look. On an even money bet, an 84% chance of winning is a good deal. And AlphaGo just beat Lee Sedol in 80% of their five games.

    I used a uniform prior over a finite interval large enough to contain almost all of N(y,1); any such finite interval will give the same result. Even taking the prior on theta to be N(0,1) gave a frequency of (theta>0 | y=1) of about 73%. It takes priors like uniform on [-1,0.1] or on [-0.1,1] to drastically change the posterior. So I am not clear on the point of the example. If you don’t know which of those priors to use, then all of them are wrong. Is the real problem here, how to choose a prior when you don’t know how to choose a prior?

    • Richard:

      Of course, if theta is really drawn from a uniform (-A, A) distribution with A very large, and you gather data y~N(theta,1), and you condition on y=1, then, yes, theta will be positive 84% of the time. The point is that, in the situation where you’d be computing such a probability, theta is not really drawn a uniform (-A, A) distribution with A very large. In settings where y~N(theta,1) and you might observe y=1, theta is much more likely to be close to zero and much less likely to be large. The uniform prior is a default model that can cause all sorts of problems (cf. power pose, ovulation and voting, social priming, etc etc etc).

Leave a Reply

Your email address will not be published. Required fields are marked *