Classical hypothesis testing is really really hard

Posted on March 14, 2018 9:55 AM by Andrew

This one surprised me. I included the following question in an exam:

In causal inference, it is often important to study varying treatment effects: for example, a treatment could be more effective for men than for women, or for healthy than for unhealthy patients. Suppose a study is designed to have 80% power to detect a main effect at a 95% confidence level. Further suppose that interactions of interest are half the size of main effects. What is its power for detecting an interaction, comparing men to women (say) in a study that is half men and half women? Suppose 1000 studies of this size are performed. How many of the studies would you expect to report a statistically significant interaction? Of these, what is the expectation of the ratio of estimated effect size to actual effect size?

None of the students got any part of this question correct.

In retrospect, the question was too difficult; it had too many parts given that it was an in-class exam, and I can see how it would be tough to figure out all these numbers. But the students even didn’t get close: they had no idea how to start. They had no sense that you can work backward from power to effect size and go from there.

And these were statistics Ph.D. students. OK, they’re still students and they have time to learn. But this experience reminds me, once again, that classical hypothesis testing is really really hard. All these null hypotheses and type 1 and type 2 errors are distractions, and it’s hard to keep your eye on the ball.

I like the above exam question. I’ll put it in our new book, but I’ll need to break it up into many pieces to make it more doable.

P.S. See here for an awesome joke-but-not-really-a-joke solution from an anonymous commenter.

P.P.S. Solution is here.

48 thoughts on “Classical hypothesis testing is really really hard”

Sameera Daniels on March 14, 2018 9:59 AM at 9:59 am said:

It’s so cerebrally intuitive to me and hard to put into words. I love that question. Yes a book might be useful. It probably needs to be a collaboration i would speculate.

Reply ↓
Sameera Daniels on March 14, 2018 10:19 AM at 10:19 am said:

I would have reworded it. It sometimes takes me longer to formulate . But it does formulate.

Reply ↓
Erik on March 14, 2018 10:34 AM at 10:34 am said:

Can you even solve that with the available information? You give the distribution between male/female, but not not for treatment. I mean it is cross-over or parallel? Are both treatment groups the same size? Assuming so sounds sort of reasonable, but the next question would be re the distribution of gender in the two treatment groups.

Reply ↓
- Andrew on March 14, 2018 10:43 AM at 10:43 am said:
  
  Erik:
  
  Good point. It’s sooo hard to avoid ambiguity when writing exam questions.
  
  In this case I was assuming a between-person design (no crossover) and a 50/50 split of men and women in each group.
  
  Reply ↓
  - Michael Lew on March 14, 2018 3:56 PM at 3:56 pm said:
    
    Was this exam question evaluated by any of your colleagues before being launched at your students?
    
    Reply ↓
Daniel Lakeland on March 14, 2018 10:56 AM at 10:56 am said:

I’m not sure I’d be able to sit down and do this calculation either without looking at some references, of course I’m not in the middle of a course and I can count on the fingers of 0 hands the number of times I’ve wanted to perform a hypothesis test like this.

But here’s my question: what would a similar sort of test question look like phrased in the language of Bayesian estimation of effect sizes?

Reply ↓
- Chris Wilson on March 14, 2018 11:58 AM at 11:58 am said:
  
  +1
  
  Reply ↓
- Daniel Lakeland on March 14, 2018 3:16 PM at 3:16 pm said:
  
  Trying this out for size:
  
  In causal inference, it is often important to study varying treatment effects: for example, a treatment could be more effective for men than for women, or for healthy than for unhealthy patients. Suppose we have a prior distribution for an effect size which is normal(1,1), and a known measurement variation of sigma, and an experimental design with N samples. Assume we’ve chosen N such that the posterior 95% credible interval for the mean will exclude zero in 80% of replicated experiments if the mean really is 1. Further suppose that interactions of interest are on the order of half the size of main effects, with a prior of normal(0.5,1).
  
  With what frequency would the posterior credible interval of the interaction between main effect and gender exclude zero?
  
  Suppose 1000 studies of this size are performed. How many of the studies would you expect to report a credible interval that excludes zero?
  
  Of these, what is the posterior distribution of the ratio of estimated effect size to actual effect size if the actual size is 0.5?
  
  How’s that sound?
  
  Reply ↓
Dale Lehman on March 14, 2018 11:12 AM at 11:12 am said:

I like this question a lot and would propose that no change in the question is required. What is required is a change in the testing procedure. There is no reason why this should be given in a time limited (potentially closed-book) individual format. Having students discuss this, with sufficient time to research aspects of the question, and then submit either individual or group answers seems like a great learning experience to (and for) me.

Reply ↓
- Michael Lew on March 14, 2018 3:55 PM at 3:55 pm said:
  
  +1
  
  Reply ↓
- Bob Carpenter on March 15, 2018 3:00 AM at 3:00 am said:
  
  I agree. I was a professor for eight years and taught two classes almost every semester and never gave a single exam, much less a closed-book one. I based this on my experience as a theoretical math undergrad at Michigan State, where not a single one of my professors gave in-class exams—it was all take home and usually 50% was good enough for an A. That made the whole thing much more realistic (other than that in real life 1/2 of 1% is an A—you just need to make a contribution, not do everything).
  
  Thats’ not so unusual. What I did that was very unusual in most of my graduate classes was let students work with whomever they wanted, even if that person wasn’t in the class. The only thing I required was standard academic citations and I told students I’d be looking at the answer, not who did it. I figured that was much more like the real world, where forming a network of friends and colleagues to help answer questions is invaluable. The students didn’t really believe me, but a few of the good ones took me up on it and produced some pretty amazing class projects.
  
  Reply ↓
Georgette Asherman on March 14, 2018 11:24 AM at 11:24 am said:

There are three effects– treatment, gender, and treatment with gender interaction. You say the interaction effect size is half the effect size of the main effect. I assume the 80% power refers to the treatment effect. Are you assuming that the treatment and gender main effects are equal? Or are you assuming that the main effect of gender is some non-specified constant?

Reply ↓
Erik on March 14, 2018 11:39 AM at 11:39 am said:

Scrabbling on some paper and plugging some numbers in R, here is my rough attempt to solve this. I always assume my outcome variable is scaled to have mean difference 1 between treated and control. Then my SE for power 80% would be about 0.43058.

Testing male vs female has half the expected effect and twice the standard error – so when I rescale that to have same expected effect and four times the standard error (same distribution for the test statistic), I get a power of about 21.23% – number of significant studies would just be #studies * power.

Finally, if I look at the distribution of the test statistic| test statistic > critical value I get a mean that is almost 3 times that of the test statistic without the condition. Ratio of test statistics = ratio of effect size, so on average my estimate is about three times as the true effect size.

Is that roughly correct?

Reply ↓
jrc on March 14, 2018 12:11 PM at 12:11 pm said:

After they’ve taken a year of undergraduate statistics, a year of graduate econometrics, and a quarter of applied quantitative methods, I usually ask my second quarter class of Ph.D. students “What is a standard error?” No one has yet to come close.

So even though yeah, it is hard, I think more importantly we are failing at teaching the basics. Your question probably does have too many moving parts for an in-class exam, but the fact that the students didn’t even get an idea of the intuition is proof at how bad we’ve been at teaching basic statistical concepts up to and including in our graduate training. Either that, or the students were too frightened to write down their actual thoughts without having the math directly handy… and the idea that we are training people to give up critical thinking in deference to arithmetic is even more terrifying than the idea that we all just suck at our jobs.

Reply ↓
- mattio on March 14, 2018 12:39 PM at 12:39 pm said:
  
  I find that hard to believe… are you looking for something more than “the standard deviation of the sampling distribution of your parameter or statistic of interest”?
  
  Reply ↓
  - jrc on March 14, 2018 12:54 PM at 12:54 pm said:
    
    That is literally exactly what I was looking for.
    
    Reply ↓
    - mattio on March 14, 2018 1:03 PM at 1:03 pm said:
      
      To be fair I do think the concept of a sampling distribution is deeply unintuitive to most people, but I would except PhD students to all be able to understand this, and the smart undergrads as well. Coming from economics, I do think it’s absurd how statistics / econometrics is introduced to undergrads. It’s basically zero intuition and 100% math / theory. Most undergrads have no idea what they are even deriving, let alone how to apply to new problems.
    - jrc on March 14, 2018 1:28 PM at 1:28 pm said:
      
      Exactly. We teach them to be really good at solving math. It is true in both economic theory and statistics, but with the theory at least they tend to learn the concepts behind what the theory is representing… sure it becomes mindless algebra at some point, but at least the students understand prices and quantities and how they move together and how the math represents trade-offs. But it turns out that after G-d knows how many Stats courses, no one has ever really taught them to think about what a sampling distribution is… who wants to spend 3 weeks getting that idea across when they could cover the math behind 8 weirdly-named statistical tests during that time? For that matter, who wants to spend a whole week on the difference between Beta and BetaHat (and I’m not even asking for the epistemological questions on whether Beta exists)? That idea is absolutely fundamental, and yet we tend to cover it by drawing a hat over a letter and saying it is an estimate of the “real” thing and then moving on to the math of calculating it.
      
      I have hopes for simulation-based lecturing/teaching methods, but a) it seems to be harder to teach, if only because of path-dependence in the pedagogical history of the discipline; and b) the computer literacy requirements go up. Graphical techniques maybe… I mean, maybe we should spend a lot more time with scatter plots in the first year, at the expense of balls and urns and algebra. But I don’t pretend to have all the answers…after all, my part in the process is supposed to be training the use of these methods, not the foundations. So I’ll mostly just throw stones at those who came before me in the quantitative-researcher-assembly-line process (#PersonalResponsibility).
    - mattio on March 14, 2018 1:41 PM at 1:41 pm said:
      
      Haha, lots I agree with there. I would love for those classes to be taught with simulation methods, would be so illuminating for most people. The beta and beta_hat comment is spot on, too. It’s not surprising how stats are taught when you attend an empirical economics seminar, lots of confusion there too. The obsession with statistical significance is infuriating. I think Mostly Harmless by Angrist / Pischke is a great resource for grad students in economics, though. They cut out 95% of the bullshit and talk about deep concepts that most people think they understand but really don’t. That first chapter on regression is, for lack of a better word, epic.
    - Chris on March 14, 2018 2:19 PM at 2:19 pm said:
      
      Look at Phillip Good’s Resampling Methods text for one way to approach simulation instead of algebra.
    - Keith O'Rourke on March 14, 2018 2:28 PM at 2:28 pm said:
      
      Some quick comments from another amateur (though I have taught a few times with simulation).
      
      Even if you do cover conceptual material in class the tests are likely all or mostly on the math and there’s this ethically dilemma of distracting students from what will get them higher marks – just concentrating on the math.
      
      Not sure how simulations can be tested in a sit down classroom test.
      
      I think McElreath has a nice mix of concepts and math in his teaching slides – making student aware of always distinguishing between what is the representation (small world) versus what is being represented (large world) is challenging.
      
      And because it’s Pi day – it always seemed to work well to walk students through using the digits of Pi to generate random digits, Uniform(a,b) random variates and then Normal(a,b) and other random variates using rejection sampling. With that simulation is more likely to be seen as what it is rather than imagined or metaphor-ed to be.
    - Daniel Lakeland on March 14, 2018 3:22 PM at 3:22 pm said:
      
      From the testing perspective, stop focusing on mathematical formulas and focus on computational procedures. assume pnorm,qnorm,rnorm,dnorm and many other similar procedures are fundamental, request students instead of finding a number on a test, to write a command which will find the number.
      
      Instead of saying “calculate the regression coefficients” and expect numerical answers, say “write a Stan program that will sample from an appropriate posterior” and justify your choice of model
      
      “The purpose of computing is insight, not numbers” — Richard Hamming Numerical Methods for Scientists and Engineers (1962, Preface)
    - Thanatos Savehn on March 14, 2018 5:23 PM at 5:23 pm said:
      
      I second McElreath’s Statistical Rethinking book and video lectures. Whether it’s his execution of the globe tossing example of sampling to estimate the earth’s water/land ratio or his forking paths to Bayes, parameters, likelihood and prior he has clearly invested a lot of effort in recasting basic things in a way that’s more intuitively understood.
  - Bob Carpenter on March 15, 2018 3:06 AM at 3:06 am said:
    
    If this had been laid out clearly up front in either of the two math stats books I read (DeGroot and Schervish, Larsen and Marx), it would’ve made my life a lot easier. They tend not to get to estimators until the end of these books after hacking through a ton of seemingly random (pun intended) hypothesis tests.
    
    Reply ↓
Anoneuoid on March 14, 2018 12:50 PM at 12:50 pm said:

Suppose a study is designed to have 80% power to detect a main effect at a 95% confidence level.

Just because a study is designed with a certain effect size in mind does not mean there really is that size effect, for simplicity we’ll choose the actual effect size to be zero.

Further suppose that interactions of interest are half the size of main effects.

Half of zero is also zero.

What is its power for detecting an interaction, comparing men to women (say) in a study that is half men and half women?

The power to find an effect of size zero is either undefined or 5% (due to the 95% CI), not sure.

Suppose 1000 studies of this size are performed. How many of the studies would you expect to report a statistically significant interaction?

5%, or 50 studies.

Of these, what is the expectation of the ratio of estimated effect size to actual effect size?

I at first thought this was NaN but R is telling me that division by zero is Inf now. I’m guessing due to this: https://en.wikipedia.org/wiki/IEEE_754#Exception_handling

Reply ↓
- Andrew on March 14, 2018 12:56 PM at 12:56 pm said:
  
  Anon:
  
  This was not the solution I was thinking of—I was assuming that the hypothetical study did have 80% power—but your answer is not bad. Wins thread for sure.
  
  Reply ↓
  - John Mashey on March 14, 2018 3:32 PM at 3:32 pm said:
    
    Perhaps this applies to ESP experiments, like those done at PEAR.
    
    Reply ↓
  - Anoneuoid on March 14, 2018 5:33 PM at 5:33 pm said:
    
    Thanks, its always best to check for loopholes.
    
    Reply ↓
Phil on March 14, 2018 2:23 PM at 2:23 pm said:

Like Daniel L, I never think this way so this is new to me, but even given that, I find it hard to think about…indeed, I’m not 100% sure I understand the question. I will write out my reasoning here, so others can learn from my ignorance. (Go ahead and mock me if you will).

When you tell me you have an 80% chance of ‘detecting a main effect at a 95% confidence level’, what does that mean? The main effect could be tiny or huge, how can you know your chance of detecting it if you don’t have any idea how big it could be? How many baseball at-bats do I need to watch to have an 80% chance of detecting a ‘hot hand’ effect with 95% confidence? 1000. No, 10000. No, 100000.

I haven’t even gotten to the question yet, and already I am at sea.

Well, what can I do anyway? I draw a vertical line at x=0 and a normal distribution centered somewhere at x > 0, with a tail that reaches down below 0. Let’s center it at x0. This plot is useful in thinking about 88% of statistics problems, so I draw it. I should have it put on a rubber stamp.

Let’s assume I have some expected distribution for the size of the main effect, i.e. what is x0 on my plot. And let’s pretend I know a range I expect it to be in. And let’s assume I know the uncertainty in the effect estimate I would get if I were estimate it from a given sample size (i.e. what is the standard deviation among individuals), that’s what is represented by the normal distribution that I drew. Is it reasonable to assume these things? I have no idea, depends on what we’re talking about. But I don’t see any way to make progress without assumptions like these, so I make them. At least now I can interpret the question: I have designed the study such that, given the expected distribution of x0, there is an 80% chance that the distribution that I drew will have less than 5% of its mass below 0.

Now I split that distribution into two parts, one for men and one for women. The one for men is shifted rightward by z/2, the one for women is shifted downward by z/2, so men are now centered at x0 + z/2 and women are at x0 – z/2; obviously the difference between them is z. What if I set z = x0/2, i.e. the interaction is half the size of the main effect. If my estimate of x0 is two standard errors away from zero 80% of the time, then my estimate of z is only one standard error away from 0 80% of the time. That is damn noisy, I can see that. But now I need to convert “80% chance of being one standard error away from zero” into “Y% chance of being 2 standard errors away from zero.” These distribution-of-distributions things are tricky, at least for me. The probability mass outside 2 standard deviations is about 15x smaller than the mass outside 1 standard deviation, so I only have 1/15 as much chance of ending up out that far. (Is this the right way to think about it? It seems right. This is already taking me longer than I intended to spend, so I’m going to go with it). 1/15 * 80% = 5.333, call it 5%. So if I have an 80% chance of detecting the main effect with 95% probability, I have only a 5% chance of detecting an interaction that is half that big. Out of 1000 studies, I get a ‘statistically significant interaction’ in 50 of them.

Ratio of observed effect size to actual…hmm. I only get that ‘statistically significant’ interaction if my observation is twice as big as the actual interaction, so the expected ratio is 2. Actually that’s not quite right, expected value of a ratio, but it’s something like that.

Is that all nonsense? Probably. Maybe someone can fix it in the comments below, thus teaching us all how to do the problem.

Reply ↓
- Daniel Lakeland on March 14, 2018 3:40 PM at 3:40 pm said:
  
  I love this Phil, for two reasons:
  
  1) you’re willing to go out on a limb and maybe embarrass yourself publicly, that takes some courage.
  
  2) You give us a thought process, and the thought process is far more important than getting the right answer, because the thought process can be debugged, whereas the right / wrong answer is just either right or wrong. Also your thought process has lots of nice simplification ideas, you’re thinking like a physicist here not an arithmetic memorizer.
  
  Reply ↓
  - Keith O'Rourke on March 14, 2018 3:46 PM at 3:46 pm said:
    
    Daniel: I put the program memorizer answer below ;-)
    
    Reply ↓
Ulrich Schimmack on March 14, 2018 3:20 PM at 3:20 pm said:

What was the intended correct answer?

Suppose a study is designed to have 80% power to detect a main effect at a 95% confidence level. Further suppose that interactions of interest are half the size of main effects. What is its power for detecting an interaction, comparing men to women (say) in a study that is half men and half women?

Power is not a linear function of effect size. With ES.interaction = ES.main /2, power is < .80 /2 = .40, but I don't think it is possible to get a precise estimate of power.

Reply ↓
Keith O'Rourke on March 14, 2018 3:43 PM at 3:43 pm said:

OK – I am not going to stoop to using formulas but I think this is it (below).

But there are a lot of implicit assumption being made – continuous outcomes, constant variance, approx normal, ?

power.t.test(n = 20, power=.80)

Two-sample t test power calculation

n = 20
delta = 0.9091306
sd = 1
sig.level = 0.05
power = 0.8
alternative = two.sided

NOTE: n is number in *each* group

Now with n and delta each divided by 2
power.t.test(n = 10, delta = 0.9091306/2)

Two-sample t test power calculation

n = 10
delta = 0.4545653
sd = 1
sig.level = 0.05
power = 0.159386
alternative = two.sided

NOTE: n is number in *each* group

Reply ↓
- Anoneuoid on March 14, 2018 5:18 PM at 5:18 pm said:
  
  Why are you using n/2 for the second case? Your method matches up to my sim if you only divide delta by 2. I get ~27% power for the interaction, but maybe there is a mistake: https://image.ibb.co/hoARHx/powerSim.png
  
  As always, crossing fingers on this showing up properly:
  
  # Simulate study of treatment vs control w/ half male subjects sim = function(n, show_dat = FALSE, show_fit = FALSE){ a = rnorm(n, 0.0, 1) b1 = rnorm(n/2, 0.5, 1) b2 = rnorm(n/2, 1.5, 1)
  dat = data.frame(Group = c(rep("C", n), rep("T", n)), Sex = rep(c(rep("M", n/2), rep("F", n/2)), 2), Val = c(a, b1, b2)) fit = lm(Val ~ Group*Sex, dat) pvals = summary(fit)$coefficients[, 4] if(show_dat) print(dat) if(show_fit) print(summary(fit)) return(pvals) } # Analytical power calculation by effect and sample size exactPwr = function(nvals, delta){ sapply(nvals, function(n) power.t.test(n = n, delta = delta, sd = 1, sig.level = .05, strict = T)$power) } # Run Simulation nvals = seq(4, 100, by = 2) res = sapply(nvals, function(n) rowMeans(replicate(1e4, sim(n)) < 0.05)) # Get Sample size nearest to 80% power n80 = nvals[which.min(abs(res["GroupT",] - 0.8))] # Simulation Results plot(nvals, res[1, ], type = "n", ylim = range(res), xlab = "n", ylab = "% Sig", panel.first = grid()) for(i in 1:nrow(res)){ lines(nvals, res[i,], col = rainbow(4)[i], lwd = 3) } # Add analytical results lines(nvals, exactPwr(nvals, 1), lwd = 2, lty = 2) lines(nvals, exactPwr(nvals, .5), lwd = 2, lty = 2) abline(h = 0.8, v = n80, lty = 2) legend("right", c(rownames(res), "Analytic Match"), col = c(rainbow(4), "Black"), lwd = 3, lty = c(rep(1, 4), 2), bty = "n") res[, nvals == n80] exactPwr(nvals, .5)[nvals == n80]
  
  The same code on pastebin just in case: https://pastebin.com/G54fTf7S
  
  Reply ↓
  - Arno on March 15, 2018 8:10 AM at 8:10 am said:
    
    It’s n/2 for the same reason your b1 an b2 in sim() are using n/2: the interaction is compared within the treatment group, not treatment to non-treatment group.
    
    Btw, you have an error in sim(): your interaction size is also 1 (1.5-0.5), not 0.5. So actually, you should have 0.75 and 1.25 as values for b1 and b2.
    
    When you plot your graph then, you’ll find that the interaction term is not following your “Analytic Match” line. I assume the reason is that Sex is in your model: Val ~ Group*Sex.
    
    If you change your sim() model to this: Val ~ Group + Group:Sex, the lines follow the analytic match. The GroupT line is a bit lagging, I assume because of the Intercept estimate.
    
    Reply ↓
    - Anoneuoid on March 15, 2018 11:15 AM at 11:15 am said:
      
      Btw, you have an error in sim(): your interaction size is also 1 (1.5-0.5), not 0.5. So actually, you should have 0.75 and 1.25 as values for b1 and b2.
      
      Thanks, I actually noticed this one after posting but still didn’t get the same ~16% results for power so kept my mouth shut.
      
      When you plot your graph then, you’ll find that the interaction term is not following your “Analytic Match” line. I assume the reason is that Sex is in your model: Val ~ Group*Sex.
      
      If you change your sim() model to this: Val ~ Group + Group:Sex, the lines follow the analytic match.
      
      This works, of course it is strange to not include the all the main effects in the model so I would say it is the analytic result that is off here.
      
      The GroupT line is a bit lagging, I assume because of the Intercept estimate.
      
      I think removing the intercept is done using this model: lm(Val ~ 0 + Group + Group:Sex, dat) In that case the GroupT power curve is actually now leading the analytic result: https://image.ibb.co/mjUz5H/power_Sim2.png
      
      new code: https://pastebin.com/kQ1LwuHY
    - Keith O'Rourke on March 15, 2018 11:54 AM at 11:54 am said:
      
      OK I should put forward my revision :-(
      
      Now, in my real work, I have a rule to not distribute analyses until a day after they are done and I have checked them – any time I break that rule (including now) I seem to regret it.
      
      power.t.test(n = 20, power=.80,alternative = “one.side”)
      
      Two-sample t test power calculation
      
      n = 20
      delta = 0.8006829
      sd = 1
      sig.level = 0.05
      power = 0.8
      alternative = one.sided
      
      NOTE: n is number in *each* group
      
      power.t.test(power=.80, delta=0.8006829/2,sd=2,alternative = “one.side”)
      
      Two-sample t test power calculation
      
      n = 309.2794
      delta = 0.4003414
      sd = 2
      sig.level = 0.05
      power = 0.8
      alternative = one.sided
      
      NOTE: n is number in *each* group
      
      round(310/20)
      [1] 16
    - Anoneuoid on March 15, 2018 1:27 PM at 1:27 pm said:
      
      Sorry Keith, it made more sense to me earlier. The goal was to check the power of the interaction when main effect had 80% power, so I don’t see why are you checking the ratio of sample sizes required to get the same power? Why is the sd doubling for the interaction now? Also I don’t see why you have changed it to be one-sided. From the original description I would assume two-sided.
      
      Obviously your approach is different from mine since I am starting the main effect of group to 1 then choosing a sample size to get the 80% power for that, but still we should be able to match our results up if the “correct” model is plugged into lm.
    - Keith O'Rourke on March 15, 2018 1:40 PM at 1:40 pm said:
      
      Anoneuoid:
      
      I assumed you read Andrew’s new post today with his answer http://statmodeling.stat.columbia.edu/2018/03/15/need-16-times-sample-size-estimate-interaction-estimate-main-effect/#comment-684994 which prompted the changes.
      
      When I saw you comment I realized something was wrong but sorry I did not have time to read your simulation (or anyone else’s).
- Phil on March 14, 2018 8:31 PM at 8:31 pm said:
  
  I think Keith’s answer is right, although I didn’t trust it until I simulated it myself.
  
  In the code below: mu is the size of the average effect (averaged over men and women); delta is the size of the interaction; sigma is the uncertainty with which the effect size can be estimated. I interpret the problem’s statement about power to mean that if we did a very large number of replicates of the experiment, we would find that our estimated effect size is more than 1.64*sigma away from 0 in 80% of them. (It’s 1.645*sigma because that’s the standard deviation that puts 5% of them below 0).
  
  I tried different values of mu, leaving other parameters fixed, until I got the requisite 80%. When that happens, I get a ‘statistically significant’ interaction term (using the same 1.645*sigma test) in 15.3% of the cases. Why isn’t this the same 15.9% power Keith gets? I dunno.
  
  The calculations above assume that I know which direction the main effect and the interaction term go. For the main effect that makes sense: if I have enough information about what I’m testing to enable me to do the power calculation in the first place, I should know the direction of the main effect. For the interaction it’s less clear, maybe you think that could go either direction for men and women, in which case that should use 1.96 instead of 1.645. In that case, you get 9% power for the interaction, not 15.9%.
  
  mu = 1.76 delta = mu/2 sigma = 1 N = 1000000
  mean_men = mu + delta/2 mean_women = mu - delta/2 mean_men_est = rnorm(n=N, mean = mean_men, sd = sigma) # N replicates of the experiment mean_women_est = rnorm(n=N, mean = mean_women, sd = sigma) # mean_all_est = (mean_men_est + mean_women_est)/2 # Estimate for everyone is mean of men and women sd_all_est = sd(mean_all_est) # How widely distributed are the N estimates? se_above_0_all = mean_all_est/sd_all_est # How far is each estimate from zero, in sd? diff_est = mean_men_est - mean_women_est # Estimate of difference between men and women sd_diff_est = sd(diff_est) # How widely distributed are the N estimates? se_above_0_diff = diff_est/sd_diff_est # How far is each estimate from zero, in sd?
  print(paste('fraction of times mean_all_est/sd_all_est > 1.645:', (sum(se_above_0_all > 1.645))/N )) print(paste('fraction of times diff_est/sd_diff_est > 1.645:', (sum(se_above_0_diff > 1.645))/N ))
  
  Reply ↓
  - Daniel Lakeland on March 14, 2018 11:18 PM at 11:18 pm said:
    
    It’s all a little imprecise because there’s whether or not you assume a known sigma, if so you use normal distributions, if not you have a t-test issue, and then degrees of freedom matter, and the quantile point on the appropriate t distro, but in fact we don’t have a given N we just have a given power, so Keith wound up just assuming an N.
    
    but all that being said, the basic *concept* is correct.
    
    Reply ↓
digithead on March 14, 2018 3:48 PM at 3:48 pm said:

I find value in these types of questions even if they’re too vague as it allows you to see the creativity and problem solving processes of students.

Reply ↓
- Michael Lew on March 14, 2018 4:04 PM at 4:04 pm said:
  
  The value is in sparking learning, not in testing it. Examination questions should not be so difficult through ambiguity that they spark a set of comments like these.
  
  Reply ↓
  - digithead on March 14, 2018 7:53 PM at 7:53 pm said:
    
    They already know how to learn (and regurgitate things) for a test if they’re at the doctoral level. I want to know how Ph.D. students think when the answers aren’t obvious.
    
    And ambiguity drives science and inquiry. It’s that desperate need for concreteness that causes people to think they’ve discovered something novel when it’s just noise from rote statistical practices as Dr. Gelman’s blog ably demonstrates.
    
    Funny aside about ambiguity. The qualifying exam for my stats program back in the 90s stated in one of the problems that all Greek letters represented parameters which caused everyone consternation taking the exam because pi is used in the p.d.f. for a lot of random variables. None of the professors were around for the test that day to clear up the confusion so the department secretary made the executive decision to tell everyone that pi in the problem meant pi not a parameter. Some of my fellow students thought they were being set up because she was a nice old lady that had worked at the school since she was 18 and never went to college. I trusted her since she was the one who typed up the exams along with the answers.
    
    Reply ↓
Anonymous on March 14, 2018 5:50 PM at 5:50 pm said:

This is probably wrong, but I’ll give it a shot. Sorry if this doesn’t make sense.
Because we are looking for interactions, it is as if we had to estimate 4 means instead of 2 with the same n, so the power is halved 0.8/2= 0.4.
Then, under 0.05 type I error, we would only detect statistically significant interactions in 0.4*0.95 = 0.38*1000 = 380 of the times.
And the estimated effect size should be proportional to the change in the standard error of the mean: 1/sqrt(n/4) = 2*1/sqrt(n) = in this case would be overestimated 2x

Reply ↓
Dave Meyer on March 14, 2018 9:57 PM at 9:57 pm said:

Are you absolutely certain that YOU know all of the correct answers?!?

If so, what makes you so sure???

Reply ↓
Thanatos Savehn on March 15, 2018 1:04 AM at 1:04 am said:

Looking at this through the lens by which I, and maybe others, learned to evaluate scientific revolutions is I suspect wrong-headed. Paradigm shifts back in the day were said to be of the “you’re thinking about it the wrong way” variety. Yet the current problem appears to be less like “Mercury’s orbit can be better explained thusly” than “Ummm, sorry but that telescope you’ve been using to track Mercury doesn’t actually work the way you think it works.” And this, I think, has profound consequences for what must follow.

If so, a theory of “what is entailed by the realization that we’ve been looking at things through the wrong end of (or wrong sort of) the telescope” may be in order.

Reply ↓
David on March 15, 2018 9:58 AM at 9:58 am said:

I fully support the post’s conclusion!

TBH I could not not follow the answer in the other post at all. I know, my stats knowledge is only “basic” and “applied”, but I doubt that many people in my field (a) are aware of the immense size of samples needed; (b) did a power analysis for their interaction effect; (c) would be able to do it.

And this beside the fact, that I’ve read hundreds of studies that reported and even focused on interaction effects.

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Classical hypothesis testing is really really hard

48 thoughts on “Classical hypothesis testing is really really hard”

Leave a Reply Cancel reply