I’m currently trying to make sense of the Army’s preliminary figures on their Comprehensive Soldier Fitness programme, which I found here. That report (see for example table 4 on p.15) has only a few very small “effect sizes” with p<.01 on some of the subscales and nothing significant on the rest. It looks to me like it's not much different from random noise, which I suspect might be caused by the large N (and there's more to come, because N for the whole programme will be in excess of 1 million). While googling on the subject of large N, I came across this entry in your blog. My question is, does that imply that when one has a large N – and, thus, presumably, large statistical power – one should systematically reduce alpha as well? Is there any literature on this? Does one always/sometimes/never need to take Lindley’s “paradox” into account?
And a supplementary question: can it ever be legitimate to quote a result as significant for one DV (“Social fitness” in table 4) when it is simply (cf. p.10) an amalgam of the four other DVs listed immediately underneath it, of which one (“Friendliness”) has a significance of <.001 and the three others are NS? That just looks like double-dipping to me. They could add any number of superordinate meta-fitnesses and be improving on 100 dimensions.
PS: If you find this interesting, I wonder if you might want to make a blog post out of it. CSF is a $140 million programme that has been controversial for all sorts of reasons. There’s a whole bunch of other stuff that about this process, such as their use of MANOVA at T1 and “ANOVA with blocking” at T2, that makes me think they are on a fishing expedition for cherries to pick. For example, the means in some of the tables are “estimated marginal means” (MANOVA output), the SD values are in fact SEMs, and I have no idea why they are expressing effect sizes as partial eta squared when they only have one independent variable. But I’m a complete newbie to stats, so I’m probably missing a lot of stuff.
My reply: I followed the link. That report is almost a parody of military bureaucracy! But the issues you raise are important. The people doing this research have real problems for which there are no easy solutions. In short: none of the effects is zero and there’s gotta be a lot of variation across people and across subgroups of people. Also, there are multiple outcomes. It’s a classic multiple comparisons situation, but the null hypothesis of zero effects (which is standard in multiple-comparisons analyses) is clearly inappropriate. Multilevel modeling seems like a good idea but it requires real modeling and real thought, not simply plugging the data into an 8-schols program.
We have seen the same issues arising in education research, another area with multiple outcomes, treatments varying across predictors, and small aggregate effects.