Rather than being about sampling, the “true p-value” more often refers to “the p-value if the person actually tested a null model they thought could be correct”.

]]>I expect that nearly everyone here is in agreement that decision making should not be based on p-value thresholds, but this argument about an observed p-value being “significantly” different from 0.05 seems like a category error.

]]>assume effect A is equal to its sample value, or close to it, and assume effect B is equal to 0. The difference in effect sizes is then A-B = A-0 which is largish… and “A is much better than B”.

But, if instead you tested the idea that A-B = 0 you might easily get p = 0.14 or 0.23 or something, basically there’s no p value based evidence that A-B is different from 0.

Hence the difference between significant (A) and not-significant (B) is not itself statistically significant (A-B=0 has p = 0.2 or whatever)

deciding to do stuff based on having gotten certain p values, and particularly based on having gotten *different sides of the threshold for two different treatments* is not a good way to decide what is or is not true/good/helpful/whatever.

]]>http://fooledbyrandomness.com/pvalues ]]>

Tom Passin’s argument seems to be that if the null is true, the p-value of the p-value (as if the first-order p-value is something real, to be estimated) will never have statistical significance. I don’t know really what this means (what is the “real” p-value, even given the null?) be he seems to think it’s worth noting and is (yet another) critique of p-values. But

the same argument criticises 1e-1000 just as much as 0.05, so I’m left questioning why I should find this mathematical argument at all damning. ]]>

Well, yes, *if* you can support that informed prior, and that utility function – they need to be more than just personal opinion. In this case, the one that started this whole conversation, I don’t see anything like that as being supported by what was reported.

“if we already have the sample average as a best estimate, why are these fools doing Bayes and getting some other result?”

Why, precisely to be able to incorporate some other knowledge, preferably actual data. If we had actual data, though, that was of the same kind as the experiment, we could just combine them without that much complication. The complication comes in when you want to bring in other information that isn’t strictly of the same kind: e.g., a prior distribution when all you have from the experiment is one set of points.

Anyway, my comment was about how one might report a very uncertain result, not about technicalities about better estimates. Let’s not lose sight of the real thread here.

]]>To be fair, the Bayesian interpretation is “p = 0.24: Modest improvements if you thought the idea was probable a priori, mostly noise if you thought it wasn’t.”…with an important exception when you have informative priors about nuisance parameters.

]]>I actually think that your wording of the statement seems ok, but the statement about the sample average being the best available estimate of the true overall average was a common mistake that then confuses people. “if we already have the sample average as a best estimate, why are these fools doing Bayes and getting some other result?” The fact is, the biased estimate from Bayesian decision theory with informed prior and real-world utility function is overall better, sometimes MUCH better.

]]>Andrew says from time to time that the difference in two p-values is not in itself significant. Maybe you (some random reader of this blog, I mean) haven’t thought through the implications of this. It’s possible to show how a p-value threshold is a poor way to evaluate a data set by thinking about the p-value as a statistic itself. The p-value has a uniform distribution, and so it has a very large relative variance. The standard deviation is in the vicinity of 0.3 (where the p-value, of course, is in [0,1]). Your experimental p-value of 0.05 is a statistic. What is its variation? Hmm, 0.05 +/- 0.3! Well, we can’t really go below zero, but never mind.

So any claim that a result has a p-value less than, say, 0.05, is subject to the fact that this result (reaching that 0.05 value) cannot have much statistical significance (judging by the p-value of the p-value, to hoist the thing with its own petard). Maybe the “true” p-value is something else.

We could reduce the s.d. of the p-value from 0.3 down to 0.05 by running (0.3/0.05)^2 = 36 repetitions of the experiment. And even then, the (statistical) significance of the p-value is iffy, being 0.05 +/- 0.05.

This all doesn’t make me very interested in paying much attention to a p-value threshold.

]]>Hairsplitting, guys! I really meant “unbiased”, and these differences wouldn’t change my suggested wording at all. Would they?

]]>I agree. I like to separate statistical bias from bias in what might be the actual population average if such a thing. Methods and estimators do not necessarily provide anything realistic, as this requires thought about whether it makes sense in light of previous information. To me, I think one of the most unfortunate things I observe among quant. psychologists is thinking the math and/or simulations has a 1 to 1 mapping in reality. ]]>

In a practical sense, the best estimate comes from specifying a real-world prior and a real-world loss function, and doing Bayesian decision theory. But in a mathematical sense, the James-Stein estimator shows that even using basically no information you can still construct an estimator that is technically better. It shouldn’t be surprising that it’s not a lot better, as you’re using basically no information, but it’s still mathematically better, and so the value is in showing that a widely used common assumption is in fact a mistake.

]]>No. If the problems are unrelated enough, their parameters will be far enough apart that the Bayesian or so-called James-Stein estimator will do essentially no pooling. The argument you make is a common one in statistics (or, at least, it used to be commonly said thirty years ago) but it’s wrong, for roughly the same reason that it’s wrong to think that the Second Law of Thermodynamics is violated by that little demon who puts the fast molecules in one side of the gate and the slow molecules in the other. If you try to build the demon, you’ll find that he too is subject to the Second Law of Thermodynamics. Similarly, if you try to apply hierarchical modeling using unrelated problems, you’ll find that if you have a flat prior, this will work with probability zero; the “unrelated problems” strategy only works then the parameters are near to each other, which suggests that they are actually related, or else represent prior information.

For example, suppose you’re estimating a parameter that happens to be near the value 5 dollars, and you decide, just for laffs, to estimate this along with estimating the weight of a cat (which happens to weigh 5 pounds) and also a 5-pound steak. If you do this, your inferences will be partially pooled to be near 5 . . . but where did this come from? When evaluating the statistical properties of a method (and that’s a key part of the James-Stein argument, as you’re dealing with expected loss, averaging over some frequency distribution), then you need to average. If you’re always partially pooling your estimates by throwing in external parameters that are ostensibly unrelated but often happen to be very close to your parameter of interest, then this is an assumption that needs to go into the distribution you’re using to define your frequency properties. And if your external parameters are *not* often very close to your parameter of interest, then your James-Stein estimate won’t do any real partial pooling anyway.

I’d write a paper or give a talk about this, but it doesn’t seem like a problem that people care about anymore, perhaps because of the general understanding that multilevel models work because they make use of real information; they’re not just a mathematical trick.

]]>Apropos of nothing, I wrote a blog post which poses a question to readers; I’d be interested in your feedback.

]]>That said, the CI is far from perfect too. I think the CI should not be called an uncertainty interval because the only uncertainty it captures is the conditional uncertainty about the parameter given certainty about the data-generation model (DGM) from which the CI is computed. Any uncertainties about that model (and there are usually plenty in real examples in health and social sciences) is not captured by the CI, or the posterior intervals (PI) computed from the same DGM – so both CI and PI are really ‘overconfidence intervals’. I find it more easy to address this problem using P-values than interval estimates, simply by recognizing that any observed P value may stem from a model violation to which P is sensitive (e.g., nonrandom selection); that is why small values do not require and thus cannot imply violation of the null, and large values do not require and thus cannot imply truth of the null.

]]>There is no such thing as a “best estimate” in statistical theory. Under certain regularity conditions, there are such things as “best estimators”. But, no, the sample mean is nothing like a “best” estimate, not in any mathematical sense.

]]>That’s not really true, it’s not necessarily a bad estimate of the mean, but it’s not an admissable estimate of the mean when n is bigger than 2 ;-) The purpose of the James-Stein estimator was really to show that the sample average isn’t the “best” estimate

]]>The latter, however, is usually the main result of small sample papers and generally is what the paper is intended to sell and actually causes people to waste time in effects that are either Type M, or even Type S, both of which hold back actual progress.

]]>There is this idea from over 20 years ago http://journals.sagepub.com/doi/abs/10.1111/j.1467-9280.1994.tb00281.x or more generally assess the data’s compatibility with a range of parameters values rather than just the zero effect.

In the larger scientific context, a single paper should just be pointing to a later meta-analysis where replication of results over studies can be critically assessed and (given adequate replication) the effectS jointly assessed.

]]>That leaves something like this: “There is too much statistical uncertainty to be sure, but for what it’s worth, the data for this experiment had a slight positive [or whatever] average. With more data, it might easily turn out to be negative [or whatever] instead.”

That sounds pretty weak, doesn’t it? But it does reflect the state of the data, which was also pretty weak.

]]>