Stephen Senn writes:

What the practicing scientist wants to know is what is a good test in practice.

I agree with Stephen Senn on most things—even where it seems we disagree, I think we agree on the fundamentals—but in this case I think you have to be careful about giving the practicing scientist what he or she wants! It’s my impression that the practicing scientist wants certainty: thus, the result of an experiment is either “statistically significant” (the result is true) or “not significant” (the result is false). Or perhaps “marginally significant” which is really great because then if you want the result to be true you can call it evidence in favor, and if you want the result to be false you can call it evidence against. This desire for certainty on the part of statistical consumers has historically been aided and abetted by the statistics profession.

Yes, terminology too, also has its limits. Expertise is expected to provide certainty. An area of expertise has to continue to market its products.

Andrew: Did you notice this reply from Stephen?

“February 25, 2017 Stephen Senn

Yes. I agree that one has to. Be wary about giving scientists what they want out of statistics. I frequently stress that biostatisticians should give physicians what they need and that this is often not the same as what they want. “

My sense (and yours and Stephens?) is that they need to fully grasp the real uncertainties.

Well, what they want from stats is Pr(My_hypothesis|data). They’ve been getting Pr(data|Some_other_hypothesis), which is neither what they want or need. This is all before the certainty issue comes into play. A lot of it is also that the researchers are not able/willing to define “My_hypothesis” well enough to be useful.

“Well, what they want from stats is Pr(My_hypothesis|data).”

This is claimed very often but apart from the fact that I’m not sure whether anyone ever came up with some empirical evidence for this claim, I’d think that most of them don’t have enough of a grasp of what Pr(My_hypothesis|data) actually means and how much it may depend on some decisions that they have to make but of which they have no idea how to make them. So even if they say that this is what they want, they probably don’t understand what they’re asking for; and if they understand it, they may not want it either.

Christian:

I think people want different things at different times. Sometimes they definitely seem to want Pr(my hypothesis | data); other times it seems that they want some measure of the plausibility of their hypothesis, or of the alternative, or some measure of the evidence in favor or against their hypothesis or the alternative. The trouble is that the p-value isn’t any of these things! And, perhaps more to the point, none of these things is possible without really strong assumptions, much more than are being input into these simple calculations.

The p-value is *some measure* of the evidence in favour or against the hypothesis used to calculate the p-value, isn’t it? At least in a relative sense, lower/higher p-value corresponding to more/less evidence against the hypothesis.

Yes but usually the hypothesis used to calculate the p value is ridiculously un-connected to anything anyone believes “people’s response to taking steroids is a normal random number generator with mean 0 and unknown standard deviation” ???

It really sounds silly when you put it like that.

“Yes but usually the hypothesis used to calculate the p value is ridiculously un-connected to anything anyone believes”

Yes: it is directly connected to comparing a positive effect (believable) to a negative effect (also believable).

Your strawman argument about p-values is on the same level as saying “MCMC is a method for taking in data and spitting out random numbers. No more no less”.

A:

I disagree that Daniel is arguing with a strawman. Regarding your particular point: MCMC has nothing to do with “data.” MCMC is a method for approximately sampling from a specified target distribution. And, indeed, the output from MCMC is of interest only to the extent the target distribution is relevant to some question that somebody cares about. Just as a p-value is of interest only if the random number generator in question has some relevant to the applied problem under study.

Andrew:

Of course, I don’t believe that’s in any way an accurate representation of MCMC, but just as I don’t believe that “p-values are method for testing if a dataset was generated by a particular RNG, *no more no less*”, as Daniel is so fond of saying, is an accurate representation of p-values.

I completely agree with your statement: “Just as a p-value is of interest only if the random number generator in question has some relevant to the applied problem under study.”. Your statement allows that they can be relevant to the applied problem at hand. Daniel consistently makes the argument that they never are.

a reader: No I think you misunderstand my argument. My argument is that p values have a specific usage, they detect a deviation from what is usual to come out of a specific distribution. That is *the only* thing they do. That’s their mathematical purpose. So when you want to do that. They’re totally appropriate.

I’ve argued when p values are appropriate is when there is a fairly large observed distribution of things, and you are interested in finding out whether a new sample deviates from that distribution. They’re an excellent “trigger” for further investigation.

http://models.street-artists.org/2015/07/29/understanding-when-to-use-p-values/

In fact, p values are fully and totally appropriate for calibrating RNG software. But they’re also totally appropriate for detecting when “something that happened” isn’t like “most of the stuff that’s happened in the past”.

However, it’s just that a fairly small subset of what p values are used for is this process. The more common usage is to *assume a distributional family based on thin air* then try to detect whether there’s a difference in the parameter between two samples.

The correct inference after a small p value is “either these are not *both* samples from the distributional family I assumed, or they are from that family and the parameters are different”.

The usual conclusion from this is “they are both from the normal distribution but and the parameters are different, so I’ve discovered a fact about the world!”

The correct follow on to this is: “I can be almost sure ahead of time that these are not exact samples from the distributional family I assumed” ergo… I know nothing more than I started with.

Only when you have a well calibrated observed distribution to compare to can you then conclude “these are (probably) not samples from the specific previously observed distribution” and then correctly learn something (ie. something different is going on in my sample compared to what’s observed in the past).

p values are extremely sensitive to the behavior of the distribution in the tails… and so you simply can’t expect real world data to conform to whatever your distribution assumption was until you have a well calibrated baseline distribution… I’d say 500 to 1000 samples is probably pretty good. So, if you’re trying to detect forest fires and you have 20,000 images of non-burning forest from satellites, you’ll do a good job of detecting when a new small sample is weird and might be a forest fire.

a reader, I’m curious to know what inaccuracies you see in the claim, “MCMC is a method for approximately sampling from a specified target distribution”. Is is just too terse, or is there something else?

Corey:

Sorry, maybe I wasn’t clear. Andrew’s definition of MCMC is, of course, correct and not misleading in any way.

I was merely stating that *my* strawman definition of MCMC (“takes in data, gives out random numbers”), which is technically true in regards to how MCMC is used in application to Bayesian methods is extremely unhelpful in explaining the role MCMC methods, just as saying “a pvalue is a test of whether your data comes from a specific RNG, nothing more nothing less” is not helpful in explaining what a pvalue is.

Got it. Thanks for the clarification.

Daniel:

“My argument is that p values have a specific usage, they detect a deviation from what is usual to come out of a specific distribution. That is *the only* thing they do.”

The main thrust of my argument is that this is false. Basic likelihood ratio theory states that testing H1: B less than 0 vs H2: B greater than 0 is exactly equivalent to a NHST on H1: B equals 0. So you are not testing a specific distribution that no one believes, but rather testing a set of hypotheses that people do believe are both possible. You do so using a test that is operationally equivalent to NHST, but you are still actually comparing real hypotheses.

Now, there are many NHST that do not have a comparison of plausible hypotheses as a possible interpretation: for example, if you are testing the normality of the data, this is not equivalent to comparing any pair of plausible hypotheses that I am aware of.

“p values are extremely sensitive to the behavior of the distribution in the tails… and so you simply can’t expect real world data to conform to whatever your distribution assumption was until you have a well calibrated baseline distribution… “

I take it you mean behavior of the tails of your test statistic, rather than the tails of your data? For example, various CLTs give us some assurance that we should expect the test statistic to be relatively close to a normal distribution, even if the data is mild to moderately non-normal, depending on sample size.

More over, this same argument easily extends to almost every application of statistics, not just p-values. For example, multi-level models typically make assumptions about latent effects. And it’s really, really difficult to test those assumptions! I’ve read plenty of very misleading articles about applied Bayesian analysis regarding this issue (“we examined the estimated latent effects, and their estimates appeared only mildly skewed, therefore non-normality was not an issue”).

I’m not trying to say anything like p-values are of higher utility than Bayesian analysis: I believe quite the opposite! But I think it’s very misleading to draw hard lines between methods that are not correct.

We’ve already discussed about this in lots of places… anyway, you don’t need to believe in the “real truth” of the H0 or the parametric model at all to find it informative that the data show no evidence against the H0 (or that they do, in the specific direction specified by the test statistic).

Daniel: Nice explanation (at 1:09 pm)

a reader: a mathematically accurate statement of a hypothesis test such as the t test is:

if S1 is a sample from a normal random number generator with unknown m1,s1 and S2 is a sample from a normal random number generator with unknown m2,s2 then the frequency with which t(S1) – t(S2) would exceed the observed value is p

the CLT tells you:

if S1 is a sample from a random number generator with well defined m,s (as in, mathematically, they exist) then (mean(S1) – m)/s converges in frequency distribution to normal(0,1/sqrt(n)) as n goes to infinity

Notice the part right after “if” requires *a random number generator*

It’s trivial that I can write a computer program, or write out a protocol for real-world data collection, where these things will be violated severely. Any trending process, any sample that is filtered through a selection process, any adversarial process where someone is choosing the numbers to make you get what they want you to get, any process with heterogenous data collection problems, any process where people transcribe things incorrectly…

The CLT relies on the properties of *random* sequences.

When I say “a random number generator” I don’t mean just computer software, but I do mean that we’d better be able to at least believe in an approximately stable distribution and a population that really exists, and the sequences of numbers that come out of the data collection sample being able to pass at least basic tests of randomness before you can rely on a CLT or anything like that.

And in a 12 or 22 sample medical clinical trial with who knows what protocols involved…. you have absolutely no assurance that the CLT means anything, any more than you have assurance that if you call my function f() 7 times and it gives you [1 2 3 4 5 6 7] that future calls of 7 trials will also have mean 4 +- 2.16

> I’ve argued when p values are appropriate is when there is a fairly large observed distribution of things, and you are interested in finding out whether a new sample deviates from that distribution.

They can also be useful when there is an assumed distribution of things, and you are interested in finding out whether a sample deviates from that hypothetical distribution.

Carlos, yes, when you have a realistic reason to actually believe that something might actually be a sample from that distribution. There are lots of good uses here. For example if you’ve got a system for choosing genes to target for a drug based on their coding sequence… and you’re concerned that it might be no different from just choosing genes uniformly at random… You can choose them uniformly at random and find out the distribution of your summary statistic. If your actual selection is outside the high probability range of that uniform random selection, you can reject the idea that your method of choosing genes is no different from a uniform random selector.

So long as you don’t forget that rejection is a rejection of the statement “S is a random sample from the distribution D(0)”, and does not mean “S is a random sample from the distributional family D(q) and q != 0”

I think one of the hardest parts about being an academic statistical consultant is that you’re supposed to be a judge for hire, and you’re being hired by the prosecution (or the defense?). Pretty strong conflict of interest.

I actually left a position of academic biostatistician for another position where predictive modeling was more of interest to the employer, mostly because of this issue. All the researchers I worked with as a biostatistician were really good about asking for honest analyses and not pressuring me to bias my results, and they were even more excited about me proposing new statistical methods for problematic data types (perhaps because they recognized that could mean their paper would get more citations, but no crime there), but the constant pressure to find statistical significance was always there, as a part of the institution rather than the people. This was rewarding when I could talk to researchers about how to improve their experiments to get more power (i.e. use measures that are less noisy, deciding whether to look at a lot of samples from a few subjects or a few samples from a lot of subjects, etc.), but 80% of the time, my job was “can you show that this data is ready for publication?”, and saying “no” and then handing them a bill was rather unpleasant, no matter how professional the biologists were.

If you’re doing predictive modeling, at least your employer is happy when you do a good job.

This is an interesting perspective. With predictive modeling, we are much more concerned with factors related to effect size: how accurate can I be, often relative to some known criterion.

Predictive modeling is also younger; we can look at algorithms like C4.5/C5 versus CART and see how arbitrary the decision rules are. (They may empirically be excellent rules, but the choices follow more from mathematical intuition than proof, in many cases. Trimming is probably the easiest way to see my point.) The 5% rule and null hypothesis testing is also an arbitrary rule, but somehow it became dogma, probably due to the influence of introductory statistics classes.

“the practicing scientist wants certainty: thus, the result of an experiment is either “statistically significant” (the result is true) or “not significant” (the result is false).”

I’ve long thought that that is the true reason for the popularity of significance tests even in areas where you cannot even name the population from which the sample is drawn: The ability to treat a whole bunch of associations as though they were zero.

Gigerenzer wrote exactly about this in his Journal of Management paper – arguing that bayesian statistics won’t solve scientists’ problems because they want certainty…

Scientists want MONEY. It’s the granting agencies that want certainty.

If scientists wanted money, they wouldn’t have gotten into basic science. I mean, I guess it is true that a postdoc in bio making 30k a year living in a major metropolitan does want more money, but clearly that’s not been their only driving motivator in every decision in their life.

Most scientists really want to do science, and do it well. But they also understand that nowadays, it costs lots of money to do any science, and they need to get more money to keep doing science in the future. So again, I suppose scientists do want money, but in the hopes that it will eventually lead to certainty about the uncertain.

Personally, I think the granting agencies are considerably less emotionally invested in the actual science done than the scientists: how else could they be, they are in a constant cycle of reviewing large numbers of grants!

My point is, I think many scientists would be fine with studying things and reporting realistic uncertainty, except that it would lead to their extinction from the business of running labs because it wouldn’t garner MONEY, and without MONEY in your institutional account, as you say, you can’t keep doing the kind of science that is done these days.

There is, very very definitely, strong survivorship bias among principal investigators. Those who regularly bring in grant money stay in the business, others wind up outside academia.

So, it matters a lot what granting agencies act as if they want. And, as far as I can tell, they act as if they could hardly care less whether you produce true facts about the world, but they care a LOT about statistics like how many papers you’ve produced and how many “discoveries” you’ve made.

If you can provide a steady stream of “certainty”, even if it’s totally false, you can be used to justify a large congressional budget, and so you will survive in academia.

Some people come along for the ride and wind up doing ok science incidentally, but for the most part success = helping the NIH/NSF sell certainty to congress.

If grant reviews came back requesting larger sample sizes and independent replications and offering bigger budgets if you can show the effectiveness of your experimental designs using a simulation study, and independent auditors who get paid when they discover how you’re gaming the system are never able to find substantial things to complain about in your work…. then we’ll see scientists caring less about certainty and more about truth. Basically, when truth pays off.

Daniel: That was my sense of largely what was happening.

Daniel:

Oh, okay. I misread your original comment as “Scientists only care about money, while grant agencies want reliable results”.

Now I understand you to have meant “scientists need money to do research and to get it, they cannot express uncertainty in their findings to grant agencies”. That makes a lot more sense to me.

“Some people come along for the ride and wind up doing ok science incidentally, but for the most part success = helping the NIH/NSF sell certainty to congress”

Difference of opinions, but I think this is a bit more pessimistic than my experiences. But I’d guess that while a lot of the pressures are pretty constant, especially at the R1 level, I’m sure community reaction to such pressures is non-constant.

Leaving aside money and getting back to motivation: the problem often lies in the often casual and ill-considered use of the term ‘significant’. I prefer statistically ‘different’, which avoids the categorical statement and allows the researcher to highlight findings that may offer useful insights. Unfounded certainty is a dangerous precept in most fields of enquiry, particular real-world research where confounding variables are often hidden just around the corner, or just out of sight of the research design.

That’s something that really puzzles me. Unless you are working in a field where classical parameters have a substantive theoretical meaning and your data doesn’t have much intrinsic variation – so you can truly worry mostly about sampling and measurement error – why a scientist wouldn’t be interested in the stochastic aspect of the phenomenon he studies? The demand for a specific test seems to go against the notion of building scientific theories for complex phenomena – unless your theory is so great that you really just need to test some parameters values.

Sure, one might argue that RCTs usually end up testing treatment A vs treatment B or placebo, but the question answered by most RCTs seems to be much more an engineering problem (is procedure X more effective than (Y/nothing) in context A?) than testing hypotheses derived from substantive theory.