Paul Alper writes:

I know by searching your blog that you hold the position, “I’m negative on the expression ‘false positives.'”

Nevertheless, I came across this. In the medical/police/judicial world, false positive is a very serious issue:

$2

Cost of a typical roadside drug test kit used by police departments. Namely, is that white powder you’re packing baking soda or blow? Well, it turns out that these cheap drug tests have some pretty significant problems with false positives. One study found 33 percent of cocaine field tests in Las Vegas between 2010 and 2013 were false positives. According to Florida Department of Law Enforcement data, 21 percent of substances identified by the police as methamphetamine were not methamphetamine. [ProPublica]

The ProPublica article is lengthy:

Tens of thousands of people every year are sent to jail based on the results of a $2 roadside drug test. Widespread evidence shows that these tests routinely produce false positives. Why are police departments and prosecutors still using them? . . .

The Harris County district attorney’s office is responsible for half of all exonerations by conviction-integrity units nationwide in the past three years — not because law enforcement is different there but because the Houston lab committed to testing evidence after defendants had already pleaded guilty, a position that is increasingly unpopular in forensic science. . . .

The Texas Criminal Court of Appeals overturned Albritton’s conviction in late June, but before her record can be cleared, that reversal must be finalized by the trial court in Houston. Felony records are digitally disseminated far and wide, and can haunt the wrongly convicted for years after they are exonerated. Until the court makes its final move, Amy Albritton — for the purposes of employment, for the purposes of housing, for the purposes of her own peace of mind — remains a felon, one among unknown tens of thousands of Americans whose lives have been torn apart by a very flawed test.

Yes, I agree. There are cases where “false positive” and “false negative” make sense. Just not in general for scientific hypotheses. I think the statistical framework of hypothesis testing (Bayesian or otherwise) is generally a mistake. But in settings in which individuals are in one of some number of discrete states, it can make a lot of sense to think about false positives and negatives.

The funny thing is, someone once told me that he had success teaching the concepts of type 1 and 2 errors by framing the problem in terms of criminal defendants. My reaction was that he was leading the students exactly in the wrong direction!

I haven’t commented on the politics of the above story but of course I agree that it’s horrible. Imagine being sent to prison based on some crappy low-quality lab test. There’s a real moral hazard here: The people who do these tests and who promote them based on bad data, they aren’t at risk of going to prison themselves here, even though they’re putting others in jeopardy.

Hi Andrew:

Interesting. But, why are people driving around with baking soda pleading guilty to a felony?

It seems to me there is more going on here than just flawed testing.

Rodney

There is a big literature on this. Most often it is part of a plea bargain to avoid the threat of a longer sentence. In younger people, it might be as simple as the interrogator promising they will get to go home to their families after hours and hours of questioning. In the very young and mentally impaired it can be easy for interrogators to plant false memories after hours and hours of questioning. There are plenty of articles about it out on the webs.

http://www.nybooks.com/articles/2014/11/20/why-innocent-people-plead-guilty/

Yes, Matt is right. I did some research into this subject for my first book. False confession is a real thing! There are many causes-the most egregious is that the court allows the police to tell lies to elicit confessions. Also, psychologically, the innocent person thinks that surely since I didn’t do it, there should be lots of other evidence to contradict the confession. Unfortunately, some of the research showed that jurors place disproportionate weight on the confession evidence, at the expense of other evidence.

Sad.

“False confession is a real thing!”

Makes one take a second look at the reasons for which millions of whites might plead guilty to the charges of “racism”

Interestingly, in the case highlighted in the Pro Publica article, the defendant pleaded guilty to a misdemeanor, and was sentenced to 45 days in jail. And, the article does not seem to say that anyone is actually convicted at trail on the basis of the field test alone (nor would they – a more sophisticated test would doubtlessly be performed were the case to go to trial). The field tests seem to be used only as the basis for an arrest, not a conviction.

The problem seems to lie not with the test, but with people being pressured to plead guilty (From the article: “A majority of those defendants, 58 percent, pleaded guilty at the first opportunity, during their arraignment; the median time between arrest and plea was four days.”).

So, this seems to be an instance of the reporter not understanding the problem he or she is reporting on, or being so devoted to a pre-assumed narrative that he or she is blind to the real story. Unfortunately, that seems all too common in the journalism profession.

> The problem seems to lie not with the test, but with people being pressured to plead guilty

You’re being disingenuous. They are being pressured to plead guilty after being falsely detained.

As a reader of a statistics blog that often posts about utility functions, you of all people should realize that the choice is between “plead guilty and get a reduced sentence” vs. “plead not guilty and face the maximum penalty,” and many, many people would prefer the former given their lack of faith in due process after just having been falsely detained.

But, that is my entire point – being given the choice between “plead guilty and get a reduced sentence” vs. “plead not guilty and face the maximum penalty” is exactly the pressure I am talking about. That choice simply should not be offered to a defendant at such an early stage in the proceeding – DAs should not be permitted to offer plea deals at that stage, and judges should not accept guilty pleas based on plea bargaining at that stage.

PS: BTW, innocent people are arrested all the time; that is unavoidable. A person can be arrested if there is probable cause to think he or she is guilty. “A police officer has probable cause for an arrest when he has “knowledge or reasonably trustworthy information of facts and circumstances that are sufficient to warrant a person of reasonable caution in the belief that the person to be arrested has committed or is committing a crime,” Weyant v. Okst, 101 F.3d 845, 852 (2d Cir.1996). A drug test that is 79% accurate almost certainly meets that test. Hence, a police officer who arrests someone based on that test has done nothing wrong. It is the DA who pushes for a guilty plea based on that test who is the one who has acted wrongfully.

PS That should say 67-79% accurate, since there are two estimates of reliability in the story

“a police officer who arrests someone based on that test has done nothing wrong. It is the DA who pushes for a guilty plea based on that test who is the one who has acted wrongfully.”

+1

I had the same question. You might classify this story as a classic example of burying the lede. (1) There’s a significant chance that your friendly neighborhood coke dealer is selling you counterfeit drugs, and (2) consumers apparently can’t tell.

I think that there are lots of situations like this, and it’s also a good idea for undergraduates to learn to think about the variety of ways in which they may make decisions that are wrong or leap to the wrong conclusions about individuals, situations, policies. The drug test example is way more interesting as a way to introduce Bayes Theorem than some drawers with or without jewelry in them or even the Monty Hall problem etc. Many students understand the issues around drug tests well since they have had those kinds of jobs. Of course I also want them to eventually question where the specificity and sensitivity values come from too. Basically I think it is good critical thinking for people to consider a number of ways they can go wrong in drawing conclusions. That’s why I don’t have a problem with introducing the idea of types of error in a conceptual way.

I think the important thing, though, is that the distribution of individuals on dichotomous variables is a different thing than the distribution of sample statistics from samples containing large numbers of individuals.

That’s one cool cat, standing in that quantity of baking powder. Sort of an El Chapo of gatos.

Even in this case what you really want to know is the amount/percentage of a certain substance that is present, not whether it is present at all. The only reason the false positive concept seems to make some sense is because the tests are so inaccurate and the people using them are so poorly trained. Yes, if the situation is a lost cause anyway who cares how weak your method of analysis is… the results will always be inconclusive.

Once you get into diagnostic tests designed and performed by properly trained people, they will start doing stuff like measuring isotope ratios to distinguish between sources and/or trace the contamination back to a source. Imagine if law enforcement didn’t fall into the hypothesis testing trap and instead did that, how much easier would it be to see who is getting what from who, etc?

Thinking more about the above, I’d even go so far as to say reliance on the hypothesis testing paradigm is a major contributor the epic failure of the US “war on drugs” we have seen the last few decades (not that I am particularly in fond of this war).

This is a case where signal detection theory is a more useful framework than statistical hypothesis testing, because SDT is entirely about individual cases and how the costs and benefits of different outcomes affect — and should affect — decision making.

I’ll never like the phrase “false positive”. Tests sometimes give wrong answers so the correct statement is that the test was wrong. To me, this has two important meanings: 1) when you say “false positive”, that carries the idea that the test was conducted, that it was conducted somehow in a reasonable manner and that the result is reasonable but just happen to be wrong, and 2) none of that is true because a) tests are conducted poorly and that affects reliability of answers and b) tests results may be affected by a number of factors.

Imagine this: a guy is brought in for public drunkenness and the test is presented to the court that the defendant failed to walk a straight line, that the defendant was unable to respond to simple commands requiring identification of objects and was unable to reach out and grasp an object without assistance. That’s your false positive: the test and its subparts were administered properly and the defendant did in fact fail at every one of them … except the defendant is blind. Take the false positive route and the defendant has to affirmatively prove, “But I’m blind. Really. Blind enough that I can’t see.” It isn’t a false positive when someone with multiple scelerosis is treated as drunk by the police – which happens. The point isn’t all that subtle, especially in a court: the burden of proof is a) the test must be disproved, which is the “false positive” idea, or b) the validity of the test in this circumstance must be proved. And in statistical terms, you can think of that as a sign question: you’re examining an effect and a measure of that effect, so the direction you approach that from has meaning. In a false positive, you’re approaching from the idea of validity, which I suppose is why we drowned women because witches float.

False Positive is not the same as Wrong, since Wrong includes False Positives and False Negatives, and when it comes to decisions the distinction matters a lot.

At the same time, we need to not use False Positive as a flippant way to write off the impact of poor tests. It’s like “statistically significant” which may or may not indicate practical significance.

> The funny thing is, someone once told me that he had success teaching the concepts of type 1 and 2 errors by framing the problem in terms of criminal defendants. My reaction was that he was leading the students exactly in the wrong direction!

What do you mean? Guilty/not guilty is a discrete state, no?

I am guessing that Andrew’s phrase “leading the students exactly in the wrong direction” really refers to the whole idea of hypothesis testing.

But I agree with Eric that using the analogy of guilty/not guilty is a good way to explain Type I and Type II errors, if one is going to teach hypothesis testing, which is needed because it is used so much — so people need to understand the problems with it, and Type I/Type II errors do that partially (which is not to deprecate Type M and Type S errors; these also need to be taught).

And, as Elin points out, drug testing is a very good way to introduce students to Bayes’ theorem, then on to Bayesian analysis. In my opinion/experience, drug testing works much better for this purpose than a betting approach, which loses a lot of students (although it fascinates a few).

Also medical tests (or other diagnostic criteria) for diseases are maybe even better than drug testing for introducing Bayes’ Theorem and Bayesian analysis.

If anything the relationship to hypothesis testing might be more about the reasonable doubt standard versus the preponderance of evidence standard.

+1

Andrew:

The link referred to in your sentences:

“Nevertheless, I came across this. In the medical/police/judicial world, false positive is a very serious issue:”

comes up as

“This site can’t be reached

cost%20of%20a%20typical%20roadside%20drug%20test%20kit%20used%20by%20police%20departmentshttp’s server DNS address could not be found.

ERR_NAME_NOT_RESOLVED”

Here is the correct link:

http://fivethirtyeight.com/features/significant-digits-for-friday-july-8-2016/

> There are cases where “false positive” and “false negative” make sense.

Add “fire control decisions” to the list with “roadside drug testing”.

I think the concepts of false positive and false negative make perfect sense when a person is making an irrevocable binary decision. Example: “Should this self-driving car apply the brakes?” “Should the air bag be deployed?” “Is this person guilty of driving under the influence of intoxicants?” “Should this person have a lumpectomy?”

My sense is that Andrew’s objection to talking about hypothesis testing in this way is that our goal is very rarely to make an irrevocable binary decision. Neyman and Pearson were very clever to try to formalize statistical testing as a binary choice, but statistical analysis is rarely about binary choices. We seek to understand the underlying processes, the relevant factors, the sources of noise and missingness, the strength of the evidence, the weaknesses of our hypotheses, the additional data that should be gathered, and so on. Most (All?) of the problems with p-values result from the attempt to use a statistic designed for a binary decision to serve all of these other purposes.

Speaking with my lab hat on, I’m horrified by the attitude of some statisticians.

The idea that there is no such thing as a false positive in science is so far divorced from reality, and so irresponsible, that I’m (almost) speechless.

If it were taken seriously, then, for example, the world would be deluged with ineffective and toxic quack medicines (even more than it is already). It would open the door to every sort of charlatan.

David:

I do most of my applied with in social and environmental sciences, where correlations and effects are essentially never zero. They can, however, be highly variable, positive in some settings and negative in others, to the extent that there’s no generally replicable “effect.” You could call such a thing a false positive, but I prefer to avoid that term, in part because the “false positive” idea is associated with hypothesis tests of zero, which I don’t think makes any sense in just about any social or environmental science problem I’ve ever seen.

> no such thing as a false positive in science

Does not exclude getting the sign wrong nor the magnitude way off – which I belief is the real concern in medicine or should be.

Keith, David:

Yes. To put it another way: I don’t mind talking about false positives as long as this idea is separated from the null hypothesis idea that associates a “false positive” with a particular random number generator.

This was my bad – I linked Colquhoun here after I got in twitter arguments with both Colquhoun and Mayo for suggesting that the concept of a ‘false positive’ is typically more misleading than useful and that it could be beneficial to abandon hypothesis testing entirely. Mayo responded with ALL CAPS outrage and David accused me of enabling Trump supporters/quack medicine. Sigh…

PS, at David’s suggestion I was reading his latest p-value paper to give feedback, but kept coming back to what you said elsewhere:

> I suspect that I would agree with the recommendations of this paper (as, indeed, I agree with Ioannidis), but at this point I’ve just lost the patience for decoding this sort of argument and reframing it in terms of continuous and varying effects

My main barrier to providing any feedback was I just reeeeally don’t like the framing of false positives (at least for general sci. hypotheses).

Perhaps someone can here help me: how do you _formally_ define e.g. a type I error when you also hold the whole ‘all models are wrong/idealisations/approximations’ perspective?

If H0: This model is true, then clearly you should alway reject if you take that statement seriously.

If I say H0:

‘This model adequately represents the aspects of the observable data that are of interest to me’

then how can I make an error in the ‘false rejection’ sense?

Specifically, you should reject iff you judge that the model represents aspects of the observable data of interest. You can only make an error by contradicting yourself.

Reminds me of Tarski: ‘Snow is white’ iff Snow is in fact white.

I.e. ‘This model is an adequate representation of the data’ iff this model is in fact an adequate representation of the data.

(meant to say: …should reject iff you judge that the model _does not_ represent…)

Not a complete answer to your question (which I’m not in a position to provide) but I think it’s important to bear in mind that the whole Type I/II error paradigm is based on evaluating procedures over hypothetical long-term replications, rather than a direct assessment of any given model. So the whole business is just saying that, even though we don’t believe the RNG null hypothesis in any particular case, we still set a bar for ourselves to clear before rejecting in favor of any particular alternative. The only use I’ve ever been able to figure is that, if carefully applied, it can kind of answer the question how easily chance variation (due to measurement, stochasticity in system, whatever) could plausibly explain results by itself. I agree this whole business is unsatisfactory…

Yeah, I agree that the ‘error _probability_ attaches to the procedure’ as Mayo might say, so it is the probability of method producing and error or whatever.

But the actual errors themselves are supposed to be _inferential_ errors right – like falsely rejecting a true hypothesis. Without true hypotheses what is the error in rejection? I mean _formally_ or at least semi-formally.

ojm,

I’m not sure if I’m understanding what you are saying, but here’s the way I conceptualize the situation, in the hope that that will help:

I distinguish between the model and the null hypothesis. So, for example, the model might say “the data are from a normal distribution with mean mu and standard deviation sigma”, without specifying what mu and sigma are. Then the null hypothesis states. e.g., “mu = 0”. So I would call a Type I error “rejecting the null hypothesis when both the model and null hypothesis are true.”

Hi Martha,

Thanks for the response.

Influenced in part by eg Laurie Davies (also Andrew’s comments on generative models) I like to use the terms ‘model instance’ for when a full particular parameter set is specified and ‘model family’ or ‘model structure’ to indicate a set of possible models like the normal family. I also like the term ‘model function’ for referring to the associated map from parameter instances to model instances (I’ve seen Barndorff-Nielsen use this in some old books/papers).

But this terminology doesn’t help address my question, I don’t think. I don’t think it makes sense to call either a model instance or a model family true.

(and really, saying model instances aren’t true already implies that a family can’t be true, if you think of a family as a collection of model instances and a family as true if at least one instance is true)

ojm, one thing that is clear is that the formalization is going to have to be over a hypothetical long-run of replications, rather than for any particular inference. So you say, “using rule X, we reject the null hypothesis in this instance Z, because if we consistently apply rule X over the long-term, our error probability should be Y”. The null hypothesis is here just a stand in for saying, “these data could plausibly be explained by chance variation”. As Martha suggests, it’s not really a model being tested, in the sense of “all models are false, but some are useful”. Really, this whole exercise is a demonstration of why p-value/NHST-driven analysis is such a can of worms, that I can guarantee almost no practicising scientist/researcher actually understands…

> the formalization is going to have to be over a hypothetical long-run of replications, rather than for any particular inference

I’m actually fairly sympathetic to the _procedural_ aspects of frequentist inference. In this sense, rather than ‘long run of applications’ you could think of analysing a function in terms of how it would behave under various inputs.

E.g. we often analyse the performace of alogorithms in terms of average, best, worst etc cases and use this to guide their use in particular cases.

What I don’t like, is to say e.g. “this algorithm often tells me that my null model instance is wrong – e.g. ‘N(0,1) is wrong’ – when in fact my model is true”.

On the other hand, I am OK with the _general_ idea of e.g. ‘this algorithm produces reliable outputs under this range of inputs’.

(though ‘reliable’ is another murky concept – I would prefer something like’informative for this case and stable wrt perturbations beyond this case’ instead…

…which is another reason why I don’t find arguments against considering ‘data other than that observed’ especially convincing in general. You just have to get the right ‘other than that observed’

Aha good to see some discussion.

I realise that my results are dependent on the assumption of a point null, and I try to justify that viewpoint in appendix A1 in the 2017 paper https://www.biorxiv.org/content/biorxiv/early/2017/08/07/144337.full.pdf

I sometimes wonder whether statisticians who don’t like point null hypotheses see enough real data. The world is flooded with ineffective treatments, some in regular medicine and nearly 100% in alternative medicine. The appropriate null for any proposed new treatment is that it doesn’t work: its effect size is zero. It’s perfectly possible for an effect to be exactly zero -just give the same pill to both groups. But that isn’t the point You aren’t asserting that the effect size IS exactly zero. You are asking whether your results could plausibly be seen if it were zero. That seems to me to be exactly the question that should be asked. If you claim that there is a real effect when there isn’t, that;s a serious mistake. It harms people and wastes time and money. That’s why it is the responsibility of every scientist to avoid such false positives as far as possible.

Of course P values don’t answer the question, and neither do likelihood ratios. Bayes doesn’t go way just because of the inconvenient fact that you very rarely have a valid prior. That’s why I like Matthews’ idea of calculating the prior that it would be necessary to assume to give you an acceptable chance of being right, a false positive risk of 5% for example. That is what most of the world still believes P = 0.05 gives you (I blame the people who give introductory statistics courses for that).

The myth of P = 0.05 would disappear very rapidly if, every time you observed a value close to 0.05 you had to supplement it with a statement like this.

“I believe that I have discovered an effective treatment (P = 0.043). I am 95% confident in this assertion and my argument is based on the assumption that I was almost (85%) certain that there was a real effect before I did the experiment”.

It seems to me that every time a result is published that is wrong (a false positive for example) that puts ammunition into the hands of science-deniers. Now, when they are in government, that is a very serious problem indeed.

David:

1. You write, “I sometimes wonder whether statisticians who don’t like point null hypotheses see enough real data.” I think I see enough real data.

2. You write, “Of course P values don’t answer the question, and neither do likelihood ratios. Bayes doesn’t go way just because of the inconvenient fact that you very rarely have a valid prior.” I’m not sure what you mean by “valid,” but I’d say that researchers typically choose data models (logistic regression, etc.) out of convenience and convention, just as they do with priors. In any case, I’ve repeatedly emphasized that my problem is not so much with p-values but with null hypothesis significance testing more generally. I am no fan of Bayes factors for NHST either.

3. You recommend a statement such as, “I believe that I have discovered an effective treatment (P = 0.043).” I don’t think that any p-value should be taken as evidence of a particular alternative (in this case, that the treatment is “effective”). A p-value near zero is some evidence against a particular random number generator (a so-called null hypothesis) that I didn’t believe in the first place.

Concerning your point 3. May I suggest that you read to the end of the bit that you quote? It says that P values won’t do.

Re

> You aren’t asserting that the effect size IS exactly zero. You are asking whether your results could plausibly be seen if it were zero. That seems to me to be exactly the question that should be asked.

Not quite. You can’t have a ‘false positive’ without believing the null really is true (see my comments above).

The logic is

a) assume null true

b) compute compatibility

c) reject if too incompatible

d) use false positive reasoning to determine if null _was in fact true_

My problems are really with c) and d). As mentioned further above you can only be ‘wrong’ in rejecting a null if you believe it is really is true.

It’s my opinion that these differences in philosophy are really application specific.

In the world of bio and medicine, the null really is a very reasonable approximation for what could be happening in your data. You think that mechanism X from treatment Y should be at play…but you were wrong. Of course, as a statistician, we think every thing’s got *some* effect. But since X didn’t activate, your treatment effect for Y is really 0.000001 instead 1.5. And since you’ve only got 6 samples in your preliminary data, a treatment effect of 0.000001 will produce data nearly identical to a treatment effect of 0 for all intents and purposes.

On the other hand, when you are looking Xbox survey data, you may have so much data that rejecting the null hypothesis is an entirely meaningless achievement.

I work mainly on bio problems. In my opinion, the messier the field and/or data the less relevant hypothesis testing.

This is not about the whole ‘too much data’ issue – that’s just another, different pathology of hypothesis testing…

(Though I suppose the ‘too much data’ issue is related in sense that it again emphasises that you don’t literally believe the null can be true. But problem stands even with small data)

Also relevant: ‘The’ effect size doesn’t exist

http://datacolada.org/33

If you were to believe that effect size doesn’t exist, both experimenters and statisticians might as well shut up shop and go home.

David:

ojm’s point is that when we say “the” effect size, it has the implication that a treatment only has one fixed effect across all subjects in the population across all time, but this can be a misleading representation of the world. For example the effect of penicillin is definitely much less than it was 50 years ago!

But similar to my earlier statement, I think this is a little less of an issue in bio/medicine than, say, politics: while the penicillin effect may have changed quite a bit in 50 years, the Trump effect can change overnight.

a reader: the penicillin effect on a person with an infection subject to penicillin is very very good… The effect on a person who has a penicillin resistant strain is near zero. Again, no such thing as “the” effect.

+1 to Daniel’s comment.

And I’ll add: The penicillin effect on a person who is severely allergic to penicillin can be worse than the effect of the original infection.

Small addition: also, in the Xbox example, you really have no reason to think a priori that the range of effects is either on the order of around 1 if your hypothesis about how the world works is right, but on the order of 0.000001 if your hypothesis about how the world works is wrong. Having seen a lot of bio data, I do think that describes the data often seen by biostatisticians.

I simply don’t agree. A false positive doesn’t require that the true effect is exactly zero. It requires that you have declared a discovery when there is no good reason to believe that the results are anything other than sampling error. In any case, Berger and Delampaday have shown the it makes little difference to arguments like mine if the null is taken as a narrow band around zero. There is a danger that arguments about whether it’s exactly zero degenerate into being as helpful as arguments about the number of angels that you can fit on the head of a pin.

You are forcing the problem into a discrete decision making setup which automatically generates higher error rates (your paper is implicitly a demonstration of this too). You may say ‘but this is what people do’ – the point is that they shouldn’t.

You want people to ‘avoid making a fool of themselves by declaring there is an effect when there isn’t’. Then people should stop formulating studies as a discrete choice problem where they have to decide. This itself gives high error rates.

Along these lines – when you say ‘a narrow band around zero’ I assume there is still a very discrete gap between where this band ends and the alternative begins? The problem is not the point null itself but that it isn’t embedded into a continuous family of possibilities.

The uniform prior

isa “continuous family of possibilities” What is, in my view, not acceptable is made-up prior distributions that are bases on “expert opinion”. In medicine at least, acceptance of expert opinion has done much harm.I agree with Valen Johnson when he said

“subjective Bayesian testing procedures have not been

—and will likely never be— generally accepted by the

scientific community”

Many decades of pushing subjective Bayesian views has next-to-no effect on practice. And that, I suspect, is quite a good thing.

By far the biggest reason Bayesian methods aren’t in science has been lack of computational ability to fit the kinds of things that Bayesian analysis requires. It’s no good spending decades telling scientists they should do Bayesian analyses if they have no actual operational way to do them.

BUGS and then JAGS and now Stan have lowered that computational bar to the point where people who have some interest in doing a good job of a Bayesian analysis can sit down and begin to walk that road after a few chapters of reading in one of the modern textbooks available, such as McElreath or Kruschke or whatever. This kind of analysis and the texts helping people operationalize it into code just wasn’t really available to people as recently as say 2010 or 2012.

So, the truth is, the decade clock starts let’s say 2010, by 2017 we see several major textbooks in the market and a thriving series of interesting papers using Stan in widespread applications. What will things be like in 2040 is the real question.

My comment has nothing to do with Bayes or not.

You could eg plot the pvalue P(T>t;theta) as a function of theta for theta in a continuous range. This gives an _indication_ of how compatible each theta value is with the data. For fun, try differentiating this with respect to theta and see what you get.

Still though, I wouldn’t use the above to declare an ‘effect’ ‘real’ or not on the basis of one study.

The point re decision making is that continuity leads to better error properties than discreteness – it is easier to control a continuous function than a discontinuous one. Hence why we can usually get better error bounds for regression than classification – former has continuous loss function, latter does not.

I figured you might be more open to a point addressing error control than my preference for abandoning the idea of a false positive but I guess not.

Dr. Colquhoun, I get the feeling — and please correct me if I’m wrong — that you haven’t read Prof. Gelman’s textbooks and you’re firing from the hip on the basis of a single blog post. It’s tricky to make worthwhile criticisms of Prof. Gelman’s preferred methodology on that basis because the specific statistical issues he writes about in any particular blog post make the most sense in the larger context of his entire approach to data analysis; one likely outcome is that you’ll end up talking past each other. (I’ve seen it happen before.)

A good place to start is here: http://www.stat.columbia.edu/~gelman/research/published/francis8.pdf

I fear that you’re right. I haven’t read Gelman’s textbook. But I don’t think that it’s necessary to do so in order to argue that the point null hypothesis is the appropriate thing to test, at least in many biomedical problems.

In fact I’d go a bit further. Perhaps it is up to the experimenter

to decide what he wants to test, and up to statisticians to tell them how to do it.

The present crisis of reproducibility calls for some urgent action. That means writing in a style that has some chance of being read by users. To that extent, I have sympathy with Benjamin et al. Everyone can understand their proposal and it could be implemented instantly. Personally I don’t like the idea of just changing the threshold for “statistical significance” because I thin that term should be abolished, not redefined. My proposal for stating the prior that you’d have to believe to achieve 5% FPR is a bit more complicated but we do provide a web app that would make it possible for almost anyone to calculate the prior.

The problem is, perhaps, that statisticians have been writing for each other for 100 years with no appreciable effect on practice (apart, of course, from Fisher whose methods, many would now argue, have been abused).

And here you’ve already skated past one of Gelman’s key assertions, to wit,

the testing mindset and the resulting dichotomization of reported results is a big part of the problem, regardless of whether the tests are carried out via tail area p-values or Bayes factors or what have you. No testing? Outrageous! Well, no — if you take away formal statistical tests, what you’re left with is estimation and credible regions, and the kinds of errors you want to avoid are (i) getting the sign of the estimated effect size wrong and (ii) gross errors in the magnitude of estimated effect size.No You don’t have to get dichotomisation.

I said explicitly

“Perhaps most important of all, never, ever, use

the words “significant” and “non-significant” to

describe the results. This wholly arbitrary

dichotomy has done untold mischief to the

integrity of science.”

But there isn’t the slightest hope of getting a policy adopted that fails to make a serious attempt to rule out false positives. The scientific community will never accept such a policy, and neither should they. It is a recipe for irreproducibility.

Can you just please define for me the meaning of the term “false positive” without dichotomization?

A false positive occurs when someone claims that there is a real effect when in fact the observations could easily have arisen from sampling error. That, it turns out, is a disastrously common occurrence. It has nothing to do with describing your results in a dichotomous way. That is something which I very explicitly deplore.

If you say that there is no such thing as a false positive, it’s almost like saying that there’s no such thing as truth and falsehood. That is a narrative best left to homeopaths, in my opinion.

David:

I’m ok with your definition: “A false positive occurs when someone claims that there is a real effect when in fact the observations could easily have arisen from sampling error.” For example, the himmicanes paper is a false positive by that definition. An underlying himmicanes effect might really exist, but the data we’ve seen don’t provide any real evidence on the question.

But my impression is that this is

notwhat is usually meant by a false positive. It’s my impression that when people say “false positive,” they mean that there’s a claim of an effect but that the true, underlying effect is zero. For example, from wikipedia: “A false positive error, or in short a false positive, commonly called a ‘false alarm’, is a result that indicates a given condition exists, when it does not.” My problem is the “when it does not” part. I’m happy saying that the himmicanes paper is a false positive under your definition, but I’mnotwilling to call it a false positive under the wikipedia definition, because I don’t know that the underlying condition doesnotexist.It’s true that the nomenclature in this field is a bit of a mess. Every time you read a paper you have to pay attention to how exactly terms are defined. What I call false positive risk, other have called false positive report probability (FPRP) or false discovery rate (FDR). I used FDR in my 2014 paper, and in my Youtube video, but changed it to FPR in the (doubtless vain) hope of avoiding confusion with multiple comparison problems: people in that field use FDR with quite a different meaning (they correct only the type 1 error, so end up with a P value, which is subject to same problems as any other P value).

It’s interesting that you put your argument in terms of “a given condition exists, when it does not”. That suggests you are thinking about diagnostic screening tests, and it was thinking about them that first made me realise how inadequate P values are (eg Fig 1 in my 2014 paper: http://rsos.royalsocietypublishing.org/content/1/3/140216 ).

The case of screening is easier because you may have a valid prior -the prevalence of the condition in the population being tested. And it shows that for rare conditions the false positive rate may be huge. I don’t think that anyone disputes this general conclusion.

The analogy with tests of significance (Fig 2 in 2014 paper) seems inescapable to me. Unless you are going to maintain that ineffective drugs don’t exist (an assumption that would, sadly, be beyond absurd) it must surely make sense to talk about false positives in the same way that you do for screening tests. It makes no difference whether the effect of ineffective drugs is zero or whether its just very small.

David:

I’m ok with your definition above (in your comment dated 12:14 pm) because it does

notmake any claim that the underlying effect is zero. You simply say “could easily have arisen from sampling error,” which seems completely reasonable to me.Regarding drugs: In the drug development research I’ve been involved in, it’s been clear that the drugs do have effects, which necessarily vary by person. I’ve not done research involving ineffective drugs, but I agree that if you’re doing work in that area, that it would be important for your statistical method to acknowledge the possibility of effects that are zero or very close to zero. In the social and environmental research problems I’ve worked on, there are no plausible zero effects that anyone’s studying. But often there are effects which are “null” in the practical sense that these effects vary so much by context that they are essentially unpredictable. I think it’s a mistake to identify such nulls with the “null hypothesis” of consistently zero effect, which is something else entirely.

> I’m ok with your definition above (in your comment dated 12:14 pm) because it does not make any claim that the underlying effect is zero. You simply say “could easily have arisen from sampling error,” which seems completely reasonable to me.

In fact I think everyone here is OK with this. What some of us dispute is the necessity and/or desirability of going from this to a discrete T/F effect decision making scheme.

David – can you see the difference? It is subtle (but, you know, sometimes small differences can be significant…)

> A false positive occurs when someone claims that there is a real effect when in fact the observations could easily have arisen from sampling error.

The probability of observations arising from sampling error (conditional on no real effect existing) is the p-value. Given a low p-value, how could we have false positive? One possibility is that the false positive is based on a reported p-value which is low but wrong, while the correct p-value is higher (i.e. the observations could easily have arising from sampling error in the absence of a real effect, but we miscalculated that probability). But I don’t think this is what you meant.

The bit about “conditional on no real effect existing” is the reason why P values can give no direct information about the truth of H0, or of HI. And that’s why you need to think in terms of false positive risks, not in terms of p values.

Looking at your papers I guess the “could easily have arisen from sampling error” is not conditional on “no effect”, but conditional on some assumed mixture of “no effect” and “real effect”.

Our messages crossed. I agree that p-values give no direct information about the truth of the hypothesis. But given that priors probabilities are required to do things properly, I find easier conceptually to go full-Bayesian.

> If you say that there is no such thing as a false positive, it’s almost like saying that there’s no such thing as truth and falsehood. That is a narrative best left to homeopaths, in my opinion.

I’m happy for truth and falsehood to be things. I just don’t think they apply to statistical models directly. They can apply to eg the adequacy of a statistical model for a particular problem. But as above I don’t see how you can actually provide a proper formal definition of ‘false positive’.

Homeopaths etc tend to abuse language and move goalposts. The issue here is we are trying to say your language is unclear and you shift the goalposts.

I give aspirin to 20 people, and starch to 20 others, their blood prostaglandins reduce by a variable amount, their headache durations reduce by a variable amount… all relative to the control.

Due to the variability if I assume a random number generator of a certain kind with similar variability generated the data, I discover that I can’t reject this idea at p less than 0.05, or with likelihood ratio X or whatever.

Does this mean “aspirin reduces headache duration in part by reducing prostaglandin formation” is a *false* positive?

By your definition, yes. But what is false about this? Suppose if we give aspirin to 3000 people we would get a lot more information about the results, and in fact we’d find that, the statement is *definitely true*. So again, what is false about it?

One answer could be “a false sense of certainty” which I would agree with if the statement were “our data shows that aspirin definitely reduces prostaglandins and headache duration”. Fortunately, there’s a very good way to avoid making *certain* statements that are unwarranted especially in the presence of other options to consider… Bayesian Statistics:

https://andrewgelman.com/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/

Well your example is a complicated one because it involves assertions of causality and mechanism. Let’s stick, to start with, to a simpler question, does aspirin reduce the duration of a headache?

If your first results don’t show adequate evidence for a reduction, and you claim an effect, that would clearly be bad science. Of course if you subsequently did a much bigger experiment and that showed an effect, one would know, in retrospect, that the first result was a false negative. All this shows is that small experiments may miss real effects, and that’s something that’s been known for ever.

It certainly pays not to jump to conclusions. Aspirin was developed long before RCTs were heard of. When, long after its introduction, it was properly tested, aspirin turned out to be much less effective than the textbooks suggest. And, worse, the same is true even of morphine. Sad to say, good analgesics don’t seem to exist.

You say

“Fortunately, there’s a very good way to avoid making *certain* statements that are unwarranted especially in the presence of other options to consider… Bayesian Statistics”.

My proposals are in part Bayesian, but they don’t involve made-up subjective priors. Elsewhere on this blog I’ve cited Valen Johnson’s view that subjective Bayes argument will never be accepted by the scientific community. I believe that he’s right.

Again though, from the perspective of describing the issue you’re concerned with “false positive” seems like perpetuating terrible terminology. How about “unwarranted positive” because whether the fact is true or not is not the same as whether a random number generator might have generated your data or not.

If you say “the possibility that a random number generator of type H0 might have generated your data set is too high for me to be sure that your alternative explanation is correct” then this is, actually, a Bayesian analysis, specifically a mixture of two models: your preferred specific model, and a pure random number generator. If you provide some mixture weights, you can calculate posterior weights, and say “the posterior weight of the RNG isn’t low enough to be sure that it isn’t at work here” and have a perfectly fine logical argument within a Bayesian framework.

So in your usage a “false negative” is a finding that an effect doesn’t exist when in fact it does, but a “false positive” is

nota finding that an effect exists when in fact it does not; instead a “false positive” is a finding that an effect exists when the supporting data could easily have arisen from sampling error with no effect. I guess I should have taken my own advice and read your stuff; this sure is a surprising and unusual definition.I see in a thread above that your definition of “false positive” doesn’t assume no effect is present. But if the effect is present one wonders why the finding is to be called “false” and not, say, “unwarranted” or the like. ¯\_(ツ)_/¯

@Corey 4.35 – exactly…

Corey:

I believe it’s a false assumption that if you take away hypothesis testing, you end up with estimation and credible regions *and then people will start taking their statistics more seriously*. Credible/confidence intervals are just inversions of hypothesis testing, so the second you give me a credible region, I probably will automatically perform a hypothesis test immediately, even though I’m not a fan of hypothesis testing (it’s such a low bar of information). And, while I’m sure there are examples of this, I don’t think I’ve read a paper that has just listed p-values without estimates, and much more often than not, confidence intervals as well. So it doesn’t seem very plausible to me that if the literature presented only 2 out of the 3 values it normally presents, all of a sudden everyone will get a grip on how to do statistics properly.

To me, the bigger issue is that we take scientists who aren’t really interested in statistics, such as biologists and medical doctors, and demand that they use statistical methodology to publish. And then we’re shocked that they’re not totally plugged into the process!

To “a reader”

I think you are exactly right, There’s nothing wrong with credible intervals -they are a variant of what I’m advocating. But they would certainly be used to rule out zero effects (and rightly so, I think).

But in the second part of your contribution, you seem to suggest that “hypothesis testing” is synonymous with relying on P values. That would, of course, be silly. If, as I advocate, you base inferences on false positive risks (FPR) rather than P values, you are on much safer ground.

The heart of the problem lies in the fact that a large fraction of experimenters still think that the P value gives you the FPR. What we need is a concerted campaign to explain the error of the transposed conditional.

FPR == FDR’s (False Discovery Rates)?

If so, I do agree it’s a much better metric.

Well I comment above (1:17 pm) on the nomenclature used in this field. It’s a mess and you have to see exactly how terms are defined in each paper. I hope that I’ve mad it clear at the start of mine how I define FPR.

a reader, in general *credible intervals* are NOT inversions of hypothesis tests! I understand the connection for frequentist confidence intervals, but credible intervals are not the same. They differ precisely when Bayesian methods make the most difference: when a principled prior is used, compared to some kind of flat/uniform prior.

Also, you can develop estimation theory independently of hypothesis tests.

You just don’t provide _frequentist coverage_ guarantees (but can you can provide other bounds and guarantees).

*you can provide…

a reader,

I’ve certainly read Psychology papers that report p-values with accompanying t- and F- statistics but not means, standard deviations, standard errors, or confidence intervals. Or, more commonly, you get a different grab bag of these depending on what picture the author is trying to paint.

And I’ve certainly worked with scientists who’d prefer to only report p-values (assuming they’re less than 0.05). My (admittedly subjective) opinion based on my discussions with non-statisticians is that the p-value is always the most important thing to them, and if we want to throw some other stuff in there then OK. So in that sense, I do think that doing away with (or de-emphasizing) p-values while retaining confidence intervals would be a help in getting them to think less dichotomously.

And while it’s true that confidence intervals are just inversions of hypothesis tests, they can still take a little bit of the shine off a paper where the authors are trying to sell their results as hard as possible. Seeing just how wide that confidence interval is can sober the reader up, even if the interval does just barely exclude zero. Likewise for seeing just how piddly a difference in means is, even if the size of that difference is just more than twice the standard error.

I agree completely with your last sentence. I have a biologist friend who openly admits that she doesn’t really give a damn about the analyses (t-tests and ANOVA F-tests mostly) that she’s forced to perform in order to get her work published. It’s just some tedious busy work we make them do, and shockingly they’re willing to cut corners!

+1

I also believe it’s a false assumption that if you take away hypothesis testing, you end up with estimation and credible regions *and then people will start taking their statistics more seriously*. I still regard it as progress to get people thinking about sign and magnitude errors instead of Type I and II errors.

+1

Here’s Jaynes take on the subject from a physicist perspective (see here http://bayes.wustl.edu/etj/articles/what.question.pdf for the original):

“In 1958, Cocconi and Alpeter proposed a new theory H1 of gravitation, which predicted that he inertial mass of a body is a tensor. That is, instead of Newton’s F=MA, one had $F_i=\sum_j M_{ij}A_j$. For terrestrial mechanics the principal axes of this tensor would be determined by the distribution of mass in our galaxy, such that with the x-axis directed toward the galactic center, $M_{xx}/M_{yy} = M_{xx}/M_{zz} = (1+\lambda)$. From the approximately known galactic mass and size, one could estimate (Weisskopf, 1961) a value $\lambda \approx 10^{-8}$.

Such a small effect would not have been noticed before, but when the new hypothesis H1 was brought forth it became a kind of challenge to experimental physicists: devise an experiment to detect this effect, if it exists, with the greatest possible sensitivity. Fortunately, the newly discovered Mossbauer effect provided a test with sensitivity far beyond one’s wildest dreams. The experimental verdict (Sherwin, et. al, 1960) was that lambda, if it exists, cannot be greater than $|\lambda| \leq 10^{-15}$. So we forgot about H1 and retained our null Hypothesis: H0 = Einstein’s theory of gravitation, in which lambda =0.

…

It is in the criterion for retaining H0 that we seem to differ: contrast the physicist’s rational with that usually advanced by statisticians, Bayesian or otherwise. When we retain the null hypothesis, our reason is not that it emerged from the test with a high posterior probability, or even that it has accounted well for the data. H0 is retained for the totally different reason that if the most sensitive available test fails to detect its existence, the new effect can have no observable consequences. That is, we are still free to adopt the alternative H1 if we wish to; but then we shall be obliged to use a value of lambda so close the previous 0 that all our resulting predictive distributions will be indistinguishable from those based on H0.”

I will just add to Jaynes’s comment that $|\lambda| \leq 10^{-15}$ is the width of a Bayesian style “uncertainty” interval and in no sense represents a frequency of anything. Moreover this way of doing things emerges naturally from a Bayesian approach done right (where “bayes” isn’t about Bayes Theorem especially but rather using the probability equations to model uncertainty in general).

Glad to see Laplace is still around.

From the perspective of this example, it seems that what we have is a decision theory: the cost of choosing \lambda = 0 as of ~1960’s with our resolution to see the consequences is effectively nil. Some day we may have the resolving power to actually measure a nonzero \lambda, at that point in time we’ll also have the resolving power to actually measure other consequences of the theory, and so leaving \lambda out of the theory would incur a cost of inability to correctly predict these sensitive experiments… At that point, the decision would change.

The two different viewpoints are roughly:

(1) accept H0 when lambda=0 is in some interval (Bayesian or Frequentist).

versus

(2) accept H0 when the uncertainty interval (Bayesian) is so tightly concentrated around 0 that using H1 would be functionally equivalent to using H0, so we might as well use H0.

The problem with (1) is that those intervals can easily contain lots of other values of lambda which are just as plausible, but have very different consequences. (2) eliminates that problem.

It’s probably easier for statisticians to think of (2) in terms of terms of “when can we make simplifying approximations?”. The claim in (2) is basically “even if H1 is true, H0 is such a good approximation to it our most sensitive experiments can’t tell the difference”. Ultimate truth can be left up to future generations with more sensitive instruments.

Nicely put.

This has been an interesting and (for me anyway) valuable discussion. Please keep it going.

I think the upshot is that the disagreements between us is less than might have been thought at first sight. We all agree that P values and confidence limits don’t ask the relevant questions. We all agree that it’s very undesirable to divide results into significant/non-significant. We all agree that there is a problem of reproducibility (worse in some areas than others). We all agree that the answer is not simply to reduce the p value threshold. And we all agree that priors cannot just be ignored.

So far, so good. The only disagreements are about about some details of nomenclature (I don’t really care whether they are called false positives or unwarranted positives -it comes to the same thing either way). Perhaps the most important disagreement is about what should be done to improve reproducibility. It was disagreements about this which led the ASA statement on P values being so useless.

If reproducibility is to be improved, we need recommendations that people can understand and which stand at least a small chance of being adopted. Recommending a full subjective Bayesian approach will probably never be accepted, so that’s out (in the short term at least). At the other extreme. everyone can understand the proposal to redefine “significant” as p = 0.005, but I guess that everyone here would regard that as a poor solution (actually so do Benjamin et al -I think that they proposed it as a temporary sticking plaster rather than as a permanent solution).

My proposal, to accompany P values with a statement of the prior that you would need to assume in order to achieve a false positive risk of, say, 0.05 may have some statistical impurity in some eyes, but it does have the advantage that it’s relatively easy to understand (and we have a web app to calculate the prior). At least it would have the advantage of quickly stopping the present near-universal practice of saying “P = 0.043 -I’ve made a discovery”. if P=0.043 was accompanied by statement that it was necessary to assume that there was an 85% chance that the discovery was right before you did the experiment in order to claim that the discovery was genuine.If you added the bit about having to assume a prior of 85% for H1, editors and authors would very soon abandon the P<0.05 myth. That, at least, I imagine we can all agree would be a good thing.

> I don’t really care whether they are called false positives or unwarranted positives -it comes to the same thing either way

You said in another comment that “a false positive doesn’t require that the true effect is exactly zero”. But the effect being exactly zero (i.e. the null hypothesis being true) is what makes a positive count as “false” .

You said as well that a false positive “has nothing to do with describing your results in a dichotomous way”. But the distinction between a “positive” and a “negative” is a dichotomy.

From the introduction of the bioRxiv paper linked (emphasis added):

“what you want to know is that _when_a_statistical_test_of_significance_comes_out_positive_, what is the probability that you have a false positive i.e. _there_is_no_real_effect_ and the results have occurred by chance.”

I fear that the distinction between exactly zero and close to zero comes into the category of nitpicking. That is so mathematically too as Berger and Delampady showed.

David:

I don’t think the effect of “power pose,” say, is exactly zero, nor do I think it is extremely close to zero. I think the effect is sometimes positive and sometimes negative, and that it varies by person and by context. Similarly, I don’t think reactions to hurricanes with boy and girl names are exactly identical, and maybe the reactions aren’t extremely close to identical either. To the extent the name of the hurricane matters, I think the effect is variable. Or, to take a more serious example, I don’t think the effect of early childhood intervention on later outcomes is zero, or necessarily extremely close to zero. I’m not interested in testing a hypothesis or proving it’s nonzero because I have no reason to imagine that it could be zero.

Andrew

That is quite an interesting example. The parallel in the drug field is as follows. You give a pill to a group of people, and each responds a bit differently. Is that because there is a systematic difference between people, or is it random. There is no way of knowing without repeatedly testing the same person, and that is hardly ever done. Nobody knows whether people can be classified as responders or non-responders. The data just aren’t there.

Likewise, nobody has ever determined in man the distribution of individual effective doses (IED) and without that, it isn’t possible to interpret properly the relationship between dose and response. When I wanted a distribution of IEDs in order to illustrate probit analysis in my 1971 book, all I could find was data from the 1920s on the bioassay of tincture of digitalis.

So you may say “I don’t think the effect of “power pose,” say, is exactly zero, nor do I think it is extremely close to zero. I think the effect is sometimes positive and sometimes negative, and that it varies by person and by context”. But that is just guesswork. The data don’t exist to back up such beliefs. It’s my understanding that statisticians job is to help people to reduce the amount of guesswork involved when analysing an experiment, not to increase it.

“The data don’t exist to back up such beliefs. It’s my understanding that statisticians job is to help people to reduce the amount of guesswork involved when analysing an experiment, not to increase it.”

I think this is fundamental to your misunderstanding. The role of a statistician is to tell you, after you’ve collected some data, just how much guesswork is still left. That is, to quantify your uncertainty post-data.

+ 1

+1

-1

Nope, I’m going to say that it is the statistician’s job to reduce the guesswork required, as stated by David. Not to misrepresent the uncertainty (although that can make you employable), but it is certainly their job to reduce it.

a reader: in the following sense your thing is true.

It should also be the job of a mathematical modeler / statistician to help you describe your science in a mathematical way, and from that description to help you decide what kind of data you would need to collect to reduce your uncertainty. But it’s the job of the experimenters collected data to actually give you the information that reduces the guesswork. The statistician can’t reduce the guesswork more than is warranted by the data and the model. Typically the model part is largely given to the statistician, or what’s more the truth perhaps extracted from the scientist by the statistician asking very pointed questions.

That is precisely what I meant

Sure fine, a Statistician / Mathematical Modeler should help you build better models of your phenomenon, and help you choose what data to collect, and part of that is eliciting what is known about how the system works, what the plausible parameter values are, etc and then encoding those things into priors. To say that there’s no data to support those priors is usually presumptuous. The truth is, people should explain what information informs their priors. That they don’t might be a cause for complaint, but there’s no reason to categorically state that all priors are “just guesswork”. Lots of priors are really meaningful information, particularly in the case of pseudo-mechanistic models with parameters that are interpretable in terms of meaningful physical/chemical/biological/psychological/economic mechanisms.

I fear that this is simply not true.

(David’s comments I mean)

What is the distinction between “close to zero” and “not close to zero”? Maybe I’m nitpicking again…

Is your definition of the FPR still valid when H0 is not well defined? Or are you using a different H0 in this case?

and on the question of dichotomy, I said that I think that it’s undesirable and unnecessary to describe the outcome of your experiment as “significant” or “non-significant”. I didn’t say that there is no dichotomy between truth and falsehood. That would indeed be a bizarre believe (expect perhaps to same fake news enthusiasts).

If you assume a dichotomy between truth and falsehood, then you’ll see that when I use a random number generator on my computer to generate:

rnorm(100,1e-33,1) (normal with mean 1e-33 and sd = 1)

and then make the claim “my RNG has positive mean”

And you fail to reject the idea that my RNG had distribution normal(0,1) and you therefore claim that “your claim is a false positive”

then of course one of us is right, and it’s not you.

The bigger point here of course is that I can adjust the mean upward and if you take a continuous view as advocated by ojm this is no problem, but if you insist on false and true you are ultimately dichotomizing with all the problems that discontinuities induce.

I’m not talking about the dichotomy between truth and falsehood.

I’m talking about the dichotomy between positive and negative, which seems hard to avoid if you want to define the false *positive* rate.

Re: Carlos

> You said in another comment that “a false positive doesn’t require that the true effect is exactly zero”. But the effect being exactly zero (i.e. the null hypothesis being true) is what makes a positive count as “false” .

> You said as well that a false positive “has nothing to do with describing your results in a dichotomous way”. But the distinction between a “positive” and a “negative” is a dichotomy.

Exactly.

Re: David

> At least it would have the advantage of quickly stopping the present near-universal practice of saying “P = 0.043 -I’ve made a discovery”.

Why not just say that observing P=0.043 is no justification for ‘announcing a discovery’, fullstop? Not every paper has to be – or should be – the ‘announcement of a discovery’. Science is cumulative.

Instead, publish your experiment, your data, some statistics, estimates and, especially, figures showing the data in various lights.

You give your interpretation of what is going on and why – e.g. ‘the estimate in the treated group is much higher than the control and seems to be stable wrt various manipulations such as…however, if we do this…then…’ etc.

Conclusions are tentative – e.g. ‘theory (a) appears adequate, however theory (b) is still plausible/possible or a simple model based on a particular form of noise gives results that are visually indistinguishable from what was measured…setting parameter two to zero makes no difference to model fit…we simply can’t measure this accurately or stably enough to determine a stable, accurate value of this parameter…etc’

Reviewers (etc) can read the paper and decide if it constitutes useful information. and justified analysis. Provide full analysis/model code/scripts, raw data whenever legally possible etc.

If the goal is decision making (e.g. you must decide whether to implement or not a policy) then people can include a formal or semi-formal decision analysis where probabilities/estimates of various outcomes, impacts etc etc are accounted for.

Future scenarios can be forecast, likely outcomes, minimax/maximin outcomes etc etc put forward. Authors and decision makers can adopt policies based on this, while knowing there are always unknown unknowns to worry about and that they may have to revisit decisions if these become apparent.

In short – if people can’t tell what’s going on from that, _without you having to say_ “I’ve made a discovery” and without you saying “p<whatever" then perhaps you haven't made your case well enough.

Ojm:

Yes.

I fear that following ojm’s advice would make reproducibility worse. Your proposals seem to amount to saying just publish your results with a bit of arm-waving,and time will sort it out. Perhaps you don’t appreciate fully the extent of the crisis in reproducibility and the real dangers it poses to the whole scientific enterprise? Surely it is the job of statisticians to prevent this sort of thing happening, and so far they haven’t done a very good job of it.

You say

“Why not just say that observing P=0.043 is no justification for ‘announcing a discovery’, fullstop?”.

You can’t “just say” that, ex cathedra. You have to give people reasons, and you have to give them some alternative way of achieving the aims which they mistakenly thought P values provided. I have tried to do that. My proposals may not be perfect, but they are certainly better than the status quo.

Despite the pressures put on them, most scientists want to get things right. It is the job of statisticians to suggest ways of doing that. Not to make things worse.

I already pointed you to this paper I wrote recently that doesn’t use p-values of hypothesis testing and was, to the best of my abilities at the time, an attempt to follow my advice above:

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005688

Funnily enough this was me following, to the best of my abilities at the time, the advice I picked up from Andrew’s book and blog.

Yes, I looked at that paper. It’s entirely different from the sort of papers that are a cause for concern. It’s more like the sort of thing that used to be my day job -fitting mechanisms to interpret single ion channel recording data eg see http://www.onemol.org.uk/c-hatton-hawkes-03.pdf and http://www.onemol.org.uk/c-hatton-hawkes-03.pdf

In those we are fitting models, though we have the advantage that we are fitting physical models to a relatively simple process, rather than empirical models to very complicated processes.

I never found P values to be at all helpful in that process either.

But these papers are totally different from the usual sort of paper in experimental psychology or drug development. Most of the problem papers are of the form does intervention A affect outcome B. And too often it’s claimed that there is an effect. That is the problem.

It is entirely different by design. We could have easily just done pvalue/hypothesis test calculations. In fact we had to do a couple in the ‘sister’ paper we published in an experimental journal, solely because experimentalists expect pvalue and hypothesis tests. They shouldn’t.

Here’s another paper I wrote as a PhD student that’s perhaps also related to the sort of study you’re talking about

https://www.ncbi.nlm.nih.gov/m/pubmed/23430220/

The issue was controversy over interpreting aquaporin knockout studies in mice. Some of the experimental literature was arguing over whether an increase or decrease of whatever percent in knockouts was evidence for hypothesis A or B etc etc. All in terms of pvlaues and things too.

My contribution was a simple qualitative mathematical model, a bit of dimensional analysis etc, that showed, I think, that the results were very much consistent with a simple mechanism (but also that we couldn’t really rule out an alternative one way or the other).

We didn’t give pvalues, just interpretation of the data in light of a plausible (but somewhat qualitative) mechanistic model. I know experimentalist who have been using the ideas of the model to try to design better experiments but haven’t kept track enough to know what’s happened with that.

I also co-wrote something on guidelines for reproducibility issues in computational biology once, with similar guidelines to the above:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4946111/

Note again: no hypothesis tests or pvalues recommended.

Though I regret not protesting the sentence:

> For hundreds of years, care has been taken over the reliability of experimental methods and result

more…

I also just wrote a short blog post sketching out suggestions for development of the more formal side of ‘…statistics without true models or hypothesis testing’

https://omaclaren.com/2017/09/26/a-sketch-of-statistics-without-true-models-or-hypothesis-testing/

(Note – your formulation fails to met the requirement/goal sketched there of using continuous estimators whenever possible to ensure stability of analysis)

It’s interesting that these comments occur on a post which agrees that there is such a thing as a false positive in the legal case and that it matters. It’s not long ago that some innocent people were hanged in the UK, and of course that still happens in some US states.

I don’t think that anyone familiar with drug development would deny the analogy with the legal case. Most new drugs just don’t work. That’s why the process is so slow and expensive. And if you claim, wrongly, that an ineffective drug works, that can be a death sentence on the person who’s given it. That, I maintain, shows that there is a real problem which many people here seem to underestimate.

> The funny thing is, someone once told me that he had success teaching the concepts of type 1 and 2 errors by framing the problem in terms of criminal defendants. My reaction was that he was leading the students exactly in the wrong direction!

> I don’t think that anyone familiar with drug development would

Also, Stephen Senn literally wrote the book ‘Statistical issues in drug development’ and he has criticised your false positive analysis in precisely the way many here have criticised it.

The criteria for deciding whether an underlying “false positive” is a meaningful idea is *whether or not the underlying question is inherently discrete*.

So, “did person A break, enter, and burgle apartment B?” is pretty clearly discrete, it’s not like they could 1/3 do that or 2.20831 do it… they either did it or didn’t

But “does giving drug X reduce the duration of a headache or not” is just NOT really discrete, the actual question is “how does giving drug X change the duration of a headache on average?” and so there’s no point in trying to force it into a discrete setting.

No, not in precisely the same way as you at all. I know Senn quite well and have found him very helpful, despite is self-admitted tendency to say that everyone else is wrong. We have corresponded quite a lot about my 2017 paper which he kindly read. He has pointed out that there is a particular form of prior which, for one-sided tests, gives p values that are much the same as posterior probabilities. That’s referred to in my paper. That doesn’t mean that he thinks p values are an adequate way to assess data. or that he thinks that statistical analysis is redundant.

Stephen, in his book (to which I also refer) lists all the possibilities without coming down in favour of one. This seems to me to be less than helpful to the reader, but I think he agrees that my approach is one more-or-less plausible approach to the problem of irreproducibility.

I am referring to the fact that Senn has also pointed out how strongly the results depend on whether the problem is set up in a discrete way or a continuous way.

David

– obviously I am against the discrete framing of the problem here. But if you must set it up this way, I’m curious if you’ve looked at Royall’s bound on the ‘probability of observing misleading evidence’?

It’s something like P(l1/l0 > k ; H0) less than or equal 1/k.

So the probability of observing evidence in favour of H1 over H0 with a likelihood ratio of at least k, while H0 is true, is bounded by 1/k. Note that it only applies to two _simple_ hypotheses though, so ‘not H0’ would not be a valid H1.

Eg a ratio of 15 in favour of the alternative occurs with prob less than 1/15 under H0.

It’s pretty much a reinterpretation of the Neyman-Pearson lemma, I believe.

Now – I wouldn’t really recommend we keep thinking in these terms as such but this is another proposal that sticks closely to the current framework of hyp tests.

(And as Mayo is quick to emphasise the two simple hypotheses must be predesignated before collecting data for the bounds to hold. But that should be fine for your simulations. So a ratio of 20:1 or more should occur less than 5 percent of the time)

(…under the null…)

So rather than introduce a prior prob you could say eg ‘these data would have to be 20 times more likely under the assumption of an effect than under the assumption of no effect in order for me to not falsely reject the null more than 5 percent of the time’

On a quick skim of your paper you give examples where the likelihood ratio is 2.76 and where it is 3.64.

Your method of calculating the false positive rate gives about 26% and 22% for these cases.

Royall’s method would give upper bounds of 1/2.76 = approx 36% and 1/3.64 = approx 27% respectively.

So, in this case at least, we would be even more skeptical using Royall’s bounds than yours, and we don’t need a prior.

(Of course all of this is working within your framing of the problem).

No, I wasn’t aware of Royall’s limits. It’s good to know that another approach gives similar results to mine. The same is true of the approaches of J. Berger and V. Johnson. All these approaches suggest the if you observe P = 0.05, you’ll get a minimum false positive risk of 20-30% (and much higher of prior odds are less than one).

I’ll admit to being mildly alarmed to hear that all you’ve done is “skimmed my paper” :-(

How about: ‘on a quick re-read to find an example to compare’

PS I find Royall’s argument much clearer than yours and Berger’s etc etc.

No lump probabilities, priors etc etc. Just what effectively amounts to the NP lemma from their hypothesis testing approach but presented in an even simpler manner.

So for these cases all you need to do is report the likelihood ratio for null against another _simple_ hypothesis and you can take the reciprocal to get bound.

His book/papers detail more.

It also illustrates the unfortunate issue that Andrew has called ‘type M’ errors: if you are studying an area where effects are unlikely, then when you do accidentally ‘find them’ (false positive) it will tend to be very overestimated: 1/k small means k large.

I have had a recent twitter exchange with David Colquhoun that was along the lines of the discussion here over that past few days. I hope Andrew that you don’t mind me expanding on it here.

Much is made of the analogy between diagnostic and scientific hypothesis testing. I see the situation from the viewpoint of a diagnostician. My first point was that a ‘false positive rate’ (FPR) would be the probability of observed data (e.g. 456/600 = 0.76) conditional on a null hypothesis H0.5 (e.g. 0.5). This FPR is not a P-value, which for example would be the probability of seeing 456/600 = 0.76 or something more extreme (e.g. 457, 458, … 600/600) conditional on H0.5. Therefore, we cannot use P-values directly in Bayesian calculations by using prior probabilities etc. as we would for diagnostic screening. Finding a likelihood ratio for the observation based on only two possible hypothetical ‘true’ proportions H0.5 and H0.76 (the latter being the expected true result equal to the observed result of 0.76) does not take us much further. It does not consider all the other possible hypothetical results.

It seems to me that the concept of statistical significance and scientific replication used in scientific reasoning is not analogous to diagnostic screening but analogous to diagnostic test precision. The precision of a diagnostic test and the probability is not affected by the prior or posterior probability of a diagnosis but the probability of a diagnosis is affected by precision of a test. In the same way, the statistical significance of a research finding is not affected by the prior or posterior probabilities of a scientific hypothesis but the probability of a scientific hypothesis is affected by the statistical significance of a finding on which it is based.

This was the issue that Bayes addressed in his paper, by calculating the probability that a true proportion (after an ideally large or infinite number of observations were made) would lie between two limits conditional on an observed proportion based on a limited number of observations. He reasoned that this could be done because the possible outcomes of an experiment that could be modelled by random selection from a single population would be equally probable. I reason in the same way in a recent Oxford University Press blog by showing that uniform priors are a necessary consequence of random sampling and do not need to be assumed: https://blog.oup.com/2017/06/suspected-fake-results-in-science/

When a Bayesian includes a subjective prior probability distribution in a random selection model, then this is equivalent to performing a meta-analysis that combines subjectively imagined data with real data by using likelihood distributions. The Bayesian prior probability can always be regarded as a posterior probability based on uniform priors multiplied by a normalized subjective likelihood distribution. This posterior but new subjective prior distribution is then multiplied by the likelihood distribution of real data and normalized again to give another posterior probability distribution.

If data sets are ‘independent’ so that they can be combined by assuming statistical independence between their likelihood distributions, any number of data sets can be combined in this way. It is basically a Bayesian way of performing a meta-analysis. For example, multiplying the likelihood distributions for the observations of 12/30 and 8/20 and then normaliszing them will provide the same result as normalizing the likelihood distribution of (12+8)/(30+20) =20/50. However, care has to be taken that a prior ‘observations’ had been made ‘prospectively’ (e.g. it was a part of the current study) and not recalled with hindsight to support some wished for outcome, thus biasing the result (which would have a similar effect to ‘P-hacking’ or ‘publication bias’).

This only applies to the ‘precision’ of a study (its ability to predict a mean after an ideally very large number of observations). However, if the result of a study (analogous to a diagnostic test result) is to be applied to testing possible scientific hypotheses (analogous to differential diagnoses) then the prior probabilities, sensitivities, false positive rates, false negative rates and specificities do come into play.

I agree with omj and recommend abandoning hypothesis testing. If you think all statistical models are false, as I do, then it makes no sense to ask whether a null hypothesis involving such a model is true. The hypotheses are often formulated in terms of parameters H_0: mu=0 of a model but these are constructs of the mind with no ontological significance. How can mu, a construct of the mind, have a true value, how can it have a false value? You can give mu ontological significance by identifying it with some property of the real world. Given measurements of the quantity of copper in a sample of drinking water and a legal upper bound in some units of 2.5 I can frame the real world question of interest, does the quantity of copper in this sample exceed the legal, in terms of a hypothesis H_o: mu >=2.5.The statistician will now behave as if the model were true, derive an optimal estimator, derive a posterior or whatever and transfer the result back to real world. This model is not true but I will behave as if it is and make recommendations based on this. How to connect the parameters to the real world is not always clear. Suppose that for the copper example I use the log-normal family on the grounds that the amount cannot be negative. How do I connect the parameters of the log-normal distribution to the quantity of copper in the water?

Judging by omj’s twitter exchange it seem as if rejecting hypothesis testing is a heresy something akin to post truth. On the contrary the post truth description is more applicable to statisticians who talk in terms of truth when formulating hypotheses in terms of parameters. Just how deep seated this is can be seen from the reaction to my analysis of the birth data. Andrew introduced it by stating

Laurie Davies sent along this non-Bayesian analysis he did that uses residuals and hypothesis testing.

Not at all. Andrew had not read it. There is no hypothesis testing, there is no model. My motivation for my book on statistics was to do it all in terms of approximation, there is no hypothesis testing in the whole of the book.

Andrew

Your helpful paper ‘Gelman A. P values and statistical practice. Epidemiology 2013, 24, 1, 69-72’ (http://www.stat.columbia.edu/~gelman/research/published/pvalues3.pdf) discussed the relationship between P-values and posterior probabilities from a Bayesian viewpoint. I would like to discuss the implications of my previous post in the light of what you discussed in that paper. I would like to draw David C’s attention to these issues.

On the basis that that the possible outcomes of a random sampling from a single population with an unknown ‘true’ result (i.e. a ‘true’ mean or proportion after an infinite number of selections) are all equally probable of necessity (see: https://blog.oup.com/2017/06/suspected-fake-results-in-science/) then it possible to estimate the probability that an unknown ‘true’ result conditional on some observations will fall in any specified range. The range can be above or below some single value (e.g. a null hypothesis) or between two bounds (i.e. a Bayesian credibility interval).

If we are predicting a ‘true’ proportion from an observed proportion, the calculation will be based on the binomial likelihood distribution for the observed proportion, which of course may be asymmetrical. In this case, the probability of the true result being more extreme than the null hypothesis may not equal to the one-sided P-value as the latter will be based on another binomial probability distribution centered on the null hypothesis. However, if the observed result’s likelihood distribution and the null’s probability distribution are both symmetrical (e.g. Gaussian) and have the same variance, the one-sided P-value and the probability of the null hypothesis or something more extreme will be the same.

So when the likelihood and probability distributions are similar, the probability of the ‘true’ result being LESS extreme than the null hypothesis will be approximately ‘1- one-sided P’ and the probability of the ‘true’ result falling within the 95% confidence interval will be about 0.95. However, those true results adjacent to the null hypothesis would not for most practical purposes be regarded as being clinical or scientifically significant and so the ‘one minus the P-value’ would exaggerate the latter’s probability and best be regarded as an upper bound of such a probability. Perhaps it would be better for the scientist or clinician to specify a range of practical significance and for the statistician to calculate its probability conditional on the observed result. Other factors in addition to this conditional probability would also have to be taken into account in order to make a complete assessment of course (e.g. from a diagnostician’s point of view: https://blog.oup.com/2013/09/medical-diagnosis-reasoning-probable-elimination/).

“Perhaps it would be better for the scientist or clinician to specify a range of practical significance and for the statistician to calculate its probability conditional on the observed result”

Except that this is just an attempt to shoehorn statistics into the box that people who currently rely on p values want it to be in. Perhaps what would be better would be to teach scientists the logic of science instead of the logic of p values, and then we could publish posterior distributions and discuss tradeoffs/utilities, and come to rational conclusions.

There is never going to be a “range of practical significance” for which everyone will agree “cut off at effect size = A because A-\epsilon would cease to have any meaning whatsoever for all epsilon greater than zero” but this is what the “range of practical significance” requires.

The continuous version of this is bayesian decision theory and has been well understood for a lifetime or so.

I agree that providing a posterior distribution of the true mean or proportion would be better with a device to estimate the posterior probability of the true mean or proportion falling into any range of the scientist’s choice.