Had we not checked for an interaction, we likely would have never have been aware of that – they were surprised to find out they needed to tell us that.

I withdrew from the project as the stats dept chair over-ruled my concerns about dependencies in the gene expression data arguing that biological dependence does not imply statistical dependence (which admittedly only occurs with probability 1 or something close to 1.)

]]>The paper is:

Expression of A152T human tau causes age‐dependent neuronal dysfunction and loss in transgenic mice

See figure 11. Basically, you want to compare the effect of Dox between figures A and B. Note that Dox appears to do nothing except in the case of the triple-transgenic mice. To me, at first glance it seemed like extreme cherry-picking…until you understand that the science dictates that behavior.

You’ll note that I’m thanked, but not listed as an author. The cherry-picking comment may have been responsible for that one…

]]>You’re definitely right that, in retrospect, I should have chosen factors other than 2, since 2+2 = 2*2 .

]]>Bob

]]>It is an empirical question, not distribution theory – right?

]]>“Then, what is standard deviation of X+Y (additive effect)? What is the standard deviation of X*Y (multiplicative effect)? A standard deviation of 0.75 for X+Y (or X*Y) seem unreasonable.”

This is well-known. For X+Y, var(X+Y) = var(X) + var(Y). For XY, it’s var(XY)/(XY)^2 = var(X)/X^2 + var(Y)/Y^2. Of course, that’s when X and Y actually have independent errors.

]]>Bob.

]]>I find this helpful – http://andrewgelman.com/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/

In two ways.

More obvious is the subset of p < .05 studies.

Perhaps less obvious is the the standard CI procedures follow from what would be ideal for applications where it is expected (based on current scientific understanding) that extremely large positive and negative effects are just as likely as tiny effects. That is, they result from assuming priors for effect size like that – this brings into question whether the seemingly good property of uniform coverage that standard CI procedures here are good for what?

(By the way, it is not something you can see in a single study in isolation – its a repeated application property.)

]]>A very old distinction that perhaps needs more emphasis today.

(some historical links here http://www.stat.columbia.edu/~gelman/research/unpublished/amalgamating4.pdf)

I do think hierarchical modelling challenges one to be more thoughtful about many things.

> common NHST

Most classical statistics were developed to evade important distinctions (i.e. render them not that critical) in circumscribed applications by very bright mathematicians – unfortunately they were seldom clear about that!

(Fisher’s null, Neyman’s null, additive effects, common variances, common intercepts, etc., etc.)

There are zillions of examples. Just about any published result based on a noisy study and p less than .05. Power pose, embodied cognition, himmicanes, beauty and sex ratio . . . the examples are endless. Just open any issue of Psychological Science published between 2010 and 2015. If you want a single example, consider that study that claimed that single women were 20 percentage points more likely to vote for Barack Obama at a certain time of the month, compared to the corresponding change among married women. The confidence interval there would be 20% +/- 8% or something like that. Any true effect would be in the neighborhood of 2% or less. Huge bias. John Carlin and I discuss this in our paper on type M and type S errors.

]]>I’m also wondering whether the criticism 2a) is only true for confidence intervals or concerns interval estimates more generally.

]]>I continue to disagree with you. There is neither scientific nor policy interest in the claim that the average of the positive and negative effects of power pose in some undefined population of people and scenarios happens to be very close to zero. This average will depend crucially on what people and what scenarios are included in the population, and these issues are never addressed in any of the literature, which should give a clue that this is not what people are talking about. What they are testing is the hypothesis that power pose has exactly zero effect in all situations (what Rubin calls the “Fisher null hypothesis”), and I think this hypothesis is nonsensical.

]]>If your argument is about the infinitesimally small probability of any particular exact outcome then it is a very uninteresting argument. And it is not an argument that is relevant to your point number 1.

]]>But I was just observing that the whole set up of the story doesn’t really make sense; it would if the control results and the combined results were flipped or if the scale was “shrinkage” rather than “growth”. At least it was just a hypothetical ]]>

No, the population mean is not even defined. But if it were defined, it would not be exactly zero, it would be the average of whatever it’s averaging over. You can’t average a bunch of continuous numbers and get exactly zero. And, beyond this, there’s no reason to be interested in some arbitrary average of effects.

]]>“…that’s NOT the actual data the researcher provided…”

]]>From this post, I really don’t think there’s enough information to decide whether everything’s based on flawed logic or not (unless there’s some link to the real paper I missed). First off, that’s the actual data the original researcher presented. Second, we don’t know the background behind why it was decided that the interaction should be stronger.

I will definitively say that if the above data was ALL they had and that’s what lead them to the conclusion, then that’s easily interpreted as the confusion of p-values leading to very bad conclusions.

But now consider another scenario. I’m not saying I know the researcher started the study with the following information, but let’s just pretend they do. The biology tells us that treatment A has molecules that are extremely likely to bind with cancerous cells. Treatment B has molecules that will destroy cells in which A appears. In this scenario, we would expect near 0 effects of A and B alone, but strong effects of A + B. And with regards to standard deviations of the response variable “tumor sizes”, we should expect A and B (alone) to have about the same standard deviation…but A + B should have much smaller standard deviation (although a log transform may fix that)!

IF that were our background, and we saw that A + B was statistically significant, A, B alone were not, and (A) + (B) less than (A+B) (i.e. interactive effects greater than individual effects) I would say the data supported the researcher’s hypotheses. I would have a lot of faith that the A + B effect was real, and would say this supports the idea that (A) + (B) less than (A+B). We might want to do a formal test to address the strength of the evidence of (A) + (B) less than (A+B).

On the other hand, if a priori they had no idea about the molecular effects of A, B and they hypothesized (A+B) greater than (A) + (B) after seeing results, my posterior would look a whole lot like my prior.

(Also not quite sure why you say (A+B) was not applied to real patients? In both my example, and the example in Andrew’s post, it seemed that (A+B) was applied to real patients. Well, in my case, the “patients” were mice. If those mice were patients, they should really consider switching care providers).

]]>I do feel that defining a hierarchical model in Stan makes it likely that an analyst will distinguish between natural variations of the quantity of interest and phenomena that distort the measurement of that quantity (noise). In contrast, perhaps why I brought this up, common NHST does not seem to me to make such a distinction.

Bob.

]]>So your argument is really that the statistical model focussed on the population mean is inappropriate. Is that right? (That would not be the same thing as your point number 1, as far as I can see.)

I’m not just making an argument here, I really cannot join the dots in a manner that makes your point number 1 valid. You are doing something differently from me.

]]>Then, what is standard deviation of X+Y (additive effect)? What is the standard deviation of X*Y (multiplicative effect)? A standard deviation of 0.75 for X+Y (or X*Y) seem unreasonable. Theoretically, we can always play with the co-variance between X and Y to get a standard deviation of 0.75.

It’s obvious if the standard deviation of X+Y (or X*Y) remains to be 0.75, the resulting p-value would be smaller as explained in the post because expect a larger sample mean effect due to the combined treatment. However, is such simulation not un-representative? I am skeptical about this claim.

Am I missing something?

Cliff AB,

I simply don’t see how one can ask to test whether the combined treatment A+B has a significant effect if the combined treatment is not applied to real patients? It felt like fabricating data to me. :)

]]>As a biostatistician, I had an important lesson similar to this one. A researcher came to me and said “I think A by itself should not have an effect, neither should be B, but A+B should. Could you test this so I can put some p-values in my paper?”. Looking at the data, this was the trend you saw, but it felt very cherry picked to me (they also asked for something like 20 other tests to be run). My impression was that they saw this trend, post-hoc pretended that it was what they wanted to see, and asked me for the power to publish it. I fired back “hey, I don’t do cherry picked analyses”. There was tension.

But I didn’t leave the study. As I became more involved, I realized that there was VERY good scientific reason for this interaction effect with a lack of main effects hypothesis to be true that had nothing to do with the data they gave me to. In the way that they presented the analyses to be done, I would say that they didn’t quite have a firm understanding of good statistical practices. But the more I learned about the study, the more I realized that they had a very good understanding of proper scientific processes, and were not just p-hunting for a publication. My ignorance about “non-statisticians are abusing statistics to get publications” actually meant that I was discarding good prior information. My prior about the quality of their priors was wrong.

After awhile, I issued an extremely awkward “sorry for the cherry picking comment”.

Is that what happened in the study discussed in this blog? I certainly do not have enough information to conclude “yes”. But if we are too quick to conclude “you’re not a statistician, so you’re just abusing p-values to get something”, we might be missing out on good scientific research, even if it’s masked behind bad/confused statistical practices.

]]>“Noise” is not a well defined term. As you indicate, it is used to describe some aspect of variation. What aspects of variation are called “noise” depends on how you think of the problem. One reason I like to work with generative models (“Bayesian inference”) is because this forces some clarity on different sources of variation.

]]>A example. I just went up to my kitchen and performed two experiments.

Experiment 1. I used the kitchen scale to weigh a single orange (clementine) 10 times—placing the orange at different places on the scale. Result, 6 measurements of 70 g, 4 measurements of 71 g. So, (based on a very limited set of measurements) I conclude that the noise in the measurement has a mean value in the ballpark of 0.5 g with an SD of about 0.5. There is also some quantization noise floating around here (the scale only resolves to 1 g) with SD of about 0.3. But, let’s ignore that.

Experiment 2. I used the same scale to weigh 5 different oranges (not including the 70 g orange). I got 85, 95, 78, 72, and 80 g for the five measurements. The SD of these measurements is 8.6. But, 8.6 is not the SD of the measurement noise—that seems to be less than 1. Rather 8.6 is some statistic about (a) the biology of clementine oranges and (2) the sorting and packaging process used by the wholesaler.

It seems to me that referring to unknown sources of variation as noise risks creating confusion. Am I the only person who is troubled by this use of language?

Bob

]]>What’s confusing is that the post says that the combined group has “the same” noise as the alone groups, but in the data the for the combined group it is much bigger. Yes it could be chance, but when you did the combined group did you take two values sampled from the same distributions as the #1 and #2 groups and multiply them? Or did you take 10 values from a distribution with mean of 4 and standard deviation of .75? Your boxes in the graph makes it seem like the former but the horizontal arrows make it seem like the latter (the larger standard error for the combined group isn’t reflected).

The example (especially as excerpted here–I think the full post is more clear) is also confusing because 2 is confusing 2*2 = 2+2. Also with s= .75, n = 10 the standard error is .24 I think I would choose means that were closer to each other. On that scale comparing samples from populations with means of 1 and 2 are pretty likely to give you a p value < .05.

Of course the whole point is not to be binary anyway.

]]>Nope. Power pose does not have “a positive or negative effect.” The effect of power pose depends on the person and situation. It’s sometimes positive and sometimes negative. That’s life.

]]>For your example of the power pose, by the logic that the null hypothesis cannot be exactly true is not very helpful because in the case where you would not be surprised by a positive or negative effect then the boundary between the positive and negative is not only a live point, but an interesting point. (I do not see why a statistical analysis that does not use a prior should be affected by your opinion that the power pose “has _some_ effects”.)

]]>“… And ideed, in Raghu’s example they are directly additive… So we actually do have some valuable information here amongst the noise, even though the p value demonstration of it is nonsense.”

I think what’s relevant here is that a non-statistically-significant result should not be replaced by a zero value. After all, the mean *is* the best estimate of the value (in a least squares sense, anyway). In Raghu’s admirable example, we have two “non-significant” but also non-zero results. If we simply call them zero, we throw away some experimental knowledge and therefore bias the conclusion.

If we are trying to demonstrate that General Relativity is wrong, then yes, we no doubt want a very high standard here – way beyond a “reasonable doubt”. If we want to screen for potentially effective drugs, we’d probably want to say “Drug one – interesting, looks good but needs some verification in case it’s only chance. Same for Drug 2. The fact that the combo seems to be (statistically speaking) effective suggests that *at least one* of them is good.”

If Raghu were to rework his example, I’d suggest using some different numbers so that the product and sum come out different (the way it is now, 2+2 = 2*2, which can confuse us), and making sure of the standard deviation of the two-drug random data. But it wouldn’t really change much. The example is pretty nice as it is.

]]>“Tom, the fact that you imply Mayo would support the notion that a small P-value in isolation provides a license to publish indicates that you are completely unfamiliar with Mayo’s philosophy.”

Hmm, sounds ad hominem to me… however … I’ve been reading her blog for a few years now. I just responded to the words in this one comment. When I wrote my words, I in no way thought what you claim I must have. In fact, my intent was to suggest that Mayo and Andrew might be talking about slightly different things.

]]>We could do this forever, but . . . My point number 1 is true. Just to clarify, you write, “they retain their full meaning whether the null hypothesis is true or not.” The null hypothesis is false, we already know that. What I wrote was, “In settings where the null hypothesis is not a live option, the p-value does not map to anything relevant.” In just about all the problems I’ve seen, the null hypothesis is *not* a live option. For example, we can all agree that power pose has *some* effects; where I depart from the power pose promoters is in their claim that these effects are large and repeatable.

Experiments using single doses of drugs are often inadequate for their intended purpose. If that purpose is the assessment of synergy then you should replace the ‘often’ with ‘always’ in the previous sentence.

]]>