But I’d like to humbly point out that the great and glorious Food and Brand Lab previously conducted a similar pizza study to the pizzagate study where they obtained the exact opposite results.

This study, http://www.mitpressjournals.org/doi/abs/10.1162/REST_a_00057, showed that diners who paid more ate more, and those who paid less rated the pizza higher. Both of these findings are contradicted in the pizzagate study.

]]>It’s analogous to doing a medical study on only subjects with European ancestry, then realizing that the situation may be different for people with African ancestry. (There was a case in the news today on how some blood sugar tests may be different for people of African vs European ancestry, especially those who have sickle cell trait).

]]>Does anyone argue that the choice of model does not matter much in a problem?

]]>I disagree 100% with your statement that statistical inference is “inappropriate for prediction” for new scenarios that are different from the old. Of course statistical inference is appropriate for prediction here! In the real world, conditions change all the time. Statistical inference is for the real world, not just for idealized random samples and roulette wheels.

]]>That is why for rare climate data they use alpha = 1e-1, in biomed studies: alpha = 5e-2 (unless there happens to be a lot of data, then alpha = 1e-2), and in particle physics we can drop it to alpha = 3e-7.

]]>I have a zillion examples in my applied research of embracing variation and accepting uncertainty. For example this paper from 1999 on decision making for home radon exposure. In the past, people had tried to identify houses as being at risk or not, or as high radon or low radon. In our paper we (a) accepted the uncertainty which allowed flexible decision recommendations, and (b) embraced variation by using multiple levels of variation to fit our model.

Or this paper from 2008 on estimating incumbency advantage and its variation.

Reducing uncertainty is great, but often you have to accept the uncertainty that remains. Also, when I say “embrace variation”: sure, if variation is controllable it can be a good idea to reduce it; that’s one of the fundamental principles of quality control. But if you’re studying humans, that’s not always a possibility.

A specific way that understanding of variation can help is in psychology experiments, where for the past few years Eric Loken, I, and others, have recommended between-person designs instead of within-person designs so as to better align data collection with unavoidable variation.

]]>“embrace variation” sounds like one of those cliches that are totally non-actionable.

Heck, I accept uncertainty. It is all around me. But acceptance is the easy part. *Reducing* uncertainty is where the worthwhile challenge lies.

“Embrace variation” is the dictum the Indian train system seems to run on. “Eschew variation” is more like Deutschebahn. That’s what we want.

]]>and thinking: what does that mean in practice? Maybe you should spell this out with an example data analysis, once with a p-value based analysis, and then with an analysis embracing uncertainty and variation. Otherwise people will not see the point. All they see is a rejection of a clear algorithm that leads to a publication, replaced by a touchy-feely alternative (seemingly—I am taking their position) that has no clear path laid out.

]]>I’m curious to hear what Andrew or someone else here think about this “problem” in the spirit of re-evaluating old statistical traditions. It the problem perhaps dependent on model “complexity”? How often have you found that your inferences depended on choice of distributions, and that there were several plausible candidates to choose between? Is this something we should care more about?

]]>We have some discussion of default priors in this wiki. Feel free to add your thoughts and questions to it. If you have specific issues, this could be very helpful.

Also if you’re doing variable-selection models, you might try the horseshoe prior which I believe is now implemented in rstanarm.

]]>The implication: in small n/noisy situations the apparent effects, conditional on them being found, will have to be much exaggerated.

The only 'solution' I know of is – make sure you collect enough data relative to the expected variability of the phenomenon of interest.

]]>Here’s what I wrote: “We were talking about a really noisy study where, if a statistically significant difference is found, it is guaranteed to be at least 9 times higher than any true effect, with a 24% chance of getting the sign backward.” This statement is not hyperbole. It is literally true in this case because there is no way the true effect size is large. That is the point.

The sad thing is that a statement as extreme as mine, which sounds like hyperbole, isn’t! That’s how bad things are in some fields of empirical science.

]]>(For instance, that a p-value of .02 is more likely to be a false positive when N = 10 than when N = 2,000.)

]]>Building models where you’re assuming “as if a random number generator with unknown mean and SD but perfectly known distributional shape” is pretty much blue and red fairies.

]]>Also, doing an analysis in which you explain the effect of the prior is hugely helpful. For example, suppose the effect of some law enforcement intervention is of interest. You run a Bayesian analysis with a broad prior allowing for potentially large or small, positive or negative effects, together with the noisy data you find an effect of 1 on some scale where 1 is a very good effect.

Now, you ask “how strongly do I need to believe that the real effect is near zero to overcome what my data tells me?” Run the analysis with a prior on effect size of normal(0,.1), and normal(0,.01) and normal(0,.001) etc. Suppose that the expected effect size drops to 0.1 only with normal(0,.001) prior…

Then you can explain to the court: “unless you go into this analysis believing strongly that there’s a 90% chance that your effect is between -.002 and +.002 you have to come out of the analysis believing that the effect is at least of size 0.1” and “if you believe that it is concievable that any size between -2 and 2 is possible, then the most likely thing is that real effect is between .9 and 1.4 (or whatever your high probability interval is under your broad prior).”

Courts understand the concept of prejudice and keeping an open mind to many possibilities pretty well. Putting it in terms they can understand will help.

]]>As an aside you do use the term “exploratory data analysis”.

But I see resistance sometimes at the suggestion that authors self-label all studies as exploratory or not.

I’ve never really understood that.

]]>I feel the prudent assumption when faced with a generic problem always is to go with the thinking that your choice of prior is going to matter critically.

]]>I agree that sometimes you do have big effects and sometimes you can’t make the sample bigger. Lots of important examples in political science and economics are like that. But I still object the reasoning that a statistical significant result is more informative if obtained under noisy conditions. If noisy data are all you have, that’s fine, but don’t treat the noise as an argument in favor of the conclusion.

]]>This t statistic is one possible way to define a meaningful dimensionless ratio, but it’s usually by no means the only dimensionless ratio of interest in a study. For example a meaningful ratio in many scenarios is something you might call U for Usefulness, namely the observed average divided by the size of the effect necessary for the study to produce an economically beneficial result. For example in a drug designed as a replacement for pseudoephedrine for use in OTC decongestant pills, U = m/S where m is the average across patients of some kind of area under a curve of decongestant effect vs time, and S is the area under the curve for Pseudoephedrine (S for Standard).

You could well have a statistically significant result (that is, you can detect that there is some decongestant effect that isn’t zero) while utterly failing to have even 1/4 of the effect of the drug you’re trying to replace.

If you set your standards at “statistical significance” you can get your drug approved by simply doing a good job of measuring and using large samples. If you set your standards at actually helping people… not so much.

Not surprisingly, this is exactly the case for Phenylephrine, the actual drug that does replace Pseudoephedrine in OTC stuff after they made Pseudoephedrine a behind-the-counter-and-register-your-drivers-license drug.

The Pseudoephedrine law resulted in about a 90% decrease in sales of pseudoephedrine in some places: http://media.arkansasonline.com/img/photos/2016/02/13/0214pseudoephidrine_t630.png?30004eeab9fb5f824ff65e51d525728c55cf3980

Unsurprisingly since the replacement drug is “statistically significant” but not actually useful, the incidence of chronic sinusitis has skyrocketed since the 2006 law, so that now something like 10% of the population suffers from the chronic form.

yay statistics!

]]>The weaker your data, the more important is your prior. In my above post I’m specifically talking about weak-data scenarios.

]]>If you set your standards according to what you find, you will often find something “significant”. If that’s what you did, you had better take that criterion and try the whole thing again with a different data set *with this same criterion*.

If you set your standards according to your findings, you are doing exploratory data analysis. If you think otherwise, you are fooling yourself.

If it’s an observational data set that can’t be repeated, too bad but it’s still exploratory data analysis. Get over it!

]]>