There is a tension between statistical pedagogy and practice, in the sense that most good data analysts learn to not take their statistics too seriously. It is difficult to train this nuance.

]]>I agree those classes are an afterthought (at least that is a very apt description of my biomed experience), but wouldn’t be so sure that isn’t because at some level the thing taught as “statistics” is recognized to be a way to just legitimatize BS.

I was pretty much told that grad school stats teachers get fired if they don’t help to produce papers, by whatever means. To avoid Godwin’s law, basically my impression is that teaching NHST can be primarily attributed to learned helplessness.

]]>To repeat myself, I feel that people should aspire to acquiring the understanding that people like Richard have, so that they can make their own decision on how to proceed in each individual case. People should not just blindly copy this or that technique as the *the* solution that can be applied again and again in a McDonald’s I’ll-have-this-to-go approach.

The response I get when I say this to people starting out in academia is that they don’t have the time to acquire the knowledge, they have to get tenure and therefore publish enough first. Of course, the same people would not hesitate to take the time to acquire the relevant knowledge in their own specialization. It’s only statistics that is considered to be something to be outsourced to experts, something that needs less attention. Somehow statistics needs to become embedded as part of the core curriculum of every scientific field. When one does a PhD in ling, one has to have relatively good mastery in all the areas, phonetics, phonology, morphology, syntax, semantics, sociolinguistics, pragmatics, psycholinguistics, computational linguistics, etc. The same must go in psychology and other areas. Statistics is an afterthought in these areas, and that’s where the problems begin. If the Morey and Wagenmakers paper being criticized here were to appear in the kind of ideal environment where statistics were already embedded as part of the core knowledge, there would be less room for misuse. But in that ideal case, even frequentist methods would be OK for many situations. Editors wouldn’t hanker after low p-values as more evidence for the specific alternative, they wouldn’t complain about “useless” replications, they would know that p-values tell you nothing about replicability of the result, that model assumptions matter, and transparency matters. In such an environment, if someone were to disagree with the approaches Morey and Wagenmakers propose, they can go ahead and reanalyze the data themselves and show why their conclusions were wrong (if they were).

]]>This is super presumptuous on several different levels, and the implication is also just wrong. EJ’s been selling Bayesian methods since I was in grad school (for example, this was written before I even knew EJ, and before I got my PhD: http://www.ejwagenmakers.com/2008/BayesFreqBook.pdf). Like them or dislike them, these examples are EJ’s and they come from his talks. Don’t give me credit for the work of my collaborator who is my senior.

Since the implication also seems to be that I’m somehow responsible for EJ’s advocacy of Bayes factors, let me state my opinion about BFs here. I’ve said all this elsewhere on social media and in courses I teach, but here it is again.

Bayes factors are an inescapable consequence of Bayes’ theorem. However, this does not mean that Bayes factor point-null hypothesis testing is appropriate in all situations. Or most situations. When I consult, well more than half the time I steer researchers away from them for exactly the reason Gelman stated in the post. They’re just not right for many research questions. When I teach Bayes, I do not spend much time on BFs; I focus on building/checking models (which can be done in roughly the same way as a classical statistician would do) and interpreting posteriors. When I review, I suggest removing Bayes factor point-null tests where they are not appropriate or helpful, in spite of this reducing the visibility of my own work. When I analyse data, I most often rely on posterior estimation of complex models with model checking, in a manner similar to Gelman’s “continuous model expansion”.

]]>I garbled that sentence where I wrote, “Changing these prior sd’s from 100 to 1000 won’t matter except for either model but it will have a huge effect on the marginal likelihood.”

Here it is, ungarbled (and slightly lengthened for clarity): “Changing these prior sd’s from 100 to 1000 will have essentially no effect on the inferences conditional on either model but it will have a huge effect on the marginal likelihoods.”

]]>“Changing these prior sd’s from 100 to 1000 won’t matter except for either model but it will have a huge effect on the marginal likelihood.”

You’re saying it does matter for the marginal likelihood, and for either model, but it won’t matter for… what else? Each model’s predictions are made by the marginal likelihood, which is affected by the prior. Therefore, a model’s predictions are affected by the prior. Therefore, for a model to be well-specified, the prior must be specified.

Taking your example (sigmas are variances):

H1: y ~ N(beta*X, sigma)

H0: y ~ N(gamma*W, sigma)

We can simplify for the sake of discussion, and say that W = 0, X = 1, y = 0, and sigma = 1. Gamma no longer matters, and let’s give beta a prior: beta ~ N(0, tau).

The predictive distribution for y under H0 is now (y|H0) ~ N(0, sigma) and under H1 it is y ~ N(0, sigma+tau). The latter includes the prior. In order to evaluate whether the added parameter was worth adding, the suggested approach simply moves along and compares the support at y under H0 to the support under H1:

N(0, sigma+tau)/N(0, sigma) = 1/(1+tau)

… still a function of the prior. I understand that you have to be serious about priors, but I have no idea how one could make inferences in which the model assumptions don’t feature. That just seems wrong.

Perhaps I am not understanding the difference between the actions “doing model selection” and “deciding if model expansion is warranted”?

I’ll check out BDA3 Ch.7 today.

]]>Thanks for relating your experiences here as that makes underlying pragmatic concerns for scientific communities very salient.

(My experiences here involved _deprogramming_ clinical experts who had bought into lay introductions to Bayesian methods as being essentially infallible – e.g. believed a prior (that was just a software default) fully captured all past clinical knowledge.)

If you are not already aware you might find this a interesting discussion of similar concerns http://www.dcscience.net/Gigerenzer-Journal-of-Management-2015.pdf

]]>I started fitting linear mixed models after it was suggested to me by the Ohio State Statistical Consulting service that that is what I should be doing. I installed the nlme package and cut along the dotted line as instructed by the statisticians at OSU. This was all fine because they signed off on my analysis; I myself had no idea about the underlying machinery (this was around 2000). There was an in-retrospect hilarious moment when the statistics grad student helping me asked me what the variance covariance matrix in my analysis looked like and I had really no idea what that was and how to even find out the answer. She had to send me the relevant command for me to print out the vcov matrix.

Then lme4 came along and I started using that just like every other psycholinguist does, gradually starting to understand more and more about the underlying theory. It was only after I did the MSc in Stats at Sheffield that I finally got something closer to the full picture. In retrospect, the canned analyses I did did not serve me well. What I should have done in 2000 was take a course or two in statistics *in the statistics department*. There is usually no point taking a course in a linguistics or psych department because you will invariably get a distorted and very incomplete picture (ironic because I teach statistics in a linguistics dept).

The central problem is that canned software encourages distance from the details. This kind of convenience is a very dangerous thing. Bayes will probably go the same way as the abuse of frequentist methods for this reason. rstanarm and JASP are great when you are an expert like Gelman, Morey, Rouder, or Wagenmakers, but they are a deadly tool in a novice’s hands, especially if the novice has the idea—encouraged actively in the psych* world at least—that knowing the details of how the underlying moving parts work is optional. People provide canned one-and-done “recommendations”, and it’s downhill after that.

]]>I don’t really have it in me to explain it all again here, but very briefly: Think about two regression models, one is y = X*beta + error, the other is y = W*gamma + error, where X and W are two different (possibly overlapping) sets of predictors. I’m assuming things are roughly on unit scale so all the coefficients are well below 10 in absolute value, I’m also assuming no collinearity and a reasonable amount of data so that either regression can be estimated with no problem. Various priors for beta or for gamma will be essentially equivalent, for example independent priors on the components of beta with mean 0 and sd 10 or sd 100 or sd 1000 or whatever, and the same sort of thing for gamma. Changing these prior sd’s from 100 to 1000 won’t matter except for either model but it will have a huge effect on the marginal likelihood. So, to start with, if you want to do Bayesian model averaging you have to be really really serious about the priors, even in a large N, small error setting where otherwise the priors won’t matter.

For continuous model expansion I’d want to fit a model including all the X’s and W’s, adding prior info to the coefs as necessary.

]]>I don’t have BDA handy right now, but what procedure would you recommend to evaluate if a continuous model expansion was worth it?

]]>No, changing the prior in this way will have essentially zero impact on inferences and predictions conditional on the model. As I discuss in chapter 7 of BDA, I think the right thing to do in these settings is to build a larger model that includes the individual models as special cases, that is, continuous model expansion.

]]>I don’t see how one can avoid the dependence on assumptions in the formal model check that is needed to determine whether a model needs expanding (since the expanded model under consideration would have priors also), so I’d be curious to learn how you do model checking.

That said, I understand that this is a matter of taste. I personally think “robustness against prior information” more often indicates a pathological model than a virtuous one.

]]>The problem with these Bayesian model averaging methods is that the marginal likelihood of a model typically depends very strongly on aspects of the model that are set arbitrarily. For example, change your prior on a parameter from normal(0,100) to normal(0,1000) and your marginal likelihood changes by a factor of 10. I discuss this further in chapter 7 of BDA3 (or chapter 6 of the earlier editions) and in my 1995 paper with Rubin.

]]>I think your questions 1 and 2 are good questions, and worthy of exploration. With regard to the first of them: Yes, in almost every context it makes sense to ask whether such-and-such an effect is big enough to be important. (Indeed, I think it’s very unfortunate that statistics uses the word ‘significant’ to mean something totally different from ‘important’, and I think a lot of misunderstanding is engendered when a statistician says that an effect is ‘significant’ and a lay listener interprets this to mean it is big enough to be important). So, yeah, there’s no point testing whether such-and-such an effect is literally zero, but it often makes sense to try to estimate the size of an effect and see whether it’s big enough to be important. If you conclude that a bad Adam Sandler makes just about as much money as a good Adam Sandler movie [sic] then don’t bother hiring a script doctor or whatever.

]]>If I understand, you consider the model rho=0 objectionable. Let’s agree that models are neither true or false, data do not follow normals, etc. The question of course is whether rho=0 is a theoretically useful statement. I think in many contexts point mass is a useful, theoretically interesting statements of constraint and invariance. Certainly, much of science has proceeded with the notion that point masses, equalities, are theoretically important (see, say, physics). Do you always object to points or do you object in this example? My tendency is to give deference to substantive researchers on whether points are theoretically useful or not. It’s not a statistical issue to me.

Jeff

]]>Regarding point 1, I don’t see why there’s interest in whether rho is between -.01 and .01, or whatever.

Regarding point 2, yes, I think the bigger model should be better but there is a cost in expanding the model, hence I do model checking to decide where the model should be expanded. There are lots of open questions here and room for more research; I just don’t really like the methods proposed in the paper discussed above.

]]>Alas, I’m afraid that my morbid fascination with null hypothesis testing may not stop any time soon…

Personally I am persuaded by the Wrinch and Jeffreys (1921) argument for assigning mass to a point null. See the historical overview by Alexander Etz, in press for Statistical Science (preprint at https://arxiv.org/abs/1511.08180). But I also like to think that if you were operating in my field those nulls would suddenly look a lot more plausible (and vice versa, in your field I agree the null is often not of interest).

Anyway, I am not sure I follow your claim about the predictive intervals. Bayesian dogma (Pratt, Lindley) has it that classical inference cannot provide, in the IQ example: (1) the probability of Bob’s true IQ falling between x and y; (2) the relative plausibility that Bob’s true IQ is z versus y. Are you saying that you can answer these questions with predictive intervals? It seems unlikely to me, because questions 1 and 2 require a prior, and their answer therefore also varies with the prior.

Cheers,

E.J.

1) Is the issue indeed with point hypotheses and would everyone be fine with this if the “rho = 0” were replaced by “rho ~ U(-e,e)” with e small?

2) Isn’t every model M a point hypothesis with respect to some higher-order model M’ (i.e., where M’ expands M by including an additional continuous parameter whose value can be fixed to some value in order to obtain M)? How many turtles deep is the right turtle depth? ]]>

Conceivably it could be very low, but it cannot literally be 0. There is no point testing a hypothesis that you know to be false.

]]>I don’t think any reputations are getting shot down. I consider the sort of work in the paper under discussion to be speculative. I think it is not well founded, but various not-well-founded ideas can work ok. Hell, even p-values have solved a lot of problems. A couple years ago in this blog I discussed how I’d underestimated the importance of the lasso idea, just because the justifications given for the method didn’t make sense. Lots of people have wrongly dismissed Bayesian methods because they felt that the idea of the prior distribution was unscientific. So, sure, I don’t like the method described in this paper and I made no bones about it. I think their method won’t be useful. But, who knows, maybe I’m wrong. It’s fine for them to put it out there, and if these authors do other good work, that’s fine.

]]>I agree Andrew, I respect some of these authors quite a bit but these tortured and inappropriate efforts to sell Bayesian methods have got to stop or they (one of them at least) are going to shoot their own reputation down.

]]>I don’t think they’re saying it is; they’re saying that this is one of the hypotheses under consideration, and they evaluate the evidence for it. Are you just forcefully asserting your prior that the South Park hypothesis is false?

]]>A less wrong message would be that learning from data is risky but persistence in grasping the context of the data and how it came about may give rise to models and techniques that sufficiently reduce that risk to make it worthwhile. With that, and adequate expertise, Bayesian and orthodox methods both can likely be made to work well.

Also reminds me of Parizeau’s lobster strategy – get folks to choose your side then like lobsters thrown into boiling water they can’t get out. http://www.thecanadianencyclopedia.ca/en/article/parizeaus-lobster-flap/

As for the predictive interpretation I recalled this from 10 year ago http://andrewgelman.com/2006/12/18/example_of_a_co/#comment-42003

]]>In practice for this problem, most likely the uniform distribution isn’t that bad. The measurement error can’t be much bigger than the variability across people, and it logically can’t be smaller than zero. So uniform(0,15) covers pretty much everything. Shortening that interval a little on the bottom side makes sense.

In practice for positive scale parameters I usually use a gamma distribution gamma(n,n/m) where n is an “effective sample size” and m is an expected value for the parameter… gamma(3,3/5) seems like it’d be pretty good for an IQ test where we’ve observed sd=12 across multiple people in the past.

]]>Pragmatically, this should lead to more honest, less over-confident, inference. Although given an informative prior, the difference would be much smaller (although of the same spirit) as using a t vs normal distribution for a sample average.

When fitting models in Stan, this almost isn’t any extra work over using a point estimate in the first place.

]]>Can you recommend an applied paper that does scientific hypothesis adjudication with this approach?

]]>As a non-Bayesian (by training, not belief), I have been working on simple and clear examples illustrating the difference between classical statistical reasoning and Bayesian analysis. Their first example comes closer than most of what I have seen. I’ve constructed my own version of their first example but I don’t see any reason to assume that the standard deviation of IQ scores follows a distribution rather than just using the standard deviation estimate based on prior studies. Can someone explain whether it is necessary to assume that the standard deviation follows a non-degenerate distribution, and if so, why that is necessary for illustrating the difference in techniques?

]]>