Bill Harris writes:
Mr. P is pretty impressive, but I’m not sure how far to push him in particular and MLM [multilevel modeling] in general.
Mr. P and MLM certainly seem to do well with problems such as eight schools, radon, or the Xbox survey. In those cases, one can make reasonable claims that the performance of the eight schools (or the houses or the interviewees, conditional on modeling) are in some sense related.
Then there are totally unrelated settings. Say you’re estimating the effect of silicone spray on enabling your car to get you to work: fixing a squeaky door hinge, covering a bad check you paid against the car loan, and fixing a bald tire. There’s only one case where I can imagine any sort of causal or even correlative connection, and I’d likely need persuading to even consider trying to model the relationship between silicone spray and keeping the car from being repossessed.
If those two cases ring true, where does one draw the line between them? For a specific example, see “New drugs and clinical trial design in advanced sarcoma: have we made any progress?” (inked from here). The discussion covers rare but somewhat related diseases, and the challenge is to do clinical studies with sufficient power from number of participants in aggregate and by disease subtype.
Do you know if people have successfully used MLM or Mr. P in such settings? I’ve done some searching and not found anything I recognized.
I suspect that the real issue is understanding potential causal mechanisms, but MLM and perhaps Mr. P. sound intriguing for such cases. I’m thinking of trying fake data to test the idea.
I have a few quick thoughts here:
– First, on the technical question about what happens if you try to fit a hierarchical model to unrelated topics: if the topics are really unrelated, there should be no reason to expect the true underlying parameter values to be similar, hence the group-level variance will be estimated to be huge, hence essentially no pooling. The example I sometimes give is: suppose you’re estimating 8 parameters: the effects of SAT coaching in 7 schools, and the speed of light. These will be so different that you’re just getting the unpooled estimate. The unpooled estimate is not the best—you’d rather pool the 7 schools together—but it’s the best you can do given your model and your available information.
– To continue this a bit, suppose you are estimating 8 parameters: the effects of a fancy SAT coaching program in 4 schools, and the effects of a crappy SAT coaching program in 4 other schools. Then what you’d want to do is partially pool each group of 4 or, essentially equivalently, to fit a multilevel regression at the school level with a predictor indicating the prior assessment of quality of the coaching program. Without that information, you’re in a tough situation.
– Now consider your silicone spray example. Here you’re estimating unrelated things so you won’t get anything useful from partial pooling. Bayesian inference can still be helpful here, though, in that you should be able to write down informative priors for all your effects of interest. In my books I was too quick to use noninformative priors.
The other day, I wrote:
It’s been nearly 20 years since the last time there was a high-profile report of a social science survey that turned out to be undocumented. I’m referring to the case of John Lott, who said he did a survey on gun use in 1997, but, in the words of Wikipedia, “was unable to produce the data, or any records showing that the survey had been undertaken.” Lott, like LaCour nearly two decades later, mounted an aggressive, if not particularly convincing, defense.
Lott disputes what is written on the Wikipedia page. Here’s what he wrote to me, first on his background:
You probably don’t care, but your commentary is quite wrong about my career and the survey. Since most of the points that you raise are dealt with in the post below, I will just mention that you have the trajectory of my career quite wrong. My politically incorrect work had basically ended my academic career in 2001. After having had positions at Wharton, University of Chicago, and Yale, I was unable to get an academic job in 2001 and spent 5 months being unemployed before ending up at a think tank AEI. If you want an example of what had happened you can see here. A similar story occurred at Yale where some US Senators complained about my research. My career actual improved after that, at least if you judge it by getting academic appointments. For a while universities didn’t want to touch someone who would get these types of complaints from high profile politicians. I later re-entered academia, though eventually I got tired of all the political correctness and left academia.
Regarding the disputed survey, Lott points here and writes:
Your article gives no indication that the survey was replicated nor do you explain why the tax records and those who participated in the survey were not of value to you. Your comparison to Michael LaCour is also quite disingenuous. Compare our academic work. As I understand it, LaCour’s data went to the heart of his claim. In my case we are talking about one paragraph in my book and the survey data was biased against the claim that I was making (see the link above).
I have to admit I never know what to make of it when someone describes me as “disingenuous,” which according to the dictionary, means “not candid or sincere, typically by pretending that one knows less about something than one really does.” I feel like responding, truly, that I was being candid and sincere! But of course once someone accuses you of being insincere, it won’t work to respond in that way. So I can’t really do anything with that one.
Anyway, Lott followed up with some specific responses to the Wikipedia entry:
The Wikipedia statement . . . is completely false (“was unable to produce the data, or any records showing that the survey had been undertaken”). You can contact tax law Professor Joe Olson who went through my tax records. There were also people who have come forward to state that they took the survey.
A number of academics and others have tried to correct the false claims on Wikipedia but they have continually been prevented from doing so, even on obviously false statements. Here are some posts that a computer science professor put up about his experience trying to correct the record at Wikipedia.
I hope that you will correct the obviously false claim that I “was unable to produce the data, or any records showing that the survey had been undertaken.” Now possibly the people who wrote the Wikipedia post want to dismiss my tax records or the statements by those who say that they took the survey, but that is very different than them saying that I was unable to produce “any records.” As to the data, before the ruckus erupted over the data, I had already redone the survey and gotten similar results. There are statements from 10 academics who had contemporaneous knowledge of my hard disk crash where I lost the data for that and all my other projects and from academics who worked with me to replace the various data sets that were lost.
I don’t really have anything to add here. With LaCour there was a pile of raw data and also a collaborator, Don Green, who recommended to the journal that their joint paper be withdrawn. The Lott case happened two decades ago, there’s no data file and no collaborator, so any evidence is indirect. In any case, I thought it only fair to share Lott’s words on the topic.
Thanks to Robert Grant, we now have a Stata interface! For more details, see:
- Robert Grant’s Blog: Introducing StataStan
Jonah and Ben have already kicked the tires, and it works. We’ll be working on it more as time goes on as part of our Institute of Education Sciences grant (turns out education researchers use a lot of Stata).
We welcome feedback, either on the Stan users list or on Robert’s blog post. Please don’t leave comments about StataStan here — I don’t want to either close comments for this post or hijack Robert’s traffic.
P.S. Yes, we know that Stata released its own Bayesian analysis package, which even provides a way to program your own Bayesian models. Their language doesn’t look very flexible, and the MCMC sampler is based on Metropolis and Gibbs, so we’re not too worried about the competition on hard problems.
Radford shared with us this probability puzzle of his from 1999:
A couple you’ve just met invite you over to dinner, saying “come by around 5pm, and we can talk for a while before our three kids come home from school at 6pm”.
You arrive at the appointed time, and are invited into the house. Walking down the hall, your host points to three closed doors and says, “those are the kids’ bedrooms”. You stumble a bit when passing one of these doors, and accidently push the door open. There you see a dresser with a jewelry box, and a bed on which a dress has been laid out. “Ah”, you think to yourself, “I see that at least one of their three kids is a girl”.
Your hosts sit you down in the kitchen, and leave you there while they go off to get goodies from the stores in the basement. While they’re away, you notice a letter from the principal of the local school tacked up on the refrigerator. “Dear Parent”, it begins, “Each year at this time, I write to all parents, such as yourself, who have a boy or boys in the school, asking you to volunteer your time to help the boys’ hockey team…” “Umm”, you think, “I see that they have at least one boy as well”.
That, of course, leaves only two possibilities: Either they have two boys and one girl, or two girls and one boy. What are the probabilities of these two possibilities?
NOTE: This isn’t a trick puzzle. You should assume all things that it seems you’re meant to assume, and not assume things that you aren’t told to assume. If things can easily be imagined in either of two ways, you should assume that they are equally likely. For example, you may be able to imagine a reason that a family with two boys and a girl would be more likely to have invited you to dinner than one with two girls and a boy. If so, this would affect the probabilities of the two possibilities. But if your imagination is that good, you can probably imagine the opposite as well. You should assume that any such extra information not mentioned in the story is not available.
As a commenter pointed out, there’s something weird about how the puzzle is written, not just the charmingly retro sex roles but also various irrelevant details such as the time of the dinner. (Although I can see why Radford wrote it that way, as it was a way to reveal the number of kids in a natural context.)
The solution at first seems pretty obvious: As Radford says, the two possibilities are:
(a) 2 boys and 1 girl, or
(b) 1 boy and 2 girls.
If it’s possibility (a), the probability of the random bedroom being a girl’s is 1/3, and the probability of getting that note (“I write to all parents . . . who have a boy or boys at the school”) is 1, so the probability of the data is 1/3.
If it’s possibility (b), the probability of the random bedroom being a girl’s is 2/3, and the probability of getting the school note is still 1, so the probability of the data is 2/3.
The likelihood ratio is thus 2:1 in favor of possibility (b).
Case closed . . . but is it?
Two complications arise. First, as commenter J. Cross pointed out, if the kids go to multiple schools, it’s not clear what is the probability of getting that note, but a first guess would be that the probability of you seeing such a note on the fridge is proportional to the number of boys in the family. Actually, even if there’s only one school the kids go to, it might be more likely to see the note prominently on the fridge if there are 2 boys: presumably, the probability that at least one boy is interested in hockey is an higher if there are two boys than if there’s only one.
The other complication is the prior odds. Pr(boy birth) is about .512, so the prior odds, which are .512/.488 in favor of the 2 boys and 1 girl, rather than 2 girls and 1 boy.
This is just to demonstrate that, as Feynman could’ve said in one of his mellower moments, God is in every leaf of every tree: Just about every problem is worth looking at carefully. It’s the fractal nature of reality.
Mon: God is in every leaf of every probability puzzle
Tues: Where does Mister P draw the line?
Wed: Recently in the sister blog
Thurs: Humility needed in decision-making
Fri: “Why should anyone believe that? Why does it make sense to model a series of astronomical events as though they were spins of a roulette wheel in Vegas?”
Sat: July 4th
Sun: “Menstrual Cycle Phase Does Not Predict Political Conservatism”
Nathan Lemoine writes:
I’m an ecologist, and I typically work with small sample sizes from field experiments, which have highly variable data. I analyze almost all of my data now using hierarchical models, but I’ve been wondering about my interpretation of the posterior distributions. I’ve read your blog, several of your papers (Gelman and Weakliem, Gelman and Carlin), and your excellent BDA book, and I was wondering if I could ask your advice/opinion on my interpretation of posterior probabilities.
I’ve thought of 95% posterior credible intervals as a good way to estimate effect size, but I still see many researchers use them in something akin to null hypothesis testing: “The 95% interval included zero, and therefore the pattern was not significant”. I tend not to do that. Since I work with small sample sizes and variable data, it seems as though I’m unlikely to find a “significant effect” unless I’m vastly overestimating the true effect size (Type M error) or unless the true effect size is enormous (a rarity). More often than not, I find ‘suggestive’, but not ‘significant’ effects.
In such cases, I calculate one-tailed posterior probabilities that the effect is positive (or negative) and report that along with estimates of the effect size. For example, I might say something like
“Foliar damage tended to be slightly higher in ‘Ambient’ treatments, although the difference between treatments was small and variable (Pr(Ambient>Warmed) = 0.86, CI95 = 2.3% less – 6.9% more damage).”
By giving the probability of an effect as well as an estimate of the effect size, I find this to be more informative than simply saying ‘not significant’. This allows researchers to make their own judgements on importance, rather than defining importance for them by p < 0.05. I know that such one-tailed probabilities can be inaccurate when using flat priors, but I place weakly informative priors ( N(0,1) or N(0,2) ) on all parameters in an attempt to avoid such overestimates unless strongly supported by my small sample sizes.
I was wondering if you agree with this philosophy of data reporting and interpretation, or if I’m misusing the posterior probabilities. I’ve done some research on this, but I can’t find anyone that’s offered a solid opinion on this. Based on my reading and the few interactions I’ve had with others, it seems that the strength of posterior probabilities compared to p-values is that they allow for such fluid interpretation (what’s the probability the effect is positive? what’s the probability the effect > 5? etc.), whereas p-values simply tell you “if the null hypothesis is true, theres a 70 or 80% chance I could observe an effect as strong as mine by chance alone”. I prefer to give the probability of an effect bounded by the CI of the effect to give the most transparent interpretation possible.
My short answer is that this is addressed in this post:
If you believe your prior, then yes, it makes sense to report posterior probabilities as you do. Typically, though, we use flat priors even though we have pretty strong knowledge that parameters are close to 0 (this is consistent with the fact that we see lots of estimates that are 1 or 2 se’s from 0, but very few that are 4 or 6 se’s from 0). So, really, if you want to make such a statement I think you’d want a more informative prior that shrinks to 0. If, for whatever reason, you don’t want to assign such a prior, then you have to be a bit more careful about interpreting those posterior probabilities.
In your case, you’re using weakly-informative priors such as N(0,1), this is less of a concern. Ultimately I guess the way to go is to embed any problem in a hierarchical meta-analysis so that the prior makes sense in the context of the problem. But, yeah, I’ve been using N(0,1) a lot myself lately.
“Faith means belief in something concerning which doubt is theoretically possible.” — William James (again)
Eric Tassone writes:
So, here’s a Bill James profile from late-ish 2014 that I’d missed until now. It’s baseball focused, which was nice — so many recent articles about him are non-baseball stuff. Here’s an extended excerpt of a part I found refreshing, though it’s probably just that my expectations have gotten pretty low of late w/r/t articles about him. What is going on in this passage? … an evolving maturity for him? … merely exchanging one set of biases for another?
Anyway, surprisingly I enjoyed the article. I hope you enjoy it too. Here’s an excerpt:
But [James] wonders if the generation of baseball fans he inspired have expanded their skepticism to the point where it has crowded out other things like wonder and tolerance and a healthy understanding of our own limited understanding.
Right now, Bill James thinks this sort of arrogance can be dangerous in the sabermetric community. There is more baseball data available now than ever before, and the data grows exponentially. “Understanding cannot keep up with the data,” he says. “It will take many years before we fully understand, say, some of the effects of PITCHf/x (which charts every pitch thrown). It’s important not to skip steps.”
He groans whenever he hears people discount leadership or team chemistry or heart because they cannot find such things in the data. He has done this himself in the past … and regrets it.
“I have to take my share of responsibility for promoting skepticism about things that I didn’t understand as well as I might have,” he says. “What I would say NOW is that skepticism should be directed at things that are actually untrue rather than things that are difficult to measure.
“Leadership is one player having an effect on his teammates. There is nothing about that that should invite skepticism. People have an effect on one another in every area of life. … We all affect another’s work. You just can’t really measure that in an individual-accounting framework.”
The young Bill James rather famously wrote that he could not find any evidence that certain types of players could consistently hit better in the clutch – he still has not found that evidence. But unlike his younger self, he will not dismiss the idea of clutch hitting. He has been a consultant for the Red Sox for more than a decade, and he has watched David Ortiz deliver so many big hits in so many big moments, and he finds himself unwilling to deny that Big Papi does have an ability in those situations others don’t have. He wrote an essay with this thought in mind, suggesting that just because we have not found the evidence is not a convincing argument that the evidence does not exist.
“I think I had limited understanding of these issues and wrote about them — little understanding and too-strong opinions,” he says. “And I think I damaged the discussion in some ways when I did this. … these sorts of effects (leadership and clutch-hitting and how players interact) CAN be studied. You just need to approach the question itself, rather than trying to back into it beginning with the answer.”
I responded: Interesting . . . but I wonder if part of this is that James is such an insider now that he’s buying into all the insider tropes.
Yep, exactly . . . especially since one is about his guy, Ortiz!