Where does Mister P draw the line?

Bill Harris writes:

Mr. P is pretty impressive, but I’m not sure how far to push him in particular and MLM [multilevel modeling] in general.

Mr. P and MLM certainly seem to do well with problems such as eight schools, radon, or the Xbox survey. In those cases, one can make reasonable claims that the performance of the eight schools (or the houses or the interviewees, conditional on modeling) are in some sense related.

Then there are totally unrelated settings. Say you’re estimating the effect of silicone spray on enabling your car to get you to work: fixing a squeaky door hinge, covering a bad check you paid against the car loan, and fixing a bald tire. There’s only one case where I can imagine any sort of causal or even correlative connection, and I’d likely need persuading to even consider trying to model the relationship between silicone spray and keeping the car from being repossessed.

If those two cases ring true, where does one draw the line between them? For a specific example, see “New drugs and clinical trial design in advanced sarcoma: have we made any progress?” (inked from here). The discussion covers rare but somewhat related diseases, and the challenge is to do clinical studies with sufficient power from number of participants in aggregate and by disease subtype.

Do you know if people have successfully used MLM or Mr. P in such settings? I’ve done some searching and not found anything I recognized.

I suspect that the real issue is understanding potential causal mechanisms, but MLM and perhaps Mr. P. sound intriguing for such cases. I’m thinking of trying fake data to test the idea.

I have a few quick thoughts here:

– First, on the technical question about what happens if you try to fit a hierarchical model to unrelated topics: if the topics are really unrelated, there should be no reason to expect the true underlying parameter values to be similar, hence the group-level variance will be estimated to be huge, hence essentially no pooling. The example I sometimes give is: suppose you’re estimating 8 parameters: the effects of SAT coaching in 7 schools, and the speed of light. These will be so different that you’re just getting the unpooled estimate. The unpooled estimate is not the best—you’d rather pool the 7 schools together—but it’s the best you can do given your model and your available information.

– To continue this a bit, suppose you are estimating 8 parameters: the effects of a fancy SAT coaching program in 4 schools, and the effects of a crappy SAT coaching program in 4 other schools. Then what you’d want to do is partially pool each group of 4 or, essentially equivalently, to fit a multilevel regression at the school level with a predictor indicating the prior assessment of quality of the coaching program. Without that information, you’re in a tough situation.

– Now consider your silicone spray example. Here you’re estimating unrelated things so you won’t get anything useful from partial pooling. Bayesian inference can still be helpful here, though, in that you should be able to write down informative priors for all your effects of interest. In my books I was too quick to use noninformative priors.

9 thoughts on “Where does Mister P draw the line?

  1. So, what type of diagnostics would you recommend for detecting that you’re model is incorrectly adding in the unrelated topic, just the normal model comparisons such as an information criterion comparison or cross validation?

    • I think there’s a tendency for people to NOT use all of the information at their disposal, out of a misguided effort to be “unbiased” or “scientific” or something. Not all information is in the form of data points that fit neatly into a database. If you’re studying the flight speed of bumblebees, don’t put in data about the driving speed of race cars. You don’t need a method to determine this from the data, you know they’re unrelated so you don’t mix them together.

      • @Phil, I think that’s the issue. Bumblebees and race cars seem clearly different (*), but the question posed in the article seems to me at least partially related to whether the responses of the different diseases to drugs are more akin to the speed of bumblebees and race cars or to the speed of SUVs and station wagons. I think this carries over to other domains, too; the aforelinked article may provide a useful starting point for thinking about it.

        If one faces a situation where one doesn’t know the answer to that sort of question, then is one forced back to one factor at a time experimentation, or can multilevel regression play a useful role in the design and analysis of such experiments?

        Maybe this is related to Christian’s recent posting about “Testing hypotheses via a mixture estimation model.” That is, maybe one can extend Andrew’s notion of avoiding multiple comparisons problems and Christian’s et al.’s notion of avoiding model comparison problems by incorporating all possibilities into a mixture model in a way that can do useful analysis whether it’s bumblebees or BMWs you’re comparing against race cars.

        I see places in domains closer to my work where it’s not clear which situation holds, and I’m wondering if one has to make that clear first or if it can be done all together. Andrew’s statement about high group-level variances suggests it can be done all together.

        (*) Despite your separating the speed of bumblebees and race cars, I can almost imagine a model of speed of movement with things like daily energy intake, mass, and friction as predictors, with the coefficient grouped by type of motor drive–flight, wheeled, slithering, etc. Since I know nothing about this domain, I have no idea if one would see a reasonable relationship across bumblebees, cheetahs, race cars, and 747s.

        • Bill:

          I think the solution is to get quantitative. When two things are somewhat similar, you can introduce a parameter delta for how different they are. When you have a bunch of things that are somewhat similar, the relevant parameter is the scale of the variation, what we call tau in the 8-schools example in chapter 5 of BDA.

    • Z

      No, if they’re really unrelated, then the estimated group-level variance will be estimated to be large, and essentially no pooling will be done. If the topics are unrelated but the parameters are all roughly on the same scale (for example, a bunch of elasticity parameters, all between 0 and 1) then, yes, there will be some regularization, but you’d want to do regularization even if there were just one parameter being estimated, so the existence of multiple parameters has nothing to do with it.

      This is an issue that used to confuse statisticians back in the 1960s–1980s but I think it’s basically cleared up now and is no paradox at all. (Perhaps worth its own blog post if there remains confusion on it.)

      • “No, if they’re really unrelated, then the estimated group-level variance will be estimated to be large, and essentially no pooling will be done.”

        That doesn’t sound like a “no” to me, that sounds like in the worst case it’s no worse.

        Aren’t there some risks at the group-level estimates that are analogues of small sample size / weak power problems? Depending on what the top level hyperprior, can’t you overestimate between-group heterogeneity, or overestimate the differences between sets of groups if the number of groups in each set is small?

  2. Thanks for posting this, Andrew, and +1 for a separate Stein “paradox” posting.

    Back to the examples. The car example and the 7 schools + 1 speed of light example seem sort of trivial, but it’s comforting to think that MRP will respond appropriately. 4 schools + 4 schools is essentially adding a new group, and that seems quite doable.

    I can see that the Xbox solution could have seemed counter-intuitive before it was done, and yet MRP behaved himself quite nicely there. It was nice that there was a comparative example (the professional “random sample” survey) to compare and contrast with.

    Now what about the medical example? Would MRP likely respond appropriately, either with improved estimates from partial pooling or high variances because of the lack of any relationship? Would MRP help with the problem mentioned in that article? I don’t really know of a comparative medical example (but that’s mostly because I don’t know the field). Has anyone done that?

    Or if it’s easier to think about more political science-related examples, imagine a particular campaign technique you want to test (perhaps this is analogous to the medical example). Say you conjecture that campaigning door-to-door will increase your share of the vote by perhaps 4 percentage points, and so you try that (Note to Andrew: I know that’s probably a huge impact. :-) ). In practice, it might raise your share by 10 points among one demographic group, by 2 points with another, 1 point with a third, and have no effect on a fourth. Those demographic groups have widely varying sizes, and you’d really like to learn enough to target campaigns to receptive groups. Would MRP work well, or would he get lost in the garden of forking paths? Would the garden always be evidente by huge group-level variances, conditional on a decent model?

Leave a Reply

Your email address will not be published. Required fields are marked *