Someone who wishes to remain anonymous writes:
I’m working on building a predictive model (not causal) of the onset of diabetes mellitus using electronic medical records from a semi-panel of HMO patients. The dependent variable is blood glucose level. The unit of analysis is the patient visit to a network doctor or hospitalization in a network hospital aggregated to the month-year level. The time frame is from the early 80s to the present. Since my focus is on the onset of the disease, my approach is agnostic and prospective. I would like to derive data-driven answers to questions of co-morbidity, patient health and wellness based on physical measures such as BMI or BP as well as physician and hospital quality as an inherent part of the model output.
To me, addressing these issues with data of this type would require multiple models for full coverage:
1) A survival model to capture censoring and time to disease onset
– Censoring can have multiple causes: diagnosed with diabetes type 1 or 2, lost to followup, death, etc
2) Multiple hierarchical bayesian models for massively categorical variables such as patient, diagnosis, doctor, hospital to capture the differing dependence structures
– Patient within zipcode, community, county, state to capture the social determinants of health
– Patient within a family network, e.g., children, siblings, parents, etc., to reflect familial history of disease
– Patient and diagnoses received — thousands of possible diagnoses which collapse into higher levels
– Patient within HMO doctor and hospital network
– Doctor within specialty — probably 70 or so specialties overall
– Doctor within zipcode, community, county, state
– Hospital within zipcode, community, county, state
3) As available, the impact of programs and interventions designed to promote wellness, mitigate or prevent disease…these could include recommendations regarding exercise, diet, etc.
4) Given the wide time frame, macro-economic indexes to capture the well-known impact of the business cycle on the determinants of medically-related activities
These are preliminary thoughts as I have not yet begun the process of testing the need for specifying all of these hierarchies since I am still in the initial stages of the analysis. Just getting this data lined up and talking together is a significant challenge in and of itself.
My question for you concerns the need for multiple models when the dependence structures overlap and are as messy as in the present case. I’m sure you’re going to advise against such a wide-ranging predictive design, enjoining me to greater research focus and specificity. My preference is to retain an expansive and exploratory stance and not to simplify the in-going hypotheses just for the sake of the modeling. Honestly, I think that there is already too much specificity in the literature which does little or nothing to uncover and identify the broad antecedents of this illness.
What do you think? Am I missing something? Suggestions?
P.S. Is anyone working on hierarchical survival models?
My reply: It does sound kind of appealing to just throw everything into the model and let Stan sort it out. On the other hand, it also seems like the “throw it all in at once” strategy is a recipe for confusion, and it could be hard to interpret the results. So let me give you the generic suggestion that, whatever model you start with, you check it out using fake-data simulation (that is, simulate fake data from the model, then fit the model and check that you can recover the parameters of interest and make good predictions). And I’d suggest starting simple and working up from there. Ultimately I think a more complex model is better and should be more believable, but you might have to work up to it, because of challenges of computation, identification, and understanding of the model.
P.S. Matt Gribble adds:
I wanted to plug some exciting work by Michael Crowther extending generalized gamma regression to have random effects not only on the log-hazards scale (frailty models) but also on the log-relative median survival time scale. The paper’s in press at Statistics in Medicine, and I had nothing to do with it but can’t wait to cite/use it. http://onlinelibrary.wiley.com/doi/10.1002/sim.6191/abstract
Not quite what I plugged (I was basing the plug about survival time random effects on slides I saw of his, not on the actual paper) but I think this ref is still cutting-edge stuff in the theme of hierarchical survival models.