Enrique Pérez Campuzano writes,
I’m using a multilevel logistic model to predict the probabilities of internal migration in Mexico. My level 1 are persons and level two cities. I ran some analysis with a small sample of my dataset in R using lmer as you do. The model, I think, fits well, but I was wondering if the assumption of 50/50 for probabilities can be applied to data when the distribution of the response variable is not 50/50 or when the cutoff point is not 50. In my data only the .7% of population moved from a city to another one. In this kind of data, can we use the “divide by 4 rule” for the estimation of probabilities? or can we change the cutoff point to improve the coefficients in lmer command (for example logistic regression SPSS allows th change the cutoff point)?
Second, as I said, this small model is only the beginning. My real data are around 10,000,000 individuals from a special sample from the Mexican Census. Is any problem in estimations in this kind of samples? In think the computational problem is more or less solved because University of California at San Diego allows me to run the analysis in the supercomputing center.
1. That’s right, when the probabilities are far from 1/2, you can’t use the divide-by-4 rule. See Section 5.7 of our book for discussions of how to compute the appropriate probability changes.
2. I’m pretty sure that R will choke on datasets of size 10 million; you might try Stata. (We have some example code in Appendix C on fitting multilevel models in Stata and other languages.) Another way to look at it is, if you have 10 million data points, you can probably start by fitting a model separately to different subsets of the data–that’s how I’d probably go. You can fit the separate multilevel models for different demographic groups, then postprocess in some way, for example using plots or second-level regressions fit to the estimated coefficients.