Enrique Pérez Campuzano writes,

I’m using a multilevel logistic model to predict the probabilities of internal migration in Mexico. My level 1 are persons and level two cities. I ran some analysis with a small sample of my dataset in R using lmer as you do. The model, I think, fits well, but I was wondering if the assumption of 50/50 for probabilities can be applied to data when the distribution of the response variable is not 50/50 or when the cutoff point is not 50. In my data only the .7% of population moved from a city to another one. In this kind of data, can we use the “divide by 4 rule” for the estimation of probabilities? or can we change the cutoff point to improve the coefficients in lmer command (for example logistic regression SPSS allows th change the cutoff point)?

Second, as I said, this small model is only the beginning. My real data are around 10,000,000 individuals from a special sample from the Mexican Census. Is any problem in estimations in this kind of samples? In think the computational problem is more or less solved because University of California at San Diego allows me to run the analysis in the supercomputing center.

My response:

1. That’s right, when the probabilities are far from 1/2, you can’t use the divide-by-4 rule. See Section 5.7 of our book for discussions of how to compute the appropriate probability changes.

2. I’m pretty sure that R will choke on datasets of size 10 million; you might try Stata. (We have some example code in Appendix C on fitting multilevel models in Stata and other languages.) Another way to look at it is, if you have 10 million data points, you can probably start by fitting a model separately to different subsets of the data–that’s how I’d probably go. You can fit the separate multilevel models for different demographic groups, then postprocess in some way, for example using plots or second-level regressions fit to the estimated coefficients.

Stata is awful in handling large datasets. As a rule of thumb, if your dataset plus matrices generated in whatever temporary calculations are performed do not fit in the operating memory, stata will start dyeing swapping forever.

You may consider SAS which handles such situations better if R cannot handle it

I have to disagree. Stata is great at handling large datasets — as long as you have enough RAM — exactly because it works with all data in RAM. RAM is cheap and I see no reason why a dataset with 10 million observations shouldn't be able to fit in RAM, without even going to a 64 bit platform. If you have 100's of variables in your model, then you might start to run into problems. But I think the bigger problem in any case will be the sheer number of calculations that will be required, even without disk swap problems.

It has been about 4 years, but in a regression class during my senior year of college we had a huge data set that made R crash and die and the only thing we could get to handle it was SAS. Ugh. We found the batch SAS approach totally inferior to the interactive approach of most other statistical packages. It was very frustrating.

NumberCruncher: I am not sure what you disagree about – if you have enough RAM, stata works fine, my point is what happens when you don't. 10 million observations is not exactly trivial, with 4g of RAM it takes about 100 variables before you fill out RAM just with the dataset itself; you need substantially more to run even just straight linear regressions. RAM is not exactly cheap, it quickly become the main part of the cost – I priced a few days ago a state of the art workstation (dual quad cpus, four very fast harddrives) with 32g RAM, it was $15K, with $10K for RAM. Stata is fine for interactive use with moderate size datasets, for anything large or requiring significant programming it quickly becomes clear that it is just an inefficient toy. In particular, you don't want to ever swap with stata, it appears to assume that all the data is in operating memory which results in swapping like crazy.

10 million is not small.

As a starting point. You might try to draw several samples of 100 000 ( or whatever number you can handle ) and analyze them.

I've done lots of regressions on datasets of around 5 million observations using Stata on a desktop PC with 2G of memory, and it works great, as long as I don't try to have too many variables at once. 10 million might be pushing it, though, unless you have a 64-bit machine.

W-

As I said, if the model contains 100+ variables you may be in trouble with 10m observations, but I have analyzed many datasets with 1m-10m observations using Stata and I do not have a workstation, just a PC. I have also written many Stata programs to analyze datasets ranging from small to large and don't think "inefficient toy" characterizes it well at all. I don't think you know much about Stata's programming capabilities.

If you are trying to fit a multilevel model with hundreds of variables and 10m observations, you will need a workstation and a lot of time regardless of the software you use. Have you ever done that in SAS or anything else?

I am trying to do a multilevel logistic analysis using stata with no luck. My level 1 variables are S4, Violence Experienced During Past 12 month,0-No 1-Yes as outcome variable, Respondent's Education with three categories 1- non 2- high school 3-more than high school.N=3400 Nested in

My level 2 variables, attitudes towards violence norms at the community level(this variable is calculated by aggregating individual responses of the above sample in 423 communities). Nested in my third level variables of level of unemployment and provincial GDP in 28 provinces.

My questions are,

1- How do I structure my data in stata, do I have to have three different files? One for each level , the same way we do USING HLM? If not how do I set up my single file for the analysis?

2- Using Interactive mode in STATA MP 10, how do declare some of my independent variables such Education as categorical to get odds ratio changes by moving from no education to high school or to higher than high school as reference category?

am trying to to a multilevel analysis of factors associated with child mortality so i wanted to use a multilevel logistic regression model so how can i hndel it and even get that model.