Multilevel modeling questions

Karyn Heavner (Office of Program Evaluation and Research, AIDS Institute, New York State Department of Health) wrote to ask:

I am using data from a venue-based sample of men who have sex with men (MSM) for my dissertation (at SUNY Albany) and have been having a very difficult time determining what multivariate regression method is most appropriate for the analysis of my data. Lillian Lin at the CDC she suggested that I contact you concerning this question.

My dependent variables are several HIV risk behaviors, all of which are very common (~60%) in the sample. The independent variables are venue and time of enrollment and a variety of sociodemographic variables. The unit of analysis is individual men. The sample consists of 704 MSM enrolled in 32 gay bars, 1 bathhouse, 1 community center, 1 café and 2 highway rest areas in a total of 4 cities in Upstate New York. Fewer than ten men were enrolled in many of the sites.

The majority of studies that analyzed data from similar studies used logistic regression. I believe that this is inappropriate for my data because the odds ratio obtained from logistic regression will overestimate the prevalence rate ratio. I decided to use Poisson regression to obtain a less biased odds ratio.

In addition, I believe that there is clustering of the data as men were randomly selected from individual venues not from an overall sampling frame of MSM. Due to the small number of MSM enrolled in many of the sites, I don’t think that a mixed model is appropriate. A faculty member at SUNY Albany suggested that I use GEE and bootstrap the residuals. I have not found support for this idea in the HIV/AIDS literature as the standard regression method used for this type of analysis is logistic regression. Other fields of study seem to provide some support for using GEE in this situation but I have not found any support in the literature for bootstrapping the residuals.

I was wondering if you would have some advice on whether GEE is truly the best method to analyze this data.

My reply:

Whether you do Poisson or logistic regression depends on whether you
have yes/no data or count data. In either case, you can fit a
hierarchical model that allows for variation among clusters. In my
opinion this is better than GEE. John carlin and I and others have a paper
comparing hierarchical logistic regression to GEE; the paper is
here.

(As you can see, I’m consistent in my advice in always referring to my own work.)

Karyn replied:

I chose Poisson regression because it has recently been proposed for the
analysis of common outcomes with dichotomous data (see this article).

Is it really possible (and valid) to do a hierarchical model when 1/3 of
the 37 clusters have 10 or few observations (overall sample size ~700).

My reply:

First, the this McNutt et al. article is interesting. It doesn’t dismiss logistic regression but rather shoots down a particular formula for converting logistic regression coefficients to the probability scale. Iain Pardoe and I have done some more general work in this area but there is certainly a need for quick calculations of this sort.

Second, yes, you can definitely use a hierarchical model when there are only a few observations in each group. This is, in fact, the ideal setting for hierarchical modeling.

I’ll be teaching a short course on multilevel modeling for the NYC Department of Health, so if you’re interested and you have space, perhaps you could attend.