Spatial correlation models

Jason McDaniel writes with a question about using spatial lag regression models for studying voting patterns by ethnic group. I’ll give his question, then my (brief) reply.

Jason writes:

I am a PhD candidate in political science at USC . . . I entered grad school as a political theorist mostly uninterested in quantitative methods, but have done a complete 180 degree turn towards the use of quantitative methods.

I have a question that I would appreciate your help with.

I am wondering if it is possible to develop spatial lag regression models (Max Likelihood Estimator) (See Anselin and Cho 2002) using variables estimated via a geographically weighted regression (GWR) approach to ecological inference (Calvo and Escolar 2004)?

Let me be more specific about my methodological approach so far. I am studying spatial context effects on local voting behavior in Los Angeles. I started from a geographically located database (GIS file) of raw vote totals, registered voters, and various census demographics (aggregated at the census tract level).

From that point I used Calvo and Escolar’s GWR-Ei method to estimate racial group voter turnout, and second stage racial group vote share for each of two candidates in a mayoral election. I specifically used the GWR-Ei approach because I wanted to account for spatial autocorrelation (King’s Ei, of course, assumes no spatial autocorrelation), and so that I could further explore the resulting data using spatial econometric methods such as spatial lag regression (Anselin).

I proceeded to develop spatial lag regression models of the racial group percent vote for each candidate. My results have been very satisfying so far, and were well received by the spatial analysis panel at the midwest poli sci conference.

However, I have recently received some criticism that fundamentally questions my basic approach, and I am worried that my lack of statistical expertise has caused me to make a basic methodological error.

I understand that it is not very common to develop regression models using estimated variables. Do you have any advice about what I should do to account for the error associated with the GWR-Ei process? Or should I abandon my work so far because it is simply methodologically unacceptable?

If you think there is a fundamental flaw in my approach, I would like to ask you a follow-up question about multi-level modeling. That is, do you think it would be possible for me to develop a multi-level model that includes exit polling of LA mayoral races with my geographically located aggregate data? (Similar to Sampson, Morenoff, and Earls 1999) Being a complete novice to HLM, do you have any advice about where to start?

My response: When you say, “I understand that it is not very common to develop regression models using estimated variables,” I think you’re referring to this 2-stage approach in which you first fit a model (in this case, an ecological regression) and then use the estimates from this model as data in another regression. Formally, yes, one should include all of this in a single hierarchical model. In practice, however, we’ll often do this 2-stage approach to avoid the difficulty of setting up a full model (see the recent Political Analysis issue on multilevel modeling for several examples; here’s my discussion).

With ecological regression, the bigger issue is worrying that the model is inappropriate. You can address some of these concerns through residual plots (see here), also by cross-validatoin, which shouldn’t be hard to do in a census-tract-level analysis where N is really large.

And, yes, multilevel modeling would be good. Ansolabehere, I believe, has written about combining poll data with ecological data in such settings.

1 thought on “Spatial correlation models

  1. Thanks for the link to your commentary in Political Analysis. I have been grappling with the 2-stage vs. MLM issue for a while and find a similar situation where 2-stage seems to have an advantage, especially in fitting. I was wondering if you'd agree.

    I evaluate energy efficiency programs. I commonly deal with panel data sets where I may have about 10,000 panels (houses) with about 10-20 observations (monthly utility meter readings) for each before and after a treatment. I also have a comparison group.

    A simple stage 1 regression of the energy usage of each home before/after treatment against one or two explanatory variables (ouitdoor temperature data) typically has a very good fit (r-squared >.9, often 0.95). I then create a linear prediction for the before and after states of each panel, which is an estimate of annual usage during a normal weather year (I also create an estimate of the standard error of this estimate).

    I then work with a dataset with one observation per panel where I explore models of the change in annnualized energy usage and often have 5-25 potential explanatory variables (none of which are related to the truly exogenous stage 1 predictors)

    The within-panel variance is much smaller than the between panel variance and I don't think there would be much to gain by fitting this as an MLM. Very few of the stage 1 regressions have poor fits and those that do are usually excluded from the second stage analysis (and examined as part of sample attrition)

    Do you think this approach sounds reasonable? Thanks in advance

Comments are closed.