Skip to content
 

Combining data from many sources

Mark Grote writes:

I’d like to request general feedback and references for a problem of combining disparate data sources in a regression model. We’d like to model log crop yield as a function of environmental predictors, but the observations come from many data sources and are peculiarly structured. Among the issues are:

1. Measurement precision in predictors and outcome varies widely with data sources. Some observations are in very coarse units of measurement, due to rounding or even observer guesswork.

2. There are obvious clusters of observations arising from studies in which crop yields were monitored over successive years in spatially proximate communities. Thus some variables may be constant within clusters–this is true even for log yield, probably due to rounding of similar yields.

3. Cluster size and intra-cluster association structure (temporal, spatial or both) vary widely across the dataset.

My [Grote's] intuition is that we can learn about central tendency even by fitting models that deal only superficially with sample structure (e.g., least-squares with robust estimates of standard errors). But I wonder if we could do better, while still keeping the analysis relatively simple. Although multi-level modeling might appeal, many of the clusters are
singletons, which gives me [Grote] pause.

My reply:

1. It’s no problem doing multilevel modeling when many (or even most) of the clusters are singletons.

2. I don’t think robust standard errors will get you anywhere.

3. It sounds like you want a model with different error variances for different data points. That’s easy enough to do in Bugs/Jags or if programming by hand, possibly doable in Stata’s multilevel modeling functions, not so easy do to in lmer in R without some additional programming.

One Comment

  1. Maarten Buis says:

    Regarding point 3: Depending on the software you use, it may be easier to trick a Structural Equation Modeling module to estimate the random effects model. This typically enables great freedom in relaxing constraints like common error variance. The link below discusses how to do this in Stata:

    http://blog.stata.com/2011/09/28/multilevel-random-effects-in-xtmixed-and-sem-the-long-and-wide-of-it/