Fitting a multilevel model

Cui Yang writes:

I have a question about the use of BRT (Boosting regression tree). I am planning to write an article about the effects of soil fauna and understory fine roots on forest soil organic carbon. The experiment was conducted in a subtropical forest area in China. There were 16 blocks each with 5 coarse-mesh nylon bag wrapped and 5 fine-mesh nylon bag wrapped soil cores throughout four forests (forests A/B/C/D). Each forest had four different blocks. The coarse mesh (4 mm) bags allowed the access of both the entire soil fauna and understory plant fine roots. The meshes of fine bags were on the whole 0.038 mm, except that a row of 4 mm holes were distributed evenly at the lateral side and a circle of 4 mm holes at the circumferential bottom. Ideally, we supposed that this kind of fine bags could allow the access of most of soil fauna through 4 mm holes, while getting rid of the roots by the overall 0.038 mm meshes.

Originally, the dataset generated from this experiment should be 160 pieces of record constituted of soil total carbon, dissolved organic carbon, microbial biomass carbon plus soil fauna abundances and fine roots biomass. The soil fauna data was of coarse taxonomic resolution and mainly included six abundant fauna taxa, i.e., Acari, Collembola, Hymenoptera, Nematode, Enchytraeid. For fine roots data, parts of records in forests B (18 records) and C (17 records) were missing because of faulty operation. As a result, fine roots biomass had records at only 78% soil cores.

Because BRT can handle missing data in predictor variables by using surrogates, I decided to use BRT to determine the relative influence of predictor variables on the individual response for soil carbon, especially for the abundant fauna taxa and fine roots. So, can BRT be applicable to the abovementioned dataset, which only has 160 records and one of its main explanatory variable covers less than 80% records?

I have sent a similar mail to Professor *** who advised me to read your book “Data Analysis Using Regression and Multilevel/Hierarchical Models” for the resolution. Since my data are structured (samples within blocks within forests), a generalized linear mixed model will be suited. But I am worrying about that, while removing the records with missing data about fine roots, the experimental design would be an unbalanced one and at the expense of less useful data of other variables such as soil carbon and soil fauna taxa. Upon the unbalanced design, I will use Redundancy Analysis as a tool for multivariate ANOVA. Through RDA, I might select the several most influential fauna taxa probably with fine root. After this action, I would use Variation Partitioning to quantifying the variation explained by these factor. However, I am puzzled that, should the model used by RDA involve the effects of mesh size of nylon bag (fine mesh versus coarse mesh) and blocks factors or as well as forests? I am mainly using software R for statistical analyses. If the alternative of BRT is RDA, how can the model of RDA involve the data structure which seems to be a split-plot design? I feel that it is difficult to use functions such as adonis {vegan} in R to define the data structure of split-plot designs, especially under the condition of unbalanced design. What’s worse, my focus is the effects of soil fauna and fine root instead of mesh size/blocks factors/forests.

My reply:

You write, “BRT can handle missing data in predictor variables by using surrogates.” But any method for prediction can handle missing data by using surrogates, no? Or maybe I’m missing something here. In any case, here are some general comments:

– If your missingness is only in the outcome variable, then just fit a multilevel model, no problem.

– If some of your predictors have some missingness, I recommend imputing the missing values first using some multivariate imputation method. We have an R package “mi” to do this, but other options are available too.

– Unbalanced data don’t present any difficulty; indeed, multilevel models are well suited to unbalanced data. Unbalanced data is one of the main motivations for fitting multilevel models.

– Similarly, you’re fine with any mix of nested and non-nested factors. Split-plot is fine too, the multilevel model will handle it automatically, as discussed in my 2005 paper.

– Once it’s time for you to fit your multilevel model, I recommend you use Stan. In my book with Jennifer we use lmer, but now I use Stan. We just haven’t updated the book yet.

1 thought on “Fitting a multilevel model

  1. I have another environmental science dataset that I am working with that has similarities to this one, it is nested, unbalanced, and has missing data.

    I have already managed to impute missing data and fit some multi-level models in R using Amelia for imputation and glmmTMB for linear mixed modeling. Assuming my model specification is correct, the model notation is basically: { outcome ~ treatment_effect_of_interest + (starting_value | random_location) + offset(starting_value) }. The main scientific question I am trying to answer can be resolved by simply reporting the coefficients for the fixed effect(s) and their confidence intervals, and pretty much ignoring the other stuff –> what is the effect size of the treatment_effect_of_interest, i.e. how much does it change the outcome from the starting value?
    I am simplifying a bit here, but I have been able to figure out a pooled estimate of these effect size values using the mice package, and the magnitudes of the values not triggering my ‘BS’ meter (so far so good).

    My main problem now is that I’m not clear what measures of model fit are appropriate to report to support the idea that the pooled estimates and/or unpooled models are any good. There are many different indicators of model fit I could use to indicate the goodness/badness of the linear mixed models, generally summarize them, and to describe the random and fixed portions — which ones do you think I should report?
    Common ones that (I think) I can get mixed effect estimates for are:
    AIC, BIC,
    R^2 and conditional R^2
    Total variance, within group, between group
    ICC (Intraclass Correlation Coefficient)
    possibly also one of the following: wald statistic or likelihood ratio test or lagrange multiplier — I am not 100% sure how to interpret these though, or calculate them, for that matter, and sure that my advisors/committee don’t know how to interpret them either, so I am likely to get them wrong and not know any better.

    Secondly, do you think it’s okay to include only a single representative fit value using only one of the imputed datasets?
    –Excuse 1 is that most model fit indicators cannot be pooled after imputation.
    –Excuse 2 has to do with sheer volume:
    One chapter of my dissertation has ~ 10 model specs x 5 imputations (50 models, each with whatever fit parameters I decide to report, 10 pooled estimates)
    A second chapter has 18 different outcome analytes x 5 imputations x 4 model specs (so about 360 individual models, each one with its own estimators of model fit and about 72 pooled estimates)

    For a little more context:
    I am dealing with observations of a bunch of different chemicals in stormwater, and trying to see if specific treatment interventions change their concentration and/or mass present in the effluent. (It would have been nice if the data could have supported something as simple as a paired t-test haha) I have created some DAGs to figure out several model specifications to appropriately de-confound and test these treatment interventions.

    I read all the posts I could find on multi-level model fit parameters, and this one seemed the closest to my issue, even though I am not using a BRT. Any advice you have would be much appreciated, thank you. Caitlin

Leave a Reply

Your email address will not be published. Required fields are marked *