A linguist has a question about sampling when the goal is causal inference from observational data

Nate Delaney-Busch writes:

I’m a PhD student of cognitive neuroscience at Tufts, and a question came recently with my colleagues about the difficulty of random sampling in cases of highly controlled stimulus sets, and I thought I would drop a line to see if you had any reading suggestions for us.

Let’s say I wanted to disentangle the effects of word length from word frequency on the speed at which people can discriminate words from pseudowords, controlling for nuisance factors (say, part of speech, age of acquisition, and orthographic neighborhood size – the number of other words in the language that differ from the given word by only one letter). My sample can be a couple hundred words from the English language.

What’s the best way to handle the nuisance factors without compromising random sampling? There are programs that can automatically find the most closely-matched subsets of larger databases (if I bin frequency and word length into categories for a factorial experimental design), but what are the consequences of having experimental item essentially be a fixed factor? Would it be preferable to just take a random sample of the English language, then use a heirarchical regression to deal with the nuisance factors first? Are there measures I can use to determine the quantified extent to which chosen sampling rules (e.g. “nuisance factors must not significantly differ between conditions”) constrain random sampling? How would I know when my constraints start to really become a problem for later generalization?

Another way to ask the same question would be how to handle correlated variables of interest like word length and frequency during sampling. Would it be appropriate to find a sample in which word length and frequency are orthogonal (e.g. if I wrote a script to take a large number of random samples of words and use the one where the two variables of interest are the least correlated)? Or would it be preferable to just take a random sample and try to deal with the collinearity after the fact?

I replied:

I don’t have any great answers except to say that in this case I don’t know that it makes sense to think of word length or word frequency as a “treatment” in the statistical sense of the word. To see this, consider the potential-outcomes formulation (or, as Don Rubin would put it, “the RCM”). Suppose you want the treatment to be “increase word length by one letter.” How do you do this? You need to switch in a new word. But the effect will depend on which word you choose. I guess what I’m saying is, you can see how the speed of discrimination varies by word length and word frequency, and you might find a model that predicts well, in which case maybe the sample of words you use might not matter much. But if you don’t have a model with high predictive power, then I doubt there’s a unique right way to define your sample and your population; it will probably depend on what questions you are asking.

Delaney-Busch then followed up:

For clarification, this isn’t actually an experiment I was planning to run – I thought it would be a simple example that would help illustrate my general dilemna when it comes to psycholinguistics.

Your point on treatments is well-taken, though perhaps hard to avoid in research on language processing. It’s actually one of the reasons I’m concerned with the tension between potential collinearity and random sampling in cases where two or more variable correlate in a population. Theoretically, with a large random sample, I should be able to model the random effects of item in the same way I could model the random effects of subject in a between-subjects experiment. But I feel caught between a rock and a hard place when on the one hand, a random sample of words would almost certainly be collinear in the variables of interest, but on the other hand, sampling rules (such as “general a large number of potential samples and keep the one that is the least collinear”) undermines the ability to treat item as an actual random effect.

If you’d like, I would find it quite helpful to hear how you address this issue in the sampling of participants for your own research. Let’s say you were interested in teasing apart the effects of two correlated variables – education and median income – on some sort of political attitude. Would you prefer to sample randomly and just deal with the collinearity, or constrain your sample such that recruited participants had orthogonal education and median income factors? How much constraint would you accept on your sample before you start to worry about generalization (i.e. worry that you are simply measuring the fixed effect of different specific individuals), and is there any way to measure what effect your constraints have on your statistical inferences/tests?

12 thoughts on “A linguist has a question about sampling when the goal is causal inference from observational data

  1. There’s a lot of interesting (psycho)linguistics just in definining orthographic neighborhoods. We’ve learned a lot about it from spell checking and all the lexical access work (which I haven’t perused since the 1980s). From both literatures, it’s clear that earlier letters in a word are more important for discrimination, that unstressed vowels, doubled consonants and other sound-alikes are not very important. The shape is also important based on what we know about reading. So d and b would be confusible in a way that p wouldn’t be, even though they all have similar sounds.

    @Andrew: You get a new very closely related word sometimes by adding a letter, as in English pluralization (“statistic” and “statistics”), or adding two letters (“strength” to “strengthen”) and so on.

  2. Nate:

    I would advise that you start by stating clearly and concisely your theory on how word lengths and word frequency affect discrimination, and their relation to other potential confounders.

    To do this properly you need to use a causal language. Talking about correlations will not do since the same observed correlation patterns can emerge from different causal structures. And yet these causal structures may have different implications for your experimental design and estimation. Start by laying out your causal theory.

    I advise that you use Dags bc that way you can express all your causal theories quickly, concisely, and graphically. Indeed, you can prototype models much faster than using potential outcomes. I include some examples. Figure (a) shows your two correlated variables X and Z (common cause), and their effect on Y (I have not a clue of linguistics so I am keeping this abstract). U_i are other unknown causes. Variable I could be the intervention. Andrew’s concern, I think, is that any Intervention I cannot change X without also clanging come other background causes U_y, even if I is perfectly randomized.

    This is an example of a lack of control. Other problems with experimental interventions are contamination (e.g. fertilizer bleaching to control plots) and spillover (externalities Y_i affects Y_j). In this case the design does not meet the exclusion restriction common in experiments. What you need is theory into U_y to instantiate it and control for some of the concepts in it.

    One possibility — though Bob knows better on this — is that assuming Figure (a) if intervention I is “add letters”, then you can subtract letters at the same time to keep X constant and try to estimate the effect of I on Y via U_y (changing words). Then in step 2 you can let go of X (think about it).

    Some people say that they have no theory, or that everything is confounded, etc. in which case jump to figure (c). I think this is mostly intellectual laziness.

    Re sampling Figure (d) shows how you might sample on Z to choose sample S then Intervention I to change X. This might allow you to condition the effect of X on Z and then generalize if you have P(Z) for the population. There are many other possibilities.

    In conclusion. Start with a theory. Be bold, make _causal_ assumptions, but make them transparent using a DAG. Then use that to inform your intervention and estimation strategy. Whatever you do don’t start from the end by talking ambiguously about correlations.

    • +1.

      like your use of the DAGs for this case, but in many other places I might find DAGs cumbersome or limiting (for example ODEs describing time series or something like that), but in general the point *start with a theory* can not be emphasized enough. Even if that theory is just a lot of written words describing what you think is going on, which you then formalize after you’ve spewed them out and thought about them for a while.

      The approach in which we talk about correlations and orthogonality a lot is in my opinion one of the worst things about the linear algebra approach to statistics. There’s a lot more to modeling than orthogonal decompositions in abstract vector spaces. This is where Stan and BUGS and things really shine.

    • I see that given a finite number of variables, there are only finitely many directed acyclic graphs connecting them. Then, if you couple that with sampling distributions for each node in the graph, you get a joint distribution (and can easily derive conditional distributions for each node). And then you can use something like Stan or BUGS to fit each model.

      Then what? I’m afraid I know nothing at all about “potential outcomes” either, so explanations in those terms aren’t likely to help me understand.

      Also, for extra credit, what’s the role of latent variables as in item-response theory (IRT) models?

      • Bob:

        Not sure exactly what you are asking so I’ll say this:

        1. You can have an infinite number of variables and yet a simple graph, as nodes can represent sets of variables that stand in the same causal relation. E.g. X ← U → Y can be used to represent a million confounders for a gazillion causes X and effects Y.

        2. There are two modes of inference, I think. One is we don’t know anything, only have some data, and want to learn the underlying graph structure using a structure learning algorithm. I think this is what you have in mind. Another is that we start with a specific graph structure and test it against data. For better or worse that is how most scientists proceed. We seldom include model uncertainty in our standard errors. Moreover if you want to add uncertainty typically you need not consider all possible graphs. If your assumption is X → Y the alternative might be the class of models where they are confunded as drawn in point 1 above. Then you can put priors on that and do sensitivity analysis.

        3. I am not familiar with IRT models but latent variables pose no additional problems for expressing our model, only for estimation, at which point I would use the graph to inform a bayesian model.

        Not sure if this answers your questions. My general point is not to take a combinatorial approach to structure learning or model selection. The number of graphs (or structural equation models, or potential outcome specifications, as these are all equivalent and so suffer from the same ailments) explodes in the number of nodes. I would rather rely on strong theory, Occam’s razor, and a research program (never a single-shot study). I would also collapse alternative models into equivalence classes as shown in point 2 above.

        • Thanks. Since this seemed to be about causal inference, I was really interested in how you determine if X causes Y as opposed to just being correlated. It sounds like the answer is model comparison. But if X and Y are correlated without one causing the other, then it seems either X → Y or Y → X might be “better” models (e.g., under cross-validation) depending on the distribution assumptions. What I don’t see is how we can then draw any conclusions about causality.

          I understand the pont about not taking modeling error into account. Most models don’t even take measurement error into account!

          The point about latent variables is just that it expands the space of potential models for comparison, not that it needs any expansion since there are already uncountably many alternatives at the point you need to make distribution assumptions on the nodes.

        • Bob:

          It depends on whether experiments are available or only observational studies. Say you posit X → Y. That has come implications e.g. (i) intervening on X changes Y; (ii) intervening on Y does not change X. You can test these. If only observational, you can look at instruments for X and Y etc…

          In general two (sets of) variables are correlated if (i) they have a cause in common (confounding); (ii) they cause one or the other (but not both, per acyclicality); (iii) selection bias whereby the sample is selected on some value of a collider in a path between X and Y.

          But at the end of the day causation is established by assumption. Even experiments involve lots of assumptions. But that is ok. E.g. we have enough info to think smoking causes cancers and not the other way around. And enough studies have been done to make the probability that the correlation is driven by confounding or selection bias very unlikely.

        • PS DAGs are deterministic. Probability comes in through ignorance of background conditions. So for a model of X causes Y we would add U_x as background causes of X and U_y for Y. We summarize these by P(U_i) etc. So X and Y get their distribution from U_i.

          This is important as counterfactuals are defined in terms of U_i. So in smoking (X) causes cancer (Y) you might still live to 102 years without cancer if your individual U_y(i) is extreme, say. Further, individual heterogeneity depends on functional form of Y=f(X,U_y). If U_y enters multiplicatively we have heterogenity. The challenge then is to unpack U_y.

          PPS writing on android phone is a royal pain @Google.

  3. Actually, having in mind what Andrew wrote, I’d probably first be interested in how correlated/collinear things actually are, and which ones in particular, even before setting up a more detailed causal theory. It shouldn’t be a problem to sample some words first and look at these things so that you get an idea. Maybe collinearity is not as bad as you thought it may be. Or, on the other hand, some effects may be genuinely confounded so that they cannot be told apart without making bold assumptions (in which case being honest about the limitations of what can be known from the data seems at least as legitimate as making such bold assumptions).

  4. I guess it depends on your resources, but why not do the same experiment or manipulation, with both types of designs:

    1. A set of items that are carefully selected to maximize experimental control, with less importance assigned to representativeness

    2. A set of items that are carefully selected to be representative of the set of words to which the theory should generalize, with less importance assigned to separating the nuisance factors

    Then you can compare the results. If they differ, we should try to understand why they differ depending on the items. This probably works better in a behavioral paradigm if it doesn’t take too long to run it. If it is an imaging experiment, then you are probably more restricted in what you can do.

    In my mind, this reminds me of the old verb bias debate in the sentence parsing literature, but there are other examples. It is certainly true that experimental control is emphasized much more than generalizability in psycholinguistics experiments. Part of this, I think, is due to the emphasis on tests of pre-existing theory. Researchers want the most sensitive measure with the highest level of control that they can achieve because it gives them the best chance to see results that will distinguish between existing theories. But this can have the side-effect of limiting the generalizability of the results.

  5. These are very interesting comments to an interesting question!

    Getting back to the original question, I think it depends on what you want to do. I think that with colinear predictors, the variance on the regression coefficients of each predictor will be high (because each individual predictor’s coefficent is not identifiable). Of course, if the goal is just prediction, maybe that doesn’t matter (in machine learning typically one doesn’t care about colinearlity, and it’s standard to stabilise the numerical estimates of the coefficients using regularisation or a prior).

    But it seems that you’re interested in figuring out which of the two predictors is the “real” one (e.g., in a causal sense). If they really are colinear, then of course you just don’t have the data to do this. Your idea of searching for a data set where the correlation doesn’t hold is an interesting one. I don’t have a good theoretical understanding here, but I’d be a bit hesitant trying to select a special set of words where frequency and length aren’t correlated, mainly because in “normal” words frequency and length are strongly correlated, and any words you find where that correlation doesn’t hold are likely to be unusual in other ways too (e.g., perhaps loan words).

    PS. We actually tried this for some syntactic predictors at the sentence level for the same reason — we searched Gigaword for sentences where two perplexity measures differed — and the sentences we extracted were rather bizarre (e.g., fragments).

Leave a Reply

Your email address will not be published. Required fields are marked *