Skip to content

Thinking of doing a list experiment? Here’s a list of reasons why you should think again

Someone wrote in:

We are about to conduct a voting list experiment. We came across your comment recommending that each item be removed from the list. Would greatly appreciate it if you take a few minutes to spell out your recommendation in a little more detail. In particular: (a) Why are you “uneasy” about list experiments? What would strengthen your confidence in list experiments? (b) What do you mean by “each item be removed”? As you know, there are several non-sensitive items and one sensitive item in a list experiment. Do you mean that the non-sensitive items should be removed one-by-one for the control group or are you suggesting a multiple arm design in which each arm of the experiment has one non-sensitive item removed. What would be achieved by this design?

I replied: I’ve always been a bit skeptical about list experiments, partly because I worry that the absolute number of items on the list could itself affect the response. For example, someone might not want to check off 6 items out of 6 but would have no problem checking off 6 items out of 10: even if 4 items on that latter list were complete crap, their presence on the list might make the original 6 items look better by comparison. So this has made me think that a list experiment should really have some sort of active control. But the problem with the active control is that then any effects will be smaller. Then that made me think that one might be interested in interactions, that is, which groups of people would be triggered by different items on the list. But that’s another level of difficulty…

And then I remembered that I’ve never actually done such an experiment! So I thought I’d bring in some experts. Here’s what they said:

Macartan Humphreys:

I have had mixed experiences with list experiments.

Enumerators are sometimes confused by them and so are subjects and sometimes we have found enumerators implementing them badly, eg sometimes getting the subjects to count out as they go along reading the list that kind of thing. Great enumerators shouldn’t have this problem, but some of ours have.

In one implementation that we thought went quite well we cleverly did two list experiments with the same sensitive item and different nonsensitive items, but got very different results. So that is not encouraging.

The length of list issue I think is not the biggest. You can keep lists constant length and include an item that you know the answer to (maybe because you ask it elsewhere, or because you are willing to bet on it). Tingley gives some references that discuss this kind of thing:

A bigger issue though is that list experiments don’t incentivize people to give you information that they don’t want you to have. eg if people do not want you to know that there was fraud, and if they understand the list experiment, you should not get evidence of fraud. The technique only seems relevant for cases in which people DO want you to know the answer but don;t want to be identifiable as the person that told you.

Lynn Vavreck:

Simon Jackman and I ran a number of list experiments in the 2008 Cooperative Campaign Analysis Project. Substantively, we were interested in Obama’s race, Hillary Clinton’s sex, and McCain’s age. We ran them in two different waves (March of 2008 and September of 2008).

Like the others, we got some strange results that prevented us from writing up the results. Ultimately, I think we both concluded that this was not a method we would use again in the future.

In the McCain list, people would freely say “his age” was a reason they were not voting for him. We got massive effects here. We didn’t get much at all on the Clinton list (“She’s a woman.”) And, on the Obama list, we got results in the OPPOSITE direction in the second wave! I will let you make of those patterns what you will — but, it seemed to us to echo what Macartan writes below — if it’s truly a sensitive item, people seem to figure out what is going on and won’t comply with the “treatment.”

If the survey time is easily available (i.e. running this is cheap), I think I still might try it. But if you are sacrificing other potentially interesting items, you should probably reconsider doing the list. Also, one more thing: If you are going to go back to these people in any kind of capacity you don’t want to do anything that will damage the rapport you have with the respondents. If they “figure out” what you’re up to in the list experiment they may be less likely to give you honest answers to other questions down the line. As you develop the survey you want to be sensitive to fostering the notion that surveys are “just out to trick people.” I’d put a premium on that just now if I were you.

Cyrus Samii:

I’ve had experiences similar to what Macartan and Lynn reported. I think Macartan’s last point about the incentives makes a lot of sense. If the respondent is not motivated in that way, then the validity of the experiment requires that the respondent can follow the instructions but is not so attentive as to avoid being tricked. That may not be a reasonable assumption.

There’s also the work that Jason Lyall and coauthors have done using both list experiments and endorsement experiments in Afghanistan. E.g.,
They seem to think that they the techniques have been effective and so it may be useful to contact Jason to get some tips that would be specifically relevant to research in Afghanistan. It’s possible that the context really moderates the performance of these techniques.

Simon Jackman:

“List” experiments — aka “item-count” experiments — seem most prone to run into trouble when the “sensitive item” jumps off the page. This gives rise to the “top-coding” problem: if all J items are things I’ve done, including the sensitive item, then I’m going to respond “J” only if I’m ok revealing myself as someone who would respond “yes” to the sensitive item.

Then you’ve got to figure out how to have J items, including your sensitive item, such that J-1 might be the plausible upper bound on the item count. This can be surprisingly hard. Pre-testing would seem crucial, fielding your lists trying to avoid “top-coding”.

I still use the technique now and then (including a paper out now under R&R), but I’ve come to realize they can be expensive to do well, especially in novel domains of application, given the test cases you have to burn through to get the lists working well.

More generally, the item-count technique seems like a lot of work for an estimate of the population rate of the “sensitive” attitude or self-report of the sensitive behavior. Sure, modeling (a la Imai) can get you estimates of the correlates of the sensitive item and stratification lets you estimate rates in sub-populations. But if the lists aren’t working well to begin with, then the validity of “post-design”, model-based swings at the problem have to be somewhat suspect.

One thing I’m glad Lynn and I did in our 2008 work was to put the whole “misreport/social-desirability” rationale to a test. For the context we were working in — American’s attitudes about Obama and McCain on a web survey — there were more than a few people willing to quite openly respond that they wouldn’t vote for Obama because he’s black, or won’t for McCain because he’s too old. These provided useful lower bounds on what we ought to have got from the item-count approach. Again, note the way you’re blowing through sample to test & calibrate the lists.

And Brendan Nyhan adds:

I suspect there’s a significant file drawer problem on list experiments. I have an unpublished one too! They have low power and are highly sensitive to design quirks and respondent compliance as others mentioned. Another problem we found is interpretive. They work best when the social desirability effect is unidirectional. In our case, however, we realized that there was a plausible case that some respondents were overreporting misperceptions as a form of partisan cheerleading and others were underreporting due to social desirability concerns, which could create offsetting effects.

That makes sense to me. Regular blog readers will know that I’m generally skeptical about claims of unidirectional effects.

And Alissa Stollwerk discusses some of her experiences here.

A short questionnaire regarding the subjective assessment of evidence

E. J. Wagenmakers writes:

Remember I briefly talked to you about the subjective assessment of evidence? Together with Richard Morey and myself, Annelies Bartlema created a short questionnaire that can be done online. There are five scenarios and it does not take more than 5 minutes to complete. So far we have collected responses from psychology faculty and psychology students, but we were also keen to get responses from a more statistically savvy crowd: the people who read your blog!

Try it out!

Ticket to Baaaaarf

A link from the comments here took me to the wonderfully named Barfblog and a report by Don Schaffner on some reporting.

First, the background: A university in England issued a press release saying that “Food picked up just a few seconds after being dropped is less likely to contain bacteria than if it is left for longer periods of time . . . The findings suggest there may be some scientific basis to the ‘5 second rule’ – the urban myth about it being fine to eat food that has only had contact with the floor for five seconds or less. Although people have long followed the 5 second rule, until now it was unclear whether it actually helped.” According to the press release, the study was “undertaken by final year Biology students” and led by a professor of microbiology.

The press release hit the big time, hitting NPR, Slate, Forbes, the Daily News, etc etc. Some typical headlines:

“5-second rule backed up by science” — Atlanta Journal Constitution

“Eating food off the floor may be OK, scientist says” — CNET

“Scientists confirm dad’s common sense: 5-second rule totally legit”

OK, that last one was from the Christian Science Monitor, a publication that I don’t think anyone will take very seriously when it comes to health issues.

Second, the take-home point from Schaffner:

If you don’t have any pathogens on your kitchen floor, it doesn’t matter how long food sits there. If you do have pathogens on your kitchen floor, you get more of them on wet food than dry food. But in my considered opinion, the five-second rule is nonsense. I’m a scientist, I’ll keep an open mind. I know what some people in my lab will be working on this summer. . . .

Third, the rant from Don Schaffner on barfblog:

I [Scaffner] can tell when something is a big news story.

First, I read about it in my news feed from one or more sources. Second, friends and family send it to me. By these two criteria, the recent news about the five second rule qualifies as a big news story. . . . And it’s a story, or a press release, not a study.

The press release is apparently based on a PowerPoint presentation. The study has not undergone any sort of peer review, as far as I know. Science by press release is something that really bugs me. It’s damned hard to do research. It’s even harder to get that research published in the peer-reviewed literature. And when reputable news outlets publish university press releases without even editing them, that does a disservice to everyone; the readers, the news outlet, and even the university researchers. . . .

A review of the slide set shows a number of problems with the study. The researchers present their data as per cent transfer. As my lab has shown repeatedly, through our own peer-reviewed research, when you study cross-contamination and present the results as percentage transfer, those data are not normally distributed. A logarithmic transformation appears to be suitable for converting percentage transfer data to a normal distribution. This is important because any statistics you do on the results generally assume the data to be normally distributed. If you don’t verify this assumption first, you may conclude things that aren’t true.

The next problem with the study is that the authors appear to have only performed three replicates for most of the conditions studied. Again, as my own peer-reviewed research has shown, the nature of cross-contamination is such that the data are highly variable. In our experience you need 20 to 30 replicates to reasonably truly characterize the variability in logarithmically transformed percent transfer data.

Our research has also shown that the most significant variable influencing cross-contamination appears to be moisture. This is not surprising. Bacteria need moisture to move from one location to another. When conditions are dry, it’s much less likely that a cell will be transferred.

Another problem that peer-reviewers generally pick up, is an awareness (or lack thereof) of knowledge of the pre-existing literature. Research on the five-second rule is not new. I’m aware of at least three groups that schaffnerhave worked in this area. Although it’s not peer-reviewed, the television show MythBusters has considered this issue. Paul Dawson at Clemson has also done research on the five-second rule. Dawson’s research has been peer-reviewed and was published in the Journal of Applied Microbiology. Hans Blaschek and colleagues were, as far as I know, the first lab to ever study this.

When I first read this, I was like, Yeah, you go guy! If only all the journalists did it as well as Mary Beth Breckenridge of the Beacon Journal, in a news article headlined, “Study supports five-second rule, but should you? Probably not”:

A new study appears to validate what every 12-year-old knows: If you drop food on the floor, you have five seconds until it becomes contaminated. Biology students at Aston University in Birmingham, England, tested the time-honored five-second rule and claim to have found some truth to it. The faster you pick food up off the floor, they discovered, the less likely it is to contain bacteria. . . .

But don’t go picking fallen Fritos out of the rug just yet.
The study contradicts findings of earlier research at Clemson University, where scientists tested how fast Salmonella Typhimurium bacteria made their way from flooring surfaces to bologna and bread. It happened instantly, the researchers found.
What’s more, the British study apparently hasn’t been published yet in a scientific journal, noted Jeffrey T. LeJeune, a food safety expert at the Ohio Agricultural Research and Development Center in Wooster Township.
Since the data aren’t available to other researchers, he said, there’s no way to replicate the study or determine whether the results are legitimate. “I would be very skeptically cautious about the results, and even more about the interpretation,” he said. . . .

But then I got a bit worried. What exactly is the take-home message? It can’t just be, “don’t report a study that hasn’t been peer-reviewed,” since (a) even if a study is published in a peer-reviewed journal, it could be crap (recall all those papers published in Psychological Science), and (b) if a topic is sufficiently important, it could well be newsworthy even before the grind of the peer review process.

This particular study does seem shaky, though: a student project that is not backed up by shared data or a preprint. The press release seems a bit irresponsible: “Although people have long followed the 5 second rule, until now it was unclear whether it actually helped,” which implies that now all is clear. But journalists should know better than to trust a press release! Don’t they teach them that in day 1 of journalism school?? The reports typically do express some skepticism, for example the NPR report says, “The team hasn’t published the data yet. So the findings are still preliminary and need to be confirmed” and later on quotes a biologist stating an opposite position. Even so, though, it seems like all these news outlets are taking the press release a bit too uncritically.

Some of this is simple envy: I’d love for my research to be discussed on NPR and I’m sure Don Schaffner wouldn’t mind this sort of exposure either. But it does seem to me that this sort of science-reporting-by-press-release creates the worst sort of incentives for researchers. I don’t blame the university researcher for promoting his students’ project (his quote: “The findings of this study will bring some light relief to those who have been employing the five-second rule for years, despite a general consensus that it is purely a myth”) but I do blame the reporting system for hyping this sort of thing, which seems like the flip side of the notorious proclivity of media organizations for scare stories. (As Jonathan Schoenfeld and John Ioannidis found, it seems like just about everything has been said to cause cancer at one time or another.)

P.S. This all got my attention not because I care about the so-called five-second rule but because I was attracted by the name of the barfblog.

Stan Model of the Week: Hierarchical Modeling of Supernovas

The Stan Model of the Week showcases research using Stan to push the limits of applied statistics.  If you have a model that you would like to submit for a future post then send us an email.

Our inaugural post comes from Nathan Sanders, a graduate student finishing up his thesis on astrophysics at Harvard. Nathan writes,

“Core-collapse supernovae, the luminous explosions of massive stars, exhibit an expansive and meaningful diversity of behavior in their brightness evolution over time (their “light curves”). Our group discovers and monitors these events using the Pan-STARRS1 telescope in Hawaii, and we’ve collected a dataset of about 20,000 individual photometric observations of about 80 Type IIP supernovae, the class my work has focused on. While this dataset provides one of the best available tools to infer the explosion properties of these supernovae, due to the nature of extragalactic astronomy (observing from distances
$\gtrsim$ 1 billion light years), these light curves typically have much lower signal-to-noise, poorer sampling, and less complete coverage than we would like.

My goal has been to develop a light curve model, with a physically interpretable parameterization, robust enough to fit the diversity of observed behavior and to extract the most information possible from every light curve in the sample, regardless of data quality or completeness.  Because light curve parameters of individual objects are often not identified by the data, we have adopted a hierarchical model structure.  The intention is to capitalize on partial pooling of information to simultaneously regularize the fits of individual light curves and constrain the population level properties of the light curve sample.  The highly non-linear character of the light curves motivates a full Bayes approach to explore the complex joint structure of the posterior.

Sampling from a ~$10^4$ dimensional, highly correlated joint posterior seemed intimidating to me, but I’m fortunate to have been empowered by having taken Andrew’s course at Harvard, by befriending expert practitioners in this field like Kaisey Mandel and Michael Betancourt, and by using Stan!  For me, perhaps the most attractive feature of Stan is its elegant probabilistic modeling language.  It has allowed us to rapidly develop and test a variety of functional forms for the light curve model and strategies for optimization and regularization of the hierarchical structure.  This would not be useful, of course, without Stan’s efficient implementation of NUTS, although the particular pathologies of our model’s posterior drove us to spend a great deal of time exploring divergence, tree depth saturation, numerical instability, and other problems encountered by the sampler.

Over the course of the project, I learned to pay increasingly close attention to the stepsize, n_treedepth and n_divergent NUTS parameters, and other diagnostic information provided by Stan in order to help debug sampling issues.  Encountering saturation of the treedepth and/or extremely small stepsizes often motivated simplifications of the hierarchical structure in order to reduce the curvature in the posterior.  Divergences during sampling led us to apply stronger prior information on key parameters (particularly those that are exponentiated in the light curve model) in order to avoid numerical overflow on samples drawn from the tails.  Posterior predictive checks have been a constant companion throughout, providing a natural means to visualize the model’s performance against the data to understand where failure modes have been introduced – be it through under- or over-constraining priors, inadequate flexibility in the light curve model form, or convergence failure between chains.”

By modeling the hierarchical structure of the supernova measurements Nathan was able to significantly improve the utilization of the data.  For more, see

By modeling the hierarchical structure of the supernova measurements Nathan was able to significantly improve the utilization of the data. For more, see the preprint.

Building and fitting this model proved to be a tremendous learning experience for both Nathan any myself.  We haven’t really seen Stan applied to such deep hierarchical models before, and our first naive implementations proved to be vulnerable to all kinds of pathologies.

A problem early on came in how to model hierarchical dependences
between constrained parameters.  As has become a common theme,
the most successful computational strategy is to model the hierarchical dependencies on the unconstrained latent space and transform to the constrained space only when necessary.

The biggest issue we came across, however, was the development of a well-behaved hierarchal prior with so many layers.  With multiple layers the parameter variances increase exponentially, and the naive generalization of a one-layer prior induces huge variances on the top-level parameters.  This became especially pathological when those top-level parameters are constrained — the exponential function is very easy to overflow in floating point.  Ultimately we established the desired variance on the top-level parameters and worked backwards, scaling the deeper priors by the number of groups in the next layer to ensure the desired behavior.

Another great feature of Stan is that the modeling language also serves as a convenient means of sharing models for reproducible science.  Nathan was able to include the full model as an appendix to his paper, which you can find on the arXiv.

Ticket to Baaaath


Ooooooh, I never ever thought I’d have a legitimate excuse to tell this story, and now I do! The story took place many years ago, but first I have to tell you what made me think of it:

Rasmus Bååth posted the following comment last month:

On airplane tickets a Swedish “å” is written as “aa” resulting in Rasmus Baaaath. Once I bought a ticket online and five minutes later a guy from Lufthansa calls me and asks if I misspelled my name…

OK, now here’s my story (which is not nearly as good). A long time ago (but when I was already an adult), I was in England for some reason, and I thought I’d take a day trip from London to Bath. So here I am on line, trying to think of what to say at the ticket counter. I remember that in England, they call Bath, Bahth. So, should I ask for “a ticket to Bahth”? I’m not sure, I’m afraid that it will sound silly, like I’m trying to fake an English accent. So, when I get to the front of the line, I say, hesitantly, “I’d like a ticket to Bath?” (with the American pronunciation). The ticket agent replies, slightly contemptuously: “Oh, you’d like a ticket to Baaaaaaath.” I pay for the ticket, take it, and slink away.

This is, like, my favorite story. Ok, not my favorite favorite story—that’s the time I saw this guy in Harvard Square and the back of his head looked just like Michael Keaton—but, still, it’s one of my best. Among linguistic-themed stories, it’s second only to the “I speak only English” story (see third paragraph here). Also, both of these are what might be called “reverse Feynman stories” in that they make me look like a fool.

On deck this week

Mon: Ticket to Baaaath

Tues: Ticket to Baaaaarf

Wed: Thinking of doing a list experiment? Here’s a list of reasons why you should think again

Thurs: An open site for researchers to post and share papers

Fri: Questions about “Too Good to Be True”

Sat: Sleazy sock puppet can’t stop spamming our discussion of compressed sensing and promoting the work of Xiteng Liu

Sun: White stripes and dead armadillos

Fooled by randomness

From 2006:

Naseem Taleb‘s publisher sent me a copy of “Fooled by randomness: the hidden role of chance in life and the markets” to review. It’s an important topic, and the book is written in a charming style—I’ll try to respond in kind, with some miscellaneous comments.

On the cover of the book is a blurb, “Named by Fortune one of the smartest books of all time.” But Taleb instructs us on page 161-162 to ignore book reviews because of selection bias (the mediocre reviews don’t make it to the book cover).

Books vs. articles

I prefer writing books to writing journal articles because books are written for the reader (and also, in the case of textbooks, for the teacher), whereas articles are written for referees. Taleb definitely seems to be writing to the reader, not the referee. There is risk in book-writing, since in some ways referees are the ideal audience of experts, but I enjoy the freedom in book-writing of being able to say what I really think.

Variation and randomness

Taleb’s general points—about variation, randomness, and selection bias—will be familiar with statisticians and also to readers of social scientists and biologists such as Niall Ferguson, A.J.P. Taylor, Stephen J. Gould, and Bill James who have emphasized the roles of contingency and variation in creating the world we see.


On pages xiiv-xlv, Taleb compares the “Utopian Vision, associated with Rousseau, Godwin, Condorcet, Thomas Painen, and conventional normative economists,” to the more realistic “Tragic Vision of humankind that believes in the existence of inherent limitations and flaws in the way we think and act,” associated with Karl Popper, Freidrich Hayek and Milton Friedman, Adam Smith, Herbert Simon, Amos Tversky, and others. He writes, “As an empiricist (actually a skeptical empiricist) I despise the moralizers beyond anything on this planet . . .”

Despise “beyond anything on this planet”?? Isn’t this a bit extreme? What about, for example, hit-and-run drivers? I despise them even more.


On page 39, Taleb quotes the maxim, “What is easy to conceive is clear to express / Words to say it would come effortlessly.” This reminds me of the duality in statistics between computation and model fit: better-fitting models tend to be easier to compute, and computational problems often signal modeling problems. (See here for my paper on this topic.)

Turing Test

On page 72, Taleb writes about the Turing test: “A computer can be said to be intelligent if it can (on aveage) fool a human into mistaking it for another human.” I don’t buy this. At the very least, the computer would have to fool me into thinking it’s another human. I don’t doubt that this can be done (maybe another 5-20 years, I dunno). But I wouldn’t use the “average person” as a judge. Average people can be fooled all the time. If you think I can be fooled easily, don’t use me as a judge, either. Use some experts.

Evaluations based on luck

I’m looking at my notes. Something in Taleb’s book, but I ‘m not sure what, reminded me of a pitfall in the analysis of algorithms that forecast elections. People have written books about this, “The Keys to the White House,” etc. Anyway, the past 50 years have seen four Presidential elections that have been, essentially (from any forecasting standpoint), ties: 1960, 1968, 1976, 2000. Any forecasting method should get no credit for forecasting the winner in any of these elections, and no blame for getting it wrong. Also in the past 50 years, there have been four Presidential elections that were landslides: 1956, 1964, 1972, 1984. (Perhaps you could also throw 1996 in there; obviously the distinction is not precise.) Any forecasting method better get these right, otherwise it’s not to be taken seriously at all. What is left are 1980, 1988, 1992, 1996, 2004: only 5 actual test cases in 50 years! You have a 1/32 chance of getting them all right by chance. This is not to say that forecasts are meaningless, just that a simple #correct is too crude a summary to be useful.


I once talked with someone who wanted to write a book called Winners, interviewing a bunch of lottery winners. Actually Bruce Sacerdote and others have done statistical studies of lottery winners, using the lottery win as a randomly assigned treatment. But my response was to write a book called Losers, interviewing a bunch of randomly-selected lottery players, almost all of which, of course, would be net losers.

Finance and hedging

When I was in college I interviewed for a summer job for an insurance company. The interviewer told me that his boss “basically invented hedging.” He also was getting really excited about a scheme for moving profits around between different companies so that none of the money got taxed. It gave me a sour feeling, but in retrospect maybe he was just testing me out to see what my reaction would be.

Forecasts, uncertainty, and motivations

Taleb describes the overconfidence of many “experts.” Some people have a motivation to display certainty. For example, auto mechanics always seemed to me to be 100% sure of their diagnosis (“It’s the electrical system”), then when they were wrong, it never would bother them a bit. Setting aside possible fradulence, I think they have a motivation to be certain, because we’re unlikely to follow their advice if they qualify it. In the other direction, academics like me perhaps have a motivation to overstate uncertainty, to avoid the potential loss in reputation from saying something stupid. But in practice, people seem to understate our uncertainty most of the time.

Some experts aren’t experts at all. I was once called by a TV network (one of the benefits of living in New York?) to be interviewed about the lottery. I’m no expert—I referred them to Clotfelter and Cook. Other times, I’ve seen statisticians quoted in the paper on subjects they know nothing about. Once, several years ago, a colleague came into my office and asked me what “sampling probability proportional to size” was. It turned out he was doing some consulting for the U.S. government. I was teaching a sampling class at the time, so i could help him out. But it was a little scary that he had been hired as a sampling expert. (And, yes, I’ve seen horrible statistical consulting in the private sector as well.)


A thought-provoking and also fun book. The statistics of low-probability events has long interested me, and the stuff about the financial world was all new to me. The related work of Mandelbrot discusses some of these ideas from a more technical perspective. (I became aware of Mandelbrot’s work on finance through this review by Donald MacKenzie.)


Taleb is speaking this Friday at the Collective Dynamics Seminar.

Update (2014):

I thought Fooled by Randomness made Taleb into a big star, but then his followup effort, The Black Swan, really hit the big time. I reviewed The Black Swan here.

The Collective Dynamics Seminar unfortunately is no more; several years ago, Duncan Watts left Columbia to join Yahoo research (or, as I think he was contractually required to write, Yahoo! research). Now he and his colleagues (who are my collaborators too) work at Microsoft research, still in NYC.

Index or indicator variables

Someone who doesn’t want his name shared (for the perhaps reasonable reason that he’ll “one day not be confused, and would rather my confusion not live on online forever”) writes:

I’m exploring HLMs and stan, using your book with Jennifer Hill as my field guide to this new territory. I think I have a generally clear grasp on the material, but wanted to be sure I haven’t gone astray.

The problem in working on involves a multi-nation survey of students, and I’m especially interested in understanding the effects of country, religion, and sex, and the interactions among those factors (using IRT to estimate individual-level ability, then estimating individual, school, and country effects).

Following the basic approach laid out in chapter 13 for such interactions between levels, I think I need to create a matrix of indicator variables for religion and sex. Elsewhere in the book, you recommend against indicator variables in favor of a single index variable.

Am I right in thinking that this is purely a matter of convenience, and that the matrix formulation of chapter 13 requires indicator variables, but that the matrix of indicators or the vector of indices yield otherwise identical results? I can’t see why they shouldn’t be the same, but my intuition is still developing around multi-level models.

I replied:

Yes, models can be formulated equivalently in terms of index or indicator variables. If a discrete variable can take on a bunch of different possible values (for example, 50 states), it makes sense to use a multilevel model rather than to include indicators as predictors with unmodeled coefficients. If the variable takes on only two or three values, you can still do a multilevel model but really it would be better at that point to use informative priors for any variance parameters. That’s a tactic we do not discuss in our book but which is easy to implement in Stan, and I’m hoping to do more of it in the future.

To which my correspondent wrote:

The main difference that occurs to me as I work through implementing this is that the matrix of indicator variables loses information about what the underlying variable was. So, for instance, if the matrix mixes an indicator for sex and n indicators for religion and m indicators for schools, we’d have Sigma_beta be an m+n+1 x m+n+1 matrix, when we really want a 3×3 matrix.

I could set up the basic structure of Sigma_beta, separately estimate the diagonal elements with a series of multilevel loops by sex, religion, and school, and eschew the matrix formulation in the individual model. So instead of y~N(X_iB_j[i],sigma^2_y) it would be (roughly, I’m doing this on my phone):


And the group-level formulation unchanged. Sigma_beta becomes a 3×3 matrix rather than an m+n+1 matrix, which seems both more reasonable and more computationally tractable.

My reply:

Now I’m getting tangled in your notation. I’m not sure what Sigma_beta is.

One-tailed or two-tailed?


Someone writes:

Suppose I have two groups of people, A and B, which differ on some characteristic of interest to me; and for each person I measure a single real-valued quantity X. I have a theory that group A has a higher mean value of X than group B. I test this theory by using a t-test. Am I entitled to use a *one-tailed* t-test? Or should I use a *two-tailed* one (thereby giving a p-value that is twice as large)?

I know you will probably answer: Forget the t-test; you should use Bayesian methods instead.

But what is the standard frequentist answer to this question?

My reply:

The quick answer here is that different people will do different things here. I would say the 2-tailed p-value is more standard but some people will insist on the one-tailed version, and it’s hard to make a big stand on this one, given all the other problems with p-values in practice:

P.S. In the comments, Sameer Gauria summarizes a key point:

It’s inappropriate to view a low P value (indicating a misfit of the null hypothesis to data) as strong evidence in favor of a specific alternative hypothesis, rather than other, perhaps more scientifically plausible, alternatives.

This is so important. You can take lots and lots of examples (most notably, all those Psychological Science-type papers) with statistically significant p-values, and just say: Sure, the p-value is 0.03 or whatever. I agree that this is evidence against the null hypothesis, which in these settings typically has the following five aspects:
1. The relevant comparison or difference or effect in the population is exactly zero.
2. The sample is representative of the population.
3. The measurement in the data corresponds to the quantities of interest in the population.
4. The researchers looked at exactly one comparison.
5. The data coding and analysis would have been the same had the data been different.
But, as noted above, evidence against the null hypothesis is not, in general, strong evidence in favor of a specific alternative hypothesis, rather than other, perhaps more scientifically plausible, alternatives.

If you get to the point of asking, just do it. But some difficulties do arise . . .

Nelson Villoria writes:

I find the multilevel approach very useful for a problem I am dealing with, and I was wondering whether you could point me to some references about poolability tests for multilevel models. I am working with time series of cross sectional data and I want to test whether the data supports cross sectional and/or time pooling. In a standard panel data setting I do this with Chow tests and/or CUSUM. Are these ideas directly transferable to the multilevel setting?

My reply: I think you should do partial pooling. Once the question arises, just do it. Other models are just special cases. I don’t see the need for any test.

That said, if you do a group-level model, you need to consider including group-level averages of individual predictors (see here). And if the number of groups is small, there can be real gains from using an informative prior distribution on the hierarchical variance parameters. This is something that Jennifer and I do not discuss in our book, unfortunately.