Skip to content

“Statistics textbooks (including mine) are part of the problem, I think, in that we just set out ‘theta’ as a parameter to be estimated, without much reflection on the meaning of ‘theta’ in the real world.”

Carol Nickerson pointed me to a new article by Arie Kruglanski, Marina Chernikova, Katarzyna Jasko, entitled, “Social psychology circa 2016: A field on steroids.”

I wrote:

1. I have no idea what is the meaning of the title of the article. Are they saying that they’re using performance-enhancing drugs?

2. I noticed this from the above article: “Consider the ‘power posing’ effects (Carney, Cuddy, & Yap, 2010; Carney, Cuddy, & Yap, 2015) or the ‘facial feedback’ effects (Strack, Martin, & Stepper, 1988), both of which recently came under criticism on grounds of non-replicability. We happen to believe that these effects could be quite real rather than made up, albeit detectable only under some narrowly circumscribed conditions. Our beliefs derive from what (we believe) is the core psychological mechanism mediating these phenomena.”

This seems naive to me. If we want to talk about what we “happen to believe” about “the core psychological mechanism,” I’ll register my belief that if you were to do an experiment on the “crouching cobra” position and explain how powerful people hold their energy in reserve, and if you had all the researcher degrees of freedom available to Carney, Cuddy, and Yap, I think you’d have no problem demonstrating that a crouched “cobra pose” is associated with powerfulness, while an open pose is associated with weakness.

One could also argue that the power pose results arose entirely from facial feedback. Were the experimenters controlling for their own facial expressions or the facial expressions of the people in the experiment? I think not.

Nickerson replied:

The effects *could* be real (albeit small). But this hasn’t been demonstrated in any credible way.

I didn’t understand the “on steroids” bit, either. That idiom usually means “in a stronger, more powerful, or exaggerated way.” I’d agree that the results of much social psychology research seems exaggerated (if not just plain wrong), but I don’t think that is what they meant.

And I followed up by saying:

Yes, the effects could be real. But they could just as well be real in the opposite direction (that the so-called power pose makes things worse). Realistically I think the effects will depend so strongly on context (including expectations) that I doubt that “the effect” or “the average effect” of power pose has any real meaning. Statistics textbooks (including mine) are part of the problem, I think, in that we just set out “theta” as a parameter to be estimated, without much reflection on the meaning of “theta” in the real world.

This discussion is relevant too.

Died in the Wool

Garrett M. writes:

I’m an analyst at an investment management firm. I read your blog daily to improve my understanding of statistics, as it’s central to the work I do.

I had two (hopefully straightforward) questions related to time series analysis that I was hoping I could get your thoughts on:

First, much of the work I do involves “backtesting” investment strategies, where I simulate the performance of an investment portfolio using historical data on returns. The primary summary statistics I generate from this sort of analysis are mean return (both arithmetic and geometric) and standard deviation (called “volatility” in my industry). Basically the idea is to select strategies that are likely to generate high returns given the amount of volatility they experience.

However, historical market data are very noisy, with stock portfolios generating an average monthly return of around 0.8% with a monthly standard deviation of around 4%. Even samples containing 300 months of data then have standard errors of about 0.2% (4%/sqrt(300)).

My first question is, suppose I have two times series. One has a mean return of 0.8% and the second has a mean return of 1.1%, both with a standard error of 0.4%. Assuming the future will look like the past, is it reasonable to expect the second series to have a higher future mean than the first out of sample, given that it has a mean 0.3% greater in the sample? The answer might be obvious to you, but I commonly see researchers make this sort of determination, when it appears to me that the data are too noisy to draw any sort of conclusion between series with means within at least two standard errors of each other (ignoring for now any additional issues with multiple comparisons).

My second question involves forecasting standard deviation. There are many models and products used by traders to determine the future volatility of a portfolio. The way I have tested these products has been to record the percentage of the time future returns (so out of sample) fall within one, two, or three standard deviations, as forecasted by the model. If future returns fall within those buckets around 68%/95%/99% of the time, I conclude that the model adequately predicts future volatility. Does this method make sense?

My reply:

In answer to your first question, I think you need a model of the population of these time series. You can get different answers from different models. If your model is that each series is a linear trend plus noise, then you’d expect (but not be certain) that the second series will have a higher future return than the first. But there are other models where you’d expect the second series to have a lower future return. I’d want to set up a model allowing all sorts of curves and trends, then fit the model to past data to estimate a population distribution of those curves. But I expect that the way you’ll make real progress is to have predictors—I guess they’d be at the level of the individual stock, maybe varying over time—so that your answer will depend on the values of these predictors, not just the time series themselves.

In answer to your second question, yes, sure, you can check the calibration of your model using interval coverage. This should work fine if you have lots of calibration data. If your sample size becomes smaller, you might want to do something using quantiles, as described in this paper with Cook and Rubin, as this will make use of the calibration data more efficiently.

Multilevel modeling: What it can and cannot do

Today’s post reminded me of this article from 2005:

We illustrate the strengths and limitations of multilevel modeling through an example of the prediction of home radon levels in U.S. counties. . . .

Compared with the two classical estimates (no pooling and complete pooling), the inferences from the multilevel models are more reasonable. . . . Although the specific assumptions of model (1) could be questioned or improved, it would be difficult to argue against the use of multilevel modeling for the purpose of estimating radon levels within counties. . . . Perhaps the clearest advantage of multilevel models comes in prediction. In our example we can predict the radon levels for new houses in an existing county or a new county. . . . We can use cross-validation to formally demonstrate the benefits of multilevel modeling. . . . The multilevel model gives more accurate predictions than the no-pooling and complete-pooling regressions, especially when predicting group averages.

The most interesting part comes near the end of the three-page article:

We now consider our model as an observational study of the effect of basements on home radon levels. The study includes houses with and without basements throughout Minnesota. The proportion of homes with basements varies by county (see Fig. 1), but a regression model should address that lack of balance by estimating county and basement effects separately. . . . The new group-level coefficient γ2 is estimated at −.39 (with standard error .20), implying that, all other things being equal, counties with more basements tend to have lower baseline radon levels. For the radon problem, the county-level basement proportion is difficult to interpret directly as a predictor, and we consider it a proxy for underlying variables (e.g., the type of soil prevalent in the county).

This should serve as a warning:

In other settings, especially in social science, individual av- erages used as group-level predictors are often interpreted as “contextual effects.” For example, the presence of more basements in a county would somehow have a radon-lowering effect. This makes no sense here, but it serves as a warning that, with identical data of a social nature (e.g., consider substituting “income” for “radon level” and “ethnic minority” for “basement” in our study), it would be easy to leap to a misleading conclusion and find contextual effects where none necessarily exist. . . .

This is related to the problem in meta-analysis that between-study variation is typically observational even if individual studies are randomized experiments . . .

In summary:

One intriguing feature of multilevel models is their ability to separately estimate the predictive effects of an individual predictor and its group-level mean, which are sometimes interpreted as “direct” and “contextual” effects of the predictor. As we have illustrated in this article, these effects cannot necessarily be interpreted causally for observational data, even if these data are a random sample from the population of interest. Our analysis arose in a real research problem (Price et al. 1996) and is not a “trick” example. The houses in the study were sampled at random from Minnesota counties, and there were no problems of selection bias.

Read the whole thing.

Adding a predictor can increase the residual variance!

Chao Zhang writes:

When I want to know the contribution of a predictor in a multilevel model, I often calculate how much of the total variance is reduced in the random effects by the added predictor. For example, the between-group variance is 0.7 and residual variance is 0.9 in the null model, and by adding the predictor the residual variance is reduced to 0.7, then VPC = (0.7 + 0.9 – 0.7 – 0.7) / (0.7 + 0.9) = 0.125. Then I assume that the new predictor explained 12.5% more of the total variance than the null model. I guess this is sometimes done by some researchers when they need a measure of sort of an effect size.

However, now I have a case in which adding a new predictor (X) greatly increased the between-group variance. After some inspection, I realized that this was because although X correlate with Y positively overall, it correlate with Y negatively within each group, and X and Y vary in the same direction regarding the grouping variable. Under this situation, the VPC as computed above becomes negative! I am puzzled by this because how could the total variance increase? And this seems to invalidate the above method, at least in some situations.

My reply: this phenomenon is discussed in Section 21.7 of my book with Jennifer Hill. The section is entitled, “Adding a predictor can increase the residual variance!”

It’s great when I can answer a question so easily!

Recently in the sister blog

This research is 60 years in the making:

How “you” makes meaning

“You” is one of the most common words in the English language. Although it typically refers to the person addressed (“How are you?”), “you” is also used to make timeless statements about people in general (“You win some, you lose some.”). Here, we demonstrate that this ubiquitous but understudied linguistic device, known as “generic-you,” has important implications for how people derive meaning from experience. Across six experiments, we found that generic-you is used to express norms in both ordinary and emotional contexts and that producing generic-you when reflecting on negative experiences allows people to “normalize” their experience by extending it beyond the self. In this way, a simple linguistic device serves a powerful meaning-making function.

Applying human factors research to statistical graphics

John Rauser writes:

I’ve been a reader of yours (books, papers and the blog) for a long time, and it occurred to me today that I might be able to give something back to you.

I recently wrote a talk ( about human factors research applied to making statistical graphics. I mainly cover material from Cleveland, but also fold in ideas from Gestalt psychology. My main contribution is to apply these ideas to commonly seen real world graphics and draw out observations and recommendations.

I thought the talk might be decent fodder for your course on statistical graphics and communication, or perhaps for the blog. Either way, thanks so much for everything you’ve contributed to the community.

Trips to Cleveland, huh? I don’t have the patience to watch videos but I’ll post here and we can see what people think.

A stunned Dyson

Terry Martin writes:

I ran into this quote and thought you might enjoy it. It’s from p. 273 of Segre’s new biography of Fermi, The Pope of Physics:

When Dyson met with him in 1953, Fermi welcomed him politely, but he quickly put aside the graphs he was being shown indicating agreement between theory and experiment. His verdict, as Dyson remembered, was “There are two ways of doing calculations in theoretical physics. One way, and this is the way I prefer, is to have a clear physical picture of the process you are calculating. The other way is to have a precise and self-consistent mathematical formalism. You have neither.” When a stunned Dyson tried to counter by emphasizing the agreement between experiment and the calculations, Fermi asked him how many free parameters he had used to obtain the fit. Smiling after being told “Four,” Fermi remarked, “I remember my old friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” There was little to add.

My reply: I’d have love to have met Fermi or Ulam. Something about Neumann really irritates me, though. That elephant quote just seems like bragging! For one thing, I can have a model with a lot more than five parameters and still struggle to fit my data.

Stan Weekly Roundup, 21 July 2017

It was another productive week in Stan land. The big news is that

  • Jonathan Auerbach, Tim Jones, Susanna Makela, Swupnil Sahai, and Robin Winstanley won first place in a New York City competition for predicting elementary school enrollment. Jonathan told me, “I heard 192 entered, and there were 5 finalists….Of course, we used Stan (RStan specifically). … Thought it might be Stan news worthy.”

    I’d say that’s newsworthy. Jon also provided a link to the “challenge” page, a New York City government sponsored “call for innovations”: Enhancing School Zoning Efforts by Predicting Population Change.

    They took home a US$20K paycheck for their efforts!

    Stan’s seeing quite a lot of use these days among demographers and others looking to predict forward from time series data. Jonathan’s been very active using government data sets (see his StanCon 2017 presentation with Rob Trangucci, Twelve Cities: Does lowering speed limits save pedestrian lives?). Hopefully they’ll share their code—maybe they already have. I really like to see this combination of government data and careful statistical analysis.

In other big news coming up soon,

In other news,

  • Andrew Gelman‘s been driving a lot of rethinking of our interfaces because he’s revising his and Jennifer Hill’s regression book (the revision will be two books). Specifically, he’s thinking a lot about workflow and how we can manage model expansion by going from fixed to modeled parameters. Right now, the process is onerous in that you have to move data variables into the parameters block and keep updating their priors. Andrew wants to be able to do this from the outside, but Michael Betancourt and I both expressed a lot of skepticism in terms of it breaking a lot of our fundamental abstractions (like a Stan program defining the density that’s fit!). More to come on this hot topic. Any ideas on how to manage developing a model would be appreciated. This goes back to the very first thing Matt Hoffman, Michael Malecki and I worked on with Andrew when he hired us before we’d conceived Stan. You’d think we’d have better advice on this after all this time. I’ve seen people do everything from use the C++ preprocessor to write elaborate program generation code in R.

  • Breck Baldwin has been working on governance and we’re converging on a workable model that we’ll share with everyone soon. The goal’s to make the governance clear and less of a smoke-filled room job by those of us who happen to go to lunch after the weekly meetings.

  • Jonah Gabry is taking on the ggplot2-ification of the new regression book and trying to roll everything into a combination of RStanArm and BayesPlot. No word yet if the rest of the tidyverse is to follow. Andrew said, “I’ll see what Jonah comes up with” or something to that effect.

  • Jonah has alos been working on priors for multilevel regression and poststratification with Yajuan Si (former postdoc of Andrew’s, now at U. Wisconsin); the trick is doing somethign reasonable when you have lots of interactions.

  • Ben Goodrich has been working on the next release of RStanArm. It’s beginning to garner a lot of contributions. Remember that the point has been to convert a lot of common point estimation packages to Bayesian versions and supply them with familiar R interfaces. Ben’s particularly been working on survival analysis lately.

  • Our high school student intern (don’t know if I can mention names online—the rest of our developers are adults!) is working on applying the Cook-Gelman-Rubin metric to evaluating various models. We’re doing much more of this method and it needs a less verbose and more descriptive name!

  • Mitzi Morris submitted a pull request for the Stan repo to add line numbers to error messages involving compound declare-and-define statements.
  • A joint effort by Mitzi Morris, Dan Simpson, Imad Ali, and Miguel A Martinez Beneito has led to convergence of models and fits among BUGS, Stan, and INLA for intrinsic conditional autorgression (ICAR) models. Imad’s building the result in RStanArm and has figured out how to extend the loo (leave one out cross-validation) package to deal with spatial models. Look for it in a Stan package near you soon. Mitzi’s working on the case study, which has been updated in the example-models repo.

  • Charles Margossian knocked off a paper on the mixed ODE solver he’s been working on, with a longer paper promised that goes through all the code details. Not sure if that’s on arXiv or not. He’s also been training Bill Gillespie to code in C++, which is great news for the project since Charles has to contend with his first year of grad school next year (whereas Bill’s facing a pleasant retirement of tracking down memory leaks). Charles is also working on getting the algebraic and fixed state solvers integrated into Torsten before the fall.

  • Krzysztof Sakrejda has a new paper out motivating a custom density he wrote in Stan for tracking dealys in diseseas incidence in a countrywide analysis for Thailand. I’m not sure where he put it.

  • Yuling Yao is revising the stacking paper (for a kind of improved model averaging). I believe this is going into the loo package (or is maybe already there). So much going on with Stan I can’t keep up!

  • Yuling also revised the simulated tempering paper, which is some custom code on top of Stan to fit models with limited multimodality. There was some discussion about realistic examples with limited multimodality and we hope to have a suite of them to go along with the paper.

  • Sebastian Weber, Bob Carpenter, and Micahel Betancourt, and Charles Margossian had a long and productive meeting to design the MPI interface. It’s very tricky trying to find an interface that’ll fit into the Stan language and let you ship off data once and reuse it. I think we’re almost there. The design discussion’s on Discourse.

  • Rob Trangucci is finishing the GP paper (after revising the manual chapter) with Dan Simpson, Aki Vehtari, and Michael Betancourt.

I’m sure I’m missing a lot of goings on, especially among people not at our weekly meetings. If you know of things that should be on this list, please let me know.

“Bayes factor”: where the term came from, and some references to why I generally hate it

Someone asked:

Do you know when this term was coined or by whom? Kass and Raftery’s use of the tem as the title of their 1995 paper suggests that it was still novel then, but I have not noticed in the paper any information about where it started.

I replied:

According to Etz and Wagenmakers (2016), “The term ‘Bayes factor’ comes from Good, who attributes the introduction of the term to Turing, who simply called it the ‘factor.'”

They refer to Good (1988) and Fienberg (2006) for historical review.

I generally hate Bayes factors myself, for reasons discussed at a technical level in our Bayesian Data Analysis book (see chapter 7 of the third edition). Or, if you want to see my philosophical reasons, see the discussion around Figure 1 of this paper. Also this paper with Rubin from 1995.

How does a Nobel-prize-winning economist become a victim of bog-standard selection bias?

Someone who wishes to remain anonymous writes in with a story:

Linking to a new paper by Jorge Luis García, James J. Heckman, and Anna L. Ziff, an economist Sue Dynarski makes this “joke” on facebook—or maybe it’s not a joke:

How does one adjust standard errors to account for the fact that N of papers on an experiment > N of participants in the experiment?

Clicking through, the paper uses data from the “Abecedarian” (ABC) childhood intervention program of the 1970s. Well, the related ABC & “CARE” experiments, pooled together. From Table 3 on page 7, the ABC experiment has 58 treatment and 56 control students, while ABC has 17 treatment and 23 control. If you type “abecedarian” into Google Scholar, sure enough, you get 9,160 results! OK, but maybe some of those just have citations or references to other papers on that project… If you restrict the search to papers with “abecedarian” in the title, you still get 180 papers. If you search for the word “abecedarian” on Google Scholar (not necessarily in the title) and restrict to papers by Jim Heckman, you get 86 results.

That’s not why I thought to email you though.

Go to pages 7-8 of this new paper where they explain why they merged the ABC and CARE studies:

CARE included an additional arm of treatment. Besides the services just described, those in the treatment group also received home visiting from birth to age 5. Home visiting consisted of biweekly visits focusing on parental problem-solving skills. There was, in addition, an experimental group that received only the home visiting component, but not center-based care.[fn 17] In light of previous analyses, we drop this last group from our analysis. The home visiting component had very weak estimated effects.[fn 18] These analyses justify merging the treatment groups of ABC and CARE, even though that of CARE received the additional home-visiting component.[fn 19] We henceforth analyze the samples so generated as coming from a single ABC/CARE program.

OK, they merged some interventions (garden of forking paths?) because they wanted more data. But, how do they know that home visits had weak effects? Let’s check their explanation in footnote 18:

18: Campbell et al. (2014) test and do not reject the hypothesis of no treatment effects for this additional component of CARE.

Yep. Jim Heckman and coauthors conclude that the effects are “very weak” because ran some tests and couldn’t reject the null. If you go deep into the supplementary material of the cited paper, to tables S15(a) and S15(b), sure enough you find that these “did not reject the null” conclusions are drawn from interventions with 12-13 control and 11-14 treatment students (S15(a)) or 15-16 control and 18-20 treatment students (S15(b)). Those are pretty small sample sizes…

This jumped out at me and I thought you might be interested too.

My reply: This whole thing is unfortunate but it is consistent with the other writings of Heckman and his colleagues in this area: huge selection bias and zero acknowledgement of the problem. It makes me sad because Heckman’s fame came from models of selection bias, but he doesn’t see it when it’s right in front of his face. See here, for example.

The topic is difficult to write about for a few reasons.

First, Heckman is a renowned scholar and he is evidently careful about what he writes. We’re not talking about Brian Wansink or Satoshi Kanazawa here. Heckman works on important topics, his studies are not done on the cheap, and he’s eminently reasonable in his economic discussions. He’s just making a statistical error, over and over again. It’s a subtle error, though, that has taken us (the statistics profession) something like a decade to fully process. Making this mistake doesn’t make Heckman a bad guy, and that’s part of the problem: When you tell a quantitative researcher that they made a statistical error, you often get a defensive reaction, as if you accused them of being stupid, or cheating. But lots of smart, honest people have made this mistake. That’s one of the reasons we have formal statistical methods in the first place: people get lots of things wrong when relying on instinct. Probability and statistics are important, but they’re not quite natural to our ways of thinking.

Second, who wants to be the grinch who’s skeptical about early childhood intervention? Now, just to be clear, there’s lots of room to be skeptical about Heckman’s claims and still think that early childhood intervention is a good idea. For example, this paper by Garcia, Heckman, Leaf, and Prados reports a benefit/cost ratio of 7.3. So they could be overestimating their effect by a factor of 7 and still have a favorable ratio. The point is, if for whatever reason you support universal day care or whatever, you have a motivation not to worry too much about the details of a study that supports your position.

Again, I’m not saying that Heckman and his colleagues are doing this. I can only assume they’re reporting what, to them, are their best estimates. Unfortunately these methods are biased. But a lot of people with classical statistics and econometrics training don’t realize this: they thing regression coefficients are unbiased estimates, but nobody ever told them that the biases can be huge when there is selection for statistical significance.

And, remember, selection for statistical significance is not just about the “file drawer” and it’s not just about “p-hacking.” It’s about researcher degrees of freedom and forking paths that researchers themselves don’t always realize until they try to replicate their own studies. I don’t think Heckman and his colleagues have dozens of unpublished papers hiding in their file drawers, and I don’t think they’re running their data through dozens of specifications until they find statistical significance. So it’s not the file drawer and it’s not p-hacking as is often understood. But these researchers do have nearly unlimited degrees of freedom in their data coding and analysis, they do interpret “non-significant” differences as null and “significant” differences at face value, they have forking paths all over the place, and their estimates of magnitudes of effects are biased in the positive direction. It’s kinda funny but also kinda sad, that there’s so much concern for rigor in the design of these studies and in the statistical estimators used in the analysis, but lots of messiness in between, lots of motivation on the part of the researchers to find success after success after success, and lots of motivation for scholarly journals and the news media to publicize the results uncritically. These motivations are not universal—there’s clearly a role in the ecosystem for critics within academia, the news media, and in the policy community—but I think there are enough incentives for success within Heckman’s world to keep him and his colleagues from seeing what’s going wrong.

Again, it’s not easy—it took the field of social psychology about a decade to get a handle on the problem, and some are still struggling. So I’m not slamming Heckman and his colleagues. I think they can and will do better. It’s just interesting, when considering the mistakes that accomplished people make, to ask, How did this happen?

P.S. This is an important topic. It’s not ovulation-and-voting or air rage or himmicanes or anything silly like that: We’re talking about education policy that could affect millions of kids! And I’m not saying I have all the answers, or anything close to that. No, it’s the opposite: data are relevant to these questions, and I’m not close to the data. What’s needed is an integration of theory with observational and experimental data, and it’s great that academic economists such as Heckman have put so much time into studying these problems. I see my role as statistician as a helper. For better or worse, statistics is a big part of the story, and when people are making statistical errors, we should be correcting them. But these corrections are not the end of the story; they’re just necessary adjustments to keep research on the right track.

Short course on Bayesian data analysis and Stan 23-25 Aug in NYC!

Jonah “ShinyStan” Gabry, Mike “Riemannian NUTS” Betancourt, and I will be giving a three-day short course next month in New York, following the model of our successful courses in 2015 and 2016.

Before class everyone should install R, RStudio and RStan on their computers. (If you already have these, please update to the latest version of R and the latest version of Stan.) If problems occur please join the stan-users group and post any questions. It’s important that all participants get Stan running and bring their laptops to the course.

Class structure and example topics for the three days:

Day 1: Foundations
Foundations of Bayesian inference
Foundations of Bayesian computation with Markov chain Monte Carlo
Intro to Stan with hands-on exercises
Real-life Stan
Bayesian workflow

Day 2: Linear and Generalized Linear Models
Foundations of Bayesian regression
Fitting GLMs in Stan (logistic regression, Poisson regression)
Diagnosing model misfit using graphical posterior predictive checks
Little data: How traditional statistical ideas remain relevant in a big data world
Generalizing from sample to population (surveys, Xbox example, etc)

Day 3: Hierarchical Models
Foundations of Bayesian hierarchical/multilevel models
Accurately fitting hierarchical models in Stan
Why we don’t (usually) have to worry about multiple comparisons
Hierarchical modeling and prior information

Specific topics on Bayesian inference and computation include, but are not limited to:
Bayesian inference and prediction
Naive Bayes, supervised, and unsupervised classification
Overview of Monte Carlo methods
Convergence and effective sample size
Hamiltonian Monte Carlo and the no-U-turn sampler
Continuous and discrete-data regression models
Mixture models
Measurement-error and item-response models

Specific topics on Stan include, but are not limited to:
Reproducible research
Probabilistic programming
Stan syntax and programming
Warmup, adaptation, and convergence
Identifiability and problematic posteriors
Handling missing data
Ragged and sparse data structures
Gaussian processes

Again, information on the course is here.

The course is organized by Lander Analytics.

The course is not cheap. Stan is open-source, and we organize these courses to raise money to support the programming required to keep Stan up to date. We hope and believe that the course is more than worth the money you pay for it, but we hope you’ll also feel good, knowing that this money is being used directly to support Stan R&D.

Make Your Plans for Stans (-s + Con)

This post is by Mike

A friendly reminder that registration is open for StanCon 2018, which will take place over three days, from Wednesday January 10, 2018 to Friday January 12, 2018, at the beautiful Asilomar Conference Grounds in Pacific Grove, California.

Detailed information about registration and accommodation at Asilomar, including fees and instructions, can be found on the event website.  Early registration ends on Friday November 10, 2017 and no registrations will be accepted after Wednesday December 20, 2017.

We have an awesome set of invited speakers this year that is worth attendance alone,

  • Susan Holmes (Department of Statistics, Stanford University)
  • Sean Taylor and Ben Letham (Facebook Core Data Science)
  • Manuel Rivas (Department of Biomedical Data Science, Stanford University)
  • Talia Weiss (Department of Physics, Massachusetts Institute of Technology)
  • Sophia Rabe-Hesketh and Daniel Furr (Educational Statistics and Biostatistics, University of California, Berkeley)

Contributed talks will proceed as last year, with each submission consisting of self-contained knitr or Jupyter notebooks that will be made publicly available after the conference.  Last year’s contributed talks were awesome and we can’t wait to see what users will submit this year.  For details on how to submit see the submission website.  The final deadline for submissions is Saturday September 16, 2017 5:00:00 AM GMT.

This year we are going to try to support as many student scholarships  as we can — if you are a student who would love to come but may not have the funding then don’t hesitate to submit a short application!

Finally, we are still actively looking for sponsors!  If you are interested in supporting StanCon 2018, or know someone who might be, then please contact the organizing committee.

His concern is that the authors don’t control for the position of games within a season.

Chris Glynn wrote last year:

I read your blog post about middle brow literature and PPNAS the other day. Today, a friend forwarded me this article in The Atlantic that (in my opinion) is another example of what you’ve recently been talking about. The research in question is focused on Major League Baseball and the occurrence that a batter is hit by a pitch in retaliation for another player previously hit by a pitch in the same game. The research suggests that temperature is an important factor in predicting this retaliatory behavior. The original article by Larrick et al. in the journal Psychological Science is here.

My concern is that the authors don’t control for the position of games within a season. There are several reasons why the probability of retaliation may change as the season progresses, but a potentially important one is the changing relative importance of games as the season goes along. Games in the early part of the season (April, May) are important as teams try to build a winning record. Games late in the season are more important as teams compete for limited playoff spots. In these important games, retaliation is less likely because teams are more focused on winning than imposing baseball justice. The important games occur in relatively cool months. There exists a soft spot in the schedule during June, July, and August (hot months) where the games are less consequential. Perhaps what is driving the result is the schedule position (and relative importance) of the game. Regardless of the mechanism by which the schedule position impacts the probability of retaliation, the timing of a game within the season is correlated with temperature.

One quick analysis to get at the effect of temperature in games of similar importance would be to examine those that were played in the month of August. Some of those games will be played in dome stadiums which are climate controlled. Most games will be played in outdoor stadiums. I am curious to see if the temperature effect still exists after controlling for the relative importance of the game.

My reply: Psychological Science, published in 2011? That says it all. I’ll blog this, I guess during next year’s baseball season…

And here we are!

Animating a spinner using ggplot2 and ImageMagick

It’s Sunday, and I [Bob] am just sitting on the couch peacefully ggplotting to illustrate basic sample spaces using spinners (a trick I’m borrowing from Jim Albert’s book Curve Ball). There’s an underlying continuous outcome (i.e., where the spinner lands) and a quantization into a number of regions to produce a discrete outcome (e.g., “success” and “failure”). I’m quite pleased with myself for being able to use polar coordinates to create the spinner and arrow. ggplot works surprisingly well in polar coordinates once you figure them out; almost everything people have said about them online is confused and the doc itself assumes you’re a bit more of a ggplotter and geometer than me.

I’m so pleased with it that I show the plot to Mitzi. She replies, “Why don’t you animate it?” I don’t immediately say, “What a waste of time,” then get back to what I’m doing. Instad, I boast, “It’ll be done when you get back from your run.” Luckily for me, she goes for long runs—I just barely had the prototype working as she got home. And then I had to polish it and turn it into a blog post. So here it is, for your wonder and amazement.

Here’s the R magic.

draw_curve <- function(angle) {
  df <- data.frame(outcome = c("success", "failure"), prob = c(0.3, 0.7)) 
  plot <-
    ggplot(data=df, aes(x=factor(1), y=prob, fill=outcome)) +
    geom_bar(stat="identity", position="fill") +
    coord_polar(theta="y", start = 0, direction = 1) +
    scale_y_continuous(breaks = c(0.12, 0.7), labels=c("success", "failure")) +
    geom_segment(aes(y= angle/360, yend= angle/360, x = -1, xend = 1.4), arrow=arrow(type="closed"), size=1) +
    theme(axis.title = element_blank(),
          axis.ticks = element_blank(),
          axis.text.y = element_blank()) +
    theme(panel.grid = element_blank(),
          panel.border = element_blank()) +
    theme(legend.position = "none") +
    geom_point(aes(x=-1, y = 0), color="#666666", size=5)
ds <- c()
pos <- 0
for (i in 1:66) {
  pos <- (pos + (67 - i)) %% 360
  ds[i] <- pos
ds <- c(rep(0, 10), ds)
ds <- c(ds, rep(ds[length(ds)], 10))

for (i in 1:length(ds)) {
  ggsave(filename = paste("frame", ifelse(i < 10, "0", ""), i, ".png", sep=""),
         plot = draw_curve(ds[i]), device="png", width=4.5, height=4.5)

I probably should've combined theme functions. Ben would've been able to define ds in a one-liner and then map ggsave. I hope it's at least clear what my code does (just decrements the number of degrees moved each frame by one---no physics involved).

After producing the frames in alphabetical order (all that ifelse and paste mumbo-jumbo), I went to the output directory and ran the results through ImageMagick (which I'd previously installed on my now ancient Macbook Pro) from the terminal, using

> convert *.png -delay 3 -loop 0 spin.gif

That took a minute or two. Each of the pngs is about 100KB, but the final output is only 2.5MB or so. Maybe I should've went with less delay (I don't even know what the units are!) and fewer rotations and maybe a slower final slowing down (maybe study the physics). How do the folks at Pixar ever get anything done?

P.S. I can no longer get the animation package to work in R, though it used to work in the past. It just wraps up those calls to ImageMagick.

P.P.S. That salmon and teal color scheme is the default!

“The ‘Will & Grace’ Conjecture That Won’t Die” and other stories from the blogroll

How to design future studies of systemic exercise intolerance disease (chronic fatigue syndrome)?

Someone named Ramsey writes on behalf of a self-managed support community of 100+ systemic exercise intolerance disease (SEID) patients. He read my recent article on the topic and had a question regarding the following excerpt:

For conditions like S.E.I.D., then, the better approach may be to gather data from people suffering “in the wild,” combining the careful methodology of a study like PACE with the lived experience of thousands of people. Though most may be less eloquent than Rehmeyer, each may have his or her own potential path to recovery.

Ramsey asks:

From your perspective, are there particular design features to such an approach that one should prioritize, in order to maximize its usefulness to others?

Here’s the challenge.

The current standard model of evaluating medical research is the randomized clinical trial with 100 or so patients. This sort of trial is both too large and too small (see also here): too large in there is so much variation in the population of patients, and different treatments will work (or not work, or even be counterproductive) for different people; too small in that the variation in such studies makes it hard to find reliable, reproducible results.

I think we need to move in two directions at once. From one direction, N=1 experiments: careful scientific evaluations of treatment options adapted to individual people. From the other direction, full population studies, tracking what really is happening outside the lab. The challenge there, as Ramsey notes, is that a lot of uncontrolled information is and will be available.

I’m sorry to say that I don’t have any good advice right now on how future studies should proceed. Speaking generally, I think it’s important to measure exactly what’s being done by the doctor and patient at all times, I think you should think carefully about outcome measures, and I think it’s a good idea to try multiple treatments on individual patients (that is, to perform within-person comparisons, also called crossover trials in this context). And, when considering observational studies (that is, comparisons based on existing treatments), gather whatever pre-treatment information that is predictive of individuals’ choice of treatment regimen to follow. For SEID in particular, it seems that the diversity of the condition is a key part of the story and so it would be good to find treatments that work with well-defined subgroups.

I hope others can participate in this discussion.

Should we continue not to trust the Turk? Another reminder of the importance of measurement

From 2013: Don’t trust the Turk

From 2017 (link from Kevin Lewis), from Jesse Chandler and Gabriele Paolacci:

The Internet has enabled recruitment of large samples with specific characteristics. However, when researchers rely on participant self-report to determine eligibility, data quality depends on participant honesty. Across four studies on Amazon Mechanical Turk, we show that a substantial number of participants misrepresent theoretically relevant characteristics . . .

For some purposes you can learn a lot from these online samples, but it depends on context. Measurement is important, and it is underrated in statistics.

The trouble is if you’re cruising along treating “p less than .05” as your criterion of success, then quality of measurement barely matters at all! Gather your data, grab your stars, get published, give your Ted talk, and sell your purported expertise to the world. Statistics textbooks have lots about how to analyze your data, a little bit on random sampling and randomized experimentation, and next to nothing on gathering data with high reliability and validity.

They want help designing a crowdsourcing data analysis project

Michael Feldman writes:

My collaborators and myself are doing research where we try to understand the reasons for the variability in data analysis (“the garden of forking paths”). Our goal is to understand the reasons why scientists make different decisions regarding their analyses and in doing so reach different results.

In a project called “Crowdsourcing data analysis: Gender, status, and science”, we have recruited a large group of independent analysts to test the same hypotheses on the same dataset using a platform we developed.

The platform is essentially Rstudio running online with few additions:

· We record all executed commands even if they are not in the final code

· We ask analysts to explain these commands by creating semantic blocks explaining the rationale and alternatives

· We allow analysts to create graphical workflow of their work using these blocks and by restructuring them

You can find the more complete experiment description here. Also a short video tutorial of the platform.

Of course this experiment is not covering all considerations that might lead to variability (e.g. R users might differ from Python users), but we believe it is a step towards better understanding how defensible, yet subjective analytic choices may shape research results. The experiment is still running but we are likely to receive about 40-60 submissions of code, logs, comments, and explanations of decisions made. We are also collecting various information about analysts like their background, methods they usually use and the way they operationalized the hypotheses.

Our current plan is to analyze the data from this crowdsourced project using inductive coding by splitting participants into groups that reached similar results (effect size and direction). We then plan to identify factors that can explain various decisions as well as explain the similarities between participants.

We would love to receive any feedback and suggestions from readers of your blog regarding our planned approach to account for variability in results across different analysts.

If anyone has suggestions, feel free to respond in the comments.

Graphs as comparisons: A case study

Above is a pair of graphs from a 2015 paper by Alison Gopnik, Thomas Griffiths, and Christopher Lucas. It takes up half a page in the journal, Current Directions in Psychological Science. I think we can do better.

First, what’s wrong with the above graphs?

We could start with the details: As a reader, I have to go back and forth between the legend and the bars to keep the color scheme fresh in my mind. The negative space in the middle of each plot looks a bit like a white bar, which is not directly confusing but makes the display that much harder to read. And I had to do a double-take to interpret the infinitesimal blue bar of zero height on the left plot. Also, it’s not clear how to interpret the y-axes: 25 participants out of how many? And whassup with the y-axis on the second graph: the 15 got written as a 1, which makes me suspect that the graph was re-drawn from an original, which then leads to concern that other mistakes may have been introduced in the re-drawing process.

But that’s not the direction I want to go. There are problems with the visual display, but going and fixing them one by one would not resolve the larger problem. To get there, we need to think about goals.

A graph is a set of comparisons, and the two goals of a graph are:

1. To understand communicate the size and directions of comparisons that you were already interested in, before you made the graph; and
2. To facilitate discovery of new patterns in data that go beyond what you expected to see.

Both these goals are important. Its important to understand and communicate what we think we know, and it’s also important to put ourselves in a position where we can learn more.

The question, when making any graphs, is: what comparisons does it make it easy to see? After all, if you just wanted the damn numbers you could put them in a table.

Now let’s turn to the graph above. It makes it easy to compare the heights of two lines right next to each other—for example, I see that the dark blue lines are all higher than the light blue lines, except for the pair on the left . . . hmmmm, which are light blue and which are dark blue, again? I can’t really do much with the absolute levels of the lines because I don’t know what “25” means.

Look. I’m not trying to rag on the authors here. This sort of Excel-style bar graph is standard in so many presentations. I just think they could do better.

So, how to do better? Let’s start with the goals.

1. What are the key comparisons that the authors want to emphasize? From the caption, it seems that the most important comparisons are between children and adults. We want a graph that shows the following patterns:
(i) In the Combination scenario, children tended to chose multiple objects (the correct response, it seems) and adults tended to choose single objects (the wrong response).
(ii) In the Individual scenario, both children and adults tended to choose a single object (the correct response).
Actually, I think I’d re-order these, and first look at the Individual scenario which seems to be some sort of baseline, and then go to Combination which is displaying something new.

2. What might we want to learn from a graph of these data, beyond the comparisons listed just above? This one’s not clear so I’ll guess:
Who were those kids and adults who got the wrong answer in the Combination scenario? Did they have other problems? What about the adults who got the wrong answer in the Individual scenario, which was presumably easier? Did they also get the answer wrong in the other case? There must be some other things to learn from these data too—it’s hard to get people to participate in a psychology experiment, and once you have them there, you’ll want to give them as many tasks as you can. But from this figure alone, I’m not sure what these other questions would be.

OK, now time to make the new graph. Given that I don’t have the raw data, and I’m just trying to redo the figure above, I’ll focus on task 1: displaying the key comparisons clearly.

Hey—I just realized something! The two outcomes in this study are “Single object” and “Multiple object”—that’s all there is! And, looking carefully, I see that the numbers in each graph add up to a constant: it’s 25 children and, ummm, let me read this carefully . . . 28 adults!

This simplifies our task considerably, as now we have only 4 numbers to display instead of 8.

We can easily display four numbers with a line plot. The outcome is % who give the “Single object” response, and the two predictors are Child/Adult and Individual Principle / Combination Principle.

One of these predictors will go on the x-axis, one will correspond to the two lines, and the outcome will go on the y-axis.

In this case, which of our two predictors goes on the x-axis?

Sometimes the choice is easy: if one predictor is binary (or discrete with only a few categories) and the other is continuous, it’s simplest to put the continuous predictor as x, and use the discrete predictor to label the lines. In this case, though, both predictors are binary, so what to do?

I think we should use logical or time order, as that’s easy to follow. There are two options:
(1) Time order in age, thus Children, then Adults; or
(2) Logical order in the experiment, thus Individual Principle, then Combination Principle, as Individual is in a sense the control case and Combination is the new condition.

I tried it both ways and I think option 2 was clearer. So I’ll show you this graph and the corresponding R code. Then I’ll show you option 1 so you can compare.

Here’s the graph:

I think this is better than the bar graphs from the original article, for two reasons. First, we can see everything in one place: Like the title sez, “Children did better than adults,\nespecially in the combination condition.” Second, we can directly make both sorts of comparisons: we can compare children to adults, and we can also make the secondary comparison of seeing that both groups performed worse under the combination condition than the individual condition.

Here’s the data file I made, gopnik.txt:

Adult Combination N_single N_multiple
0 0 25 0
0 1 4 21
1 0 23 5
1 1 18 10

And here’s the R code:

gopnik <- read.table("gopnik.txt", header=TRUE)
N <- gopnik$N_single + gopnik$N_multiple
p_multiple <- gopnik$N_multiple / N
p_correct <- ifelse(gopnik$Combination==0, 1 - p_multiple, p_multiple)
colors <- c("red", "blue")
combination_labels <- c("Individual\ncondition", "Combination\ncondition")
adult_labels <- c("Children", "Adults")

pdf("gopnik_2.pdf", height=4, width=5)
par(mar=c(3,3,3,2), mgp=c(1.7, .5, 0), tck=-.01, bg="gray90")
plot(c(0,1), c(0,1), yaxs="i", xlab="", ylab="Percent who gave correct answer", xaxt="n", yaxt="n", type="n", main="Children did better than adults,\nespecially in the combination condition", cex.main=.9)
axis(1, c(0, 1), combination_labels, mgp=c(1.5,1.5,0))
axis(2, c(0,.5,1), c("0", "50%", "100%"))
for (i in 1:2){
  ok <- gopnik$Adult==(i-1)
  x <- gopnik$Combination[ok]
  y <- p_correct[ok]
  se <- sqrt((N*p_correct + 2)*(N[ok]*(1-p_correct) + 2)/(N[ok] + 4)^3)
  lines(x, y, col=colors[i])
  points(x, y, col=colors[i], pch=20)
  for (j in 1:2){
    lines(rep(x[j], 2), y[j] + se[j]*c(-1,1), lwd=2, col=colors[i])
    lines(rep(x[j], 2), y[j] + se[j]*c(-2,2), lwd=.5, col=colors[i])
  text(mean(x), mean(y) - .05, adult_labels[i], col=colors[i], cex=.9)

Yeah, yeah, I know the code is ugly. I'm pretty sure it could be done much more easily in ggplot2.

Also, just for fun I threw in +/- 1 and 2 standard error bars, using the Agresti-Coull formula based on (y+2)/(n+4) for binomial standard errors. Cos why not. The one thing this graph doesn't show is whether the adults who got it wrong on the individual condition were more likely to get it wrong in the combination condition, but that information wasn't in the original graph either.

On the whole, I'm satisfied that the replacement graph contains all the information in less space and is much clearer than the original.

Again, this is not a slam on the authors of the paper. They're not working within a tradition in which graphical display is important. I'm going through this example in order to provide a template for future researchers when summarizing their data.

And, just for comparison, here's the display going the other way:

(This looks so much like the earlier plot that it seems at first that we did something wrong. But, no, it just happened that way because we're only plotting four numbers, and it just happened that the two numbers whose positions changed had very similar values of 0.84 and 0.82.)

And here's the R code for this second graph:

pdf("gopnik_1.pdf", height=4, width=5)
par(mar=c(3,3,3,2), mgp=c(1.7, .5, 0), tck=-.01, bg="gray90")
plot(c(0,1), c(0,1), yaxs="i", xlab="", ylab="Percent who gave correct answer", xaxt="n", yaxt="n", type="n", main="Children did better than adults,\nespecially in the combination condition", cex.main=.9)
axis(1, c(0, 1), adult_labels)
axis(2, c(0,.5,1), c("0", "50%", "100%"))
for (i in 1:2){
  ok <- gopnik$Combination==(i-1)
  x <- gopnik$Adult[ok]
  y <- p_correct[ok]
  se <- sqrt((N*p_correct + 2)*(N[ok]*(1-p_correct) + 2)/(N[ok] + 4)^3)
  lines(x, y, col=colors[i])
  points(x, y, col=colors[i], pch=20)
  for (j in 1:2){
    lines(rep(x[j], 2), y[j] + se[j]*c(-1,1), lwd=2, col=colors[i])
    lines(rep(x[j], 2), y[j] + se[j]*c(-2,2), lwd=.5, col=colors[i])
  text(mean(x), mean(y) - .1, combination_labels[i], col=colors[i], cex=.9)

Hey—here are some tools in R and Stan to designing more effective clinical trials! How cool is that?

In statistical work, design and data analysis are often considered separately. Sometimes we do all sorts of modeling and planning in the design stage, only to analyze data using simple comparisons. Other times, we design our studies casually, even thoughtlessly, and then try to salvage what we can using elaborate data analyses.

It would be better to integrate design and analysis. My colleague Sebastian Weber works at Novartis (full disclosure: they have supported my research too), where they want to take some of the sophisticated multilevel modeling ideas that have been used in data analysis to combine information from different experiments, and apply these to the design of new trials.

Sebastian and his colleagues put together an R package wrapping some Stan functions so they can directly fit the hierarchical models they want to fit, using the prior information they have available, and evaluating their assumptions as they go.

Sebastian writes:

Novartis was so kind to grant permission to publish the RBesT (R Bayesian evidence synthesis Tools) R library on CRAN. It’s landed there two days ago. We [Sebastian Weber, Beat Neuenschwander, Heinz Schmidli, Baldur Magnusson, Yue Li, and Satrajit Roychoudhury] have invested a lot of effort into documenting (and testing) that thing properly. So if you follow our vignettes you get an in-depth tutorial into what, how and why we have crafted the library. The main goal is to reduce the sample size in our clinical trials. As such the library performs a meta-analytic-predictive (MAP) analysis using MCMC. Then that MAP prior is turned into a parametric representation, which we usually recommend to “robustify”. That means to add a non-informative mixture component which we put there to ensure that if things go wrong then we still get valid inferences. In fact, robustification is critical when we use this approach to extrapolate from adults to pediatrics. The reason to go parametric is that this makes it much easier to communicate that MAP prior. Moreover, we use conjugate representations such that the library performs operating characteristics with high-precision and high-speed (no more tables of type I error/power, but graphs!). So you see, RBesT does the job for you for the problem to forge a prior and then evaluate it before using it. This library is a huge help for our statisticians at Novartis to apply the robust MAP approach in clinical trials.

Here are the vignettes:

Getting started with RBesT (binary)

RBest for a Normal Endpoint

Using RBesT to reproduce Schmidli et al. “Robust MAP Priors”

Customizing RBesT plots