Skip to content

Applying human factors research to statistical graphics

John Rauser writes:

I’ve been a reader of yours (books, papers and the blog) for a long time, and it occurred to me today that I might be able to give something back to you.

I recently wrote a talk (https://www.youtube.com/watch?v=fSgEeI2Xpdc) about human factors research applied to making statistical graphics. I mainly cover material from Cleveland, but also fold in ideas from Gestalt psychology. My main contribution is to apply these ideas to commonly seen real world graphics and draw out observations and recommendations.

I thought the talk might be decent fodder for your course on statistical graphics and communication, or perhaps for the blog. Either way, thanks so much for everything you’ve contributed to the community.

Trips to Cleveland, huh? I don’t have the patience to watch videos but I’ll post here and we can see what people think.

A stunned Dyson

Terry Martin writes:

I ran into this quote and thought you might enjoy it. It’s from p. 273 of Segre’s new biography of Fermi, The Pope of Physics:

When Dyson met with him in 1953, Fermi welcomed him politely, but he quickly put aside the graphs he was being shown indicating agreement between theory and experiment. His verdict, as Dyson remembered, was “There are two ways of doing calculations in theoretical physics. One way, and this is the way I prefer, is to have a clear physical picture of the process you are calculating. The other way is to have a precise and self-consistent mathematical formalism. You have neither.” When a stunned Dyson tried to counter by emphasizing the agreement between experiment and the calculations, Fermi asked him how many free parameters he had used to obtain the fit. Smiling after being told “Four,” Fermi remarked, “I remember my old friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” There was little to add.

My reply: I’d have love to have met Fermi or Ulam. Something about Neumann really irritates me, though. That elephant quote just seems like bragging! For one thing, I can have a model with a lot more than five parameters and still struggle to fit my data.

Stan Weekly Roundup, 21 July 2017

It was another productive week in Stan land. The big news is that

  • Jonathan Auerbach, Tim Jones, Susanna Makela, Swupnil Sahai, and Robin Winstanley won first place in a New York City competition for predicting elementary school enrollment. Jonathan told me, “I heard 192 entered, and there were 5 finalists….Of course, we used Stan (RStan specifically). … Thought it might be Stan news worthy.”

    I’d say that’s newsworthy. Jon also provided a link to the “challenge” page, a New York City government sponsored “call for innovations”: Enhancing School Zoning Efforts by Predicting Population Change.

    They took home a US$20K paycheck for their efforts!

    Stan’s seeing quite a lot of use these days among demographers and others looking to predict forward from time series data. Jonathan’s been very active using government data sets (see his StanCon 2017 presentation with Rob Trangucci, Twelve Cities: Does lowering speed limits save pedestrian lives?). Hopefully they’ll share their code—maybe they already have. I really like to see this combination of government data and careful statistical analysis.

In other big news coming up soon,

In other news,

  • Andrew Gelman‘s been driving a lot of rethinking of our interfaces because he’s revising his and Jennifer Hill’s regression book (the revision will be two books). Specifically, he’s thinking a lot about workflow and how we can manage model expansion by going from fixed to modeled parameters. Right now, the process is onerous in that you have to move data variables into the parameters block and keep updating their priors. Andrew wants to be able to do this from the outside, but Michael Betancourt and I both expressed a lot of skepticism in terms of it breaking a lot of our fundamental abstractions (like a Stan program defining the density that’s fit!). More to come on this hot topic. Any ideas on how to manage developing a model would be appreciated. This goes back to the very first thing Matt Hoffman, Michael Malecki and I worked on with Andrew when he hired us before we’d conceived Stan. You’d think we’d have better advice on this after all this time. I’ve seen people do everything from use the C++ preprocessor to write elaborate program generation code in R.

  • Breck Baldwin has been working on governance and we’re converging on a workable model that we’ll share with everyone soon. The goal’s to make the governance clear and less of a smoke-filled room job by those of us who happen to go to lunch after the weekly meetings.

  • Jonah Gabry is taking on the ggplot2-ification of the new regression book and trying to roll everything into a combination of RStanArm and BayesPlot. No word yet if the rest of the tidyverse is to follow. Andrew said, “I’ll see what Jonah comes up with” or something to that effect.

  • Jonah has alos been working on priors for multilevel regression and poststratification with Yajuan Si (former postdoc of Andrew’s, now at U. Wisconsin); the trick is doing somethign reasonable when you have lots of interactions.

  • Ben Goodrich has been working on the next release of RStanArm. It’s beginning to garner a lot of contributions. Remember that the point has been to convert a lot of common point estimation packages to Bayesian versions and supply them with familiar R interfaces. Ben’s particularly been working on survival analysis lately.

  • Our high school student intern (don’t know if I can mention names online—the rest of our developers are adults!) is working on applying the Cook-Gelman-Rubin metric to evaluating various models. We’re doing much more of this method and it needs a less verbose and more descriptive name!

  • Mitzi Morris submitted a pull request for the Stan repo to add line numbers to error messages involving compound declare-and-define statements.
  • A joint effort by Mitzi Morris, Dan Simpson, Imad Ali, and Miguel A Martinez Beneito has led to convergence of models and fits among BUGS, Stan, and INLA for intrinsic conditional autorgression (ICAR) models. Imad’s building the result in RStanArm and has figured out how to extend the loo (leave one out cross-validation) package to deal with spatial models. Look for it in a Stan package near you soon. Mitzi’s working on the case study, which has been updated in the example-models repo.

  • Charles Margossian knocked off a paper on the mixed ODE solver he’s been working on, with a longer paper promised that goes through all the code details. Not sure if that’s on arXiv or not. He’s also been training Bill Gillespie to code in C++, which is great news for the project since Charles has to contend with his first year of grad school next year (whereas Bill’s facing a pleasant retirement of tracking down memory leaks). Charles is also working on getting the algebraic and fixed state solvers integrated into Torsten before the fall.

  • Krzysztof Sakrejda has a new paper out motivating a custom density he wrote in Stan for tracking dealys in diseseas incidence in a countrywide analysis for Thailand. I’m not sure where he put it.

  • Yuling Yao is revising the stacking paper (for a kind of improved model averaging). I believe this is going into the loo package (or is maybe already there). So much going on with Stan I can’t keep up!

  • Yuling also revised the simulated tempering paper, which is some custom code on top of Stan to fit models with limited multimodality. There was some discussion about realistic examples with limited multimodality and we hope to have a suite of them to go along with the paper.

  • Sebastian Weber, Bob Carpenter, and Micahel Betancourt, and Charles Margossian had a long and productive meeting to design the MPI interface. It’s very tricky trying to find an interface that’ll fit into the Stan language and let you ship off data once and reuse it. I think we’re almost there. The design discussion’s on Discourse.

  • Rob Trangucci is finishing the GP paper (after revising the manual chapter) with Dan Simpson, Aki Vehtari, and Michael Betancourt.

I’m sure I’m missing a lot of goings on, especially among people not at our weekly meetings. If you know of things that should be on this list, please let me know.

“Bayes factor”: where the term came from, and some references to why I generally hate it

Someone asked:

Do you know when this term was coined or by whom? Kass and Raftery’s use of the tem as the title of their 1995 paper suggests that it was still novel then, but I have not noticed in the paper any information about where it started.

I replied:

According to Etz and Wagenmakers (2016), “The term ‘Bayes factor’ comes from Good, who attributes the introduction of the term to Turing, who simply called it the ‘factor.'”

They refer to Good (1988) and Fienberg (2006) for historical review.

I generally hate Bayes factors myself, for reasons discussed at a technical level in our Bayesian Data Analysis book (see chapter 7 of the third edition). Or, if you want to see my philosophical reasons, see the discussion around Figure 1 of this paper. Also this paper with Rubin from 1995.

How does a Nobel-prize-winning economist become a victim of bog-standard selection bias?

Someone who wishes to remain anonymous writes in with a story:

Linking to a new paper by Jorge Luis García, James J. Heckman, and Anna L. Ziff, an economist Sue Dynarski makes this “joke” on facebook—or maybe it’s not a joke:

How does one adjust standard errors to account for the fact that N of papers on an experiment > N of participants in the experiment?

Clicking through, the paper uses data from the “Abecedarian” (ABC) childhood intervention program of the 1970s. Well, the related ABC & “CARE” experiments, pooled together. From Table 3 on page 7, the ABC experiment has 58 treatment and 56 control students, while ABC has 17 treatment and 23 control. If you type “abecedarian” into Google Scholar, sure enough, you get 9,160 results! OK, but maybe some of those just have citations or references to other papers on that project… If you restrict the search to papers with “abecedarian” in the title, you still get 180 papers. If you search for the word “abecedarian” on Google Scholar (not necessarily in the title) and restrict to papers by Jim Heckman, you get 86 results.

That’s not why I thought to email you though.

Go to pages 7-8 of this new paper where they explain why they merged the ABC and CARE studies:

CARE included an additional arm of treatment. Besides the services just described, those in the treatment group also received home visiting from birth to age 5. Home visiting consisted of biweekly visits focusing on parental problem-solving skills. There was, in addition, an experimental group that received only the home visiting component, but not center-based care.[fn 17] In light of previous analyses, we drop this last group from our analysis. The home visiting component had very weak estimated effects.[fn 18] These analyses justify merging the treatment groups of ABC and CARE, even though that of CARE received the additional home-visiting component.[fn 19] We henceforth analyze the samples so generated as coming from a single ABC/CARE program.

OK, they merged some interventions (garden of forking paths?) because they wanted more data. But, how do they know that home visits had weak effects? Let’s check their explanation in footnote 18:

18: Campbell et al. (2014) test and do not reject the hypothesis of no treatment effects for this additional component of CARE.

Yep. Jim Heckman and coauthors conclude that the effects are “very weak” because ran some tests and couldn’t reject the null. If you go deep into the supplementary material of the cited paper, to tables S15(a) and S15(b), sure enough you find that these “did not reject the null” conclusions are drawn from interventions with 12-13 control and 11-14 treatment students (S15(a)) or 15-16 control and 18-20 treatment students (S15(b)). Those are pretty small sample sizes…

This jumped out at me and I thought you might be interested too.

My reply: This whole thing is unfortunate but it is consistent with the other writings of Heckman and his colleagues in this area: huge selection bias and zero acknowledgement of the problem. It makes me sad because Heckman’s fame came from models of selection bias, but he doesn’t see it when it’s right in front of his face. See here, for example.

The topic is difficult to write about for a few reasons.

First, Heckman is a renowned scholar and he is evidently careful about what he writes. We’re not talking about Brian Wansink or Satoshi Kanazawa here. Heckman works on important topics, his studies are not done on the cheap, and he’s eminently reasonable in his economic discussions. He’s just making a statistical error, over and over again. It’s a subtle error, though, that has taken us (the statistics profession) something like a decade to fully process. Making this mistake doesn’t make Heckman a bad guy, and that’s part of the problem: When you tell a quantitative researcher that they made a statistical error, you often get a defensive reaction, as if you accused them of being stupid, or cheating. But lots of smart, honest people have made this mistake. That’s one of the reasons we have formal statistical methods in the first place: people get lots of things wrong when relying on instinct. Probability and statistics are important, but they’re not quite natural to our ways of thinking.

Second, who wants to be the grinch who’s skeptical about early childhood intervention? Now, just to be clear, there’s lots of room to be skeptical about Heckman’s claims and still think that early childhood intervention is a good idea. For example, this paper by Garcia, Heckman, Leaf, and Prados reports a benefit/cost ratio of 7.3. So they could be overestimating their effect by a factor of 7 and still have a favorable ratio. The point is, if for whatever reason you support universal day care or whatever, you have a motivation not to worry too much about the details of a study that supports your position.

Again, I’m not saying that Heckman and his colleagues are doing this. I can only assume they’re reporting what, to them, are their best estimates. Unfortunately these methods are biased. But a lot of people with classical statistics and econometrics training don’t realize this: they thing regression coefficients are unbiased estimates, but nobody ever told them that the biases can be huge when there is selection for statistical significance.

And, remember, selection for statistical significance is not just about the “file drawer” and it’s not just about “p-hacking.” It’s about researcher degrees of freedom and forking paths that researchers themselves don’t always realize until they try to replicate their own studies. I don’t think Heckman and his colleagues have dozens of unpublished papers hiding in their file drawers, and I don’t think they’re running their data through dozens of specifications until they find statistical significance. So it’s not the file drawer and it’s not p-hacking as is often understood. But these researchers do have nearly unlimited degrees of freedom in their data coding and analysis, they do interpret “non-significant” differences as null and “significant” differences at face value, they have forking paths all over the place, and their estimates of magnitudes of effects are biased in the positive direction. It’s kinda funny but also kinda sad, that there’s so much concern for rigor in the design of these studies and in the statistical estimators used in the analysis, but lots of messiness in between, lots of motivation on the part of the researchers to find success after success after success, and lots of motivation for scholarly journals and the news media to publicize the results uncritically. These motivations are not universal—there’s clearly a role in the ecosystem for critics within academia, the news media, and in the policy community—but I think there are enough incentives for success within Heckman’s world to keep him and his colleagues from seeing what’s going wrong.

Again, it’s not easy—it took the field of social psychology about a decade to get a handle on the problem, and some are still struggling. So I’m not slamming Heckman and his colleagues. I think they can and will do better. It’s just interesting, when considering the mistakes that accomplished people make, to ask, How did this happen?

P.S. This is an important topic. It’s not ovulation-and-voting or air rage or himmicanes or anything silly like that: We’re talking about education policy that could affect millions of kids! And I’m not saying I have all the answers, or anything close to that. No, it’s the opposite: data are relevant to these questions, and I’m not close to the data. What’s needed is an integration of theory with observational and experimental data, and it’s great that academic economists such as Heckman have put so much time into studying these problems. I see my role as statistician as a helper. For better or worse, statistics is a big part of the story, and when people are making statistical errors, we should be correcting them. But these corrections are not the end of the story; they’re just necessary adjustments to keep research on the right track.

Short course on Bayesian data analysis and Stan 23-25 Aug in NYC!

Jonah “ShinyStan” Gabry, Mike “Riemannian NUTS” Betancourt, and I will be giving a three-day short course next month in New York, following the model of our successful courses in 2015 and 2016.

Before class everyone should install R, RStudio and RStan on their computers. (If you already have these, please update to the latest version of R and the latest version of Stan.) If problems occur please join the stan-users group and post any questions. It’s important that all participants get Stan running and bring their laptops to the course.

Class structure and example topics for the three days:

Day 1: Foundations
Foundations of Bayesian inference
Foundations of Bayesian computation with Markov chain Monte Carlo
Intro to Stan with hands-on exercises
Real-life Stan
Bayesian workflow

Day 2: Linear and Generalized Linear Models
Foundations of Bayesian regression
Fitting GLMs in Stan (logistic regression, Poisson regression)
Diagnosing model misfit using graphical posterior predictive checks
Little data: How traditional statistical ideas remain relevant in a big data world
Generalizing from sample to population (surveys, Xbox example, etc)

Day 3: Hierarchical Models
Foundations of Bayesian hierarchical/multilevel models
Accurately fitting hierarchical models in Stan
Why we don’t (usually) have to worry about multiple comparisons
Hierarchical modeling and prior information

Specific topics on Bayesian inference and computation include, but are not limited to:
Bayesian inference and prediction
Naive Bayes, supervised, and unsupervised classification
Overview of Monte Carlo methods
Convergence and effective sample size
Hamiltonian Monte Carlo and the no-U-turn sampler
Continuous and discrete-data regression models
Mixture models
Measurement-error and item-response models

Specific topics on Stan include, but are not limited to:
Reproducible research
Probabilistic programming
Stan syntax and programming
Optimization
Warmup, adaptation, and convergence
Identifiability and problematic posteriors
Handling missing data
Ragged and sparse data structures
Gaussian processes

Again, information on the course is here.

The course is organized by Lander Analytics.

The course is not cheap. Stan is open-source, and we organize these courses to raise money to support the programming required to keep Stan up to date. We hope and believe that the course is more than worth the money you pay for it, but we hope you’ll also feel good, knowing that this money is being used directly to support Stan R&D.

Make Your Plans for Stans (-s + Con)

This post is by Mike

A friendly reminder that registration is open for StanCon 2018, which will take place over three days, from Wednesday January 10, 2018 to Friday January 12, 2018, at the beautiful Asilomar Conference Grounds in Pacific Grove, California.

Detailed information about registration and accommodation at Asilomar, including fees and instructions, can be found on the event website.  Early registration ends on Friday November 10, 2017 and no registrations will be accepted after Wednesday December 20, 2017.

We have an awesome set of invited speakers this year that is worth attendance alone,

  • Susan Holmes (Department of Statistics, Stanford University)
  • Sean Taylor and Ben Letham (Facebook Core Data Science)
  • Manuel Rivas (Department of Biomedical Data Science, Stanford University)
  • Talia Weiss (Department of Physics, Massachusetts Institute of Technology)
  • Sophia Rabe-Hesketh and Daniel Furr (Educational Statistics and Biostatistics, University of California, Berkeley)

Contributed talks will proceed as last year, with each submission consisting of self-contained knitr or Jupyter notebooks that will be made publicly available after the conference.  Last year’s contributed talks were awesome and we can’t wait to see what users will submit this year.  For details on how to submit see the submission website.  The final deadline for submissions is Saturday September 16, 2017 5:00:00 AM GMT.

This year we are going to try to support as many student scholarships  as we can — if you are a student who would love to come but may not have the funding then don’t hesitate to submit a short application!

Finally, we are still actively looking for sponsors!  If you are interested in supporting StanCon 2018, or know someone who might be, then please contact the organizing committee.

His concern is that the authors don’t control for the position of games within a season.

Chris Glynn wrote last year:

I read your blog post about middle brow literature and PPNAS the other day. Today, a friend forwarded me this article in The Atlantic that (in my opinion) is another example of what you’ve recently been talking about. The research in question is focused on Major League Baseball and the occurrence that a batter is hit by a pitch in retaliation for another player previously hit by a pitch in the same game. The research suggests that temperature is an important factor in predicting this retaliatory behavior. The original article by Larrick et al. in the journal Psychological Science is here.

My concern is that the authors don’t control for the position of games within a season. There are several reasons why the probability of retaliation may change as the season progresses, but a potentially important one is the changing relative importance of games as the season goes along. Games in the early part of the season (April, May) are important as teams try to build a winning record. Games late in the season are more important as teams compete for limited playoff spots. In these important games, retaliation is less likely because teams are more focused on winning than imposing baseball justice. The important games occur in relatively cool months. There exists a soft spot in the schedule during June, July, and August (hot months) where the games are less consequential. Perhaps what is driving the result is the schedule position (and relative importance) of the game. Regardless of the mechanism by which the schedule position impacts the probability of retaliation, the timing of a game within the season is correlated with temperature.

One quick analysis to get at the effect of temperature in games of similar importance would be to examine those that were played in the month of August. Some of those games will be played in dome stadiums which are climate controlled. Most games will be played in outdoor stadiums. I am curious to see if the temperature effect still exists after controlling for the relative importance of the game.

My reply: Psychological Science, published in 2011? That says it all. I’ll blog this, I guess during next year’s baseball season…

And here we are!

Animating a spinner using ggplot2 and ImageMagick

It’s Sunday, and I [Bob] am just sitting on the couch peacefully ggplotting to illustrate basic sample spaces using spinners (a trick I’m borrowing from Jim Albert’s book Curve Ball). There’s an underlying continuous outcome (i.e., where the spinner lands) and a quantization into a number of regions to produce a discrete outcome (e.g., “success” and “failure”). I’m quite pleased with myself for being able to use polar coordinates to create the spinner and arrow. ggplot works surprisingly well in polar coordinates once you figure them out; almost everything people have said about them online is confused and the doc itself assumes you’re a bit more of a ggplotter and geometer than me.

I’m so pleased with it that I show the plot to Mitzi. She replies, “Why don’t you animate it?” I don’t immediately say, “What a waste of time,” then get back to what I’m doing. Instad, I boast, “It’ll be done when you get back from your run.” Luckily for me, she goes for long runs—I just barely had the prototype working as she got home. And then I had to polish it and turn it into a blog post. So here it is, for your wonder and amazement.



Here’s the R magic.

library(ggplot2)
draw_curve <- function(angle) {
  df <- data.frame(outcome = c("success", "failure"), prob = c(0.3, 0.7)) 
  plot <-
    ggplot(data=df, aes(x=factor(1), y=prob, fill=outcome)) +
    geom_bar(stat="identity", position="fill") +
    coord_polar(theta="y", start = 0, direction = 1) +
    scale_y_continuous(breaks = c(0.12, 0.7), labels=c("success", "failure")) +
    geom_segment(aes(y= angle/360, yend= angle/360, x = -1, xend = 1.4), arrow=arrow(type="closed"), size=1) +
    theme(axis.title = element_blank(),
          axis.ticks = element_blank(),
          axis.text.y = element_blank()) +
    theme(panel.grid = element_blank(),
          panel.border = element_blank()) +
    theme(legend.position = "none") +
    geom_point(aes(x=-1, y = 0), color="#666666", size=5)
  return(plot)
}
ds <- c()
pos <- 0
for (i in 1:66) {
  pos <- (pos + (67 - i)) %% 360
  ds[i] <- pos
}
ds <- c(rep(0, 10), ds)
ds <- c(ds, rep(ds[length(ds)], 10))

for (i in 1:length(ds)) {
  ggsave(filename = paste("frame", ifelse(i < 10, "0", ""), i, ".png", sep=""),
         plot = draw_curve(ds[i]), device="png", width=4.5, height=4.5)
}

I probably should've combined theme functions. Ben would've been able to define ds in a one-liner and then map ggsave. I hope it's at least clear what my code does (just decrements the number of degrees moved each frame by one---no physics involved).

After producing the frames in alphabetical order (all that ifelse and paste mumbo-jumbo), I went to the output directory and ran the results through ImageMagick (which I'd previously installed on my now ancient Macbook Pro) from the terminal, using

> convert *.png -delay 3 -loop 0 spin.gif

That took a minute or two. Each of the pngs is about 100KB, but the final output is only 2.5MB or so. Maybe I should've went with less delay (I don't even know what the units are!) and fewer rotations and maybe a slower final slowing down (maybe study the physics). How do the folks at Pixar ever get anything done?

P.S. I can no longer get the animation package to work in R, though it used to work in the past. It just wraps up those calls to ImageMagick.

P.P.S. That salmon and teal color scheme is the default!

“The ‘Will & Grace’ Conjecture That Won’t Die” and other stories from the blogroll

How to design future studies of systemic exercise intolerance disease (chronic fatigue syndrome)?

Someone named Ramsey writes on behalf of a self-managed support community of 100+ systemic exercise intolerance disease (SEID) patients. He read my recent article on the topic and had a question regarding the following excerpt:

For conditions like S.E.I.D., then, the better approach may be to gather data from people suffering “in the wild,” combining the careful methodology of a study like PACE with the lived experience of thousands of people. Though most may be less eloquent than Rehmeyer, each may have his or her own potential path to recovery.

Ramsey asks:

From your perspective, are there particular design features to such an approach that one should prioritize, in order to maximize its usefulness to others?

Here’s the challenge.

The current standard model of evaluating medical research is the randomized clinical trial with 100 or so patients. This sort of trial is both too large and too small (see also here): too large in there is so much variation in the population of patients, and different treatments will work (or not work, or even be counterproductive) for different people; too small in that the variation in such studies makes it hard to find reliable, reproducible results.

I think we need to move in two directions at once. From one direction, N=1 experiments: careful scientific evaluations of treatment options adapted to individual people. From the other direction, full population studies, tracking what really is happening outside the lab. The challenge there, as Ramsey notes, is that a lot of uncontrolled information is and will be available.

I’m sorry to say that I don’t have any good advice right now on how future studies should proceed. Speaking generally, I think it’s important to measure exactly what’s being done by the doctor and patient at all times, I think you should think carefully about outcome measures, and I think it’s a good idea to try multiple treatments on individual patients (that is, to perform within-person comparisons, also called crossover trials in this context). And, when considering observational studies (that is, comparisons based on existing treatments), gather whatever pre-treatment information that is predictive of individuals’ choice of treatment regimen to follow. For SEID in particular, it seems that the diversity of the condition is a key part of the story and so it would be good to find treatments that work with well-defined subgroups.

I hope others can participate in this discussion.

Should we continue not to trust the Turk? Another reminder of the importance of measurement

From 2013: Don’t trust the Turk

From 2017 (link from Kevin Lewis), from Jesse Chandler and Gabriele Paolacci:

The Internet has enabled recruitment of large samples with specific characteristics. However, when researchers rely on participant self-report to determine eligibility, data quality depends on participant honesty. Across four studies on Amazon Mechanical Turk, we show that a substantial number of participants misrepresent theoretically relevant characteristics . . .

For some purposes you can learn a lot from these online samples, but it depends on context. Measurement is important, and it is underrated in statistics.

The trouble is if you’re cruising along treating “p less than .05” as your criterion of success, then quality of measurement barely matters at all! Gather your data, grab your stars, get published, give your Ted talk, and sell your purported expertise to the world. Statistics textbooks have lots about how to analyze your data, a little bit on random sampling and randomized experimentation, and next to nothing on gathering data with high reliability and validity.

They want help designing a crowdsourcing data analysis project

Michael Feldman writes:

My collaborators and myself are doing research where we try to understand the reasons for the variability in data analysis (“the garden of forking paths”). Our goal is to understand the reasons why scientists make different decisions regarding their analyses and in doing so reach different results.

In a project called “Crowdsourcing data analysis: Gender, status, and science”, we have recruited a large group of independent analysts to test the same hypotheses on the same dataset using a platform we developed.

The platform is essentially Rstudio running online with few additions:

· We record all executed commands even if they are not in the final code

· We ask analysts to explain these commands by creating semantic blocks explaining the rationale and alternatives

· We allow analysts to create graphical workflow of their work using these blocks and by restructuring them

You can find the more complete experiment description here. Also a short video tutorial of the platform.

Of course this experiment is not covering all considerations that might lead to variability (e.g. R users might differ from Python users), but we believe it is a step towards better understanding how defensible, yet subjective analytic choices may shape research results. The experiment is still running but we are likely to receive about 40-60 submissions of code, logs, comments, and explanations of decisions made. We are also collecting various information about analysts like their background, methods they usually use and the way they operationalized the hypotheses.

Our current plan is to analyze the data from this crowdsourced project using inductive coding by splitting participants into groups that reached similar results (effect size and direction). We then plan to identify factors that can explain various decisions as well as explain the similarities between participants.

We would love to receive any feedback and suggestions from readers of your blog regarding our planned approach to account for variability in results across different analysts.

If anyone has suggestions, feel free to respond in the comments.

Graphs as comparisons: A case study

Above is a pair of graphs from a 2015 paper by Alison Gopnik, Thomas Griffiths, and Christopher Lucas. It takes up half a page in the journal, Current Directions in Psychological Science. I think we can do better.

First, what’s wrong with the above graphs?

We could start with the details: As a reader, I have to go back and forth between the legend and the bars to keep the color scheme fresh in my mind. The negative space in the middle of each plot looks a bit like a white bar, which is not directly confusing but makes the display that much harder to read. And I had to do a double-take to interpret the infinitesimal blue bar of zero height on the left plot. Also, it’s not clear how to interpret the y-axes: 25 participants out of how many? And whassup with the y-axis on the second graph: the 15 got written as a 1, which makes me suspect that the graph was re-drawn from an original, which then leads to concern that other mistakes may have been introduced in the re-drawing process.

But that’s not the direction I want to go. There are problems with the visual display, but going and fixing them one by one would not resolve the larger problem. To get there, we need to think about goals.

A graph is a set of comparisons, and the two goals of a graph are:

1. To understand communicate the size and directions of comparisons that you were already interested in, before you made the graph; and
2. To facilitate discovery of new patterns in data that go beyond what you expected to see.

Both these goals are important. Its important to understand and communicate what we think we know, and it’s also important to put ourselves in a position where we can learn more.

The question, when making any graphs, is: what comparisons does it make it easy to see? After all, if you just wanted the damn numbers you could put them in a table.

Now let’s turn to the graph above. It makes it easy to compare the heights of two lines right next to each other—for example, I see that the dark blue lines are all higher than the light blue lines, except for the pair on the left . . . hmmmm, which are light blue and which are dark blue, again? I can’t really do much with the absolute levels of the lines because I don’t know what “25” means.

Look. I’m not trying to rag on the authors here. This sort of Excel-style bar graph is standard in so many presentations. I just think they could do better.

So, how to do better? Let’s start with the goals.

1. What are the key comparisons that the authors want to emphasize? From the caption, it seems that the most important comparisons are between children and adults. We want a graph that shows the following patterns:
(i) In the Combination scenario, children tended to chose multiple objects (the correct response, it seems) and adults tended to choose single objects (the wrong response).
(ii) In the Individual scenario, both children and adults tended to choose a single object (the correct response).
Actually, I think I’d re-order these, and first look at the Individual scenario which seems to be some sort of baseline, and then go to Combination which is displaying something new.

2. What might we want to learn from a graph of these data, beyond the comparisons listed just above? This one’s not clear so I’ll guess:
Who were those kids and adults who got the wrong answer in the Combination scenario? Did they have other problems? What about the adults who got the wrong answer in the Individual scenario, which was presumably easier? Did they also get the answer wrong in the other case? There must be some other things to learn from these data too—it’s hard to get people to participate in a psychology experiment, and once you have them there, you’ll want to give them as many tasks as you can. But from this figure alone, I’m not sure what these other questions would be.

OK, now time to make the new graph. Given that I don’t have the raw data, and I’m just trying to redo the figure above, I’ll focus on task 1: displaying the key comparisons clearly.

Hey—I just realized something! The two outcomes in this study are “Single object” and “Multiple object”—that’s all there is! And, looking carefully, I see that the numbers in each graph add up to a constant: it’s 25 children and, ummm, let me read this carefully . . . 28 adults!

This simplifies our task considerably, as now we have only 4 numbers to display instead of 8.

We can easily display four numbers with a line plot. The outcome is % who give the “Single object” response, and the two predictors are Child/Adult and Individual Principle / Combination Principle.

One of these predictors will go on the x-axis, one will correspond to the two lines, and the outcome will go on the y-axis.

In this case, which of our two predictors goes on the x-axis?

Sometimes the choice is easy: if one predictor is binary (or discrete with only a few categories) and the other is continuous, it’s simplest to put the continuous predictor as x, and use the discrete predictor to label the lines. In this case, though, both predictors are binary, so what to do?

I think we should use logical or time order, as that’s easy to follow. There are two options:
(1) Time order in age, thus Children, then Adults; or
(2) Logical order in the experiment, thus Individual Principle, then Combination Principle, as Individual is in a sense the control case and Combination is the new condition.

I tried it both ways and I think option 2 was clearer. So I’ll show you this graph and the corresponding R code. Then I’ll show you option 1 so you can compare.

Here’s the graph:

I think this is better than the bar graphs from the original article, for two reasons. First, we can see everything in one place: Like the title sez, “Children did better than adults,\nespecially in the combination condition.” Second, we can directly make both sorts of comparisons: we can compare children to adults, and we can also make the secondary comparison of seeing that both groups performed worse under the combination condition than the individual condition.

Here’s the data file I made, gopnik.txt:

Adult Combination N_single N_multiple
0 0 25 0
0 1 4 21
1 0 23 5
1 1 18 10

And here’s the R code:

setwd("~/AndrewFiles/research/graphics")
gopnik <- read.table("gopnik.txt", header=TRUE)
N <- gopnik$N_single + gopnik$N_multiple
p_multiple <- gopnik$N_multiple / N
p_correct <- ifelse(gopnik$Combination==0, 1 - p_multiple, p_multiple)
colors <- c("red", "blue")
combination_labels <- c("Individual\ncondition", "Combination\ncondition")
adult_labels <- c("Children", "Adults")

pdf("gopnik_2.pdf", height=4, width=5)
par(mar=c(3,3,3,2), mgp=c(1.7, .5, 0), tck=-.01, bg="gray90")
plot(c(0,1), c(0,1), yaxs="i", xlab="", ylab="Percent who gave correct answer", xaxt="n", yaxt="n", type="n", main="Children did better than adults,\nespecially in the combination condition", cex.main=.9)
axis(1, c(0, 1), combination_labels, mgp=c(1.5,1.5,0))
axis(2, c(0,.5,1), c("0", "50%", "100%"))
for (i in 1:2){
  ok <- gopnik$Adult==(i-1)
  x <- gopnik$Combination[ok]
  y <- p_correct[ok]
  se <- sqrt((N*p_correct + 2)*(N[ok]*(1-p_correct) + 2)/(N[ok] + 4)^3)
  lines(x, y, col=colors[i])
  points(x, y, col=colors[i], pch=20)
  for (j in 1:2){
    lines(rep(x[j], 2), y[j] + se[j]*c(-1,1), lwd=2, col=colors[i])
    lines(rep(x[j], 2), y[j] + se[j]*c(-2,2), lwd=.5, col=colors[i])
  }
  text(mean(x), mean(y) - .05, adult_labels[i], col=colors[i], cex=.9)
}
dev.off()

Yeah, yeah, I know the code is ugly. I'm pretty sure it could be done much more easily in ggplot2.

Also, just for fun I threw in +/- 1 and 2 standard error bars, using the Agresti-Coull formula based on (y+2)/(n+4) for binomial standard errors. Cos why not. The one thing this graph doesn't show is whether the adults who got it wrong on the individual condition were more likely to get it wrong in the combination condition, but that information wasn't in the original graph either.

On the whole, I'm satisfied that the replacement graph contains all the information in less space and is much clearer than the original.

Again, this is not a slam on the authors of the paper. They're not working within a tradition in which graphical display is important. I'm going through this example in order to provide a template for future researchers when summarizing their data.

And, just for comparison, here's the display going the other way:

(This looks so much like the earlier plot that it seems at first that we did something wrong. But, no, it just happened that way because we're only plotting four numbers, and it just happened that the two numbers whose positions changed had very similar values of 0.84 and 0.82.)

And here's the R code for this second graph:

pdf("gopnik_1.pdf", height=4, width=5)
par(mar=c(3,3,3,2), mgp=c(1.7, .5, 0), tck=-.01, bg="gray90")
plot(c(0,1), c(0,1), yaxs="i", xlab="", ylab="Percent who gave correct answer", xaxt="n", yaxt="n", type="n", main="Children did better than adults,\nespecially in the combination condition", cex.main=.9)
axis(1, c(0, 1), adult_labels)
axis(2, c(0,.5,1), c("0", "50%", "100%"))
for (i in 1:2){
  ok <- gopnik$Combination==(i-1)
  x <- gopnik$Adult[ok]
  y <- p_correct[ok]
  se <- sqrt((N*p_correct + 2)*(N[ok]*(1-p_correct) + 2)/(N[ok] + 4)^3)
  lines(x, y, col=colors[i])
  points(x, y, col=colors[i], pch=20)
  for (j in 1:2){
    lines(rep(x[j], 2), y[j] + se[j]*c(-1,1), lwd=2, col=colors[i])
    lines(rep(x[j], 2), y[j] + se[j]*c(-2,2), lwd=.5, col=colors[i])
  }
  text(mean(x), mean(y) - .1, combination_labels[i], col=colors[i], cex=.9)
}
dev.off()

Hey—here are some tools in R and Stan to designing more effective clinical trials! How cool is that?

In statistical work, design and data analysis are often considered separately. Sometimes we do all sorts of modeling and planning in the design stage, only to analyze data using simple comparisons. Other times, we design our studies casually, even thoughtlessly, and then try to salvage what we can using elaborate data analyses.

It would be better to integrate design and analysis. My colleague Sebastian Weber works at Novartis (full disclosure: they have supported my research too), where they want to take some of the sophisticated multilevel modeling ideas that have been used in data analysis to combine information from different experiments, and apply these to the design of new trials.

Sebastian and his colleagues put together an R package wrapping some Stan functions so they can directly fit the hierarchical models they want to fit, using the prior information they have available, and evaluating their assumptions as they go.

Sebastian writes:

Novartis was so kind to grant permission to publish the RBesT (R Bayesian evidence synthesis Tools) R library on CRAN. It’s landed there two days ago. We [Sebastian Weber, Beat Neuenschwander, Heinz Schmidli, Baldur Magnusson, Yue Li, and Satrajit Roychoudhury] have invested a lot of effort into documenting (and testing) that thing properly. So if you follow our vignettes you get an in-depth tutorial into what, how and why we have crafted the library. The main goal is to reduce the sample size in our clinical trials. As such the library performs a meta-analytic-predictive (MAP) analysis using MCMC. Then that MAP prior is turned into a parametric representation, which we usually recommend to “robustify”. That means to add a non-informative mixture component which we put there to ensure that if things go wrong then we still get valid inferences. In fact, robustification is critical when we use this approach to extrapolate from adults to pediatrics. The reason to go parametric is that this makes it much easier to communicate that MAP prior. Moreover, we use conjugate representations such that the library performs operating characteristics with high-precision and high-speed (no more tables of type I error/power, but graphs!). So you see, RBesT does the job for you for the problem to forge a prior and then evaluate it before using it. This library is a huge help for our statisticians at Novartis to apply the robust MAP approach in clinical trials.

Here are the vignettes:

Getting started with RBesT (binary)

RBest for a Normal Endpoint

Using RBesT to reproduce Schmidli et al. “Robust MAP Priors”

Customizing RBesT plots

What is “overfitting,” exactly?

This came from Bob Carpenter on the Stan mailing list:

It’s not overfitting so much as model misspecification.

I really like this line. If your model is correct, “overfitting” is impossible. In its usual form, “overfitting” comes from using too weak of a prior distribution.

One might say that “weakness” of a prior distribution is not precisely defined. Then again, neither is “overfitting.” They’re the same thing.

P.S. In response to some discussion in comments: One way to define overfitting is when you have a complicated statistical procedure that gives worse predictions, on average, than a simpler procedure.

Or, since we’re all Bayesians here, we can rephrase: Overfitting is when you have a complicated model that gives worse predictions, on average, than a simpler model.

I’m assuming full Bayes here, not posterior modes or whatever.

Anyway, yes, overfitting can happen. And it happens when the larger model has too weak a prior. After all, the smaller model can be viewed as a version of the larger model, just with a very strong prior that restricts some parameters to be exactly zero.

Stan Weekly Roundup, 14 July 2017

Another week, another bunch of Stan updates.

  • Kevin Van Horn and Elea McDonnell Feit put together a tutorial on Stan [GitHub link] that covers linear regression, multinomial logistic regression, and hierarchical multinomial logistic regression.
  • Andrew has been working on writing up our “workflow”. That includes Chapter 1, Verse 1 of Bayesian Data Analysis of (1) specifying joint density for all observed and unobserved quantities, (2) performing inference, and (3) model criticism and posterior predictive checks. It also includes fake data generation and program checking (the Cook-Gelman-Rubin procedure; eliminating divergencent transitions in HMC) and comparison to other inference systems as a sanity check. It further includes the process of building up the model incrementally starting from a simple model. We’re trying to write this up in our case studies and make it the focus of the upcoming Stan book.
  • Ben Goodrich working on RStanArm with lots of new estimators, specifically following from nlmer, for GLMs with unusual inverse functions. This led to some careful evaluation, uncovering some multimodal behavior.
  • Breck Baldwin has been pushing through governance discussions so we can start thinking about how to make decisions about the Stan project when not everyone agrees. I think we’re going to go more with a champion model than a veto model; stay tuned.
  • Mitzi Morris has been getting a high-school intern up to speed for doing some model comparisons and testing.
  • Mitzi Morris has implemented the Besag-York-Mollie likelihood with improved priors provided by Dan Simpson. You can check out the ongoing branch in the stan-dev/example-models repo.
  • Aki Vehtari has been working on improving Pareto smoothed importance sampling and refining effective sample size estimators.
  • Imad Ali has prototypes of the intrinsic conditional autoregressive models for RStanArm.
  • Charles Margossian is working on gradients of steady-state ODE solvers for Torsten and a mixed solver for forcing functions in ODEs; papers are in the works, including a paper selected to be highlighted at ACoP.
  • Jonah Gabry is working on a visualization paper with Andrew for submission and is gearing up for the Stan course later this summer. Debugging R packages.
  • Sebastian Weber has been working on the low-level architecture for MPI including a prototype linked from the Wiki. The holdup is in shipping out the data to the workers. Anyone know MPI and want to get involved?
  • Jon Zelner and Andrew Gelman have been looking at adding hierarchical structure to discrete-parameter models for phylogeny. These models are horribly intractable, so they’re trying to figure out what to do when you can’t marginalize and can’t sample (you can write these models in PyMC3 or BUGS, but you can’t explore the posterior). And when you can do some kind of pre-pruning (as is popular in natural language processing and speech recognition pipelines).
  • Matthew Kay has a GitHub package TidyBayes that aims to integrate data and sampler data munging in a TidyVerse style (wrapping the output of samplers like JAGS and Stan).
  • Quentin F. Gronau has a Bridgesampling package on CRAN, the short description of which is “Provides functions for estimating marginal likelihoods, Bayes factors, posterior model probabilities, and normalizing constants in general, via different versions of bridge sampling (Meng & Wong, 1996)”. I heard about it when Ben Goodrich recommended it on the Stan forum.
  • Juho Piironen and Aki Vehtari arXived their paper, Sparsity information and regularization in the horseshoe and other shrinkage priors. Stan code included, naturally.

Slaying Song

I came across this article by Joseph Bernstein, “Why Is A Top Harvard Law Professor Sharing Anti-Trump Conspiracy Theories?”:

On April 22, Tribe shared a story from a website called the Palmer Report — a site that has been criticized for spreading hyperbole and false claims — entitled “Report: Trump gave $10 million in Russian money to Jason Chaffetz when he leaked FBI letter,” a reference to the notorious pre-election letter sent by former FBI director James Comey to members of Congress that many have blamed for Hillary Clinton’s November loss.

The “report” the article points to is a since-deleted tweet by a Twitter user named LM Garner, who describes herself in her Twitter biography as “Just a VERY angry citizen on Twitter. Opinions are my own. Sometimes prone to crazy assertions. Not a fan of this nepotistic kleptocracy.” Garner, who has 257 followers, has tweeted more than 25,000 times from her protected account.

“I don’t know whether this is true,” Tribe’s tweet reads, “But key details have been corroborated and none, to my knowledge, have been refuted. If true, it’s huge.”

Reached by email, Tribe said that he was aware of the Palmer Report’s “generally liberal slant” and “that some people regard a number of its stories as unreliable.” Still, he added, “When I share any story on Twitter, typically with accompanying content of my own that says something like ‘If X is true, then Y,’ I do so because a particular story seems to be potentially interesting, not with the implication that I’ve independently checked its accuracy or that I vouch for everything it asserts.”

OK, then. But the “Palmer Report” thing ran a bell—didn’t someone send me something from there once? I did a quick search and found this Slate article, “Stop Saying the Election Was Rigged,” regarding “the rampant sharing of two postelection articles from Bill Palmer.”

Kinda sad to see a high-paid law professor fall for this sort of thing.

Still, though, whenever I see the name Laurence Tribe I will think of this letter. Bluntly put, indeed. If you’ll forgive my reference to bowling.

Classical statisticians as Unitarians

[cat picture]

Christian Robert, Judith Rousseau, and I wrote:

Several of the examples in [the book under review] represent solutions to problems that seem to us to be artificial or conventional tasks with no clear analogy to applied work.

“They are artificial and are expressed in terms of a survey of 100 individuals expressing support (Yes/No) for the president, before and after a presidential address (. . . ) The question of interest is whether there has been a change in support between the surveys (…). We want to assess the evidence for the hypothesis of equality H1 against the alternative hypothesis H2 of a change.”

Based on our experience in public opinion research, this is not a real question. Support for any political position is always changing. The real question is how much the support has changed, or perhaps how this change is distributed across the population.

A defender of Aitkin (and of classical hypothesis testing) might respond at this point that, yes, everybody knows that changes are never exactly zero and that we should take a more “grown-up” view of the null hypothesis, not that the change is zero but that it is nearly zero. Unfortunately, the metaphorical interpretation of hypothesis tests has problems similar to the theological doctrines of the Unitarian church. [emphasis added] Once you have abandoned literal belief in the Bible, the question soon arises: why follow it at all? Similarly, once one recognizes the inappropriateness of the point null hypothesis, we think it makes more sense not to try to rehabilitate it or treat it as treasured metaphor but rather to attack our statistical problems directly, in this case by performing inference on the change in opinion in the population. . . .

All this is application-specific. Suppose public opinion was observed to really be flat, punctuated by occasional changes, as in the left graph in Figure 3. In that case, Aitkin’s question of “whether there has been a change” would be well-defined and appropriate, in that we could interpret the null hypothesis of no change as some minimal level of baseline variation.

Real public opinion, however, does not look like baseline noise plus jumps, but rather shows continuous movement on many time scales at once, as can be seen from the right graph in Figure 3, which shows actual presidential approval data. In this example, we do not see Aitkin’s question as at all reasonable. Any attempt to work with a null hypothesis of opinion stability will be inherently arbitrary. It would make much more sense to model opinion as a continuously-varying process.

The statistical problem here is not merely that the null hypothesis of zero change is nonsensical; it is that the null is in no sense a reasonable approximation to any interesting model. The sociological problem is that, from Savage (1954) onward, many Bayesians have felt the need to mimic the classical null-hypothesis testing framework, even where it makes no sense.

This quote came up in blog comments a few years ago; I love it so much I wanted to share it again.

P.S. I also like this one, from that same review:

In a nearly century-long tradition in statistics, any probability model is sharply divided into “likelihood” (which is considered to be objective and, in textbook presentations, is often simply given as part of the mathematical specification of the problem) and “prior” (a dangerously subjective entity to which the statistical researcher is encouraged to pour all of his or her pent-up skepticism). This may be a tradition but it has no logical basis. If writers such as Aitkin wish to consider their likelihoods as objective and consider their priors as subjective, that is their privilege. But we would prefer them to restrain themselves when characterizing the models of others. It would be polite to either tentatively accept the objectivity of others’ models or, contrariwise, to gallantly affirm the subjectivity of one’s own choices.

3 things that will surprise you about model validation and calibration for state space models

Gurjinder Mohan writes:

I was wondering if you had any advice specific to state space models when attempting model validation and calibration. I was planning on conducting a graphical posterior predictive check.

I’d also recommend fake-data simulation. Beyond that, I’d need to know more about the example.

I’m posting here because this seems like a topic that some commenters could help on (and could supply the 3 things promised by the above title).