Skip to content

Measurement is part of design

The other day, in the context of a discussion of an article from 1972, I remarked that the great statistician William Cochran, when writing on observational studies, wrote almost nothing about causality, nor did he mention selection or meta-analysis.

It was interesting that these topics, which are central to any modern discussion of observational studies, were not considered important by a leader in the field, and this suggests that our thinking has changed since 1972.

Today I’d like to make a similar argument, this time regarding the topic of measurement. This time I’ll consider Donald Rubin’s 2008 article, “For objective causal inference, design trumps analysis.”

All of Rubin’s article is worth reading—it’s all about the ways in which we can structure the design of observational studies to make inferences more believable—and the general point is important and, I think, underrated.

When people do experiments, they think about design, but when they do observational studies, they think about identification strategies, which is related to design but is different in that it’s all about finding and analyzing data and checking assumptions, not so much about about systematic data collection. So Rubin makes valuable points in his article.

But today I want to focus on something that Rubin doesn’t really mention in his article: measurement, which is a topic we’ve been talking a lot about here lately.

Rubin talks about randomization, or the approximate equivalent in observational studies (the “assignment mechanism”), and about sample size (“traditional power calculations,” as his article was written before Type S and Type M errors were well known), and about the information available to the decision makers, and about balance between treatment and control groups.

Rubin does briefly mention the importance of measurement, but only in the context of being able to match or adjust for pre-treatment differences between treatment and control groups.

That’s fine, but here I’m concerned with something even more basic: the validity and reliability of the measurements of outcomes and treatments (or, more generally, comparison groups). I’m assuming Rubin was taking validity for granted—assuming that the x and y variables being measured were the treatment and outcome of interest—and, in a sense, the reliability question is included in the question about sample size. In practice, though, studies are often using sloppy measurements (days of peak fertility, fat arms, beauty, etc.), and if the measurements are bad enough, the problems go beyond sample size, partly because in such studies the sample size would have to creep into the zillions for anything to be detectable, and partly because the biases in measurement can easily be larger than the effects being studied.

So, I’d just like to take Rubin’s excellent article and append a brief discussion of the importance of measurement.

P.S. I sent the above to Rubin, who replied:

In that article I was focusing on the design of observational studies, which I thought had been badly neglected by everyone in past years, including Cochran and me. Issues of good measurement, I think I did mention briefly (I’ll have to check—I do in my lectures, but maybe I skipped that point here), but having good measurements had been discussed by Cochran in his 1965 JRSS paper, so were an already emphasized point.

And I wanted to focus on the neglected point, not all relevant points for observational studies.

Stan is Turing complete

New papers on LOO/WAIC and Stan

Screen Shot 2015-07-16 at 12.12.19 AM

Aki, Jonah, and I have released the much-discussed paper on LOO and WAIC in Stan: Efficient implementation of leave-one-out cross-validation and WAIC for evaluating fitted Bayesian models.
We (that is, Aki) now recommend LOO rather than WAIC, especially now that we have an R function to quickly compute LOO using Pareto smoothed importance sampling. In either case, a key contribution of our paper is to show how LOO/WAIC can be computed in the regular workflow of model fitting.

We also compute the standard error of the difference between LOO (or WAIC) when comparing two models, and we demonstrate with the famous arsenic well-switching example.

Also 2 new tutorial articles on Stan will be appearing:
in JEBS: Stan: A probabilistic programming language for Bayesian inference and optimization.
in JSS: Stan: A probabilistic programming language
The two articles have very similar titles but surprisingly little overlap. I guess it’s the difference between what Bob thinks is important to say, and what I think is important to say.


P.S. Jonah writes:

For anyone interested in the the R package “loo” mentioned in the paper, please install from GitHub and not CRAN. There is a version on CRAN but it needs to be updated so please for now use the version here:

To get it running, you must first install the “devtools” package in R, then you can just install and load “loo” via:


Jonah will post an update when the new version is also on CRAN.

P.P.S. This P.P.S. is by Jonah. The latest version of the loo R package (0.1.2) is now up on CRAN and should be installable for most people by running


although depending on various things (your operating system, R version, CRAN mirror, what you ate for breakfast, etc.) you might need

install.packages("loo", type = "source")

to get the new version. For bug reports, installation trouble, suggestions, etc., please use our GitHub issues page. The Stan users google group is also a fine place to ask questions about using the package with your models.

Finally, while we do recommend Stan of course, the R package isn’t only for Stan models. If you can compute a pointwise log-likelihood matrix then you can use the package.

Psych dept: “We are especially interested in candidates whose research program contributes to the development of new quantitative methods”

This is cool. The #1 psychology department in the world is looking for a quantitative researcher:

The Department of Psychology at the University of Michigan, Ann Arbor, invites applications for a tenure-track faculty position. The expected start date is September 1, 2016. The primary criterion for appointment is excellence in research and teaching. We are especially interested in candidates whose research program contributes to the development of new quantitative methods.

Although the specific research area is open, we are especially interested in applicants for whom quantitative theoretical modeling, which could include computational models, analytic models, statistical models or psychometric models, is an essential part of their research program. The successful candidate will participate in the teaching rotation of graduate-level statistics and methods. Quantitative psychologists from all areas of psychology and related disciplines are encouraged to apply. This is a university-year appointment.

Successful candidates must have a Ph.D. in a relevant discipline (e.g. Bio-statistics, Psychology) by the time the position starts, and a commitment to undergraduate and graduate teaching. New faculty hired at the Assistant Professor level will be expected to establish an independent research program. Please submit a letter of intent, curriculum vitae, a statement of current and future research plans, a statement of teaching philosophy and experience, and evidence of teaching excellence (if any).

Applicants should also request at least three letters of recommendation from referees. All materials should be uploaded by September 30, 2015 as a single PDF attachment to For inquiries about the positions please contact Richard Gonzalez.

The University of Michigan is an equal opportunity/affirmative action employer. Qualified women and minority candidates are encouraged to apply. The University is supportive of the needs of dual-career couples.

Prior information, not prior belief

The prior distribution p(theta) in a Bayesian analysis is often presented as a researcher’s beliefs about theta. I prefer to think of p(theta) as an expression of information about theta.

Consider this sort of question that a classically-trained statistician asked me the other day:

If two Bayesians are given the same data, they will come to two conclusions. What do you think about that? Does it bother you?

My response is that the statistician has nothing to do with it. I’d prefer to say that if two different analyses are done using different information, they will come to different conclusions. This different information can come in the prior distribution p(theta), it could come in the data model p(y|theta), it could come in the choice of how to set up the model and what data to include in the first place. I’ve listed these in roughly increasing order of importance.

Sure, we could refer to all statistical models as “beliefs”: we have a belief that certain measurements are statistically independent with a common mean, we have a belief that a response function is additive and linear, we have a belief that our measurements are unbiased, etc. Fine. But I don’t think this adds anything beyond just calling this a “model.” Indeed, referring to “belief” can be misleading. When I fit a regression model, I don’t typically believe in additivity or linearity at all, I’m just fitting a model, using available information and making assumptions, compromising the goal of including all available information because of the practical difficulties of fitting and understanding a huge model.

Same with the prior distribution. When putting together any part of a statistical model, we use some information without wanting to claim that this represents our beliefs about the world.

Awesomest media request of the year

(Sent to all the American Politics faculty at Columbia, including me)

RE: Donald Trump presidential candidacy


Firstly, apologies for the group email but I wasn’t sure who would be best prized to answer this query as we’ve not had much luck so far.

I am a Dubai-based reporter for **.
Donald Trump recently announced his intension to run for the US presidency in 2016.
He currently has a lot of high profile commercial and business deals in Dubai and is actively in talks for more in the wider region.

We have been trying to determine:
If a candidate succeeds in winning a nomination and goes on to win the election and reside in the White House do they have to give up their business interests as these would be seen as a conflict of interest? Can a US president serve in office and still have massive commercial business interests abroad?

Basically, would Trump have to relinquish these relationships if he was successfully elected? Are there are existing rules specifically governing this? Is there any previous case studies to go on?

Lastly, what are his chances of winning a nomination or being elected? So far, from what we have read it seems highly unlikely?


Executive Editor

Survey weighting and regression modeling

Yphtach Lelkes points us to a recent article on survey weighting by three economists, Gary Solon, Steven Haider, and Jeffrey Wooldridge, who write:

We start by distinguishing two purposes of estimation: to estimate population descriptive statistics and to estimate causal effects. In the former type of research, weighting is called for when it is needed to make the analysis sample representative of the target population. In the latter type, the weighting issue is more nuanced. We discuss three distinct potential motives for weighting when estimating causal effects: (1) to achieve precise estimates by correcting for heteroskedasticity, (2) to achieve consistent estimates by correcting for endogenous sampling, and (3) to identify average partial effects in the presence of unmodeled heterogeneity of effects.

These is indeed an important and difficult topic and I’m glad to see economists becoming aware of it. I do not quite agree with their focus—in practice, heteroskedasticity never seems like much of a bit deal to me, nor do I care much about so-called consistency of estimates—but there are many ways to Rome, and the first step is to move beyond a naive view of weighting as some sort of magic solution.

Solon et al. pretty much only refer to literature within the field of economics, which is too bad because they miss this twenty-year-old paper by Chris Winship and Larry Radbill, “Sampling Weights and Regression Analysis,” from Sociological Methods and Research, which begins:

Most major population surveys used by social scientists are based on complex sampling designs where sampling units have different probabilities of being selected. Although sampling weights must generally be used to derive unbiased estimates of univariate population characteristics, the decision about their use in regression analysis is more complicated. Where sampling weights are solely a function of independent variables included in the model, unweighted OLS estimates are preferred because they are unbiased, consistent, and have smaller standard errors than weighted OLS estimates. Where sampling weights are a function of the dependent variable (and thus of the error term), we recommend first attempting to respecify the model so that they are solely a function of the independent variables. If this can be accomplished, then unweighted OLS is again preferred. . . .

This topic also has close connections with multilevel regression and poststratification, as discussed in my 2007 article, “Struggles with survey weighting and regression modeling,” which is (somewhat) famous for its opening:

Survey weighting is a mess. It is not always clear how to use weights in estimating anything more complicated than a simple mean or ratios, and standard errors are tricky even with simple weighted means.

See also our response to the discusssions.

I was unaware of Winship and Radbill’s work when writing my paper, so I accept blame for insularity as well.

In any case, it’s good to see broader interest in this important unsolved problem.

Don’t do the Wilcoxon


The Wilcoxon test is a nonparametric rank-based test for comparing two groups. It’s a cool idea because, if data are continuous and there is no possibility of a tie, the reference distribution depends only on the sample size. There are no nuisance parameters, and the distribution can be tabulated. From a Bayesian point of view, however, this is no big deal, and I prefer to think of Wilcoxon as a procedure that throws away information (by reducing the data to ranks) to gain robustness.

Fine. But if you’re gonna do that, I’d recommend instead the following approach:

1. As in classical Wilcoxon, replace the data by their ranks: 1, 2, . . . N.

2. Translate these ranks into z-scores using the inverse-normal cdf applied to the values 1/(2*N), 3/(2*N), . . . (2*N – 1)/(2*N).

3. Fit a normal model.

In simple examples this should work just about the same as Wilcoxon as it is based on the same general principle, which is to discard the numerical information in the data and just keep the ranks. The advantage of this new approach is that, by using the normal distribution, it allows you to plug in all the standard methods that you’re familiar with: regression, analysis of variance, multilevel models, measurement-error models, and so on.

The trouble with Wilcoxon is that it’s a bit of a dead end: if you want to do anything more complicated than a simple comparison of two groups, you have to come up with new procedures and work out new reference distributions. With the transform-to-normal approach you can do pretty much anything you want.

The question arises: if my simple recommended approach indeed dominates Wilcoxon, how is it that Wilcoxon remains popular? I think much has to do with computation: the inverse-normal transformation is now trivial, but in the old days it would’ve added a lot of work to what, after all, is intended to be rapid and approximate.

Take-home message

I am not saying that the rank-then-inverse-normal-transform strategy is always or even often a good idea. What I’m saying is that, if you were planning to do a rank transformation before analyzing your data, I recommend this z-score approach rather than the classical Wilcoxon method.

On deck this week

Mon: Don’t do the Wilcoxon

Tues: Survey weighting and regression modeling

Wed: Prior information, not prior belief

Thurs: Draw your own graph!

Fri: Measurement is part of design

Sat: Annals of Spam

Sun: “17 Baby Names You Didn’t Know Were Totally Made Up”

“Physical Models of Living Systems”

Phil Nelson writes:

I’d like to alert you that my new textbook, “Physical Models of Living Systems,” has just been published. Among other things, this book is my attempt to bring Bayesian inference to undergraduates in any science or engineering major, and the course I teach from it has been enthusiastically received.

The book is intended for intermediate-level undergraduates. The only prerequisite for the course is first-year physics or something similar. Advanced appendices to each chapter make the book useful also for PhD students. There is almost no overlap with my prior book Biological Physics.

Rather than attempting an encyclopedic survey of biophysics, my aim has been to develop skills and frameworks that are essential to the practice of almost any science or engineering field, in the context of some life-science case studies.

I have quantitative and qualitative data on Penn students’ assessment of the usefulness of the class in their later work. This appears in the Instructor section of the above Web site.

Many of my students come to the course with no computer background, so I have also written the short booklet Student’s Guide to Physical Modeling with Matlab (with Tom Dodson), which is available free via the above web site. A parallel book, Student’s Guide to Physical Modeling with Python (with Jesse M. Kinder) will also be available soon. These resources are not specifically about life science applications.

This sounds great. And let me again recommend the classic How Animals Work, by Knut Schmidt-Nielsen.

P.S. We last encountered Nelson a couple years ago when answering his question, “What are some situations in which the classical approach (or a naive implementation of it, based on cookbook recipes) gives worse results than a Bayesian approach, results that actually impeded the science?” The question was surprisingly easy to answer. You might also want to check out the comment section there, because some of the commenters had some misconceptions that I tried to clarify.

Inauthentic leadership? Development and validation of methods-based criticism

Thomas Basbøll writes:

I need some help with a critique of a paper that is part of the apparently growing retraction scandal in leadership studies. Here’s Retraction Watch.

The paper I want to look at is here: “Authentic Leadership: Development and Validation of a Theory-Based Measure” By F. O. Walumbwa, B. J. Avolio, W. L. Gardner, T. S. Wernsing, S. J. Peterson Journal of Management 2007.

I have a lot of issues with this paper that are right on the surface, and the one thing (again on the surface) that seems to justify its existence is the possibility that the quantitative stuff is, well, true. But that’s exactly what’s been questioned. And in pretty harsh terms. The critics are saying its defects are entirely to obvious to anyone who knows anything about structural equation modeling. The implication is that in leadership studies no one—not the authors, not the reviewers, not the editors, not the readers—actually understands SEM. It’s just a way of presenting their ideas about leadership as science—“research”, “evidence-based”, etc.

Hey, I can relate to that: I don’t understand structural equation modeling either! I keep meaning to sit down and figure it all out sometime. Meanwhile, it remains popular in much of social science, and my go-to way of understanding anything framed as a structural equation model is to think about it some other way.

For example, there was the recent discussion of the claim that subliminal smiley-faces have been effects on political attitudes. It turns out there was no strong evidence for such a claim, but there was some indirect argument based on structural equation models.

Anyway, Basbøll continues:

I’m way out of my depth on the technical issues, however. There’s some discussion of the statistical issues with the paper here.

There’s also a question about data fabrication, but I want to leave that on the side. I’m hoping there’s someone among your readers who might have some pretty quick way to see if the critics are right that the structural equation modeling they use is bogus.

The paper has been widely cited, and has won an award—for being the most cited paper in 2013.

The editors are not saying very much about the criticism.

Hey, I published a paper recently in Journal of Management. So maybe I could ask my editors there what they think of all this.

Basbøll continues

In addition to doing a close reading of the argument (which is weird to me, like I say), I also want to track down all the people that have been citing it, to see whether the statistical analysis actually matters. I suspect it’s just taken as “reviewed therefore true”. If the critics are right, that would make the use of statistics here a great example of cargo-cult science, completely detached from reality.

You’ve talked about measurement recently, so I should say that I don’t think the thing they’re trying to measure can be measured, and is best talked about in other ways, but if their analysis itself is bad, however they got the data (whether by mis-measurement or by outright fabrication), then that point is somewhat moot.

What do all of you think? I’m prepared to agree with Basbøll on this because I agree with him on other things.

This sort of reasoning is sometimes called a “prior,” but I’d prefer to think of it as a model in which the quality of the article is an unknown predictive quantity and “Basbøll doesn’t like it” is the value of a predictor.

In any case, I have neither the energy nor the interest to actually read the damn article. But if any of you have any thoughts on it, feel free to chime in.

Economists betting on replication

Mark Patterson writes:

A bunch of folks are collaborating on a project to replicate 18 experimental studies published in prominent Econ journals (mostly American Economic Review, a few Quarterly Journal of Economics). This is already pretty exciting, but the really cool bit is they’re opening a market (with real money) to predict which studies will replicate. Unfortunately participation is restricted, but the market activity will be public. The market opens tomorrow, so it should be pretty exciting to watch.

There was some discussion about doing this with psychology papers, but the sense was that some people were so upset with the replication movement already that there would be backlash against the whole betting thing. I’m curious how the econ project goes.

Hey—guess what? There really is a hot hand!


No, it’s not April 1, and yup, I’m serious. Josh Miller came into my office yesterday and convinced me that the hot hand is real.

Here’s the background. Last year we posted a discussion on streakiness in basketball shooting. Miller has a new paper out, with Adam Sanjurjo, which begins:

We find a subtle but substantial bias in a standard measure of the conditional dependence of present outcomes on streaks of past outcomes in sequential data. The mechanism is driven by a form of selection bias, which leads to an underestimate of the true conditional probability of a given outcome when conditioning on prior outcomes of the same kind. The biased measure has been used prominently in the literature that investigates incorrect beliefs in sequential decision making — most notably the Gambler’s Fallacy and the Hot Hand Fallacy. Upon correcting for the bias, the conclusions of some prominent studies in the literature are reversed. The bias also provides a structural explanation of why the belief in the law of small numbers persists, as repeated experience with finite sequences can only reinforce these beliefs, on average.

What’s this bias they’re talking about?

Jack takes a coin from his pocket and decides that he will flip it 4 times in a row, writing down the outcome of each flip on a scrap of paper. After he is done flipping, he will look at the flips that immediately followed an outcome of heads, and compute the relative frequency of heads on those flips. Because the coin is fair, Jack of course expects this conditional relative frequency to be equal to the probability of flipping a heads: 0.5. Shockingly, Jack is wrong. If he were to sample 1 million fair coins and flip each coin 4 times, observing the conditional relative frequency for each coin, on average the relative frequency would be approximately 0.4.

Really?? Let’s try it in R:

rep <- 1e6
 n <- 4
 data <- array(sample(c(0,1), rep*n, replace=TRUE), c(rep,n))
 prob <- rep(NA, rep)
 for (i in 1:rep){
   heads1 <- data[i,1:(n-1)]==1
   heads2 <- data[i,2:n]==1
   prob[i] <- sum(heads1 & heads2)/sum(heads1)

OK, I've simulated, for each player, the conditional probability that he gets heads, given that he got heads on the previous flip.

What's the mean of these?

> print(mean(prob))
[1] NaN

Oh yeah, that's right: sometimes the first three flips are tails, so the probability is 0/0. So we'll toss these out. Then what do we get?

> print(mean(prob, na.rm=TRUE))
[1] 0.41

Hey! That's not 50%! Indeed, if you get this sort of data, it will look like people are anti-streaky (heads more likely to be followed by tails, and vice-versa), even though they're not.

With sequences of length 10, the average streakiness statistic (that is, for each person you compute the conditional probability proportion that he gets heads, conditional on him having just got heads on the previous flip, and then you average this across people), is .445. This is pretty far from .5, given that previous estimates of streak-shooting probability have been in the range of 2 percentage points.

And the bias is larger for comparisons such as the probability proportion of heads, conditional on following three straight heads, compared to the overall probability of heads. Which is one measure of streakiness, if "heads" is replaced by success of a basketball shot.

So here's the deal. The classic 1985 paper by Gilovich, Vallone, and Tversky and various followups used these frequency comparisons, and as a result they all systematically underestimated streakiness, reporting no hot hand when, actually, when the data are analyzed correctly, the evidence is there, as Miller and Sanjurjo report in the above-linked paper and also in another recent article which uses the example of the NBA three-point shooting contest.

Next step: fitting a model in Stan to estimate individual players' streakiness.

This is big news. Just to calibrate, here's what I wrote on the topic last year:

Consider the continuing controversy regarding the “hot hand” in basketball. Ever since the celebrated study of Gilovich, Vallone, and Tversky (1985) found no evidence of serial correlation in the successive shots of college and professional basketball players, people have been combing sports statistics to discover in what settings, if any, the hot hand might appear. Yaari (2012) points to some studies that have found time dependence in basketball, baseball, volleyball, and bowling, and this is sometimes presented as a debate: Does the hot hand exist or not?

A better framing is to start from the position that the effects are certainly not zero. Athletes are not machines, and anything that can affect their expectations (for example, success in previous tries) should affect their performance—one way or another. To put it another way, there is little debate that a “cold hand” can exist: It is no surprise that a player will be less successful if he or she is sick, or injured, or playing against excellent defense. Occasional periods of poor performance will manifest themselves as a small positive time correlation when data are aggregated.

However, the effects that have been seen are small, on the order of 2 percentage points (for example, the probability of a success in some sports task might be 45% if a player is “hot” and 43% otherwise). These small average differences exist amid a huge amount of variation, not just among players but also across different scenarios for a particular player. Sometimes if you succeed, you will stay relaxed and focused; other times you can succeed and get overconfident.

I don't think I said anything wrong there, exactly, but Miller and Sanjurjo's bias correction makes a difference. For example, they estimate the probability of success in the 3-point shooting contest as 6 percentage points higher after three straight successes. In comparison, the raw (biased) estimate is 4 percentage points. The difference between 4% and 6% isn't huge, but the overall impact of all this analysis is to show clear evidence for the hot hand. It will be interesting to see what Stan finds regarding the variation.

Evaluating the Millennium Villages Project

I’m a postdoc working with Andy and Jeff Sachs on the evaluation of the Millennium Villages Project, a ten-year economic development project operating in ten sub-Saharan African countries. Our evaluation protocol was recently accepted by The Lancet (full text here, and the accompanying technical paper here). We welcome your thoughts!

An Excel add-in for regression analysis

Bob Nau writes:

I know you are not particularly fond of Excel, but you might (I hope) be interested in a free Excel add-in for multivariate data analysis and linear regression that I am distributing here: I originally developed it for teaching an advanced MBA elective course on regression and time series analysis at Duke University, but it is intended for teaching data analysis at any level where students are familiar with Excel (and use it on PC’s), and it is also intended for serious applied work as a complement to other analytical software. It has been available to the public since May 2014, and a new version has just been released. If I do say so myself, its default regression output is more thoughtfully designed and includes higher quality graphics than what is provided by the best-known statistical programming languages or by commercial Excel add-ins such as Analyse-it, XLstat, or StatTools. It also has a number of unique features that are designed to facilitate data exploration and model testing and to support a disciplined and well-documented approach to analysis, with an emphasis on data visualization. My frustration with the stone-age graphics output of the leading regression software was the original motivation for its development, and I am now offering it for free as a public service. Please take it for a test drive and see for yourself. I’d welcome your feedback.

I don’t know Excel at all so I can’t take it for a test drive . . . but I bet that some of you can! Please share your thoughts. So many people use Excel that an improvement here could have a huge effect on good statistical practice. I don’t know if Reinhart and Rogoff read this blog but there must be some Excel users in the audience, right?

P.S. Nau wanted to share some further thoughts:

It may appear at first glance as though there is little that is new here: just another program that performs descriptive data analysis and plain old linear regression. The difference is in the details, and the details are many. Every design element in RegressIt has been chosen with a view toward helping the user to work efficiently and competently, to interactively share the results of the analysis with others, to enjoy the process, and to leave behind a clear trail of breadcrumbs. In this respect, RegressIt is a sort of “concept car” that illustrates features which would be nice to have in other analytical procedures besides regression if the software was designed from the ground up with the user in mind and did not carry a burden of backward compatibility with the way it looked a decade or two ago. Also, it tries to take advantage of things that Excel is good for while compensating for its lack of discipline. The design choices are based on my own experience in 30+ years of teaching as well as playing around with data for my own purposes. When a student or colleague or someone on the other side of the internet wants to discuss the results of an analysis that he or she has performed, which might or might not be for a problem whose solution I already know, I want to be able, with a few mouse clicks, to replicate their analysis and drill deeper or perform variations on it, and compare new results side-by-side with old ones, while having an armchair conversation. I might also want to do this on the spur of the moment in front of a class without worrying about my typing. When I am looking at at one among many tables or charts, I often wonder: what model produced this, and what were the variables, what was the sample, when did the analysis take place, and by whom? What other models were tried before or afterward, and what was good or bad about this one? If a chart is just labeled “Residuals of Y” or “Residuals vs. Fitted Values”, that is not very helpful, particularly if it has been copied and pasted into a report where it takes on a life of its own. And when I look at the output of a model on the computer screen, I want to see as much of it at one time as possible. I want an efficient screen design—ideally one that would look good in an auditorium as well as on my desktop—and I want easy navigation within and across models. I would rather not scroll up and down through a linear log file that reminds me of line-printer days (which I do remember!) and makes it hard to distinguish the code from the results. I would like to see a presentation that by default is fairly complete in terms of including some well-chosen chart output that allows me to engage my visual cortex without saying “yuck”. And I want the same things if the original analyst is not a student or colleague but merely myself yesterday or last week or last year.

I hope you will give it a close look, kick the tires, and take it for a drive with some data of your own. And please read everything that is on the features and advice pages on the web site. Otherwise you may overlook some of what RegressIt is doing that is novel. And whatever you may think of it in the end, I would welcome your input on improvements or extensions that could be made. Is there any low-hanging fruit could easily be added, or is there some deal-breaking omission that absolutely needs to be fixed? We can make changes in a hurry if we have to–there is no calendar of scheduled releases. We are two professors who work on this in our spare time. RegressIt’s feature set is limited at present, but our hope is that the features it does include will be useful in some circumstances to people who do most of their work in R or Stata and well as to people who do most of their work in Excel, and we plan to add more to it in the future. Thanks in advance for your input!

Short course on Bayesian data analysis and Stan 19-21 July in NYC!


Bob Carpenter, Daniel Lee, and I are giving a 3-day short course in two weeks.

Before class everyone should install R, RStudio and RStan on their computers. If problems occur please join the stan-users group and post any questions. It’s important that all participants get Stan running and bring their laptops to the course.

Class structure and example topics for the three days:

Sunday, July 19: Introduction to Bayes and Stan
Intro to Bayes
Intro to Stan
The statistical crisis in science
Stan by example
Components of a Stan program
Little data: how traditional statistical ideas remain relevant in a big data world

Monday, July 20: Computation, Monte Carlo and Applied Modeling
Computation with Monte Carlo Methods
Debugging in Stan
Generalizing from sample to population
Multilevel regression and generalized linear models
Computation and Inference in Stan
Why we don’t (usually) have to worry about multiple comparisons

Tuesday, July 21: Advanced Stan and Big Data
Vectors, matrices, and transformations
Mixture models and complex data structures in Stan
Hierarchical modeling and prior information
Bayesian computation for big data
Advanced Stan programming
Open problems in Bayesian data analysis

Specific topics on Bayesian inference and computation include, but are not limited to:
Bayesian inference and prediction
Naive Bayes, supervised, and unsupervised classification
Overview of Monte Carlo methods
Convergence and effective sample size
Hamiltonian Monte Carlo and the no-U-turn sampler
Continuous and discrete-data regression models
Mixture models
Measurement-error and item-response models

Specific topics on Stan include, but are not limited to:
Reproducible research
Probabilistic programming
Stan syntax and programming
Warmup, adaptation, and convergence
Identifiability and problematic posteriors
Handling missing data
Ragged and sparse data structures
Gaussian processes

Again, information on the course is here.

The course is organized by Lander Analytics.

Discreteland and Continuousland

Roy Mendelssohn points me to this paper by Jianqing Fan, Qi-Man Shao, and Wen-Xin Zhou, “Are Discoveries Spurious? Distributions of Maximum Spurious Correlations and Their Applications.” I never know what to think about these things because I don’t work in a discrete world in which there are zero effects (see our earlier discussion of the “bet on sparsity principle”), but I thought I’d pass it along in case any of you are interested.

On deck this week

Mon: Discreteland and Continuousland

Tues: “There are many studies showing . . .”

Wed: An Excel add-in for regression analysis

Thurs: Unreplicable

Fri: Economists betting on replication

Sat: Inauthentic leadership? Development and validation of methods-based criticism

Sun: “Physical Models of Living Systems”

“Menstrual Cycle Phase Does Not Predict Political Conservatism”


Someone pointed me to this article by Isabel Scott and Nicholas Pound:

Recent authors have reported a relationship between women’s fertility status, as indexed by menstrual cycle phase, and conservatism in moral, social and political values. We conducted a survey to test for the existence of a relationship between menstrual cycle day and conservatism.

2213 women reporting regular menstrual cycles provided data about their political views. Of these women, 2208 provided information about their cycle date . . . We also recorded relationship status, which has been reported to interact with menstrual cycle phase in determining political preferences.

We found no evidence of a relationship between estimated cyclical fertility changes and conservatism, and no evidence of an interaction between relationship status and cyclical fertility in determining political attitudes. . . .

I have no problem with the authors’ substantive findings. And they get an extra bonus for not labeling day 6 as high conception risk:


Seeing this clearly-sourced graph makes me annoyed one more time at those psychology researchers who refused to acknowledge that, in a paper all about peak fertility, they’d used the wrong dates for peak fertility. So, good on Scott and Pound for getting this one right.

There’s one thing that does bother me about their paper, though, and that’s how they characterize the relation of their study to earlier work such as the notorious paper by Durante et al.

Scott and Pound write:

Our results are therefore difficult to reconcile with those of Durante et al, particularly since we attempted the analyses using a range of approaches and exclusion criteria, including tests similar to those used by Durante et al, and our results were similar under all of them.

Huh? Why “difficult to reconcile”? The reconciliation seems obvious to me: There’s no evidence of anything going on here. Durante et al. had a small noisy dataset and went all garden-of-forking-paths on it. And they found a statistically significant comparison in one of their interactions. No news here.

Scott and Pound continue:

Lack of statistical power does not seem a likely explanation for the discrepancy between our results and those reported in Durante et al, since even after the most restrictive exclusion criteria were applied, we retained a sample large enough to detect a moderate effect . . .

Again, I feel like I’m missing something. “Lack of statistical power” is exactly what was going on with Durante et al., indeed their example was the “Jerry West” of our “power = .06″ graph:

Screen Shot 2014-11-17 at 11.19.42 AM

Scott and Pound continue:

One factor that may partially explain the discrepancy is our different approaches to measuring conservatism and how the relevant questions were framed. . . . However, these methodological differences seem unlikely to fully explain the discrepancy between our results . . . One further possibility is that differences in responses to our survey and the other surveys discussed here are attributable to variation in the samples surveyed. . . .

Sure, but aren’t you ignoring the elephant in the room? Why is there any discrepancy to explain? Why not at least raise the possibility that those earlier publications were just examples of the much-documented human ability to read patterns in noise.

I suspect that Scott and Pound have considered this explanation but felt it would be politic not to explicitly suggest it in their paper.

P.S. The above graph is a rare example of a double-y-axis plot that isn’t so bad. But the left axis should have a lower bound at 0: it’s not possible for conception risk to be negative!

July 4th

Lucky to have been born an American.