Skip to content

Letters we never finished reading

I got a book in the mail attached to some publicity material that began:

Over the last several years, a different kind of science book has found a home on consumer bookshelves. Anchored by meticulous research and impeccable credentials, these books bring hard science to bear on the daily lives of the lay reader; their authors—including Malcolm Gladwell . . .

OK, then.

The book might be ok, though. I wouldn’t judge it on its publicity material.

Free workshop on Stan for pharmacometrics (Paris, 22 September 2016); preceded by (non-free) three day course on Stan for pharmacometrics

So much for one post a day…

Workshop: Stan for Pharmacometrics Day

If you are interested in a free day of Stan for pharmacometrics in Paris on 22 September 2016, see the registration page:

Julie Bertrand (statistical pharmacologist from Paris-Diderot and UCL) has finalized the program:

When Who What
09:00–09:30 Registration
9:30-10:00 Bob Carpenter Introduction to the Stan Language and Model Fitting Algorithms
10:00-10:30 Michael Betancourt Using Stan for Bayesian Inference in PK/PD Models
10:30-11:00 Bill Gillepsie Prototype Stan Functions for Bayesian Pharmacometric Modeling
11:00-11:30 coffee break
11:30-12:00 Sebastian Weber Bayesian popPK for Pediatrics – bridging from adults to pediatrics
12:00-12:30 Solene Desmee Using Stan for individual dynamic prediction of the risk of death in nonlinear joint models:
Application to PSA kinetics and survival in metastatic prostate cancer
12:30-13:30 lunch
13:30-14:00 Marc Vandemeulebroecke A longitudinal Item Response Theory model to characterize cognition over time in elderly subjects
14:00-14:30 William Barcella Modeling correlated binary variables: an application to lower urinary tract symptoms
14:30-15:00 Marie-Karelle Riviere Evaluation of the Fisher information matrix without linearization in
nonlinear mixed effects models for discrete and continuous outcomes
15:00-15:30 coffee break
15:30-16:00 Dan Simpson TBD
16:00-16:30 Frederic Bois Bayesian hierarchical modeling in pharmacology and toxicology / about what we need next
16:30-17:00 Everyone Discussion


Course: Bayesian Inference with Stan for Pharmacometrics

The three days preceding the workshop (19–21 September 2016), Michael Betancourt, Daniel Lee, and I will be teaching a course on Stan for Pharmacometrics. This, alas, is not free, but if you’re interested, registration details are here:

It’s going to be very hands-on and by the end you should be fitting hierarchical PK/PD models based on compartment differential equations.

P.S. As Andrew keeps pointing out, all proceeds (after overhead) go directly toward Stan development. It turns out to be very difficult to get funding to maintain software that people use, because most funding is directed at “novel” research (not software development, research, which means prototypes, not solid code). These courses help immensely to supplement our grant funding and let us continue to maintain Stan and its interfaces.

A day in the life

I like to post approx one item per day on this blog, so when multiple things come up in the same day, I worry about the sustainability of all this. I suppose I could up the posting rate to 2 a day but I think that could be too much of a burden on the readers.

So in this post I’ll just tell you everything I’ve been thinking about today, Thurs 14 Apr 2016.

Actually I’ll start with yesterday, when I posted an update to our Prior Choice Recommendations wiki. There had been a question on the Stan mailing list about priors for cutpoints in ordered logistic regression and this reminded me of a few things I wanted to add, not just on ordered regression but in various places in the wiki. This wiki is great and I’ll devote a full post to it sometime.

Also yesterday I edited a post on this sister blog. Posting there is a service to the political science profession and it’s good to reach Washington Post readers which is a different audience than what we have here. But it’s also can be exhausting as I need to explain everything, whereas for you regular readers I can just speak directly.

This morning I taught my class on design and analysis of sample surveys. Today’s class was on Mister P. Jitts led into a 20-minute discussion about the history and future of sample surveys. I don’t know much about the history of sample surveys. Why was there no Gallup Poll in 1990? How much random sampling was being done, anywhere, before 1930? I don’t know. After that, the class was all R/Stan demos and discussion. I had some difficulties. I took an old R script I had from last year’s class but it didn’t run. I’d deleted some of the data files—Census PUMS files I needed for the poststratification—so I needed to get them again.

After that I biked downtown to give a talk at Baruch College, where someone had asked me to speak. On the way down I heard this story, which the This American Life producers summarize as follows:

When Jonathan Goldstein was 11, his father gave him a book called Ultra-Psychonics: How to Work Miracles with the Limitless Power of Psycho-Atomic Energy. The book was like a grab bag of every occult, para-psychology, and self-help book popular at the time. It promised to teach you how to get rich, control other people’s minds, and levitate. Jonathan found the book in his apartment recently and decided to look into the magical claims the book made.

It turns out that the guy who wrote the book was just doing it to make money:

At the time, Schaumberger was living in New Jersey and making a decent wage as an editor at a publishing house that specialized in occult self help books with titles like “Secrets From Beyond The Pyramids” and “The Magic Of Chantomatics.” And he was astonished by the amount of money he saw writers making. . . .

Looking at it now, it seems obvious it was a lark. It almost reads like a parody of another famous science fiction slash self help book with a lot of psuedoscience jargon that, for legal reasons, I will only say rhymes with diuretics.

Take, for instance, the astral spur. You were supposed to use it at the race track to give your horse extra energy, and it involved standing on one foot and projecting a psychic laser at your horse’s hindquarters.

Then there’s the section on ultra vision influence. The road to domination is explained this way– one, sit in front of a mirror and practice staring fixedly into your own eyes. Two, practice the look on animals. Cats are the best. See if you can stare down a cat. Don’t be surprised if the cat seems to win the first few rounds. Three, practice the look on strangers on various forms of public transport. Stare steadily at someone sitting opposite you until you force them to turn their head away or look down. You have just mastered your first human subject.

I’m listening to this and I’m thinking . . . power pose! It’s just like power pose. It could be true, it kinda sounds right, it involves discipline and focus.

One difference is that power pose has a “p less than .05” attached to it. But, as we’ve seen over and over again, “p less than .05” doesn’t mean very much.

The other difference is that, presumably, the power pose researchers are sincere, whereas this guy was just gleefully making it all up. And yet . . . there’s this, from his daughter:

Well, he was very familiar with all these things. The “Egyptian Book of the Dead” was a big one, because there was always this thing of, well, maybe if they had followed the formulas correctly, maybe something . . . He may have wanted to believe. It may be that in his private thoughts, there were some things in there that he believed in.

I think there may be something going on here, the idea that, even if you make it up, if you will it, you can make it true. If you just try hard enough. I wonder if the power-pose researchers and the ovulation-and-clothing researchers and all the rest, I wonder if they have a bit of this attitude, that if they just really really try, it will all become true.

And then there was more. I’ve had my problems with This American Life from time to time, but this one was a great episode. It had this cool story of a woman who was caring for her mother with dementia, and she (the caregiver) and her husband learned about how to “get inside the world” of the mother so that everything worked much more smoothly. I’m thinking I should try this approach when talking with students!

OK, so I got to my talk. It went ok, I guess. I wasn’t really revved up for it. But by the time it was over I was feeling good. I think I’m a good speaker but one thing that continues to bug me is that I rarely elicit many questions. (Search this blog for Brad Paley for more on this.)

After my talk, on the way back, another excellent This American Life episode, including a goofy/chilling story of how the FBI was hassling some US Taliban activist and trying to get him to commit crimes so they could nail him for terrorism. Really creepy: they seemed to want to create crimes where none existed, just so they could take credit for catching another terrorist.

Got home and started typing this up.

What else relevant happened recently? On Monday I spoke at a conference on “Bayesian, Fiducial, and Frequentist Inference.” My title was “Taking Bayesian inference seriously,” and this was my abstract:

Over the years I have been moving toward the use of informative priors in more and more of my applications. I will discuss several examples from theory, application, and computing where traditional noninformative priors lead to disaster, but a little bit of prior information can make everything work out. Informative priors also can resolve some of the questions of replication and multiple comparisons that have recently shook the world of science. It’s funny for me to say this, after having practiced Bayesian statistics for nearly thirty years, but I’m only now realizing the true value of the prior distribution.

I don’t know if my talk quite lived up to this, but I have been thinking a lot about prior distributions, as was indicated at the top of this post. On the train ride to and from the conference (it was in New Jersey) I talked with Deborah Mayo. I don’t really remember anything we said—that’s what happens when I don’t take notes—but Mayo assured me she’d remember the important parts.

I also had an idea for a new paper, to be titled, “Backfire: How methods that attempt to avoid bias can destroy the validity and reliability of inferences.” OK, I guess I need a snappier title, but I think it’s an important point. Part of this material was in my talk, “‘Unbiasedness’: You keep using that word. I do not think it means what you think it means,” which I gave last year at Princeton—that was before Angus Deaton got mad at me, he was really nice during that visit and offered a lot of good comments, both during and after the talk—but I have some new material too. I want to work in the bit about the homeopathic treatments that have been so popular in social psychology.

Oh, also I received emails today from 2 different journals asking me to referee submitted papers, someone emailed me his book manuscript the other day, asking for comments, and a few other people emailed me articles they’d written.

I’m not complaining, nor am I trying to “busy-brag.” I love getting interesting things to read, and if I feel too busy I can just delete these messages. My only point is that there’s a lot going on, which is why it can be a challenge to limit myself to one blog post per day.

Finally, let me emphasize that I’m not saying there’s anything special about me. Or, to put it another way, sure, I’m special, and so are each of you. You too can do a Nicholson Baker and dissect every moment of your lives. That’s what blogging’s all about. God is in every leaf etc.

Hey pollsters! Poststratify on party ID, or we’re all gonna have to do it for you.

Alan Abramowitz writes:

In five days, Clinton’s lead increased from 5 points to 12 points. And Democratic party ID margin increased from 3 points to 10 points.

No, I don’t think millions of voters switched to the Democratic party. I think Democrats are were just more likely to respond in that second poll. And, remember, survey response rates are around 10%, whereas presidential election turnout is around 60%, so it makes sense that we’d see big swings in differential nonresponse to polls which will not be expected to map to comparable swings in differential voting turnout.

We’ve been writing about this a lot recently. Remember this post, and this earlier graph from Abramowitz:


and this news article with David Rothschild, and this research article with Rothschild, Doug Rivers, and Sharad Goel, and this research article from 2001 with Cavan Reilly and Jonathan Katz? The cool kids know about this stuff.

I’m telling you this for free cos, hey, it’s part of my job as a university professor. (The job is divided into teaching, research, and service; this is service.) But I know that there are polling and news organizations that make money off this sort of thing. So, my advice to you: start poststratifying on party ID. It’ll give you a leg up on the competition.

That is, assuming your goal is to assess opinion and not just to manufacture news. If what you’re looking for is headlines, then by all means go with the raw poll numbers. They jump around like nobody’s business.

P.S. Two questions came up in discussion:

1. If this is such a good idea, why aren’t pollsters doing it already? Many answers here, including (a) some pollsters are doing it already, (b) other pollsters get benefit from headlines, and you get more headlines with noisy data, (c) survey sampling is a conservative field and many practitioners resist new ideas (just search this blog for “buggy whip” for more on that topic), and, most interestingly, (d) response rates keep going down, so differential nonresponse might be a bigger problem now than it used to be.

2. Suppose I want to poststratify on party ID? What numbers should I use? If you’re poststratifying on party ID, you don’t simply want to adjust to party registration data: party ID is a survey response, and party registration is something different. The simplest approach would be to take some smoothed estimate of the party ID distribution from many surveys: this won’t be perfect but it should be better than taking any particular poll, and much better than not poststratifying at all. To get more sophisticated, you could model the party ID distribution as a slowly varying time series as in our 2001 paper but I doubt that’s really necessary here.

His varying slopes don’t seem to follow a normal distribution

Bruce Doré writes:

I have a question about multilevel modeling I’m hoping you can help with.

What should one do when random effects coefficients are clearly not normally distributed (i.e., coef(lmer(y~x+(x|id))) )? Is this a sign that the model should be changed? Or can you stick with this model and infer that the assumption of normally distributed coefficients is incorrect?

I’m seeing strongly leptokurtic random slopes in a context where I have substantive interest in the shape of this distribution. That is, it would be useful to know if there are more individuals with “extreme” and fewer with “moderate” slopes than you’d expect of a normal distribution.

My reply: You can fit a mixture model, or even better you can have a group-level predictor that breaks up your data appropriately. To put it another way: What are your groups? And which are the groups that have low slopes and which have high slopes? Or which have slopes near the middle of the distribution and which have extreme slopes? You could fit a mixture model where the variance varies, but I think you’d be better off with a model using group-level predictors. Also I recommend using Stan which is more flexible than lmer and gives you the full posterior distribution.

Doré then added:

My groups are different people reporting life satisfaction annually surrounding a stressful life event (divorce, bereavement, job loss). I take it that the kurtosis is a clue that there are unobserved person-level factors driving this slope variability? With my current data I don’t have any person-level predictors that could explain this variability, but certainly it would be good to try to find some.

Postdoc in Finland with Aki

I’m looking for a postdoc to work with me at Aalto University, Finland.

The person hired will participate in research on Gaussian processes, functional constraints, big data, approximative Bayesian inference, model selection and assessment, deep learning, and survival analysis models (e.g. cardiovascular diseases and cancer). Methods will be implemented mostly in GPy and Stan. The research will be made in collaboration with Columbia University (Andrew and Stan group), University of Sheffield, Imperial College London, Technical University of Denmark, The National Institute for Health and Welfare, University of Helsinki, and Helsinki University Central Hospital.

See more details here

Balancing bias and variance in the design of behavioral studies: The importance of careful measurement in randomized experiments

At Bank Underground:

When studying the effects of interventions on individual behavior, the experimental research template is typically: Gather a bunch of people who are willing to participate in an experiment, randomly divide them into two groups, assign one treatment to group A and the other to group B, then measure the outcomes. If you want to increase precision, do a pre-test measurement on everyone and use that as a control variable in your regression. But in this post I argue for an alternative approach—study individual subjects using repeated measures of performance, with each one serving as their own control.

As long as your design is not constrained by ethics, cost, realism, or a high drop-out rate, the standard randomized experiment approach gives you clean identification. And, by ramping up your sample size N, you can get all the precision you might need to estimate treatment effects and test hypotheses. Hence, this sort of experiment is standard in psychology research and has been increasingly popular in political science and economics with lab and field experiments.

However, the clean simplicity of such designs has led researchers to neglect important issues of measurement . . .

I summarize:

One motivation for between-subject design is an admirable desire to reduce bias. But we shouldn’t let the apparent purity of randomized experiments distract us from the importance of careful measurement. Real-world experiments are imperfect—they do have issues with ethics, cost, realism, and high drop-out, and the strategy of doing an experiment and then grabbing statistically-significant comparisons can leave a researcher with nothing but a pile of noisy, unreplicable findings.

Measurement is central to economics—it’s the link between theory and empirics—and it remains important, whether studies are experimental, observational, or some combination of the two.

I have no idea who reads that blog but it’s always good to try to reach new audiences.

Evil collaboration between Medtronic and FDA

Paul Alper points us to this news article by Jim Spencer and Joe Carlson that has this amazing bit:

Medtronic ran a retrospective study of 3,647 Infuse patients from 2006-2008 but shut it down without reporting more than 1,000 “adverse events” to the government within 30 days, as the law required.

Medtronic, which acknowledges it should have reported the information promptly, says employees misfiled it. The company eventually reported the adverse events to the FDA more than five years later.

Medtronic filed four individual death reports from the study in July 2013. Seven months later, the FDA posted a three-sentence summary of 1,039 other adverse events from the Infuse study, but deleted the number from public view, calling it a corporate trade secret.

Wow. I feel bad for that FDA employee who did this: it must be just horrible to have to work for the government when you have such exquisite sensitivity to corporate secrets. I sure hope that he or she gets a good job in some regulated industry after leaving government service.

Bayesian inference completely solves the multiple comparisons problem


I promised I wouldn’t do any new blogging until January but I’m here at this conference and someone asked me a question about the above slide from my talk.

The point of the story in that slide is that flat priors consistently give bad inferences. Or, to put it another way, the routine use of flat priors results in poor frequency properties in realistic settings where studies are noisy and effect sizes are small. (More here.)

Saying it that way, it’s obvious: Bayesian methods are calibrated if you average over the prior. If the distribution of effect sizes that you average over, is not the same as the prior distribution that you’re using in the analysis, your Bayesian inferences in general will have problems.

But, simple as this statement is, the practical implications are huge, because it’s standard to use flat priors in Bayesian analysis (just see most of the examples in our books!) and it’s even more standard to take classical maximum likelihood or least squares inferences and interpret them Bayesianly, for example interpreting a 95% interval that excludes zero as strong evidence for the sign of the underlying parameter.

In our 2000 paper, “Type S error rates for classical and Bayesian single and multiple comparison procedures,” Francis Tuerlinckx and I framed this in terms of researchers making “claims with confidence.” In classical statistics, you make a claim with confidence on the sign of an effect if the 95% confidence interval excludes zero. In Bayesian statistics, one can make a comparable claim with confidence if the 95% posterior interval excludes zero. With a flat prior, these two are the same. But with a Bayesian prior, they are different. In particular, with normal data and a normal prior centered at 0, the Bayesian interval is always more likely to include zero, compared to the classical interval; hence we can say that Bayesian inference is more conservative, in being less likely to result in claims with confidence.

Here’s the relevant graph from that 2000 paper:


This plot shows the probability of making a claim with confidence, as a function of the variance ratio, based on the simple model:

True effect theta is simulated from normal(0, tau).
Data y are simulated from normal(theta, sigma).
Classical 95% interval is y +/- 2*sigma
Bayesian 95% interval is theta.hat.bayes +/- 2*,
where theta.hat.bayes = y * (1/sigma^2) / (1/sigma^2 + 1/tau^2)
and = sqrt(1 / (1/sigma^2 + 1/tau^2))

What’s really cool here is what happens when tau/sigma is near 0, which we might call the “Psychological Science” or “PPNAS” domain. In that limit, the classical interval has a 5% chance of excluding 0. Of course, that’s what the 95% interval is all about: if there’s no effect, you have a 5% chance of seeing something.

But . . . look at the Bayesian procedure. There, the probability of a claim with confidence is essentially 0 when tau/sigma is low. This is right: in this setting, the data only very rarely supply enough information to determine the sign of any effect. But this can be counterintuitive if you have classical statistical training: we’re so used to hearing about 5% error rate that it can be surprising to realize that, if you’re doing things right, your rate of making claims with confidence can be much lower.

We are assuming here that the prior distribution and the data model are correct—that is, we compute probabilities by averaging over the data-generating process in our model.

Multiple comparisons

OK, so what does this have to do with multiple comparisons? The usual worry is that if we are making a lot of claims with confidence, we can be way off if we don’t do some correction. And, indeed, with the classical approach, if tau/sigma is small, you’ll still be making claims with confidence 5% of the time, and a large proportion of these claims will be in the wrong direction (a “type S,” or sign, error) or much too large (a “type M,” or magnitude, error), compared to the underlying truth.

With Bayesian inference (and the correct prior), though, this problem disappears. Amazingly enough, you don’t have to correct Bayesian inferences for multiple comparisons.

I did a demonstration in R to show this, simulating a million comparisons and seeing what the Bayesian method does.

Here’s the R code:


spidey <- function(sigma, tau, N) {
  cat("sigma = ", sigma, ", tau = ", tau, ", N = ", N, "\n", sep="")
  theta <- rnorm(N, 0, tau)
  y <- theta + rnorm(N, 0, sigma)
  signif_classical <- abs(y) > 2*sigma
  cat(sum(signif_classical), " (", fround(100*mean(signif_classical), 1), "%) of the 95% classical intervals exclude 0\n", sep="")
  cat("Mean absolute value of these classical estimates is", fround(mean(abs(y)[signif_classical]), 2), "\n")
  cat("Mean absolute value of the corresponding true parameters is", fround(mean(abs(theta)[signif_classical]), 2), "\n")
  cat(fround(100*mean((sign(theta)!=sign(y))[signif_classical]), 1), "% of these are the wrong sign (Type S error)\n", sep="")
  theta_hat_bayes <- y * (1/sigma^2) / (1/sigma^2 + 1/tau^2)
  theta_se_bayes <- sqrt(1 / (1/sigma^2 + 1/tau^2))
  signif_bayes <- abs(theta_hat_bayes) > 2*theta_se_bayes
  cat(sum(signif_bayes), " (", fround(100*mean(signif_bayes), 1), "%) of the 95% posterior intervals exclude 0\n", sep="")
  cat("Mean absolute value of these Bayes estimates is", fround(mean(abs(theta_hat_bayes)[signif_bayes]), 2), "\n")
  cat("Mean absolute value of the corresponding true parameters is", fround(mean(abs(theta)[signif_bayes]), 2), "\n")
  cat(fround(100*mean((sign(theta)!=sign(theta_hat_bayes))[signif_bayes]), 1), "% of these are the wrong sign (Type S error)\n", sep="")

sigma <- 1
tau <- .5
N <- 1e6
spidey(sigma, tau, N)

Here's the first half of the results:

sigma = 1, tau = 0.5, N = 1e+06
73774 (7.4%) of the 95% classical intervals exclude 0
Mean absolute value of these classical estimates is 2.45 
Mean absolute value of the corresponding true parameters is 0.56 
13.9% of these are the wrong sign (Type S error)

So, when tau is half of sigma, the classical procedure yields claims with confidence 7% of the time. The estimates are huge (after all, they have to be at least two standard errors from 0), much higher than the underlying parameters. And 14% of these claims with confidence are in the wrong direction.

The next half of the output shows the results from the Bayesian intervals:

62 (0.0%) of the 95% posterior intervals exclude 0
Mean absolute value of these Bayes estimates is 0.95 
Mean absolute value of the corresponding true parameters is 0.97 
3.2% of these are the wrong sign (Type S error)

When tau is half of sigma, Bayesian claims with confidence are extremely rare. When there is a Bayesian claim with confidence, it will be large---that makes sense; the posterior standard error is sqrt(1/(1/1 + 1/.5^2)) = 0.45, and so any posterior mean corresponding to a Bayesian claim with confidence here will have to be at least 0.9. The average for these million comparisons turns out to be 0.94.

So, hey, watch out for selection effects! But no, not at all. If we look at the underlying true effects corresponding to these claims with confidence, these have a mean of 0.97 (in this simulation; in other simulations of a million comparisons, we get means such as 0.89 or 1.06). And very few of these are in the wrong direction; indeed, with enough simulations you'll find a type S error rate of a bit less 2.5% which is what you'd expect, given that these 95% posterior intervals exclude 0, so something less than 2.5% of the interval will be of the wrong sign.

So, the Bayesian procedure only very rarely makes a claim with confidence. But, when it does, it's typically picking up something real, large, and in the right direction.

We then re-ran with tau = 1, a world in which the standard deviation of true effects is equal to the standard error of the estimates:

sigma <- 1 tau <- 1 N <- 1e6 spidey(sigma, tau, N) And here's what we get:

sigma = 1, tau = 1, N = 1e+06
157950 (15.8%) of the 95% classical intervals exclude 0
Mean absolute value of these classical estimates is 2.64 
Mean absolute value of the corresponding true parameters is 1.34 
3.9% of these are the wrong sign (Type S error)
45634 (4.6%) of the 95% posterior intervals exclude 0
Mean absolute value of these Bayes estimates is 1.68 
Mean absolute value of the corresponding true parameters is 1.69 
1.0% of these are the wrong sign (Type S error)

The classical estimates remain too high, on average about twice as large as the true effect sizes; the Bayesian procedure is more conservative, making fewer claims with confidence and not overestimating effect sizes.

Bayes does better because it uses more information

We should not be surprised by these results. The Bayesian procedure uses more information and so it can better estimate effect sizes.

But this can seem like a problem: what if this prior information on theta isn’t available? I have two answers. First, in many cases, some prior information is available. Second, if you have a lot of comparisons, you can fit a multilevel model and estimate tau. Thus, what can seem like the worst multiple comparisons problems are not so bad.

One should also be able to obtain comparable results non-Bayesianly by setting a threshold so as to control the type S error rate. The key is to go beyond the false-positive, false-negative framework, to set the goals of estimating the sign and magnitudes of the thetas rather than to frame things in terms of the unrealistic and uninteresting theta=0 hypothesis.

P.S. Now I know why I swore off blogging! The analysis, the simulation, and the writing of this post took an hour and a half of my work time.

P.P.S. Sorry for the ugly code. Let this be a motivation for all of you to learn how to code better.

One more thing you don’t have to worry about

Baruch Eitam writes:

So I have been convinced by the futility of NHT for my scientific goals and by the futility of of significance testing (in the sense of using p-values as a measure of the strength of evidence against the null). So convinced that I have been teaching this for the last 2 years. Yesterday I bump into this paper [“To P or not to P: on the evidential nature of P-values and their place in scientific inference,” by Michael Lew] which I thought makes a very strong argument for the validity of using significance testing for the above purpose. Furthermore—by his 1:1 mapping of p-values to likelihood functions he kind of obliterates the difference between the Bayesian and frequentist perspectives. My questions are 1. is his argument sound? 2.what does this mean regarding the use of p-values as measures of strength of evidence?

I replied that it all seems a bit nuts to me. If you’re not going to use p-values for hypothesis testing (and I agree with the author that this is not a good idea), why bother with p-values at all. It seems weird to use p-values to summarize the likelihood; why not just use the likelihood and do Bayesian inference directly? Regarding that latter point, see this paper of mine on p-values.

Eitam followed up:

But aren’t you surprised that the p-values do summarize the likelihood?

I replied that I did not read the paper in detail, but or any given model and sample size, I guess it makes sense that any two measures of evidence can be mapped to each other.

On deck this week

Mon: One more thing you don’t have to worry about

Tues: Evil collaboration between Medtronic and FDA

Wed: His varying slopes don’t seem to follow a normal distribution

Thurs: A day in the life

Fri: Letters we never finished reading

Sat: Better to just not see the sausage get made

Sun: Oooh, it burns me up

Taking Bayesian Inference Seriously [my talk tomorrow at Harvard conference on Big Data]

Mon 22 Aug, 9:50am, at Harvard Science Center Hall A:

Taking Bayesian Inference Seriously

Over the years I have been moving toward the use of informative priors in more and more of my applications. I will discuss several examples from theory, application, and computing where traditional noninformative priors lead to disaster, but a little bit of prior information can make everything work out. Informative priors also can resolve some of the questions of replication and multiple comparisons that have recently shook the world of science. It’s funny for me to say this, after having practiced Bayesian statistics for nearly thirty years, but I’m only now realizing the true value of the prior distribution.

Kaiser Fung on the ethics of data analysis

Kaiser gave a presentation and he’s sharing the slides with us here. It’s important stuff.

Michael Porter as new pincushion

Some great comments on this post about Ted talk visionary Michael Porter. Most rewarding was this from Howard Edwards:

New Zealand seems to score well on his index so perhaps I shouldn’t complain, but Michael Porter was well known in this part of the world 25 years ago when our government commissioned him to write a report titled “Upgrading New Zealand’s Competitive Advantage” (but known colloquially as the Porter Project.) Back then (perhaps not quite so much now) our government departments were in thrall of any overseas “expert” who could tell us what to do, and especially so if their philosophy happened to align with that of the government of the day.

Anyway this critique written at the time by one of our leading political economists suggests that his presentation and analysis skills weren’t the greatest back then either.

I followed the link and read the article by Brian Easton, which starts out like this:

Flavour of the moment is Upgrading New Zealand’s Competitive Advantage, the report of the so-called Porter Project. Its 178 pages (plus appendices) are riddled with badly labelled graphs; portentous diagrams which, on reflection, say nothing; chummy references to “our country”, when two of the three authors are Americans; and platitudes dressed up as ‘deep and meaningful sentiments.

Toward the end of the review, Easton sums up:

It would be easy enough to explain this away as the usual shallowness of a visiting guru passing through; But New Zealand’s. Porter Project spent about $1.5 million (of taxpayers’ money) on a report which is, largely a recycling of conventional wisdom and material published elsewhere. Even if there were more and deeper case studies, the return on the money expended would still be low.

But that’s just leading up to the killer blow:

Particularly galling is the book’s claim that we should improve the efficiency of government spending. The funding of this report would have been a good place to start. It must be a candidate for the lowest productivity research publication ever funded by government.

In all seriousness, I expect that Michael Porter is so used to getting paid big bucks that he hardly noticed where the $1.5 million went. (I guess that’s 1.5 million New Zealand dollars, so something like $750,000 U.S.) Wasteful government spending on other people, sure, that’s horrible, but when the wasteful government spending goes directly to you, that’s another story.

Things that sound good but aren’t quite right: Art and research edition

There are a lot of things you can say that sound very sensible but, upon reflection, are missing something.

For example consider this blog comment from Chris G:

Years ago I heard someone suggest these three questions for assessing a work of art:

1. What was the artist attempting to do?
2. Were they successful?
3. Was it worth doing?

I think those apply equally well to assessing research.

The idea of applying these same standards to research as to art, that was interesting. And the above 3 questions sounded good too—at first. But then I got to thinking about all sorts of art and science that didn’t fit the above rules. As I wrote:

There are many cases of successful art, and for that matter successful research, that were created by accident, where the artist or researcher was just mucking around, or maybe just trying to do something to pay the bills, and something great came out of it.

I’m not saying you’ll get much from completely random mucking around of the monkeys-at-a-typewriter variety. And in general I do believe in setting goals and working toward them. But artistic and research success often does seem to come in part by accident, or as a byproduct of some other goals.

An ethnographic study of the “open evidential culture” of research psychology

Claude Fischer points me to this paper by David Peterson, “The Baby Factory: Difficult Research Objects, Disciplinary Standards, and the Production of Statistical Significance,” which begins:

Science studies scholars have shown that the management of natural complexity in lab settings is accomplished through a mixture of technological standardization and tacit knowledge by lab workers. Yet these strategies are not available to researchers who study difficult research objects. Using 16 months of ethnographic data from three laboratories that conduct experiments on infants and toddlers, the author shows how psychologists produce statistically significant results under challenging circumstances by using strategies that enable them to bridge the distance between an uncontrollable research object and a professional culture that prizes methodological rigor. This research raises important questions regarding the value of restrictive evidential cultures in challenging research environments.

And it concludes:

Open evidential cultures may be defensible under certain conditions. When problems are pressing and progress needs to be made quickly, creativity may be prized over ascetic rigor. Certain areas of medical or environmental science may meet this criterion. Developmental psychology does not. However, it may meet a second criterion. When research findings are not tightly coupled with some piece of material or social technology—that is, when the “consumers” of such science do not significantly depend on the veracity of individual articles—then local culture can function as an internal mechanism for evaluation in the field. Similar to the way oncologists use a “web of trials” rather than relying on a single, authoritative study or how weather forecasters use multiple streams of evidence and personal experience to craft a prediction, knowledge in such fields may develop positively even in a literature that contains more false positives than would be expected by chance alone.

It’s an interesting article, because usually discussions of research practices are all about what is correct, what should be done or not done, what do the data really tell us, etc. But here we get an amusing anthropological take on things, treating scientists’ belief in their research findings with the same respect that we treat tribal religious beliefs. This paper is not normative, it’s descriptive. And description is important. As I often say, if we want to understand the world, it helps to know what’s actually happening out there!

I like the term “open evidential culture”: it’s descriptive without being either condescending, on one hand, or apologetic, on the other.

Stan Course up North (Anchorage, Alaska) 23–24 Aug 2016

Stan logo
Daniel Lee’s heading up to Anchorage, Alaska to teach a two-day Stan course at the Alaska chapter of the American Statistical Association (ASA) meeting in Anchorage. Here’s the rundown:

I hear Alaska’s beautiful in the summer—16 hour days in August and high temps of 17 degrees celsius. Plus Stan!

More Upcoming Stan Events

All of the Stan-related events of which we are aware are listed on:

After Alaska, Daniel and Michael Betancourt will be joining me in Paris, France on 19–21 September to teach a three-day course on Pharmacometric Modeling using Stan. PK/PD in Stan is now a whole lot easier after Sebastian Weber integrated CVODES (pun intended) to solve stiff differential equations with control over tolerances and max steps per iteration.

The day after the course in Paris, on 22 September, we (with Julie Bertrand and France Mentre) are hosting a one-day Workshop on Pharmacometric Modeling with Stan.

Your Event Here

Let us know if you hear about other Stan-related events (meetups, courses, workshops) and we can post them on our events page and advertise them right here on the blog.

What’s gonna happen in November?

Nadia Hassan writes:

2016 may be strange with Trump. Do you have any thoughts on how people might go about modeling a strange election? When I asked you about predictability and updating election forecasts, you stated that models that rely on polls at different points should be designed to allow for surprises. You have touted the power of weakly informative priors. Could those be a good tool for this situation?

I received this message on 4 Apr and I’m typing this on 9 Apr but it’s 17 Aug in blog time. So you’re actually reading a response that’s 4 months old.

What is it that they say: History is journalism plus time? I guess political science is political journalism plus time.

Anyway . . . whenever people asked me about the primary elections, I’d point them to my 2011 NYT article, Why Are Primaries Hard to Predict? Here’s the key bit:

Presidential general election campaigns have several distinct features that distinguish them from most other elections:

1. Two major candidates;
2. The candidates clearly differ in their political ideologies and in their positions on economic issues;
3. The two sides have roughly equal financial and organizational resources;
4. The current election is the latest in a long series of similar contests (every four years);
5. A long campaign, giving candidates a long time to present their case and giving voters a long time to make up their minds.

OK, now to Hassan’s question. I don’t really have a good answer! I guess I’d take as a starting point the prediction from a Hibbs-like model predicting the election based on economic conditions during the past year, presidential popularity, and party balancing. Right now the economy seems to be going OK though not great, Obama is reasonably popular, and party balancing favors the Democrats because the Republicans control both houses of Congress. So I’m inclined to give the Democratic candidate (Hillary Clinton, I assume) the edge. But that’s just my guess, I haven’t run the numbers. There’s also evidence from various sources that more extreme candidates don’t do so well, so if Sanders is the nominee, I’d assume he’d get a couple percentage points less than Clinton would. Trump . . . it’s hard to say. He’s not ideologically extreme, on the other hand he is so unpopular (even more so than Clinton), it’s hard to know what to say. So I find this a difficult election to predict. And once August rolls around, it’s likely there will be some completely different factors that I haven’t even thought about! From a statistical point of view, I guess I’d just add an error term which would increase my posterior uncertainty.

It’s not so satisfying to say this, but I don’t have much to offer as an election forecast beyond what you could read in any newspaper. I’m guessing that statistical tools will be more relevant in modeling what will happen in individual states, relative to the national average. As Kari Lock and I wrote a few years ago, it can be helpful to decompose national trends and the positions of the states. So maybe by the time this post appears here, I’ll have more to say.

P.S. This seems like a natural for the sister blog but I’m afraid the Washington Post readers would get so annoyed at me for saying I can’t make a good forecast! So I’m posting it here.

How schools that obsess about standardized tests ruin them as measures of success


Mark Palko and I wrote this article comparing the Success Academy chain of charter schools to Soviet-era factories:

According to the tests that New York uses to evaluate schools, Success Academies ranks at the top of the state — the top 0.3 percent in math and the top 1.5 percent in English, according to the founder of the Success Academies, Eva Moskowitz. That rivals or exceeds the performance of public schools in districts where homes sell for millions of dollars.

But it took three years before any Success Academy students were accepted into New York City’s elite high school network — and not for lack of trying. After two years of zero-percent acceptance rates, the figure rose to 11 percent this year, still considerably short of the 19 percent citywide average.

News coverage of those figures emphasized that that acceptance rate was still higher than the average for students of color (the population Success Academy mostly serves). But from a statistical standpoint, we would expect extremely high scores on the state exam to go along with extremely high scores on the high school application exams. It’s not clear why race should be a factor when interpreting one and not the other.

The explanation for the discrepancy would appear to be that in high school admissions, everybody is trying hard, so the motivational tricks and obsessive focus on tests at Success Academy schools has less of an effect. Routine standardized tests are, by contrast, high stakes for schools but low stakes for students. Unless prodded by teachers and anxious administrators, the typical student may be indifferent about his or her performance. . . .

We summarize:

In general, competition is good, as are market forces and data-based incentives, but they aren’t magic. They require careful thought and oversight to prevent gaming and what statisticians call model decay. . . .

What went wrong with Success Academy is, paradoxically, what also seems to have gone right. Success Academy schools have excelled at selecting out students who will perform poorly on state tests and then preparing their remaining students to test well. But their students do not do so well on tests that matter to the students themselves.

Like those Soviet factories, Success Academy and other charter schools have been under pressure to perform on a particular measure, and are reminding us once again what Donald Campbell told us 40 years ago: Tampering with the speedometer won’t make the car go faster.

Calorie labeling reduces obesity Obesity increased more slowly in California, Seattle, Portland (Oregon), and NYC, compared to some other places in the west coast and northeast that didn’t have calorie labeling

Ted Kyle writes:

I wonder if you might have some perspective to offer on this analysis by Partha Deb and Carmen Vargas regarding restaurant calorie counts.

[Thin columnist] Cass Sunstein says it proves “that calorie labels have had a large and beneficial effect on those who most need them.”

I wonder about the impact of using self-reported BMI as a primary input and also the effect of confounding variables. Someone also suggested that investigator degrees of freedom is an important consideration.

They’re using data from a large national survey (Behavioral Risk Factor Surveillance System) and comparing self-reported body mass index of people who lived in counties with calorie-labeling laws, compared to counties without such laws, and they come up with these (distorted) maps:

Screen Shot 2016-04-09 at 10.15.36 PMScreen Shot 2016-04-09 at 10.15.43 PM

Here’s their key finding:

Screen Shot 2016-04-09 at 10.11.57 PM

The two columns correspond to two different models they used to adjust for demographic differences between the people in the two groups of counties. As you can see, average BMI seems to have increased faster in the no-calorie-labeling counties.

On the other hand, if you look at the map, it seems like they’re comparing {California, Seattle, Portland (Oregon), and NYC} to everyone else (with Massachusetts somewhere in the middle), and there are big differences between these places. So I don’t know how seriously we can attribute the differences between those trends to food labeling.

Also, figure 5 of that paper, showing covariate balance, is just goofy. I recommend simple and more readable dotplots as in chapter 10 of ARM. Figure 4 is a bit mysterious too, I’m not quite clear on what is gained by the barplots on the top; aren’t they just displaying the means of the normal distributions on the bottom? And Figures 1 and 2, the maps, look weird: they’re using some bad projection, maybe making the rookie mistake of plotting latitude vs. longitude, not realizing that when you’re away from the equator one degree of latitude is not the same distance as one degree of longitude.

As to the Cass Sunstein article (“Calorie Counts Really Do Fight Obesity”), yeah, it seems a bit hypey. Key Sunstein quote: “All in all, it’s a terrific story.” Even aside from the causal identification issues discussed above, don’t forget that the difference between significant and non-significant etc.

Speaking quite generally, I agree with Sunstein when he writes:

A new policy might have modest effects on Americans as a whole, but big ones on large subpopulations. That might be exactly the point! It’s an important question to investigate.

But of course researchers—even economists—have been talking about varying treatment effects for awhile. So to say we can draw this “large lesson” from this particular study . . . again, a bit of hype going on here. It’s fine for Sunstein if this particular paper has woken him up to the importance of interactions, but let’s not let his excitement about the general concept, and his eagerness to tell a “terrific story” and translate into policy, distract us from the big problems of interpreting the claims made in this paper.

And, to return to the multiple comparisons issue, ultimately what’s important is not so much what the investigators did or might have done, but rather what the data say. I think the right approach would be some sort of hierarchical model that allows for effects in all groups, rather than a search for a definitive result in some group or another.

P.S. Kyle referred to the article by Deb and Vargas as a “NBER analysis” but that’s not quite right. NBER is just a consortium that publishes these papers. To call their paper an NBER analysis would be like calling this blog post “a WordPress analysis” because I happen to be using this particular software.