Skip to content

Hey pollsters! Poststratify on party ID, or we’re all gonna have to do it for you.

Alan Abramowitz writes:

In five days, Clinton’s lead increased from 5 points to 12 points. And Democratic party ID margin increased from 3 points to 10 points.

No, I don’t think millions of voters switched to the Democratic party. I think Democrats are were just more likely to respond in that second poll. And, remember, survey response rates are around 10%, whereas presidential election turnout is around 60%, so it makes sense that we’d see big swings in differential nonresponse to polls which will not be expected to map to comparable swings in differential voting turnout.

We’ve been writing about this a lot recently. Remember this post, and this earlier graph from Abramowitz:

image001

and this news article with David Rothschild, and this research article with Rothschild, Doug Rivers, and Sharad Goel, and this research article from 2001 with Cavan Reilly and Jonathan Katz? The cool kids know about this stuff.

I’m telling you this for free cos, hey, it’s part of my job as a university professor. (The job is divided into teaching, research, and service; this is service.) But I know that there are polling and news organizations that make money off this sort of thing. So, my advice to you: start poststratifying on party ID. It’ll give you a leg up on the competition.

That is, assuming your goal is to assess opinion and not just to manufacture news. If what you’re looking for is headlines, then by all means go with the raw poll numbers. They jump around like nobody’s business.

His varying slopes don’t seem to follow a normal distribution

Bruce Doré writes:

I have a question about multilevel modeling I’m hoping you can help with.

What should one do when random effects coefficients are clearly not normally distributed (i.e., coef(lmer(y~x+(x|id))) )? Is this a sign that the model should be changed? Or can you stick with this model and infer that the assumption of normally distributed coefficients is incorrect?

I’m seeing strongly leptokurtic random slopes in a context where I have substantive interest in the shape of this distribution. That is, it would be useful to know if there are more individuals with “extreme” and fewer with “moderate” slopes than you’d expect of a normal distribution.

My reply: You can fit a mixture model, or even better you can have a group-level predictor that breaks up your data appropriately. To put it another way: What are your groups? And which are the groups that have low slopes and which have high slopes? Or which have slopes near the middle of the distribution and which have extreme slopes? You could fit a mixture model where the variance varies, but I think you’d be better off with a model using group-level predictors. Also I recommend using Stan which is more flexible than lmer and gives you the full posterior distribution.

Doré then added:

My groups are different people reporting life satisfaction annually surrounding a stressful life event (divorce, bereavement, job loss). I take it that the kurtosis is a clue that there are unobserved person-level factors driving this slope variability? With my current data I don’t have any person-level predictors that could explain this variability, but certainly it would be good to try to find some.

Postdoc in Finland with Aki

I’m looking for a postdoc to work with me at Aalto University, Finland.

The person hired will participate in research on Gaussian processes, functional constraints, big data, approximative Bayesian inference, model selection and assessment, deep learning, and survival analysis models (e.g. cardiovascular diseases and cancer). Methods will be implemented mostly in GPy and Stan. The research will be made in collaboration with Columbia University (Andrew and Stan group), University of Sheffield, Imperial College London, Technical University of Denmark, The National Institute for Health and Welfare, University of Helsinki, and Helsinki University Central Hospital.

See more details here

Balancing bias and variance in the design of behavioral studies: The importance of careful measurement in randomized experiments

At Bank Underground:

When studying the effects of interventions on individual behavior, the experimental research template is typically: Gather a bunch of people who are willing to participate in an experiment, randomly divide them into two groups, assign one treatment to group A and the other to group B, then measure the outcomes. If you want to increase precision, do a pre-test measurement on everyone and use that as a control variable in your regression. But in this post I argue for an alternative approach—study individual subjects using repeated measures of performance, with each one serving as their own control.

As long as your design is not constrained by ethics, cost, realism, or a high drop-out rate, the standard randomized experiment approach gives you clean identification. And, by ramping up your sample size N, you can get all the precision you might need to estimate treatment effects and test hypotheses. Hence, this sort of experiment is standard in psychology research and has been increasingly popular in political science and economics with lab and field experiments.

However, the clean simplicity of such designs has led researchers to neglect important issues of measurement . . .

I summarize:

One motivation for between-subject design is an admirable desire to reduce bias. But we shouldn’t let the apparent purity of randomized experiments distract us from the importance of careful measurement. Real-world experiments are imperfect—they do have issues with ethics, cost, realism, and high drop-out, and the strategy of doing an experiment and then grabbing statistically-significant comparisons can leave a researcher with nothing but a pile of noisy, unreplicable findings.

Measurement is central to economics—it’s the link between theory and empirics—and it remains important, whether studies are experimental, observational, or some combination of the two.

I have no idea who reads that blog but it’s always good to try to reach new audiences.

Evil collaboration between Medtronic and FDA

Paul Alper points us to this news article by Jim Spencer and Joe Carlson that has this amazing bit:

Medtronic ran a retrospective study of 3,647 Infuse patients from 2006-2008 but shut it down without reporting more than 1,000 “adverse events” to the government within 30 days, as the law required.

Medtronic, which acknowledges it should have reported the information promptly, says employees misfiled it. The company eventually reported the adverse events to the FDA more than five years later.

Medtronic filed four individual death reports from the study in July 2013. Seven months later, the FDA posted a three-sentence summary of 1,039 other adverse events from the Infuse study, but deleted the number from public view, calling it a corporate trade secret.

Wow. I feel bad for that FDA employee who did this: it must be just horrible to have to work for the government when you have such exquisite sensitivity to corporate secrets. I sure hope that he or she gets a good job in some regulated industry after leaving government service.

Bayesian inference completely solves the multiple comparisons problem

spidey_slide

I promised I wouldn’t do any new blogging until January but I’m here at this conference and someone asked me a question about the above slide from my talk.

The point of the story in that slide is that flat priors consistently give bad inferences. Or, to put it another way, the routine use of flat priors results in poor frequency properties in realistic settings where studies are noisy and effect sizes are small. (More here.)

Saying it that way, it’s obvious: Bayesian methods are calibrated if you average over the prior. If the distribution of effect sizes that you average over, is not the same as the prior distribution that you’re using in the analysis, your Bayesian inferences in general will have problems.

But, simple as this statement is, the practical implications are huge, because it’s standard to use flat priors in Bayesian analysis (just see most of the examples in our books!) and it’s even more standard to take classical maximum likelihood or least squares inferences and interpret them Bayesianly, for example interpreting a 95% interval that excludes zero as strong evidence for the sign of the underlying parameter.

In our 2000 paper, “Type S error rates for classical and Bayesian single and multiple comparison procedures,” Francis Tuerlinckx and I framed this in terms of researchers making “claims with confidence.” In classical statistics, you make a claim with confidence on the sign of an effect if the 95% confidence interval excludes zero. In Bayesian statistics, one can make a comparable claim with confidence if the 95% posterior interval excludes zero. With a flat prior, these two are the same. But with a Bayesian prior, they are different. In particular, with normal data and a normal prior centered at 0, the Bayesian interval is always more likely to include zero, compared to the classical interval; hence we can say that Bayesian inference is more conservative, in being less likely to result in claims with confidence.

Here’s the relevant graph from that 2000 paper:

claims_with_confidence

This plot shows the probability of making a claim with confidence, as a function of the variance ratio, based on the simple model:

True effect theta is simulated from normal(0, tau).
Data y are simulated from normal(theta, sigma).
Classical 95% interval is y +/- 2*sigma
Bayesian 95% interval is theta.hat.bayes +/- 2*theta.se.bayes,
where theta.hat.bayes = y * (1/sigma^2) / (1/sigma^2 + 1/tau^2)
and theta.se.bayes = sqrt(1 / (1/sigma^2 + 1/tau^2))

What’s really cool here is what happens when tau/sigma is near 0, which we might call the “Psychological Science” or “PPNAS” domain. In that limit, the classical interval has a 5% chance of excluding 0. Of course, that’s what the 95% interval is all about: if there’s no effect, you have a 5% chance of seeing something.

But . . . look at the Bayesian procedure. There, the probability of a claim with confidence is essentially 0 when tau/sigma is low. This is right: in this setting, the data only very rarely supply enough information to determine the sign of any effect. But this can be counterintuitive if you have classical statistical training: we’re so used to hearing about 5% error rate that it can be surprising to realize that, if you’re doing things right, your rate of making claims with confidence can be much lower.

We are assuming here that the prior distribution and the data model are correct—that is, we compute probabilities by averaging over the data-generating process in our model.

Multiple comparisons

OK, so what does this have to do with multiple comparisons? The usual worry is that if we are making a lot of claims with confidence, we can be way off if we don’t do some correction. And, indeed, with the classical approach, if tau/sigma is small, you’ll still be making claims with confidence 5% of the time, and a large proportion of these claims will be in the wrong direction (a “type S,” or sign, error) or much too large (a “type M,” or magnitude, error), compared to the underlying truth.

With Bayesian inference (and the correct prior), though, this problem disappears. Amazingly enough, you don’t have to correct Bayesian inferences for multiple comparisons.

I did a demonstration in R to show this, simulating a million comparisons and seeing what the Bayesian method does.

Here’s the R code:

setwd("~/AndrewFiles/research/multiplecomparisons")
library("arm")

spidey <- function(sigma, tau, N) {
  cat("sigma = ", sigma, ", tau = ", tau, ", N = ", N, "\n", sep="")
  theta <- rnorm(N, 0, tau)
  y <- theta + rnorm(N, 0, sigma)
  signif_classical <- abs(y) > 2*sigma
  cat(sum(signif_classical), " (", fround(100*mean(signif_classical), 1), "%) of the 95% classical intervals exclude 0\n", sep="")
  cat("Mean absolute value of these classical estimates is", fround(mean(abs(y)[signif_classical]), 2), "\n")
  cat("Mean absolute value of the corresponding true parameters is", fround(mean(abs(theta)[signif_classical]), 2), "\n")
  cat(fround(100*mean((sign(theta)!=sign(y))[signif_classical]), 1), "% of these are the wrong sign (Type S error)\n", sep="")
  theta_hat_bayes <- y * (1/sigma^2) / (1/sigma^2 + 1/tau^2)
  theta_se_bayes <- sqrt(1 / (1/sigma^2 + 1/tau^2))
  signif_bayes <- abs(theta_hat_bayes) > 2*theta_se_bayes
  cat(sum(signif_bayes), " (", fround(100*mean(signif_bayes), 1), "%) of the 95% posterior intervals exclude 0\n", sep="")
  cat("Mean absolute value of these Bayes estimates is", fround(mean(abs(theta_hat_bayes)[signif_bayes]), 2), "\n")
  cat("Mean absolute value of the corresponding true parameters is", fround(mean(abs(theta)[signif_bayes]), 2), "\n")
  cat(fround(100*mean((sign(theta)!=sign(theta_hat_bayes))[signif_bayes]), 1), "% of these are the wrong sign (Type S error)\n", sep="")
}

sigma <- 1
tau <- .5
N <- 1e6
spidey(sigma, tau, N)

Here's the first half of the results:

sigma = 1, tau = 0.5, N = 1e+06
73774 (7.4%) of the 95% classical intervals exclude 0
Mean absolute value of these classical estimates is 2.45 
Mean absolute value of the corresponding true parameters is 0.56 
13.9% of these are the wrong sign (Type S error)

So, when tau is half of sigma, the classical procedure yields claims with confidence 7% of the time. The estimates are huge (after all, they have to be at least two standard errors from 0), much higher than the underlying parameters. And 14% of these claims with confidence are in the wrong direction.

The next half of the output shows the results from the Bayesian intervals:

62 (0.0%) of the 95% posterior intervals exclude 0
Mean absolute value of these Bayes estimates is 0.95 
Mean absolute value of the corresponding true parameters is 0.97 
3.2% of these are the wrong sign (Type S error)

When tau is half of sigma, Bayesian claims with confidence are extremely rare. When there is a Bayesian claim with confidence, it will be large---that makes sense; the posterior standard error is sqrt(1/(1/1 + 1/.5^2)) = 0.45, and so any posterior mean corresponding to a Bayesian claim with confidence here will have to be at least 0.9. The average for these million comparisons turns out to be 0.94.

So, hey, watch out for selection effects! But no, not at all. If we look at the underlying true effects corresponding to these claims with confidence, these have a mean of 0.97 (in this simulation; in other simulations of a million comparisons, we get means such as 0.89 or 1.06). And very few of these are in the wrong direction; indeed, with enough simulations you'll find a type S error rate of a bit less 2.5% which is what you'd expect, given that these 95% posterior intervals exclude 0, so something less than 2.5% of the interval will be of the wrong sign.

So, the Bayesian procedure only very rarely makes a claim with confidence. But, when it does, it's typically picking up something real, large, and in the right direction.

We then re-ran with tau = 1, a world in which the standard deviation of true effects is equal to the standard error of the estimates:

sigma <- 1 tau <- 1 N <- 1e6 spidey(sigma, tau, N) And here's what we get:

sigma = 1, tau = 1, N = 1e+06
157950 (15.8%) of the 95% classical intervals exclude 0
Mean absolute value of these classical estimates is 2.64 
Mean absolute value of the corresponding true parameters is 1.34 
3.9% of these are the wrong sign (Type S error)
45634 (4.6%) of the 95% posterior intervals exclude 0
Mean absolute value of these Bayes estimates is 1.68 
Mean absolute value of the corresponding true parameters is 1.69 
1.0% of these are the wrong sign (Type S error)

The classical estimates remain too high, on average about twice as large as the true effect sizes; the Bayesian procedure is more conservative, making fewer claims with confidence and not overestimating effect sizes.

Bayes does better because it uses more information

We should not be surprised by these results. The Bayesian procedure uses more information and so it can better estimate effect sizes.

But this can seem like a problem: what if this prior information on theta isn’t available? I have two answers. First, in many cases, some prior information is available. Second, if you have a lot of comparisons, you can fit a multilevel model and estimate tau. Thus, what can seem like the worst multiple comparisons problems are not so bad.

One should also be able to obtain comparable results non-Bayesianly by setting a threshold so as to control the type S error rate. The key is to go beyond the false-positive, false-negative framework, to set the goals of estimating the sign and magnitudes of the thetas rather than to frame things in terms of the unrealistic and uninteresting theta=0 hypothesis.

P.S. Now I know why I swore off blogging! The analysis, the simulation, and the writing of this post took an hour and a half of my work time.

P.P.S. Sorry for the ugly code. Let this be a motivation for all of you to learn how to code better.

One more thing you don’t have to worry about

Baruch Eitam writes:

So I have been convinced by the futility of NHT for my scientific goals and by the futility of of significance testing (in the sense of using p-values as a measure of the strength of evidence against the null). So convinced that I have been teaching this for the last 2 years. Yesterday I bump into this paper [“To P or not to P: on the evidential nature of P-values and their place in scientific inference,” by Michael Lew] which I thought makes a very strong argument for the validity of using significance testing for the above purpose. Furthermore—by his 1:1 mapping of p-values to likelihood functions he kind of obliterates the difference between the Bayesian and frequentist perspectives. My questions are 1. is his argument sound? 2.what does this mean regarding the use of p-values as measures of strength of evidence?

I replied that it all seems a bit nuts to me. If you’re not going to use p-values for hypothesis testing (and I agree with the author that this is not a good idea), why bother with p-values at all. It seems weird to use p-values to summarize the likelihood; why not just use the likelihood and do Bayesian inference directly? Regarding that latter point, see this paper of mine on p-values.

Eitam followed up:

But aren’t you surprised that the p-values do summarize the likelihood?

I replied that I did not read the paper in detail, but or any given model and sample size, I guess it makes sense that any two measures of evidence can be mapped to each other.

On deck this week

Mon: One more thing you don’t have to worry about

Tues: Evil collaboration between Medtronic and FDA

Wed: His varying slopes don’t seem to follow a normal distribution

Thurs: A day in the life

Fri: Letters we never finished reading

Sat: Better to just not see the sausage get made

Sun: Oooh, it burns me up

Taking Bayesian Inference Seriously [my talk tomorrow at Harvard conference on Big Data]

Mon 22 Aug, 9:50am, at Harvard Science Center Hall A:

Taking Bayesian Inference Seriously

Over the years I have been moving toward the use of informative priors in more and more of my applications. I will discuss several examples from theory, application, and computing where traditional noninformative priors lead to disaster, but a little bit of prior information can make everything work out. Informative priors also can resolve some of the questions of replication and multiple comparisons that have recently shook the world of science. It’s funny for me to say this, after having practiced Bayesian statistics for nearly thirty years, but I’m only now realizing the true value of the prior distribution.

Kaiser Fung on the ethics of data analysis

Kaiser gave a presentation and he’s sharing the slides with us here. It’s important stuff.

Michael Porter as new pincushion

Some great comments on this post about Ted talk visionary Michael Porter. Most rewarding was this from Howard Edwards:

New Zealand seems to score well on his index so perhaps I shouldn’t complain, but Michael Porter was well known in this part of the world 25 years ago when our government commissioned him to write a report titled “Upgrading New Zealand’s Competitive Advantage” (but known colloquially as the Porter Project.) Back then (perhaps not quite so much now) our government departments were in thrall of any overseas “expert” who could tell us what to do, and especially so if their philosophy happened to align with that of the government of the day.

Anyway this critique written at the time by one of our leading political economists suggests that his presentation and analysis skills weren’t the greatest back then either.

I followed the link and read the article by Brian Easton, which starts out like this:

Flavour of the moment is Upgrading New Zealand’s Competitive Advantage, the report of the so-called Porter Project. Its 178 pages (plus appendices) are riddled with badly labelled graphs; portentous diagrams which, on reflection, say nothing; chummy references to “our country”, when two of the three authors are Americans; and platitudes dressed up as ‘deep and meaningful sentiments.

Toward the end of the review, Easton sums up:

It would be easy enough to explain this away as the usual shallowness of a visiting guru passing through; But New Zealand’s. Porter Project spent about $1.5 million (of taxpayers’ money) on a report which is, largely a recycling of conventional wisdom and material published elsewhere. Even if there were more and deeper case studies, the return on the money expended would still be low.

But that’s just leading up to the killer blow:

Particularly galling is the book’s claim that we should improve the efficiency of government spending. The funding of this report would have been a good place to start. It must be a candidate for the lowest productivity research publication ever funded by government.

In all seriousness, I expect that Michael Porter is so used to getting paid big bucks that he hardly noticed where the $1.5 million went. (I guess that’s 1.5 million New Zealand dollars, so something like $750,000 U.S.) Wasteful government spending on other people, sure, that’s horrible, but when the wasteful government spending goes directly to you, that’s another story.

Things that sound good but aren’t quite right: Art and research edition

There are a lot of things you can say that sound very sensible but, upon reflection, are missing something.

For example consider this blog comment from Chris G:

Years ago I heard someone suggest these three questions for assessing a work of art:

1. What was the artist attempting to do?
2. Were they successful?
3. Was it worth doing?

I think those apply equally well to assessing research.

The idea of applying these same standards to research as to art, that was interesting. And the above 3 questions sounded good too—at first. But then I got to thinking about all sorts of art and science that didn’t fit the above rules. As I wrote:

There are many cases of successful art, and for that matter successful research, that were created by accident, where the artist or researcher was just mucking around, or maybe just trying to do something to pay the bills, and something great came out of it.

I’m not saying you’ll get much from completely random mucking around of the monkeys-at-a-typewriter variety. And in general I do believe in setting goals and working toward them. But artistic and research success often does seem to come in part by accident, or as a byproduct of some other goals.

An ethnographic study of the “open evidential culture” of research psychology

Claude Fischer points me to this paper by David Peterson, “The Baby Factory: Difficult Research Objects, Disciplinary Standards, and the Production of Statistical Significance,” which begins:

Science studies scholars have shown that the management of natural complexity in lab settings is accomplished through a mixture of technological standardization and tacit knowledge by lab workers. Yet these strategies are not available to researchers who study difficult research objects. Using 16 months of ethnographic data from three laboratories that conduct experiments on infants and toddlers, the author shows how psychologists produce statistically significant results under challenging circumstances by using strategies that enable them to bridge the distance between an uncontrollable research object and a professional culture that prizes methodological rigor. This research raises important questions regarding the value of restrictive evidential cultures in challenging research environments.

And it concludes:

Open evidential cultures may be defensible under certain conditions. When problems are pressing and progress needs to be made quickly, creativity may be prized over ascetic rigor. Certain areas of medical or environmental science may meet this criterion. Developmental psychology does not. However, it may meet a second criterion. When research findings are not tightly coupled with some piece of material or social technology—that is, when the “consumers” of such science do not significantly depend on the veracity of individual articles—then local culture can function as an internal mechanism for evaluation in the field. Similar to the way oncologists use a “web of trials” rather than relying on a single, authoritative study or how weather forecasters use multiple streams of evidence and personal experience to craft a prediction, knowledge in such fields may develop positively even in a literature that contains more false positives than would be expected by chance alone.

It’s an interesting article, because usually discussions of research practices are all about what is correct, what should be done or not done, what do the data really tell us, etc. But here we get an amusing anthropological take on things, treating scientists’ belief in their research findings with the same respect that we treat tribal religious beliefs. This paper is not normative, it’s descriptive. And description is important. As I often say, if we want to understand the world, it helps to know what’s actually happening out there!

I like the term “open evidential culture”: it’s descriptive without being either condescending, on one hand, or apologetic, on the other.

Stan Course up North (Anchorage, Alaska) 23–24 Aug 2016

Stan logo
Daniel Lee’s heading up to Anchorage, Alaska to teach a two-day Stan course at the Alaska chapter of the American Statistical Association (ASA) meeting in Anchorage. Here’s the rundown:

I hear Alaska’s beautiful in the summer—16 hour days in August and high temps of 17 degrees celsius. Plus Stan!

More Upcoming Stan Events

All of the Stan-related events of which we are aware are listed on:

After Alaska, Daniel and Michael Betancourt will be joining me in Paris, France on 19–21 September to teach a three-day course on Pharmacometric Modeling using Stan. PK/PD in Stan is now a whole lot easier after Sebastian Weber integrated CVODES (pun intended) to solve stiff differential equations with control over tolerances and max steps per iteration.

The day after the course in Paris, on 22 September, we (with Julie Bertrand and France Mentre) are hosting a one-day Workshop on Pharmacometric Modeling with Stan.

Your Event Here

Let us know if you hear about other Stan-related events (meetups, courses, workshops) and we can post them on our events page and advertise them right here on the blog.

What’s gonna happen in November?

Nadia Hassan writes:

2016 may be strange with Trump. Do you have any thoughts on how people might go about modeling a strange election? When I asked you about predictability and updating election forecasts, you stated that models that rely on polls at different points should be designed to allow for surprises. You have touted the power of weakly informative priors. Could those be a good tool for this situation?

I received this message on 4 Apr and I’m typing this on 9 Apr but it’s 17 Aug in blog time. So you’re actually reading a response that’s 4 months old.

What is it that they say: History is journalism plus time? I guess political science is political journalism plus time.

Anyway . . . whenever people asked me about the primary elections, I’d point them to my 2011 NYT article, Why Are Primaries Hard to Predict? Here’s the key bit:

Presidential general election campaigns have several distinct features that distinguish them from most other elections:

1. Two major candidates;
2. The candidates clearly differ in their political ideologies and in their positions on economic issues;
3. The two sides have roughly equal financial and organizational resources;
4. The current election is the latest in a long series of similar contests (every four years);
5. A long campaign, giving candidates a long time to present their case and giving voters a long time to make up their minds.

OK, now to Hassan’s question. I don’t really have a good answer! I guess I’d take as a starting point the prediction from a Hibbs-like model predicting the election based on economic conditions during the past year, presidential popularity, and party balancing. Right now the economy seems to be going OK though not great, Obama is reasonably popular, and party balancing favors the Democrats because the Republicans control both houses of Congress. So I’m inclined to give the Democratic candidate (Hillary Clinton, I assume) the edge. But that’s just my guess, I haven’t run the numbers. There’s also evidence from various sources that more extreme candidates don’t do so well, so if Sanders is the nominee, I’d assume he’d get a couple percentage points less than Clinton would. Trump . . . it’s hard to say. He’s not ideologically extreme, on the other hand he is so unpopular (even more so than Clinton), it’s hard to know what to say. So I find this a difficult election to predict. And once August rolls around, it’s likely there will be some completely different factors that I haven’t even thought about! From a statistical point of view, I guess I’d just add an error term which would increase my posterior uncertainty.

It’s not so satisfying to say this, but I don’t have much to offer as an election forecast beyond what you could read in any newspaper. I’m guessing that statistical tools will be more relevant in modeling what will happen in individual states, relative to the national average. As Kari Lock and I wrote a few years ago, it can be helpful to decompose national trends and the positions of the states. So maybe by the time this post appears here, I’ll have more to say.

P.S. This seems like a natural for the sister blog but I’m afraid the Washington Post readers would get so annoyed at me for saying I can’t make a good forecast! So I’m posting it here.

How schools that obsess about standardized tests ruin them as measures of success

83-2-711_016

Mark Palko and I wrote this article comparing the Success Academy chain of charter schools to Soviet-era factories:

According to the tests that New York uses to evaluate schools, Success Academies ranks at the top of the state — the top 0.3 percent in math and the top 1.5 percent in English, according to the founder of the Success Academies, Eva Moskowitz. That rivals or exceeds the performance of public schools in districts where homes sell for millions of dollars.

But it took three years before any Success Academy students were accepted into New York City’s elite high school network — and not for lack of trying. After two years of zero-percent acceptance rates, the figure rose to 11 percent this year, still considerably short of the 19 percent citywide average.

News coverage of those figures emphasized that that acceptance rate was still higher than the average for students of color (the population Success Academy mostly serves). But from a statistical standpoint, we would expect extremely high scores on the state exam to go along with extremely high scores on the high school application exams. It’s not clear why race should be a factor when interpreting one and not the other.

The explanation for the discrepancy would appear to be that in high school admissions, everybody is trying hard, so the motivational tricks and obsessive focus on tests at Success Academy schools has less of an effect. Routine standardized tests are, by contrast, high stakes for schools but low stakes for students. Unless prodded by teachers and anxious administrators, the typical student may be indifferent about his or her performance. . . .

We summarize:

In general, competition is good, as are market forces and data-based incentives, but they aren’t magic. They require careful thought and oversight to prevent gaming and what statisticians call model decay. . . .

What went wrong with Success Academy is, paradoxically, what also seems to have gone right. Success Academy schools have excelled at selecting out students who will perform poorly on state tests and then preparing their remaining students to test well. But their students do not do so well on tests that matter to the students themselves.

Like those Soviet factories, Success Academy and other charter schools have been under pressure to perform on a particular measure, and are reminding us once again what Donald Campbell told us 40 years ago: Tampering with the speedometer won’t make the car go faster.

Calorie labeling reduces obesity Obesity increased more slowly in California, Seattle, Portland (Oregon), and NYC, compared to some other places in the west coast and northeast that didn’t have calorie labeling

Ted Kyle writes:

I wonder if you might have some perspective to offer on this analysis by Partha Deb and Carmen Vargas regarding restaurant calorie counts.

[Thin columnist] Cass Sunstein says it proves “that calorie labels have had a large and beneficial effect on those who most need them.”

I wonder about the impact of using self-reported BMI as a primary input and also the effect of confounding variables. Someone also suggested that investigator degrees of freedom is an important consideration.

They’re using data from a large national survey (Behavioral Risk Factor Surveillance System) and comparing self-reported body mass index of people who lived in counties with calorie-labeling laws, compared to counties without such laws, and they come up with these (distorted) maps:

Screen Shot 2016-04-09 at 10.15.36 PMScreen Shot 2016-04-09 at 10.15.43 PM

Here’s their key finding:

Screen Shot 2016-04-09 at 10.11.57 PM

The two columns correspond to two different models they used to adjust for demographic differences between the people in the two groups of counties. As you can see, average BMI seems to have increased faster in the no-calorie-labeling counties.

On the other hand, if you look at the map, it seems like they’re comparing {California, Seattle, Portland (Oregon), and NYC} to everyone else (with Massachusetts somewhere in the middle), and there are big differences between these places. So I don’t know how seriously we can attribute the differences between those trends to food labeling.

Also, figure 5 of that paper, showing covariate balance, is just goofy. I recommend simple and more readable dotplots as in chapter 10 of ARM. Figure 4 is a bit mysterious too, I’m not quite clear on what is gained by the barplots on the top; aren’t they just displaying the means of the normal distributions on the bottom? And Figures 1 and 2, the maps, look weird: they’re using some bad projection, maybe making the rookie mistake of plotting latitude vs. longitude, not realizing that when you’re away from the equator one degree of latitude is not the same distance as one degree of longitude.

As to the Cass Sunstein article (“Calorie Counts Really Do Fight Obesity”), yeah, it seems a bit hypey. Key Sunstein quote: “All in all, it’s a terrific story.” Even aside from the causal identification issues discussed above, don’t forget that the difference between significant and non-significant etc.

Speaking quite generally, I agree with Sunstein when he writes:

A new policy might have modest effects on Americans as a whole, but big ones on large subpopulations. That might be exactly the point! It’s an important question to investigate.

But of course researchers—even economists—have been talking about varying treatment effects for awhile. So to say we can draw this “large lesson” from this particular study . . . again, a bit of hype going on here. It’s fine for Sunstein if this particular paper has woken him up to the importance of interactions, but let’s not let his excitement about the general concept, and his eagerness to tell a “terrific story” and translate into policy, distract us from the big problems of interpreting the claims made in this paper.

And, to return to the multiple comparisons issue, ultimately what’s important is not so much what the investigators did or might have done, but rather what the data say. I think the right approach would be some sort of hierarchical model that allows for effects in all groups, rather than a search for a definitive result in some group or another.

P.S. Kyle referred to the article by Deb and Vargas as a “NBER analysis” but that’s not quite right. NBER is just a consortium that publishes these papers. To call their paper an NBER analysis would be like calling this blog post “a WordPress analysis” because I happen to be using this particular software.

The history of characterizing groups of people by their averages

Andrea Panizza writes:

I stumbled across this article on the End of Average.

I didn’t know about Todd Rose, thus I had a look at his Wikipedia entry:

Rose is a leading figure in the science of individual, an interdisciplinary field that draws upon new scientific and mathematical findings that demonstrate that it is not possible to draw meaningful inferences about human beings using statistical averages.

Hmmm. I guess you would have something to say about that last sentence. To me, it sounds either trivial, if we interpret it in the sense illustrated by the US Air force later on in the same page, i.e., that individuals whose properties (weight, height, chest, etc.) are “close” to those of the Average Human Being are very rare, provided the number of properties is sufficiently high. Or plain wrong, if it’s a claim that statistics cannot be used to draw useful inferences on some specific population of individuals (American voters, middle-aged non-Hispanic white men, etc.). Either way, I think this would make a nice entry for your blog.

My reply: I’m not sure! On one hand, I love to be skeptical; on the other hand, since you’re telling me I won’t like it, I’m inclined to say I like it, just to be contrary!

OK, that was my meta-answer. Seriously, though . . . I haven’t looked at Rose’s book, but I kinda like his Atlantic article that you linked to, in that it has an interesting historical perspective. Of course we can draw meaningful inferences using statistical averages—any claim otherwise seems just silly. But if the historical work is valid, we can just go with that and ignore any big claims about the world. Historians can have idiosyncratic views about the present but still give us valuable insights about how we got to where we are today.

Tax Day: The Birthday Dog That Didn’t Bark

Following up on Valentine’s Day and April Fools, a journalist was asking about April 15: Are there fewer babies born on Tax Day than on neighboring days?

Let’s go to the data:

birthsnewbw

These are data from 1968-1988 so it would certainly be interesting to see new data, but here’s what we got:
– April 1st has a lot less
– Maybe something going on Apr 15 but not much, really nothing going on there at all.
– A lot less on vacation holidays such as July 4th, Labor day,etc.
– Extra births before xmas and between xmas and New year’s, which makes sense: the baby has to come out sometime!
– Day-of-week effects were increasing over the years.
But, really, nothing going on with April 15th. April Fools is where it’s at.

I just don’t think tax day is such a big deal. It looms large in the folklore of comedy writers and editorial writers, but for regular people it’s just a pain in the ass and then it’s over, not like, “Hey, I don’t want my kid to have an April Fools birthday.”

On deck this week

Mon: The history of characterizing groups of people by their averages

Tues: Calorie labeling reduces obesity Obesity increased more slowly in California, Seattle, Portland (Oregon), and NYC, compared to some other places in the west coast and northeast that didn’t have calorie labeling

Wed: What’s gonna happen in November?

Thurs: An ethnographic study of the “open evidential culture” of research psychology

Fri: Things that sound good but aren’t quite right: Art and research edition

Sat: Michael Porter as new pincushion

Sun: Kaiser Fung on the ethics of data analysis