Skip to content

Bigmilk strikes again

Screen Shot 2016-07-16 at 9.14.34 AM

Paul Alper sends along this news article by Kevin Lomagino, Earle Holland, and Andrew Holtz on the dairy-related corruption in a University of Maryland research study on the benefits of chocolate milk (!).

The good news is that the university did not stand behind its ethically-challenged employee. Instead:

“I did not become aware of this study at all until after it had become a news story,” Patrick O’Shea, UMD’s Vice President and Chief Research Officer, said in a teleconference. He says he took a look at both the chocolate milk and concussions news release and an earlier one comparing the milk to sports recovery drinks. “My reaction was, ‘This just doesn’t seem right. I’m not sure what’s going on here, but this just doesn’t seem right.’”

Back when I was a student there, we called it UM. I wonder when they changed it to UMD?

Also this:

O’Shea said in a letter that the university would immediately take down the release from university websites, return some $200,000 in funds donated by dairy companies to the lab that conducted the study, and begin implementing some 15 recommendations that would bring the university’s procedures in line with accepted norms. . . .

Dr. Shim’s lab was the beneficiary of large donations from Allied Milk Foundation, which is associated with First Quarter Fresh, the company whose chocolate milk was being studied and favorably discussed in the UMD news release.

Also this from a review committee:

There are simply too many uncontrolled variables to produce meaningful scientific results.

Wow—I wonder what Harvard Business School would say about this, if this criterion were used to judge some of its most famous recent research?

And this:

The University of Maryland says it will never again issue a news release on a study that has not been peer reviewed.

That seems a bit much. I think peer review is overrated, and if a researcher has some great findings, sure, why not do the press release? The key is to have clear lines of responsibility. And I agree with the University of Maryland on this:

The report found that while the release was widely circulated prior to distribution, nobody knew for sure who had the final say over what it could claim. “There is no institutional protocol for approval of press releases and lines of authority are poorly defined,” according to the report. It found that Dr. Shim was given default authority over the news release text, and that he disregarded generally accepted standards as to when study results should be disseminated in news releases.

Now we often seem to have the worst of both worlds, with irresponsible researchers making extravagant and ill-founded claims and then egging on press agents to make even more extreme statements. Again, peer review has nothing to do with it. There is a problem with press releases that nobody is taking responsibility for.

One-day workshop on causal inference (NYC, Sat. 16 July)

James Savage is teaching a one-day workshop on causal inference this coming Saturday (16 July) in New York using RStanArm. Here’s a link to the details:

Here’s the course outline:

How do prices affect sales? What is the uplift from a marketing decision? By how much will studying for an MBA affect my earnings? How much might an increase in minimum wages affect employment levels?

These are examples of causal questions. Sadly, they are the sorts of questions that data scientists’ run-of-the-mill predictive models can be ill-equipped to answer.

In this one-day course, we will cover methods for answering these questions, using easy-to-use Bayesian data analysis tools. The topics include:

– Why do experiments work? Understanding the Rubin causal model

– Regularized GLMs; bad controls; souping-up linear models to capture nonlinearities

– Using panel data to control for some types of unobserved confounding information

– ITT, natural experiments, and instrumental variables

– If we have time, using machine learning models for causal inference.

All work will be done in R, using the new rstanarm package.

Lunch, coffee, snacks and materials will be provided. Attendees should bring a laptop with R, RStudio and rstanarm already installed. A limited number of scholarships are available. The course is in no way affiliated with Columbia.

Replin’ ain’t easy: My very first preregistration

sympin

I’m doing my first preregistered replication. And it’s a lot of work!

We’ve been discussing this for awhile—here’s something I published in 2013 in response to proposals by James Moneghan and by Macartan Humphreys, Raul Sanchez de la Sierra, and Peter van der Windt for preregistration in political science, here’s a blog discussion (“Preregistration: what’s in it for you?”) from 2014.

Several months ago I decided I wanted to perform a preregistered replication of my 2013 AJPS paper with Yair on MRP. We found some interesting patterns of voting and turnout, but I was concerned that perhaps we were overinterpreting patterns from a single dataset. So we decided to re-fit our model to data from a different poll. That paper had analyzed the 2008 election using pre-election polls from Pew Research. The 2008 Annenberg pre-election poll was also available, so why not try that too?

Since we were going to do a replication anyway, why not preregister it? This wasn’t as easy as you might think. First step was getting our model to fit with the old data; this was not completely trivial given changes in software, and we needed to tweak the model in some places. Having checked that we could successfully duplicate our old study, we then re-fit our model to two surveys from 2004. We then set up everything to run on Annenberg 2008. At this point we paused, wrote everything up, and submitted to a journal. We wanted to time-stamp the analysis, and it seemed worthwhile to do this in a formal journal setting so that others could see all the steps in one place. The paper (that is, the preregistration plan) was rejected by the AJPS. They suggested we send it to Political Analysis, but they ended up rejecting it too. Then we sent it to Statistics, Politics, and Policy, which agreed to publish the full paper: preregistration plan plus analysis.

But, before doing the analysis, I wanted to time-stamp the preregistration plan. I put the paper up on my website, but that’s not really preregistration. So then I tried Arxiv. That took awhile too—it first they were thrown off by the paper being incomplete (by necessity, as we want to first publish the article with the plan but without the replication results). But they finally posted it.

The Arxiv post is our official announcement of preregistration. Now that it’s up, we (Rayleigh, Yair, and I) can run the analysis and write it up!

What have we learned?

Even before performing the replication analysis on the 2008 Annenberg data, this preregistration exercise has taught me some things:

1. The old analysis was not in runnable condition. We and others are now in position to fit the model to other data much more directly.

2. There do seem to be some problems with our model in how it fits the data. To see this, compare Figure 1 to Figure 2 of our new paper. Figure 1 shows our model fit to the 2008 Pew data (essentially a duplication of Figure 2 of our 2013 paper), and Figure 2 shows this same model fit to the 2004 Annenberg data.

So, two changes: Pew vs. Annenberg, and 2008 vs. 2004. And the fitted models look qualitatively different. The graphs take up a lot of space, so I’ll just show you the results for a few states.

We’re plotting the probability of supporting the Republican candidate for president (among the supporters of one of the two major parties; that is, we’re plotting the estimates of R/(R+D)) as a function of respondent’s family income (divided into five categories). Within each state, we have two lines: the brown line shows estimated Republican support among white voters, and the black lines shows estimated Republican support among all voters in the state. Y-axis goes from 0 to 100%.

From Figure 1:

fig1

From Figure 2:

fig2

You see that? The fitted lines are smoother in Figure 2 than in Figure 1, they seem to be tied closer to the data points. It appears as if this is coming from the raw data, which seem in Figure 2 to be closer to clean monotonic patterns.

My first thought was that this was something to do with sample size. OK, that was my third thought. My first thought was that it was a bug in the code, and my second thought was that there was some problem with coding of the income variable. But I don’t think it was any of these things. Annenberg 2004 had a larger sample than Pew 2008, so we re-fit to two random subsets of those Annenberg 2004 data, and the resulting graphs (not shown in the paper) look similar to the Figure 2 shown above; they were still a lot smoother than Figure 1 which shows results from Pew 2008.

We discuss this at the end of Section 2 of our new paper and don’t come to any firm conclusions. We’ll see what turns up with the replication on Annenberg 2008.

Anyway, the point is:
– Replication is not so easy.
– We can learn even from setting up the replications.
– Published results (even from me!) are always only provisional and it makes sense to replicate on other data.

About that claim that police are less likely to shoot blacks than whites

handonfire-wallpaper

Josh Miller writes:

Did you see this splashy NYT headline, “Surprising New Evidence Shows Bias in Police Use of Force but Not in Shootings”?

It’s actually looks like a cool study overall, with granular data, and a ton of leg work, and rich set of results that extend beyond the attention grabbing headline that is getting bandied about (sometimes with ill-intent). While I do not work on issues of race and crime, I doubt I am alone in thinking that this counter-intuitive result is unlikely to be true. The result: whites are as likely as blacks to be shot at in encounters in which lethal force may have been justified? Further, in their taser data, blacks are actually less likely than whites to subsequently be shot by a firearm after being tasered! While its true that we are talking about odds ratios for small probabilities, dare I say that the ratios are implausible enough to cue us that something funny is going on? (blacks are 28-35% less likely to be shot in the taser data, table 5 col 2, PDF p. 54). Further, are we to believe that suddenly, when an encounter escalates, the fears and other biases of officers suddenly melt away and they become race-neutral? This seems to be inconsistent with the findings in other disciplines when it comes to fear, and other immediate emotional responses to race (think implicit association tests, fMRI imaging of the amygdala, etc.).

This is not to say we can’t cook up a plausible sounding story to support this result. For example, officers may let their guard down against white suspects, and then, whoops, too late! Now the gun is the only option.

But do we believe this? That depends on how close we are to the experimental ideal of taking equally dangerous suspects, and randomly assigning their race (and culture?), and then seeing if police end up shooting them.

Looking at the paper, it seems like we are far from that ideal. In fact, it appears likely that the white suspects in their sample were actually more dangerous than the black suspects, and therefore more likely to get shot at.

Potential For Bias:

How could this selection bias happen? Well, this headline result comes solely from the Houston data, and for that data, their definition of a “shoot or don’t shoot” situation (my words) is defined as an arrest report that describes an encounter in which lethal force was likely justified. What is the criteria for lethal force to be likely justified? Among other things, for this data, it includes “resisting arrest, evading arrest, and interfering in arrest” (PDF pp.16-17, actual p. 14-15—they sample 5% of 16,000 qualifying reports) They also have a separate data set in which the criteria is that a taser was deployed (~5000 incidents). Remember, just to emphasize, these are reports involving encounters that don’t necessarily lead to an officer involved shootings (OIS). Given the presences of exaggerated fears, cultural misunderstandings, and other more nefarious forms of bias, wouldn’t we expect an arrest report to over-apply these descriptors to blacks relative to whites? Wouldn’t we also expect the taser to be over-applied to blacks relatively to whites? If so, then won’t this mechanically lower the incidence of shootings of blacks relative to whites in this sample? There are more blacks in the researcher-defined “shoot, or don’t shoot” situation that just shouldn’t be there; they are not as dangerous as the whites, and lethal force was unlikely to be justified (and wasn’t applied in most cases).

Conclusion:

With this potential selection bias, yet no discussion of it (as far as I can tell), the headline conclusion doesn’t appear to be warranted. Maybe the authors can do a calculation and find that the degree of selection you would need to cause this result is itself implausible? Who knows. But I don’t see how it is justified to spread around this result without checking into this (This takes nothing away, of course, from the other important results in the paper).

Notes:

The analysis for this particular result is reported on PDF pp. 23-25 with the associated table 5 on PDF p. 54. Note that when adding controls, there appear to be power issues. There is a partial control for suspect danger, under “encounter characteristics,” which includes, e.g. whether the suspect attacked, or drew a weapon—interestingly, blacks are 10% more likely to be shot with this control (not significant). The table indicates a control is also added for the taser data, but I don’t know how they could do that, because the taser data has no written narrative.

See here for more on the study from Rajiv Sethi.

And Justin Feldman pointed me to this criticism of his. Feldman summarizes:

Roland Fryer, an economics professor at Harvard University, recently published a working paper at NBER on the topic of racial bias in police use of force and police shootings. The paper gained substantial media attention – a write-up of it became the top viewed article on the New York Times website. The most notable part of the study was its finding that there was no evidence of racial bias in police shootings, which Fryer called “the most surprising result of [his] career”. In his analysis of shootings in Houston, Texas, black and Hispanic people were no more likely (and perhaps even less likely) to be shot relative to whites.

I’m not endorsing Feldman’s arguments but I do want to comment on “the most surprising result of my career” thing. We should all have the capacity for being surprised. Science would go nowhere if we did nothing but confirm our pre-existing beliefs. Buuuuut . . . I feel like I see this reasoning a lot in media presentations of social science: “I came into this study expecting X, and then I found not-X, and the fact that I was surprised is an additional reason to trust my result.” The argument isn’t quite stated that way, but I think it’s implicit, that the surprise factor represents some sort of additional evidence. In general I’m with Miller that when a finding is surprising, we should look at it carefully as this could be an indication that something is missing in the analysis.

P.S. Some people also pointed out this paper by Cody Ross from last year, “A Multi-Level Bayesian Analysis of Racial Bias in Police Shootings at the County-Level in the United States, 2011–2014,” which uses Stan! Ross’s paper begins:

A geographically-resolved, multi-level Bayesian model is used to analyze the data presented in the U.S. Police-Shooting Database (USPSD) in order to investigate the extent of racial bias in the shooting of American civilians by police officers in recent years. In contrast to previous work that relied on the FBI’s Supplemental Homicide Reports that were constructed from self-reported cases of police-involved homicide, this data set is less likely to be biased by police reporting practices. . . .

The results provide evidence of a significant bias in the killing of unarmed black Americans relative to unarmed white Americans, in that the probability of being {black, unarmed, and shot by police} is about 3.49 times the probability of being {white, unarmed, and shot by police} on average. Furthermore, the results of multi-level modeling show that there exists significant heterogeneity across counties in the extent of racial bias in police shootings, with some counties showing relative risk ratios of 20 to 1 or more. Finally, analysis of police shooting data as a function of county-level predictors suggests that racial bias in police shootings is most likely to emerge in police departments in larger metropolitan counties with low median incomes and a sizable portion of black residents, especially when there is high financial inequality in that county. . . .

I’m a bit concerned by maps of county-level estimates because of the problems that Phil and I discussed in our “All maps of parameter estimates are misleading” paper.

I don’t have the energy to look at this paper in detail, but in any case its existence is useful in that it suggests a natural research project of reconciling it with the findings of the other paper discussed at the top of this post. When two papers on the same topic come to such different conclusions, it should be possible to track down where in the data and model the differences are coming from.

P.P.S. Miller points me to this post by Uri Simonsohn that makes the same point (as Miller at the top of the above post).

In their reactions, Miller and Simonsohn do something very important, which is to operate simultaneously on the level of theory and data, not just saying why something could be a problem but also connecting this to specific numbers in the article under discussion.

Of polls and prediction markets: More on #BrexitFail

David “Xbox poll” Rothschild and I wrote an article for Slate on how political prediction markets can get things wrong. The short story is that in settings where direct information is not easily available (for example, in elections where polls are not viewed as trustworthy forecasts, whether because of problems in polling or anticipated volatility in attitudes), savvy observers will deduce predictive probabilities from the prices of prediction markets. This can keep prediction market prices artificially stable, as people are essentially updating them from the market prices themselves.

Long-term, or even medium-term, this should sort itself out: once market participants become aware of this bias (in part from reading our article), they should pretty much correct this problem. Realizing that prediction market prices are only provisional, noisy signals, bettors should start reacting more to the news. In essence, I think market participants are going through three steps:

1. Naive over-reaction to news, based on the belief that the latest poll, whatever it is, represents a good forecast of the election.

2. Naive under-reaction to news, based on the belief that the prediction market prices represent best information (“market fundamentalism”).

3. Moderate reaction to news, acknowledging that polls and prices both are noisy signals.

Before we decided to write that Slate article, I’d drafted a blog post which I think could be useful in that I went into more detail on why I don’t think we can simply take the market prices are correct.

One challenge here is that you can just about never prove that the markets were wrong, at least not just based on betting odds. After all, an event with 4-1 odds against, should still occur 20% of the time. Recall that we were even getting people arguing that those Leicester City odds of 5,000-1 odds were correct, which really does seem like a bit of market fundamentalism.

OK, so here’s what I wrote the other day:

We recently talked about how the polls got it wrong in predicting Brexit. But, really, that’s not such a surprise: we all know that polls have lots of problems. And, in fact, the Yougov poll wasn’t so far off at all (see P.P.P.S. in above-linked post, also recognizing that I am an interested party in that Yougov supports some of our work on Stan).

Just as striking, and also much discussed, is that the prediction markets were off too. Indeed, the prediction markets were more off than the polls: even when polling was showing consistent support for Leave, the markets were holding on to Remain.

This is interesting because in previous elections I’ve argued that the prediction markets were chasing the polls. But here, as with Donald Trump’s candidacy in the primary election, the problem was the reverse: prediction markets were discounting the polls in a way which, retrospectively, looks like an error.

How to think about this? One could follow psychologist Dan Goldstein who, under the heading, “Prediction markets not as bad as they appear,” argued that prediction markets are approximately calibrated in the aggregate, and thus you can’t draw much of a conclusion from the fact that, in one particular case, the markets were giving 5-1 odds to an event (Brexit) that actually ended up happening. After all, there are lots of bets out there, and 1/6 of all 5:1 shots should come in.

And, indeed, if the only pieces of information available were: (a) the market odds against Brexit winning the vote were 5:1, and (b) Brexit won the vote; then, yes, I’d agree that nothing more could be said. But we actually to have more information.

Let’s start with this graph from Emile Servan-Schreiber, from a post linked to by Goldstein. The graph shows one particular prediction market for the week leading up to the vote:

brexit-bremain

It’s my impression that the odds offered by other markets looked similar. I’d really like to see the graph over the past several months, but I wasn’t quite sure where to find it, so we’ll go with the one-week time series.

One thing that strikes me is how stable these odds are. I’m wondering if one thing that went on was that a feedback mechanism where the betting odds reify themselves.

It goes like this: the polls are in different places, and we all know not to trust the polls, which have notoriously failed in various British elections. But we do watch the prediction markets, which all sorts of experts have assured us capture the wisdom of crowds.

So, serious people who care about the election watch the prediction markets. The markets say 5:1 for Leave. Then there’s other info, the latest poll, and so forth. How to think about this information? Informed people look to the markets. What do the markets say? 5:1. OK, then that’s the odds.

This is not an airtight argument or a closed loop. Of course, real information does intrude upon this picture. But my argument is that prediction markets can stay stable for too long.

In the past, traders followed the polls too closely and sent the prediction markets up and down. But now the opposite is happening. Traders are treating markets odds as correct probabilities and not updating enough based on outside information. Belief in the correctness of prediction markets causes them to be too stable.

We saw this with the Trump nomination, and we saw it with Brexit. Initial odds are reasonable, based on whatever information people have. But then when new information comes in, it gets discounted. People are using the current prediction odds as an anchor.

Related to this point is this remark from David Rothschild:

I [Rothschild] am very intrigued by this interplay of polls, prediction markets, and financial markets. We generally accept polls as exogenous, and assume the markets are reacting to the polls and other information. But, with growth of poll-based forecasting and more robust analytics on the polling, before release, there is the possibility that polls (or, at least what is reported from polls) are influenced by the markets. Markets were assuming that there were two things at play (1) social-desirability bias to over report leaving (which we saw in Scotland in 2014) (2) uncertain voters would break stay (which seemed to happen in the polling in the last few days). And, while there was a lot of concern about the turnout of stay voters (due to stay voters being younger) the unfortunate assassination of Jo Cox seemed to have assuaged the markets (either by rousing the stay supporters to vote or tempering the leave supports out of voting). Further, the financial markets were, seemingly, even more bullish than the prediction markets in the last few days and hours before the tallies were complete.

I know you guys think I have no filter, but . . .

. . . Someone sent me a juicy bit of news related to one of our frequent blog topics, and I shot back a witty response (or, at least, it seemed witty to me), but I decided not to post it here because I was concerned that people might take it as a personal attack (which it isn’t; I don’t even know the guy).

P.S. I wrote this post a few months ago and posted it for the next available slot, which is now. So you can pretty much forget about guessing what the news item was, as it’s not like it just happened or anything.

P.P.S. The post was going to be bumped again, to December! But this seemed a bit much so I’ll just post it now.

Some insider stuff on the Stan refactor

From the stan-dev list, Bob wrote [and has since added brms based on comments; the * packages are ones that aren’t developed or maintained by the stan-dev team, so we only know what we hear from their authors]:

The bigger picture is this, and you see the stan-dev/stan repo really spans three logical layers:

                      stan
          ----------------------------------
  math <- language <- algorithms <- services <- pystan
                                             <- rstan   <- rstanarm
                                                        <- rethinking (*)
                                                        <- brms (*)
                                             <- cmdstan <- statastan
                                                        <- matlabstan
                                                        <- stan.jl

What we are trying to do with the services refactor is make a clean services layer between the core interfaces (pystan, rstan, cmdstan) so that these don't have to know anything below the services layer. Ideally, there wouldn't be any calls from pystan, rstan, or cmdstan other than ones to the stan::services namespace. services, on the other hand, is almost certainly going to need to know about things below the algorithms level in language and math.

And Daniel followed up with:

This clarified a lot of things. I think this is what we should do:

  1. Split algorithms and services into their own repos. (Language too, but that's a given.)
  2. Each "route" to calling an algorithm should live in the "algorithms" repo. That is, algorithms should expose a simple function for calling it directly. It'll be a C++ API, but not ones that the interfaces use directly.
  3. In "services," we'll have a config object with validation and only a handful of calls that pystan, rstan, cmdstan call. The config object needs to be simple and safe, but I think the pseudocode Bob and I created (which is looking really close to Michael's config object if it were safe) will suffice.

I don't really know what they're talking about but I thought it might be interesting to those of you who don’t usually see software development from the inside.

Retro 1990s post

925ff8ed1c2f4258fdab8b9ff143b548

I have one more for you on the topic of jail time for fraud . . . Paul Alper points us to a news article entitled, “Michael Hubbard, Former Alabama Speaker, Sentenced to 4 Years in Prison.” From the headline this doesn’t seem like such a big deal, just run-of-the-mill corruption that we see all the time, but Alper’s eye was caught by this bit:

His power went almost unquestioned by members of both parties: Even after he was indicted, Mr. Hubbard received all but one vote in the Legislature for his re-election as speaker.

Mr. Hubbard’s problems are only a part of the turmoil in Montgomery these days. The governor, Robert Bentley, is being threatened with impeachment for matters surrounding an alleged affair with a chief adviser, and the State Supreme Court chief justice, Roy S. Moore, who is suspended, has been charged with violating judicial ethics in his orders to probate judges not to issue marriage licenses to same-sex couples.

Wow! The governor, the chief justice of the state supreme court, and all but one member of the legislature.

Back in the Clinton/Gingrich era, I came up with the proposal that every politician be sent to prison for a couple years before assuming office. That way the politician would already know how the other half lived; also, governing would be straightforward without the possibility of jail time hanging over the politician’s head. With the incarceration already in the past, the politician could focus on governing.

“Most notably, the vast majority of Americans support criminalizing data fraud, and many also believe the offense deserves a sentence of incarceration.”

925ff8ed1c2f4258fdab8b9ff143b548

Justin Pickett sends along this paper he wrote with Sean Roche:

Data fraud and selective reporting both present serious threats to the credibility of science. However, there remains considerable disagreement among scientists about how best to sanction data fraud, and about the ethicality of selective reporting.

OK, let’s move away from asking scientists. Let’s ask the general public:

The public is arguably the largest stakeholder in the reproducibility of science; research is primarily paid for with public funds, and flawed science threatens the public’s welfare. Members of the public are able to make rapid but meaningful judgments about the morality of different behaviors using moral intuitions.

Pickett and Roche did a couple surveys:

We conducted two studies—a survey experiment with a nationwide convenience sample (N = 821), and a follow-up survey with a representative sample of US adults (N = 964)—to explore public judgments about the morality of data fraud and selective reporting in science.

What did they find?

The public overwhelming judges both data fraud and selective reporting as morally wrong, and supports a range of serious sanctions for these behaviors. Most notably, the vast majority of Americans support criminalizing data fraud, and many also believe the offense deserves a sentence of incarceration.

We know from other surveys that people generally feel that, if there’s something they don’t like, that it should be illegal. And are pretty willing to throw wrongdoers into prison. So, in that general sense, this isn’t so surprising. Still interesting to see it in this particular case.

As Evelyn Beatrice Hall never said, I disapprove of your questionable research practices, but I will defend to the death your right to publish their fruits in PPNAS and have them featured on NPR.

P.S. Just to be clear on this, I’m just reporting on an article that someone sent me. I don’t think people should be sent to prison for data fraud and selective reporting. Not unless they also commit real crimes that are serious.

P.P.S. Best comment comes from Shravan and AJG:

Ask the respondent what you think the consequences should be if
– you commit data fraud
– your coauthor commits data fraud
– your biggest rival commits data fraud
Then average these responses.

On deck this week

Mon: “Most notably, the vast majority of Americans support criminalizing data fraud, and many also believe the offense deserves a sentence of incarceration.”

Tues: Some insider stuff on the Stan refactor

Wed: I know you guys think I have no filter, but . . .

Thurs: Bigmilk strikes again

Fri: “Pointwise mutual information as test statistics”

Sat: Some U.S. demographic data at zipcode level conveniently in R

Sun: So little information to evaluate effects of dietary choices

Over at the sister blog, they’re overinterpreting forecasts

Matthew Atkinson and Darin DeWitt write, “Economic forecasts suggest the presidential race should be a toss-up. So why aren’t Republicans doing better?”

Their question arises from a juxtaposition of two apparently discordant facts:

1. “PredictWise gives the Republicans a 35 percent chance of winning the White House.”

2. A particular forecasting model (one of many many that are out there) predicts “The Democratic Party’s popular-vote margin is forecast to be only 0.1 percentage points. . . . a 51 percent probability that the Democratic Party wins the popular vote.”

Thus Atkinson and DeWitt conclude that “the Republican Party is underperforming this model’s prediction by 14 percentage points.” And they go on to explain why.

But I think they’re mistaken—not in their explanations, maybe, but in their implicit assumption that a difference between a 49% chance of winning from a forecast, and a 35% chance of winning from a prediction market, demands an explanation.

Why do I say this?

First, when you take one particular model as if it represents the forecast, you’re missing a lot of your uncertainty.

Second, you shouldn’t take the probability of a win as if it were an outcome in itself. The difference between a 65% chance of winning and a 51% chance of winning is not 14 percentage points in any real sense, it’s more like a difference of 1% or 2% of the vote. That is, the model predicts a 50/50 vote shift, maybe the markets are predicting 52/48, that’s a 2 percentage point difference, not 14 percentage points.

It’s not that Atkinson and DeWitt are wrong to be looking at discrepancies between different forecasts; I just think they’re overintepreting what is essentially 1 data point. Forecasts are valuable, but different information is never going to be completely aligned.

Causal and predictive inference in policy research

Todd Rogers pointed me to a paper by Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer that begins:

Empirical policy research often focuses on causal inference. Since policy choices seem to depend on understanding the counterfactual—what happens with and without a policy—this tight link of causality and policy seems natural. While this link holds in many cases, we argue that there are also many policy applications where causal inference is not central, or even necessary.

Kleinberg et al. start off with the example of weather forecasting, which is indeed makes their point well: Even if we have no ability to alter the weather or even to understand what makes it do what it does, if we’re able to forecast the weather, this can still help us make better decisions. Indeed, we can feed a probabilistic forecast directly into decision analyses.

On the other hand, if you want to make accurate forecasts, you probably will want a causal model of the weather. Consider real-world weather forecasts, which use a lot of weather-related information and various causal models of atmospheric dynamics. So it’s not that causal identification is required to make weather decisions—but in practice we are using causal reasoning to get our descriptively accurate forecasts.

Beyond the point that we can make decisions based on non-causal forecasts, Kleinberg et al. also discuss some recent ideas from machine learning, which they apply to the problem of predicting the effects of hip and knee replacement surgery. I don’t really know enough to comment on this application but it seems reasonable enough. As with the weather example, you’ll want to use prior knowledge and causal reasoning to gather a good set of predictors and combine them well, if you want to make the best possible forecasts.

One thing I do object to in this paper, though, is the attribution to machine learning of ideas that have been known in statistics for a long time. For example, Kleinberg et al. write:

Standard empirical techniques are not optimized for prediction problems because they focus on unbiasedness. . . . Machine learning techniques were developed specifically to maximize prediction performance by providing an empirical way to make this bias-variance trade-off . . . A key insight of machine learning is that this price λ [a tuning parameter or hyperparameter of the model] can be chosen using the data itself. . . .

This is all fine . . . but it’s nothing new! It’s what we do in Bayesian inference every day. And it’s a fundamental characteristic of hierarchical Bayesian modeling that the hyperparameters (which govern how much partial pooling is done, or the relative weights assigned to different sorts of information, or the tradeoff between bias and variance, or whatever you want to call it) are inferred from the data.

It’s great for economists and other applied researchers to become aware of new techniques in data analysis. Also good for them to realize that certain ideas such as the use of predictive models for decision making, have been around in statistics for a long time.

For example, here’s a decision analysis on home radon measurement and remediation that my colleagues and I published nearly twenty years ago. The ideas weren’t new then either, but I think we did a good job at integrating decision making with hierarchical modeling (that is, with tuning parameters chosen using the data). I link to my own work here not to claim priority but just to convey that these ideas are not new.

Again, nothing wrong with some economists writing a review article drawing on well known ideas from statistics and machine learning. I’m just trying to place this work in some context.

“Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001]."

JoseIgnacio_2134667a

E. J. Wagenmakers points me to a delightful bit of silliness from PPNAS, “Hunger promotes acquisition of nonfood objects,” by Alison Jing Xu, Norbert Schwarz, and Robert Wyer. It has everything we’re used to seeing in this literature: small-N, between-subject designs, comparisons of significant to non-significant, and enough researcher degrees of freedom to buy Uri Simosohn a lighthouse on the Uruguayan Riviera.

But this was my favorite part:

Participants in study 2 (n = 77) were recruited during lunch time (between 11:30 AM and 2:00 PM) either when they were entering a campus café or when they had eaten and were about to leave. . . . Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001].

Ya think?

But seriously, folks . . .

To me, the most interesting thing about this paper is that it’s so routine, nothing special at all. Published in PPNAS? Check. Edited by bigshot psychology professor (in this case, Richard Nisbett)? Check. Statistical significance with t=1.99? Check. What could possibly go wrong???

I happened to read the article because E. J. sent it to me, but it’s not particularly bad. It’s far better than the himmicanes and hurricanes paper (which had obvious problems with data selection and analysis), or the ovulation and clothing paper (data coding problems and implausible effect sizes), or the work of Marc Hauser (who wouldn’t let people see his data), or Daryl Bem’s ESP paper (really bad work, actually I think people didn’t even realize how bad it was because they were distracted by the whole ESP thing), or the beauty and sex ratio paper (sample size literally about a factor of 100 too low to learn anything useful from the data).

I guess I’d put this “hungry lunch” paper in roughly the same category as embodied cognition or power pose: it could be true, or the opposite could be true (hunger could well reduce the desire to acquire of nonfood objects; remember that saying, “You can’t have your cake and eat it too”?). This particular study is too noisy and sloppy for anything much to be learned, but their hypotheses and conclusions are not ridiculous. I still wouldn’t call this good science—“not ridiculous” is a pretty low standard—but I’ve definitely seen worse.

And that’s the point. What we have here is regular, workaday, bread-and-butter pseudoscience. An imitation of the scientific discovery process that works on its own, week after week, month after month, in laboratories around the world, chasing noise around in circles and occasionally moving forward. And, don’t get me wrong, I’m not saying all this work is completely useless. As I’ve written on occasion, even noise can be useful in jogging our brains, getting us to think outside of our usual patterns. Remember, Philip K. Dick used the I Ching when he was writing! So I can well believe that researchers can garner useful insights out of mistaken analyses of noisy data.

What do I think should be done? I think researchers should publish everything, all their data, show all their comparisons and don’t single out what happens to have p less than .05 or whatever. And I guess if you really want to do this sort of study, follow the “50 shades of gray” template and follow up each of your findings with a preregistered replication. In this case it would’ve been really easy.

Reproducible Research with Stan, R, knitr, Docker, and Git (with free GitLab hosting)

Jon Zelner recently developed a neat Docker packaging of Stan, R, and knitr for fully reproducible research. The first in his series of posts (with links to the next parts) is here:

* Reproducibility, part 1

The post on making changes online and auto-updating results using GitLab’s continuous integration service is here:

* GitLab continuous integration

It updates via pushes to a Git repository hosted by GitLab.

Jon says, “This is very much a work-in-progress, so any feedback would be greatly appreciated!”. You can leave comments on the blog itself.

Causal mediation

The-Cat-The-Weseal-and-The-Rabbit_RESIZED

Judea Pearl points me to this discussion with Kosuke Imai at a conference on causal mediation. I continue to think that the most useful way to think about mediation is in terms of a joint or multivariate outcome, and I continue to think that if we want to understand mediation, we need to think about potential interventions or “instruments” in different places in a system. I think this is consistent with Pearl’s view although in different language. Recently I was talking with some colleagues about estimating effects of the city’s recent Vision Zero plan on reducing traffic deaths, and some of this thinking came up, that it makes sense to think about effects on crashes, injuries, serious injuries, and deaths. I also agree with Pearl (I think) that it’s generally important to have a substantive model of the process being studied. When I was a statistics student I was somehow given the impression that causal inference could, and even should, be done from a sort of black-box perspective. You have the treatment assignment, the outcomes, and you estimate the causal effect. But more and more it seems that this approach doesn’t work so well, that it really helps to understand at some level the causal mechanism.

Another way of putting it is that most effects are not large: they can’t be, there’s just not room in the world for zillions of large and consistent effects, it just wouldn’t be mathematically possible. So prior information is necessarily relevant in the design of a study. And, correspondingly, prior information will be useful, even crucial, in the analysis.

How does this relate to Pearl’s framework of causal inference? I’m not exactly sure, but I think when he’s using these graphs and estimating whether certain pathways are large and others are zero, that corresponds to a model of the world in which there are some outstanding large effects, and such a model can be appropriate in certain problem-situations where the user has prior knowledge, or is willing to make the prior assumption, that this is the case.

Anyway, perhaps the discussion of Imai and Pearl on these topics will interest you. Pearl writes, “Overall, the panel was illuminating, primarily due to the active participation of curious students. It gave me good reasons to believe that Political Science is destined to become a bastion of modern causal analysis.” That sounds good to me! My colleagues and I have been thinking about causal inference in political science for a long time, as in this 1990 paper. Political scientists didn’t talk much about causal inference at that time. Then a bunch of years later, political scientists started following economists in the over-use, or perhaps I should say, over-interpretation, of various trendy methods such as instrumental variables and regression discontinuity analysis. Don’t get me wrong—IV and RD are great, indeed Jennifer and I discuss both of them in our book—but there got to be a point where researchers would let the instrument or the discontinuity drive their work, rather than stepping back and thinking about their larger research aims. (We discuss one such example here.) A more encouraging trend in political science, with the work of Gerber and Green and others, is a seriousness about causal reasoning. One advantage of tying causal inference to field experiments, beyond all issues of identification, is that these experiments are expensive, which typically means that the people who conduct such an experiment have a sense that it might really work. Skin in the game. Prior information. Now I’m hoping that the field of political science is moving to a new maturity in thinking about causal inference, recognizing that we have various useful tools of design and analyses but not being blinded by them. I don’t agree with everything that Judea Pearl has written about causal inference, but one place I do agree with him is that causal reasoning is fundamental, and causal inference is too important to be restricted to clean settings with instruments, or discontinuities, or randomization. We need to go out and collect data and model the world.

“I would like to share some sad stories from economics related to these issues”

Princess_Madeleine_of_Sweden_14_2013

Per Pettersson-Lidbom from the Department of Economics at Stockholm University writes:

I have followed your discussions about replication, criticism, and the self-correcting process of science. I would like to share some sad stories from economics related to these issues. It is the stories about three papers published in highly respected journals, i.e., the study by Dahlberg, Edmark and Lundqvist (2012, henceforth DEL) published in the Journal of Political Economy and the study by Lundqvist, Dahlberg and Mork (2014, henceforth LDM) published in American Economic Journal: Economic Policy and the study by Aidt and Shevts (2012, henceforth AS) also published in AEJ: Economic Policy. I decided to write comments on all 3 papers since I discovered that they all have serious flaws. Here are my stories (I will try to keep them as short as possible).

Starting with DEL’s analyzes of whether there exists a causal relationship between ethnic diversity and preferences for redistribution, we (myself and Lena Nekby) discovered 3 significant problems with their statistical analysis: (i) an unreliable and potentially invalid measure of preferences for redistribution, (ii) an endogenously selected sample and (iii) a mismeasurement of the instrumental variable (the refugee placement policy). We made DEL aware of some of these problems before they resubmitted their paper to JPE. However, they did not pay any attention to our critique. Thus, we decided to write a comment to JPE (we had to collect all the raw data ourselves since DEL refused to share their raw data). When we re-analyzed the data we found that correcting for any of these three problems reveal that there is no evidence of any relationship between ethnic diversity and preferences for redistribution. However, JPE desk-rejected (without sending it to referees) our paper twice (the first time by the same editor handling DEL and the second time by another editor when the original editor had stepped down). We then submitted our papers to 6 other respected economic journals, but it was always rejected (typically without sending it to referees). Nonetheless, most of the editors agreed with our critique but said that it was JPE’s responsibility to publish it. Eventually, Scandinavian Journal of Economics has recently decided to publish our paper.

The second example is from AS which study the effect of electoral incentives on the allocation of public services across U.S. legislative districts. I realized that they have 3 serious problems in their differences-in-difference design: (i) serial correlation in the errors, (ii) functional form issues, and (iii) omitted-time invariant factors at the district level since AS do not control for district fixed effects. When I reanalyze their data (posted on the journals website) I find that correcting for any of these three problems reveals that there is no evidence of any relationship. I submitted my comment to AEJ:Policy long before the paper was published but I was told by the editor that that they do not publish comments. Instead, I was told to post a comment on their website. So that is what I did (see https://www.aeaweb.org/articles.php?doi=10.1257/pol.4.3.1)

The third example is from LDM which use a type of regression-discontinuity design (a kink design) to estimate causal effects of intergovernmental grants on local public employment. I discovered that their results depends on (i) extremely large bandwidths and (ii) mis-specified functional forms of the forcing variable since they omit interactions in the second and third order polynomial specification. I show that when correcting for any these problems there is no regression kink that can be used for identification. I again wrote to the editor of AEJ:policy (another editor this time) long before the paper was published making them aware of this problem but I was once more told that the AEJ: policy do not publish comments. Again, I was told to post my comment on their website and so I did (see https://www.aeaweb.org/articles.php?doi=10.1257/pol.6.1.167)

What bothers me most about my experience with replicating and checking the robustness of others people’s work is two things: (i) the reluctance of economic journals to publish comments on papers that are found to be completely and indisputably wrong (I don’t think posting a comment on a journals website is satisfactory procedure. I am probably the only one stupid enough to do it!) and (ii) that researchers can get away with scientific fraud. The last point is about that I discovered that both DEL and LDM (the two papers have two authors in common) intentionally misreport their results. For example, in DEL they analyze at least 9 outcomes but only choose to report those 3 who confirm their hypothesis. Had they reported these other results, it would have been clear that there is no relationship between ethnic diversity and attitudes for redistribution. DEL also make a number of, sample restrictions, often unreported, which reduces the number of observations from 9620 to 3834 thereby creating a huge sample selection problem. Again, had they reported the results from the full sample it had been very clear that there is no relationship. DEL also misreport the definition of their instrumental variable even though previous work has used exactly the same variable and where the definition has been correct. Had they reported the correct definition it had been obvious that their instrument is actually a poor instrument since it does not measure what is purported to measure. Turning to LDM, there are 4 estimates in their Table 2 (which show the first-stage relationship) that have been left out intentionally. Had they reported these 4 estimates it would have very clear that the first-stage relationship is not robust since the sign of the estimate switches from being positive (about 3) to negative (about -3). Moreover, had they reported smaller bandwidths (for example, a data driven optimal RD) it also had been clear that there is no first-stage relationship since for smaller bandwidths almost all the estimates are negative. Also, had they reported the correct polynomial functions it had also been very clear that the first-stage estimate is nonrobust.

So the bottom line of all this is that “the self-correcting process of science” does not work very well in economics. I wonder if you have any suggestions how I should handle this type of problem since you have had similar experiences.

I don’t have the energy to look into the above cases in detail.

But, stepping back and thinking about these issues more generally, I do think there’s an unfortunate “incumbency advantage” by which published papers with “p less than .05” are taken as true unless a large effort is amassed to take them down. Criticisms are often held to a much higher standard than held for the reviewing of the original paper and, as noted above, many journals don’t publish letters at all. Other problems included various forms of fraud (as alleged above) and a more general reluctance of authors even to admit honest mistakes (as in the defensive reaction of Case and Deaton to our relatively minor technical corrections to their death-rate-trends paper).

Hence, I’m sharing Pettersson’s stories, neither endorsing nor disputing their particulars but as an example of how criticisms in scholarly research just hang in the air, unresolved. Scientific journals are set up to promote discoveries, not to handle corrections.

In journals, it’s all about the wedding, never about the marriage.

Gremlins in the work of Amy J. C. Cuddy, Michael I. Norton, and Susan T. Fiske

1977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x529

1977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x529

1977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x529

1977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x529

Remember that “gremlins” paper by environmental economist Richard Tol? The one that had almost as many errors as data points? The one where, each time a correction was issued, more problems would spring up? (I’d say “hydra-like” but I’d rather not mix my mythical-beast metaphors.)

Well, we’ve got another one. This time, nothing to do with the environment or economics; rather, it’s from some familiar names in social psychology.

Nick Brown tells the story:

For an assortment of reasons, I [Brown] found myself reading this article one day: This Old Stereotype: The Pervasiveness and Persistence of the Elderly Stereotype by Amy J. C. Cuddy, Michael I. Norton, and Susan T. Fiske (Journal of Social Issues, 2005). . . .

This paper was just riddled through with errors. First off, its main claims were supported by t statistics of 5.03 and 11.14 . . . ummmmm, upon recalculation the values were actually 1.8 and 3.3. So one of the claim wasn’t even “statistically significant” (thus, under the rules, was unpublishable).

But that wasn’t the worst of it. It turns out that some of the numbers reported in that paper just couldn’t have been correct. It’s possible that the authors were doing some calculations wrong, for example by incorrectly rounding intermediate quantities. Rounding error doesn’t sound like such a big deal, but it can supply a useful set of “degrees of freedom” to allow researchers to get the results they want, out of data that aren’t readily cooperating.

Here’s how Brown puts it:

To summarise, either:
/a/ Both of the t statistics, both of the p values, and one of the dfs in the sentence about paired comparisons is wrong;
or
/b/ “only” the t statistics and p values in that sentence are wrong, and the means on which they are based are wrong.

And yet, the sentence about paired comparisons is pretty much the only evidence for the authors’ purported effect. Try removing that sentence from the Results section and see if you’re impressed by their findings, especially if you know that the means that went into the first ANOVA are possibly wrong too.

OK, everybody makes mistakes. These people are psychologists, not statisticians, so maybe we shouldn’t fault them for making some errors in calculation, working as they were in a pre-Markdown era.

The way that this falls into “gremlins” territory is how the mistakes fit together: The claims in this paper are part of an open-ended theory that can explain just about any result, any interaction in any direction. Publication’s all about finding something statistically significant and wrapping it in a story. So if it’s not one thing that’s significant, it’s something else.

And that’s why the authors’ claim that fixing the errors “does not change the conclusion of the paper” is both ridiculous and all too true. It’s ridiculous because one of the key claims is entirely based on a statistically significant p-value that is no longer there. But the claim is true because the real “conclusion of the paper” doesn’t depend on any of its details—all that matters is that there’s something, somewhere, that has p less than .05, because that’s enough to make publishable, promotable claims about “the pervasiveness and persistence of the elderly stereotype” or whatever else they want to publish that day.

As with Richard Tol’s notorious paper, the gremlins feed upon themselves, as each revelation of error reveals the rot beneath the original analysis, and when the authors protest that none of the errors really matter, it makes you realize that, in these projects, the data hardly matter at all.

We’ve encountered all three of these authors before.

Amy Cuddy is a co-author and principal promoter of the so-called power pose, and she notoriously reacted to an unsuccessful outside replication of that study by going into deep denial. The power pose papers were based on “p less than .05” comparisons constructed from analyses with many forking paths, including various miscalculations which brought some p-values below that magic cutoff.

Michael Norton is a coauthor of that horrible air-rage paper that got so much press a few months ago, and even appeared on NPR. It was in a discussion thread on that air-rage paper that the problems of the Cuddy, Norton, and Fiske paper came out. Norton also is on record recommending that you buy bullfight tickets for that “dream vacation in Spain.” (When I mocked Norton and his coauthor for sending people to bullfights, a commenter mocked me right back by recommending “a ticket to a factory farm slaughterhouse” instead. I had to admit that this would be an even worse vacation destination!)

And, as an extra bonus, when I just googled Michael Norton, I came across this radio show in which Norton plugs “tech giant Peter Diamandis,” who’s famous in these parts for promulgating one of the worst graphs we’ve ever seen. These people are all connected. I keep expecting to come across Ed Wegman or Marc Hauser.

Finally, Susan Fiske seems to have been doing her very best to wreck the reputation of the prestigious Proceedings of the National Academy of Sciences (PPNAS) by publishing papers on himmicanes, power pose, and “People search for meaning when they approach a new decade in chronological age.” In googling Fiske, I was amused to come across this press release entitled, “Scientists Seen as Competent But Not Trusted by Americans.”

A whole fleet of gremlins

This is really bad. We have interlocking research teams making fundamental statistical errors over and over again, publishing bad work in well-respected journals, promoting bad work in the news media. Really the best thing you can say about this work is maybe it’s harmless because no relevant policymaker will take the claims about himmicanes seriously, no airline executive or transportation regulator would be foolish enough to believe the claims from those air rage regressions, and, hey, even if power pose doesn’t work, it’s not hurting anybody, right? On the other hand, those of us who really do care about social psychology are concerned about the resources and attention that are devoted to this sort of cargo-cult science. And, as a statistician, I feel disgust at a purely aesthetic level to these fundamental errors of inference. Wrapping it all up is the attitudes of certainty and defensiveness exhibited by the authors and editors of these papers, never wanting to admit that they could be wrong and continuing to promote and promote and promote their mistakes.

A whole fleet of gremlins, indeed. In some ways, Richard Tol is more impressive in that he can do it all on its own, and these psychology researchers work in teams. But the end result is the same. Error piled upon error piled upon error piled on refusal to admit that their conclusions could be completely mistaken.

P.S. Look. I’m not saying these are bad people. I’m guessing that from their point of view, they’re doing science, they have good theories, their data support their theories, and “p less than .05” is just a silly rule they have to follow, a bit of paperwork that needs to be stamped on their findings to get them published. Sure, maybe they cut corners here or there, or make some mistakes, but those are all technicalities—at least, that’s how I’m guessing they’re thinking. For Cuddy, Norton, and Fiske to step back and think that maybe almost everything they’ve been doing for years is all a mistake . . . that’s a big jump to take. Indeed, they’ll probably never take it. All the incentives fall in the other direction. So that’s the real point of this post: the incentives. Forget about these three particular professionals, and consider the larger problem, which is that errors get published and promoted and hyped and Gladwell’d and Freakonomics’d and NPR’d, whereas when Nick Brown and his colleagues do the grubby work of checking the details, you barely hear about it. That bugs me, hence this post.

P.P.S. Putting this in perspective, this is about the mildest bit of scientific misconduct out there. No suppression of data on side effects from dangerous drugs, no million-dollar payoffs, no $228,364.83 in missing funds, no dangerous policy implications, no mistreatment of cancer patients, no monkeys harmed by any of these experiments. It’s just bad statistics and bad science, simple as that. Really the worst thing about it is the way in which respected institutions such as the Association for Psychological Science, National Academy of Sciences, and National Public Radio have been sucked into this mess.

“Positive Results Are Better for Your Career”

Brad Stiritz writes:

I thought you might enjoy reading the following Der Spiegel interview with Peter Wilmshurst. Talk about fighting the good fight! He took the path of greatest resistance, and he beat what I presume are pretty stiff odds.

Then the company representatives asked me to leave some of the patients out of the data analysis. Without these patients, the study result would have been positive.

I guess Serpico-esque stories like this are probably outlier stories, particularly when they have happier endings than what Frank Serpico experienced. Or for that matter, Boris Kolesnikov.

Wow—that’s pretty scary! I had cardiac catheterization once!

The Spiegel interview begins with a bang:

SPIEGEL: In your early years as a researcher, a pharmaceutical company offered you a bribe equivalent to two years of your salary: They wanted to prevent you from publishing negative study results. Were you disappointed that you weren’t worth more?

Peter Wilmshurst: (laughs) I was just a bit surprised to be offered any money, really. I was a very junior researcher and doctor, only 33 years old, so I didn’t know that sort of thing happened. I didn’t know that you could be offered money to conceal data.

SPIEGEL: How exactly did they offer it to you? They probably didn’t say: “Here’s a bribe for you.”

Wilmshurst: No, of course not! Initially we were talking about the results that I’d obtained: That the drug that I had been testing for them did not work and had dangerous side effects. Then the company representatives asked me to leave some of the patients out of the data analysis. Without these patients, the study result would have been positive. When I said I couldn’t do that, they asked me not to publish the data. And to compensate me for the work I had done in vain, they said, they would offer me this amount of money.

I recommend you read the whole thing.

P.S. Full disclosure: Some of my research is funded by Novartis.

“Merciless Indian savages”

Americans (used to) love world government

cat globe

Sociologist David Weakliem writes:

It appears that an overwhelming majority of Americans who have an opinion on the subject think that Britain should remain in the European Union. But how many would support the United States joining an organization like the EU? My guess is very few. But back in 1946, the Gallup Poll asked “Do you think the United Nations organization should be strengthened to make it a world government with power to control the armed forces of all nations, including the United States?” 54% said yes, and only 24% no, with the rest undecided. The question was asked again in 1946 and 1947, with similar results. In 1951, the margin was smaller, at 49-36%. In 1953 and 1955, there were narrow margins against the idea. That was the last time the question, or anything like it, was asked. Of course, opposition probably would have increased if anyone had seriously tried to implement a plan like this, but for a while many Americans were willing to at least contemplate the idea.

Wow. 54%. Really? I did a Google search and indeed that’s what the poll said. Here’s George Gallup writing in the Pittsburgh Press on Christmas Eve, 1947:

Screen Shot 2016-07-01 at 8.01.13 PM

Even more striking was the 49% support as late as 1951, at which point I assume any illusions about our Soviet allies had dissipated.

Weakliem does have a good point, though, when he writes that “opposition probably would have increased if anyone had seriously tried to implement a plan like this.” Supporting world government is one thing; supporting any particular version of it is another.

Anyway, this poll finding seems worth sharing amid all the Brexit discussion, also a good item for July 4th.