Skip to content

Point summary of posterior simulations?

Luke Miratrix writes:

​In the applied stats class ​I’m teaching ​on​ hierarchical models I’m giving the students (a mix of graduate students, many from the education school, and undergrads) a taste of Stan. I have to give them some “standard” way to turn Stan output into a point estimate (though of course I’ll also explain why credible intervals are better if they have the space to give them). ​But what I don’t know is how to tell them to do this? ​What is the state of the art?

First, let’s presume we’re talking about a parameter which is more or less well-behaved: clearly identifiable, reasonably unimodal, at least a few steps along the asymptotic path towards symmetry, etc. In that case, the question is mostly just an academic one, because anything we do will come up with more-or-less the same answer. In cases like that, should we use median, mean, trimmed mean, or what?

Second, what happens when things start to break down? We might be close to the edge of parameter space, or there might be some bimodal situation where two (or more) parameters can combine in two (or more) different plausible ways, and the evidence from the data isn’t yet enough to conclusively rule one of those out. In those cases, plotting the distribution should clearly show that something is up, but what do we do about it in terms of giving a point estimate? The best answer might be “don’t”, but sometimes it can be hard to tell that to the ref… In this kind of case, the mean might be the most defensible compromise, though of course it should come with caveats.

Basically, I need to give them some simple answer that is least likely to raise eyebrows when they’re trying to publish. Bonus points if it’s also a good answer​!

My reply: If it’s one parameter at a time, I like the posterior median. Of course, in multiple dimensions the vector (median of parameter 1, median of parameter 2, . . ., median of parameter K) is not necessarily a good joint summary. So that’s a potential concern. No easy answers! There’s also some work on Bayesian point estimates. It turns out that the best prior, if you’re planning to compute a posterior mode, is not necessarily the prior you’d use for full Bayes:
http://www.stat.columbia.edu/~gelman/research/published/chung_etal_Pmetrika2013.pdf
http://www.stat.columbia.edu/~gelman/research/published/chung_cov_matrices.pdf

Bonus: these papers have education examples!

Stan talk in Seattle on Tuesday, May 17

I (Eric) will be giving a Stan talk at the Seattle useR Group next week. Daniel Lee and Ben Goodrich will be there as well. If you are in the Seattle area on Tuesday, please stop by and say hello. Thanks to Zach Stednick for organizing this meetup.

Leicester City and Donald Trump: How to think about predictions and longshot victories?

Leicester City was a 5000-to-1 shot to win the championship—and they did it.

Donald Trump wasn’t supposed to win the Republican nomination—last summer Nate gave him a 2% chance—and it looks like he will win.

For that matter, Nate only gave Bernie Sanders a 7% chance, and he came pretty close.

Soccer

There’s been a lot of discussion in the sports and political media about what happened here. Lots of sports stories just treated the Leicester win as a happy miracle, but more thoughtful reports questioned the naive interpretation of the 5000:1 odds as a probability statement. See, for example, this from Paul Campos:

Now hindsight is 20/20, but when a 5000 to 1 shot comes in that strongly suggests those odds were, ex ante, completely out of wack. . . . Leicester was the 14th-best team in the league last year in terms of points (and they were better than that in terms of goal differential, which is probably a better indicator of underlying quality). Anyway, the idea that it’s a 5000 to 1 shot for the 14th best team in one year to win the league in the next is obviously absurd on its face.

The 14th best team in the EPL is roughly equivalent to the 20th best team in MLB or the NBA or the NFL, in terms of distance from the top. Now obviously a whole bunch of things have to break right for for a 75-87 team to have the best record in baseball the next year. It’s quite unlikely — but quite unlikely as in 50-1 or maybe even 100-1.

That sounds about right to me. Odds or no odds, the Leicester story is inspiring.

Primary elections

There are a bunch of quotes out there from various pundits last year saying that Trump had zero chance of winning to the Republican primary. To be fair to Nate, he said 2% not 0%, and there’s a big difference between those two numbers.

But even when Nate was saying 2%, the betting markets were giving him, what, 10%? Something like that? In retrospect, 10% odds last fall seems reasonable enough: things did break Trump’s way. If Trump was given a 10% chance and he won, that doesn’t represent the failure of prediction markets.

I’d also say that, whatever Trump’s unique characteristics as a political candidate, his road to the nomination does not seem so different from that of other candidates. Yes, he’s running as an anti-Washington outsider who’s willing to say what other candidates won’t say, but that’s not such an unusual strategy.

My own forecasts

I’ve avoided making forecasts during this primary election campaign? Why? In some ways, my incentives are the opposite of political pundits. Nate’s supposed to come up with a forecast—that’s his job—and he’s also expected to come up with some value added, above and beyond the betting markets. If Nate’s just following the betting markets, who needs Nate? Indeed, one might think that the bettors are listening to Nate’s recommendations when deciding how to bet. So Nate’s gotta make predictions, and he gets some credit for making distinctive predictions and being one step ahead of the crowd.

In contrast, if I make accurate predictions, ok, fine, I’m supposed to be able to do that. But if I make a prediction that’s way off, it’s bad for my reputation. The plus of making a good forecast are outweighed by the minuses of screwing up.

Also, I haven’t been following the polls, the delegate race, the fundraising race, etc., very carefully. I don’t have extra information, and if I tried to beat the experts I’d probably just be guessing, doing little more than fooling myself (and some number of gullible blog readers).

Here’s a story for you. A few months ago I was cruising by the Dilbert blog and I came across some really over-the-top posts where Scott Adams was positively worshipping Donald Trump, calling him a “master persuader,” giving Trump the full Charlie Sheen treatment. Kinda creepy, really, and I was all set to write a post mocking Adams for being so sure that Trump knew what he was doing, it just showed how clueless Adams was, everybody knew Trump didn’t have a serious chance . . .

And then I remembered why primary elections are hard to predict. “Why Are Primaries Hard to Predict?”—that’s the title of my 2011 online NYT article. I guess I should’ve reposted it earlier this year. But now, after a season of Trump and Sanders, I guess the “primaries are hard to predict” lesson will be remembered for awhile.

Anyway, yeh, primaries are hard to predict. So, sure, I didn’t think Trump had much of a chance—but what did I know? If primaries are hard to predict in general, they’re hard for me to predict, too.

Basically, I applied a bit of auto-peer-review to my own hypothetical blog post on Adams and Trump, and I rejected it! I didn’t run the post: I rightly did not criticize Adams for making what was, in retrospect, a perfectly fine prediction (even if I don’t buy Adams’s characterization of Trump as a “master persuader”).

The only thing I did say in any public capacity about the primary election was when an interviewer asked me if I thought Sanders cound stand up against Donald Trump if they were to run against each other in a general election, and I replied:

I think the chance of a Sanders-Trump matchup is so low that we don’t have to think too hard about this one!

It almost happened! But it didn’t. Here I was taking advantage of the fact that the probability of two unlikely events is typically much smaller than the probability of either of them alone. OK, the two parties’ nominations are not statistically independent—it could be that Trump’s success fueled that of Sanders, and vice-versa—but, still, it’s much safer to assign a low probability to “A and B” than to A or B individually.

But, yeah, primaries are hard to predict. And Leicester was no 5000:1 shot, even prospectively.

P.S. Some interesting discussion in comments, including this exchange:

Anon:

A big difference between EPL and Tyson-Douglas is that there were only two potential winners of the boxing match. The bookies aren’t giving odds on SOME team with a (equivalent to) 75-87 record or worse winning the most games next year – but for one specific team. 5000-1 may be low, but 50-1 is absurdly high given there actually are quality differences between teams, and there are quite a few of them.

My response:

Yes, I agree. Douglas’s win was surprising because he was assumed to be completely outclassed by Tyson, and then this stunning thing happened. Leicester is a pro soccer team and nobody thought they were outclassed by the other teams—on “any given Sunday” anyone can win—but it was thought they were doomed by the law of large numbers. One way to think about Leicester’s odds in this context would be to say that, if they really are the 14th-best team, then maybe there are about 10 teams with roughly similar odds as theirs, and one could use historical data to get a sense of what’s the probability of any of the bottom 10 teams winning the championship. If the combined probability for this “field” is, say, 1/20, then that would suggest something like a 1/200 chance for Leicester. Again, just a quick calculation. Here I’m using the principles explained in our 1998 paper, “Estimating the probability of events that have never occurred.”

Also a comment from fraac! More on that later.

P.P.S. There still seems to be some confusion so let me also pass along this which I posted in comments:

N=1 does give us some information but not much. Beyond that I think it makes sense to look at “precursor data”: near misses and the like. For example if there’s not much data on the frequency of longshots winning the championship, we could get data on longshots placing in the top three, or the top five. There’s a continuity to the task of winning the championship, so it should be possible to extrapolate this probability from the probabilities of precursor events. Again, this is discussed in our 1998 paper.

The key to solving this problem—as with many other statistics problems—is to step back and look at more data.

Just by analogy: what’s the probability that a pro golfer sinks a 10-foot putt. We have some data (see this presentation and scroll thru to the slides on golf putting; see also in my Teaching Statistics book with Nolan, the data come from Don Berry’s textbook from 1995 and there’s more on the model in this article from 2002) which shows a success rate of 67/200, ok, that’s a probability of 33.5% which is a reasonable estimate. But we can do better by looking at data from 7-foot putts, 8-foot putts, 9-foot putts, and so on. The sparser the data the more it will make sense to model. This idea is commonplace in statistics but it often seems to be forgotten when discussing rare events. It’s easy enough to break down a soccer championship into its component pieces, and so there should be no fundamental difficulty in assigning prospective probabilities. In short, you can get leverage by thinking of this championship as part of a larger set of possibilities rather than as a one-of-a-kind event.

Happy talk, meet the Edlin factor

Mark Palko points us to this op-ed in which psychiatrist Richard Friedman writes:

There are also easy and powerful ways to enhance learning in young people. For example, there is intriguing evidence that the attitude that young people have about their own intelligence — and what their teachers believe — can have a big impact on how well they learn. Carol Dweck, a psychology professor at Stanford University, has shown that kids who think that their intelligence is malleable perform better and are more motivated to learn than those who believe that their intelligence is fixed and unchangeable.

In one experiment, Dr. Dweck and colleagues gave a group of low-achieving seventh graders a seminar on how the brain works and put the students at random into two groups. The experimental group was told that learning changes the brain and that students are in charge of this process. The control group received a lesson on memory, but was not instructed to think of intelligence as malleable.

At the end of eight weeks, students who had been encouraged to view their intelligence as changeable scored significantly better (85 percent) than controls (54 percent) on a test of the material they learned in the seminar.

I can believe that one group could do much better than the other group. But do I think there’s a 31 percentage point effect in the general population? No, I think the reported difference is an overestimate because of the statistical significance filter. How much should we scale it down? I’m not sure. Maybe an Edlin factor of 0.1 is appropriate here, so we’d estimate the effect as being 3 percentage points (with some standard error that will doubtless be large enough that we can’t make a confident claim that the average population effect is greater than 0)?

MAPKIA 2: Josh and Drew shred the CCP/APPC “Political Polarization Literacy” test!

coloringbook2

Just like the original Jaws 2, this story features neither Richard Dreyfus nor Steven Spielberg.

It all started when Dan Kahan sent me the following puzzle:

Match the resonses of large nationally representative sample to supporting these policy items.

kahan_questions

kahan_curves

I let this languish in my inbox for awhile until Kahan taunted me by letting me know he’d posted the solution online.

I let it languish a bit longer and then one day Josh “Sanjurjo” Miller was in my ofc and we decided to play this latest game of MAPKIA. How hard could it be to match up the opinion items with the data?

Not so hard at all, we thought. We spent about 10 minutes and came up with our best guess. We attacked the problem crossword-style, starting with the easy items (sports betting, vaccinations, legalized prostitution) and then working our way through the others, by process of elimination.

But before I give you our guesses, and before I tell you the actual answers (courtesy of Kahan), give it a try yourself.

Take your time, no rush.

Keep going.

Scroll down when the proctor declares that the exam is over.

If you finish early, feel free to click on the random ass pictures that I’ve inserted to prevent you from inadverently spotting the answers before completing the test!

Time’s up!

Okay, here’s what I’m going to do. First, I’m going to start by showing you our guesses. Second, I’m going to show you the “answer key,” which consists of the original figure (actually, Kahan changed around some of the colors, I have no idea why) with labels.

OK, here’s what Josh and I guessed:

kahan_guesses

And here’s what the survey said:

everything

OK, as I said, it’s tricky to do the scoring because Kahan switched some of the colors on us, but if you go back to the top image and compare, you’ll see that we got 8 out of 10 correct! The only mistake we made was to switch #4 (“universal health care”) and #6 (“global warming carbon emission”). And, as you can see those two curves are pretty damn close.

8 out of 10! Pretty good, huh? Especially considering that we couldn’t get 9 out of 10 since we knew that each item lined up with exactly one curve. So we made only one mistake.

Indeed, if you go with Kahan’s absurdly complicated scoring scheme, we got a perfect score of 14.75!

Josh and I high-fived each other (literally) when we found out we’d done so well. But then I thought . . . hey, I should be able to do well on this sort of quiz. After all, partisan polarization of U.S. public opinion is supposed to be my area of expertise.

Still, I’ll take the win.

Big Belly Roti on Amsterdam Ave and 123 St

handonfire-wallpaper

Josh “hot hand” Miller was in town and a bunch of us went to this new Caribbean place around the corner. It was good!

P.S. The other hot hand guy, Sanjurjo, is not in town.

What’s the motivation to do experiments on motivation?

show-me-the-money

Bill Harris writes:

Do you or your readers have any insights into the research that underlays Dan Pink’s work on motivation and Tom Wujec’s (or Peter Skillman’s) work on iterative development?  They make intuitive sense to me (but may be counterintuitive to others), but I don’t know much more about them.

Pink’s work is summarized in a TED talk (yeah, okay, cred problem already, eh?) on The puzzle of motivation (there’s a shorter RSA animation here).  The basic claim?  Once one gets past routine tasks and once one has enough financial reward to take money off the table, then extrinsic motivation worsens performance instead of improving them.  That’s largely consistent with Deming’s views, too.

Wujec’s and Skillman’s work is summarized in another TED talk, called Build a tower, build a team, and on The Marshmallow Challenge Web site.  If you listen to the end, you get the message that iteration trumps planning, that extrinsic motivation alone degrades performance seriously, and that knowledge about effective (iterative, collaborative processes) combined with financial rewards trumps even that knowledge by itself.  In a way, that’s consistent with Tukey’s EDA, with your emphasis on starting small and iterating (still not quite expressly consistent with your “throw everything into a model and test it all at once”), and with the iterative stance of action research and continual process improvement.

Both claim to be supported by studies.  There’s an interview with Pink in HBR called What Motivates Us?  but everything I find online says there is research but doesn’t provide it (I read the book, but it was when it was new, and I no longer have it easily available).  So far, the Wujec / Skillman position seems supported mostly by experiments they’ve run in workshops they’ve conducted.

They also seem to conflict in one key sense.  Pink’s thesis seems to be that financial rewards degrade performance, period, while the Wujec / Skillman thesis seems to be that there’s an interaction between knowledge and financial reward, as if financial rewards amplify the effect of appropriate knowledge.

We may all have our biases, but I’m curious about cutting through those to what useful investigation of the data tells us.  What does the research behind these say?  Is that research good, or does it fall prey to low power and that pesky overgrown garden you keep writing about?

My reply:

I have no idea. It all seems much more worthy than that power pose stuff. I guess I should talk with a serious researcher at Columbia or somewhere who works on organizational psychology. Or maybe someone who does research on sports. Usually the way I could get into such a topic would be to read some papers and check out the data and statistical analyses but here I wouldn’t know where to start.

The topic is interesting in itself and is of particular importance to me because nearly every day I’m thinking about what I can do to keep Stan team members happy and productive, and we do have some mix of structure we can impose and extrinsic rewards we can give out. The idea of iterative improvement makes sense to me, although to some extent that just pushes the question back one step to: How do you set up the iterative improvement process?

Finally, the relation between common sense, individual experiences, and statistical evidence is unclear here. Sure, statistical evidence should be relevant, but maybe the details of how to manage the team vary so much that there is no general recipe for success or even for relative success. Then it would be hard to study the topic using the usual treatment vs. control strategy.

A similar issue arises in education research which may be one reason we don’t practice what we preach, and even in silly things like power pose: if, as seems reasonable, different “poses” work for different people in different settings, then it may be close to hopeless to learn anything useful from a conventional experiment. One could however take one step back and do a controlled trial of meta-advice, for example instead of recommending a specific pose, the person in the experiment could just be asked to choose a pose, or to sit and focus for a few minutes. Then again, that won’t work for everyone either . . .

The Access to Justice Lab at Harvard Law School: Job Openings!

Jim Greiner writes:

The Access to Justice Lab is a startup effort, initially supported by the Laura and John Arnold Foundation with sufficient funds for three years, headed by Jim Greiner at Harvard Law School. The Lab will produce randomized control trials (“RCTs”) directly involving courts and lawyers, particularly in the areas of access to justice and court administration (including agency adjudication). It will also combat the legal profession’s current hostility to RCTs through short courses, publications, presentations, and other methods. The Lab is hiring a Research Director, a Research Associate (Field), and a half-time Research Associate (Data). Lab personnel will be trained to, and expected to, produce their own interventions and RCTs After sufficient time, Lab personal will be invited to create their own self-sustaining research agendas at other institutions (including legal academia). To view positions descriptions and to apply, please use the links below. Contact Jim Greiner, jgreiner(at)law.harvard.edu, for further information.
Research Director Access to Justice Lab: http://bit.ly/1QWBF4T
Research Associate (Field) Access to Justice Lab: http://bit.ly/1WjzqR0
Research Associate (Data) Access to Justice Lab: http://bit.ly/23CeNPx

Cool! Just watch out for that Laurence Tribe guy running around in the hallways screaming about Obama burning the Constitution. Bluntly put, the guy needs a newly created DOJ position dealing with the rule of law. Pronto.

P.S. For my own views, positive on negative, on randomized clinical trials, see this 2011 article.

Bill James does model checking

rabbitduck2

Regular readers will know that Bill James was one of my inspirations for becoming a statistician.

I happened to be browsing through the Bill James Historical Baseball Abstract the other day and came across this passage on Glenn Hubbard, who he ranks as the 88th best second baseman of all time:

Total Baseball has Glenn Hubbard rated as a better player than Pete Rose, Brooks Robinson, Dale Murphy, Ken Boyer, or Sandy Koufax, a conclusion which is every bit as preposterous as it seems to be at first blush.

To a large extent, this rating is caused by the failure to adjust Hubbard’s fielding statistics for the ground-ball tendency of his pitching staff. Hubbard played second base for teams which had very high numbers of ground balls, as is reflected in their team assists totals. The Braves led the National League in team assists in 1985, 1986, and 1987, and were near the league lead in the other years that Hubbard was a regular. Total Baseball makes no adjustment for this, and thus concludes that Hubbard is reaching scores of baseballs every year that an average second basement would not reach, hence that he has enormous value.

Posterior checking! This would fit in perfectly in chapter 6 of BDA.

This idea is so fundamental to statistics—to science—and yet so many theories of statistics and theories of science have no place for it.

The alternative to the Jamesian, model-checking approach—so close to the Jaynesian approach!—is exemplified by Pete Palmer’s Total Baseball book, mentioned in the above quote. Pete Palmer did a lot of great stuff, and Bill James is a fan of Palmer, but Palmer follows the all-too-common approach of just taking the results from his model and then . . . well, then, what can you do? You use what you’ve got.

What makes Bill James special is that he’s interested in getting it right, and he’s interested in seeing where things went wrong.

A chicken is an egg’s way of making another egg.

To make the analogy explicit: the “egg” is the model and data, and the chicken is the inferences from the model. The chicken is implicit in the egg, but it needs some growing. The inferences are implicit in the model and the data, but it takes some computing.

All the effort that went into Total Baseball was useful for sabermetrics, in part for the direct relevance of the results (a delicious “chicken”) and in part because Total Baseball included so much data and made so many inferences that people such as James could come in and see which of these statements made no sense—and what this revealed about the problems with Palmer’s model.

It’s like Lakatos said in Proofs and Refutations: once you have the external counterexample—an implication that doesn’t make sense—you go find the internal flaw—the assumption in the model that went wrong, often an assumption that was so implicit in the construction of your procedure that you didn’t even realize it was an assumption at all. (Remember the Speed Racer principle?) Or, conversely, if you first find an internal assumption that concerns you, you should follow the thread outward and figure out what are its external consequences: what does it imply that it does not make sense.

P.S. James is doing a posterior check, not a prior check, because his criticism of the Total Baseball model comes from the absurdity of one of its inferences, conditional on the data.

On deck this week

Mon: Bill James does model checking

Tues: What’s the motivation to do experiments on motivation?

Wed: Happy talk, meet the Edlin factor

Thurs: FDA approval of generic drugs: The untold story

Fri: Acupuncture paradox update

Sat: Point summary of posterior simulations?

Sun: Peer review abuse flashback

Math on a plane!

Screen Shot 2016-05-07 at 10.12.54 PM

Paul Alper pointed me to this news article about an economist who got BUSTED for doing algebra on the plane.

This dude was profiled by the lady sitting next to him who got suspicious of his incomprehensible formulas.

I feel that way about a lot of econ research too, so I can see where she was coming from!

All jokes aside, though, this made me realize how different plane travel is than it used to be.

OK, let me explain. The first time I ever rode on a plane was in 1978—it was a family trip to Disneyland, and I was super excited! I supply the date just to establish that I have only flown during the modern era of mass air travel, crowded planes, cramped seats, stale air, and all the rest. I’m not mourning for the so-called golden age of air travel, which I never actually experienced.

No, the change I’m talking about is in the interactions with the strangers sitting next to me. As I remember it, you’d typically have a bit of a conversation with your seat-mate, sometimes a pretty long one. Once I even got a woman’s phone number! (True story: I was flying to D.C. and she opened with, “Do you fly the shuttle often?”) But, forget about that, there’d usually be a bit of chitchat, where are you flying to?, etc. If the people nearby had kids, some peekaboo or whatever.

In recent years, though, not so much chat. It could just be me, that I’ve become too lazy to engage in conversation with strangers, but from my general impressions it looks to be more general, that people feel less of a need or desire to have these little conversations.

Anyway, back to economists on planes. In August 1990 I flew from Boston to San Francisco to start my new job at the University of California, and my new life as a non-student. I would’ve driven, but my car had been having problems with its starter motor and during the summer I was often having to push-start it. This was no big deal—the car was a ’76 Corolla which was so light I could even push it up if the incline was slight enough, and it was no problem to maneuver it to a flat space, get out and push it alongside the driver’s seat with the door open, then jump in and pop it into second gear—sometimes it would even stall at a traffic light and I could still hop out and get it going without delaying the cars behind me—but my dad convinced me, sensibly enough, that I’d only have to have one mishap on some hilly road somewhere and I could ruin my back forever, so I decided to sell the car. Of course you can’t really sell a car with an intermittent starter motor—if you tell the potential buyer of the problem, the sale’s off, and if you don’t tell them about it, you’re being unethical—so I got the car repaired. I’d tried to get this problem fixed earlier but with no success but for some reason this time it worked. So now my car was ok! But I’d already bought the plane ticket so I sold the car anyway (for $500, which was not much more than the cost of that repair job). The day before I sold it, I was stopped for having an expired registration! I explained the story to the cop, though, and he didn’t ticket me.

So . . . here I am on the plane, and I get to talking with my seat-mate. I told him I was moving to Berkeley to teach in the statistics department. It turned out that he taught economics at Berkeley. His name was Barry Eichengreen. We got to talking about research and I told him about our paper on estimating the incumbency advantage. The paper is here, and I didn’t realize it at the time but in retrospect I’d have to say it’s the most econ-friendly paper I’ve ever written. It has an unbiased estimate, it has estimators, it has proofs, it even has tables! Eichengreen gave me his card (I think; or maybe I just wrote down his name) and he asked me to send him a copy of the incumbency paper.

In retrospect, it’s funny that he didn’t ask me to just come by his office. So now I’m thinking we must have had this conversation on an earlier flight to San Francisco, maybe the flight I took that summer to go and find an apartment to rent.

In any case, one thing I remember clearly is that he asked me to send him a copy of the paper, and then he said something like: Really, I’d like to see the paper, I’m not just being polite. And I was like, sure, of course, I’ll send it to you, thanks for your interest.

I never bothered sending him the paper. I guess I could do so now, but I’m guessing he’s forgotten the whole incident. Eichengreen’s a well known economist but back then we were both nobodies; I just happen to remember the incident because (a) it was cool that someone wanted to read one of my papers, and (b) I was all keyed up to be flying out to my new job. It’s funny that I didn’t send him the paper; usually I’m pretty conscientious about this sort of thing.

Drive-by

Jona Sassenhagen writes:

Here is a paper ***, in case you, errrrr, have run out of other things to blog about …

I took a look and replied:

Wow—what a horrible paper. Really ignorant. Probably best for me to just ignore it!

Doing data science

Someone sent me this question:

As a social and political science expert, you analyze data related to everything from public health and clinical research to college football. Considering how adaptable analytics expertise is, what kinds of careers available to one with this skillset? In which industries are data scientists and analysts in particularly demand? What about emerging opportunities, or sectors that you believe could use more data experts?

I replied:

Data science is useful all over. What you need is to just look at problems that need to be solved, identify places where existing solutions could be improved in some meaningful way, and find or gather data that are relevant to your question, and perform some analysis integrating old and new data. Theory can be useful too in allowing you to get a sense of what can be learned from additional data.

The Puzzle of Paul Meehl: An intellectual history of research criticism in psychology

cliparti1_vacuum-clipart_01

There’s nothing wrong with Meehl. He’s great. The puzzle of Paul Meehl is that everything we’re saying now, all this stuff about the problems with Psychological Science and PPNAS and Ted talks and all that, Paul Meehl was saying 50 years ago. And it was no secret. So how is it that all this was happening, in plain sight, and now here we are?

An intellectual history is needed.

I’ll start with this quote, provided by a commenter on our recent thread on the “power pose,” a business fad hyped by NPR, NYT, Ted, etc., based on some “p less than .05” results that were published in a psychology journal. Part of our discussion turned on the thrashing attempts of power-pose defenders to salvage their hypothesis in the context of a failed replication, by postulating an effect that work under some conditions but not others, an effect that shines brightly when studied by motivated researchers but slips away in replications.

And now here’s Meehl, from 1967:

It is not unusual that (e) this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments, in which the auxiliary hypothesis involved in Experiment 1 (and chal-lenged ad hoc in order to avoid the latter’s modus tollens impact on the theory) becomes the focus of interest in Experiment 2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological net-work, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once re-futing or corroborating so much as a single strand of the network. Some of the more horrible examples of this process would require the combined analytic and recon-structive efforts of Carnap, Hempel, and Popper to unscramble the logical relation-ships of theories and hypotheses to evidence. Meanwhile our eager-beaver re- searcher, undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modem statistical hypothesis-testing, has produced a long publication list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.

Exactly! Meehl got it all down in 1967. And Meehl was respected, people knew about him. Meehl wasn’t in that classic 1982 book edited by Kahneman, Slovic, and Tversky, but he could’ve been.

But somehow, even though Meehl was saying this over and over again, we weren’t listening. We (that is, the fields of statistics and psychometrics) were working on the edges, worrying about relatively trivial issues such as the “file drawer effect” (1979 paper by Rosenthal cited 3500 times) and missing the big picture, the problem discussed by Meehl, that researchers are working within a system that can perpetuate null results.

It’s a little bit like the vacuum energy in quantum physics. Remember that? The idea that the null state is not zero, that even in a vacuum there is energy, there are particles appearing and disappearing? It’s kinda like that in statistical studies: there’s variation, there’s noise, and if you shake it up you will be able to find statistical significance. Meehl has a sociological model of how the vacuum energy and the statistical significance operator can sustain a theory indefinitely even when true effects are zero.

But nobody was listening. Or were listening but in one ear and out the other. Whatever. It took us nearly half a century to realize the importance of p-hacking and the garden of forking paths, to realize that these are not just ways to shoot down joke pseudo-research such as Bem’s ESP study (published in JPSP in 2011) and the notorious Bible Code paper (published in Statistical Scence—how embarrassing!—in 1994), but that they are a key part of how the scientific process works. P-hacking and the garden of forking paths grease the wheels of normal science in psychology and medicine. Without these mechanisms (which extract statistical significance from the vacuum energy), the whole system would dry up, we’d have to start publishing everything where p is less than .25 or something.

So . . . whassup? What happened? Why did it take us nearly 50 years to what Meehl was saying all along? This is what I want the intellectual history to help me understand.

P.S. Josh Miller points to these video lectures from Meehl’s Philosophical Psychology class.

I’m really getting tired of this sort of thing, and I don’t feel like scheduling it for September, so I’ll post it at 1 in the morning

Bummer! NPR bites on air rage study.

Йозеф_Блаттер

OK, here’s the story. A couple days ago, regarding the now-notorious PPNAS article, “Physical and situational inequality on airplanes predicts air rage,” I wrote:

NPR will love this paper. It directly targets their demographic of people who are rich enough to fly a lot but not rich enough to fly first class, and who think that inequality is the cause of the world’s ills.

The next day I did a media roundup and found no NPR mentions of air rage, hence I had to write:

I was unfair to NPR.

Commenter Sepp slammed me on it:

Why the dig at NPR? And why the implication that NPR listeners cannot distinguish good scientific articles from bad ones that agree with listeners’ values? On that note, why the implicit indictment of said values (i.e. the desire to reduce inequality, etc.)? I find these statements saddening and confusing.

I replied that I meant no condemnation, implicit or otherwise, of the desire to reduce inequality. But, sure, I shouldn’t blame NPR for a news story they didn’t even run!

But then commenter Diogo reported that the study was mentioned on Wait Wait, and commenter Adam reported:

Vindication! NPR Finally bit!

The (usually) esteemed Planet Money podcast uncritically retweeted the generally uncritical qz.com article today: https://twitter.com/planetmoney/status/728292705193824257

Feel free to direct criticism to planetmoney (at) npr.com

OK, just a tweet. But still.

The Wait Wait thing is no big deal—they could well have mentioned this study only for the purpose of mocking it.

But the other one . . . it’s so frustrating. I post and post and post and only a few thousand people read. Planet Money tweets and reaches 300,000 people. And NPR is so . . . official. Being on NPR, it’s like being on Gladwell or Ted, it’s the ultimate badge of seriousness.

As to qz.com . . . I’d not heard of them before (or, at least, not that I remember). Based on this linked article, I’m not impressed. But, who knows, they could have lots of good stuff too. I won’t judge an entire media channel based on N=1.

In all seriousness . . .

I have no problem with NPR. NPR is great. That’s why I’m bummed when it falls for junk science.

“Null hypothesis” = “A specific random number generator”

In an otherwise pointless comment thread the other day, Dan Lakeland contributed the following gem:

A p-value is the probability of seeing data as extreme or more extreme than the result, under the assumption that the result was produced by a specific random number generator (called the null hypothesis).

I could care less about p-values but I really really like the identification of a null hypothesis with a random number generator. That’s exactly the point.

The only thing missing is to specify that “as extreme or more extreme” is defined in terms of a test statistic which itself needs to be defined for every possible outcome of the random number generator. For more on this last point, see section 1.2 of the forking paths paper:

The statistical framework of this paper is frequentist: we consider the statistical properties of hypothesis tests under hypothetical replications of the data. Consider the following testing procedures:

1. Simple classical test based on a unique test statistic, T , which when applied to the observed data yields T(y).

2. Classical test pre-chosen from a set of possible tests: thus, T(y;φ), with preregistered φ. For example, φ might correspond to choices of control variables in a regression, transformations, and data coding and excluding rules, as well as the decision of which main effect or interaction to focus on.

3. Researcher degrees of freedom without fishing: computing a single test based on the data, but in an environment where a different test would have been performed given different data; thus T(y;φ(y)), where the function φ(·) is observed in the observed case.

4. “Fishing”: computing T(y;φj) for j = 1,…,J: that is, performing J tests and then reporting the best result given the data, thus T(y; φbest(y)).

Our claim is that researchers are doing #3, but the confusion is that, when we say this, researchers think we’re accusing them of doing #4. To put it another way, researchers assert that they are not doing #4 and the implication is that they are doing #2. In the present paper we focus on possibility #3, arguing that, even without explicit fishing, a study can have a huge number of researcher degrees of freedom, following what de Groot (1956) refers to as “trying and selecting” of associations. . . .

It might seem unfair that we are criticizing published papers based on a claim about what they would have done had the data been different. But this is the (somewhat paradoxical) nature of frequentist reasoning: if you accept the concept of the p-value, you have to respect the legitimacy of modeling what would have been done under alternative data. . . .

Summary

These are the three counterintuitive aspects of the p-value, the three things that students and researchers often don’t understand:

– The null hypothesis is not a scientific model. Rather, it is, as Lakeland writes, “a specific random number generator.”

– The p-value is not the probability that the null hypothesis is true. Rather, it is the probability of seeing a test statistic as large or larger than was observed, conditional on the data coming from this specific random number generator.

– The p-value depends entirely on what would have been done under other possible datasets. It is not rude to speculate on what a researcher would have done had the data been different; actually, such specification is required in order to interpret the p-value, in the same way that the only way to answer the Monty Hall problem is to specify what Monte would have done under alternative scenarios.

A template for future news stories about scientific breakthroughs

Screen Shot 2016-05-04 at 12.58.50 AM

Screen Shot 2016-05-04 at 12.59.17 AM

Yesterday, in the context of a post about news media puffery of the latest three-headed monstrosity to come out of PPNAS, I promised you a solution. I wrote:

OK, fine, you might say. But what’s a reporter to do? They can’t always call Andrew Gelman at Columbia University for a quote, and they typically won’t have the technical background to evaluate these papers by themselves.

But I do have a suggestion, a template for how reporters can handle PPNAS studies in the future, a template that respects the possibility that these papers can have value.

I’ll share that template in my next post.

And here we are.

I’ll give you three for future news stories about scientific breakthroughs. You’d have Story #1, followed a year later by Story #2a or Story #2b or, if necessary, Story #2c.

Story #1:

Can a spray improve your memory?

A spray can improve your memory, claims a paper just published in the Proceedings of the National Academy of Sciences. Mary Smith and John Lee, psychologists at Miskatonic University, reported a 34% improvement in short-term memory performance for 91 students subject to a chlorine spray, as compared to 88 students given a placebo control. Participants in the studies were randomized to the two groups. Asked about these findings, Smith says, “We suspected the spray could be effective, but we were surprised at how well it worked, and we’re planning further studies to understand the mechanism.”

Sara Yossarian of Harvard Medical School, who has no affiliation with the study, was guarded in her assessment: “This could be real, but I’m not believing it until I see a preregistered replication.” Smith and Lee confirmed that the hypotheses and data analysis plan for their study were not preregistered.

The Proceedings of the National Academy of Sciences is a prestigious journal but in recent years has had its credibility questioned after its publication of several articles with seriously flawed statistical analyses on topics ranging from suicide rates to life expectancy to air rage to hurricane naming.

Do the memory-boosting effects of chlorine imply that you should be spending more time inhaling the spray at your local pool? Smith says, “I can’t say yes for sure. But I don’t recommend you cancel your gym membership.” We’ll wait for the replication.

A year later, Story #2a:

Yes, it seems that a spray can improve your memory.

A spray can improve your memory, claimed a much-publicized paper published last year in the Proceedings of the National Academy of Sciences. Mary Smith and John Lee, psychologists at Miskatonic University, reported a 34% improvement in short-term memory performance for 91 students subject to a chlorine spray, as compared to 88 students given a placebo control. Participants in the studies were randomized to the two groups. Asked about these findings, Smith said, “We suspected the spray could be effective, but we were surprised at how well it worked, and we’re planning further studies to understand the mechanism.”

The Proceedings of the National Academy of Sciences is a prestigious journal but in recent years has had its credibility questioned after its publication of several articles with seriously flawed statistical analyses on topics ranging from suicide rates to life expectancy to air rage to hurricane naming.

Since then, two laboratories have conducted independent replications of the study, and these were just published in the journal Psychology Research Replications. The replication studies in this issue were performed in the U.S. and France, and both showed the expected positive results.

Sara Yossarian of Harvard Medical School, who had no affiliation with the original study or the replications and who expressed skepticism about this research last year, says, “I’m impressed with the preregistered replications. I’m still not sure how effective this treatment will be in the real world—I’d like to see some evidence outside campus and hospital settings—but this is definitely worth further study.”

Do the memory-boosting effects of chlorine imply that you should be spending more time inhaling the spray at your local pool? Smith says, “I can’t say yes for sure. But I don’t recommend you cancel your gym membership.” At the very least, this is one more reason to do some daily laps.

Or, Story #2b:

Follow-up: No evidence that a spray can improve your memory.

A spray can improve your memory, claimed a much-publicized paper published last year in the Proceedings of the National Academy of Sciences. Mary Smith and John Lee, psychologists at Miskatonic University, reported a 34% improvement in short-term memory performance for 91 students subject to a chlorine spray, as compared to 88 students given a placebo control. Participants in the studies were randomized to the two groups. Asked about these findings, Smith said, “We suspected the spray could be effective, but we were surprised at how well it worked, and we’re planning further studies to understand the mechanism.”

The Proceedings of the National Academy of Sciences is a prestigious journal but in recent years has had its credibility questioned after its publication of several articles with seriously flawed statistical analyses on topics ranging from suicide rates to life expectancy to air rage to hurricane naming.

At the time of its publication, the conclusions of the chlorine-memory study could only be considered tentative because its hypotheses and data analysis plan for their study had not been not preregistered. Since then, two laboratories have conducted independent replications of the study, and these were just published in the journal Psychology Research Replications. The replication studies in this issue were performed in the U.S. and France, and they showed no pattern of success.

Asked about the failed replications, Smith says, “We thought we found something there, but apparently we were overconfident. We’re glad that other groups went to the trouble to replicate our study, and we plan to start over and think more carefully about how a memory spray should work.”

Sara Yossarian of Harvard Medical School, who had no affiliation with the original study or the replications, says, “The unsuccessful replications confirm my original skepticism. Yes, it’s possible that a spray could improve your memory, but it could also make your memory worse, or have no effect at all. This episode is a reminder not to overreact to preliminary research findings—even if they’re published in a prestigious journal.”

If chlorine really has no memory-boosting effects, does this imply that you should spend less time at your local pool? Smith says, “I don’t recommend you cancel your gym membership. But go for the exercise, not for the chlorine.”

Or, if necessary, Story #2c:

Follow-up: No evidence that a spray can improve your memory.

A spray can improve your memory, claimed a much-publicized paper published last year in the Proceedings of the National Academy of Sciences. Mary Smith and John Lee, psychologists at Miskatonic University, reported a 34% improvement in short-term memory performance for 91 students subject to a chlorine spray, as compared to 88 students given a placebo control. Participants in the studies were randomized to the two groups. Asked about these findings, Smith said, “We suspected the spray could be effective, but we were surprised at how well it worked, and we’re planning further studies to understand the mechanism.”

The Proceedings of the National Academy of Sciences is a prestigious journal but in recent years has had its credibility questioned after its publication of several articles with seriously flawed statistical analyses on topics ranging from suicide rates to life expectancy to air rage to hurricane naming.

At the time of its publication, the conclusions of the chlorine-memory study could only be considered tentative because its hypotheses and data analysis plan for their study had not been not preregistered. Since then, two laboratories have conducted independent replications of the study, and these were just published in the journal Psychology Research Replications. The replication studies in this issue were performed in the U.S. and France, and they showed no pattern of success.

Asked about the failed replications, Smith says, “We stand by our originally-published finding. These so-called replications differed in important ways from our original study, which is why they failed to find any effects.”

Sara Yossarian of Harvard Medical School, who had no affiliation with the original study or the replications, says, “The unsuccessful replications confirm my original skepticism. Yes, it’s possible that a spray could improve your memory, but it could also make your memory worse, or have no effect at all. This episode is a reminder not to overreact to preliminary research findings—even if they’re published in a prestigious journal.”

If chlorine really has no memory-boosting effects, does this imply that you should spend less time at your local pool? Yossarian says there’s no reason for to you cancel your gym membership. “But go for the exercise, not for the chlorine.”

OK, these aren’t the only possible template. In the above stories I’ve emphasized preregistration, but (a) not every study can be replicated so easily or at all, and (b) when we criticize studies, it’s typically not for being non-preregistered but for specific statistical or conceptual flaws. To put it another way, the power-pose paper had many statistical weaknesses, to the extent that I would see no reason to believe it even in the absence of preregistration. The failed replication is just another nail in its coffin. The air-pollution-in-China is a one-time thing and the study can never be repeated—but my criticism is that the statistical analysis is so weak that the conclusions can’t be trusted.

But these sorts of criticisms are hard to make in general, hence I think it’s safest to present a general skepticism in line with PPNAS’s demonstrated willingness to publish unsubstantiated claims garnished with statistically significant p-values.

If you’re writing about a study such as air rage or himmicanes or air pollution in China which can’t be easily replicated, then remove the bits about replication in the above templates and replace with statements such as the researchers’ hypotheses being open-ended and the data being unavailable and that you don’t really believe the claims until the data have been reanalyzed by outside groups.

For example, in Story #1:

Sara Yossarian of Harvard Medical School, who has no affiliation with the study, was guarded in her assessment: “This could be real, but I’m not believing it until I see a reanalysis of the raw data by an independent research group.” Smith and Lee confirmed that the hypotheses and data analysis plan for their study were not preregistered. Their data and detailed experimental protocol are not yet available online.

And then, in the concluding line to Story #1, instead of “We’ll wait for the replication,” you can have:

We’ll wait until other research teams have had had a chance to analyze these data.

I think the above stories represent a first start at a reasonable template that journalists can use when reporting science news. Perhaps the actual journalists out there can refine these.

PPNAS: How does it happen? And happen? And happen? And happen?

Screen Shot 2013-08-03 at 4.23.29 PM

In the comment thread to today’s post on journalists who take PPNAS papers at face value, Mark asked, in response to various flaws pointed out in one of these papers:

How can the authors (and the reviewers and the editor) not be aware of something so elementary?

My reply:

Regarding the authors, see here. Statistics is hard. Multiple regression is hard. Figuring out the appropriate denominator is hard. These errors aren’t so elementary.

Regarding the reviewers, see here. The problem with peer review is that the reviewers are peers of the authors and can easily be subject to the same biases and the same blind spots.

Regarding the editor: it doesn’t help that she has the exalted title of Member of the National Academy of Sciences. With a title like that, it’s natural that she thinks she knows what she’s doing. What could she possibly learn from a bunch of blog commenters, a bunch of people who are so ignorant that they don’t even believe in himmicanes, power pose, and ESP?

P.S. Let me clarify. I don’t expect or demand that PPNAS publish only wonderful papers, or only good papers, or only correct papers. Mistakes will happen. The publication process is not perfect. But I would like for them to feel bad about publishing bad papers. What really bothers me is that, when a bad paper is published, they just move on, there seems to be no accountability, no regret. They act as if the publication process retrospectively assigns validity to the work (except in very extreme circumstances of fraud, etc.) I’m bothered by the PPNAS publish-and-never-look-back attitude for the same reason I’m bothered that New York Times columnist David Brooks refuses to admit his errors.

Making mistakes is fine, it’s the cost of doing business. Not admitting your mistakes, that’s a problem.

Journalists are suckers for anything that looks like science. And selection bias makes it even worse. But I was unfair to NPR.

Nickelback_in_Brisbane_November_2012_Here_And_Now_Tour

Journalists are suckers. Marks. Vics. Boobs. Rubes.

You get the picture.

Where are the classically street-trained reporters, the descendants of Ring Lardner and Joe Liebling, the hard-bitten journos who would laugh in the face of a press release?

Today, nowhere in evidence.

I’m speaking, of course, about the reaction in the press to the latest bit of “p less than .05” clickbait to appear in PPNAS. Here’s what I wrote yesterday regarding the article, “Physical and situational inequality on airplanes predicts air rage”:

NPR will love this paper. It directly targets their demographic of people who are rich enough to fly a lot but not rich enough to fly first class, and who think that inequality is the cause of the world’s ills.

This morning I was curious so I googled the name of the article’s first author and NPR. No hits on this study. But a lot from other news organizations:

Screen Shot 2016-05-03 at 11.39.30 AM

Let’s go through and take a look:

Deborah Netburn in the L.A. Times presents the story completely uncritically. Zero concerns. From the authors’ lips to God’s ears.

Carina Storrs at CNN: 12 paragraphs of unskeptical parroting of the authors’ claims, followed by three paragraphs of very mild criticism (quoting psychologist Michael McCullough as saying that the study “is provocative, but it does not strike me as an open and shut case”), followed by two more paragraphs by the study’s author.

Gillian Mohney at ABC News: no skepticism at all, she buys into the whole study, hook, line, and sinker.

Bob Weber, CTV News: Again takes it at face value. A regression with p less than .05 in PPNAS is good enough for Bob Weber.

Unsigned, ABC Radio: A short five-paragraph story, the last paragraph of which is, “Although this study points to a link between air rage and first class cabins, it does not prove causation.”

Vanessa Lu, Toronto Star: Straight P.R., no chaser.

Peter Dockrill, Science Alert: A nearly entirely credulous story, marred only by a single paragraph buried in the middle of the story, quoting Michael McCullough from that CNN article.

And Sophie Ryan at the New Zealand Herald buys into the whole story. Again, if it’s published in PPNAS and it tells us something we want to hear, run with it.

LA Times, ABC News, Toronto Star, sure, fine, what can you expect? But the New Zealand Herald? I’m disappointed. You can do better. If NPR dodged this bullet, you can too.

Where were the savvy reporters?

Where were Felix Salmon, Ed Yong, Sharon Begley, Julie Rehmeyer, Susan Perry, etc., in all this? The quantitative and science reporters who know what they’re doing? They didn’t waste their time with this paper. They see the equivalent in Psychological Science each week, and they just tune it out.

You don’t see the most respected pop music critics reviewing the latest Nickelback show, right? OK, maybe at the Toronto Star. But nowhere else.

Where were Nate Silver’s 538 and the New York Times’s Upshot team? They didn’t waste their time with this. They like to analyze their own data. They know that data analysis is hard, and they don’t trust any numbers they haven’t crunched themselves.

We have a classic case of selection bias. The knowledgeable reporters don’t waste their time on this, leaving the suckers to write it up.

Comparison to himmicanes

Here’s a data point, though. This air rage study, like the power pose study, got nearly uniformly positive coverage, whereas the ovulation-and-clothing study and the himmicanes study were accompanied in their news reports with a bit of skepticism (not as much as was deserved, but some). Why?

I suspect a key factor is that the conclusions of this new paper told people what they want to hear: flying sucks, first-class passengers are assholes, social inequality is a bad thing, and it’s been proved by science!

Also, the ovulation-and-clothing and himmicanes studies had particularly obvious errors in their conceptualization and measurement, whereas the statistical flaws in the air rage study are more subtle and have to do with scaling of ratios and the interpretation of multiple regression coefficients.

A template for future news stories

OK, fine, you might say. But what’s a reporter to do? They can’t always call Andrew Gelman at Columbia University for a quote, and they typically won’t have the technical background to evaluate these papers by themselves.

But I do have a suggestion, a template for how reporters can handle PPNAS studies in the future, a template that respects the possibility that these papers can have value.

I’ll share that template in my next post.

P.S. BoingBoing fell for it too. Too bad. You can do better, BoingBoing!

P.P.S. Felix Salmon pointed out that the study was also promoted completely uncritically in Science magazine. Tabloids gonna tabloid.