Skip to content

Graph too clever by half

Mike Carniello writes:

I wondered what you make of this.

I pay for the NYT online and tablet – but not paper, so I don’t know how they’re representing this content in two dimensions.

I’ve paged through the thing a couple of times, not sure how useful it is – it seems like a series of figures – in a one-column, many-row display might have worked as well (or, perhaps, two-column (differentiating oil producer groups)).

I agree, I found it disconcerting that the axes start changing as I scroll down!

Publication bias occurs within as well as between projects

Kent Holsinger points to this post by Kevin Drum entitled, “Publication Bias Is Boring. You Should Care About It Anyway,” and writes:

I am an evolutionary biologist, not a psychologist, but this article describes a disturbing Scenario concerning oxytocin research that seems plausible. It is also relevant to the reproducibility/publishing issues you have been discussing recently on your blog.

Drum writes:

You all know about publication bias, don’t you? Sure you do. It’s the tendency to publish research that has bold, affirmative results and ignore research that concludes there’s nothing going on. This can happen two ways. First, it can be the researchers themselves who do it. In some cases that’s fine: the data just doesn’t amount to anything, so there’s nothing to write up. In other cases, it’s less fine: the data contradicts previous results, so you decide not to write it up. . . .

This is just fine but I want to emphasize that publication bias is not just about the “file drawer effect,” it’s not just about positive findings being published and zero or negative findings remaining unpublished. It’s also that, within any project, there are so many different results that researchers can decide what to focus on.

So, yes, sometimes a research team will try an idea and it won’t work and they won’t bother writing it up. Just one more dry hole—but if only the successes are written up and published, we will get a misleading view of reality: we’re seeing a nonrandom sample of results. But it’s more than that. Any study contains within itself so many possibilities that often something can be published that appears to be consistent with some vague theory. Embodied cognition, anyone.

This “garden of forking paths” is important because it shows how publication bias can occur, even if every study is published and there’s nothing in the file drawer.

Evaluating election forecasts

Nadia Hassan writes:

Nate Silver did a review of pre-election predictions from forecasting models in 2012. The overall results were not great, but many scholars noted that some models seemed to do quite well. You mentioned that you were interested in how top-notch models fare.

Nate agreed that some were better, but he raised the question of lucky vs. good with forecasters:
“Some people beat Vegas at roulette on any given evening. Some investors beat the stock market in any given month/quarter/year, and yet there is (relatively) little evidence of persistent stock-picking skill, etc, etc.”

The other thing is you did a paper with Wang on the limits of predictive accuracy. Many election models are linear regressions, but the point seems pertinent.

Election forecasting is seen by some as a valuable opportunity social science theories over time. It does seem like one can go wrong by just comparing pre-election forecasts to outcomes. How can one examine predictions sensibly considering these issues?

My reply: One way to increase N here is to look at state-by-state predictions. Here it makes sense to look at predictions for each state relative to the national average, rather than just looking at the raw prediction. To put it another way: suppose the state-level outcomes are y_1,…,y_50, and the national popular vote outcome is y_usa (a weighted average of the 50 y_j’s). Then you should evaluate the national prediction by comparing to y_usa, and you should evaluate state predictions of y_j – y_usa for each j. Otherwise you’re kinda double counting the national election and you’re not really evaluating different aspects of the prediction. You can also look at predictions of local elections, congressional elections, etc.

And always evaluate predictions on vote proportions, not just win/loss. That’s something I’ve been saying for a long long time (for example see this book review from 1993). To evaluate predictions based on win/loss is to just throw away information.

Birthdays and heat waves

I mentioned the birthdays example in a talk the other day, and Hal Varian pointed me to some research by David Lam and Jeffrey Miron, papers from the 1990s with titles like Seasonality of Births in Human Populations, The Effect of Temperature on Human Fertility, and Modeling Seasonality in Fecundability, Conceptions, and Births.

Aki and I have treated the birthdays problem as purely a problem in statistical modeling and computation and have not looked at all at work of demographers in this area. So it was good to learn of this work.

Hal also pointed me to a recent paper, Heat Waves at Conception and Later Life Outcomes by Joshua Wilde, Bénédicte Apouey, and Toni Jung, which I looked at and don’t believe at all.

Wilde et al. report that babies born 9 months after hot weather have better educational and health outcomes as adults, and they attribute this to a selection among fetuses, by which the higher temperature conditions make fetal development more difficult so that the weaker fetuses die and it is the stronger, healthier ones that survive. As is typically the case, I’m suspicious of this sort of bank-shot explanation.

Wilde et al. talk about the causal effect of temperature but I’m guessing it can all be explained by selection effects of parents, that different sorts of people get pregnant at different times of the year, with no causal effect of temperature at all. Yes they run some regressions controlling for family characteristics but I get the impression that the purpose of those regressions was just to confirm that their primary findings were OK: As sometimes happens in this sort of robustness analysis, they weren’t looking to find anything there, and then they successfully didn’t find anything. Not what I’d call convincing. The whole thing just seems like massive overreach to me. Also seems odd for them to talk about temperature “shocks”: It’s hardly a shock that it gets warm in the summer and cold in the winter.

I’m not saying that temperature at conception can’t have any effect on fetal health; I just don’t find the particular argument in this paper at all convincing. It’s the learning-through-regression paradigm out of control.

P.S. It’s April, and it just happens that the next available day on the blog is in August. What better time to post something on the effects of heat waves?

Who owns your code and text and who can use it legally? Copyright and licensing basics for open-source

I am not a lawyer (“IANAL” in web-speak); but even if I were, you should take this with a grain of salt (same way you take everything you hear from anyone). If you want the straight dope for U.S. law, see the U.S. government Copyright FAQ; it’s surprisingly clear for government legalese.

What is copyrighted?

Computer code and written material such as books, journals, and web pages, are subject to copyright law. Copyright is for the expression of an idea, not the idea itself. If you want to protect your ideas, you’ll need a patent (or to be good at keeping secrets).

Who owns copyrighted material?

In the U.S., copyright is automatically assigned to the author of any text or computer code. But if you want to sue someone for infringing your copyright, the government recommends registering the copyright. And most of the rest of the world respects U.S. copyright law.

Most employers require as part of their employment contract that copyright for works created by their employees be assigned to the employer. Although many people don’t know this, most universities require the assignment of copyright for code written by university research employees (including faculty and research scientists) to the university. Typically, universities allow the author to retain copyright for books, articles, tutorials, and other traditional written material. Web sites (especially with code) and syllabuses for courses are in a grey area.

The copyright holder may assign copyright to others. This is what authors do for non-open-access journals and books—they assign the copyright to the publisher. That means that even they may not be able to legally distribute copies of the work to other people; some journals allow crippled (non-official) versions of the works to be distributed. The National Institutes of Health require all research to be distributed openly, but they don’t require the official version to be so, so you can usually find two versions (pre-publication and official published version) of most work done under the auspices of the NIH.

What protections does copyright give you?

You can dictate who can use your work and for what. There are fair use exceptions, but I don’t understand the line between fair use and infringement (like other legal definitions, it’s all very fuzzy and subject to past and future court decisions).


For others to be able to use copyrighted text or code legally, the copyrighted material must be explicitly licensed for such use by the copyright holder. Just saying “common domain” or “this is trivial” isn’t enough. Just saying “do whatever you want with it” is in a grey area gain, because it’s not a recognized license and presumably that “whatever you want” doesn’t involve claiming copyright ownership. The actual copyright holder needs to explicitly license the material.

There is a frightening degree of non-conformance among open-source contributors, largely I suspect, due to misunderstandings of the author’s employment contract and copyright law.

Derived works

Most of the complication from software licensing comes from so-called derived works. For example, I download open-source package A, then extend it to produce open-source package B that includes open-source package A. That’s why most licenses explicitly state what happens in these cases. The reason we don’t like the Gnu Public Licenses (GPL) is that they restrict derived works with copyleft (forcing package B to adopt the same license, or at best one that’s compatible). That’s why I insisted on the BSD license for Stan—it’s maximally open in tems of what it allows others to do with the code, and it’s compatible with GPL. R’s licensed under the GPL, which means projects built on R, such as RStan, must also be released under the GPL plus whatever license the project is released under (we just went GPL for RStan).

Where does Stan stand?

Columbia owns the copyright for all code written by Columbia research staff (research faculty, postdocs, and research scientists). It’s less clear (from our reading of the faculty handbook) who owns works created by Ph.D. students and teaching faculty. For non-Columbia contributions, the author (or their assignee) retains copyright for their contribution. The advantage of this distributed copyright is that ownership isn’t concentrated with one company or person; the disadvantage is that we’ll never be able to contact everyone to change licenses, etc.

The good news is that Columbia’s Tech Ventures office (the controller of software copyrights at Columbia), has given the Stan project a signed waiver that allows us to release all past and future work on Stan under open source licenses. They maintain the copyright, though, under our employment contracts (at least for the research faculty and research scientists).

For other contributors, we now require them to explicitly state who owns the copyrighted contribution and to agree that the copyright holder gives permission to license the material under the relevant license (BSD for most of Stan, GPL or MIT for some of the interfaces).

The other good news is that most universities and companies are coming around and allowing their employees to contribute to open-source projects. The Gnu Public License (GPL) is often an exception for companies, because they are afraid of its copyleft properties.


The Stan project is trying to cover our asses from being sued in the future by a putative copyright holder, though we don’t like having to deal with all this crap (pun intended).

Luckily, most universities these days seem to be opening up to open source (no, that wasn’t intended to continue the metaphor of the previous paragraph).

But what about patents?

Don’t get me started on software patents. Or patent trolls. Like copyrights, patents protect the owner of intellectual property against its illegal use by others. Unlike copyright, which is about the realization of an idea (such as a way of writing a recipe for chocolate chip cookies), patents are more abstract and are about the right to realize ideas (such as making a chocolate chip cookie in any fashion). If you need to remember one thing about patent law, it’s that a patent lets you stop others from using your patented technology—it doesn’t let you use it (your patent B may depend on some other patent A).

Or trademarks?

Like patents, trademarks prevent other people from (legally) using your intellectual property without your permission, such as building a knockoff logo or brand. Trademarks can involve names, font choices, color schemes, etc. The trademark itself can involve fonts, color schemes, similar names, etc. But they tend to be limited to areas, so we could register a trademark for Stan (which we’re considering doing), without running afoul of the down-under Stan.

There are also unregistered trademarks, but I don’t know all the subtleties about what rights registered trademarks grant you over the unregistered ones. Hopefully, we’ll never be writing that little R in a circle above the Stan name, Stan®; even if you do register a trademark, you don’t have to use that annoying mark—it’s just there to remind people that the item in question is trademarked.

Oooh, it burns me up

If any of you are members of the Marketing Research Association, could you please contact them and ask them to change their position on this issue:

Screen Shot 2016-01-11 at 4.43.58 PM

I have a feeling they won’t mind if you call them at home. With an autodialer. “Pollsters now must hand-dial cellphones, at great expense,” indeed. It’s that expensive to pay people to push a few buttons, huh?

Those creepy lobbyists are so creepy. Yeah, yeah, I know they’re part of the political process, but I don’t have to like them or their puppets in Congress.

Better to just not see the sausage get made


Mike Carniello writes:

This article in the NYT leads to the full text, in which these statement are buried (no pun intended):

What is the probability that two given texts were written by the same author? This was achieved by posing an alternative null hypothesis H0 (“both texts were written by the same author”) and attempting to reject it by conducting a relevant experiment. If its outcome was unlikely (P ≤ 0.2), we rejected the H0 and concluded that the documents were written by two individuals. Alternatively, if the occurrence of H0 was probable (P > 0.2), we remained agnostic.

See the footnote to this table:

Screen Shot 2016-04-16 at 8.45.22 PM

Ahhh, so horrible. The larger research claims might be correct, I have no idea. But I hate to see such crude statistical ideas being used, it’s like using a pickaxe to dig for ancient pottery.

Letters we never finished reading

I got a book in the mail attached to some publicity material that began:

Over the last several years, a different kind of science book has found a home on consumer bookshelves. Anchored by meticulous research and impeccable credentials, these books bring hard science to bear on the daily lives of the lay reader; their authors—including Malcolm Gladwell . . .

OK, then.

The book might be ok, though. I wouldn’t judge it on its publicity material.

Free workshop on Stan for pharmacometrics (Paris, 22 September 2016); preceded by (non-free) three day course on Stan for pharmacometrics

So much for one post a day…

Workshop: Stan for Pharmacometrics Day

If you are interested in a free day of Stan for pharmacometrics in Paris on 22 September 2016, see the registration page:

Julie Bertrand (statistical pharmacologist from Paris-Diderot and UCL) has finalized the program:

When Who What
09:00–09:30 Registration
9:30-10:00 Bob Carpenter Introduction to the Stan Language and Model Fitting Algorithms
10:00-10:30 Michael Betancourt Using Stan for Bayesian Inference in PK/PD Models
10:30-11:00 Bill Gillepsie Prototype Stan Functions for Bayesian Pharmacometric Modeling
11:00-11:30 coffee break
11:30-12:00 Sebastian Weber Bayesian popPK for Pediatrics – bridging from adults to pediatrics
12:00-12:30 Solene Desmee Using Stan for individual dynamic prediction of the risk of death in nonlinear joint models:
Application to PSA kinetics and survival in metastatic prostate cancer
12:30-13:30 lunch
13:30-14:00 Marc Vandemeulebroecke A longitudinal Item Response Theory model to characterize cognition over time in elderly subjects
14:00-14:30 William Barcella Modeling correlated binary variables: an application to lower urinary tract symptoms
14:30-15:00 Marie-Karelle Riviere Evaluation of the Fisher information matrix without linearization in
nonlinear mixed effects models for discrete and continuous outcomes
15:00-15:30 coffee break
15:30-16:00 Dan Simpson TBD
16:00-16:30 Frederic Bois Bayesian hierarchical modeling in pharmacology and toxicology / about what we need next
16:30-17:00 Everyone Discussion


Course: Bayesian Inference with Stan for Pharmacometrics

The three days preceding the workshop (19–21 September 2016), Michael Betancourt, Daniel Lee, and I will be teaching a course on Stan for Pharmacometrics. This, alas, is not free, but if you’re interested, registration details are here:

It’s going to be very hands-on and by the end you should be fitting hierarchical PK/PD models based on compartment differential equations.

P.S. As Andrew keeps pointing out, all proceeds (after overhead) go directly toward Stan development. It turns out to be very difficult to get funding to maintain software that people use, because most funding is directed at “novel” research (not software development, research, which means prototypes, not solid code). These courses help immensely to supplement our grant funding and let us continue to maintain Stan and its interfaces.

A day in the life

I like to post approx one item per day on this blog, so when multiple things come up in the same day, I worry about the sustainability of all this. I suppose I could up the posting rate to 2 a day but I think that could be too much of a burden on the readers.

So in this post I’ll just tell you everything I’ve been thinking about today, Thurs 14 Apr 2016.

Actually I’ll start with yesterday, when I posted an update to our Prior Choice Recommendations wiki. There had been a question on the Stan mailing list about priors for cutpoints in ordered logistic regression and this reminded me of a few things I wanted to add, not just on ordered regression but in various places in the wiki. This wiki is great and I’ll devote a full post to it sometime.

Also yesterday I edited a post on this sister blog. Posting there is a service to the political science profession and it’s good to reach Washington Post readers which is a different audience than what we have here. But it’s also can be exhausting as I need to explain everything, whereas for you regular readers I can just speak directly.

This morning I taught my class on design and analysis of sample surveys. Today’s class was on Mister P. Jitts led into a 20-minute discussion about the history and future of sample surveys. I don’t know much about the history of sample surveys. Why was there no Gallup Poll in 1990? How much random sampling was being done, anywhere, before 1930? I don’t know. After that, the class was all R/Stan demos and discussion. I had some difficulties. I took an old R script I had from last year’s class but it didn’t run. I’d deleted some of the data files—Census PUMS files I needed for the poststratification—so I needed to get them again.

After that I biked downtown to give a talk at Baruch College, where someone had asked me to speak. On the way down I heard this story, which the This American Life producers summarize as follows:

When Jonathan Goldstein was 11, his father gave him a book called Ultra-Psychonics: How to Work Miracles with the Limitless Power of Psycho-Atomic Energy. The book was like a grab bag of every occult, para-psychology, and self-help book popular at the time. It promised to teach you how to get rich, control other people’s minds, and levitate. Jonathan found the book in his apartment recently and decided to look into the magical claims the book made.

It turns out that the guy who wrote the book was just doing it to make money:

At the time, Schaumberger was living in New Jersey and making a decent wage as an editor at a publishing house that specialized in occult self help books with titles like “Secrets From Beyond The Pyramids” and “The Magic Of Chantomatics.” And he was astonished by the amount of money he saw writers making. . . .

Looking at it now, it seems obvious it was a lark. It almost reads like a parody of another famous science fiction slash self help book with a lot of psuedoscience jargon that, for legal reasons, I will only say rhymes with diuretics.

Take, for instance, the astral spur. You were supposed to use it at the race track to give your horse extra energy, and it involved standing on one foot and projecting a psychic laser at your horse’s hindquarters.

Then there’s the section on ultra vision influence. The road to domination is explained this way– one, sit in front of a mirror and practice staring fixedly into your own eyes. Two, practice the look on animals. Cats are the best. See if you can stare down a cat. Don’t be surprised if the cat seems to win the first few rounds. Three, practice the look on strangers on various forms of public transport. Stare steadily at someone sitting opposite you until you force them to turn their head away or look down. You have just mastered your first human subject.

I’m listening to this and I’m thinking . . . power pose! It’s just like power pose. It could be true, it kinda sounds right, it involves discipline and focus.

One difference is that power pose has a “p less than .05” attached to it. But, as we’ve seen over and over again, “p less than .05” doesn’t mean very much.

The other difference is that, presumably, the power pose researchers are sincere, whereas this guy was just gleefully making it all up. And yet . . . there’s this, from his daughter:

Well, he was very familiar with all these things. The “Egyptian Book of the Dead” was a big one, because there was always this thing of, well, maybe if they had followed the formulas correctly, maybe something . . . He may have wanted to believe. It may be that in his private thoughts, there were some things in there that he believed in.

I think there may be something going on here, the idea that, even if you make it up, if you will it, you can make it true. If you just try hard enough. I wonder if the power-pose researchers and the ovulation-and-clothing researchers and all the rest, I wonder if they have a bit of this attitude, that if they just really really try, it will all become true.

And then there was more. I’ve had my problems with This American Life from time to time, but this one was a great episode. It had this cool story of a woman who was caring for her mother with dementia, and she (the caregiver) and her husband learned about how to “get inside the world” of the mother so that everything worked much more smoothly. I’m thinking I should try this approach when talking with students!

OK, so I got to my talk. It went ok, I guess. I wasn’t really revved up for it. But by the time it was over I was feeling good. I think I’m a good speaker but one thing that continues to bug me is that I rarely elicit many questions. (Search this blog for Brad Paley for more on this.)

After my talk, on the way back, another excellent This American Life episode, including a goofy/chilling story of how the FBI was hassling some US Taliban activist and trying to get him to commit crimes so they could nail him for terrorism. Really creepy: they seemed to want to create crimes where none existed, just so they could take credit for catching another terrorist.

Got home and started typing this up.

What else relevant happened recently? On Monday I spoke at a conference on “Bayesian, Fiducial, and Frequentist Inference.” My title was “Taking Bayesian inference seriously,” and this was my abstract:

Over the years I have been moving toward the use of informative priors in more and more of my applications. I will discuss several examples from theory, application, and computing where traditional noninformative priors lead to disaster, but a little bit of prior information can make everything work out. Informative priors also can resolve some of the questions of replication and multiple comparisons that have recently shook the world of science. It’s funny for me to say this, after having practiced Bayesian statistics for nearly thirty years, but I’m only now realizing the true value of the prior distribution.

I don’t know if my talk quite lived up to this, but I have been thinking a lot about prior distributions, as was indicated at the top of this post. On the train ride to and from the conference (it was in New Jersey) I talked with Deborah Mayo. I don’t really remember anything we said—that’s what happens when I don’t take notes—but Mayo assured me she’d remember the important parts.

I also had an idea for a new paper, to be titled, “Backfire: How methods that attempt to avoid bias can destroy the validity and reliability of inferences.” OK, I guess I need a snappier title, but I think it’s an important point. Part of this material was in my talk, “‘Unbiasedness’: You keep using that word. I do not think it means what you think it means,” which I gave last year at Princeton—that was before Angus Deaton got mad at me, he was really nice during that visit and offered a lot of good comments, both during and after the talk—but I have some new material too. I want to work in the bit about the homeopathic treatments that have been so popular in social psychology.

Oh, also I received emails today from 2 different journals asking me to referee submitted papers, someone emailed me his book manuscript the other day, asking for comments, and a few other people emailed me articles they’d written.

I’m not complaining, nor am I trying to “busy-brag.” I love getting interesting things to read, and if I feel too busy I can just delete these messages. My only point is that there’s a lot going on, which is why it can be a challenge to limit myself to one blog post per day.

Finally, let me emphasize that I’m not saying there’s anything special about me. Or, to put it another way, sure, I’m special, and so are each of you. You too can do a Nicholson Baker and dissect every moment of your lives. That’s what blogging’s all about. God is in every leaf etc.

Hey pollsters! Poststratify on party ID, or we’re all gonna have to do it for you.

Alan Abramowitz writes:

In five days, Clinton’s lead increased from 5 points to 12 points. And Democratic party ID margin increased from 3 points to 10 points.

No, I don’t think millions of voters switched to the Democratic party. I think Democrats are were just more likely to respond in that second poll. And, remember, survey response rates are around 10%, whereas presidential election turnout is around 60%, so it makes sense that we’d see big swings in differential nonresponse to polls which will not be expected to map to comparable swings in differential voting turnout.

We’ve been writing about this a lot recently. Remember this post, and this earlier graph from Abramowitz:


and this news article with David Rothschild, and this research article with Rothschild, Doug Rivers, and Sharad Goel, and this research article from 2001 with Cavan Reilly and Jonathan Katz? The cool kids know about this stuff.

I’m telling you this for free cos, hey, it’s part of my job as a university professor. (The job is divided into teaching, research, and service; this is service.) But I know that there are polling and news organizations that make money off this sort of thing. So, my advice to you: start poststratifying on party ID. It’ll give you a leg up on the competition.

That is, assuming your goal is to assess opinion and not just to manufacture news. If what you’re looking for is headlines, then by all means go with the raw poll numbers. They jump around like nobody’s business.

P.S. Two questions came up in discussion:

1. If this is such a good idea, why aren’t pollsters doing it already? Many answers here, including (a) some pollsters are doing it already, (b) other pollsters get benefit from headlines, and you get more headlines with noisy data, (c) survey sampling is a conservative field and many practitioners resist new ideas (just search this blog for “buggy whip” for more on that topic), and, most interestingly, (d) response rates keep going down, so differential nonresponse might be a bigger problem now than it used to be.

2. Suppose I want to poststratify on party ID? What numbers should I use? If you’re poststratifying on party ID, you don’t simply want to adjust to party registration data: party ID is a survey response, and party registration is something different. The simplest approach would be to take some smoothed estimate of the party ID distribution from many surveys: this won’t be perfect but it should be better than taking any particular poll, and much better than not poststratifying at all. To get more sophisticated, you could model the party ID distribution as a slowly varying time series as in our 2001 paper but I doubt that’s really necessary here.

His varying slopes don’t seem to follow a normal distribution

Bruce Doré writes:

I have a question about multilevel modeling I’m hoping you can help with.

What should one do when random effects coefficients are clearly not normally distributed (i.e., coef(lmer(y~x+(x|id))) )? Is this a sign that the model should be changed? Or can you stick with this model and infer that the assumption of normally distributed coefficients is incorrect?

I’m seeing strongly leptokurtic random slopes in a context where I have substantive interest in the shape of this distribution. That is, it would be useful to know if there are more individuals with “extreme” and fewer with “moderate” slopes than you’d expect of a normal distribution.

My reply: You can fit a mixture model, or even better you can have a group-level predictor that breaks up your data appropriately. To put it another way: What are your groups? And which are the groups that have low slopes and which have high slopes? Or which have slopes near the middle of the distribution and which have extreme slopes? You could fit a mixture model where the variance varies, but I think you’d be better off with a model using group-level predictors. Also I recommend using Stan which is more flexible than lmer and gives you the full posterior distribution.

Doré then added:

My groups are different people reporting life satisfaction annually surrounding a stressful life event (divorce, bereavement, job loss). I take it that the kurtosis is a clue that there are unobserved person-level factors driving this slope variability? With my current data I don’t have any person-level predictors that could explain this variability, but certainly it would be good to try to find some.

Postdoc in Finland with Aki

I’m looking for a postdoc to work with me at Aalto University, Finland.

The person hired will participate in research on Gaussian processes, functional constraints, big data, approximative Bayesian inference, model selection and assessment, deep learning, and survival analysis models (e.g. cardiovascular diseases and cancer). Methods will be implemented mostly in GPy and Stan. The research will be made in collaboration with Columbia University (Andrew and Stan group), University of Sheffield, Imperial College London, Technical University of Denmark, The National Institute for Health and Welfare, University of Helsinki, and Helsinki University Central Hospital.

See more details here

Balancing bias and variance in the design of behavioral studies: The importance of careful measurement in randomized experiments

At Bank Underground:

When studying the effects of interventions on individual behavior, the experimental research template is typically: Gather a bunch of people who are willing to participate in an experiment, randomly divide them into two groups, assign one treatment to group A and the other to group B, then measure the outcomes. If you want to increase precision, do a pre-test measurement on everyone and use that as a control variable in your regression. But in this post I argue for an alternative approach—study individual subjects using repeated measures of performance, with each one serving as their own control.

As long as your design is not constrained by ethics, cost, realism, or a high drop-out rate, the standard randomized experiment approach gives you clean identification. And, by ramping up your sample size N, you can get all the precision you might need to estimate treatment effects and test hypotheses. Hence, this sort of experiment is standard in psychology research and has been increasingly popular in political science and economics with lab and field experiments.

However, the clean simplicity of such designs has led researchers to neglect important issues of measurement . . .

I summarize:

One motivation for between-subject design is an admirable desire to reduce bias. But we shouldn’t let the apparent purity of randomized experiments distract us from the importance of careful measurement. Real-world experiments are imperfect—they do have issues with ethics, cost, realism, and high drop-out, and the strategy of doing an experiment and then grabbing statistically-significant comparisons can leave a researcher with nothing but a pile of noisy, unreplicable findings.

Measurement is central to economics—it’s the link between theory and empirics—and it remains important, whether studies are experimental, observational, or some combination of the two.

I have no idea who reads that blog but it’s always good to try to reach new audiences.

Evil collaboration between Medtronic and FDA

Paul Alper points us to this news article by Jim Spencer and Joe Carlson that has this amazing bit:

Medtronic ran a retrospective study of 3,647 Infuse patients from 2006-2008 but shut it down without reporting more than 1,000 “adverse events” to the government within 30 days, as the law required.

Medtronic, which acknowledges it should have reported the information promptly, says employees misfiled it. The company eventually reported the adverse events to the FDA more than five years later.

Medtronic filed four individual death reports from the study in July 2013. Seven months later, the FDA posted a three-sentence summary of 1,039 other adverse events from the Infuse study, but deleted the number from public view, calling it a corporate trade secret.

Wow. I feel bad for that FDA employee who did this: it must be just horrible to have to work for the government when you have such exquisite sensitivity to corporate secrets. I sure hope that he or she gets a good job in some regulated industry after leaving government service.

Bayesian inference completely solves the multiple comparisons problem


I promised I wouldn’t do any new blogging until January but I’m here at this conference and someone asked me a question about the above slide from my talk.

The point of the story in that slide is that flat priors consistently give bad inferences. Or, to put it another way, the routine use of flat priors results in poor frequency properties in realistic settings where studies are noisy and effect sizes are small. (More here.)

Saying it that way, it’s obvious: Bayesian methods are calibrated if you average over the prior. If the distribution of effect sizes that you average over, is not the same as the prior distribution that you’re using in the analysis, your Bayesian inferences in general will have problems.

But, simple as this statement is, the practical implications are huge, because it’s standard to use flat priors in Bayesian analysis (just see most of the examples in our books!) and it’s even more standard to take classical maximum likelihood or least squares inferences and interpret them Bayesianly, for example interpreting a 95% interval that excludes zero as strong evidence for the sign of the underlying parameter.

In our 2000 paper, “Type S error rates for classical and Bayesian single and multiple comparison procedures,” Francis Tuerlinckx and I framed this in terms of researchers making “claims with confidence.” In classical statistics, you make a claim with confidence on the sign of an effect if the 95% confidence interval excludes zero. In Bayesian statistics, one can make a comparable claim with confidence if the 95% posterior interval excludes zero. With a flat prior, these two are the same. But with a Bayesian prior, they are different. In particular, with normal data and a normal prior centered at 0, the Bayesian interval is always more likely to include zero, compared to the classical interval; hence we can say that Bayesian inference is more conservative, in being less likely to result in claims with confidence.

Here’s the relevant graph from that 2000 paper:


This plot shows the probability of making a claim with confidence, as a function of the variance ratio, based on the simple model:

True effect theta is simulated from normal(0, tau).
Data y are simulated from normal(theta, sigma).
Classical 95% interval is y +/- 2*sigma
Bayesian 95% interval is theta.hat.bayes +/- 2*,
where theta.hat.bayes = y * (1/sigma^2) / (1/sigma^2 + 1/tau^2)
and = sqrt(1 / (1/sigma^2 + 1/tau^2))

What’s really cool here is what happens when tau/sigma is near 0, which we might call the “Psychological Science” or “PPNAS” domain. In that limit, the classical interval has a 5% chance of excluding 0. Of course, that’s what the 95% interval is all about: if there’s no effect, you have a 5% chance of seeing something.

But . . . look at the Bayesian procedure. There, the probability of a claim with confidence is essentially 0 when tau/sigma is low. This is right: in this setting, the data only very rarely supply enough information to determine the sign of any effect. But this can be counterintuitive if you have classical statistical training: we’re so used to hearing about 5% error rate that it can be surprising to realize that, if you’re doing things right, your rate of making claims with confidence can be much lower.

We are assuming here that the prior distribution and the data model are correct—that is, we compute probabilities by averaging over the data-generating process in our model.

Multiple comparisons

OK, so what does this have to do with multiple comparisons? The usual worry is that if we are making a lot of claims with confidence, we can be way off if we don’t do some correction. And, indeed, with the classical approach, if tau/sigma is small, you’ll still be making claims with confidence 5% of the time, and a large proportion of these claims will be in the wrong direction (a “type S,” or sign, error) or much too large (a “type M,” or magnitude, error), compared to the underlying truth.

With Bayesian inference (and the correct prior), though, this problem disappears. Amazingly enough, you don’t have to correct Bayesian inferences for multiple comparisons.

I did a demonstration in R to show this, simulating a million comparisons and seeing what the Bayesian method does.

Here’s the R code:


spidey <- function(sigma, tau, N) {
  cat("sigma = ", sigma, ", tau = ", tau, ", N = ", N, "\n", sep="")
  theta <- rnorm(N, 0, tau)
  y <- theta + rnorm(N, 0, sigma)
  signif_classical <- abs(y) > 2*sigma
  cat(sum(signif_classical), " (", fround(100*mean(signif_classical), 1), "%) of the 95% classical intervals exclude 0\n", sep="")
  cat("Mean absolute value of these classical estimates is", fround(mean(abs(y)[signif_classical]), 2), "\n")
  cat("Mean absolute value of the corresponding true parameters is", fround(mean(abs(theta)[signif_classical]), 2), "\n")
  cat(fround(100*mean((sign(theta)!=sign(y))[signif_classical]), 1), "% of these are the wrong sign (Type S error)\n", sep="")
  theta_hat_bayes <- y * (1/sigma^2) / (1/sigma^2 + 1/tau^2)
  theta_se_bayes <- sqrt(1 / (1/sigma^2 + 1/tau^2))
  signif_bayes <- abs(theta_hat_bayes) > 2*theta_se_bayes
  cat(sum(signif_bayes), " (", fround(100*mean(signif_bayes), 1), "%) of the 95% posterior intervals exclude 0\n", sep="")
  cat("Mean absolute value of these Bayes estimates is", fround(mean(abs(theta_hat_bayes)[signif_bayes]), 2), "\n")
  cat("Mean absolute value of the corresponding true parameters is", fround(mean(abs(theta)[signif_bayes]), 2), "\n")
  cat(fround(100*mean((sign(theta)!=sign(theta_hat_bayes))[signif_bayes]), 1), "% of these are the wrong sign (Type S error)\n", sep="")

sigma <- 1
tau <- .5
N <- 1e6
spidey(sigma, tau, N)

Here's the first half of the results:

sigma = 1, tau = 0.5, N = 1e+06
73774 (7.4%) of the 95% classical intervals exclude 0
Mean absolute value of these classical estimates is 2.45 
Mean absolute value of the corresponding true parameters is 0.56 
13.9% of these are the wrong sign (Type S error)

So, when tau is half of sigma, the classical procedure yields claims with confidence 7% of the time. The estimates are huge (after all, they have to be at least two standard errors from 0), much higher than the underlying parameters. And 14% of these claims with confidence are in the wrong direction.

The next half of the output shows the results from the Bayesian intervals:

62 (0.0%) of the 95% posterior intervals exclude 0
Mean absolute value of these Bayes estimates is 0.95 
Mean absolute value of the corresponding true parameters is 0.97 
3.2% of these are the wrong sign (Type S error)

When tau is half of sigma, Bayesian claims with confidence are extremely rare. When there is a Bayesian claim with confidence, it will be large---that makes sense; the posterior standard error is sqrt(1/(1/1 + 1/.5^2)) = 0.45, and so any posterior mean corresponding to a Bayesian claim with confidence here will have to be at least 0.9. The average for these million comparisons turns out to be 0.94.

So, hey, watch out for selection effects! But no, not at all. If we look at the underlying true effects corresponding to these claims with confidence, these have a mean of 0.97 (in this simulation; in other simulations of a million comparisons, we get means such as 0.89 or 1.06). And very few of these are in the wrong direction; indeed, with enough simulations you'll find a type S error rate of a bit less 2.5% which is what you'd expect, given that these 95% posterior intervals exclude 0, so something less than 2.5% of the interval will be of the wrong sign.

So, the Bayesian procedure only very rarely makes a claim with confidence. But, when it does, it's typically picking up something real, large, and in the right direction.

We then re-ran with tau = 1, a world in which the standard deviation of true effects is equal to the standard error of the estimates:

sigma <- 1 tau <- 1 N <- 1e6 spidey(sigma, tau, N) And here's what we get:

sigma = 1, tau = 1, N = 1e+06
157950 (15.8%) of the 95% classical intervals exclude 0
Mean absolute value of these classical estimates is 2.64 
Mean absolute value of the corresponding true parameters is 1.34 
3.9% of these are the wrong sign (Type S error)
45634 (4.6%) of the 95% posterior intervals exclude 0
Mean absolute value of these Bayes estimates is 1.68 
Mean absolute value of the corresponding true parameters is 1.69 
1.0% of these are the wrong sign (Type S error)

The classical estimates remain too high, on average about twice as large as the true effect sizes; the Bayesian procedure is more conservative, making fewer claims with confidence and not overestimating effect sizes.

Bayes does better because it uses more information

We should not be surprised by these results. The Bayesian procedure uses more information and so it can better estimate effect sizes.

But this can seem like a problem: what if this prior information on theta isn’t available? I have two answers. First, in many cases, some prior information is available. Second, if you have a lot of comparisons, you can fit a multilevel model and estimate tau. Thus, what can seem like the worst multiple comparisons problems are not so bad.

One should also be able to obtain comparable results non-Bayesianly by setting a threshold so as to control the type S error rate. The key is to go beyond the false-positive, false-negative framework, to set the goals of estimating the sign and magnitudes of the thetas rather than to frame things in terms of the unrealistic and uninteresting theta=0 hypothesis.

P.S. Now I know why I swore off blogging! The analysis, the simulation, and the writing of this post took an hour and a half of my work time.

P.P.S. Sorry for the ugly code. Let this be a motivation for all of you to learn how to code better.

One more thing you don’t have to worry about

Baruch Eitam writes:

So I have been convinced by the futility of NHT for my scientific goals and by the futility of of significance testing (in the sense of using p-values as a measure of the strength of evidence against the null). So convinced that I have been teaching this for the last 2 years. Yesterday I bump into this paper [“To P or not to P: on the evidential nature of P-values and their place in scientific inference,” by Michael Lew] which I thought makes a very strong argument for the validity of using significance testing for the above purpose. Furthermore—by his 1:1 mapping of p-values to likelihood functions he kind of obliterates the difference between the Bayesian and frequentist perspectives. My questions are 1. is his argument sound? 2.what does this mean regarding the use of p-values as measures of strength of evidence?

I replied that it all seems a bit nuts to me. If you’re not going to use p-values for hypothesis testing (and I agree with the author that this is not a good idea), why bother with p-values at all. It seems weird to use p-values to summarize the likelihood; why not just use the likelihood and do Bayesian inference directly? Regarding that latter point, see this paper of mine on p-values.

Eitam followed up:

But aren’t you surprised that the p-values do summarize the likelihood?

I replied that I did not read the paper in detail, but or any given model and sample size, I guess it makes sense that any two measures of evidence can be mapped to each other.

On deck this week

Mon: One more thing you don’t have to worry about

Tues: Evil collaboration between Medtronic and FDA

Wed: His varying slopes don’t seem to follow a normal distribution

Thurs: A day in the life

Fri: Letters we never finished reading

Sat: Better to just not see the sausage get made

Sun: Oooh, it burns me up

Taking Bayesian Inference Seriously [my talk tomorrow at Harvard conference on Big Data]

Mon 22 Aug, 9:50am, at Harvard Science Center Hall A:

Taking Bayesian Inference Seriously

Over the years I have been moving toward the use of informative priors in more and more of my applications. I will discuss several examples from theory, application, and computing where traditional noninformative priors lead to disaster, but a little bit of prior information can make everything work out. Informative priors also can resolve some of the questions of replication and multiple comparisons that have recently shook the world of science. It’s funny for me to say this, after having practiced Bayesian statistics for nearly thirty years, but I’m only now realizing the true value of the prior distribution.

Kaiser Fung on the ethics of data analysis

Kaiser gave a presentation and he’s sharing the slides with us here. It’s important stuff.