Skip to content

(It’s never a) Total Eclipse of the Prior

(This is not by Andrew)

This is a paper we (Gelman, Simpson, Betancourt) wrote by mistake.

The paper in question, recently arXiv’d, is called “The prior can generally only be understood in the context of the likelihood”.

How the sausage was made

Now, to be very clear (and because I’ve been told since I moved to North America that you are supposed to explicitly say these things rather than just work on the assumption that everyone understands that there’s no way we’d let something sub-standard be seen by the public) this paper turned out very well.  But it started with an email where Andrew said “I’ve been invited to submit a paper about priors to a special issue of a journal, are you both interested?”.

Why did we get this email?  Well Mike and I [Simpson], along with a few others, have been working with Andrew on a paper about weakly informative priors that has been stuck in the tall grass for a little while.  And when I say it’s been stuck in the tall grass, I mean that I [Simpson] got mesmerised by the complexity and ended up stuck. This paper has gotten us out of the grass. I’d use a saying of my people (“you’ve got to suffer through Henry Street to make it to People”) except this paper is not Henry Street.  This paper is good.  (This paper is also not People, so watch this space…)

So over a fairly long email thread, we worked out that we were interested and we carved out a narrative and committed to the idea that writing this paper shouldn’t be a trauma.  Afterwards, it turned out that individually we’d each understood the end of that conversation differently, leading in essence to three parallel universes that we were each playing in (eat your heart out Sliders).

Long story short, Andrew went on holidays and one day emailed us a draft of the short paper he had thought we were writing. I then took it and wrestled it into a draft of the short paper I thought we were writing.  Mike then took it and wrestled it into the draft of the short paper he thought we were writing. And so on and so forth.  At some point we converged on something that (mostly) unified our perspectives and all of a sudden, this “low stakes” paper turned into something that we all really wanted to say.

Connecting the prior and the likelihood

So what is this paper about? Well it’s 13 pages, you can read it.  But it covers a few big points:

1) If you believe that priors are not important to Bayesian analysis, we have a bridge we can sell you.  This is particularly true for complex models, where the structure of the posterior may lead to certain aspects of the prior never washing away with more data.

2) Just because you have a probability distribution, doesn’t mean you have a prior.  A prior connects with a likelihood to make a *generative model* for new data and when we understand it in that context, weakly informative priors become natural.

3) This idea is understood by a lot of the classical literature on prior specification like reference priors etc.  These methods typically use some sort of asymptotic argument remove the effect of the specific realisation of the likelihood that is observed.  The resulting prior then leans heavily on the assumption that this asymptotic argument is valid for the data that is actually being observed, which often does not hold.  When the data are far from asymptopia, the resulting priors are too diffuse and can lead to nonsensical estimates.

Generative models are the key

4) The interpretation of the prior as a distribution that couples with the likelihood to build a generative model for new data is implicit in the definition of the marginal likelihood, which is just the density of this generative distribution evaluated at the observed data.  This makes it easy to understand why improper priors cannot be used for Bayes factors (ratios of marginal likelihoods): they do not produce generative models.  (In a paper that will come out later this week, we make a pretty good suggestion for a better use of the prior predictive.)

5) Understanding what a generative model means also makes it clear why any decision (model choice or model averaging) that involves the marginal likelihood leans very heavily on the prior that has been chosen.  If your data is y and the likelihood is p(y | theta), then the generative model makes new data as follows:

– Draw theta ~ p(theta) from the prior

– Draw y ~ p(y | theta).

So if, for example, p(theta) has very heavy tails (like a half-Cauchy prior on a standard deviation), then occasionally the data will be drawn with an extreme value of theta.

This means that the entire prior will be used for making these decisions, even if it corresponds to silly parts of the parameters space.  This is why we strongly advocate using posterior predictive distributions for model comparison (LOO scores) or model averaging (predictive stacking).

So when can you use safely diffuse priors? (Hint: not often)

6) Enjoying, as I do, giving slightly ludicrous talks, I recently gave one called “you can only be uninformative if you’re also being unambitious”.  This is an under-appreiciated point about Bayesian models for complex data: the directions that we are “vague” in are the directions where we are assuming the data is so strong that that aspect of the model will be unambiguously resolved. This is a huge assumption and one that should be criticised.

So turn around bright eyes. You really need to think about your prior!

PS. If anyone is wondering where the first two sentences of the post went, they weren’t particularly important and I decided that they weren’t particularly well suited to this forum.

Gigo update (“electoral integrity project”)

Someone sent me this note:

I read your takedown of the EIP on Slate and then your original blog post and the P. Norris response. I wanted to offer a couple of points.

First, as you can see below, I was asked to be one of the ‘experts.’ I declined. I think we all can see the kind of bias introduced into the sample of experts when the sampling frame is a list of email addresses of elections scholars and participation is based on self selection.

Second, and probably more fundamentally, the data are generated by expert responses to a 12 minute survey (quick!). But the issues surveyed are pretty significant. Is media coverage adequate? Is the vote counted fairly? How am I supposed to assess this for an entire state (much less an entire country)? Even if I am paying close attention (and I’m not, since most of my time is spent, you know, trying to get things published…), do they really expect my range of attention to cover ** media markets and ** counties? It’s not like I have a range of sources across the state that I could call upon to give me information on these topics from areas far from ** – if I did, I’d be a journalist.

Finally, notice the date of the request – two weeks after Election Day. Even if I could gather all the necessary information to make my survey responses valid, I would have had to be primed to look for these things not only before Election Day, but also before the campaign even took place. Asking me to recall information after the fact when I wasn’t necessarily looking for it at the time is, well, bad.

I thought this effort sounded fishy when the request to participate landed in my inbox, and with your assessment and the Norris response I now have even less confidence in these data. I appreciate Norris’s effort to engage in a public dialogue on this and (in general) her efforts to get us to think about how to assess electoral integrity, but I more appreciate your efforts to point out the methodological issues and to keep journalists from running away with “findings” based on faulty social science.

Sincerely,
**

Prof. **
Dept. of Political Science
** University

Here’s the email that my correspondent received:

From: Electoral Integrity Project <**@**.harvard.edu>
Sent: Tuesday, November 22, 2016 10:41 AM
To: **
Subject: Harvard University seeks your expertise on electoral integrity in **

www.electoralintegrityproject.com

Dr. **
Political Science
** University

Dear Dr. **,

How do we know when elections meet international standards and principles – and when they fail?

Given your knowledge and expertise we are interested in learning your views about how the US presidential election on 8 November 2016 was conducted in **.

The survey usually takes around 12 minutes to complete. Your answers are anonymous and all replies will be treated with the strictest confidence. You can participate in the survey or decline to do so by clicking below.

I would like to participate. Your unique reference number is: *****
I would like to decline the opportunity to participate.

The aim of the survey is to gather comprehensive, impartial, and reliable information which can compare the quality of all national elections held worldwide. So far the research has monitord contests in over 100 countries. The study is conducted by an independent team of scholars based in Australia, Europe and the United States. The data is made widely available to the user community and released every year. You are welcome to contact us at **@**.harvard.edu for any further information.

You can read the relevant Participation Information Statement, which provides more information about this study. Completing the survey is an indication that you have read and understand the Participation Information Statement.

We realize that you have numerous demands on your time and we greatly appreciate you collaboration and help in this project.

Sincerely,
Professor Pippa Norris
(Harvard University and the University of Sydney)

—————————————————————————————–
The Electoral Integrity Project
Department of Government and International Relations,
University of Sydney
Sydney, NSW 2006
Australia

Professor Pippa Norris (Harvard University and the University of Sydney)
Professor Jørgen Elklit (Aarhus University)
Professor Andrew Reynolds (University of North Carolina, Chapel Hill)
Professor Jeffrey Karp (University of Exeter)
Project Manager: Dr. **
Survey Manager: Mr. **
Research assistant: Ms **

Email: **@**.harvard.edu
Website: www.electoralintegrityproject.com
—————————————————————————————–

It’s always good to know where your data came from. I think Norris has done a great service with the World Values Survey but I’m more skeptical about the Electoral Integrity Project.

P.S. Joseph Cummins writes, “my girlfriend now wishes to externalize her love for our neighborhood’s most boss street cat onto you.” I suspect this cat has strong feelings about electoral integrity.

Rosenbaum (1999): Choice as an Alternative to Control in Observational Studies

Winston Lin wrote in a blog comment earlier this year:

Paul Rosenbaum’s 1999 paper “Choice as an Alternative to Control in Observational Studies” is really thoughtful and well-written. The comments and rejoinder include an interesting exchange between Manski and Rosenbaum on external validity and the role of theories.

And here it is. Rosenbaum begins:

In a randomized experiment, the investigator creates a clear and relatively unambiguous comparison of treatment groups by exerting tight control over the assignment of treatments to experimental subjects, ensuring that comparable subjects receive alternative treatments. In an observational study, the investigator lacks control of treatment assignments and must seek a clear comparison in other ways. Care in the choice of circumstances in which the study is conducted can greatly influence the quality of the evidence about treatment effects. This is illustrated in detail using three observational studies that use choice effectively, one each from economics, clinical psychology and epidemiology. Other studies are discussed more briefly to illustrate specific points. The design choices include (i) the choice of research hypothesis, (ii) the choice of treated and control groups, (iii) the explicit use of competing theories, rather than merely null and alternative hypotheses, (iv) the use of internal replication in the form of multiple manipulations of a single dose of treatment, (v) the use of undelivered doses in control groups, (vi) design choices to minimize the need for stability analyses, (vii) the duration of treatment and (viii) the use of natural blocks.

Good stuff. Someone should translate all of Rosenbaum into Bayes at some point.

Iterative importance sampling

Aki points us to some papers:

Langevin Incremental Mixture Importance Sampling

Parallel Adaptive Importance Sampling

Iterative importance sampling algorithms for parameter estimation problems

Next one is not iterative, but interesting in other way

Black-box Importance Sampling

Importance sampling is what you call it when you’d like to have draws of theta from some target distribution p(theta) (or, in a Bayesian context, we’d say p(theta|y)), but all you have are draws from some proposal distribution g(theta) that approximates p. You take the draws from g, and give each of them a weight proportional to the importance ratio r=p/g. And then you compute weighted averages; for any function h(theta), you estimate E_p(h) as Sum_theta r(theta)h(theta) / Sum_theta r(theta), summing over draws theta from g. We typically can only compute p up to an unknown multiplicative constant, and often we can only compute g up to an unknown multiplicative constant, but those constants drop out when computing the ratio.

Importance sampling is an old idea, and statisticians used to think of it as “exact,” in some sense. And, back around 1990, when Gibbs and Metropolis sampling started to become popular in statistics, a lot of us had the idea that it would be a good idea to start a computation with the iterative Gibbs and Metrop algorithms, and then clean things up at the end with some exact importance sampling. But this idea was wrong.

Yes, importance sampling is simulation-consistent for most purposes, but, in general, if the importance ratios are unbounded (which will happen if there are parts of the target distribution with longer tails than the proposal distribution), then for any finite number of simulation draws, importance sampling will give you something between the proposal and target distributions. So it doesn’t make sense to think of importance sampling as more “exact” than Gibbs or Metropolis.

Indeed, importance sampling can be seen as an iterative approximation, starting with a proposal distribution and gradually approaching the target distribution (if certain conditions are satisfied) as the number of simulation draws increase. This is a point I emphasized in section 3 of my 1991 paper, that importance sampling, like Markov chain sampling, is an iterative simulation method. But, where Gibbs and Metropolis are adaptive—their proposal distributions depend on the most recently drawn theta—importance sampling is not.

Thanks to the above reasoning, importance sampling fell out of favor: for hard problems of high or even moderate dimensionality, importance sampling fails miserably; and for easy, low-dimensional problems, one can just as well use black-box MCMC (i.e., Stan).

Importance sampling is no longer the workhorse.

But importance sampling still has a role to play, in (at least) two ways. First, sometimes we want to work with perturbations of our distribution without having to re-fit, for example when doing leave-one-out cross-validatation. Thus these two papers with Aki and Jonah:

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.

Pareto smoothed importance sampling.

The other place for importance sampling is following an approximate fit such as obtained using normal approximation, variational Bayes, or expectation propagation. This is not a gimme because in moderate or high dimensions, the approx is going to be far enough away that the importance ratios will be highly variable. Still, one would expect importance sampling, if done right, to give us something between the approx and the target distribution, so it should be a step forward.

Research still needs to catch up to practice in this area. In particular, I think the theoretical framework for importance sampling should more explicitly recognize that the goal is to get a good intermediate distribution, not to expect to get all the way there.

P.S. Lexi, pictured above, looks pretty important to me! She’s the sister of the reflective cat from this post and the picture comes from Maria Del Carmen Herrojo-Ruiz.

All cause and breast cancer specific mortality, by assignment to mammography or control

Paul Alper writes:

You might be interested in the robocall my wife received today from our Medicare Advantage organization (UCARE Minnesota). The robocall informed us that mammograms saved lives and was available free of charge as part of her health insurance. No mention of recent studies criticizing mammography regarding false positives, harms of biopsies, etc.

I bring this up to illustrate that statistics have failed to dent the mystique and eagerness of the mammography lobby’s incessant push to overtreat and overdiagnose. Below are two famous graphs from a 25-year Canadian study.

Wow. I’d like to see the link to the source of these graphs, along with the raw data. But assuming they’re correct . . . wow. I mean, sure, we can come up with all sorts of stories, and mammography has gotta be better now than it was 25 years ago. But still, no visible difference at all. . . . wow.

SCANDAL: Florida State University football players held to the same low standards as George Mason University statistics faculty

Paul Alper points us to this news report:

As the Florida State University football team was marching to a national title in the fall of 2013, the school was investigating allegations of academic favoritism involving a half-dozen of its leading players . . . The inquiry, previously unreported, stemmed from a complaint by a teaching assistant who said she felt pressured to give special breaks to athletes in online hospitality courses on coffee, tea and wine, where some handed in plagiarized work and disregarded assignments and quizzes. . . .

Hey, wait a minute . . . “online hospitality courses on coffee, tea and wine”? Huh? What is this, the Cornell University business school?

Check this out:

Copying from Wikipedia? This’ll get you tenure in the statistics department at a major university in Northern Virginia.

It’s a sad day when professional football players student-athletes are held to the same low standards as tenured professors of statistics. Really, I’d hope the football players could do better. After all, they’ll have to get jobs in the real world soon, they can’t just coast on their reputations.

The whole thing is so sad: they play football, they make millions of dollars in revenue for their institutions, they don’t get paid, they get major injuries, and they don’t even get an education out of it.

P.S. Yeah, yeah, Columbia has Dr. Oz. I never said we were perfect.

Causal identification + observational study + multilevel model

Sam Portnow writes:

I am attempting to model the impact of tax benefits on children’s school readiness skills. Obviously, benefits themselves are biased, so I am trying to use the doubling of the maximum allowable additional child tax credit in 2003 to get an unbiased estimate of benefits. I was initially planning to attack this problem from an instrumental variables framework, but the measures of school readiness skills change during the course of the study, and I can’t equate them. My (temporary) solution is to use a multilevel model to extract the random effect of the increase of the benefit on actual benefits, and then plug that random effect into my equation looking at school readiness skills. The downside of this approach is that I can’t seem to find much research that suggests this is a plausible solution. Do you have an initial thoughts about this approach, or perhaps papers that you’ve seen that use a similar approach. I don’t know any one with expertise in this area, and want to make sure I’m not going down a rabbit hole.

My reply:

To start I recommend this post from ten years ago on how to think about instrumental variables. In your case, the ideas is to estimate the effect of the doubling of the tax credit directly, and worry later about interpreting this as an effect of tax benefits more generally.

Now that you’re off the hook regarding instrumental variables, you can just think of this as a regular old observational study. You have your treatment group and your control group . . . ummmm, I don’t know anything about the child tax credit, maybe this was a policy change done just once, so all you have is a before-after comparison? In that case you gotta make a lot of assumptions. Fitting a multilevel model might be fine, this sort of thing makes sense if you have individual-level outcomes and individual-level predictors with a group-level treatment effect.

So really I think you can divide your problem into three parts:

1. Causal identification. If you were going to think of the doubling of the tax credit as your instrument, then just try to directly estimate the effect of that treatment. Or if you’re gonna be more observational about it and make use of existing variation, fine, do that. Just be clear on what you’re doing, and I don’t see that much will be gained by bringing in the machinery of instrumental variables.

2. The observational study. This is the usual story: treatment group, control group, use regression, or matching followed by regression, to control for pre-treatment predictors.

3. Multilevel modeling. This will come in naturally if you have group-level variation or group-level predictors.

mc-stan.org down again (and up again)

[update: back up again 20 minutes later. sorry for all the churn and sorry again it went down.]

My fault again. Really sorry about this.

I’m actually on a real vacation for the first time in two years and not checking my email regularly and not checking my junk email at all. This time, PairNic shut us off because they wanted me to follow up the domain renewal several days later with one of these multi-step dances to renew my email. I have no idea why this is so challenging.

I just verified, we transferred the domain, and are hoping it’ll be back up within the next two hours as PairNIC implied in their web form.

We’re trying to figure out how to get this set up so there’s not a single point of failure such as one of our personal emails or phone numbers.

What to make of reported statistical analysis summaries: Hear no distinction, see no ensembles, speak of no non-random error.

Recently there has been a lot of fuss about the inappropriate interpretations and uses of p-values, significance tests, Bayes factors, confidence intervals, credible intervals and almost anything anyone has ever thought of. That is to desperately discern what to make of reported statistical analysis summaries of individual studies –  largely on their own. Including a credible quantification of the uncertainties involved. Immediately after a study has been completed, or soon after – by the very experimenters who were involved in carrying it out. Perhaps along with consultants or collaborators with hopefully somewhat more statistical experience. So creators, perpetrators, evaluators, jurors and judges all biased to a hopeful sentence of many citations and continued career progression.

Three things that do not seem to be getting adequate emphasis in these discussions of what to make of reported statistical analysis summaries are – 1. failing to distinguish what something is versus what to make of it, 2. ignoring the ensemble of similar studies (completed, ongoing and future) and 3. neglecting important non-random errors. This does seem to be driven by academic culture and so it won’t be easy to change. As Nazi Reich Marshal Hermann Goring once quipped? “Whenever I hear the word culture, I want to reach for my pistol!”.

What is meant by “what to make of” a reported statistical analysis summary, its upshot or how it should affect our future actions and thinking as opposed to simply what it is? CS Peirce called this the pragmatic grade of clarity of a concept. To him it was the third grade that needed to be proceeded by two other grades, the ability to recognise instances of a concept and the ability to define it. For instance with regard to p_values, the ability to recognise what is or is not a p_value, the ability to define a p_value and the ability to know what to make of a p_value in a given study. Its the third that is primary and paramount to “enabling researchers to be less misled by the observations” and thereby discern what to make for instance of a p_value. Importantly it also always remains open ended.

A helpful quote from Peirce might be “. . . there are three grades of clearness in our apprehensions of the meanings of words. The first consists in the connexion of the word with familiar experience. . . . The second grade consists in the abstract definition, depending upon an analysis of just what it is that makes the word applicable. . . . The third grade of clearness consists in such a representation of the idea that fruitful reasoning can be made to turn upon it, and that it can be applied to the resolution of difficult practical problems.” (CP 3.457, 1897)

Now almost all the teaching in statistics is about the first two and much (most) of the practice of statistics skips over the third with the usual, this is the p_value in your study and don’t forget its actual definition. If you do people will have the right to laugh at you. But all the fuss here is or should be about – what should be made of this p-value, or other statistical analysis summary.  How should it affect our future actions and thinking? Again, that will always remains open ended.

Additionally, ignoring the ensemble of similar studies makes that task unduly hazardous (except in emergency or ethical situations where multiple studies are not possible or can’t be waited for). So why are most statistical discussions framed with reference to a single solitary study with the expectation that, if done adequately, one should be able to discern what to make of it and adequately quantify the uncertainties involved. Why, why, why? As Mosteller and Tukey put it in their chapter Hunting Out the Real Uncertainty of Data analysis and Regression way back in 1977 – you don’t even have access to the real uncertainty with just a single study.

Unfortunately, when many do consider the ensemble (i.e. do an meta-analysis) they almost exclusively obsess about combining studies to get more power paying not much more than lip service to assessing the real uncertainty (e.g. doing an horribly under powered  test of heterogeneity or thinking a random effect will adequately soak up all the real differences).  Initially the first or second sentence of the wiki entry on meta-analysis was roughly “meta-analysis has the capacity to contrast results from different studies and identify patterns among study results, sources of disagreement among those results, or other interesting relationships that may come to light in the context of multiple studies.” That progressively got moved further and further down and prefaced with “In addition to providing an estimate of the unknown common truth” (i.e. in addition to this amazing offer you will also receive…). Why in the world would you want an estimate of the unknown common truth without some credible assessment that things are common?

Though perhaps most critical of all, not considering important non-random error in discerning what to make of a p-value or other statistical analysis summary makes no sense. Perhaps the systematic error (e.g. confounding) is much larger than the random error. Maybe the random error is so small relative to systematic that the random error  can be safely ignored (i.e. no need to even calculate a p_value)?

Earlier I admitted these oversights are culturally driven in academia and reaching for one’s pistol is almost never a good idea. Academics really want (or even feel they absolutely need) to make something out of an individual study on its own (especially if its theirs). Systematic errors are just hard too deal with adequately for most statisticians and usually require domain knowledge statisticians and even study authors won’t have. Publicly discerning what to make of  p-value or other statistical analysis summary is simply too risky. It is open ended and in some sense you will always fall short and others might laugh at you.

Too bad always being wrong (in some sense) seems so wrong.

The Groseclose endgame: Getting from here to there.

A few years ago, I wrote the following regarding political scientist Tim Groseclose’s book on media bias:

Groseclose’s big conclusion is that in the absence of media bias, the average American voter would be positioned at around 25 on a 0-100 scale, where 0 is a right-wing Republican and 100 is a left-wing Democrat. . . .

In Groseclose’s endgame, a balanced media might include some TV networks promoting the view that abortion should be illegal under all circumstances and subject to criminal penalties, whereas others might merely hold that Roe v. Wade is unconstitutional; some media outlets might support outright discrimination against gays whereas others might be neutral on civil unions but oppose gay marriage; and on general politics there might be some newspapers that endorse hard-right Republican candidates (0 on Groseclose’s 0-100 scale) whereas those on the left would endorse the positions of Senator Olympia Snowe. . . .

I find it plausible that a Berlusconi-style media environment could shift U.S. politics far to the right, but given the effort it would take to maintain such a system (in Italy, Berlusconi has the power of the government but still has continual struggles with the law), it’s hard for me to think of this as an equilibrium in the way that it is envisioned by Groseclose. This just seems like a counterfactual that would require resources far beyond what was spent to set up Fox News, the Weekly Standard, and other right-leaning media properties.

I wrote that in 2011. Since then, the media landscape has changed, and Fox News has moved from far-right to center-right. By which, I don’t mean that Fox has moved to the left, I mean that the institutions in the center and the center-left have become weaker (declining readership of newspapers and broadcast TV networks), while Breitbart News etc. have become the new hard right, and there’s pressure on what remains of the center. So things really are moving in the direction that Groseclose was saying. I don’t see the voters as moving all the way to 25 on his scale—after all, the Democrats and Republicans are pretty much split fifty-fifty among the voters—but the political distribution of the news media has been changing fast. It’s all some complicated interaction of people’s political attitudes, what they find entertaining enough to watch or click on, and what are the efforts that various rich and powerful organizations want to pay for.

What are best practices for observational studies?

Mark Samuel Tuttle writes:

Just returned from the annual meeting of the American Medical Informatics Association (AMIA); in attendance were many from Columbia.

One subtext of conversations I had with the powers that be in the field is the LACK of Best Practices for Observational Studies. They all agree that however difficult they are that Observational Studies are the future of healthcare research.

I passed along your blog item, “Thinking more seriously about the design of exploratory studies: A manifesto,” to the new chair of NCVHS (National Center for Vital and Health Statistics).

I replied: Just to clarify: the observational/experimental divide is orthogonal to exploratory/confirmatory. There is a literature on the design of observational studies (see the work of Paul Rosenbaum) but it has a confirmatory focus. That’s no knock on Rosenbaum: almost all the literature on statistical design—including my own papers and book chapters on the topic—come at it from a confirmatory perspective.

Tuttle responded:

At a deeper level I understand all this – the math at least, and the distinctions to be made, but failed to acquire the language with which to describe it well to others, or with which to communicate with those for whom these are “religious” distinctions.

This is (yet) another challenge of inter-disciplinary work.

On a related note, many (healthcare) clinical trials – “experiments” in your lingo – never finish, mostly because of failures of accrual – they can’t get enough patients.

(Difficulty in accruing patients is not just about the ethical dilemma – denying some patients something that might be better; it’s also a predictor that the study may be irrelevant – because real patients are just more complicated, with, for example, co-morbidities that disqualify them from the trial.)

This is yet another reason many are embarrassed by the whole thing – the failure of experiments, lack of reproducibility, etc. Still, those extolling “observational” studies don’t always stand up on their hind legs when they should.

For more on observational/experimental, see this paper from several years ago, which begins:

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.”

At the same time, in my capacity as a social scientist, I’ve published many applied research papers, almost none of which have used experimental data.

In the present article, I’ll address the following questions:

1. Why do I agree with the consensus characterization of randomized experimentation as a gold standard?

2. Given point 1 above, why does almost all my research use observational data?

Robert Gelman, 1923-2017

Bob Gelman, beloved husband of Jane for 67 years, proud father of Alan, Nancy, Susan, and Andy, and adoring grandparent of Stephanie, Noah, Adam, Jamie, Ben, Zacky, Jakey, and Sophie, passed away peacefully on the morning of 27 Aug 2017 at the age of 94. A child of immigrants, Bob grew up playing stickball in the streets of Brooklyn, studied physics at City College and Columbia University, taught at Champlain College in Plattsburgh, and served his country during World War II and after, when he built machines to compute missile trajectories, and later in his work at the Environmental Protection Agency. Bob was a gentle, careful man who loved life, a fiercely liberal Democrat who delighted in puns and the English language, music, tennis, and, above all, his family.

“Mainstream medicine has its own share of unnecessary and unhelpful treatments”

I have a story and then a question.

The story

Susan Perry (link sent by Paul Alper) writes:

Earlier this week, I [Perry] highlighted two articles that exposed the dubious history, medical ineffectiveness and potential health dangers of popular alternative “therapies.”

Well, the same can be said of many mainstream conventional medical practices, as investigative reporter David Epstein points out in an article co-published last week by ProPublica and The Atlantic.

“When you visit a doctor, you probably assume the treatment you receive is backed by evidence from medical research,” writes Epstein. “Surely, the drug you’re prescribed or the surgery you’ll undergo wouldn’t be so common if it didn’t work, right?”

Wrong, as Epstein explains:

For all the truly wondrous developments of modern medicine — imaging technologies that enable precision surgery, routine organ transplants, care that transforms premature infants into perfectly healthy kids, and remarkable chemotherapy treatments, to name a few — it is distressingly ordinary for patients to get treatments that research has shown are ineffective or even dangerous. Sometimes doctors simply haven’t kept up with the science. Other times doctors know the state of play perfectly well but continue to deliver these treatments because it’s profitable — or even because they’re popular and patients demand them. Some procedures are implemented based on studies that did not prove whether they really worked in the first place. Others were initially supported by evidence but then were contradicted by better evidence, and yet these procedures have remained the standards of care for years, or decades.

Even if a drug you take was studied in thousands of people and shown truly to save lives, chances are it won’t do that for you. The good news is, it probably won’t harm you, either. Some of the most widely prescribed medications do little of anything meaningful, good or bad, for most people who take them.

Epstein describes the results of several recent reviews of common clinical practices that found such practices were often unnecessary, unhelpful and/or potentially harmful — “from the use of antibiotics to treat people with persistent Lyme disease symptoms (didn’t help) to the use of specialized sponges for preventing infections in patients having colorectal surgery (caused more infections).”

Many of these treatments were hailed as being “breakthroughs” when they were first approved, but were found in subsequent research to be inferior to the practices they replaced.

By then, though, the treatment had become so ubiquitous that doctors — and patients — were reluctant to accept the evidence that it didn’t work.

As Alper points out, this echoes ESP, power pose, and the collected works of Brian Wansink: publicized mind hacks that don’t seem to hold up under scrutiny.

The question

So, what to think about this? I’m not asking, “What can we do about all this published research, widely accepted by doctors and patients, which turns out to be largely wrong?”, nor am I asking, “What’s a good Edlin factor for clinical research literature?”

Those are good questions, but here I want to ask something different: Should we care? What’s the cost? These largely ineffectual medical treatments are, I assume, generally no worse than the alternatives. So what we’re doing as a society, and as individuals, is to throw away resources: costs in the development and marketing of drugs, doctors’ time effort, medical researchers’ time and effort, patient time that could be better spent in conversation with the doctor or in some other way, various mountains of paperwork, and so on. Ultimately I think the costs have to be put on some dollar scale, otherwise it’s hard to know what to make of all this.

Another way to think of this is in terms of policy analysis. We currently have an implicit policy—some combination of laws, regulations, company policies, and individual behaviors which result in these treatments being performed but not those. So I’d be interested in Susan Perry’s suggested alternative, or the alternatives proposed by others. I’m not saying this as a challenge to Perry, nor is it any sort of requirement that she comes up with an alternative—her reporting already is doing a valuable service. What I’m saying is that it’s hard for me to know what to make of these stories—or even hard numbers on treatment effects—without going the next step and thinking about policies.

I know there are people working in this area, so I’m not at all trying to claim that I’m making some stunning deep point in the above post. I’m just trying to shake the tree a bit here, as I’d like to see the connection between individual studies and the big picture.

Using statistical prediction (also called “machine learning”) to potentially save lots of resources in criminal justice

John Snow writes:

Just came across this paper [Human Decisions and Machine Predictions, by Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan] and I’m wondering if you’ve been following the debate/discussion around these criminal justice risk assessment tools.

I haven’t read it carefully or fully digested the details. On the surface, their general critique of the risk assessment tools seems reasonable but what caught my attention are the results of the simulation they report in the abstract:

Even accounting for these concerns, our results suggest potentially large welfare gains: a policy simulation shows crime can be reduced by up to 24.8% with no change in jailing rates, or jail populations can be reduced by 42.0% with no increase in crime rates.

Those numbers seem unrealistic in their size. I’d be curious to hear your take on this paper in the blog.

Ummm, I think that when they said “24.8%” and “42.0%,” they really meant 25% and 42%, as there’s no way they could possibly estimate such things to an accuracy of less than one percentage point. Actually there’s no way they could realistically estimate such things to an accuracy of 10 percentage points, but I won’t demand a further rounding to 20% and 40%.

In all seriousness, I do think it’s misleading for them to be presenting numbers such as “83.2%” and “36.2%” in their paper. The issue is not “sampling error”—they have a huge N—it’s that they’re using past data to make implicit inferences and recommendations for new cases in new places, and of course there’s going to be variation.

In any case, sure, I can only assume their numbers are unrealistic, as almost by definition they’re a best-case analysis, not because of overfitting but because they’re not foreseeing any . . . ummmm, any unforeseen problems. But they seem pretty clear on their assumptions: they explicitly label their numbers as coming from “a policy simulation” and they qualify that whole sentence with a “our results suggest.” I’m cool with that.

In our radon article, my colleagues and I wrote: “we estimate that if the recommended decision rule were applied to all houses in the United States, it would be possible to save the same number of lives as with the current official recommendations for about 40% less cost.” And that’s pretty similar. If we can claim a 40% cost savings under optimal policy, I don’t have a problem with these researchers claiming something similar. Yes, 40% is a lot, but if you have no constraints and can really make the (prospectively) optimal decisions, this could be the right number.

P.S. Parochially, I don’t see why the authors of this paper have to use the term “machine learning” for what I would call “statistical prediction.” For example, they contrast their regularized approach to logistic regression without seeming to recognize that logistic regression can itself be regularized: they write, “An important practical breakthrough with machine learning is that the data themselves can be used to decide the level of complexity to use,” but that’s not new: it’s a standard idea in hierarchical modeling and was already old news ten years ago when Jennifer and I wrote our book.

On the other hand, it may well be that more people consider themselves users of “machine learning” than “statistical prediction,” so maybe I’m the one who should switch. As long as these researchers are using good methods, it’s not so important if we have similar methods under different names that could also solve these problems. They’re the ones who fit a model to this problem, and they deserve the credit for it.

No big deal either way as long as (a) we’re clear on what we’re assuming, what our algorithms are doing, and what data we’re using; and (b) we remember to adjust for bias and and variance of measurements, nonrepresentative samples, selection bias, and all the other things we worry about when using data on a sample to draw inference about a population.

Chris Moore, Guy Molyneux, Etan Green, and David Daniels on Bayesian umpires

Kevin Lewis points us to a paper by Etan Green and David Daniels, who conclude that “decisions of [baseball] umpires reflect an accurate, probabilistic, and state-specific understanding of their rational expectations—as well as an ability to integrate those prior beliefs in a manner that approximates Bayes rule.”

This is similar to what was found in an earlier empirical article by Guy Molyneux, and this theoretical treatment of the idea by Chris Moore a few years earlier.

I don’t have anything to add here, except to suggest that these people can now all credit each others’ work in this area when going forward.

P.S. There seems to be some confusion. When I said that these people can all credit each others’ work, I didn’t mean to imply that there had been no references so far. In particular, Green and Daniels in their paper do cite Molyneux already.

Fake polls. Not new.

Mark Palko points me to this article by Harry Enten about a possibly nonexistent poll that was promoted by an organization or group or website called Delphi Analytica. Enten conjectures that the reported data were not fabricated but they’re not a serious poll either but rather some raw undigested output from a Google poll.

This sort of thing is not new. Here’s an example I wrote about in 2008, a possibly nonexistent poll that was promoted by a consulting company called Prince & Associates and which got blurbed in the Wall Street Journal (yup, John Yoo’s newspaper). The data from that poll may be real, or maybe not, but no evidence was ever provided that the sample, if it existed, was representative of the claimed population in any way.

Eternal vigilance is the price of journalism.

P.S. Enten includes the following chart:

I disagree with this chart for two reasons. First, just cos a poll is real and it comes from a respected pollster, it doesn’t mean we should take it seriously. Remember Gallup in 2012? Second, what’s on the bottom right of the chart? Are there really and “respected pollsters” that do “fake polls”?

“From that perspective, power pose lies outside science entirely, and to criticize power pose would be a sort of category error, like criticizing The Lord of the Rings on the grounds that there’s no such thing as an invisibility ring, or criticizing The Rotter’s Club on the grounds that Jonathan Coe was just making it all up.”

From last year:

One could make the argument that power pose is innocuous, maybe beneficial in that it is a way of encouraging people to take charge of their lives. And this may be so. Even if power pose itself is meaningless, the larger “power pose” story could be a plus. Of course, if power pose is just an inspirational story to empower people, it doesn’t have to be true, or replicable, or scientifically valid, or whatever. From that perspective, power pose lies outside science entirely, and to criticize power pose would be a sort of category error, like criticizing The Lord of the Rings on the grounds that there’s no such thing as an invisibility ring, or criticizing The Rotter’s Club on the grounds that Jonathan Coe was just making it all up. I guess I’d prefer, if business school professors want to tell inspirational stories without any scientific basis, that they label them more clearly as parables, rather than dragging the scientific field of psychology into it.

Same story with pizzagate and all the rest: Let’s just go straight to the inspirational business book and the TV appearances. Cut out the middleman of the research studies, the experiments on college students or restaurant diners or whoever, the hormone measurements, the counts of partially-eaten carrots, the miscalculated t-scores, the conveniently-rounded p-values, the referee reports, the publication in PPNAS etc., the publicity, the failed replications, the post hoc explanations, the tone police on twitter, etc. Just start with the idea and jump to the book, the NPR interview, and the Ted talk. It’ll save us all a lot of trouble.
Continue reading ‘“From that perspective, power pose lies outside science entirely, and to criticize power pose would be a sort of category error, like criticizing The Lord of the Rings on the grounds that there’s no such thing as an invisibility ring, or criticizing The Rotter’s Club on the grounds that Jonathan Coe was just making it all up.”’ »

Stan Weekly Roundup, 25 August 2017

This week, the entire Columbia portion of the Stan team is out of the office and we didn’t have an in-person/online meeting this Thursday. Mitzi and I are on vacation, and everyone else is either teaching, TA-ing, or attending the Stan course. Luckily for this report, there’s been some great activity out of the meeting even if I don’t have a report of what everyone around Columbia has been up to. If a picture’s really worth a thousand words, this is the longest report yet.

  • Ari Hartikainen has produced some absolutely beautiful parallel coordinate plots of HMC divergences* for multiple parameters. The divergent transitions are shown in green and the lines connect a single draw. The top plot is unnormalized, whereas the bottom scales all parameters to a [0, 1] range.


    Ari's divergence plot

    You can follow the ongoing discussion on the forum thread. There are some further plots for larger models and some comparisons with the pairs plots that Michael Betancourt has been recommending for the same purpose (the problem with pairs is that it’s very very slow, at least in RStan, because it has to draw quadratically many plots).

  • Sebastian Weber has a complete working prototype of the MPI (multi-core parallelization) in place and has some beautiful results to report. The first graph is the speedup he achieved on a 20-core server (all in one box with shared memory):


    Sebasitan's MPI speedup plot

    The second graph shows what happens when the problem size grows (those bottom numbers on the x-axis are the number of ODE systems being solved, whereas the top number remains the number of cores used).


    Sebastian's weak scaling plot

    As with Ari’s plots, you can follow the ongoing disussion on the forum thread. And if you know something about MPI, you can even help out. Sebastian’s been asking if anyone who knows MPI would like to check his work—he’s learning it as he goes (and doing a bang-up job of it, I might add!).

 

These lists are incomplete

After doing a handful of these reports, I’m sorry to say you’re only seeing a very biased selection of activity around Stan. For the full story, I’d encourage you to jump onto our forums or GitHub (warning: very high traffic, even if you focus).


 * Divergences in Stan arise when the Hamiltonian, which should be conserved across a trajectory, diverges—it’s basically a numerical simulation problem—if we could perfectly follow the Hamiltonian through complex geometries, there wouldn’t be any divergences. This is a great diagnostic mechanism to signal something’s going wrong and resulting estimates might be biased. It may seem to make HMC more fragile, but the problem is that Gibbs and Metropolis will fail silently in a lot of these situations (though BUGS will often help you out of numerical issues by crashing).

Nice interface, poor content

Jim Windle writes:

This might interest you if you haven’t seen it, and I don’t think you’ve blogged about it. I’ve only checked out a bit of the content but it seems a pretty good explanation of basic statistical concepts using some nice graphics.

My reply: Nice interface, but their 3 topics of Statistical Inference are Confidence Intervals, p-Values, and Hypothesis Testing.

Or, as I would put it: No, No, and No.

Maybe someone can work with these people, replacing the content but keeping the interface. Interface and content are both important—neither alone can do the job—so I hope someone will be able to get something useful out of the work that’s been put into the project.

Sucker MC’s keep falling for patterns in noise

Mike Spagat writes:

Apologies if forty people just sent this to you but maybe it’s obscure enough that I’m the first.

It’s a news article by Irina Ivanova entitled, “‘Very unattractive’ workers can out-earn pretty people, study finds.”

Spagat continues:

You may be able to recognize a pattern here:

Tiny, noisy sample

Surprise result

Journal bites (seems like an obscure journal although I’m not sure)

Press release

News story (although at least they get a comment from someone who knows what he’s doing).

My reply: Good that they devoted three paragraphs to the criticism. And good that the article appeared in an obscure journal rather than Science, Nature, or PPNAS. But that CBS Moneywatch ran the story in the first place. And what a horrible headline.