Skip to content

How does probabilistic computation differ in physics and statistics?

[image of Schrodinger’s cat, of course]

Stan collaborator Michael Betancourt wrote an article, “The Convergence of Markov chain Monte Carlo Methods: From the Metropolis method to Hamiltonian Monte Carlo,” discussing how various ideas of computational probability moved from physics to statistics.

Three things I wanted to add to Betancourt’s story:

1. My paper with Rubin on R-hat, that measure of mixing for iterative simulation, came in part from my reading of old papers in the computational physics literature, in particular Fosdick (1959), which proposed a multiple-chain approach to monitoring convergence. What we added in our 1992 paper was the within-chain comparison: instead of simply comparing multiple chains to each other, we compared their variance to the within-chain variance. This enabled the diagnostic to be much more automatic.

2. Related to point 1 above: It’s my impression that computational physics is all about hard problems, each of which is a research effort on its own. In contrast, computational statistics often involves relatively easy problems—not so easy that they have closed-form solutions, but easy enough that they can be solved with suitable automatic iterative algorithms. It could be that the physicists didn’t have much need for an automatic convergence diagnostic such as R-hat because they (the physicists) were working so hard on each problem that they already had a sense of the solution and how close they were to it. In statistics, though, there was an immediate need for automatic diagnostics.

3. It’s also my impression that computational physics problems typically centered on computation of the normalizing constant or partition function, Z(theta), or related functions such as (d/dtheta) log(Z). But in Bayesian statistics, we usually aren’t so interested in that, indeed we’re not really looking for expectations at all; rather, we want random draws from the posterior distribution. This changes our function, in part because we’re just trying to get close, we’re not looking for high precision.

I think points 1, 2, and 3 help to explain some of the differences between the simulation literatures in physics and statistics. In short: physicists work on harder problems so they’ve developed fancier algorithms; statisticians work on easier problems and and so have made more advances in automatic methods.

And we can help each other! From one direction, statisticians use physics-developed methods such as Metropolis and HMC; from the other, physicists use Stan to do applied modeling more effectively so that they are less tied to specific conventional choices of modeling.

StanCon is next week, Jan 10-12, 2018

It looks pretty cool!

Wednesday, Jan 10

Invited Talk: Predictive information criteria in hierarchical Bayesian models for clustered data. Sophia Rabe-Hesketh and Daniel Furr (U California, Berkely) 10:40-11:30am

Does the New York City Police Department rely on quotas? Jonathan Auerbach (Columbia U) 11:30-11:50am

Bayesian estimation of mechanical elastic constants. Ben Bales, Brent Goodlet, Tresa Pollock, Linda Petzold (UC Santa Barbara) 11:50am-12:10pm

Joint longitudinal and time-to-event models via Stan. Sam Brilleman, Michael Crowther, Margarita Moreno-Betancur, Jacqueline Buros Novik, Rory Wolfe (Monash U, Columbia U) 12:10-12:30pm
Lunch 12:30-2:00pm

ScalaStan. Joe Wingbermuehle (Cibo Technologies) 2:00-2:20pm
A tutorial on Hidden Markov Models using Stan. Luis Damiano, Brian Peterson, Michael Weylandt 2:20-2:40pm

Student Ornstein-Uhlenbeck models served three ways (with applications for population dynamics data). Aaron Goodman (Stanford U) 2:40-3:00pm

SlicStan: a blockless Stan-like language. Maria I. Gorinova, Andrew D. Gordon, Charles Sutton (U of Edinburgh) 3:00-3:20pm
Break 3:20-4:00pm

Invited Talk: Talia Weiss (MIT) 4:00-4:50pm

Thursday, Jan 11

Invited Talk: Sean Taylor and Ben Letham (Facebook) 10:40-11:30am

NPCompare: a package for nonparametric density estimation and two populations comparison built on top of PyStan. Marco Inacio (U of São Paulo/UFSCar) 11:30-11:50am

Introducing idealstan, an R package for ideal point modeling with Stan. Robert Kubinec (U of Virginia) 11:50am-12:10pm

A brief history of Stan. Daniel Lee (Generable) 12:10-12:30pm
Lunch 12:30-1:30pm

Computing steady states with Stan’s nonlinear algebraic solver. Charles C. Margossian (Metrum, Columbia U) 1:30-1:50pm

Flexible modeling of Alzheimer’s disease progression with I-Splines. Arya A. Pourzanjani, Benjamin B. Bales, Linda R. Petzold, Michael Harrington (UC Santa Barbara) 1:50-2:10pm

Intrinsic Auto-Regressive (ICAR) Models for Spatial Data, Mitzi Morris (Columbia U) 2:10-2:30pm

Modeling/Data Session + Classes 2:30-4:10pm

Open session for consultations on modeling and data problems with Stan developers and modelers. 2:30-4:10pm

Session 3 of Intro to Stan 2:30-4:10pm

2:30-3:30pm Have I converged successfully? How to verify fit and diagnose fit problems, Bob Carpenter

What is new to Stan 3:30-4:10pm

Invited Talk: Manuel Rivas (Stanford U) 4:00-4:50pm

Friday, Jan 12

Invited Talk: Susan Holmes (Stanford U) 10:40-11:30am

Aggregate random coefficients logit — a generative approach. Jim Savage, Shoshana Vasserman 11:30-11:50am

The threshold test: Testing for racial bias in vehicle searches by police. Camelia Simoiu, Sam Corbett-Davies, Sharad Goel, Emma Pierson (Stanford U) 11:50am-12:10pm

Assessing the safety of Rosiglitazone for the treatment of type II diabetes. Konstantinos Vamvourellis, K. Kalogeropoulos, L. Phillips (London School of Economics and Political Science) 12:10-12:30pm
Lunch 12:30-1:30pm

Causal inference with the g-formula in Stan. Leah Comment (Harvard U) 1:30-1:50pm
Bayesian estimation of ETAS models with Rstan. Fausto Fabian Crespo Fernandez (Universidad San Francisco de Quito) 1:50-2:10pm

Invited Talk: Andrew Gelman 2:10-3:00 (Columbia U) (virtual)

Classes/Tutorials

We have tutorials that start at the crack of 8am for those desiring further edification beyond the program—these do not run in parallel to the main session but do run parallel to each other:

Introduction to Stan: Know how to program? Know basic statistics? Curious about Bayesian analysis and Stan? This is the course for you. Hands on, focused and an excellent way to get started working in Stan. Two hours every day, 6 hours total. Jonah Sol Gabry.

Executive decision making the Bayesian way: This is for non-technical managers and technical folks who need to communicate with managers to learn the core of decision making under uncertainty. One hour every day. Jonathan Auerbach, Breck Baldwin, Eric Novik.

Advanced Hierarchical Models in Stan: The hard stuff. Very interactive, very intense. Topics vary by day. Ben Goodrich.

How to develop for Stan at the C++ level: Overview of Stan C++ architecture and build/development process for contributors. Charles Christopher Margossian.

Intelligence has always been artificial or at least artefactual.

I (Keith O’Rourke) thought I would revisit a post of Andrew’s on artificial intelligence (AI) and statistics. The main point seemed to be that “AI can be improved using long-established statistical principles. Or, to put it another way, that long-established statistical principles can be made more useful through AI techniques.” The point(s) I will try to make here are that AI, like statistics, can be (and has been) improved by focusing on representations themselves. That is focusing on what makes them good in a purposeful sense and how we and machines (perhaps only jointly) can build more purposeful ones.

To start, I’ll suggest the authors of the paper Andrew reviewed might wish to consider Robert Kass’ arguments that it is time to move past the standard conception of sampling from a population to one more focused on the hypothetical link between variation in data and its description using statistical models . Even more elaborately – on the hypothetical link between data and statistical models where here the data are connected more specifically to their representation as random variables.  Google “rob kass pragmatic statistics” for his papers on this as well as reactions to them (including some from Andrew).

Given my armchair knowledge of AI, the focus on representations themselves and how algorithms can learn better ones first came to my attention in a talk by Yoshua Bengio at the 2013 Montreal JSM. (The only reason I went was the David Dunson had suggested to me that the talk would be informative). Now, I had attended many of Geoff Hinton’s talks when I was in Toronto, but never picked up the idea of learning ways to represent rather than just predict. Even in the seminar he gave to a group very interested in the general theory of representation – the Toronto Semiotic Circle in 1991 (no I am not good at remembering details – its on his CV).  Of course this was well before deep neural nets.

So what is my motivation and sense of purpose for this post. The so what? Perhaps as Dennett put it “produce new ways of looking at things, ways of thinking about things, ways of framing the questions, ways of seeing what is important and why” [The intentional stance. MIT Press, Cambridge (1987)]. For instance, at 32:32 in  Yann LeCun – How Does The Brain Learn So Quickly?  there is a diagram depicting reasoning with World Simulator -> Actor -> Critic. Perhaps re-ordering as Critic -> Actor -> World Simulator, re-expressed as Aesthetics -> Ethics -> Logic and then expanded as what you value should set out how you act and that how you represent what to possibly act upon – would be insightful. Perhaps not. But even Dennett’s comment does seem to be about representing differently to get better, more purposeful representations.

Of course, there will be expected if not obligatory “what insight can be garnered from CS Peirce’s work” in this regard.

Continue reading ‘Intelligence has always been artificial or at least artefactual.’ »

How is science like the military? They are politically extreme yet vital to the nation

I was thinking recently about two subcultures in the United States, public or quasi-public institutions that are central to our country’s power, and which politically and socially are distant both from each other and from much of the mainstream of American society.

The two institutions I’m thinking of are science and the military, both of which America excels at. We spend the most on science and do the most science in the world, we’ve developed transistors and flying cars and Stan and all sorts of other technologies that derive from the advanced science that we teach and research in the world’s best universities. As for the military, we spend more than the next umpteen countries around, and our army, navy, and air force are dominant everywhere but on the streets of Mogadishu. Neither institution is perfect—you don’t need me to give you examples of corruption, waste, and failure in science or in the armed forces—and it’s not my goal here to defend or boost either institution. I think we can all recognize that the U.S. is pre-eminent in both fields.

The other striking thing about science and the military in the United States is their distinct cultures.

When I say “distinct,” I don’t mean “completely separate.” Young people enter the army, navy, and air force for a few years and then return to civilian life. From a recent survey, we estimate that nearly half of Americans would say they know someone in active duty. And lots of Americans have had contact with scientists, whether from taking a science class in college or from informal social networks—maybe you coach Little League and your assistant coach happens to be a scientist, for example.

Both science and the military are part of American life, yet they are distinct communities with their own sets of values. And politically distinct too. Scientists tend to be liberal Democrats; armed service members tend to be conservative Republicans. These are not just correlations; they feel central to the identity of these groups. For a paradigmatic scientist, truth and tolerance and open inquiry are liberal virtues; while for a paradigmatic soldier, duty and honor and patriotism are conservative virtues. I’m sure these points could be made more carefully and with the support of survey data—and there are differences of opinion within each group, too—but I think the general point is right, that the scientific and military communities are political in a way that wasn’t so much the case, decades ago.

Politics and science, and politics and the military, have never been separate—consider, for example, the postwar unmasking of Russian spies in the scientific establishment, the conflict between the generals and the soldiers in Vietnam, and Ronald Reagan’s appointment of an anti-environmentalist to head the EPA, and the revolt against Bill Clinton’s gays-in-the-military plan—but, again, I think we’re now in a new era of heightened polarization.

It’s not about economics—much of science is government-supported, and of course the military is entirely so and has indeed been described as one of the last bastion of socialism in the world. Each group thinks of this as money well spent. And the public by and large agrees.

From a Pew survey in 2015:

– 79% of adults say that science has made life easier for most people and a majority is positive about science’s impact on the quality of health care, food and the environment.

– 54% of adults consider U.S. scientific achievements to be either the best in the world (15%) or above average (39%) compared with other industrial countries.

– 92% of AAAS scientists say scientific achievements in the U.S. are the best in the world (45%) or above average (47%).

– About seven-in-ten adults say that government investments in engineering and technology (72%) and in basic scientific research (71%) usually pay off in the long run. Some 61% say that government investment is essential for scientific progress, while 34% say private investment is enough to ensure scientific progress is made.

I kinda wonder who are the 8% who think that U.S. scientific achievements are not above average in the world . . . but I guess you can get 8% of the people to say just about anything!

What about the military? According to Gallup in 2015:

– One in three Americans say U.S. is spending “too little”

– Another third of Americans think U.S. is spending “too much”

– The U.S. has long held the distinction of having the largest military budget of any nation, and the 2014 budget nearly matched the spending of the 10 next-largest national military budgets. Twenty-nine percent of Americans feel the U.S. budget’s size is “about right.”

The Gallup report notes that there have been times in recent years when as many as 40-50% of respondents have said that we spend too much on the military—but, even then, that’s 50-60% who think we spend too little or about the right amount.

So, overall, it seems that voters feel they’re getting value for the money on our science and military spending, concerns about string theory and $600 toilet seats aside.

I don’t have any particular conclusion here. I just think this is an important topic in understanding current American politics. Other politicized institutions include journalism, education, and health care. It’s natural to consider such institutions and their subcultures one at a time, but I think much can potentially be gained by considering them together as well.

Full disclosure: My research is partially supported by government science funding, including from the Office of Naval Research and Darpa, and I’ve worked with colleagues on a survey of military personnel. I’m not currently working on any military projects but, given that they’re funding me, I have to assume that the work we are doing could well have applications in those areas.

Stopping rules and Bayesian analysis

This is an old one but i think there still may be interest in the topic. In this post, I explain how to think about stopping rules in Bayesian inference and why, from a Bayesian standpoint, it’s not cheating to run an experiment until you get statistical significance and then stop.

If the topic interests you, I recommend you read the main post (including the P.S.) and also the comments.

Now, Andy did you hear about this one?

We drank a toast to innocence, we drank a toast to now. We tried to reach beyond the emptiness but neither one knew how. – Kiki and Herb

Well I hope you all ended your 2017 with a bang.  Mine went out on a long-haul flight crying so hard at a French AIDS drama that the flight attendant delivering my meal had to ask if I was ok. (Gay culture is keeping a running ranking of French AIDS dramas, so I can tell you that this BPM was my second favourite.)

And I hope you spent your New Year’s Day well. Mine went on jet lag and watching I, Tonya in the cinema. (Gay culture is a lot to do with Tonya Harding especially after Sufjan Stevens chased his songs in Call Me By Your Name with the same song about Tonya Harding in two different keys.)

But enough frivolity and joy, we’re here to talk about statistics, which is anathema to both those things.

Andy Kaufman in the wrestling match

So what am I talking about today (other than thumbing my nose at a certain primary blogger who for some reason didn’t think I liked REM)?

Well if any of you are friends with academic statisticians, you know that we don’t let an argument go until it’s died, and been resurrected again and again, each time with more wisdom and social abilities (like Janet in The Good Place). So I guess a part of this is another aspect of a very boring argument: what is the role for statistics in “data science”.

Now, it has been pointed out to me by a very nice friend that academic statisticians have precisely no say in what is and isn’t data science. Industry has taken that choice out of our equivocating hands. But if I cared about what industry had to say, I would probably earn more money.

So what’s my view? Well it’s kinda boring. I think Data Science is like Mathematics: an encompassing field with a number of traditional sub-disciplines. There is a lot of interesting work to do within each individual sub-field, but increasingly there is bleeding-edge work to be done on the boundaries between fields.  This means the boundaries between traditions and, inevitably, the boundaries between training methods. It also means that the Highlander-Style quest for dominance among the various sub-disciplines of Data Science is a seriously counter-productive waste of time.

So far, so boring.

So why am I talking about this?  Because I want to seem mildly reasonable before I introduce our pantomime villain for the post (yes, it’s still Christmas until the Epiphany, so it’s still panto season. If you’re American or otherwise don’t know what panto season is, it’s waaaaaay too hard to explain, but it involves men in elaborate dresses and pays the rent for a lot of drag queens in the UK).

So who’s my villain? That would be none other than Yann LeCun.  Why? Because he’s both the Director of AI Research at Facebook and a Professor at NYU, so he’s doing fine. (As a side note, I’d naively assume those were two fairly full on full time jobs, so all power to that weirdly American thing where people apparently do both).

LeCun also has a slightly bonkers sense of humour, evidenced here by him saying he’s prepared to throw probability theory under a bus. (A sentiment I, as someone who’s also less good at probability theory than other people, can jokingly get behind.) Also a sentiment some people seriously got behind, because apparently if you stand for nothing you’ll fall for anything.

Hey Andy are you goofing on Elvis, hey baby, are we losing touch?

But what is the specific launching pad for my first unnecessarily wordy blog post of 2018? Well it’s a debate that he participated in at NIPS2017 on whether interpretability is necessary for machine learning.  He was on the negative side. The debate was not, to my knowledge, recorded, but some record of it lives on on a blue tick twitter account.

Ok, so the bit I screenshot’d is kinda dumb. There’s really no way to water that down. It’s wrong. Actively and passively. It’s provably false. It’s knowably false from some sort of 101 in statistical design. It’s damaging and it’s incorrect.

So, does he know better? Honestly I don’t care. (But, like, probably because you usually don’t get to hold two hard jobs at the same time without someone noticing you’re half-arsing at least one of them unless you’re pretty damn good at what you do and if you’re pretty damn good at what you do you know how clinical trials work because you survived undergraduate.) Sometimes it’s fun to be annoyed by things, but that was like 3 weeks ago and a lot has been packed into that time. So forget it.

And instead look at the whole thread of the “best and brightest” in the category of “people who are invited to speak at NIPS”. Obviously this group includes no women because 2017 can sink into the ******* ocean. Please could we not invite a few women. I’m pretty sure there are a decent number who could’ve contributed knowledgeably and interestingly to this ALL MALE PANEL (FOR SHAME, NIPS. FOR SHAME). 

Serious side note time: Enough. For the love of all gods past, future, and present can every single serious scientific body please dis-endorse any meeting/workshop/conference that has all male panels, all male invited session, or all male plenary sessions. It is not the 1950s. Women are not rare in our field. If you can’t think of a good one put out a wide call and someone will probably find one. And no. It’s not the case that the 4 voices were the unique ones that had the only orthogonal set of views on this topic. I’m also not calling for tokenism. I reckon we could re-run this panel with no men and have a vital and interesting debate. Men are not magic. Not even these ones.

[And yes, this is two in a row. I am sorry. I do not want to talk about this, but it’s important that it doesn’t pass without mention. But the next post will be equality free.]

Other serious side note: So last night when I wrote this post I had a sanctimonious thing asking why men agree to be in all-male sessions. Obviously I checked this morning and it turns out I’m in one at ISBA, so that’s fun. And yes, there are definitely women who can contribute to a session (ironically) called The diversity of objective approaches at least as well as I can (albeit probably not talking about the same stuff).

Will I do it? Probably. I like my job and speaking in sessions at major conferences is one of the things I need to do to keep it. Do I think ISBA (and other conferences) should reject out of hand every all male session?  No, because there are focussed but important subfields that just don’t have many women in them. Which is unfortunate, but we need to live in the world as it is. Do I think ISBA (and other conferences) should require a strong justification for a lack of diversity in proposed sessions? ABSOLUTELY.

Final serious side note: Why do I care about this? I care because people like Yann LeCun or Andrew Gelman or Kerrie Mengersen do not spring fully formed from the mind of Zeus. While they are extremely talented (and would be no matter what they did), they also benefitted from mentoring, guidance, opportunities, and most importantly a platform at all points during their career. We are very bad at doing that for people in general, which means that a lot of talent slips through the cracks. Looking at gender diversity is an easy place to start to fix this problem – the fact that women make up about half of the population but only a tiny slither of possible speaking slots at major conferences is an easy thing to see.

If we give space to women, ethnic and racial minorities, people from less prestigious institutions, people with non-traditional backgrounds etc, (and not just the same handful that everyone else invites to speak at their things) we will ensure that we don’t miss the next generation of game changers.

Anyway, long rant over. Let’s move onto content.

So first and foremost let’s talk about machine learning (where LeCun makes his bread). It’s a fairly uncontroversial view to say that when the signal-to-noise ratio is high, modern machine learning methods trounce classical statistical methods when it comes to prediction. (I’ve talked a little about the chicken-armageddon regime before.) The role of statistics in this case is really to boost the signal-to-noise ratio through the understanding of things like experimental design. [The Twitter summary of] LeCun’s argument suggests that this skill may not be being properly utilized in industrial data science practice.  This is the seat at the table we should be fighting hard for.

One easy thing statisticians could contribute could be built around Andrew’s  (and collaborator’s) ideas of Mister-P and their obvious extensions to non-multilevel smoothing methods. Another would be a paper I love by Chen, Wakefield and Lumley, which is basically the reverse of Mister-P.

Broadly speaking, the Twitter thread summarizing the debate is worth reading. It’s necessarily a summary of what was said and does not get across tone or, possibly, context. But it does raise a pile of interesting question.

I fall on the side that if you can’t interpret your model you have no chance of sense-checking your prediction. But it’s important to recognize that this view is based on my own interests and experiences (as is LeCun saying that he’d prefer good prediction to good interpretability). In the applied areas I’m interested in, new data comes either at a scientific or human cost (or is basically impossible to get), so this is hugely important. In other areas where you can get essentially infinite data for free (I’m looking at fun problems like solving Go), there is really no reason for interpretability.

What’s missing from the summary is the idea that Science is a vital part of data science and that contextually appropriate loss functions are vitally important when making decisions about which modelling strategy (or model) will be most useful for the problem at hand. (Or the philosophy, methodolgy, and prior specification should only be considered in the context of the likelihood.)

I’m with Errol: On flypaper, photography, science, and storytelling

[image of a cat going after an insect]

I’ve been reading this amazing book, Believing is Seeing: Observations on the Mysteries of Photography, by Errol Morris, who, like John Waters, is a pathbreaking filmmaker who is also an excellent writer.

I recommend this book, but what I want to talk about here is one particular aspect of Morris’s work which brings together a bunch of things I’ve been thinking about in the past few years, regarding how we understand and communicate about the world.

While reading Morris’s book I came across this quote:

Photographs attract false beliefs the way flypaper attracts flies.

But it’s more than that. Flypaper doesn’t just attract flies, it also traps them. Similarly, photographs, by having many appealing hooks, attract false beliefs, and photographs also can sustain false beliefs by appearing to supply evidence to confirm all sorts of erroneous notions.

This connects to two of our recurring themes:

1. The immutability of data and stories. Recall that Thomas Basbøll and I argued that we learn from stories because of their anomalousness and their immutability: a story is anomalous when it refutes some existing theory we had about the world, and it’s immutable to the extent that it has an existence independent of these theories.

In my terminology, stories, which have immutable, telling details, are different from parables, whose details can be adapted to fit the message they are sending. In our other paper, Basbøll and I criticized plagiarist Karl Weick in part because, by changing the specifics of the story he was plagiarizing, he was allowing the story to convey a message that was in many ways opposite to that of the original telling. Weick was turning the story into a parable, and his plagiarism facilitated that transformation by making it harder for readers to learn from the original source.

2. A central message of Morris in his book on photography, and in his work more generally, is that images (or, more generally, statements and other data) can mislead, but if we look at them carefully and with skepticism, we can discover things we never would’ve learned without careful inspection and interrogation.

This is related to our emphasis on the complementarity of two statistical methods that are often taken as competing: exploratory data visualization, and complex Bayesian modeling.

We learn by looking hard at an image (or transcript, or other form of data), but we can learn more in the context of a model—and we can try out model after model to see how they fit the data. This was how Morris’s breakout movie, The Thin Blue Line, worked. The notorious reconstructions in that movie were, essentially, posterior predictive checks. And this in turn connects to our idea of storytelling as predictive model checking: storytelling is a creative exploration of possibilities and in that sense can be considered as a form of deductive reasoning, or working out of consequences.

3. Finally, let’s return to the subject of false beliefs that are inspired by, and appear to be supported by, data. This has come up a lot recently with debunkings of famous studies that had purported to show that being primed with elderly-related words makes you walk more slowly, or that elections are determined by shark attacks, or that subliminal smiley faces can have big effects on political attitudes on immigration, or that 20% of women change their vote preferences based on the time of the month, or that beautiful parents are more likely to have girls, and a million other things. What was striking about these cases is that, once people had what appeared to be convincing evidence of their theories, they refused to back down—even after careful investigation of the evidence made it clear that these theories were not supported at all.

A theory is created out of misleading evidence, the evidence is knocked down, the theory remains. Again, this is a theme that should be familiar to Errol Morris fans, along with the followup, which is when defenders of the unsupported theory start to argue that the details of the evidence don’t really matter.

On deck through the first half of 2018

Here’s what we got scheduled for ya:

  • I’m with Errol: On flypaper, photography, science, and storytelling
  • Politically extreme yet vital to the nation
  • How does probabilistic computation differ in physics and statistics?
  • “Each computer run would last 1,000-2,000 hours, and, because we didn’t really trust a program that ran so long, we ran it twice, and it verified that the results matched. I’m not sure I ever was present when a run finished.”
  • “However noble the goal, research findings should be reported accurately. Distortion of results often occurs not in the data presented but . . . in the abstract, discussion, secondary literature and press releases. Such distortion can lead to unsupported beliefs about what works for obesity treatment and prevention. Such unsupported beliefs may in turn adversely affect future research efforts and the decisions of lawmakers, clinicians and public health leaders.”
  • Nudge nudge, say no more
  • Alzheimer’s Mouse research on the Orient Express
  • Incentive to cheat
  • Why are these explanations so popular?
  • The retraction paradox: Once you retract, you implicitly have to defend all the many things you haven’t yet retracted
  • The puzzle: Why do scientists typically respond to legitimate scientific criticism in an angry, defensive, closed, non-scientific way? The answer: We’re trained to do this during the process of responding to peer review.
  • It’s not about getting a “statistically significant” result or proving your hypothesis; it’s about understanding variation
  • Hey, here’s a new reason for a journal to reject a paper: it’s “annoying” that it’s already on a preprint server
  • Statistical behavior at the end of the world: the effect of the publication crisis on U.S. research productivity
  • “The following needs to be an immutable law of journalism: when someone with no track record comes into a field claiming to be able to do a job many times better for a fraction of the cost, the burden of proof needs to shift quickly and decisively onto the one making the claim. The reporter simply has to assume the claim is false until substantial evidence is presented to the contrary.”
  • (What’s So Funny ‘Bout) Evidence, Policy, and Understanding
  • A lesson from the Charles Armstrong plagiarism scandal: Separation of the judicial and the executive functions
  • How to get a sense of Type M and type S errors in neonatology, where trials are often very small? Try fake-data simulation!
  • How smartly.io productized Bayesian revenue estimation with Stan
  • Big Data Needs Big Model
  • State-space modeling for poll aggregation . . . in Stan!
  • When the appeal of an exaggerated claim is greater than a prestige journal
  • Bayes, statistics, and reproducibility: My talk at Rutgers 4:30pm on Mon 29 Jan 2018
  • The Paper of My Enemy Has Been Retracted
  • Another bivariate dependence measure!
  • The multiverse in action!
  • Looking at all possible comparisons at once: It’s not “overfitting” if you put it in a multilevel model
  • Objects of the class “Verbal Behavior”
  • Education, maternity leave, and breastfeeding
  • Geoff Norman: Is science a special kind of storytelling?
  • The Anti-Bayesian Moment and Its Passing
  • “revision-female-named-hurricanes-are-most-likely-not-deadlier-than-male-hurricanes”
  • What to teach in a statistics course for journalists?
  • Snappy Titles: Deterministic claims increase the probability of getting a paper published in a psychology journal
  • p=0.24: “Modest improvements” if you want to believe it, a null finding if you don’t.
  • N=1 experiments and multilevel models
  • Methodological terrorism. For reals. (How to deal with “what we don’t know” in missing-data imputation.)
  • 354 possible control groups; what to do?
  • I’m skeptical of the claims made in this paper
  • Return of the Klam
  • 3 quick tricks to get into the data science/analytics field
  • One data pattern, many interpretations
  • Use multilevel modeling to correct for the “winner’s curse” arising from selection of successful experimental results
  • How to think about the risks from low doses of radon
  • “Write No Matter What” . . . about what?
  • Of rabbits and cannons
  • Low power and the replication crisis: What have we learned since 2004 (or 1984, or 1964)?
  • Let’s face it, I know nothing about spies.
  • The graphs tell the story. Now I want to fit this story into a bigger framework so it all makes sense again.
  • “Deeper into democracy: the legitimacy of challenging Brexit’s majoritarian mandate”
  • “Have Smartphones Destroyed a Generation?” and “The Narcissism Epidemic”: How can we think about the evidence?
  • “If I wanted to graduate in three years, I’d just get a sociology degree.”
  • Rasmussen and Williams never said that Gaussian processes resolve the problem of overfitting
  • Big Oregano strikes again
  • Bayes for estimating a small effect in the context of large variation
  • What prior to use for item-response parameters?
  • Brian Wansink’s misrepresentation of data and research methods have been known for years
  • The “shy Trump voter” meta-question: Why is an erroneous theory so popular?
  • Who’s afraid of prediction markets? (Hanson vs. Thicke)
  • “Like a harbor clotted with sunken vessels”
  • Information flows both ways (Martian conspiracy theory edition)
  • Another reason not to believe the Electoral Integrity Project
  • “and, indeed, that my study is consistent with X having a negative effect on Y.”
  • Incorporating Bayes factor into my understanding of scientific information and the replication crisis
  • Murray Davis on learning from stories
  • 3 cool tricks about constituency service
  • A more formal take on the multiverse
  • Classical hypothesis testing is really really hard
  • What We Talk About When We Talk About Bias
  • What are the odds of Trump’s winning in 2020?
  • Garden of forking paths – poker analogy
  • The New England Journal of Medicine wants you to “identify a novel clinical finding”
  • Statistical controversy over “trophy wives”
  • Last lines of George V. Higgins
  • “It’s not just that the emperor has no clothes, it’s more like the emperor has been standing in the public square for fifteen years screaming, I’m naked! I’m naked! Look at me! And the scientific establishment is like, Wow, what a beautiful outfit.”
  • Forking paths said to be a concern in evaluating stock-market trading strategies
  • An economist wrote in, asking why it would make sense to fit Bayesian hierarchical models instead of frequentist random effects.
  • Debate over claims of importance of spending on Obamacare advertising
  • Spatial patterns in crime: Where’s he gonna strike next?
  • The problem with those studies that claim large and consistent effects from small and irrelevant inputs
  • Combining Bayesian inferences from many fitted models
  • Yet another IRB horror story
  • Hey! Free money!
  • Judgment Under Uncertainty: Heuristics and Biases
  • Perspectives on Psychological Science Article Investigated Over Fixing Suspicions
  • This one’s important: How to better analyze cancer drug trials using multilevel models.
  • Does adding women to corporate boards increase stock price?
  • A potential big problem with placebo tests in econometrics: they’re subject to the “difference between significant and non-significant is not itself statistically significant” issue
  • Yes on design analysis, No on “power,” No on sample size calculations
  • “Imaginary gardens with real data”
  • Learn by experimenting!
  • A possible defense of cargo cult science?
  • Don’t define reproducibility based on p-values
  • Fitting a hierarchical model without losing control
  • Failure of failure to replicate
  • Tools for detecting junk science? Transparency is the key.
  • It’s all about Hurricane Andrew: Do patterns in post-disaster donations demonstrate egotism?
  • “Bit by Bit: Social Research in the Digital Age”’
  • Fixing the reproducibility crisis: Openness, Increasing sample size, and Preregistration ARE NOT ENUF!!!!
  • Taking perspective on perspective taking
  • An Upbeat Mood May Boost Your Paper’s Publicity
  • Using partial pooling when preparing data for machine learning applications
  • Trichotomous
  • Bayesian inference and religious belief
  • What is “blogging”? Is it different from “writing”?
  • There’s nothing embarrassing about self-citation
  • The cargo cult continues
  • A few words on a few words on Twitter’s 280 experiment.
  • A quick rule of thumb is that when someone seems to be acting like a jerk, an economist will defend the behavior as being the essence of morality, but when someone seems to be doing something nice, an economist will raise the bar and argue that he’s not being nice at all.
  • The syllogism that ate social science
  • Early p-hacking investments substantially boost adult publication record
  • Economic growth -> healthy kids?
  • A coding problem in the classic study, Nature and Origins of Mass Opinion
  • A model for scientific research programmes that include both “exploratory phenomenon-driven research” and “theory-testing science”
  • Anthony West’s literary essays
  • Walter Benjamin on storytelling
  • Doomsday! Problems with interpreting a confidence interval when there is no evidence for the assumed sampling model
  • How to read (in quantitative social science). And by implication, how to write.
  • Klam > Ferris
  • Why is the replication crisis centered on social psychology?
  • What killed alchemy?
  • Watch out for naively (because implicitly based on flat-prior) Bayesian statements based on classical confidence intervals!
  • Evaluating Sigmund Freud: Should we compare him to biologists or economists
  • Slow to update
  • “16 and Pregnant”
  • Could you say that again less clearly, please? A general-purpose data garbler for applications requiring confidentiality
  • Mouse Among the Cats
  • How to reduce Type M errors in exploratory research?
  • “Eureka bias”: When you think you made a discovery and then you don’t want to give it up, even if it turns out you interpreted your data wrong
  • Graphs and tables, tables and graphs
  • Awesome data visualization tool for brain research
  • Prior distributions and the Australia principle
  • The fallacy of objective measurement: the case of gaydar
  • A Bayesian take on ballot order effects
  • What is “weight of evidence” in bureaucratese?
  • The anthropic principle in statistics
  • “I admire the authors for simply admitting they made an error and stating clearly and without equivocation that their original conclusions were not substantiated.”
  • Write your congressmember to require researchers to publicly post their code?
  • “And when you did you weren’t much use, you didn’t even know what a peptide was”
  • Three-warned is three-armed
  • Kaleidoscope
  • Some experiments are just too noisy to tell us much of anything at all: Political science edition
  • “Not statistically significant” != 0, stents edition
  • All Fools in a Circle
  • In which I demonstrate my ignorance of world literature
  • A link between science and hype? Not always!
  • “This is a weakness of our Bayesian Data Analysis book: We don’t have a lot of examples with informative priors.”
  • Against Screening
  • Ambiguities with the supposed non-replication of ego depletion
  • Forking paths come from choices in data processing and also from choices in analysis
  • Average predictive comparisons and the All Else Equal fallacy
  • “Human life is unlimited – but short”
  • Oxycontin, Purdue Pharma, the Sackler family, and the FDA
  • Wolfram Markdown, also called Computational Essay
  • The necessity—and the difficulty—of admitting failure in research and clinical practice
  • “My advisor and I disagree on how we should carry out repeated cross-validation. We would love to have a third expert opinion…”
  • A style of argument can be effective in an intellectual backwater but fail in the big leagues—but maybe it’s a good thing to have these different research communities
  • Olivia Goldhill and Jesse Singal report on the Implicit Association Test
  • Do women want more children than they end up having?
  • When “nudge” doesn’t work: Medication Reminders to Outcomes After Myocardial Infarction
  • Chasing the noise in industrial A/B testing: what to do when all the low-hanging fruit have been picked?
  • Bayesians are frequentists
  • Power analysis and NIH-style statistical practice: What’s the implicit model?
  • Using statistical ideas and methods to adjust drug doses
  • Carol Nickerson explains what those mysterious diagrams were saying
  • Answering the question, What predictors are more important?, going beyond p-value thresholding and ranking
  • Ways of knowing in computer science and statistics
  • Trying to make some sense of it all, but I can see it makes no sense at all . . . stuck in the middle with you
  • Regression to the mean continues to confuse people and lead to errors in published research
  • Multilevel modeling in Stan improves goodness of fit — literally.
  • The “Psychological Science Accelerator”: it’s probably a good idea but I’m still skeptical
  • Back to the Wall
  • He has a math/science background and wants to transition to social science. Should he get a statistics degree and do social science from there, or should he get a graduate degree in social science or policy?
  • Francis Spufford writes just like our very own Dan Simpson and he also knows about the Australia paradox!
  • Problems with surrogate markers
  • Vigorous data-handling tied to publication in top journals among public heath researchers
  • The Ponzi threshold and the Armstrong principle
  • Flaws in stupid horrible algorithm revealed because it made numerical predictions
  • PNAS forgets basic principles of game theory, thus dooming thousands of Bothans to the fate of Alderaan

Enjoy.

Some of our work from the past year

Our published papers are listed here in approximate reverse chronological order (including some unexpected items such as a review of a book on international relations), and our unpublished papers are here. (Many but not all of the unpublished papers will eventually end up in the “published” category.)

No new books this year except for the second edition of my Teaching Statistics book with Deb Nolan. Also relevant for teaching purposes are our recent articles in Slate.

If you like the give-and-take of discussion, you can go back and read the past 400 or so blog entries. Thanks as always for all of you who write the thoughtful comments that make this such a wonderful place to learn.

Finally, a lot of our work has gone into Stan and hasn’t been formulated as research papers, but you can see the Stan forums to get a sense of some of the things that have been going on, or you could do a diff on the Stan manual, or take a look at the Stan case studies. I didn’t write Stan but I’ve been a frequent user.

We get a lot of research funding from many different sources (see here for a partial list), so it’s only fair that we share as much as we can with all of you.

Happy new year, and thanks again for all the thoughtful and lively discussion!

P.S. Here are the articles from 2017:
Continue reading ‘Some of our work from the past year’ »

Your (Canadian) tax dollars at work

Retraction Watch links to this amazing (in a bad way) article by “The International Consortium of Investigators for Fairness in Trial Data Sharing” who propose that “study investigators be allowed exclusive use of the data for a minimum of 2 years after publication of the primary trial results and an additional 6 months for every year it took to complete the trial, with a maximum of 5 years before trial data are made available to those who were not involved in the trial.”

It’s not just that. They also want us to pay:

Persons who were not involved in an investigator-initiated trial but want access to the data should financially compensate the original investigators for their efforts and investments in the trial and the costs of making the data available.

Ummm, we already did pay for this—it’s called “taxes”!

To the credit of the New England Journal of Medicine, they allowed open comments, which were universally, 100%, negative.

From one comment:

“A key motivation for investigators to conduct RCTs is the ability to publish not only the primary trial report, but also major secondary articles based on the trial data.” This is used as a justification for a lengthy delay in data release. THE primary motivation MUST BE to improve patient health. As long as medical investigators continue to prioritize publication rights over what is best for the patients, progress will be slowed.

Yup.

From another comment:

Researchers should get credit whenever secondary analyses are performed. Having more sets of eyes on the data could mean more discoveries are made. The incentive created by getting more articles published from their data and having enabled more discoveries should be plenty of motivation for researchers under the current publish-or-perish system.

Yup again.

From another comment:

“We believe 6 months is insufficient for performing the extensive analyses needed to adequately comprehend the data and publish even a few articles.”

Are the authors proposing that dying patients should wait a few years such that medical researchers can climb the career ladder?

Yup yup yup.

Take-home message

What’s scary is not that some researchers have these views but that they were considered so normal that New England Journal of Medicine published them without editorial comment.

And with that we close out our blogging for 2017. Let’s hope for more data sharing in the forthcoming year!

Learning from and responding to statistical criticism

In 1960, Irwin Bross, “a public-health advocate and biostatistician . . . known for challenging scientific dogmas” published an article called “Statistical Criticism.” Here it is.

A few months ago, Dylan Small, editor of the journal Observational Studies, invited various people including me to write a comment on Bross’s article.

Here’s what I wrote:

Irwin Bross’s article, “Statistical Criticism,” gives advice that is surprisingly current, given that it appeared in the journal Cancer nearly sixty years ago. Indeed, the only obviously dated aspects of this paper are the use of the generic male pronoun and the sense that it was still an open question whether cigarette smoking caused lung cancer.

In his article, Bross acts a critic of criticism, expressing support for the general form but recommending that critics go beyond hit-and-run, dogmatism, speculation, and tunnel vision. This all seems reasonable to me, but I think criticisms can also be taken at face value. If I publish a paper and someone replies with a flawed criticism, I still should be able to respond to its specifics. Indeed, there have been times when my own work has been much improved by criticism that was itself blinkered but which still revealed important and fixable flaws in my published work.

I would go further and argue that nearly all criticism has value. Again, I’ll place myself in the position of the researcher whose work is being slammed. Consider the following sorts of statistical criticism, aligned in roughly decreasing order of quality:

A thorough, comprehensive reassessment. . . .

A narrow but precise correction. . . .

Identification of a potential problem. . . .

Confusion. . . .

Hack jobs. . . .

Another way to see the value of post-publication criticism, even when it is imperfect, is to consider the role of pre-publication review. It is perfectly acceptable for a peer reviewer to raise a narrow point, to speculate, or to point out a potential data flaw without demonstrating that the problem in question is consequential. Referees are encouraged to point out potential concerns, and it is the duty of the author of the paper to either correct the problems or to demonstrate their unimportance. Somehow, though, the burden of proof shifts from the author (in the pre-publication stage) to the critic (after the paper has been published). It is not clear to me that either of these burdens is appropriate. I would prefer a smoother integration of scientific review at all stages, with pre-publication reports made public and post-publication reports being appended to published articles.

Overall, I am inclined to paraphrase Al Smith and reply to Bross that the ills of criticism can be cured by more criticism. That said, I recognize that any system based on open exchange can be hijacked by hacks, trolls, and other insincere actors. The key issues in dealing with such people are economic and political, not statistical, but we still need to be able to learn from and respond to statistical criticisms, whatever their source.

Scammed by spammers

I received an unsolicited email awhile ago claiming to come from some company, but I went back to it and looked more carefully and realized the links were all to some SEO scam. I’d unfairly criticized this company for spamming me, and it was rally some third party spammer that had been spoofing that company. So I took down the post.

Forking paths plus lack of theory = No reason to believe any of this.

[image of a cat with a fork]

Kevin Lewis points us to this paper which begins:

We use a regression discontinuity design to estimate the causal effect of election to political office on natural lifespan. In contrast to previous findings of shortened lifespan among US presidents and other heads of state, we find that US governors and other political office holders live over one year longer than losers of close elections. The positive effects of election appear in the mid-1800s, and grow notably larger when we restrict the sample to later years. We also analyze heterogeneity in exposure to stress, the proposed mechanism in the previous literature. We find no evidence of a role for stress in explaining differences in life expectancy. Those who win by large margins have shorter life expectancy than either close winners or losers, a fact which may explain previous findings.

All things are possible but . . . Jesus, what a bunch of forking paths. Forking paths plus lack of theory = No reason to believe any of this.

Just to clarify: Yes, there’s some theory in the paper, kinda, but it’s the sort of theory that Jeremy Freese describes as “more vampirical than empirical—unable to be killed by mere evidence” because any of the theoretical explanations could go in either direction (in this case, being elected could be argued to increase or decrease lifespan, indeed one could easily make arguments for the effect being positive in some scenarios and negative in others): the theory makes no meaningful empirical predictions.

I’m not saying that theory is needed to do social science research: There’s a lot of value in purely descriptive work. But if you want to take this work as purely descriptive, you have to deal with the problems of selection bias and forking paths inherent in reporting demographic patterns that are the statistical equivalent of noise-mining statements such as, “The Dodgers won 9 out of their last 13 night games played on artificial turf.”

On the plus side, the above-linked article includes graphs indicating how weak and internally contradictory the evidence is. So if you go through the entire paper and look at the graphs, you should get a sense that there’s not much going on here.

But if you just read the abstract and you don’t know about all the problems with such studies, you could really get fooled.

The thing that people just don’t get is that is just how easy it is to get “p less than .01” using uncontrolled comparisons. Uri Simonsohn explains in his post, “P-hacked Hypotheses Are Deceivingly Robust,” along with a story of odd numbers and the horoscope.

It’s our fault

Statistics educators, including myself, have to take much of the blame for this sad state of affairs.

We go around sending the message that it’s possible to get solid causal inference from experimental or observational data, as long as you have a large enough sample size and a good identification strategy.

People such as the authors of the above article then take us at our word, gather large datasets, find identification strategies, and declare victory. The thing we didn’t say in our textbooks was that this approach doesn’t work so well in the absence of clean data and strong theory. In the example discussed above, the data are noisy—lots and lots of things affect lifespan in much more important ways than whether you win or lose an election—and, as already noted, the theory is weak and doesn’t even predict a direction of the effect.

The issue is not that “p less than .01” is useless—there are times when “p less than .01” represents strong evidence—but rather that this p-value says very little on its own.

I suspect it would be hard to convince the authors of the above paper that this is all a problem, as they’ve already invested a lot of work in this project. But I hope that future researchers will realize that, without clean data and strong theory, this sort of approach to scientific discovery doesn’t quite work as advertised.

And, again, I’m not saying the claim in that published paper is false. What I’m saying is I have no idea: it’s a claim coming out of nowhere for which no good evidence has been supplied. The claim’s opposite could just as well be true. Or, to put it more carefully, we can expect any effect to be highly variable, situation-dependent, and positive in some settings and negative in others.

P.S. I suspect this sort of criticism is demoralizing for many researchers—not just those involved in the particular article discussed above—because forking paths and weak theory are ubiquitous in social research. All I can say is: yeah, there’s a lot of work out there that won’t stand up. That’s what the replication crisis is all about! Indeed, the above paper could usefully be thought of as an example of a failed replication: if you read carefully, you’ll see that the result that was found was weak, noisy, and in the opposite direction of what was expected. In short, a failed replication. But instead of presenting it as a failed replication, the authors presented it as a discovery. After all, that’s what academic researchers are trained to do: turn those lemons into lemonade!

Anyway, yeah, sorry for the bad news, but that’s the way it goes. The point of the replication crisis is that you have to start expecting that huge swaths of the social science literature won’t replicate. Just cos a claim is published, don’t think that implies there’s any serious evidence behind it. “Hey, it could be true” + “p less than 0.05” + published in a journal (or, in this case, posted on the web) is not enough.

“Handling Multiplicity in Neuroimaging through Bayesian Lenses with Hierarchical Modeling”

Donald Williams points us to this new paper by Gang Chen, Yaqiong Xiao, Paul Taylor, Tracy Riggins, Fengji Geng, Elizabeth Redcay, and Robert Cox:

In neuroimaging, the multiplicity issue may sneak into data analysis through several channels . . . One widely recognized aspect of multiplicity, multiple testing, occurs when the investigator fits a separate model for each voxel in the brain. However, multiplicity also occurs when the investigator conducts multiple comparisons within a model, tests two tails of a t-test separately when prior information is unavailable about the directionality, and branches in the analytic pipelines. . . .

More fundamentally, the adoption of dichotomous decisions through sharp thresholding under NHST may not be appropriate when the null hypothesis itself is not pragmatically relevant because the effect of interest takes a continuum instead of discrete values and is not expected to be null in most brain regions. When the noise inundates the signal, two different types of error are more relevant than the concept of FPR: incorrect sign (type S) and incorrect magnitude (type M).

Excellent! Chen et al. continue:

In light of these considerations, we introduce a different strategy using Bayesian hierarchical modeling (BHM) to achieve two goals: 1) improving modeling efficiency via one integrative (instead of many separate) model and dissolving the multiple testing issue, and 2) turning the focus of conventional NHST on FPR into quality control by calibrating type S errors while maintaining a reasonable level of inference efficiency.

The performance and validity of this approach are demonstrated through an application at the region of interest (ROI) level, with all the regions on an equal footing: unlike the current approaches under NHST, small regions are not disadvantaged simply because of their physical size. In addition, compared to the massively univariate approach, BHM may simultaneously achieve increased spatial specificity and inference efficiency. The benefits of BHM are illustrated in model performance and quality checking using an experimental dataset. In addition, BHM offers an alternative, confirmatory, or complementary approach to the conventional whole brain analysis under NHST, and promotes results reporting in totality and transparency. The methodology also avoids sharp and arbitrary thresholding in the p-value funnel to which the multidimensional data are reduced.

I haven’t read this paper in detail but all of this sounds great to me. Also I noticed this:

The difficulty in passing a commonly accepted threshold with noisy data may elicit a hidden misconception: A statistical result that survives the strict screening with a small sample size seems to gain an extra layer of strong evidence, as evidenced by phrases in the literature such as “despite the small sample size” or “despite limited statistical power.” However, when the statistical power is low, the inference risks can be perilous . . .

They’re pointing out the “What does not kill my statistical significance makes it stronger” fallacy!

And they fit their models in Stan. This is just wonderful. I really hope this project works out and is useful in imaging research. It feels so good to think that all this work we do can make a difference somewhere.

Stupid-ass statisticians don’t know what a goddam confidence interval is

From page 20 in a well-known applied statistics textbook:

The hypothesis of whether a parameter is positive is directly assessed via its confidence interval. If both ends of the 95% confidence interval exceed zero, then we are at least 95% sure (under the assumptions of the model) that the parameter is positive.

Huh? Who says this sort of thing? Only a complete fool. Or, to be charitable, maybe someone who didn’t carefully think through everything he was writing and let some sloppy thinking slip in.

Just to explain in detail, the above quotation has two errors. First, under the usual assumptions of the classical model, you can’t make any probability statement about the parameter value; all you can do is make an average statement about the coverage of the interval. To say anything about the probability the parameter is positive, you’d need also to assume a prior distribution. And this brings us to the second error: if you assume a uniform prior distribution on the parameter, and the 95% confidence interval excludes zero, then under the normal approximation the posterior probability is at least 97.5%, not 95%, that the parameter is positive. So the above statement is wrong conceptually and also wrong on the technical level. Idiots.

P.S. Elina Numminen sent along the above picture of Maija, who is wistfully imaging the day when textbook writers will get their collective act together and stop spreading misinformation about confidence intervals and p-values.

A debate about robust standard errors: Perspective from an outsider

A colleague pointed me to a debate among some political science methodologists about robust standard errors, and I told him that the topic didn’t really interest me because I haven’t found a use for robust standard errors in my own work.

My colleague urged me to look at the debate more carefully, though, so I did. But before getting to that, let me explain where I’m coming from. I won’t be trying to make the “Holy Roman Empire” argument that they’re not robust, not standard, and not an estimate of error. I’ll just say why I haven’t found those methods useful myself, and then I’ll get to the debate.

The paradigmatic use case goes like this: You’re running a regression to estimate a causal effect. For simplicity suppose you have good identification and also suppose you have enough balance that you can consider your regression coefficient as some reasonably interpretable sort of average treatment effect. Further assume that your sample is representative enough, or treatment interactions are low enough, that you can consider the treatment effect in the sample as a reasonable approximation to the treatment effect in the population of interest.

But . . . your data are clustered or have widely unequal variances, so the assumption of a model plus independent errors is not appropriate. What you can do is run the regression, get an estimate and standard error, and then use some method of “robust standard errors” to inflate the standard errors so you get confidence intervals with close to nominal coverage.

That all sounds reasonable. And, indeed, robust standard errors are a popular statistical method. Also, speaking more generally, I’m a big fan of getting accurate uncertainties. See, for example, this paper, where Houshmand Shirani-Mehr, David Rothschild, Sharad Goel, and I argue that reported standard errors in political polls are off by approximately a factor of 2.

But this example also illustrates why I’m not so interested in robust standard errors: I’d rather model the variation of interest (in this case, the differences between polling averages and actual election outcomes) directly, and get my uncertainties from there.

This all came up because a colleague pointed me to an article, “A Note on ‘How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It,'” by Peter Aronow. Apparently there’s some debate about all this going on among political methodologists, but it all seems pointless to me.

Let me clarify: it all seems pointless to me because I’m not planning to use robust standard errors: I’ll model my clustering and unequal variances directly, to the extent these are relevant to the questions I’m studying. That said, I recognize that many researchers, for whatever reason, don’t want to model clustering or unequal variances, and so for them a paper like Aronow’s can be helpful. So I’m not saying this debate among the political methodologists is pointless: it could make a difference when it comes to a lot of actual work that people are doing. Kinda like a debate about how best to format large tables of numbers, or how to best use Excel graphics, or what’s the right software for computing so-called exact p-values (see section 3.3 of this classic paper to hear me scream about that last topic), or when the local golf course is open, or what’s the best car repair shop in the city, or who makes the best coffee, or which cell phone provider has the best coverage: all these questions could make a difference to a lot of people, just not me.

Unraveling some confusion about the distinction between modeling and inference

One other thing. Aronow writes:

And thus we conclude . . . in light of Manski (2003)’s Law of Decreasing Credibility: “The credibility of inference decreases with the strength of the assumptions maintained.” Taking Manski’s law at face value, then a semiparametric model is definitionally more credible than any assumed parametric submodel thereof.

What the hell does this mean: “Taking Manski’s law at face value”? Is it a law or is it not a law? How do you measure “credibility” of inference or strength of assumptions?

Or this: “a semiparametric model is definitionally more credible than any assumed parametric submodel thereof”? At first this sounds kinda reasonable, even rigorous with those formal-sounding words “definitionally” and “thereof.” I’m assuming that Aronow is follwing Manksi and referring to the credibility of inferences from these models. But then there’s a big problem, because you can flip it around and get “Inference from a parametric model is less credible than inference from any semiparametric model that includes it.” And that can’t be right. Or, to put it more carefully, it all depends how you fit that semiparametric model.

Just for example, and not even getting into semiparametrics, you can get some really really goofy results if you indiscriminately fit high-degree polynomials when fitting discontinuity regressions. Now, sure, a nonparametric fit should do better—but not any nonparametric fit. You need some structure in your model.

And this reveals the problem with Aronow’s reasoning in that quote. Earlier in his paper, he defines a “model” as “a set of possible probability distributions, which is assumed to contain the distribution of observable data.” By this he means the distribution of data conditional on (possibly infinite-dimensional) parameters. No prior distribution in that definition. That’s fine: not everyone has to be Bayesian. But if you’re not going to be Bayesian, you can’t speak in general about “inference” from a model without declaring how you’re gonna perform that inference. You can’t fit an infinite-parameter model to finite data using least squares. You need some regularization. That’s fine, but then you get some tangled questions, such as comparing class of distributions that’s estimated using too-weak regularization, to a narrower class that’s estimated more appropriately. It makes no sense in general to that inference in that first case is “more credible,” or even “definitionally more credible.” It all depends on what you’re doing. Which is why we’re not all running around fitting 11th-degree polynomials to our data, and why we’re not all sitting in awe of whoever fit a model that includes ours as a special case. You don’t have to be George W. Cantor to know that’s a mug’s game. And I’m pretty sure Aronow understands this too; I suspect he just got caught up in this whole debating thing and went a bit too far.

The failure of null hypothesis significance testing when studying incremental changes, and what to do about it

A few months ago I wrote a post, “Cage match: Null-hypothesis-significance-testing meets incrementalism. Nobody comes out alive.” I soon after turned it into an article, published in Personality and Social Psychology Bulletin, with the title given above and the following abstract:

A standard mode of inference in social and behavioral science is to establish stylized facts using statistical significance in quantitative studies. However, in a world in which measure- ments are noisy and effects are small, this will not work: selection on statistical significance leads to effect sizes which are overestimated and often in the wrong direction. After a brief discussion of two examples, one in economics and one in social psychology, we consider the procedural solution of open post-publication review, the design solution of devoting more effort to accurate measurements and within-person comparisons, and the statistical analysis solution of multilevel modeling and reporting all results rather than selection on significance. We argue that the current replication crisis in science arises in part from the ill effects of null hypothesis significance testing being used to study small effects with noisy data. In such settings, apparent success comes easy but truly replicable results require a more serious connection between theory, measurement, and data.

The body of the article begins:

A standard mode of inference in social and behavioral science is to establish stylized facts using statistical significance in quantitative studies. A “stylized fact”—the term is not intended to be pejorative—is a statement, presumed to be generally true, about some aspect of the world. For example, the experiments of Stroop and of Kahenman and Tversky established stylized facts about color perception and judgment and decision making. A stylized fact is assumed to be replicable, and indeed those aforementioned classic experiments have been replicated many times. At the same time, social science cannot be as exact as physics or chemistry, and we recognize that even the most general social and behavioral rules will occasionally fall. Indeed, one way we learn is by exploring the scenarios in which the usual laws of psychology, politics, economics, etc., fail.

The recent much-discussed replication crisis in science is associated with many prominent stylized facts that have turned out not to be facts at all (Open Science Collaboration, 2015, Jarrett, 2016, Gelman, 2016b). Prominent examples in social psychology include embodied cognition, mindfulness, ego depletion, and power pose, as well as sillier examples such as the claim that beautiful parents are more likely to have daughters, or that women are three times more likely to wear red at a certain time of the month.

These external validity problems reflect internal problems with research methods and the larger system of scientific communication. . . .

At this point it is tempting to recommend that researchers just stop their p-hacking. But unfortunately this would not make the replication crisis go away! . . . eliminating p-hacking is not much of a solution if this is still happening in the context of noisy studies.

Null hypothesis significance testing (NHST) only works when you have enough accuracy that you can confidently reject the null hypothesis. You get this accuracy from a large sample of mea- surements with low bias and low variance. But you also need a large effect size. Or, at least, a large effect size, compared to the accuracy of your experiment.

But we’ve grabbed all the low-hanging fruit. In medicine, public health, social science, and policy analysis we are studying smaller and smaller effects. These effects can still be important in aggregate, but each individual effect is small. . . .

I then discuss two examples: the early-childhood intervention study of Gertler et al. which we’ve discussed many times, and a recent social-psychology paper by Burum, Gilbert, and Wilson that happened to come up on the blog around the time I decided to write this paper.

The article discusses various potential ways that science can do better, concluding:

These solutions are technical as much as they are moral: if data and analysis are not well suited for the questions being asked, then honesty and transparency will not translate into useful scientific results. In this sense, a focus on procedural innovations or the avoidance of p-hacking can be counterproductive in that it will lead to disappointment if not accompanied by improvements in data collection and data analysis that, in turn, require real investments in time and effort.

To me, the key point in the article is that certain classical statistical methods designed to study big effects, will crash and burn when used to identify incremental changes of the sort that predominate in much of modern empirical science.

I think this point is important; in some sense it’s a key missing step in understanding why the statistical methods that worked so well for Fisher/Yates/Neyman etc. are giving us so many problems today.

P.S. There’s nothing explicitly Bayesian in my article at all, but arguably the whole thing is Bayesian in that my discussion is conditional on a distribution of underlying effect sizes: I’m arguing that we have to proceed differently given our current understanding of this distribution. In that way, this new article is similar to my 2014 article with Carlin where we made recommendations conditional on prior knowledge of effect sizes without getting formally Bayesian. I do think it would make sense to continue all this work in a more fully Bayesian framework.

It’s . . . spam-tastic!

We’ll celebrate Christmas today with a scam that almost fooled me. OK, not quite: I was about two steps from getting caught.

Here’s the email:

Dear Dr. Gelman,

I hope you do not mind me emailing you directly, I thought it would be the easiest way to make first contact. If you have time for a short discussion I was hoping to speak with you about your studies and our interest to feature your work in a special STEM issue of our publication, Scientia.

I will run you through this in more detail when we talk. But to give you a very quick insight into Scientia and the style in which we publish, I have attached a few example articles from research groups we have recently worked with. I have attached these as HTML files to reduce the file size, but I can send PDF versions if you would prefer. As you can see, our style is very different to the traditional science publishing format and we are very much aimed at connecting science in society.

You may also view one of our exciting full editions of Scientia here: [link redacted so I don’t give these spammers any free SEO]

Please let me know if you might have 15 minutes for a short phone call and advise when would be a good time and day for you to discuss further?

I look forward to talking soon.

Kind regards,

Marie Serrano

Publication manager
Science Diffusion

T: +44 7437 *** ***
E: **@***.***
W: www.***.***
W: www.***.***

Hmmm, it sounds like some sort of science publication or newsletter. I get requests like this from time to time, and if I have something to write, I’ll send it off. For example, I’ve recently written articles for these publications that I’d never previously heard of:

Socius

Brazilian Journal of Probability and Statistics

Clinical Neuropsychologist

Journal of Quantitative Criminology

Journal of Survey Statistics and Methodology

You get the idea. Some of these outlets are pretty obscure, I’d probably be better off focusing my literary efforts elsewhere, but I get asked, it’s easier to say yes than to say no, and it’s always good to reach new audiences.

“Scientia” sounded like something of that ilk—actually, it sounded a lot like “Socius,” for which I recently wrote an article. (Actually, my Socius experience was annoying: they told me they wanted the article in a hurry, so I quickly whipped it off, but since then I’ve been waiting forever; apparently it wasn’t in any hurry at all! But that’s another story.)

So I clicked the link to check out the “exciting full edition of Scientia” and . . . ummm, it looked kinda amateurish, but that’s ok. I’ve published in Kwantitative Methoden and in European Science Editing and neither of these looks so professional—heck, for years I had a regular column in Chance, a magazine that has the production values of a high school yearbook, circa 1980—so that’s not enough for me to say no.

But then, scrolling down, I see this:

Wait a minute . . . where have I heard this name before? I did a Paul Alper and searched my blog and found this story of a series of emails from “Nick Bagnall” who at the time had an outfit called “Research Media” that would publish my article for a low low price of “$2,980 USD for the full three page development, this is a required contribution.”

The price is going down, though! I scrolled through their webpage and found this:

OK, 375*4 is 1500, and converting the exchange rates gives us $1912. So their price has gone down by 36%! I think I’ll wait until the price goes negative before considering it.

Now that I realized this was a scam, I looked at “Marie Serrano”‘s email more carefully, and I noticed the subject line: “Solving Difficult Bayesian Computation Problems in Education Research Using STAN – Science Diffusion.” All they did was take the title of our Institute for Education Sciences grant. Too lazy to even realize that this would be a ludicrously inappropriate topic for any magazine that anyone would want to read. And, by the way, it’s Stan, not STAN.

P.S. As I wrote before, this sort of thing really does hurt my feelings. I do all this research because I think it’s important. So it hurts when these people come along and think of me as nothing more than a mark.

P.P.S. I wonder if Nick and Marie are two different people? I guess it’s possible. If so, I guess they’re doing pretty well, spamming scientists who’ve received NSF and IES grants. I guess I should count myself lucky that they’ve only hit me twice so far.

“Scientia,” indeed.

Walk a Crooked MiIe

An academic researcher writes:

I was wondering if you might have any insight or thoughts about a problem that has really been bothering me. I have taken a winding way through academia, and I am seriously considering a career shift that would allow me to do work that more directly translates to societal good and more readily incorporates quantitative as well as qualitative methodologies. Statistics have long been my “intellectual first language”—I believe passionately in the possibilities they open, and I’d love to find a place (ideally outside of the university) where I could think more about how they allow us to find beauty in uncertainty.

The problem is that, as you know, many sectors seem to apply quantitative methods of convenience rather than choosing analytical frames that are appropriate to the questions at hand. Even when I was in psychology I was bothered by the widescale application of parametric tests and small sample sizes to questions where they did not seem appropriate and your writings suggest to me that not only was I right in sensing a problem in certain research trends, but that its magnitude is now being felt (with increasing horror) across the sciences. Yet when I look at, for example, programs in public health or epidemiology, much of the quantitative training still seems to concentrate on the same old methods and p-testing that seems so problematic.

So, I suppose I have 2 questions for you, one general and one specific. First, what advice do you have for the “quantitatively mindful and concerned” who are trying to negotiate an increasingly chaotic datascape? Do you have advice for, for example, junior researchers who have to juggle learning all these new methodologies with the tenure and advancement expectations set by an older generation who may be more comfortable with research paradigms that are now harder to justify than they were 25 years ago? And second, do you see any areas of social or biomedical research that you think have done a good (or at least better) job incorporating good data and statistical practices, where non-normal distributions and strange relationships between samples and populations are acknowledged and addressed openly?

My reply: I’m not sure how hard it is to use innovative methods. I remember when I started doing applied statistics, thirty years ago, that people warned me that subject-matter experts might not trust anything Bayesian. But that hasn’t happened to me, at least not directly. I mean, sure, the world is fully of buggy-whip operators and ignorant haters. There’s always gonna be someone invoking non-existent data or non-existent principles as a way of telling you what to do, or what not to do. But ultimately it’s a live-and-let-live world out there. The buggy-whip guy, in all his clueless and pontificating ways, still wasn’t trying to stop other pollsters from using MRP, it’s not like the got an injunction against Yougov to stop them from winning 2017. And the “no Bayesians in foxholes” guy, ok, sure, he did try to get me fired, but he didn’t try to stop statistics journals from publishing my work. Even with all his attitude problems, he recognized that researchers gonna research and he didn’t get in the way of me plying my trade once I’d done him the favor of no longer hanging around his workplace.

So, I guess what I’m saying is, use the methods that you think will best solve your problem, and stay focused on the three key leaps of statistics:
– Extrapolating from sample to population
– Extrapolating from control to treatment conditions
– Extrapolating from observed data to underlying constructs of interest.
Whatever methods you use, consider directly how they address these issues.

P.S. These Westlake titles continue to work!

Can’t keep up with the flood of gobbledygook

Jonathan Falk points me to a paper published in one of the tabloids; he had skepticism about its broad claims. I took a look at the paper, noticed a few goofy things about it (for example, “Our data also indicate a shift toward more complex societies over time in a manner that lends support to the idea of a driving force behind the evolution of increasing complexity”), and wrote back to him: “I’m too exhausted to even bother mocking this on the blog.”

Falk replied:

Well, you’d need about 30 co-bloggers to even up with the number of authors.

Actually 53, but who’s counting?

The funny thing is, it’s not like this paper is so horrible. It’s a million times better than the ovulation-and-voting paper, the beauty-and-sex-ratio paper, the ages-ending-in-9 paper, etc. It’s just an innocent little data analysis attached to some over-the-top bravado (“The approach that we have taken in this paper can be used to resolve other long-standing controversies in the study of human societies” etc etc etc). It would be ok as a final project in a social science data analysis class: you’d say the students got a lot by grappling with the data and also they learned a couple of new statistical techniques along the way. OK, they overclaimed, but ya gotta start somewhere, and there’s no harm in theorizing, as this can inform further work. To me, the most interesting of their data-based claims was this: “many trajectories exhibit long periods of stasis or gradual, slow change interspersed with sudden large increases in the measure of social complexity over a relatively short time span. This pattern is consistent with a punctuational model of social evolution, in which the evolution of larger polities requires a relatively rapid change in sociopolitical organization, including the development of new governing institutions and social roles, to be stable.” I don’t really know how seriously to take this, as I could imagine the apparent jumps being the result of data problems, but it’s an interesting thought.

There’s a problem when this sort of half-baked work gets placed in the tabloids and goes through the hype machine. On the plus side, it seems that the hype machine isn’t what it used to be: Tyler Cowen posted on this article today (I guess that’s how Falk heard about it), and the few comments were uniformly skeptical.

Again: I’m not saying this sort of data analysis and speculation shouldn’t be done—who knows what can be learned from this sort of thing, it’s worth a try! I think one problem with the tabloid model of science publication is that it incentivizes big claims.