Skip to content

“Americans Greatly Overestimate Percent Gay, Lesbian in U.S.”

This sort of thing is not new but it’s still amusing. From a Gallup report by Frank Newport:

The American public estimates on average that 23% of Americans are gay or lesbian, little changed from Americans’ 25% estimate in 2011, and only slightly higher than separate 2002 estimates of the gay and lesbian population. These estimates are many times higher than the 3.8% of the adult population who identified themselves as lesbian, gay, bisexual or transgender in Gallup Daily tracking in the first four months of this year.

Trend: Just your best guess, what percent of Americans today would you say are gay or lesbian?

Newport provides some context:

Part of the explanation for the inaccurate estimates of the gay and lesbian population rests with Americans’ general unfamiliarity with numbers and demography. Previous research has shown that Americans estimate that a third of the U.S. population is black, and believe almost three in 10 are Hispanic, more than twice what the actual percentages were as measured by the census at the time of the research. Americans with the highest levels of education make the lowest estimates of the gay and lesbian population, underscoring the assumption that part of the reason for the overestimate is a lack of exposure to demographic data.

But there’s a lot of innumeracy even among educated Americans:

Still, the average estimate among those with postgraduate education is 15%, four times the actual rate.

Newport summarizes:

The estimates of gay and lesbian percentages have been relatively stable compared with those measured in 2011 and 2002, even though attitudes about gays and lesbians have changed dramatically over that time.

Math class is tough.

Using Mister P to get population estimates from respondent driven sampling

From one of our exams:

A researcher at Columbia University’s School of Social Work wanted to estimate the prevalence of drug abuse problems among American Indians (Native Americans) living in New York City. From the Census, it was estimated that about 30,000 Indians live in the city, and the researcher had a budget to interview 400. She did not have a list of Indians in the city, and she obtained her sample as follows.

She started with a list of 300 members of a local American Indian community organization, and took a random sample of 100 from this list. She interviewed these 100 persons and asked each of these to give her the names of other Indians in the city whom they knew. She asked each respondent to characterize him/herself and also the people on the list on a 1-10 scale, where 10 is “strongly Indian-identified,” 5 is “moderately Indian-identified,” and 0 is “not at all Indian identified.” Most of the original 100 people sampled characterized themselves near 10 on the scale, which makes sense because they all belong to an Indian community organization. The researcher then took a random sample of 100 people from the combined lists of all the people referred to by the first group, and repeated this process. She repeated the process twice more to obtain 400 people in her sample.

Describe how you would use the data from these 400 people to estimate (and get a standard error for your estimate of) the prevalence of drug abuse problems among American Indians living in New York City. You must account for the bias and dependence of the nonrandom sampling method.

There are different ways to attack this problem but my preferred solution is to use Mister P:

1. Fit a regression model to estimate p(y|X)—in this case, y represents some measure of drug abuse problem at the individual level, and X includes demographic predictors and also a measure of Indian identification (necessary because the survey design oversamples of people who are strongly Indian identified) and a measure of gregariousness (necessary because the referral design oversamples people with more friends and acquaintances);

2. Estimate the distribution of X in the population (in this case, all American Indian adults living in New York City); and

3. Take the estimates from step 1, and average these over the distribution in step 2, to estimate the distribution of y over the entire population or any subpopulations of interest.

The hard part here is step 2, as I’m not aware of many published examples of such things. You have to build a model, and in that model you must account for the sampling bias. It can be done, though; indeed I’d like to do some examples of this to make these ideas more accessible to survey practitioners.

There’s some literature on this survey design—it’s called “respondent driven sampling”—but I don’t think the recommended analysis strategies are very good. MRP should be better, but, again, I should be able say this with more confidence and authority once I’ve actually done such an analysis for this sort of survey. Right now, I’m just a big talker.


Kevin Lewis points to a research article by Lawton Swan, John Chambers, Martin Heesacker, and Sondre Nero, “How should we measure Americans’ perceptions of socio-economic mobility,” which reports effects of question wording on surveys on an important topic in economics. They replicated two studies:

Each (independent) research team had prompted similar groups of respondents to estimate the percentage of Americans born into the bottom of the income distribution who improved their socio-economic standing by adulthood, yet the two teams reached ostensibly irreconcilable conclusions: that Americans tend to underestimate (Chambers et al.) and overestimate (Davidai & Gilovich) the true rate of upward social mobility in the U.S.

There are a few challenges here, and I think the biggest is that the questions being asked of survey respondents are so abstract. We’re talking about people who might not be able to name their own representative in Congress—hey, actually I might not be able to do that particular task myself!—and who are misinformed about basic demographics, and then asking tricky questions about the distribution of income of children of parents in different income groups.

Consider the survey question pictured above, from Chambers, Swan & Heesacker (2015). First off, the diagram could be misleading in that the ladder kinda makes it look like they’re talking about people in the middle of the “Bottom 3rd” category, but they’re asking about the average for this group. Also they ask, “what percentage of them do you think stayed in the bottom third of the income distribution (i.e., lower class), like their parents,” but doesn’t that presuppose that the parents stayed in this category? Lots of grad students are in the bottom third of U.S. income, and some of them have kids, but I assume that most of these parents, as well as their kids, will end up in the middle or even upper third once they graduate and get jobs. It’s also not clear, when they ask about “the time those children have grown up to be young adults, in their mid-20’s,” are they talking about income terciles compared to the entire U.S., or just people in their 20s? Also is it really true that the upper third of income is “upper class”? I thought that in the U.S. context you had to do better than than the top third to be upper class. Upper class people make more than $100,000 a year, right? And that’s something like the 90th percentile.

I’m not trying to be picky here, and I’m not trying to diss Swan et al. who are raising some important questions about survey data that regularly get reported uncritically in the news media (see, for example, here and here). I just think this whole way of getting at people’s understanding may be close to hopeless, as the questions are so ill-defined and the truth so hard to know. Attitudes on inequality, social mobility, redistribution, taxation, etc., are important, and maybe there’s a more direct way to study people’s thoughts in this area.

Contour as a verb

Our love is like the border between Greece and Albania – The Mountain Goats

(In which I am uncharacteristically brief)

Andrew’s answer to recent post reminded me of one of my favourite questions: how do you visualise uncertainty in spatial maps.  An interesting subspecies of this question relates to exactly how you can plot a contour map for a spatial estimate.  The obvious idea is to take a point estimate (like your mean or median spatial field) and draw a contour map on that.

But this is problematic because it does not take into account the uncertainty in your estimate.  A contour on a map indicates a line that separates two levels of a field, but if you do not know the value of the field exactly, you cannot separate it precisely.  Bolin and Lindgren have constructed a neat method for dealing with this problem by having an intermediate area where you don’t know which side of the chosen level you are on.  This replaces thin contour lines with thick contour bands that better reflect our uncertainty.

Interestingly, using these contour bands requires us to reflect on just how certain our estimates are when selecting the number of contours we wish to plot (or else there would be nothing left in the space other than bands).

There is a broader principle reflected in Bolin and Lindgren’s work: when you are visualising multiple aspects of an uncertain quantity, you need to allow for an indeterminate region.  This is the same idea that is reflected in Andrew’s “thirds” rule.

David and Finn also wrote a very nice R package that implements their method for computing contours (as well as for computing joint “excursion regions”, i.e. areas in space where the random field simultaneously exceeds a fixed level with a given probability).

“Quality control” (rather than “hypothesis testing” or “inference” or “discovery”) as a better metaphor for the statistical processes of science

I’ve been thinking for awhile that the default ways in which statisticians think about science—and which scientists think about statistics—are seriously flawed, sometimes even crippling scientific inquiry in some subfields, in the way that bad philosophy can do.

Here’s what I think are some of the default modes of thought:

Hypothesis testing, in which the purpose of data collection and analysis is to rule out a null hypothesis (typically, zero effect and zero systematic error) that nobody believes in the first place;

Inference, which can work in the context of some well-defined problems (for example, studying trends in public opinion or estimating parameters within an agreed-upon model in pharmacology), but which doesn’t capture the idea of learning from the unexpected;

Discovery, which sounds great but which runs aground when thinking about science as a routine process: can every subfield of science really be having thousands of “discoveries” a year? Even to ask this question seems to cheapen the idea of discovery.

A more appropriate framework, I think, is quality control, an old idea in statistics (dating at least to the 1920s; maybe Steve Stigler can trace the idea back further), but a framework that, for whatever reason, doesn’t appear much in academic statistical writing or in textbooks outside of the subfield of industrial statistics or quality engineering. (For example, I don’t know that quality control has come up even once in my own articles and books on statistical methods and applications.)

Why does quality control have such a small place at the statistical table? That’s a topic for another day. Right now I want to draw the connections between quality control and scientific inquiry.

Consider some thread or sub-subfield of science, for example the incumbency advantage (to take a political science example) or embodied cognition (to take a much-discussed example from psychology). Different research groups will publish papers in an area, and each paper is presented as some mix of hypothesis testing, inference, and discovery, with the mix among the three having to do with some combination of researchers’ tastes, journal publication policies, and conventions within the field.

The “replication crisis” (which has been severe with embodied cognition, not so much with incumbency advantage, in part because to replicate an election study you have to wait a few years until sufficient new data have accumulated) can be summarized as:

– Hypotheses that seemed soundly rejected in published papers cannot be rejected in new, preregistered and purportedly high-power studies;

– Inferences from different published papers appear to be inconsistent with each other, casting doubt on the entire enterprise;

– Seeming discoveries do not appear in new data, and different published discoveries can even contradict each other.

In a “quality control” framework, we’d think of different studies in a sub-subfield as having many sources of variation. One of the key principles of quality control is to avoid getting faked out by variation—to avoid naive rules such as reward the winner and discard the loser—and instead to analyze and then work to reduce uncontrollable variation.

Applying the ideas of quality control to threads of scientific research, the goal would be to get better measurement, and stronger links between measurement and theory—rather than to give prominence to surprising results and to chase noise. From a quality control perspective, our current system of scientific publication and publicity is perverse: it yields misleading claims, is inefficient, and it rewards sloppy work.

The “rewards sloppy work” thing is clear from a simple decision analysis. Suppose you do a study of some effect theta, and your study’s estimate will be centered around theta but with some variance. A good study will have low variance, of course. A bad study will have high variance. But what are the rewards? What gets published is not theta but the estimate. The higher the estimate (or, more generally, the more dramatic the finding), the higher the reward! Of course if you have a noisy study with high variance, your theta estimate can also be low or even negative—but you don’t need to publish these results, instead you can look in your data for something else. The result is an incentive to have noise.

The above decision analysis is unrealistically crude—for one thing, your measurements can’t be obviously bad or your paper probably won’t get published, and you are required to present some token such as a p-value to demonstrate that your findings are stable. Unfortunately those tokens can be too cheap to be informative, so a lot of effort has to be taken to make research projects look scientific.

But all this is operating under the paradigms of hypothesis testing, inference, and discovery, which I’ve argued is not a good model for the scientific process.

Move now to quality control, where each paper is part of a process, and the existence of too much variation is a sign of trouble. In a quality-control framework, we’re not looking for so-called failed or successful replications; we’re looking at a sequence of published results—or, better still, a sequence of data—in context.

I was discussing some of this with Ron Kennett and he sent me two papers on quality control:

Joseph M. Juran, a Perspective on Past Contributions and Future Impact, by A. Blanton Godfrey and Ron Kenett

The Quality Trilogy: A Universal Approach to Managing for Quality, by Joseph Juran

I’ve not read these papers in detail but I suspect that a better understanding of these ideas could help us in all sorts of areas of statistics.

Advice for science writers!

I spoke today at a meeting of science journalists, in a session organized by Betsy Mason, also featuring Kristin Sainani, Christie Aschwanden, and Tom Siegfried.

My talk was on statistical paradoxes of science and science journalism, and I mentioned the Ted Talk paradox, Who watches the watchmen, the Eureka bias, the “What does not kill my statistical significance makes it stronger” fallacy, the unbiasedness fallacy, selection bias in what gets reported, the Australia hypothesis, and how we can do better.

Sainani gave some examples illustrating that journalists with no particular statistical or subject-matter expertise should be able to see through some of the claims made in published papers, where scientists misinterpret their own data or go far beyond what was implied by their data. Aschwanden and Siegfried talked about the confusions surrounding p-values and recommended that reporters pretty much forget about those magic numbers and instead focus on the substantive claims being made in any study.

After the session there was time for a few questions, and one person stood up and said he worked for a university, he wanted to avoid writing up too many stories that were wrong, but he was too busy to do statistical investigations on his own. What should he do?

Mason replied that he should contact the authors of the studies and push them to explain their results without jargon, answering questions as necessary to make the studies clear. She said that if an author refuses to answer such questions, or seems to be deflecting rather than addressing criticism, that this itself is a bad sign.

I expressed agreement with Mason and said that, in my experience, university researchers are willing and eager to talk with reporters and public relations specialists, and we’ll explain our research at interminable length to anyone who will listen.

So I recommended to the reporter that, when he sees a report of an interesting study, that he contact the authors and push them with hard questions: not just “Can you elaborate on the importance of this result?” but also “How might this result be criticized?”, “What’s the shakiest thing you’re claiming?”, “Who are the people who won’t be convinced by this paper?”, etc. Ask these questions in a polite way, not in any attempt to shoot the study down—your job, after all, is to promote this sort of work—but rather in the spirit of fuller understanding of the study.

My favorite definition of statistical significance

From my 2009 paper with Weakliem:

Throughout, we use the term statistically significant in the conventional way, to mean that an estimate is at least two standard errors away from some “null hypothesis” or prespecified value that would indicate no effect present. An estimate is statistically insignificant if the observed value could reasonably be explained by simple chance variation, much in the way that a sequence of 20 coin tosses might happen to come up 8 heads and 12 tails; we would say that this result is not statistically significantly different from chance. More precisely, the observed proportion of heads is 40 percent but with a standard error of 11 percent—thus, the data are less than two standard errors away from the null hypothesis of 50 percent, and the outcome could clearly have occurred by chance. Standard error is a measure of the variation in an estimate and gets smaller as a sample size gets larger, converging on zero as the sample increases in size.

I like that. I like that we get right into statistical significance, we don’t waste any time with p-values, we give a clean coin-flipping example, and we directly tie it into standard error and sample size.

P.S. Some questions were raised in discussion, so just to clarify: I’m not saying the above (which was published in a magazine, not a technical journal) is a comprehensive or precise definition; I just think it gets the point across in a reasonable way for general audiences.

An alternative to the superplot

Kevin Brown writes:

I came across the lexicon link to your ‘super plots’ posting today. In it, you plot the association between individual income (X) and republican voting (Y) for 3 states: one assumed to be poor, one middle income, and one wealthy.

An alternative way of plotting this, what I call a ‘herd effects plot’ (based on the vaccine effects lit) would be to put mean state income on the X axis, categorize the individual income into 2 (low-high) income groups, and then plot. This would create a plot with 50 x 2 points (two for each state). It would likely show the convergence of the voting tendencies of the poor and wealthy within high income states.

Some of the advantages of this alternative way of plotting is that you could see the association across all 50 states without the graph appearing to be ‘busy’. Also aliens wouldn’t have to keep in mind that Mississippi is a poor state because that information would be explicitly described on the X axis. Could also plot the marginal association with state income as a 3rd line.

Here’s an example from a recent paper [Importation, Antibiotics, and Clostridium difficile Infection in Veteran Long-Term Care: A Multilevel Case–Control Study, by Kevin Brown, Makoto Jones, Nick Daneman, Frederick Adler, Vanessa Stevens, Kevin Nechodom, Matthew Goetz, Matthew Samore, and Jeanmarie Mayer, published in the Annals of Internal Medicine] looking at herd effects of antibiotic use and a hospital acquired infection that’s precipitated by antibiotic use. Each pair of X-aligned dots represents antibiotic users and non-users within one long-term care facility with a given overall level of antibiotic use.

My reply: Rather than compare high to low, I would compare upper third to lower third. By discarding the middle third you reduce noise; see this article for explanation.

Science funding and political ideology

Mark Palko points to this news article by Jeffrey Mervis entitled, “Rand Paul takes a poke at U.S. peer-review panels”:

Paul made his case for the bill yesterday as chairperson of a Senate panel with oversight over federal spending. The hearing, titled “Broken Beakers: Federal Support for Research,” was a platform for Paul’s claim that there’s a lot of “silly research” the government has no business funding. . . .

Two of the witnesses—Brian Nosek of the University of Virginia in Charlottesville and Rebecca Cunningham of the University of Michigan in Ann Arbor—were generally supportive of the status quo, although Nosek emphasized the importance of replicating findings to maximize federal investments. The third witness, Terence Kealey of the Cato Institute in Washington, D.C., asserted that there’s no evidence that publicly funded research makes any contribution to economic development.

Palko places this in the context of headline-grabbing politicians such as William Proxmire (Democrat) and John McCain (Republican) egged on by crowd-pleasing journalists such as Maureen Dowd and Howard Kurtz:

Of course, we have no way of knowing how effective these programs are, but questions of effectiveness are notably absent from McCain/Dowd’s piece. Instead it functions solely on the level of mocking the stated purposes of the projects, which brings us to one of the most interesting and for me, damning, aspects of the list: the preponderance of agricultural research.

You could make a damned good case for agricultural research having had a bigger impact on the world and its economy over the past fifty years than research in any other field. That research continues to pay extraordinary dividends both in new production and in the control of pest and diseases. It also helps us address the substantial environmental issues that have come with industrial agriculture.

As I said before, this earmark coverage with an emphasis on agriculture is a recurring event. I remember Howard Kurtz getting all giggly over earmarks for research on dealing with waste from pig farms about ten years ago and I’ve lost count of the examples since then. . . .

But this new effort by Sen. Paul and others seems to be coming from a different direction.

Part of the story, I think, is that a lot of the research funding goes directly to scientists, who are disproportionately liberal Democrats, compared to the general population. So I could see how a conservative Republican could say: Hey, why are we giving money to these people.

As a scientist who does a lot of government-funded research that is put to use by business, I think liberals, moderates, and conservatives should all support government science funding without worrying about the private political views of its recipients. Yes, this view is in my interest, but it’s also what I feel: Science is a public good in so many ways. But the point is that, for some conservatives, there’s a real tradeoff here in that money is going for useful things but it’s going to people with political views they don’t like.

I guess one could draw an analogy to liberals’ perspectives on military funding. If you’re a liberal Democrat and you support military spending, you have to accept that a lot of this money is going to conservative Republicans.

I say all this not in any attempt to discredit the proposals of Sen. Paul or others—these ideas should be evaluated on their merits—but just to look at these debates from a somewhat different political perspective here. When I talk about scientists being disproportionately liberal Democrats, I’m not talking about postmodernists in the Department of Literary Theory; I’m talking about chemists, physicists, biologists, etc.

Stan Roundup, 27 October 2017

I missed two weeks and haven’t had time to create a dedicated blog for Stan yet, so we’re still here. This is only the update for this week. From now on, I’m going to try to concentrate on things that are done, not just in progress so you can get a better feel for the pace of things getting done.

Not one, but two new devs!

This is my favorite news to post, hence the exclamation.

  • Matthijs Vákár from University of Oxford joined the dev team. Matthijs’s first major commit is a set of GLM functions for negative binomial with log link (2–6 times speedup), normal linear regression with identity link (4–5 times), Poisson with log link (factor of 7) and bernoulli with logit link (9 times). Wow! And he didn’t just implement the straight-line case—this is a fully vectorized implementation as a density, so we’ll be able to use them this way:
    int y[N];  // observations
    matrix[N, K] x;                  // "data" matrix
    vector[K] beta;                  // slope coefficients
    real alpha;                      // intercept coefficient
    y ~ bernoulli_logit_glm(x, beta, alpha);

    These stand in for what is now written as

    y ~ bernoulli_logit(x * beta + alpha);

    and before that was written

    y ~ bernoulli(inv_logit(x * beta + alpha)); 

    Matthijs also successfully defended his Ph.D. thesis—welcome to the union, Dr. Vákár.

  • Andrew Johnson from Curtin University also joined. In his first bold move, he literally refactored the entire math test suite to bring it up to cpplint standard. He’s also been patching doc and other issues.


  • Kentaro Matsura, author of Bayesian Statistical Modeling Using Stan and R (in Japanese) visited and we talked about what he’s been working on and how we’ll handle the syntax for tuples in Stan.

  • Shuonan Chen visited the Stan meeting, then met with Michael (and me a little bit) to talk bioinformatics—specifically about single-cell PCR data and modeling covariance due to pathways. She had a well-annotated copy of Kentaro’s book!

Other news

  • Bill Gillespie presented a Stan and Torsten tutorial at ACoP.

  • Charles Margossian had a poster at ACoP on mixed solving (analytic solutions with forcing functions); his StanCon submission on steady state solutions with the algebraic solver was accepted.

  • Krzysztof Sakrejda nailed down the last bit of the standalone function compilation, so we should be rid of regexp based C++ generation in RStan 2.17 (coming soon).

  • Ben Goodrich has been cleaning up a bunch of edge cases in the math lib (hard things like the Bessel functions) and also added a chol2inv() function that inverts the matrix corresponding to a Cholesky factor (naming from LAPACK under review—suggestions welcome).

  • Bob Carpenter and Mitzi Morris taught a one-day Stan class in Halifax at Dalhousie University. Lots of fun seeing Stan users show up! Mike Lawrence, of Stan tutorial YouTube fame, helped people with debugging and installs—nice to finally meet him in person.

  • Ben Bales got the metric initialization into CmdStan, so we’ll finally be able to restart (the metric used to be called the mass matrix—it’s just the inverse of a regularized estimate of global posterior covariance during warmup)

  • Michael Betancourt just returned from SACNAS (diversity in STEM conference attended by thousands).

  • Michael also revised his history of MCMC paper, which as been conditionally accepted for publication. Read it on arXiv first.

  • Aki Vehtari was awarded a two-year postdoc for a joint project working on Stan algorithms and models jointly supervised with Andrew Gelman; it’ll also be joint between Helsinki and New York. Sounds like fun!

  • Breck Baldwin and Sean Talts headed out to Austin for the NumFOCUS summit, where they spent two intensive days talking largely about project governance and sustainability.

  • Imad Ali is leaving Columbia to work for the NBA league office (he’ll still be in NYC) as a statistical analyst! That’s one way to get access to the data!

Quick Money

I happened to come across this Los Angeles Times article from last year:

Labor and business leaders declared victory Tuesday night over a bitterly contested ballot measure that would have imposed new restrictions on building apartment towers, shops and offices in Los Angeles.

As of midnight, returns showed Measure S going down to defeat by a 2-1 margin, with more than half of precincts reporting. . . .

More than $13 million was poured into the fight, funding billboards, television ads and an avalanche of campaign mailers.

OK, fine so far. Referenda are inherently unpredictable so both sides can throw money into a race without any clear sense of who’s gonna win.

But then this:

The Yes on S campaign raised more than $5 million — about 99% of which came from the nonprofit AIDS Healthcare Foundation — to promote the ballot measure.

Huh? The AIDS Healthcare Foundation is spending millions of dollars on a referendum on development in L.A.? That’s weird. Maybe somewhere else in the city there’s a Low Density Housing Foundation that just spent 5 million bucks on AIDS.

If you want to know about basketball, who ya gonna trust, a mountain of p-values . . . or that poseur Phil Jackson??

Someone points me with amusement to this published article from 2012:

Beliefs About the “Hot Hand” in Basketball Across the Adult Life Span

Alan Castel, Aimee Drolet Rossi, and Shannon McGillivray
University of California, Los Angeles

Many people believe in streaks. In basketball, belief in the “hot hand” occurs when people think a player is more likely to make a shot if they have made previous shots. However, research has shown that players’ successive shots are independent events. To determine how age would impact belief in the hot hand, we examined this effect across the adult life span. Older adults were more likely to believe in the hot hand, relative to younger and middle-aged adults, suggesting that older adults use heuristics and potentially adaptive processing based on highly accessible information to predict future events.

My correspondent writes: “This paper is funny, I didn’t realize how strongly the psych community bought the null hypothesis.”

To be fair, back in 2012, I didn’t think the hot hand was real either . . .

But what really makes it work is this quote from the first paragraph of the above-linked paper:

Anecdotally, many fans, and even coaches, profess belief in the hot hand. For example, Phil Jackson, one of the most successful coaches in the history of the National Basketball Association (NBA), once said of Kobe Bryant, in Game 5 of the 2010 NBA Finals: “He’s the kind of guy (where) you ride the hot hand, that’s for sure, we were waiting for him to do that . . . . He went out there and found a rhythm.”

I mean, if you want to know about basketball, who ya gonna trust, a mountain of p-values . . . or that poseur Phil Jackson?? Have you seen how bad the Knicks were last year? Zen master, my ass.

In the open-source software world, bug reports are welcome. In the science publication world, bug reports are resisted, opposed, buried.

Mark Tuttle writes:

If/when the spirit moves you, you should contrast the success of the open software movement with the challenge of published research.

In the former case, discovery of bugs, or of better ways of doing things, is almost always WELCOMED. In some cases, submitters of bug reports, patches, suggestions, etc. get “merit badges” or other public recognition. You could relate your experience with Stan here.

In contrast, as you observe often, a bug report or suggestion regarding a published paper is treated as a hostile interaction. This is one thing that happens to me during more open review processes, whether anonymous or not. The first time it happened I was surprised. Silly me, I expected to be thanked for contributing to the quality of the result (or so I thought).

Thus, a simple-to-state challenge is to make publication of research, especially research based on data, more like open software.

As you know, sometimes open software gets to be really good because so many eyes have reviewed it carefully.

I just posted something related yesterday, so this is as good a time to respond to Tuttle’s point, which I this is a good one. We’ve actually spent some time thinking about how to better reward people in the Stan community who help out with development and user advice.

Regarding resistance to “bug reports” in science, here’s what I wrote last year:

We learn from our mistakes, but only if we recognize that they are mistakes. Debugging is a collaborative process. If you approve some code and I find a bug in it, I’m not an adversary, I’m a collaborator. If you try to paint me as an “adversary” in order to avoid having to correct the bug, that’s your problem.

It’s a point worth making over and over. Getting back to the comparison with “bug reports,” I guess the issue is that developers want their software to work. Bugs are the enemy! In contrast, many researchers just want to believe (and have others believe) that their ideas are correct. For them, errors are not the enemy; rather, the the enemy is any admission of defeat. Their response to criticism is to paper over any cracks in their argument and hope that nobody will notice or care. With software it’s harder to do that, because your system will keep breaking, or giving the wrong answer. (With hardware, I suppose it’s even more difficult to close your eyes to problems.)

This Friday at noon, join this online colloquium on replication and reproducibility, featuring experts in economics, statistics, and psychology!

Justin Esarey writes:

This Friday, October 27th at noon Eastern time, the International Methods Colloquium will host a roundtable discussion on the reproducibility crisis in social sciences and a recent proposal to impose a stricter threshold for statistical significance. The discussion is motivated by a paper, “Redefine statistical significance,” recently published in Nature Human Behavior (and available at
Our panelists are:
  1. Daniel Benjamin, Associate Research Professor of Economics at the University of Southern California and a primary co-author of the paper in Nature Human Behavior as well as many other articles on inference and hypothesis testing in the social sciences.
  2. Daniel Lakens, Assistant Professor in Applied Cognitive Psychology at Eindhoven University of Technology and an author or co-author on many articles on statistical inference in the social sciences, including the Open Science Collaboration’s recent Science publication “Estimating the reproducibility of psychological science” (available at
  3. Blake McShane, Associate Professor of Marketing at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance” as well as many other articles on statistical inference and replicability.
  4. Jennifer Tackett, Associate Professor of Psychology at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance” who specializes in childhood and adolescent psychopathology.
  5. E.J. Wagenmakers, Professor at the Methodology Unit of the Department of Psychology at the University of Amsterdam, a co-author of the paper in Nature Human Behavior and author or co-author of many other articles concerning statistical inference in the social sciences, including a meta-analysis of the “power pose” effect (available at
To tune in to the presentation and participate in the discussion after the talk, visit click “Watch Now!” on the day of the talk. To register for the talk in advance, click here:
The IMC uses Zoom, which is free to use for listeners and works on PCs, Macs, and iOS and Android tablets and phones. You can be a part of the talk from anywhere around the world with access to the Internet. The presentation and Q&A will last for a total of one hour.
This sounds great.

I think it’s great to have your work criticized by strangers online.

Brian Resnick writes:

I’m hoping you could help me out with a story I’m looking into.

I’ve been reading about the debate over how past work should be criticized and in what forums. (I’m thinking of the Susan Fiske op-ed against using social media to “bully” authors of papers that are not replicating. But then, others say the social web is needs to be an essential vehicle to issue course corrections in science.)

This is what I’m thinking: It can’t feel great to have your work criticized by strangers online. That can be true regardless of the intentions of the critics (who, as far as I can tell, are doing this because they too love science and want to see it thrive). And it can be true even if the critics are ultimately correct. (My pet theory is that this “crisis” is actually confirming a lot of psychological phenomenon—namely motivated reasoning)

Anyway: I am interested in hearing some stories about dealing with replication failure during this “crisis.” (Or perhaps some stories about being criticized for being a critic.) How did these instances change the way you thought about yourself as a scientist? Could you really separate your intellectual reaction from your emotional one?

This isn’t about infighting and gossip: I think there’s an important story to be told about what it means to be a scientist in the age of the social internet. Or maybe the story is about how this period is changing (or reaffirming) your thoughts of what it means to be a scientist.

Let me know if you have any thoughts or stories you’d like to share on this topic!

Or perhaps you think I’m going about this the wrong way. That’s fine too.

My reply:

You write, “It can’t feel great to have your work criticized by strangers online.” Actually, I love getting my work criticized, by friends or by strangers, online or offline. When criticism gets personal, it can be painful, but it is by criticism that we learn, and the challenge is to pull out the useful content. I have benefited many many times from criticism.

Here’s an example from several years ago.

In March 2009 I posted some maps based on the Pew pre-election polls to estimate how Obama and McCain did among different income groups, for all voters and for non-Hispanic whites alone. The next day the blogger and political activist Kos posted some criticisms.

The criticisms were online, non-peer-reviewed, by a stranger, and actually kinda rude. So what, who cares! Not all of Kos’s criticisms were correct but some of them were right on the mark, and they motivated me to spend a couple of months with my colleague Yair Ghitza improving my model; the story is here.

Yair and I continued with the work and a few years later published a paper in the American Journal of Political Science. A few years after that, Yair and I, with Rayleigh Lei, published a followup in which we uncovered problems with our earlier published work.

So, yeah, I think criticism is great.

If people don’t want their work criticized by strangers, I recommend they not publish or post their work for strangers to see.

P.S. This post happens to be appearing shortly after a discussion on replicability and scientific criticism. Just a coincidence. I wrote the post several months ago (see here for the full list).

My 2 talks in Seattle this Wed and Thurs: “The Statistical Crisis in Science” and “Bayesian Workflow”

For the Data Science Seminar, Wed 25 Oct, 3:30pm in Physics and Astronomy Auditorium – A102:

The Statistical Crisis in Science

Top journals routinely publish ridiculous, scientifically implausible claims, justified based on “p < 0.05.” And this in turn calls into question all sorts of more plausible, but not necessarily true, claims, that are supported by this same sort of evidence. To put it another way: we can all laugh at studies of ESP, or ovulation and voting, but what about MRI studies of political attitudes, or stereotype threat, or, for that matter, the latest potential cancer cure? If we can’t trust p-values, does experimental science involving human variation just have to start over? And what to we do in fields such as political science and economics, where preregistered replication can be difficult or impossible? Can Bayesian inference supply a solution? Maybe. These are not easy problems, but they’re important problems.

For the Department of Biostatistics, Thurs 26 Oct, 3:30pm in Room T-639
Health Sciences:

Bayesian Workflow

Bayesian inference is typically explained in terms of fitting a particular model to a particular dataset. But this sort of model fitting is only a small part of real-world data analysis. In this talk we consider several aspects of workflow that have not been well served by traditional Bayesian theory, including scaling of parameters, weakly informative priors, predictive model evaluation, variable selection, model averaging, checking of approximate algorithms, and frequency evaluations of Bayesian inferences. We discuss the application of these ideas in various applications in social science and public health.

P.S. It appears I’ll have some time available on Wed morning so if anyone has anything they want to discuss, just stop by; I’ll be at the eScience Institute on the 6th floor of the Physics/Astronomy Tower.

The Publicity Factory: How even serious research gets exaggerated by the process of scientific publication and reporting

The starting point is that we’ve seen a lot of talk about frivolous science, headline-bait such as the study that said that married women are more likely to vote for Mitt Romney when ovulating, or the study that said that girl-named hurricanes are more deadly than boy-named hurricanes, and at this point some of these studies are almost pre-debunked. Reporters are starting to realize that publication in Science or Nature or PNAS is, not only no guarantee of correctness but also no guarantee that a study is even reasonable.

But what I want to say here is that even serious research is subject to exaggeration and distortion, partly through the public relations machine and partly because of basic statistics. The push to find and publicize so-called statistically significant results leads to overestimation of effect sizes (type M errors), and crude default statistical models lead to broad claims of general effects based on data obtained from poor measurements and nonrepresentative samples.

One example we’ve discussed a lot is that claim of the effectiveness of early childhood intervention, based on a small-sample study from Jamaica. This study is not “junk science.” It’s a serious research project with real-world implications. But the results still got exaggerated. My point here is not to pick on those researchers. No, it’s the opposite: even top researchers exaggerate in this way so we should be concerned in general.

What to do here? I think we need to proceed on three tracks:
1. Think more carefully about data collection when designing these studies. Traditionally, design is all about sample size, not enough about measurement.
2. In the analysis, use Bayesian inference and multilevel modeling to partially pool estimated effect sizes, thus giving more stable and reasonable output.
3. When looking at the published literature, use some sort of Edlin factor to interpret the claims being made based on biased analyses.

The above remarks are general, indeed it was inspired by a discussion we had a few months ago about the design and analysis of psychology experiments, as I think there’s some misunderstanding in which people don’t see where assumptions are coming into various statistical analyses (see for example this comment).

The network of models and Bayesian workflow

This is important, it’s been something I’ve been thinking about for decades, it just came up in an email I wrote, and it’s refreshingly unrelated to recent topics of blog discussion, so I decided to just post it right now out of sequence (next slot on the queue is in May 2018).

Right now, standard operating procedure is to write a model, fit it, then alter the model and give it a new name, fit it, alter the model again, creating a new file, etc. We’re left with a mess of models in files with names like logit_1.stan, logit_2.stan, logit_2_interactions.stan, logit_3_time_trend.stan, logit_hier.stan, etc., that are connected to each other in some tangled way.

This could be thought of as network of models, i.e. I’m picturing a graph in which each node of the graph is a statistical model (instantiated as a Stan program), and the edges of the graph connect models that are one step away from each other (for example, adding one more predictor, or allowing a coefficient to vary by group, or adding a nonlinear term, or changing a link function, or whatever).

So I’m thinking what we need—we’ve been needing this for awhile, actually, and I’ve been talking about it for awhile too, but talk is cheap, right?, anyway . . .—is a formalization of the network of models.

Some possible payoffs:
– Automatic, or at least machine-aided, model building.
– A potentially algorithmic tie-in of model expansion and predictive model checking.
– Some better way of keeping track of multiple models. Using file names is just horrible.
– Visualization of workflow, which should help us better understand what we’re doing and be able to convey it better to others.

I’m thinking this would make sense, at least at first, to set this up as a R or Python wrapper.

Does traffic congestion make men beat up their wives?

Max Burton-Chellew writes:

I thought this paper and news story (links fixed) might be worthy of your blog? I’m no stats expert, far from it, but this paper raised some alarms for me. If the paper is fine then sorry for wasting your time, if it’s terrible then sorry for ruining your day!

Why alarms – well for the usual 1-2-3 of: p-hacking a ‘rich’ data set > inferring an unsubstantiated causal process for a complex human behaviour > giving big policy advice. Of course the authors tend to write this process in reverse.

I think the real richness is the multitude of psychological processes that are inferred for their full interpretation (traffic delays cause domestic violence but not other violence, and more for short commutes than long ones because long ones are less ‘surprised’ by delays etc).

This paragraph is perhaps most illuminating of the post-hoc interpretative approach used:

Next, we examine heterogeneity in the effect of traffic on crime by dividing zip codes along three dimensions: crime, income and distance to downtown. In each specification we subset zip codes in the sample by being either above or below the sample median for each variable. Crime and income are correlated, but there are zip codes that are high crime and high income. Table 4 shows that traffic increases domestic violence in predominantly high-crime and low-income zip codes. We also find that most of the effect appears to come from zip codes that are closer to downtown, which may arise for two reasons. First, households living closer to downtown are more likely to work downtown, and therefore we are assigning them the appropriate traffic conditions. Secondly, a traffic shock for a household with a very long commute may be a smaller proportion of their total commute and a traffic shock might be more expected.

My reply: It’s always good to hear from a zoologist! I’m not so good with animals myself. Also I agree with on you this paper, at least from a quick look. It’s not hard for people to sift through data to find patterns consistent with their stories of the world. Or, to put it another way, maybe traffic congestion does make (some) men beat up their wives, and maybe it makes other men less likely to do this—but this sort of data analysis won’t tell us much about it. As usual, I recommend studying this multiverse of possible interactions using a multilevel model, in which case I’m guessing the results won’t look so clean any more.

How to discuss your research findings without getting into “hypothesis testing”?

Zachary Horne writes:

I regularly read your blog and have recently started using Stan. One thing that you’ve brought up in the discussion of nhst [null hypothesis significance testing] is the idea that hypothesis testing itself is problematic. However, because I am an experimental psychologist, one thing I do (or I think I’m doing anyway) is conduct experiments with the aim of testing some hypothesis or another. Given that I am starting to use Stan and moving away from nhst, how would you recommend that experimentalists like myself discuss their findings since hypothesis testing itself may be problematic? In general, any guidance you have on this front would be very helpful.

My reply: In any particular case, I’d recommend building a model and estimating parameters within that model. For example, instead of trying to prove that the incumbency advantage was real, my colleagues and I estimated how it varied over time and across different congressional districts, and we estimated its consequences. The point is to draw direct links to questions outside the lab, or outside the data.

Maybe commenters have other suggestions?