The post Pastagate! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In a news article, “Pasta Is Good For You, Say Scientists Funded By Big Pasta,” Stephanie Lee writes:

The headlines were a fettuccine fanatic’s dream. “Eating Pasta Linked to Weight Loss in New Study,” Newsweek reported this month, racking up more than 22,500 Facebook likes, shares, and comments. The happy news also went viral on the Independent, the New York Daily News, and Business Insider.

What those and many other stories failed to note, however, was that three of the scientists behind the study in question had financial conflicts as tangled as a bowl of spaghetti, including ties to the world’s largest pasta company, the Barilla Group. . . .

They should get together with Big Oregano.

**P.S.** Our work has many government and corporate sponsors. Make of this what you will.

The post Pastagate! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Postdoc opportunity at AstraZeneca in Cambridge, England, in Bayesian Machine Learning using Stan! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Predicting drug toxicity with Bayesian machine learning models

We’re currently looking for talented scientists to join our innovative academic-style Postdoc. From our centre in Cambridge, UK you’ll be in a global pharmaceutical environment, contributing to live projects right from the start. You’ll take part in a comprehensive training programme, including a focus on drug discovery and development, given access to our existing Postdoctoral research, and encouraged to pursue your own independent research. It’s a newly expanding programme spanning a range of therapeutic areas across a wide range of disciplines. . . .

You will be part of the Quantitative Biology group and develop comprehensive Bayesian machine learning models for predicting drug toxicity in liver, heart, and other organs. This includes predicting the mechanism as well as the probability of toxicity by incorporating scientific knowledge into the prediction problem, such as known causal relationships and known toxicity mechanisms. Bayesian models will be used to account for uncertainty in the inputs and propagate this uncertainty into the predictions. In addition, you will promote the use of Bayesian methods across safety pharmacology and biology more generally. You are also expected to present your findings at key conferences and in leading publications

This project is in collaboration with Prof. Andrew Gelman at Columbia University, and Dr Stanley Lazic at AstraZeneca.

The post Postdoc opportunity at AstraZeneca in Cambridge, England, in Bayesian Machine Learning using Stan! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Psychometrics corner: They want to fit a multilevel model instead of running 37 separate correlation analyses appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>One of my students has some data, and there is an issue with multiple comparisons. While trying to find out how to best deal with the issue, I came across your article with Martin Lindquist, “Correlations and Multiple Comparisons in Functional Imaging: A Statistical Perspective.” And while my student’s work does not involve functional imaging, I thought that your article may present a solution for our problem.

My student is interested in the relationship between vocabulary size and different vocabulary learning strategies (VLS). He has measured each participant’s approximate vocabulary size with a standardized test (scores between 0 and 10000) and asked each participant how frequently they use each of 37 VLS on a scale from 1 through 5. The 37 VLS fall into five different groups (cognitive, memory, social etc.). He is interested in which VLS correlate with or predict vocabulary size. To see which VSL correlate with vocabulary size, we could run 37 separate correlation analyses, but then we run into the problem that we are doing multiple comparisons and the issue of false positives that goes along with that.

Do you think a multilevel Bayesian approach that uses partial pooling, as you suggest in your paper for functional imaging date, would be appropriate in our case? If so, would you be able to provide me with some more information as to how I can actually run such an analysis? I am working in R, and any information as to which packages and functions would be appropriate for the analysis would be really helpful. I came across the brms package for Advanced Bayesian Multilevel Modeling, but I have not worked with this particular package before and I am not sure if this is exactly what I need.

My reply:

I do think a multilevel Bayesian approach would make sense. I’ve never worked on this particular problem. So I am posting it here on blog on the hope that someone might have a response. This seems like the exact sort of problem where we’d fit a multilevel model rather than running 37 separate analyses!

The post Psychometrics corner: They want to fit a multilevel model instead of running 37 separate correlation analyses appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Trichotomous appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Regarding this paper, Frank Harrell writes:

One grammatical correction: Alvan Feinstein, the ‘father of clinical epidemiology’ at Yale, educated me about ‘trichotomy’. dichotomous = Greek dicho (two) + tomous (cut). Three = tri so the proper word would be ‘tritomous’ instead of ‘trichotomous’.

Uh oh. I can’t bring myself to use the word “tritomous” as it just sounds wrong. Trichotomous might just be one of those words that are just impossible to use correctly; see here.

**P.S.** The adorable cat above faces many more than three options.

The post Trichotomous appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Statistics: Learning from stories” (my talk in Zurich on Tues 28 Aug) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Statistics: Learning from stories

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University, New York

Here is a paradox: In statistics we aim for representative samples and balanced comparisons, but stories are interesting to the extent that they are surprising and atypical. The resolution of the paradox is that stories can be seen as a form of model checking: we learn from a good story when it refutes some idea we have about the world. We demonstrate with several examples of successes and failures of applied statistics.

Information on the conference is here.

The post “Statistics: Learning from stories” (my talk in Zurich on Tues 28 Aug) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post You better check yo self before you wreck yo self appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This type of testing should be required in order to publish a new (or improved) algorithm that claims to compute a posterior distribution. It’s time to get serious about only publishing things that actually work!

**You Oughta Know**

Before I go into our method, let’s have a brief review of some things that are not sufficient to demonstrate that an algorithm for computing a posterior distribution actually works.

- Theoretical results that are anything less than demonstrably tight upper and lower bounds* that work in finite-sample situations.
- Comparison with a long run from another algorithm unless that algorithm has stronger guarantees than “we ran it for a long time”. (Even when the long-running algorithm is guaranteed to work, there is nothing generalizable here. This can only ever show the algorithm works on a specific data set.)
- Recovery of parameters from simulated data (this literally checks nothing)
- Running the algorithm on real data. (Again, this checks literally nothing.)
- Running the algorithm and plotting traceplots, autocorrelation, etc etc etc
- Computing the Gelman-Rubin R-hat statistic. (Even using multiple chains initialized at diverse points, this only checks if the Markov Chain has converged. It does not check that it’s converged to the correct thing)

I could go on and on and on.

The method that we are proposing does actually do a pretty good job at checking if an approximate posterior is similar to the correct one. It isn’t magic. It can’t guarantee that a method will work for any data set.

What it can do is make sure that for a given model specification, one dimensional posterior quantities of interest will be correct on average. Here, “on average” means that we average over data simulated from the model. This means that rather than just check the algorithm once when it’s proposed, we need to check the algorithm every time it’s used for a new type of problem. This places algorithm checking within the context of Bayesian Workflow.

This isn’t as weird as it seems. One of the things that we always need to check is that we are actually running the correct model. Programming errors happen to everyone and this procedure will help catch them.

Moreover, if you’re doing something sufficiently difficult, it can happen that even something as stable as Stan will quietly fail to get the correct result. The Stan developers have put a lot of work into trying to avoid these quiet cases of failure (Betancourt’s idea to monitor divergences really helped here!), but there is no way to user-proof software. The Simulation-Based Calibration procedure that we outline in the paper (and below) is another safety check that we can use to help us be confident that our inference is actually working as expected.

(* I will also take asymptotic bounds and sensitive finite sample heuristics because I’m not that greedy. But if I can’t run my problem, check the heuristic, and then be confident that if someone died because of my inference, it would have nothing to do with the computaition of the posterior, then it’s not enough.)

**Don’t call it a comeback, I’ve been here for years**

One of the weird things that I have noticed over the years is that it’s often necessary to re-visit good papers from the past so they reflect our new understanding of how statistics works. In this case, we re-visited an excellent idea Samantha Cook, Andrew, and Don Rubin proposed in 2006.

Cook, Gelman, and Rubin proposed a method for assessing output from software for computing posterior distributions by noting a simple fact:

If and , then the posterior quantile is uniformly distributed (the randomness is in ) for any continuous function .

There’s a slight problem with this result. It’s not actually applicable for sample-based inference! It only holds if, at every point, all the distributions are continuous and all of the quantiles are computed exactly.

In particular, if you compute the quantile using a bag of samples drawn from an MCMC algorithm, this result will not hold.

This makes it hard to use the original method in practice. That might be down-weighting the problem. This whole project happened because we wanted to run Cook, Gelman and Rubin’s procedure to compare some Stan and BUGS models. And we just kept running into problems. Even when we ran it on models that we knew worked, we were getting bad results.

So we (Sean, Michael, Aki, Andrew, and I) went through and tried to re-imagine their method as something that is more broadly applicable.

**When in doubt, rank something**

The key difference between our paper and Cook, Gelman, and Rubin is that we have avoided their mathematical pitfalls by re-casting their main theoretical result to something a bit more robust. In particular, we base our method around the following result.

Let and , and be

independentdraws from the posterior distribution . Then therankof in the bag of samples has a discrete uniform distribution on .

This result is true for both discrete and continuous distributions. On the other hand, we now have freedom to choose . As a rule, the larger , the more sensitive this procedure will be. On the other hand, a larger will require more simulated data sets in order to be able to assess if the observed ranks deviate from a discrete-uniform distribution. In the paper, we chose samples for each posterior.

**The hills have eyes**

But, more importantly, the hills have autocorrelation. If a posterior has been computed using an MCMC method, the bag of samples that are produced will likely have non-trivial autocorrelation. This autocorrelation will cause the rank histogram to deviate from uniformity in a specific way. In particular, it will lead to spikes in the histogram at zero and/or one.

To avoid this, we recommend thinning the sample to remove most of the autocorrelation. In our experiments, we found that thinning by effective sample size was sufficient to remove the artifacts, even though this is not theoretically guaranteed to remove the autocorrelation. We also considered using some more theoretically motivated methods, such as thinning based on Geyer’s initial positive sequences, but we found that these thinning rules were too conservative and this more aggressive thinning did not lead to better rank histograms than the simple effective sample size-based thinning.

**Simulation based calibration**

After putting all of this together, we get the simulation based calibration (SBC) algorithm. The below version is for correlated samples. (There is a version in the paper for independent samples).

The simple idea is that each of the simulated datasets, you generate a bag of approximately independent samples from the approximate posterior. (You can do this however you want!) You then compute the rank of the true parameter (that was used in the simulation of the data set) within the bag of samples. So you need to compute true parameters, each of which is used to compute one data set, which is used to compute samples from its posterior.

So. Validating code with SBC is obviously expensive. It requires a whole load of runs to make it work. The up side is that all of this runs in parallel on a cluster, so if your code is reliable, it is actually quite straightforward to run.

The place where we ran into some problems was trying to validate MCMC code that we knew didn’t work. In this case, the autocorrelation on the chain was so strong that it wasn’t reasonable to thin the chain to get 100 samples. This is an important point: if your method fails some basic checks, then it’s going to fail SBC. There’s no point wasting your time.

The main benefit of SBC is in validating MCMC methods that appear to work, or validating fast approximate algorithms like INLA (which works) or ADVI (which is a more mixed bag).

This method also has another interesting application: evaluating approximate models. For example, if you replace an intractable likelihood with a cheap approximation (such as a composite likelihood or a pseudolikelihood), SBC can give some idea of the errors that this approximation has pushed into the posterior. The procedure remains the same: simulate parameters from the prior, simulate data from the correct model, and then generate a bag of approximately uncorrelated samples from corresponding posterior using the approximate model. While this procedure cannot assess the quality of the approximation in the presence of model error, it will still be quite informative.

**When You’re Smiling (The Whole World Smiles With You)**

One of the most useful parts of the SBC procedure is that it is inherently visual. This makes it fairly straightforward to work out how your algorithm is wrong. The one-dimensional rank histograms have four characteristic non-uniform shapes: “smiley”, “frowny”, “a step to the left”, “a jump to the right”, which are all interpretable.

- Histogram has a smile: The posteriors are narrower than they should be. (We see too many low and high ranks)
- Histogram has a frown: The posteriors are wider than they should be. (We don’t see enough low and high ranks)
- Histogram slopes from left to right: The posteriors are biased upwards. (The true value is too often in the lower ranks of the sample)
- Histogram slopes from right to left: The posteriors are biased downwards. (The opposite)

These histograms seem to be sensitive enough to indicate when an algorithm doesn’t work. In particular, we’ve observed that when the algorithm fails, these histograms are typically quite far from uniform. A key thing that we’ve had to assume, however, is that the bag of samples drawn from the computed posterior is approximately independent. Autocorrelation can cause spurious spikes at zero and/or one.

These interpretations are inspired by the literature on calibrating probabilistic forecasts. (Follow that link for a really detailed review and a lot of references). There are also some multivariate extensions to these ideas, although we have not examined them here.

The post You better check yo self before you wreck yo self appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Using partial pooling when preparing data for machine learning applications appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I reached out to John Mount/Nina Zumel over at Win Vector with a suggestion for their vtreat package, which automates many common challenges in preparing data for machine learning applications.The default behavior for impact coding high-cardinality variables had been a naive bayes approach, which I found to be problematic due its multi-modal output (assigning probabilities close to 0 and 1 for low sample size levels). This seemed like a natural fit for partial pooling, so I pointed them to your work/book and demonstrated it’s usefulness from my experience/applications. It’s now the basis of a custom-coding enhancement to their package.You can find their write up here.

Cool. I hope their next step will be to implement in Stan.

It’s also interesting to think of Bayesian or multilevel modeling being used as a preprocessing tool for machine learning, which is sort of the flipped-around version of an idea we posted the other day, on using black-box machine learning predictions as inputs to a Bayesian analysis. I like these ideas of combining different methods and getting the best of both worlds.

The post Using partial pooling when preparing data for machine learning applications appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post An Upbeat Mood May Boost Your Paper’s Publicity appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A new study suggests that older people who are in a good mood when they get the shot have a better immune response.

British researchers followed 138 people ages 65 to 85 who got the 2014-15 vaccine. Using well-validated tests in the weeks before and after their shots, the scientists recorded mood, stress, negative thoughts, sleep patterns, diet and other measures of psychological and physical health. . . .

Greater levels of positive mood were associated with higher blood levels of antibodies to H1N1, a potentially dangerous flu strain, at both four and 16 weeks post-vaccination. No other factors measured were associated with improved immune response.

Abundant researcher degrees of freedom? Check.

Speculative hypothesis? Check.

Obvious latent-variable explanation? Check.

Difference between significant and non-significant taken as significant? Check.

The article continues:

The authors acknowledge they were not able to control for all possible variables, and that their observational study does not prove cause and effect.

The senior author, Kavita Vedhara, professor of health psychology at the University of Nottingham, said that many things could affect vaccine effectiveness, but most are not under a person’s control — age, coexisting illness or vaccine history, for example.

“It’s not there aren’t other influences,” she said, “but it looks like how you’re feeling on the day you’re vaccinated may be among the more important.”

First off, the confident statement at the end seems to contradict the caveats two paragraphs earlier. Second, I question the implication that one’s mood is “under a person’s control.” How does that work, exactly?

Beyond all this are the usual statistical problems of noise. From the research article:

One hundred and thirty-eight community-dwelling older adults aged 65–85 were recruited through 4 primary care practices in Nottingham, UK. A priori sample size calculations based on observed effects of stress on vaccine response in elderly caregivers (Vedhara et al., 1999) indicated a sample of 121 would give 80% power at 5% significance to detect a similar small-to-medium sized effect (r = 0.25) in individual regression models.

This the familiar “power = .06” disaster: take an overestimated effect size from a previous noisy study, then design a new study under these unrealistic assumptions. Bad news all around.

On the plus side, this is a study that would be easy enough to do a preregistered replication. I recommend the authors of the above-cited study start thinking up their alibis right now for the anticipated replication failure.

**P.S.** As usual, let me repeat that, yes, this effect *could* be real and replicable. And I’ll believe it once I see real evidence. Not before.

**P.P.S.** I learned about this paper on 25 Sep, right around when everyone’s getting their flu shots. But I posted it on a delay so it’s not appearing until mid-April.

Why delay my post on this timely topic?

Here’s why. If I keep quiet, this research might make people happy, which in turn will boost their flu shots’ effectiveness. But if I post, I’d be duty-bound to criticize this research as just another bit of noise-mining. This would make people sad, which in turn would decrease their flu shots’ effectiveness. Thus, by posting right away, I could be making people unhealthy, even maybe killing them! So, ethically speaking I have no choice but to delay my post until April, when flu season is over–and also, coincidentally, the next spot on the blog queue.

The post An Upbeat Mood May Boost Your Paper’s Publicity appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post loo 2.0 is loose appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We’re happy to announce the release of v2.0.0 of the **loo** R package for efficient approximate leave-one-out cross-validation (and more). For anyone unfamiliar with the package, the original motivation for its development is in our paper:

Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.

Statistics and Computing. 27(5), 1413–1432. doi:10.1007/s11222-016-9696-4. (published version, arXiv preprint)

Version 2.0.0 is a major update (release notes) to the package that we’ve been working on for quite some time and in this post we’ll highlight some of the most important improvements. Soon I (Jonah) will follow up with a post about important new developments in our various other R packages.

**New interface, vignettes, and more helper functions to make the package easier to use**

Because of certain improvements to the algorithms and diagnostics (summarized below), the interfaces, i.e., the `loo()`

and `psis()`

functions and the objects they return, also needed some improvement. (Click on the function names in the previous sentence to see their new documentation pages.) Other related packages in the Stan R ecosystem (e.g., **rstanarm**, **brms**, **bayesplot**, **projpred**) have also been updated to integrate seamlessly with **loo** v2.0.0. (Apologies to anyone who happened to install the update during the short window between the **loo** release and when the compatible rstanarm/brms binaries became available on CRAN.)

Three vignettes now come with the **loo** package package and are also available (and more nicely formatted) online at mc-stan.org/loo/articles:

*Using the loo package (version >= 2.0.0)*(view)*Bayesian Stacking and Pseudo-BMA weights using the loo package*(view)*Writing Stan programs for use with the loo package*(view)

A vignette about K-fold cross-validation using new K-fold helper functions will be included in a subsequent update. Since the last release of **loo** we have also written a paper, Visualization in Bayesian workflow, that includes several visualizations based on computations from **loo**.

**Improvements to the PSIS algorithm, effective sample sizes and MC errors**

The approximate leave-one-out cross-validation performed by the **loo** package depends on Pareto smoothed importance sampling (PSIS). In **loo** v2.0.0, the PSIS algorithm (`psis()`

function) corresponds to the algorithm in the most recent update to our PSIS paper, including adapting the Pareto fit with respect to the effective sample size and using a weakly informative prior to reduce the variance for small effective sample sizes. (I believe we’ll be updating the paper again with some proofs from new coauthors.)

For users of the **loo** package for PSIS-LOO cross-validation and not just the PSIS algorithm for importance sampling, an even more important update is that the latest version of the same PSIS paper referenced above describes how to compute the effective sample size estimate and Monte Carlo error for the PSIS estimate of `elpd_loo`

(expected log predictive density for new data). Thus, in addition to the Pareto k diagnostic (an indicator of convergence rate – see paper) already available in previous **loo** versions, we now also report an effective sample size that takes into account both the MCMC efficiency and the importance sampling efficiency. Here’s an example of what the diagnostic output table from **loo** v2.0.0 looks like (the particular intervals chosen for binning are explained in the papers and also the package documentation) for the diagnostics:

Pareto k diagnostic values: Count Pct. Min. n_eff (-Inf, 0.5] (good) 240 91.6% 205 (0.5, 0.7] (ok) 7 2.7% 48 (0.7, 1] (bad) 8 3.1% 7 (1, Inf) (very bad) 7 2.7% 1

We also compute and report the Monte Carlo SE of `elpd_loo`

to give an estimate of the accuracy. If some k>1 (which means the PSIS-LOO approximation is not reliable, as in the example above) NA will be reported for the Monte Carlo SE. We hope that showing the relationship between the k diagnostic, effective sample size, and and MCSE of `elpd_loo`

will make it easier to interpret the diagnostics than in previous versions of **loo** that only reported the k diagnostic.** **This particular example is taken from one of the new vignettes, which uses it as part of a comparison of unstable and stable PSIS-LOO behavior.

**Weights for model averaging: Bayesian stacking, pseudo-BMA and pseudo-BMA+**

Another major addition is the `loo_model_weights()`

function, which, thanks to the contributions of Yuling Yao, can be used to compute weights for model averaging or selection. `loo_model_weights()`

provides a user friendly interface to the new `stacking_weights()`

and `pseudobma_weights()`

, which are implementations of the methods from Using stacking to average Bayesian predictive distributions (Yao et al., 2018). As shown in the paper, Bayesian stacking (the default for `loo_model_weights()`

) provides better model averaging performance than “Akaike style“ weights, however, the **loo **package does also include Pseudo-BMA weights (PSIS-LOO based “Akaike style“ weights) and Pseudo-BMA+ weights, which are similar to Pseudo-BMA weights but use a so-called Bayesian bootstrap procedure to better account for the uncertainties. We recommend the Pseudo-BMA+ method instead of, for example, WAIC weights, although we prefer the stacking method to both. In addition to the Yao et al. paper, the new vignette about computing model weights demonstrates some of the motivation for our preference for stacking when appropriate.

**Give it a try**

You can install **loo** v2.0.0 from CRAN with `install.packages("loo")`

. Additionally, reinstalling an interface that provides **loo** functionality (e.g., **rstanarm**,** ****brms**) will automatically update your **loo** installation. The **loo** website with online documentation is mc-stan.org/loo and you can report a bug or request a feature on GitHub.

The post loo 2.0 is loose appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Taking perspective on perspective taking appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I thought you might be interested in this paper with Gabor Kezdi of U Michigan and Peter Kardos of Bloomfield College, about an online intervention reducing anti-Roma prejudice and far-right voting in Hungary through a role-playing game.

The paper is similar to some existing social psychology studies on perspective taking but we made an effort to improve on the credibility of the analysis by (1) using a relatively large sample (2) registering and following a pre-analysis plan (3) using pre-treatment measures to explore differential attrition and (4) estimating long term effects of the treatment. It got desk-rejected from PNAS and Psych Science but was just accepted for publication in APSR.

I have not had a chance to read the paper carefully. But, just speaking generally, I agree with Simonovits that: (1) a large sample can’t hurt, (2) preregistration makes this sort of result much more believable, (3) using pre-treatment variables can be crucial in getting enough precision to estimate what you care about, and (4) richer outcome measures can help a lot.

Also, whassup. No graphs??

The post Taking perspective on perspective taking appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Generable: They’re building software for pharma, with Stan inside. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We’ve just launched our new website.

Generable is where precision medicine meets statistical machine learning.

We are building a state-of-the-art platform to make individual, patient-level predictions for safety and efficacy of treatments. We’re able to do this by building Bayesian models with Stan. We currently have pilots with AstraZeneca, Sanofi, and University of Marseille. We’re particularly interested in small clinical trials, like in rare diseases or combination therapies. If anyone is interested, they can reach Daniel at daniel@generable.com

I’ve been collaborating with Daniel for many years and I’m glad to hear that he and his colleagues are doing this work. It’s my impression that in many applied fields, pharmacometrics included, there’s a big need for systems that allow users to construct open-ended models, using prior information and hierarchical models to regularize inferences and thus allow the integration of multiple relevant data sources in making predictions. As Daniel implies in his note above, Bayesian tools are particularly relevant where data are sparse.

The post Generable: They’re building software for pharma, with Stan inside. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Fixing the reproducibility crisis: Openness, Increasing sample size, and Preregistration ARE NOT ENUF!!!! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>One of the most exciting things to happen during the years-long debate about the replicability of psychological research is the shift in focus from providing evidence that there is a problem to developing concrete plans for solving those problems. . . . I’m hopeful and optimistic that future investigations into the replicability of findings in our field will show improvement over time.

Of course, many of the solutions that have been proposed come with some cost: Increasing standards of evidence requires larger sample sizes; sharing data and materials requires extra effort on the part of the researcher; requiring replications shifts resources that could otherwise be used to make new discoveries. . . .

This is all fine, but, BUT, honesty and transparency are not enough! Even honesty, transparency, replication, and large sample size are not enough. You also need good measurement, and some sort of good theory. Otherwise you’re just moving around desk chairs on the . . . OK, you know where I’m heading here.

Don’t get me wrong. Sharing data and materials is a good idea in any case; replication of some sort is central to just about all of science, and larger sample sizes are fine too. But if you’re not studying a stable phenomenon that you’re measuring well, then forget about it: all those good steps of openness, replication, and sample size will just be expensive ways of learning that your research is no good.

I’ve been saying this for awhile so I know this is getting repetitive. See, for example, this post from yesterday, or this journal article from a few months back.

But I feel like I need to keep on screaming about this issue, given that well-intentioned and thoughtful researchers still seem to be missing it. I really really really don’t want people going around thinking that, if they increase their sample size and keep open data and preregister, that they’ll solve their replications. Eventually, sure, enough of this and they’ll be so demoralized that maybe they’ll be motivated to improve their measurements. But why wait? I recommend following the recommendations in section 3 of this paper right away.

The post Fixing the reproducibility crisis: Openness, Increasing sample size, and Preregistration ARE NOT ENUF!!!! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Bit by Bit: Social Research in the Digital Age” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I really like the division into Observing Behavior, Asking Questions, Running Experiments, and Mass Collaboration (I’d remove the word “Creating” from the title of that section). It seemed awkward for Ethics to be in its own section rather than being sprinkled throughout the book, but it in any case it’s a huge plus to have any discussion of ethics at all. I’ve written a lot about ethics but very little of this has made its way into my textbooks so I appreciate that Matt did this.

Also I suggested three places where the book could be improved:

1. On page xiv, Matt writes, “I’m not going to be critical for the sake of being critical.” This seems like a straw man. Just about nobody is “critical for the sake of being critical.” For example, if I criticize junk science such as power pose, I do so because I’m concerned about waste of resources, about bad incentives (positive press and top jobs for junk science motivates students to aim for that sort of thing themselves), I’m concerned because the underlying topic is important and it’s being trivialized, I’m concerned because I’m interested in learning about human interactions, and pointing out mistakes is one way we learn, and criticism is also helpful in revealing underlying principles of research methods: when we learn how things can seem so right and go so wrong, that can help us move forward. Matt writes that he’s “going to be critical so that [he] can help you create better research.” But that’s the motivation of just about *every* critic. I have no problem with whatever balance Matt happens to choose between positive and negative examples; I just think he may be misunderstanding the reasons why people criticize mistakes in social research.

2. On pages 136 and 139, Matt refers to non-probability sampling. Actually, just about every real survey is a non-probability sample. For a probability sample, it is necessary that everyone in the population has a nonzero probability of being in the sample, and that these probabilities are known. Real polls have response rates under 10%, and there’s no way of knowing or even really defining what is the response probability for each person in the sample. Sometimes people say “probability sample” when they mean “random digit dialing (RDD) sample”, but an RDD sample is not actually a probability sample because of nonresponse.

3. In the ethics section, I’d like a discussion the idea that it can be an ethics violation to do low-quality research; see for example here, here, and here. In particular, high-quality measurement (which Matt discusses elsewhere in his book) is crucial. A researcher can be a wonderful, well-intentioned person, follow all ethical rules, IRB and otherwise—but if he or she takes crappy measurements, then the results will be crap too. Couple that with standard statistical practices (p-values etc.) and the result is junk science. Which in my view is unethical. To do a study and *not* consider data quality, on the vague hope that something interesting will come out and you can publish it, is unethical in that it is an avoidable pollution of scientific discourse.

Anyway, I think it will make an excellent textbook. I mentioned 3 little things that I think could be improved, but I could list 300 things in it that I love. It’s a great contribution.

The post “Bit by Bit: Social Research in the Digital Age” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post It’s all about Hurricane Andrew: Do patterns in post-disaster donations demonstrate egotism? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I took a quick look and didn’t notice anything clearly wrong with the paper, but there did seem to be some opportunities for forking paths, in that the paper seemed to be analyzing only a small selection of relevant data on the question they were asking.

I wrote that I’m open to the possibility that this is real, also open to the possibility that it’s not.

Windle replied:

That was my take as well. Human psychology is certainly strange enough that its possible, but human psychology is strange enough to allow seeing effects where there are none.

Well put.

The person I’d really want to ask about this one is Uri Simonsohn. He’s the one who wrote that paper several years ago carefully shooting down every claim from the dentists-named-Dennis article.

The post It’s all about Hurricane Andrew: Do patterns in post-disaster donations demonstrate egotism? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Tools for detecting junk science? Transparency is the key. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Exposure to nonionizing radiation used in wireless communication remains a contentious topic in the public mind—while the overwhelming scientific evidence to date suggests that microwave and radio frequencies used in modern communications are safe, public apprehension remains considerable. A recent article in Child Development has caused concern by alleging a causative connection between nonionizing radiation and a host of conditions, including autism and cancer. This commentary outlines why these claims are devoid of merit, and why they should not have been given a scientific veneer of legitimacy. The commentary also outlines some hallmarks of potentially dubious science, with the hope that authors, reviewers, and editors might be better able to avoid suspect scientific claims.

The article in question is, “Electromagnetic Fields, Pulsed Radiofrequency Radiation, and Epigenetics: How Wireless Technologies May Affect Childhood Development,” by Cindy Sage and Ernesto Burgio. I haven’t read the two articles in detail, but Grimes and Bishop’s critique seems reasonable to me; I have no reason to believe the claims of Sage and Burgio, and indeed the most interesting thing there is that this article, which has no psychology content, was published in the journal Child Development. Yes, the claims in that article, if true, would indeed be highly relevant to the topic of child development—but I’d expect an article such as this to appear in a journal such as Health Physics whose review pool is more qualified to evaluate it.

How did that happen? The Sage and Burgio article appeared in a “Special Section is Contemporary Mobile Technology and Child and Adolescent Development, edited by Zheng Yan and Lennart Hardell.” And if you google Lennard Hardell, you’ll see this:

Lennart Hardell (born 1944), is a Swedish oncologist and professor at Örebro University Hospital in Örebro, Sweden. He is known for his research into what he says are environmental cancer-causing agents, such as Agent Orange, and has said that cell phones increase the risk of brain tumors.

So now we know how the paper got published in Child Development.

Of more interest, perhaps, are the guidelines that Grimes and Bishop give for evaluating research claims:

I’m reminded by another article by Dorothy Bishop, this one written with Stephen Lewandowsky a couple years ago, giving red flags for research claims.

As I wrote back then, what’s important to me is not peer review (see recent discussion) but transparency. And several of the above questions (#3, #4, #7, and, to some extent, #8 and #9) are about transparency. So that could be a way forward.

Not that all transparent claims are correct—of course, you can do a crappy study, share all your data, and still come to an erroneous conclusion—but I think transparency is a good start, as lots of the problems with poor data collection and analysis can be hidden by lack of transparency. Just imagine how many tens of thousands of person-years of wasted effort could’ve been avoided if that pizzagate guy had shared all his data and code from the start.

The post Tools for detecting junk science? Transparency is the key. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Do Statistical Methods Have an Expiration Date? (my talk noon Mon 16 Apr at the University of Pennsylvania) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

There is a statistical crisis in the human sciences: many celebrated findings have failed to replicate, and careful analysis has revealed that many celebrated research projects were dead on arrival in the sense of never having sufficiently accurate data to answer the questions they were attempting to resolve. The statistical methods which revolutionized science in the 1930s-1950s no longer seem to work in the 21st century. How can this be? It turns out that when effects are small and highly variable, the classical approach of black-box inference from randomized experiments or observational studies no longer works as advertised. We discuss the conceptual barriers that have allowed researchers to avoid confronting these issues, which arise in psychology, policy research, public health, and other fields. To do better, we recommend three steps: (a) designing studies based on a perspective of realism rather than gambling or hope, (b) higher quality data collection, and (c) data analysis that combines multiple sources of information.

Some of material in the talk appears in our recent papers, The failure of null hypothesis significance testing when studying incremental changes, and what to do about it and Some natural solutions to the p-value communication problem—and why they won’t work.

The talk is at 340 Huntsman Hall.

The post Do Statistical Methods Have an Expiration Date? (my talk noon Mon 16 Apr at the University of Pennsylvania) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Failure of failure to replicate appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Too much here to digest probably, but the common theme is—what if people start saying their work “replicates” or “fails to replicate” when the studies in question are massively underpowered &/or have significantly different design (& sample) from target study?

1. Kahan after discovering that authors claim my study “failed to replicate”:On Thu, Aug 10, 2017 at 6:37 PM, Dan Kahan <dan.kahan@yale.edu> wrote: Hi, Steve & Cristina.So predictably, people are picking up on your line that “[you failed to replicate Kahan et al.’s “motivated numeracy effect”.As we have discussed, your study differed from ours in various critical respects, including N & variance of sample in numeracy & ideology. I think it is misleading to say one found no “replication” when study designs differ. All the guidelines on replication make this point.

2. Them–acknowledging this point

co-author 2

Hi Dan,

If we didn’t, we should have said “conceptual replication.”

I certainly agree we didn’t fail to replicate any specific study of yours. And we could have had a bigger N and more conservatives. That’s why we haven’t tried to publish the work in a journal, just a conference proceedings. But, as appealing as the hypothesis is, Cristina’s work does leave me with less faith in the general rule that more numerate people engage in more motivated reasoning using the contingency table task.best, s

lead author:

Hi Dan,I agree– we should have used a phrase other than “replication” in describing those parts of the results.To add, I tried to make it clear in our poster presentation, as well as in our paper, that the effect of reasons we found was not predicated on the existence of the motivated numeracy effect. And I explicitly noted that this null result was likely attributable to the differences between the two studies– in fact, many people I talked to pointed out the difference in N and the differences in variance on their own.Cristina

3. Kahan—trying to figure out how they can acknowledge somewhere that their studiesaren’tcommensurable w/ ours & it was mistake to assert otherwiseHi, Steve & Cristina.

Thanks for reflecting on my email & responding so quickly.I am left, however, with the feeling that your willngness to acknowledge my points in private correspondence doesn’t address my objection to the fairness of what you have done in an open scholarly forum.Youhave“published” your paper in the proceedings collection. The abstract of your paper states expressly “we failed to replicate Kahan et al.’s ‘motivated numeracy effect.’ ” In the text you state that “you attempted to replicate” our study and “failed to find a significant effect of motivated numeracy.”The perfectly forseeable result is that readers are now treating your study as a “failed replication” attempt, notwithstanding your acknowledgement to me that such a conclusion “clearly,” “definitely” isn’t warranted. Expecting them to “figure this out” for themselves isn’t realistic given the limited time & attention span of casual readers, and the lure of the abstract.I think the fair thing to do would be to remove the references to “failed replication” and to acknowledgein the paperthat your design — because of the Nandbecause of the lack of variance in ideology & numeracy in the study subjects — was not suited for testing the replicability of ours.Anytning short of this putsmein the position of bearing the burden of explaining away your expressly stated conclusion that our study results “didn’t replicate.” Because my advocacy would be discounted as self-serving, I would suffer an obvious rhetorical handicap. What’s more, I’d be forced to spend considerable time on this at the expense of other projects I am working on.Avoiding such unfairness informs the protocols for replication that various scholars and groups of scholars have compiled and that guided theSciencearticle. I’m sure you agree that this is good practice & hope will accommodate me & my co-authors on this.–Dan4. Co-author tells me I should feel “honored” that they examined by work & otherwise shut up; also, “replication” carries no special meaning that justifies my focus on it…

Dear Dan,

I will speak for myself, not Cristina.

You seem to have misunderstood my email. I am not taking back our claim that we failed to replicate. What I said is that I admitted that we could have characterized it as a failure of a “conceptual replication.” This is still a type of replication. We were testing an hypothesis we derived from your paper, we used a similar experimental procedure though a wildly smaller N, which we tried to counterbalance by giving each subject more tasks to do. So we had more data than you per subject. We also only tested half your hypothesis in the sense that we didn’t have many conservatives. Nevertheless, we fully expected to see the same pattern of results that you found. But we didn’t; we found the opposite. We were surprised and disappointed but nevertheless decided to report the data in a public forum. I stand by our report even if you don’t like one of our verbs.

Even if we wanted to, we couldn’t deliver on your request. The proceedings have been published. It’s too late to change them. But the fact is that I wouldn’t want to change them anyway. Yes, we could have added the word “conceptual” in a couple of places. But that wouldn’t change the gist of the story. There are failures to replicate all the time. Ours is a minor study, reported in a minor venue. If people challenge you because of it, I’m sure you’re smart enough and have enough data to meet the challenge. I think you should consider it an honor that we took the time and made the effort to look at one boundary of your effect. If you feel strongly about it, then feel free to go out and explain why our data look the way they do. Simply saying our N was too small and our population too narrow explains nothing. We found some very systematic effects, not just random noise.

all the best, steve

Just to interrupt here, I agree with Dan that this seems wrong. Earlier, Steve had written, “I certainly agree we didn’t fail to replicate any specific study of yours,” and Cristina had written, “we should have used a phrase other than ‘replication’ in describing those parts of the results.” But now Steve is saying:

I stand by our report even if you don’t like one of our verbs.

I guess the verb here is “replicate”—but at the very least it’s not just Dan who doesn’t think that word is appropriate. It’s also Cristina, the first author of the paper!

The point here is not to do some sort of gotcha or to penalized Cristina in any way for being open in an email. Rather, it’s the opposite: the point is that Kahan is offering to help Steve and Cristina out by giving them a chance to fix a mistake they’d made—just as, earlier, Steve and Cristina were helping Dan out by doing conceptual replications of his work. It seems that those conceptual replications may have been too noisy to tell us much—but that’s fine too, we have to start somewhere.

OK, back to Dan:

5.So Dan writes attached paper w/ co-author:Futile gesture, no doubt.

Kahan concludes:

This is a case study in how replication can easily go off the rails. The same types of errors people make in non-replicated papers will now be used in replications.

The post Failure of failure to replicate appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The Millennium Villages Project: a retrospective, observational, endline evaluation appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The Millennium Villages Project (MVP) was a 10 year, multisector, rural development project, initiated in 2005, operating across ten sites in ten sub-Saharan African countries to achieve the Millennium Development Goals (MDGs). . . .

In this endline evaluation of the MVP, we retrospectively selected comparison villages that best matched the project villages on possible confounding variables. . . . we estimated project impacts as differences in outcomes between the project and comparison villages; target attainment as differences between project outcomes and prespecified targets; and on-site spending as expenditures reported by communities, donors, governments, and the project. . . .

Averaged across the ten project sites, we found that impact estimates for 30 of 40 outcomes were significant (95% uncertainty intervals [UIs] for these outcomes excluded zero) and favoured the project villages. In particular, substantial effects were seen in agriculture and health, in which some outcomes were roughly one SD better in the project villages than in the comparison villages. The project was estimated to have no significant impact on the consumption-based measures of poverty, but a significant favourable impact on an index of asset ownership. Impacts on nutrition and education outcomes were often inconclusive (95% UIs included zero). Averaging across outcomes within categories, the project had significant favourable impacts on agriculture, nutrition, education, child health, maternal health, HIV and malaria, and water and sanitation. A third of the targets were met in the project sites. . . .

It took us three years to do this retrospective evaluation, from designing sampling plans, gathering background data, designing the comparisons, and performing the statistical analysis.

At the very beginning of the project, we made it clear that our goal was not to find “statistical significant” effects, that we’d do our best and report what we found. Unfortunately, some of the results in the paper are summarized by statistical significance. You can’t fight City Hall. But we tried our best to minimize such statements.

In the design stage we did lots and lots of fake-data simulation to get a sense of what we might expect to see. We consciously tried to avoid the usual plan of gathering data, flying blind, and hoping for good results.

You can read the article for the full story. Also, published in the same issue of the journal:

– The perspective of Jeff Sachs, leader of the Millennium Village Project,

– An outside evaluation of our evaluation, from Eran Bendavid.

The post The Millennium Villages Project: a retrospective, observational, endline evaluation appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Fitting a hierarchical model without losing control appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I have been asked to run some regularized regressions on a small N high p situation, which for the primary outcome has lead to more realistic coefficient estimates and better performance on cv (yay!). Rstanarm made this process very easy for me so I am grateful for it.

I have now been asked to run a similar regression on a set of exploratory analyses where authors are predicting the results of 4 subscales of the same psychological test. Given the small sample and opportunity for type M and S errors I had originally thought of trying to specify a multivariate normal model, but then remembered your paper on why we don’t usually worry about multiple comparisons.

I am new to translating written notation of multilevel models into R code, but I’m wondering if I’m understanding your eight schools with multiple outcomes example properly. Would the specification in lmer just be:

y ~ 1 + (1 + B1 + B2 | outcome)

Where outcome is my factor of subscales, y is the standardized test outcome, and B1 and B2 are standardized slopes I want to allow to vary by subgroup? This seems to make sense to me in that it’s coding my belief that the slopes between subgroups are similar (and thus hopefully pulling extreme estimates closer to the overall mean), but it seems too easy, so I figure I must be doing something wrong. The results also end up leading to switching signs in in the coefficients when compared against the no pooling results. Not sure whether to be excited about potentially avoiding a type S error, or scared that I’ve stuffed up the whole analysis!

My reply:

That looks almost right to me as a starting point, but one thing it’s missing is the idea that the 4 subscales could be correlated. Perhaps people with higher scores on subscale 1 also tend to have higher scores on subscale 2, for example?

How best to model the correlation? It depends on what these subscales are doing. Most general is a 4×4 covariance matrix (which, incidentally, allows the variances to be different for the different subscales, something not allowed in your model above), but something sort of item response model could make sense if you think all the subscales are measuring related things.

In any case, I guess you could start with the model above but then I’d move to fitting a multivariate-outcome model in Stan.

Finally, regarding the larger question of making sure that your model is doing what you think it’s supposed to be doing: I very much recommend fake data simulation. Set up your model, do a forward simulation and create fake data, then fit the model to your fake data and check that the results make sense and are consistent with what you were assuming.

The post Fitting a hierarchical model without losing control appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Don’t define reproducibility based on p-values appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I just got asked to comment on this article [“Genotypic variability enhances the reproducibility of an ecological study,” by Alexandru Milcu et al.

]—I have yet to have time to fully sort out their stats but the first thing that hit me about it was they seem to be suggesting a way to increase reproducibility is to increase some aspect that leads to important variation in the experiment (like genotypic variation in plants, which we know is important). But that doesn’t seem to make sense!

My response:

Regarding the general issue, I had a conversation with Paul Rosenbaum once about choices in design of experiments, where one can decide to perform: (a) a focused experiment with very little variation on x, which should improve precision but harm generalizability; or (b) a broader experiment in which one purposely chooses a wide range of x, which should reduce precision in estimation but allow the thing being estimated to be more relevant for out-of-sample applications. That sounds related to what’s going on here.

Regarding this particular paper, I am finding the details hard to follow, in part because they aren’t always so clear in distinguishing between data and parameters. For example, they write, “the net legume effect on mean total plant biomass varied among laboratories from 1.31 to 6.72 g dry weight (DW) per microcosm in growth chambers, suggesting that unmeasured laboratory-specific conditions outweighed effects of experimental standardization.” But I assume they are referring not to the effect but to the estimated effect, so that some of this variation could be explained as estimation error.

I also find it frustrating to read a paper about replication in which decisions are made based on statistical significance; for example, see lines 174-184 of text, and, even more explicitly, on lines 187-188: “To answer the question of how many laboratories produced results that were statistically indistinguishable from one another (i.e. reproduced the same finding) . . .”

Also there are comparisons of significance and non-significance, for example this: “Introducing genotypic CSV increased reproducibility in growth chambers but not in glasshouses,” followed by post-hoc explanations: “This observation is in line with the hypothesis put forward by Richter et al. . . .”

This is not to say that the claims in this paper are wrong, just that I’m finding it difficult to make sense of this paper and understand exactly what they mean by reproducibility, which is never defined in the paper.

Lizzie replied:

Yes, the theme of the paper seems to be, “When all you care about is an asterisk above your bargraph in one paper, but no asterisks when you compare papers.” They also do define reproducibility: “Because we considered that statistically significant differences among the 14 laboratories would indicate a lack of reproducibility….”

I guess what we’re saying here is that reproducibility is important, but defining it based on p-values is a mistake, it’s kinda sending you around in circles.

The post Don’t define reproducibility based on p-values appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How jet lag impairs major league statistical performance appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Last August you wrote about [1] a PNAS paper that looked at “jet lag” and a bunch of metrics across twenty MLB seasons. I’ve played around with incorporating their measure of jet lag into a model of run differentials [2], working from your posts about estimating team abilities in soccer [3-5]. I don’t think the model I came up with is particularly useful. Assuming that I didn’t make any stupid mistakes, the model doesn’t do a good job of estimating home field advantage, which makes me question all of its estimates, including the ones for the jet lag parameters.

But, anyway, perhaps others will be interested in the data set that I generated [6]. As far as I can tell, the authors of the original study didn’t release the lag values or the code they used to generate them. Based on my attempts to reproduce their summary tables [7], I think my set of lag values is pretty similar.

[1]: http://andrewgelman.com/2017/08/04/hadnt-jet-lag-junior-certainly-wouldve-banged-756-hrs-career/

[2]: https://kyleam.github.io/mlb-rundiff/

[3]: http://andrewgelman.com/2014/07/13/stan-analyzes-world-cup-data/

[4]: http://andrewgelman.com/2014/07/15/stan-world-cup-update/

[5]: http://andrewgelman.com/2017/05/17/using-stan-week-week-updating-estimated-soccer-team-abilites/

[6]: https://kyleam.github.io/mlb-rundiff/log-with-lags-cleaned.csv.gz

[7]: https://kyleam.github.io/mlb-rundiff/lag-calculation-checks.html

The post How jet lag impairs major league statistical performance appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A possible defense of cargo cult science? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’ve been a follower of your blog and your continual coverage of “cargo cult science”. Since this type of science tends to be more influential and common than the (idealized) non-“cargo cult” stuff, I’ve been trying to find ways of reassuring myself that this type of science isn’t a bad thing (because if it is a bad thing, then academia itself is entering a Dark Age that it’ll never recover from).

I suppose an alternative is to hope that “cargo cult science” diminish in size and influence, but I’m not an optimist, so I’ll take the rationalization approach.

On your blog, you previously mentioned the placebo effect that this type of research can cause. If power pose help people, then it’s a good thing, even if the underlining research is bunk. I’ve recently thought about another way by which junk science could be useful: cheap decision-making.

There’s an XKCD comic where Strategy A and Strategy B are considered, but the time spent finding the better strategy is way more than the time actually implementing either of the two strategies. It would make more sense if, say, we flip a coin and blindly follow whatever that coin says. It doesn’t really matter in the grand scheme of things.

Could junk science be seen as a way of saving people time as well? Instead of being paralyzed on the most efficient way of encourage voter turnout, just look up some junk science and follow whatever they recommend you to do.

Yeah, you could probably increase the effectiveness of your voter turnout operation if you instead followed best practices, but coming up with accurate results can be incredibly expensive. Plus you need to have a community of researchers also replicate the results as well, just to make sure that this is indeed the most efficient approach to increasing voter turnout. The costs keep piling up, while the benefits of picking the optimal strategy is fairly minimal.

The junk science’s recommendations are just cheaper and more scalable than doing things the “right way”…as long as you don’t accidentally pick an strategy that significantly reduces voter turnout.

But the odds of stumbling upon a bad strategy is probably low…and even if you do pick one that reduces turnout, that reduction may be slight and not really worth worrying about. The most important thing is not what decision to make, but that a decision is made at all, so that you can move onto the hard part of actually implementing the strategy and mobilizing voters. (And if you do realize the strategy is bad, you can always throw away that “cargo cult science” paper and find a new “cargo cult science” paper to follow.)

This argument in favor of “cargo cult science” starts to fall apart when you try to apply this to the medical field…but that’s where you use the placebo argument instead.

Is there a flaw in my argument that I’m missing? Have this argument been made before in the comment sections of your blog and then dismissed by others? Or am I missing the point of research (which is to find out facts about the world, and not just make decisions)? I’m honestly curious, and like your feedback.

My correspondent ends with a paradox:

P.P.S: I know I probably could have answered my question by looking for peer-reviewed studies on “cargo cult science”. However, I’m afraid that most of those studies may very well

beexamples of “cargo cult science”.

I have a couple thoughts on this. First, the discussion of coin-flip decisions reminds me of this recent paper by Steven Levitt, “Heads or Tails: The Impact of a Coin Toss on Major Life Decisions and Subsequent Happiness,” which presented evidence supporting the idea that in many settings people would be better off making decisions using coin flips.

Second, I’ve blogged a bit on various potential benefits from cargo cult science. Here are a few ideas:

– In “cargo cult science,” the researchers’ ideas are being tested in a useless, unscientific way. But maybe some of the ideas are good. That suggests a division of labor in which the people who promote the ideas be separated from the people who test the ideas and from the people who study and present the evidence.

– I’ve also put forth the argument that cargo cult science can be useful in shaking people up, in getting scientists and practitioners to think about alternative explanations. Lots of ideas might be valid in some way without being easy to measure and test, and if we decide only to pursue ideas with strong scientific evidence, we could be missing out. Ultimately I think the right way to resolve this issue is not through misinterpretation of data and subsequent hype, but through decision analysis: Just as pharmaceutical companies will pursue some low-probability leads because there is some probability they could make it big, so should science and practice allow for experimentation.

– That said, junk science has social costs. So, yes, it’s not a bad idea to come up with clever ways in which junk science can be a good thing, but on balance I’d prefer that people would stop doing it and stop enabling it.

The post A possible defense of cargo cult science? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Learn by experimenting! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Sidenote: I know some people say you’re not supposed to use loops in R, but I’ve never been totally sure why this is (a speed thing?). My first computer language was Java, so my inclination is to think in loops before using some of the other R functions that iterate for you. Maybe I should practice more with those.

There’s an answer to why loops are slow in R, I think it has something to do with memory allocation and the creation of new variables.

But the “why” is not really my point here.

Rather, my point is that you can directly check if it’s a speed thing. Just run a simulation.

For example:

N <- 1e5 date() a <- rep(1, N) date() for (n in 1:N) a[n] <- 1 date()

OK, let's try it out:

> N <- 1e5 > date() [1] "Fri Sep 15 09:15:18 2017" > a <- rep(1, N) > date() [1] "Fri Sep 15 09:15:18 2017" > for (n in 1:N) a[n] <- 1 > date() [1] "Fri Sep 15 09:15:18 2017"

Damn! It ran too fast. Let's up the N to 10 million:

> N <- 1e7 > date() [1] "Fri Sep 15 09:16:03 2017" > a <- rep(1, N) > date() [1] "Fri Sep 15 09:16:03 2017" > for (n in 1:N) a[n] <- 1 > date() [1] "Fri Sep 15 09:16:07 2017"

Yup. The loop runs slower than the unlooped version.

Is it just that rep(1, N) is just some super-fast function? We'll try another:

N <- 1e7 date() a <- rnorm(N) date() for (n in 1:N) a[n] <- rnorm(1) date()

Here's what happens:

> N <- 1e7 > date() [1] "Fri Sep 15 09:17:29 2017" > a <- rnorm(N) > date() [1] "Fri Sep 15 09:17:29 2017" > for (n in 1:N) a[n] <- rnorm(1) > date() [1] "Fri Sep 15 09:17:50 2017"

Again, the loop runs slower than the unlooped version.

The above is nothing definitive. The point is that, yes, a little experimentation can clarify our questions. The same principle applies in more complicated settings, for example this article which relates the results of experiments that had a big effect in the practice of computational statistics.

It's good to get in the habit of experimenting as a routine part of your workflow.

The post Learn by experimenting! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The all-important distinction between truth and evidence appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The published paper was called, “The more you play, the more aggressive you become: A long-term experimental study of cumulative violent video game effects on hostile expectations and aggressive behavior,” but actually that title was false: There was no long-term study. As reported in the article, the study lasted three days, and in each day the measurements of outcomes may well have been conducted immediately after the experimental treatment. The exact details were not made clear in the article, but there’s no way this was a long-term study.

Yesterday I discussed how mistake could’ve made it through peer-review. Today I want to talk more generally about incentives.

**But first, let’s step back a moment . . .**

Before discussing any further, let’s just consider how ridiculous this all is. A paper was published by legitimate researchers (one of them is the Margaret Hall and Robert Randal Rinehart Chair of Mass Communication at Ohio State University) in a legitimate journal (Journal of Experimental Social Psychology, impact factor 2.2), it receives over 100 citations and national press coverage—and the title is flat-out wrong. (3 days is not “long term.)

And, the most amazing thing of all . . . nobody noticed! An experimental science paper that mischaracterizes its experiment *in the title*—that’s roughly equivalent to a math paper declaring 2+2=5.

You know that saying, “The scandal isn’t what’s illegal, the scandal is what’s legal”?

Something similar here. The scandal is not that somewhere, in some journal, some authors screwed up and mis-titled their paper and the reviewers didn’t notice. Mistakes happen. I’ve messed up in published work in lots of different ways. No, the scandal is that this huge error was sitting there, in plain view, for five years! And nobody noticed. Or, I should say, if anybody noticed, I never heard about it. I guess that’s part of the problem right there, that it’s not so easy to correct the published record.

**Incentives, incentives**

My second-favorite bit of the above-linked article:

Another limitation is that our experiment lasted only three days. We wish we could have conducted a longer experimental study, but that was not possible for practical and ethical reasons.

Fine. If you can only do a 3-day study, just change “long-term” to “3 day” or, maybe, “5 minute” in the title of your paper. How hard is it, really, to just say what you really did? I guess, as commenters keep saying, it’s the incentives. Label your paper as a 3-day study and you might not get the citations and influence that you’ll get by calling it “long term.”

From the webpage of one of the authors of this paper:

So, consider three alternatives in designing and writing up this study:

1. Do a truly long-term study, following a group of people for a few years. Hmmm, that’s lots of work, don’t wanna do that!

2. Do a 3-day study, each day redoing the intervention and testing immediately after. This is inexpensive and likely to get solid results—but it’s not very interesting, will be hard to get published in a good journal and hard to get publicity later on.

3. Do a 3-day study, each day redoing the intervention and testing immediately after. But then put “long-term” in the title and hope for the best! This gives you all the convenience of the easy option, but with the potential for the major citations and media exposure that would be appropriate for an actual long-term study.

The incentives favor option 3.

**You don’t have to be a “bad guy” . . .**

Apparently these are the rules of the game, at least in some areas of science: You do an experiment which somewhere gives you statistical significance, you get the paper accepted at a journal, and then you misrepresent what you’ve learned, in the title of the paper and in the publicity material. (See the end of this post for an example where, in a single sentence of the publicity materials, one of the authors made 3 different claims, none of which are supported by the data in the research article.)

This sort of behavior is, in my opinion, destructive to science. But it happens all the time, to the extent that I doubt the authors even realized what they are doing. After all, they may well be personally convinced that their research hypotheses are true, thus, in their view, they may be saying true statements.

The idea that it may be true but it’s not supported by the data—the distinction between *truth* and *evidence*—that seems to be difficult for a lot of people.

I strongly doubt these researchers are *trying* to misrepresent their evidence. Indeed, once you become aware of the misrepresentation, and once you become aware of the distinction between truth and evidence, it becomes difficult to grossly misrepresent the evidence, if for no other reason than that it seems so obvious and embarrassing.

So, it’s not that these researchers really think that 3 days, or 5 minutes, is “long-term.” They just feel they’ve made a general discovery, they have no reason *not* to believe they’ve identified a long-term effect, and they’re reporting the truth, as they see it.

The trouble is that lots of outsiders—journalists and the general public, policymakers, and other scientists—might naively take these unsupported claims at face value, and think that Hasan et al. really did conduct “a long-term experimental study of cumulative violent video game effects on hostile expectations and aggressive behavior.” Which they didn’t.

There are so many incentives for researchers to misrepresent their data, and at the same time we have to deal with ostriches who say things like, “The replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” So, yeah, there’s a reason that we keep screaming about all this.

I don’t care so much about this particular paper, which I’d never even heard of until someone sent me an anonymous email about it. But I do care about the larger issue, which is what’s happened to the scientific literature, when it’s considered OK, and unremarkable, to misrepresent your study in *the title of your paper*.

Like a harbor clotted with sunken vessels.

The post The all-important distinction between truth and evidence appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post More bad news in the scientific literature: A 3-day study is called “long term,” and nobody even seems to notice the problem. Whassup with that?? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Long-term = 3 days??**

The biggest problem I see with this paper is in the title: “A long-term experimental study.” What was “long term,” you might wonder? 5 years? 10 years? 20 years? Were violent video games even a “thing” 20 years ago?

Nope. By “long-term” here, the authors mean . . . 3 days.

In addition, the treatment is re-applied each day. So we’re talking about immediate, short-term effects.

I’ve heard of short-term thinking, but this is ridiculous! Especially given that the lag between the experimental manipulation and the outcome measure is, what, 5 minutes? The time lag isn’t stated in the published paper, so we just have to guess.

3 days, 5 minutes, whatever. Either way it’s not in any way “long term.” Unless you’re an amoeba.

Oddly enough, a correction notice has already been issued for this paper but this correction says nothing about the problem with the title; it’s all about experimental protocols and correlation measures.

According to Google Scholar, the paper’s been cited 100 times! It has a good title (also, following Zwaan’s Rule #12, it features a celebrity quote), and it’s published in a peer-reviewed journal. I guess that’s enough.

**What happened in peer review?**

Hey, wait! The paper was peer reviewed! How did the reviewers not catch the problem?

Two reasons:

1. You can keep submitting a paper to journal after journal until it gets accepted. Maybe this article was submitted initially to the Journal of Experimental Social Psychology and got published right away; maybe it was sent a few other places first, in which case reviewers at earlier journals might’ve caught these problems.

2. The problem with peer review is the peers, who often seem to have the same blind spots as the authors.

I’d love to know who were the peer reviewers who thought that 3 days is a long-term study.

Here’s is my favorite sentence of the paper. It comes near the end:

The present experiment is not without limitations.

Ya think?

More tomorrow on the systemic problems that let this happen.

**The error bars in Figure 1**

Finally, let me return to the fun little technical point that got us all started—assessing the error bars in Figure 1.

Here’s the graph, with point estimates +/- 1 standard error:

Here’s the question: Are these error bars too narrow? Should we be suspicious?

And here’s the answer:

The responses seem to be on a 0-7 scale; if they’re uniformly distributed you’d see a standard deviation of approximately 7*sqrt(1/12) = 2.0. The paper says N = 70; that’s 35 in each group so then you’d see a standard error of 2.0/sqrt(35) = 0.34 which is, hmmm, a bit bigger than we see in the figure. It’s hard to tell exactly. But, for example, if you look at Day 1 on the top graph those two entire error bars fit comfortably between 3 and 4. It looks like they come to approximately 0.6 combined, so that each individual s.e. is about 0.15.

So the error bars are about half as wide as you’d expect to see if responses were uniformly distributed between 1 and 7. But they’re probably not uniformly distributed! The outcome being studied is some sort of average of coded responses, so it’s completely plausible that the standard error is on the order of half what you’d get from a uniform distribution.

Thus, the error bars look ok. The surprising aspect of the graph is that the differences between groups are so large. But I guess that’s what happens when you do this particular intervention and measure these outcomes immediately after (or so it seems; the paper doesn’t say exactly when the measurements were taken).

The post More bad news in the scientific literature: A 3-day study is called “long term,” and nobody even seems to notice the problem. Whassup with that?? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “The Internal and External Validity of the Regression Discontinuity Design: A Meta-Analysis of 15 Within-Study-Comparisons” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The RD bias is below 0.01 standard deviations on average, indicating RD’s high internal validity. When the study‐specific estimates are shrunken to capitalize on the information the other studies provide, all the RD causal estimates fall within 0.07 standard deviations of their RCT counterparts, now indicating high external validity. With unshrunken estimates, the mean RD bias is still essentially zero, but the distribution of RD bias estimates is less tight, especially with smaller samples and when parametric RD analyses are used.

Chaplin et al. are making two points:

1. The regression discontinuity estimates performed well, and this good performance could be checked by comparing to the estimates from the randomized clinical trial in each case.

2. Bayesian multilevel modeling with partial pooling made things even better.

I think of this paper as being similar to the classic Dehejia and Wahba paper on matching for observational studies. Dehejia and Wahba found that matching worked well, if it was done well, and they provided practical guidelines.

Similarly, in this new paper, Chaplin et al. found that regression discontinuity analysis performed well, in a set of examples where regression discontinuity analysis made sense.

I would’ve liked to have seen a scatterplot with 15 points, one for each study.

**Causal magic?**

When pointing me to this paper, Bhalla expressed concern that “Economists think regression discontinuity can evade statistical limits and perform causal magic.”

I don’t know what “economists” think in general, but I agree with Bhalla that at times it seems that practitioners think of regression discontinuity and other identification strategies as way of extracting causal inferences, but without thinking seriously about the required assumptions.

I don’t think the Chaplin et al. paper promotes that kind of magical thinking, but I see how it could be naively misinterpreted as “Regression Discontinuity Works” (the title of Tabarrok’s post). I’d rephrase this as, “Regression discontinuity can work well” or “Regression discontinuity works well when used appropriately.”

Here’s another example. In the comment thread of Tabarrok’s post:

wiki April 2, 2018 at 3:07 pm:I’m a bit more skeptical of RD based on historical data. We can’t do time travel RCT and a lot depends on being able to identify all the possible confounds and correcting for selection bias. Lots of current work doesn’t even adjust for human capital, biology, personality, cultural predisposition, or genes and just waves its hands about this. But this is the persistence RD that is hot in the development literature.

Sam April 2, 2018 at 3:37 pm:The whole point of RD is that you don’t need to identify confounds.

The first commenter is, broadly, correct; the second commenter is too confident. Or, to put it another way, the second commenter is correct in the settings where all the assumps hold, but regression discontinuity is often applied in settings where the assumps are off, in which case RD is little more than a crude, thoughtless, regression adjustment for observational data.

Just to be clear: the paper *does* represent good news for regression discontinuity analysis and hierarchical Bayesian modeling. I’m not trying to imply otherwise. I’m just clarifying that, like Dehejia and Wahba, Chaplin et al. are finding that their method works well in well-chosen settings. That’s not nothing; it’s good to know; it shouldn’t be taken to imply much in settings where the regression discontinuity assumptions or the fitted model don’t make sense. Of course. But I better say it here just so people don’t overinterpret.

**Other issues**

I don’t really agree with this statement by the first author, though, “RD is generally acknowledged as the most rigorous non-experimental method for obtaining internally valid impact estimates.” The rigor of any statistical inference depends on some set of assumptions. No method is inherently more rigorous than another; it all depends on where and how the method’s applied. To put it another way, I have little doubt that the regression discontinuity analyses in the above-linked paper are in settings where the assumptions are reasonable.

I also like when he refers to “less than 1,100 observations” as a “small sample size.” Remember that regression discontinuity analysis we looked at with N=27?

The post “The Internal and External Validity of the Regression Discontinuity Design: A Meta-Analysis of 15 Within-Study-Comparisons” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Justify my love appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’ve always found the real sign that Spring has almost sprung is when strangers start asking which priors you should use on the variance parameters in a multilevel model.

A surprising twist on this age-old formula is that this year the question was actually asked on Twitter! Because I have a millennial’s^{-1} deep understanding of social media, my answer was “The best way to set priors on the variance of a random effect^{0} is to stop being a Nazi TERF”. But thankfully I have access to a blogging platform, so I’m going to give a more reasoned, nuanced, and calm answer here.

(**Warning: **This post is long. There’s a summary at the end. There is also an answer that fits into a 280 character limit. It obviously links to the Stan prior choice wiki. You can also look at this great Stan case study from Michael Betancourt.)

In this post, I’m going to focus on weakly informative priors for the variance parameters in a multilevel model. The running example will be a very simple multilevel model for Poisson where the log-risk is modelled using a global intercept, a covariate, and one iid Gaussian random effect^{0}. Hence we only have one variance parameter in teh model that we need to find a prior for.

Some extensions beyond this simple model are discussed further down, but setting priors for those problems are much harder and I don’t have good answers for all of them. But hopefully there will be enough information here to start to see where we’re (we = Andrew, Aki, Michael and I, but also Håvard Rue, Sigrunn Sørbye, Andrea Riebler, Geir-Arne Fuglstad, and others) going with this thread of research.

The first thing I’m going to do is rule out any attempt to construct “non-informative” (whatever that means) priors. The main reason for this is that in most situations (and particularly for multilevel models), I don’t think vague priors are a good idea.

The main reason is that I don’t think they live up to their oft-stated ability to “let the data speak for itself”. The best case scenario is that you’re working with a very regular model and you have oodles of data. In this case, a correctly-specified vague prior (ie a reference prior) will not get in the way of the data’s dream to stand centre stage and sing out “*I am a maximum likelihood estimator to high order*“. But if that’s not the tune that your data should be singing (for example, if you think regularization or partial pooling may be useful), then a vague prior will not help.

Neither do I think that “letting the data speak” is the role of the prior distribution. *The right way to ensure that your model is letting the data speak for itself is through *a priori *and* a posteriori* model checking and model criticism. *In particular, well-constructed cross-validation methods, data splitting, posterior predictive checks, and prior sensitivity analysis are *always* necessary. It’s not enough to just say that you used a vague prior and hope. Hope has no place in statistics. * *

Most of the priors that people think are uninformative turn out not to be. The classic example is the prior for the variance of a random effect^{0}. Andrew wrote a paper about this 12 years ago. It’s been read (or at least cited) quite a lot. The paper is old enough that children who weren’t even born when this paper was written have had time to rebel and take up smoking. And yet, the other day I read yet another paper that referred to a prior on the variance of a random effect^{0} as “weakly informative”. So let me say it one more time for the people at the back: *It’s time to consign priors to the dustbin of history.*

But even people who did read Andrew’s paper^{0.25} can go wrong when they try to set vague priors. Here’s a “close to home” example using the default priors in INLA. INLA is an R package for doing fast, accurate, approximate Bayesian inference for latent Gaussian models, which is a small but useful class of statistical models that covers multilevel models, lots of spatial and spatiotemporal models, and some other things. The interface for the INLA package uses an extension to R’s formula environment (just like `rstanarm` and `brms`), and one of the things that this forces us to do is to specify default priors for the random effects^{0}. Here is a story of a default prior being bad.

This example is taken from the first practical that you can find here (should you want to reproduce it). The model, which will be our running example for the next little while, is a simple Poisson random effects^{0} model. It has 200 observations, and the random effect^{0} has 200 levels. For those of you familiar with the `lme4` syntax, the model can be fit as

` lme4::glmer(num_awards ~ math + (1|id),data=awards,family="poisson")`

If you run this command, you get a standard deviation of 0.57 for the random effect^{0}.

INLA parameterises a normal distribution using its precision (ie the inverse of the variance). This is in line with JAGS and BUGS, but different to Stan (which is the younger piece of software and uses the standard deviation instead). The default prior on the precision is , which is essentially a very flat exponential prior.

The following table shows the 95% posterior credible intervals for the standard deviation across five different priors. The first is the INLA default; the second is recommended by Bernardinelli, Clayton, and Montomoli. Both of these priors are very vague. The third is a prior with a not particularly small . The fourth and fifth are examples of PC priors which are designed to be weakly informative. For this model, the PC prior is an exponential prior on the standard deviation with the rate parameter tuned so that *a priori * with upper bounds and respectively.

2.5% | Median | 97.5% | |
---|---|---|---|

INLA default | 0.00 | 0.01 | 0.05 |

IG(0.5,0.0005) | 0.02 | 0.07 | 0.59 |

IG(0.001,0.001) | 0.04 | 0.40 | 0.75 |

PC(1,0.1) | 0.04 | 0.42 | 0.74 |

PC(10,0.1) | 0.09 | 0.49 | 0.79 |

The takeaway message here is that the two vague priors on the precision (the default priors in INLA and the one suggested by Bernardinelli *et al.*) are terrible for this problem. The other three priors do fine and agree with the frequentest estimate^{0.5}.

The take away from this should not be that INLA is a terrible piece of software with terrible priors. They actually work fairly well quite a lot of the time. I chose INLA as an example because it’s software that I know well. More than that, the only reason that I know anything at all about prior distributions is that we were trying to work out better default priors for INLA.

The point that we came to is that the task of finding one universal default prior for these sorts of parameters is basically impossible. It’s even more impossible to do it for more complex parameters (like lag-1 autocorrelation parameters, or shape parameters, or over-dispersion parameters). What is probably possible is to build a fairly broadly applicable default *system* for building default priors.

The rest of this post is a stab at explaining some of the things that you need to think about in order to build this default system for specifying priors. To do that, we need to think carefully about what we need a prior to do, as well as the various ways that we can make the task of setting priors a little easier.

The first problem with everything I’ve talked about above is that, when you’re setting priors, it is a *terrible* idea to parameterize a normal distribution by its precision. No one has an intuitive feeling about the inverse of the variance of something, which makes it hard to sense-check a prior.

So why is it used? It simplifies all of the formulas when you’re dealing with hierarchical models. If

,

then the variance of is , whereas if we parameterize by the precision, we just need to add the precisions of and to get the precision of . So when people sat down to write complicated code to perform inference on these models, they made their lives easier and parameterized normal distributions using the precision.

But just because something simplifies coding, it doesn’t mean that it’s a good idea to use it in general situations!

A *much better* way to parameterize the normal distribution is in terms of the *standard deviation. *The key benefit of the standard deviation is that it directly defines the *scale* of the random effect^{0}: we know that it’s highly likely that any realization of the random effect^{0} is within 3 standard deviations of the mean. This means that we can easily sense-check a prior on the standard deviation by asking “is this a sensible scale for the random effect^{0}?”.

For this model, the standard deviation is on the scale of the log-counts of a Poisson distribution, so we probably don’t want it to be too big. What does this mean? Well, the largest count observation that a computer is most likely only going to allow is 2,147,483,648. This is a fairly large number. If we want the *mean* of the Poisson to be most likely less than that (in the case where everything else in the log-risk is zero), then we want to ensure . If we want to ensure that the *expected* count in each cell is less than a million, we need to ensure . If we want to ensure the *expected *count to be less than 1000, we need to ensure . (I can literally go on all day.) Under the Gamma(0.001,0.001) prior on the precision, the probabilities of these three events happening are, respectively, 0.01, 0.009, and 0.007. This means that rather than “speaking for itself” (whatever that means), your data is actually trying to claw back probability mass from some really dumb parts of the parameters space.

If you plot all of the distributions on the standard deviation scale (sample from them and do a histogram of their inverse square root), you see that the first 3 priors do not look particularly good. In particular, the first two are extremely concentrated around small values. This explains the extreme shrinkage of the standard deviation estimates. The third prior is extremely diffuse.

So if the standard deviation is the correct parameter for setting a prior, what sort of prior should we set on it?

Well I have some good news for you: there are several popular alternatives and no clear winner. This is pretty common when you do research on priors. If you don’t screw up the prior completely, it’s a *higher-order effect*. This means that asymptotically, the effect of the prior on the posterior is smaller than the effect of sampling variation in the data.

But this didn’t happen in the above example, where there was a very clear effect. There are a few reasons for this. Firstly, the data only has 200 observations, so we may be so far from asymptopia the the prior is still quite important. The second reason is more interesting. There is more than one sensible asymptotic regime for this problem, and which one we should use depends on the details of the application.

The data, as it stands, has 200 records and the random effect^{0} has 200 different levels. There are essentially 3 different routes to asymptopia here. The first one is that we see an infinite number of repeated observations for these 200 levels. The second is a regime where we have observations, the random effect^{0} has levels, and we send . And the third option is half way between the two where the number of levels grows slower than the number of observations.

It is a possibly under-appreciated point that asymptotic assessments of statistical methods are conditional on the particular way you let things go to infinity. The “standard” asymptotic model of independent replication is quite often unrealistic and any priors constructed to take advantage of this asymptotic structure might have problems. This is particularly relevant to things like reference priors, where the asymptotic assumptions are strongly baked in to the prior specification. But it’s also relevant to any prior that has an asymptotic justification or any method for sense-checking priors that relies on asymptotic reasoning.

Well that got away from me, let’s try it again. What should a prior for the standard deviation of a Gaussian random effect^{0} look like?

The first thing is that it should peak at zero and go down as the standard deviation increases. Why? Because we need to ensure that our prior doesn’t prevent the model from easily finding the case where the random effect^{0} should not be in the model. The easiest way to ensure this is to have the prior decay away from zero. My experience is that this also prevents the model from over-fitting (in the sense of having too much posterior mass on very “exciting” configurations of the random effects^{0} that aren’t clearly present in the data). This idea that the prior should decrease monotonically away from a simple *base model* gives us some solid ground to start building our prior on.

The second thing that we want from a prior is *containment*. We want to build it so that it keeps our parameters in a sensible part of the model space. One way to justify this (and define “sensible”) is to think about the way Rubin characterized the posterior distribution in a very old essay (very old = older than me). He argued that you can conceptualize the posterior through the following procedure (here is the observed data):

- Simulate
- Simulate
- If , then keep , otherwise discard it.

The set of that are generated with this procedure are samples from the posterior distribution.

This strongly suggests that the prior should be built to ensure that the following containment property holds: The prior predictive distribution should be slightly broader than a prior predictive distribution that only has mass on plausible data sets.

This requirement is slightly different to Andrew’s original definition^{1} of a weakly informative prior (which is a prior on a parameter that is slightly broader than the best informative prior that we could put on a parameter without first seeing the data). The difference is that while the original definition of a WIP requires us to have some sort of interpretation of the parameter that we are putting the prior on, the containment condition only requires us to understand how a parameter affects the prior data generating process. Containment also reflects our view that The prior can often only be understood in the context of the likelihood.

If this all sounds like it would lead to very very complicated maths, it probably will. But we can always check containment visually. We have a discussion of this in our (re-written the other month to read less like it was written under intense deadline pressure) paper on visualization and the Bayesian workflow. That paper was recently accepted as a discussion paper in the Journal of the Royal Statistical Society Series A (date of the discussion meeting TBC) which is quite nice.

An immediate consequence of requiring this sort of containment condition is that we need to consider all of the parameters in our model simultaneously in order to check that it is satisfied. This means that we can’t just specify priors one at a time when a model has a number or random effects^{0}, or when there are other types of model component.

For the sake of sanity, we can simplify the process by breaking up the parameters into groups that affect different part of the data generating process. For instance, if you have several random effects^{0}, a good way to parameterize their variances is to put a prior on the standard deviation of the total variability and then a simplex-type prior to distribute the total standard deviation across the many components.

There are some complexities here (eg what do you do when the random effects^{0} have differing numbers of levels or differing dependence structures between levels?). Some suggested solutions are found towards the end of this paper and in this paper. A similar idea (except it’s the variances that are distributed rather than the standard deviations) is used for the default priors in rstanarm. The same idea pops up again for setting priors on mixture models.

So while that was a fun diversion, it still hasn’t given us a prior to use on the variance of a random effect^{0}. But we do now know what we’re looking for.

There are a number of priors that decay away from the base model and give us some form of containment. Here is an incomplete, but popular, list^{2} of simple containment priors:

- A half-Gaussian on the standard deviation
- An exponential on the standard deviation
- A half-t with 7 degrees of freedom on the standard deviation
- A half-t with 3 degrees of freedom on the standard deviation
- A half-Cauchy on the standard deviation.

These were listed from lightest to heaviest tail on purpose.

All of these priors require the specification of one parameter controlling the width of the 90% highest probability density interval^{2.5}. This parameter allows us to control containment. For example, if the random effect^{0} was part of a model for the log-relative risk of a disease, it may be sensible to set the parameter so that the probability that the standard deviation was less than 1 with 90% prior probability. This would correspond to the random effect^{0} being contained within [-3,3] and hence the random effect’s^{0} contribution to the relative risk be constrained to be a multiplicative factor contained within [0.05, 20].

This brings out an important aspect of containment priors: they are problem dependent. Although this example does not need a lot of expert information to set a sensible prior, it does need someone to understand how big a deviation from the baseline risk is unlikely for the particular scenario you are modelling. There isn’t a way around this. You either explicitly encode information about the problem in a prior, hide it in the structure of your asymptotic assumptions, or just throw your prior against a wall and hope that it sticks.

One way to standardize the procedure for setting priors is to demand that the model is correctly *scaled* in order to ensure that all of the parameters are on the unit scale. This can be done structurally (if your expected counts are well constructed, the relative risk will often be on unit scale). It can also be done by introducing some data-dependence into the prior, although this is a little more philosophically troublesome and you have to be diligent with your model assessment.

As for which of these priors is preferred, it really depends on context. If your model has a lot of random effects^{0}, or the likelihood is sensitive to extreme values of the random effect^{0}, you should opt for a lighter tail. On the other hand, a heavier tail goes some way towards softening the importance of correctly identifying the scale of the random effect^{0}.

To put some skin in the game, last time I talked to them about this^{3} Andrew seemed more in favour of good scaling and a half-Normal(0,1) prior, I like an exponential prior with an explicitly specified 90% quantile for the standard deviation, and Aki seemed to like one of the Student-t distributions. Honestly, I’m not sure it really makes a difference and with the speed of modern inference engines, you can often just try al three and see which works best. A nice^{4} simulation study in this direction was done by Nadja Klein and Thomas Kneib.

None of us are particularly fond of the half-Cauchy as a prior on the standard deviation. This was one of the early stabs at a weakly informative prior for the standard deviation. It does some nice things, for example if you marginalize out a standard deviation with a half-Cauchy prior, you get a distribution on the random effect^{0} that has a very heavy tail. This is the basis for the good theoretical properties of the Horseshoe prior for sparsity, as well as the good mean-squared error properties of the posterior mean estimator for the normal means problem.

But this prior has fallen out of favour with us for a couple of reasons. Firstly, the extremely heavy tail makes for a computational disaster if you’re actually trying to do anything with this prior. Secondly, it turns out that some regularization does good things for sparsity estimators.

But for me, it’s the third problem that sinks the prior. The tail is so heavy that if you simulate from the model you frequently get extremely implausible data sets. This means that the half-Cauchy prior on the standard deviation probably doesn’t satisfy our requirement that a good prior satisfies the containment property.

Right. We’ve managed to select a good prior for the standard deviation for a Gaussian random effect^{0}. The next thing to think about is whether or not there are any generalizable lessons here. (Hint: There are.) So let us look at a very similar model that would be more computationally convenient to fit in Stan and see that, at least, all of the ideas above still work when we change the distribution of the random effect^{0}.

The role of the random effect^{0} in the example model is to account for over-dispersion in the count data (allowing the variance being larger than the mean). An alternative model that does the same thing is to take the likelihood as negative-binomial rather than Poisson. The resulting model no longer has a random effect^{0} (it’s a pure GLM). If you’re fitting this model in Stan, this is probably a good thing as the dimension of the parameter space goes from 203 to 3, which will definitely make your model run faster!

To parameterize the negative binomial distribution, we introduce an over-dispersion parameter with the property that the mean of the negative binomial is and the variance is .

We need to work out a sensible prior for the over-dispersion parameter. This is not a particularly well-explored topic in the Bayesian literature. And it’s not really obvious what a sensible prior will look like. The effect of on the distribution of is intertwined with the effect of .

One way through this problem is to note that setting a prior on is in a lot of ways quite similar to setting a prior on the standard deviation of a Gaussian random effect^{0}.

To see this, we note that we can write the negative binomial with mean and overdispersion parameter as

.

This is different to the previous model. The Gamma distribution for has a heavier left tail and a lighter right tail than then log-normal distribution that was implied by the previous model. That being said, we can still apply all of our previous logic to this model.

The concept of the base model would be a spike at , which is a Poisson distribution. (The argument for this is that it is a good base model because every other achievable model with this structure is more interesting as the mean and the variance are different from each other.) The base model occurs when .

So we now need to work out how to ensure containment for this type of model. The first thing to do is to try and make sure we have a sensible parameterization so that we can use one of our simple containment priors.

The gamma distribution has a mean of 1 and a variance of , so one option would be to completely follow our previous logic and use as a sensible transformation. Because the Gamma distribution is highly skewed, it isn’t completely clear that the standard deviation is as good a parameterization as it is in the previous model. But it turns out we can justify it a completely different way, which suggests it might be a decent choice.

A different method for finding a good parameterization for setting priors was explored in our PC priors paper. The idea is that we can parameterize our model according to its *deviation *from the base model. In both the paper and the rejoinder to the discussion, we give a pile of reasons why this is a fairly good idea.

In the context of this post, the thing that should be clear is that this method will not ensure containment directly. Instead, we are parameterizing the model by a deviation so that if you increase the value of by one unit, the model gets one unit more interesting (in the sense that the square root^{5} of the amount of information lost when you replace this model by the base model increases by one). The idea is that with this parameterization we will contain and hopefully therefore constrain .

If we apply the PC prior re-parameterization to the Gaussian random effects^{0} model, we end up setting the prior on the standard deviation of the random effect^{0}, just as before. (This is a sense check!)

For a the Gamma random effect^{0}, some tedious maths leads to the exact form of the distance (Warning: this looks horrible. It will simplify soon.)

,

where is the Kullback-Leibler divergence, and is the digamma function (the derivative of the log-gamma function).

To simplify the distance, let’s take a look at what it looks like for very small and very large . When is near zero, some rooting round the internet shows that . Similarly, when is large, we also get . Moreover, if you plot the square of the distance, it looks a lot like a straight line (it isn’t quite, but it’s very close). The suggests that putting a containment prior on the parameter might be a good idea.

So the end of this journey is that something like an appropriately scaled half-normal, exponential, or half-t distribution on is a good candidate for a containment prior for the over-dispersion parameter of a negative binomial distribution. The truth of this statement can be evaluated using prior data simulations to check containment, and posterior sensitivity and predictive checks to check that the prior is appropriate for the problem at hand. Because no matter how good I think the mathematical argument is for a prior, it is still vital to actually verify the theory numerically for the particular problem you are trying to solve.

If I had to summarize this very long post, I’d probably say the following:

- Vague priors tend to be “accidentally informative” in random effects
^{0}models (and other somewhat complicated statistical models) - The first thing you need to think about when setting a prior is to find a parameterization that is at least a little bit interpretable. If you can’t give an “in words” interpretation of your prior, you probably haven’t put enough care into setting it.
- Base models are very useful to work out how to set priors. Think about what the most boring thing that your model can do and expand it from there.
- The idea of containment is hiding in a lot of the ways people write about priors. It’s a good thing and we should pay more attention to it.
- Containment tells us explicitly that we need to consider the
*joint prior*on all parameters of the model, rather than just thinking about the priors on each parameter independently. Sometimes we can cheat if different sets of parameters change (almost) disjoint aspects of a model. - Containment also suggests that the priors need to respect the scale of the effect a parameter has. Sometimes we can fix this though a linear scaling (as in regression coefficients or standard deviation parameters), but sometimes we need to be more creative (as in the over-dispersion parameter).
- Containment makes it hard to specify a
*universal*default prior for a problem, but we can still specify a universal default*procedure*for setting priors for a problem. - We can
*always*check containment using prior simulations from the model. - We can
*always*assess the effect of the prior on the estimates through careful prior and posterior checks (this is really the story of our Visualization and Workflow paper). - These considerations stretch far beyond the problem of specifying priors for variance components in multilevel models.

^{-1} For those of you who use “millennial” as a synonym for “young”, I promise you we are not. Young people were not alive when the Millennium turned. The oldest millennials are 35.

^{0} When I talk about random effects, I’m using the word in the sense of Hodges and Clayton, who have a lovely paper looks at how that term should be modernized. In particular, they point out the way that in modern contexts, random effects can be interpereted as “formal devices to facilitate smoothing or shrinkage, interpreting those terms broadly”. Everything that I talk about here holds for the scale parameter of a Gaussian process or a spatial random effect. Andrew and Aki prefer to just use different terms. The specific example that I’m using falls into their definition of an “old-style” random effect, in that the distribution of the random effect is of interest rather than the particular value of the realization isn’t of interest. I personally think that “random effect” is a sufficiently broad, accessible definition that bridges decades of theory with current practice, so I’m happy with it. But mileage obviously varies. (I have tagged the term every time it’s used in case someone gets half way through before they realize they’re not sure what I mean.)

^{0.25} Even if you use the suggested prior in Andrew’s paper (a half-Cauchy with scale parameter 25 on the standard deviation), bad things will happen. Consider a simple version of our problem, where , . If we put the recommended half-Cauchy prior on , then the simulation of will return an `NA` in R around 20% of the time. This happens because it tries to draw an integer bigger than , which is the largest integer that `R` will let you use. This will lead to very very bad inference and some fun numerical problems in small-data regimes.

^{0.5} Posterior intervals that are consistent with a particular estimator with some good frequentist properties isn’t necessarily a sign that a prior is good. Really there should be some proper posterior checks here. I have not done them because this is a blog.

^{1} Now I didn’t go back and check Andrew’s paper, so this may well be my memory of my original interpretation of Andrew’s definition of a Weakly Informative Prior. There are turtles everywhere!

^{2} These options are taken from the Stan prior choice wiki, which has a lot of interesting things on it.

^{2.5} I can’t say it controls the standard deviation because the half-Cauchy doesn’t have a well-defined standard deviation. So I picked a 90% highest density interval as the thing to be controlled. Note that because we’ve specified priors that decay away from the base model, these are one-sided intervals. Obviously if you want to use something else, use something else.

^{3} Obviously I do not speak for them here and you really shouldn’t take my half-memory of a conversation at some point in the dusty, distant past as a reflection of their actual views. But we do have a range of opinions on this matter, but not such a diverse set of opinions that we actually disagree with each other.

^{4} The exponential prior on the standard deviation (which is the PC prior for this model) did very well in these simulations, so obviously I very much like the results!

^{5} The square root is there for lots of good reasons, but mainly to get make sure all of the scales come out right. For the maths-y, it’s strongly suggested by Pinsker’s inequality.

The post Justify my love appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Does adding women to corporate boards increase stock price? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I recently came across a study which I think is

quite questionable, even ridiculous. This study is unfortunately quite old (2012), but its conclusions are so ludicrous that the study is perhaps still interesting.The study claims that companies with women in their supervisory boards perform better than companies without a woman in such high positions (especially in western countries). So far so good, but of course causation does not mean correlation. There are many more female members of supervisory boards in the service industry than in the manufacturing industriy, but manufacturing is a weakening branch of industry (especially in the west) and the sevice industry is gaining more and more. Japan and Korea have very few female supervisory board members and are currently (in 2012 also) in a bad economic constitution. All this would decrease the (already weak) numbers in intercession for “gender diversity”, but the authors of the study did not seem to considered the obvious. One might also wonder how big the influence of the supervisory board on the profits of an company is. In many companies this council plays only a subordinate role or none at all.

All media organs have propagated the study without criticism. Even the Australian government refers to it.

What is your opinion on the whole story?

The study in question is called “Gender diversity and corporate performance” and is published by Credit Suisse Research. I can’t figure out who wrote it. It says, “For more information contact Richard Kersley, Head of Global Research Product, Credit Suisse Investment Banking, and Michael O’Sullivan, Head of Portfolio Strategy & Thematic Research, Credit Suisse Private Banking,” but I have no idea if they’re the authors of the report.

I took a look and the paper does indeed make some causal claims:

What evidence is there to support the theory that stock-market performance is enhanced by having a greater number of women on the board? . . . Our key finding is that, in a like-for-like com- parison, companies with at least one woman on the board would have outperformed in terms of share price performance, those with no women on the board over the course of the past six years.

Setting aside casual inference concerns for a moment, I see some forking paths:

However, there is a clear split between relative performance in the 2005–07 period and performance post-2008. In the middle of the decade when economic growth was relatively robust, there was little difference in share price performance between companies with or without women on the board. almost all of the outperformance in our backtest was delivered post-2008, since the macro environment deteriorated and volatility increased. In other words, stocks with greater gender diversity on their boards generally look defensive: they tend to perform best when markets are falling, deliver higher average ROes through the cycle, exhibit less volatility in earnings and typically have lower gearing ratios.

Other aspects of the report are purely descriptive and I have less problem with that; for example when they ask, “Is there any difference in the financial characteristics of companies with a greater number of women on the board?”

They do refer to dissenting views, and that’s good:

There is a significant body of literature on this issue; articles on the subject span several decades. Some suggest corporate performance benefits from greater gender diversity at board level, while others suggest not.

In the positive camp are the likes of McKinsey and Catalyst. Catalyst has shown that Fortune 500 companies with more women on their boards tend to be more profitable. McKinsey showed that companies with a higher proportion of women at board level typically exhibited a higher degree of organization, above-average operating margins and higher valuations.

Other studies, such as those conducted by Adams and Ferreira or Farrell and Hersch, have shown that there is no causation between greater gender diversity and improved profitability and stock price performance. Instead, the appointment of more women to the board may be a signal that the company is already doing well, rather than being a sign of better things to come.

They follow up later in their report and seem to think highly of the reports by Adams and Ferreira and by Farrell and Hersch. So if you actually read the entire document, their claims don’t seem so strong.

Regarding the main causal claim, the report does address potential confounding:

Our headline result is that, over the past six years, companies with at least some female board representation outperformed those with no women on the board in terms of share price performance.

Getting to this result was not straightforward. There is a bias from the skew in female representation towards certain sectors (consumer-related), certain markets (Europe) and towards large-cap stocks. Take the sector issue by way of example. The consumer staples sector ranks higher than average in terms of female board representation, but arguably the considerable share price outperformance the sector has delivered over the past few years has little to do with board composition and much more to do with the very stable and defensive nature of its earnings in a world of considerable earnings uncertainty.

Hence, in calculating the returns generated by companies with (a) one or more women on the board compared with those with (b) no women on the board, we have made three adjustments:

1. We look at performance from a sector-neutral stance. In other words, we have allocated the same sector weights in the calculations of both (a) and (b) in order to mitigate the impact of overall sector performance;

2. We split the sample universe into two baskets: one containing companies with market capitalization greater than USD 10 billion and one containing companies with market capitalization less than USD 10 billion. Hence, in broad terms, we are aiming to compare women versus no women on the board of large caps and separately, women versus no women on the board of mid-to-small caps. In this way, we can partially mitigate the survivor bias of small cap stocks in the construction of our sample universe; and

3. We look at the returns generated (on a sector-neutral basis) within each region as well as at the aggregate global level.

This all sounds reasonable; that said, it’s not clear to me what analysis they actually did—it seems they used some sort of weighted averaging, which is limited as a technique for addressing differences in pre-treatment variables in causal inference. It’s tough when the problem is not formally set up causally: what’s the “treatment” or “instrument”? In particular, it’s not clear how to give a casual interpretation to a descriptive statement such as “the results demonstrate superior share price performance for the companies with one or more women on the board.”

So, overall, yes, I think Kasster’s criticisms are reasonable, and many of the conclusions of the report could be artifacts of the data. At the same time, the report itself is moderate in tone. The topic is difficult to study because the effect of adding more men or women to a corporate board has to depend on context, and stock price is a noisy outcome measure.

The post Does adding women to corporate boards increase stock price? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post This one’s important: How to better analyze cancer drug trials using multilevel models. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>With the arrival of two revolutionary treatment strategies, immunotherapy and personalized medicine, cancer researchers have found new hope — and a problem that is perhaps unprecedented in medical research.

There are too many experimental cancer drugs in too many clinical trials, and not enough patients to test them on. . . . there are more than 1,000 immunotherapy trials underway, and the number keeps growing. “It’s hard to imagine we can support more than 1,000 studies,” said Dr. Daniel Chen, a vice president at Genentech, a biotechnology company. . . .

Take melanoma: There are more than 85,000 cases a year in the United States, according to Dr. Norman Sharpless, director of the Lineberger Comprehensive Cancer Center at the University of North Carolina, who was recently named director of the National Cancer Institute. . . . “We used to have trials not long ago that had 700 patients per arm,” Dr. Sharpless said, referring to the treatment groups in a study. “That’s almost undoable now.”

Today, “trials can be eight patients.”

This reminds me of my general view that conventional clinical trials are a bad model for research, that the idea of a definitive randomized study does not work so well in the modern world.

In the the article, The failure of null hypothesis significance testing when studying incremental changes, and what to do about it, I argue that when effects are small, black-box randomized trials are simply not going to work. On one hand, you’ll need huge sample sizes to reliably detect small effects; on the other hand, the world is changing and once you get that huge sample size your target may have moved.

In that paper I discuss two examples from social and behavioral science, but I have every reason to believe the same issues arise in medical research. And, indeed, Kolata’s article emphasizes that many of the treatments being considered are very similar to each other, which suggests that the right approach is a coordinated set of trials analyzed using a multilevel model, rather than a set of independent trials analyzed separately, which leads to a play-the-winner rule followed by the winner’s curse in which the treatment that happens to perform best in a noisy environment ends up overrated.

So there’s a convergence of practical and statistical issues. As usual, the problem with the conventional approach is not just p-values, it’s the whole null hypothesis significance testing attitude that just doesn’t fit with the problems under study and the decisions that need to be made.

**P.S.** Alper adds:

Possibly because it was covered by my insurance ,possibly because ultrasound is noninvasive and possibly because John McCain was found to have glioblastoma above his eye, earlier this week I biked to a local hospital to have a growth on my forehead looked at. While the technician went out to fetch the radiologist, I noticed that the text on the ultrasound screen said “left eye” when in fact it was over my right eye. As has been noted by many others, hospitals are dangerous places and a surprising number of surgeries are done on the wrong body part.

Damn!

**P.P.S.** Kolata’s article is excellent but I do have one complaint: every expert quoted in the article is a doctor. The topic is medical research, so, sure, doctors are experts. But these questions are not just medical: they involve statistics, they involve economics, they involve politics too. So I don’t think docs should be the *only* people interviewed.

The post This one’s important: How to better analyze cancer drug trials using multilevel models. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post This April Fools post is dead serious appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>But today I have something so ridiculous that it made sense to just post it straight up.

It came up a few days ago, when I was googling the name of a researcher who, with a colleague, had published two papers that were near exact duplicates, two years apart and in the very same journal. It turns out this researcher has had various data problems with his published work (see here from Retraction Watch and here from Malte Elson, a story of a ridiculously drawn-out story of bad data) and, according to Elson, is “one of the most frequent users of the Competitive Reaction Time Task,” a true nest of forking paths (this last bit is relevant to understanding how this researcher, and others like him, manage to consistently find stunning, statistically-significant and publishable findings from their data).

But that’s all background. What happened was that I was googling this guy and came up with what may possibly be the most ridiculous scientific article I’ve ever seen.

The title is “Low glucose relates to greater aggression in married couples,” but things really get going in the abstract:

People are often the most aggressive against the people to whom they are closest—intimate partners. Intimate partner violence might be partly a result of poor self-control. Self-control of aggressive impulses requires energy, and much of this energy is provided by glucose derived from the food we eat. We measured glucose levels in 107 married couples over 21 days. To measure aggressive impulses, participants stuck 0–51 pins into a voodoo doll that represented their spouse each night, depending how angry they were with their spouse. To measure aggression, participants blasted their spouse with loud noise through headphones. Participants who had lower glucose levels stuck more pins into the voodoo doll and blasted their spouse with louder and longer noise blasts.

Sticking 0-51 pins into a voodoo doll, huh? I could see sticking 1 or 2 pins into the doll, but 51?! That’s a bit outta control, no? Is it a voodoo doll or a pincushion?

The paper carefully follows Rolf Zwaan’s 18 rules for writing a successful PNAS paper, even going to the trouble of leading off with a celebrity quote (#12 on Zwaan’s list).

I still can’t believe there were people who’d go to the trouble of sticking 51 pins into a voodoo doll. 51, that’s such a high number—where did it come from? What the heck, why not go all the way up to 100?

Also this bit:

To measure aggression, participants competed against their spouse on a 25-trial task in which the winner blasted the loser with loud noise through headphones.

Whaaa?

OK, here are some further details:

Participants were told that they would compete with their spouse to see who could press a button faster when a target square turned red on the computer, and that the winner on each trial could blast the loser with loud noise through headphones. The noise was a mixture of sounds that most people hate (e.g., fingernails scratching on a chalkboard, dentist drills, ambulance sirens). The noise levels ranged from level 1 (60 dB) to level 10 (105 dB; approximately the same level as a fire alarm). The winner could also determine the duration of the loser’s suffering by controlling the noise duration [from level 1 (0.5 s) to level 10 (5 s)].

Wow, that sounds like a fun game.

The “voodoo doll,” thing still seems like the weirdest part. But . . .

Previous research has shown that this procedure is a valid way to measure aggressive inclinations in couples (17).

OK, let’s look up the reference:

17. Dewall CN, et al. (2013) The voodoo doll task: Introducing and validating a novel method for studying aggressive inclinations. Aggress Behav 39(6):419–439.

“Aggress Behav,” indeed. I still can’t figure out how they came up with the number 51. This just seems like a lot of pins to me. What with the pins and the blasting of loud noise, it’s kind of amazing these people are still married!

I was talking about the “Low glucose relates to greater aggression in married couples” paper with someone I know who does social work research, and she assured me that it must be some sort of April Fool’s joke: the voodoo dolls, the story about the glucose, the trivialization of the serious problem of intimate partner violence. She assumed this was all a parody of silly psychology research.

So I checked some more and, no, the paper seems to be real. For example, here’s a press release dated April 14, 2014, from Ohio State University, which includes the following image:

So I think the study really happened! The press release also featured this quote from one of the authors of the study:

“It’s simple advice but it works: Before you have a difficult conversation with your spouse, make sure you’re not hungry.”

You probably don’t need me to tell you this, but . . . the paper had no data at all on conversations, let alone “difficult conversations,” nor was there any data on hunger, or any evidence that any intervention “works.”

So, par for the course: a one sentence claim that includes 3 different claims, none of which are supported by data.

The study was also featured uncritically by NPR. Of course. No preregistered replications that I’ve seen, but, hey, that’s not a problem in the field of ego depletion, right? Right?

Voodoo correlations, indeed.

**P.S.** One interesting question is why it is that various problems go together: In this case we have duplicate publications, disregard of the welfare of students, reluctance to share data, p-values obtained via forking paths, NPR-bait research published in PNAS, ridiculous measurements, the claim that one simple trick can change your life, and a set of specific claims that are not addressed in any way by the published research.

There perhaps are some logical reasons for this co-morbidity.

Let’s work backward. To get NPR-bait research published in PNAS, you need some combination of (a) originality and (b) major claims, along with (c) statistical significance or the equivalent. (We actually saw a PNAS paper recently that got by on a “p less than 0.10” result that went in the opposite direction as the preregistered hypothesis, but that’s unusual; I still can’t figure out how that one got through.)

So here’s the problem:

(a) Originality is tough. It’s hard to come up with original ideas, and the easiest way to do so is to go wacky (voodoo dolls)!.

(b) If your ideas *are* original, they’re unlikely to work the first time, or even the second or third. Hence the need to massage the data, which selects for unethical behavior (hence the possible correlation with duplicate publication, disregard for the welfare of students, reluctance to share data, and general suppression of dissent).

(c) And the easiest way to get statistical significance is to keep shaking your data till something comes up, then cover your tracks with story time.

That pizzagate guy was just the most extreme example.

On the other hand, I don’t really know how much the above behaviors go together in general. I’ve never done anything like a systematic or representative survey of research misconduct, so these are all speculations. Also, I’m making no claim that any of the authors of the above-discussed paper have engaged in unethical behavior. I have no idea. They may just have all been in the wrong place at the wrong time. Nor am I saying that PNAS should not be publishing a paper on voodoo dolls. It’s their call: PNAS gets to publish the paper, Ohio State NPR gets to publicize them, and outsiders such as myself get to share our takes. Fair all around.

**P.P.S.** See here for more (reference from some comments below), where Florian Lange and Robert Kurzban write:

As researchers in the field of self-control, we read the recent publication by Bushman et al. (2014) with great interest. Using creative measures of aggressive tendencies, the authors examined the relationship between blood glucose levels and proxies for intimate partner violence. . . .

From their results, Bushman et al. (2014) concluded that glucose “influences aggressive tendencies and behaviors” (p. 3) within couples. They regarded their findings as implying that “interventions designed to provide individuals with metabolic energy might foster more harmonious couple interactions” (p. 3). While there is obvious appeal to the notion that glucose can increase self-control and thus prevent aggressive impulses from being expressed, this study does not provide evidence supporting this idea.

Exactly! Who knows? Their theory and proposed interventions might be correct, they might be wrong, they might be counterproductive, or, more generally, their recommendations might make sense in some settings and be counterproductive in others—but the published results do not provide good evidence.

Lange and Kurzban continue:

The work by Bushman et al. draws on the proposal that “self-control requires brain food in the form of glucose” (p. 3). However, the glucose model of self-control (Gailliot et al., 2007) suffers from both conceptual shortcomings and empirical falsification (Kurzban et al., 2013). Not only has the proposal that glucose fuels the part of the brain needed to exert self-control been shown to be inconsistent with what is known about brain metabolism (Kurzban, 2010), but the empirical evidence reported in support of the proposal has been demonstrated to be implausible from a statistical perspective (Schimmack, 2012). . . . This conclusion is further corroborated by replication studies that did not find the originally reported effect . . .

In view of these issues, self-control and blood glucose levels cannot simply be equated. As a consequence, when relating their outcome measure to blood sugar concentrations, Bushman et al. (2014) did not test, as they claim, “the effects of self-control on aggression” (p. 3). What they did test was the size of the relationship between daily fluctuations in blood glucose levels and a measure of aggressive impulse. Importantly, the authors did not record any self-control data and assuming that the number of pins stuck in a doll varies according to individuals’ ability to exert self-control is conceptually problematic. For the daily assessment of aggressive tendencies, participants were simply asked to indicate how angry they were with their partner. They were not required to inhibit or override their aggressive thoughts, emotions, or urges. Hence, the only conclusion licensed by the findings reported by Bushman et al. is that blood glucose relates to a single-item self-report measure of aggressive impulse, not to the ability to control these impulses.

We do not doubt that hungrier organisms are more aggressive. This accords with our everyday experience, the animal literature (e.g., Cook et al., 2000), and the Snickers ad campaign, “You’re Not You When You’re Hungry.” However, this observation does not imply that glucose reflects the fuel necessary to muster the willpower not to harm one’s partner.

For their second analysis, mean blood glucose levels across 3 weeks were related to aggressive behavior toward the partner. Analyzed in this way, glucose levels do not indicate the current state of a fluctuating self-control resource, but are rather a trait variable. This has important implications for the authors’ conclusions. The more aggressive participants on the laboratory task were not those who were ego-depleted or hungry in that particular moment. They had low blood sugar concentration in general, a trait that can be linked to aggression via numerous third variables. . . . Whereas the reported correlation might provide information about the biology of individual differences in aggression, it does not support the glucose model of self-control. . . .

The post This April Fools post is dead serious appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Judgment Under Uncertainty: Heuristics and Biases appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>So I wrote to the person who sent me the article:

Is this paper serious? Cos I don’t understand a single thing this guy is saying.

My correspondent sent back a link to the author’s webpage and the following assessment:

The author has a mustache, so I have my doubts.

Excellent use of heuristics.

The post Judgment Under Uncertainty: Heuristics and Biases appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Replication is a good idea, but this particular replication is a bit too exact! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>From:

Subject: Self-Plagarism in Current Opinion in Psychology

Date: March 9, 2018 at 4:06:25 PM EST

To: “gelman@stat.columbia.edu”Hello,

You might be interested in the tremendous amount of overlap between two recent articles by Benjamin & Bushman (2016 & 2018) in Current Opinion in Psychology. The articles “The Weapons Effect” https://doi.org/10.1016/j.copsyc.2017.04.011 and “The Weapons Priming Effect” https://doi.org/10.1016/j.copsyc.2016.05.003 seem to simply be lightly-rewritten versions of the same piece. Should readers be made aware of the overlap?

I don’t know who this “aoju3n+8h52tq8nmy8ms” person is, but . . . this story is amazing!

To say there’s a “tremendous amount of overlap” between these two papers is an understatement.

To start with, here are the abstracts:

From 2016:

In many societies, weapons are plentiful and highly visible. This review examines recent trends in research on the weapons priming effect, which is the finding that the mere presence of weapons can prime people to behave aggressively. The General Aggression Model provides a theoretical framework to explain why the weapons priming effect occurs. This model postulates that exposure to weapons increases aggressive thoughts and hostile appraisals, thus explaining why weapons facilitate aggressive behavior. Data from meta-analytic reviews are consistent with the General Aggression Model. These findings have important practical as well as theoretical implications. They suggest that the link between weapons and aggression is very strong in semantic memory, and that merely seeing a weapon can make people more aggressive.

from 2018:

In some societies, weapons are plentiful and highly visible. This review examines recent trends in research on the weapons effect, which is the finding that the mere presence of weapons can prime people to behave aggressively. The General Aggression Model provides a theoretical framework to explain why the weapons effect occurs. This model postulates that exposure to weapons increases aggressive thoughts and hostile appraisals, thus explaining why weapons facilitate aggressive behavior. Data from meta-analytic reviews are consistent with the General Aggression Model. These findings have important practical as well as theoretical implications. They suggest that the link between weapons and aggression is very strong in semantic memory, and that merely seeing a weapon can make people more aggressive.

It keeps going from there.

Really, there are only three things missing from that second paper:

1. A left quotation mark (“, or, as we say in Latex, “)

2. A right quotation mark (“, or, as we say in Latex, ”)

3. The following phrase at the very beginning of the paper: “As Benjamin and Bushman (2016) wrote:”

At times I’ve felt some sympathy for authors who follow Arrow’s theorem and publish the same article multiple times: after all, it gives you a change to reach multiple audiences.

But in this case there’s really no excuse at all, as the two papers are published in *the very same journal*.

Here’s something funny:

Can you believe it? Dude was so clueless that he copied an entire article he’d written, then edited that article, never remembering that he he already published it two years ago.

Brad J. Bushman is Professor of Communication and Psychology, Margaret Hall and Robert Randal Rinehart Chair of Mass Communication at Ohio State University. He also appears to be affiliated with Vrije Universiteit in the Netherlands. Perhaps he holds the Diederik Stapel chair there?

A google search also revealed that Brad Bushman retracted a paper which caused one of his students to retroactively lose her Ph.D. from Ohio State. Bushman has published other papers that appear to have problems. In the meantime, though, he “received the Kurt Lewin Award from the Society for the Psychological Study of Social Issues for ‘outstanding contributions to the development and integration of psychological research and social action.'”

Bushman also reports:

I have published over 200 peer-reviewed journal articles.

Umm, better change that to “over 199,” as I don’t think “The weapons effect” and “The weapons priming effect” should count as two papers. If publishing two papers with the same content counts as two different articles, then I could easily up my publication count to 10,000 by just standing by the xerox machine.

**P.S.** I searched Ohio State University’s misconduct rules and found this, which is item 5 on a list of examples of academic misconduct:

Submitting substantially the same work to satisfy requirements for one course or academic requirement that has been submitted in satisfaction of requirements for another course or academic requirement without permission of the instructor of the course for which the work is being submitted or supervising authority for the academic requirement.

Apparently this is a problem if you’re a student, not so much if you’re the “Margaret Hall and Robert Randal Rinehart Chair of Mass Communication.”

Jeez. Bushman was editing the damn journal issue. If he and his collaborators really had nothing new to say, then fine, why not just reproduce the abstract from the earlier paper, with direct citation, and let some other people publish something in the journal? What’s the point of it all? Just to rack up your publication count from 199 to 200?

The whole thing is so pitiful, to go to the trouble of cheating and not even get anything for it. Really the worst of both worlds.

Say what you want about Lance Armstrong, at least he got to wear the yellow jersey for awhile. And Barry Bonds, he got the home run record. But Brad J. Bushman, all he got for his efforts was a duplicate paper in a journal that nobody reads. Was it really worth it, dude?

**P.P.S.** I just realized something. The guy’s job title is Chair of Mass Communication. Publishing the same article multiple times, that really is a form of “mass communication”!

The post Replication is a good idea, but this particular replication is a bit too exact! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Hey! Free money! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>On Dec 27, 2017, at 6:55 PM, **@gmail.com wrote:

My name is ** and I am a freelance writer hoping to contribute my writing to andrewgelman.com. I would be willing to compensate you for publishing.

For my posts, I require one related client link within the body of my article, as well as no “guest” or “Sponsored” tag on the post. I am willing to offer $50 per post published on the site as compensation for these things. Please let me know if you are interested and/or if you have any questions.

OK, here’s the cool part: For all you know, I’ve taken this guy (or bot) up on his offer. So from now on, when reading this blog, you’ll have to guess which of the posts you’re seeing are sponsored content. Given the conditions above, unfortunately I’m *not* allowed to label these particular posts.

It’s possible, right? Just about all my posts have links, and, hmmmm . . . $50 per post x 400 posts a year = $20,000. That’s real money! Also think of all the time this frees up for me, if someone else is writing all my posts for me. It’s really a win-win situation, and Google can translate all the Russian to English, no problem.

The post Hey! Free money! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Yet another IRB horror story appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Faced with submitting twenty-seven new pieces of paperwork to correct our twenty-seven infractions, Dr. W and I [Alexander] gave up. We shredded the patient data and the Secret Code Log. We told all the newbies they could give up and go home. We submitted the Project Closure Form to the woman in the corner office (who as far as I know still hasn’t completed her Pre-Study Training). We told the IRB that they had won, fair and square; we surrendered unconditionally.

They didn’t seem the least bit surprised. . . .

I feel like some scientists do amazingly crappy studies that couldn’t possibly prove anything, but get away with it because they have a well-funded team of clerks and secretaries who handle the paperwork for them. And that I, who was trying to do everything right, got ground down with so many pointless security-theater-style regulations that I’m never going to be able to do the research I would need to show they’re wrong. . . .

We’ve discussed IRB nightmares before; see here and here. And here‘s a discussion from Macartan Humphreys on how ethical concerns differ in health and social science research.

The post Yet another IRB horror story appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Mitzi’s talk on spatial models in Ann Arbor, Thursday 5 April 2018 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Abstract**

This case study shows how to efficiently encode and compute an intrinsic conditional autoregressive (ICAR) model in Stan. When data has a neighborhood structure, ICAR models provide spatial smoothing by averaging measurements of directly adjoining regions. The Besag, York, and Mollié (BYM) model is a Poisson generalized linear model (GLM) which includes both an ICAR component and an ordinary random-effects component for non-spatial heterogeneity. We compare two variants of the BYM model and fit two datasets taken from epidemiological studies of Scottish lip cancer (56 regions) and New York city pedestrian traffic deaths (700 regions).

It’s based on her Stan case study on ICAR models.

**Registration**

The event is open to the public. Here are the Meetup registration details.

The post Mitzi’s talk on spatial models in Ann Arbor, Thursday 5 April 2018 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Combining Bayesian inferences from many fitted models appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m curious about your opinion on combining multi-model inference techniques with rstanarm:

On the one hand, screening all (theoretically meaningful) model specifications and fully reporting them seems to make a lot of sense to me — in line with the idea of transparent reporting, your idea of the multiverse analysis, or akin to Simonsohn’s Specification Curve Analysis (which seem to be quite closely related to each other anyway). So in the past, I’ve been using tools such as glmulti or regressionBF (from the BayesFactor package) to obtain model-averaged coefficients and more interestingly, the “importances” of the various predictors, which are determined across various model specifications (i.e., in the case of glmulti). However, these tools are only available for the “traditional” OLS and related estimation methods.

On the other hand, I’ve recently started to use rstanarm, which I now clearly prefer to the traditional estimation methods, not least because of the possibility to specify weakly informative priors and the resulting regularization.

As different model specifications would make sense from a theoretical point of view in some of my current projects, I’ve now wondered if it would be reasonable to write a wrapper that automatically implements different model specifications and runs them with rstanarm, which would permit an (automatic) comparison of coefficients across different model specifications (and would possibly also permit to extract a measure of “importance” for each of the predictors). Of course, this would need to be highly parallelizable to make it computationally feasible.

Do you have any thoughts on this, and / or do you plan to implement something related in rstarnarm in the future?

My reply:

Yes, definitely use rstanarm! Or go straight to Stan (that would be rstan if you’re running it from R) and program more general models. If you want to fit several models and average their posterior distributions, I recommend stacking, as described in this recent paper. Also, sure, someone could write a wrapper to automatically fit a large number of models in rstanarm in parallel and then average over them—this would not be hard to do—but I think I’d prefer fitting a single model with all these predictors and interactions, using strong priors to regularize all the coefficients.

The post Combining Bayesian inferences from many fitted models appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Heuristics and Biases? Laplace was there, 200 years ago. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In an article entitled Laplace’s Theories of Cognitive Illusions, Heuristics, and Biases, Josh “hot hand” Miller and I write:

In his book from the early 1800s, Essai Philosophique sur les Probabilités, the mathematician Pierre-Simon de Laplace anticipated many ideas developed in the 1970s in cognitive psychology and behavioral economics, explaining human tendencies to deviate from norms of rationality in the presence of probability and uncertainty. A look at Laplace’s theories and reasoning is striking, both in how modern they seem and in how much progress he made without the benefit of systematic experimentation. We argue that this work points to these theories being more fundamental and less contingent on recent experimental findings than we might have thought.

We conclude:

Laplace’s approach to identifying behavior that departed from the enlightenment conception of rational decision making—an effort that occurred in parallel with his role as a major architect of this ideal, as it applied to inference and decision making under uncertainty—spurred him to search for the general principles of reasoning that underlay these departures. That many of his explanations happen to coincide with modern accounts, arrived at independently based on the same introspections that evidently guided Laplace, suggests that the heuristics and biases approach to judgement and decision making is a scientific contribution that will endure.

More generally, Laplace’s work as a proto-psychologist and applied statistician, which complemented his career as a mathematician and physicist, demonstrates the creative tension between normative and descriptive ideas of inference and decision making. . . .

Modern behavioral science research has taken us far beyond Laplace. While Laplace was an early advocate for the scientific method to be applied to psychological questions, he was limited in his inquiry by his reliance upon observational data. Modern research, through the use of innovative and carefully designed experimental demonstrations, has provided insights and further directions of study into how and why human behavior departs from the normative model of probability theory (Kahneman et al., 1982). Looking at decision making from a different direction, as Laplace’s faith in a clockwork universe that could be reduced to intelligible causes via the scientific method has been called into question with the discovery of quantum phenomena and emergent complexity, Laplace’s assumption that probability theory could serve as a domain-independent prescriptive model for human judgement has been upended by research demonstrating the relative efficacy of simple domain-specific decision rules and predictive models that respect cognitive limitations, tacit knowledge, multidimensionality of goals, and the need to adapt to complex and changing environments (Meehl, 1954; Gigerenzer and Brighton, 2009; Todd and Gigerenzer, 2000).

Nevertheless, Laplace’s attempts to understand the underlying mechanisms for people’s biases were highly original, insightful, in many ways were centuries ahead of their time, and in at least two instances produced novel conjectures that have not been tested to this day. We believe that modern-day social and behavioral scientists can benefit from revisiting Laplace’s thinking on illusions in the estimation of probabilities, and beyond.

**P.S.** Emma Gillingham sent in the above photo of Pepper, along with the following description:

The classic ‘in or out’ debate – open a door, the cat wants to stay in. Close it, the cat wants to go out. Repeat for the next few hours. Some kind of cognitive bias that the grass is always greener?!

The post Heuristics and Biases? Laplace was there, 200 years ago. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The problem with those studies that claim large and consistent effects from small and irrelevant inputs appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Dale Lehman writes:

You have often critiqued those headline grabbing studies such as how news about shark attacks influence voting behavior, how the time of month/color of clothing influences voting, etc. I am in total agreement with your criticisms of this “research.” Too many confounding variables, too small sample sizes, too many forking paths, poor incentives to grab headlines, etc. But one aspect of your critique I don’t understand. You have said (my paraphrasing here) that people’s political beliefs are not as superficial and fickle as these studies claim to show. I am wondering (I don’t know this research) if your prior on this is based on evidence or is it mood affiliation on your part? While I don’t find any of these studies (the ones you critique) convincing, my own prior is that people’s voting behavior is indeed fickle and superficial. The last presidential election is but a glaring example of this. Repeatedly, people seem to vote according to what seem like frivolous and easily manipulated perceptions. Is there a disconnect between your views on voters’ beliefs and your critique of research which seems to portray voters as easily manipulated? Or are you saying voters are not easily influenced but their beliefs may be based on superficial and irrational perceptions?

My reply:

First, let’s separately consider primary and general elections. Primary elections are hard to predict because the candidates have the same party affiliation and typically have similar positions, voters often don’t have much time to think about their choice, and there can be many candidates running. General elections are much more patterned.

I don’t think *most* vote choices in the general election are superficial or fickle. Most people vote their party ID, and we saw this in 2016 as well as 2014, 2012, 2010, etc.

I agree, however, that *some* people vote based on superficial or fickle reasons, and these choices can make a difference in a close election.

But the papers on ovulation and voting, shark attacks and voting, college football and voting, etc., *don’t* just say that voters, or some voters, are superficial and fickle. No, these papers claim that seemingly trivial or irrelevant factors have *large and consistent effects*, and that I don’t believe. I do believe that individual voters can be influenced these silly things, but I don’t buy the claim that these effects are predictable in that way. The problem is interactions. For example, the effect on my vote of the local college football team losing could depend crucially on whether there’s been a shark attack lately, or on what’s up with my hormones on election day. Or the effect could be positive in an election with a female candidate and negative in an election with a male candidate. Or the effect could interact with parent’s socioeconomic status, or whether your child is a boy or a girl, or the latest campaign ad, etc.

**P.S.** Thanks to Diana Senechal for the above photo.

The post The problem with those studies that claim large and consistent effects from small and irrelevant inputs appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bayesian inference for A/B testing: Lauren Kennedy and I speak at the NYC Women in Machine Learning and Data Science meetup tomorrow (Tues 27 Mar) 7pm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Bayesian inference for A/B testing

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Lauren Kennedy, Columbia Population Research Center, Columbia UniversitySuppose we want to use empirical data to compare two or more decisions or treatment options. Classical statistical methods based on statistical significance and p-values break down in the context of incremental improvement: that is, when there is a stream of innovations, each only slightly better (or possibly slightly worse) than what came before. In contrast, a Bayesian approach is ideally suited to decision making under uncertainty. We discuss the implications for applied statistics and code up some of these models in R and Stan, based on a case study by Bob Carpenter.

The post Bayesian inference for A/B testing: Lauren Kennedy and I speak at the NYC Women in Machine Learning and Data Science meetup tomorrow (Tues 27 Mar) 7pm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Spatial patterns in crime: Where’s he gonna strike next? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I am a criminologist and mostly do spatial analyses of crime patterns: where does crime occur and why in these neighborhoods / at these locations, and so on. Currently, I am thinking about offender decision-making behavior, specifically his ‘location choice’ of where to offend.

Hey, how about criminologists instead of looking to someone else to solve their problem, do something about maybe taking CPR classes?

But I digress.

Steenbeek continues:

You may be surprised that I contact you as you don’t work on this substantive subject, but the models used to analyze such behaviors are familiar to you as they are used in (voting) choice behavior: multinomial logit. More specifically, most criminological studies use McFadden’s conditional logit model. Just like a voter can choose from 4 political parties, or a traveler can choose between 5 modes of travel behavior (choice between car, bus, taxi, walk, or bike), a criminal can choose where to offend. Characteristics of the offender, characteristics of each choice alternative (e.g. deterrence level at each location), and offender-specific characteristics of each choice alternative (e.g. distance from each offender to each location) are then used to model the choice of crime location.

The main difference is that the offender chooses the *location* where crime is committed, and therefore the choice set is usually much larger, depending on the definition of ‘location’. Often, location refers to “neighborhood” (census tract). When studying offenders within one city, each offender is modeled to choose which neighborhood he commits crime in. This can easily be a choice set of 50-100 neighborhoods. But location can also refer to smaller spatial units of analysis such as street segments (the part of a street between two intersections), leading to a choice set of a few thousand (!) alternatives.

A disadvantage of the conditional logit model is the assumption of the independence of irrelevant alternatives. Especially in spatial analyses where nearby locations are vey similar to each other, this violates the IIA assumption. Exactly *two* studies have used the ‘mixed logit’ model that does not suffer from IIA. The R package RSGHB (https://cran.r-project.org/package=RSGHB) can be used to estimate these models using a Hierarchical Bayesian framework.

(1) I would prefer to work with. But: can such mixed logit models be programmed relatively easily in Stan? Or would you suggest to keep using RSGHB? (RSGHB uses Metropolis-Hastings)

A second question is with regard to the use of an informative prior / knowledge about each offender. One can only commit crimes in locations where one has knowledge of. (Let’s assume for now that offenders only have knowledge of locations that he visits himself, and that the study area is limited to one city and the locations chosen are neighborhoods). In the ideal data situation, we would know exactly how familiar each offender is with each neighborhood. Then one could use multinomial logits with varying choice sets. (An equivalent example from voting behavior would be that a voter in district A can choose from candidates of political parties A, B, C, but a voter in district B can only choose from candidates of political parties A and B: in that case voter[B] should not be modeled as if he can choose C, because he simply *cannot* choose C by definition).

In practice however, we only have (at best) a “likely” familiarity of each offender with each neighborhood, predicted using other sources (such as smartphone travel data of the population). I cannot quite wrap my head around how to incorporate such offender-specific best guesses/proxy of the familiarity with each neighborhood into the model. I suppose I can simply add it as an offender-specific covariate to the model (but there is a lot of uncertainty in the prediction, so this variable will need to incorporate measurement error).

Theory suggests that if the offender is unfamiliar with a neighborhood, then the chance that he commits there is essentially 0. So perhaps I should include an interaction between the offender-specific location-familiarity variable and all other variables to capture this?

But actually, my feeling is that the offender-specific location-familiarity is some kind of “prior”, i.e. our individual-specific prior knowledge of an offender’s choice set. This prior would be on the Y’s, similar to a multinomial model with varying choice sets, but without removing some choice alternative completely (as we cannot be 100% sure that an offender is really totally unfamiliar with a neighborhood). I have no idea if that is a feasible approach, however.

(2) What do you recommend for the situation described above?

My reply:

1. Yes, you should definitely be able to fit those Rsghb models in Stan, this should be no problem at all. If difficulties arise, you can post questions to the Stan users group and you’re likely to get a polite and helpful answer (we’re a Ripley-free zone).

2. If you have information on how “likely familiar” an offender is with each area, I think you should just put this in as a continuous predictor. The model should work fine, and it shouldn’t really matter how many nonzero cells there happen to be.

**P.S.** I wonder if we could set up this sort of model for junk science: we’d have a grid with research topics in one direction, and journals in the other. The challenge would be to predict where this stuff would be published next.

The post Spatial patterns in crime: Where’s he gonna strike next? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Request for a cat picture appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**P.S.** Javier Benítez points us to this page of free stock photos of cats. Cool! Still, if anyone has anything particularly appropriate to the topic above, just let me know. Thanks again.

The post Request for a cat picture appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Debate over claims of importance of spending on Obamacare advertising appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Anderson expresses skepticism about this claim.

I’ll first summarize the claims of Shafer et al. and then get to Anderson’s criticism.

Shafer et al. write:

The Trump administration announced Thursday that it was cutting spending on advertising for the 2018 Marketplace open enrollment period from $100 to $10 million. Empirical work can inform our expectations for its impact, assuming these cuts are implemented. . . .

Kentucky—an early success story under the ACA—sponsored a robust multimedia campaign to create awareness about its state-based marketplace, known as kynect, to educate its residents about the opportunity to gain coverage. However, after the 2015 gubernatorial election, the Bevin administration declined to renew the advertising contract for kynect and directed all pending advertisements to be canceled with approximately six weeks remaining in the 2016 open enrollment period. The reduction in advertising during open enrollment gives us precisely the rare leverage needed to assess the influence of advertising using real-world data.

We obtained advertising and Marketplace data in Kentucky to identify whether a dose-response relationship exists between weekly advertising volume and information-seeking behavior. . . .

Each additional kynect ad per week during open enrollment was associated with an additional 7,973 page views (P=.001), 390 visits (P=.003), and 388 unique visitors (P<.001) to the kynect web site per week. Based on the average number of ads per week during the first two open enrollment periods, our estimates imply that there would have been more than 450,000 fewer page views, 20,000 fewer visits, and 20,000 fewer unique visitors per week during open enrollment without the television campaign. . . .

But Anderson is concerned that the changes attributed to advertising are explainable more simply as artifacts arising from timing of the treatment variation:

I’ve seen this linked to from a few other sites so I thought you might want to comment.

The authors run a linear regression with the unique visitors, site visits, and calls as the dependent variables and the number of weekly ads by the state exchange (among other “controls”) as the independent variable of interest. Aside from the usual problems of hypothesis testing and forking paths and beyond the concerns that the effect is probably nonlinear enough to make a linear regression inappropriate, the co-movements of the unique visitors and the number of weekly ads just doesn’t seem that convincing. Notice the very beginning of the study period in the graph, there are 160,000 unique visitors (which is what we’re really concerned about). That number is cut in half after the first week (probably because there were several people just curious about the website) and this outlier may have a significant effect on the results. If you look at the variance in weekly advertising outside of the open enrollment periods, the number of unique visitors barely changes and sometimes the big changes precede any increase in advertising. My sense is that the beginning and ending of open enrollment are the main drivers of visits and visitors and those just happen to coincide with large changes in advertising.

The post Debate over claims of importance of spending on Obamacare advertising appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “The problem of infra-marginality in outcome tests for discrimination” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Outcome tests are a popular method for detecting bias in lending, hiring, and policing decisions. These tests operate by comparing the success rate of decisions across groups. For example, if loans made to minority applicants are observed to be repaid more often than loans made to whites, it suggests that only exceptionally qualified minorities are granted loans, indicating discrimination. Outcome tests, however, are known to suffer from the problem of infra-marginality: even absent discrimination, the repayment rates for minority and white loan recipients might differ if the two groups have different risk distributions. Thus, at least in theory, outcome tests can fail to accurately detect discrimination. We develop a new statistical test of discrimination—the threshold test—that mitigates the problem of infra-marginality by jointly estimating decision thresholds and risk distributions. Applying our test to a dataset of 4.5 million police stops in North Carolina, we find that the problem of infra-marginality is more than a theoretical possibility, and can cause the outcome test to yield misleading results in practice.

It’s an interesting combination of economics and statistics. Also, they do posterior predictive checks and use Stan! I only wish that on Figure 8 they’d’ve labeled the lines directly. Or at least put the codes of the legend in the same order as the lines in the graph. Figure 9, too. Also, I think Figure 7 would’ve worked better as a 2 x 4 grid of graphs. All those dots with different colors are just too hard to visually process.

The post “The problem of infra-marginality in outcome tests for discrimination” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post An economist wrote in, asking why it would make sense to fit Bayesian hierarchical models instead of frequentist random effects. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>My reply:

Short answer is that anything Bayesian can be done non-Bayesianly: just take some summary of the posterior distribution, call it an “estimator,” and there you go. Non-Bayesian can be regularized, it can use prior information, etc. No reason that a non-Bayesian method has to use p-values. To put it another way, there’s Ms. Bayesian and there’s Ms. Bayesian’s evil twin, who lives in a mirror world and does everything that Ms. Bayesian does, but says it’s non-Bayesian. The evil twin doesn’t trust Bayesian methods, she’s a real skeptic, so she just copies Ms. Bayesian but talks about regularizers instead of priors, and predictive distributions instead of posteriors. It doesn’t really matter, except that the evil twin might have more difficulty justifying her estimation choices because she can’t refer to a generative model.

Now if people want to defend some *particular* “frequentist” procedure, that’s another story. The procedures out there tend to under-regularize; they get noisy estimates of group-level variance parameters (see here and here) and they lead to overestimates of magnitudes of effect sizes (see here).

The usual non-Bayesian procedures are designed to work well asymptotically (in the case of hierarchical models, this is the limit as the number of groups approaches infinity). But as noted Bayesian J. M. Keynes could’ve said, asymptotically we’re all dead.

The post An economist wrote in, asking why it would make sense to fit Bayesian hierarchical models instead of frequentist random effects. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Are self-driving cars 33 times more deadly than regular cars? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’ve been mulling the noise over Uber’s pedestrian death.

While there are fewer pedestrian deaths so far from autonomous cars than non-autonomous (one in a few thousand hours, versus 1 every 1.5 hours), there is also, of course, a big difference in rates per passenger-mile. The rate for autonomous cars is now 1 for 3 million passenger miles, while the rate for non-autonomous cars is 1 for every 100 million passenger miles. This raises the obvious question: If the rates are actually the same per passenger mile, what’s the likelihood we would have seen that first autonomous car pedestrian death in the first 3 million passenger-miles?

Initially wanted to model this as a Poisson distribution, with outbreaks (accidents) randomly distributed through passenger-miles. Then I thought it should be a comparison of proportions. What is the best approach here?

I haven’t checked the above numbers so I’ll take Kedrosky’s word for them, for the purpose of this post.

My quick reply to the above question is that the default model would be exponential waiting time. So if the rate of the process is 1 for every 100 million passenger miles, then the probability of seeing the first death within the first 3 million passenger miles is 1 – exp(-0.03) = 0.03. So, yes, it could happen with some bad luck.

Really, though, I don’t think this approach is appropriate to this problem, as the probabilities are changing over time—maybe going up, maybe going down, I’m not really sure. I guess the point is that we could use the observed frequency of 1 per 3 million to get an estimated rate. But this one data point doesn’t tell us so much. In general I’d say we could get more statistical information using precursor events that are less rare—in this case, injuries as well as deaths—but then we could have concerns about reporting bias.

The post Are self-driving cars 33 times more deadly than regular cars? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Forking paths said to be a concern in evaluating stock-market trading strategies appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Again, the Chordia et al. paper is fine for what it is; I just think they’re making their life more difficulty by using this indirect hypothesis-testing framework, testing hypotheses that can’t be true and oscillating between the two inappropriate extremes of theta=0 and theta being unconstrained. To me, life’s just to short to mess around like that.

The post Forking paths said to be a concern in evaluating stock-market trading strategies appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Lessons learned in Hell appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m halfway through my third year as a consultant, after 25 years at a government research lab, and I just had a miserable five weeks finishing a project. The end product was fine — actually really good — but the process was horrible and I put in much more time than I had anticipated. I definitely do not want to experience anything like that again, so I’ve been thinking about what went wrong and what I should do differently in the future. It occurred to me that other people might also learn from my mistakes, so here’s my story.

The job was a ‘program review’ for a large, heavily regulated company that has an incentive program to try to get customers to behave in certain ways that both the company and its regulators feel are desirable. (Sorry, I have to remain pretty vague about this). The company is required to have an independent contractor review the program annually, to quantify the effectiveness of it and to predict how the program would perform under a few hypothetical futures. The review had been performed by the same consulting company for several years in a row, and it looked like they had done a competent job, so my partner and I felt that we were at a substantial disadvantage bidding against them: the other company already had experience with the data and the process, they had computer code already written and personnel who knew how to run it, and they would presumably do most of the work with junior staff whereas my partner and I are both quite experienced and not willing to work for the low wages a junior person would be happy with. Still, we saw several places where we were sure we could improve on the previous work, and we had been told that the company likes to switch consultants every now and then, so we put in a bid, and we got the job.

An initial stage of statistical manipulation was needed in order to process the raw data into numbers that would feed into the statistical model that would spit out the outputs, and it’s in that initial stage that we made our improvements. For the model itself, which was in my bailiwick, I was going to use pretty much the same model that had been used in the past. The previous reports noted several potential problems with the model, and said they had checked and the effect of the problems was small. There was one thing about the model I really didn’t like: the use of two very highly correlated explanatory variables in a linear regression model. That’s OK if you are making predictions for regions of parameter space that are well sampled in your training data, but can be a big problem when extrapolating, which is what we needed to do. So I changed the variables in the model to use a less highly correlated set, checked that the model still fit about the same, and kept plugging along.

So everything was going fine…right up until the first major ‘deliverable’, about five weeks ago. All the client needed was a few hundred numbers in a table, representing the estimated performance of the program under those hypothetical futures I mentioned earlier. The numbers were due on a Monday, and I had them ready on Friday: I was looking forward to a weekend getaway with my wife and some friends, and didn’t want to have any nagging worry about not being able to submit the numbers on time. So, Friday I sent the numbers to our project manager at the company…and got back an email that afternoon that said: why are the numbers so low? Last year they predicted such-and-such, and you guys are 20% under!

My first thought was: uh oh, where did I screw up? Each of the final numbers is the sum of several other numbers, maybe I forgot to add one of them? Or maybe the problem is in the pre-processing pipeline, which would cause a problem with the inputs to the model? Or maybe…

I worked through all of Friday night, checking one thing at a time, but I didn’t find anything wrong. When my wife woke up Saturday morning I had to send her off with our friends on their getaway; I kept working. I worked all Saturday and still couldn’t find the problem; ditto Sunday. Sunday afternoon I finally roped my partner in, and we spent the day cleaning things up in my analysis code and checking as much as we could check. And we did find some things wrong! I had failed to adequately specify merge parameters in R’s ‘merge’ function, which was causing some numbers to end up in the wrong places; the client had sent us some corrections to their data but those hadn’t made it into our database; and a few other things. None of these turned out to make much difference, but the fact that we were finding little errors made it seem possible that there was one (or more!) big one out there. And it was frustratingly hard to check things or test things because I continue to use some sloppy programming practices that I swore a year ago I would improve upon. Finally Monday afternoon rolled around. We had checked every module and every bit of code, we could not find any problems at all anymore…and we were still generating numbers that were 20% under what previous years had reported. We had no choice but to submit our numbers just before the deadline, with a nagging worry that there was still a major problem. The numbers were due at 5 pm Monday, so we submitted them then, but we asked “obviously nobody is going to start using these tonight and there are still things we would like to check out; when is the time people actually start with these?” The answer: start of business Tuesday, that’s why they were due Monday. So we kept working Monday evening after submitting the numbers.

Through all of this, my assessment of the situation had changed. On Friday, when first told that our numbers were different by 20%, I would have said there was a 90% chance the problem was my fault. Saturday night, maybe I would have said 60%. Sunday night, 30%. Monday evening, maybe 10%. So I was 90% sure we were OK, but that still meant a 10% chance that we weren’t!

Then, around midnight Monday night, my partner and I were looking everything over once again. I was looking at the previous year’s report, and I said “I just don’t understand how they could possibly get the numbers in the second column of such-and-such a table, these completely disagree with us and they don’t even seem plausible.” And then I noticed a footnote to the table: “Numbers in the second column have been adjusted under the assumption that…” And suddenly it all fell into place. The assumption was wrong, indeed easily demonstrated to be wrong, and it was wrong enough that replacing it with the right answer made about a 20% difference. My partner and I went to bed. I got a full night’s sleep for the first time in days.

So was that three days the visit to Hell I alluded to? Oh not, that was just the start. First, we faced some pushback from the client. We were telling them (and ultimately the regulators) that the program wasless effective than they thought. They were reluctant to believe it, although (to my relief) they were not inclined to shoot the messenger. Still, for the next couple of days I did little but try to document the evidence that the previous report had gotten it wrong. So all in all this little episode cost about a week. When you’ve only got five weeks to write the final report, and you spend one of them on something you didn’t anticipate, the pressure mounts. Still not Hell, though.

The real problem was: having realized that the previous work had involved this major misjudgment, I no longer trusted that work in other ways..specifically, I no longer took it for granted that the statistical model that had been used in the past was adequate. I finally did what I should have done months (!) earlier: rather than make a few plots and tabulate a few things to confirm that the model was behaving OK, I started looking for evidence that the model had significant problems. And sure enough: the model had significant problems. I spent several solid days coming up with a final model that I was willing to stand behind…which doesn’t sound so bad, but significantly expanded the workload: now, instead of just saying “we used the same model they used last year, except for such-and-such a minor modification”, we needed to explain why we did what we did, quantify how much difference this made, and so on. Of course this meant that the numbers we had submitted earlier — a week and a half ago, at this point — needed to be changed, because we have this new model, so that was a slightly embarrassing issue with the client. And we eneeded more tables, more plots, a section comparing the models and explaining the differences that are due to the model (in addition to a section explaining what had changed about the program; these two now had to be untangled). All of this with the clock running. Having lost a week diagnosing the problem with the previous model and convincing the client that it was indeed a big problem, and then another coming up with an improved model and figuring out the implications, we now had about three weeks to write the report. Five had already seemed pretty tight, to write what we expected to be a simpler report. Having 3/5 of the time to write a report that was 5/3 as complicated (or something)…well, we got it done, and I am pretty proud of it, but the five weeks from the start of this narrative to the moment we finished were as unpleasant a period of work as I’ve experienced in 25 years.

What are the lessons I am taking away from this?

The biggest lesson is: You know your model is not perfect, which means you know there are ways in which the answers it generates are wrong. You need to know what those ways are, and see whether the results are good enough for your purposes. If they aren’t, you need to modify the model. My initial mindset was more like “this is the model that was used in the past, let me check a few things and see if it makes sense”, and that’s completely wrong. My goal should have been to try to demonstrate that the model was inadequate, not to try to check whether it’s adequate.

The second lesson is: I really do need to code better. I did not find any really major errors in the work that I had done to generate the first set of numbers, but I could have; indeed, it seemed so plausible that I had made a major mistake that I assumed that must be the problem. Right now I am very sure that my analysis, and the code that implements it, does not have any mistakes of practical significance, but that was definitely not true at the time that we submitted our first results, and it should have been. This is partly a matter of simply allowing more time, but there’s a big component of better practices: create modules that can be tested independently, and then create tests for them, those are two of the big things I didn’t do initially.

Finally — and this is a minor one compared to the others — I should have been a lot less trusting of the previous work. In fact there were some major red flags. I already mentioned the use of two highly correlated variables in a regression, with no attempt to ensure that they could reasonably be jointly used to extrapolate beyond the range of the data, nor even any discussion of the issue. Also (perhaps related to that one) I had of course noticed that the previous analysts had not tabulated the regression coefficients of their model. I assumed that was just because the intended audience didn’t care about, or wouldn’t understand, such a table…but shouldn’t they have put it in anyway? (In fact, it is a requirement by the regulators, had anyone chosen to check). In essence, I had made the mistake of judging the book by its cover: I assumed that since the report *appeared* to have been done reasonably well, it actually had been.

I may not be able to eliminate these mistakes completely — especially the coding one — but even half-measures there are better than no measures at all.

Do as I say, don’t do as I did: (1) Always try to find all of the ways your model is significantly wrong, and to understand the magnitude of its deficiencies. (2) Implement the model in code that can be tested one module at a time. And (3) if someone tells you some specific model is good enough, don’t believe them until you’ve tried to prove them wrong…really this is just a corollary of #1.

Good luck out there!

The post Lessons learned in Hell appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Some of the data from the NRA conventions and firearm injuries study appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>You wrote about the NRA conventions and firearm injuries study here.

The lead author, Anupam Jena, kindly provided some of the underlying data and a snippet of the code they used to me. You can see it all here.

The data are here.

I [Kane] wrote up a brief analysis, R Markdown and html files are at Github.

The post Some of the data from the NRA conventions and firearm injuries study appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “It’s not just that the emperor has no clothes, it’s more like the emperor has been standing in the public square for fifteen years screaming, I’m naked! I’m naked! Look at me! And the scientific establishment is like, Wow, what a beautiful outfit.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I have that one in my collection of PDFs. I see I downloaded it on January 7, 2017, which was 3 days before our preprint went live. Probably I skimmed it and didn’t pay much further attention. I don’t know if my coauthors looked at it. Let’s give it five minutes worth of attention:

1. I notice right off the bat that the first numerical statement in the Method section contains a GRIM inconsistency:

“Data collection took place in 60 distinct FSR ranging from large chains (e.g., AppleBees®, Olive Garden®, Outback Steakhouse®, TGIF®) to small independent places (58.8%).”

58.8% is not possible. 35 out of 60 is 58.33%. 36 out of 60 is 60%.2. The split of interactions by server gender (female 245, male 250) does not add up to the total of 497 interactions. The split by server BMI does. Maybe they couldn’t determine server gender in two cases. (However, one would expect far fewer servers than interactions. Maybe with the reported ethnic and gender percentage splits of the servers we can work out a plausible number of total servers that match those percentages when correctly rounded. Maybe.)

3. The denominator degrees of freedom for the F statistics in Table 1 are incorrect (N=497 implies df2=496 for the first two, 495 for the third; subtract 2 if the real N is in fact 405 rather than 407).

4. In Table 5, the total observations with low (337) and high (156) BMI servers do not match the numbers (low, 215, high, 280) in Table 2.

There are errors right at the surface, and errors all the way through: the underlying scientific model (in which small, seemingly irrelevant manipulations are supposed to have large and consistent effects, a framework which is logically impossible because all these effects could interact with each other), the underlying statistical approach (sifting through data to find random statistically-significant differences which won’t replicate), the research program (in which a series of papers are published, each contradicting something that came before but presented as if they are part of a coherent whole), the details (data that could never have been, incoherent descriptions of data collection protocols, fishy numbers that could never have occurred with any data), all wrapped up in an air of certainty and marketed to the news media, TV audiences, corporations, the academic and scientific establishment, and the U.S. government.

What’s amazing here is not just that someone publishes low-quality research—that happens, journals are not perfect, and even when they make terrible mistakes they’re loath to admit it, as in the notorious case of that econ journal that refused to retract that “gremlins” paper which had nearly as many errors as data points—but that Wansink was, until recently, considered a leading figure in his field. Really kind of amazing. It’s not just that the emperor has no clothes, it’s more like the emperor has been standing in the public square for fifteen years screaming, I’m naked! I’m naked! Look at me! And the scientific establishment is like, Wow, what a beautiful outfit.

A lot of this has to be that Wansink and other social psychology and business-school researchers have been sending a message (that easy little “nudges” can have large and beneficial effects) that many powerful and influential people want to hear. And, until recently, this sort of feel-good message has had very little opposition. Science is not an adversarial field—it’s not like the U.S. legal system where active opposition is built into its processes—but when you have unscrupulous researchers on one side and no opposition on the other, bad things will happen.

**P.S.** I wrote this post in Sep 2017 and it is scheduled to appear in Mar 2018, by which time Wansink will probably be either president of Cornell University or the chair of the publications board of the Association for Psychological Science.

**P.P.S.** We’ve been warning Cornell about this one for awhile.

The post “It’s not just that the emperor has no clothes, it’s more like the emperor has been standing in the public square for fifteen years screaming, I’m naked! I’m naked! Look at me! And the scientific establishment is like, Wow, what a beautiful outfit.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The moral hazard of quantitative social science: Causal identification, statistical inference, and policy appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The United States is experiencing an epidemic of opioid abuse. In response, many states have increased access to Naloxone, a drug that can save lives when administered during an overdose. However, Naloxone access may unintentionally increase opioid abuse through two channels: (1) saving the lives of active drug users, who survive to continue abusing opioids, and (2) reducing the risk of death per use, thereby making riskier opioid use more appealing. . . . We exploit the staggered timing of Naloxone access laws to estimate the total effects of these laws. We find that broadening Naloxone access led to more opioid-related emergency room visits and more opioid-related theft, with no reduction in opioid-related mortality. . . . We also find suggestive evidence that broadening Naloxone access increased the use of fentanyl, a particularly potent opioid. . . .

I see three warning signs in the above abstract:

1. The bank-shot reasoning by which it’s argued that a lifesaving drug can actually make things worse. It could be, but I’m generally suspicious of arguments in which the second-order effect is more important than the first-order effect. This general issue has come up before.

2. The unintended-consequences thing, which often raises my hackles. In this case, “saving the lives of active drug users” is a plus, not a minus, right? And I assume it’s an anticipated and desired effect of the law. So it just seems wrong to call this “unintentional.”

3. Picking and choosing of results. For example, “more opioid-related emergency room visits and more opioid-related theft, with no reduction in opioid-related mortality,” but then, “We find the most detrimental effects in the Midwest, including a 14% increase in opioid-related mortality in that region.” If there’s no reduction in opioid-related mortality nationwide, but an increase in the midwest, then there should be a decrease somewhere else, no?

I find it helpful when evaluating this sort of research to go back to the data. In this case the data are at the state-year level (although some of the state-level data seems to come from cities, for reasons that I don’t fully understand.) The treatment is at the state-month level, when a state implements a law that broadens Naloxone access. This appears to have happened in 39 states between 2013 and 2015, so we have N=39 cases. So I guess what I want to see, for each outcome, are a bunch of time series plots showing the data in all 50 states.

We don’t quite get that but we do get some summaries, for example:

The weird curvy lines are clearly the result of overfitting some sort of non-regularized curves; see here for more discussion of this issue. More to the point, if you take away the lines and the gray bands, I don’t see any patterns at all! Figure 4 just looks like a general positive trend, and figure 8 doesn’t look like anything at all. The discontinuity in the midwest is the big thing—this is the 14% increase mentioned in the abstract to the paper—but, just looking at the dots, I don’t see it.

I’m not saying the conclusions in the linked paper are wrong, but I don’t find the empirical results very compelling, especially given that they’re looking at changes over time, in a dataset where there may well be serious time trends.

On the particular issue of Nalaxone, one of my correspondents passes along a reaction from an addiction specialist whose “priors are exceedingly skeptical of this finding (it implies addicts think carefully about Naloxone ‘insurance’ before overdosing, or something).” My correspondent also writes:

Another colleague, who is pre-tenure, requested that I anonymize the message below, which increases my dismay over the whole situation. Somehow both sides have distracted from the paper’s quality by shifting the discussion to the tenor of the discourse, which gives the paper’s analytics a pass.

There’s an Atlantic article on the episode.

Of course there was an overreaction by the harm reduction folks, but if you spend 5 minutes talking to non-researchers in that community, you’d realize how much they are up against and why these econ papers are so troubling.

My main problem remains that their diff-in-diff has all the hallmarks of problematic pre-trends and yet this very basic point has escaped the discussion somehow.

There is a problem that researchers often think that an “identification strategy” (whether it be randomization, or instrumental variables, or regression discontinuity, or difference in difference) gives them watertight inference. An extreme example is discussed here. An amusing example of econ-centrism comes from this quote in the Atlantic article:

“Public-health people believe things that are not randomized are correlative,” says Craig Garthwaite, a health economist at Northwestern University. “But [economists] have developed tools to make causal claims from nonrandomized data.”

It’s not really about economics: causal inference from observational data comes up all the time in other social sciences and also in public health research.

Olga Khazan, the author of the Atlantic article, points out that much of the discussion of the paper has occurred on twitter. I hate twitter; it’s a medium that seems so well suited for thoughtless sloganeering. From one side, you have people emptily saying, “Submit it for peer review and I’ll read what comes from it”—as if peer review is so great. On the other side, you get replies like “This paper uses causal inference, my dude”—not seeming to recognize that ultimately this is an observational analysis and the causal inference doesn’t come for free. I’m not saying blogs are perfect, and you don’t have to tell me about problems with the peer review process. But twitter can bring out the worst in people.

**P.S.** One more thing: I wish the data were available. It would be easy, right? Just some ascii files with all the data, along with code for whatever models they fit and computations they performed. This comes up all the time, for almost every example we look at. It’s certainly not a problem specific to this particular paper; indeed, in my own work, too, our data are often not so easily accessible. It’s just a bad habit we all fall into, of not sharing our data. We—that is, social scientists in general, including me—should do a better job of this. If a topic is important enough that it merits media attention, if the work could perhaps affect policy, then the data should be available for all to see.

**P.P.S.** See also this news article by Alex Gertner that expresses skepticism regarding the above paper.

**P.P.P.S.** Richard Border writes:

After reading your post, I was overly curious how sensitive those discontinuous regression plots were and I extracted the data to check it out. Results are here in case you or your readers are interested.

**P.P.P.P.S.** One of the authors of the article under discussion has responded, but without details; see here.

The post The moral hazard of quantitative social science: Causal identification, statistical inference, and policy appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>