Skip to content

Carol Nickerson investigates an unfounded claim of “17 replications”

Carol Nickerson sends along this report in which she carefully looks into the claim that the effect of power posing on feelings of power has replicated 17 times. Also relevant to the discussion is this post from a few months ago by Joe Simmons, Leif Nelson, and Uri Simonsohn.

I am writing about this because the claims of replication have been receiving wide publicity, and so, to the extent that these claims are important and worth publicizing, it’s also important to point out their errors. Everyone makes scientific mistakes—myself included—and the fact that some mistakes were made regarding claimed replications is not intended in any way to represent a personal criticism of anyone involved.

Pastagate!

[relevant picture]

In a news article, “Pasta Is Good For You, Say Scientists Funded By Big Pasta,” Stephanie Lee writes:

The headlines were a fettuccine fanatic’s dream. “Eating Pasta Linked to Weight Loss in New Study,” Newsweek reported this month, racking up more than 22,500 Facebook likes, shares, and comments. The happy news also went viral on the Independent, the New York Daily News, and Business Insider.

What those and many other stories failed to note, however, was that three of the scientists behind the study in question had financial conflicts as tangled as a bowl of spaghetti, including ties to the world’s largest pasta company, the Barilla Group. . . .

They should get together with Big Oregano.

P.S. Our work has many government and corporate sponsors. Make of this what you will.

Postdoc opportunity at AstraZeneca in Cambridge, England, in Bayesian Machine Learning using Stan!

Here it is:

Predicting drug toxicity with Bayesian machine learning models

We’re currently looking for talented scientists to join our innovative academic-style Postdoc. From our centre in Cambridge, UK you’ll be in a global pharmaceutical environment, contributing to live projects right from the start. You’ll take part in a comprehensive training programme, including a focus on drug discovery and development, given access to our existing Postdoctoral research, and encouraged to pursue your own independent research. It’s a newly expanding programme spanning a range of therapeutic areas across a wide range of disciplines. . . .

You will be part of the Quantitative Biology group and develop comprehensive Bayesian machine learning models for predicting drug toxicity in liver, heart, and other organs. This includes predicting the mechanism as well as the probability of toxicity by incorporating scientific knowledge into the prediction problem, such as known causal relationships and known toxicity mechanisms. Bayesian models will be used to account for uncertainty in the inputs and propagate this uncertainty into the predictions. In addition, you will promote the use of Bayesian methods across safety pharmacology and biology more generally. You are also expected to present your findings at key conferences and in leading publications

This project is in collaboration with Prof. Andrew Gelman at Columbia University, and Dr Stanley Lazic at AstraZeneca.

Psychometrics corner: They want to fit a multilevel model instead of running 37 separate correlation analyses

Anouschka Foltz writes:

One of my students has some data, and there is an issue with multiple comparisons. While trying to find out how to best deal with the issue, I came across your article with Martin Lindquist, “Correlations and Multiple Comparisons in Functional Imaging: A Statistical Perspective.” And while my student’s work does not involve functional imaging, I thought that your article may present a solution for our problem.

My student is interested in the relationship between vocabulary size and different vocabulary learning strategies (VLS). He has measured each participant’s approximate vocabulary size with a standardized test (scores between 0 and 10000) and asked each participant how frequently they use each of 37 VLS on a scale from 1 through 5. The 37 VLS fall into five different groups (cognitive, memory, social etc.). He is interested in which VLS correlate with or predict vocabulary size. To see which VSL correlate with vocabulary size, we could run 37 separate correlation analyses, but then we run into the problem that we are doing multiple comparisons and the issue of false positives that goes along with that.

Do you think a multilevel Bayesian approach that uses partial pooling, as you suggest in your paper for functional imaging date, would be appropriate in our case? If so, would you be able to provide me with some more information as to how I can actually run such an analysis? I am working in R, and any information as to which packages and functions would be appropriate for the analysis would be really helpful. I came across the brms package for Advanced Bayesian Multilevel Modeling, but I have not worked with this particular package before and I am not sure if this is exactly what I need.

My reply:

I do think a multilevel Bayesian approach would make sense. I’ve never worked on this particular problem. So I am posting it here on blog on the hope that someone might have a response. This seems like the exact sort of problem where we’d fit a multilevel model rather than running 37 separate analyses!

Trichotomous

Regarding this paper, Frank Harrell writes:

One grammatical correction: Alvan Feinstein, the ‘father of clinical epidemiology’ at Yale, educated me about ‘trichotomy’. dichotomous = Greek dicho (two) + tomous (cut). Three = tri so the proper word would be ‘tritomous’ instead of ‘trichotomous’.

Uh oh. I can’t bring myself to use the word “tritomous” as it just sounds wrong. Trichotomous might just be one of those words that are just impossible to use correctly; see here.

P.S. The adorable cat above faces many more than three options.

“Statistics: Learning from stories” (my talk in Zurich on Tues 28 Aug)

Statistics: Learning from stories

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University, New York

Here is a paradox: In statistics we aim for representative samples and balanced comparisons, but stories are interesting to the extent that they are surprising and atypical. The resolution of the paradox is that stories can be seen as a form of model checking: we learn from a good story when it refutes some idea we have about the world. We demonstrate with several examples of successes and failures of applied statistics.

Information on the conference is here.

You better check yo self before you wreck yo self

We (Sean Talts, Michael Betancourt, Me, Aki, and Andrew) just uploaded a paper (code available here) that outlines a framework for verifying that an algorithm for computing a posterior distribution has been implemented correctly. It is easy to use, straightforward to implement, and ready to be implemented as part of a Bayesian workflow.

This type of testing should be required in order to publish a new (or improved) algorithm that claims to compute a posterior distribution. It’s time to get serious about only publishing things that actually work!

You Oughta Know

Before I go into our method, let’s have a brief review of some things that are not sufficient to demonstrate that an algorithm for computing a posterior distribution actually works.

  • Theoretical results that are anything less than demonstrably tight upper and lower bounds* that work in finite-sample situations.
  • Comparison with a long run from another algorithm unless that algorithm has stronger guarantees than “we ran it for a long time”. (Even when the long-running algorithm is guaranteed to work, there is nothing generalizable here. This can only ever show the algorithm works on a specific data set.)
  • Recovery of parameters from simulated data (this literally checks nothing)
  • Running the algorithm on real data. (Again, this checks literally nothing.)
  • Running the algorithm and plotting traceplots, autocorrelation, etc etc etc
  • Computing the Gelman-Rubin R-hat statistic. (Even using multiple chains initialized at diverse points, this only checks if the Markov Chain has converged. It does not check that it’s converged to the correct thing)

I could go on and on and on.

The method that we are proposing does actually do a pretty good job at checking if an approximate posterior is similar to the correct one. It isn’t magic. It can’t guarantee that a method will work for any data set.

What it can do is make sure that for a given model specification, one dimensional posterior quantities of interest will be correct on average. Here, “on average” means that we average over data simulated from the model. This means that rather than just check the algorithm once when it’s proposed, we need to check the algorithm every time it’s used for a new type of problem. This places algorithm checking within the context of Bayesian Workflow.

This isn’t as weird as it seems. One of the things that we always need to check is that we are actually running the correct model. Programming errors happen to everyone and this procedure will help catch them.

Moreover, if you’re doing something sufficiently difficult, it can happen that even something as stable as Stan will quietly fail to get the correct result. The Stan developers have put a lot of work into trying to avoid these quiet cases of failure (Betancourt’s idea to monitor divergences really helped here!), but there is no way to user-proof software. The Simulation-Based Calibration procedure that we outline in the paper (and below) is another safety check that we can use to help us be confident that our inference is actually working as expected.

(* I will also take asymptotic bounds and sensitive finite sample heuristics because I’m not that greedy. But if I can’t run my problem, check the heuristic, and then be confident that if someone died because of my inference, it would have nothing to do with the computaition of the posterior, then it’s not enough.)

Don’t call it a comeback, I’ve been here for years

One of the weird things that I have noticed over the years is that it’s often necessary to re-visit good papers from the past so they reflect our new understanding of how statistics works.  In this case, we re-visited an excellent idea  Samantha Cook, Andrew, and Don Rubin proposed in 2006.

Cook, Gelman, and Rubin proposed a method for assessing output from software for computing posterior distributions by noting a simple fact:

If \theta^* \sim p(\theta) and y^* \sim p(y \mid \theta^*), then the posterior quantile \Pr(h(\theta^*)<h(\theta)\mid y^*) is uniformly distributed  (the randomness is in y^*) for any continuous function h(\cdot).

There’s a slight problem with this result.  It’s not actually applicable for sample-based inference! It only holds if, at every point, all the distributions are continuous and all of the quantiles are computed exactly.

In particular, if you compute the quantile \Pr(h(\theta^*)<h(\theta)\mid y^*) using a bag of samples drawn from an MCMC algorithm, this result will not hold.

This makes it hard to use the original method in practice. That might be down-weighting the problem. This whole project happened because we wanted to run Cook, Gelman and Rubin’s procedure to compare some Stan and BUGS models. And we just kept running into problems. Even when we ran it on models that we knew worked, we were getting bad results.

So we (Sean, Michael, Aki, Andrew, and I) went through and tried to re-imagine their method as something that is more broadly applicable.

When in doubt, rank something

The key difference between our paper and Cook, Gelman, and Rubin is that we have avoided their mathematical pitfalls by re-casting their main theoretical result to something a bit more robust. In particular, we base our method around the following result.

Let \theta^* \sim p(\theta) and y^* \sim p(y \mid \theta^*), and \theta_1,\ldots,\theta_L be independent draws from the posterior distribution p(\theta\mid y^*). Then the rank of h(\theta^*) in the bag of samples h(\theta_1),\ldots,h(\theta_L) has a discrete uniform distribution on [0,L].

This result is true for both discrete and continuous distributions. On the other hand, we now have freedom to choose L. As a rule, the larger L, the more sensitive this procedure will be. On the other hand, a larger L will require more simulated data sets in order to be able to assess if the observed ranks deviate from a discrete-uniform distribution.   In the paper, we chose L=100 samples for each posterior.

The hills have eyes

But, more importantly, the hills have autocorrelation. If a posterior has been computed using an MCMC method, the bag of samples that are produced will likely have non-trivial autocorrelation. This autocorrelation will cause the rank histogram to deviate from uniformity in a specific way. In particular, it will lead to spikes in the histogram at zero and/or one.

To avoid this, we recommend thinning the sample to remove most of the autocorrelation.  In our experiments, we found that thinning by effective sample size was sufficient to remove the artifacts, even though this is not theoretically guaranteed to remove the autocorrelation.  We also considered using some more theoretically motivated methods, such as thinning based on Geyer’s initial positive sequences, but we found that these thinning rules were too conservative and this more aggressive thinning did not lead to better rank histograms than the simple effective sample size-based thinning.

Simulation based calibration

After putting all of this together, we get the simulation based calibration (SBC) algorithm.  The below version is for correlated samples. (There is a version in the paper for independent samples).

The simple idea is that each of the N simulated datasets, you generate a bag of L approximately independent samples from the approximate posterior. (You can do this however you want!) You then compute the rank of the true parameter (that was used in the simulation of the data set) within the bag of samples.  So you need to compute N true parameters, each of which is used to compute one data set, which is used to compute L samples from its posterior.

So. Validating code with SBC is obviously expensive. It requires a whole load of runs to make it work. The up side is that all of this runs in parallel on a cluster, so if your code is reliable, it is actually quite straightforward to run.

The place where we ran into some problems was trying to validate MCMC code that we knew didn’t work. In this case, the autocorrelation on the chain was so strong that it wasn’t reasonable to thin the chain to get 100 samples. This is an important point: if your method fails some basic checks, then it’s going to fail SBC. There’s no point wasting your time.

The main benefit of SBC is in validating MCMC methods that appear to work, or validating fast approximate algorithms like INLA (which works) or ADVI (which is a more mixed bag).

This method also has another interesting application: evaluating approximate models. For example, if you replace an intractable likelihood with a cheap approximation (such as a composite likelihood or a pseudolikelihood), SBC can give some idea of the errors that this approximation has pushed into the posterior. The procedure remains the same: simulate parameters from the prior, simulate data from the correct model, and then generate a bag of approximately uncorrelated samples from corresponding posterior using the approximate model. While this procedure cannot assess the quality of the approximation in the presence of model error, it will still be quite informative.

When You’re Smiling (The Whole World Smiles With You)

One of the most useful parts of the SBC procedure is that it is inherently visual. This makes it fairly straightforward to work out how your algorithm is wrong.  The one-dimensional rank histograms have four characteristic non-uniform shapes: “smiley”, “frowny”, “a step to the left”, “a jump to the right”, which are all interpretable.

  • Histogram has a smile: The posteriors are narrower than they should be. (We see too many low and high ranks)
  • Histogram has a frown: The posteriors are wider than they should be. (We don’t see enough low and high ranks)
  • Histogram slopes from left to right: The posteriors are biased upwards. (The true value is too often in the lower ranks of the sample)
  • Histogram slopes from right to left: The posteriors are biased downwards. (The opposite)

These histograms seem to be sensitive enough to indicate when an algorithm doesn’t work. In particular, we’ve observed that when the algorithm fails, these histograms are typically quite far from uniform. A key thing that we’ve had to assume, however, is that the bag of samples drawn from the computed posterior is approximately independent. Autocorrelation can cause spurious spikes at zero and/or one.

These interpretations are inspired by the literature on calibrating probabilistic forecasts. (Follow that link for a really detailed review and a lot of references).  There are also some multivariate extensions to these ideas, although we have not examined them here.

Using partial pooling when preparing data for machine learning applications

Geoffrey Simmons writes:

I reached out to John Mount/Nina Zumel over at Win Vector with a suggestion for their vtreat package, which automates many common challenges in preparing data for machine learning applications.
The default behavior for impact coding high-cardinality variables had been a naive bayes approach, which I found to be problematic due its multi-modal output (assigning probabilities close to 0 and 1 for low sample size levels). This seemed like a natural fit for partial pooling, so I pointed them to your work/book and demonstrated it’s usefulness from my experience/applications. It’s now the basis of a custom-coding enhancement to their package.
You can find their write up here.
Cool.  I hope their next step will be to implement in Stan.
It’s also interesting to think of Bayesian or multilevel modeling being used as a preprocessing tool for machine learning, which is sort of the flipped-around version of an idea we posted the other day, on using black-box machine learning predictions as inputs to a Bayesian analysis.  I like these ideas of combining different methods and getting the best of both worlds.

An Upbeat Mood May Boost Your Paper’s Publicity

Gur Huberman points to this news article, An Upbeat Mood May Boost Your Flu Shot’s Effectiveness, which states:

A new study suggests that older people who are in a good mood when they get the shot have a better immune response.

British researchers followed 138 people ages 65 to 85 who got the 2014-15 vaccine. Using well-validated tests in the weeks before and after their shots, the scientists recorded mood, stress, negative thoughts, sleep patterns, diet and other measures of psychological and physical health. . . .

Greater levels of positive mood were associated with higher blood levels of antibodies to H1N1, a potentially dangerous flu strain, at both four and 16 weeks post-vaccination. No other factors measured were associated with improved immune response.

Abundant researcher degrees of freedom? Check.

Speculative hypothesis? Check.

Obvious latent-variable explanation? Check.

Difference between significant and non-significant taken as significant? Check.

The article continues:

The authors acknowledge they were not able to control for all possible variables, and that their observational study does not prove cause and effect.

The senior author, Kavita Vedhara, professor of health psychology at the University of Nottingham, said that many things could affect vaccine effectiveness, but most are not under a person’s control — age, coexisting illness or vaccine history, for example.

“It’s not there aren’t other influences,” she said, “but it looks like how you’re feeling on the day you’re vaccinated may be among the more important.”

First off, the confident statement at the end seems to contradict the caveats two paragraphs earlier. Second, I question the implication that one’s mood is “under a person’s control.” How does that work, exactly?

Beyond all this are the usual statistical problems of noise. From the research article:

One hundred and thirty-eight community-dwelling older adults aged 65–85 were recruited through 4 primary care practices in Nottingham, UK. A priori sample size calculations based on observed effects of stress on vaccine response in elderly caregivers (Vedhara et al., 1999) indicated a sample of 121 would give 80% power at 5% significance to detect a similar small-to-medium sized effect (r = 0.25) in individual regression models.

This the familiar “power = .06” disaster: take an overestimated effect size from a previous noisy study, then design a new study under these unrealistic assumptions. Bad news all around.

On the plus side, this is a study that would be easy enough to do a preregistered replication. I recommend the authors of the above-cited study start thinking up their alibis right now for the anticipated replication failure.

P.S. As usual, let me repeat that, yes, this effect could be real and replicable. And I’ll believe it once I see real evidence. Not before.

P.P.S. I learned about this paper on 25 Sep, right around when everyone’s getting their flu shots. But I posted it on a delay so it’s not appearing until mid-April.

Why delay my post on this timely topic?

Here’s why. If I keep quiet, this research might make people happy, which in turn will boost their flu shots’ effectiveness. But if I post, I’d be duty-bound to criticize this research as just another bit of noise-mining. This would make people sad, which in turn would decrease their flu shots’ effectiveness. Thus, by posting right away, I could be making people unhealthy, even maybe killing them! So, ethically speaking I have no choice but to delay my post until April, when flu season is over–and also, coincidentally, the next spot on the blog queue.

loo 2.0 is loose

This post is by Jonah and Aki.

We’re happy to announce the release of v2.0.0 of the loo R package for efficient approximate leave-one-out cross-validation (and more). For anyone unfamiliar with the package, the original motivation for its development is in our paper:

Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. 27(5), 1413–1432. doi:10.1007/s11222-016-9696-4. (published versionarXiv preprint)

Version 2.0.0 is a major update (release notes) to the package that we’ve been working on for quite some time and in this post we’ll highlight some of the most important improvements. Soon I (Jonah) will follow up with a post about important new developments in our various other R packages.

New interface, vignettes, and more helper functions to make the package easier to use

Because of certain improvements to the algorithms and diagnostics (summarized below), the interfaces, i.e., the loo() and psis() functions and the objects they return, also needed some improvement. (Click on the function names in the previous sentence to see their new documentation pages.) Other related packages in the Stan R ecosystem (e.g., rstanarm, brms, bayesplot, projpred) have also been updated to integrate seamlessly with loo v2.0.0. (Apologies to anyone who happened to install the update during the short window between the loo release and when the compatible rstanarm/brms binaries became available on CRAN.)

Three vignettes now come with the loo package package and are also available (and more nicely formatted) online at mc-stan.org/loo/articles:

  • Using the loo package (version >= 2.0.0) (view)
  • Bayesian Stacking and Pseudo-BMA weights using the loo package (view)
  • Writing Stan programs for use with the loo package (view)

A vignette about K-fold cross-validation using new K-fold helper functions will be included in a subsequent update. Since the last release of loo we have also written a paper, Visualization in Bayesian workflow, that includes several visualizations based on computations from loo.

Improvements to the PSIS algorithm, effective sample sizes and MC errors

The approximate leave-one-out cross-validation performed by the loo package depends on Pareto smoothed importance sampling (PSIS). In loo v2.0.0, the PSIS algorithm (psis() function) corresponds to the algorithm in the most recent update to our PSIS paper, including adapting the Pareto fit with respect to the effective sample size and using a weakly informative prior to reduce the variance for small effective sample sizes. (I believe we’ll be updating the paper again with some proofs from new coauthors.)

For users of the loo package for PSIS-LOO cross-validation and not just the PSIS algorithm for importance sampling, an even more important update is that the latest version of the same PSIS paper referenced above describes how to compute the effective sample size estimate and Monte Carlo error for the PSIS estimate of elpd_loo (expected log predictive density for new data). Thus, in addition to the Pareto k diagnostic (an indicator of convergence rate – see paper) already available in previous loo versions, we now also report an effective sample size that takes into account both the MCMC efficiency and the importance sampling efficiency. Here’s an example of what the diagnostic output table from loo v2.0.0 looks like (the particular intervals chosen for binning are explained in the papers and also the package documentation) for the diagnostics:

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     240   91.6%   205
 (0.5, 0.7]   (ok)         7    2.7%   48
   (0.7, 1]   (bad)        8    3.1%   7
   (1, Inf)   (very bad)   7    2.7%   1

We also compute and report the Monte Carlo SE of elpd_loo to give an estimate of the accuracy. If some k>1 (which means the PSIS-LOO approximation is not reliable, as in the example above) NA will be reported for the Monte Carlo SE. We hope that showing the relationship between the k diagnostic, effective sample size, and and MCSE of elpd_loo will make it easier to interpret the diagnostics than in previous versions of loo that only reported the k diagnostic. This particular example is taken from one of the new vignettes, which uses it as part of a comparison of unstable and stable PSIS-LOO behavior.

Weights for model averaging: Bayesian stacking, pseudo-BMA and pseudo-BMA+

Another major addition is the loo_model_weights() function, which, thanks to the contributions of Yuling Yao, can be used to compute weights for model averaging or selection. loo_model_weights() provides a user friendly interface to the new stacking_weights() and pseudobma_weights(), which are implementations of the methods from Using stacking to average Bayesian predictive distributions (Yao et al., 2018). As shown in the paper, Bayesian stacking (the default for loo_model_weights()) provides better model averaging performance than “Akaike style“ weights, however, the loo package does also include Pseudo-BMA weights (PSIS-LOO based “Akaike style“ weights) and Pseudo-BMA+ weights, which are similar to Pseudo-BMA weights but use a so-called Bayesian bootstrap procedure to  better account for the uncertainties. We recommend the Pseudo-BMA+ method instead of, for example, WAIC weights, although we prefer the stacking method to both. In addition to the Yao et al. paper, the new vignette about computing model weights demonstrates some of the motivation for our preference for stacking when appropriate.

Give it a try

You can install loo v2.0.0 from CRAN with install.packages("loo"). Additionally, reinstalling an interface that provides loo functionality (e.g., rstanarm, brms) will automatically update your loo installation. The loo website with online documentation is mc-stan.org/loo and you can report a bug or request a feature on GitHub.

Taking perspective on perspective taking

Gabor Simonovits writes:

I thought you might be interested in this paper with Gabor Kezdi of U Michigan and Peter Kardos of Bloomfield College, about an online intervention reducing anti-Roma prejudice and far-right voting in Hungary through a role-playing game.

The paper is similar to some existing social psychology studies on perspective taking but we made an effort to improve on the credibility of the analysis by (1) using a relatively large sample (2) registering and following a pre-analysis plan (3) using pre-treatment measures to explore differential attrition and (4) estimating long term effects of the treatment. It got desk-rejected from PNAS and Psych Science but was just accepted for publication in APSR.

I have not had a chance to read the paper carefully. But, just speaking generally, I agree with Simonovits that: (1) a large sample can’t hurt, (2) preregistration makes this sort of result much more believable, (3) using pre-treatment variables can be crucial in getting enough precision to estimate what you care about, and (4) richer outcome measures can help a lot.

Also, whassup. No graphs??

Generable: They’re building software for pharma, with Stan inside.

Daniel Lee writes:

We’ve just launched our new website.

Generable is where precision medicine meets statistical machine learning.

We are building a state-of-the-art platform to make individual, patient-level predictions for safety and efficacy of treatments. We’re able to do this by building Bayesian models with Stan. We currently have pilots with AstraZeneca, Sanofi, and University of Marseille. We’re particularly interested in small clinical trials, like in rare diseases or combination therapies. If anyone is interested, they can reach Daniel at daniel@generable.com

I’ve been collaborating with Daniel for many years and I’m glad to hear that he and his colleagues are doing this work. It’s my impression that in many applied fields, pharmacometrics included, there’s a big need for systems that allow users to construct open-ended models, using prior information and hierarchical models to regularize inferences and thus allow the integration of multiple relevant data sources in making predictions. As Daniel implies in his note above, Bayesian tools are particularly relevant where data are sparse.

Fixing the reproducibility crisis: Openness, Increasing sample size, and Preregistration ARE NOT ENUF!!!!

In a generally reasonable and thoughtful post, “Yes, Your Field Does Need to Worry About Replicability,” Rich Lucas writes:

One of the most exciting things to happen during the years-long debate about the replicability of psychological research is the shift in focus from providing evidence that there is a problem to developing concrete plans for solving those problems. . . . I’m hopeful and optimistic that future investigations into the replicability of findings in our field will show improvement over time.

Of course, many of the solutions that have been proposed come with some cost: Increasing standards of evidence requires larger sample sizes; sharing data and materials requires extra effort on the part of the researcher; requiring replications shifts resources that could otherwise be used to make new discoveries. . . .

This is all fine, but, BUT, honesty and transparency are not enough! Even honesty, transparency, replication, and large sample size are not enough. You also need good measurement, and some sort of good theory. Otherwise you’re just moving around desk chairs on the . . . OK, you know where I’m heading here.

Don’t get me wrong. Sharing data and materials is a good idea in any case; replication of some sort is central to just about all of science, and larger sample sizes are fine too. But if you’re not studying a stable phenomenon that you’re measuring well, then forget about it: all those good steps of openness, replication, and sample size will just be expensive ways of learning that your research is no good.

I’ve been saying this for awhile so I know this is getting repetitive. See, for example, this post from yesterday, or this journal article from a few months back.

But I feel like I need to keep on screaming about this issue, given that well-intentioned and thoughtful researchers still seem to be missing it. I really really really don’t want people going around thinking that, if they increase their sample size and keep open data and preregister, that they’ll solve their replications. Eventually, sure, enough of this and they’ll be so demoralized that maybe they’ll be motivated to improve their measurements. But why wait? I recommend following the recommendations in section 3 of this paper right away.

“Bit by Bit: Social Research in the Digital Age”

Our longtime collaborator Matt Salganik sent me a copy of his new textbook, “Bit by Bit: Social Research in the Digital Age.”

I really like the division into Observing Behavior, Asking Questions, Running Experiments, and Mass Collaboration (I’d remove the word “Creating” from the title of that section). It seemed awkward for Ethics to be in its own section rather than being sprinkled throughout the book, but it in any case it’s a huge plus to have any discussion of ethics at all. I’ve written a lot about ethics but very little of this has made its way into my textbooks so I appreciate that Matt did this.

Also I suggested three places where the book could be improved:

1. On page xiv, Matt writes, “I’m not going to be critical for the sake of being critical.” This seems like a straw man. Just about nobody is “critical for the sake of being critical.” For example, if I criticize junk science such as power pose, I do so because I’m concerned about waste of resources, about bad incentives (positive press and top jobs for junk science motivates students to aim for that sort of thing themselves), I’m concerned because the underlying topic is important and it’s being trivialized, I’m concerned because I’m interested in learning about human interactions, and pointing out mistakes is one way we learn, and criticism is also helpful in revealing underlying principles of research methods: when we learn how things can seem so right and go so wrong, that can help us move forward. Matt writes that he’s “going to be critical so that [he] can help you create better research.” But that’s the motivation of just about every critic. I have no problem with whatever balance Matt happens to choose between positive and negative examples; I just think he may be misunderstanding the reasons why people criticize mistakes in social research.

2. On pages 136 and 139, Matt refers to non-probability sampling. Actually, just about every real survey is a non-probability sample. For a probability sample, it is necessary that everyone in the population has a nonzero probability of being in the sample, and that these probabilities are known. Real polls have response rates under 10%, and there’s no way of knowing or even really defining what is the response probability for each person in the sample. Sometimes people say “probability sample” when they mean “random digit dialing (RDD) sample”, but an RDD sample is not actually a probability sample because of nonresponse.

3. In the ethics section, I’d like a discussion the idea that it can be an ethics violation to do low-quality research; see for example here, here, and here. In particular, high-quality measurement (which Matt discusses elsewhere in his book) is crucial. A researcher can be a wonderful, well-intentioned person, follow all ethical rules, IRB and otherwise—but if he or she takes crappy measurements, then the results will be crap too. Couple that with standard statistical practices (p-values etc.) and the result is junk science. Which in my view is unethical. To do a study and not consider data quality, on the vague hope that something interesting will come out and you can publish it, is unethical in that it is an avoidable pollution of scientific discourse.

Anyway, I think it will make an excellent textbook. I mentioned 3 little things that I think could be improved, but I could list 300 things in it that I love. It’s a great contribution.

It’s all about Hurricane Andrew: Do patterns in post-disaster donations demonstrate egotism?

Jim Windle points to this post discussing a paper by Jesse Chandler, Tiffany M. Griffin, and Nicholas Sorensen, “In the ‘I’ of the Storm: Shared Initials Increase Disaster Donations.”

I took a quick look and didn’t notice anything clearly wrong with the paper, but there did seem to be some opportunities for forking paths, in that the paper seemed to be analyzing only a small selection of relevant data on the question they were asking.

I wrote that I’m open to the possibility that this is real, also open to the possibility that it’s not.

Windle replied:

That was my take as well. Human psychology is certainly strange enough that its possible, but human psychology is strange enough to allow seeing effects where there are none.

Well put.

The person I’d really want to ask about this one is Uri Simonsohn. He’s the one who wrote that paper several years ago carefully shooting down every claim from the dentists-named-Dennis article.

Tools for detecting junk science? Transparency is the key.

In an article to appear in the journal Child Development, “Distinguishing polemic from commentary in science,” physicist David Grimes and psychologist Dorothy Bishop write:

Exposure to nonionizing radiation used in wireless communication remains a contentious topic in the public mind—while the overwhelming scientific evidence to date suggests that microwave and radio frequencies used in modern communications are safe, public apprehension remains considerable. A recent article in Child Development has caused concern by alleging a causative connection between nonionizing radiation and a host of conditions, including autism and cancer. This commentary outlines why these claims are devoid of merit, and why they should not have been given a scientific veneer of legitimacy. The commentary also outlines some hallmarks of potentially dubious science, with the hope that authors, reviewers, and editors might be better able to avoid suspect scientific claims.

The article in question is, “Electromagnetic Fields, Pulsed Radiofrequency Radiation, and Epigenetics: How Wireless Technologies May Affect Childhood Development,” by Cindy Sage and Ernesto Burgio. I haven’t read the two articles in detail, but Grimes and Bishop’s critique seems reasonable to me; I have no reason to believe the claims of Sage and Burgio, and indeed the most interesting thing there is that this article, which has no psychology content, was published in the journal Child Development. Yes, the claims in that article, if true, would indeed be highly relevant to the topic of child development—but I’d expect an article such as this to appear in a journal such as Health Physics whose review pool is more qualified to evaluate it.

How did that happen? The Sage and Burgio article appeared in a “Special Section is Contemporary Mobile Technology and Child and Adolescent Development, edited by Zheng Yan and Lennart Hardell.” And if you google Lennard Hardell, you’ll see this:

Lennart Hardell (born 1944), is a Swedish oncologist and professor at Örebro University Hospital in Örebro, Sweden. He is known for his research into what he says are environmental cancer-causing agents, such as Agent Orange, and has said that cell phones increase the risk of brain tumors.

So now we know how the paper got published in Child Development.

Of more interest, perhaps, are the guidelines that Grimes and Bishop give for evaluating research claims:

I’m reminded by another article by Dorothy Bishop, this one written with Stephen Lewandowsky a couple years ago, giving red flags for research claims.

As I wrote back then, what’s important to me is not peer review (see recent discussion) but transparency. And several of the above questions (#3, #4, #7, and, to some extent, #8 and #9) are about transparency. So that could be a way forward.

Not that all transparent claims are correct—of course, you can do a crappy study, share all your data, and still come to an erroneous conclusion—but I think transparency is a good start, as lots of the problems with poor data collection and analysis can be hidden by lack of transparency. Just imagine how many tens of thousands of person-years of wasted effort could’ve been avoided if that pizzagate guy had shared all his data and code from the start.

Do Statistical Methods Have an Expiration Date? (my talk noon Mon 16 Apr at the University of Pennsylvania)

Do Statistical Methods Have an Expiration Date?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

There is a statistical crisis in the human sciences: many celebrated findings have failed to replicate, and careful analysis has revealed that many celebrated research projects were dead on arrival in the sense of never having sufficiently accurate data to answer the questions they were attempting to resolve. The statistical methods which revolutionized science in the 1930s-1950s no longer seem to work in the 21st century. How can this be? It turns out that when effects are small and highly variable, the classical approach of black-box inference from randomized experiments or observational studies no longer works as advertised. We discuss the conceptual barriers that have allowed researchers to avoid confronting these issues, which arise in psychology, policy research, public health, and other fields. To do better, we recommend three steps: (a) designing studies based on a perspective of realism rather than gambling or hope, (b) higher quality data collection, and (c) data analysis that combines multiple sources of information.

Some of material in the talk appears in our recent papers, The failure of null hypothesis significance testing when studying incremental changes, and what to do about it and Some natural solutions to the p-value communication problem—and why they won’t work.

The talk is at 340 Huntsman Hall.

Failure of failure to replicate

The Millennium Villages Project: a retrospective, observational, endline evaluation

Shira Mitchell et al. write (preprint version here if that link doesn’t work):

The Millennium Villages Project (MVP) was a 10 year, multisector, rural development project, initiated in 2005, operating across ten sites in ten sub-Saharan African countries to achieve the Millennium Development Goals (MDGs). . . .

In this endline evaluation of the MVP, we retrospectively selected comparison villages that best matched the project villages on possible confounding variables. . . . we estimated project impacts as differences in outcomes between the project and comparison villages; target attainment as differences between project outcomes and prespecified targets; and on-site spending as expenditures reported by communities, donors, governments, and the project. . . .

Averaged across the ten project sites, we found that impact estimates for 30 of 40 outcomes were significant (95% uncertainty intervals [UIs] for these outcomes excluded zero) and favoured the project villages. In particular, substantial effects were seen in agriculture and health, in which some outcomes were roughly one SD better in the project villages than in the comparison villages. The project was estimated to have no significant impact on the consumption-based measures of poverty, but a significant favourable impact on an index of asset ownership. Impacts on nutrition and education outcomes were often inconclusive (95% UIs included zero). Averaging across outcomes within categories, the project had significant favourable impacts on agriculture, nutrition, education, child health, maternal health, HIV and malaria, and water and sanitation. A third of the targets were met in the project sites. . . .

It took us three years to do this retrospective evaluation, from designing sampling plans, gathering background data, designing the comparisons, and performing the statistical analysis.

At the very beginning of the project, we made it clear that our goal was not to find “statistical significant” effects, that we’d do our best and report what we found. Unfortunately, some of the results in the paper are summarized by statistical significance. You can’t fight City Hall. But we tried our best to minimize such statements.

In the design stage we did lots and lots of fake-data simulation to get a sense of what we might expect to see. We consciously tried to avoid the usual plan of gathering data, flying blind, and hoping for good results.

You can read the article for the full story. Also, published in the same issue of the journal:

The perspective of Jeff Sachs, leader of the Millennium Village Project,

An outside evaluation of our evaluation, from Eran Bendavid.

Fitting a hierarchical model without losing control

Tim Disher writes:

I have been asked to run some regularized regressions on a small N high p situation, which for the primary outcome has lead to more realistic coefficient estimates and better performance on cv (yay!). Rstanarm made this process very easy for me so I am grateful for it.

I have now been asked to run a similar regression on a set of exploratory analyses where authors are predicting the results of 4 subscales of the same psychological test. Given the small sample and opportunity for type M and S errors I had originally thought of trying to specify a multivariate normal model, but then remembered your paper on why we don’t usually worry about multiple comparisons.

I am new to translating written notation of multilevel models into R code, but I’m wondering if I’m understanding your eight schools with multiple outcomes example properly. Would the specification in lmer just be:

y ~ 1 + (1 + B1 + B2 | outcome)

Where outcome is my factor of subscales, y is the standardized test outcome, and B1 and B2 are standardized slopes I want to allow to vary by subgroup? This seems to make sense to me in that it’s coding my belief that the slopes between subgroups are similar (and thus hopefully pulling extreme estimates closer to the overall mean), but it seems too easy, so I figure I must be doing something wrong. The results also end up leading to switching signs in in the coefficients when compared against the no pooling results. Not sure whether to be excited about potentially avoiding a type S error, or scared that I’ve stuffed up the whole analysis!

My reply:

That looks almost right to me as a starting point, but one thing it’s missing is the idea that the 4 subscales could be correlated. Perhaps people with higher scores on subscale 1 also tend to have higher scores on subscale 2, for example?

How best to model the correlation? It depends on what these subscales are doing. Most general is a 4×4 covariance matrix (which, incidentally, allows the variances to be different for the different subscales, something not allowed in your model above), but something sort of item response model could make sense if you think all the subscales are measuring related things.

In any case, I guess you could start with the model above but then I’d move to fitting a multivariate-outcome model in Stan.

Finally, regarding the larger question of making sure that your model is doing what you think it’s supposed to be doing: I very much recommend fake data simulation. Set up your model, do a forward simulation and create fake data, then fit the model to your fake data and check that the results make sense and are consistent with what you were assuming.