Skip to content

“Tweeking”: The big problem is not where you think it is.

In her recent article about pizzagate, Stephanie Lee included this hilarious email from Brian Wansink, the self-styled “world-renowned eating behavior expert for over 25 years”:

OK, what grabs your attention is that last bit about “tweeking” the data to manipulate the p-value, where Wansink is proposing research misconduct (from NIH: “Falsification: Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record”).

But I want to focus on a different bit:

. . . although the stickers increase apple selection by 71% . . .

This is the type M (magnitude) error problem—familiar now to us, but not so familiar a few years ago to Brian Wansink, James Heckman, and other prolific researchers.

In short: when your data are noisy, you’ll expect to get large point estimates. Also large standard errors, but as the email above illustrates, you can play games to get the standard errors down.

My point is: an unreasonably high point estimate (in this case, “stickers increase apple selection by 71%,” a claim that’s only slightly less ridiculous than the claim that women are three times more likely to wear red or pink during days 6-14 of their monthly cycle) is not a sign of a good experiment or a strong effect—it’s a sign that whatever you’re looking for is being overwhelmed by noise.

One problem, I fear, is the lesson sometimes taught by statisticians, that we should care about “practical significance” and not just “statistical significance.” Generations of researchers have been running around, petrified that they might overinterpret an estimate of 0.003 with standard error 0.001. But this distracts them from the meaninglessness of estimates that are HUGE and noisy.

I say “huge and noisy,” not “huge or noisy,” because “huge” and “noisy” go together. It’s the high noise level that allows you to get that huge estimate in the first place.

P.S. I wrote this post about 6 months ago and it happens to be appearing today. Wansink is again in the news. Just a coincidence. The important lesson here is GIGO. All the preregistration in the world won’t help you if your data are too noisy.

Multilevel data collection and analysis for weight training (with R code)

[image of cat lifting weights]

A graduate student who wishes to remain anonymous writes:

I was wondering if you could answer an elementary question which came to mind after reading your article with Carlin on retrospective power analysis.

Consider the field of exercise science, and in particular studies on people who lift weights. (I sometimes read these to inform my own exercise program.) Because muscle gain, fat loss, and (to a lesser extent) strength gain happen very slowly, and these studies usually last a few weeks to a few months at most, the effect sizes are all quite small. This is especially the case when comparing any two interventions that are not wholly idiotic for a difference in means.

For example, a recent meta-analysis compared non-periodized to periodized strength training programs and found an effect size of approximately 0.25 standard deviations in favor of periodized programs. I’ll use this as an example, but essentially all other non-tautological, practically relevant effects are around this size or less. (Check out the publication bias graph (Figure 3, page 2091), and try not to have a heart attack when you see someone reported an effect size of 5 standard deviations. More accurately, after some mixed effects model got applied, but still…)

In research on such programs, it is not uncommon to have around N=10 in both the control and experimental group. Sometimes you get N=20, maybe even as high as N=30 in a few studies. But that’s about the largest I’ve seen.

Using an online power calculator, I find you would need well over N=100 in each group to get acceptable power (say 0.8). This is problematic for the reasons outlined in your article.

It seems practically impossible to recruit this many subjects for something like a multi-month weight training study. Should I then conclude that doing statistically rigorous exercise science—on the kinds of non-tautological effects that people actually care about—is impossible?

(And then on top of this, there are concerns about noisy measurements, ecological validity, and so on. It seems that the rot that infects other high noise, small sample disciplines, also infects exercise science, to an even greater degree.)

My reply:

You describe your question as “elementary” but it’s not so elementary at all! Like many seemingly simple statistics questions, it gets right to the heart of things, and we’re still working on in our research.

To start, I think the way to go is to have detailed within-person trajectories. Crude between-person designs just won’t cut it, for the reasons stated above. (I expect that any average effect sizes will be much less than 0.25 sd’s when “when comparing any two interventions that are not wholly idiotic.”)

How to do it right? At the design stage, it would be best to try multiple treatments per person. And you’ll want multiple, precise measurements. This last point sounds kinda obvious, but we don’t always see it because researchers have been able to achieve apparent success through statistical significance with any data at all.

The next question is how to analyze the data. You’ll want to fit a multilevel model with a different trajectory for each person: most simply, a varying-intercept, varying-slope model.

What would help is to have an example that’s from a real problem and also simple enough to get across the basic point without distraction.

I don’t have that example right here so let’s try something with fake data.

Fake-data simulations and analysis of between and within person designs

The underlying model will be a default gradual improvement, with the treatment being more effective than the control. We’ll simulate fake data, then run the regression to estimate the treatment effect.

Here’s some R code:

## 1.  Set up

setwd("/Users/andrew/AndrewFiles/teaching/multilevel_course")
library("rstan")
library("rstanarm")
library("arm")
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

## 2.  Simulate a data structure with N_per_person measurements on each of J people

J <- 50  # number of people in the experiment
N_per_person <- 10 # number of measurements per person
person_id <- rep(1:J, rep(N_per_person, J))
index <- rep(1:N_per_person, J) 
time <- index - 1  # time of measurements, from 0 to 9
N <- length(person_id)
a <- rnorm(J, 0, 1)
b <- rnorm(J, 1, 1)
theta <- 1
sigma_y <- 1

## 3.  Simulate data from a between-person experiment

z <- sample(rep(c(0,1), J/2), J)
y_pred <- a[person_id] + b[person_id]*time + theta*z[person_id]*time
y <- rnorm(N, y_pred, sigma_y)
z_full <- z[person_id]
exposure <- z_full*time
data_1 <- data.frame(time, person_id, exposure, y)

## 4.  Simulate data from a within-person experiment:  for each person, do one treatment for the first half of the experiment and the other treatment for the second half.

z_first_half <- z
T_switch <- floor(0.5*max(time))
z_full <- ifelse(time <= T_switch, z_first_half[person_id], 1 - z_first_half[person_id])
for (j in 1:J){
  exposure[person_id==j] <- cumsum(z_full[person_id==j])
}
y_pred <- a[person_id] + b[person_id]*time + theta*exposure
y <- rnorm(N, y_pred, sigma_y)
data_2 <- data.frame(time, person_id, exposure, y)

## 5.  Graph the simulated data

pdf("within_design.pdf", height=7, width=10)
par(mfrow=c(2, 2))
par(mar=c(3,3,3,1), mgp=c(1.5, .5, 0), tck=-.01)

plot(range(time), range(data_1$y, data_2$y), xlab="time", ylab="y", type="n", bty="l", main="Between-person design:\nControl group")
for (j in 1:J){
  ok <- data_1$person_id==j
  if (z[j] == 0){
    points(time[ok], data_1$y[ok], pch=20, cex=.5)
    lines(time[ok], data_1$y[ok], lwd=.5, col="blue")
  }
}
plot(range(time), range(data_1$y, data_2$y), xlab="time", ylab="y", type="n", bty="l", main="Between-person design:\nTreatment group")
for (j in 1:J){
  ok <- data_1$person_id==j
  if (z[j] == 1){
    points(time[ok], data_1$y[ok], pch=20, cex=.5)
    lines(time[ok], data_1$y[ok], lwd=.5, col="red")
  }
}
plot(range(time), range(data_1$y, data_2$y), xlab="time", ylab="y", type="n", bty="l", main="Within-person design:\nControl, then treatment")
for (j in 1:J){
  ok <- person_id==j
  if (z[j] == 0) {
    points(time[ok], data_2$y[ok], pch=20, cex=.5)
    lines(time[ok&time<=T_switch], data_2$y[ok&time<=T_switch], lwd=.5, col="blue")
    lines(time[ok&time>=T_switch], data_2$y[ok&time>=T_switch], lwd=.5, col="red")
  }
}
plot(range(time), range(data_1$y, data_2$y), xlab="time", ylab="y", type="n", bty="l", main="Within-person design:\nTreatment, then control")
for (j in 1:J){
  ok <- person_id==j
  if (z[j] == 1) {
    points(time[ok], data_2$y[ok], pch=20, cex=.5)
    lines(time[ok&time<=T_switch], data_2$y[ok&time<=T_switch], lwd=.5, col="red")
    lines(time[ok&time>=T_switch], data_2$y[ok&time>=T_switch], lwd=.5, col="blue")
    for (i in 1:N_per_person) {
      ok2 <- ok & index==i
    }
  }
}
dev.off()

## 6. Fit models using rstanarm

fit_1 <- stan_glmer(y ~ (1 + time | person_id) + time + exposure, data=data_1)
fit_2 <- stan_glmer(y ~ (1 + time | person_id) + time + exposure, data=data_2)

print(fit_1)
print(fit_2)

And here are the simulated data from these 50 people with 10 measurements each.

I'm simulating two experiments. The top row shows simulated control and treatment data from a between-person design, in which 25 people get control and 25 get treatment, for the whole time period. The bottom row shows simulated data from a within-person design, in which 25 people get control for the first half of the experiment and treatment for the second half; and the other 25 people get treatment, then control. In all these graphs, the dots show simulated data and the lines are blue or red depending on whether control or treatment was happening:

In this case the slopes under the control condition have mean 1 and standard deviation 1 in the population, and the treatment effect is assumed to be a constant, adding 1 to the slope for everyone while it is happening.

Here's the result from the regression with the simulated between-person data:

 family:       gaussian [identity]
 formula:      y ~ (1 + time | person_id) + time + exposure
 observations: 500
------
            Median MAD_SD
(Intercept) 0.0    0.2   
time        1.2    0.2   
exposure    0.5    0.3   
sigma       1.0    0.0   

Error terms:
 Groups    Name        Std.Dev. Corr
 person_id (Intercept) 0.91         
           time        0.91     0.11
 Residual              1.01         
Num. levels: person_id 50

The true value is 1 but the point estimate is 0.5. That's just randomness; simulate new data and you might get an estimate of 1.1, or 0.8, or 1.4, or whatever. The relevant bit is that the standard error is 0.3.

Now the within-person design:

family:       gaussian [identity]
 formula:      y ~ (1 + time | person_id) + time + exposure
 observations: 500
------
            Median MAD_SD
(Intercept) -0.1    0.2  
time         1.0    0.1  
exposure     1.1    0.1  
sigma        1.1    0.0  

Error terms:
 Groups    Name        Std.Dev. Corr
 person_id (Intercept) 0.93         
           time        0.96     0.11
 Residual              1.12         
Num. levels: person_id 50 

With this crossover design, the standard error is now just 0.1.

A more realistic treatment effect

The above doesn't seem so bad: sure, 50 is a large sample size, but we're able to reliably estimate that treatment effect.

But, as discussed way above, a treatment effect of 1 seems way too high, given that the new treatment is being compared to some existing best practices.

Let's do it again but with a true effect of 0.1 (thus, changing "theta = 1" in the above code to "theta = 0.1"). Now here's what we get.

For the between-person data:

stan_glmer
 family:       gaussian [identity]
 formula:      y ~ (1 + time | person_id) + time + exposure
 observations: 500
------
            Median MAD_SD
(Intercept) 0.2    0.2   
time        1.0    0.2   
exposure    0.0    0.3   
sigma       1.0    0.0   

Error terms:
 Groups    Name        Std.Dev. Corr
 person_id (Intercept) 1.1          
           time        1.0      0.24
 Residual              1.0          
Num. levels: person_id 50

For the within-person data:

For info on the priors used see help('prior_summary.stanreg').stan_glmer
 family:       gaussian [identity]
 formula:      y ~ (1 + time | person_id) + time + exposure
 observations: 500
------
            Median MAD_SD
(Intercept) 0.2    0.2   
time        1.0    0.1   
exposure    0.1    0.1   
sigma       1.0    0.0   

Error terms:
 Groups    Name        Std.Dev. Corr
 person_id (Intercept) 0.98         
           time        0.95     0.32
 Residual              1.00         
Num. levels: person_id 50

The key here: The se's for the coefficient of "exposure"---that is, the se's for the treatment effect---have not changed; they're still 0.3 for the between-person design and 0.1 for the within-person design.

What to do, then?

So, what to do, if the true effect size really is only 0.1? I think you have to study more granular output. Not just overall muscle gain, for example, but some measures of muscle gain that are particularly tied to your treatment.

To put it another way: if you think this new treatment might "work," then think carefully about how it might work, and get in there and measure that process.

Lessons learned from our fake-data simulation

Except in the simplest settings, setting up a fake-data simulation requires you to decide on a bunch of parameters. Graphing the fake data is in practice a necessity in order to understand the model you're simulating and to see where to improve it. For example, if you're not happy with the above graphs---they don't look like what your data really could look like---then, fine, change the parameters.

In very simple settings you can simply suppose that the effect size is 0.1 standard deviations and go from there. But once you get to nonlinearity, interactions, repeated measurements, multilevel structures, varying treatment effects, etc., you'll have to throw away that power calculator and dive right in with the simulations.

A psychology researcher uses Stan, multiverse, and open data exploration to explore human memory

Under the heading, “An example of Stan to the rescue, multiverse analysis, and psychologists trying to do well,” Greg Cox writes:

I’m currently a postdoc at Syracuse University studying how human memory works. I wanted to forward a paper of ours [“Information and Processes Underlying Semantic and Episodic Memory Across Tasks, Items, and Individuals,” by Gregory Cox, Amy Criss, William Aue, and Pernille Hemmer] that is soon to appear in the Journal of Experimental Psychology: General in which we use Stan to help us analyze correlations among a variety of experimental tasks commonly used to study human memory, thereby helping us understand not only what those tasks tell us about the humans doing them but also how humans use the stimulus materials (words, in this case) across these different tasks. My purpose in forwarding this, beyond the aforementioned shameless plug, is that I believe this project can serve as an example of the utility of several practices you and your blogmates have long advocated:
1) Stan
2) Multiverse analysis
3) Open data exploration

Cox elaborates:

1) Stan

In my own work, I have bounced around between various off-the-shelf samplers and needing to write my own for certain special-purpose models. But Stan made it so easy to implement big multivariate distributions and rather complex link functions (the Wiener first passage time distribution) and NUTS explores the posterior in such an efficient manner that I find it hard to imagine I will go back to other programs or custom samplers any time soon. We might have been able to do the work without Stan, but we would have had to either reduce the scope of our project or rest uneasy in the worry that our posterior samples were just a cloud of dust. Stan makes me feel that we can rest easy.

2) Multiverse analysis

Like anyone who analyzes data, particularly in an exploratory setting (see below), we tried a bunch of different things before settling on the stuff in the main text. I admit, we originally were not going to talk about all those other tries in the paper until the editor asked about them, but I’m very glad he (the editor) did! For context, our analyses involved selecting a certain number of principal components to focus on, so that we could reduce the complexity of the data to something we might have a chance of understanding (and communicating to a reader), but how was this choice made? While hardly gripping reading, Appendix D shows the results of different choices, and even the results that would occur if individual tasks were dropped out (as if we had not measured them). The result is that a reader has more information to make a reasoned judgment about our analyses, since they now have a clearer picture of why we thought the main one was most useful and also how much the conclusions might change as a function of different choices or different observations.

3) Open data exploration

The original motivation for this study, for which credit goes to Pernille Hemmer and Amy Criss, was to produce an openly available set of data across these various tasks to aid the development of memory theory, so really this whole project was “born free”. We have tried to adhere to this ideal as best we can, putting out our data and code on the OSF, doing so as soon as we submitted it for publication. We know that there is a lot more detail in these data than we have analyzed so far, but we are finite creatures. By making all of our data/methods available, we don’t have to do all the work ourselves, so this is a benefit to us–someone with a good idea can just do it and contribute immediately without going to the trouble of collecting another 450 participants. And obviously it is a benefit to those folks with good ideas since now they can test and refine those ideas much more easily. As we say in our discussion, “we do not expect this will be the final word with regard to this dataset or the tasks represented therein.”

But while I think our work is an example of a number of good research practices, I am sure that there are things we could do better . . . As a final remark, many psychologists/cognitive scientists like myself feel frustrated that the public face of our field is one that is pocked with a mixture of tabloid silliness, deliberate ignorance, and outright fraud, as discussed often on your blog. Thus I hope that our work, as imperfect as it is, can also serve as a reminder that there have been a lot of us—going back to the birth of scientific psychology with Helmholtz, Wundt, etc.—that have been diligently trying to understand how minds work through careful experiments, rigorous analyses, and well-developed theories. We may not be the loudest voices in the room, but we were always there and always will be.

Cox adds:

While I believe the public face of psychology is, indeed, unpleasant to look upon, I don’t blame critics for that fact, nor do I mean to impugn the vast majority of psychology researchers who, like most of us, are doing the best we can with what we have. I also don’t mean to say that that ONLY way to do good psychological science is the way we did it—different research questions require different kinds of experiments, analyses, and theories (to say nothing of the fact that we are not perfect ourselves). Rather, I want to emphasize the importance of tight links between experiment, analysis, and theory, and that the three research practices I discuss (Stan, multiverse, and open exploration) make it easier to forge those links and make real progress.

How to graph a function of 4 variables using a grid

This came up in response to a student’s question.

I wrote that, in general, you can plot a function y(x) on a simple graph. You can plot y(x,x2) by plotting y vs x and then having several lines showing different values of x2 (for example, x2=0, x2=0.5, x2=1, x2=1.5, x2=2, etc). You can plot y(x,x2,x3,x4) by making a two-dimensional grid of plots, where the rows show different values of x3 and the columns show different values of x4.

Then I thought I should illustrate with a graph:

It took me about an hour to make this in R (or maybe half an hour, as I was doing other things at the same time). The code is really ugly; see below. Among other things, I had difficulty with the expression() function in R. I expect it should be much easier and more effective to do this in ggplot2.

[Check out the comments, which include several excellent implementations of this idea in ggplot2. If this doesn’t induce me to switch to ggplot2, I don’t know what will! — ed.]

Anyway, below is my code, which I include not out of pride but out of honesty. I could clean it up a bit but I might as well show you what I did. In any case, the grid of graphs illustrates the general point of how to plot a function of many variables, a point which I don’t think is as well known as it should be.
Continue reading ‘How to graph a function of 4 variables using a grid’ »

Post-publication peer review: who’s qualified?

Gabriel Power writes:

I don’t recall that you addressed this point in your posts on post-publication peer review [for example, here and here — ed.]. Who would be allowed to post reviews of a paper? Anyone? Only researchers? Only experts?

Science is not a democracy. A study is not valid because a majority of people think it is. One correct assessment should trump ten incorrect assessments.

In other words, how do we qualify the tradeoff between “fewer, deeper reviews” and “many, shallower reviews”? When does the former work best? The latter?

I think everyone should be allowed to post reviews. But this does raise the question of what to do if trolls start to invade. I guess we can address this problem when we get that far. Right now, the problem seems to be not enough reviews, not too many.

A couple more papers on genetic diversity as an explanation for why Africa and remote Andean countries are so poor while Europe and North America are so wealthy

Back in 2013, I wrote a post regarding a controversial claim that high genetic diversity, or low genetic diversity, is bad for the economy:

Two economics professors, Quamrul Ashraf and Oded Galor, wrote a paper, “The Out of Africa Hypothesis, Human Genetic Diversity, and Comparative Economic Development,” that is scheduled to appear in the American Economic Review. As Peyton has indicated, the paper is pretty silly and I’m surprised it was accepted in such a top journal. Economists can be credulous but I’d expect better from them when considering economic development, which is one of their central topics. Ashraf and Galor have, however, been somewhat lucky in their enemies, in that they’ve been attacked by a bunch of anthropologists who have criticized them on political as well as scientific grounds. This gives the pair of economists the scientific and even moral high ground, in that they can feel that, unlike their antagonists, they are the true scholars, the ones pursuing truth wherever it leads them, letting the chips fall where they may.

The real issue for me is that the chips aren’t quite falling the way Ashraf and Galor think they are. . . . Everybody wants to be Jared Diamond, that’s the problem. . . .

And in 2016 I followed up with a post, “Why is Africa so poor while Europe and North America are so wealthy?”:

Any claim that economic outcomes can be explained by genes will be immediately controversial. It can be interpreted as a justification of the status quo, as if it is arguing that existing economic inequality among countries has a natural, genetic cause. See this paper by Guedes et al. for further discussion of this point.

When the paper by Ashraf and Galor came out, I criticized it from a statistical perspective, questioning what I considered its overreach in making counterfactual causal claims . . .

My criticisms were of a general sort. Recently, Shiping Tang sent me a paper criticizing Ashraf and Galor from a data-analysis perspective, arguing that their effect goes away after allowing for a “Eurasia” effect . . . I have not tried to evaluate the details of Tang’s re-analysis because I continue to think that Ashraf and Galor’s paper is essentially an analysis of three data points (sub-Saharan Africa, remote Andean countries and Eurasia). It offered little more than the already-known stylized fact that sub-Saharan African countries are very poor, Amerindian countries are somewhat poor, and countries with Eurasians and their descendants tend to have middle or high incomes. . . .

In the meantime (actually, before my 2016 post), various experts have written more on the topic.

There’s this 2015 paper, “Heterogeneity and Productivity,” by Ashraf, Galor, and Klemp, which begins:

This research explores the effects of within-group heterogeneity on group-level productivity within a novel geo-referenced dataset of observed genetic diversity across the globe. It establishes that observed genetic diversity of 230 worldwide ethnic groups, as well as predicted genetic diversity of 1,331 ethnic groups, has a hump-shaped effect on economic prosperity, reflecting the trade-off between the beneficial and the detrimental effects of diversity on productivity. Moreover, the study demonstrates that variations in within-ethnic-group genetic diversity across ethnic groups contribute to ethnic and thus regional variation in economic development within a country.

Also in 2015, Noah Rosenberg and Jonathan Kang published a paper, “Genetic Diversity and Societally Important Disparities,” which begins:

The magnitude of genetic diversity within human populations varies in a way that reflects the sequence of migrations by which people spread throughout the world. Beyond its use in human evolutionary genetics, worldwide variation in genetic diversity sometimes can interact with social processes to produce differences among populations in their relationship to modern societal problems. We review the consequences of genetic diversity differences in the settings of familial identification in forensic genetic testing, match probabilities in bone marrow transplantation, and representation in genome-wide association studies of disease. In each of these three cases, the contribution of genetic diversity to social differences follows from population-genetic principles. For a fourth setting that is not similarly grounded, we reanalyze with expanded genetic data a report that genetic diversity differences influence global patterns of human economic development, finding no support for the claim. The four examples describe a limit to the importance of genetic diversity for explaining societal differences while illustrating a distinction that certain biologically based scenarios do require consideration of genetic diversity for solving problems to which populations have been differentially predisposed by the unique history of human migrations.

So, the economists go with a biological explanation for economic disparities, but the biologists disagree. Rosenberg summarizes in an email:

Something that might be considered different in this critique of Ashraf & Galor (2013) is that the critique is accompanied by detailed descriptions of several topics where genetic diversity has a clear impact on socially consequential variables that differ across populations (e.g. transplantation match probabilities), and by a discussion of the distinction that separates Ashraf & Galor (2013) from those other topics. We would hope that the line of work that computes correlations between economic variables and features of population-genetic data will recognize the distinction between fundamentally nonbiological uses of the population-genetic variables and scenarios where the utility of those variables is based in biology.

I don’t have anything new to add beyond my 2013 post, really. Anyone who’s interested can go to these papers and read more.

The hot hand—in darts!

Roland Langrock writes:

Since on your blog you’ve regularly been discussing hot hand literature – which we closely followed – I’m writing to share with you a new working paper we wrote on a potential hot hand pattern in professional darts.

We use state-space models in which a continuous-valued latent “hotness” variable, modeled as an autoregressive process, is assumed to affect throwing success probabilities. We find strong but short-lived correlation in the latent (hotness) process, which we claim provides clear evidence for a hot hand phenomenon in darts. Right now, we’re implementing the model in Stan, since we want to incorporate random effects to account for potential player heterogeneity regarding the hot hand effect.

I replied:

Regarding hotness etc., we’ve considered two sorts of models:
1. continuously varying underlying ability (thus, the hot hand is a “correlational” phenomenon)
2. making a shot increases the probability of making the next shot (a “causal” model)
It would make sense to include both factors, of course, but either one of them is hard enough to estimate on its own, for reasons discussed on blog: (a) the binary outcome is so noisy that you’d need lots of data, but if you have a long series, then hotness won’t last so long, and (b) the state variable can only be measured from noisy 0/1 data, so it’s hard to tell whether you’re hot or not just from data on recent successes.

With darts it should be much easier to study this because you have continuous data on how close the shot is to its intended target. For basketball we’ve talked about doing something similar using 3-d ball-tracking data, but that would take a lot of work!

Why, oh why, do so many people embrace the Pacific Garbage Cleanup nonsense? (I have a theory).

This post is by Phil, not Andrew.

Over the couple of months I have seen quite a few people celebrating the long-awaited launch of a big device that will remove plastic garbage from the Pacific ocean. I find this frustrating because this project makes no sense even if the device works as intended: at best it will turn out to be a good piece of technology that is deployed in a stupid location where it will cost a lot of money to maintain while removing much less plastic than it could otherwise.

Every now and then I hear similar enthusiasm being expressed for devices that will remove carbon dioxide from the atmosphere…just build these things all over the place and you can solve or at least reduce the problem of excessive atmospheric carbon dioxide. As with the ocean cleanup, if you have such a technology you’d be a fool to deploy it this way.

As many readers of this blog will have already recognized, if you are trying to remove plastic from the ocean, or to remove carbon dioxide from the atmosphere, the place to do it is where the concentration is highest.  If you have a device that can separate plastic from water, put it in or near the outflow of a river that is bringing the plastic into the ocean; don’t wait until the garbage-filled water has been diluted by a factor of many thousand. The device can only remove plastic from the water that it interacts with, so it can only process a certain volume of water per day; much better if that water has a high concentration of the plastic you are trying to remove. Similarly, if you have a technology that can separate carbon dioxide from other gases, you should put it in or near smokestacks from power plants and cement plants…that is, places where the concentration is much higher than you find if you wait for the gases to be fully mixed into the rest of the atmosphere.

The situation with the Pacific Garbage Patch cleanup is especially bad because, in addition to the (very large) inefficiency that is due to putting the devices in places with unnecessarily low concentrations of floating plastic, there are farther inefficiences associated with putting the devices far, far out at sea, where they are more costly to maintain than if they were closer to land.

All of the above seems pretty obvious, and I daresay it is pretty obvious to most readers of this blog. Why, then, are so many people excited about the idea of putting a bunch of devices way out in the Pacific, or sprinkling carbon dioxide removal devices all around the globe? I don’t know but I have a theory: I think people make a cognitive error, or perhaps experience a cognitive illusion, in which they don’t count the input stream as part of the system in a logical way. I had some interesting and somewhat perplexing conversations with friends who are enthusiastic about the carbon dioxide removal idea, and I think that’s a nice clean example of the cognitive error that I’m talking about so I’ll focus on that one.

I’ve discussed the carbon dioxide removal approach with several friends at various times, and said that if there is a good technology for separating carbon dioxide from other gases it should be used at major carbon dioxide sources, not in the general atmosphere. All of them express some variation of this sentiment: We need to stop emitting carbon dioxide, but we also need to remove carbon dioxide from the atmosphere because carbon dioxide concentrations are already too high. Putting the devices in places where they remove carbon dioxide from ‘the atmosphere’ seems like it is actually solving the problem, whereas decreasing the amount of carbon dioxide that is emitted is merely a way of stopping things from getting worse.

But of course, taking N tons of carbon dioxide out of the atmosphere and taking N tons of carbon dioxide out of a stream of combustion gases that is entering the atmosphere have the same effect on atmospheric carbon dioxide concentrations. And if, by putting your device at the smokestack, you can remove 3N tons in the same time and for the same cost, you are much better off putting the device in the smokestack. But somehow this doesn’t appeal to people because it doesn’t “reduce the carbon dioxide concentration in the atmosphere.” It’s almost like they would prefer to add 3N tons to the atmosphere and then remove N tons, than to add zero tons in the first place; after all, in the former case you are removing N tons of carbon dioxide from the atmosphere, and in the latter you aren’t removing anything! They love the idea of removing the pollutant; decreasing the rate at which it is added just doesn’t seem the same somehow, even though it is.

I think this error, or something close to it, clouds people’s thinking about ocean plastics too.

The situation is more complicated with ocean plastics: 97% of plastic that enters the ocean does not end up in the “Great Pacific Garbage Patch” so if you want to remove plastic specifically from the ‘patch’ then maybe you do want to put your device there. But I’ve talked to people about this and they seem to agree that they do want to decrease the amount of plastics in the oceans in general, not just in the middle of the Pacific Ocean. But still, they seem to think that a device that removes plastic from the Pacific Ocean is needed, and they’re much more excited about that than about preventing plastic from entering the Pacific Ocean. One of my friends said “we need to do both.” Well, no: if it’s more effective to just do one of them, then that’s what we should do.

The Ocean Cleanup project is probably going to collect many tons of garbage from the Pacific Ocean (at great expense) and I’m sure some people will declare it a success…and that’s a crying shame because they could do much, much better for a lot less money.

This post is by Phil, not Andrew.

What to do when your measured outcome doesn’t quite line up with what you’re interested in?

Matthew Poes writes:

I’m writing a research memo discussing the importance of precisely aligning the outcome measures to the intervention activities. I’m making the point that an evaluation of the outcomes for a given intervention may net null results for many reasons, one of which could simply be that you are looking in the wrong place. That the outcome you chose isn’t what the intervention actually influences. I’m thinking about this in a nuanced way (so not like you set sail for Hawaii and hit Australia, but more like you ended up off the coast of Nihue). An intervention may seek to improve parent child interaction, which is a broad domain with many potential sub-domains, but in looking at the specific activities, it may be that it only seeks to improve a particular type of parent child interaction. General measures may not pick this up, or they may, but may not be particularly sensitive to change in that kind of parent child interaction. So while there are many potential failed assumptions that lead to a null finding, certainly one could be a misaligned outcome measure.

My question is, is this a statistical concept, is there a statistical property to the specificity or alignment of the outcome to the intervention activities, or is this purely a conceptual issue with how experiments work. Related to that question, is it fair to say that poorly aligned outcome measures may also attenuate effect sizes even when positive effects are found? An example might be that your measure gets at 4 domains of parent child interaction: affection, responsiveness, encouragement, and teaching. Our interest may be to impact how parents interact with their child to improve a child’s social emotional development. Maybe we know that affection and encouragement are the major predictors of S.E. from this tool, but it’s not a direct measure. So this measure isn’t a perfect match to what we are trying to impact, with past research primarily showing that maternal sensitivity is what matters (but we aren’t measuring that in this fictitious study, its indirectly coming from related concepts). My thought is that in such a scenario, while we have superficially measured our outcome of interest (parent child interaction) we failed to drill down to what actually is taking place (and matters), maternal sensitivity. Instead we have a few items from a more general scale that may get at maternal sensitivity. As such, we may have attenuated effects because we aren’t measuring all the things that are improving from the intervention, as well as because only having a small number of items may not make for a very reliable or valid scale for our actual outcome of interest.

I don’t have a detailed response here, but I wanted to post this because it relates to our recent post on estimating interactions. Some people might have taken from that post the message that it’s too hard to estimate interactions so we shouldn’t bother. That’s not what I think, though. What I think is that we do need to estimate interactions in observational settings, because unmodeled interactions can pollute the estimates of effects that we care about. See this paper from 2003 for an example.

What, then, to do? I think we should be including lots of interactions, estimating them using strong priors that pool toward zero: this will give more reasonable estimates and will allow the included interactions to improve our estimates of main effects without adding noise.

The point of the recent post on estimated interactions having large uncertainties is that, when uncertainties are large and underlying parameters are small, that’s when Bayesian inference with informative priors can really help.

I asked Beth Tipton for her thoughts, and she wrote:

First, this is a “construct validity” issue, in the language of Shadish, Cook, and Campbell (2002). One method proposed to address this is the multi-trait mulimethod matrix.

Second, this strikes me more generally as a measurement issue and one that area has probably addressed to some extent. I know from my experience working in RCTs that this is something psychologists are generally worried about and try to address.

Poes expressed further concerns:

That is the direction we initially approached this, just as a conceptual issue related to the validity of the measure itself. This is a different issue, but I personally feel that we do not validate tools adequately, and even the best feasible validations don’t tell us what we often suggest they do. The example I used, parent child interaction, is really a pretty abstract concept. Most tools take the “I’ll know it when I see it” standpoint. They use what we term global ratings, where an observer is asked to score on an arbitrary scale where a mom is at with their child in an interaction scenario for the overall interaction across a set of prescribed domains. These domains are often not so well defined themselves, and so this leaves a lot of room for measurement error and bias. In graduate school I developed a micro-coding schema for moment level behavior coding. In this case, discrete behaviors were precisely defined and coded and overall PCI scores were based on the aggregate within a domain (or overall). A confirmatory Factor Analysis was used to create weighted aggregate scores, rather than a direct addition of the items by frequency. This was more predictive of long term development of a child academically and social emotionally than were global scoring schema’s, but the tool is also a beast to use and very difficult to gain high inter-rater reliability. It could not be used by practitioners. Even this, however, was only validated against other similar tools or predictive validity. My concern has always been, our methods of validation often rely on face validity (it looks right) which can be very biased, concurrent validity (that it matches other similar tools) which can have the same biases and inaccuracies in their definition and aim as the new tool, and predictive validity, which actually is often weak when measured, and even when not, precise alignment is needed which may narrow the true validation of the tool from how it is conceptually presented. That is, the actual evidence for validation may narrow what you can say the tool measures, but people don’t present it that way, they would still present it as a universal PCI tool (even if you only have evidence to support that it accurately predicts social emotional development). There is also the possibility that this is all nonsense since you cannot measure PCI the concept directly to compare.

I think I am stuck on this idea of representing in mathematical terms the effect of these misguided measurements. Call it the impact of poor construct validity, where the tool does not measure what it is supposed to measure, and that then weakens an intervention evaluations ability to measure true impacts. When an intervention seeks to improve what is effectively a latent construct that can only be measured by a set of manifest variables organized themselves into latent sub-scales. You end up having a kind of Venn diagram. The intervention itself is only addressing the area where the sets in each scale overlap. In this case, I’m suggesting there are two sets of Venn diagrams (since again, the intervention is addressed something latent or abstract, it can’t be measured directly, but there should be a “true” set of manifest variables that get at the latent construct). You have the Venn for the set of manifest variables and respective sub-scales that would make up the reality of the intervention, and then you have your actual measure. Where the overlap of each set overlaps is the only portion of your actual measure that matches the effect of the intervention. You take a fraction of a fraction and depict it as a true and accurate representation of the outcome of your intervention, but in fact, it was always being measured abstractly, and further, in this case, I’m suggesting it gets mis-measured. Since the outcome is composed of sets of subscales which each represent a latent variable, I’m suggesting that you end up effectively only measuring portions of your intervention’s outcome, and it may even be fairly unreliable now since it is possible and even likely that the tool itself was not assessed for validity or reliability at that sub-scale level.

He then continued:

I generally take issue with people taking their scales apart for other reasons. Many of the outcome measures we use have not been validated at the item or sub-scale level. Who says that sub-scale is a valid outcome measure? In this case what I find are a number of problems. One is that the most scales ask multiple questions that are related. If a latent construct is not well represented within an individual by one question, another will capture it. We then expect that on average across respondents that one or two items captures that aspect of the construct for the individual. When you deconstruct a scale into its sub-scales that is maintained, but many researchers go farther and create their own un-tested short forms. In this case, you lose that. You also don’t know if the new short form is valid or not. The other issue is that with the small samples and discrete nature of the responses (such as with small number likert items) I don’t find sub-scales to be normally distributed anymore. While there are statistics that may be ok for such things (and are common in education), that is not true in my field, where t-tests are as advanced as many get.

I guess that makes me curious what you think of this? In my stats courses in graduate school (the stats department at Purdue), I was taught that violations of the assumption of normality is a mixed bag. You shouldn’t do it, but in most cases the standard analysis we use is robust enough against this violation to cause much bias. As such, many people I know simply ignore this. I guess I’m a rule follower because I’ve always checked it and I have found that it’s often a very poor approximation of a normal distribution. The most common I see are 3 item likerts where nobody chooses the middle point, so it’s a U shape. Everyone chose 1 or 3. Is this a concern or am I making a big thing out of nothing?

My quick responses:

1. We typically don’t really care about violations of the normality assumption (unless we’re interested in predictions for individual cases). See here for a quick summary of most important assumps of linear regression.

Or this regarding assumps more generally.

2. Regarding multiple comparisons: Here, the key mistake is selection. If there are 100 reasonable analyses to do on your data, it’s wrong to select just one comparison and not adjust it for multiple comparisons. But it’s also wrong to select just one comparison and adjust it. The mistake is to select just one comparison, or just a few comparisons. The right thing to do [link fixed] is to keep all 100 comparisons and analyze them with a multilevel model.

Don’t get fooled by observational correlations

Gabriel Power writes:

Here’s something a little different: clever classrooms, according to which physical characteristics of classrooms cause greater learning. And the effects are large! Moving from the worst to the best design implies a gain of 67% of one year’s worth of learning!

Aside from the dubiously large effect size, it looks like the study is observational yet cause-and-effect are assumed, and there are many degrees of freedom.

I would say this is just the usual dog bites man stats mistakes, but governments might spend many millions on this…

OK, I took a look. They report a three-year study of 153 classrooms in 27 schools. They do multilevel modeling—hey, I like it already. And scatterplots! The scatterplots are kinda weird, though:

It’s a full color (or, I guess I should say, full colour) report, so why not use those colours in the graphs, instead of the confusing symbols. Also it’s not clear what happened to the urban and rural schools in the top graph, or why there are so many divisions on the x and y-axes or why they have all those distracting gray lines.

Throughout the report there are graphs, but it doesn’t seem that much statistical thought went into them. The graphs look like they were made by an artist, not in consultation with a data analyst. For example:

Plotting by school ID makes no sense, the symbols contain zero information, the graph is overwhelmed by all these horizontal gray lines, and, most importantly, I have no idea what’s being plotted! The title says they’re plotting the “impacts of classrooms” and the vertical axis says they’re plotting “input.” Inputs and impacts, those are much different!

I’m not trying to be a graphics scold here. Do whatever graphs you want! My point is that graphics—and statistical analysis more generally—are an opportunity, a chance to learn. Making a graph that’s pretty but conveys no useful information—that’s a waste of an opportunity.

I think there’s a big problem, which is that graphs are considered as a sort of ornament to the “real analysis” which is the regression. But that’s not right. The graphs are part of the real analysis.

In particular, nowhere are there graphs of the raw data, or of classroom-level averages. Too bad; that would help a lot.

Also, I can’t figure out what they did in their analysis. They have lots of graphs like this which make no sense:

“Colour” is not a number so how can it have a numerical correlation with overall progress. And what exactly is “overall progress”? And why are correlations drawn as bars: they can be negative, no? Indeed, it’s suspicious that all of the correlations are between 0 and 0.18 and not one of them is negative.

The big picture

Stepping back, here’s what we see. These researchers care about schools and they know a lot about schools. I’m inclined to trust their general judgment about what’s good for schools, whether or not they do a quantitative study. They have done a quantitative study and they’ve used it to inform their judgment. They don’t know a lot of statistics, but that’s fine: statistics is not their job, they’re doing their best here.

The problem is what these researchers are thinking that statistics, and quantitative methods, can do for them in this setting. Realistically, doing things like improving the lighting in classrooms will have small effects. That doesn’t mean you shouldn’t do it, it just means that it’s unrealistic to expect large consistent effects, and it’s a mistake to estimate effects from observational correlations. (And effect size, not just direction of effect, can make a difference when it comes to setting priorities and allocating scarce resources.)

Columbia Data Science Institute art contest

This is a great idea! Unfortunately, only students at Columbia can submit. I encourage other institutions to do such contests too. We did something similar at Columbia, maybe 10 or 15 years ago? It went well, we just didn’t have the energy to do it again every year, as we’d initially planned. So I’m very happy to see the Data Science Institute start it up again.

High-profile statistical errors occur in the physical sciences too, it’s not just a problem in social science.

In an email with subject line, “Article full of forking paths,” John Williams writes:

I thought you might be interested in this article by John Sabo et al., which was the cover article for the Dec. 8 issue of Science. The article is dumb in various ways, some of which are described in the technical comment on it that I have submitted, but it also exhibits multiple forking paths, a lack of theory, and abundant jargon. It is also very carelessly written and reviewed. For example, the study analyzed the Mekong River stage (level of the water with respect to a reference point), but refers more often to the discharge (volume per time past a reference point: the relationship between the two is non-linear). It is pretty amazing that something like this got published.

I shared this with a colleague who is knowledgeable about this general area of research, and my colleague wrote that he agreed with Williams’s criticisms.

Williams followed up with a new document listing forking paths in the Sabo analysis.

There’s more to the story but I’ll stop here. I just wanted to share all this because it’s good to be reminded that high-profile statistical errors occur in the physical sciences too, it’s not just a problem in social science.

Echo Chamber Incites Online Mob to Attack Math Profs

The story starts as follows:

There’s evidence for greater variability in the distribution of men, compared to women, in various domains. Two math professors, Theodore Hill and Sergei Tabachnikov, wrote an article exploring a mathematical model for the evolution of this difference in variation, and send the article to the Mathematical Intelligencer, a magazine that welcomes “expository articles on all kinds of mathematics, and articles that portray the diversity of mathematical communities and mathematical thought, emergent mathematical communities around the world, new interdisciplinary trends, and relations between mathematics and other areas of culture.” (Tabachnikov is on the editorial board of this magazine, a fact I learned from this comment on our blog.) After some revision, the article was accepted for publication. But then there was pushback from activists within the academic community, and a few months later the editor of the magazine informed the authors that article would not be published; also one of the authors, Tabachnikov, removed himself from the paper out of fear of reprisals from his university. (There also seems to possibly have been a third author who also backed out of the paper, but there’s less detail on that part.) There was an online furor, and a month later, the remaining author, Hill was contacted by an editor of the online New York Journal of Mathematics and invited to submit the article there. (The NYJM, unlike the Intelligencer, seems to focus entirely or nearly entirely on pure math, at least that’s what I see here.) A month later, after a referee report and some revisions, the paper appeared online at NYJM. But then, three days later, the article was removed from that journal’s website and replaced by a completely different article on an unrelated topic. Currently the paper is available on Arxiv.

All the above happened in 2017. During the following months, Hill tried to find out what happened. He ended up with the impression that there had been a politically-motivated campaign against his paper, causing it to be yanked from the Intelligencer and the NYJM as the result of an intimidation campaign by academic activists. Then last week he wrote up the whole story as a blog entry or online article at the site Quillette. I first heard about the story when a couple people pointed me to Hill’s post.

The math paper in question

I was curious so I followed the link to the Arxiv paper, “An Evolutionary Theory for the Variability Hypothesis,” by Theodore Hill, dated 24 Aug 2018.

Hill’s article did not strike me as mathematically deep, not did it seem politically objectionable in any way. The math was accessible and related to an interesting general issue, so I could see how it would be of interest to a general-interest mathematics magazine such as the Intelligencer. In some ways it reminded me of my paper, “Forming voting blocs and coalitions as a prisoner’s dilemma: a possible theoretical explanation for political instability,” which I published in an econ journal back in 2003: My article, like Hill’s, contained some mathematical results that were inspired by a real phenomenon of interest, and although the connection between the math and anything in the real world was tenuous, I (and, correspondingly, Hill) thought the mathematical results were interesting enough, and the motivating applied problem compelling enough, that our efforts were worth sharing with the world. I later followed up that paper with work of more applied relevance, and I’m supportive of the general idea of working out mathematical models, as long as we recognize their limitations, as Hill does in his paper (“the contribution here is also merely a general theory intended to open the discussion to further mathematical modeling and analysis”). So I could see why the Intelligencer might want to publish the paper. I could also see why they might not want to publish it, as the mathematical argument in the paper is pretty simple, and the paper is also loaded down with what seem to me to be irrelevant claims regarding biology and society. I didn’t find these claims political or offensive; they just seemed beside the point in a math paper. I think it would be enough to just raise the issue (more variability among men than women in various traits), give some references, and move on from there, rather than attempt a review of the biology literature on the topic. Anyway, the paper seemed innocuous to me: not so exciting, but with some mathematical content; on an interesting topic even if with some difficulties linking the math to the biology; reasonable enough to publish in the Intelligencer.

Trying to make sense of the story

With this in mind, there were a few aspects of Hill’s blog entry that didn’t completely make sense to me.

First, the research article did not seem politically objectionable to me. I could see how people with strong views on the topic of sex differences would find things to criticize in his paper, and he could well be missing some important points of the biology, and if you really tried to apply his model to data I don’t think it would work at all, so, sure, the paper’s not perfect. But as a math paper that touches on an interesting topic, it is what it is, and I was surprised there’d be a campaign to suppress it.

Here’s the version time-stamped 19 Mar 2017, which mentions this Summers quote: “It does appear that on many, many different human attributes – height, weight, propensity for criminality, overall IQ, mathematical ability, scientific ability – there is relatively clear evidence that whatever the difference in means – which can be debated – there is a difference in the standard deviation, and variability of a male and a female population.” [When writing the first version of the post, I hadn’t noticed the Summers quote in the Hill and Tabachnikov article, so I incorrectly wrote that we’re not actually getting to see the version of the paper that got all the controversy. — AG.] The other detail that I couldn’t quite follow was why the paper would’ve appeared in NYJM, which seems to only publish stuff like “A dyadic Gehring inequality in spaces of homogeneous type and applications” and “Off-diagonal sharp two-weight estimates for sparse operators.”

The thing I couldn’t quite figure out was why this paper bothered people so much. But, upon reflection, I think I have an idea. I’m a political scientist, and one thing that annoys the hell out of me is when people apply cute but wrong math arguments to real political questions. For example, there was that claim that the probability of casting a decisive vote is on the order of 10 to the −2,650th power, or misinformed but authoritative-sounding claims about the voting patterns of rich and poor. So I could see how someone who’s really studied sex differences could find Hill’s model to be annoying enough that they’d not want it spread around the world with the imprimateur of a serious journal. Sex differences isn’t my area of research so I was just considering Hill’s paper as expressing a simple and somewhat interesting mathematical model. If I saw the same kind of model applied to voting (for example, using the binomial probability model to compute the probability of a decisive vote), I’d be screaming. I wouldn’t want a journal to publish such a model, and if it were published, I’d want to run some article alongside explaining why the model doesn’t make sense. Not that the math is wrong but that it doesn’t apply here. I’m not prepared to make that judgment one way or another for the Hill and Tabachnikov paper, but my guess is that that’s where the critics are coming from. It’s not about suppressing a politically offensive idea; it would be more, from their perspective, about not spreading a mistaken idea. And it probably didn’t help that early versions of the paper said, “there has been no clear or compelling explanation offered for why there might be gender differences in variability,” which seems like a pretty strong claim regarding the biology literature. See this post and this from mathematician Timothy Gowers.

The big thing, though, was the paper getting accepted and then yanked—twice. Once from the Intelligencer and once from the NYJM. It’s hard to imagine a good reason for that.

The other thing that I noticed in Hill’s blog post was Lee Wilkinson, an influential statistician (among other things, author of The Grammar of Graphics, which motivated R’s ggplot package) and friend of mine, who was identified as the father of mathematician Amie Wilkinson, who in turn, according to Hill, “had successfully suppressed my variability hypothesis research and trampled on the principles of academic liberty.” Hill also wrote that Amie Wilkinson’s husband, Benson Farb, another math professor, “had written a furious email” to the NYJM editor demanding that Hill’s paper “be deleted at once.”

It was hard for me to put all the pieces together. I could see how the Intelligencer would want to publish the paper and I could see how others would think the paper not appropriate to publish (for scientific, not political, reasons). Both those views made sense to me, as they represented slightly different perspectives on the value of a work of applied mathematics. But I couldn’t see why there’d be a movement of the radical academic Left to suppress the paper—for one thing, there already have been lots and lots of papers published in reputable journals on sex differences in general and the greater-male-variability hypothesis in particular, so why would the Left focus such effort on a little math paper; and, for another, the article did not seem politically offensive.

The tempest

Hill’s post appeared on 7 Sep. In the following days, the story was featured in a blog post at Reason magazine by law professor David Bernstein (“A Mathematics Paper Two Math Journals Were Mau-Maued into Suppressing”), tweets by psychology professors Jordan Peterson (“Here’s the offending paper. Please read and distribute as widely as possible”) and Steven Pinker (“Again the academic left loses its mind: Ties equality to sameness, erodes credibility of academia, & vindicates right-wing paranoia”), and various other places on the web, including lots of material too horrible to quote (you can google and look for it yourself if you’re interested).

Many of these comment and twitter threads, including the one attached Hill’s post on the Quillette site, featured personal attacks on Amie Wilkinson and Benson Farb. Some of the personal attacks are just horrible. I won’t quote them here—you can find these remarks yourself if you want—because they’re not directly relevant to what happened to Hill. After all, if Wilkinson and Farb really did behave badly and try to suppress an already-published paper (which is much different that the very routine action of recommended that a submitted paper not be accepted for publication, or recommending that a published or in-press paper be accompanied by a rebuttal), then such suppressive actions would not be justified after the fact by others’ bad behavior toward them. The attacks on Wilkinson and Farb bothered me because of their virulence, and they give me some sense of the kind of people who comment at the site where Hill posted, that’s all. In the context of some quarters of the internet, the criticisms that Hill made are like waving the proverbial red flag in front of the bull.

On 11 Sep—4 days after Hill posted his story—Amie Wilkinson and Benson Farb posted their sides of the story. Wilkinson wrote: “I first saw the publicly-available paper of Hill and Tabachnikov on 9/6/17, listed to appear in The Mathematical Intelligencer. . . . I sent an email, on 9/7/17, to the Editor-in-Chief . . . In it, I criticized the scientific merits of the paper and the decision to accept it for publication, but I never made the suggestion that the decision to publish it be reversed. Instead, I suggested that the journal publish a response rebuttal article by experts in the field to accompany the article. One day later, on 9/8/17, the editor wrote to me that she had decided not to publish the paper. . . . I had no involvement in any editorial decisions concerning Hill’s revised version of this paper in The New York Journal of Mathematics.”

Farb wrote: “This statement is meant to set the record straight on the unfounded accusations of Ted Hill regarding his submission to the New York Journal of Mathematics (NYJM), where I was one of 24 editors serving under an editor-in-chief. Hill’s paper raised several red flags to me and other editors, giving concern not just about the quality of the paper, but also the question of whether it underwent the usual rigorous review process. . . . At the request of several editors, the editor-in-chief pulled the paper temporarily on 11/9/17 so that the entire editorial board could discuss these concerns. . . . The editor who handled the paper was asked to share these reports with the entire board. . . . The reports themselves were not from experts on the topic of the paper. They did not address our concerns about the substantive merit of the paper. . . . Further, the evidence that the paper had undergone rigorous scrutiny before being accepted was scant. In light of this, the board voted (by a 2-to-1 ratio) to rescind the paper.”

Working out what really happened

OK, now we can put all the pieces together. This is not a “he-said, she-said” or “Rashomon” situation in which different people present us with incompatible stories, and we have no way to reconcile their stories. One thing that’s interesting about these stories is how consistent they are with each other.

Look back at Hill’s post and distinguish what he knows directly—what happened to him—and where he is speculating. What happened for sure is that his paper was accepted, then yanked, from two separate journals. The rest is second-hand. The academic Left, the attacks on the NYJM, the “the husband-wife team who had successfully suppressed my variability hypothesis research and trampled on the principles of academic liberty,” the claim that his paper was judged based on “desirability or political utility”—Hill has no direct evidence for that.

Indeed, the facts of Hill’s story are consistent with the facts of Wilkinson’s and Farb’s story. Here’s what happened. Or, at least, the following is consistent with what was recounted by Hill, Wilkinson, and Farb:
– Hill’s paper was accepted by the Intelligencer and posted online. Wilkinson felt the paper was flawed and suggested the paper be published with a rebuttal. Instead, the journal editor un-published the paper.
– Hill’s paper was accepted by the NYJM in an unusual fashion by one of the journal’s 24 editors. After a review of the editorial process, the full editorial board un-published the paper.

Hill is annoyed, and justifiably so. For a journal editor to accept his paper, then reject it, that’s not cool. It’s happened to me a couple of times, and it pisses me off, to put in all the work of writing an article, going through the review process, finally the paper appears or is scheduled to appear, and then, Bam! the journal pulls the rug out from under you.

The problem is that Hill is focusing his annoyance on the wrong people. And there’s a clue in Hill’s own post, where he writes, “My quarrel, the vice-provost [of the University of Chicago] concluded, was with the editors-in-chief who had spiked my papers . . .” That’s right! The problem is with the editors who broke the rules, it’s not with reviewers who raised concerns.

Can the rabble-rousers call off the online mob?

To me, the most unfortunate part of the story is the amplification of Hill’s post throughout Twitter, Quillette, 4chan, etc., abetted by thought leaders on Twitter, leading to noxious hatred spewed at Amie Wilkinson. I don’t blame Jordan Peterson, Steven Pinker, or the editors of Quillette for the behavior of Twitter commenters, or even for the behavior of commenters at Quillette. But now that more of the story is out, it’s time for all these people to explain what happened to their followers, and to apologize.

Theodore Hill and Amie Wilkinson clearly differ on the value of the Hill and Tabachnikov paper, both as mathematics and regarding its relevance to biology and the study of human institutions. For example, Wilkinson wrote, “Invoking purely mathematical arguments to explain scientific phenomena without serious engagement with science and data is an offense against both mathematics and science.”

But they agree on the value of open communication.

Here’s Wilkinson: “I believe that discussion of scientific merits of research should never be stifled. This is consistent with my original suggestion to bring in outside experts to rebut the Hill-Tabachnikov paper.” The fact that a journal editor pulled Hill’s paper, after Wilkinson recommended otherwise, reflects poorly on the editor, not on the other people involved in this story. But that’s not the story Hill, or the online mob, wanted to hear.

N=1 survey tells me Cynthia Nixon will lose by a lot (no joke)

Yes, you can learn a lot from N=1, as long as you have some auxiliary information.

The other day I was talking with a friend who’s planning to vote for Andrew Cuomo in the primary. What about Cynthia Nixon? My friend wasn’t even considering voting for her. Now, my friend is, I think, in the left half of registered Democrats in the state. Not far left, but I’d guess left of the median. If Nixon didn’t even have a chance with this voter, there’s no way she can come close in the primary election. She’s gonna get slaughtered.

A survey with N=1! And not even a random sample. How could we possibly learn anything useful from that? We have a few things in our favor:

– Auxiliary information on the survey respondent. We have some sense of his left-right ideology, relative to the general primary electorate.

– An informative measure of the respondent’s attitude. He didn’t just answer a yes/no question about his vote intention; he told me that he wasn’t even considering voting for her.

– A model of opinions and voting: Uniform partisan swing. We assume that, from election to election, voters move only a small random amount on the left-right scale, relative to the other voters.

– Assumption of random sampling, conditional on auxiliary information: My friend is not a random sample of New York Democrats, but I’m implicitly considering him as representative of New York Democrats at his particular point in left-right ideology.

Substantive information + informative data + model + assumption. Put these together and you can learn a lot.

I could be wrong, of course, and I haven’t tried to attach an uncertainty to my prediction. But this is what I’m going with, from my N=1 survey.

Discussion of effects of growth mindset: Let’s not demand unrealistic effect sizes.

Shreeharsh Kelkar writes:

As a regular reader of your blog, I wanted to ask you if you had taken a look at the recent debate about growth mindset [see earlier discussions here and here] that happened on theconversation.com.

Here’s the first salvo by Brooke McNamara, and then the response by Carol Dweck herself. The debate seems to come down to how to interpret “effect sizes.” It’s a little bit out of my zone of expertise (though, as a historian of science, I find the growth of growth mindset quite interesting) but I was curious what you thought.

I took a look, and I found both sides of the exchange to be interesting. It’s so refreshing to see a public discussion where there is robust disagreement but without defensiveness.

Here’s the key bit, from Dweck:

The effect size that Macnamara reports for growth mindset interventions is .19 for students at risk for low achievement – that is, for the students most in need of an academic boost. When you include students who are not at risk or are already high achievers, the effect size is .08 overall. [approximately 0.1 on a scale of grade-point averages which have a maximum of 4.0]

An effect of one-tenth of a grade point—large or small? For any given student, it’s small. Or maybe it’s an effect of 1 grade point for 10% of the students and no effect for the other 90%. We can’t really know from this sort of study. The point is, yes it’s a small effect in the context of any student, and of course it’s a small effect. It’s hard to get good grades, and there’s no magic way to get there!

This is all a big step forward from the hype we’d previously been seeing, such as this claim of a 31 percentage point improvement from a small intervention. It seems that we can all agree that any average effects will be small. And that’s fine. Small effects can still be useful, and we shouldn’t put the burden on developers of new methods to demonstrate unrealistically huge effects. That’s the hype cycle, it’s the Armstrong principle, which can push honest researchers toward exaggeration just to compete in the world of tabloid journals and media.

Here’s an example of how we need to be careful. In the above-linked comment, Macnamara writes:

Our findings suggest that at least some of the claims about growth mindsets . . . are not warranted. In fact, in more than two-thirds of the studies the effects were not statistically significantly different from zero, meaning most of the time, the interventions were ineffective.

I respect the general point that effects are not as large and consistent as have been claimed—but it’s incorrect to say that, just because an estimate was not statistically significantly different from zero, that the intervention was ineffective.

Similarly, we have to watch out for statements like this, from Macnamara:

In our meta-analysis, we found a very small effect of growth mindset interventions on academic achievement – specifically, a standardized mean difference of 0.08. This is roughly equivalent to less than one-tenth of one grade point. To put the size of this effect in perspective, consider that the average effect size for a psychological educational intervention (such as teaching struggling readers how to identify main ideas and to create graphic organizers that reflect relations within a passage) is 0.57.

Do we really believe that “0.57”? Maybe not. Indeed, Dweck gives some reasons to suspect this number is inflated. From talking with people at Educational Testing Service years ago, I gathered the general impression that, to first approximation, the effect of an educational intervention is proportional to the amount of time the students spend on it. Given that, according to Dweck, growth mindset interventions only last an hour, I’m actually skeptical of the claim of 0.08. It’s a good thing to tamp down extreme claims made for growth mindeset, maybe not such a good thing to compare to possible overestimates of the effects of other interventions.

Against Arianism 2: Arianism Grande

“There’s the part you’ve braced yourself against, and then there’s the other part” – The Mountain Goats

My favourite genre of movie is Nicole Kidman in a questionable wig. (Part of the sub-genre founded by Sarah Paulson, who is the patron saint of obvious wigs.) And last night I was in the same room* as Nicole Kidman to watch a movie where I swear she’s wearing the same wig Dolly did in Steel Magnolias. All of which is to say that Boy Erased is an absolutely brilliant and you should all rush out to see it when it’s released. (There are many good things about it that aren’t Nicole Kidman’s wig.) You will probably cry.

When I eventually got home, I couldn’t stop thinking of one small moment towards the end of the film where one character makes a peace overture to another character that is rejected as being insufficient after all that has happened. And it very quietly breaks the first character’s heart. (Tragically this is followed by the only bit of the movie that felt over-written and over-acted, but you can’t have it all.)

So in the spirit of blowing straight past nuance and into the point that I want to make, here is another post about priors. The point that I want to make here is that saying that there is no prior information is often incorrect. This is especially the case when you’re dealing with things on the log-scale.

Collapsing Stars

Last week, Jonah an I successfully defended our visualization paper against a horde of angry** men*** in Cardiff. As we were preparing our slides, I realized that we’d missed a great opportunity to properly communicate how bad the default priors that we used were.

This figure (from our paper) plots one realization of fake data generated from our prior against the true data. As you can see, the fake data simulated from the prior model is two orders of magnitude too large.

(While plotting against actual data is a convenient way of visualizing just how bad these priors are, we could see the same information using a histogram.)

To understand just how bad these priors are, let’s pull in some physical comparisons. PM2.5 (airborne particulate matter that is less than 2.5 microns in diameter) is measured in micrograms per cubic metre.  So let’s work out the density of some real-life things and see where they’d fall on the y-axis of that plot.

(And a quick note for people who are worried about talking the logarithm of a dimensioned quantity: Imagine I’ve divided through by a baseline of 1 microgram per cubic metre.  To do this properly, I probably should divide through by a counterfactual, which would be more reasonably around 5 micrograms per cubic metre, but let’s keep this simple.)

Some quick googling gives the following factlets:

  • The log of the density of concrete would be around 28.
  • The log the density of a neutron star would be around 60.

Both of these numbers are too small to appear on the y-axis of the above graph. That means that every point in that prior simulation is orders of magnitude denser than the densest thing in the universe.

It makes me wonder why I would want to have any prior mass at all on these types of values. Sometimes, such as in the example in the paper, it doesn’t really matter. The data informs the parameters sufficiently well that the silly priors don’t have much of an effect on the inference.

But this is not always the case. In the real model for air pollution, that our model was the simplest possible version of, it’s possible that some of the parameters will be hard to learn from data. In this case, these un-phyiscal priors will lead to unphysical inference. (In particular, the upper end of the credible interval on the posterior predictive interval will be unreasonable.)

The nice thing here is that we don’t really need to know anything about air pollution to make weakly informative priors that are tailored for this data. All we need to know is that if you breathe try to breathe in concrete, you will die. This means that we just need to set priors that don’t generate values that are bigger than 28.  I’ve talked about this idea of containment priors  in these pages before, but a concrete example hopefully makes it more understandable. (That pun is unforgivable and I love it!)

Update: The fabulous Aki made some very nice plots that shows a histogram of the log-PM2.5 values for 1000 datasets simulated from the model with these bad, vague priors. Being Finnish, he also included Pallastunturi Fells, which is a place in Finland with extremely clean air. The graph speaks for itself.

One of the things that we should notice here is that the ambient PM2.5 concentration will never just be zero (it’ll always be >1), so the left tail is also unrealistic.

A similar plot using the weakly informative priors suggested in the paper looks much better.

 

 

The Recognition Scene

The nice thing about this example is that it is readily generalizable to a lot of cases where logarithms are used.

In the above case, we modelled log-PM2.5 for two reasons: because it was more normal, but more importantly because we care much more about the difference between 5\mu gm^{-3} and 10\mu gm^{-3} than we do about the difference between 5\mu gm^{-3} and 6\mu gm^{-3}. (Or, to put it differently, we care more about multiplicative change than additive change.)  This is quite common when you’re doing public health modelling.

A different time you see logarithms is when you’re modelling count data.  Once again, there are natural limits to how big counts can be.

Firstly, it is always good practice to model relative-risk rather than risk. That is to say that if we observe y_i disease cases in a population where we would ordinarily expect to see E_i disease cases, then we should model the counts as something like

y_i\sim \text{Poisson}(E_i\lambda)

rather than

y_i\sim \text{Poisson}(\tilde{\lambda}).

The reason for doing this is that if the expected number of counts differs strongly between observations, the relative risk \lambda can be estimated more stably than the actual risk \tilde{\lambda}.  This is a manifestation of the good statistical practice of making sure that unknown parameters are on the unit scale.

This is relevant for setting priors on model parameters.  If we follow usual practice and model \log\lambda\sim N(\mu,\sigma), then being on unit scale puts some constraints on how big \sigma can be.

But that assumes that our transformation of our counts to unit scale was reasonable. We can do more than that!

Because we do inference using  computers, we are limited to numbers that computers can actually store. In particular, we know that if \log\lambda>45 then the expected number of counts is bigger than any integer that can be represented on an ordinary (64bit) computer.

This strongly suggests that if the covariates used in \mu are centred and the intercept is zero (which should be approximately true if the expected counts are correctly calibrated!), then we should avoid priors on \sigma with much mass on values larger than 15.

Once again, this is the most extreme priors for a generic Poisson model that don’t put too much weight on completely unreasonable values. Some problem-specific though can usually make these priors even tighter. For instance, we know there are fewer than seven and a half billion people on earth, so anything that is counting people can safely use much tighter priors (in particular we don’t want too much mass on sigma that’s bigger than 7).

All of these numbers are assuming that E_i\approx 1 and need to be adjusted when the expected counts are very large or very small.

The exact same arguments can be used to put priors on the hyper-parameters in models of count data (such as models that use a negative-binomial likelihood). Similarly, priors for models of positive, continuous data can be set using the sorts of arguments used in the previous section.

Song For an Old Friend

This really is just an extension of the type of thing that Andrew has been saying for years about sensible priors for the logit-mean in multilevel logistic regression models.  In that case, you want the priors for the logit-mean to be contained within [-5,5].

The difference between this case and they types of models considered in this post, is that the upper bound requires some more information than just the structure of the likelihood. In the first case, we needed information about the real world (concrete, neutron stars etc), while for count data we needed to know about the limits of computers.

But all of this information is easily accessible. All you have to do is remember that you’re probably not as ignorant as you think you are.

Just to conclude, I want to remind everyone once again that there is one true thing that we must always keep in mind when specifying priors. When you specify a prior, just like when you cut a wig, you need to be careful. That hair isn’t going to grow back.

Deserters

* It was a very large room. I had to zoom all the way in to get a photo of her. It did not come out well.)

** They weren’t.

*** They were. For these sort of sessions, the RSS couldn’t really control who presented the papers (all men) but they could control the chair (a man) and the invited discussants (both men). But I’m sure they asked women and they all said no. </sarcasm> Later, two women contributed very good discussions from the floor, so it wasn’t a complete sausage-fest.

Narcolepsy Could Be ‘Sleeper Effect’ in Trump and Brexit Campaigns

Kevin Lewis sent along this example of what in social science is called the “ecological fallacy.” Below is a press release that I’ve changed in only a few places:

UNDER EMBARGO UNTIL MARCH 8, 2018 AT 10 AM EST

Media Contact:
Public and Media Relations Manager
Society for Personality and Social Psychology
press@spsp.org

Narcolepsy Could Be ‘Sleeper Effect’ in Trump and Brexit Campaigns

Regions where voters have more narcoleptic personality traits were more likely to vote for Donald Trump in the United States or for the Brexit campaign in the United Kingdom, revealing a new trend that could help explain the rise of fearmongering populist political campaigns across the world, according to new research published in the journal Social Psychological and Personality Science.

Researchers analyzed personality traits from online surveys of more than 3 million people in the United States and more than 417,000 people in the United Kingdom. Election data was compiled from public sources.

“Our study reveals how narcolepsy or sleep hardship is shaping the global political landscape,” said lead study author Martin Sandman, PhD, a psychologist and associate professor at Sommeil University of Technology in Australia. “One could also call this ‘irrational’ voting behavior because the surprising success of Trump and Brexit weren’t predicted by models that relied on a rational understanding of voters.”

Narcolepsy hasn’t previously been associated with voting behavior, suggesting that it could have been a “sleeper effect” with the potential to have a profound impact on the success of populist political campaigns across the globe, Sandman said.

The Trump and Brexit campaigns both promoted themes of fear and lost pride, which are related to narcoleptic personality traits that include persistent feelings of exhaustion, insomnia, collapse, or restfulness. Regions in the United States with greater support for Trump were very similar to areas in the United Kingdom that supported Brexit, including a higher percentage of white people and lower levels of college education, earnings and liberal attitudes. Former industrial areas that are now in economic decline also were more likely to support Trump or Brexit.

The Brexit vote in June 2016 by the United Kingdom to leave the European Union succeeded by a very narrow margin, with 51.9 percent of voters in favor. Trump’s presidential victory in 2016 also shocked many people, with him winning 30 states and the electoral vote tally even though 2.8 million more Americans voted for Hilary Clinton.

Trump’s crucial voter gains above the performance of 2012 Republican presidential candidate Mitt Romney occurred largely in areas with high levels of narcoleptic traits, including battleground states such as Pennsylvania, Wisconsin, and Ohio, which shifted from Democratic in 2012 to Republican in 2016. Trump’s populist campaign also was especially successful in former industrial centers that are now in economic decline, including the Rust Belt region across the Midwest and Great Lakes.

The researchers examined regions, not individuals [emphasis added], and were studying larger trends relating to psychological traits, not specific diagnoses of mental illness for any voters. The study also excluded Northern Ireland from the Brexit analysis because of the lack of available data.

The fears and worries of voters with narcoleptic personality traits should be taken seriously, and blankets and pillows should be provided during political campaigns to allay those fears, Sandman said. Education also could be a buffer against fearmongering populist political campaigns because regions with higher rates of college graduates had much lower levels of narcolepsy, he said.

I’ve changed a few words above, but the basic ideas shine through clearly.

In all seriousness, I think it’s (a) irresponsible for the Society for Personality and Social Psychology to promote this sort of hype, and (b) ludicrous for it to be “UNDER EMBARGO” as if it’s some big breakthrough.

If you want to make some maps of aggregate patterns of survey responses, go for it. Do some scatterplots and regressions too, why not? But the political interpretations, all based on those aggregate correlations—they’re ridiculous. The Society for Personality and Social Psychology can and should do better.

P.S. The above link was to a press release which no longer works, and I can’t find it on the Internet Archive either. So instead here’s a link to the original paper. It was the press release that I mildly altered in the above post.

P.P.S. As discussed in the comments, the original paper was not about narcolepsy; the term “sleeper effect” was just a metaphor. You can follow the links to see the original article and press release.

Mouse Among the Cats

Colleen Flaherty asks:

Do you get asked to peer review a lot? I’m guessing you do… This new very short paper says it’s not a crisis, though, since only the people who publish the most are getting asked to review a lot… The authors pose two solutions: either we need to “democratize” the system of peer review OR we start thinking about it as a credit system, where you should basically expect to review three papers for every one you publish. Where would you fall on this? Or do you think there is a crisis anyway?

The paper in question is called “Are Potential Peer Reviewers Overwhelmed Altruists or Free-Riders?,” by Paul Djupe, Amy Smith, and Anand Sokhey, who write: “where does peer review in the social sciences stand? Are academics overburdened altruists or peer-review free-riders? Our new Professional Activity in the Social Sciences data set suggests the answer is ‘Neither.’ Instead, most academics get few peer review requests and perform most of them. . . . the peer review crisis may be overblown.”

My reply to Flaherty was to point her to this post from a couple years ago, “An efficiency argument for post-publication review,” and continue with this mini-rant:

Peer review is wasteful in that every paper, no matter how crappy, gets reviewed multiple times (for example, consider a useless paper that gets 3 reviews and is rejected from journal A, then gets 3 reviews and is rejected from journal B, then gets 3 reviews and is accepted in journal C). But even the most important papers don’t get traditional peer review after publication. I think post-publication review makes more sense in that the reviewer resources are focused on the most important or talked-about papers.

The only argument I can see in favor of the current system of peer review is it makes use of the unpaid labor of thousands of people–and if the structure were re-created from scratch, it might be hard to get most of these people to continue to work for free.

To put it another way: the Djupe et al. paper has some interesting data, and it’s fine for descriptive purposes, but it’s hard for me to even think of the existing pre-publication peer review system without being reminded of how wasteful it is, and how often it is used to shield and justify bad work.

P.S. I get hundreds of peer-review requests a year. I say No to most of these requests—I have to, or I’d have no time for anything else—but I say Yes enough that I still review a lot.

Researchers.one: A souped-up Arxiv with pre- and post-publication review

Harry Crane writes:

I’m writing to call your attention to a new peer review and publication platform, called RESEARCHERS.ONE, that I have recently launched with Ryan Martin. The platform can be found at https://www.researchers.one.

Given past discussions I’ve seen on your website, I think this new platform might interest you and your readers. We’d also be interested to hear your thoughts about the platform, and would encourage you to submit any of your work that you think could benefit from this new outlet.

Some further information is included below for your reference.

First, the platform is meant to be entirely open, to all researchers, in all fields. The platform aims to realize the benefits of peer review without suffering its drawbacks by (a) making all communications non-anonymous and (b) putting all publication decisions (including peer review) in the hands of the authors. There is no editorial board or accepting/rejecting of papers.

Among other benefits, we believe RESEARCHERS.ONE will enhance transparency and remove publication bias.

There are a number of other aspects to the platform, including pre-publication public peer review and a commenting feature to allow for post-publication discussion and peer review.

We’ve written about the details of the platform in our mission statement here https://www.researchers.one/article/2018-07-1

And we’ve addressed some basic questions in these videos: https://www.youtube.com/watch?v=ZFAerOjIMGM and https://www.youtube.com/watch?v=JD3kd7duAdQ

They describe their new system:

There are no editors, there are no accept-reject decisions or other barriers to publication, and there are no barriers to access. In removing these barriers, the researchers.one au- tonomous publishing model gives authors total control over the publication process from start to finish, which includes selecting the mode of peer review (public access or traditional), choosing the invited reviewers (if traditional peer review is chosen), determining whether and how to address reviewer comments, and deciding whether or not to publish their work. Once published, articles are publicly accessible on the site, with options for other users to provide non-anonymous commentary and post-publication peer review. Ultimately, the quality of published work must stand on its own, without the crutch of impact factors, journal prestige, ‘likes’, ‘thumbs up’, or the artificial stamp of approval signaled by the label ‘peer review’.

Here are some of my many, many posts on related topics:

Post-publication peer review: How it (sometimes) really works

When does peer review make no damn sense?

An efficiency argument for post-publication review

When do we want evidence-based change? Not “after peer review”

Crane and Martin’s system, a kind of super-Arxiv with pre- and post-publication review, seems like a great idea to me. The challenge will be getting people to go to the trouble of submitting their manuscripts to it. As it is, I can’t even get around to submit things to Arxiv most of the time; it just seems like too much trouble. But if people get in the habit of submitting to Researchers.One, maybe it could catch on.

A cautionary tale

I remember, close to 20 years ago, an economist friend of mine was despairing of the inefficiencies of the traditional system of review, and he decided to do something about it: He created his own system of journals. They were all online (a relatively new thing at the time), with an innovative transactional system of reviewing (as I recall, every time you submitted an article you were implicitly agreeing to review three articles by others) and a multi-tier acceptance system, so that very few papers got rejected; instead they were just binned into four quality levels. And all the papers were open-access or something like that.

The system was pretty cool, but for some reason it didn’t catch on—I guess that, like many such systems, it relied a lot on continuing volunteer efforts of its founder, and perhaps he just got tired of running an online publishing empire, and the whole thing kinda fell apart. The journals lost all their innovative aspects and became just one more set of social science publishing outlets. My friend ended up selling his group of journals to a traditional for-profit company, they were no longer free, etc. It was like the whole thing never happened.

A noble experiment, but not self-sustaining. Which was too bad, given that he’d put so much effort into building a self-sustaining structure.

Perhaps one lesson from my friend’s unfortunate experience is that it’s not enough to build a structure; you also need to build a community.

Another lesson is that maybe it can help to lean on some existing institution. This guy built up his whole online publishing company from scratch, which was kinda cool, but then when he no longer felt like running it, it dissolved. Maybe would’ve been better to team up with an economics society, or with some university, governmental body, or public-interest organization.

Anyway, I wish Crane and Martin well in their endeavor. I’ll have to see if it makes sense for us to post our own manuscripts there.

What if a big study is done and nobody reports it?

Paul Alper writes:

Your blog often contains criticisms of articles which get too much publicity. Here is an instance of the obverse (inverse? reverse?) where a worthy publication dealing with a serious medical condition is virtually ignored. From Michael Joyce at the ever-reliable and informative Healthnewsreview.org:

Prostate cancer screening: massive study gets minimal coverage. Why?

The largest-ever randomized trial of using the prostate-specific antigen (PSA) test in asymptomatic men over the age of 50 has found — after about 10 years of follow-up — no significant difference in prostate cancer deaths among men who were screened with a single (“one-off”) PSA test, and those who weren’t screened. . . .

Two things caught our attention here.

First, that this “largest-ever” trial did not get large coverage in the mainstream press. In fact, none of the nearly two dozen US news outlets that we check every weekday wrote about it.

Second, it reminded us that even when coverage of screening is actually large, it often falls short in two very important ways. . . .

Here’s what the researchers reported:

– The cohort included 400,000 men without prostate symptoms, ages 50-69, enrolled at nearly 600 doctors’ offices across England

– 189,386 men had a one-off PSA test vs 219,439 men who had no screening

– After ~ 10 years: 4.3% of the screened group was diagnosed with prostate cancer vs 3.6% of the unscreened (control) group (authors attribute most of this difference to low-grade, non-aggressive cancers)

– Despite finding more cancer in the screened group, the authors found both groups had the same percentage of men dying from prostate cancer, and that percentage was very low: 0.29%

As of our publishing this article — two days after the British study was published — coverage of this “largest-ever” trial remains scant. Is it because it’s a European study? Unlikely — the results were featured prominently in one of the most prestigious US medical journals and promoted with an embargoed news release. Or, because it represents a so-called ‘negative’ or non-dramatic finding? (That is, no increase in prostate cancer deaths between the two groups found.) Who knows.

But it stands in stark contrast to the mega-coverage we’ve documented for many years on other prostate cancer screening studies that are typically much less rigorous — and which often trumpet an imbalanced, pro-screening message about prostate cancer. . . .

It’s an interesting issue of selection bias in what gets reported, what is considered to be news.