Skip to content

How to set up a voting system for a Hall of Fame?

Micah Cohen writes:

Our company is establishing a Hall of Fame and I am on a committee to help set it up which involved figuring out the voting system to induct a candidate. We have modeled it somewhat off of the voting for the Baseball Hall of Fame.

The details in short:
· Up to 40 candidates,
· 600 voters
· Each elector has up to 10 votes
· A candidate has to have 75% of the votes to get inducted

Our current projected model:
· Up to 20 candidates
· About 100-120 voters (lets say 100)
· Each elector has up to 3 votes
· A candidate has to have 75% of the votes to get inducted

The last 2 points are the variables in question that we need help with: How many votes should each elector get and what percentage should the candidate have to have to get inducted?

We don’t want to make it too easy and don’t want to make it too hard. Our thought is to have 2-5 people inducted per year but we want to avoid having 0, no one, inducted.

We will assume that each candidate has an equal chance of being voted in so it’s not weighted at all.

My initial thought was to increase the number of votes that each elector can have to the same ratio as the baseball system (10 votes for up to 40 candidates) so we could increase to up to 5 votes for the 20 candidates. Or should we keep it at 3 candidates and decrease the percent that the candidate would need? The other factor would be if there are less than 20 candidates, lets say 15, how would this all change?

With all that being said, is there a way to find out what the right number of votes each elector should have and what the percentage should be?

Is there a way to visually see this in a graph where we could plug in the variables to see how it would change the probability of no one being elected or how many would be elected in a year?

My reply: I have no specific ideas here but I have four general suggestions:

1. To get a sense of what numbers would work for you, I recommend simulating fake data and trying out various ideas. Your results will only be as good as your fake data; still, I think this can give a lot of insight.

2. No need for a fixed rule, right? I’d recommend starting with a tough threshold for getting into the hall of fame, and then if you’re getting not enough people inducted each year, you can loosen your rules. This only works in one direction: If your rules are too loose, you can’t retroactively tighten them and exclude people you’ve already honored. But if you start out too tight, it shouldn’t be hard to loosen up.

3. This still leaves the question of how to set the thresholds for the very first year: if you set the bar too high at first, you could well have zero people inducted. One way you could handle this is to put the percentage threshold on a sliding scale so that if you have just 0 or 1 people passing, you lower the threshold until at least 2 people get in.

4. Finally, think about the motivations of voters, strategic voting, etc. The way the rules are set up can affect how people will vote.

Hey—take this psychological science replication quiz!

Rob Wilbin writes:

I made this quiz where people try to guess ahead of time which results will replicate and which won’t in order to give then a more nuanced understanding of replication issues in psych. Based on this week’s Nature replication paper.

It includes quotes and p-values from the original study if people want to use them, and we offer some broader lessons on what kinds of things replicate and which usually don’t.

You can try the quiz yourself. Also, I have some thoughts about the recent replication paper and its reception—I’m mostly happy with how the paper was covered in the news media, but I think there are a few issues that were missed. I’ll discuss that another time. For now, enjoy the quiz.

John Hattie’s “Visible Learning”: How much should we trust this influential review of education research?

Dan Kumprey, a math teacher at Lake Oswego High School, Oregon, writes:

Have you considered taking a look at the book Visible Learning by John Hattie? It seems to be permeating and informing reform in our K-12 schools nationwide. Districts are spending a lot of money sending their staffs to conferences by Solution Tree to train their schools to become PLC communities which also use an RTI (Response To Intervention) model. Their powerpoint presentations prominently feature John Hattie’s work. Down the chain, then, if all of these school districts attending are like mine, their superintendents, assistant superintendents, principals, and vice principals are constantly quoting John Hattie’s work to support their initiatives, because they clearly see it as a powerful tool.

I am asking not as a proponent or opponent of Hattie’s work. I’m asking as a high school math teacher who found that there does not seem to have been much critical analysis of his work (except by Arne Kåre Topphol and Pierre-Jérôme Bergeron, as far as I can tell from a cursory search.) This seems strange given its ubiquitous impact on educational leaders’ plans for district and school-wide changes that affect many students and teachers. An old college wrestling teammate of mine, now a statistician, encouraged me to ask you about this.

The reason educational leaders have latched onto this book so much, I believe, is Hattie’s synthesis of over 1,000 meta-analyses. This is, no doubt, a very appealing thing. I’m glad to see educational leaders using data to inform their decisions, but I’m not glad to see them treating it as an educational research bible, of sorts. I wonder about the statistical soundness (and hence value) of synthesizing so many studies of so many designs. I wonder about a book where there’s only two statistics primarily used, one of them incorrectly. And, finally, I wonder about these things b/c this book is functioning as fuel for educational Professional Development conferences over multiple years in multiple states (i.e., it’s a significant component in a very profitable market) as well as the primary resource used by administrators in individual districts to affect change, often without teachers as change-agents. Regardless of these concerns, I also appreciate conversations the book elicits, and am open to the notion that perhaps there are some sound statistical conclusions from the book, ignoring Hattie’s misuse of the CLE stats. (Similarly, I should note, I like a lot about the RTI model that Solution Tree teaches/sells.) I’m sending you this email from a place of curiosity, not of cynicism.

My reply: I’ve not heard of this book by Hattie. I’m setting this down here as a placeholder, and if I have a chance to look at the Hattie book before the scheduled posting date, six months from now, I’ll give my impressions below. Otherwise, maybe some of you commenters know something about it?

“Identification of and correction for publication bias,” and another discussion of how forking paths is not the same thing as file drawer

Max Kasy and Isaiah Andrews sent along this paper, which begins:

Some empirical results are more likely to be published than others. Such selective publication leads to biased estimates and distorted inference. This paper proposes two approaches for identifying the conditional probability of publication as a function of a study’s results, the first based on systematic replication studies and the second based on meta-studies. For known conditional publication probabilities, we propose median-unbiased estimators and associated confidence sets that correct for selective publication. We apply our methods to recent large-scale replication studies in experimental economics and psychology, and to meta-studies of the effects of minimum wages and de-worming programs.

I sent them these comments:

1. This recent discussion might be relevant. My quick impression is that this sort of modeling (whether parametric or nonparametric; I can see the virtues of both approaches) could be useful in demonstrating the problems of selection in a literature, and setting some lower bound on how bad the selection could be, but not so valuable if it were attempted to be used to come up with some sort of corrected estimate. In that way, it’s similar to our work on type M and type S errors, which are more of a warning than a solution. I think your view may be similar, in that you mostly talk about biases, without making strong claims about the ability to use your method to correct these biases.

2. In section 2.1 you have your model where “a journal receives a stream of studies” and then decides which one to publish. I’m not sure how important this is for your mathematical model, but it’s my impression that most of the selection occurs within papers: a researcher gets some data and then has flexibility to analyze it in different ways, to come up with statistically significant conclusions or not. Even preregistered studies are subject to a lot of flexibility in interpretation of data analysis; see for example here.

3. Section 2.1 of this paper may be of particular interest to you as it discusses selection bias and the overestimation of effect sizes in a study that was performed by some economists. It struck me as ironic that economists, who are so aware of selection bias in so many ways, have been naive in taking selected point estimates without recognizing their systematic problems. It seems that, for many economists, identification and unbiased estimation (unconditional on selection) serve as talismans which provide a sort of aura or blessing to an entire project, allowing the researchers to turn off their usual skepticism. Sad, really.

4. You’ll probably agree with this point too: I think it’s a terrible attitude to say that a study replicates “if both the original study and its replication find a statistically significant effect in the same direction.” Statistical significance is close to meaningless at the best of times, but it’s particularly ridiculous here, in that using such a criterion is just throwing away information and indeed compounding the dichotomization that causes so many problems in the first place.

5. I notice you cite the paper of Gilbert, King, Pettigrew, and Wilson (2016). I’d be careful about citing that paper, as I don’t think the authors of that paper knew what they were doing. I think that paper was at best sloppy and misinformed, and at worst a rabble-rousing, misleading bit of rhetoric, and I don’t recommend citing it. If you do want to refer to it, I really think you should point out that its arguments are what academics call “tendentious” and what Mark Twain called “stretchers,” and you could refer to this post by Brian Nosek and Elizabeth Gilbert explaining what was going on. As I’m sure you’re aware, the scientific literature gets clogged with bad papers, and I think we should avoid citing bad papers uncritically, even if they appear in what are considered top journals.

Kasy replied:

1) We wouldn’t want to take our corrections too literally, either, but we believe there is some value in saying “if this is how selection operates, this is how you would perform corrected frequentist inference (estimation & confidence sets)”
Of course our model is not going to capture all the distortions that are going on, and so the resulting numbers should not be taken as literal truth, necessarily.

2) We think of our publication probability function p(.) as a “reduced form” object which is intended to capture selectivity by both researchers and journals. Our discussion is framed for concreteness in terms of journal decisions, but for our purposes journal decisions are indistinguishable from researcher decisions.
In terms of assessing the resulting biases, I think it doesn’t really matter who performs the selection?

3) Agreed; there seems to be some distortion in the emphasis put by empirical economists.
Though we’d also think that the focus on “internal validity” has led to improvements in empirical practice relative to earlier times?

4) Agree completely. We attempt to clarify this point in Section 3.3 of our paper.
The key point to us seems to be that, depending on the underlying distribution of true effects across studies, pretty much any value for “Probability that the replication Z > 1.96, given that the original Z > 1.96” is possible (with a lower bound of 0.025), even in the absence of any selectivity.

Regarding Kasy’s response to point 2, “it doesn’t really matter who performs the selection?”, I’m not so sure.

Here’s my concern: in the “file drawer” paradigm, there’s some fixed number of results that are streaming through, and some get selected for publication. In the “forking paths” paradigm, there’s a virtually unlimited number of possible findings, so it’s not clear that it makes sense to speak of a latent distribution of results.

One other thing: in the “file drawer” paradigm, the different studies are independent. In the “forking paths” paradigm, the different possible results are all coming from the same data so they don’t represent so much additional information.

3 recent movies from the 50s and the 70s

I’ve been doing some flying, which gives me the opportunity to see various movies on that little seat-back screen. And some of these movies have been pretty good:

Logan Lucky. Pure 70s. Kinda like how Stravinsky did those remakes of Tchaikovsky etc. that were cleaner than the original, so did Soderbergh in Logan Lucky, and earlier in The Limey, recreate that Seventies look and feel. The Limey had the visual style, the washed-out look of the L.A. scenes in all those old movies. Logan Lucky had the 70s-style populist thing going, Burt Reynolds, Caddyshack, the whole deal.

La La Land. I half-watched it—I guess I should say, I half-listened to it, on the overnight flight. I turned it on, plugged myself in, and put on the blindfold so I could sleep. A couple times I woke up in the middle of the night and restarted it. Between these three blind viewings, I pretty much heard the whole thing. On the return flight I actually watched the damn thing and then the plot all made sense. It was excellent, just beautiful. The actual tunes were forgettable, but maybe that was part of the design. Like Logan Lucky, this was a retro movie—in this case, from the Fifties—but better than the originals on which it was modeled.

Good Time. I’d never heard of this one. This was the most intense movie I’ve ever seen. Also pure 70s, but not like Logan Lucky, more like a cross between The French Connection and Dog Day Afternoon. Almost all the action takes place in Queens. Really intense—did I say that already?

StanCon Helsinki streaming live now (and tomorrow)

We’re streaming live right now!

Timezone is Eastern European Summer Time (EEST) +0300 UTC

Here’s a link to the full program [link fixed].

There have already been some great talks and they’ll all be posted with slides and runnable source code after the conference on the Stan web site.

Some clues that this study has big big problems

Paul Alper writes:

This article from the New York Daily News, reproduced in the Minneapolis Star Tribune, is so terrible in so many ways. Very sad commentary regarding all aspects of statistics education and journalism.

The news article, by Joe Dziemianowicz, is called “Study says drinking alcohol is key to living past 90,” with subheading, “When it comes to making it into your 90s, booze actually beats exercise, according to a long-term study,” and it continues:

The research, led by University of California neurologist Claudia Kawas, tracked 1,700 nonagenarians enrolled in the 90+ Study that began in 2003 to explore impacts of daily habits on longevity.

Researchers discovered that subjects who drank about two glasses of beer or wine a day were 18 percent less likely to experience a premature death, the Independent reports.

Meanwhile, participants who exercised 15 to 45 minutes a day cut the same risk by 11 percent. . . .

Other factors were found to boost longevity, including weight. Participants who were slightly overweight — but not obese — cut their odds of an early death by 3 percent. . . .

Subjects who kept busy with a daily hobby two hours a day were 21 percent less likely to die early, while those who drank two cups of coffee a day cut that risk by 10 percent.

At first, this seems like reasonable science reporting. But right away there are a couple flags that raise suspicion, such as the oddly specific “15 to 45 minutes a day”—what about people who exercise more or less than that?—and the bit about “overweight — but not obese.” It’s harder than you might think to estimate nonlinear effects. In this case the implication is not just nonlinearity but nonmonotonicity, and I’m starting to worry that the researchers are fishing through the data looking for patterns. Data exploration is great, but you should realize that you’ll be dredging up a lot of noise along with your signal. As we’ve said before, correlation (in your data) does not even imply correlation (in the underlying population, or in future data).

The claims produced by the 90+ Study can also be criticized on more specific grounds. Alper points to this news article by Michael Joyce, who writes:

their survey [found] that drinking the equivalent of two beers or two glasses of wine per day was associated with 18% fewer deaths, it also found that daily exercise of around 15 to 45 minutes was only associated with 11% fewer premature deaths.

TechTimes opted to blend these two findings into a single whopper of a headline:

Drinking Alcohol Helps Better Than Exercise If You Want To Live Past 90 Years Old

Not only is this language unjustified in referring to a study that can only show association, not causation, but the survey did not directly compare alcohol and exercise. So the headline is very misleading. . . .

Other reported findings of the study included:

– being slightly overweight (not obese) was associated with 3% fewer early deaths

– being involved in a daily hobby two hours a day was associated with a 21 % lower rate of premature deaths

– drinking two cups of coffee a day was associated with a 10% lower rate of early death

But these are observations and nothing more. Furthermore, they are based on self-reporting by the study subjects. That’s a notoriously unreliable way to get accurate information regarding people’s daily habits or behaviors.

Just after we published this piece we heard back from Dr. Michael Bierer, MD, MPH — one of our regular contributors — who we had reached out to for comment . . .:

Observational studies that demonstrate benefits to people engaged in a certain activity — in this case drinking — are difficult to do well. That’s because the behavior in question may co-vary with other features that predict health outcomes.

For example, those who abstain from alcohol completely may do so for a variety of reasons. In older adults, perhaps that reason is taking a medication that makes alcohol dangerous; such as anticoagulants, psychotropics, or aspirin. So not drinking might be a marker for other health conditions that themselves are associated — weakly or not-so-weakly — with negative outcomes. Or, abstaining may signal a history of problematic drinking and the advice to cut back. Likewise, there are many health conditions (like liver disease) that are reasons to abstain.

Conversely, moderate drinking might be a marker for more robust health. There is an established link between physical activity and drinking alcohol. People who take some alcohol may simply have more social contacts than those who abstain, and pro-social behaviors are linked to health.

P.S. I’d originally titled this post, “In Watergate, the saying was, ‘It’s not the crime, it’s the coverup.’ In science reporting, it’s not the results, it’s the hype.” But I changed the title to avoid the association with criminality. One thing I’ve said a lot is that, in science, honesty and transparency are not enough: You can be a scrupulous researcher but if your noise overwhelms your signal, and you’re using statistical methods (such as selection on statistical significance) that emphasize and amplify noise, that you can end up with junk science. Which, when put through the hype machine, becomes hyped junk science. Gladwell bait. Freakonomics bait. NPR bait. PNAS bait.

So, again:

(1) If someone points out problems with your data and statistical procedures, don’t assume they’re saying you’re dishonest.

(2) If you are personally honest, just trying to get at the scientific truth, accept that concerns about “questionable research practices” might apply to you too.

Old school

Maciej Cegłowski writes:

About two years ago, the Lisp programmer and dot-com millionaire Paul Graham wrote an essay entitled Hackers and Painters, in which he argues that his approach to computer programming is better described by analogies to the visual arts than by the phrase “computer science”.

When this essay came out, I was working as a computer programmer, and since I had also spent a few years as a full-time oil painter, everybody who read the article and knew me sent along the hyperlink. I didn’t particularly enjoy the essay . . . but it didn’t seem like anything worth getting worked up about. Just another programmer writing about what made him tick. . . .

But the emailed links continued, and over the next two years Paul Graham steadily ramped up his output while moving definitively away from subjects he had expertise in (like Lisp) to topics like education, essay writing, history, and of course painting. Sometime last year I noticed he had started making bank from an actual print book of collected essays, titled (of course) “Hackers and Painters”. I felt it was time for me to step up.

So let me say it simply – hackers are nothing like painters.

Cegłowski continues:

It’s surprisingly hard to pin Paul Graham down on the nature of the special bond he thinks hobbyist programmers and painters share . . . The closest he comes to a clear thesis statement is at the beginning “Hackers and Painters”:

[O]f all the different types of people I’ve known, hackers and painters are among the most alike. What hackers and painters have in common is that they’re both makers.

To which I’d add, what hackers and painters don’t have in common is everything else.

Ouch. Cegłowski continues:

The fatuousness of the parallel becomes obvious if you think for five seconds about what computer programmers and painters actually do.

– Computer programmers cause a machine to perform a sequence of transformations on electronically stored data.

– Painters apply colored goo to cloth using animal hairs tied to a stick.

It is true that both painters and programmers make things, just like a pastry chef makes a wedding cake, or a chicken makes an egg. But nothing about what they make, the purposes it serves, or how they go about doing it is in any way similar.

Start with purpose. With the exception of art software projects (which I don’t believe Graham has in mind here) all computer programs are designed to accomplish some kind of task. Even the most elegant of computer programs, in order to be considered a program, has to compile and run . . .

The only objective constraint a painter has is making sure the paint physically stays on the canvas . . .

Why does Graham bring up painting at all in his essay? Most obviously, because Graham likes to paint, and it’s natural for us to find connections between different things we like to do. But there’s more to it: also, as Cegłowski discusses, painting has a certain street-cred (he talks about it in terms of what can “get you laid,” but I think it’s more general than that). So if someone says that what he does is kinda like painting, I do think that part of this is an attempt to share in the social status that art has.

Cegłowski’s post is from 2005, and it’s “early blogging” in so many ways, from the length and tone, to the references to old-school internet gurus such as Paul Graham and Eric Raymond, to the occasional lapses in judgment. (In this particular example, I get off Cegłowski’s train when he goes on about Godel, Escher, Bach, a book that I positively hate, not so much for itself as for how overrated it was.)

Old-school blogging. Good stuff.

“To get started, I suggest coming up with a simple but reasonable model for missingness, then simulate fake complete data followed by a fake missingness pattern, and check that you can recover your missing-data model and your complete data model in that fake-data situation. You can then proceed from there. But if you can’t even do it with fake data, you’re sunk.”

Alex Konkel writes on a topic that never goes out of style:

I’m working on a data analysis plan and am hoping you might help clarify something you wrote regarding missing data. I’m somewhat familiar with multiple imputation and some of the available methods, and I’m also becoming more familiar with Bayesian modeling like in Stan. In my plan, I started writing that we could use multiple imputation or Bayesian modeling, but then I realized that they might be the same. I checked your book with Jennifer Hill, and the multiple imputation chapter says that if outcome data are missing (which is the only thing we’re worried about) and you use a Bayesian model, “it is trivial to use this model to, in effect, impute missing values at each iteration”. Should I read anything into that “in effect”? Or is the Bayesian model identical to imputation?

A second, trickier question: I expect that most of our missing data, if we have any, can be considered missing at random (e.g., the data collection software randomly fails). One of our measures, though, is the participant rating their workload at certain times during task performance. If the participant misses the prompt and fails to respond, it could be that they just missed the prompt and nothing unusual was happening, and so I would consider that data to be missing at random. But participants could also fail to respond because the task is keeping them very busy, in which case the data are not missing at random but in fact reflect a high workload (and likely higher than other times when they do respond). Do you have any suggestions on how to handle this situation?

My reply:

to paragraph 1: Yes, what we’re saying is that if you just exclude the missing cases and fit your model, this is equivalent to the usual statistical inference, assuming missing at random. And then if you want inference for those missing values, you can impute them conditional on the fitted model. If the outcome data are missing not at random, though, then more modeling must be done.

to paragraph 2: Here you’d want to model this. You’d have a model, something like logistic regression, where probability of missingness depends on workload. This could be fit in a Stan model. It could be that strong priors would be needed; you can run into trouble trying to fit such models from data alone, as such fits can depend uncomfortably on distributional assumptions.

Konkel continues:

For the Bayesian/imputation part: Is there a difference between what you describe, which I think would be what’s described in 11.1 of the Stan manual, and trying to combine the missing and non-missing data into a single vector, more like 11.3? We wouldn’t be interested in the value of the missing data, per se, so much as maximizing our number of observations in order to examine differences across experimental conditions.

For the workload part: I understand what you’re saying in principle, but I’m having a little trouble imagining the implementation. I would model the probability of missingness depending on workload, but the data that’s missing is the workload! We’re running an experiment, so we don’t have a lot of other predictors with which to model/stand in for workload. Essentially we just have the experimental condition, although maybe I could jury-rig more predictors out of the details of the experiment in the moment when the workload response is supposed to occur…

My reply:

Those sections of the Stan model are all about how to fold in missing data into larger data structures. We’re working on allowing missing data types in Stan so that this hashing is done automatically—but until that’s done, yes, you’ll need to do this sort of trick to model missing data.

But in my point 1 above, if your missingness is only in the outcome variable in your regression, and you assume missing at random, then you can simply exclude the missing cases entirely from your analysis, and then you can impute them later in the generated quantities block. Then it’s there in the generated quantities block that you’ll combine the observed and imputed data.

For the workload model: Yes, you’re modeling the probability of missingness given something that’s missing in some cases. That’s why you’d need a model with informative priors. From a Bayesian standpoint, the model is fit all at once, so it can all be done.

To get started, I suggest coming up with a simple but reasonable model for missingness, then simulate fake complete data followed by a fake missingness pattern, and check that you can recover your missing-data model and your complete data model in that fake-data situation. You can then proceed from there. But if you can’t even do it with fake data, you’re sunk.

Bayesian model comparison in ecology

Conor Goold writes:

I was reading this overview of mixed-effect modeling in ecology, and thought you or your blog readers may be interested in their last conclusion (page 35):

Other modelling approaches such as Bayesian inference are available, and allow much greater flexibility in choice of model structure, error structure and link function. However, the ability to compare among competing models is underdeveloped, and where these tools do exist, they are not yet accessible enough to non-experts to be useful.

This strikes me as quite odd. The paper discusses model selection using information criterion and model averaging in quite some detail, and it is confusing that the authors dismiss the Bayesian analogues (I presume they are aware of DIC, WAIC, LOO etc. [see chapter 7 of BDA3 and this paper — ed.]) as being ‘too hard’ when parts of their article would probably also be too hard for non-experts.

In an area in which small sample sizes are common, I’d argue that effort to explain Bayesian estimation in hierarchical models would have been very worthwhile (e.g. estimation of variance components, more accurate estimation of predictor coefficients using informative priors/variable selection).

In general, I find the ‘Bayesian reasoning is too difficult for non-experts’ argument pretty tiring, especially when it’s thrown in at the end of a paper like this!

Along these lines, I used to get people telling me that I couldn’t use Bayesian methods for applied problems because people wouldn’t stand for it. Au contraire, I’ve used Bayesian methods in many different applied fields for a long time, ever since my first work in political science in the 1980s, and nobody’s ever objected to it. If you don’t want to use some statistical method (Bayesian or otherwise) cos you don’t like it, fine; give your justification and go from there. But don’t ever say not to use a method out of a purported concern that some third party will object. That’s so bogus. Stand behind your own choices.

In statistics, we talk about uncertainty without it being viewed as undesirable

Lauren Kennedy writes:

I’ve noticed that statistics (or at least applied statistics) has this nice ability to talk about uncertainty without it being viewed as undesirable. Stan Con had that atmosphere and I think it just makes everyone so much more willing to debug, discuss and generate new ideas.

Indeed, in statistics I’ve seen fierce disputes, but I don’t typically see people trying to dodge criticism or attack people who point out errors in their work. I’d like to think that one reason for this productive style of handling dissent is that uncertainty is baked into our way of thinking.

P.S. Lauren writes of Jazz, pictured above: “Her claim to fame is that she can open two of four door styles present in the houses she’s lived in.”

When anyone claims 80% power, I’m skeptical.

A policy analyst writes:

I saw you speak at ** on Bayesian methods. . . . I had been asked to consult on a large national evaluation of . . . [details removed to preserve anonymity] . . . and had suggested treading carefully around the use of Bayesian statistics in this study (basing it on all my expertise taken from a single graduate course taught ten years ago by a professor who didn’t even like Bayesian statistics). I left that meeting determined to learn more myself as well as to encourage Bayesian experts to help us bring this approach to my field of research. . . .

I’ve been asked to review an evaluation proposal and provide feedback on what could help improve the evaluation. The study team is evaluating the effectiveness of a training . . . The study team suggests up front that a great deal of variance exists among these case workers (who come from a number of different agencies, so are clustered under [two factors]). The problem with all of this is that the number is still small [less than 100 people] . . . To recap, we have a believed heterogenous population . . . whose variation is expected to cluster under a number of factors . . . and all will get this training at the same time . . . The study team has proposed a GEE model and their power analysis was based on multiple regression. Their claim on power is that [less than 100] individuals is sufficient to achieve 80% power using multiple regression with no covariates. That is as much information as they give me, so their assumptions about the sample are unclear to me.

I want to suggest they go back an actually run the power analysis on the study they are doing and based on the analysis they intend to run. I’m also suggesting they consider GLMM instead of GEE since they seem to feel that the variance that can be explained by these clusters is meaningful and could inform future trainings. I recall learning that when you have this scenario where there is this clustered variance, actual power is less than if you make an assumption of homogeneity across your population. In other words, that estimating power based on a multiple regression and a standard proxy for the expected variance of your sample, you would far over-estimate how much power you have. The problem is I can’t remember if this is true, why this is true, what words to google to remind myself. If I totally made this up and pulled it out of my butt. What I recall was that in my MLM courses they had suggested that MLM would actually achieve 80% power with a smaller sample when the cluster variable explained significant variance.

I modeled all of this for another study where the baseline variance is already know and where a number of related past studies exist. What I found there was that if you just plug your N into G*Power and use its preset values for variance, the sample it suggests is needed to achieve 80% power is far lower than what you get if you put in accurate values for variance. That if you move from G*Power to a power software script for R where you can explicitly model power for GLMM and put in good estimates for all this, you get a number somewhat in-between the two. This is what I think they should do, but I don’t want to send them on a wild goose chase.

Here’s my reply:

1. When anyone claims 80% power, I’m skeptical. (See also here.) Power estimates are just about always set up in an optimistic way with the primary goal to get the study approved. I’d throw out any claim of 80% power. Instead start with the inputs: the assumptions of effect size and variation that were used to get that 80% power claim. I recommend you interrogate those assumptions carefully. In particular, guessed effect sizes are typically drawn from existing literature which overestimates effect sizes (Type M errors arising from selection on statistical significance), and these overestimates can be huge, perhaps no surprise given that, in designing their studies, researchers have every incentive to use high estimates for effectiveness of their interventions.

2. More generally, I’m not so interested in “power” as it is associated with the view that the goal of a study is to get statistical significance and then make claims with certainty. I think it’s better to accept ahead of time that, even after the study is done, residual uncertainty will be present. Don’t make statistical significance the goal, then there’s less pressure to cheat and get statistical significance, less overconfidence the study happens to result in statistical significance, and less disappointment if the results end up not being statistically significant.

3. To continue with the theme: I don’t like talking about “power,” but I agree with the general point that it’s a bad idea to do “low-power studies” (or, as I’d say, studies where the standard error of the parameter estimate is as high, or higher, than the true underlying effect size). The trouble is that a low-power study gives a noisy estimate, which if it does happen to be statistically significant, will cause the treatment effect to be overestimated.

Sometimes people think a low-power study isn’t so bad, in a no-harm, no-foul sort of way: if a study has low power, it’ll probably fail anyway. But in some sense the big risk with a low-power study is that it apparently succeeds; thus leading people into a false sense of confidence about effect size. That’s why we sometimes talk about statistical significance as a “winner’s curse” or a “deal with the devil.”

3. To return to your technical question: Yes, in general a multilevel analysis (or, for that matter, a correctly done GEE) will give a lower power estimate than a corresponding observation-level analysis that does not account for clustering. There are some examples of power analysis for multilevel data structures in chapter 20 of my book with Jennifer Hill.

4. When in doubt, I recommend fake-data simulation. This can take some work—you need to simulate predictors as well as outcomes, you need to make assumps about variance components, correlations, etc.—but in my experience the effort to make the assumps is well worth it in clarifying one’s thinking.

Which elicited the following response:

You raise concerns for the broader funding system itself. The grants require that awardees demonstrate in their proposal through a power analysis that their analytic method and sample size is sufficient to detect the expected effects of their intervention. Unfortunately it is very common for researchers to present something along these lines:

– We are looking to improve parenting practices through intervention X.
– Past interventions have seen effect sizes with estimates that range from 0.1 to 0.6 (and in fact, the literature would show effect sizes ranging from between -0.6 to 0.6, but those negative effect sizes are left out of the proposal).
– The researcher will conclude that their sample is sufficient to detect an effect size of 0.6 with 80% power and that this is reasonable.
– The study is good to go.

I would look at that and generally conclude the study is underpowered and whatever effects they do or don’t find are going to be questionable. The uncertainty would be too great.

I like your idea of doing simulations. While I can’t recommend that awardees do that (It would be considered too onerous), I can suggest evaluators consider the option. I’ve done these for education studies. My experience is that if it is a well-studied field and the assumptions can be more readily inferred, they are not that time consuming to do.

I assume a problem with making these assumptions of effect size and variance would be the file drawer effect, the lack of published null or negative studies and the inadvertent or intentional ignoring of those that are published. [Actually, I think that forking paths—the ability of researchers to extract statistical significance from data that came from any individual study—is a much bigger deal than the file drawer. — ed.] I do think that some researchers take an erroneous view of these statistics that if no effect or a negative effect is found, its not meaningful. If a positive effect is found, it is meaningful and interpretable.

P.S. I have been highly privileged, first in receiving a world-class general education in college, then in receiving a world-class education in statistics in graduate school, then in having jobs for the past three decades in which I’ve been paid to explore the truth however I’ve seen fit, and during this period to have found hundreds of wonderful collaborators in both theory and application. As a result, I take very seriously the “service” aspect of my job, and I’m happy to give back to the community by sharing tips in this way.

Problems in a published article on food security in the Lower Mekong Basin

John Williams points us to this article, “Designing river flows to improve food security futures in the Lower Mekong Basin,” by John Sabo et al., featured in the journal Science. Williams writes:

The article exhibits multiple forking paths, a lack of theory, and abundant jargon. It is also very carelessly written and reviewed. For example, the study analyzed the Mekong River stage (level of the water with respect to a reference point), but refers more often to the discharge (volume per time past a reference point: the relationship between the two is non-linear). It is pretty amazing that something like this got published.

Williams’s fuller comments are here.

I haven’t read through this at all, but I ran it by a colleague who knows this stuff, and my colleague agreed with Williams’s critique, so I’ll share it here.

Too bad the journal Science doesn’t have a post-publication review portal, so we have to do things in this awkward way.

P.S. Commenters pointed out updates in the journal here and here.

Who spends how much, and on what?

Nathan Yau (link from Dan Hirschman) constructed the above excellent visualization of data from the Consumer Expenditure Survey. Lots of interesting things here. The one thing that surprises me is that people (or maybe it’s households) making more than $200,000 only spent an average of $160,000. I guess the difference is taxes, savings (but not retirement savings, as that’s included in the “Personal Insurance and Pensions” category), and charitable donations. (Here’s the definition of all the expenditure categories.) Still, I was surprised that the amount spent by upper-income people was so low. The average income of people who make more than $200,000 must be pretty high . . . Here’s something from Wikipedia: of the households making over $200,000 in 2014, average income is (2.61 * $220,000 + 3.28 * $400,000)/(2.61 + 3.28) = $320,000. So, according to the data, I guess that half their income is going to taxes, non-retirement savings, and charitable donations. Put that way, this sounds about right: depending on where you live and how much you make, taxes can take close to 50% already.

Yau reports that he “made these charts in R with a variation of this unit chart tutorial.” He gives links to these, but I’d really like for him to share his code! I’d also like to know exactly what table he used. Yau’s blogging for free so I’m not complaining at all; I think his graph and follow-up discussion are great as they are! and they’d be even better with links to data and code.

Data concerns when interpreting comparisons of gender equality between countries

A journalist pointed me to this research article, “Gender equality and sex differences in personality: evidence from a large, multi-national sample,” by Tim Kaiser (see also news report by Angela Lashbrook here), which states:

A large, multinational (N = 926,383) dataset was used to examine sex differences in Big Five facet scores for 70 countries. Difference scores were aggregated to a multivariate effect size (Mahalanobis’ D). . . .

Countries’ difference scores were related to an index of gender equality, revealing a positive weighted correlation of r = .39 . . .

Using multivariate effect sizes derived from latent scores with invariance constraints, the study of sex differences in personality becomes more robust und replicable. Sex differences in personality should not be interpreted as results of unequal treatment . . .

The journalist wrote, “This study found that as gender equality increases, so do gender differences. Have you seen evidence of this in research, including your own research?”

I replied as follows:

I have not worked in this area of gender equality myself, so I can’t say I’ve seen this pattern, or not seen it, in my own research. This particular study linked above has a lot of details; the main pattern appears in its figure 2. I have no sense of why this “Mahanalobis D” number is so much higher in Latvia than Iran, but I could well imagine this could be an artifact of survey responses. In general, if the survey responses are noisier, you’d expect a measure such as this D to be closer to zero. If you look in the paper, you’ll see that D is based on various personality inventories, and I could well imagine that these responses would have different interpretations in Latvia, Iceland, and Finland than in Iran, Mexico, and Jamaica. In addition, it says in the paper that “the assessment procedures have selected for English language subjects with Internet access.” This would seem to destroy all their interpretations of the results when comparing countries. So I would be wary of taking these results too seriously. That said, there’s nothing wrong with speculation, as long as it is clearly labeled as such.

Finally, I’m skeptical about the following claim made in this paper: “The degree to which a society allows individuals to express biological gender differences can vary. If a society ensures that men and women have exactly the same access to all resources that this society has to offer, the biological factors could be expressed more strongly than in more repressive societies. A stronger sexual dimorphism should therefore be seen more as an expression of a successful gender policy.” I don’t see how the results in their paper—even if you ignore any potential data issues and assume the survey responses have the same meanings in each country—lead to this conclusion.

Here’s another line from the paper: “The results presented suggest that greater sexual dimorphism should not be interpreted as an indicator of a society that discriminates against a particular sex, but rather as an indicator of a successful gender equality policy.” I don’t understand this at all. Even taking the data at face value, they have two measures: D (the measure of sex differences in personality) and GGGI (the measure of lack of sex differences in outcomes in the country). D is weakly correlated with GGGI. Fine. But then a high value of D is a measure of a high value of D; it’s not an indicator of GGGI. For that matter, GGGI is not a measure of “policy” either.

Finally, I am concerned about some of the details in the paper, for example on page 7, Antarctica, Puerto Rico, and Andorra are listed as countries. Andorra, maybe although I can’t imagine we’d learn much from such an odd case. Puerto Rico is of course not a country, and Antarctica even less so.

Anyway, my quick reaction is that it’s a good step for these findings to be published but I think they are being way overinterpreted.

I sent the above comments to Laurie Rudman, a researcher who works in related areas, and she agreed that assessing and analyzing cross-cultural data can be difficult. Rudman pointed me to this paper, “Mind the level: problems with two recent nation-level analyses in psychology,” by Toon Kuppens, and Thomas Pollet, that raises some of these issues (although not with the Kaiser paper discussed above).

P.S. The journalist asked for some clarifications, so I added these points:

1. What is the Mahalanobis D? When comparing two groups (for example. male and female survey respondents) on a single variable (for example, height), you can just take the difference and report, say, that on average men are one standard deviation taller than women, or whatever it is. When comparing two groups on multiple variables (for example, several different personality assessment survey responses), you can construct a “multivariate distance measure.” Mahalanobis D is one such measure. Bigger numbers correspond to larger average differences between men and women in whatever variables are measured. In the above-linked paper, my issue with the Mahalanobis D was not how it was defined, but rather with the data used to compute it. I’m concerned (a) that the responses to the survey questions will have different meanings in different countries, (b) that there will be nonresponse bias due to the overrepresentation of English-speaking internet users, and (c) that these issues will be correlated with key variables in the study.

2. Regarding the point about biological differences: Part of my concern is that I don’t really know what is meant by “biological gender differences” in this context. My other concern is that the conclusions of the paper relate to things not measured in the paper. For example, the paper refers to discrimination, but there’s nothing in the data about discrimination. And the paper refers to a gender equality policy, but there’s nothing in the paper about gender equality policies. Just in general, I’m wary about conclusions that don’t directly connect to the data.

The scandal isn’t what’s retracted, the scandal is what’s not retracted.

Andrew Han at Retraction Watch reports on a paper, “Structural stigma and all-cause mortality in sexual minority populations,” published in 2014 by Mark Hatzenbuehler, Anna Bellatorre, Yeonjin Lee, Brian Finch, Peter Muennig, and Kevin Fiscella, that claimed:

Sexual minorities living in communities with high levels of anti-gay prejudice experienced a higher hazard of mortality than those living in low-prejudice communities (Hazard Ratio [HR] = 3.03, 95% Confidence Interval [CI] = 1.50, 6.13), controlling for individual and community-level covariates. This result translates into a shorter life expectancy of approximately 12 years (95% C.I.: 4–20 years) for sexual minorities living in high-prejudice communities.

Hatzenbuehler et al. attributed some of this to “an 18-year difference in average age of completed suicide between sexual minorities in the high-prejudice (age 37.5) and low-prejudice (age 55.7) communities,” but the whole thing still doesn’t seem to add up. Suicide is an unusual cause of death. To increase the instantaneous probability of dying (that is, the hazard rate) by a factor of 3, you need to drastically increase the rate of major diseases, and it’s hard to see how living in communities with high levels of anti-gay prejudice would do that, after controlling for individual and community-level covariates.

Part 1. The original paper was no good.

Certain aspects of the design of the study were reasonable: they used some questions on the General Social Survey (GSS) about attitudes toward sexual minorities, aggregated these by geography to identify areas which, based on these responses, corresponded to positive and negative attitudes, then for each area they computed the mortality rate of a subset of respondentx in that area, which they were able to do using “General Social Survey/National Death Index (GSS-NDI) . . . a new, innovative prospective cohort dataset in which participants from 18 waves of the GSS are linked to mortality data by cause of death.” They report that “Of the 914 sexual minorities in our sample, 134 (14.66%) were dead by 2008.” (It’s poor practice to call this 14.66% rather than 15%—it would be kinda like saying that Steph Curry is 6 feet 2.133 inches tall—but this is not important for the paper, it’s only an indirect sign of concern as it indicates a level of innumeracy on the authors’ part to have let this slip into the paper.)

The big problem with this paper—what makes it dead in arrival even before any of the data are analyzed—is that you’d expect most of the deaths to come from heart disease, cancer, etc., and the most important factor predicting death rates will be the age of the respondents in the survey. Any correlation between age of respondent and their anti-gay prejudice predictor will drive the results. Yes, they control for age in their analysis, but throwing in such a predictor won’t solve the problem. The next things to worry about are sex and smoking status, but really the game’s already over. To look at aggregate mortality here is to be trying to find a needle in a haystack. And all this is exacerbated by the low sample size. It’s hard enough to understand mortality-rate comparisons with aggregate CDC data. How can you expect to learn anything from a sample of 900 people?

Once the analysis comes in, the problem becomes even more clear, as the headline result is an 3-fold increase in the hazard rate, which is about as plausible as that claim we discussed a few years ago that women were 3 times more likely to wear red or pink clothing during certain times of the month, or that claim we discussed a few years before that beautiful parents were 36 percent more likely to have girls.

Beyond all that—and in common with these two earlier studies—this new paper has forking paths, most obviously in that the key predictor is not the average anti-gay survey response by area, but rather an indicator for whether this is in the top quartile.

To quickly summarize:
1. This study never had a chance of working.
2. Once you look at the estimate, it doesn’t make sense.
3. The data processing and analysis had forking paths.

Put this all together and you get: (a) there’s no reason to believe the claims in this paper, and (b) you can see you the researchers obtained statistical significance, which fooled them and the journal editors into thinking that their analysis told them something generalizable to the real world.

Part 2. An error in the data processing.

So far, so boring. A scientific journal publishes a paper that makes bold, unreasonable claims based on a statistical analysis of data that could never realistically have been hoped to support the analysis. Hardly worth noting in Retraction Watch.

No, what made the story noteworthy was the next chapter, when Mark Regnerus published a paper, “Is structural stigma’s effect on the mortality of sexual minorities robust? A failure to replicate the results of a published study,” reporting:

Hatzenbuehler et al.’s (2014) study of structural stigma’s effect on mortality revealed an average of 12 years’ shorter life expectancy for sexual minorities who resided in communities thought to exhibit high levels of anti-gay prejudice . . . Attempts to replicate the study, in order to explore alternative hypotheses, repeatedly failed to generate the original study’s key finding on structural stigma.

Regnerus continues:

The effort to replicate the original study was successful in everything except the creation of the PSU-level structural stigma variable.

Regnerus attributes the problem to the authors’ missing-data imputation procedure, which was not clearly described in the original paper. It’s also a big source of forking paths:

Minimally, the findings of Hatzenbuehler et al. (2014) study of the effects of structural stigma seem to be very sensitive to subjective decisions about the imputation of missing data, decisions to which readers are not privy. Moreover, the structural stigma variable itself seems questionable, involving quite different types of measures, the loss of information (in repeated dichotomizing) and an arbitrary cut-off at a top-quartile level. Hence the original study’s claims that such stigma stably accounts for 12 years of diminished life span among sexual minorities seems unfounded, since it is entirely mitigated in multiple attempts to replicate the imputed stigma variable.

Regnerus also points out the sensitivity of any conclusions to confounding in the observational study. Above, I mention age, sex, and smoking status; Regnerus mentions ethnicity as a possible confounding variable. Again, controlling for such variables in a regression model is a start, but only a start; it is a mistake to think that if you throw a possible confounder into a regression, that you’ve resolved the problem.

A couple months later, a correction note by Hatzenbuehler et al. appeared in the journal:

Following the publication of Regnerus’s (2017) paper, we hired an independent research group . . . A coding error was discovered. Specifically, the data analyst mis-specified the time variable for the survival models, which incorrectly addressed the censoring for individuals who died. The time variable did not correctly adjust for the time since the interview to death due to a calculation error, which led to improper censoring of the exposure period. Once the error was corrected, there was no longer a significant association between structural stigma and mortality risk among the sample of 914 sexual minorities.

I can’t get too upset about this particular mistake, given that I did something like that myself a few years back!

Part 3. Multiple errors in the same paper.

But here’s something weird: The error reported by Hatzenbuehler et al. in their correction is completely unrelated to the errors reported by Regnerus. The paper had two completely different data problems, one involving imputation and the construction of the structural stigma variable, and another involving the measurement of survival time. And these both errors are distinct from the confounding problem and the noise problem.

How did this all happen? It’s surprisingly common to see multiple errors in a single published paper. It’s rare for authors to reach the frequency of a Brian Wansink (150 errors in four published papers, and that was just the start of it) or a Richard Tol (approaching the Platonic ideal of more errors in a paper than there are data points), but multiple unrelated errors—that happens all the time.

How does this happen? Here are a few reasons:

1. Satisficing. The goal of the analysis is to obtain statistical significance. Once that happens, there’s not much reason to check.

2. Over-reliance on the peer-review process. This happens in two ways. First, you might be less careful about checking your manuscript, knowing that three reviewers will be going through it anyway. Second—and this is the real problem—you might assume that reviewers will catch all the errors, and then when the paper is accepted by the journal you’d think that no errors remain.

3. A focus on results. The story of the paper is that prejudice is bad for your health. (See for example this Reuters story by Andrew Seaman, linked to from the Retraction Watch article.) If you buy the conclusion, the details recede.

There’s no reason to think that errors in scientific papers are rare, and there’s no reason to be surprised that a paper with one documented error will have others.

Let’s give credit where due. It was a mistake of the journal Social Science and Medicine to publish that original paper, but, on the plus side, they published Regnerus’s criticism. Much better to be open with criticism than to play defense.

I’m not so happy, though, with Hatzenbuehler et al.’s correction notice. They do cite Regnerus, but they don’t mention that the error he noticed is completely different from the error they report. Their correction notice gives the misleading impression that their paper had just this one error. That ain’t cool.

Part 4. The reaction.

We’ve already discussed the response of Hatzenbuehler et al., which is not ideal but I’ve seen worse. At least they admitted one of their mistakes.

In the Reaction Watch article, Han linked to some media coverage, including this news article by Maggie Gallagher in the National Review with headline, “A widely reported study on longevity of homosexuals appears to have been faked.” I suppose that’s possible, but my guess is that the errors came from sloppiness, plus not bothering to check the results.

Han also reports:

Nathaniel Frank, a lawyer who runs the What We Know project, a catalog of research related to LGBT issues housed at Columbia Law School, told Retraction Watch:

Mark Regnerus destroyed his scholarly credibility when, as revealed in federal court proceedings, he allowed his ideological beliefs to drive his conclusions and sought to obscure the truth about how he conducted his own work. There’s an enormous body of research showing the harms of minority stress, and Regnerus is simply not a trustworthy critic of this research.

Hey. Don’t shoot the messenger. Hatzenbuehler et al. published a paper that had multiple errors, used flawed statistical techniques, and came up with an implausible conclusion. Frank clearly thinks this area of research is important. If so, he should be thanking Regnerus for pointing out problems with this fatally flawed paper and for motivating Hatzenbuehler et al. to perform the reanalysis that revealed another mistake. Instead of thanking Regnerus, Frank attacks him. What’s that all about? You can dislike a guy but that shouldn’t make you dismiss his scientific contribution. Yes, maybe sometimes it takes someone predisposed to be skeptical of a study to go to the trouble to find its problems. We should respect that.

Part 5. The big picture.

As the saying goes, the scandal isn’t what’s illegal, the scandal is what’s legal. Michael Kinsley said this in the context of political corruption. Here I’m not talking about corruption or even dishonesty, just scientific error—which includes problems in data coding, incorrect statistical analysis, and, more generally, grandiose claims that are not supported by the data.

This particular paper has been corrected, but lots and lots of other papers have not, simply because they have no obvious single mistake or “smoking gun.”

The scandal isn’t what’s retracted, the scandal is what’s not retracted. All sorts of papers whose claims aren’t supported by their data.

The Retraction Watch article reproduces the cover of the journal Social Science and Medicine:

Hmmm . . . what’s that number in the circle on the bottom right? “2.742,” it appears? I couldn’t manage to blow up this particular image but I found something very similar in high resolution on Wikipedia:

“Most Cited Social Science Journal” “Impact Factor 2.814.”

I have no particular problem with the journal, Social Science and Medicine. They just happened to have the good fortune that, after publishing a problem with major problems, they were able to have an excuse to correct it. Standard practice is, unfortunately, for errors not to be noticed in the published record, or, when they are, for the authors of the paper to reply with denials that there was ever a problem.

Against Arianism

“I need some love like I’ve never needed love before” – Geri, Mel C, Mel B, Victoria, Emma (noted Arianists) 

I spent most of today on a sequence of busses shuttling between cities in Ontario, so I’ve been thinking a lot about fourth century heresies. 

That’s an obvious lie. But I think we all know by now that I love to torture a metaphor. (Never forget Diamanda Galas. This is not a metaphor. Just solid advice.)

But why Arianism specifically? Well it’s an early Christian heresy that posited that Jesus was created by God and thus not the same as God. This view was emphatically rejected by the church leadership and in 325 the First Council of Nicaea wrote this rejection explicitly into the Nicene Creed, which these days reads 

I believe in one Lord Jesus Christ / … / Begotten, not made, consubstantial with the Father

(Yes to all the other Catholics who have lapsed or fallen: a few years back they decided “of one being with the Father” was too understandable. If you don’t know the difference between a lapsed and a fallen Catholic either it doesn’t matter to you, or you’re the former.)

This isn’t even my favourite early Christian heresy: if I ever find a solid reason to use Docetism (short version: Jesus was a hologram) as a metaphor, you better believe I will. (They didn’t need to construct a whole creed to rid themselves of that one.)

Whyyyyyyyyyyy? (extreme Annie Lennox voice)

The first sentence of this post was a lie. I actually spent those five hours thinking about Satan.

I get horrifically motion sick if I try to read anything on a bus, so I used today as an opportunity to catch up on the only podcast I subscribe to. And that podcast is a combination of an Audiobook and an academic discussion of the meaning and context (past and current) of Milton’s Paradise Lost.  It’s done by Anthony Oliveira, and as well as being an excellent performance of the epic poem, the discussion is so much fun you’ll be struggling to stop thinking about Satan. (Anthony has a PhD in this and unlike people with stats PhDs, he has good communication skills. So you get expert knowledge in an accessible package!) It is well worth the $3 per month (at this point $3 gets 21 episodes, so it’s an absolute bargain).

The Perils of Pauline (Theology) 

If I have a point in all of this, I probably should’ve expressed it by now. But I do. I’m just bad a writing. And unsurprisingly it will be a point that sort of I’ve made before. But one I made without such a grouse metaphor.

Like Gaul, Bayesian data analysis is usually divided into three parts: the likelihood, the prior, and … well, depending on the point that I’m going to make I’d either say the data or the computation. But as I don’t care about the third part at the moment, feel free to pick your favourite. (And let’s take it as a sign of personal growth [or Miltonic inspiration] that I’ve swerved into a pre-Christian metaphor even though there’s a perfectly obvious other option.) (Added later: John Ormerod suggested the third part should be the posterior, proving he has a better grip on my metaphors than I do.)

I’ve written before (with Andrew and Michael) about how the prior can (often) only be understood in the context of the likelihood [qualifying adjective added in review], but this is a more realistic metaphor. Because there is not compulsion to speak only of Jesus in the context of God. Instead, they are consubstantial; different but not interchangeable beings made of the same stuff.

But we’ve maybe hit the point of the post where my metaphor falls apart. Because the trinitarian view of god is often stated hierarchically as the father, son, and holy spirit, even if they are co-eternal and of the same substance. And our previous paper was also hierarchical: the prior was only to be understood in the light of the (superior) likelihood.

But reality is more nuanced. If I had to write that title again, I’d say this: The prior is consubstantial with the likelihood. (This is why post-publication revisions shouldn’t be encouraged)

That is not to say that they’re the same. The likelihood typically contains our hypothesized generative mechanism as well as information about how that mechanism was measured. On the other hand, the prior will encode our hypotheses about the constituent parts of both the generative and measurement processes. My point is that while it’s true that the prior should be considered in light of the likelihood, it is equally true that the likelihood  should be considered in light of the prior. 

Bayes from a homoousianism viewpoint. (Rather than a homoiousian one)

So Data Science isn’t so much ability to set a reasonable prior for a given problem (as suggested by a twitter soul with the excellent name “daniel”). Instead, it is the ability to use the same scientific knowledge (substance) to simultaneously build both the prior and the likelihood. 

Work that only considers one of these problems (and I’ve definitely written a bunch of those sorts of papers, including my favourite of my papers) is definitely useful but is incomplete. We must resist Arianism at all times!

The life of the world to come

I want to round out this post with some cultural things. It’s summer, so I’ve been actually enjoying myself, which is to say I’ve been consuming media like something that consumes a lot of media. So here’s some stuff (also – its nice to show that I know how to make a fold, just to point out that all these overbearingly long posts have been on purpose).

Continue reading ‘Against Arianism’ »

The competing narratives of scientific revolution

Back when we were reading Karl Popper’s Logic of Scientific Discovery and Thomas Kuhn’s Structure of Scientific Revolutions, who would’ve thought that we’d be living through a scientific revolution ourselves?

Scientific revolutions occur on all scales, but here let’s talk about some of the biggies:

1850-1950: Darwinian revolution in biology, changed how we think about human life and its place in the world.

1890-1930: Relativity and quantum revolutions in physics, changed how we think about the universe.

2000-2020: Replication revolution in experimental science, changed our understanding of how we learn about the world.

When it comes to technical difficulty and sheer importance of the scientific theories being disputed, this recent replication revolution is by far more trivial than the earlier revolutions in biology and physics. Still, within its narrow parameters, a revolution it is. And, to the extent that the replication revolution affects research in biology, medicine, and nutrition, its real-world implications do go a bit beyond the worlds of science and the news media. The replication revolution has also helped us understand statistics better, and so I think it potentially can have large indirect effects, not just about ESP, beauty and sex ratio, etc., but for all sorts of problems in science and engineering where statistical data collection and analysis are being used, from polling to genomics to risk analysis to education policy.

Revolutions can be wonderful and they can be necessary—just you try to build a transistor using 1880s-style physics, or to make progress in agriculture using the theories of Trofim Lysenko—but the memory of triumphant past revolutions can perhaps create problems in current research. Everybody wants to make a discovery, everybody wants to be a hero. The undeniable revolutionary successes of evolutionary biology have led to a series of hopeless attempted revolutions of the beauty-and-sex-ratio variety.

The problem is what Richard Feynman called cargo-cult science: researchers try to create new discoveries following the template of successes of the past, without recognizing the key roles of strong theory and careful measurement.

We shouldn’t take Kuhn’s writings as gospel, but one thing he wrote about that made sense to me is the idea of a paradigm or way of thinking.

Here I want to talk about something related, which are the storylines or narratives that run in parallel with the practice of science. There stories are told by journalists, or by scientists themselves; they appear in newspapers and movies and textbooks, and I think it is from these stories that many of our expectations arise about what science is supposed to be.

In this discussion, I’ll set aside, then, stories of science as savior, ensuring clean baths and healthy food for all; or science as Frankenstein, creating atomic bombs, deadly plagues, etc.; or other stories in between. Instead, I’ll focus on the process of science and not its effects on the larger world.

What, then, are the stories of the scientific process?

Narrative #1: Scientist as hero, discovering secrets of nature. The hero might be acting alone, or with a sidekick, or as part of a Mission-Impossible-style team; in any case, it’s all about the discovery. This was the narrative of Freakonomics, it’s the narrative of countless Gladwell articles, and it’s the narrative we were trying to promote in Red State Blue State. The goal of making discoveries is one of the big motivations of doing science in the first place, and the reporting of discovery is a big part of science writing.

But then some scientists push it too far. It’s no surprise that, if scientists are given nearly uniformly positive media coverage, that they will start making bigger and bigger claims. It’s gonna keep happening until something stops it. There have been the occasional high-profile cases of scientific fraud, and these can shake public trust in science, but, paradoxically, examples of fraud can give “normal scientists” (to use the Kuhnian term) a false sense of security: Sure, Diederik Stapel was disgraced, but he faked his data. As long as you don’t fake my data (or if you’re not in the room where it happens). And I don’t think many scientists are actively faking it.

And then the revolution, which comes in three steps:

1. Failed replications. Researchers who are trying to replicate respected studies—sometimes even trying to replicate their own work—are stunned to find null results.

2. Questionable research practices. Once a finding comes into question, either from a failed replication or on theoretical grounds that the claimed effect seems implausible, you can go back to the original published paper, and often then a lot of problems appear in the measurement, data processing, and data analysis. These problems, if found, were always there, but the original authors and reviewers just didn’t think to look, or didn’t notice the problems because they didn’t know what to look for.

3. Theoretical and statistical analysis. Some unreplicated studies were interesting ideas that happened not to work out. For example, could intervention X really have had large and consistent effects on outcome Y? Maybe so. Before actually gathering the data, who knows? Hence it would be worth studying. Other times, an idea never really had a chance: it’s the kangaroo problem, where the measurements were too noisy to possibly detect the effect being studied. In that beauty-and-sex-ratio study, for example, we calculate that the sample size was about 1/100 what would be needed to detect anything. This sort of design analysis is mathematically subtle—considering the distribution of the possible results of an experiment is tougher than simply analyzing a dataset once.

Points 1, 2, and 3 reinforce each other. A failed replication is not always so convincing on its own—after all, in the human science, no replication is exact, and the question always arises: What about that original, successful study. Once we know about questionable research practices, we can understand how those original researchers could’ve reported a string of statistically significant p-values, even from chance alone. And then the theoretical analysis can give us a clue of what might be learned from future studies. Conversely, even if you have a theoretical analysis that a study is hopeless, along with clear evidence of forking paths and even more serious data problems, it’s still valuable to see the results of an external replication.

And that leads us to . . .

Narrative #2: Science is broken. The story here is that scientists are incentivized to publish, indeed to pile up publications in prestige journals, which in turn are incentivized to get citations and media exposure. Put this together and you get a steady flow of hype, with little motivation to do the careful work of serious research. This narrative is supported by high-profile cases of scientific fraud, but what really made it take off was the realization that top scientific journals were regularly publishing papers that did not replicate, and in many cases these papers had made claims that were pretty ridiculous—not necessarily a priori false, and big news if they were true, but silly on their face, and even harder to take seriously after the failed replications and revelations of questionable research practices.

The “science is broken” story has often been framed as scientists being unethical, but this can be misleading, and I’ve worked hard to separate the issue of poor scientific practice from ethical violations. A study could be dead on arrival, but if the researcher in question doesn’t understand the statistics, then I wouldn’t call the behavior unethical. One reason I prefer the term “forking paths” to “p-hacking” is that, to my ear, “hacking” implies intentionality.

At some point, ethical questions do arrive, not so much with the original study as with later efforts to dodge criticism. At some point, ignorance is no excuse. But statistics is hard, and I think we should be able to severely criticize a study without that implying a criticism of the ethics of its authors.

Unfortunately, not everyone takes criticism well, and this has led some of the old guard to argue . . .

Narrative #3: Science is just fine. Hence we get claims such as “The replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%” and “Psychology is not in crisis, contrary to popular rumor. . . . All this is normal science . . . National panels will convene and caution scientists, reviewers, and editors to uphold standards. Graduate training will improve, and researchers will remember their training and learn new standards.”

But this story didn’t fly. There were just too many examples of low-quality work getting the royal treatment from the scientific establishment. The sense that something was rotten had spread beyond academia into the culture at large. Even John Oliver got in a few licks.

Hence the attempt to promote . . .

Narrative #4: Attacking the revolutionaries. This tactic is not new—a few years ago, a Harvard psychology professor made some noise attacking the “replication police” as “shameless little bullies” and “second stringers”, and a Princeton psychology professor wrote about “methodological terrorism”—but from my perspective it ramped up more recently when a leading psychologist lied about me in print, and then when various quotes from this blog were taken out of context to misleadingly imply that critics of unreplicated work in the psychology literature were “rife with vitriol . . . vicious . . . attacks . . . threatening.”

I don’t see Narrative 4 having much success. After all, the controversial scientific claims still aren’t replicating, and more and more people—scientists, journalists, and even (I hope) policymakers—are starting to realize that “p less than 0.05” ain’t all that. You can shoot the messengers all you like; the message still isn’t going anywhere. Also, from a sociology of science perspective, shooting the messenger also misses the point: I’m pretty sure that even if Paul Meehl, Deborah Mayo, John Ioannidis, Andrew Gelman, Uri Simonsohn, Anna Dreber, and various other well-known skeptics had never been born, that a crisis would still have arisen regarding unreplicated and unreplicable research results that had been published and publicized in prestigious venues. I’d like to believe that our work, and that of others, has helped us better understand the replication crisis, and can help lead us out of it, but the revolution would be just as serious without us. Calling us “terrorists” isn’t helping any.

OK, so where do these four narratives stand now?

Narrative #1: Scientist as hero. This one’s still going strong. Malcolm Gladwell, Freakonomics, that Telsa guy who’s building a rocket to Mars—they’re all going strong. And, don’t get me wrong, I like the scientist-as-hero story. I’m no hero, but I do consider myself a seeker after truth, and I don’t think it’s all hype to say so. Just consider some analogies: Your typical firefighter is no hero but is still an everyday lifesaver. Your typical social worker is no hero but is still helping people improve their lives. Your typical farmer is no hero but is still helping to feed the world. Etc. I’m all for a positive take on science, and on scientists. And, for that matter, Gladwell and the Freakonomics team have done lots of things that I like.

Narrative #2: Science is broken. This one’s not going anywhere either. Recently we’ve had that Pizzagate professor from Cornell in the news, and he’s got so many papers full of errors that the drip-drip-drip on his work could go on forever. Meanwhile, some of the rhetoric has improved but the incentives for scientists and scholarly journals haven’t changed much, so we can expect a steady stream of weak, mockable papers in top journals, enough to continue feeding the junk-science storyline.

As long as there is science, there will be bad science. The problem, at least until recently, is that some of the bad science was getting a lot of promotion from respected scientific societies and from respected news outlets. The success of Narrative 2 may be changing that, which in turn will, I hope, lead to a decline in Narrative 2 itself. To put it in more specific terms, when a paper on “the Bible Code” appears in Statistical Science, an official journal of the Institute of Mathematical Statistics, then, yes, science—or, at least, one small corner of it—is broken. If such papers only appear in junk journals and don’t get serious media coverage, then that’s another story. After all, we wouldn’t say that science is broken just cos astrology exists.

Narratives #3 and 4: Science is just fine, and Attacking the revolutionaries. As noted above, I don’t see narrative 3 holding up. As various areas of science right themselves, they’ll be seen as fine, but I don’t think the earlier excesses will be forgotten. That’s part of the nature of a scientific revolution, that it’s not seen as a mere continuation of what came before. I’m guessing that scientists in the future will look in wonderment, imagining how it is that researchers could ever have thought that it made sense to treat science as a production line in which important discoveries were made by pulling statistically significant p-values out of the froth of noise.

As for the success of Narrative 4, who knows? The purveyors of Narrative 4 may well succeed in their short-term goal of portraying particular scientific disagreements in personal terms, but I can’t see this effort having the effect of restoring confidence in unreplicated experimental claims, or restoring the deference that used to be given to papers published in prestigious journals. To put it another way, consider that one of the slogans of the defenders of the status quo is “Call off the revolutionaries.” In the United States, being a rebel or a revolutionary it’s typically considered to be a good thing. If you’re calling the other side “revolutionaries,” you’ve probably already lost.

An alternative history

It all could’ve gone differently. Just as we can imagine alternative streams of history where the South did not fire on Fort Sumter, or where the British decided to let go of India in 1900, we can imagine a world in which the replication revolution in science was no revolution at all, but just a gradual reform: a world in which the beauty-and-sex ratio researcher, after being informed of his statistical errors in drawing conclusions from what were essentially random numbers, had stepped back and recognized that this particular line of research was a dead end, that he had been, in essence, trying to send himself into orbit using a firecracker obtained from the local Wal-Mart; a world in which the ovulation-and-clothing researchers, after recognizing that their data were so noisy that their results could not be believed and after recognizing they had so many forking paths that their p-values were meaningless, decided to revamp their research program, improve the quality of their measurements, and move to within-person comparisons; a world in which the celebrated primatologist, after hearing from his research associates that his data codings were questionable, had openly shared his videotapes and fully collaborated with his students and postdocs to consider more general theories of animal behavior; a world in which the ESP researcher, after seeing others point out that forking paths made his p-values uninterpretable and after seeing yet others fail to replicate his study, had recognized that his research had reached a dead end—no shame in that, we all reach dead ends, and the very best researchers can sometimes spend decades on a dead end; it happens; for that matter, what if Andrew Wiles had never reached the end of his particular tunnel and Fermat’s last theorem had remained standing, would we then said that Wiles had wasted his career? No, far from it; there’s honor in pursuing a research path to its end—; a world in which the now-notorious business school professor who studied eating behavior had admitted from the very beginning—at least six years ago now it was that he first heard from outsiders about the crippling problems with his published empirical work—that he had no control over the data reported in his papers, and had stopped trying to maintain that all his claims were valid, instead worked with colleagues to design careful experiments with clean data pipelines and transparent analyses; a world in which that controversial environmental economist had taken the criticism of his work to heart, instead of staying in debate mode had started over, instead of continuing to exercise his talent of getting problematic papers published in good journals had decided to spend a couple years disentangling these climate-and-economics models he’d been treating as data points and instead really working out their implications; a world in which the dozens of researchers who had prominent replication failures or serious flaws in their published work had followed those leaders mentioned above and had used this adversity as an opportunity for reflection and improvement, as an aside thanking their replicators and critics for going to the trouble of taking their work seriously enough to find its problems; in which thousands of researchers whose research hadn’t been checked by others had gone to check their own work, not wanting to publish claims that would not replicate. In this alternative world, there’s no replication crisis at all, just a gradual reassessment of past work, leading gently into a new paradigm of careful measurement and within-person comparison.

Why the revolution?

We tend to think of revolutions as inevitable. The old regime won’t budge, the newcomers want to install a new system, hence a revolution. Or, in scientific terms, we assume there’s no way to resolve an old and a new paradigm.

In the case of the replication crisis, the old paradigm is to gather any old data, find statistical significance in a series of experiments, and then publish and publicize the results. The experiments are important, the conclusions are important, but the actual gathering of data is pretty arbitrary. In the new paradigm, the connection of measurement to theory is much more important. On the other hand, the new paradigm is not entirely new, if we consider fields such as psychometrics.

As remarked above, I don’t think the revolution had to happen; I feel that we could’ve gone from point A to point B in a more peaceful way.

So, why the revolution? Why not just incremental corrections and adjustments? Research team X does a study that gets some attention, others follow up and apparently confirm in. But then, a few years later, team Y comes along with an attempted replication that fails. Later unsuccessful replications follow, along with retrospective close readings of the original papers that reveal forking paths and open-ended theories. So far, no problem. This is just “normal science,” right?

So here’s my guess as to what happened. The reform became a revolution as a result of the actions of the reactionaries.

Part of the difficulty was technical: statistics is hard, and when the first ideas of reform came out, it was easy for researchers to naively think that statistical significance trumped all objections. After all, if you write a paper with 9 different experiments, and each has a statistically significant p-value, then the probability of all that success, if really there were no effect, is (1/20)^9. That’s a tiny number which at first glance would seem impervious to technicalities of multiple comparisons. Actually, though, no: forking paths multiply as fast as p-values. But it took years of exposure to the ideas of Ed Vul, Hal Pashler, Greg Francis, Uri Simosohn, and others to get this point across.

Another difficulty is attachment to particular scientific theories or hypotheses. One way I’ve been trying to help with this one is to separate the scientific models from the statistical models. Sometimes you gotta get quantitative. For example, centuries of analysis of sex ratios tell us that variations in the percentage of girl births are small. So theories along these lines will have to predict small effects. This doesn’t make the theories wrong, it just implies that we can’t discover them from a small survey, and it should also open us up to the possibility of correlations that are positive for some populations in some settings, and negative in others. Similarly in FMRI studies, or social pscyhology, or whatever: The theories can have validity even if they can’t be tested in sloppy experiments. This could be taken as a negative message—some studies are just dead on arrival—but it can also be taken positively: just cos a particular experiment or set of experiments are too noisy to be useful, it doesn’t mean your theory is wrong.

To stand by bad research just because you love your scientific theory: that’s a mistake. Almost always, the bad research is noisy, inconclusive research: careful reanalysis or failed replication does not prove the theory wrong, it just demonstrates that the original experiment did not prove the theory to be correct. So if you love your theory (for reasons other than its apparent success in a noisy experiment), then fine, go for it. Use the tools of science to study it for real.

The final reason for the revolution is cost: the cost of giving up the old approach to science. That’s something that was puzzling me for awhile.

My thinking went like this:
– For the old guard, sure, it’s awkward for them to write off some of the work they’ve been doing for decades—but they still have their jobs and their general reputations as leaders in their fields. As noted above, there’s no embarrassment in pursuing a research dead end in good faith; it happens.
– For younger researchers, yes, it hurts to give up successes that are already in the bank, as it were, but they have future careers to consider, and so why not just take the hit, accept the sunk cost, and move on.

But then I realized that it’s not just a sunk cost; it’s also future costs. Think of it this way: If you’re a successful scientific researcher, you have a kind of formula or recipe, your own personal path to success. The path differs from scientist to scientist, but if you’re in academia, it involves publishing, ideally in top journals. In fields such as experimental biology and psychology, it typically involves designing and conducting experiments, obtaining statistically significant results, and tying them to theory. If you take this pathway away from a group of researchers—for example, by telling them that the studies that they’ve been doing, and that they’re experts in, are too noisy to be useful—then you’re not just wiping out their (reputational) savings, you’re also removing their path to future earnings. You’re not just taking away their golden eggs, you’re reposessing the goose they were counting on to lay more of them.

It’s still a bad idea for researchers to dodge criticism and to attack the critics who are trying so hard to help. But on some level, I understand it, given the cost both in having to write off past work and in losing the easy path to continuing future success.

Just remember that, for each of these people, there may well be three other young researchers who were doing careful, serious work but then didn’t get picked for a plum job or promotion because it was too hard to compete with other candidates who did sloppy but flashy work that got published in top journals. It goes both ways.

Summary (for now)

We are in the middle of a scientific revolution involving statistics and replication in many areas of science, moving from an old paradigm in which important disoveries are a regular, expected product of statistially-significant p-values obtained from routine data collection and analysis, to a new paradigm of . . . weeelll, I’m not quite sure what the new paradigm is. I have some ideas related to quality control, and when it comes to the specifics of design, data collection, and analysis, I recommend careful measurement, within-person comparisons, and multilevel models. Compared to ten years ago, we have a much better sense of what can go wrong in a study, and a lot of good ideas of how to do better. What we’re still struggling with is the big picture, when we move away from the paradigm of routine discovery to a more continuous sense of scientific progress.

Let’s get hysterical

Following up on our discussion of hysteresis in the scientific community, Nick Brown points us to this article from 2014, “Excellence by Nonsense: The Competition for Publications in Modern Science,” by Mathias Binswanger, who writes:

To ensure the efficient use of scarce funds, the government forces universities and professors, together with their academic staff, to permanently take part in artificially staged competitions. . . . how did this development occur? Why did successful and independent universities forget about their noble purpose of increasing knowledge and instead degenerated into “publication factories” and “project mills” which are only interested in their rankings?

Here we should distinguish between natural and artificial competitions. For example, if students get to choose what universities to attend and staff get to choose where to work, then universities will need to compete for both students and staff. But competition for government research grants, for example, could be considered as artificial in that an alternative would just be for the same amount of public funds to be distributed among universities according to some formula.

As Binswanger notes, what works for the top research universities might not make sense more generally:

How can you impress the research commissions responsible for the distribution of funds? This is mainly achieved by increasing measurable output such as publications, projects funded by third-party funds, and networks with other institutes and universities. In this way, “excellence” is demonstrated, in turn leading to easier access to further government research funds. Competitiveness has therefore become a priority for universities and their main goal is to perform as highly as possible in measurable indicators which play an important role in these artificially staged competitions.

One might say that there is no real alternative to this sort of competition—but fifty years ago, or maybe even thirty years ago, the above picture would not have reflected what was happening at many universities.

Binswanger continues:

Relevant publications are in professional journals, where submitted work is subjected to a “rigorous” and “objective” selection method: the so-called “peer-review process”. . . . However, among scientific journals strict hierarchies also exist which are supposed to represent the average “quality” of the accepted papers. In almost every scientific discipline there are a few awe-inspiring top-journals (A-journals), and then there are various groups of less highly respected journals (B- and C- journals), where it is easier to place an article, but where the publication does not have the same significance as an A-journal article. Publishing one’s work in an A-journal is therefore the most important and often also the only aim of modern scientists, thus allowing them to ascend to the “Champions’ League” of their discipline. Belonging to this illustrious club makes it easier to publish further articles in A-journals, to secure more research funds, to conduct even more expensive experiments, and, therefore, to become even more excellent. The “Taste for Science”, described by Merton (1973), which is based on intrinsic motivation and supposed to guide scientists was replaced by the extrinsically motivated “Taste for Publications.”

I’d like to stop here and issue a mild dissent. Yes, there is some extrinsic motivation to publish in top journals, a motivation which I don’t feel much right now but which was a big deal for my colleagues and myself when we were younger. Even now, though, I’d like to publish in top journals, not so much for the league standings or even to help out my younger colleagues, but because I feel that papers in such journals are more likely to be read and to make a difference. But I don’t really know how true that is anymore; it may just be habit that I retain a weak preference to publish in higher-ranked venues.

Binswanger continues:

At the end of the peer review process, the reviewers inform the editor in writing whether they plead for acceptance (very rare), revision, or rejection (most common) of the article submitted to the journal in question. Quite a few top journals pride themselves on high rejection rates, supposedly reflecting the high quality of these journals . . . For such journals the rejection rates amount to approximately 95%, which encourages the reviewers to reject manuscripts in almost all cases in order to defend this important “quality measure”. Solely manuscripts that find favor with their reviewers get published . . .

And thus:

The peer-review process is thus a kind of insider procedure . . . The already-established scientists of a discipline evaluate each other, especially newcomers, and decide what is worthy to be published. . . . Outside of the academic system, most people neither know what modern research is about, nor how to interpret the results and their potential importance to mankind. Although scientists often also do not know the latter, they are—in contrast to the layman—educated to conceal this lack of knowledge behind important sounding scientific jargon and formal models. In this way, even banalities and absurdities can be represented as A-journal worthy scientific excellence, a process laymen and politicians alike are not aware of. They are kept in the blissful belief that more competition in scientific publication leads to ever- increasing top performance and excellence.

Also this amusing bit:

Calculating published articles per capita, Switzerland becomes the world’s leading country . . . in no other country in the world are more research publications squeezed out of the average researcher than in Switzerland.

Are you listening, Bruno?

Binswanger lists a number of “modes of perverse behavior caused by the peer-review process,” most notably this one: “Form is more important than content.” I think about that all the time when I see papers backed up by “p less than 0.05.”

Nick Brown points us to this quote from Binswanger:

Cases of fraud such as the example of Jan Hendrik Schoen mainly affect the natural sciences, where the results of experiments are corrected or simply get invented. Social sciences often have gone already one step further. There, research is often of such a high degree of irrelevance that it does not matter anymore whether a result is faked or not. It does not matter one way or the other.

This reminds me of Clarke’s Law: Any sufficiently crappy research is indistinguishable from fraud. Just to clarify: I’m not saying that all, most, or even a large fraction of social science research is fraudulent, nor am I questioning the sincerity of most social science researchers. I’m just agreeing that in many cases the empirical evidence in published papers can be pretty much irrelevant, as we can see in the common retort of authors when problems are pointed out in their published work: “These mistakes and omissions do not change the general conclusion of the paper . . .”

That sort of attitude is consistent with the idea that publication, not research, has become the primary goal.

And this sounds familiar:

What scientists at universities and other research institutions are mostly doing are things such as writing applications for funding of research projects, looking for possible partners for a network and coordination of tasks, writing interim and final reports for existing projects, evaluating project proposals and articles written by other researchers, revising and resubmitting a rejected article, converting a previously published article into a research proposal so that it can be funded retrospectively, and so on.

I guess we could add “blogging” to that list of unproductive activities . . . Hey! I guess things could be worse. Imagine a world in which, in addition to everything else, productive researchers were expected to regularly blog their findings, participate in internet debates, answer questions posed by strangers in the comment sections of their blogs, etc.

In all seriousness, I’m glad that blogging is an option for researchers, but just an option, and not considered to be any sort of requirement. 15 years ago, when blogging was beginning to really catch on, one could’ve imagined an academic world in which blogging would’ve become expected behavior of junior and senior scholars alike, leading to a world of brown-nosing, back-stabbing, etc. I’m a little sad that blogging isn’t more popular—I hate twitter—but at least blogging hasn’t been sucked into the bureaucracy.

A few years ago, I wrote:

It’s easy to tell a story in which scientific journals are so civilized and online discussion is all about point scoring. But what I’ve seen here [in the case of a particular scientific dispute] is the opposite. The norms of peer reviewed journals such as PNAS encourage presenting work with a facade of certainty. It is the online critics such as myself who have continued to display a sprit of openness.

At the time, I was thinking of these positive qualities as a product of the medium, with online expression allowing more direct and less mediated discussion without the gatekeeping role that has created so many problem with PNAS, Lancet, and various other high-prestige journals. But maybe it’s just that blogs are a backwater, relatively well behaved because they haven’t been sucked into the incentive system.

Indeed, there are some blogs out there (none on our blogroll, though) that do seem to be “political,” not in the sense of being about politics but in being exercises in strategic communication, and I hate that sort of thing. In the alternative universe in which blogging had become an expected part of academic production, I guess we’d be seeing noxious politicking on blogs all the time.

The fallacy of the excluded middle — statistical philosophy edition

I happened to come across this post from 2012 and noticed a point I’d like to share again. I was discussing an article by David Cox and Deborah Mayo, in which Cox wrote:

[Bayesians’] conceptual theories are trying to do two entirely different things. One is trying to extract information from the data, while the other, personalistic theory, is trying to indicate what you should believe, with regard to information from the data and other, prior, information treated equally seriously. These are two very different things.

I replied:

Yes, but Cox is missing something important! He defines two goals: (a) Extracting information from the data. (b) A “personalistic theory” of “what you should believe.” I’m talking about something in between, which is inference for the population. I think Laplace would understand what I’m talking about here. The sample is (typically) of no interest in itself, it’s just a means to learning about the population. But my inferences about the population aren’t “personalistic”—at least, no more than the dudes at CERN are personalistic when they’re trying to learn about particle theory from cyclotron experiments, and no more than the Census and the Bureau of Labor Statistics are personalistic when they’re trying to learn about the U.S. economy from sample data.

I feel like this fallacy-of-the-excluded-middle happens a lot, where people dismiss certain statistical approaches by too restrictively defining one’s goals. There’s a wide, wide world out there between the very narrow “extract information form the data” and the very vague “indicate what you should believe.” Within that gap falls most of the statistical work I’ve ever done or plan to do: