Skip to content

“If you’re not using a proper, informative prior, you’re leaving money on the table.”

Well put, Rob Weiss.

This is not to say that one must always use an informative prior; oftentimes it can make sense to throw away some information for reasons of convenience. But it’s good to remember that, if you do use a noninformative prior, that you’re doing less than you could.

Soil Scientists Seeking Super Model

I (Bob) spent last weekend at Biosphere 2, collaborating with soil carbon biogeochemists on a “super model.”


Model combination and expansion

The biogeochemists (three sciences in one!) have developed hundreds of competing models and the goal of the workshop was to kick off some projects on putting some of them together intos wholes that are greater than the sum of their parts. We’ll be doing some mixture (and perhaps change point) modeling, which makes sense here because of different biogeochemical processes at work based on system evolution and extrinsic conditions (some of which we have covariates for or can be modeled with random effects), and we’re also going to do some of what Andrew likes to call “continuous model expansion.”

Others at the workshop also expressed interest in Bayesian model averaging as well as model comparison using Bayes factors, though I’d rather concentrate on mixture modeling and continuous model expansion, for reasons Andrew’s already discussed at length on the blog and in Bayesian Data Analysis (aka BDA3, aka “the red book”).

One of the three workshop organizers, Kiona Ogle, did a great job laying out the big picture during the opening dinner / lightning-talk session and then following it up by making sure we didn’t stray too far from our agenda. This is always a tricky balance with a bunch of world class scientists each with his or her own research agenda.

So far, so good

We got a surprising amount done over the weekend—it was really more hackathon than workshop, because there weren’t any formal talks.

GitHub repositories: Thanks to David LeBauer, another of the workshop organizers, we have GitHub organization, with repositories with our work so far. David and I were really into pitching version control, and in particular GitHub, for managing our collaborations. Hopefully we’ve converted some Dropbox users to version control.

Stan “Hello World”: The soil-metamodel/stan repo includes a Stan implementation of a soil incubation model with two pools and feedback, which I translated from Carlos Sierra’s system SoilR, an R package implementing a vast variety of linear and non-linear differential-equation based soil-carbon models (the scope of which is explained in this paper).

Taking Michael Betancourt’s advice, I implemented a second version with lognormal noise and a proper measurement error model (see the repo), which fits much more cleanly (higher effective sample size, less residual noise, obeys scientific constraints on positivity).

“Forward” and “Backward” Michaelis-Menten: Bonnie Waring, a post-doc, not only survived having a scorpion attached to her ankle during dinner one night, she’s leading one of the subgroups I’m involved with on reimplementing and expanding these models in Stan. Apparently, Bonnie’s seen much worse (than little Arizona scorpions) working in Costa Rica at the lab of Jennifer Powers (the third workshop organizer), to which Bonnie’s returning to run some of the enzyme assays we need to complete the data.

I’m very excited about this particular model combination, which involves some state-of-the art models taking into account biomass and enzyme behavior. There are two different forms of Michaelis-Menten dynamics under consideration, as they both make sense for different subsystems of the aggregate soil and organic matter biogeochemistry.

The repo for this project is soil-metamodel/back-forth-mm, the readme for which has references to some papers, including one first-authored by another workshop participant, Steve Allison, one of the workshop participants, and some colleagues, Soil-carbon response to warming dependent on microbial physiology (Nature Geoscience).

Global mapping: Steve’s actually involved with a separate group doing global mapping, using litter decomposition data. The GitHub repo is soil-metamodel/Litter-decomp-mapping.

They’ve got some stiff competition (ODE pun intended), given the recent fine-grained, animated global carbon map that NASA just put out.

Non-linear models: Kathe Todd-Brown, another post-doc, helped me (and everyone else) unpack and understand all of the models by breaking them down from narratives to differential equations. Kathe’s leading another subgroup looking at non-linear models, which I’m also involved with. I don’t see a public GitHub repo for that yet.

Science is awesome!

Right after Carlos, David, and I first arrived, we ran into a group of tourists, including some teenagers, who asked us, “Are you scientists?” We said, “Why yes, we are.” The teenager replied, “That’s super awesome.” I happen to agree, but in nearly 30 years doing science, I can’t remember ever getting that reaction. So, if you’re a scientist and want to feel like a rock star, I’d highly recommend Biosphere 2.

It’s also a fun tour, what with the rain forest environment (i.e., a big greenhouse), and the 16 ton rubber-suspended “lung” for pressure equalization.

Retrospective clinical trials?

Kelvin Leshabari writes:

I am a young medical doctor in Africa who wondered if it is possible to have a retrospective designed randomised clinical trial and yet be sound valid in statistical sense.

This is because to the best of my knowledge, the assumptions underlying RCT methodology include that data is obtained in a prospective manner!

We are now about to design a new study in a clinical set up and during literature search we encountered few data published with such a design in mind. It has risen some confusion among the trialists whether we should include the findings and account for them in our study or whether the said findings are mere products of confusion in a mathematical/statistical sense!

My reply: You can have retrospective studies with good statistical properties—if you know the variables that predict the choice of treatment, then you can model the outcome conditional on these variables, and you should be ok. I don’t think it makes sense to speak of a retrospective RCT but you can analyze retrospective data as if they were collected prospectively, if you can condition on enough relevant variables. This is the sort of thing that Paul Rosenbaum and others have written about under the rubric of observational studies.

Replication controversies

I don’t know what ATR is but I’m glad somebody is on the job of prohibiting replication catastrophe:

Screen Shot 2014-11-18 at 7.07.28 PM

Seriously, though, I’m on a list regarding a reproducibility project, and someone forwarded along this blog by psychology researcher Simone Schnall, whose attitudes we discussed several months ago in the context of some controversies about attempted replications of some of her work in social psychology.

I’ll return at the end to my remarks from July, but first I’d like to address Schnall’s recent blog, which I found unsettling. There are some technical issues that I can discuss:

1. Schnall writes: “Although it [a direct replication] can help establish whether a method is reliable, it cannot say much about the existence of a given phenomenon, especially when a repetition is only done once.” I think she misses the point that, if a replication reveals that a method is not reliable (I assume she’s using the word “reliability” in the sense that it’s used in psychological measurement, so that “not reliable” would imply high variance) then it can also reveal that an original study, which at first glance seemed to provide strong evidence in favor of a phenomenon, really doesn’t. The Nosek et al. “50 shades of gray” paper is an excellent example.

2. Her discussion of replication of the Stroop effect also seems to miss the point, or at least so it seems to me. To me, it makes sense to replicate effects that everyone believes, as a sort of “active control” on the whole replication process. Just as it also makes sense to do “passive controls” and try to replicate effects that nobody thinks can occur. Schnall writes that in the choice of topics to replicate, “it is irrelevant if an extensive literature has already confirmed the existence of a phenomenon.” But that doesn’t seem quite right. I assume that the extensive literature on Stroop is one reason it’s been chosen to be included in the study.

The problem, perhaps, is that she seems to see the goal of replication as a goal to shoot things down. From that standpoint, sure, it seems almost iconoclastic to try to replicate (and, by implication, shoot down) Stroop, a bit disrespectful of this line of research. But I don’t see any reason why replication should be taken in that way. Replication can, and should, be a way to confirm a finding. I have no doubt that Stroop will be replicated—I’ve tried the Stroop test myself (before knowing what it was about) and the effect was huge, and others confirm this experience. This is a large effect in the context of small variation. I guess that, with some great effort, it would be possible to design a low-power replication of Stroop (maybe use a monochrome image, embed it in a within-person design, and run it on Mechanical Turk with a tiny sample size?), but I’d think any reasonable replication couldn’t fail to succeed. Indeed, if Stroop weren’t replicated, this would imply a big problem with the replication process (or, at least with that particular experiment). But that’s the point, that’s one reason for doing this sort of active control. The extensive earlier literature is not irrelevant at all!

3. Also I think her statement, “To establish the absence of an effect is much more difficult than the presence of an effect,” misses the point. The argument is not that certain claimed effects are zero but rather that there is no strong evidence that they represent general aspects of human nature (as is typically claimed in the published articles). If an “elderly words” stimulus makes people walk more slowly one day in one lab, and more quickly another day in another lab, that could be interesting but it’s not the same as the original claim. And, in the meantime, critics are not claiming (or should not be claiming) an absence of any effect but rather they (we) are claiming to see no evidence of a consistent effect.

In her post, Schnall writes, “it is not about determining whether an effect is “real” and exists for all eternity; the evaluation instead answers a simply question: Does a conclusion follow from the evidence in a specific paper?”—so maybe we’re in agreement here. The point of criticism of all sorts (including analysis of replication) can be to address the question, “Does a conclusion follow from the evidence in a specific paper?” Lots of statistical research (as well as compelling examples such as that of Nosek et al.) has demonstrated that simple p-values are not always good summaries of evidence. So we should all be on the same side here: we all agree that effects vary, none of us is trying to demonstrate that an effect exists for all eternity, none of us is trying to establish the absence of an effect. It’s all about the size and consistency of effects, and critics (including me) argue that effects are typically a lot smaller and a lot less consistent than are claimed in papers published by researchers who are devoted to these topics. It’s not that people are “cheating” or “fishing for significance” or whatever, it’s just that there’s statistical evidence that the magnitude and stability of effects are overestimated.

4. Finally, here’s a statement of Schnall that really bothers me: “There is a long tradition in science to withhold judgment on findings until they have survived expert peer review.” Actually, that statement is fine with me. But I’m bothered by what I see as an implied converse, that, once a finding has survived expert peer review, it should be trusted. Ok, don’t get me wrong, Schnall doesn’t say that second part in this most recent post of hers, and if she agrees with me—that is, if she does not think that peer-reviewed publication implies that a study should be trusted—that’s great. But, from her earlier writings on this topic give me the sense that she believes that published studies, at least in certain fields of psychology, should get the benefit of the doubt: that, once they’ve been published in a peer-reviewed publication, they should stand on a plateau and require some special effort to be dislodged. So when Study 1 says one thing and pre-registered Study 2 says another, she seems to want to give the benefit of the doubt to Study 1. But I don’t see that.

Different fields, different perspectives

A lot of this discussion seems somehow “off” to me. Perhaps this is because I do a lot of work in political science. And almost every claim in political science is contested. That’s the nature of claims about politics. As a result, political scientists do not expect deference to published claims. We have disputes, sometimes studies fail to replicate, and that’s ok. Research psychology is perhaps different in that there’s traditionally been a “we’re all in this together” feeling, and I can see how Schnall and others can be distressed that this traditional collegiality has disappeared. From my perspective, the collegiality could be restored by the simple expedient of researchers such as Schnall recognizing that the patterns they saw in particular datasets might not generalize to larger populations of interest. But I can see how some scholars are so invested in their claims and in their research methods that they don’t want to take that step.

I’m not saying that political science is perfect, but I do think there are some differences in that poli sci has more of a norm of conflict whereas it’s my impression that research psychology has more of the norms of a lab science where repeated experiments are supposed to give identical results. And that’s one of the difficulties.

If scientist B fails to replicate the claims of scientist A who did a low-power study, my first reaction is: hey, no big deal, data are noisy, the patterns in the sample do not generally match the patterns in the population, certainly not if you condition on “p less than .05.” But a psychology researcher trained in this lab tradition might not be looking at sampling variability as an explanation—nowhere in Schnall’s blogs did I see this suggested as a possible source of the differences between original reports and replications—and, as a result, they can perceive a failure to replicate as an attack on the original study, to which it’s natural for them to attack the replication. But once you become more attuned to sampling and measurement variation, failed replications are to be expected all the time, that’s what it means to do a low-power study.
Continue reading ‘Replication controversies’ »

4-year-old post on Arnold Zellner is oddly topical

I’m re-running this Arnold Zellner obituary because it is relevant to two recent blog discussions:

1. Differences between econometrics and statistics

2. Old-fashioned sexism (of the quaint, not the horrible, variety)

Stan hits bigtime


First Wikipedia, then the Times (featuring Yair Ghitza), now Slashdot (featuring Allen “PyStan” Riddell). Just get us on Gawker and we’ll have achieved total media saturation.

Next step, backlash. Has Stan jumped the shark? Etc. (We’d love to have a “jump the shark” MCMC algorithm but I don’t know if or when we’ll get there. I’m still struggling trying to get wedge sampling to work.)

In which I play amateur political scientist

Mark Palko writes:

I have a couple of what are probably poli sci 101 questions.

The first involves the unintended (?) consequences of plans bring political power back to the common people. The two examples I have in mind are California’s ballot initiatives and parental trigger laws but I’m sure I’m missing some obvious ones. It seems like these attempts to bring power to ordinary people are generally taken over very quickly by those with money and power.

The obvious explanation is proposals that, with the intention of making things more democratic, allow small groups to have more power are prone to being taken over by special interests. Is this a recognized principle? If so, does it have a name?

The second involves issues reversing partisan connotations, cases where certain positions go from being strongly identified with one end of the political spectrum to being strongly identified with the other. The example I have in mind is pacifism. As far as I can tell, being anti-war was basically a liberal position in 1915, a conservative one in 1940 and a liberal one in 1965. Are there other, better examples? Do these shifts create problems for researchers studying political affiliations?

My reply:

On the first item, yes, lots of people have written about the way that potential reforms can backfire. I’m in general skeptical of such skepticism; my attitude is that shaking up the political system is generally a good thing. And if the powers-that-be end up taking over various reform measures, well, that takes some effort on their part. I’m also suspicious of the reforms-don’t-work argument because it is so generic. Albert Hirschman discussed this in detail in his classic book.

On the second item, I’ve long been interested in issues whose correlation with partisanship is fluid. Indeed, I’ve been fascinated with this topic since first studying political science, thirty years ago. The work of scholars such as Bob Shapiro is relevant to these questions, but I still don’t feel I have the big picture here.

Guys, we need to talk. (Houston, we have a problem).

This post is by Phil Price. I’m posting it on Andrew’s blog without knowing exactly where he stands on this so it’s especially important for readers to note that this post is NOT BY ANDREW!

Last week a prominent scientist, representing his entire team of researchers, appeared in widely distributed television interviews wearing a shirt covered with drawings of scantily clad women with futuristic weapons (I believe the term of art is “space vixens.”) In that interview, he said about the comet that his team is studying, “she’s sexy, but she’s not easy.” Here’s a photo of the shirt (sorry about the fuzziness):
Continue reading ‘Guys, we need to talk. (Houston, we have a problem).’ »

This is what “power = .06” looks like. Get used to it.

Screen Shot 2014-11-17 at 11.19.42 AM

I prepared the above image for this talk. The calculations come from the second column of page 6 of this article, and the psychology study that we’re referring to is discussed here.

On deck this week

Mon: “Why continue to teach and use hypothesis testing?”

Tues: In which I play amateur political scientist

Wed: Retrospective clinical trials?

Thurs: “If you’re not using a proper, informative prior, you’re leaving money on the table.”

Fri: Hey, NYT: Former editor Bill Keller said that any editor who fails to confront a writer about an error because of the writer’s supposed status is failing to do their job. It’s not too late to correct the errors of Nicholas Kristof and David Brooks!

Sat: Blogs > Twitter

Sun: Princeton Abandons Grade Deflation Plan . . .