Skip to content

“We are reluctant to engage in post hoc speculation about this unexpected result, but it does not clearly support our hypothesis”

Brendan Nyhan and Thomas Zeitzoff write:

The results do not provide clear support for the lack-of control hypothesis. Self-reported feelings of low and high control are positively associated with conspiracy belief in observational data (model 1; p<.05 and p<.01, respectively). We are reluctant to engage in post hoc speculation about this unexpected result, but it does not clearly support our hypothesis. Moreover, our experimental treatment effect estimate for our low-control manipulation is null relative to both the high-control condition (the preregistered hypothesis test) as well as the baseline condition (a RQ) in both the combined (table 2) and individual item results (table B7). Finally, we find no evidence that the association with self-reported feelings of control in model 1 of table 2 or the effect of the control treatments in model 2 are moderated by anti-Western or anti-Jewish attitudes (results available on request). Our expectations are thus not supported.

It is good to see researchers openly express their uncertainty and be clear about the limitations of their data.

“Simulations are not scalable but theory is scalable”

Eren Metin Elçi writes:

I just watched this video the value of theory in applied fields (like statistics), it really resonated with my previous research experiences in statistical physics and on the interplay between randomised perfect sampling algorithms and Markov Chain mixing as well as my current perspective on the status quo of deep learning. . . .

So essentially in this post I give more evidence for [the] statements “simulations are not scalable but theory is scalable” and “theory scales” from different disciplines. . . .

The theory of finite size scaling in statistical physics: I devoted quite a significant amount of my PhD and post-doc research to finite size scaling, where I applied and checked the theory of finite size scaling for critical phenomena. In a nutshell, the theory of finite size scaling allows us to study the behaviour and infer properties of physical systems in thermodynamic limits (close to phase transitions) through simulating (sequences) of finite model systems. This is required, since our current computational methods are far from being, and probably will never be, able to simulate real physical systems. . . .

Here comes a question I have been thinking about for a while . . . is there a (universal) theory that can quantify how deep learning models behave on larger problem instances, based on results from sequences of smaller problem instances. As an example, how do we have to adapt a, say, convolutional neural network architecture and its hyperparameters to sequences of larger (unexplored) problem instances (e.g. increasing the resolution of colour fundus images for the diagnosis of diabetic retinopathy, see “Convolutional Neural Networks for Diabetic Retinopathy” [4]) in order to guarantee a fixed precision over the whole sequence of problem instances without the need of ad-hoc and manual adjustments to the architecture and hyperparameters for each new problem instance. A very early approach of a finite size scaling analysis of neural networks (admittedly for a rather simple “architecture”) can be found here [5]. An analogue to this, which just crossed my mind, is the study of Markov chain mixing times . . .

It’s so wonderful to learn about these examples where my work is inspiring young researchers to look at problems in new ways!

My two talks in Austria next week, on two of your favorite topics!

Innsbruck, 7 Nov 2018:

The study of American politics as a window into understanding uncertainty in science

We begin by discussing recent American elections in the context of political polarization, and we consider similarities and differences with European politics. We then discuss statistical challenges in the measurement of public opinion: inference from opinion polls with declining response rates has much in common with challenges in big-data analytics. From here we move to the recent replication crisis in science, and we argue that Bayesian methods are well suited to resolve some of these problems, if researchers can move away from inappropriate demands for certainty. We illustrate with examples in many different fields of research, our own and others’.

Some background reading:

19 things we learned from the 2016 election (with Julia Azari), http://www.stat.columbia.edu/~gelman/research/published/what_learned_in_2016_5.pdf
The mythical swing voter (with Sharad Goel, Doug Rivers, and David Rothschild). http://www.stat.columbia.edu/~gelman/research/published/swingers.pdf
The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. http://www.stat.columbia.edu/~gelman/research/published/incrementalism_3.pdf
Honesty and transparency are not enough. http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics14.pdf
The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective. http://www.stat.columbia.edu/~gelman/research/published/bayes_management.pdf

Vienna, 9 Nov 2018:

Bayesian Workflow

Methods in statistics and data science are often framed as solutions to particular problems, in which a particular model or method is applied to a dataset. But good practice typically requires multiplicity, in two dimensions: fitting many different models to better understand a single dataset, and applying a method to a series of different but related problems. To understand and make appropriate inferences from real-world data analysis, we should account for the set of models we might fit, and for the set of problems to which we would apply a method. This is known as the reference set in frequentist statistics or the prior distribution in Bayesian statistics. We shall discuss recent research of ours that addresses these issues, involving the following statistical ideas: Type M errors, the multiverse, weakly informative priors, Bayesian stacking and cross-validation, simulation-based model checking, divide-and-conquer algorithms, and validation of approximate computations.

Some background reading:

Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors (with John Carlin). http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf
Increasing transparency through a multiverse analysis (with Sara Steegen, Francis Tuerlinckx, and Wolf Vanpaemel). http://www.stat.columbia.edu/~gelman/research/published/multiverse_published.pdf
Prior choice recommendations wiki (with Daniel Simpson and others). https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
Using stacking to average Bayesian predictive distributions (with Yuling Yao, Aki Vehtari, and Daniel Simpson). http://www.stat.columbia.edu/~gelman/research/published/stacking_paper_discussion_rejoinder.pdf
Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC (with Aki Vehtari and Jonah Gabry). http://www.stat.columbia.edu/~gelman/research/published/loo_stan.pdf
Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data (with Aki Vehtari, Tuomas Sivula, Pasi Jylanki, Dustin Tran, Swupnil Sahai, Paul Blomstedt, John Cunningham, David Schiminovich, and Christian Robert). http://www.stat.columbia.edu/~gelman/research/unpublished/ep_stan_revised.pdf
Yes, but did it work?: Evaluating variational inference (with Yuling Yao, Aki Vehtari, and Daniel Simpson). http://www.stat.columbia.edu/~gelman/research/published/Evaluating_Variational_Inference.pdf
Visualization in Bayesian workflow (with Jonah Gabry, Daniel Simpson, Aki Vehtari, and Michael Betancourt). http://www.stat.columbia.edu/~gelman/research/published/bayes-vis.pdf

I felt like in Vienna I should really speak on this paper, but I don’t think I’ll be talking to an audience of philosophers. I guess the place has changed a bit since 1934.

P.S. I was careful to arrange zero overlap between the two talks. Realistically, though, I don’t expect many people to go to both!

“What Happened Next Tuesday: A New Way To Understand Election Results”

Yair just published a long post explaining (a) how he and his colleagues use Mister P and the voter file to get fine-grained geographic and demographic estimates of voter turnout and vote preference, and (b) why this makes a difference.

The relevant research paper is here.

As Yair says in his above-linked post, he and others are now set up to report adjusted pre-election poll data on election night or shortly after, as a replacement for exit polls, which are so flawed.

Facial feedback: “These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.”

Fritz Strack points us to this article, “When Both the Original Study and Its Failed Replication Are Correct: Feeling Observed Eliminates the Facial-Feedback Effect,” by Tom Noah, Yaacov Schul, and Ruth Mayo, who write:

According to the facial-feedback hypothesis, the facial activity associated with particular emotional expressions can influence people’s affective experiences. Recently, a replication attempt of this effect in 17 laboratories around the world failed to find any support for the effect. We hypothesize that the reason for the failure of replication is that the replication protocol deviated from that of the original experiment in a critical factor. In all of the replication studies, participants were alerted that they would be monitored by a video camera, whereas the participants in the original study were not monitored, observed, or recorded. . . . we replicated the facial-feedback experiment in 2 conditions: one with a video-camera and one without it. The results revealed a significant facial-feedback effect in the absence of a camera, which was eliminated in the camera’s presence. These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.

We’ve discussed the failed replications of facial feedback before, so it seemed worth following up with this new paper that provides an explanation for the failed replication that preserves the original effect.

Here are my thoughts.

1. The experiments in this new paper are preregistered. I haven’t looked at the preregistration plan, but even if not every step was followed exactly, preregistration does seem like a good step.

2. The main finding is the facial feedback worked in the no-camera condition but not in the camera condition:

3. As you can almost see in the graph, the difference between these results is not itself statistically significant—not at the conventional p=0.05 level for a two-sided test. The result has a p-value of 0.102, which the authors describe as “marginally significant in the expected direction . . . . p=.051, one-tailed . . .” Whatever. It is what it is.

4. The authors are playing a dangerous game when it comes to statistical power. From one direction, I’m concerned that the studies are way too noisy: it says that their sample size was chosen “based on an estimate of the effect size of Experiment 1 by Strack et al. (1988),” but for the usual reasons we can expect that to be a huge overestimate of effect size, hence the real study has nothing like 80% power. From the other direction, the authors use low power to explain away non-statistically-significant results (“Although the test . . . was greatly underpowered, the preregistered analysis concerning the interaction . . . was marginally significant . . .”).

5. I’m concerned that the study is too noisy, and I’d prefer a within-person experiment.

6. In their discussion section, the authors write:

Psychology is a cumulative science. As such, no single study can provide the ultimate, final word on any hypothesis or phenomenon. As researchers, we should strive to replicate and/or explicate, and any one study should be considered one step in a long path. In this spirit, let us discuss several possible ways to explain the role that the presence of a camera can have on the facial-feedback effect.

That’s all reasonable. I think the authors should also consider the hypothesis that what they’re seeing is more noise. Their theory could be correct, but another possibility is that they’re chasing another dead end. This sort of thing can happen when you stare really hard at noisy data.

7. The authors write, “These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.” I have no idea, but if this is true, it would definitely be good to know.

8. The treatments are for people to hold a pen in their lips or their teeth in some specified ways. It’s not clear to me why any effects of this treatments (assuming the effects end up being reproducible) should be attributed to facial feedback rather than some other aspect of the treatment such as priming or implicit association. I’m not saying there isn’t facial feedback going on; I just have no idea. I agree with the authors that their results are consistent with the facial-feedback model.

P.S. Strack also points us to this further discussion by E. J. Wagenmakers and Quentin Gronau, which I largely find reasonable, but I disagree with their statement regarding “the urgent need to preregister one’s hypotheses carefully and comprehensively, and then religiously stick to the plan.” Preregistration is fine, and I agree with their statement that generating fake data is a good way to test it out (one can also preregister using alternative data sets, as here), but I hardly see it as “urgent.” It’s just one part of the picture.

Raghuveer Parthasarathy’s big idea for fixing science

Raghuveer Parthasarathy writes:

The U.S. National Science Foundation ran an interesting call for proposals recently called the “Idea Machine,” aiming to gather “Big Ideas” to shape the future of research. It was open not just to scientists, but to anyone interested in potentially identifying grand challenges and new directions.

He continues:

(i) There are non-obvious, or unpopular, ideas that are important. I’ll perhaps discuss this in a later post. (What might you come up with?)

(ii) There is a very big idea, perhaps bigger than all the others, that I’d bet isn’t one of the ~1000 other submissions: fixing science itself.

And then he presents his Big Idea: A Sustainable Scientific Enterprise:

The scientific enterprise has never been larger, or more precarious. Can we reshape publicly funded science, matching trainees to viable careers, fostering reproducibility, and encouraging risk?

Science has transformed civilization. This statement is so obviously true that it can come as a shock to learn of the gloomy view that many scientists have of the institutions, framework, and organizational structure of contemporary scientific research. Issues of reproducibility plague many fields, fueled in part by structural incentives for eye-catching but fragile results. . . . over 2 million scientific papers are published each year . . . representing both a steady increase in our understanding of the universe and a barrage of noise driven by pressures to generate output. All of these issues together limit the ability of scientists and of science to tackle important questions that humanity faces. A grand challenge for science, therefore, is to restructure the scientific enterprise to make it more sustainable, productive, and capable of driving innovation. . . .

Methods of scholarly communication that indicate progress in small communities can easily become simple tick-boxes of activity in large, impersonal systems. Continual training of students as new researchers, who then train more students, is very effective for exponentially expanding a small community, as was the goal in the U.S. after World War II, but is clearly incompatible with a sustainable, stable population. The present configuration is so well-suited to expansion, and so ill-suited to stability . . .

It is hard to understate the impact of science on society: every mobile phone, DNA test, detection of a distant planet, material phase transition, airborne jetliner, radio-tracked wolf, and in-vitro fertilized baby is a testament to the stunning ability of our species to explore, understand, and engineer the natural world. There are many challenges that remain unsolved . . .

In some fields, a lot of what’s published is wrong. More commonly, much of what’s published is correct but minor and unimportant. . . . Of course, most people don’t want to do boring work; the issue is one of structures and incentives. [The paradox is that funding agencies always want proposals to aim high, and journals always want exciting papers, but they typically want a particular sort of ambition, a particular sort of exciting result—the new phenomenon, the cure for cancer, etc., which paradoxically is often associated with boring, mechanistic, simplistic models of the world. — ed.] . . .

Ultimately, the real test of scientific reforms is the progress we make on “big questions.” We will hopefully look back on the post-reform era as the one in which challenges related to health, energy and the environment, materials, and more were tackled with unprecedented success. . . . science thrives by self-criticism and skepticism, which should be applied to the institutions of science as well as its subject matter if we are to maximize our chances of successfully tackling the many complex challenges our society, and our planet, face.

“Radio-tracked wolf” . . . I like that!

In all seriousness, I like this Big Idea a lot. It’s very appropriate for NSF, and I think it should and does have a good chance of being a winner in this competition. I submitted a Big Idea too—Truly Data-Based Decision Making—and I like it, I think it’s great stuff, indeed it’s highly compatible with Parthasarathy’s. He’s tackling the social and institutional side of the problem, while my proposal is more technical. They go together. Whether or not either of our ideas are selected in this particular competition, I hope NSF takes Parthasarathy’s ideas seriously and moves toward implementing some of them.

You are welcome in comments to discuss non-obvious, or unpopular, ideas that are important. (Please do better than this list; thank you.)

“2010: What happened?” in light of 2018

Back in November 2010 I wrote an article that I still like, attempting to answer the question: “How could the voters have swung so much in two years? And, why didn’t Obama give Americans a better sense of his long-term economic plan in 2009, back when he still had a political mandate?”

My focus was on the economic slump at the time: how it happened, what were the Obama team’s strategies for boosting the economy, and in particular why they Democrats didn’t do more to prime the pump in 2009-2010, when they controlled the presidency and both houses of congress and had every motivation to get the economy moving again.

As I wrote elsewhere, I suspect that, back when Obama was elected in 2008 in the midst of an economic crisis, lots of people thought it was 1932 all over again, but it was really 1930:

Obama’s decisive victory echoed Roosevelt’s in 1932. But history doesn’t really repeat itself. . . With his latest plan of a spending freeze, Obama is being labeled by many liberals as the second coming of Herbert Hoover—another well-meaning technocrat who can’t put together a political coalition to do anything to stop the slide. Conservatives, too, may have switched from thinking of Obama as a scary realigning Roosevelt to viewing him as a Hoover from their own perspective—as a well-meaning fellow who took a stock market crash and made it worse through a series of ill-timed government interventions.

My take on all this in 2010 was that, when they came into office, the Obama team was expecting a recovery in any case (as in this notorious graph) and, if anything, were concerned about reheating the economy too quickly.

My perspective on this is a mix of liberal and conservative perspectives: liberal, or Keynesian, in that I’m accepting the idea that government spending can stimulate the economy and do useful things; conservative in that I’m accepting the idea that there’s some underlying business cycle or reality that governments will find it difficult to avoid. “I was astonished to see the recession in Baghdad, for I had an appointment with him tonight in Samarra.”

I have no deep understanding of macroeconomics, though, so you can think of my musings here as representing a political perspective on economic policy—a perspective that is relevant, given that I’m talking about the actions of politicians.

In any case, a big story of the 2010 election was a feeling that Obama and the Democrats were floundering on the economy, which added some force to the expected “party balancing” in which the out-party gains in congress in the off-year election.

That was then, this is now

Now on to 2018, where the big story is, and has been, the expected swing toward the Democrats (party balancing plus the unpopularity of the president), but where the second biggest story is that, yes, Trump and his party are unpopular, but not as unpopular as he was a couple months ago. And a big part of that story is the booming economy, and a big part of that story is the large and increasing budget deficit, which defies Keynesian and traditional conservative prescriptions (you’re supposed to run a surplus, not a deficit, in boom times).

From that perspective, I wonder if the Republicans’ current pro-cyclical fiscal policy, so different from traditional conservative recommendations, is consistent with a larger pattern in the last two years in which the Republican leadership feels that it’s living on borrowed time. The Democrats received more votes in the last presidential election and are expected to outpoll the Republicans in the upcoming congressional elections too, so they may well feel more pressure to get better economic performance now, both to keep themselves in power by keeping the balls in the air as long as possible, and because if they’re gonna lose power, they want to grab what they can when they can still do it.

In contrast the Democratic leadership in 2008 expected to be in charge for a long time, so (a) they were in no hurry to implement policies that they could do at their leisure, and (b) they just didn’t want to screw things up and lose their permanent majority.

Different perspectives and expectations lead to different strategies.

Explainable ML versus Interpretable ML

First, I (Keith) want to share something I was taught in MBA school –  all new (and old but still promoted) technologies exaggerate their benefits, are overly dismissive of difficulties, underestimate the true costs and fail to anticipate how older (less promoted) technologies can adapt and offer similar and/or even better benefits and/or with less difficulties and/or less costs.

Now I have recently become aware of work by Cynthia Rudin (Duke) that argues upgraded versions of easy to interpret machine learning (ML) technologies (e.g. Cart) can offer similar predictive performance of new(er) ML (e.g. deep neural nets) with the added benefit of interpret-ability.  But I am also trying to keep in mind or even anticipate how newer ML (e.g. deep neural nets)  can adapt to (re-)match this.

Never say never.

The abstract from Learning customized and optimized lists of rules with mathematical programming. Cynthia Rudin and Seyda Ertekin may suffice to provide a good enough sense for this post.

We introduce a mathematical programming approach to building rule lists, which are a type of interpretable, nonlinear, and logical machine learning classifier involving IF-THEN rules. Unlike traditional decision tree algorithms like CART and C5.0, this method does not use greedy splitting and pruning. Instead, it aims to fully optimize a combination of accuracy and sparsity, obeying user-defined constraints. This method is useful for producing non-black-box predictive models, and has the benefit of a clear user-defined tradeoff between training accuracy and sparsity. The flexible framework of mathematical programming allows users to create customized models with a provable guarantee of optimality. 

For those with less background in ML, think of regression trees or decision trees (Cart) on numerical steroids.

For those with more background in predictive modelling this may be the quickest way to get a sense of what is at stake (and the  challenges). Start at 17:00 and its done by 28:00 – so 10 minutes.

My 9 line summary notes of Rudin’s talk (link above): Please stop doing “Explainable” ML [for high-stakes decisions].

Explainable ML – using a black box and explaining it afterwards.
Interpretable ML – using a model that is not black box.

Advantages of interpret-able ML are mainly for high-stakes decisions.

Accuracy/interpretabilty tradeoff is a myth – in particular for problems with good data representations – all ML methods perform about the same.

[This does leave many application areas where it is not a myth and Explainable or even un-explainable ML will have accuracy advantages.]

Explainable ML is flawed, there are two models the black box model and an understudy model that is explainable and predicts similarly but not identically (exactly the same x% of the time). And sometimes the explanations do not make sense.

p.s. Added a bit more about the other side: problematic obsession with transparency and arguments why “arguments by authority” [black boxes] although the worst kind of arguments are all that most people will accept and make use of  here

Continue reading ‘Explainable ML versus Interpretable ML’ »

“Evidence-Based Practice Is a Two-Way Street” (video of my speech at SREE)

Here’s the talk I gave a few months ago at the Society of Research on Educational Effectiveness. Enjoy.

What does it mean to talk about a “1 in 600 year drought”?

Patrick Atwater writes:

Curious to your thoughts on a bit of a statistical and philosophical quandary. We often make statements like this drought was a 1 in 400 year event but what do we really mean when we say that?
In California for example there was an oft repeated line that the recent historic drought was a 1 in 600 year event though we only have a well instrumented hydrologic record of approximately 100 years. Beyond that it’s mostly tree ring studies.
I suppose this is a subset of the larger issue of quantifying rare but impactful events like hurricane, flood, terrorist attacks or other natural and man-made disasters. Generally hard to to infer conclusively from a single observation (in this case the recent droughts in South Africa and California).
The general practice is to utilize the historical instrumented record to construct a probability distribution and see where the current observation lies on that. So what we’re really saying with the 1 in 400 years line then is that this is a 1 in 400 year event assuming the future will continue to look like the past hundred years.
But with climate change there’s good reason not to believe that! So I’m curious if you have any ideas about ways to incorporate priors about future hydrology utilizing climate modeling into these sorts of headline statements about the likelihood of the current drought. And then thinking ideally would communicate through transparent tools like what Bret Victor advocates for so the assumptions are clear ( http://worrydream.com/ClimateChange/ ).
For context we emailed a bit a few years back about applied data science questions [see here and here] and I’ve been leading a small data team automating the integration of urban customer water use data in California and contextual information. You can see the resulting dataset ( http://bit.ly/scuba_metadata ) and analytics ( https://demo.californiadatacollaborative.com/smc/ ) if you’re curious.

My reply:

Get Gerd Gigerenzer on the line.

You’ve got a great question here!

When it comes to communication of probability and uncertainty, I’m only aware of work that assumes that the probabilities are known, or at least that they are constant. Here you’re talking about the real-world scenario in which the probabilities are changing—indeed, the change in these probabilities is a key issue in any real-world use of these numbers.

The point of saying that we’re in a hundred-year drought is not to say: Hey, we happened to see something unexpected this year! No, the point is that the probabilities have changed; that a former “hundred-year drought” is now happening every 20 years or whatever.

I have no answers here. I’m posting because it’s an important topic.

And, yes, I’m sure there are researchers in the judgment and decision making field who have worked on this. Please share relevant ideas and references in the comments. Thank you.

P.S. Hey, I just noticed the name thing: a hydrologist named Atwater! Cool.

MRP (or RPP) with non-census variables

It seems to be Mister P week here on the blog . . .

A question came in, someone was doing MRP on a political survey and wanted to adjust for political ideology, which is a variable that they can’t get poststratification data for.

Here’s what I recommended:

If a survey selects on a non-census variable such as political ideology, or if you simply wish to adjust for it because of potential nonresponse bias, my recommendation is to do MRP on all these variables.

It goes like this: suppose y is your outcome of interest, X are the census variables, and z is the additional variable, in this example it is ideology. The idea is to do MRP by fitting a multilevel regression model on y given (X, z), then poststratify based on the distribution of (X, z) in the population. The challenge is that you don’t have (X, z) in the population; you only have X. So what you do to create the poststratification distribution of (X, z) is: first, take the poststratification distribution of X (known from the census); second, estimate the population distribution of z given X (most simply by fitting a multilevel regression of z given X from your survey data, but you can also use auxiliary information if available).

Yu-Sung and I did this a few years ago in our analysis of public opinion for school vouchers, where one of our key poststratification variables was religion, which we really needed to include for our analysis but which is not on the census. To poststratify, we first modeled religion given demographics—we had several religious categories, and I think we fit a series of logistic regressions. We used these estimated conditional distributions to fill out the poststrat table and then went from there. We never wrote this up as a general method, though.

Debate about genetics and school performance

Jag Bhalla points us to this article, “Differences in exam performance between pupils attending selective and non-selective schools mirror the genetic differences between them,” by Emily Smith-Woolley, Jean-Baptiste Pingault, Saskia Selzam, Kaili Rimfeld, Eva Krapohl, Sophie von Stumm, Kathryn Asbury, Philip Dale, Toby Young, Rebecca Allen, Yulia Kovas, and Robert Plomin, along with this response by Eric Turkheimer.

Smith-Wooley et al. find an association of test scores with genetic variables that are also associated with socioeconomic status, and conclude that “genetic and exam differences between school types are primarily due to the heritable characteristics involved in pupil admission.” From the other direction, Turkheimer says, “if the authors think their data support the hypothesis that socioeconomic educational differences are simply the result of pre-existing genetic differences among the students assigned to different schools, that is their right. But . . . the data they report here do nothing to actually make the case in one direction or the other.”

It’s hard for me to evaluate this debate given my lack of background in genetics (Bhalla shares some thoughts here, but I can’t really evaluate these either), but I thought I’d share it with you.

Can we do better than using averaged measurements?

Angus Reynolds writes:

Recently a PhD student at my University came to me for some feedback on a paper he is writing about the state of research methods in the Fear Extinction field. Basically you give someone an electric shock repeatedly while they stare at neutral stimuli and then you see what happens when you start showing them the stimuli and don’t shock them anymore. Power will always be a concern here because of the ethical problems.

Most of his paper is commenting on the complete lack of constancy between and within labs in how they analyse data. Plenty of Garden of forking paths, concerns about type 1, type 2 and S and M errors.

One thing I’ve been pushing him is to talk about more is improved measurement.

Currently fear is measured in part by taking skin conductance measurements continuously and then summarising an 8 second or so window between trials into averages, which are then split into blocks and ANOVA’d.

I’ve commented that they must be losing information if they are summarising a continuous (and potentially noisy) measurement over time to 1 value. It seems to me that the variability within that 8 second window would be very important as well. So why not just model the continuous data?

Given that the field could be at least two steps away from where it needs to be (immature data, immature methods), I’ve suggested that he just start by making graphs of the complete data that he would like to be able to model one day and not to really bother with p-value style analyses.

In terms of developing the skills necessary to move forward: would you even bother trying to create models of the fear extinction process using the simplified, averaged data that most researchers use or would it be better to get people accustomed to seeing the continuous data first and then developing more complex models for that later?

My reply:

I actually don’t think it’s so horrible to average the data in this way. Yes, it should be better to model the data directly, and, yes, there has to be some information being lost by the averaging, but empirical variation is itself very variable, so it’s not like you can expect to see lots of additional information by comparing groups based on their observed variances.

I agree 100% with your suggestion of graphing the complete data. Regarding measurement, I think the key is for it to be connected to theory where possible. Also from the above description it sounds like the research is using within-person comparisons, which I generally recommend.

The Axios Turing test and the heat death of the journalistic universe

I was wasting some time on the internet and came across some Palko bait from the website Axios: “Elon Musk says Boring Company’s first tunnel to open in December,” with an awesome quote from this linked post:

Tesla CEO Elon Musk has unveiled a video of his Boring Company’s underground tunnel that will soon offer Los Angeles commuters an alternative mode of transportation in an effort to escape the notoriously clogged highways of the city.

The only thing that could top this would be a reference to Netflix. . . . Hey, wait a minute! Let’s google *axios netflix*. Here’s what we find:

The bottom line: Most view Netflix’s massive spending as reckless, but Ball argues it isn’t if you consider how it’s driving an unprecedented growth that could eventually allow Netflix to surpass Facebook in engagement and Pay-TV in penetration. At that point, they will have the leverage to increase prices, bringing them closer to profitability and making the massive spend worthwhile.

Palko

See, for example, here:

At this point, I’d imagine most of our regular readers are getting fairly burned out on the Hyperloop. I more than sympathize. In terms of technology, infrastructure, transportation, and public policy, we’ve pretty much exhausted the topic.

The one area, however, where the Hyperloop remain somewhat relevant, is as an example of certain dangerous journalistic trends, such as the tendency to ignore red flags, warnings that there’s something wrong with the story as being presented be it lies or faulty data were simply a fundamental misunderstanding. . . . One of the major causes of this dysfunction is the willingness to downplay or even completely ignore massive warning signs. . . .

and here:

What does Netflix really want? To make sense of the company’s approach toward original content, it is useful to think in terms of long-term IP value vs the hype-genic, those programs that lend themselves to promotion by being awards friendly or newsworthy. . . . If Netflix really is playing the wildly ambitious, extremely long term game that forms the basis for the company’s standard narrative and justifies incredible amounts of money investors are pouring in, then this distribution makes no sense whatsoever. If, on the other hand, the company is simply trying to keep the stock pumped up until they can find a soft landing spot, it makes all the sense in the world.

Summary

Of course, Palko could be wrong in his Musk and Netflix skepticism. Palko’s been banging this drum for awhile—he’s been skeptical about these guys since way before it was fashionable—but that doesn’t make him right. Who knows? I’m no expert on tunneling or TV.

What’s really relevant for our discussion here is that an uncredentialed blogger working on zero budget can outperform a 30 million dollar media juggernaut. Even if Palko’s getting it wrong, he’s putting some thought into it, giving us a better read and more interesting content than Axios’s warmed-over press releases.

On one hand, this is an inspiring case of the little guy offering better content than the well-funded corporation. From the other direction, it’s a sad story that Axios has this formula for getting attention by running press releases, while I suspect that most of the people who read Palko only do so because they happen to be bored one day and decide to click though our blogroll—and then are so bored that they click on the site labeled Observational Epidemiology (perhaps the only blog title that’s more dry than Statistical Modeling, Causal Inference, and Social Science).

An Axios Turing test

But there’s also this which I noticed from the wikipedia article on Axios:

The company earned more than $10 million in revenue in its first seven months, primarily with native advertising that appears in between stories.

“Native advertising” . . . what’s that? I’ll look that up:

Native advertising is a type of advertising, mostly online, that matches the form and function of the platform upon which it appears. In many cases, it manifests as either an article or video, produced by an advertiser with the specific intent to promote a product, while matching the form and style which would otherwise be seen in the work of the platform’s editorial staff.

Hey! This would explain the Musk and Netflix stories. Why is Axios running press releases that could damage its reputation as a news source? Because it’s “native advertising.” Sure, it might hurt their journalistic reputation, but by the same token it could help their reputation as an ad site, as it demonstrates their willingness to run press releases as if they’re actual news stories.

By running these press releases, Axios is either demonstrating its credulity or demonstrating its willingness to print what its sponsors want to be printed. Either of these traits could be considered a plus for an ad site.

And this brings us to the Native Content Turing Test. Once your news feed contains native advertising, you have the challenge of figuring out which stories are real and which are sponsored. The better the native advertising is, the harder it will be to tell the difference, leading to a Platonic ideal in which all the ads read like news stories and all the news stories read like press releases. A sort of journalistic equivalent of the heat death of the universe, in which every story contains exactly zero bits of new information.

P.S. The last paragraph above was a joke. Apparently native advertising must be labeled in some way. According to wikipedia:

The most common practices of these are recognizable by understated labels, such as “Advertisement”, “Ad”, “Promoted”, “Sponsored”, “Featured Partner”, or “Suggested Post” in subtitles, corners, or the bottoms of ads.

I found no such identifiers in the two Axios articles above, which suggest they are not advertising but rather are just business-as-usual news articles from an organization with $30 million to burn but no time to do much more than run press releases on these particular topics.

But, again, they win: they get the clicks. And I suppose that the bland ad-like content makes the site safe for native advertising: In an environment of touched-up press releases, an actual ad doesn’t stand out in any awkward way.

Baltimore-Washington

Taking Amtrak, going through Baltimore, reminded that Baltimore used to be the big city and Washington was the small town. My mother worked for the government in the 1970s-80s and had a friend, Pat Smith, who’d moved to DC during WW2 when they were filling up all the government agencies. Pat told my mom that, way back then, they’d go up to Baltimore to shop, because in Baltimore the stores had the newest dresses. I guess Woodies didn’t cut it.

Thurgood Marshall was from Baltimore, as I was reminded as we went by the stop for the Thurgood Marshall Airport (formerly BWI, formerly Friendship). Brett Kavanaugh was from the DC suburbs. Times have changed, what it takes to be a big city, what it takes to be a federal judge.

A study fails to replicate, but it continues to get referenced as if it had no problems. Communication channels are blocked.

In 2005, Michael Kosfeld, Markus Heinrichs, Paul Zak, Urs Fischbacher, and Ernst Fehr published a paper, “Oxytocin increases trust in humans.” According to Google, that paper has been cited 3389 times.

In 2015, Gideon Nave, Colin Camerer, and Michael McCullough published a paper, “Does Oxytocin Increase Trust in Humans? A Critical Review of Research,” where they reported:

Behavioral neuroscientists have shown that the neuropeptide oxytocin (OT) plays a key role in social attachment and affiliation in nonhuman mammals. Inspired by this initial research, many social scientists proceeded to examine the associations of OT with trust in humans over the past decade. . . . Unfortunately, the simplest promising finding associating intranasal OT with higher trust [that 2005 paper] has not replicated well. Moreover, the plasma OT evidence is flawed by how OT is measured in peripheral bodily fluids. Finally, in recent large-sample studies, researchers failed to find consistent associations of specific OT-related genetic polymorphisms and trust. We conclude that the cumulative evidence does not provide robust convergent evidence that human trust is reliably associated with OT (or caused by it). . . .

Nave et al. has been cited 101 times.

OK, fine. The paper’s only been out 3 years. Let’s look at recent citations, since 2017:

“Oxytocin increases trust in humans”: 377 citations
“Does Oxytocin Increase Trust in Humans? A Critical Review of Research”: 49 citations

OK, I’m not the world’s smoothest googler, so maybe I miscounted a bit. But the pattern is clear: New paper revises consensus, but, even now, old paper gets cited much more frequently.

Just to be clear, I’m not saying the old paper should be getting zero citations. It may well have made an important contribution for its time, and even if its results don’t stand up to replication, it could be worth citing for historical reasons. But, in that case, you’d want to also cite the 2015 article pointing out that the result did not replicate.

The pattern of citations suggests that, instead, the original finding is still hanging on, with lots of people not realizing the problem.

For example, in my Google scholar search of references since 2017, the first published article that came up was this paper, “Survival of the Friendliest: Homo sapiens Evolved via Selection for Prosociality,” in the Annual Review of Psychology. I searched for the reference and found this sentence:

This may explain increases in trust during cooperative games in subjects that have been given intranasal oxytocin (Kosfeld et al. 2005).

Complete acceptance of the claim, no reference to problems with the study.

My point here is not to pick on the author of this Annual Review paper—even when writing a review article, it can be hard to track down all the literature on every point you’re making—nor is it to criticize Kosfeld et al., who did what they did back in 2005. Not every study replicates; that’s just how things go. If it were easy, it wouldn’t be research. No, I just think it’s sad. There’s so much publication going on that these dead research results fill up the literature and seem to lead to endless confusion. Like a harbor clotted with sunken vessels.

Things can get much uglier when researchers whose studies don’t replicate refuse to admit it. But even if everyone is playing it straight, it can be hard to untangle the research literature. Mistakes have a life of their own.

Vote suppression in corrupt NY State

I’ll be out of town on election day so I thought I should get an absentee ballot. It’s not so easy:

OK, so I download the form and print it out. Now I have to mail it to the county board of elections. Where exactly is that? I find the little link . . . here it is:

Jeez. They don’t make it easy. Not quite vote suppression (yes, the above title was an exaggeration), but they’re not exactly encouraging us to vote, either.

To put it another way, here’s another webpage, also run by New York State:

It appears that the people who run our state government care more—a lot more—about getting its residents to blow their savings on mindless gambling, than on expressing their views at the ballot box.

It’s almost as if the people in power like the system that got them in power, and don’t want a bunch of pesky voters out there rocking the boat.

P.S. It seems that Kansas is worse.

What to think about this new study which says that you should limit your alcohol to 5 drinks a week?

Someone who wishes to remain anonymous points us to a recent article in the Lancet, “Risk thresholds for alcohol consumption: combined analysis of individual-participant data for 599 912 current drinkers in 83 prospective studies,” by Angela Wood et al., that’s received a lot of press coverage; for example:

Terrifying New Study Breaks Down Exactly How Drinking Will Shorten Your Life

Extra glass of wine a day ‘will shorten your life by 30 minutes’

U.S. should lower alcohol recommendation because booze shortens our lives, study says

Here’s the key graph from the research paper:

According to one of the news articles, 100 grams of alcohol per week, which according to the graph above is approximately the maximum safe dose, is equivalent to “five standard 175ml glasses of wine or five pints [of beer] a week.”

The press coverage of this study was uncritical and included this summary from our friend David Spiegelhalter who described it as “massive and very impressive”:

The paper estimates a 40-year-old drinking four units a day above the guidelines [the equivalent of drinking three glasses of wine in a night] has roughly two years’ lower life expectancy, which is around a 20th of their remaining life. This works out at about an hour per day. So it’s as if each unit above guidelines is taking, on average, about 15 minutes of life, about the same as a cigarette.

And the statistics

On one hand, I’m always suspicious of headline-grabbing studies. On the other, I respect Spiegelhalter.

I took a look at the research article, and . . . it’s complicated. They need to do a lot with their data. Most obviously, they need to adjust for pre-treatment differences in the groups, differences of age, sex, smoking status, and other variables. They adjust for whether people are current smokers or not, but they don’t seem to adjust for the level of smoking or past smoking status; maybe these data are not available? The researchers also have a complicated series of decisions to make regarding missing data, inclusion of nonlinear terms and interactions in their models, and other statistical details.

It was not clear to me exactly how they got the above graph, and how much the pattern could be distorted by systematic differences between the groups, not caused by alcohol consumption.

That said, you have to do something. And, from a casual look at the paper, the analyses seem serious and the results seem plausible.

The logical next step is for the data and analysis to be shared. Immediately. Put all the data on a spreadsheet on Github so that anyone can do their own analyses. Maybe the data are already publicly available and easily accessible? I don’t know. There are various questionable steps in the published analysis—that’s fine, no analysis is perfect!—and the topic is important enough that it’s time to let a thousand reanalyses bloom.

P.S. More here in this excellent post by Spiegelhalter.

“The dwarf galaxy NGC1052-DF2”

Paul Pudaite points to this post by Stacy McGaugh entitled, “The dwarf galaxy NGC1052-DF2.” Pudaite writes that it’s an interesting comment on consequences of excluding one outlier.

I can’t really follow what’s going on here but I thought I’d share it for the benefit of all the astronomers out there.

P.S. Apparently it is common for giant galaxies to have some dwarf satellite galaxies. Who knew?

Multilevel models with group-level predictors

Kari Lock Morgan writes:

I’m writing now though with a multilevel modeling question that has been nagging me for quite some time now. In your book with Jennifer Hill, you include a group-level predictor (for example, 12.15 on page 266), but then end up fitting this as an individual-level predictor with lmer. How can this be okay? It seems as if lmer can’t really still be fitting the model specified in 12.15? In particular, I’m worried about analyzing a cluster randomized experiment where the treatment is applied at the cluster level and outcomes are at the individual level – intuitively, of course it should matter that the treatment was applied at the cluster level, not the individual level, and therefore somehow this should enter into how the model is fit? However, I can’t seem to grasp how lmer would know this, unless it is implicitly looking at the covariates to see if they vary within groups or not, which I’m guessing it’s not? In your book you act as if fitting the model with county-level uranium as an individual predictor is the same as fitting it as a group-level predictor, which makes me think perhaps I am missing something obvious?

My reply: It indeed is annoying that lmer (and, for that matter, stan_lmer in its current incarnation) only allows individual-level predictors, so that any group-level predictors need to be expanded to the individual level (for example, u_full <- u[group]). But from the standpoint of fitting the statistical model, it doesn't matter. Regarding the question, how does the model "know" that, in this case, u_full is actually an expanded group-level predictor: The answer is that it "figures it out" based on the dependence between u_full and the error terms. It all works out.