Skip to content

When you add a predictor the model changes so it makes sense that the coefficients change too.

Shane Littrell writes:

I’ve recently graduated with my Masters in Science in Research Psych but I’m currently trying to get better at my stats knowledge (in psychology, we tend to learn a dumbed down, “Stats for Dummies” version of things). I’ve been reading about “suppressor effects” in regression recently and it got me curious about some curious results from my thesis data.

I ran a multiple regression analysis on several predictors of academic procrastination and I noticed that two of my predictors showed some odd behavior (to me). One of them (“entitlement”) was very nonsignificant (β = -.05, p = .339) until I added “boredom” as a predictor, and it changed to (β = – .10, p = .04).

The boredom predictor also had an effect on another variable, but in the opposite way. Before boredom was added, Mastery Approach Orientation (MAP) was significant (β = -.17, p = .003) but after boredom was added it changed to (β = -.05, p = .335).

It’s almost as if Entitlement and MAP switched Beta values and significance levels once Boredom was added.

What is the explanation for this? Is this a type of suppressor effect or something else I haven’t learned about yet?

My reply: Yes, this sort of thing can happen. It is discussed in some textbooks on regression but we don’t really go into it in our book. Except we do have examples where we run a regression and then throw in another predictor and the original coefficients change. When you add a predictor the model changes so it makes sense that the coefficients change too.

Field Experiments and Their Critics

Seven years ago I was contacted by Dawn Teele, who was then a graduate student and is now a professor of political science, and asked for my comments on an edited book she was preparing on social science experiments and their critics.

I responded as follows:

This is a great idea for a project. My impression was that Angus Deaton is in favor of observational rather than experimental analysis; is this not so? If you want someone technical, you could ask Ed Vytlacil; he’s at Yale, isn’t he? I think the strongest arguments in favor of observational rather than experimental data are:

(a) Realism in causal inference. Experiments–even natural experiments–are necessarily artificial, and there are problems in generalizing beyond them to the real world. This is a point that James Heckman has made.

(b) Realism in research practice. Experimental data are relatively rare, and in the meantime we have to learn with what data we have, which are typically observational. This is the point made by Paul Rosenbaum, Don Rubin, and others who love experiments, see experiments as the gold standard, but want to make the most of their observational data. You could perhaps get Paul Rosenbaum or Rajeev Dehejia to write a good paper making this point–not saying that obs data are better than experimental data, but saying that much that is useful can be learned from obs data.

(c) The “our brains can do causal inference, so why can’t social scientists?” argument. Sort of an analogy to the argument that the traveling salesman problem can’t be so hard as all that, given that thousands of traveling salesmen solve the problem every day. The idea is that humans do (model-based) everyday causal inference all the time (every day, as it were), and we rarely use experimental data, certainly not the double-blind stuff we do all the time. I have some sympathy but some skepticism with this argument (see attached article), but if you wanted someone who could make that argument, you could ask Niall Bolger or David Kenny or some other social psychologist or sociologist who is familiar with path analysis. Again, I doubt they’d say that observational data are better than the equivalent experiment, but they might point out that, realistically, “the equivalent experiment” isn’t always out there, and the observational data are.

(d) This issue also arises in evidence-based medicine. As far as I can tell, there are three main strands of evidence-based medicine: (i) using randomized controlled trials to compare treatments, (ii) data-based cost-benefit analyses (Qalys and the like), (iii) systematic collection and analysis of what’s actually done (i.e., observational data), thus moving medicine into a total quality control environment. You could perhaps get someone like Chris Schmid (a statistician at New England Medical Center who’s a big name in this field) to write an article about this (giving him my sentence above to give you a sense of what you’re looking for).

(e) An argument from a completely different direction is that _experimentation_ is great, but formal _randomized trials_ are overrated. The idea is that these formal experiments (in the style of NIH or, more recently, the MIT poverty lab) would be fine in and of themselves except that they (i) suck up resources and, even more importantly, (ii) dissuade people from doing everyday experimentation that they might learn from. The #1 proponent of this view is Seth Roberts, an experimental psychologist who’s written on self-experimentation.

I’d be happy to write something expanding (briefly) on the above points. I don’t feel so competent in the area to actually take any strong positions but I’d be glad to lay out what I consider are some important issues that often get lost in the debate.

A few months later I sent in my chapter, which begins:

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.”

At the same time, in my capacity as a social scientist, I’ve published many applied research papers, almost none of which have used experimental data.

In the present article, I’ll address the following questions:

1. Why do I agree with the consensus characterization of randomized experimentation as a gold standard?

2. Given point 1 above, why does almost all my research use observational data?

In confronting these issues, we must consider some general issues in the strategy of social science research. We also take from the psychology methods literature a more nuanced perspective that considers several different aspects of research design and goes beyond the simple division into randomized experiments, observational studies, and formal theory.

A few years later the book came out.

I’ve blogged on this all before, but just recently the journal Perspectives on Politics published a symposium with several reviews of the book (from Henry Brady, Yanna Krupnikov, Jessica Robinson Preece, Peregrine Schwartz-Shea, and Betsy Sinclair), and I thought it might interest some of you.

In her review, Sinclair writes, “The arguments in the book are slightly dated . . . Seven years later, there is more consensus within the experimental community about the role experiments play in addressing a research question.” I don’t quite agree with that; I think the issues under discussion remain highly relevant. I hope that soon we shall reach a point of consensus, but we’re not there yet.

I certainly would not want to join in any consensus that includes some of the more controversial Ted-talk-style experimental claims involving all the supposedly irrational influences on voting, for example. The key role of experimentation in such work is, I think, not scientific so much as meta-scientific: when a study is encased in an experimental or quasi-experimental framework, it can seem more like science, and then the people at PPNAS, NPR, etc., can take it more seriously. My recommendation is for experimentation, quasi-experimentation, and identification strategies more generally to be subsumed within larger issues of statistical measurement.

About that claim in the Monkey Cage that North Korea had “moderate” electoral integrity . . .

Yesterday I wrote about problems with the Electoral Integrity Project, a set of expert surveys that are intended to “evaluate the state of the world’s elections” but have some problems, notably rating more than half of the U.S. states in 2016 as having lower integrity than Cuba (!) and North Korea (!!!) in 2014.

I was distressed to learn that these shaky claims regarding electoral integrity have been promoted multiple times on the Monkey Cage, a blog with which I am associated. Here, for example, is that notorious map showing North Korea as having “moderate” electoral integrity in 2014.

The post featuring North Korea has the following note:
Continue reading ‘About that claim in the Monkey Cage that North Korea had “moderate” electoral integrity . . .’ »

Fragility index is too fragile

Simon Gates writes:

Where is an issue that has had a lot of publicity and Twittering in the clinical trials world recently. Many people are promoting the use of the “fragility index” (paper attached) to help interpretation of “significant” results from clinical trials. The idea is that it gives a measure of how robust the results are – how many patients would have to have had a different outcome to render the result “non-significant”.

Lots of well-known people seem to be recommending this at the moment; there’s a website too (http://fragilityindex.com/ , which calculates p-values to 15 decimal places!). I’m less enthusiastic. It’s good that problems of “statistical significance” are being more widely appreciated, but the fragility index is still all about “significance”, and we really need to be getting away from p-values and “significance” entirely, not trying to find better ways to use them (or shore them up).

Though you might be interested/have some thoughts as it’s relevant to many of the issues frequently discussed on your blog.

My response: I agree, it seems like a clever idea but built on a foundation of sand.

“Constructing expert indices measuring electoral integrity” — reply from Pippa Norris

This morning I posted a criticism of the Electoral Integrity Project, a survey organized by Pippa Norris and others to assess elections around the world.

Norris sent me a long response which I am posting below as is. I also invited Andrew Reynolds, the author of the controversial op-ed, to contribute to the discussion.

Here’s Norris:
Continue reading ‘“Constructing expert indices measuring electoral integrity” — reply from Pippa Norris’ »

About that bogus claim that North Carolina is no longer a democracy . . .

Nick Stevenson directed me to a recent op-ed in the Raleigh News & Observer, where political science professor Andrew Reynolds wrote:

In 2005, in the midst of a career of traveling around the world to help set up elections in some of the most challenging places on earth . . . my Danish colleague, Jorgen Elklit, and I designed the first comprehensive method for evaluating the quality of elections around the world. . . . In 2012 Elklit and I worked with Pippa Norris of Harvard University, who used the system as the cornerstone of the Electoral Integrity Project. Since then the EIP has measured 213 elections in 153 countries and is widely agreed to be the most accurate method for evaluating how free and fair and democratic elections are across time and place. . . .

So far so good. But then comes the punchline:

In the just released EIP report, North Carolina’s overall electoral integrity score of 58/100 for the 2016 election places us alongside authoritarian states and pseudo-democracies like Cuba, Indonesia and Sierra Leone. If it were a nation state, North Carolina would rank right in the middle of the global league table – a deeply flawed, partly free democracy that is only slightly ahead of the failed democracies that constitute much of the developing world.

I searched on the web and could not find a copy of the just released EIP report but I did come across this page which lists all 50 states plus DC.

North Carolina is not even the lowest-ranked state! Alabama, Michigan, Ohio, Georgia, Rhode Island, Pennsylvania, South Carolina, Mississippi, Oklahoma, Tennessee, Wisconsin, and Arizona are lower.

Hmmm. Whassup with that?

Here’s the international map from The Year in Elections, 2014, by Pippa Norris, Ferran Martinez i Coma, and Max Grömping:

There’s North Korea in yellow, one of the countries with “moderate” electoral integrity. Indeed, go to the chart and they list North Korea as #65 out of 127 countries. The poor saps in Bulgaria and Romania are ranked #90 and 92, respectively. Clearly what they need is a dose of Kim Jong-il.

Let’s see what this measure actually is. From the report:

The survey asks experts to evaluate elections using 49 indicators, grouped into eleven categories reflecting the whole electoral cycle. Using a comprehensive instrument, listed at the end of the report, experts assess whether each national parliamentary and presidential contest meets international standards during the pre-election period, the campaign, polling day and its aftermath. The overall PEI index is constructed by summing the 49 separate indicators for each election and for each country. . . .

Around forty domestic and international experts were consulted about each election, with requests to participate sent to a total of 4,970 experts, producing an overall mean response rate of 29%. The rolling survey results presented in this report are drawn from the views of 1,429 election experts.

OK, let’s check what the experts said about North Korea; it’s on page 9 of the report:
Electoral laws 53
Electoral procedures 73
District boundaries 73
Voter registration 83
Party and candidate registration 54
Media coverage 78
Campaign finance 84
Voting process 53
Vote count 74
Results 80
Electoral authorities 60

Each of these is on a 0-100 scale with 100 being good. So, you got it, North Korea is above 50 in every category on the scale.

Who did they get to fill out this survey? Walter Duranty?

OK, let’s look more carefully. In this table, the response rate for North Korea is given as 6%. And the report said they consulted about 40 “domestic and international experts” for each election. Hmmm . . . 6% of 40 is 2.4, so maybe they got 3 respondents for North Korea, 2 of whom were Stalinists.

That 2014 report mentioned above gave North Korea a rating of 65.3 out of 100 and Cuba a rating of 65.6. Both these numbers are higher than at least 27 of the 50 U.S. states in 2016, according to the savants at the Electoral Integrity Project.

Political science, indeed.

How’s North Korea been doing lately? Stevenson writes:

North Korea is in The Year in Elections 2014 but was quietly removed from The Year in Elections 2015. It’s not a matter of the 2014 elections not being in the 2015 timeframe either – diagram 5 of The Year in Elections 2015 says ‘PEI Index 2012-2015’ and North Korea was in Diagram 1 of The Year in Elections 2014, PEI Index 2012-2014. They have North Korea in gray in the later world map as ‘Not yet covered’. On p. 73 of The Year in Elections 2015 they list their criteria for inclusion in the survey (no microstates, no Taiwan, etc) but don’t explain why PRK_09032014_L1 has suddenly gone missing.

Perhaps North Korea was too embarrassing for them?

In his email to me, Stevenson wrote:

This is terrible research that I [Stevenson] think has the potential to do real damage in the real world with their absurdly high scores for fake elections in places like Oman, Kuwait, Rwanda, and Cuba. Suppose Oman’s government arrests an opposition politician or cracks down on a peaceful demonstration and the EU and US ambassadors protest. What if the Omani government argues that according to Harvard University’s measure which is “widely agreed to be the most accurate method for evaluating how free and fair and democratic elections are across time and place”, Oman is in much better shape than many EU countries and US states and that they should get their own houses in order before criticizing others? The EIP is just as likely to serve as a freebie to repressive governments that somehow fluke a high score as it is to spur the repeal of Wisconsin’s ID law.

If Reynolds, Norris, etc., don’t like what the North Carolina legislature has been doing, fine. It could even be unconstitutional, I have no sense of such things. And I agree with the general point that there are degrees of electoral integrity or democracy or whatever. Vote suppression is not the same thing as an one-party state and any number-juggling that suggests that is just silly, but, sure, put together enough restrictions and gerrymandering and ex post facto laws and so on, and that can add up.

Electoral integrity is an important issue, and it’s worth studying. In a sensible way.

What went wrong here? It all seems like an unstable combination of political ideology, academic self-promotion, credulous journalism, and plain old incompetence. Kinda like this similar thing from a few years ago with the so-called Human Development Index.

P.S. I googled *reynolds north carolina democracy* to see how much exposure this story got, and I found links to Democracy Now, Vox, Slate, Daily Caller, Common Dreams, American Thinker, Qz.com, MSNBC, Huffington Post, Think Progress, The Week, Grist.com . . . basically a lot of obscure outlets. I write for Slate and Vox, so I was sorry to see them pick this one up.

But the good news is that the usual suspects such as ABC, NBC, CBS, CNN, NPR, BBC, NYT didn’t fall for it. I give these core media outlets such a hard time when they screw up, and they deserve our respect when they don’t take the bait on this sort of juicy, but bogus, story.

P.P.S. See here for more from Pippa Norris.

Migration explaining observed changes in mortality rate in different geographic areas?

We know that the much-discussed increase in mortality among middle-aged U.S. whites is mostly happening among women in the south.

In response to some of that discussion, Tim Worstall wrote:

I [Worstall] have a speculative answer. It is absolutely speculative: but it is also checkable to some extent.

Really, I’m channelling my usual critique of Michael Marmot’s work on health inequality in the UK. Death stats don’t measure lifespans of people from places, they measure life spans of people who die in places. So, if there’s migration, and selectivity in who migrates where, then it’s not the inequality between places that might explain differential lifespans but that selection in migration.

Similarly, here in the American case. We know that Appalachia, the Ozarks and the smaller towns of the mid west are emptying out. But it’s those who graduate high school, or who go off to college, who are leaving.

It’s possible, but obviously not certain, that the rising death *rates* are simply a reflection of this selectivity in migration.

I replied: This could be true, I’m not sure. I haven’t tried to crunch the numbers to see if mobility is enough to cause these changes, but on first glance it seems possible. One thing also to remember is that when comparing a particular age category over several years, we’re not comparing the same people. Today’s 50-yr-olds are not the same as next year’s 50-yr-olds. So the usual challenge is separating age, period, and cohort effects. But I agree with you that mobility is an issue too. On a related point, I questioned Case and Deaton’s comparisons by education category, because the proportion of people with college degrees etc. in different age groups has been changing over time too.

Worstall replied, “Not sure when the switch took place in the US but in my age cohort in the UK some 12% or so went to university, now it’s near 50%.” And then he followed up:

Further to the point that migration might be explaining something about these changes in average lifespans. Interesting new research from Glasgow. Seems that it at least part of the story there.

I guess the point is that death rates below age 65 are low enough that it doesn’t take much migration of at-risk people to move the numbers around.

P.S. As an aside, it’s kind of amazing that the big discussion of mortality trends was over a year ago. It seems so recent! There’s so much going on in statistics and social science, room for 400 or so posts a year, sometimes it’s hard to see how we can possibly keep it all in our heads at once.

Stan 2.14 released for R and Python; fixes bug with sampler


Stan 2.14 is out and it fixes the sampler bug in Stan versions 2.10 through 2.13.

Critical update

It’s critical to update to Stan 2.14. See:

The other interfaces will update when you udpate CmdStan.

The process

After Michael Betancourt diagnosed the bug, it didn’t take long for him to generate a test statistic so we can test this going forward, then submit a pull request for the patch and new test. I code reviewed that and made sure a clean check out did the right thing and then we merged. We had a few other fixes in, including one from Mitzi Morris that completed the compound declare define feature. Then Mitzi and Daniel built the releases for the Stan math library, the core Stan C++ library, and then Daniel built the release for CmdStan. After that, Ben Goodrich and Jiqiang Guo worked on updating RStan and Allen Riddell worked through pile of issues for PyStan, and both were released.

Stan Con coming soon!

Over 100 people have registered for the first

It’s at Columbia University in New York on

  • 21 January 2017

Andrew Gelman and Michael Betancourt will be speaking, along with nine submitted talks and a closing Q&A panel. Most of the rest of us from Columbia will be there and I believe other dev team members are coming in for the event. There will be courses the two days before.

Hope to see you in New York!

Comment of the year

In our discussion of research on the possible health benefits of a low-oxygen environment, Raghu wrote:

This whole idea (low oxygen -> lower cancer risk) seems like a very straightforward thing to test in animals, which one can move to high and low oxygen environments . . .

And then Llewelyn came in for the kill:

Why do the animals always get first pick at new treatments? Seems unfair.

Transformative treatments

Screen Shot 2016-07-25 at 4.01.30 PM

Kieran Healy and Laurie Paul wrote a new article, “Transformative Treatments,” (see also here) which reminds me a bit of my article with Guido, “Why ask why? Forward causal inference and reverse causal questions.” Healy and Paul’s article begins:

Contemporary social-scientific research seeks to identify specific causal mechanisms for outcomes of theoretical interest. Experiments that randomize populations to treatment and control conditions are the “gold standard” for causal inference. We identify, describe, and analyze the problem posed by transformative treatments. Such treatments radically change treated individuals in a way that creates a mismatch in populations, but this mismatch is not empirically detectable at the level of counterfactual dependence. In such cases, the identification of causal pathways is underdetermined in a previously unrecognized way. Moreover, if the treatment is indeed transformative it breaks the inferential structure of the experimental design. . . .

I’m not sure exactly where my paper with Guido fits in here, except that the idea of the “treatment” is so central to much of causal inference, that sometimes researchers seem to act as if randomization (or, more generally, “identification”) automatically gives validity to a study, as if randomization plus statistical significance equals scientific discovery. The notion of a transformative treatment is interesting because it points to a fundamental contradiction in how we typically think of causality, in that on one hand “the treatment” is supposed to be transformative and have some clearly-defined “effect,” while on the other hand the “treatment” and “control” are typically considered symmetrically in statistical models. I pick at this a bit in this 2004 article on general models for varying treatment effects.

P.S. Hey, I just remembered—I discussed this a couple of other times on this blog:

– 2013: Yes, the decision to try (or not) to have a child can be made rationally

– 2015: Transformative experiences: a discussion with L. A. Paul and Paul Bloom

“Kevin Lewis and Paul Alper send me so much material, I think they need their own blogs.”

In my previous post, I wrote:

Kevin Lewis and Paul Alper send me so much material, I think they need their own blogs.

It turns out that Lewis does have his own blog. His latest entry contains a bunch of links, starting with this one:

Populism and the Return of the “Paranoid Style”: Some Evidence and a Simple Model of Demand for Incompetence as Insurance against Elite Betrayal

Rafael Di Tella & Julio Rotemberg

NBER Working Paper, December 2016

Abstract:
We present a simple model of populism as the rejection of “disloyal” leaders. We show that adding the assumption that people are worse off when they experience low income as a result of leader betrayal (than when it is the result of bad luck) to a simple voter choice model yields a preference for incompetent leaders. These deliver worse material outcomes in general, but they reduce the feelings of betrayal during bad times. We find some evidence consistent with our model in a survey carried out on the eve of the recent U.S. presidential election. Priming survey participants with questions about the importance of competence in policymaking usually reduced their support for the candidate who was perceived as less competent; this effect was reversed for rural, and less educated white, survey participants.

I clicked through, and, ugh! What a forking-paths disaster! It already looks iffy from the abstract, but when you get into the details . . . ummm, let’s just say that these guys could teach Daryl Bem a thing or two.

Not Kevin Lewis’s fault; he’s just linking . . .

On the plus side, he also links to this:

Turnout and weather disruptions: Survey evidence from the 2012 presidential elections in the aftermath of Hurricane Sandy

Narayani Lasala-Blanco, Robert Shapiro & Viviana Rivera-Burgos

Electoral Studies, forthcoming

Abstract:
This paper examines the rational choice reasoning that is used to explain the correlation between low voter turnout and the disruptions caused by weather related phenomena in the United States. Using in-person as well as phone survey data collected in New York City where the damage and disruption caused by Hurricane Sandy varied by district and even by city blocks, we explore, more directly than one can with aggregate data, whether individuals who were more affected by the disruptions caused by Hurricane Sandy were more or less likely to vote in the 2012 Presidential Election that took place while voters still struggled with the devastation of the hurricane and unusually low temperatures. Contrary to the findings of other scholars who use aggregate data to examine similar questions, we find that there is no difference in the likelihood to vote between citizens who experienced greater discomfort and those who experienced no discomfort even in non-competitive districts. We theorize that this is in part due to the resilience to costs and higher levels of political engagement that vulnerable groups develop under certain institutional conditions.

I like this paper, but then again I know Narayani and Bob personally, so you can make of this what you will.

P.S. Although I think the “Populism and the Return of the Paranoid Style” paper is really bad, I recognize the importance of the topic, and I assume the researchers on this project were doing their best. It is worth another post or article explaining how better to address such questions and analyze this sort of data. My quick suggestion is that each causal question deserves its own study, and I don’t think it’s going to work so well to sift through a pile of data pulling out statistically significant comparisons, dismissing results that don’t fit your story, and labeling results that you like as “significant at the 7% level.” It’s not that there’s anything magic about a 5% significance level, it’s that you want to look at all of your comparisons, and you’re asking for trouble if you keep coming up with reasons to count or discard patterns.
Continue reading ‘“Kevin Lewis and Paul Alper send me so much material, I think they need their own blogs.”’ »

Two unrelated topics in one post: (1) Teaching useful algebra classes, and (2) doing more careful psychological measurements

Kevin Lewis and Paul Alper send me so much material, I think they need their own blogs. In the meantime, I keep posting the stuff they send me, as part of my desperate effort to empty my inbox.

1. From Lewis:

“Should Students Assessed as Needing Remedial Mathematics Take College-Level Quantitative Courses Instead? A Randomized Controlled Trial,” by A. W. Logue, Mari Watanabe-Rose, and Daniel Douglas, which begins:

Many college students never take, or do not pass, required remedial mathematics courses theorized to increase college-level performance. Some colleges and states are therefore instituting policies allowing students to take college-level courses without first taking remedial courses. However, no experiments have compared the effectiveness of these approaches, and other data are mixed. We randomly assigned 907 students to (a) remedial elementary algebra, (b) that course with workshops, or (c) college-level statistics with workshops (corequisite remediation). Students assigned to statistics passed at a rate 16 percentage points higher than those assigned to algebra (p < .001), and subsequently accumulated more credits. A majority of enrolled statistics students passed. Policies allowing students to take college-level instead of remedial quantitative courses can increase student success.

I like the idea of teaching statistics instead of boring algebra. That said, I think if algebra were taught well, it would be as useful as statistics. I think the most important parts of statistics are not the probabilistic parts so much as the quantitative reasoning. You can use algebra to solve lots of problems. For example, this age adjustment story is just a bunch of algebra. Algebra + data. But there’s no reason algebra has to be data-free, right?

Meanwhile, intro stat can be all about p-values, and then I hate it.

So what I’d really like to see is good intro quantitative classes. Call it algebra or call it real-world math or call it statistics or call it data science, I don’t really care.

2. Also from Lewis:

“Less Is More: Psychologists Can Learn More by Studying Fewer People,” by Matthew Normand, who writes:

Psychology has been embroiled in a professional crisis as of late. . . . one problem has received little or no attention: the reliance on between-subjects research designs. The reliance on group comparisons is arguably the most fundamental problem at hand . . .

But there is an alternative. Single-case designs involve the intensive study of individual subjects using repeated measures of performance, with each subject exposed to the independent variable(s) and each subject serving as their own control. . . .

Normand talks about “single-case designs,” which we also call “within-subject designs.” (Here we’re using experimental jargon in which the people participating in a study are called “subjects.”) Whatever terminology is being used, I agree with Normand. This is something Eric Loken and I have talked about a lot, that many of the horrible Psychological Science-style papers we’ve discussed use between-subject designs to study within-subject phenomena.

A notorious example was that study of ovulation and clothing, which posited hormonally-correlated sartorial changes within each woman during the month, but estimated this using a purely between-person design, with only a single observation for each woman in their survey.

Why use between-subject designs for studying within-subject phenomena? I see a bunch of reasons. In no particular order:

1. The between-subject design is easier, both for the experimenter and for any participant in the study. You just perform one measurement per person. No need to ask people a question twice, or follow them up, or ask them to keep a diary.

2. Analysis is simpler for the between-subject design. No need to worry about longitudinal data analysis or within-subject correlation or anything like that.

3. Concerns about poisoning the well. Ask the same question twice and you might be concerned that people are remembering their earlier responses. This can be an issue, and it’s worth testing for such possibilities and doing your measurements in a way to limit these concerns. But it should not be the deciding factor. Better a within-subject study with some measurement issues than a between-subject study that’s basically pure noise.

4. The confirmation fallacy. Lots of researchers think that if they’ve rejected a null hypothesis at a 5% level with some data, that they’ve proved the truth of their preferred alternative hypothesis. Statistically significant, so case closed, is the thinking. Then all concerns about measurements get swept aside: After all, who cares if the measurements are noisy, if you got significance? Such reasoning is wrong wrong wrong but lots of people don’t understand.

Also relevant to this reduce-N-and-instead-learn-more-from-each-individual-person’s-trajectory perspective is this conversation I had with Seth about ten years ago.

“The Pitfall of Experimenting on the Web: How Unattended Selective Attrition Leads to Surprising (Yet False) Research Conclusions”

catdancer

Kevin Lewis points us to this paper by Haotian Zhou and Ayelet Fishbach, which begins:

The authors find that experimental studies using online samples (e.g., MTurk) often violate the assumption of random assignment, because participant attrition—quitting a study before completing it and getting paid—is not only prevalent, but also varies systemically across experimental conditions. Using standard social psychology paradigms (e.g., ego-depletion, construal level), they observed attrition rates ranging from 30% to 50% (Study 1). The authors show that failing to attend to attrition rates in online panels has grave consequences. By introducing experimental confounds, unattended attrition misled them to draw mind-boggling yet false conclusions: that recalling a few happy events is considerably more effortful than recalling many happy events, and that imagining applying eyeliner leads to weight loss (Study 2). In addition, attrition rate misled them to draw a logical yet false conclusion: that explaining one’s view on gun rights decreases progun sentiment (Study 3). The authors offer a partial remedy (Study 4) and call for minimizing and reporting experimental attrition in studies conducted on the Web.

I started to read this but my attention wandered before I got to the end; I was on the internet at the time and got distracted by a bunch of cat pictures, lol.

“I thought it would be most unfortunate if a lab . . . wasted time and effort trying to replicate our results.”

Mark Palko points us to this news article by George Dvorsky:

A Harvard research team led by biologist Douglas Melton has retracted a promising research paper following multiple failed attempts to reproduce the original findings. . . .

In June 2016, the authors published an article in the open access journal PLOS One stating that the original study had deficiencies. Yet this peer-reviewed admission was not accompanied by a retraction. Until now.

Melton told Retraction Watch that he finally decided to issue the retraction to ensure zero confusion about the status of the paper, saying, “I thought it would be most unfortunate if a lab missed the PLOS ONE paper, then wasted time and effort trying to replicate our results.”

He said the experience was a valuable one, telling Retraction Watch, “It’s an example of how scientists can work together when they disagree, and come together to move the field forward . . . The history of science shows it is not a linear path.”

True enough. Each experiment, successful or not, takes us a step closer to an actual cure.

Are you listening, John Bargh? Roy Baumeister?? Andy Yap??? Editors of the Lancet???? Ted talk people????? NPR??????

I guess the above could never happen in a field like psychology, where the experts assure us that the replication rate is “statistically indistinguishable from 100%.”

In all seriousness, I’m glad that Melton and their colleagues recognize that there’s a cost to presenting shaky work as solid and thus sending other research teams down blind alleys for years or even decades. I don’t recall any apologies on those grounds ever coming from the usual never-admit-error crowd.

Sorry, but no, you can’t learn causality by looking at the third moment of regression residuals

Under the subject line “Legit?”, Kevin Lewis pointed me to this press release, “New statistical approach will help researchers better determine cause-effect.” I responded, “No link to any of the research papers, so cannot evaluate.”

In writing this post I thought I’d go further. The press release mentions 6 published articles so I googled the first one, from the British Journal of Mathematical and Statistical Psychology (hey, I’ve published there!) and found this paper, “Significance tests to determine the direction of effects in linear regression models.”

Uh oh, significance tests. It’s almost like they’re trying to piss me off!

I’m traveling so I can’t get access to the full article. From the abstract:

Previous studies have discussed asymmetric interpretations of the Pearson correlation coefficient and have shown that higher moments can be used to decide on the direction of dependence in the bivariate linear regression setting. The current study extends this approach by illustrating that the third moment of regression residuals may also be used to derive conclusions concerning the direction of effects. Assuming non-normally distributed variables, it is shown that the distribution of residuals of the correctly specified regression model (e.g., Y is regressed on X) is more symmetric than the distribution of residuals of the competing model (i.e., X is regressed on Y). Based on this result, 4 one-sample tests are discussed which can be used to decide which variable is more likely to be the response and which one is more likely to be the explanatory variable. A fifth significance test is proposed based on the differences of skewness estimates, which leads to a more direct test of a hypothesis that is compatible with direction of dependence. . . .

The third moment of regression residuals??? This is nuts!

OK, I can see the basic idea. You have a model in which x causes y; the model looks like y = x + error. The central limit theorem tells you, roughly, that y should be more normal-looking than x, hence all those statistical tests.

Really, though, this is going to depend so much on how things are measured. I can’t imagine it will be much help in understanding causation. Actually, I think it will hurt in that if anyone takes it seriously, they’ll just muddy the waters with various poorly-supported claims. Nothing wrong with doing some research in this area, but all that hype . . . jeez!

Ethics and statistics

Objects of the class “George Orwell”

Celeb.whoopi.jpg

George Orwell is an exemplar in so many ways: a famed truth-teller who made things up, a left-winger who mocked left-wingers, an author of a much-misunderstood novel (see “Objects of the class ‘Sherlock Holmes,’”) probably a few dozen more.

But here I’m talking about Orwell’s name being used as an adjective. More specifically, “Orwellian” being used to refer specifically to the sort of doublespeak that Orwell deplored. When someone says something is Orwellian, they mean it’s something that Orwell would’ve hated.

Another example: Kafkaesque. A Kafkaesque world is not something Kafka would’ve wanted.

Just to be clear: I’m not saying there’s anything wrong with referring to doublespeak as Orwellian—the man did write a lot about it! It’s just interesting to think of things named after people who hated them.

Emails I never bothered to answer

So, this came in the email one day:

Dear Professor Gelman,

I would like to shortly introduce myself: I am editor in the ** Department at the publishing house ** (based in ** and **).

As you may know, ** has taken over all journals of ** Press. We are currently restructuring some of the journals and are therefore looking for new editors for the journal **.

You have published in the journal, you work in the field . . . your name was recommended by Prof. ** as a potential editor for the journal. . . . We think you would be an excellent choice and I would like to ask you kindly whether you are interested to become an editor of the journal. In case you are interested (and even if you are not), we would be glad if you could maybe recommend us some additional potential candidates who could be interested to get involved with **. We are looking for a several editors who will cover the different areas of the field.

If you have any questions, I will gladly provide you with more information.

I look forward to hearing from you,

with best regards

**

Ummm, don’t take this the wrong way, but . . . why is it exactly that you think I would want to work for free on a project, just to make money for you?

Christmas special: Survey research, network sampling, and Charles Dickens’ coincidences

It’s Christmas so what better time to write about Charles Dickens . . .

Here’s the story:

In traditional survey research we have been spoiled. If you work with atomistic data structures, a small sample looks like a little bit of the population. But a small sample of a network doesn’t look like the whole. For example, if you take a network and randomly sample some nodes, and then look at the network of all the edges connecting these nodes, you’ll get something much more sparse than the original. For example, suppose Alice knows Bob who knows Cassie who knows Damien, but Alice does not happen to know Damien directly. If only Alice and Damien are selected, they will appear to be disconnected because the missing links are not in the sample.

This brings us to a paradox of literature. Charles Dickens, like Tom Wolfe more recently, was celebrated for his novels that reconstructed an entire society, from high to low, in miniature. But Dickens is also notorious for his coincidences: his characters all seem very real but they’re always running into each other on the street (as illustrated in the map above, which comes from David Perdue) or interacting with each other in strange ways, or it turns out that somebody is somebody else’s uncle. How could this be, that Dickens’s world was so lifelike in some ways but filled with these unnatural coincidences?

My contention is that Dickens was coming up with his best solution to an unsolvable problems, which is to reproduce a network given a small sample. What is a representative sample of a network? If London has a million people and I take a sample of 100, what will their network look like? It will look diffuse and atomized because of all those missing connections. The network of this sample of 100 doesn’t look anything like the larger network of Londoners, any more than a disconnected set of human cells would look like a little person.

So to construct something with realistic network properties, Dickens had to artificially fill in the network, to create the structure that would represent the interactions in society. You can’t make a flat map of the world that captures the shape of a globe; any projection makes compromises. Similarly you can’t take a sample of people and capture all its network properties, even in expectation: if we want the network density to be correct, we need to add in links, “coincidences” as it were. The problem is, we’re not used to thinking this way because with atomized analysis, we really can create samples that are basically representative of the population. With networks you can’t.

This may be the first, and last, bit of literary criticism to appear in the Journal of Survey Statistics and Methodology.

How to include formulas (LaTeX) and code blocks in WordPress posts and replies

It’s possible to include LaTeX formulas like \int e^x \, \mathrm{d}x. I entered it as $latex \int e^x \, \mathrm{d}x$.

You can also generate code blocks like this

for (n in 1:N) 
  y[n] ~ normal(0, 1);

The way to format them is to use <pre> to open the code block and </pre> to close it.

You can create links using the anchor (a) tag.

You can also quote someone else, like our friend lorem ipsum,

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

You open with <blockquote> and close with </blockquote>.

You can add bold (tag inside angle brackets is b), italics (tag is i) and typewriter text (tag is tt), but our idiotic style makes typewriter text smaller, so you need to wrap it in a big for it to render the same size as surrounding text.

The full set of tags allowed is:

address, a, abbr, acronym, area, article, aside, b, big,
blockquote, br, caption, cite, class, code, col, del,
details, dd, div, dl, dt, em, figure, figcaption, footer,
font, h1, h2, h3, h4, h5, h6, header, hgroup, hr, i,
img, ins, kbd, li, map, ol, p, pre, q, s, section, small,
span, strike, strong, sub, summary, sup, table, tbody,
td, tfoot, th, thead, tr, tt, u, ul, var

For more details, see: https://en.support.wordpress.com/code/

Too bad there’s no way for users without admin privileges to edit their work. It’s fiddly getting LaTeX or HTML right on the first try.

After some heavy escaping, you deserve some comic relief; it’ll give you some hint at what I had to do to show what I entered to you without it rendering.