Skip to content

Ethics and the Replication Crisis and Science (my talk Tues 6pm)

I’ll be speaking on Ethics and the Replication Crisis and Science tomorrow (Tues 28 Feb) 6-7:30pm at room 411 Fayerweather Hall, Columbia University. I don’t plan to speak for 90 minutes; I assume there will be lots of time for discussion.

Here’s the abstract that I whipped up:

Busy scientists sometimes view ethics and philosophy as “touchy-feely” concern that scientists worry about only after they are too old to do real research. In this talk I argue that, on the contrary, that ethics and philosophy are practical tools that can make us more effective scientists. Many of the traditional discussions of statistical ethics are outdated, but we can move to a more modern understanding of ethics in statistics—and in science more generally—by looking more closely at the goals and practices of quantitative research. The current replication crisis in science motivates much of this discussion, but our discussion will consider broader issues too.

Did Trump win because his name came first in key states? Maybe, but I’m doubtful.

The above headline (without the “Maybe, but I’m doubtful”) is from a BBC News article, which continues:

One of the world’s leading political scientists believes Donald Trump most likely won the US presidential election for a very simple reason, writes Hannah Sander – his name came first on the ballot in some critical swing states.

Jon Krosnick has spent 30 years studying how voters choose one candidate rather than another, and says that “at least two” US presidents won their elections because their names were listed first on the ballot, in states where the margin of victory was narrow. . . .

“There is a human tendency to lean towards the first name listed on the ballot,” says Krosnick, a politics professor at Stanford University. “And that has caused increases on average of about three percentage points for candidates, across lots of races and states and years.” . . .

When an election is very close the effect can be decisive, Krosnick says – and in some US states, such as Pennsylvania, Michigan and Wisconsin, the 2016 election was very close.

As is noted in the BBC article, Trump seems to have been listed first on the ballot in Michigan and Wisconsin.

What about the other close states? In Minnesota, it looks like Trump was first on the ballot, and he did almost come from behind to win that state.

Florida and Pennsylvania appear to list the candidate of the governor’s party first, which would put Trump first in Florida and Clinton first in Pennsylvania. New Hampshire I can’t quite tell, their rules are confusing. Nevada uses alphabetical order so I think this means Clinton went first. In Maine, I’m not sure but it looks like Clinton might have been listed first.

So, suppose ballot order gave Trump the win in Michigan, Wisconsin, and Florida. That’s 16 + 10 + 29 = 55 electoral votes. On the other side, maybe ballot order helped Clinton in Maine (at-large) and New Hampshire, that’s 2 + 4 = 6 electoral votes, for a net gain of 49 for Trump. Take away 49 of Trump’s electoral votes and he no longer has the victory (assuming all electoral voters voted as pledged; I guess that will be our next constitutional crisis, come 2020). We tend to think of all these little things as averaging out, but they don’t have to. The number of swing states is small.

So, yeah, maybe Krosnick is right on this one. It all comes down to Florida, I guess.

Could ballot order have been enough to cause a 1.2% swing? Maybe so, maybe not. The research is mixed. Analyzing data from California elections where a rotation of candidate orders was used across assembly districts, Jon Krosnick, Joanne Miller, and Michael Tichy (2004) found large effects including in the 2000 presidential race. But in a different analysis of California elections, Daniel Ho and Kosuke Imai (2008) write that “in general elections, ballot order significantly impacts only minor party candidates, with no detectable effects on major party candidates.” Ho and Imai also point out that the analysis of Krosnick, Miller, and Tichy is purely observational. That said, we can learn a lot from observational data. Krosnick et al. analyzed data from the 80 assembly districts but it doesn’t look like they controlled for previous election results in those districts, which would be the obvious thing to do in such an analysis. Amy King and Andrew Leigh (2009) analyze Australian elections and find that “being placed first on the ballot increases a candidate’s vote share by about 1 percentage point.” Marc Meredith and Yuval Salant (2013) find effects of 4-5 percentage points, but this is for city council and school board elections so not so relevant for the presidential race. A Google Scholar search found lots and lots of papers on ballot-order effects but mostly on local elections or primary elections, where we’d expect such effects to be larger. This 1990 paper by R. Darcy and Ian McAllister cites research back to the early 1900s!

So, putting all the evidence together: what do I think? As I said above, it all comes down to Florida. In 2000, Florida was extremely close—best estimates has Gore winning by only about 30,000 votes (according to Mebane, the votes were lost “primarily due to defective election administration in the state”), and had ballot order been randomized he could well have won by even more, enough for the state to have counted in his favor in the electoral college.

In 2016, maybe, maybe not. Based on the literature I’ve seen, a 1% swing seems to be on the border of what might be a plausible ballot-order effect for the general election for president, maybe a bit on the high end given our current level of political polarization. So I think Krosnick is overstating the case, but it is just possible that the ballot order effects were large enough that, had the ballots been randomized, Clinton could’ve won Florida, Michigan, Wisconsin, and thus the electoral college.

I’m Niall Ferguson without the money

poof

Somehow I agreed or volunteered to give 6 talks on different topics to different audiences during a two-week period. Maybe I need to use Google calendar with some sort of spacing feature.

Giving talks is fun, and it’s a public service, but this is ridiculous.

Forecasting mean and sd of time series

Garrett M. writes:

I had two (hopefully straightforward) questions related to time series analysis that I was hoping I could get your thoughts on:

First, much of the work I do involves “backtesting” investment strategies, where I simulate the performance of an investment portfolio using historical data on returns. The primary summary statistics I generate from this sort of analysis are mean return (both arithmetic and geometric) and standard deviation (called “volatility” in my industry). Basically the idea is to select strategies that are likely to generate high returns given the amount of volatility they experience.

However, historical market data are very noisy, with stock portfolios generating an average monthly return of around 0.8% with a monthly standard deviation of around 4%. Even samples containing 300 months of data then have standard errors of about 0.2% (4%/sqrt(300)).

My first question is, suppose I have two time series. One has a mean return of 0.8% and the second has a mean return of 1.1%, both with a standard error of 0.4%. Assuming the future will look like the past, is it reasonable to expect the second series to have a higher future mean than the first out of sample, given that it has a mean 0.3% greater in the sample? The answer might be obvious to you, but I commonly see researchers make this sort of determination, when it appears to me that the data are too noisy to draw any sort of conclusion between series with means within at least two standard errors of each other (ignoring for now any additional issues with multiple comparisons).

My second question involves forecasting standard deviation. There are many models and products used by traders to determine the future volatility of a portfolio. The way I have tested these products has been to record the percentage of the time future returns (so out of sample) fall within one, two, or three standard deviations, as forecasted by the model. If future returns fall within those buckets around 68%/95%/99% of the time, I conclude that the model adequately predicts future volatility. Does this method make sense?

My reply:

Regarding your first question about the two time series, I’d recommend doing a multilevel model. I bet you have more than two of these series. Model a whole bunch at once, and then estimate the levels and trends of each series. Move away from a deterministic rule of which series will be higher, and just create forecasts that acknowledge uncertainty.

Regarding your second question about standard deviation, your method might work but it also discards some information. For example, the number of cases greater than 3sd must be so low that your estimate of these tails will be noisy, so you have to be careful that you’re not in the position of those climatologists who are surprised when so-called hundred-year floods happen every 10 years. At a deeper level, it’s not clear to me that you should want to be looking at sd; perhaps there are summaries that map more closely to decisions of interest.

But I say these things all pretty generically as I don’t know anything about stock trading (except that I lost something like 40% of my life savings back in 2008, and that was a good thing for me).

Improv

I like this new thing of lecturing improv. I feel that it helps the audience stay focused, as they have to keep the structure of the talk in their heads while it’s happening. Also it enforces more logic in my own presentation, as I’m continually looping back to remind myself and the audience how each part fits into the general theme. It’s like a 40-minute-long story, with scene, plot, character development, a beginning, middle, and end.

Yes, sometimes it helps to show graphs or code as part of this, but I can pull that up as needed during a talk. It doesn’t need to be on “slides.”

My overall aim is for a Stewart Lee-type experience. OK, not exactly. For one thing, Lee isn’t doing improv; he practices and hones his act until he knows exactly what’s going where. But that’s a bit different because the standards are higher for stand-up entertainment than for an academic talk. So I don’t need to be so polished.

I’ve also been running my classroom lectures on the improv principle, riffing from homeworks, readings, and jitts and using students’ questions as the fuel to keep things moving along. That’s been going well too, I think, but I need to work more on the organization. When I give a colloquium or conference talk, I’m in control and can structure the time how I want and make sure everything fits within the larger story; but in class it seems to make sense to follow more closely the students’ particular needs, and then I’ll end up talking on things for which I hadn’t prepared, and it’s easy for me to get lost in the details of some examples and lose the main thread, thus reducing what the students get out of the class (I think).

The interesting thing is how long it’s taken me to get to this point. I’ve been giving talks in conferences for just about 30 years, and my style keeps changing. I’ve gone from acetate transparency sheets to handouts, back to transparencies, back to handouts, then to power point and pdf, then to the stage of removing as many words from the slides as possible, then removing even more words and using lots of pictures, now to this new stage of no slides at all. I like where I am now, but maybe in 5 years we’ll all be doing something completely different.

Exposure to Stan has changed my defaults: a non-haiku

Now when I look at my old R code, it looks really weird because there are no semicolons
Each line of code just looks incomplete
As if I were writing my sentences like this
Whassup with that, huh
Also can I please no longer do <-
I much prefer =
Please

Is Rigor Contagious? (my talk next Monday 4:15pm at Columbia)

Is Rigor Contagious?

Much of the theory and practice of statistics and econometrics is characterized by a toxic mixture of rigor and sloppiness. Methods are justified based on seemingly pure principles that can’t survive reality. Examples of these principles include random sampling, unbiased estimation, hypothesis testing, Bayesian inference, and causal identification. Examples of uncomfortable reality include nonresponse, varying effects, researcher degrees of freedom, actual prior information, and the desire for external validity. We discuss a series of scenarios where researchers naively think that rigor in one part of their design and analysis will assure rigor on their larger conclusions, and then we discuss possible hierarchical Bayesian solutions in which the load of rigor is more evenly balanced across the chain of scientific reasoning.

The talk (for the Sustainable Development seminar) will be Mon 27 Feb, 4:15-5:45, in room 801 International Affairs Building at Columbia.

Note to Deborah Mayo

I have a post coming on 2 Mar on preregistration that I think you’ll like. It unifies some ideas regarding statistical design and analysis, and in some ways it’s a follow-up to my Borscht Belt post.

He wants to know what book to read to learn statistics

Tim Gilmour writes:

I’m an early 40s guy in Los Angeles, and I’m sort of sending myself back to school, specifically in statistics — not taking classes, just working through things on my own. Though I haven’t really used math much since undergrad, a number of my personal interests (primarily epistemology) would be much better served by a good knowledge of statistics.

I was wondering if you could recommend a solid, undergrad level intro to statistics book? While I’ve seen tons of options on the net, I don’t really have the experiential basis to choose among them effectively.

My reply: Rather than reading an intro stat book, I suggest you read a book in some area of interest to you that uses statistics. For example, Bob Carpenter is always recommending Jim Albert’s book on baseball. But if you’re interested in epidemiology, then maybe best to read a book on that subject. Sander Greenland wrote an epidemiology textbook; I haven’t read it all the way through, but Sander knows what he’s talking about, so it could be a good place to start.

If you had to read one statistics book right now, I’d suggest my book with Jennifer Hill. It’s not quite an intro book but we pretty much start from scratch.

Readers might have other suggestions.

Eurostat microdata conference

Division of labor and a Pizzagate solution

I firmly believe that the general principles of social science can improve our understanding of the world.

Today I want to talk about two principles—division of labor from economics, and roles from sociology—and their relevance to the Pizzagate scandal involving Brian Wansink, the Cornell University business school professor and self-described “world-renowned eating behavior expert for over 25 years” whose published papers have been revealed to have hundreds of errors.

It is natural to think of “division of labor” and “roles” as going together: different people have different skill sets and different opportunities so it makes sense that they play different roles; and, conversely, the job you do is in part a consequence of your role in society.

From another perspective, though, the two principles are in conflict, in that certain logical divisions of labor might not occur because people are too stuck in playing their roles. We’ll consider such a case here.

I was talking the other day with someone about the Pizzagate story, in particular the idea that the protagonist, Brian Wansink, is in a tough position:

1. From all reports, Wansink sounds like a nice guy who cares about improving public health and genuinely wants to do the right thing. He wants to do good research because research is a way to learn about the world and to ultimately help people to make better decisions. He also enjoys publicity, but there’s nothing wrong with that: by getting your ideas out there, you can help more people. Through hard work, Wansink has achieved a position of prominence at his university and in the world.

2. However, for the past several years people have been telling Wansink that his published papers are full of errors, indeed they are disasters, complete failures that claim to be empirical demonstrations but do not even accurately convey the data used in their construction, let alone provide good evidence for their substantive claims.

3. Now put the two above items together. How can Wansink respond? So far he’s tried to address 2 while preserving all of 1: he’s acknowledged that his papers have errors and said that he plans to overhaul his workflow but at the same time had not expressed any changes in his beliefs about any of the conclusions of his research. This is a difficult position to stand by, especially going forward when questions about the quality of this work. Whether or not Wansink personally believes his claims, I can’t see why anyone else should take them seriously.

What, then, can Wansink do? I thought about and realized that, from the standpoint of division of labor, all is clear.

Wansink has some talents and is in some ways well-situated:
– He can come up with ideas for experiments that other people find interesting.
– He’s an energetic guy with a full Rolodex: he can get lots of projects going and he can inspire people to work on them.
– He’s working on a topic that affects a lot of people.
– He’s a master of publicity: he really cares about his claims and is willing to put in the effort to tell the world about them.

On the other hand, he has some weaknesses:
– He runs experiments without seeming to be aware of what data he’s collected.
– He doesn’t understand key statistical ideas.
– He publishes lots and lots of papers with clear errors.
– He seems to have difficulty mapping specific criticisms to any acceptance of flaws in his scientific claims.

Putting these together, I came up with a solution!
– Wansink should be the idea guy, he should talk with people and come up with ideas for experiments.
– Someone else, with a clearer understanding of statistics and variation, should design the data collection with an eye to minimizing bias and variance of measurements.
– Someone else should supervise the data collection.
– Someone else should analyze the data.
– Someone else should write the research papers, which should be openly exploratory and speculative.
– Wansink should be involved in the interpretation of the research results and in publicity afterward.

I made the above list in recognition that Wansink does have a lot to offer. The mistake is in thinking he needs to do all the steps.

But this is where “division of labor” comes into conflict with “roles.” Wansink’s been placed in the role of scientist, or “eating behavior expert,” and scientists are supposed to design their data collection, analyze their data, and write up their finding.

The problem here is not just that Wansink doesn’t know how to collect high-quality data, analyze them appropriately, or accurately write up the results—it’s that he can’t even be trusted to supervise these tasks.

But this shouldn’t be a problem. There are lots of things I don’t know how to do—I just don’t do them! I do lots of survey research but I’ve never done any survey interviewing. Maybe I should learn how to do survey interviews but I haven’t done so yet.

But the “rules” seem to be that the professor should do, or at least supervise, data collection, analysis, and writing of peer-reviewed papers. Wansink can’t do this. He would better employed, I think, by being part of a team where he can make his unique contributions. To make this step wouldn’t be easy: Wansink would have to give up a lot, in the sense of accepting limits on his expertise. So there are obstacles. But this seems like the logical endpoint.

P.S. Just to emphasize: This is not up to me. I’m not trying to tell Wansink or others what to do; I’m just offering my take on the situation.

Cloak and dagger

Elan B. writes:

I saw this JAMA Pediatrics article [by Julia Raifman, Ellen Moscoe, and S. Bryn Austin] getting a lot of press for claiming that LGBT suicide attempts went down 14% after gay marriage was legalized.
The heart of the study is comparing suicide attempt rates (in last 12 months) before and after exposure — gay marriage legalization in their state. For LGBT teens, this dropped from 28.5% to 24.5%.
In order to test whether this drop was just an ongoing trend in dropping LGBT suicide attempts, they do a placebo test by looking at whether rates dropped 2 years before legalization. In the text of the article, they simple state that there is no drop.
But then you open up the supplement and find that about half of the drop in rates — 2.2% — already came 2 years before legalization. However, since 0 is contained in the 95% confidence interval, it’s not significant! Robustness check passed.
In figure 1 of the article, they graph suicide attempts before legalization to show they’re flat, but even though they have the data for some of the states they don’t show LGBT rates.
Very suspicious to me, what do you think?

My reply: I wouldn’t quite say “suspicious.” I expect these researchers are doing their best; these are just hard problems. What they’ve found is an association which they want to present as causation, and they don’t fully recognize that limitation in their paper.

Here are the key figures:

And from here it’s pretty clear that the trends are noisy, so that little differences in the model can make big differences in the results, especially when you’re playing the statistical significance game. That’s fine—if the trends are noisy, they’re noisy, and your analysis needs to recognize this, and in any case it’s a good idea to explore such data.

I also share Elan’s concern about the whole “robustness check” approach to applied statistics, in which a central analysis is presented and then various alternatives are presented, with the goal is to show the same thing as the main finding (for perturbation-style robustness checks) or to show nothing (for placebo-style robustness checks).

One problem with this mode of operation is that robustness checks themselves have many researcher degrees of freedom, so it’s not clear what we can take from these. Just for example, if you do a perturbation-style robustness check and you find a result in the same direction but not statistically significant (or, as the saying goes, “not quite” statistically significant), you can call it a success because it’s in the right direction and, if anything, it makes you feel even better that the main analysis, which you chose, succeeded. But if you do a placebo-style robustness check and you find a result in the same direction but not statistically significant, you can just call it a zero and claim success in that way.

So I think there’s a problem in that there’s a pressure for researchers to seek, and claim, more certainty and rigor than is typically possible from social science data. If I’d written this paper, I think I would’ve started with various versions of the figures above, explored the data more, then moved to the regression line, but always going back to the connection between model, data, and substantive theories. But that’s not what I see here: in the paper at hand, there’s the more standard pattern of some theory and exploration motivating a model, then statistical significance is taken as tentative proof, to be shored up with robustness studies, then the result is taken as a stylized fact and it’s story time. There’s nothing particularly bad about this particular paper, indeed their general conclusions might well be correct (or not). They’re following the rules of social science research and it’s hard to blame them for that. I don’t see this paper as “junk science” in the way of the himmicanes, air rage, or ages-ending-in-9 papers (I guess that’s why it appeared in JAMA, which is maybe a bit more serious-minded than PPNAS or Lancet); rather, it’s a reasonable bit of data exploration that could be better. I’d say that a recognition that it is data exploration could be a first step to encouraging researchers to think more seriously about how best to explore such data. If they really do have direct data on suicide rates of gay people, that would seem like a good place to look, as Elan suggests.

Clay pigeon

Sam Harper writes:

Not that you are collecting these kinds of things, but I wanted to point to (yet) another benefit of the American Economic Association’s requirement of including replication datasets (unless there are confidentiality constraints) and code in order to publish in most of their journals—certainly for the top-tier ones like Am Econ Review: correcting coding mistakes!
  1. The Impact of Family Income on Child Achievement: Evidence from the Earned Income Tax Credit: Comment
    Lundstrom, Samuel
    The American Economic Review (ISSN: 0002-8282); Volume 107, No. 2, pp. 623-628(6); 2017-02-01T00:00:00
  2. The Impact of Family Income on Child Achievement: Evidence from the Earned Income Tax Credit: Reply
    Dahl, Gordon B.; Lochner, Lance
    The American Economic Review (ISSN: 0002-8282); Volume 107, No. 2, pp. 629-631(3); 2017-02-01T00:00:00
The papers are no doubt gated (I attached them if you are interested), but I thought it was refreshing to see what I consider to be close to a model exchange between the original authors and the replicator: Replicator is able to reproduce nearly everything but finds a serious coding error, corrects it and generates new (and presumably improved) estimates, and original authors admit they made a coding error without making much of a fuss, plus they also generate revised estimates. Post-publication review doing what it should. The tone is also likely more civil because the effort to reproduce largely succeeded and the original authors did not have to eat crow or say that they made a mistake that substantively changed their interpretation (and economists obsession with statistical significance is still disappointing). Credit to Lundstrom for not trying to over-hype the change in the results.
As an epidemiologist I do feel embarrassed that the biomedical community is still so far behind other disciplines when it comes to taking reproducible science seriously—especially the “high impact” general medical journals. We should not have to take our cues from economists, though perhaps it helps that much of the work they do uses public data.
I haven’t looked into this one but I agree with the general point.

Looking for rigor in all the wrong places (my talk this Thursday in the Columbia economics department)

Looking for Rigor in All the Wrong Places

What do the following ideas and practices have in common: unbiased estimation, statistical significance, insistence on random sampling, and avoidance of prior information? All have been embraced as ways of enforcing rigor but all have backfired and led to sloppy analyses and erroneous inferences. We discuss these problems and some potential solutions in the context of problems in social science research, and we consider ways in which future statistical theory can be better aligned with practice.

The seminar is held Thursday, February 23rd at the Economics Department, International Affairs Building (420 W. 118th Street) in room 1101, from 2:30 to 4:00 pm

I don’t have one particular paper, but here are a few things that people could read:

http://www.stat.columbia.edu/~gelman/research/published/rd_china_5.pdf
http://www.stat.columbia.edu/~gelman/research/unpublished/regression_discontinuity_16sep6.pdf
http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf

Unethical behavior vs. being a bad guy

I happened to come across this article and it reminded me of the general point that it’s possible to behave unethically without being a “bad guy.”

The story in question involves some scientists who did some experiments about thirty years ago on the biological effects of low-frequency magnetic fields. They published their results in a series of papers which I read when I was a student, and I found some places where I thought their analysis could be improved.

The topic seemed somewhat important—at the time, there was concern about cancer risks from exposure to power lines and other sources of low-frequency magnetic fields—so I sent a letter to the authors of the paper, pointing out two ways I thought their analysis could be improved, and requesting their raw data. I followed up the letter with a phone call.

Just for some context:

1. At no time did I think, or do I think, that they were doing anything unethical in their data collection or analysis. I just thought that they weren’t making full use of the data they had. Their unethical behavior, as I see it, came at the next stage, when they refused to share their data.

2. Those were simpler times. I assumed by default that published work was high quality, so when I saw what seemed like a flaw in the analysis, I wasn’t so sure—I was very open to the possibility that I’d missed something myself—and I didn’t see the problems in that paper as symptomatic of any larger issues.

3. I was not trying to “gotcha” these researchers. I thought they too would be interested in getting more information out of their data.

To continue with the story: When I called on the phone, the lead researcher on the project said he didn’t want to share the data: they were in lab notebooks and it would be effort to copy these, and his statistician had assured him that the analysis was just fine as is.

I think this was unethical behavior, given that: (a) at the time, this work was considered to have policy implications; (b) there was no good reason for the researcher to think that his statistician had particular expertise in this sort of analysis; (c) I’d offered some specific ways in which the data analysis could be improved so there was a justification for my request; (d) the work had been done at the Environmental Protection Agency, which is part of the U.S. government; (e) the dataset was pretty small so how hard could it be to photocopy some pages of lab notebooks and drop them in the mail; and, finally (f) the work was published in a scientific journal that was part of the public record.

A couple decades later, I wrote about the incident and the biologist and the statistician responded with defenses of their actions. I felt at the time of the original event, and after reading their letters, and I still feel, that these guys were trying to do their best, that they were acting according what they perceived to be their professional standards, and that they were not trying to impede the progress of science and public health.

To put it another way, I did not, and do not, think of them as “bad guys.” Not that this is so important—there’s no reason why these two scientists should particularly care about my opinion of them, nor am I any kind of moral arbiter here. I’m just sharing my perspective to make the more general point that it is possible to behave unethically without being a bad person.

I do think the lack of data sharing was unethical—not as unethical as fabricating data (Lacour), or hiding data (Hauser) or brushing aside a barrage of legitimate criticism from multiple sources (Cuddy), or lots of other examples we’ve discussed over the years on this blog—but I do feel it is a real ethical lapse, for reasons (a)-(f) given above. But I don’t think of this as the product of “bad guys.”

My point is that it’s possible to go about your professional career, doing what you think is right, but still making some bad decisions: actions which were not just mistaken in retrospect, but which can be seen as ethical violations on some scale.

One way to view this is everyone involved in research—including those of us who see ourselves as good guys—should be aware that we can make unethical decisions at work. “Unethical” labels the action, not the person, and ethics is a product of a situation as well as of the people involved.

Should the Problems with Polls Make Us Worry about the Quality of Health Surveys? (my talk at CDC tomorrow)

My talk this Thursday at CDC, Tuesday, February 21, 2017, 12:00 noon, 2400 Century Center, Room 1015C:

Should the Problems with Polls Make Us Worry about the Quality of Health Surveys?

Response rates in public opinion polls have been steadily declining for more than half a century and are currently heading toward the 0% mark. We have learned much in recent years about the problems this is causing and how we can improve data collection and statistical analysis to get better estimates of opinion and opinion trends. In this talk, we review research in this area and then discuss the relevance of this work to similar problems in health surveys.

P.S. I gave the talk. There were no slides. OK, I did send along a subset of these, but I spent only about 5 minutes on them out of a 40-minute lecture, so the slides will give you close to zero sense of what I was talking about. I have further thoughts about the experience which I’ll save for a future post, but for now just let me say that if you weren’t at the talk, and you don’t know anyone who was there, then the slides won’t help.

Blind Spot

X pointed me to this news article reporting an increase in death rate among young adults in the United States:

Selon une enquête publiée le 26 janvier par la revue scientifique The Lancet, le taux de mortalité des jeunes Américains âgés de 25 à 35 ans a connu une progression entre 1999 et 2014, alors que ce taux n’a cessé de baisser dans l’ensemble des pays les plus riches depuis quarante ans. . . . Ce sont principalement les jeunes femmes blanches qui tirent les chiffres à la hausse . . . Ainsi, l’analyse des statistiques collectées auprès du National Center for Health Statistics, montre que le taux de mortalité des femmes blanches de 25 ans a connu une progression moyenne annuelle de 3 % pendant les quinze années prises en compte, et de 2,3 % pour la catégorie des trentenaires. Pour des garçons du même âge, la croissance annuelle du taux de mortalité s’élève à 1,9 %.

I ran this by Jonathan Auerbach to see what he thought. After all, it’s the Lancet, which seems to specialize in papers of high publicity and low content, so it’s not like I’m gonna believe anything in there without careful scrutiny.

As part of our project, Jonathan had already run age-adjusted estimates for different ethnic groups every decade of age. These time series should be better than what was in the paper discussed in the above news article because, in addition to age adjusting, we also got separate estimated trends for each state, fitting some sort of hierarchical model in Stan.

Jonathan reported that we found a similar increase in death rates for women after adjustment. But there are comparable increases for men after breaking down by state.

Here are the estimated trends in age-adjusted death rates for non-Hispanic white women aged 25-34:

And here are the estimated trends for men:

In the graphs for the women, certain states with too few observations were removed. (It would be fine to estimate these trends from the raw data, but for simplicity we retrieved some aggregates from the CDC website, and it didn’t provide numbers in every state and every year.)

Anyway, the above graphs show what you can do with Stan. We’re not quite sure what to do with all these analyses: we don’t have stories to go with them so it’s not clear where they could be published. But at least we can blog them in response to headlines on mortality trends.

P.S. The Westlake titles keep on coming. It’s not just that they are so catchy—after all, that’s their point—but how apt they are, each time. And the amazing thing is, I’m using them in order. Those phrases work for just about anything. I’m just looking forward to a month or so on when I’ve worked my way down to the comedy titles lower down on the list.

Accessing the contents of a stanfit object

I was just needing this. Then, lo and behold, I found it on the web. It’s credited to Stan Development Team but I assume it was written by Ben and Jonah. Good to have this all in one place.

ComSciCon: Science Communication Workshop for Graduate Students

“Luckily, medicine is a practice that ignores the requirements of science in favor of patient care.”

Javier Benitez writes:

This is a paragraph from Kathryn Montgomery’s book, How Doctors Think:

If medicine were practiced as if it were a science, even a probabilistic science, my daughter’s breast cancer might never have been diagnosed in time. At 28, she was quite literally off the charts, far too young, an unlikely patient who might have eluded the attention of anyone reasoning “scientifically” from general principles to her improbable case. Luckily, medicine is a practice that ignores the requirements of science in favor of patient care.

I [Benitez] am not sure I agree with her assessment. I have been doing some reading on history and philosophy of science, there’s not much on philosophy of medicine, and this is a tough question to answer, at least for me.

I would think that science, done right, should help, not hinder, the cause of cancer decision making. (Incidentally, the relevant science here would necessarily be probabilistic, so I wouldn’t speak of “even” a probabilistic science as if it were worth considering any deterministic science of cancer diagnosis.)

So how to think about the above quote? I have a few directions, in no particular order:

1. Good science should help, but bad science could hurt. It’s possible that there’s enough bad published work in the field of cancer diagnosis that a savvy doctor is better off ignoring a lot of it, performing his or her own meta-analysis, as it were, partially pooling the noisy and biased findings toward some more reasonable theory-based model.

2. I haven’t read the book where this quote comes from, but the natural question is, How did the doctor diagnose the cancer in that case? Presumably the information used by the doctor could be folded into a scientific diagnostic procedure.

3. There’s also the much-discussed cost-benefit angle. Early diagnosis can save lives but it can also has costs in dollars and health when there is misdiagnosis.

To the extend that I have a synthesis of all these ideas, it’s through the familiar idea of anomalies. Science (that is, probability theory plus data plus models of data plus empirical review and feedback) is supposed to be the optimal way to make decisions under uncertainty. So if doctors have a better way of doing it, this suggests that the science they’re using is incomplete, and they should be able to do better.

The idea here is to think of the “science” of cancer diagnosis not as a static body of facts or even as a method of inquiry, but as a continuously-developing network of conjectures and models and data.

To put it another way, it can make sense to “ignore the requirements of science.” And when you make that decision, you should explain why you’re doing it—what information you have that moves you away from what would be the “science-based” decision.

Benitez adds some more background:
Continue reading ‘“Luckily, medicine is a practice that ignores the requirements of science in favor of patient care.”’ »