Skip to content

Forecasting mean and sd of time series

Garrett M. writes:

I had two (hopefully straightforward) questions related to time series analysis that I was hoping I could get your thoughts on:

First, much of the work I do involves “backtesting” investment strategies, where I simulate the performance of an investment portfolio using historical data on returns. The primary summary statistics I generate from this sort of analysis are mean return (both arithmetic and geometric) and standard deviation (called “volatility” in my industry). Basically the idea is to select strategies that are likely to generate high returns given the amount of volatility they experience.

However, historical market data are very noisy, with stock portfolios generating an average monthly return of around 0.8% with a monthly standard deviation of around 4%. Even samples containing 300 months of data then have standard errors of about 0.2% (4%/sqrt(300)).

My first question is, suppose I have two time series. One has a mean return of 0.8% and the second has a mean return of 1.1%, both with a standard error of 0.4%. Assuming the future will look like the past, is it reasonable to expect the second series to have a higher future mean than the first out of sample, given that it has a mean 0.3% greater in the sample? The answer might be obvious to you, but I commonly see researchers make this sort of determination, when it appears to me that the data are too noisy to draw any sort of conclusion between series with means within at least two standard errors of each other (ignoring for now any additional issues with multiple comparisons).

My second question involves forecasting standard deviation. There are many models and products used by traders to determine the future volatility of a portfolio. The way I have tested these products has been to record the percentage of the time future returns (so out of sample) fall within one, two, or three standard deviations, as forecasted by the model. If future returns fall within those buckets around 68%/95%/99% of the time, I conclude that the model adequately predicts future volatility. Does this method make sense?

My reply:

Regarding your first question about the two time series, I’d recommend doing a multilevel model. I bet you have more than two of these series. Model a whole bunch at once, and then estimate the levels and trends of each series. Move away from a deterministic rule of which series will be higher, and just create forecasts that acknowledge uncertainty.

Regarding your second question about standard deviation, your method might work but it also discards some information. For example, the number of cases greater than 3sd must be so low that your estimate of these tails will be noisy, so you have to be careful that you’re not in the position of those climatologists who are surprised when so-called hundred-year floods happen every 10 years. At a deeper level, it’s not clear to me that you should want to be looking at sd; perhaps there are summaries that map more closely to decisions of interest.

But I say these things all pretty generically as I don’t know anything about stock trading (except that I lost something like 40% of my life savings back in 2008, and that was a good thing for me).


I like this new thing of lecturing improv. I feel that it helps the audience stay focused, as they have to keep the structure of the talk in their heads while it’s happening. Also it enforces more logic in my own presentation, as I’m continually looping back to remind myself and the audience how each part fits into the general theme. It’s like a 40-minute-long story, with scene, plot, character development, a beginning, middle, and end.

Yes, sometimes it helps to show graphs or code as part of this, but I can pull that up as needed during a talk. It doesn’t need to be on “slides.”

My overall aim is for a Stewart Lee-type experience. OK, not exactly. For one thing, Lee isn’t doing improv; he practices and hones his act until he knows exactly what’s going where. But that’s a bit different because the standards are higher for stand-up entertainment than for an academic talk. So I don’t need to be so polished.

I’ve also been running my classroom lectures on the improv principle, riffing from homeworks, readings, and jitts and using students’ questions as the fuel to keep things moving along. That’s been going well too, I think, but I need to work more on the organization. When I give a colloquium or conference talk, I’m in control and can structure the time how I want and make sure everything fits within the larger story; but in class it seems to make sense to follow more closely the students’ particular needs, and then I’ll end up talking on things for which I hadn’t prepared, and it’s easy for me to get lost in the details of some examples and lose the main thread, thus reducing what the students get out of the class (I think).

The interesting thing is how long it’s taken me to get to this point. I’ve been giving talks in conferences for just about 30 years, and my style keeps changing. I’ve gone from acetate transparency sheets to handouts, back to transparencies, back to handouts, then to power point and pdf, then to the stage of removing as many words from the slides as possible, then removing even more words and using lots of pictures, now to this new stage of no slides at all. I like where I am now, but maybe in 5 years we’ll all be doing something completely different.

Exposure to Stan has changed my defaults: a non-haiku

Now when I look at my old R code, it looks really weird because there are no semicolons
Each line of code just looks incomplete
As if I were writing my sentences like this
Whassup with that, huh
Also can I please no longer do <-
I much prefer =

Is Rigor Contagious? (my talk next Monday 4:15pm at Columbia)

Is Rigor Contagious?

Much of the theory and practice of statistics and econometrics is characterized by a toxic mixture of rigor and sloppiness. Methods are justified based on seemingly pure principles that can’t survive reality. Examples of these principles include random sampling, unbiased estimation, hypothesis testing, Bayesian inference, and causal identification. Examples of uncomfortable reality include nonresponse, varying effects, researcher degrees of freedom, actual prior information, and the desire for external validity. We discuss a series of scenarios where researchers naively think that rigor in one part of their design and analysis will assure rigor on their larger conclusions, and then we discuss possible hierarchical Bayesian solutions in which the load of rigor is more evenly balanced across the chain of scientific reasoning.

The talk (for the Sustainable Development seminar) will be Mon 27 Feb, 4:15-5:45, in room 801 International Affairs Building at Columbia.

Note to Deborah Mayo

I have a post coming on 2 Mar on preregistration that I think you’ll like. It unifies some ideas regarding statistical design and analysis, and in some ways it’s a follow-up to my Borscht Belt post.

He wants to know what book to read to learn statistics

Tim Gilmour writes:

I’m an early 40s guy in Los Angeles, and I’m sort of sending myself back to school, specifically in statistics — not taking classes, just working through things on my own. Though I haven’t really used math much since undergrad, a number of my personal interests (primarily epistemology) would be much better served by a good knowledge of statistics.

I was wondering if you could recommend a solid, undergrad level intro to statistics book? While I’ve seen tons of options on the net, I don’t really have the experiential basis to choose among them effectively.

My reply: Rather than reading an intro stat book, I suggest you read a book in some area of interest to you that uses statistics. For example, Bob Carpenter is always recommending Jim Albert’s book on baseball. But if you’re interested in epidemiology, then maybe best to read a book on that subject. Sander Greenland wrote an epidemiology textbook; I haven’t read it all the way through, but Sander knows what he’s talking about, so it could be a good place to start.

If you had to read one statistics book right now, I’d suggest my book with Jennifer Hill. It’s not quite an intro book but we pretty much start from scratch.

Readers might have other suggestions.

Eurostat microdata conference

Division of labor and a Pizzagate solution

I firmly believe that the general principles of social science can improve our understanding of the world.

Today I want to talk about two principles—division of labor from economics, and roles from sociology—and their relevance to the Pizzagate scandal involving Brian Wansink, the Cornell University business school professor and self-described “world-renowned eating behavior expert for over 25 years” whose published papers have been revealed to have hundreds of errors.

It is natural to think of “division of labor” and “roles” as going together: different people have different skill sets and different opportunities so it makes sense that they play different roles; and, conversely, the job you do is in part a consequence of your role in society.

From another perspective, though, the two principles are in conflict, in that certain logical divisions of labor might not occur because people are too stuck in playing their roles. We’ll consider such a case here.

I was talking the other day with someone about the Pizzagate story, in particular the idea that the protagonist, Brian Wansink, is in a tough position:

1. From all reports, Wansink sounds like a nice guy who cares about improving public health and genuinely wants to do the right thing. He wants to do good research because research is a way to learn about the world and to ultimately help people to make better decisions. He also enjoys publicity, but there’s nothing wrong with that: by getting your ideas out there, you can help more people. Through hard work, Wansink has achieved a position of prominence at his university and in the world.

2. However, for the past several years people have been telling Wansink that his published papers are full of errors, indeed they are disasters, complete failures that claim to be empirical demonstrations but do not even accurately convey the data used in their construction, let alone provide good evidence for their substantive claims.

3. Now put the two above items together. How can Wansink respond? So far he’s tried to address 2 while preserving all of 1: he’s acknowledged that his papers have errors and said that he plans to overhaul his workflow but at the same time had not expressed any changes in his beliefs about any of the conclusions of his research. This is a difficult position to stand by, especially going forward when questions about the quality of this work. Whether or not Wansink personally believes his claims, I can’t see why anyone else should take them seriously.

What, then, can Wansink do? I thought about and realized that, from the standpoint of division of labor, all is clear.

Wansink has some talents and is in some ways well-situated:
– He can come up with ideas for experiments that other people find interesting.
– He’s an energetic guy with a full Rolodex: he can get lots of projects going and he can inspire people to work on them.
– He’s working on a topic that affects a lot of people.
– He’s a master of publicity: he really cares about his claims and is willing to put in the effort to tell the world about them.

On the other hand, he has some weaknesses:
– He runs experiments without seeming to be aware of what data he’s collected.
– He doesn’t understand key statistical ideas.
– He publishes lots and lots of papers with clear errors.
– He seems to have difficulty mapping specific criticisms to any acceptance of flaws in his scientific claims.

Putting these together, I came up with a solution!
– Wansink should be the idea guy, he should talk with people and come up with ideas for experiments.
– Someone else, with a clearer understanding of statistics and variation, should design the data collection with an eye to minimizing bias and variance of measurements.
– Someone else should supervise the data collection.
– Someone else should analyze the data.
– Someone else should write the research papers, which should be openly exploratory and speculative.
– Wansink should be involved in the interpretation of the research results and in publicity afterward.

I made the above list in recognition that Wansink does have a lot to offer. The mistake is in thinking he needs to do all the steps.

But this is where “division of labor” comes into conflict with “roles.” Wansink’s been placed in the role of scientist, or “eating behavior expert,” and scientists are supposed to design their data collection, analyze their data, and write up their finding.

The problem here is not just that Wansink doesn’t know how to collect high-quality data, analyze them appropriately, or accurately write up the results—it’s that he can’t even be trusted to supervise these tasks.

But this shouldn’t be a problem. There are lots of things I don’t know how to do—I just don’t do them! I do lots of survey research but I’ve never done any survey interviewing. Maybe I should learn how to do survey interviews but I haven’t done so yet.

But the “rules” seem to be that the professor should do, or at least supervise, data collection, analysis, and writing of peer-reviewed papers. Wansink can’t do this. He would better employed, I think, by being part of a team where he can make his unique contributions. To make this step wouldn’t be easy: Wansink would have to give up a lot, in the sense of accepting limits on his expertise. So there are obstacles. But this seems like the logical endpoint.

P.S. Just to emphasize: This is not up to me. I’m not trying to tell Wansink or others what to do; I’m just offering my take on the situation.

Cloak and dagger

Elan B. writes:

I saw this JAMA Pediatrics article [by Julia Raifman, Ellen Moscoe, and S. Bryn Austin] getting a lot of press for claiming that LGBT suicide attempts went down 14% after gay marriage was legalized.
The heart of the study is comparing suicide attempt rates (in last 12 months) before and after exposure — gay marriage legalization in their state. For LGBT teens, this dropped from 28.5% to 24.5%.
In order to test whether this drop was just an ongoing trend in dropping LGBT suicide attempts, they do a placebo test by looking at whether rates dropped 2 years before legalization. In the text of the article, they simple state that there is no drop.
But then you open up the supplement and find that about half of the drop in rates — 2.2% — already came 2 years before legalization. However, since 0 is contained in the 95% confidence interval, it’s not significant! Robustness check passed.
In figure 1 of the article, they graph suicide attempts before legalization to show they’re flat, but even though they have the data for some of the states they don’t show LGBT rates.
Very suspicious to me, what do you think?

My reply: I wouldn’t quite say “suspicious.” I expect these researchers are doing their best; these are just hard problems. What they’ve found is an association which they want to present as causation, and they don’t fully recognize that limitation in their paper.

Here are the key figures:

And from here it’s pretty clear that the trends are noisy, so that little differences in the model can make big differences in the results, especially when you’re playing the statistical significance game. That’s fine—if the trends are noisy, they’re noisy, and your analysis needs to recognize this, and in any case it’s a good idea to explore such data.

I also share Elan’s concern about the whole “robustness check” approach to applied statistics, in which a central analysis is presented and then various alternatives are presented, with the goal is to show the same thing as the main finding (for perturbation-style robustness checks) or to show nothing (for placebo-style robustness checks).

One problem with this mode of operation is that robustness checks themselves have many researcher degrees of freedom, so it’s not clear what we can take from these. Just for example, if you do a perturbation-style robustness check and you find a result in the same direction but not statistically significant (or, as the saying goes, “not quite” statistically significant), you can call it a success because it’s in the right direction and, if anything, it makes you feel even better that the main analysis, which you chose, succeeded. But if you do a placebo-style robustness check and you find a result in the same direction but not statistically significant, you can just call it a zero and claim success in that way.

So I think there’s a problem in that there’s a pressure for researchers to seek, and claim, more certainty and rigor than is typically possible from social science data. If I’d written this paper, I think I would’ve started with various versions of the figures above, explored the data more, then moved to the regression line, but always going back to the connection between model, data, and substantive theories. But that’s not what I see here: in the paper at hand, there’s the more standard pattern of some theory and exploration motivating a model, then statistical significance is taken as tentative proof, to be shored up with robustness studies, then the result is taken as a stylized fact and it’s story time. There’s nothing particularly bad about this particular paper, indeed their general conclusions might well be correct (or not). They’re following the rules of social science research and it’s hard to blame them for that. I don’t see this paper as “junk science” in the way of the himmicanes, air rage, or ages-ending-in-9 papers (I guess that’s why it appeared in JAMA, which is maybe a bit more serious-minded than PPNAS or Lancet); rather, it’s a reasonable bit of data exploration that could be better. I’d say that a recognition that it is data exploration could be a first step to encouraging researchers to think more seriously about how best to explore such data. If they really do have direct data on suicide rates of gay people, that would seem like a good place to look, as Elan suggests.

Clay pigeon

Sam Harper writes:

Not that you are collecting these kinds of things, but I wanted to point to (yet) another benefit of the American Economic Association’s requirement of including replication datasets (unless there are confidentiality constraints) and code in order to publish in most of their journals—certainly for the top-tier ones like Am Econ Review: correcting coding mistakes!
  1. The Impact of Family Income on Child Achievement: Evidence from the Earned Income Tax Credit: Comment
    Lundstrom, Samuel
    The American Economic Review (ISSN: 0002-8282); Volume 107, No. 2, pp. 623-628(6); 2017-02-01T00:00:00
  2. The Impact of Family Income on Child Achievement: Evidence from the Earned Income Tax Credit: Reply
    Dahl, Gordon B.; Lochner, Lance
    The American Economic Review (ISSN: 0002-8282); Volume 107, No. 2, pp. 629-631(3); 2017-02-01T00:00:00
The papers are no doubt gated (I attached them if you are interested), but I thought it was refreshing to see what I consider to be close to a model exchange between the original authors and the replicator: Replicator is able to reproduce nearly everything but finds a serious coding error, corrects it and generates new (and presumably improved) estimates, and original authors admit they made a coding error without making much of a fuss, plus they also generate revised estimates. Post-publication review doing what it should. The tone is also likely more civil because the effort to reproduce largely succeeded and the original authors did not have to eat crow or say that they made a mistake that substantively changed their interpretation (and economists obsession with statistical significance is still disappointing). Credit to Lundstrom for not trying to over-hype the change in the results.
As an epidemiologist I do feel embarrassed that the biomedical community is still so far behind other disciplines when it comes to taking reproducible science seriously—especially the “high impact” general medical journals. We should not have to take our cues from economists, though perhaps it helps that much of the work they do uses public data.
I haven’t looked into this one but I agree with the general point.

Looking for rigor in all the wrong places (my talk this Thursday in the Columbia economics department)

Looking for Rigor in All the Wrong Places

What do the following ideas and practices have in common: unbiased estimation, statistical significance, insistence on random sampling, and avoidance of prior information? All have been embraced as ways of enforcing rigor but all have backfired and led to sloppy analyses and erroneous inferences. We discuss these problems and some potential solutions in the context of problems in social science research, and we consider ways in which future statistical theory can be better aligned with practice.

The seminar is held Thursday, February 23rd at the Economics Department, International Affairs Building (420 W. 118th Street) in room 1101, from 2:30 to 4:00 pm

I don’t have one particular paper, but here are a few things that people could read:

Unethical behavior vs. being a bad guy

I happened to come across this article and it reminded me of the general point that it’s possible to behave unethically without being a “bad guy.”

The story in question involves some scientists who did some experiments about thirty years ago on the biological effects of low-frequency magnetic fields. They published their results in a series of papers which I read when I was a student, and I found some places where I thought their analysis could be improved.

The topic seemed somewhat important—at the time, there was concern about cancer risks from exposure to power lines and other sources of low-frequency magnetic fields—so I sent a letter to the authors of the paper, pointing out two ways I thought their analysis could be improved, and requesting their raw data. I followed up the letter with a phone call.

Just for some context:

1. At no time did I think, or do I think, that they were doing anything unethical in their data collection or analysis. I just thought that they weren’t making full use of the data they had. Their unethical behavior, as I see it, came at the next stage, when they refused to share their data.

2. Those were simpler times. I assumed by default that published work was high quality, so when I saw what seemed like a flaw in the analysis, I wasn’t so sure—I was very open to the possibility that I’d missed something myself—and I didn’t see the problems in that paper as symptomatic of any larger issues.

3. I was not trying to “gotcha” these researchers. I thought they too would be interested in getting more information out of their data.

To continue with the story: When I called on the phone, the lead researcher on the project said he didn’t want to share the data: they were in lab notebooks and it would be effort to copy these, and his statistician had assured him that the analysis was just fine as is.

I think this was unethical behavior, given that: (a) at the time, this work was considered to have policy implications; (b) there was no good reason for the researcher to think that his statistician had particular expertise in this sort of analysis; (c) I’d offered some specific ways in which the data analysis could be improved so there was a justification for my request; (d) the work had been done at the Environmental Protection Agency, which is part of the U.S. government; (e) the dataset was pretty small so how hard could it be to photocopy some pages of lab notebooks and drop them in the mail; and, finally (f) the work was published in a scientific journal that was part of the public record.

A couple decades later, I wrote about the incident and the biologist and the statistician responded with defenses of their actions. I felt at the time of the original event, and after reading their letters, and I still feel, that these guys were trying to do their best, that they were acting according what they perceived to be their professional standards, and that they were not trying to impede the progress of science and public health.

To put it another way, I did not, and do not, think of them as “bad guys.” Not that this is so important—there’s no reason why these two scientists should particularly care about my opinion of them, nor am I any kind of moral arbiter here. I’m just sharing my perspective to make the more general point that it is possible to behave unethically without being a bad person.

I do think the lack of data sharing was unethical—not as unethical as fabricating data (Lacour), or hiding data (Hauser) or brushing aside a barrage of legitimate criticism from multiple sources (Cuddy), or lots of other examples we’ve discussed over the years on this blog—but I do feel it is a real ethical lapse, for reasons (a)-(f) given above. But I don’t think of this as the product of “bad guys.”

My point is that it’s possible to go about your professional career, doing what you think is right, but still making some bad decisions: actions which were not just mistaken in retrospect, but which can be seen as ethical violations on some scale.

One way to view this is everyone involved in research—including those of us who see ourselves as good guys—should be aware that we can make unethical decisions at work. “Unethical” labels the action, not the person, and ethics is a product of a situation as well as of the people involved.

Should the Problems with Polls Make Us Worry about the Quality of Health Surveys? (my talk at CDC tomorrow)

My talk this Thursday at CDC, Tuesday, February 21, 2017, 12:00 noon, 2400 Century Center, Room 1015C:

Should the Problems with Polls Make Us Worry about the Quality of Health Surveys?

Response rates in public opinion polls have been steadily declining for more than half a century and are currently heading toward the 0% mark. We have learned much in recent years about the problems this is causing and how we can improve data collection and statistical analysis to get better estimates of opinion and opinion trends. In this talk, we review research in this area and then discuss the relevance of this work to similar problems in health surveys.

P.S. I gave the talk. There were no slides. OK, I did send along a subset of these, but I spent only about 5 minutes on them out of a 40-minute lecture, so the slides will give you close to zero sense of what I was talking about. I have further thoughts about the experience which I’ll save for a future post, but for now just let me say that if you weren’t at the talk, and you don’t know anyone who was there, then the slides won’t help.

Blind Spot

X pointed me to this news article reporting an increase in death rate among young adults in the United States:

Selon une enquête publiée le 26 janvier par la revue scientifique The Lancet, le taux de mortalité des jeunes Américains âgés de 25 à 35 ans a connu une progression entre 1999 et 2014, alors que ce taux n’a cessé de baisser dans l’ensemble des pays les plus riches depuis quarante ans. . . . Ce sont principalement les jeunes femmes blanches qui tirent les chiffres à la hausse . . . Ainsi, l’analyse des statistiques collectées auprès du National Center for Health Statistics, montre que le taux de mortalité des femmes blanches de 25 ans a connu une progression moyenne annuelle de 3 % pendant les quinze années prises en compte, et de 2,3 % pour la catégorie des trentenaires. Pour des garçons du même âge, la croissance annuelle du taux de mortalité s’élève à 1,9 %.

I ran this by Jonathan Auerbach to see what he thought. After all, it’s the Lancet, which seems to specialize in papers of high publicity and low content, so it’s not like I’m gonna believe anything in there without careful scrutiny.

As part of our project, Jonathan had already run age-adjusted estimates for different ethnic groups every decade of age. These time series should be better than what was in the paper discussed in the above news article because, in addition to age adjusting, we also got separate estimated trends for each state, fitting some sort of hierarchical model in Stan.

Jonathan reported that we found a similar increase in death rates for women after adjustment. But there are comparable increases for men after breaking down by state.

Here are the estimated trends in age-adjusted death rates for non-Hispanic white women aged 25-34:

And here are the estimated trends for men:

In the graphs for the women, certain states with too few observations were removed. (It would be fine to estimate these trends from the raw data, but for simplicity we retrieved some aggregates from the CDC website, and it didn’t provide numbers in every state and every year.)

Anyway, the above graphs show what you can do with Stan. We’re not quite sure what to do with all these analyses: we don’t have stories to go with them so it’s not clear where they could be published. But at least we can blog them in response to headlines on mortality trends.

P.S. The Westlake titles keep on coming. It’s not just that they are so catchy—after all, that’s their point—but how apt they are, each time. And the amazing thing is, I’m using them in order. Those phrases work for just about anything. I’m just looking forward to a month or so on when I’ve worked my way down to the comedy titles lower down on the list.

Accessing the contents of a stanfit object

I was just needing this. Then, lo and behold, I found it on the web. It’s credited to Stan Development Team but I assume it was written by Ben and Jonah. Good to have this all in one place.

ComSciCon: Science Communication Workshop for Graduate Students

“Luckily, medicine is a practice that ignores the requirements of science in favor of patient care.”

Javier Benitez writes:

This is a paragraph from Kathryn Montgomery’s book, How Doctors Think:

If medicine were practiced as if it were a science, even a probabilistic science, my daughter’s breast cancer might never have been diagnosed in time. At 28, she was quite literally off the charts, far too young, an unlikely patient who might have eluded the attention of anyone reasoning “scientifically” from general principles to her improbable case. Luckily, medicine is a practice that ignores the requirements of science in favor of patient care.

I [Benitez] am not sure I agree with her assessment. I have been doing some reading on history and philosophy of science, there’s not much on philosophy of medicine, and this is a tough question to answer, at least for me.

I would think that science, done right, should help, not hinder, the cause of cancer decision making. (Incidentally, the relevant science here would necessarily be probabilistic, so I wouldn’t speak of “even” a probabilistic science as if it were worth considering any deterministic science of cancer diagnosis.)

So how to think about the above quote? I have a few directions, in no particular order:

1. Good science should help, but bad science could hurt. It’s possible that there’s enough bad published work in the field of cancer diagnosis that a savvy doctor is better off ignoring a lot of it, performing his or her own meta-analysis, as it were, partially pooling the noisy and biased findings toward some more reasonable theory-based model.

2. I haven’t read the book where this quote comes from, but the natural question is, How did the doctor diagnose the cancer in that case? Presumably the information used by the doctor could be folded into a scientific diagnostic procedure.

3. There’s also the much-discussed cost-benefit angle. Early diagnosis can save lives but it can also has costs in dollars and health when there is misdiagnosis.

To the extend that I have a synthesis of all these ideas, it’s through the familiar idea of anomalies. Science (that is, probability theory plus data plus models of data plus empirical review and feedback) is supposed to be the optimal way to make decisions under uncertainty. So if doctors have a better way of doing it, this suggests that the science they’re using is incomplete, and they should be able to do better.

The idea here is to think of the “science” of cancer diagnosis not as a static body of facts or even as a method of inquiry, but as a continuously-developing network of conjectures and models and data.

To put it another way, it can make sense to “ignore the requirements of science.” And when you make that decision, you should explain why you’re doing it—what information you have that moves you away from what would be the “science-based” decision.

Benitez adds some more background:
Continue reading ‘“Luckily, medicine is a practice that ignores the requirements of science in favor of patient care.”’ »

Pizzagate and Kahneman, two great flavors etc.

1. The pizzagate story (of Brian Wansink, the Cornell University business school professor and self-described “world-renowned eating behavior expert for over 25 years”) keeps developing.

Last week someone forwarded me an email from the deputy dean of the Cornell business school regarding concerns about some of Wansink’s work. This person asked me to post the letter (which he assured me “was written with the full expectation that it would end up being shared”) but I wasn’t so interested in this institutional angle so I passed it along to Retraction Watch, along with links to Wansink’s contrite note and a new post by Jordan Anaya listing some newly-discovered errors in yet another paper by Wansink.

Since then, Retraction Watch ran an interview with Wansink, in which the world-renowned eating behavior expert continued with a mixture of contrition and evasion, along with insights into his workflow, for example this:

Also, we realized we asked people how much pizza they ate in two different ways – once, by asking them to provide an integer of how many pieces they ate, like 0, 1, 2, 3 and so on. Another time we asked them to put an “X” on a scale that just had a “0” and “12” at either end, with no integer mark in between.

This is weird for two reasons. First, how do you say “we realized we asked . . .”? What’s to realize? If you asked the question that way, wouldn’t you already know this? Second, who eats 12 pieces of pizza? I guess they must be really small pieces!

Wansink also pulls one out of the Bargh/Baumeister/Cuddy playbook:

Across all sorts of studies, we’ve had really high replication of our findings by other groups and other studies. This is particularly true with field studies. One reason some of these findings are cited so much is because other researchers find the same types of results.

Ummm . . . I’ll believe it when I see the evidence. And not before.

In our struggle to understand Wansink’s mode of operation, I think we should start from the position that he’s not trying to cheat; rather, he just doesn’t know what he’s doing. Think of it this way: it’s possible that he doesn’t write the papers that get published, he doesn’t produce the tables with all the errors, he doesn’t analyze the data, maybe he doesn’t even collect the data. I have no idea who was out there passing out survey forms in the pizza restaurant—maybe some research assistants? He doesn’t design the survey forms—that’s how it is that he just realized that they asked that bizarre 0-to-12-pieces-of-pizza question. Also he’s completely out of the loop on statistics. When it comes to stats, this guy makes Satoshi Kanazawa look like Uri Simonsohn. That explains why his response to questions about p-hacking or harking was, “Well, we weren’t testing a registered hypothesis, so there’d be no way for us to try to massage the data to meet it.”

What Wansink has been doing for several years is organizing studies, making sure they get published, and doing massive publicity. For years and years and years, he’s been receiving almost nothing but positive feedback. (Yes, five years ago someone informed his lab of serious, embarrassing flaws in one of his papers, but apparently that inquiry was handled by one of his postdocs. So maybe the postdoc never informed Wansink of the problem, or maybe Wansink just thought this was a one-off in his lab, somebody else’s problem, and ignored it.)

When we look at things from the perspective of Wansink receiving nothing but acclaim for so many years and from so many sources (from students and postdocs in his lab, students in his classes, the administration of Cornell University, the U.S. government, news media around the world, etc., not to mention the continuing flow of accepted papers in peer-reviewed journals), the situation becomes more clear. It would be a big jump for him to accept that this is all a house of cards, that there’s no there there, etc.

Here’s an example of how this framing can help our understanding:

Someone emailed this question to me regarding that original “failed study” that got the whole ball rolling:

I’m still sort of surprised that they weren’t able to p-hack the original hypothesis, which was presumably some correlate with the price paid (either perceived quality, or amount eaten, or time spent eating, or # trips to the bathroom, or …).

My response:

I suspect the answer is that Wansink was not “p-hacking” or trying to game the system. My guess is that he’s legitimately using these studies to inform his thinking–that is, he forms many of his hypotheses and conclusions based on his data. So when he was expecting to see X, but he didn’t see X, he learned something! (Or thought he learned something; given the noise level in his experiments, it might be that his original hypothesis happened to be true, irony of ironies.) Sure, if he’d seen X at p=0.06, I expect he would’ve been able to find a way to get statistical significance, but when X didn’t show up at all, he saw it as a failed study. So, from Wansink’s point of view, the later work by the student really did have value in that they learned something new from their data.

I really don’t like the “p-hacking” frame because it “gamifies” the process in a way that I don’t think is always appropriate. I prefer the “forking paths” analogy: Wansink and his students went down one path that led nowhere, then they tried other paths.

2. People keep pointing me to a recent statement by Daniel Kahneman in a comment on a blog by Ulrich Schimmack, Moritz Heene, and Kamini Kesavan, who wrote that the “priming research” of Bargh and others that was featured in Kahneman’s book “is a train wreck” and should not be considered “as scientific evidence that subtle cues in their environment can have strong effects on their behavior outside their awareness.” Here’s Kahneman:

I accept the basic conclusions of this blog. To be clear, I do so (1) without expressing an opinion about the statistical techniques it employed and (2) without stating an opinion about the validity and replicability of the individual studies I cited.

What the blog gets absolutely right is that I placed too much faith in underpowered studies. As pointed out in the blog, and earlier by Andrew Gelman, there is a special irony in my mistake because the first paper that Amos Tversky and I published was about the belief in the “law of small numbers,” which allows researchers to trust the results of underpowered studies with unreasonably small samples. We also cited Overall (1969) for showing “that the prevalence of studies deficient in statistical power is not only wasteful but actually pernicious: it results in a large proportion of invalid rejections of the null hypothesis among published results.” Our article was written in 1969 and published in 1971, but I failed to internalize its message.

My position when I wrote “Thinking, Fast and Slow” was that if a large body of evidence published in reputable journals supports an initially implausible conclusion, then scientific norms require us to believe that conclusion. Implausibility is not sufficient to justify disbelief, and belief in well-supported scientific conclusions is not optional. This position still seems reasonable to me – it is why I think people should believe in climate change. But the argument only holds when all relevant results are published.

I knew, of course, that the results of priming studies were based on small samples, that the effect sizes were perhaps implausibly large, and that no single study was conclusive on its own. What impressed me was the unanimity and coherence of the results reported by many laboratories. I concluded that priming effects are easy for skilled experimenters to induce, and that they are robust. However, I now understand that my reasoning was flawed and that I should have known better. Unanimity of underpowered studies provides compelling evidence for the existence of a severe file-drawer problem (and/or p-hacking). The argument is inescapable: Studies that are underpowered for the detection of plausible effects must occasionally return non-significant results even when the research hypothesis is true – the absence of these results is evidence that something is amiss in the published record. Furthermore, the existence of a substantial file-drawer effect undermines the two main tools that psychologists use to accumulate evidence for a broad hypotheses: meta-analysis and conceptual replication. Clearly, the experimental evidence for the ideas I presented in that chapter was significantly weaker than I believed when I wrote it. This was simply an error: I knew all I needed to know to moderate my enthusiasm for the surprising and elegant findings that I cited, but I did not think it through. When questions were later raised about the robustness of priming results I hoped that the authors of this research would rally to bolster their case by stronger evidence, but this did not happen.

I still believe that actions can be primed, sometimes even by stimuli of which the person is unaware. There is adequate evidence for all the building blocks: semantic priming, significant processing of stimuli that are not consciously perceived, and ideo-motor activation. I see no reason to draw a sharp line between the priming of thoughts and the priming of actions. A case can therefore be made for priming on this indirect evidence. But I have changed my views about the size of behavioral priming effects – they cannot be as large and as robust as my chapter suggested.

I am still attached to every study that I cited, and have not unbelieved them, to use Daniel Gilbert’s phrase. I would be happy to see each of them replicated in a large sample. The lesson I have learned, however, is that authors who review a field should be wary of using memorable results of underpowered studies as evidence for their claims.

Following up on Kahneman’s remarks, neuroscientist Jeff Bowers added:

There is another reason to be sceptical of many of the social priming studies. You [Kahneman] wrote:

I still believe that actions can be primed, sometimes even by stimuli of which the person is unaware. There is adequate evidence for all the building blocks: semantic priming, significant processing of stimuli that are not consciously perceived, and ideo-motor activation. I see no reason to draw a sharp line between the priming of thoughts and the priming of actions.

However, there is an important constraint on subliminal priming that needs to be taken into account. That is, they are very short lived, on the order of seconds. So any claims that a masked prime affects behavior for an extend period of time seems at odd with these more basic findings. Perhaps social priming is more powerful than basic cognitive findings, but it does raise questions. Here is a link to an old paper showing that masked *repetition* priming is short-lived. Presumably semantic effects will be even more transient.

And psychologist Hal Pashler followed up:

One might ask if this is something about repetition priming, but associative semantic priming is also fleeting. In our JEP:G paper failing to replicate money priming we noted:

For example, Becker, Moscovitch, Behrmann, and Joordens (1997) found that lexical decision priming effects disappeared if the prime and target were separated by more than 15 seconds, and similar findings were reported by Meyer, Schvaneveldt, and Ruddy (1972). In brief, classic priming effects are small and transient even if the prime and measure are strongly associated (e.g., NURSE-DOCTOR), whereas money priming effects are [purportedly] large and relatively long-lasting even when the prime and measure are seemingly unrelated (e.g., a sentence related to money and the desire to be alone).

Kahneman’s statement is stunning because it seems so difficult for people to admit their mistakes, and in this case he’s not just saying he got the specifics wrong, he’s pointing to a systematic error in his ways of thinking.

You don’t have to be Thomas W. Kuhn to know that you can learn more from failure than success, and that a key way forward is to push push push to understand anomalies. Not to sweep them under the rug but to face them head-on.

3. Now return to Wansink. He’s in a tough situation. His career is based on publicity, and now he has bad publicity. And there no easy solution for him, as once he starts to recognize problems with his research methods, the whole edifice collapses. Similarly for Baumeister, Bargh, Cuddy, etc. The cost of admitting error is so high that they’ll go to great lengths to avoid facing the problems in their research.

It’s easier for Kahneman to admit his errors because, yes, this does suggest that some of the ideas behind “heuristics and biases” or “behavioral economics” have been overextended (yes, I’m looking at you, claims of voting and political attitudes being swayed by shark attacks, college football, and subliminal smiley faces), but his core work with Tversky is not threatened. Similarly, I can make no-excuses corrections of my paper that was wrong because of our data coding error, and my other paper with the false theorem.

P.S. Hey! I just realized that the above examples illustrate two of Clarke’s three laws.

Vine regression?

Jeremy Neufeld writes:

I’m an undergraduate student at the University of Maryland and I was recently referred to this paper (Vine Regression, by Roger Cooke, Harry Joe, and Bo Chang), also an accompanying summary blog post by the main author) as potentially useful in policy analysis. With the big claims it makes, I am not sure if it passes the sniff test. Do you know anything about vine regression? How would it avoid overfitting?

My reply: Hey, as a former University of Maryland student myself I’ll definitely respond! I looked at the paper, and it seems to be presenting a class of multivariate models, a method for fitting the models to data, and some summaries. The model itself appears to be a mixture of multivariate normals of different dimensions, fit to the covariance matrix of a rank transformation of the raw data—I think they’re ranking each variable on its marginal distribution but I’m not completely sure, and I’m not quite sure how they deal with discreteness in the data. Then somehow they’re transforming back to the original space of the data; maybe they do some interpolation to get continuous values, also I’m not quite sure what happens when they extrapolate to beyond the range of the original ranks.

The interesting part of the model is the mixture of submodels of different dimensions. I’m generally suspicious of such approaches, as continuous smoothing is more to my taste. That said, the usual multivariate models we fit are so oversimplified, that I could well imagine that this mixture model could do well. So I’m supportive of the approach. I think maybe they could fit their model in Stan—if so, that would probably make the computation less of a hassle for them.

The one think I really don’t understand at all in this paper is their treatment of causal inference. The model is entirely associational—that’s fine, I love descriptive data analysis!—and they’re fitting a multivariate model to some observational data. But then in section 3.1 of their paper they use explicit causal language: “the effect of breast feeding on IQ . . . If we change the BFW for an individual, how might that affect the individual’s IQ?” The funny thing is, right after that they again remind the reader that this is just descriptive statistics “we integrate the scaled difference of two regression functions which differ only in that one has weeks more breast feeding than the other” but then they snap right back to the causal language. So that part just baffles me. They have a complicated, flexible tool for data description but for some reason they then seem to make the tyro mistake of giving a causal interpretation to regression coefficients fit to observational data. That’s not really so important, though; I think you can ignore the causal statements and the method could still be useful. It seems worth trying out.

Krzysztof Sakrejda speaks in NYC on Bayesian hierarchical survival-type model for Dengue infection

Daniel writes:

Krzysztof Sakrejda is giving a cool talk next Tues 5:30-7pm downtown on a survival model for Dengue infection using Stan. If you’re interested, please register asap. Google is asking for the names for security by tomorrow morning.