Skip to content

Question about data mining bias in finance

Finance professor Ravi Sastry writes:

Let’s say we have N vectors of data, {y_1,y_2,…,y_N}. Each is used as the dependent variable in a series of otherwise identical OLS regressions, yielding t-statistics on some parameter of interest, theta: {t_1,t_2,…,t_N}.

The maximum t-stat is denoted t_n*, and the corresponding data are y_n*. These are reported publicly, as if N=1. The remaining N-1 data vectors and tests are not reported or disclosed in any way.

Given priors on theta and N, and only y_n*, how do we form a posterior on theta?

I would greatly appreciate any help at all, including proper terminology to describe this problem (which is endemic in academic finance) and pointers to relevant papers.

My reply:

I don’t know the relevant literature but I think you could do this easily enough in a Bayesian context by just treating all the unreported results as missing data (that is, as unknown quantities). You could do this in Stan—almost. (You’d have to assume N is known, but that might not be so horrible in practice.) Or maybe there are ways to do this using a theoretical analysis as well; this could give some insight. It seems like an unpleasant problem, though, if you’re not allowed to see most of the data.

Times have changed (sportswriting edition)


The name Tom Boswell came up in a recent comment thread and I was moved to reread his 1987 article, “99 Reasons Why Baseball Is Better Than Football.”

The phrase “head injury” did not come up once. Boswell refers a few times to football’s dangerous nature (for example, “98. When a baseball player gets knocked out, he goes to the showers. When a football player gets knocked out, he goes to get X-rayed.”) but nothing about concussions, brain injuries, etc.

“The Statistical Crisis in Science”: My talk in the psychology department Monday 17 Nov at noon

Monday 17 Nov at 12:10pm in Schermerhorn room 200B, Columbia University:

Top journals in psychology routinely publish ridiculous, scientifically implausible claims, justified based on “p < 0.05.” And this in turn calls into question all sorts of more plausible, but not necessarily true, claims, that are supported by this same sort of evidence. To put it another way: we can all laugh at studies of ESP, or ovulation and voting, but what about MRI studies of political attitudes, or embodied cognition, or stereotype threat, or, for that matter, the latest potential cancer cure? If we can’t trust p-values, does experimental science involving human variation just have to start over? And what to we do in fields such as political science and economics, where preregistered replication can be difficult or impossible? Can Bayesian inference supply a solution? Maybe. These are not easy problems, but they’re important problems.

Here are the slides (which might be hard to follow without hearing the talk) and here is some suggested reading:


Too Good to Be True

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time

Slightly technical:

Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors

The Connection Between Varying Treatment Effects and the Crisis of Unreplicable Research: A Bayesian Perspective

If you do an experiment with 700,000 participants, you’ll (a) have no problem with statistical significance, (b) get to call it “massive-scale,” (c) get a chance to publish it in a tabloid top journal. Cool!

David Hogg points me to this post by Thomas Lumley regarding a social experiment that was performed by randomly manipulating the content in the news feed of Facebook customers. The shiny bit about the experiment is that it involved 700,000 participants (or, as the research article, by Adam Kramera, Jamie Guillory, and Jeffrey Hancock, quaintly puts it, “689,003”), but, as Tal Yarkoni points out, that kind of sample size is just there to allow people to estimate tiny effects and also maybe to get the paper published in a top journal and get lots of publicity (there it is, “massive-scale” right in the title).

Before getting to Lumley’s post, which has to do with the ethics of the study, I want to echo the point made by Yarkoni:

In the experimental conditions, where negative or positive emotional posts are censored, users produce correspondingly more positive or negative emotional words in their own status updates. . . . [But] these effects, while highly statistically significant, are tiny. The largest effect size reported had a Cohen’s d of 0.02–meaning that eliminating a substantial proportion of emotional content from a user’s feed had the monumental effect of shifting that user’s own emotional word use by two hundredths of a standard deviation. In other words, the manipulation had a negligible real-world impact on users’ behavior. . . .

The attitude in much of science, of course, is that if you can conclusively demonstrate an effect, that its size doesn’t really matter. But I don’t agree with this. For one reason, if we happen to see an effect of +0.02 in one particular place at one particular time, it could well be -0.02 somewhere else. Don’t get me wrong—I’m not saying that this finding is empty, just that we have to be careful about out-of-sample generalization.

Now on to the ethics question. Lumley writes:

The problem is consent. There is a clear ethical principle that experiments on humans require consent, except in a few specific situations, and that the consent has to be specific and informed. . . . The need for consent is especially clear in cases where the research is expected to cause harm. In this example, the Facebook researchers expected in advance that their intervention would have real effects on people’s emotions; that it would do actual harm, even if the harm was (hopefully) minor and transient.

I pretty much disagree with this, for reasons that I’ll explain in a moment. Lumley continues:

The psychologist who edited the study for PNAS said

“I was concerned,” Fiske told The Atlantic, “until I queried the authors and they said their local institutional review board had approved it—and apparently on the grounds that Facebook apparently manipulates people’s News Feeds all the time.”

Fiske added that she didn’t want the “the originality of the research” to be lost, but called the experiment “an open ethical question.”

To me [Lumley], the only open ethical question is whether people believed their agreement to the Facebook Terms of Service allowed this sort of thing. This could be settled empirically, by a suitably-designed survey. I’m betting the answer is “No.” Or, quite likely, “Hell, no!”.

Amusingly enough, this is the same Susan Fiske who was earlier quoted in support of the himmicanes study, but that doesn’t seem to be particularly relevant here.

I don’t feel strongly about the ethical issues here. On one hand, I’d be a bit annoyed if I found that my internet provider was messing with me just to get a flashy paper in a journal (for example, what if someone told me that some researcher was sending spam to the blog, wasting my time (yes, I delete these manually every day) in the hope of getting a paper published in a tabloid journal using a phrase such as “massively online experiment”). Indeed, a couple of years ago I was annoyed that some researchers sent me a time-wasting email ostensibly coming from a student who wanted to meet with me. My schedule is a mess and it doesn’t help me to get fake appointment requests. On the other hand, as Fiske notes, corporations manipulate what they send us all the time, and any manipulation can possibly affect our mood. It seems a bit ridiculous to say that a researcher needs special permission to do some small alteration of an internet feed, when advertisers and TV networks can broadcast all sorts of emotionally affecting images whenever they want. The other thing that’s bugging me is the whole IRB thing, the whole ridiculous idea that if you’re doing research you need to do permission for noninvasive things like asking someone a survey question.

So, do I consider this Facebook experiment unethical? No, but I could see how it could be considered thus, in which case you’d also have to consider all sorts of non-research experiments (the famous A/B testing that’s so popular now in industry) to be unethical as well. In all these cases, you have researchers, of one sort or another, experimenting on people to see their reactions. And I don’t see the goal of getting published in PNAS to be so much worse than the goal of making money by selling more ads. But, in any case, I don’t really see the point of involving institutional review boards for this sort of thing. I’m with Tal Yarkoni on this one; as he puts it:

It’s not clear what the notion that Facebook users’ experience is being “manipulated” really even means, because the Facebook news feed is, and has always been, a completely contrived environment. . . . Facebook—and virtually every other large company with a major web presence—is constantly conducting large controlled experiments on user behavior.

Again, I can respect if you take a Stallman-like position here (or, at least, what I imagine rms would say) and argue that all of these manipulations are unethical, that the code should be open and we should all be able to know, at least in principle, how our messages are being filtered. So I agree that there is an ethical issue here and I respect those who have a different take on it than I do—but I don’t see the advantage of involving institutional review boards here. All sorts of things are unethical but still legal, and I don’t see why doing something and publishing it in a scientific journal should be considered more unethical or held to a more stringent standard than doing the same thing and publishing it in an internal business report.

P.S. This minor, minor story got to me what seems like a hugely disproportionate amount of attention—I’m guessing it’s because lots of people feel vaguely threatened by the Big Brother nature of Google, Facebook, etc., and this is a story that gives people an excuse to grab onto these concerns—and so when posting on it I’m glad of our 1-to-2-month lag, which means that you’re seeing this post with fresh eyes, after you’ve almost forgotten what the furore was all about.

“Patchwriting” is a Wegmanesque abomination but maybe there’s something similar that could be helpful?

Reading Thomas Basbøll’s blog I came across a concept I’d not previously heard about, “patchwriting,” which is defined as “copying from a source text and deleting some words, altering grammatical structures, or plugging in one synonym for another.” (See here for further discussion.)

As Basbøll writes, this is simply a variant of plagiarism, indeed it’s an excellent description of what some of the craftier plagiarists actually do. I’m reminded of the statement of history professor Matthew Whitaker’s publisher that Whitaker couldn’t have plagiarized his recent book because he’d assured her that he’d run it thru two different plagiarism programs, or something like that.

As the saying goes, if you have to run your book through two plagiarism programs, you’re already in trouble.

OK, so I’m 100% with Basbøll that “patchwriting” is plagiarism. And, like Basbøll, I’m a bit disturbed if some people think that patchwriting is “virtually inevitable as writers learn to produce texts within a new discourse community.”

But I wonder if there’s something similar to patchwriting that could serve the same function but much more constructively. Check out this paragraph that Basbøll quotes from Pecorari, “Academic Writing and Plagiarism” (1999):

In a study of the course of the progress of second-language writer through a business course, [P. Currie] found that the student, Diana, worked diligently in the early weeks of the course to raise the level of her writing assignments, but was at real risk of not receiving the grade she needed to stay in her program. Eventually Diana hit upon the strategy of repeating words and phrases from her sources; in other words, she began to patchwrite. From then on her teacher’s feedback was more positive.

My take on this is that, if student Diana did this right, it could be a great way for her to transition to learning to write on her own.

Just for example, imagine how the above quoted paragraph could be “patched”:

Pecorari (1999) writes of “a study of the course of the progress of second-language writer through a business course.” In this class, Diana (the student in the class) worked hard (Pecorari uses “diligently”) to raise the level of her writing assignments write better. Diana was worried about not getting a good grade. Then Diana did patchwriting: she did “repeating words and phrases from her sources.” Her teacher liked that. Her teacher gave Diana positive feedback.

I’ve purposely written this in one take, in a somewhat awkward style to imitate how a student might do it. I also put in a strikethrough to illustrate another way that a student might paraphrase but in an honest way. And, even so, it’s not perfect; one might say it still teeters on the edge of plagiarism, even with the sourcing, because I (playing the role of the hypothetical student) am adding nothing—like Ed Wegman or Frank Fischer or Matthew Whitaker (but in a more honest way), I’m merely regurgitating.

A better approach could be a complete blockquote followed by a summary and a reaction. For example, suppose the (hypothetical) student paper went like this:

In her 1999 book, Pecorari writes:

In a study of the course of the progress of second-language writer through a business course, [P. Currie] found that the student, Diana, worked diligently in the early weeks of the course to raise the level of her writing assignments, but was at real risk of not receiving the grade she needed to stay in her program. Eventually Diana hit upon the strategy of repeating words and phrases from her sources; in other words, she began to patchwrite. From then on her teacher’s feedback was more positive.

In my own words: Diana repeated words and phrases and her teacher liked it. Diana got better grades.

My reaction: Pecorari thinks it worked. I like the idea too. I want to learn to write on my own. Will patchworking work for me?

That’s probably not a good imitation on my part of student writing. But I have two points here:

1. Patchwriting, even with full sourcing, is empty as a means of self-expression but could still provide some useful practice. Just as we can learn from looking up classic chess games and playing them out on the board, maybe novice students could learn by re-expressing source material.

2. Shuffling around the words of others seems like a bit of a dead end, so I think its limitations should be kept in mind.


Unlike Basbøll and (I assume) Pecorari, I have very limited experience as a writing teacher, so these are just some quick ideas I’m offering up. Conditional on these caveats, here are my thoughts:

- Pure patchwriting as in the definition at the top of this page seems like a terrible idea.

- But I could see it making sense to encourage “patching” (as in my example above) and block-quoting-and-explaining (as in my other example) as a way to learn.

- We may need to somewhat separate the goal of learning to put words and sentences together, and the goal of expressing oneself. A writer needs to learn both these skills.

Crowdsourcing Data Analysis 2: Gender, Status, and Science

Emily Robinson writes:

Brian Nosek, Eric Luis Uhlmann, Amy Sommer, Kaisa Snellman, David Robinson, Raphael Silberzahn, and I have just launched a second crowdsourcing data analysis project following the success of the first one. In the crowdsourcing analytics approach, multiple independent analysts are recruited to test the same hypothesis on the same data set in whatever manner they see as best. If everyone comes up with the same results, then scientists can speak with one voice. If not, the subjectivity and conditionality of results on analysis strategy is made transparent.

The first crowdsourcing analytics initiative examined whether soccer referees give more red cards to dark skin toned than light-skin toned players (Silberzahn et al., in preparation; see project page on the Open Science Framework at The outcome was striking: although 62% of teams obtained a significant effect in the expected direction, estimated effect sizes ranged from moderately large to practically nil.

For this second project, we have collected the scientific dialogue from to analyze the how gender and status affect verbal dominance and verbosity. This project adds several new key features to the first crowdsourcing project, in particular having analysts operationalize the key variables on their own and giving analysts the opportunity to propose and vote on their own hypothesis to be tested by the group.

The full project description is here. If you’re interested in being one of the crowdstormer analysts, you can register here. All analysts will receive an author credit on the final paper. We would love to have Bayesian analysts represented in the group. Also, please feel free to let others know about the opportunity; anyone with the relevant data analysis skills is welcome to take part.

Sounds like fun. And I’m pretty sure they won’t be following up in a few weeks with an announcement that this was all a hoax.

The history of MRP highlights some differences between political science and epidemiology

Responding to a comment from Thomas Lumley (who asked why MRP estimates often seem to appear without any standard errors), I wrote:

In political science, MRP always seems accompanied by uncertainty estimates. However, when lots of things are being displayed at once, it’s not always easy to show uncertainty, and in many cases I simply let variation stand in for uncertainty. Thus I’ll display colorful maps of U.S. states with the understanding that the variation between states and demographic groups gives some sense of uncertainty as well. This isn’t quite right, of course, and with dynamic graphics it would make sense to have some default uncertainty visualizations as well.

But one thing I have emphasized, ever since my first MRP paper with Tom Little in 1997, is that this work unifies the design-based and model-based approaches to survey inference, in that we use modeling and poststratification to adjust for variables that are relevant to design and nonresponse. We discuss this a bit in BDA (chapter 8 of the most recent edition) as well. So there’s certainly no reason not to display uncertainty (beyond the challenges of visualization).

I’ve recently been told that things are different in epidemiology, that there’s a fairly long tradition in that field of researchers fitting Bayesian models to survey data and not being concerned about design at all! Perhaps that relates to the history of the field. Survey data, and survey adjustment, have been central to political science for close to a century, and we’ve been concerned all this time with non-representativeness. In contrast, epidemiologists are often aiming for causality and are more concerned about matching treatment to control group, than about matching sample to population. Ultimately there’s no good reason for this—even in an experimental context we should ultimately care about the population (and, statistically, this will make a difference if there are important treatment interactions) but it makes sense that the two fields will have different histories, to the extent that a Bayesian researcher in epidemiology might find it a revelation that Bayesian methods (via MRP) can adjust for survey bias, while this is commonplace to a political scientist as it’s been done in that field for nearly 20 years.

I wonder if another part of the story is that Bugs really caught on in epi (which makes sense given who was developing it), and Bugs was set up in a traditionally-Bayesian way of data + model -> inference about parameters, without the additional step required in MRP of mapping back to the population.

Also, causal inference researchers have tended to be pretty cavalier about the sampling aspect of their data. Rubin, for example, talked a lot about random or nonrandom assignment of the treatment but not much about representativeness of the sample, and I think that attitude was typical for statisticians for many years—at least, when they weren’t working in survey research. In my own work in poli sci, I was always acutely aware that survey adjustment mattered (for example, see figure 1a here), and I didn’t want to be one of those Bayesians who parachute in from the outside and ignore the collective wisdom of the field. In retrospect, this caution has served me well, because recently when some sample-survey dinosaurs went around attacking model-based data-collection and adjustment, I was able to decisively shoot them down by pointing out that we’re all ultimately playing the same game.

I don’t go with the traditional “Valencia” attitude that Bayesian approach is a competitor to classical statistics; rather, I see Bayes as an enhancement (which I’m pretty sure is your view too), and it’s an important selling point that we don’t discard the collective wisdom of a scientific field; rather, we effectively include that wisdom in our models.

Illegal Business Controls America

The other day I wrote:

After encountering the Chicago-cops example I was going to retitle this post, “The psych department’s just another crew” in homage to the line, “The police department’s just another crew” from the rap, “Who Protects Us From You.” But, just to check, I googled that KRS-One rap and it turns out it does not contain that line! It’s funny because it’s not a memory thing—when the album came out and I heard that rap, I registered that line, which I guess KRS-One never said.

I sent this to my friend Kenny, who’d introduced me to KRS-One many years ago, and he said:

Re KRS-One, I think that line* is from somewhere on the By All Means Necessary album— T’cha T’cha?? in society we have illegal and legal/we need both to make things equal… or some other track? I haven’t listened to it in years and it is all running together, but it is KRS-One.

*the police department is LIKE a crew
It does whatever they want to do…

So I did some more searching and indeed KRS-One did say it, in “Illegal Business”:

The police department, Is like a crew, It does whatever, They want to do.

The idea is very close to that of Who Protects Us From You (“If I hit you I’ll be killed, But you hit me? I can sue”) and it has the same rhythm and rhyme, so I can see how I misplaced it my memory. I’m glad to know I didn’t completely fabricate it.

Also good to know that Kenny not only corroborated but remembered the line exactly.

On deck this week

Mon: Illegal Business Controls America

Tues: The history of MRP highlights some differences between political science and epidemiology

Wed: “Patchwriting” is a Wegmanesque abomination but maybe there’s something similar that could be helpful?

Thurs: If you do an experiment with 700,000 participants, you’ll (a) have no problem with statistical significance, (b) get to call it “massive-scale,” (c) get a chance to publish it in a tabloid top journal. Cool!

Fri: Stethoscope as weapon of mass distraction

Sat: Times have changed (sportswriting edition)

Sun: Question about data mining bias in finance

“Differences Between Econometrics and Statistics” (my talk this Monday at the University of Pennsylvania econ dept)

Differences Between Econometrics and Statistics:  that’s the title of the talk I’ll be giving at the econometrics workshop at noon on Monday.


At 4pm 4:30pm in the same place, I’ll be speaking on Stan.

And here are some things for people to read:

For “Differences between econometrics and statistics”:

For “Stan: A platform for Bayesian inference”:

Please interrupt and ask a lot of questions.

Wait, I forgot, this is an econ seminar, I don’t have to remind you to do that!

P.S. The noon talk was a bit of a mess; some interesting stuff but I wasn’t well-enough organized. The 4:30 talk went well, though! And in neither talk was I overwhelmed with interruptions. Apparently, econometricians aren’t like other economists in that way.