Skip to content

Diederik Stapel in the news, again

Bikes . . . have “become the most common mode of transportation for criminals.”

OK, that’s just ethnic profiling of Dutch people. I think they’re just gonna put the whole country on lockdown.

How do data and experiments fit into a scientific research program?

I was talking with someone today about various “dead on arrival” research programs we’ve been discussing here for the past few years: I’m talking about topics such beauty and sex ratios of children, or ovulation and voting, or ESP—all of which possibly represent real phenomena and could possibly be studied in a productive way, just not using the data collection and measurement strategies now in use. This is just my opinion, but it’s an opinion based on a mathematical analysis (see full story here) that compares standard errors with plausible population differences

Anyway, my point here is not to get into another argument with Satoshi Kanazawa or Daryl Bem or whoever. They’re doing their research, I’m doing mine, and at this point I don’t think they’re planning to change their methods.

Instead, accept for a moment my premise that these research programs, as implemented, are dead ends. Accept my premise that these researchers are chasing noise, that they’re in the position of the “50 shades of gray” guys but without the self-awareness. They think they’re doing research and making discoveries but they’re just moving in circles.

OK, fine, but then the question arises: what is the role of data and experimental results in these research programs?

Here’s what I think. First, when it comes to the individual research articles, I think the data add nothing, indeed the data can even be a minus if they lead other researchers to conclude that a certain pattern holds in the general population.

From this perspective, if these publications have value, it’s in spite of, not because of, their data. If the theory is valuable (and it could be), then it could (and, I think, should) stand alone. It would be good if the theory also came with quantitative predictions that were consistent with the rest of available scientific understanding, which would in turn motivate a clearer understanding of what can be learned from noisy data in such situations—but let’s set that aside, let’s accept that these people are working within their own research paradigm.

So what is that paradigm? By which I mean, not What is their paradigm of evolutionary psychology or paranormal perception or whatever, but What is their paradigm of how research proceeds? How will their careers end up, and how will these strands of research go forward.

I think (but certainly am not sure) that these scientists think of themselves as operating in Popperian fashion, coming up with scientific theories that imply testable predictions, then designing measurements and experiments to test their hypotheses, rejecting when “p less than .05″ and moving forward. Or, to put it slightly more loosely, they believe they are establishing stylized facts, little islands of truth in our sea of ignorance, and jumping from island to island, building a pontoon bridge of knowledge . . . ummmm, you get the picture. The point is, that from their point of view, they’re doing classic science. I don’t think this is what’s happening, though, for reasons I discussed here a few months ago.

But, if these researchers are not following the Karl Popper playbook, what are they doing?

A harsh view, given all I’ve written above, is that they’re just playing in a sandbox with no connection to science or the real world.

But I don’t take this harsh view. I accept that theorizing is an important part of science, and I accept that the theorizing of Daryl Bem, or Sigmund Freud, or the himmicanes and hurricanes people, or the embodied cognition researchers, etc etc etc., is science, even if these researchers do not have a realistic sense of the sort of measurement accuracy it would take to test and evaluate these theories.

Now we’re getting somewhere. What I think is that anecdotes, or case studies, even data that are so noisy as to essentially be random numbers, can be a helpful stimulus, in that it can motivate some theorizing.

Take, for example, that himmicanes and hurricanes study. The data analysis was a joke (no more so than a lot of other published data analyses, of course), and the authors of the paper made a big mistake to double down on their claims rather than accepting the helpful criticism from outside—but maybe there’s something to their idea that the name of a weather event affects how people react to it. It’s quite possible that, if there is such an effect, it goes in the opposite direction from what was claimed in that notorious article—but the point is that their statistical analyses may have jogged them into an interesting theory.

It’s the same way, I suppose, that Freud came up with and refined his theories of human nature, based on his contacts with individual patients. In this case, researchers are looking at individual datasets, but it’s the same general idea.

Anyway, here’s my point. To the extent that research of Bem, or Kanazawa, or the ovulation-and-voting people, or the himmicanes-and-hurricanes people, or whatever, has value, I think the value comes from the theories, not from the data and certainly not from whatever happens to show up as statistically significant in some power=.06 study. And, once we recognize that the value comes in the theories, it suggests that the role of the data is to throw up random numbers that will tickle the imagination of theorists. Even if they don’t realize that’s what they’re doing.

Sociologist Jeremy Freese came up with the term Columbian Inquiry to describe scientists’ search for confirmation of a vague research hypothesis: “Like brave sailors, researchers simply just point their ships at the horizon with a vague hypothesis that there’s eventually land, and perhaps they’ll have the rations and luck to get there, or perhaps not. Of course, after a long time at sea with no land in sight, sailors start to get desperate, but there’s nothing they can do. Researchers, on the other hand, have a lot of more longitude—I mean, latitude—to terraform new land—I mean, publishable results—out of data . . .”

What I’ve attempted to do in the above post is, accepting that a lot of scientists do proceed via Columbian Inquiry, try to understand where this leads. What happens if you spend a 40-year scientific career using low-power studies to find support for, and modify, vague research hypotheses? What will happen is that you’ll move in a sort of directed random walk, finding one thing after another, one interaction after another (recall that we’ve looked at studies that find interactions with respect to relationship status, or weather, or parents’ socioeconomic status—but never in the same paper), but continuing to stay in the main current of your subfield. There will be a sense of progress, and maybe real progress (to the extent that the theories lead to useful insights that extend outside your subfield), even if the data aren’t quite playing the role that you think they are.

For example, Satoshi Kanazawa, despite what he might think, is not discovering anything about variation in the proportion of girl births. But, by spending years thinking of explanations for the patterns in his noisy data, he’s coming up with theory after theory, and this all fits into his big-picture understanding of human nature. Sure, he could do all this without ever seeing data at all—indeed, the data are, in reality, so noisy as to have have no bearing on his theorizing—but the theories could still be valuable.

P.S. I’m making no grand claims for my own research. Much of my political science work falls in a slightly different tradition in which we attempt to identify and resolve “puzzles” or stylized facts that do not fit the current understanding. We do have some theories, I guess—Gary and I talked about “enlightened preferences” in our 1993 paper—but we’re a bit closer to the ground. Also we tend to study large effects with large datasets so I’m not so worried that we’re chasing noise.

Gigerenzer on logical rationality vs. ecological rationality

I sent my post about the political implication of behavioral economics, embodied cognition, etc., to Gerd Gigerenzer, who commented as follows:

The “half-empty” versus “half-full” explanation of the differences between Kahneman and us misses the essential point: the difference is about the nature of the glass of rationality, not the level of the water. For Kahneman, rationality is logical rationality, defined as some content-free law of logic or probability; for us, it is ecological rationality, loosely speaking, the match between a heuristic and its environment. For ecological rationality, taking into account contextual cues (the environment) is the very essence of rationality, for Kahneman it is a deviation from a logical norm and thus, a deviation from rationality. In Kahneman’s philosophy, simple heuristics could never predict better than rational models; in our research we have shown systematic less-is-more effects.

Gigerenzer pointed to his paper with Henry Brighton, “Homo Heuristicus: Why Biased Minds Make Better Inferences,” and then he continued:

Please also note that Kahneman and his followers accept rational choice theory as the norm for behavior, and so does almost all of behavioral economics. They put the blame on people, not the model.

This makes sense, in particular the less-is-more idea seems like a good framing.

That said, I think some of the power of Kahneman and Tversky’s cognitive illusions, as with the visual illusions with which we are all familiar, is that there often is a shock of recognition, when we realize that our intuitive, “heuristic,” response is revealed, upon deeper reflection, to be incorrect.

To put it in Gigerenzer’s framework, our environment is constantly changing, and we spend much of our time in an environment that is much different than the savanna where our ancestors spent so many thousands of years.

From this perspective, rational choice is not an absolute baseline of correctness but in many ways it works well in our modern society which includes written records, liquid and storable money, and various other features for which rationality is well adapted.

Perhaps the most contextless email I’ve ever received

Date: February 3, 2015 at 12:55:59 PM EST
Subject: Sample Stats Question
From: ** <**@gmail.com>

Hello,
I hope all is well and trust that you are having a great day so far. I hate to bother you but I have a stats question that I need help with: How can you tell which group has the best readers when they have the following information: Group A-130, 140, 170,170, 190, 200, 215, 225, 240, 250
Group B- 188, 189, 193, 193, 193, 194, 194, 195, 195, 196
Group A-mean (193), median (195), mode (170)
Group B- mean (193), median(193.5), mode (193)
Why?
This is for my own personal use and understanding of this subkject matter so anything you could say and redirect me would be greatly appreicated.
Any feedback that you could give me to help understand this better would be greatly appreciated.
Thanks,

Item-response and ideal point models

To continue from today’s class, here’s what we’ll be discussing next time:

- Estimating the direction and the magnitude of the discrimination parameters.

- How to tell when your data don’t fit the model.

- When does ideal-point modeling make a difference? Comparing ideal-point estimates to simple averages of survey responses.

P.S. Unlike the previous post, this time I really am referring to the class we had this morning.

A message I just sent to my class

I wanted to add some context to what we talked about in class today. Part of the message I was sending was that there are some stupid things that get published and you should be careful about that: don’t necessarily believe something, just cos it’s statistically significant and published in a top journal.

And, sure, that’s true, I’ve seen lots of examples of bad studies that get tons of publicity. But that shouldn’t really be the #1 point you get from my class.

This is how I want you to think about today’s class:

Consider 3 different ways in which you will be using sample surveys:

1. Conducting your own survey;

2. Performing your own analysis of existing survey data;

3. Reading and interpreting a study that was performed by others.

The key statistical message of today’s lecture was that if the underlying comparison of interest in the population (what I was calling the “effect size,” but that is somewhat misleading, as we could be talking about purely descriptive comparisons with no direct causal interpretation) is small, and if measurements are poor (high bias, high variance, or both), then it can be essentially impossible to learn anything statistical from your data.

The point of the examples I discussed is not so much that they’re dumb, but that they are settings where the underlying difference or effect in the population is small, and where measurements are noisy, or biased, or both.

What does this imply for your own work? Consider the 3 scenarios listed above:

1. If you’re conducting your own survey: Be aware of what your goal is, what you’re trying to estimate. And put lots of effort into getting valid and reliable measurements. If you’re estimating a difference which in truth is tiny, or if your measurements are crap, you’re drawing dead (as they say in poker).

2. If you’re performing your own analysis of existing survey data: Same thing. Consider what you’re estimating and how well it’s being measured. Don’t fall into the trap of thinking that something that’s statistically significant is likely to accurately represent a truth in the general population.

3. If you’re reading and interpreting a study that was performed by others: Same thing. Even if the claim does not seem foolish, think about the size of the underlying comparison or effect and how accurately it’s being estimated.

To put it another way, one thing I’m pushing against is the attitude that statistical significance is a “win.” From that perspective, it’s ok to do a noisy study of a small effect if the cost is low, because you might get lucky and get that “p less than .05.” But that is a bad attitude, because if you’re really studying a small effect with a noisy measurement, anything that happens to be statistically significant could well be in the wrong direction and is certain to be an overestimate. In the long run, finding something statistically significant in this way is not a win at all, it’s a loss in that it can waste your time and other researchers’ time.

This is all some serious stuff to think about in a methods class, but it’s important to think a bit about the endgame.

P.S. (in case this is confusing anyone who was in class today): I wrote the above message a couple months ago. Most of the posts on this blog are on delay.

“For better or for worse, academics are fascinated by academic rankings . . .”

I was asked to comment on a forthcoming article, “Statistical Modeling of Citation Exchange Among Statistics Journals,” by Christiano Varin, Manuela Cattelan and David Firth.

Here’s what I wrote:

For better or for worse, academics are fascinated by academic rankings, perhaps because most of us reached our present positions through a series of tournaments, starting with course grades and standardized tests and moving through struggles for the limited resource of publication space in top journals, peer-reviewed grant funding, and finally, the unpredictable process of citation and reputation. As statisticians we are acutely aware of the failings of each step of the process and we find ourselves torn between the desire to scrap the whole system, Arxiv-style, or to reform it as suggested in the present paper. In this article, Varin, Catelan, and Firth argue that quantitative assessment of scientific and scholarly publication is here to stay, so we might as well try to reduce the bias and variance of such assessments as much as possible.

As the above paragraph indicates, I have mixed feelings about this sort of effort and as a result I feel too paralyzed to offer any serious comments on the modeling. Instead I will offer some generic, but I hope still useful, graphics advice: Table 2 is essentially unreadable to me and is a (negative) demonstration of the principle that, just as we should publish include any sentences that we do not want to be read, we also should avoid publishing numbers that will not be of any use to a reader. Does anyone care, for example, that AoS has exactly 1663 citations? This sort of table cries out to be replaced by a graph (which it should be possible to construct taking up no more space than the original table; see Gelman, Pasarica, and Dodhia, 2002). Figure 1 violates a fundamental principle of graphics in that it wastes one of its axes, in that it follows what Wainer (2001) has called the Alabama first ordering. Figure 2 has most of its words upside down, which is a result of an unfortunate choice to present a vertical display as horizontal, thus requiring me to rotate my computer 90 degrees to read it. Table 4 represents one of the more important outputs of the research being discussed, but it too is hard to read, requiring me to try to track different acronyms across the page. It would be so natural to display these results as a plot with one line per journal.

I will stop at this point and conclude by recognizing that these comments are trivial compared to the importance of the subject, but as noted above I was too torn by this topic offer anything more.

And here are X’s reactions.

Why do we communicate probability calculations so poorly, even when we know how to do it better?

Haynes Goddard writes:

I thought to do some reading in psychology on why Bayesian probability seems so counterintuitive, and making it difficult for many to learn and apply. Indeed, that is the finding of considerable research in psychology. It turns out that it is counterintuitive because of the way it is presented, following no doubt the way the textbooks are written. The theorem is usually expressed first with probabilities instead of frequencies, or “natural numbers” – counts in the binomial case.

The literature is considerable, starting at least with a seminal piece by David Eddy (1982). “Probabilistic reasoning in clinical medicine: problems and opportunities,” in Judgment under Uncertainty: Heuristics and Biases, eds D. Kahneman, P. Slovic and A. Tversky. Also much cited are Gigerenzer and Hoffrage (1995) “How to improve Bayesian reasoning without instruction: frequency formats” Psychol. Rev, and also Cosmides and Tooby, “Are humans good intuitive statisticians after all? Rethinking some conclusions from the literature on judgment under uncertainty”, Cognition, 1996.

This literature has amply demonstrated that people actually can readily and accurately reason in Bayesian terms if the data are presented in frequency form, but have difficulty if the data are given as percentages or probabilities. Cosmides and Tooby argue that this is so for evolutionary reasons, and their argument seems compelling.

So taking a look at my several texts (not a random sample of course), including Andrew’s well written text, I wanted to know how many authors introduce the widely used Bayesian example of determining the posterior probability of breast cancer after a positive mammography in numerical frequency terms or counts first, then shifting to probabilities. None do, although some do provide an example in frequency terms later.

Assuming that my little convenience sample is somewhat representative, it raises the question of why are not the recommendations of the psychologists adopted.

This is a missed opportunity, as the psychological findings indicate that the frequency approach makes Bayesian logic instantly clear, making it easier to comprehend the theorem in probability terms.

Since those little medical inference problems are very compelling, it would make the lives of a lot of students a lot easier and increase acceptance of the approach. One can only imagine how much sooner the sometimes acrimonious debates between frequentists and Bayesians would have diminished if not ended. So there is a clear lesson here for instructors and textbook writers.

Here is an uncommonly clear presentation of the breast cancer example: http://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/. And there are numerous comments from beginning statistics students noting this clarity.

My response:

I agree, and in a recent introductory course I prepared, I did what you recommend and started right away with frequencies, Gigerenzer-style.

Why has it taken us so long to do this? I dunno, force of habit, I guess? I am actually pretty proud of chapter 1 of BDA (especially in the 3rd edition with its new spell-checking example, but even all the way back to the 1st edition in 1995) in that we treat probability as a quantity that can be measured empirically, and we avoid what I see as the flaw of seeking a single foundational justification for probability. Probability is a mathematical model with many different applications, including frequencies, prediction, betting, etc. There’s no reason to think of any one of these applications as uniquely fundamental.

But, yeah, I agree it would be better to start with the frequency calculations: instead of “1% probability,” talk about 10 cases out of 1000, etc.

P.S. It’s funny that Goddard cited a paper by Cosmides and Tooby, as they’re coauthors on that notorious fat-arms-and-political-attitudes paper, a recent gem in the garden-of-forking-paths, power=.06 genre. Nobody’s perfect, I guess. In particular, it’s certainly possible for people to do good research on the teaching and understanding of statistics, even while being confused about some key statistical principles themselves. And even the legendary Kahneman has been known, on occasion, to overstate the strength of statistical evidence.

“Another bad chart for you to criticize”

Perhaps in response to my lament, “People used to send me ugly graphs, now I get these things,” Stuart Buck sends me an email with the above comment and a link to this “Graphic of the day” produced by some uncredited designer at Thomson Reuters:

Screen Shot 2014-10-09 at 8.48.21 AM

From a statistical perspective, this graph is a disaster in that the circular presentation destroys the two-way structure (countries x topics) which has to be central to any understanding of these data. In addition, to the extent that you’d want to get something out of the graph, you’ll end up having to perform mental divisions of line widths.

At this point I’d usually say something like: On the plus side, this is a thought-provoking display (given its tentacle-like appearance, one might even call it “grabby”) that draws viewers’ attention to the subject matter. But I can’t really even say that, because the subject of the graph—nationalities of Nobel Prize winners—is one of the more overexposed topics out there, and really the last thing we need is one more display of these numbers. Probably the only thing we need less of is further analysis of the Titanic survivors data. (Sorry, Bruno: 5 papers on that is enough!)

Another stylized fact bites the dust

According to economist Henry Farber (link from Dan Goldstein):

In a seminal paper, Camerer, Babcock, Loewenstein, and Thaler (1997) find that the wage elasticity of daily hours of work New York City (NYC) taxi drivers is negative and conclude that their labor supply behavior is consistent with target earning (having reference dependent preferences). I replicate and extend the CBLT analysis using data from all trips taken in all taxi cabs in NYC for the five years from 2009-2013. The overall pattern in my data is clear: drivers tend to respond positively to unanticipated as well as anticipated increases in earnings opportunities. This is consistent with the neoclassical optimizing model of labor supply and does not support the reference dependent preferences model.

I explore heterogeneity across drivers in their labor supply elasticities and consider whether new drivers differ from more experienced drivers in their behavior. I find substantial heterogeneity across drivers in their elasticities, but the estimated elasticities are generally positive and only rarely substantially negative. I also find that new drivers with smaller elasticities are more likely to exit the industry while drivers who remain learn quickly to be better optimizers (have positive labor supply elasticities that grow with experience).

It’s good to get that one out of the way.