Discussion with Steven Pinker on research that is attached to data that are so noisy as to be essentially uninformative

I pointed Steven Pinker to my post, How much time (if any) should we spend criticizing research that’s fraudulent, crappy, or just plain pointless?, and he responded:

Clearly it *is* important to call out publicized research whose conclusions are likely to be false. The only danger is that it’s so easy and fun to criticize, with all the perks of intellectual and moral superiority for so little cost, that there is a moral hazard to go overboard and become a professional slasher and snarker. (That’s a common phenomenon among literary critics, especially in the UK.) There’s also the risk of altering the incentive structure for innovative research, so that researchers stick to the safest kinds of paradigm-twiddling. I think these two considerations were what my late colleague Dan Wegner had in mind when he made the bumbler-pointer contrast — he himself was certainly a discerning critic of social science research. [Just to clarify: Wegner is the person who talked about bumblers and pointers but he was not the person who sent me the email characterizing these as “our only choices in life.”—AG.]

The other comment is that I don’t think that evolutionary psychology is a worse offender at noise-mining than social psychology in general. Quite the contrary, the requirement that a psychological mechanism enhance reproductive success in a pre-modern environment at least imposes a modicum of aprioricity on hypotheses, which is entirely lacking in non-evolutionary (and defiantly atheoretical) social psychology. The worry that you can spin scientifically respectable evolutionary hypotheses post hoc for any finding is, in my view, greatly exaggerated. The Griskevicius finding may be wrong, for all the usual reasons, but the hypothesis is well motivated by prior theory and research.

To which I replied:

I think there are 3 things going on:

1. The science. As Lakatos and other philosophers of science have emphasized, any real scientific theory will make all sorts of predictions. The mapping of theory to prediction is a messy and necessary part of science. So a theory can be valid even if it is difficult to test, indeed part of the reason for testing a theory is often not to confirm or dispute the theory’s validity but to refine the theory.

2. Data collection. The studies by Griskevicius etc. have an extremely low ratio of signal to noise. Variability is high, measurements are crude, comparisons are performed between subjects, and this is all with a background of small effects that vary in sign and magnitude. As a result, the studies provide essentially zero information about the theory.

3. Multiple comparisons. The reason that multiple comparisons come in is to explain how it is that researchers such as Bem, Griskevicius, etc., manage to consistently find statistical significance (typically, many statistically significant comparisons in a single study) even though their noise level is so high. Multiple comparisons is the answer, and the point of our garden of forking paths paper is to explain how this problem can arise even for studies that are well motivated by substantive theory.

In short, my claim is not that the theories of Griskevicius etc. are wrong (about that, I have no idea) and my central criticism of them is not data-mining and multiple comparisons. Rather, my problem is that the study design is such that the data provide essentially no information about the science. I’d have no problem with the theory being presented as such; my problem is with the incorrect (in this case) claims that the data add anything to the story.

Regarding incentives structure, I fear that the current lack of incentives to criticize serves to offer an incentive for researchers to do small noisy studies which then they can sometimes publish in places such as Psychological Science. I would love if the incentives were to change so that researchers would put more effort into careful measurement and design!

34 thoughts on “Discussion with Steven Pinker on research that is attached to data that are so noisy as to be essentially uninformative

  1. I understand the concern about noisy data, but I also used to work in a field (spectrum analysis and the like, when I was a practicing electrical engineer) where very low signal-to-noise ratios were the norm for detecting signals. For example, there was an old spectrum analyzer that had some 100dB – 110dB spurious-free dynamic range, as I recall, and there was a phase noise measurement system with a noise floors of perhaps -150 dB or better. Or think of your simple AM radio receiver, if you still have one. I don’t have the figures at hand, but it’s still picking signals (music, talk) out of a fairly high noise environment.

    That’s a far lower S/N than what you’ve been discussing, I think. Admittedly, those were state-of-the-art instruments, at least at the time, and these measurements had the advantage of more data than you have in these experiments and the ability to use narrow detection bandwidths.

    But is there a way to refine your statement about S/N, or is my lack of a priori fear of low S/N always misplaced?

    • I suspect that the signal-to-noise that Andrew and Steven are referring to does not translate to SNR that Bill is familiar with. I’m a statistician, but was trained by an electrical engineer / statistician who does signal processing. SNR in a time series, or specifically as seen on a spectrum analyzer, is a far cry from the “signal to noise” (colloquial) that might be referred to in a common statistical problem.

      In particular, the “signal” that is being sought in a common social science setting does _not_ have any of the nice properties that a signal might have in a physics or engineering problem. If I’m looking for a particular tone to tune my receiver to, it stands out against the background simply because it’s coherent, and the background isn’t. Applying a harmonic F-test to a sufficiently long time series can reveal signals with significance levels well below 1 – 1e-6 that would otherwise be almost impossible to detact. In social sciences, you essentially have isolated points, not temporally related near-continuous data (to the level of your discretization). Imagine taking 24 instantaneous power samples from a radio broadcast, and attempting to reconstruct the music that’s currently playing. Even if the “music” was a very, very simple melody, you wouldn’t have any chance whatsoever of reconstructing the signal, because you just don’t have enough data.

      tl;dr SNR in EE/signal processing is not the same as SNR used colloquially in a statistics problem, especially one with extremely small sample sizes.

      • In other words, when Electrical Engineers use the term SNR, they have a specific, quantitative definition in mind. But in a social science setting we embrace vagueness and use SNR only in a rough hand-waving sense?

        e.g. If one asked “What exactly is the SNR in those studies by Griskevicius?” there’s no good answer?

        • Rahul:

          I’m sure that the term “signal to noise ratio” has different meanings in different settings, but I’m referring to the scale of variation of the signal (basically, the size of the treatment effect) divided by the scale of variation of the noise (which, in some contexts will refer to the unexplained sd of the data, and in other contexts will refer to the “standard error,” i.e. the uncertainty in the estimate, which has that 1/sqrt(n) term). In Bill’s case, the point is that n is huge. You can make out a radio signal in the presence of lots of noise because you’re observing it for thousands or millions of periods (“kilohertz,” “megahertz,” etc.). This is a lot different from a study with n=100 where the estimate is 2.1 standard errors from zero.

          So, no, in social science we don’t “embrace vagueness” so much as recognize that we have uncertainty. We avoid spurious precision. This is not hand-waving.

        • So when you say that “the studies by Griskevicius etc. have an extremely low SNR”, do you have a specific number in mind?

          When multiple social scientists are asked this same question will they give an approximately similar answer for that SNR number?

          cf. Note Bill Harris, by habit, quoted decibel ranges for the SNR of the application he was describing.

        • Rahul:

          No, I don’t have a specific number because I don’t know the exact size of the signal. But I know the signal is low. For example, a possible (but on the high end) estimate of the effect size is 0.01. The outcome is binary which corresponds to a se of 1/2. This corresponds to a signal-to-noise ratio for one observation of order of magnitude 1/50, which would give you a signal-to-noise ratio of 1 if (a) you have a sample size of 2500, and (b) the effect size does not itself vary.

          See? That wasn’t too bad. Social scientists can multiply and divide too. We can even take square roots and even do logarithms on occasion.

        • thanks for your clarification, I was thinking initially in terms of the R-square (although I realize that isn’t a great measure, and the pseudo-R-square isn’t always that valid).

          But say we have a new area of study with few refined theories, and we end up with analysis that finds a significant, but very small, effect for a variable we hadn’t realized mattered. Would the novelty excuse the low SNR, provided we caveat that further study and refinement is needed? Or would the novelty make the finding suspect, given the data issues?

      • On a related note, my work involves modelling infectious diseases in poor countries. I have always thought of weather forecasting as having a lot of characteristics to aspire to (well, maybe after Nate Silver’s book anyway). However, I now think there might be less to gain from the comparison than I originally thought, given the huge gulf in the volume and quality of information available to test/refine/calibrate models. While there are general lessons to learn, the tasks are very different.

        • Good points, all. Yes, classic spectrum analysis does benefit from very high n, and I know that social science research often has very low n, but I wonder if there are cases in the middle that are useful to consider in both (or multiple) fields. For example, I presume that sonar signal processing doesn’t deal in overly large n; I presume one vessel trying to detect another without giving up too much information in the process is trying to extract everything it can from a very small number (1 to few) of measurements (‘pings’).

          In those cases, does it offer any benefit to think of models that work better for various values of n or of S/N? To Nick’s question (and simplifying tremendously, as that’s not my field), I understand that it is hard to model infectious disease progression. Still, does fitting some variant of an SIR model to small data let you estimate the SIR parameters well enough to understand usefully what’s going on, even if the residual SD is pretty big due to stuff you don’t really care about? In a way, isn’t that what a spectrum analyzer or radio does? Both benefit to a degree from high n but from an even greater degree to narrow front-end bandwidths. You can listen to a small station at 1320 kHz because the front end of your radio largely excludes all the other stations over the AM band. The overall S/N is tiny (one small station out of many tens of stations, some huge), but at the point the analysis (the signal processing or detection) is done, much of the “noise” (signals from other stations, in this case) have been stripped away (or, better, strongly attenuated). That does require well-thought-out and sometimes very creative models, and it requires some thought to make sure you’re not throwing away stuff you want to see.

          Is Shannon information theory another way (besides power) to think about how much insight (information) you can pull out of (what little) data you have? For example, if you’re trying to estimate a parameter and view it as a one-shot experiment, then is power better? If you view the same problem as taking data until you get a sufficiently good estimate, does some mixture of information theory and Bayesian sequential analysis offer any benefit? From http://en.wikipedia.org/wiki/Bayesian_experimental_design and perhaps http://ilab.usc.edu/publications/doc/Baldi_Itti10nn.pdf (more for my reading list), it sounds as if others have looked in that general direction.

        • Hi Bill, what you describe — fitting a variant of a SIR model to small data — is basically what is done in a lot of cases. I think there are two main downsides:

          (1) Lack of opportunities for model improvement. If we only have small data we miss opportunities to see where our current modelling approaches are inadequate. This is probably more important as our goal is out-of-sample prediction (i.e. we fit to historical events but we care about future events).

          (2) Accuracy matters — even if our estimates are unbiased on expectation, they will be a better aid to policy making if they are more accurate.

        • Thanks for the link, K?. I’m finding it very interesting reading so far.

          Nick, thanks for your follow-up. Point 1 with small data seems at least somewhat related to the premises of the article K? recommended, BTW, I first encountered SRI models in John Sterman’s /Business Dynamics/, where he starts with disease models and quickly moves to similar models of the adoption of new products or new ideas.

          Point 2: I’ll certainly grant that accuracy is important if prediction is your goal. A fair amount of my modeling comes from the system dynamics perspective (see the Sterman book), where another valid purpose can be solving a problem or designing a policy. For a simple example, see http://onlinelibrary.wiley.com/doi/10.1002/npr.4040180306/abstract (paywall, unfortunately). I used a very simple ODE model to make sense of an expense management problem. While it couldn’t predict spending well at all, it could predict the dynamics of the problem (oscillations or business cycles) and make it easy to see why it existed. Furthermore, it was easy to change information flows in the model such that the problem went away. When we made the analogous changes in the real world, we eliminated better than 95% of the problem, which was deemed more than sufficient. In that case, having the ability to have made accurate predictions wouldn’t have helped; the manager saw the ODE feedback structure on about slide 3 out of about 30 in my slide deck and understood (he had been a practicing engineer well acquainted with feedback systems) and said, “Stop there, and go fix it.” When I noted I had more slides, he said I didn’t need them.

        • Something about Rahul’s, Nick’s, and K?’s comments made me think about my thought processes on that example, and maybe that’s useful for thinking about inference for policy decisions.

          The expense management data was a time series that, when graphed, showed some months of ~20% overspending followed by about the same of underspending. I created a causal ODE model that replicated the basic behavior (oscillations), but the simulation didn’t follow all the fine detail in the data.

          Instead of worrying about the discrepancies, I think I tacitly thought of the data as a periodic signal and the detail as possibly modelable as frequency harmonics. I could have worried about the causal structure that might have created the nonlinearities that caused those harmonics, but I instead looked at small modifications to the structure that would eliminate that mode of operation (that particular feedback-caused oscillation).

          If most all of the fine detail were harmonics of the main signal, then eliminating the main signal would also eliminate the harmonics. That’s what happened in the simulation. In the real world, the new system only cut somewhat more than 95% of the oscillation. I no longer have the data, so I don’t know if the spending variances were due to residual oscillation or not, but I suspect it may have been simple random noise.

          So I can see that accuracy is important to someone with a proposed policy to improve a problem by 20% or so. If you see a way to eliminate the problem, then accuracy may not be much of a concern, for the dynamics of the new situation may be little related to the old. This is in a time series context, and I don’t have an quick analogous idea for a non-time-series problem.

          The model is described a bit better at https://web.archive.org/web/20080511171348/http://facilitatedsystems.com/expmgmnt.pdf.

  2. A comment about uncreative research: “safest kinds of paradigm-twiddling”

    Most of the time, the work that I see which invokes biased, confounded, nasty data is in fact the ‘safest kinds’ of research. Chasing ‘low-hanging fruit’ in pursuit of scientific ‘link bait’ (i.e. ‘rising temperatures will cause this animal to no longer live in its present habitat!’, ‘confronted by unpleasant questions, people get upset and don’t perform well!’, ‘dangerous, nasty bacteria (or their distant relatives) everywhere around your house!’, ‘this daily food causes CANCER!’) is often accompanied by poor study design, questionable data, poor statistical analysis, and every other variety of scientific malady.

    But, as you say, there is little reward for pointing these things out in study after study.

    Take for instance the latest Gallup poll (http://www.gallup.com/poll/168848/life-college-matters-life-college.aspx) that was used to conclude things about education and happiness. Rather than carefully selecting people who applied and were accepted but chose not to go to school X, going instead to school Y; or groups of former students from particular backgrounds who had such-and-so GPA but reported the teachers as unfriendly or distant vs warm and caring… the study called tens of thousands of essentially random people (both with landlines and cell phones, to control for that apparently serious confounder!) and then asked them a panel of multiple choice self-reporting questions; applied a basic model to the answers, and released the low p-value answers to the press. Hot off the presses! Great universities don’t make you happy! Save your money (because apparently the money will make you happy… oh, wait…)!

    In short, stringent critique won’t force scientists to only do ‘safe’ science. It would actually shift the balance towards better, more exciting science.

  3. Dear Andrew:

    Please post a link to your Slate review of Nicholas Wade’s “A Troublesome Inheritance.” The Slate comments section is pretty lowbrow, so it would be more fun to discuss it here.

    Steve

    • Good idea. We have the same problem with our posts on the sister blog, now that it’s moved to the Washington Post: the comment section becomes a place to vent rather than discuss.

    • Yes! I love the opening line: “The paradox of racism is that at any given moment, the racism of the day seems reasonable and very possibly true, but the racism of the past always seems so ridiculous.”

      • Until you actually read the much-denounced works of the past, like Arthur Jensen’s 1969 meta-analysis of race and IQ in the Harvard Education Review, and realize the reason you always hear so much about how horrible they were is because they were disturbingly plausible at the time and disturbingly accurate today.

  4. One thing that potentially adds to the noise, though is perhaps subsumed in discussion, is that many voters might not know what the candidate’s positions on social issues are. Bawn et. al 2012 reported that 54% of voters actually knew that Bush wanted tighter restrictions on abortion than Gore. Dr. Gelman has highlighted how stable individuals are in response to surveys over a campaign. Sides and Vavreck (2013) and Erikson and Wlezien (2014) have mentioned how true this was in 2012. In the former, 92% of voters did not change over the summer and 95% held the same preferences from December 2011 to election eve 2012. Erikson and Wlezien reported that in the ANES, 98% of voters held the same preference from after the conventions to election day. In the latter case, the authors mentioned that historically and in 2012, voters who switch tend to be less informed. That would complicate shifts in vote choice through the pathway that the authors hypothesize. Regressions have also provided mixed evidence on impact. Alan Abramowitz had a study, “It’s abortion, stupid.” But, Larry Bartels found that working class voters tended to weigh economic issues much more in voting, and Hillygus and Shields found foreign policy, the economy, and partisanship all tended to impact more than social issues in multivariate logistic regression.

  5. If I could step back for a second to take in a larger view, much of the problems with social psychology in the 21st Century is that it discovered that there was money to be made by becoming a branch of marketing research while still maintaining the pretensions of a science. (I suspect Malcolm Gladwell’s 2000 bestseller “The Tipping Point” was a, uh, tipping point in this evolution.)

    The selling point of social psychology is that it’s a Science and therefore, goes the unstated but implied assumption, any experimental result social psychologists come up with about how to manipulate college students is Science and therefore part of the Unchanging Laws of the Nature of the Universe.

    In contrast, the basic assumption of marketing researchers is: We can figure out for you what’s working right at the moment to manipulate consumers, but, hey, this isn’t the Law of Gravity so whatever works now will probably stop working soon as shoppers get bored by it. So, you’ll have to come back and hire us again next year to tell you what those crazy kids have gotten into next.

    • Let’s look at Gladwell’s Books to see what he thinks the market wants:

      The Tipping Point (2000): How to win at the game of viral fashion. Yes, very closely related to marketing research. But also, the science you can use to become a giant success.

      Blink (2005): How successful people make brilliant decisions by relying on their gut instincts. Um, businessman fantasy snake-oil.

      Outliers (2008): How almost anyone – maybe you or your kids too! – can be successful through extreme dedication. See above snake-oil.

      What the Dog Saw And Other Adventures (2009): A hodge-podge of his New Yorker articles, but a common theme is seemingly ordinary guys who are actually amazingly successful, and some dismissive writing on intelligence. You too can be a late-bloomer and succeed beyond your wildest dreams with some luck and enough guile and determination.

      David and Goliath (2013): The conventional wisdom is wrong about what are privileges or disabilities with regards to formative experiences when it comes to achieving adulthood success; it’s the fires of the crucible of challenge in which heroes forged. So maybe you and your kids can turn what you think are your failures around into success.

      David and Goliath received the worst, most critical reviews by far (and deserved them). John Gray titled his New Republic review, “Malcolm Gladwell Is America’s Best-Paid Fairy-Tale Writer”. Ouch.

      But if we’re looking for a theme in his TED-pop social-psychology, we see a Scienced-up version of Tony Robbins motivational seminar, where the ultimate confirmation bias of wishful thinking generates his targeted audience’s enthusiasm.

      There’s a huge market demand for someone who can tell you that you too, ordinary guy, can change your life and become incredibly successful, even without the talent, driven work-ethic, and advantages of most actually successful people. That you might be one of the lucky ones, and maybe make some of your own luck, because New Science!

      If it was about marketing, I think Gladwell would have stuck with questions of fashions and consumer psychology. But no, the train he’s conducting is all about success fantasies and, as Gray says, fairy tales.

      • I came across Gladwell’s 2001 “The Mosquito Killer” the other day (http://gladwell.com/the-mosquito-killer/), a biography of malaria/mosquito control giant Fred Soper, and was surprisingly impressed by it (I have many of the same issues with recent Gladwell work as others on this list). Some combination of (1) pre-shark-jumping Gladwell, (2) agreement with my prejudices (there’s a quite thoughtful discussion of the need for nuance in attacking complex problems), (3) lack of subject-area knowledge on my part?

        • All Gladwell needs to be a benefit to humanity is a research assistant who is better at coming up with reality checks than he is. Gladwell’s problem is that he doesn’t have a skeptical bone in his body, so he takes various academics’ press releases and gins them up into articles that are extraordinarily persuasive to 110 IQ frequent fliers. And some of his articles are even right.

  6. Andrew: “I would love if the incentives were to change so that researchers would put more effort into careful measurement and design!”

    +1

  7. Pingback: A week of links - Evolving Economics

  8. The “Garden of Forking Paths” paper is absolutely excellent. One interesting thing (to me) is that it applies far more widely than the social sciences. I particularly like this passage, which is deep (I don’t consider it a criticism of frequentism, I consider it an elucidation of frequentism) “this is the (somewhat paradoxical) nature of frequentist reasoning: if you accept the concept of the p-value, you have to respect the legitimacy of modeling what would have been done under alternative data”

  9. Pingback: A week of links | EVOLVING ECONOMICS

Leave a Reply to Rahul Cancel reply

Your email address will not be published. Required fields are marked *