Sokal: “science is not merely a bag of clever tricks . . . Rather, the natural sciences are nothing more or less than one particular application — albeit an unusually successful one — of a more general rationalist worldview”

Alan Sokal writes:

We know perfectly well that our politicians (or at least some of them) lie to us; we take it for granted; we are inured to it. And that may be precisely the problem. Perhaps we have become so inured to political lies — so hard-headedly cynical — that we have lost our ability to become appropriately outraged. We have lost our ability to call a spade a spade, a lie a lie, a fraud a fraud. Instead we call it “spin”.

We have now travelled a long way from “science,” understood narrowly as physics, chemistry, biology and the like. But the whole point is that any such narrow definition of science is misguided. We live in a single real world; the administrative divisions used for convenience in our universities do not in fact correspond to any natural philosophical boundaries. It makes no sense to use one set of standards of evidence in physics, chemistry and biology, and then suddenly relax your standards when it comes to medicine, religion or politics. Lest this sound to you like a scientist’s imperialism, I want to stress that it is exactly the contrary. . . .

The bottom line is that science is not merely a bag of clever tricks that turn out to be useful in investigating some arcane questions about the inanimate and biological worlds. Rather, the natural sciences are nothing more or less than one particular application — albeit an unusually successful one — of a more general rationalist worldview, centered on the modest insistence that empirical claims must be substantiated by empirical evidence. [emphasis added]

Well put.

Sokal continues:

Conversely, the philosophical lessons learned from four centuries of work in the natural sciences can be of real value — if properly understood — in other domains of human life. Of course, I am not suggesting that historians or policy-makers should use exactly the same methods as physicists — that would be absurd. But neither do biologists use precisely the same methods as physicists; nor, for that matter, do biochemists use the same methods as ecologists, or solid-state physicists as elementary-particle physicists. The detailed methods of inquiry must of course be adapted to the subject matter at hand. What remains unchanged in all areas of life, however, is the underlying philosophy: namely, to constrain our theories as strongly as possible by empirical evidence, and to modify or reject those theories that fail to conform to the evidence. That is what I mean by the scientific worldview.

And then he discusses criticism:

The affirmative side of science, consisting of its well-verified claims about the physical and biological world, may be what first springs to mind when people think about “science”; but it is the critical and skeptical side of science that is the most profound, and the most intellectually subversive. The scientific worldview inevitably comes into conflict with all non-scientific modes of thought that make purportedly factual claims about the world.

He might also discuss certain pseudo-scientific modes of thought, those methods that follow various forms of science but which lack the elements of criticism. I’m thinking in particular of what we’ve been calling “Psychological Science”-style work in which a researcher manages to find a statistically significant p-value and uses this to make an affirmative claim about the world. This is not so much a “non-scientific mode of thought” as a scientific mode of thought that doesn’t work.

P.S. Here are the links to all three parts of Sokal’s essay:
What is science and why should we care? — Part I
What is science and why should we care? — Part II
What is science and why should we care? — Part III
As they say, read the whole thing.

38 thoughts on “Sokal: “science is not merely a bag of clever tricks . . . Rather, the natural sciences are nothing more or less than one particular application — albeit an unusually successful one — of a more general rationalist worldview”

  1. Cause And Effect: The Revolutionary New Statistical Test That Can Tease Them Apart, 12/20/14

    https://medium.com/the-physics-arxiv-blog/cause-and-effect-the-revolutionary-new-statistical-test-that-can-tease-them-apart-ed84a988e

    and
    Distinguishing cause from effect using observational data: methods and benchmarks
    arxiv.org/abs/1412.3773

    Please let me know what you think about this research and how long before we can go for more than 2 variables?

    Dwight Hines

        • Rahul:

          No, I don’t think the paper a dud, I think it’s cool. The point of my comment above is that I think its coolness should be separated from the hype. I don’t think such methods will ever be able to discover or identify causes in problems where causation is difficult to identify (for example, psychiatric symptoms, or classic social science problems such as policing and crime, or inflation and unemployment), but they could well be useful in a sort of data-mining sense of scanning through data looking for dependence patterns.

          One could draw an analogy to regression. Regression is a great tool for causal inference, but if you just throw regression at problems you can have trouble, especially when applied to difficult and contentious problems. Similarly for this method: it’s a particular sort of probability model, and there are settings where it makes sense and settings where it doesn’t.

        • when the paper is titled “Distinguishing cause from effect using observational data” and you say “”I don’t think such methods will ever be able to discover or identify causes in problems where causation is difficult to identify”, then I don’t see how you could be saying it is not a dud.

          And it is a dud, the claim is ridiculous. Even the way the “simple” 2 variable relationships are described in the blogpost lead me to believe they aren’t thinking about causation clearly. Altitude “causing” temperature as an example of causation? Come on.

        • Anon:

          Something can be hyped but still be cool and potentially valuable. Suppose someone were to invent multiple regression and call it, “A method for untangling causal effects from observational data.” Such a claim would be hype—but regression, when carefully interpreted, can indeed be used for this purpose! So I don’t want to say the paper is a dud, just cos their method can’t do everything they think it can.

      • As soon as I see “obviously” I’m uncomfortable. Why wouldn’t snow on the ground possibly influence temperature? Really I don’t know, but it doesn’t seem obvious to me.

    • How many real-world problems involve the question, “Does X cause Y or does Y cause X”? In the real world it’s usually more like (or even more complicated than), “X and Y affect each other” or “Z affects both X and Y” or “Z affects Y directly, but also indirectly through X, and there is also an effect of Y on Z,” …

      • Well I don’t understand the details behind their approach but is there reason to think it cannot be scaled beyond the simple “Does X cause Y or Y cause X” question?

        • The higher you go in dimensions, the more ways your noise can appear to “move” your signal. High dimensional systems are weird, for example, the hypervolume of a high dimensional sphere sort of fizzles out into nothingness as dimension increases. So if your plan is to show that noise in X causes wiggles in Y but not the other way around… ok, but if your plan is to find noise in 10 dimensions and figure out which of those 10 dimensions has causal influence on Y… and you can assume any causal DAG over the 10 dimensions… good luck to you.

          Furthermore, from the high level description of this method, it seems to rely on noise in the variables being “real” as opposed to “measurement” noise, so you’re going to have to have precision measurements of your variables, so that if X is observed to be a little bigger, it really WAS a little bigger and it’s not just your instrument. Essentially impossible in many social sciences contexts (what’s a precision measurement of mood? of left-right political polarization, of fertility, of attitudes towards homosexuality, of preference for one side of a certain economic trade-off?

        • > The higher you go in dimensions, the more ways your noise can appear to “move” your signal.

          Very true but good cross-validation mitigates the risk of false positives, i.e., calling out a sensitivity which doesn’t actually exist.

          Related: G.Hughes, “On the Mean Accuracy of Statistical Pattern Recognizers,” IEEE Trans. on Information Theory, vol. 14, no. 1, p.55-63 (1968). Link = http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1054102&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1054102

    • After briefly reviewing the paper, I found they report accuracy of the methods they test, but not the *inaccuracy*. 80% accurate and 20% inaccurate is a lot different from 80% accurate, 10% inaccurate, and 10% undetermined. If this method is to be useful to detect unknown causes, the rate of false positives need to be as low as possible.

      IOW, they only reported half of their results, and maybe not the most interesting half at that.

  2. With Sokal’s quotes, you have raised a number of interesting topics, starting with the deceptive rhetoric of politicians. Sokol’s perspective on political rhetoric, while emotionally appealing, comes across as naive. I would challenge Sokol and others who support these assertions to provide evidence of a time when political rhetoric was not strategically or tactically deceptive and a time when a high proportion of the governed felt as if they were receiving the whole truth from their leaders.

    One can use his own words to support this idea, “The scientific worldview inevitably comes in conflict with all non-scientific modes of thought that make purportedly factual claims about the world”, but is not necessarily limited to perspectives about the nature of reality but appears to be conflated with his personal views of moral imperatives. It is easy for us to identify politicians one might believe are lying, particularly when they engage in party partisanship in which they claim something is “Good for America”, when in fact we are convinced that it is only good for Koch Industries. When we begin to call these statements lies, we are ignoring ‘alternative models’ that would allow for these statements to be accurate. When journalists hold politicians accountable for promises they made during the campaign, they similarly are not allowing for the possibility that the campaigning politician may have had every intention of fulfilling this promise at the time, but found the daily exigencies of governance to be incompatible with the ability to fulfill this promise.

    I agree wholeheartedly with the idea that claims to understanding important aspects of the world based on single studies grounded in statistical significance of modest effects is a rather perplexing norm within the scientific communities mentioned, as are the inferences drawn and claims made about the world based on these modest findings. More perplexing is an educational system grounded in memorizing and regurgitating these findings as a indicator of subject mastery.

  3. In as much as you’ve been calling it ‘Psychological Science’, a better or alternative term might be ‘(woefully) incremental science’. If a researcher is doing incremental science, and only slightly pushing the envelope of discovery, then he or she falls prey to all of the problems discussed on this blog. P-Hacking, the file drawer problem, the garden of forking paths — all of these issues disappear if there is a truly robust effect, uncovered in a manner that it’s actually hard to doubt. Say, at a sample size where bayesian and frequentist methods converge, and p-values are are less than 1e-30. At that level, any reasonable approach should reveal the effect to exist.

    Due to the sufficiently competitive nature of academia, some amount of incremental advancement is inherent in any researcher’s career. However, ‘incremental’ is a matter of degree too.

    I totally agree with Sokal — One way to reduce government waste would be to have people actually prove their claims. (Or: wouldn’t the most efficient government require rationality and decisions-under-pressure training of all officials and employees?)

  4. It doesn’t identify causality, it identifies ‘X happens before Y’. Both X and Y could have been caused by A, but with different time courses.

    Obviously it can’t do that limited task if the time resolution is insufficent or non-existant.

  5. “centered on the modest insistence that empirical claims must be substantiated by empirical evidence.” OK, but empirical claims are composed from humans and empirical evidence are interpreted by humans. There are no external criteria of how the two (empirical claim and empirical evidence) should be linked. Hermeneutics is the elephant in the room.

  6. Andrew: are you alleging that finding a statistically significant p-value, when it gives an approximately correct error probability in relation to the null hypothesis under test, is a method that does not warrant empirical claims about the modeled phenomena? or only that abuses and fraud can result in reporting p-values that are not approximately correct error probabilities in relation to the null hypothesis under test? And if the latter, why limit it to abuses in psychological science as opposed to a great many other fields that produce unreliable and biased results with or without formal statistics?

    • My take on Andrew’s point has been that the large set of plausible null hypotheses that can be chosen in an analysis renders any p value irrelevant without carefully constructed pre-registration (prior to obtaining the data) of the choice of hypothesis, and this is true even without “fraud” or “abuse”.

    • Mayo:

      As I’ve discussed elsewhere, I think p-values make much more sense when used to get a sense of problems with a model that a researcher wants to use, rather than when used to reject null hypothesis A as a way of (purportedly) demonstrating the researcher’s desired hypothesis B.

      That is, I like “goodness-of-fit testing” (in the sense of trying to find problems in a model that the researcher likes and is using) but I don’t like “null hypothesis significance testing” (in which the goal is the low p-value that will be used as evidence to believe a favored alternative hypothesis).

      In any case, I am not limiting my criticism to abuses in psychological science. I refer to “Psychological Science-style work” as a shorthand for a whole class of papers, not just in psychology but also in other fields, that are crisp, appealing, attention-grabbing, and wrong. Not necessarily “wrong” in their larger theories (although they certainly can be) but wrong in their claims of evidence.

      As I’ve also said before, if, by a stroke of magic, all p-values were removed from the world and replaced by Bayes factors (for example), I think that just about all these problems would remain. As long as you have “power = .06” designs and your goal is to reject the null hypothesis, I think you’ll have troubles.

      • Andrew: If the power were really low, then it would be very difficult to reject the null, whereas the problem you are on about concerns the ease of rejecting the null. So I’m confused about that point.

        There is no account of testing that warrants inferring a desired hypothesis or any substantive hypothesis based on a small p-value in rejecting a statistical hypothesis. First one has to assure the p-value is not spurious, next, as Fisher insists, one has to go beyond the isolated p-value to show a genuine repeatable effect. Next,if one plans to infer a substantive claim C, one has to show the real effect counts as a good test of C’s flaws.
        Finding a flaw with what you are calling goodness-of-fit type testing is fine, but supposing that warrants some other model M’ that fits better than M is open to the same problems as moving from rejecting a null and jumping to a different level–UNLESS the ways that M’ can be wrong have thereby been probed and ruled out. Good fit is not enough, and being ready to reject and replace with M’ does not make the analysis genuinely critical.

        • Low powered studies can easily yield “significant” findings due to multiple-testing, the garden of forking paths, poorly chosen standard error estimators, repeated experiments or just plain randomness.

          One really interesting feature of low powered studies is that, if you do find a significant effect, it almost HAS to be an over-estimate of the effect size, because the real effect size won’t lie more than 2 standard errors from the mean of the control group. And if you have 10 outcomes, and 2 ways of looking at each, you are bound to find something significant, even if you only do one test per outcome.

          I also agree that, epistemologically, rejecting a statistical test in and of itself says nothing about the real world. Unfortunately, that kind of reasoning is incredibly common in many applied fields, and I think that is a place where you and Andrew tend to talk past each other.

        • Yes, but it’s not the low power but the multiple testing. In any event, the important point for remark brings up is that the multiple testing argument as to illicit p-values, IS an argument that rests on statistical significance (p-value) reasoning. So this criticism affirms rather than denies the legitimacy of p-value reasoning.

        • Mayo:

          What Jrc said.

          I’m not ever trying to deny the legitimacy of p-value reasoning. When people use p-values, they should be clear that they are making a statement about what they would be doing had the data been different. I prefer the term “garden of forking paths” rather than “multiple testing” to emphasize that the problem can occur even if only a single test was performed on the data at hand. In that sense, it is a “multiple potential testing” problem.

          The “power = .06” thing goes like this:

          – A researcher is studying a scientific theory that he or she believes to be true.

          – Unfortunately, this researcher performs a study where the effect or population comparison of interest is small, and the effect with biased and noisy measurements and a small sample size. Hence power = .06.

          – However, that same researcher is under the impression that statistical significance is not so hard to find, if one’s underlying theory is true. This impression is false—it all depends on the size of the effect, the validity and reliability of the measurements, etc.—but unfortunately it seems widely held.

          – The researcher gathers data, does an analysis that seems reasonable in the context of his or her theory, and finds statistical significance. If the data collection and analysis were preregistered, such an event should be occurring only something like 6% of the time in these sorts of situations. Actually, though, it can easily happen 50% of the time or more.

          – Then I come in and shout, “power = .06!” in the crowded theater. The researcher’s natural response, given the usual treatment of “power” in statistical theory and applications, is to think that this is no big deal, that power is inherently a prospective concept (the chance of successfully finding statistical significance given some assumptions). In traditional thinking, once success has been achieved, power is irrelevant. To switch briefly to a basketball analogy, 2 points is too points, whether the shot was well-chosen or just lucky.

          – But that traditional thinking is wrong, for reasons described in my recent paper with John Carlin and summarized by this graph. Actually, statistically significant claims from a “power = .06” study are often in the wrong direction and are certainly extreme overestimates. Or, to put it another way, even (or, maybe, especially) if you have statistical significance, you still need to think about your power. That is, you still need to think about your underlying effect size and your measurement error.

          So it’s a bunch of things, but the low power and the garden of forking paths go together. If power were not so low, the garden of forking paths would not be such a big deal. Conversely, if there were no garden of forking paths, I don’t think people would be doing “power = .06” studies all the time. The only reason they can get away with doing such dead-on-arrival studies is because they have this way of consistently getting statistical significance.

          And, as I’ve written about a zillion times already, I’m pretty sure that similar problems would arise with Bayes factors or any other approach whose goal is to prove hypothesis B by rejecting hypothesis A.

        • Compounding the problem, lots of people (especially social scientists, especially psychologists) do a “retrospective power” calculation after that fact, and think that shows that they indeed had high power — but “retrospective power” doesn’t calculate what a good prospective power calculation would give; it’s just the p-value in disguise.

        • “it’s not the low power but the multiple testing” – disagree with this. Two issues are present in the complete absence of “cheating” through multiple testing:

          1. Andrew’s figure show’s the problem with the result conditional on the finding being a “true positive”. Effect overestimation and directionality error.

          2. What’s the chance of the finding being a false or true positive in the first place? This is determined by the underlying true/false positive frequencies, not type I error control or multiple testing.

        • There’s “p-value reasoning” as Mayo defines it, and then there’s “p-value reasoning” as everyone else in the universe actually applies it.

          The point here is even in the absence of formal multiple-testing (ie. having run a bunch of different tests through your stats software) the fact that there is a “garden of forking paths” wherein a large number of *possible* tests lie means that it’s usually easy to find some test to run which is vaguely related to your substantive hypothesis and gives p < 0.05. Even if this is the only test you plug into your software, the fact that you chose it post-data makes the p value meaningless.

          Pre-registration prevents that flexibility in choice from influencing the frequency with which p < 0.05 will come out of your analysis.

          So, if you're going to rely on accurate "coverage" of a hypothesis test, it had better be pre-specified prior to collecting *any* data.

        • to make it clear, here’s how “everyone else in the universe” seems to apply p values:

          1) Come up with a reasonable scientific hypothesis, which is nevertheless somewhat vague (compared to say a compilable Stan model)

          2) Design some experiment which you think will be informative for (1), you have two choices here:
          a) do a power study, design an experiment which is sufficiently high powered to reliably detect an effect of the size you estimate the real effect will be (but that sounds a little like a prior doesn’t it?)
          b) just design a data collection procedure and get data until you “get tired” or “run out of available funding” or “think it seems big enough” (this is the most common in the work Andrew has been criticizing)

          3) Get your data, organize it into tables etc, and look at what it looks like, now start to think how you can analyze it in the context of (1)

          4) Especially in social sciences, design some scales on which to measure vague concepts (like redness of a shirt, threateningness of a stereotype, left-right political ideology, acceptance of authority, whatever)

          5) Based on some histograms of your raw data and recommendations you’ve read in some literature or textbook, choose some test, be it t-test or permutation testing, or bootstrapping or whatever. Run your test on your difference of interest… get p < 0.05 and publish it as "true fact about the world"

          *that* is what is going on in most NHST studies, and especially in the work Andrew is criticizing. It is by numbers of papers published, the dominant way in which NHST is actually used.

  7. Andrew:

    You criticize: “work in which a researcher manages to find a statistically significant p-value and uses this to make an affirmative claim about the world”

    But the bit about “statistically significant p-value” could be replaced by almost anything else. i.e. Extending a specific observation from the lab / sample to a general claim about the world is always fraught with difficulty & pitfalls.

    It is unfair to single out p-values. There’s tons of ways to derail external validity.

      • Really? can you give me examples of other methods using ill-defined hypotheses and the accompanying flexible interpretations that prevent misinterpreting data?
        The low-powered study business is puzzling, it’s high powered studies that make it easy to pick up on small effects.

        • I can’t quite parse your first sentence, are you asking for an alternative to NHST that prevents the problems of low-powered studies? A Bayesian model with realistic priors can help a lot, by shrinking estimates towards a valid prior guess. The result will be your small study doesn’t show “significantly different from reference (or zero) effects”. But low-powered studies are always that… low powered. One of the advantages of Bayesian models in practice (at least in my practice) is that they are not “ill-defined hypotheses” they are typically highly structured and application specific. The Bayesian computing methods allow for a theoretically straightforward inference method to be applied to many quite complicated and specific models.

          The low-powered thing is in part Bayesianism in disguise. In a prospective power study you need to estimate both the magnitude of the noise, and of the signal, these are essentially reduced information (point estimates) from an informative prior. Without that informative prior, as Martha has said above, retrospective power studies are just the p-value in disguise (since you’re using the observed magnitudes in your small sample, not a reasonable guess as to what the “real” magnitudes are).

          Much of Andrew’s complaints here can be re-thought as “my prior tells me there’s no way you can tease out any signal in this data, and any signal would have to be much smaller than your eventual estimate… so your p value is most likely the result of your data analysis choices, where you have plenty of flexibility, not of any scientific truth”

          It comes down to this: if you are studying something that has a small signal and a lot of noise, and you DO get a p < 0.05, then your estimate of the signal magnitude will be much much bigger than reality since it must have been a fair amount bigger than the noise to get your p value. Combine that with the opportunity, post-data, to choose which comparisons or which tests you will perform, and even if you do only one of them, you will be fooling yourself. Another way to say this is that the "eyeball" is performing a lot of hypothesis tests that you're rejecting before you ever settle on which one to plug into your stats software. But there's no formal way to quantify such a thing, and do bonferroni or whatever on the "eyeball" tests. The result is: without pre-registration, we can't be sure that your choice of test hasn't influenced the validity of your resulting p value.

Leave a Reply

Your email address will not be published. Required fields are marked *