Applying statistics in science will likely remain unreasonably difficult in my life time: but I have no intention of changing careers.

This post is by Keith.

image

 

(Image from deviantart.com)

There are a couple posts I have been struggling to put together, one is on what science is or should be (drawing on Charles Peirce). The other is on why a posterior is not a posterior is not a posterior: even if mathematically equivalent – they are (or could be) all very different (drawing on Susan Haack). However, Sander Greenland just sent me one of his new papers “For and Against Methodologies” which amongst other things brings out the motivations I have for trying to do those posts. It is well worth reading and I’ll try here to briefly provide some motivation for reading the paper primarily using excerpts.

Sander’s title pays homage to Against Method by Paul Feyerabend and my post title pays homage to Sigmund Freud commenting that a scientific psychology was not likely in his life time but he had no intention of changing careers (from memory, unfortunately without a reference).   By applying statistics in science, I mean enabling researchers to be less misled by the observations they encounter and also to have a good sense of how much they unavoidably may have been misled. Realistically the observations they encounter will not be always randomized nor even practically _as if randomized_ and researchers will almost always want to be targeting something about reality. Descriptively when just targeting one reality, causally when there are two or more realities being compared and transportably when more than one reality is being targeted with something taken as being common. Now, the being unreasonably difficult arises from having to have much more that adequate knowledge of statistical methods and modelling but also a rather advanced grasp of what science is or should be as well as what to make of the output of statistical modelling (e.g. posteriors) beyond the mathematical summaries of the modelling. Or so I think Sander argues.

One definition of scientific enquiry is just everyday enquiry with helps. Statistical methods and inference approaches are then seen as one of the helps, often a critically important one but not  a method of enquiry.  My focus in this blog post will primarily be on Sander’s take on “the overconfidence that plague[s] statistical methodology” specifically when “formal statistical inferences are treated as sound judgments about causality” when they are instead just “an aid to inference” being “a useful technology emerging from formal methodology, rather than a theory” of enquiry or inference. More generally, “Overconfident inferences … touted as unconditionally sound scientific inferences instead of the tentative suggestions that they are.” Though Sander focuses primarily on  targeting something about reality causally, I believe the arguments are important  descriptively even if somewhat less critical (there is always a price to be paid for getting reality wrong at some point in time).

He further clarifies his arguments with “Every methodology has its limits, so do not fall into the trap of believing that a given methodology will be necessary or appropriate for every application; conversely, do not reject out of hand any methodology because it is flawed or limited, for a methodology may perform adequately for some purposes despite its flaws. … [this] raises the problem of how to choose from among our ever-expanding methodologic toolkit, how to synthesize the methods and viewpoints we do choose, and how to get beyond automated methods and authoritative judgments in our final synthesis.” The later being a much under discussed topic in the literature.

“Much benefit can accrue from thinking a problem through within these models, as long as the formal logic is recognized as an allegory for a largely unknown reality. A tragedy of statistical theory is that it pretends as if mathematical solutions are not only sufficient but ‘‘optimal’’ for dealing with analysis problems when the claimed optimality is itself deduced from dubious assumptions. …  in the end an exercise in hypothetical reasoning to aid our actual inference process, and should not be identified with the process itself … Analysis interpretation depends on contextual judgments about how reality is to be mapped onto the model, and how the formal analysis results are to be mapped back into reality.”

Here the unreasonably difficult part is brought out as “the potential for greater realism in these approaches [which unfortunately he argues the sophistication demanded for proper use and review of – is lacking amongst many (most?) practising statisticians] suggests they should be part of the skill set for everyone who will direct or carry out statistical analyses of nonexperimental data, to help them gauge the distance between the model they are using and more realistic models, and to help them critically evaluate the sensitivity and bias analyses they encounter.” It perhaps needs to be added here that almost all data is not quite fully experimental as perfect experiments are just a threat – you can’t rule out someone doing one some day but it would be naive to assume someone had.

Now part of the problem can be blamed on the content in many (most, all?) statistical programs “degrees in statistics and medicine do not require substantial training in or understanding of scientific research or reasoning, but nonetheless serve as credentials licensing expressions of scientific certainty” but one does wonder how and who would be able to address this? A possible example of this oversight being Karl Pearson’s Grammar of Science. It also underlines a serious challenge to be overcome for John Ioannidis’ idea “Teams which publish scientific literature need a ‘licence to analyse’ and this licence should be kept active through continuing methodological education.” : what is the training that will be needed for understanding scientific research or reasoning and who will provide it?

Now I am not quite sure on what is meant “I depart from the mainstream Bayesian revival in regarding simulation methods as insufficient for Bayesian education and sensible application” but I certainly agree with ” because the complexity of actual context prohibits anything approaching complete modeling, the models actually used are never entirely coherent with that context, and formal analyses can only serve as thought experiments within informal guidelines.”

OK, what are some ways to make things less unreasonably difficult?

 

 

 

This post is by Keith.

69 thoughts on “Applying statistics in science will likely remain unreasonably difficult in my life time: but I have no intention of changing careers.

  1. “OK, what are some ways to make things less unreasonably difficult?”
    What indeed?
    Maybe there is no way to do so without introducing often-fatal errors…
    Maybe science has exceeded the typical inference capacities of most scientists, just as computers are at last exceeding the skills of professional poker players (as well as Go players, although poker wins are more impressive since they involve a large random element in the shuffle and deal, and extensive missing data about the positions of opponents). More narrowly put: Based on the fact that a majority of medical articles I see make at least one serious error in interpreting their basic inferential statistics, I have come to think that statistical inference is simply too subtle for mass use. But I also doubt that AI is as yet nearing a solution to this problem.

    • > Based on the fact that a majority of medical articles I see make at least one serious error in interpreting their basic inferential statistics, I have come to think that statistical inference is simply too subtle for mass use.

      I think we should try to actually teach people to do it right before concluding that the whole lot are just irretrievable dolts. When you explicitly train generations of people to do certain things, and have no clear answers when they ask for clarification or justification, it’s not surprising that you see a long tail of bad effects of that training in practice.

      For the most part, at least in my cohort, we are not at all afraid of thinking about deep and tricky issues, we were and are just lied to that there aren’t any in our statistical procedures. And if a well-intentioned researcher trying to do due diligence tries to go figure out how to interpret a “basic inferential statistic,” you well know that he will mostly find those same serious errors in the large majority of resources (written by statisticians).

      • Very true. Let’s do a randomized controlled trial. We’ll set up stats courses in which people learn the basic math of Bayes in the first 3 weeks, and then do 10 weeks of labs developing models and analyzing data using Stan for computation. We’ll file FOIA or whatever is necessary to get ahold of real world drug RCT data. We’ll use public archives of government and NGO survey data (American Community Survey, Consumer Expenditure Survey, Current Population Survey, General Social Survey) we’ll suck down data from remote sensing satellites and monitoring of water flow through weirs and tagging of sharks and elephant seals… and etc etc etc. Each student will pick a set of 3 topics, students that pick similar topics get assigned together to do studies. Weekly presentations of “research progress” with feedback from the class.

        In the second semester we’ll have early lessons in HMC and diagnosing simulation issues, we’ll discuss predictive vs causal inference, and using the previous semester’s projects we’ll discuss how causality did or did not enter into the analysis. Students who did causal projects will get assigned descriptive projects and vice versa.

        We’ll run this course for 10 years at 10 universities, with students assigned randomly to take this course vs a standard stats course. As we go along, we’ll do a Bayesian analysis of the quality of research done by graduates of these courses in their later career… We’ll define a quality metric for statistical analyses in papers published, we’ll include real world costs associated (grant money wasted, replication money wasted, follow up studies that were wasted… money spent on drugs approved on the basis of poor decisions based on p values… etc etc). We’ll terminate the study based on a decision theoretic result of how much better or worse one or the other group was.

        ;-)

        I think about a Billion dollars would cover it… It’d be cheap compared to continuing what we’re doing, right Keith? :-)

        • > As we go along, we’ll do a Bayesian analysis of the quality of research done by graduates of these courses in their later career… We’ll define a quality metric for statistical analyses in papers published, we’ll include real world costs associated…

          That sounds complicated. I’d say just count # of papers, use some NLP to count mentions of p-values, and then check if the CIs overlap. :-)

          In seriousness, I know you were joking but that course does seem like an improvement.

        • 10 billion would be a steal for noticeably improving the profitability of science widely – a real steal.

          But your proposed RCT only manipulated the formal stuff (and skills in it) and left the informal to vary haphazardly and drive down the power?

        • Well I assume the “labs and research reports” would have a lot of informal discussion about what is a good model of stuff and what isn’t, I’d be happy to include lots of informal discussions… we could perhaps choose a corpus of general purpose discussion topics on science, modeling, history of failed theories, case studies like the Tamiflu one, etc etc and randomly assign them to students in each course… not a bad idea.

      • I think we should try to actually teach people to do it right before concluding that the whole lot are just irretrievable dolts.

        No one thinks people are dolts. What happens is people get tricked into wasting the prime of their life on research that they were trained to design, run, and analyze incorrectly. Then they are trapped and cant admit their work was BS for career reasons (and also can’t call everyone else out on it). It is similar to a gang initiation ritual where you are peer pressured to commit some crime to prove you are part of the group and just as culpable as the rest.

        • Anon:

          Wow—that’s an excellent analogy. And indeed it makes me particularly sad to see young researchers trapped in that merry-go-round as it spins faster and faster. I really admire people such as Brian Nosek who’ve had the strength to jump off.

        • Perhaps also the folks that provided financial support so some can jump off without _giving up_ http://www.arnoldfoundation.org/

          Now their Garden of Forking Paths challenge is that amongst those who are in position to participate and appear to be most likely to contribute, a large? percentage will be careerists who have managed their research careers in the very ways that need to be changed. Many of them will see these funding opportunities as just another lucrative merry-go-round to further enhance their careers.

        • I agree with all of that. I was just saying that the current situation is not evidence that stats is just too hard for most people to get right. It’s possible that it is too hard, or rather too subtle. But that doesn’t follow from the pervasive and persistent errors in medical journals, given the background reasons you mention.

    • Hi Sander,

      It resonates with me when you write that you routinely encounter basic errors in medical papers. The same here with papers in psychology. Your point about using AI instead of humans to do inference is an interesting one. I recall David Kenny presenting his early work on “DataToText” which was an early attempt of automating a simple type of analysis. Audience response was very negative – people claimed that this is dumbing down users, who just click buttons, and let the computer do everything. Have you seen the Automated Statistician by Zoubin Ghahramani?

      https://www.automaticstatistician.com/static/assets/auto-report-affairs.pdf

      I think this was funded by Google a while back, but I am not sure if this is still actively developed.

      I actually agree that a system like this could do some good.

    • >Maybe science has exceeded the typical inference capacities of most scientists… Based on the fact that a majority of medical articles I see make at least one serious error in interpreting their basic inferential statistics, I have come to think that statistical inference is simply too subtle for mass use.

      Wasn’t this true decades ago too? And yet, despite all the errors, medicine makes progress. Research muddles along, somehow. We just don’t know how, because there is no formal understanding of how a field with so many errors in statistical inference could still get a lot right.

      When a paper is based on fundamental errors in inference, I don’t place any credence in its conclusions. I mean, I can’t even imagine how to place partial credence in its conclusions. Logically, one serious error in reasoning undermines the validity of the entire argument. And yet, decades ago, even the best statisticians were making errors that are basic today. Probably, then, the best statisticians today are making errors that will be basic decades from now. Progress is made somehow. But do any of you have a good idea how that happens?

      • Anon:

        I think science makes progress through a combination of theory and experiment. But in statistical classes and textbooks we neglect substantive theory, instead focusing on a black-box mode of inference, which can work just fine when testing a large effect with a clean experiment, but which won’t work so well without some strong theory that can facilitate the discovery of these interventions which will have large effects.

        • For Karl Popper, only theory is good or great science.

          In fact, Paul Feyerabend’s ideas is essentially consistent with Karl Popper’s.

      • And yet, despite all the errors, medicine makes progress.

        I wouldn’t be so confident. It is the method of assessing “success” or “progress” (NHST) that is flawed here.

        • But no one uses the number of null hypotheses rejected as a global measure of progress. The idea that medicine and biomedical knowledge have made no progress is clear rhetorical excess.

          Are the lots of problems? Yes. Does that mean literally zero progress? Obviously no.

        • In terms of my field, I would say the number and accuracy of the predictions I can make about biological behavior; the number of entities, processes, and relationships which we hadn’t even known existed and over which there is not really a reasonable dispute. If we are talking about decades, this includes pretty much everything about molecular biology. I am not a clinician, but I would guess they might say, for one, lowered mortality due to various diseases.

          There’s _a lot_ of noise, and it requires a lot of calibrating belief. But it’s not all noise.

        • I find it extremely difficult to get a biologist to attempt a prediction.

          Getting a biologist to make a quantitative prediction or a medical researcher to make any prediction is pretty much impossible.

        • > I find it extremely difficult to get a biologist to attempt a prediction.

          Ha, well that’s fair to an extent. On the other hand, sometimes biologists build models which give predictions or at least confirm sufficiency arguments. Also, the use of some experimental methods is making a prediction in a different sense. We couldn’t use them if we didn’t have some coherent understanding of the underlying mechanisms. (Yes, some methods are crap.

          It’s not really a high bar of proof I have to clear here. The claim was that we have built zero, or close to zero, durable knowledge in biology and medicine in the past few decades, due to the widespread use of NHST. Even the most pessimistic estimates of the portion of false claims in the literature are not 100%.

          On a larger point, I think it’s very important that statisticians and biologists find ways to talk more productively to each other and to understand what the other ones are going on about. If we have to argue about whether the other side is doing _anything_ of value, I don’t see how a productive relationship gets off the ground.

          A few years ago I asked a bunch of people, in all seriousness, what “advanced statistics” could even mean. Wasn’t the field mostly exhausted by histograms and t-tests and chi-squared something-or-other? Did statisticians do anything interesting other than yell about how the real world did not conform to tests they have on hand? I hope it is clear that I’ve done a lot of work since to discover how inaccurate that view was.

          I’m not going to apologize for the many problems and pervasive bad practices in the field. But if you really think no progress has been made in biology, I submit that you find yourself in the same position I was in with respect to statistics a few years ago, and there’s a lot of room to understand more about what’s going on in biology.

        • The claim was that we have built zero, or close to zero, durable knowledge in biology and medicine in the past few decades, due to the widespread use of NHST. Even the most pessimistic estimates of the portion of false claims in the literature are not 100%.

          Due to all the misleading paths people have been lead down by NHST, the sum knowledge generated is likely negative. I suspect huge misconceptions currently lay at the center of our understanding of problems like Alzheimer’s, stroke, and cancer and most “progress” in these areas is illusory.

          But if you really think no progress has been made in biology, I submit that you find yourself in the same position I was in with respect to statistics a few years ago, and there’s a lot of room to understand more about what’s going on in biology.

          My background is biomed research rather than stats. Like you, I had to self teach because the training was extremely poor. I was actually told by my (graduate school) stats teacher that he was worried about getting fired if he tried to teach proper statistics (as opposed to NHST) to medical students/researchers. I wouldn’t underestimate the scope of this problem.

        • Sure. And I also wouldn’t overestimate it.

          I tried really hard and have now abandoned the effort. I do wish you good luck. I do not mean this sarcastically.

    • Sander: Now having actually read your paper, I see you actually make the same points I was trying to get at in my first comment, and more eloquently and more insightfully. Thanks for the paper.

      It also, unexpectedly for me, sort of reforms Feyerabend in my eyes. My previous exposure was through the lens of his critics.

      +1 to Keith’s comment: “It is well worth reading”

      • Jason:

        I too was turned off Feyerabend as an undergrad by a newly recruited philosophy professor who was stuck with teaching the philosophy of science course and used a book by Feyerabend he had just read before the course. I dropped the course but held on to a confounded assessment of Feyerabend’s work.

    • Sorry for delay in commenting.

      First, I don’t actually believe you are giving up. Anything really undertaken with a scientific attitude reflects a desire to find out how things really are (or enable others to) and so you will likely have no choice but to persevere. More dramatically “Real communication is always a task of love: Truth is the goal of scientific inquiry and love is a distinctive feature of truth. In the words of Peirce: “The Law of Reason is the Law of Love”.” (from Jaime Nubiola Peirce quotes).

      Now, I would not see AI as being directly helpful as the main challenge is what make of analyses that could be done for various purposes – and as far as I have seen – AI is really deficient of intentionality and purposefullness. But I really do think computation more generally has and will make the learning and practice of statistics much much easier. Just think of how easier it is to simulate a modelling approach under known truth to assess if its _working properly_. Or as Felix point to – running k different analysis approaches and seeing the cross-validation results as a default. Or, as you likely know, Jasjeet S. Sekhon argued causal inference was beyond assembly line methods in Opiates for the Matches https://pdfs.semanticscholar.org/59ba/c42605545e0acc917d582971e4909c0660d8.pdf I found them convincing but then folks at Sentinel do seem to have automated at least first cut analyses http://www.populationmedicine.org/node/11229

      More generally scientific activities promise no (sure) short term rewards. Perhaps all we can do is clear some brush for future generations – e.g. maybe enable the students of today to figure out the pedagogy for their students or their students’ students’ students, etc. Or create barriers for future generations to figure out they are barriers and how to get over them :-(

      • “More generally scientific activities promise no (sure) short term rewards. Perhaps all we can do is clear some brush for future generations – e.g. maybe enable the students of today to figure out the pedagogy for their students or their students’ students’ students, etc. Or create barriers for future generations to figure out they are barriers and how to get over them”

        +1

    • Speaking as an AI researcher, I don’t think AI is anywhere near solving this problem. Current AI methods work best when the problem is well-specified (as in chess, Go, and poker). But most of the challenge in doing statistics (and science) right is to understand the ways in which the problem/model/experiment as formulated might differ from reality. It is these threats to validity that we need to constantly be considering. In my experience, well-trained statisticians (esp. practitioners) are able to bring to bear a wide range of knowledge about the world to identify potential threats to validity and consider ways to address them. They may be far from perfect at this, but I think their training as professional skeptics makes them better at this than anyone else.

      Current AI techniques have no way whatsoever of doing this. Indeed, they suffer from exactly the same threats; from exactly the same gaps between the problem as formulated and the complex realities that lie beyond that formulation.

        • “…most of the challenge in doing statistics (and science) right is to understand the ways in which the problem/model/experiment as formulated might differ from reality.”

          “…well-trained statisticians (esp. practitioners) are able to bring to bear a wide range of knowledge about the world…”

          Perhaps ordinary-language terms like “understand” and “knowledge” do not get us anywhere in our understanding of behavior – and, hence, do not get us anywhere in designing an AI that can do some of the things we can do. And, like it or not, whatever else one can say about science, it is the behavior of scientists. It is no accident that successful AI takes some kind of neural net and changes it according to the consequences of the “output.” But when you pursue this line of attack, you (hopefully) eventually realize that explanatory fictions like “understanding” and “knowledge” are obstacles to the understanding of behavior and simulated behavior.

  2. I think it will always be unreasonably difficult. Consider the lost information from basically every experiment. Currently everything begins with the human brains’ ability to conceptualize a problem, and create instruments to gather data on this problem. This works for strong signals, but for weaker signals, or more complex systems, there is often no way to gather sufficient information (by sufficient information I mean the vast vast vast amounts of information running through reality).

    This is particularly annoying because our brains’ are also incredible at conceptualizing the world at an incredibly high level of dimensional abstraction. Meaning that our ability to ask questions is vastly more powerful than our ability to answer them.

    So the question then becomes how can we determine if the questions we ask map to a statistical solution concept within our existing ability? And as far as I can tell we have no strict criteria to answer this question. Guys like Andrew Gelman develop it after a career of observing problems asked, observing answers, and evaluating outcomes (both realized and replicated). If you do this enough you develop a nice heuristic for when the question maps to our ability to answer.

    As a start, this is a boring answer, but teaching philosophy of science would help. My undergrad and grad program in social science never even used the phrase ‘philosophy of science.’ It was regressions and papers from day 1. This is complicated by the fact that applied philosci isn’t popular, and has little traction.

    The long-run answer is next-generation data collecting methods, as well as next-generation models, which combine causal inference with the flexibility of learning models, will help us bridge the gap between questions the human mind can conceive of, and questions we can answer. (of course, the human mind also seems fatally flawed to think some nonsensical questions are meaningful, but that’s another problem…)

    • I agree with your sentiment about “It was regressions and papers from day…” That does a good job describing the problem with current pedagogy.

      But I don’t agree that more teaching of philosophy of science would help.

      What I think would help is more exposure to and insights from practitioners (like Andrew) that deal with real applied problems and wrestle with abstractions, causality, noise, confounders, spurious correlations, fishing etc. on a daily basis.

      I doubt that more philosophy of science is what we need.

      • Rahul, I think you are right that more wrestling with real problems and the choices experts weigh would be a big deal. In some sense, this is already a part of graduate-level courses—read a paper, pick apart the measurement and experimental design questions. It would be very useful to extend the practice to analysis questions. In fact, my impression is that biomedical programs tend to underweight the latter and stats programs underweight the former.

        I do think that _some_ “small-p” philosophy of science is in order. I don’t see how else you address the questions about what it is that statistical inference is doing. It wouldn’t have to be a whole course, or even a dedicated unit. Just being honest about how the choice of statistic/procedure/framework maps onto the sort of knowledge you want.

      • Perhaps I was too general. I consider what Gelman does, by and large, philosophy of science. Combine his modern applied PhiloSci work with the foundations of modern Causal Inference philosci, and also throw in some guys like Popper.

    • One of the main themes here is the balance or tradeoff between the formal and the informal. This to me is one of the most difficult and fascinating subjects.

      Recognising, emphasising and even embracing that such ‘dualities’ or ‘adjoint’ concepts are everywhere (see also objective vs subjective, general vs particular etc) is perhaps one small step to take. [Maybe Peirce knows how to get past such dualities to ‘thirdness’ ;-)]

      One teaching strategy I’ve adopted, not sure if it works, is to separate the formal and informal explicitly but make sure not to drop either. When we are defining and working within a formal or mathematical system we should stick to rules of the game. When thinking about the external world we should think differently and more informally.

      I do find that even ‘not very mathematical’ students often find this easier – sometimes their confusion is a result of their formal and informal reasoning clashing, not actually an inability to follow a formal argument as such.

      • That’s an interesting point. That dynamic comes up a lot when I try to explain partial pooling as the mathematical mechanism underlying so-called “random effects”. You absolutely have to be able to follow the formal reasoning, but then pull back and let it interact with your informal reasoning about the world, and then go back and look at it again. Reminds me of one time a math professor in college told me, “linear algebra is useful but you just can’t take it too seriously!”

      • I think this is a really good point. And I also thank you for engaging me on it in that post on my blog about Cox’s theorem. In the end, I still came to the conclusion that Bayes was the way I wanted to go with data analysis, but the point that huge quantities of heavy lifting come before you get to probabilities and that these should be acknowledged as something important and useful, and that they are a mix of formal and informal… Well I guess I’ve always acknowledged them, but being explicit about the boundaries is very helpful.

        Also, you made me finally get a real book on set theory and it turned out that I really liked it ;-)

        https://www.amazon.com/Book-Theory-Dover-Books-Mathematics-ebook/dp/B00LOZENB2/

        • Hi Daniel,

          Thanks and sure – I’ve found it helpful to engage in such exchanges even if neither of us ultimately change our opinions much!

          I happened to read the following old exchange recently which is somewhat similar to ours and also connects back to Sander’s comments on formal vs informal frameworks, eclecticism etc.

          Lindley:
          Barndorff-Nielsen says that the axioms [of Bayesian inference] are at variance with the fundamental requirements of inference, without explaining why. Perhaps he will enlighten us.

          Barndorff-Nielsen:
          The key issue is whether the uncertainties handled in statistical activity are all comparable, on a single scale, or whether these uncertainties fall in qualitatively distinct classes and are not, in general, comparable…There are, of course, several versions of the axiomatic basis of Bayesian statistics, but in each of these you are, more or less explicitly, led into accepting comparability…

          …is it then, in general, meaningful to speak of the uncertainty of the one event relative to the other, as if these uncertainties were of the same kind? I think not, and find it rather like trying to measure the warmth of my feelings for some person by means of a centigrade thermometer…

          …It is true, of course, that today there exists no coherent and fully comprehensive statistical theory which recognizes the existence of qualitatively different forms of uncertainty or knowledge, and that, accordingly, much statistical reasoning is to some degree eclectic, which involves a real risk of inconsistencies. Similar situations exist, however, in a multitude of areas where theory is confronted with reality, for instance in physics; and although such situations may be far from ideal they can well be manageable and highly fruitful. Even in pure mathematics it is quite feasible to make substantial progress without a consistent axiomatic basis; Frege’s set theory, shown later by Russell to be inconsistent, is a case in question…

          …The above views are very close to arguments propounded by J. M. Keynes, R. A. Fisher, and many others. (In particular, see Keynes (1921), Chapters 1-3.) It seemed nevertheless appropriate to state them on this occasion, especially since Professor Lindley, here and elsewhere, has written as if such arguments were of no substance and thus could be ignored…

        • Of course, I have no problem often putting on a Bayesian hat and playing along with the assumptions when it suits – I even find it useful in many cases!

          I do, however, run into legitimate cases where such subtle assumptions do seem to make an important practical difference.

          Interestingly, it was also Lindley who said “unidentifiability causes no real difficulties in the Bayesian approach” which I now think is a terrible – but not uncommon – opinion to hold, and not unrelated to the subtle issues of comparability discussed above.

          My point is not that Bayes is terrible, rather that it is not panacea and that it is worthwhile to be more flexible and open to multiple formal frameworks with various strengths and weaknesses.

        • ojm:

          Thanks for the Barndorff-Nielsen quote as I think its related to the why a posterior is not a posterior is not a posterior: even if mathematically equivalent issue.

      • ojm:

        I think Richard McElreath does this nicely with really nice metaphors for the informal reasoning, Golems, big/small worlds, Wall of China, etc.

        I think his book was one of the few noticeable improvements in teaching statistics in the last 20/40 years and primarily from some balance of the informal and formal.

        • Hi Keith,
          This is probably true and I likely would have really appreciated his book when I was first learning stats.

          Having said that, I do have a real issue with what I see as the muddling of the formal and the informal by Jaynes and his followers. This goes for (mis-)interpretations of Cox’s theorem etc. From what I remember Richard’s book is very much in the Jaynesian spirit and endorses some of (what I see as) Jaynes’s misconceptions.

          The strength and weakness of the Jaynes/Bayes approach to me is that it fails to make subtle distinctions. Like trying to understand the world through elementary propositional logic – useful to a point, damaging beyond that.

  3. “…my post title pays homage to Sigmund Freud commenting that a scientific psychology was not likely in his life time…”

    GS: But a scientific psychology, in fact, arose during his lifetime and still exists though it is in the minority wrt mainstream psychology which, IMO, is not science at all. I’m talking, of course, about the experimental analysis of behavior, now usually just called “behavior analysis” (BA). BA focuses on experimental, rather than statistical, control and uncovers reliable and general laws (speaking somewhat loosely) of behavior in individual subjects. It tends toward induction and reliance on affirming the consequent, though formal analyses and a more hypothetico-deductive approach has become more common; apropos since the hypothetico-deductive method doesn’t work well given a shortage of basic facts. Also, behavior analysts have always understood the importance of conceptual analyses but mainstream psychology is largely a conceptual cesspool:

    http://www.behavior.org/resources/88.pdf

    I am glad that I stumbled upon Gelman’s blog because it has been instructive hearing actual statisticians converse, though much of it is over my head. There is, however, a bit of hubris connected with the frequent references to philosophy and “the scientific method.” My early impression is that the view of science adopted here is partly simplistic as well as tied closely to between-groups methodology. Anyway, I am still lurking and forming an opinion so that’s enough ranting for now.

  4. Wow, excessive pride or self-confidence re: philosophy and “the scientific method.”

    Now, I not sure if you are referring to me, Sander or Pearson but it might provide an opportunity to discuss what might be an elephant in the room.

    If by applying statistics in science, I _really_ mean enabling researchers to be less misled by the observations they encounter and also to have a good sense of how much they unavoidably may have been misled, I have no choice but to endlessly grapple with what the “the scientific method.” should be. Some find this endless grappling annoying and Jason provided us a nice example quote of this in another post (Richard McElreath has a good predictive model for comment behavior on this blog: https://www.youtube.com/embed/BrchZeyo7cg?start=1712&end=1742&autoplay=1 )

    Now, I find Peirce’s three grades of a concept to be helpful in thinking about this, the ability to recognise instances, the ability to define it and the ability to get the upshot of it (how it should change our future thinking and actions). For, instance in statistics the ability to recognise what is or is not a p_value, the ability to define a p_value and the ability to know what to make of a p_value in a given study. Its the third that is primary and paramount to “enabling researchers to be less misled by the observations”. But, almost all the teaching in statistics is about the first two and much (most/) of the practice of statistics skips over the third with the usual, this is the p_value in your study and don’t forget its actual definition or this is the posterior from your study and it answers all your questions! That is simple not good enough.

    Now, in Richard’s field, perhaps the primary grade at issue is usually is it this or whats a better definition of these things.
    (Note, in everything all three grades are involved some can be more important in different areas.)

    On the other hand, I have worked with philosophers who have had very successful careers in philosophy and even some who have written on probability and statistical inference and they have told me that they don’t understand much about applied statistics (e.g. can you explain, provide comments of my draft, your draft was to technical for me, etc.) So it is not as if we can leave to philosophers to deal with (CS Peirce was a practising statistician so he could but he never got funding to properly write up his work.)

    • +1. FWIW, while I understand parts of what Glen is getting at, I think the explicit grappling with the meaning of methods very refreshing. In fact, like Keith, I find that it is unavoidable if one cares about doing it well.

  5. What I meant was that whatever science is, it is so much broader than the narrow statistical issues discussed here. It is true that issues surrounding causation are raised by many of the discussions on this blog though. There is, though, tacit acceptance of some answers to pertinent questions, especially WRT psychology…a field to which you make reference, of which there is no discussion here. I wasn’t intending to be complete in any kind of criticism…yet. One issue, though, would certainly involve fundamental views on variability – e.g., as intrinsic versus imposed (cf, Sidman, 1960). This is relevant not only to psychology but anywhere individual systems may be rigorously controlled experimentally such that between-subject experiments and even pooling data across subjects is counter-productive. There appear to be other philosophical questions for which the answers are assumed and little philosophizing has actually taken place. But, again, I am not ready to be complete about the blog as a whole WRT philosophical issues.

    • Glen, with respect, I’m surprised to see the impression that philosophical issues are assumed or not thought through. I think one of the distinguishing features of Andrew’s (and Keith’s) approach is how clearly the underlying philosophical motivations are exposed.

      You can find a list of articles here: http://www.stat.columbia.edu/~gelman/research/published/ . The titles mostly make clear when they take a philosophical bent and I think you will find that there is rather a lot of consideration given to the questions you raise. Then again, perhaps I misinterpret what your specific critique is.

  6. Some fundamental philosophical issues are raised and raised nicely. But my impression is that many issues are not touched. I have already complained about, and alluded to here, the fact that single-subject designs are almost never utilized. Andrew suggested that such methods (behind which there is a host of philosophical issues) were well known and linked to a paper of his on pharmacokinetics in which the model was applied to the data from individual subjects. Fine. But how often, when the issue is, say, widespread irreproducible “effects” is “pre-registration” suggested, but there is never a mention of the fact that many kinds of clinical trials (for example) could be carried out with far fewer subjects (than random-assignment placebo-controlled studies) in a single-subject manner? For some reason, they (SSDs) are not even considered here even though, it could be argued, many problems could be immediately solved by using them. There also seems to be a reductionistic/mechanistic theme that runs through the topics as well. Not uncommon, to be sure, but a philosophical choice, as it were, nonetheless. Anyway, I’m still trying to see what philosophical issues get raised around here. I jumped in because of the mention of psychology. Keith’s point (presumably) was that managing many issues scientifically (meaning through enlightened use of statistics) was as distant as was “scientific psychology” (in Freud’s time). My point there, of course, was that a natural science of behavior (what psychology should be) was in existence near the end of Freud’s life, having originated in the early ‘30s. Like I said before, though, it is my impression that sometimes issues get raised here where there doesn’t seem to be an awareness of the philosophical choices or that a philosophical choice has been made. I will try to be more diligent in the future and jump in when those issues pop up.

    • “many kinds of clinical trials (for example) could be carried out with far fewer subjects (than random-assignment placebo-controlled studies) in a single-subject manner”

      “Many” needs some elaboration/qualification here.

      I can see how single subject designs could make sense when the “treatment” is for management of a medical condition (e.g., asthma), but they seem less appropriate for things like cure of a disease (e.g., pneumonia) — in this case, if the first treatment works, then trying the second treatment is inappropriate. Also, conceivably the first treatment might have had some negative effect that would make the second treatment less effective than if it were the first drug.

      • Martha: “many kinds of clinical trials (for example) could be carried out with far fewer subjects (than random-assignment placebo-controlled studies) in a single-subject manner”

        “Many” needs some elaboration/qualification here.

        I can see how single subject designs could make sense when the “treatment” is for management of a medical condition (e.g., asthma), but they seem less appropriate for things like cure of a disease (e.g., pneumonia) — in this case, if the first treatment works, then trying the second treatment is inappropriate.

        GS: Not sure which “second treatment” you’re referring to here…perhaps a reference to a reversal to the original untreated (or placebo) condition? That would be standard “ABA” design. It is true that a reversal is not only unnecessary given the cure, it is not even possible. In that case, one is left wondering if the treatment (B) was the relevant independent-variable. But there is a bit of irony in what you’re saying since you are implying that the cure may be easily detected even though the experiment might have been a SSD (i.e., the reasoning is still just affirming the consequent). So…why exactly would the design be necessarily inappropriate if it leads to identification of the cure? Again, it is granted that the inference concerning the putative independent-variable is not as strong given the absence of a reversal, and then maybe a return to the B condition. However, if the times of onset of treatment are different for each subject and cessation of symptoms occur in most subjects the inference is strengthened – at least some single extraneous variable (e.g., the researcher’s building begins to be remodeled after the experiment starts etc.) is unlikely to have caused the sick-to-well event in so many. So…an attack on a small N SSD study would probably be better where the IV may have cured only a few people. Would such a small N study be wasted? Doesn’t seem to have been – you now know the effect of the treatment is very likely not dramatically good. And, of course, you would be free to take into account whatever information is already available before the experiment – an apparent effect in 3 of 20 people would be worth pursuing for an illness considered 100% fatal. Now, you might say “Well, what if I wanted to detect a very small effect? Wouldn’t a standard large N between-group study be the way to go?” The answer to that aside, the point is that conducting the small-N study is not necessarily a disaster if a reasonably dramatic cure occurs (your example), even if the experiment is ended at that point and the reversal cannot take place. There’s a lot more that could probably be said here but I’ll leave it there for now.

        Martha: Also, conceivably the first treatment might have had some negative effect that would make the second treatment less effective than if it were the first drug.

        GS: Yes, of course, the ol’ history effects argument. First of all, nothing stops you from doing two experiments (i.e., make a group manipulation even though the separate effects are judged on individuals). So one group gets T1-T2 and the other T2-T1. And, yes, one would look at the data with an eye towards an order effect and anything one would say in that specific regard would be based on between-subject comparisons. But, again, that is like doing two experiments that are both SSDs. One looks across experiments in the same way one does WRT experiments from other labs in the field. In some circumstances, exposing each group to a separate treatment is not called for (given a certain philosophical view). Take drug dose, for example, when one is interested (as one should always be) in the whole dose-effect function. The question is, “a whole dose-effect function for what?” Individuals or groups? The drug effect is an effect on an individual and a dose-effect function is relevant to an individual. In that sense, one has to live with the potential for order effects because that’s how, so to speak, dose-effect functions exist in nature.

        Anyway…that was pretty long-winded for a pretty simple issue –

        1.) SSDs can be used when an effect is irreversible (although the inference concerning causation is much weaker than when the effect can be turned on and off at will).

        2.) Order/history-effects are part of the dialogue with nature. Hybrid experiments (with at least one between-group comparison) can be conducted where the effects are still judged in individual subjects.

        • Me: “[SSDs] seem less appropriate for things like cure of a disease (e.g., pneumonia) — in this case, if the first treatment works, then trying the second treatment is inappropriate.”

          GS: “But there is a bit of irony in what you’re saying since you are implying that the cure may be easily detected even though the experiment might have been a SSD (i.e., the reasoning is still just affirming the consequent).”

          In fact, I was not implying that the cure may be easily detected — but your objection pointed out to me that my original wording was poor. To try to express my intent more clearly: ” – in this case, if the patient recovers from the disease after the first treatment, the second treatment is inappropriate.”

          (I agree that hybrid experiments, combinations of experiments, sequential experiments, etc. are often appropriate.)

  7. “OK, what are some ways to make things less unreasonably difficult?”

    If large numbers of otherwise intelligent people are struggling to correctly apply these techniques, do we fault them or those who teach the techniques? I tend to think that this is as much a pedagogical problem as it is a methodological problem. Are educators providing clear, practical instructions for solving real-world measurement and analysis problems? Not in my limited experience but others may disagree.

  8. > Are educators providing clear, practical instructions for solving real-world measurement and analysis problems?
    The argument I was making, along with others, was that this was simply illusory at present and it is not all clear how to get there.

    So no intention to suggest it was anyone’s fault – we can only start inquiry into this topic from where we find are ourselves.

    More generally, if it really is science it is a struggle, period.

    • +10.
      Here’s why: Because real-world inference (“induction”) and decision depend so much on nondeductive judgments (which are in turn influenced by values and cognitive biases), there will always be plenty of room for disagreement that cannot be resolved by logic or its deductive extensions like Bayesian reasoning. Resolution by agreement only pushes the frontier of disagreement down the deductive line (now that new premises are taken as established); that’s scientific progress.

  9. “… if the patient recovers from the disease after the first treatment, the second treatment is inappropriate.”

    GS: Why would there be a “second treatment”? The standard, simple SSD is “A-B-A’.” A=initial stable state, B=stable state after variable change, and A’=re-exposure to the original conditions and (hopefully) return to the original stable state.

    • Let’s go back to my original point:

      You said that “many kinds of clinical trials (for example) could be carried out with far fewer subjects (than random-assignment placebo-controlled studies) in a single-subject manner”

      I said in response to this, ““Many” needs some elaboration/qualification here.”

      Your latest comment seems to be supporting this assertion — it seems to be saying that a single-subject-design study (at least, not of the A-B-A’ type you discuss) would not work for an RCT comparing two treatments. Or am I still misunderstanding something?

      Also, even in the case of studying a single treatment for pneumonia, I don’t see how the A-B-A’ design would fit: In this case, the initial state is “patient has pneumonia”, which is not necessarily a stable state; B would appear to be “stable state after treatment,” and A’ would appear to be something like “return to having pneumonia” — which sounds dubious from an ethical perspective.

  10. MS: You said that “many kinds of clinical trials (for example) could be carried out with far fewer subjects (than random-assignment placebo-controlled studies) in a single-subject manner”

    I said in response to this, ““Many” needs some elaboration/qualification here.”

    GS: Yes. I did not directly address this point. I have been, at least posting here in general, careful to say that SSDs are not relevant to all questions. Comparing two treatments is one of them if the first cures the patient.

    MS: Your latest comment seems to be supporting this assertion — it seems to be saying that a single-subject-design study (at least, not of the A-B-A’ type you discuss) would not work for an RCT comparing two treatments. Or am I still misunderstanding something?

    GS: Nope. I just didn’t understand why you brought up the “second treatment” issue. It made me think that you didn’t know what an SSD was since there is not necessarily a “second treatment” involved.

    MS: Also, even in the case of studying a single treatment for pneumonia, I don’t see how the A-B-A’ design would fit: In this case, the initial state is “patient has pneumonia”, which is not necessarily a stable state; B would appear to be “stable state after treatment,” and A’ would appear to be something like “return to having pneumonia” — which sounds dubious from an ethical perspective.

    GS: Well, I don’t know about ethics for sure, but returning to the original state (sick) is not even possible – at least not by simply returning to the “A conditions” since the patient has been cured. But I pointed out that this is not necessarily a death knell for SSDs, though studying reversible effects is what SSDs are made for. One gains experimental (rather than statistical) control of the subject matter and turns effects on and off at will – it gets hard to say that the independent-variable is “not causal” (using “cause” somewhat loosely perhaps) when one is turning the phenomena on and off like one works a light from a switch. As to the ethics of withdrawing a treatment (probably later to be reinstated) that appears effective, I see your point. It does not, however, strike me as any less ethical than withholding treatment that is potentially effective (e.g., in a placebo-control group). In any event, perhaps I was careless this time, but I usually say something like “SSDs should be used where they are relevant.” If there is a question about what should be done about non-replicable “findings,” SSDs should be on that list – yet they are almost never mentioned – anywhere. Why is that? Especially when using SSDs involves direct experimental control of the subject matter?

    • The core concept behind SSD seems to be that you model the time-series of the individual patient and assess whether and how the intervention has altered this time series. The basic method would be to model the state of the individual as a dynamic process through time, whether some kind of markov process, or a differential equation or an algebraic equation for different states, or whatever.

      I TOTALLY AGREE with every aspect of that. I don’t think there is a real case where it’s inappropriate, even though it is obviously inappropriate to do some static set of procedures (ie. try drug 1 and then try drug 2 regardless of what state the patient is in after drug 1). Even if the goal is just “cure patient of pneumonia” you’d want to start the drug at t=0 and track their respiratory measurements and their white blood cells and soforth through the next N days and try to determine how rapidly you see improvements as a function of dose etc.

      The fact is that partial pooling of a hierarchical model is very useful in such scenarios, where things like markov transition probabilities or coefficients describing response to a particular event across people are assumed to have some similarity in terms of a distribution over the sizes of effects. This requires a shift in analysis compared to what is usually done.

      The fact that most medical studies are not designed in terms of a time-series approach is seriously problematic. I’ve been discussing just exactly this issue with my wife who tried out a 5 day partial-fasting treatment recently. The biological evidence that there is something good going on is out there (research out of Valter Longo’s lab at USC). But the analysis of these trials is based on “do this for a while and come back in several months to see how it worked”.

      The design should have been to have each person analyzed with daily blood draws for the first 5 day period, then an analysis of the duration of time to get the effect (reductions in various hormones and circulating cell counts) and then a randomized assignment to different durations for the second and third period… with at least one blood draw at the beginning and the end of the second and third periods, and then an analysis at 3, 6, 12 months of health outcomes.

      The analysis should have then been focused on estimating the goodness of different durations and choosing an optimal duration as a function of initial health/age/sex etc so that people could get maximum (benefit – cost). As is, their RCT shows that probably 5 day partial fasts 3 months in a row provide some health benefits, but give essentially no information about whether say 2 day fasts 2 months in a row or 3 day fast once per year or whatever would do as well.

      If we stop thinking like a black box (put in coin, press button, see what outcome was later) and start thinking about the evolution of health status through time, and the factors that cause the changes, we will be a lot better off.

Leave a Reply to Martha (Smith) Cancel reply

Your email address will not be published. Required fields are marked *