A whole fleet of Wansinks: is “evidence-based design” a pseudoscience that’s supporting a trillion-dollar industry?

Following a recent post that mentioned << le Sherlock Holmes de l'alimentation >>, we go this blockbuster comment which seemed worth its own post by Ujjval Vyas:

I work in an area where social psychology is considered the gold standard of research and thus the whole area is completely full of Wansink stuff (“people recover from surgery faster if they have a view of nature out the window”, “obesity and diabetes are caused by not enough access to nature for the poor”, biomimicry is a particularly egregious idea in this field). No one even knows how to really read any of the footnotes or cares, since it is all about confirmation bias and the primary professional organizations in the area directly encourage such lack of rigor. Obscure as it may sound, the whole area of “research” into architecture and design is full of this kind of thing. But the really odd part is that the field is made up of people who have no idea what a good study is or could be (architects, designers, interior designers, academic “researchers” at architecture schools or inside furniture manufacturers trying to sell more). They even now have groups that pursue “evidence-based healthcare design” which simply means that some study somewhere says what they need it to say. The field is at such a low level that it is not worth mentioning in many ways except that it is deeply embedded in a $1T industry for building and construction as well as codes and regulations based on this junk. Any idea of replication is simply beyond the kenning in this field because, as one of your other commenters put it, the publication is only a precursor to Ted talks and keynote addresses and sitting on officious committees to help change the world (while getting paid well). Sadly, as you and commenters have indicated, no one thinks they are doing anything wrong at all. I only add this comment to suggest that there are whole fields and sub-fields that suffer from the problems outlined here (much of this research would make Wansink look scrupulous).

Here’s the wikipedia page on Evidence-based design, including this chilling bit:

As EBD is supported by research, many healthcare organizations are adopting its principles with the guidance of evidence-based designers. The Center for Health Design and InformeDesign (a not-for-profit clearinghouse for design and human-behaviour research) have developed the Pebble Project, a joint research effort by CHD and selected healthcare providers on the effect of building environments on patients and staff.

The Evidence Based Design Accreditation and Certification (EDAC) program was introduced in 2009 by The Center for Health Design to provide internationally recognized accreditation and promote the use of EBD in healthcare building projects, making EBD an accepted and credible approach to improving healthcare outcomes. The EDAC identifies those experienced in EBD and teaches about the research process: identifying, hypothesizing, implementing, gathering and reporting data associated with a healthcare project.

Later on the page is a list of 10 strategies (1. Start with problems. 2. Use an integrated multidisciplinary approach with consistent senior involvement, ensuring that everyone with problem-solving tools is included. etc.). Each of these steps seems reasonable, but put them together and they do read like a recipe for taking hunches, ambitious ideas, and possible scams and making them look like science. So I’m concerned. Maybe it would make sense to collaborate with someone in the field of architecture and design and try to do something about this.

P.S. It might seem kinda mean for me to pick on these qualitative types for trying their best to achieve something comparable to quantitative rigor. But . . . if there are really billions of dollars at stake, we shouldn’t sit idly by. Also, I feel like Wansink-style pseudoscience can be destructive of qualitative expertise. I’d rather see some solid qualitative work than bogus number crunching.

83 thoughts on “A whole fleet of Wansinks: is “evidence-based design” a pseudoscience that’s supporting a trillion-dollar industry?

  1. The concerns expressed are indeed important. On the topic of “personalized medicine” I have been deeply troubled by what appears to be a hornets nest of future bad research. Some personalized evidence-based (the two prevalent buzzwords) medicine appears to be soundly based on science – I am thinking of the use of genetic information in improving diagnosis and treatment. However, much of the hype appears to be based on a flawed model of applying statistical models to individuals – an area that promises far more than it can deliver. Given the value of models to produce average relationships (or, better yet, subgroup relationships), the idea that models will produce “personalized” results seems to me to be off-target – and in seriously costly (both financial and health-wise) ways.

    We all want evidence-based medicine: who wouldn’t. But the questions are what evidence, evaluated how, and on what basis? Wansink style approaches can pose great dangers in this realm. I’d like to see serious efforts to avoid the worst analytical practices and to promote better ones. How do we do that? One option is to try to control who does the analysis and how they do it (e.g., what qualifications are required, requiring preregistration, requiring particular approaches such as multi-level modeling, reporting of all results, etc.). I believe many such practices would be valuable, but they are not going to prevent the proliferation of over-hyped under-powered studies. In fact, I’d venture to predict that we will see continuation of the trend towards poorly designed, non-reproducible, and overly promoted studies purporting to be “evidence-based.” What we need is a better understanding of what evidence is and how to evaluate it. And it is needed on a much broader basis than just the practicing analysts. Unless the general public is better capable of understanding evidence, I fear that poor practices are likely to grow, not shrink.

    • The key question about these EBM studies is:
      “Evidence of what?”

      The (usual) answer is:
      “Evidence the groups were not sampled from identical distributions for some reason.”

      Everyone just wants to plug in their favorite thing instead of “for some reason” with little to no justification. But coming up with that justification is “theory”, and these are “practical” people who use “practical” tools like NHST (which does indeed result in “practical” results like publications and clients).

    • “Given the value of models to produce average relationships (or, better yet, subgroup relationships), the idea that models will produce “personalized” results seems to me to be off-target – and in seriously costly (both financial and health-wise) ways.”

      What’s really meant is that instead of deciding whether to prescribe you a medication based on its estimated average treatment effect in the population, the decision will be made based on its estimated average treatment effect among some subpopulation sharing many of your measured covariates. Sure, ‘personalized’ is a misnomer. But ‘*conditional* average treatment effect based medicine’ would still be an improvement over plain ‘average treatment Effect based medicine’. I don’t know why you think it would be ‘costly’.

      The problem is that we can’t accurately estimate treatment effects conditional on many covariates from observational data because of confounding or from randomized trials because of sample size constraints.

      • Yes, as the subgroups get more “personalized” the accuracy (reliability, or some other descriptor) necessarily decreases. So, the tradeoff between more targeted diagnosis and treatment and more accuracy becomes more pronounced. The “costs” I am referring to are the costs of providing treatments to individuals based on evidence that cannot really support those particular treatments. Yet the treatments have both monetary and health costs – as does the lack of providing such treatments. So, what are we to do when the costs of mistakes rise (either the mistake of providing or withholding treatment)? The room for charlatans increases in such an environment rather than decreasing.

        • I agree, costs currently come from attempts at being too personalized, not from being not personalized enough. I got the impression you were saying the reverse.

      • Stratified Medicine is another key concept (and sadly buzzword too!) here. What Z described in “what’s really meant…” would fit the description of stratified medicine. To put it simply, personalized medicine is too ambitious for health-care implementation in the next 5-10 years (made up time-scales), so we fall back to the “less” ambitious goal of implementing “stratified medicine” instead of the full on n=1 you are special personalized medicine.

    • Dale,

      “Some personalized evidence-based (the two prevalent buzzwords) medicine appears to be soundly based on science – I am thinking of the use of genetic information in improving diagnosis and treatment. However, much of the hype appears to be based on a flawed model of applying statistical models to individuals – an area that promises far more than it can deliver.”

      Unfortunately, the genetic part is also mostly hype. I have a paper on this very subject: https://www.researchgate.net/publication/261921048_The_Role_of_Genetic_Information_in_Personalized_Medicine

    • I find this whole field somewhat of a marketing scheme. There is indeed progress in genetically-informed medical decision making. Some of this is due to associations that we don’t understand and the best of it has led us to better understand the biomolecular mechanisms. However, modern medicine is already “personalized”. The TNM staging system in cancer has been around for decades and it is explicitly designed to predict outcomes based on a person’s particular tumor.

      The statistical issues underlying the use of terabytes of genetic data from thousands of people are quite different than those relevant to the small numbers (dozens or at most hundreds) of patients in clinical trials or in a single institution that inform many medical studies, particularly in cancer.

      To me, “evidence based personalized medicine” is just another set of buzzwords to get money or position. I agree that this area is ripe for bad statistics and bad medicine.

  2. “As EBD is supported by research, many healthcare organizations are adopting its principles with the guidance of evidence-based designers.”

    Just replace the acronym.

    “As KFC is supported by research, many healthcare organizations are adopting its principals with the guidance of Kentucky Fried Chickens.”

    “As IBM is supported by research, many healthcare organizations are adopting its principals with the guidance of International Business Machines.”

    • Cody – IBM is already there in a big way (e.g. exploratory research on mental health on campus).

      (They do employ a lot of scientists, but they do have those shareholders to take care of.)

  3. Yup – solid qualitative upstaged by bogus number crunching.

    The only solution is a more general education on methods of inquiry that are solely focused on avoiding being mislead by observations and continually evolve to avoid less and less over time. Using the label scientific for such inquiry is unnecessarily limiting as well as over imbued with misconceptions.

    Or other ways to reach the public – https://www.youtube.com/watch?v=J5A5o9I7rnA&feature=youtu.be

    OK – that never worked for claimed cures and snake oils – FIA – Federal Investigation Agency to fill the void of effective and widely/effectively communicated peer review.

    • Really like your point about solid qualitative upstaged by BS quantitative.

      There are lots of problems where qualitative is OK — interior design is one of them! I like buildings with light and plants. Hard to do anything quantitative to support that.

      The key is getting rid of the myth that any numerical claim (esp w/ stat significance) is always better than a qualitative one. I’m a quant person, but bad quantitative analysis is often worse than none.

        • If you can do it with pigeons, you can do it with people:

          https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1333524/pdf/jeabehav00103-0025.pdf

          Abstract:

          Four slightly hungry pigeons chose between pairs of grains in a Findley concurrent choice
          procedure. For Condition I, choice involved hemp versus buckwheat; for Condition II,
          wheat versus buckwheat; and for Condition III, hemp versus wheat. In all conditions,
          frequency of reinforcement was arranged according to concurrent variable-interval variable-interval
          schedules. On the assumption that subjects matched their behavior and time
          distributions to those of reinforcer value, the choice functions obtained in Conditions I
          and II were transformed to yield estimates of values of hemp and wheat relative to buckwheat.
          These, in turn, provided predictions about behavior and time allocation in Condition
          III. In general, the predicted outcomes were close to those actually obtained. The
          results evidence the effectiveness of matching-based hedonic scales in the prediction of
          choice between qualitatively different reinforcers.

          Key words: matching law, food preference, reinforcer quality, concurrent variable-interval
          schedules, pigeons

        • I haven’t checked everything in detail but this looks like some great research. Exactly what I’d like to see more of. The key is that they have some equations derived from the matching law, law of effect, etc to work with rather than checking for a difference between groups.

  4. I increasingly think that things like this are an unfortunate side-effect of the dominance of science in modern thought. It’s not sufficient for fields to rely on observation and reflection — even things like anthropology, design, etc., need to be “rigorous.” Since quantitatve rigor is very difficult, especially at the scale of small, short studies, we’ve ended up with a mess of bad “science.”

    As yet another example, research-based methods in higher education are all the rage. There are excellent studies on effective teaching methods, but there’s an awful lot of garbage — p-hacked, confirmation-biased, poorly designed work. And I write this as a strong proponent of things like active learning (and assessment of my own teaching), and the co-director of my university’s Science Literacy Program, which promotes research-based methods!

  5. Both in this blog post (and the subsequent comments) as well as in other posts, we are given clear examples where poor uses of or misuse of statistics (e.g. p-hacking, garden of forking paths, underpowered studies, etc.) are endemic in a wide array of fields such as social psychology or (as in this example), architecture.

    My question to all of you here would be: can you think of examples of fields of study (outside of physics, which has been used as an example by others) where (in general):

    1. Statistical methodologies are used appropriately (e.g. where p-hacking isn’t a serious problem)

    2. Reproducibility and replicability is (comparatively) not as serious a problem.

    3. Overall, the published results of such studies can be generally believed/trusted.

    • “the statistics profession has been spending decades selling people on the idea of statistics as a tool for extracting signal from noise, and our journals and textbooks are full of triumphant examples of learning through statistical significance; so *it’s not clear why we as a profession should be trusted going forward, at least not until we take some responsibility for the mess we’ve helped to create*” [emphasis mine]
      from Gelman and Carlin, http://www.stat.columbia.edu/~gelman/research/published/jasa_signif_2.pdf

    • Micro-Econometricians and Micro Causal-Inference Econometrics of the past ten years are often relatively good at not letting anything egregious through. Although as the field adapts to massive micro-data sets, and new computational tools, there is a lot of scientific work to be done to remain vigilant with respect to these new tools and data paradigms.

    • “My question to all of you here would be: can you think of examples of fields of study (outside of physics, which has been used as an example by others) where (in general):

      1. Statistical methodologies are used appropriately (e.g. where p-hacking isn’t a serious problem)

      2. Reproducibility and replicability is (comparatively) not as serious a problem.

      3. Overall, the published results of such studies can be generally believed/trusted.”

      1.) Behavior analysis

      2.) Behavior analysis

      3.) Behavior analysis

        • I think hes talking about “old school” behaviorist research (eg Skinner, Thurstone, Thorndike, Gulliksen, etc).

        • No, Glen S is talking “behavioural analysis” (even if we spell it differently). He may not not agree with me but I’d call it a specific area of psychology.

          If all you know about is the froth that often emerges from Social Psych studies you may not appreciate what areas psychologists study.

          It’s a bit like me. I have never taken an Economics course so much of Economics reads/sounds like woo. Things like the R & R fiasco do not make me more trustful.

          And as someone with a fairly solid grounding in behavioural analysis, albeit many years ago, I really don’t beleive/trust some of the basic assumptions that micro-economics has made in something like Utility theory

        • “I thought behavioural analysis research did not need “stats’.”

          GS: Behavior analysis. The term “behavioral analysis” is used to describe any number of endeavors that bear no resemblance to behavior analysis. The original name (coined by Fred Skinner) was “the experimental analysis of behavior” but, I think, when the applied field began to explode they needed something to call themselves and “applied experimental analysts of behavior” didn’t cut it. Eventually all of us – basic science types included – became “behavior analysts” and the field, “behavior analysis.” As to “stats,” you are certainly right that it (BA) doesn’t use NHST which has pretty much always been treated like a joke. And BA rarely pools data from individual subjects (though the paper I posted elsewhere in this thread on hedonic scaling showed data averaged across the four subjects). Behavior analysis is quantitative, though, as its goal is the delineation of reliable functions that relate past history and current environment to behavior. Sometimes the term “quantitative” is used to designate, in particular, mathematical hypothesis testing within behavior analysis as in the Society for the Quantitative Analysis of Behavior. So…maybe it shouldn’t be said of behavior analysis that “Statistical methodologies are used appropriately” but certainly “p-hacking isn’t a serious problem.” As to #2 and #3 above, both statements (“Reproducibility and replicability is…not [a] serious…problem” and “Overall, the published results of such studies can be generally believed/trusted.” certainly characterize behavior analysis.

        • Thanks for the reply Glen. It may help people appreciate “behavioural analysis” as a discipline.

          My comment re stats was only a joke about behavioural analysis for the behaviourists amongst us.

          I mean”Could B.F. Skinner actually calculate an s.d.”? Don’t answer that! It might destroy my illusions.

          Is it worthwhile (for me as an amateur) looking into the Society for the Quantitative Analysis of Behavior?

        • SQAB (get it?…because basic researchers often use pigeons)has a meeting and I don’t know if they publish anything. Probably some of the stuff presented there winds up in the Journal of the Experimental Analysis of Behavior. I don’t know how you’d check it out short of going there. But if you’re interested in the “math stuff” that comes out of what remains of the basic science (hey…it ain’t about the BRAIN, you know, and it ain’t based on a computer metaphor that can be construed as being sort of about the brain so who would be interested in a natural science of behavior?)you can poke around JEAB. In another post in this thread I posted a link to the archives from 1958-2012 (after that, the Society could no longer afford to publish it I’m assuming so now Wiley does). Up until then (after the advent of the internet), all of the papers were freely available.

    • Pivotal drug approval trials for the US FDA. You get: Pre-registration of the trials including outcomes, usually at least two trials for approvals in major indications (with 2 times p<=0.05 on the pre-specified primary variable being the standard criterion for approval), for many of these trial design and size discussed with FDA before trial (i.e. usually adequately powered for relevant effect sizes, meaningful outcomes), pre-specified analysis plans that are discussed with the authorities prior to trial start, for the primary and main secondary outcomes strict type I error control, FDA inspection of adherence to procedures (such as not messing with your analysis plan in an undocumented way after finalizing it, appropriate data handling), any data issues resolved before treatment codes are unblinded and analysis programs written before treatment codes are unblinded. It may not be perfect, but it seems like a pretty decent try at avoiding approvals of ineffective drugs (or those with extremely marginal effects). And to some extent the pivotal trials are kind of replications of early stage trials, which one would hope to screen out the less promising drug candidates. (Conflicts of interest: I work in the pharmaceutical industry.)

      • What do you make of the claims made by the COMPare team that outcome switching is rampant and preregistered plans aren’t being sufficiently enforced? Their evidence is here: http://compare-trials.org/

        I’m guessing that the rules are a stricter for FDA approval than for publishing in a medical journal. Still, when I saw these results I was concerned – the whole idea behind pre-registration is that it’s adhered to, right?

        • I was mostly talking about the FDA approval process (which on a whole I trust a lot) and I was a bit surprised by these compare-trials results (and you can only applaud them for making these cases more prominent). Not really great showcases for how pre-registration works… At least it’s identifiable (if with too much effort – you kind of wonder whether this wouldn’t be a perfect task for automatic detection…).

        • Bjorn,

          I have encountered quite a few other concerns with FDA trials. Examples:

          The Cochrane Collaborative study of Tamaflu.

          Other “partial publication bias” (in addition to outcome switching) in clinical trials (e.g., http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001566)

          The comment from clinical-trials-dot-gov director Deborah Zarin,

          “We are finding that in some cases, investigators cannot explain their trial, cannot explain their data. Many of them rely on the biostatistician, but some biostatisticians can’t explain the trial design. So there is a disturbing sense of some trials being done with no clear intellectual leader,”

          as quoted in Marshall E. (2011), Unseen world of clinical trials emerges from US database, Science 333:145.

        • Bjorn and Martha:

          When the FDA approval process goes as planned its pretty good – dramatically better than academic reviewed and published trails. It does require a fully competent team with thoughtful leadership and protection from political and bureaucratic pressure. Of course, that does not always happen.

          The Cochrane Collaborative study of Tamaflu was perhaps the first publicly available comparison of regulatory reviewed versus academic reviewed _evidence_ which lead me to suggest to a senior member Cochrane to shut it down until they had access to trial data that would allow them to upgrade to regulatory review. They responded that they had to make do with what they currently have, which I really can’t disagree with. There is just a lot of uncertainty that is undocumented and not adequately portrayed in Cochrahe reviews.

          Additionally (beyond these problems) in Tamaflu, there are real problems generalizing from the patients in the RCTs during seasonal flu to at risk patients during pandemic flu that Tamaflu is primarily stockpiled for. From second hand accounts, there are some compelling observational studies for that _real_ question.

      • Pivotal drug approval trials for the US FDA…with 2 times p<=0.05 on the pre-specified primary variable being the standard criterion for approval

        Are you aware that this is the step many people take issue with? It is not possible to resolve the real problems by performing that step more perfectly.

        A quick example on a group of cancer patients:
        1) Group A gets chemotherapy X while Group B does not.
        2) Group A shows a doubling (both practically and statistically “significant”) in 5 year survival compared to Group B, and a similar reduction in tumor growth rates. Actually, pretend Omniscient Jones tells you this will happen in any cancer patient who receives chemotherapy X. IE, there is no room for doubt about this.
        3) All the pre-registration, control, blinding, etc is perfect.

        Should we give people chemotherapy X?

        We still have no idea. Why? Chemotherapy X is supposed to poison cells selectively in S/M-phase, and maybe it does in vivo, but as a “side effect” chemotherapy X also suppresses appetite. The suppressed appetite leads to caloric restriction, that also slows the growth of cancers. There are always other possible explanations as well, but lets stick with that one for now.

        Does it make sense to give cancer patients a $20k/month poison that also damages their healthy tissue, when the same cancer growth-inhibition could be achieved by (admittedly severe) caloric restriction which would actually save the patient money?

        • On the example: You could argue that at least you have pretty reasonably established causality (i.e. chemotherapy X causes longer survival, at least in a 5 year timeframe). But sure, you do not know for sure why, you do not know whether the same thing could be achieved in a different way (e.g. strict diet, which as you say would be cheaper and probably have fewer side effects) and so on. But to be fair, that was not the question that one attempts to answer with this type of trial.

          Regarding the two times p<=0.05 bit as far as it is not about the problem in the example: While I am fully aware what a mess a p<=0.05 criterion ="proof" criterion is, at least with two trials achieving that, it is usually based on a likelihood from the two trials that translates into a considerable posterior probability that the treatment effect is non-negligible under a pretty skeptical prior. And of course you do want a pretty skeptical prior, just a bit more than 10% of the drugs that start out being tested in humans ever get approved, which you could interpret suggesting that a lot of them do not do much (of course, there are other reasons for why drug development projects get stopped). Similarly, when you look at the distribution of results you get when comparing new drugs versus existing standards of care, you get a distribution that has a lot of mass near no difference.

        • But to be fair, that was not the question that one attempts to answer with this type of trial.

          Isn’t that the question people want an answer to? Should we give this treatment to patients? This seems to require knowing “Is the treatment working as expected?”

          And of course you do want a pretty skeptical prior, just a bit more than 10% of the drugs that start out being tested in humans ever get approved, which you could interpret suggesting that a lot of them do not do much (of course, there are other reasons for why drug development projects get stopped). Similarly, when you look at the distribution of results you get when comparing new drugs versus existing standards of care, you get a distribution that has a lot of mass near no difference.

          Isn’t this what you would expect to happen if the vast majority of preclinical trials are based on an incorrect understanding of what is going on? Eg, that rat got more rewards because he was hungrier, not because he had improved brain function. This mouse had smaller tumors because intra-tumor edema was reduced, not because the cancer cells divided slower or died.

    • I’d presume that areas which require more statistical expertise to perform the analyses in the first place would be less susceptible to these problems than fields where people with little expertise can do the analysis by clicking on “ANOVA” in their software. So for example, in ecology, hierarchical Bayesian modeling is common, and researchers get lots of statistical training. I don’t think a Wansink would be able to crank out junk analyses of deer migration patterns.

    • Phonetics, the study of speech sounds and speech perception. It’s guided by strong theories, so there are rarely too many researcher df. I know of no significant replicability scandals in the history of the field (there have been failed replications that have sparked arguments, but they’re few and far between and usually limited to very specific results). And we may or may not be using all the right stats, but mixed-effects regression has become standard and the really knowledgeable stats people in the field are using fancier things than that.

      • Wanted to clarify but don’t know how to edit. For NHST, mixed-effects models have become standard. But there’s also a lot of work that’s more exploratory and model-based, and nobody insists on p-values in that stuff. Also, there’s the question of why the field is this way. I think the short answer is that we’re a small field, nobody has ever heard of us, and there’s no particular pressure to get big grants or publish in allegedly prestigious general journals. Plus, hiring committees tend to be focused on research quality and having lots of clever little publications doesn’t really help you much.

      • “mixed-effects regression has become standard”

        Mixed effects models cover a lot of possibilities — so there is always the question of “is this particular mixed effects model appropriate for the question being studied?” Perhaps this is given adequate attention in your field, but I have seen plenty of papers in other fields that just say, “we used a mixed effect model,” without giving details of nor justification for the particular model used.

        • Good point. What I was trying to gesture at is that my field has struggled with random effects for its entire history, because most of our designs involve multiple subjects repeating multiple items in multiple conditions (sometimes with slightly different items in different conditions). But lme4 made a huge difference. Once it became easy to construct models with nested/crossed random effects in a reasonably understandable stats environment, just about everybody quickly recognized that our ‘language as fixed effect fallacy’ problem was over and we started modeling random effects. And because that package general refuses to spit out theoretically questionable p-values, this has also had a salutatory effect on the way that we think about differences and effect sizes.

        • I’m a fan of the blog and have been reading about Stan here for a while. I’m working up the courage to look into it, maybe with the help of colleagues who know math and programming better than me. I gather rstanarm puts some of the functionality of Stan into R? If that question is completely clueless, it is fully indicative of my background here.

        • rstanarm is indeed great. In lme4’s defense, though, have you tried confint(fitted_model) (“uncertainties for the estimated variance parameters”) lately?

        • It’s very likely that rstanarm will provide more opportunities for mistakes, and so will be less idiot-resistant (nothing’s idiot-proof). But it (and Stan more generally) offers more flexibility and more opportunities for fitting, reporting, and illustrating more informative models. In my opinion, anyway.

      • I am a lot less sanguine about the use of “mixed effect” (i.e., multilevel) models and researcher degrees of freedom issues in phonetics. They may well be less of a problem in phonetics than in many other fields, but they are still a problem. I see plenty of over-reliance on p-values, too (the fact that lme4 doesn’t give you p-values is neither here nor there, given that there are multiple options for getting them anyway).

        Now, I don’t want to sound too pessimistic, either. I have repeatedly found phoneticians (and people in neighboring fields) to be refreshingly open to unconventional and often rather complex statistical tools. For example, I have gotten very little pushback from phonetic-y reviewers about some complex multilevel Bayesian models I’ve developed (at least in some journals – I’m working really hard to resist the urge to name a mixed-effect-model-obsessed journal from which I got some remarkably ignorant reviews one time).

        I think phonetics is helped by the fact that it’s standard practice to collect a lot of data. This doesn’t make it easier to do good modeling, of course, but as you say in a comment below, experimental designs in phonetics are typically fairly complicated, and this has engendered a culture of being willing to engage with and try to do good modeling. And having a lot of data can at least helps phoneticians avoid some of the problems with noisy estimates of quantities of interest.

        • That’s reasonable, and in fact I shouldn’t have touted mixed effects as a measure of our sophistication or whatever I was doing. I think the apparent robustness of results in phonetics has more to do with strong theory and your point about data collection. And also the fact that many of our effects (particularly acoustic ones) are just massive and robust, relative to the kinds of things that get studied in, e.g., social psych. My dissertation was a mess of forking paths, because I was learning stats as I analyzed the experiments. But I would bet my life the main acoustic effects replicate, because they were *huge*. In any case, I concede your point about the use and misuse of (g)lmers. But I still think most results in phonetics are quite robust.

        • Yes, that’s a good point, too. Lots of effects are big and obvious, and many effects that are small in an absolute sense often very consistent, so there’s relatively little noise.

      • DL: Isn’t Wansink’s work behavior analysis?

        GS: No.

        DL: Cuddy?

        GS: No.

        DL: Perhaps GS needs to define what is meant by ‘behavior analysis.'”

        GS: Here’s a definition that you currently need more:

        Definition of humility

        : freedom from pride or arrogance

        : the quality or state of being humble •accepted the honor with humility •The ordeal taught her humility.

        Cordially,

        Glen

        • Glen,

          Dale’s question sounded legitimate to me. Words and phrases can mean different things to different people and in different fields, so we often need to clarify our definitions of them, particularly in discussions like this, that involve people from different backgrounds. Please give your definition of behavior analysis; neither Dale not I can read your mind.

        • MS: Please give your definition of behavior analysis

          GS: SOME CHARACTERISTICS OF BEHAVIOR ANALYSIS:

          A natural experimental science that has the behavior of organisms as its subject matter

          Uses single-subject designs

          Eschews theoretical entities that exist “…at some other level of analysis [than behavior]…measured, if at all, in some other dimensions” (B. F. Skinner).

          Embraces selectionism (https://naturaldrift.files.wordpress.com/2009/12/skinner-1981-selection-by-consequences.pdf)

          Is contextualistic (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1338844/pdf/jeabehav00033-0099.pdf)

          Inductive and data-driven (though an accumulation of facts and principles allows mathematical hypothesis testing; Skinner’s view was that the hypothetico-deductive model failed when there was a shortage of basic functional relations and was most applicable when the subject-matter was difficult to observe directly)

          Makes cumulative progress (sort of another way to say it is a natural science)

          Provides (or should provide)the phenomena to be explained by a reductionistic natural science (more of which we’d see if neuro”science” hadn’t been thoroughly corrupted by cognitive psychology – i.e., if more of neuroscience was a natural science)

          Has as its goal the prediction, control and interpretation of the behavior of individuals

          Stresses thorough conceptual analysis of its basic terms

          Details the functional relations that obtain between environment and behavior (BA studies directly the relation between manipulations of the past and current environments and the dependent-variable – behavior)

          And finally…is widely misunderstood and misrepresented

          For a few (ha ha) examples of behavior analytic work see: https://www.ncbi.nlm.nih.gov/pmc/journals/299/

        • Wansink uses standard between groups designs and uses NHST. No aspect of what he does appears to be behavior analytic. I’d go on, but you appear to just be a troll.

        • Ah, how differently different people can see things! I didn’t find Dale’s posts flippant or arrogant — to me they just sounded like he was confused and asking for clarification.

        • I can understand both sides. Glen was not understanding enough that others may be unfamiliar with his terminology, and was lax when he failed to provide examples in the original post. On the other hand, he got accused of not only supporting NHST, but advocating it as a great example of the scientific method. That is a very offensive accusation, and shouldn’t be tossed out there haphazardly.

      • Isn’t Wansink’s work behavior analysis?
        No, not even close. The term “behavior analysis” has a specific meaning in psychology. It’s a bit like saying a quantum physicist and an astrophysicist are the same creature. Hey, they both are physicists.

        • It’s more like saying that alchemy and chemistry are the same thing. Maybe not a perfect analogy since it could be argued that alchemy contributed to the birth of chemistry and thermodynamics. Mainstream psychology (of which Wansink is a practitioner)contributed nothing to behavior analysis and it (mainstream psychology) remains largely prescientific.

        • Oh okay, alchemy and chemistry is often a good comparison. I just did not think of it.

          I do think some main stream psych work, even in roughly speaking the area of social psych, say in the areas of cognitive dissonance or authoritarianism have some validity. Whether or not they are scientific is a good guess.

          I’d argue that some areas of mainstream psychology are definitely prescientific and some others, say in some of the developmental or cognitive areas are either scientific or moving rapidly in that direction.

          A lot of social psych seems as weird as ever.

        • There’s nary a glimmer of hope IMO. Mainstream psychology (i.e., cognitive psychology), is still dualistic (as is most of neuro”science” which has bought representationalism hook, line and homunculus). All it has done is jettison the ontological commitments of Cartesian dualism. Instead of ontological dualism, you have epistemological dualism. Put simply, the brain is the new mind. Explanations of behavior proceed as they always have except instead of the mind driving the body around like a Cadillac, it is the brain. The epistemology is unchanged. But the problem was never ontology. As ol’ Fred said, “The problem with mental explanations is not that they are mental but, rather, that they are not explanations.” But I guess I’ll go to sleep now – something about the large meal I ate…Oh right! It has a “dormitive virtue.” That’s the explanation!

        • What did make you go to sleep Glen?

          Or alternatively what is the role of the brain in behaviour?

          I don’t *think* you’re a troll, but you seem incapable of appreciating the arguments of the fields you oppose.

          And before you get started, no I’m not a psychologist of any stripe.

        • This is a reply to, err, “NJW” whose post will probably appear either above or below mine. I’ll put it here because the nesting is not infinite – or bigger than about 5 or 6!

          NJW: What did make you go to sleep Glen?

          GS: Why do you ask? There are, no doubt, legitimate answers but finding them does not consist of naming the phenomenon, calling the name a thing and saying the thing is the cause. That is what psychology does…all the time. Often the names are ordinary-language terms. Q: Why does he go to church every Sunday? A: Because he believes in God. Q: How do you know he believes in God? A: He goes to church every Sunday. That is the pattern…really. Now…call ‘em on it, and they’ll say “That’s what physicists do.” Is it, though? It is well known that physicist often hypothesize about unobservables, but is the process really the same as the gratuitous creation of causes via circular reasoning that characterizes mainstream psychology?

          NJW: Or alternatively what is the role of the brain in behaviour?

          GS: I would say that physiology mediates behavior. In some cases, we have a complete understanding of this. For example, Kandel was almost completely able to show how changes in structures accounted for observations at the behavioral level. One set of observations involved stimulation of the gill siphon (or something like that) in Aplysia and the resulting withdrawal. First, he knew everything involved in this reflex. But, importantly, he was able to show what, reasonably precisely, was happening when the response diminished upon repeated elicitation. The reflex is, so to speak, an empirical fact at the behavioral level, as is the habituation. But there is a temporal gap in the explanation of the diminishing of the magnitude of the behavior. Let’s make sure we’re on the same page. The question is “Why does the stimulus now elicit a small magnitude response when, a week ago, it elicited a much larger magnitude response?” A: “The stimulus was repeatedly presented.” That is true. It is a true answer at the behavioral level (ignoring for a moment the whole brouhaha over the question of what “truth” is). But there is a temporal gap between the cause and the effect. This temporal gap is filled with a series of events that were initiated by the repeated stimulus presentations. The structure of the animal was changed and, as a result, the stimulus came to function differently. No temporal gap. Nice and tidy. Billiard-ball causation. Nothing wrong with it. The trick is obtaining such an explanation. Though this is incidental to the main issue, we are nowhere near such an explanation of, say, a rat’s behavior after exposure to some schedule of reinforcement, or what happens when you change the schedule. Despite decades of brain-hype, there is nothing like a complete explanation of such “simple” behavior in, say a rat. And to a very, very great extent, this is true even in less complex organisms than a mammal (mammals and birds are pretty complex). But I’m sure that neuroscientists will soon explain all the nuances of human behavior…just like they say. I hope they will be able to carry out all those experiments after breaking their arms trying to relentlessly pat themselves on their respective backs.

          NJW: I don’t *think* you’re a troll, but you seem incapable of appreciating the arguments of the fields you oppose.

          GS: I thought Andrew asked you to use only one pseudonym? Oh well. Anyway…I think you are wrong. It is mainstream psychologists and neurobiologists that don’t comprehend *my* arguments. I’m pretty sure that I understand “representationalism” pretty well, and how mentalism goes about its business. After all, even the “man in the street” is trained in mentalism. And they are used to “saving and retrieving photographs” so the storage and retrieval metaphor is a comfortable one. All behaviorists were once mentalists, but the reverse is not true.

          NJW: And before you get started, no I’m not a psychologist of any stripe.

          GS: Ok…your point being?

  6. I prefer to think of this as a cry for help. Buildings, to take an easy example, are generally designed with lots of errors. They not only leak and don’t heat/cool evenly but they often don’t flow people well, don’t join or isolate spaces well and don’t fit the spaces to actual work needs. The drowning grab at straws. Brian Wansink wouldn’t be so popular if people were capable in meaningful number and degree of controlling their appetites. We wouldn’t have simplistic causal assertions – like there’s an excess production of calories in the US so therefore we’re fat – if we understood the causes of obesity. (In that regard, watching a bit of the WWI documentary, notice how thin all the people are. Surely, many if not most of those people were getting enough to eat.)

    I was thinking about this in regards to the methods of economists and specifically why every single paper has to declare a new model. The relative citation indices help form distributions around those models which are at least popular … which says very little about correctness or appropriateness for a context. If you were to pick papers at random from some swathe of time, what are the odds one would be significant even for a bit of time?

    Qualitative work is necessary. You build coffins to a certain size because people tend to be that way and thus the coffin business now produces many large coffins. That’s qualitative. It’s a study: they got more orders for really big bodies and decided that would be a business niche they could spend money on at a profit. That’s the point, isn’t it? To get an improvement. I just saw a silly study that says looking at pictures of cold stuff increases self-discipline. I’d imagine it might, maybe for a short time until you got used to the pictures because lots of things have transient impact. And we adapt. And contexts change. And we build stuff that looks great because it’s new and it’s a change and the lines appeal to us and it satisfies certain needs we identify now for light or for privacy or for warmth or for cool, etc. and then in some number of years something else will look new and better and new may be better in some ways, worse in some ways but then that was true when your building was new too.

  7. Interesting if this would have a placebo affect.

    Thinking your office was designed with some concerns for mental health may increase you mental health just because.

  8. I’ve observed an interesting problem. As fields ‘touch’ poisoned fields, they end up getting poisoned too. Previously scrupulous (and boring) fields end up hyped and toxic…

  9. Andrew, thanks for opening this conversation about the use of evidence-based design in architecture. Architecture, as a profession, is similar to psychology as a profession in at least one major way: the public prestige has little to do the actual substance of the profession. In the last decade, architecture firms (Perkins + Will is probably the most egregious example), academia, and “knowledge leaders” have begun to use the moniker of “evidence-based” to clothe the architectural field in pseudo-scientific rigor. It is commonly known that most of architectural decision-making is based on subjective assertions of value and formal arbitrariness. There is no developed sense of research or any training in what would constitute adequate research methods or techniques (let alone even the most rudimentary familiarity with statistics). Ironically, this naivete of the architectural profession has meant that the field is easily fooled by “environmental psychologists” plying views that comport with useful biases. As just one example, Amy Cuddy was one of the major keynote speakers at the American Institute of Architects national conference last week.

    Unlike psychologists though, licensed architects have been given a monopoly by the state to practice such that, apart from a few exceptions, they must be engaged to provide a stamp before building permits will be issued. As such, they are at the beginning of the process for a vast portion of the more than $1T building and construction industry. Erroneous assumptions embedded in this industry create systemic burdens on the efficient use of societal resources. I provide only a few examples easily at hand to show some of the issues (my apologies for not knowing how to create hyperlinks):

    1. Design to stop obesity grew out of a relationship between Michelle Obama, the US Green Building Council (UCGBC), the AIA and others who wanted to make a public health issue around buildings: https://www.fastcodesign.com/1663272/a-new-design-movement-that-can-help-us-beat-obesity; https://www.nytimes.com/2016/07/12/opinion/designing-an-active-healthier-city.html?_r=0 and numerous others. This included such gems as forcing people to go up stairs instead of elevators or having long distances to bathrooms to get people moving. One wonders what was to be done with people with disabilities or those interested in universal design. But this was also during the time when the idea of “nudging” people for their own good played well with the idea that intellectual betters needed to help people who didn’t have the personal control to become obese in the first place.

    2. One bastion of EBD work and someone on the board of directors of the Center for Health Design is Roger Ulrich, who has written on the link between a view to “natural” landscapes as being therapeutically valuable compared to “non-natural” views for post-operative recovery time duration. See, https://www.healthdesign.org/about-us/meet-team/roger-s-ulrich-phd-edac and the 1984 article by him in Science entitled, “View through a window may influence recovery from surgery.” You have already noted the strange world of EDAC.

    3. Look at building rating systems that have bought into much of this and have as partners Mayo Clinic, Cleveland Clinic, Harvard, major real estate brokers, architecture firms, not to mention those paragons of good sense and virtue, Deepak Chopra and Leonardo DiCaprio: http://delos.com/people Note also that they are tied at the hip with the very powerful marketing machine, the USGBC. Or see the especially anti-science Healthy Building Network, https://healthybuilding.net/ All of these entities claim to do “research.”

    4. A whole literature exists about the ostensibly improved performance (on a variety of metrics, energy, GHG, worker productivity, worker health, even social justice) for LEED-rated buildings. The area with the most serious work, energy performance, shows that LEED buildings do not perform any better in source energy use or GHG emissions. This wouldn’t matter much if this claim of improved performance wasn’t part of the validation for all kinds of legislative and code changes that privilege such questionable rating systems but add useless costs to an increasing number of building assets in the private and public sphere. There are a few famous “definitive” papers that serve as examples of the quality of the work. A study by the New Buildings Institute (funded by USGBC) that validated claims of improved energy performance compared a mean of a non-LEED population to the mean of a LEED population. See, Turner and Frankel, “Energy Performance of LEED for New Construction Buildings–Final Report, 2008 (available at the New Buildings Institute). The second paper was touted as the first major definitive research study and served as the basis of much successful marketing and legislative activity, especially in CA and the GSA. It is Greg Kats’ “The Costs and Benefits of Green Buildings” in 2003, also funded by the USGBC. For those who want to see another example of how highly biased consulting research works and how it can lead to vast rewards financially, TED-type celebrity, and complete insulation from contrary work or skepticism, this is a perfect example. But it also shows how easily the right kind of motivated reasoning tied to sloppy and incoherent research can lead to success.

    There are many other examples but no need to go further. Most importantly, I wanted to stress as I did in the original post that this is an area of no serious opposition. Public and private owners of building assets do not have the incentive or time to care about this set of claims. As noted about Wansink, this work flies under the radar and makes hay of the social desirability bias that prevents any serious opposition. In addition, there are many ways in which these types of shoddy thinking and work can be hidden from those paying for the assets or taking the risks attendant on non-performance.

    One final thought. In architecture, the idea of evidence-based is not the same as science-based. No attempt is made to actually determine the quality of the evidence used in the claims of using evidence-based design. Rather it is used to supercharge confirmation bias by referencing poor or simply nonsensical studies to generate an aura of scientific rigor. Since most of the folks who are the audience for the discourse of evidence-based design activity have little or no way to question the authority of the licensed professional, it is hardly surprising that there is little opposition. Any evidence to the contrary is easily hidden from the audience.

    This is not to say that things can’t be done and certainly additional work is useful, but I am not sanguine about the path ahead given the current state of the field and the lack of opposition.

  10. “In architecture, the idea of evidence-based is not the same as science-based. No attempt is made to actually determine the quality of the evidence used in the claims of using evidence-based design. Rather it is used to supercharge confirmation bias by referencing poor or simply nonsensical studies to generate an aura of scientific rigor. ”

    This sounds very much like what very often happens in fields such as psychology and education. In these fields, there is a little more attention to requiring some nominal training in statistics — but, as many who comment on this blog are aware, very often the statistical training leaves out the science and just gives vague recipes to follow to try to get the magical (misunderstood) p-value

Leave a Reply

Your email address will not be published. Required fields are marked *