Skip to content
 

How has my advice to psychology researchers changed since 2013?

Four years ago, in a post entitled, “How can statisticians help psychologists do their research better?”, I gave the following recommendations to researchers:

– Analyze all your data.

– Present all your comparisons.

– Make your data public.

And, for journal editors, I wrote, “if a paper is nothing special, you don’t have to publish it in your flagship journal.”

Changes since then?

The above advice is fine, but it’s missing something—a big something—regarding design and data collection. So let me two more tips, arguably the most important pieces of advice of all:

– Put in the effort to take accurate measurements. Without accurate measurements (low bias, low variance, and a large enough sample size), you’re drawing dead. That’s what happened with the beauty-and-sex-ratio study, the ovulation-and-clothing study, the ovulation-and-voting study, the fat arms study, etc etc. All the analysis and shared data and preregistration in the world won’t save you if your data don’t closely address the questions you’re trying to answer.

– Do within-person comparisons where possible; that is, cross-over designs. Don’t worry about poisoning the well; that’s the least of your worries. Generally it’s much more important to get those direct comparisons.

101 Comments

  1. Z says:

    What’s the empirical basis for your within-person comparisons advice? Surely many interventions or outcomes of interest do ‘poison the well’. I’m not sure what population of interventions/outcomes justifies your ‘generally’.

    Maybe it should be changed to “If you have a small to moderate sample size, do within-person comparisons because between-person comparisons will be underpowered”?

    • Andrew says:

      Z:

      There are some settings where within-person comparisons make no sense. For example, if I want to compare the political attitudes of people whose parents are Democrats, to the attitudes of people whose parents are Republicans, there’s no relevant within-person comparison. The study has to be a comparison of two groups. But when a within-person comparison is possible, I think that’s the way to go.

      • Z says:

        I hadn’t heard the phrase ‘poison the well’ before, but I assume it refers to situations where interventions or outcomes have lasting effects. The benefits of within person comparisons are that they cut down on confounding by poor (or non-existent) randomization and improve efficiency. The drawback is that lasting effects of prior treatments and/or outcomes can induce bias.

        (1) If you have a large sample size and can randomize, why not take the unbiased route?

        (2) If sample size is small/moderate, what is your evidence that lasting effects tend to be weak enough to justify within person designs?

        (3) Why offer general advice on this instead of saying that the choice should depend on the likely strength of lasting effects and the sample size (and maybe other factors too)?

        • Andrew says:

          Z:

          The problem is that there’s no unbiased alternative. The whole “unbiasedness” thing is a trap, for a few reasons. First, people often aren’t measuring what they think or say they’re measuring. For example, there was that study of fat arms and voting, where what they said was “upper-body strength” was really arm circumference. There was that study of ovulation and voting where the days reported as peak fertility were . . . not actually the days of peak fertility. Lots and lots and lots of examples of this sort: if you’re not measuring what you want to measure, your estimates will be biased. Second, results are reported conditional on statistical significance: this induces type M errors, that is, biased estimates of effect size. For example, that estimate that early childhood intervention increases later earnings by 42%? Biased. If your measurements are of the wrong thing, or your statistical procedures are overwhelmed by selection, then biases will be huge. So, sure, in the hypothetical setting that you have excellent measurements and predictable variation among people, then a between-person study should work. But in psychology it’s my impression that unexplained variation between people is large, and measurements are often pretty bad: the trouble is, I think, that researchers think that if they get statistical significance, then their measurements were, retrospectively, just fine.

          And, to quickly answer your questions above: 1. Randomization doesn’t resolve the bias issue because (a) treatment interactions (see here), and (b) selection bias in how the data are analyzed and reported. 2. I recommend cross-over designs, so you can have between-person comparisons if you really want. 3. I’ve written whole books of advice; here I wanted to give quick tips that I think should be helpful right away.

          • Z says:

            “So, sure, in the hypothetical setting that you have excellent measurements and predictable variation among people, then a between-person study should work.”

            Ok, this is the scenario I was talking about. Your first piece of advice was to get accurate measurements, and I figured your second piece of advice assumed that people followed your first piece of advice.

            So are you saying that within person designs ameliorate problems caused by measurement error, garden of forking paths, etc above and beyond the efficiency gains in the absence of those problems? So the bias from ‘poisoning the well’ is in some sense double-offset? Can you point to a reference explaining how this works, or is it obvious? (Or if there is no reference but it’s clearly true, maybe it’s worth writing a paper about it.)

          • Z says:

            (I’m usually on board with pretty much everything you say, but there just seems to be less justification of this general within-person design advice than everything else you promote on the blog, or I’m missing some key intuition)

            • I think the key intuition you are missing is that you don’t just run a within-person design, you also create a *within person model* that is, you need to have a model of what your treatment does to the person, through time.

              If you have that, then there is no poisoning the well, you just need to fit the model. Yes in some people you do treatment A first then B, and you *expect* on the basis of your model that B first then A will be different in a certain way… and your model accounts for this fact. In fact, within person models + within person treatments + a pool of people treated = a combined within AND between person study. Unless you study just one person, you automatically have a between-person study as well.

              The sensitivity of this model to detect very real specific effects is way way higher than any between person design. Efficiency is much better when you have these serious models.

              • Z says:

                Ok, but then you have the can of worms of modeling assumptions. Not saying it’s never the right thing to do, but it’s still not at all obvious to me

              • You ALWAYS have “modeling assumptions”. Making lots of explicit modeling assumptions is a good thing. then if your model doesn’t fit, you can reject that model of the world and move on to another, and if your model does fit, you can make accurate predictions and have real knowledge of how things work. If you make “no” modeling assumptions, then you inherently learn nothing about the world.

                not opening the “can of worms” is basically a cop out, you’re no longer doing science really, just measuring stuff. By itself, measuring stuff can be useful, but basically as a precursor to explanation. A major problem with all of the social sciences papers Andrew calls out here is that they have no *quantitatively specific* model of how things work. They work with qualitative models like “hormones will make you want to look pretty” and they move on to trying to see if they can detect a consequence of this in measurements, and then they do some crappy measurements that are close to a random number generator… and then when they get lucky they get major publicity, and when they don’t get lucky they just pull the handle over and over until they do get lucky.

                If you take something like “hormones will make you want to look pretty” you can turn this into a real model by doing things like measuring hormone concentrations through time (perhaps urine or saliva measurements), and taking photos of the same person each day, and then having third parties rate the photos as to how attractive the women are … and then look at time trends in rated attractiveness, and see if they have a consistent rise and fall that follows a pattern similar to the rise and fall of certain hormone levels, etc.

                The modeling assumptions are good, necessary even for real science.

              • Z says:

                I disagree that you necessarily learn nothing if you make very few modeling assumptions. For example, if you want to know whether drug A on average works better than drug B, you can answer that question with a randomized experiment and very few modeling assumptions. Same if you want to know whether it was worth the money to offer a new jobs training program.

                Or suppose you want to estimate the optimal dynamic treatment regime for a drug as a function of time varying covariates L from observational data. You can do that only modeling probability of exposure as a function of past covariates at each time point. Or, you can estimate separate regressions for each of the (possibly many) time varying covariates as a function of the past along with the outcome. Lots of moving parts that second way (which is what you propose), and lots of chances for small errors to propagate. Less modeling to estimate the same quantity of interest usually gives more stable and more reliable results.

              • Andrew says:

                Z:

                The trouble is, in psychology experiments it’s rare to have a clean “drug A” vs. “drug B” comparison. Treatments are often pretty hazy, and any effects will be highly situational, so that the whole concept of the “average treatment effect” doesn’t mean much by itself.

              • “For example, if you want to know whether drug A on average works better than drug B, you can answer that question with a randomized experiment and very few modeling assumptions.”

                Typically this involves very strong modeling assumptions regarding the meaning of “Better”

                If in fact you really do just want to know whether say Drug A vs Drug B results in higher percentage of resolved cases after 3 weeks of treatment, then you’re not doing *science* you’re doing *measurement* in the same way that if you want to know whether pencil A is longer than pencil B you put a ruler next to each one and measure the outcome.

                “Or suppose you want to estimate the optimal dynamic treatment regime for a drug as a function of time varying covariates L from observational data”

                Yes, suppose this, then you need

                1) a model describing what it means to be “good” so that you can find the “most good” = optimal

                2) a model for which covariates L to consider, and for how covariates L interact with treatment through time

                You ignore (1) entirely, under the assumption that (1) is given or obvious, which is usually dubious but common enough. You propose two forms for (2) and attribute one of them to me, but I have no preference for how you do (2) only that you do in fact do it in an explicit way, don’t just rely on some default canned thing, actually consider the problem of (2) and make a conscious decision about what you will do for (2).

              • Matt says:

                Daniel, when reading what you write on modelling I am often left with the impression that what you are after is predictive or purely statistical estimates, as opposed to “causal” or “structural” parameters. With causal inference, it’s not necessarily about fitting the data well – you can have a legitimate causal mechanism that only explains a small part of the variation. Just because you write down a model that fits the data well, that doesn’t mean we should interpret those model parameters as anything meaningful. You need to argue that the assumptions of your model are justified on theoretical grounds, and then you take it to the data and your parameter estimates can be perhaps be interpreted the way you want. Obviously in practice there is a back and forth between the data and the model, but fit is not always that relevant. The set of (simple) econometric techniques laid out in Mostly Harmless, by Angrist and Pischke, would be what I would consider “model-free” causal inference. Of course there are still models here, but they can be fairly robust to misspecification, unlike more complex structural models. My issue with Bayesian stuff and structural models in general is that once you start specifying complex models with many (implicit and explicit) assumptions, it becomes very difficult to believe your estimates have any economic meaning. You have specified a complex model based on many assumptions, and you have chosen some parameters to fit the data well – all of this is good and fine, but why should I believe these estimates are causal or structural in any sense? I don’t know enough about Bayesian inference to understand how it meshes with causal inference, but my impression is that it is not well suited. We are after structural parameters – once you open the modelling “can of worms” the credibility of your estimates as anything causal or structural starts falling drastically, in my opinion.

              • Curious says:

                I find puzzling, the assertion that experiments or group level comparisons of averages to require few modelling assumptions. I suppose this is defensible if one is simply talking about the distributional assumptions required for the math to calculate an average along with the standard deviation and standard errors that allows for a statistical comparison. That said, even simple models typically come with loads of unstated assumptions. In pharmacology the assumptions are about biological processes and whether they are reasonably static or dynamic as this will affect the covariates required. Inclusion of covariates leads to assumptions about the significant causes of the outcome under study and how important the missing variable problem is in the current study. In psychology, it is often an assumption of within person stability without any evidence for such. Or the assumption that the results of between person analyses can logically be applied at an individual level. Or the mistaken assumption that precursors that are necessary are also sufficient to draw strong inferences about the results of psychological assessments. Or the idea that the mapping of numbers to some aspect of psychological reality is both accurate and meaningful.

              • Andrew says:

                Matt:

                What you say is standard in statistics and econometrics and indeed is how causal inference is presented in Angrist and Pischke’s book and in our own books too. But I think it’s a framework that doesn’t work so well in psychology experiments, for reasons explained in some of my other comments on this thread, such as in my response to Z above.

              • Matt, I don’t know how you can get the idea that I’m more interested in predictive power than in causal inference. People who care a lot about predictive power do things like regression trees + cross validation, or other automated machine learning techniques. They teach you nothing about how things work, but they do a good job of selling things on Amazon etc.

                I’ve never read Angrist + Pischke so I can’t directly say much about their viewpoint, but I can say that http://press.princeton.edu/titles/8769.html the blurb about the book does nothing to dissuade me from the idea that this kind of thing is essentially Economists rediscovering Taylor’s theorem & perturbation theory.

              • Matt says:

                Daniel, a quote like this:

                “Making lots of explicit modeling assumptions is a good thing. then if your model doesn’t fit, you can reject that model of the world and move on to another, and if your model does fit, you can make accurate predictions and have real knowledge of how things work. If you make “no” modeling assumptions, then you inherently learn nothing about the world”

                makes me think that you don’t think about causal inference the way I do, anyways (perhaps that is a good thing..). Like what do you think about the potential outcomes framework? There are not many modelling assumptions going on there…and that to me is the foundation for thinking about causal inference. I’m not sure if the words “identifying variation” would have the same meaning for you as for me, but when you have a complex model, don’t you find it difficult to understand what variation is actually identifying your parameters? And then consequently whether or not this variation is what you actually want to estimate your parameters of interest? That is my (and many others) fundamental beef with complex modelling for causal inference – it’s simply not clear what is actually identifying your parameters in a lot of these instances.

              • Matt says:

                Also re the snide comment regarding Mostly Harmless – you really don’t know what you are talking about there. I find it funny that you are talking down to Angrist, who is probably one of the more influential applied econometricians in the last couple decades (does LATE ring any bells?).

              • Andrew says:

                Matt:

                The potential outcomes framework is just fine and, by itself, makes no assumptions at all. But when a psychologist does a comparison in some particular group of people and reports the difference as statistically significant and uses that to make broad claims about the world . . . then a bunch of very strong assumptions are being made, about quality of measurement, effect size, and stability of treatment effects.

              • Matt, you’re right, I said right there in the comment that I haven’t read the book and so I can’t comment on it, my comment was on the blurb description.

                Here’s the blurb:

                “The core methods in today’s econometric toolkit are linear regression for statistical control, instrumental variables methods for the analysis of natural experiments, and differences-in-differences methods that exploit policy changes. In the modern experimentalist paradigm, these techniques address clear causal questions such as: Do smaller classes increase learning? Should wife batterers be arrested? How much does education raise wages? Mostly Harmless Econometrics shows how the basic tools of applied econometrics allow the data to speak.”

                The fact is mathematically that as long as you have a continuous differentiable function of the inputs with a countable number of discontinuities, an assumption which is always made, then a linear approximation for that function around a given point is second order accurate, and if there is some curvature, then a 2nd order polynomial in the perturbation is 3rd order accurate… this is Taylor’s theorem. For cases where a discontinuity is present, you add a step function. And then you have a complete basis for *perturbation effects* to a system assumed to be describable by a functional input-output relationship with noise y = f(a,b,c,d…) + epsilon

                So the reason Econometrics gets away with this is that for the most part they are considering perturbations to a complex system, and they are not typically considering the complex dynamic feedbacks that lead to the big picture. Consider instead of describing say the effect of free school lunches on reading performance by 8th grade, let’s instead describe the effect of the Bolshevik revolution on the spatial and time-course of 35 major industrial pollutants and consumption of 15 major natural resources in eastern europe. Linear regression is inappropriate.

                Now, consider some very simple systems studied by other fields, like the Lynx dataset:

                http://andrewgelman.com/2012/01/28/the-last-word-on-the-canadian-lynx-series/

                If you want to understand the Lynx dataset you are WAY better off directly describing the dynamics than anything else. Why? Because the dynamics are fundamental, there is feedback in the system and failing to acknowledge it is fatal to your analysis.

                So the idea that you can just “stick to linear regression with maybe a discontinuity design” is I think an extremely narrow view of things. It may very well be appropriate in almost all of what people are doing today in Micro-Econometrics, but that’s such a special case! Sure maybe it’s your whole interest in life, but it’s far from the whole life of data analysts across all disciplines. Suppose for example you’d like to understand the transport of pollution in the oceans from Fukushima Japan across the rest of the globe. Suggesting linear regression + regression discontinuity + unbiased estimates + clustered standard errors is like saying “I know, let’s use Addition!”

                The fact is, it *looks* like saying “linear regression + regression discontinuity” is “model free” to an Economist, but this is only because all the economists have already accepted the ENORMOUS modeling assumption that “everything is a perturbation to a direct continuous differentiable function of the inputs”

              • Matt says:

                Okay, thanks for that comment. I can only speak from an economics perspective, but surely you would agree there is a tradeoff when you add model complexity – namely that the number of assumptions that must hold to get the interpretation you would like grows. Regarding your Bolshevik revolution example, I just don’t see how you are going to have any estimates from that model that are structural or causal in any sense. That is why I said earlier that that type of modelling would seem to be purely “statistical” in nature – you could successfully fit a model to the data, but I’m not sure we should take it so seriously – fitting data well doesn’t tell us everything about whether or not you have the right mechanism. I’m a bit out of my depth here, as I’m not sure how things work when modelling the physical world, say, but I am so used to, in economics, the requirement of having an exogenous source of variation to identify any meaningful structural parameter. But any discussion of exogenous variation seems to be absent from your talk of modelling – not sure if this is making sense to you or not.

              • Matt,

                I think the questions you are asking are good and fundamental. For example, why does one need an exogenous source of variation in Econometrics? Does one need an exogenous source of variation to understand the weather? Is weather prediction not *causal*? It’s hard to get more causal than the Navier Stokes equations: forces cause fluid accelerations.

                So, what is the role of exogenous variation in Econometrics? The answer to that question is truly key to understanding what is going on, and I don’t know that I have the correct answer, but I think it comes down to identification.

                Let’s suppose that we take my perturbation theory version of econometrics as fundamental. Then you have

                y = f(a,b,c,d,e,…) + epsilon

                as your underlying fundamental model, with unknown f. We expand this in a first order taylor series to

                y = f1 + (a-a1)*df/da + (b-b1)*df/db …. + some_step_functions + epsilon

                and if we have interactions, we include a selection of second order terms that have factors a^2 or a*b and soforth.

                Now, we’ve typically got quite a few variables, and the value of the variables are all correlated in the population… men smoke more often than women, education tends to come along with higher income, county of residence and income are correlated… etc.

                Fundamentally, deep down inside the problem, we don’t have a N^2 dimensional set of correlations (or more properly … dependencies). We have a small dimensional causal structure that implies the N^2 correlations between all the variables. For example perhaps men have a lot of testosterone and it tends to make them more aggressive, aggressiveness affects their performance in school in one way, and their performance in salary negotiations in another way, and their choice of career in yet another way, and their cardiovascular health, and their risk of criminal behavior, and their musical ability…. whatever. If we have 25 important variables for the causes of y, then we have 25*25 = 625 different interrelationships between them that are important for the 2nd order taylor series. But, we don’t have 625 underlying separate factors, we have maybe 9 or 12 or 15, with complex causal mechanistic interrelationships that we aren’t modeling.

                So, if you want to figure out the coefficient for b, which is df/db or in economics terms Beta and you don’t want to create an appropriate underlying causal mechanistic model for the 15 dimensional feedback loop that reasonably accurately describes the time-series of individual trajectories through life (say meyers-briggs personality types, and several health and social experience variables etc) then to figure out what’s going on without all the confounding of the 625 different 2nd order relationships that you have no hope of estimating, you need to perturb b by itself.

                So, exogenous variation isn’t fundamental to causality, it’s fundamental to perturbation theory.

              • Matt says:

                Daniel, that is interesting to think about. I often struggle with the notion of what is “causal” or “structural” when thinking about things outside of my economics bubble, and sometimes even within my economics bubble (i.e. the parameters that govern an individual’s utility function – are these “causal”? Not really..maybe they are “structural”..) Perhaps one distinction between economics and physics is in physics the models are much much better approximations to reality. Sticking with your example, maybe the epsilon term is very small in physics models, while it’s always pretty large in economics models. Even I were to model the 625 relationships, would I not still need some exogenous variation to get, say, df/db? If every variable in my model affects every other variable, then there is no hope for me to estimate the effect of one variable on the other. If I can enhance my model to the point where there basically is no epsilon, then maybe I don’t need exogenous variation..although I’m still not sure of this. I guess that would be a decent concrete question that I don’t know the answer to: if I have a model wherein all variables are truly endogenous, in the sense that they have a *causal* impact on every other variable in the model, then is it possible to recover any of these causal parameters? My intuition says no… but again that is coming from an economics perspective – I don’t know what goes on in physics or other hard disciplines.

              • Matt: I think you’re on the right track to think about these things. One thing to consider is whether the fundamental model

                y = f(a,b,c,d,…) + epsilon

                is in fact a good model. Interpreted causally, this model says that whenever you set a,b,c… to certain values you will necessarily observe y close to f()

                Next you should imagine that a,b,c,etc are all changing dynamically in time, and they pass through a1,b1,c1,d1… will you necessarily observe y near y1=f(a1,b1,..) regardless of how fast a,b,c etc are changing, or what their history is? Typically I think the answer is no, that y=f()+epsilon is an asymptotic approximation to some real-world dynamics. So long as the a,b,c etc don’t change too fast then you can pretend there’s a functional relationship.

                To give you a physical model, consider a piston in a Diesel engine. You could do PV=NkT and get away with it for the first half of the piston stroke, but then someone injects some fuel, and suddenly you’ve got combustion, and uneven temperature throughout the piston, and reactions that occur at finite rates, and blablabla, so that the thing only gets back to PV=NkT a millisecond before the piston hits the bottom of its travel, and the valve opens to vent out the combustion products… A model that works fine in a lab when you move the piston slowly is only a noisy approximation in the 4000 RPM diesel truck…

                But then, in this engine, maybe things happen slowly and the approximation works fine:

                http://www.autoblog.com/2011/07/22/worlds-largest-diesel-engine-makes-109-000-horsepower/

              • ojm says:

                The ‘everything is a model’ perspective is very appealing but what does it actually mean? Everything is a *generative* model? From parameter to data? The general claim seems too general and devoid of content, while the specific claim seems wrong/limited to me.

                I agree many non-modellers often need more modelling, but many modellers don’t understand that not everything is most usefully thought of as a ‘model’.

              • ojm: I think “everything is a model” just means that there’s no way to describe the world using numbers without making choices between different options. Sometimes people make choices consciously, other times they essentially let someone else make the choices for them by adopting some defaults. Other times they just aren’t even aware that there was an alternative viewpoint; they feel like there aren’t choices, perhaps out of ignorance, or out of a mistaken belief that some particular viewpoint has been proven to be exactly correct. That’s wrong in my opinion, always you have some choices to make. Adopting someone else’s ideas or choices is fine, but it is a choice.

              • Martha (Smith) says:

                Daniel said, “ojm: I think “everything is a model” just means that there’s no way to describe the world using numbers without making choices between different options. Sometimes people make choices consciously, other times they essentially let someone else make the choices for them by adopting some defaults. Other times they just aren’t even aware that there was an alternative viewpoint; they feel like there aren’t choices, perhaps out of ignorance, or out of a mistaken belief that some particular viewpoint has been proven to be exactly correct. That’s wrong in my opinion, always you have some choices to make. Adopting someone else’s ideas or choices is fine, but it is a choice.”

                Well put.

              • ojm says:

                I think this misses the point I tried to raise. At the level you describe this is close to an empty tautology ‘everything requires choices of some sort’. Usually, and it seems to be the case in much of your comments, it is taken to mean something more specific about assuming some specific data generating mechanism.

              • Z says:

                +1 to ojm

              • Glen M. Sizemore says:

                “…you don’t just run a within-person design, you also create a *within person model*”

                GS: How is this anything but a bald-faced assertion that “science is a matter of testing hypotheses”? That is, how is this not simply the assertion “If you are not testing hypotheses, you aren’t doing science.”

              • Andrew says:

                Glen:

                Fitting statistical models is not just about testing hypotheses. It’s also about estimating parameters and making predictions conditional on a model. We do this in pharmacology all the time.

              • Glen M. Sizemore says:

                Andrew: Fitting statistical models is not just about testing hypotheses. It’s also about estimating parameters[…]

                GS: So…you are saying that there is such a thing as what used to be called “curve fitting,” no? If so, my question is “What, then, makes it a *model*?” And what if I do a parametric analysis of the effect of some drug on, say, responding maintained under a VI schedule of reinforcement. I give several doses, plot response rate as a function of dose (i.e., a dose-effect function) and connect the dots with straight lines. Is that a model? Here, I don’t even necessarily have or want an algebraic expression that describes the data. My question, again, is “Is this a model?”

                Andrew: […]and making predictions conditional on a model.

                GS: This phrase, though apparently meaningful to folks here, is strange to me. Are you simply saying that you collect data (using Rat #4), describe it algebraically, and then compute values when someone says “What do you think 1.7 mg/kg will do to ol’ Rat #4’s response rate?” If so, again, I ask “what makes it a model?”

                Andrew: We do this in pharmacology all the time.

                GS: Yeah. Me too. It’s called a “dose-effect function.” Or perhaps I have misunderstood you? Now…I saw the pharmacokinetic thing you posted for my benefit. Now that was what I would call a model. Compartments emptying and filling at particular rates, write the dif. eq. and integrate. There, the model is “compartments emptying and filling.” That’s a model in the sense that I think Daniel means it, and me as well. And I am saying that that type of modeling (and it is also “hypothesis-teating”) is not a defining feature of science, though it is quite dramatically-efficacious where applied appropriately.

              • Andrew says:

                Glen:

                What you are doing is bordering on trolling but I will respond one more time, and only one time, as after this I don’t think there’s any benefit to this.

                1. You write, “Are you simply saying that you collect data (using Rat #4), describe it algebraically, and then compute values when someone says ‘What do you think 1.7 mg/kg will do to ol’ Rat #4’s response rate?’ If so, again, I ask ‘what makes it a model?'”. In response: (a) No, I have not modeled any rat responses so I can’t really say; (b) I use what are called, in statistics jargon, “generative models”: In a generative model, you supply input conditions and parameters, and then the model implies a probability distribution for observed data; one can simulate data conditional on the input conditions and parameters, or one can specify data and input conditions, and use Stan to do Bayesian inference on the parameters and then get probabilistic predictions for future data.

                Different fields use different terminologies. In Bayesian statistics, the above is what is called a model.

                2. You write that our compartment models in pharmacology are “hypothesis-testing.” Again, words are used differently in different fields. In statistics, hypothesis testing is a procedure in which you temporarily assume a model (that is, a class of probability distributions) is true, then you fit this model to data, you get probabilistic predictions from that fitted model, and you compare these predictions to observed data, “rejecting” the model if the discrepancy between predictions and observed data is large enough, under some measure. I do this sometimes. But most of the time in pharmacology I’m not doing that. I’m assuming a model, fitting the model to data, then using the fitted model to make predictions, then we use these predictions for decision making. When we compare these predictions to data, that’s hypothesis testing. But typically we’re not doing much hypothesis testing, we’re mostly doing inference under the model.

                Anyway, that’s it. Don’t let it ever be said that I didn’t try.

              • Corey says:

                The only reasonable response to serious troll is to troll back for the lulz. It doesn’t help anything, but at least there’s lulz!

              • Andrew says:

                Corey:

                No, not on this blog. Save that crap for twitter!

              • Glen, I think Andrew did a pretty good job. To me, a model is any consciously chosen mathematical description of “how something works”. So in your rat example you could say something like:

                Response to a dose is a reasonably well defined continuous function (this is your general model)

                My doses are closely spaced, and so linear interpolation between them will produce a very tractable approximation that will always be good enough (this is an approximation to your general model)

                And then, yes, this linear interpolation is in fact a model.

                Let’s contrast it with some alternative models:

                Response to a dose is not a stable function in time, that is R(dose,t) varies with t, and so at a later time we can’t predict from the earlier observations.

                Response to a dose is a stable function of dose and remaining concentration in the bloodstream, so we need R(dose,conc)

                Response to a dose is strongly dependent not only on dose but also on environment (say exposure to light, or sound or stressors or whatever) so R(dose, env)

                What makes it a model is that it formalizes into numbers a qualitative description of a mechanism or a commonality or regular feature of the data.

                What makes it a *causal* model is that it doesn’t just summarize a feature of the data, but if you perturb the system so as to cause certain inputs to change, you will necessarily move the system’s measurable outputs to the vicinity of the predicted value.

              • Glen M. Sizemore says:

                Andrew: What you are doing is bordering on trolling…

                GS: Ahh…I guess you’re right about jargon…I’ll have to update my Andrew Dictionary…let’s see…

                Entry:

                Trolling – disagreeing with the Great Andrew Gelman.

                Got it.

              • Glen M. Sizemore says:

                Andrew: I’m assuming a model, fitting the model to data, then using the fitted model to make predictions,[…]

                GS: Ok…Step 1 and 2 of hypothesis testing by your own description…

                Andrew: […]then we use these predictions for decision making. When we compare these predictions to data, that’s hypothesis testing. But typically we’re not doing much hypothesis testing, we’re mostly doing inference under the model.

                GS: So…you make predictions by assuming a model and then, what, refuse to check if it is reasonably predictive? And what if you did check and it is wildly inaccurate? But it won’t be because the model HAS BEEN TESTED (as in hypothesis-testing) before and found to be reasonably accurate. No? Why did you assume the model that you did? Whim? If the answer is that the model has been “vetted” (an in-the-news term lately) via hypothesis testing, then you are tacitly testing the hypothesis every time you make predictions – unless you refuse to check on the accuracy. But, let me ask again, what if you did check your predictions against the data and found them to be wildly inaccurate?

            • Anoneuoid says:

              if you want to know whether drug A on average works better than drug B, you can answer that question with a randomized experiment and very few modeling assumptions.

              No one ever wants to know this. They want to know how well their measurements can be extrapolated to future/other circumstances.

              • Garnett says:

                This can be so astoundingly difficult to explain to colleagues.

              • Carlos Ungil says:

                The FDA does.

              • Carlos: I think the FDA really wants to know how well the drug works in the general population. They use how well it worked in the study as an estimate of this more general thing, but they do post-marketing monitoring precisely because they know that the extrapolation to a very broad population is not guaranteed.

              • Carlos Ungil says:

                Daniel, I think there was recently a discussion on the meaning of “wanting” (in particular things which are out of reach). The FDA asks for randomized experiments with few modeling assumptions so at least in some sense this is what they want.

              • Martha (Smith) says:

                Sure, the FDA may want to know “whether drug A on average works better than drug B”. But is this what they should want to know? It’s not of great use to an individual patient, unless the trial is done only on people very similar to the patient (which rarely seems to be the case). And the wise patient also wants to know, “how much better,” and “what are the tradeoffs with side effects (again, with “patients like me”)?

              • Anoneuoid says:

                The FDA does.

                I said “No one”. The FDA mostly wants to approve approximately the same number of new treatments every year in a given class. This minimizes questions about being too lax or strict. At least from what I’ve heard.

            • Andrew says:

              Z:

              You write, “there just seems to be less justification of this general within-person design advice than everything else you promote on the blog.” I think you’re right about this: I think it’s solid advice to recommend within-person designs but I don’t have much direct experience of such designs, outside of pharmacology/toxicology. I’m planning to write a paper on Bayesian analysis of crossover designs which should help a bit. But, for now, not all the dots are well connected.

              • Jacob says:

                For political science applications, do you see panel survey data as an example of a within-person design (depending on how it’s analyzed, of course)?

              • Andrew says:

                Jacob:

                Yes.

              • Doug Davidson says:

                In the reproducibility project Science paper, if I remember correctly, the replication rate for within-subjects designs was higher than the between-subjects designs. If that is true, can we consider that evidence?

                However, I would like to make a counter-argument to this line of reasoning about within-subjects designs. In situations where the effect size is small, relative to the subject-to-subject variation, psychologists often prefer (at least in my corner of the field) within-subjects designs because they are more sensitive. For the purposes of detecting small effects, on the face of it these designs would seem to be more effective.

                But isn’t there a problem when researchers make claims about the magnitude of these effects in more realistic settings than the laboratory? I am thinking here of situations where psychologists try to extrapolate lab findings to settings that are more applied, e.g., schools.

              • Glen M. Sizemore says:

                “But isn’t there a problem when researchers make claims about the magnitude of these effects in more realistic settings than the laboratory? I am thinking here of situations where psychologists try to extrapolate lab findings to settings that are more applied, e.g., schools.”

                GS: How is that an issue with “within-person” designs per se? Or, put another way, how is what you’re asking about not a general scientific issue?

            • Matt, regarding estimating a “dense” model where each of the variables is fundamental:

              Take something like the Lynx dataset, suppose there is an ODE: dP/dt = f(P,E(t),Theta) where P is the population vector (of lynx and hares), and E(t) are the variables that describe the environment, say food and weather and whatnot, and Theta are unknown parameters, like rate coefficients and things, and f is a hypothesized formula.

              Now you have observations of P,E with measurement error, through time. Can you infer Theta? The answer depends on the form of f, but in many cases yes, in the limit of a lot of data, you can infer Theta purely from observation without any experimental perturbation of the system. The nice thing about Bayesian methods is that if you can’t infer Theta, you’ll wind up knowing it because the posterior will fail to concentrate on a single value of Theta.

              Prior and Posterior predictive distributions can help you figure out what additional data or experiments would allow you to identify the remaining coefficients. For example, suppose fertility is a function of temperature. Perhaps you need to move to a slightly different geography where the range of temperatures is wider, and then you’ll be able to tease out the confounding between a base fertility rate, and the effect of temperature on fertility… or whatever. You can generate values from the prior and then run simulations and then fit your model to the simulated data, and see what kinds of variability are informative for inference.

              I think all of these things are way way more fundamental than the special cases implied by Angrist and Pischke’s blurb (but as I say, I haven’t read the actual book).

          • Comments here are a bit messy but somewhere you will find me referring to a model for what it means for a drug to be better than another drug. This isn’t a data generating mechanism, just a conscious choice you should make about tradeoffs in outcomes. Its a model because there is no obvious one true path.

            I really do think reiterating the conscious choice point is useful, many many people have a very deferential attitude towards analyzing data. They do what some textbook recommends or some software package has canned, even if their subject matter knowledge says that it makes no sense.

    • Glen M. Sizemore says:

      “What’s the empirical basis for your within-person comparisons advice?”

      GS: I’ll answer that question. Or rather, you can answer it yourself – just see the history of the natural science of behavior, once called “the experimental analysis of behavior,” and now usually referred to as “behavior analysis.” The conceptual rudiments and canonical experiments and procedures were already in existence when Fred Skinner published The Behavior of Organisms (1938). You could also see the entire history of the journal called, appropriately enough, The Journal of the Experimental Analysis of Behavior and, later, The Journal of Applied Behavior Analysis. You could also see Sidman’s Tactics of Scientific Research. No…no…don’t mention it…I’m always pleased to enlighten those who are unaware of, you know, 80 years of basic research that gave rise to the only generally-effective technology of behavior.

      Cordially,
      Glen

      • Z says:

        So you’re saying that you work in a field called ‘behavior analysis’. Got it, enlightening, thanks.

        • Glen M. Sizemore says:

          No…I’m telling you that the question you asked has been answered by a field now called “behavior analysis” that is 80 years old.

          • Corey says:

            Well, you were a bit snide and a bit superior. I heartily approve of that attitude — but only when Bayesians apply it to frequentism. It is surely an unhelpful way of engaging in all other cases.

            • Glen M. Sizemore says:

              C: Well, you were a bit snide and a bit superior.

              GS: Well, I’m not the one that asks rhetorical questions that are rhetorical because the asker is unaware of the natural science of behavior. True…it (the science) is not popular with most psychologists…but do you want to use what mainstream psychologists think as a yardstick or good science? But mainstream psychologists are largely against behaviorism, about which they generally know nothing. Anyay…

              C: I heartily approve of that attitude — but only when Bayesians apply it to frequentism.

              GS: Perhaps you should re-think your position.

              C: It is surely an unhelpful way of engaging in all other cases.

              GS: Especially around here where the arrogance and hubris is so thick you can cut it with a knife.

              • Corey says:

                GS: Perhaps you should re-think your position.

                As is well-known, Bayesians never re-think anything.

              • Glen M. Sizemore says:

                “As is well-known, Bayesians never re-think anything.”

                GS: Which perhaps explains the conflation of the hypothetico-deductive method with science in general that runs rampant around here…

              • Corey says:

                Hey don’t blame me for that! When I run rampant I conflate science in general with Bayesian epistemology.

          • Z says:

            Yeah, but you seem like the type of person who would claim this about whatever field you worked in, so the only concrete information I got from your post is that you work in a field called ‘Behavior Analysis’.

            • Glen M. Sizemore says:

              Wrong. I also told you that the questions you asked about what I am calling “single-subject designs” and Andrew was calling “within-person can be answered by looking into behavior analysis. I also told you about the “First Book” – The Behavior of Organisms (1938). I also told you about The Journal of the Experimental Analysis of Behavior and The Journal of Applied Behavior Analysis. AND, most importantly, I told you about Sidman’s classic Tactic’s of Scientific Research (1960). Further, since the question you asked Andrew was about the efficacy of such designs, and I gave you citations, you could pretty much guess (with only a little thought)that behavior analysis uses SSDs. BTW, I never said that I worked in a behavior analysis.

  2. Martha (Smith) says:

    I agree that the two additional tips are “arguably the most important pieces of advice.” Without good design and data collection, a study is likely to be dead on arrival.

  3. Aaron G says:

    Andrew,

    We have all seen examples where your above advice were not followed by various researchers (in psychology and in other fields), with the subsequent problems in reproducibility, among other problems.

    My question to you would be this: what is the response you have received from psychology and other social science researchers with respect to your advice? Have they ever given you reasons about why they have not followed your advice (e.g. making the data public)?

    • Andrew says:

      Aaron:

      The psychologists I speak with, seem to be supportive of my message, at least while they’re hearing it. But then they get other messages from other sources. They learn in their statistics classes about “statistical significance” and “power” and “Bayes factors”; they get the message that causal mediation analysis, or factor analysis, or some other set of buttons to push, is the sophisticated way to go; they learn from reading papers in Psychological Science that one should be able to conduct a series of 5 or 10 experiments and get consistent support for a hypothesis; they learn from Harvard press releases that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” and from Susan Fiske that “A replication failure is not a scientific problem; it is an opportunity to find limiting conditions and contextual effects.” Lots of different people sending quite different messages.

      • Martha (Smith) says:

        +1
        The price of good science is eternal vigilance.

      • As a psychologist myself, I can confirm that Andrew’s advice has not fallen completely on deaf ears. Change is slow, but it is coming. Psych Science (our flagship journal) will soon make open data a quasi-requirement, will have an exclusive section on registered replication reports (for studies that replicate an effect that was published in Psych Science). The new editor Stephen Lindsay is an amazing advocate for open science, and I think much of the field will follow. People like Richard Morey, and his PRO reviewer initiative (to make peer review more transparent and more rigourous), and Felix Schoenbrodt’s Openness Initiative are all exciting news.
        As disheartened as I was when my field (Psych) was (sometimes rightfully) dragged through the mud, I am incredibly hopeful for the future.

        • Garnett says:

          Here is an interesting caveat: Government employees may require an FOA request before providing data.

        • Solomon Kurz says:

          Well posted, Felix. I’m just a year short of my doctorate in psychology and I find the changes underfoot in our domain riveting. Change is slow, but it seems like some of our institutions are moving in good directions. I’ll take it.

  4. Marcus says:

    I agree with all those points but I fear that all such advice will tend to fall on deaf ears until “doing good research” is better aligned with “getting published and getting promoted/famous”. That will require greater sophistication from journals or more real negative consequences for continuously and intentionally capitalizing on questionable research practices.

    • Solomon Kurz says:

      Perhaps, Marcus, but those are empirical questions. Recall how, in recent months, Andrew’s posted several times on how embarrassing it is that instructors of statistics don’t use sound experimental methods to evaluate themselves (e.g., randomizing pedagogical techniques by section or across years). Similarly, what we need are teams of social scientists applying our craft to study the effects of the various open science proposals. Then we can shift form rhetoric to systematic reviews. Studies like this are a great start: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002456. And, of course, those scientists would do well by using open science practices in their efforts to examine the effectiveness of institutional interventions.

      • Glen M. Sizemore says:

        “Andrew’s posted several times on how embarrassing it is that instructors of statistics don’t use sound experimental methods to evaluate themselves (e.g., randomizing pedagogical techniques by section or across years).”

        GS: And…that’s what you are calling “…sound experimental methods”?

        • Solomon Kurz says:

          Glen, if you’re objecting that my exemplars are between participants, your point is well taken. But I’ll take a between participant design over none at all. Happily, the Kidwell et all paper I linked to was a single case study, of sorts.

        • Andrew says:

          Solomon, Glen:

          My first article on this topic was “Statisticians: When we teach, we don’t practice what we preach,” with Eric Loken. It was published in 2012. In that paper we emphasize the importance of pre-test and post-test, that is, multiple measurements per student. In this context I think it would be difficult to do both treatments for each student—they only have time to take a course once—but perhaps it would be possible to try different treatments in different segments of a course, for example. In any case it would be hard to make much progress here without good measurements.

          • Glen M. Sizemore says:

            A: My first article on this topic was “Statisticians: When we teach, we don’t practice what we preach,” with Eric Loken. It was published in 2012. In that paper we emphasize the importance of pre-test and post-test, that is, multiple measurements per student.

            GS: If one is going to worry about practicing what one preaches, then one should be careful about what one preaches. Now, the issue isn’t what “design” one should use to evaluate one’s teaching practices in the sense that you mean it. The “correct” general approach is already understood. And this understanding means that effectiveness as a teacher is judged by demonstrations of experimental control of the individual student’s behavior. This will entail measurements of behavior that are relatively (to say the least) frequent – ideally, say, every day during which behavior is continuously recoded (e.g., rates* of “correct response” etc.). And it is worth noting that the general approach of which I speak was not arrived at through hypothesis testing but, rather, emerged inductively from the laboratory study of non-humans.

            *I know! We’ll plot the frequencies relevant behaviors on 6-cycle log paper! :

            http://psycnet.apa.org/journals/bar/6/4/207.html

            Note that “precision teaching” is really the sort of general approach I am talking about above – it simply recognizes the importance of frequency of response, control of variability through experimental control combined with the basic notion that behavior is a function of its consequences and comes to occur with different probabilities in different stimulus conditions and so forth.

            A: In this context I think it would be difficult to do both treatments for each student—they only have time to take a course once—but perhaps it would be possible to try different treatments in different segments of a course, for example.

            GS: Yes. But, in the context of what is already known (what I have been describing), “different treatments” are likely to be unplanned seat-of-the pants alterations in one or more variables controlling an individual’s behavior (if it’s not, you’ll find out). But, of course, you can’t teach the same material to the same student twice, but good experimental control means that it is likely that frequencies of the relevant response classes will be relatively stable across time. And it may be that it is celeration that is steady. For example, to take a simple “animal situation,” a pigeon that has learned to learn a new 3-color X position sequence each day will have a steady-state of rate-of-acquisition of the sequence. Anyway…

            A: In any case it would be hard to make much progress here without good measurements.

            GS: Like I said before I am suspicious, perhaps unfairly, about what you mean by measurement. Is it anything more than assigning dimensional quantities to properties of things and events? Of course, that is crucial, and a fundamental aspect of natural science.

          • Martha (Smith) says:

            Andrew said: ” In that paper we emphasize the importance of pre-test and post-test, that is, multiple measurements per student. “

            This gets back to the important point in the original post, “Put in the effort to take accurate measurements. Without accurate measurements (low bias, low variance, and a large enough sample size), you’re drawing dead.”: Constructing a good test of learning is usually really difficult. This is an area where a lot of work is needed in studying educational methods — one researcher’s pre/post test may be very different from another researcher’s, both in what they are trying to measure and how well they succeed in measuring it.

            • Glen M. Sizemore says:

              “Without accurate measurements (low bias, low variance, and a large enough sample size)”

              GS: Just out of curiosity, are you have the opinion that, say, modern thermometers give accurate measurements because of “…large enough sample size”?

              • Martha (Smith) says:

                No, I am not of the opinion that “modern thermometers give accurate measurements because of large enough sample size.” In fact, I don’t see how “large enough sample size” would be a reason for claiming that any instrument “gives accurate measurements”.

                Just out of curiosity: Why did you ask this?

              • Martha (Smith) says:

                Ah, I think I see why you asked: I sloppily quoted too much from Andrew; I should have said, “Put in the effort to take accurate measurements. Without accurate measurements (low bias, low variance…), you’re drawing dead.”

                And I suspect that what Andrew should have said originally was, “Put in the effort to take accurate measurements. Without accurate measurements (low bias, low variance) and a large enough sample size, you’re drawing dead.”

      • Marcus says:

        My point was that open science proposals are unlikely to have much of an impact on the base rate with which people intentionally misreport their findings. In my specific field there are a number of recent papers looking at very specific types of errors that should probably be characterized as research fraud. These papers (including one I have in review) finds that between 30%-50% of papers are affected by these errors but there are never any real negative consequences for the authors. The journals are largely uninterested in correcting the record and the offending (often serial) researchers get to keep their cushy endowed chairs. The message to new researchers is that this the way that we play the research game with winners landing six figure salaries straight out of graduate school. I don’t think that any kind of preregistration or data sharing will fix this type of problem.

        • Solomon Kurz says:

          On that point, I concur with Carol.

          • Martha (Smith) says:

            “These papers (including one I have in review) …”

            Can you supply any links to preprints? Or references to any of these papers that have already been published?

            • Marcus says:

              My own paper has a conditional accept (pending minor changes) but I am hesitant to provide it until it has been accepted. Two articles that make a similar point are here:
              http://journals.sagepub.com/doi/abs/10.1177/0149206314527133
              http://journals.sagepub.com/doi/abs/10.1177/1094428116676345

              Other papers identifying similar problems are:
              http://onlinelibrary.wiley.com/doi/10.1111/peps.12111/abstract
              and of course this famous one: http://journals.sagepub.com/doi/abs/10.1177/0956797611430953

              An earlier paper of mine that somewhat accidentally discovered very substantial problems in some very high profile papers in my field is here:
              http://onlinelibrary.wiley.com/doi/10.1002/job.2008/abstract

              • Martha (Smith) says:

                Thanks for the links.

                Since this thread has gotten pretty lengthy with some sideturns, I’ll quote from your original post to help reinforce your original message:

                “I agree with all those points but I fear that all such advice will tend to fall on deaf ears until “doing good research” is better aligned with “getting published and getting promoted/famous”. That will require greater sophistication from journals or more real negative consequences for continuously and intentionally capitalizing on questionable research practices.”

                So this gets back to the question of “incentives,” which has been discussed many times — but the bottom line seems to be that there has been little if any progress in changing the incentives. In other words, it’s a tough problem. FWIW, http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001747 is a paper by Ioannidis with some ideas for changing practices. Quoting part of the abstract:

                “Selection of interventions to improve research practices requires rigorous examination and experimental testing whenever feasible.
                Optimal interventions need to understand and harness the motives of various stakeholders who operate in scientific research and who differ on the extent to which they are interested in promoting publishable, fundable, translatable, or profitable results.
                Modifications need to be made in the reward system for science, affecting the exchange rates for currencies (e.g., publications and grants) and purchased academic goods (e.g., promotion and other academic or administrative power) and introducing currencies that are better aligned with translatable and reproducible research.”

                It’s clearly a tough problem.

  5. Keith O'Rourke says:

    Interesting.

    The benefit cost of (between group?) randomized evidence is often not really worth it, but when it is Fisher and Rubin? argued that design should lessen assumptions and or make them more likely to be true (being a far better motivation for design than efficiency). One wants a basis for sturdy decisions to be made. Within group randomized evidence requires more assumptions – the worry about poisoning the well – so avoid here if you can.

    Alternatively, if randomized evidence is not readily achievable or ethical Rubin argued that one should try to get as close as one could to being in a similar situation as randomization would put you if you could randomize. Here there seems to be a warning against building complex causal modelling.

    Now, it also seems that psychology research has few current opportunities where benefit cost of (between group) randomized evidence is often not really worth it. I do think it is a mistake to go for randomized evidence or exclusively make use of it when its real benefit cost is poor.

  6. Doug Davidson says:

    “But isn’t there a problem when researchers make claims about the magnitude of these effects in more realistic settings than the laboratory? I am thinking here of situations where psychologists try to extrapolate lab findings to settings that are more applied, e.g., schools.”

    GS: How is that an issue with “within-person” designs per se? Or, put another way, how is what you’re asking about not a general scientific issue?

    I don’t think it is an issue with the within-subjects designs specifically, but rather the blanket recommendation that they should be preferred. If the issue you are studying requires greater sensitivity, then I can see why it is an advantage. If the issue is whether a given effect magnitude will have a practical impact given the person-to-person variability in a realistic setting, then it seems to me that the design should be optimized to get an accurate measure of the effect size(s) under realistic circumstances. If it is possible to do that with repeated measures, then sure.

    • Glen M. Sizemore says:

      “But isn’t there a problem when researchers make claims about the magnitude of these effects in more realistic settings than the laboratory? I am thinking here of situations where psychologists try to extrapolate lab findings to settings that are more applied, e.g., schools.”

      GS: How is that an issue with “within-person” designs per se? Or, put another way, how is what you’re asking about not a general scientific issue?

      DD: I don’t think it is an issue with the within-subjects designs specifically, but rather the blanket recommendation that they should be preferred.

      GS: Not even *I* would say that between group (BG) designs were always crap. But where SSDs can be used, they should be used. And studies can be combinations of BG and SSD. This was the usual design for experiments I was involved in when I was part of The Center for the Neurobiological Investigation of Drug Abuse. The behavior of all subjects (rats) was maintained by, say, (response-dependent*) cocaine infusions, and was measured during daily experimental sessions until relevant measures were stable. Then, one group would get specific-area brain lesions, and the other group, sham lesions.

      *That is what “maintained” indicates.

      DD: If the issue you are studying requires greater sensitivity, then I can see why it is an advantage.

      GS: I would not see this as the primary advantage. Since we are talking psychology (and, really, many fields where behavior is the dependent-variable) the behavior of individuals IS the subject-matter. That would be near the top along with the multitude (relatively speaking) of “internal” replications in SSDs.

      DD: If the issue is whether a given effect magnitude will have a practical impact given the person-to-person variability in a realistic setting, then it seems to me that the design should be optimized to get an accurate measure of the effect size(s) under realistic circumstances. If it is possible to do that with repeated measures, then sure.

      GS: But you are implying that reliability as well as generality (!) can be seen in one BG study and they cannot. You are essentially saying, “If I have a bunch of subjects and I see an effect [by whatever manner this is ascertained], then it is likely to be reliable and general and an accurate depiction of effect magnitude.” But, perhaps, that is beside the point when you are conflating a dependent-variable relevant to the behavior of individuals with one that is not relevant to the behavior of individuals.

      • Doug Davidson says:

        Dear Glen,

        Hey, sorry if my comments implied some of these things – it wasn’t really my intention. The only point I wanted to make (and I can see I failed here), is that designs usually require trade-offs and compromises. In situations where repeated measures are difficult or expensive to obtain (but still possible), other factors may be more important to consider.

        • Glen M. Sizemore says:

          “In situations where repeated measures are difficult or expensive to obtain (but still possible), other factors may be more important to consider.”

          GS: Wouldn’t a paraphrase of the above be “If doing science correctly is hard or expensive, do science poorly”?

          • No, I don’t think that’s the case at all. A much better paraphrase is something like “do the best science you can within your budget, by allocating your resources in the way that gives you the most information not some formulaic way that some textbook demands”

            • Martha (Smith) says:

              +1 — but also be open about your choices and why you made them, and what resources are needed to do a better study.

            • Glen M. Sizemore says:

              Daniel: No, I don’t think that’s the case at all. A much better paraphrase is something like “do the best science you can within your budget, by allocating your resources in the way that gives you the most information not some formulaic way that some textbook demands”

              GS: Other than the remark about analyzing data in a formulaic fashion [Am I being accused of that…or as that a statement about standard NHST p-value BS?], your statement strikes me as a politician’s way of saying exactly what I said: “If doing science correctly is hard or expensive, do science poorly.” Or, rather, your statement(first part)translates as: “If doing science correctly is hard or expensive, do science less correctly.”

              • Glen suppose you are studying seismic activity in southern California. You can’t afford to buy and install and maintain 10,000 continuous recording seismometers with power and cellular data connections etc. So you hire a very capable data analyst who invents a technique that gets you about 80% as much information using 300 of them placed in very specific locations and using precise microsecond accurate timing info… Is this doing science badly? I don’t think so at all.

              • Glen M. Sizemore says:

                Daniel: Glen suppose you are studying seismic activity in southern California. You can’t afford to buy and install and maintain 10,000 continuous recording seismometers with power and cellular data connections etc. So you hire a very capable data analyst who invents a technique that gets you about 80% as much information using 300 of them placed in very specific locations and using precise microsecond accurate timing info… Is this doing science badly? I don’t think so at all.

                GS: Well, it is science done less well. No? But that sounds like nitpicking and, of course, I always understood your point. Your *theory* about how science should advance. In practice – and this whole issue did begin with the field of psychology (90% of which is crap) SSDs will always be enormously time-consuming (relatively speaking, and we’re talking orders of magnitude here) and almost always more expensive (until you get to some of the massive clinical-trials monsters). When humans are the subjects (I forget that when ya’ll hear “psychology,” you think humans – I think of research that uses both human and nonhumans), a new wrinkle is added – how do you get the subjects to come to your lab for, say, most of 230 days or whatever.

                And BTW, your example is almost an engineering problem rather than a scientific one although the distinction can be fuzzy to the point of mist. You chose an example where a great deal is known about what to measure, and how to measure it – lest how would you know you’re getting “80% of the information”? But for many sciences (or “sciences” in the case of much of psychology) your example is functionally irrelevant. Further, there is a purely conceptual issue involved here – nothing will change the fact that behavior is a phenomenon that pertains to individuals – there simply is no substitute for developing procedures by which you exert experimental control over the subject matter for extended periods during which behavior is measured continuously (at least during each of many experimental sessions).

                Anyway, I’ll cut it short since I can be somewhat long-winded. In closing, let me just say that, for psychology, particularly WRT avoiding SSDs, your theory is a recipe for disaster. Indeed, mainstream psychology is living proof.

  7. Peter Chapman says:

    Your best post ever. Bias and noise are the causes of problems with experiments in Psychology. Get this right and the replication crisis evaporates.

    • Keith O'Rourke says:

      Well, with no bias or noise at all you have math and no replication crisis because other qualified mathematicians can replicate claims easily and cheaply with pen and paper or maybe computational resources.

      Getting unavoidable bias and noise (both) _right_ is what we need a meta-statistics (thinking about how you ought to do statistics) for so that the statistical foundations crisis evaporates.

Leave a Reply