Would you prefer three N=300 studies or one N=900 study?

Stephen Martin started off with a question:

I’ve been thinking about this thought experiment:


Imagine you’re given two papers.
Both papers explore the same topic and use the same methodology. Both were preregistered.
Paper A has a novel study (n1=300) with confirmed hypotheses, followed by two successful direct replications (n2=300, n3=300).
Paper B has a novel study with confirmed hypotheses (n=900).
*Intuitively*, which paper would you think has the most evidence? (Be honest, what is your gut reaction?)

I’m reasonably certain the answer is that both papers provide the same amount of evidence, by essentially the likelihood principle, and if anything, one should trust the estimates of paper B more (unless you meta-analyzed paper A, which should give you the same answer as paper B, more or less).

However, my intuition was correct that most people in this group would choose paper A (See https://www.facebook.com/groups/853552931365745/permalink/1343285629059137/ for poll results).

My reasoning is that if you are observing data from the same DGP, then where you cut the data off is arbitrary; why would flipping a coin 10x, 10x, 10x, 10x, 10x provide more evidence than flipping the coin 50x? The method in paper A essentially just collected 300, drew a line, collected 300, drew a line, then collected 300 more, and called them three studies; this has no more information in sum (in a fisherian sense, the information would just add together) than if you didn’t arbitrarily cut the data into sections.

If you read in the comments of this group (which has researchers predominantly of the NHST world), one sees this fallacy that merely by passing a threshold more times means you have more evidence. They use p*p*p to justify it (even though that doesn’t make sense, because one could partition the data into 10 n=90 sets and get ‘more evidence’ by this logic; in fact, you could have 90 p-values of ~.967262 and get a p-value of .05). They use fisher’s method to say the p-value could be low (~.006), even though when combined, the p-value would actually be even lower (~.0007). One employs only Neyman-Pearson logic, and this results in a t1 error probability of .05^3.

I replied:

What do you mean by “confirmed hypotheses,” and what do you mean by a replication being “successful”? And are you assuming that the data are identical in the two scenarios?

To which Martin answered:

I [Martin], in a sense, left it ambiguous because I suspected that knowing nothing else, people would put paper A, even though asymptotically it should provide the same information as paper B.

I also left ‘confirmed hypothesis’ vague, because I didn’t want to say one must use one given framework. Basically, the hypotheses were supported by whatever method one uses to judge support (whether it be p-values, posteriors, bayes factors, whatever).

Successful replication as in, the hypotheses were supported again in the replication studies.

Finally, my motivating intuition was that paper A could basically be considered paper B if you sliced the data into thirds, or paper B could be written had you just combined the three n=300 samples.

That said, if you are experimenter A gaining three n=300 samples, your data should asymptotically (or, over infinite datasets) equal that of experimenter B gaining one n=900 sample (over infinite datasets), in the sense that the expected total information is equal, and the accumulated evidence should be equal. Therefore, even if any given two papers have different datasets, asymptotically they should provide equal information, and there’s not a good reason to prefer three smaller studies over 1 larger one.

Yet, knowing nothing else, people assumed paper A, I think, because three studies is more intuitively appealing than one large study, even if the two could be interchangeable had you divided the larger sample into three, or combined the smaller samples into 1.

From my perspective, Martin’s question can’t really be answered because I don’t know what’s in papers A and B, and I don’t know what is meant by a replication being “successful.” I think the answer depends a lot on these pieces of information, and I’m still not quite sure what Martin’s getting at here. But maybe some of you have thoughts on this one.

155 thoughts on “Would you prefer three N=300 studies or one N=900 study?

  1. Once more proving that if you teach people muddled ideas in higher education classes that cost a lot in tuition, you can create a large army of muddled people who will muddle things up.

    Here’s what I think is actually true. If different groups all do different experiments, then the variety of apparatus/setups will help average out bias in the measurement error. In this context the several replications are an improvement on the alternative where a single group uses a single apparatus/setup to do a single experiment with a large number of replications.

    This is seen immediately in the different Bayesian models you’d use for the two scenarios (in Stan psuedocode)

    for the case with several experiments, each one with it’s own Bias[k]

    Bias[k] ~ bias_distrib()

    Data[k][i] ~ error_distrib(Bias[k])

    vs in the case of a single experiment with a single Bias:

    Bias ~ bias_distrib();
    Data ~ error_distrib(Bias);

    • Note: i don’t blame the poor guys who took Stats 101 and diligently learned what their teachers were teaching them. Time to simply *end* the teaching of incorrect NHST based ideas. The big one at work here is “p < 0.05 = TRUE" the "threshold" based logic is illogical.

      • Daniel Lakeland, this has little to do with ‘poor guys’ ‘diligenlty’ learning NHST. The question is not complete, but under at least some reasonable interpretations, the probability of misleading data is smaller in the 3 study setting (given that the probability of concluding there is a signal in the data, and not just noise, is (assuming an alpha of 0.05) 0.05*0.05*0.05 in paper A. That is not ‘evidence’ in the Bayesian sense, but if we more liberally interpret ‘evidence’ as ‘which paper would you bet your house on drawing a correct conclusion about the presence of a signal’ paper A is the correct answer here.

        The answer ‘A’ is perfectly defensible from a carefully thought out rationale following N-P statistics. You get the same level of Bayesian evidence in both studies. But there is no reason to limit yourself to one approach of inferences, so if you in addition take error control into account, answer A is correct.

        • From the original question “my motivating intuition was that paper A could basically be considered paper B if you sliced the data into thirds, or paper B could be written had you just combined the three n=300 samples.”

          So the assumption I’m operating on is that in paper A someone set up some experiment and collected 300 data points. Then they did a t test and got p < 0.05 then they said, hey let’s run it some more, and collected another 300 data points, and then p < 0.05 and finally repeat a third time.

          This isn’t 3 separate replications under varying but similar conditions, it’s just someone drawing a line in the sand and saying “at this point in my data collection process I’m calling my first experiment done”

          So, no, you’re absolutely wrong regarding “which paper would you bet your house on drawing a correct conclusion about the presence of a signal” the two are completely symmetric unless you think “drawing a line after 300 samples” is informative as to whether there “is a signal”.

          And yes, it is “poor guys” “diligently” learning NHST that makes people think such a thing could be more informative. The intuition is completely broken. If you do the Bayesian analysis you get the right answer which is that the math is the same in the two cases because “drawing a line here in the sand” is meaningless, just like remote retroactive intercessory prayer: https://www.ncbi.nlm.nih.gov/pubmed/11751349

        • Hi Daniel, can you tell me what is the probability is that you would conclude there is an effect, when there is no effect (so your Type 1 error rate)? Are they the same in both studies? Because I’m not very good at math, but I doubt it.

        • To answer that question requires more than is available in the post. There are two basic situations:

          1) It’s all about how well some mathematical computational random number generator works, and it’s known for a fact that the RNG inputs are a fixed exact number.

          2) We’re talking science, not pure math / computing.

          In case (2) there is exactly zero probability that the unknown parameter is exactly 0. And so I can conclude without even looking at the data that the parameter is not zero not even to the first 50 thousand decimal places much less to the first 400,000 trillion decimal places.

          if you’d like to reword your question as to something like “incorrectly concluding that the parameter is outside of [-epsilon,epsilon]” for a specific epsilon you give me… we could have a mathematically meaningful conversation.

          Of course, the “probability” you speak of is actually “frequency under repeatedly performing the same exact experiment” and “concluding that there is an effect” (ie. p(parameter is zero) = 0 exactly) is not something a Bayesian would ever do. So the frequency with which that occurs is zero.

        • in p(parameter is zero) = 0 the p is a density not a probability. of course there’s zero probability that the parameter is exactly zero, but there’s not zero density for the parameter in arbitrarily small regions around zero. sorry if that was confusing.

        • Hi Daniel, thanks for your answer. You are wrong about the null not being able to be 0 (it is, since there is no causal mechanism linking random assignment to conditions to the survival rates after taking a drug that does not cure a patient, so it logically must be true – even if you will observe some random variation around 0 in any non-infinite sample). But by all means, set the epsilon to anything you could reject in an equivalence test using all people alive in the world today with 99% power. OK? And let’s ignore what Bayesians would never do for now – it’s not really relevant to the question at hand.

        • It is exactly relevant to the question at hand. It is the entire content of this post:

          http://statmodeling.stat.columbia.edu/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/

          The Bayesian doesn’t look to drop the p(0) below some threshold and declare “there is an effect” the Bayesian outputs p(x) and says “probability of being within epsilon of 0 is p(0)*epsilon”

          And since in real models actually used by people the p(0) never goes to exactly zero, the Bayesian has exactly 0 frequency of Type 1 error.

        • And please, stop hijacking words – I used probability correctly. NHST is the dominant paradigm in science – if you want to use it in a different manner, please feel free to define it more specifically (but feel free not to – I’ll know what you mean from the context).

        • Also, exactly zero frequency of type 2 error. Because the posterior never becomes a delta function around 0.

          This is really *the* point. Stop thinking of statistics as a “certainty factory”. Andrew is always going on about this.

        • Daniel, You are pointing to a post that uses a correct prior, to show multiple comparisons are not a problem. Please, try to raise the bar a little. Multiple comparisons are a problem, real scientists need to make dichotomous decisions all the time (do I do a study, or not) and they will make incorrect conclusions, a certain % of the time. My question is: please quantify these probabilities in situation A and B without trying to evade the question.

        • How often scientists come to incorrect conclusions is a property of their method of making choices, not a property of Bayesian inference. I’m not evading the question, I’m pointing out that people who’ve been trained as NHST people don’t “get it”. You are perfectly embodying that.

          Look at “figure 2” in the post. The probability that a Bayesian will make a claim with confidence goes to zero on the left. This is not true for NHST which stays at 0.05. It also goes towards 1 on the right… but it never reaches 1.

          Bayesians don’t make Type 1 and Type 2 errors ever. Sure, they may take an inference and decide on its basis to choose to do something and be wrong, but this is a property of their willingness to take risk, not a property of their method of inference.

        • Let me be more mathematically explicit, perhaps I will convince you that I’m not just trolling you.

          In the logic of Bayesian analysis, the output of the computation is a p(x) some probability density on a parameter x. The Bayesian inference only ever declares “there is certainly an effect” when for every epsilon sufficiently small integrate(p(x),-epsilon,epsilon) = 0 exactly. Now this occurs only when the density p(0) = 0 and in all but some specially constructed handful of cases, this never happens. Therefore the Bayesian NEVER declares “there is an effect with certainty”. Therefore the Bayesian never makes type 1 errors.

          the same goes for type 2 errors because p(0) never becomes infinite so the Bayesian is never dogmatically certain “there is definitely zero effect”.

          Now, of course, in making decisions, Bayesians can be wrong. But as I say, it’s not a property of inference, it’s a property of decision making. Suppose there is a decision rule that takes the posterior D(p(x)) -> action…. then *how often* a given Bayesian takes an incorrect action is determined entirely by the form of D. D is a function that balances the cost of being wrong with the benefit of being right. It’s related to questions like how many dollars you’re going to spend… It isn’t a set thing. There does not exist a meaningful frequency because there does not exist a single D that everyone agrees on.

        • Hi Daniel, you are not trolling, I know. Your last answer is correct: It’s a function of the risk you are willing to make. You are right saying ‘it is not a set thing’ (it isn’t in NHST, obviously, since you set the alpha in a similar way). The point is it is much easier to control this in NHST. If you would work in a field where there is a replication crisis, and people worry about a lot of effects not being true (where true is you know, the effect is in practical terms close enough to call it zero), there are extremely strong arguments to use a manner of inferences that easily controls the error rate. That’s also why organisations like the FDA don’t care too much bout Bayesian stats – it does not give them an easy way to control what they care about. Now feel free to convince the FDA they should care more about quantifying the subjective personal belief of researchers, but you’d join a very long list of people who have failed.

          The biggest mistake is thinking I ‘don’t get it’. It’s this attitude that makes the uptake of Bayesian stats so very, very slow. People like you who think I don’t get it, when if anything, it’s the other way around.

        • Note that the FDA example is very relevant. The FDA requires 2 studies showing significant results. Not one study with twice the sample size. If you really care about that not being a good way to save science, we should not continue this discussion here. You have more important things to do – convince the FDA their ways are flawed. It might be the biggest contribution to science you could make!

        • Frequentist logic holds fine when the assumptions hold. The assumption in a frequentist analysis is : f(d) is the long run frequency of the data under repeatedly performing the experiment.

          Now, f(d) is an infinite dimensional object in a Hilbert space. It is, in essence, an infinite sequence of numbers.

          The conceit of frequentist statistics is to say that everything is known about this *infinite sequence* of numbers except what constant to multiply it by and what constant to add to it.

          All the NHST results are of the form IF (I know all but 2 of an infinite sequence of numbers that describes the physics/chemistry/biology/economics/ecology of the world) then (I will be wrong less than 5% of the time…)

          I reply by “But you don’t know that infinite sequence of numbers, and therefore… you got no guarantees of anything”

        • I don’t think that’s a very convincing reply to the FDA.

          Until you have convinced the FDA, I’m gonna take my advice from people like Stephen Senn, who have some more experience in these matters than you have, and whose real-life experience doing statistics has lead to more nuanced and useful ideas (for me personally). But thanks for your thoughts anyway.

        • If it weren’t for the fact that _land’s name is a link (in my browser, blue text) and _ns name is plain text I’d be having a pretty hard time following this conversation.

        • Which recapitulates my point that people stick with what they “diligently learned [because] their teachers were teaching them”

          Convincing the FDA has exactly nothing to do with truth, and exactly everything to do with how much money you can spend and who will spend money to counteract your arguments.

        • _ns, the problem the FDA faces mixes inference with decision-making in a very explicit way; the FDA also has the challenge of not letting profit-seekers game the system. Instituting a process that controls the prospective probability of letting a useless (or even harmful) drug on the market is a sensible policy, but it’s not science.

        • Are you saying Stephen Senn is in the pocket of the FDA? :) I doubt it. I think he, just like you, thinks what he is saying is the ‘truth’. Go figure.

        • Hi Corey, can you please define ‘Science’ – I have 24 philosophers of science I can quote who disagree with your definition, but I first need to know how you define science to be able to chose the right ones. I think you get my point.

        • Daniel Lakens:

          I just signed up for your Coursera Course with high hopes that it would help bridge the gap in my statistical training between NHST traditional statistical training and Bayesian Methods in which I still consider myself a novice.

          That said, I have to say I find your tendency to “appeal to authority” rather than disentangle the complexities of these arguments a bit disheartening. Daniel Lakeland makes some of the clearest arguments on this blog and is quite generous in is willingness to explicate the complexities for people like myself. I think it would be beneficial for us all if you were willing to focus on the details of the arguments rather than the assertions of authorities.

        • Daniel Lakens:

          I disagree with your statement, “NHST is the dominant paradigm in science.” Null hypothesis significance testing is the dominant paradigm in some sorts of statistical analyses in psychology, medicine, and other fields, but no way would I say it’s the dominant paradigm in science. To start with, I think that most of the important findings in science are not statistical at all, or not statistical in the “NHST” sort of way. (Yes, quantum mechanics is statistical but this has nothing to do with hypothesis testing.)

        • _ns, disagreements about what constitutes science can be left aside as long as we agree that the FDA, in facing the problem of setting a policy for decision-making under uncertainty, is going to act in ways which combine cost-benefit analyses with the question of which inferences are warrants, and that while the latter is a scientific concern, the former is not. (Christ I’m pompous.) The FDA’s reluctance to litigate priors in every study and resulting willingness to institute a policy that is achieves a specific outcome in worst-case-prior expectation isn’t about science.

        • Here’s Angus Deaton in The Atlantic, somewhere far down the page (search for FDA) he discusses how organizations like the FDA inevitably fall into the pocket of rent seekers, where things come to an equilibrium. They aren’t odious enough to the drug companies to continue fighting.

          https://www.theatlantic.com/business/archive/2017/03/angus-deaton-qa/518880/

          The process is called Regulatory Capture. If Merck/Roche/whatever or whoever has figured out how to make a bundle on NHST based decision making, they are certainly going to spend money to fight my efforts to re-organize the decision making around a different, and potentially financially less successful for Merck/Roche/whatever method. So what the FDA thinks is irrelevant, and it’s not because Stephen Senn but because… it’s just not science based decision making, it’s politics.

          Now, if you want to argue that my logic is flawed you certainly can’t argue that probability densities aren’t points in an infinite dimensional Hilbert space, because this is a known mathematical fact. You could argue that for all practical purposes they can be approximated by a finite dimensional Hilbert space with some smallish number of dimensions, like say 100. And this is more or less Efron’s Bootstrap, just let the data itself be the coordinate in the hilbert space. But the point is, before you can really claim anything about “guarantees” you need sufficient data to convince me that you know the proper f(d). And the frequency property of the guarantees is sensitive to the shape of the tails. So, to get stable guarantees you’re going to need to collect enough data that you can specify the tails very well. And by the time you do that, no one needs any statistics anyway (if you do your experiment on a million people it’s eventually obvious what the answer is anyway).

          So, deciding to use NHST because it gives you “error control” is a mistake, because it only gives you error control when you can specify those several hundred “shape” parameters exactly, and that only occurs when you have way more data than you really have.

          Furthermore, if we do a game where you generate random numbers uniformly between 0 and 1 and I win if x < 0.01, then we'll both agree I will lose 99% of the time. However if it costs me $0.10 to play and I win $10^9 if I'm successful… I will play this game all day even though I am sure I will almost always lose. By the end of the day I will be retired to my Yacht in the Caribbean with a high quality satellite data connection so we can continue this conversation.

        • Daniel, error control works fine in practice. It does not have to be perfect, and your arguments against error control to an infinite digit after the decimal does not interest me. All statistics is based on assumptions, and they should work good enough. NHST clearly works good enough (ignoring how it, like all approaches, is also misused).

          WHen I mention Stephen Senn, it’s not as an authority argument, but to point out different people have different opinions. I’m actually using Bayesian inferences myself here, in deciding who I think provides info that is more useful for me. The difference between using Bayesian stats or Frequentist stats hardly matters in practice, obviously, since we only take things seriously in science when there is so much data priors no longer matter, and there are bigger problems than p-values (i.e. Publication bias).

          The only reason I am responding here is because I actually think it is importNt to teach people relative benefits of different approaches (as I teach in my MOOC) – especially when someone has no experience in medicine or psychology, but wants to make statements about which of the many different approaches to inferences should be used.

        • Corey, as far as I know the FDA is not against using priors, as long as you also show how your approach to inferences controls error rates. Right?

        • _ns, yes, you can use a prior, say, as an ingredient when designing the decision boundary in an adaptive combined Type II/III trial — but you will always be optimizing that boundary under the constraint that you achieve a target Type I error rate per one or more null models, and that makes the whole thing an exercise in Neyman-Pearson hypothesis testing. (My source for this assertion is the text Adaptive Bayesian Methods for Clinical Trials by Berry et al.) The particular posterior distribution you get at the end doesn’t influence the decision.

          _land, the libertarians complaints about the FDA I encounter most often are of the regulation-impedes-innovation kind, not the regulatory-capture kind. Kind of bank shot for both of these complaints to be true at once but not impossible, I suppose.

        • @Corey: re FDA. I don’t think it’s a bank-shot at all. Regulation impedes innovation by incentivizing rent seeking rather than innovation and that increases the cost of innovation and reduces the investment into innovation as well. Inside big pharma the right hand (marketing) wants to increase the barrier to entry of competition from other drugs, so you get things done to increase the difficulty of getting approvals. Meanwhile, the left hand (R&D) gets pissed off when they can’t get their new drug approved… and fewer projects get green lighted because of the cost going up. I actually have had that conversation with people in pharma (indirectly actually, my wife has had this conversation with some of her associates and then reported the results to me)

          Both the Epi-Pen debacle and the “Pharma Bro” Shkreli disaster are examples of how companies use regulatory capture to keep competitors out.

        • Daniel Lakens,

          The NHST error rates, I think, are not even as useful as a bayesian decision analysis.

          NHST error rates— T1: Some effect of non-interest is rejected when it shouldn’t have been. T2: Some effect of interest is not accepted when it should have been.

          Bayesian decisions: Integrate the utility over our posterior probability of effects, getting an expected utility given the data. If the utility is too low, then decide against it. If utility is high, then decide for it. It’s more potent than the mere NHST error rates. NHST error rates are about some dichotomous decision of whether there’s an effect of interest or not, whereas bayesian decisions (in this sense of expected utility) are based on making the most practical decision we can about whether to proceed or not.

          Even if you /did/ make NHST-like decisions in a bayesian framework, they’re typically well calibrated, such that the 95% intervals do indeed include the ‘true’ parameter 95% of the time, and if one is rejecting some null using a test akin to NHST, it should occur with the expected frequency. Not always, and it depends on the prior, but then Bayesians are wanting to make decisions based on the data at hand, not based on ficticious replicates that never occurred, so priors can aid in this decision. NHST error rates from a Bayesian analysis are typically side-effects of probabilistic modeling; NHST has the goal of achieving these dichotomous decision error rates, Bayesian models jut have these error rates as a side-effect, but more importantly they don’t really matter much to a Bayesian, because a Bayesian can just apply a utility function over the posterior to come to a decision much more informative than ‘yes or no’, but rather ‘the expected utility is X; but there is Y uncertainty about the utility’

        • Stephen, these are minor difference between Frequentist and Bayesian paradigms. Yes, Bayesian approaches give Y more easily, Frequentist X, they have their assumptions, and limitations. I’m not disagreeing, but it’s also not something I care greatly about, due to the basically absent practical benefits. There are simply no situations in my research, or in most research I read, where these analysis choices matter substantially. I remeber Jeff Rouder writing a blog post where he thought he had found a situation where it actually mattered, only to find out it wasn’t so. Again; if we can stop insulting people who use NHST, we can focus on things that matter (e.g., experimental design). You go on and try to improve science promoting Bayesian stats. I’ve come to the evaluation most scholars I teach should spend the 10 hours a year they spend on improving their stats on more useful things.

      • I am unable to get a handle on a consensus around here concerning p-values. Most of you seem to be (but, as I said, I am unsure of that)arguing that NHST is misused (and studies are underpowered). That’s the standard spiel, a step up from “everything’s OK.” On the other hand, you have (or had – I guess I assumed he was dead) guys like Cohen with his “it doesn’t tell us what we want to know” position. I think I sort of asked this before but didn’t get what I thought was a clear answer. It seems to me that there is another position represented here which is “p-values have their place, but not as a cut-off.” And I’m not talking about esoteric uses of p-values like Daniel’s “it’s useful for certain kinds of calibration” – do p-values have any place in a standard textbook t-test thingy like “Does drinking cranberry juice reduce the frequency of kidney stones?” or something like that? I mean any use…I already know that most around here don’t like the “magic cutoff” thing. Put briefly, it seems that the “it doesn’t tell us what we want to know” pretty much puts an end to all the debate – NHST is not misused…it is useless.

        • ‘Put briefly, it seems that the “it doesn’t tell us what we want to know” pretty much puts an end to all the debate – NHST is not misused…it is useless.’

          When it comes to the un-nuanced version, yes I think it is useless. The problem is that there are some cases where it isn’t useless. Calibrating instruments, or filtering “unusual data points” out of a stream of data etc. But those aren’t the cases that you’re talking about. Still, you can’t put a blanket statement because there are plenty of people using p values for legit filtering of data and they’ll legitimately complain that you don’t know what you’re talking about using such a blanket statement.

          Finally, p values based on Bayesian models with flat priors are Bayesian probabilities. So if you muddle together Bayesian models with flat priors and things like permutation tests, then you’re really going to wind up unable to say much.

    • Also, relevant to Dale Lehman’s point below, in the proposed set up of the question, BOTH A and B are in the second case, that is, there’s just one experiment being done with a single bias, and arbitrary lines in the sand between samples. The Bayesian analysis of A and B both look like my case (2), and they both produce the same estimates of the parameters that I left out, I should have written something like error_distrib(Bias,OtherParameters)

  2. My initial reaction was to choose A – I’m thinking of all the forking paths, etc. that make me more confident in 3 separate studies rather than 1. Of course, this was not what he had in mind – he meant for A and B to be identical except in the arbitrary delineation between studies. That point actually evaded me – I think this is a form of the psychological bias known as the “base rate fallacy.” I ignored relevant baseline information – essentially I ignored the sample sizes (not really ignored, but downplayed), focusing instead on the “separate studies” aspect.

    • Exactly. If the second and third were exact replications of the first study then there is obviously no advantage or disadvantage to separating them. But a different author, different methods of collecting the data and idiosyncratic choices in a cauldron of genuine inquiry gives the three study approach “convergent validity.” I’m taking “confirmed,” by the way to mean that any estimated parameters in the three studies all plausibly come from the same DGP.

    • The two cases are not identical. It is nonsense to think that you can literally control every variable, though that is what you try. So…the first study could “show an effect” but for extraneous reasons…the reasons could still be around (was the lab tech the same etc.?) …but maybe not. However, if you get a failure to replicate in your own lab (that was the scenario, right? the other two studies are done later in the same lab?) you would have to be worried, but that would be good for science. Replications in your own lab are not as convincing as in other labs but, with all due respect, I think it is nonsense to think that three separate studies are the same as one with three times the subjects. It seems to me that it doesn’t recognize the myriad of things that can go wrong in science – after all…that is why replications of some sort are necessary.

      • +1

        That’s why I would strongly prefer option A. There is no entering the same river twice.
        Three separate attempts to set up a study the same way, and seeing the same effect, is good.

        In graphical model terms, if you vary the Markov blanket (because you will in three separate studies) and still see the effect, my posterior of it being a robust causal one goes up.

        • If there really is a reset… yes. But what if you set up a computer to observe an instrument and literally just had it automatically collect and dump 300 data points each day for 3 days into separate files?

        • “If there really is a reset… yes. But what if you set up a computer to observe an instrument and literally just had it automatically collect and dump 300 data points each day for 3 days into separate files?”

          GS: This seems analogous to breaking a long time series into parts and doing a Fourier Transform on the parts. If there is consistency in the frequencies across the parts, one is pretty certain that one is getting at some real processes. If one just does the transform on the whole time-series, it could be meaningless – after all, a Fourier Transform can be carried out on any time-series and you will get a result. I’m sticking my neck out on this…but what the heck…I’ve been wrong before…

  3. I feel that the mistake here is in the phrase “Successful replication as in, the hypotheses were supported again in the replication studies.” That just pushes the problem to the question of what “supported” means.

    See my blog post on a related topic here, where I argue that the authors of the nine-labs replication went too far when they claimed that they had a failure to replicate an earlier result that appeared in Nature Neuroscience. What they mean by failure to replicate is that they didn’t get p less than 0.05. But looking at the posterior distribution tells a different story to me:

    http://vasishth-statistics.blogspot.de/2017/04/a-comment-on-delong-et-al-2005-nine.html

    If the posterior distribution of the original study allows a range of possible outcomes, and the replication attempts fall within that range, that’s not necessarily a failure to replicate. As usual, the problem is that people are stuck in the p-value rut and are not able to pull themselves out of it.

    • Like I said, I left it purposefully ambiguous, because people gauge support for a given hypothesis in different ways. I may gauge it by a posterior distribution (ruling out a sign of an effect, maybe), others a p-value, others a bayes factor, others a CI or whatever. I meant ‘by one’s own standard of supporting a hypothesis… the hypotheses were supported across 3 N=300 studies’

    • Shravan:

      I would suggest the posterior/prior (relative belief or likelihood) rather than the posterior – as the prior has nothing directly to do with replication.

      Stephen:

      It is too ambiguous so any answer could be very insightful or completely naive.

  4. Paper A does have more information in some very basic sense. Even if the three populations are arbitrary subsets, three different means tells you something about how well behave the data. In reality, if there are three different experiments they were probably done (at minimum) at three different times, which helps with systematics.

    • Same DGP, same population, same procedures. When I say three direct replications, I mean three /direct/ replications. As in, nothing varied across the studies in the data collection or analytic procedure. The only thing that varies, given the information, is really time, but that varies regardless (time varies between the first and second observation too).

      If we had 900 samples of n=10, all somehow supporting the hypotheses, would you have more confidence than if you ran 1 n=9000 study?

      I say no, some say yes.

      I say no, because the n=9000 sample better approximates the population and has a more precise estimate with much less uncertainty about that estimate than any one of the given n=10 samples.

  5. I’m seeing a distinct lack of the words ‘heterogeneity’ & ‘meta-analysis’ in this discussion…

    With k=3, you have worries about publication bias, p-hacking to get results equivalent to the first one, etc, but you also get improvement on external validity and some hints at moderators: different studies run at presumably different times on different subjects in possibly different places by different experimenters. With the 1-study, you have better internal validity and better power to detect that study-specific effect but you lose any ability to estimate heterogeneity and external validity because, well, you only have k=1.

    Particularly with meta-analysis, my intuition is that you’re usually better off, for a fixed total n, with a decent number of medium-sized studies than with a ton of small studies (each afflicted by small-study effects) or 1 or 2 mega-studies (which estimate their local effect precisely but give the analyst no guidance about how well the treatment will work elsewhere, which is critical to decision-making). You can of course do a power analysis here to look at the tradeoffs.

      • You can get all those effects *within* a lab, too. Think about Schimmack’s ‘incredibility index’ where one team/lab/PI reports multiple experiments on the same topic and – mirabile dictu! – they all come out statistically-significant despite being underpowered. It’s publication bias/p-hacking, of course. So to some degree, I would regard 3 papers from 1 lab as less trustworthy than 3 papers from 3 labs (‘researcher allegiance effect’ in spades).

    • This is really the power of multilevel models, that you can take the k=1, n=900 study (or the 3 n=100) and systematically decompose it into many medium-sized studys of whatever meaningful groups are present.

  6. Paper A. A replication in hand is worth two in the bush. Of course, you could do one SSD (ABAB) study with N=5 and potentially get 5 within-subject replications and 4 between subject replications (I guess…figuring one demonstration in one subject and 4 replications). That is why JEAB has never had I’m talking much of a problem with failures to replicate. Well…no problem with direct replication – and “failures” of systematic replication is not failure at all. Assuming the original findings are reliable, a “failed” systematic replication is better called showing the limits on generality. Of course, extremely limited generality detracts sometimes from a finding’s importance.

    • But by this logic, I could collect N=900, arbitrarily split it into three subsets, analyze them, and you would say I have more evidence because I have three n=300 subsets showing evidence vs 1 n=900 subset showing evidence. You’re counting “number of times a hypothesis was supported”, even though the number of times could feasibly just be due to be cutting a larger potential dataset into pieces and seeing how many times the subsets pass some acceptance filter. Imagine I had a population of K=10000, and I collect 900 N=10 samples vs one N=9000 sample. If I [by some serious luck, or due to some massive effect] supported the hypothesis across 900 N=10 samples, would you really trust that more than the single N=9000 sample, which is only 1000 short of representing the full population?

      And a failure to replicate doesn’t even mean there’s a generalizability concern, it may just be sampling flukes.

      • To the extent that a replication involves some operational reset — some variation, some new sample from what Daniel Lakeland calls Bias[k] ~ bias_distrib() in the first comment — Paper A provides strictly less information about the effect in question since estimating the biases eats up some degrees of freedom. But if all we know is that Paper A contained 3 copies of the experiment and they all succeeded in detecting the effect, then under reasonable interpretations of what those words mean in context, that is indeed more evidence in favour of the effect.

        I think a lot of people have an intuition that the division of data into replications is not arbitrary and does involve such a reset. To the extent that this is false, you’re right — the two papers have equivalent information content.

        • Well, getting several samples of the bias could help you learn the parameter more accurately. The single 900 sample study will help you get precision on the measurement uncertainty, but gives you no information on the bias. Once you drive the measurement uncertainty down, you get more bang for your data buck by driving the bias down using some kind of “reset” and hopefully having several samples of the bias that themselves average closer to zero than any single bias.

        • Mosteller and Tukey called that getting at the real uncertainty.

          Part of the problem here is that it is very rare that dealing with multiple studies (meta-analysis) is taught in any statistical courses – intro or advanced.

      • SM: But by this logic, I could collect N=900, arbitrarily split it into three subsets, analyze them, and you would say I have more evidence because I have three n=300 subsets showing evidence vs 1 n=900 subset showing evidence.

        GS: Well…the temporal “bins” might be arbitrary, but their sequence is not. In a different post I suggested that aspects of the relevant extra-experimental environment changes all the time and “getting an effect” (i.e., when a researcher that, by intention, is skeptical about an effect before the study, convinces him or herself that there is one after examining the data) might depend on an interaction of unknown variables with the independent variable. Like, you get some effect because the news, when you conducted the study, was filled with Some Moron being elected, or a news story about a great thinker (like me!) claiming that all governments tend to totalitarianism and will eventually be totalitarian, or some such. Now, later that all dies down and that happens to correspond, in time, to the researcher’s own attempt to reproduce the effect. Now, he or she is not convinced by these recent data. So…he or she runs a third study, and is, once again, convinced that the data reveal an effect. What might he or she conclude? Many might think – I would – that the effect is probably there but fragile. If you hit the wrong time in the midst of a changing extra-experimental environment, you don’t get the effect. And, BTW, even though I just thought about it, if random selection yields data consistent with “no effect,” that is a sign of the fragility of an effect. A lot of the effects I often considered were such that you could sample and run study after study ‘til you were blue in the face and you would never produce data that were evidence of no effect. Not everybody has that luxury, though. But perhaps it’s time for some areas of science to stop pursuing miniscule effects. These later statement go to your statement: “And a failure to replicate doesn’t even mean there’s a generalizability concern, it may just be sampling flukes.” BTW, you do know that for most of that post I was talking about single-subject designs?

        SM: Imagine I had a population of K=10000, and I collect 900 N=10 samples vs one N=9000 sample. If I [by some serious luck, or due to some massive effect] supported the hypothesis across 900 N=10 samples, would you really trust that more than the single N=9000 sample, which is only 1000 short of representing the full population?

        GS: Beyond any shadow of a doubt.

  7. What was the power in situation A? A lot depends on that. If power was low, then I prefer B because I have a lower chance of making a Type M error (the A scenario might have just given results that are near the truth by accident, with low power they are unlikely to). If power was high, I prefer A because (a) I have reason to believe the estimates are close to the truth, and (b) they represent replication attempts (as someone else also mentioned). But without information about power (assuming we want to remain in the NHST world) there is no reason to choose one over another.

    • Replications aren’t magical though. They hold no more weight than the initial studies. Also, both “papers” are preregistered.

      In one scenario, you have n=300 for each of 3 studies. Total combined information is the same as one would find in one n=900 study, to the point that a fixed-effects meta analysis would be nearly identical to the n=900 study.

      But you are right; if power is low, then getting 3 significant results across 3 underpowered studies is improbable, and the effects must be very inflated compared to the N=900 study. However, even if power is high, there is nothing I can see that the N=300 studies could tell you that N=900 wouldn’t tell you. Why is less data per analysis more convincing that more data per analysis?

  8. It seems obvious to the point of trivial that the arbitrary decision of when to call a collection of data points “a study” is epistemically irrelevant. All other things being equal, those 900 data points provide the same evidence. If your statistical paradigm says otherwise, it is absurd and wrong.

    • Joachim:

      As I wrote above, I think the problem has no solution as stated, because I don’t know what’s supposed to be in papers A and B, and I don’t know what is meant by a replication being “successful.” But, speaking more generally, it’s not correct that all the evidence is in the data.

      Here’s a simple example. Suppose a basketball player takes some shots, which I code as 0 for miss or 1 for hit. You see eight data points, which are 1, 1, 0, 0, 1, 1, 1, 1.

      Scenario 1: The player went out to the court and took 8 shots.

      Scenario 2: The player went out and took 20 practice shots and then he took 8 more, which you recorded.

      Scenario 3: The player took a bunch of shots, most of which were misses, but you waited until you got a string of 8 with only 2 misses.

      Same data, different inferences.

      Or, for another story, two experiments:

      Scenario 1: An intervention is performed on 200 kids in a school district.

      Scenario 2: An intervention is performed on 100 kids in district A and 100 kids in district A.

      Again, you can have the same data, different inferences.

      Information is encoded in how the data were collected and in where the data came from. What’s relevant is not that there are 3 “studies” but rather how those studies were performed, where the data were collected, what other data were used in constructing the reported summaries, etc.

      • I agree that the sampling plan can matter, and also that the problem has no unambiguous answer as stated; hence the “all other things being equal” qualification.

        If I gave you a data set with some vector X and asked a simple inferential question about X (mean greater than zero, something like that), and gave you a second data set that includes X but also Y, which indicates my arbitrary, correlated-with-nothing assignment of data points into studies, and you came up with some different amount of evidence, you’d have some explaining to do.

        Of course it’s possible and potentially interesting if Y adds information, so we can test robustness, perhaps to time or location, but in the absence of such information and ceteris paribus, “study” is as interesting as “star sign”, if not less so.

        • I guess there’s more nuance still. If I gave you that X vector, and a Z vector indicating which of three locations provided the data, and then later on said “oh by the way I’m calling each location’s data a separate study, here is vector Y = Z which indexes the study” then the study index should still be irrelevant even though it correlates with something interesting.

      • In the same spirit, I like the 3-300 studies to the 1-900 because I can imagine someone with a bunch of research projects of 900 data points each, only one of which came out nicely and got published, versus someone who had a bunch of research projects of 300 data points each, of which one came out nicely, and he (or someone else— the scenario is indeed vague, which creates much of its interest) repeats it two more times and it keeps working.

  9. I am with the folks who are saying “forking paths”. Let me tweak the question: Is it better to run a single study with 900 subjects and all the usual risks of HARKing and p-hacking, or to run one study of 300, and use what you’ve learned to preregister your analysis for the subsequent 2 studies (or one study of 600)? It’s a sincere question, since I’m thinking of doing this for a study I’m conducting this Spring.

  10. To make this even clearer, I meant 3 DIRECT replications in the most literal of terms, as in:
    Run 300 subjects. Stop. Label it study 1. Resume data collection, collect 300 more. Stop. Label it study 2. Collect 300 more, stop. Label it study 3. No difference in sampling, procedures, experimenters, materials, anything.

    That vs Run 900 subjects.

    If the three studies resulted in support for your hypothesis (however you gauge that, p-values, BFs, posteriors, I don’t care), would you believe them more than a single study with N=900?
    In other words, assuming the same DGP, same effects, same methods, same total information, would you be more convinced by the 3 study paper (N=300 each) or by the 1 study paper (N=900)?

    I say both provide equal evidence in totality, and if anything one should trust the estimates of N=900 more. Knowing nothing else though, I don’t see why one would choose paper A. Some in the facebook thread even said they would believe in a study with 900 samples of 10 more than they would 1 sample of 9000. I get why that is intuitively appealing, but I’m fairly certain that intuition is totally wrong, and if anything one should believe the N=9000 more (it better represents the population, assuming these 900 N=10 studies are all collected identically and such).

    • Worth pointing out that evidence is combined with outside information to direct decision-making. I might, for example, take the evidence more seriously because apparently someone smart decided to commit and then commit more resources to the study. Or, of course, less seriously.

    • “No difference in sampling, procedures, experimenters, materials, anything.”

      Ahh…I thought we were talking the real world here. Let’s say the tech who takes care of the animals goes on vacation during the middle third of the time and is replaced by a sub (or if you’re unlucky enough to have to use humans as subjects, maybe there’s some really bad news on during the middle third…or a cold snap etc. etc. etc. etc.). If the tech matters (and let’s say that he or she does…of course, you would not know this), will you reach the same conclusion in the two versions? In the 3-3-3 version, you might say, “I think there’s something there but the phenomenon may be somewhat fragile…two studies ‘yes’…one ‘no.’” And, indeed, you would be right. The effect is sufficiently fragile that a change in animal care technician made a difference. With the single study, you would not detect this fragility – unless you separately analyze the data in three segments based on the time that segment occurred. But that would really be the same as the 3-3-3 “study.”

      • I’m pretty sure Stephen isn’t talking about any particular real world application, but is making a theoretical point. Frequentist methods seem to imply that designating a collection of data points a “study”, no matter how arbitrarily, has epistemic consequences because you get three hits on alpha rather than 1. This is an absurdity I think almost everyone here
        recognises.

  11. Correct me if I’m wrong, but significant results based on a point estimate of X in the n=900 study would not necessarily yield significant results in each of the three arbitrarily partitioned sub-studies with n_i=300 (e.g., even if X_i=X in each sub-study, the larger SEs due to smaller n_i could yield “non-significant” results). Conversely, if three n_i=300 studies give you estimates of X1, X2, and X3 — all of which yield significant results — then you would be guaranteed to get significant results in the combined n=900 study… right?

    Thus, it seems to me that Paper A is more convincing (or at least “no less convincing”) than Paper B… unless of course Paper A was published in PPNAS.

    (Note: I started typing this before I saw Stephen’s post saying “same effects” and “same total information”)

    • Agree. If you, say, have based on a reasonable frequentist analysis* p<0.05 for n=900 or 3 times p<0.05 for n=300, then knowing no more than that the likelihood of the 3 studies together is more strongly favoring parameter values away from no effect than the 1 study (if you think of this as censored data).

      * By reasonable frequentist analysis I mean some sort of sensible likelihood ratio type test based on normal data (and not, say, this being a time-to-event scenario with zero events and someone deciding to randomly reject the null hypothesis 5% of the time at random or some similar sort of thing).

      That's if in all cases these are all the studies conducted on the topic and they were ideally pre-registered etc., because my other concern would be a file drawer problem. Actually with 900 vs. 3 times 300, the studies are sufficiently similar in size that I do not see too much of a difference between the cases. I'd be a lot more suspicious with, say, 3 times 50 vs. 1 times 10,000 (besides all the problems with underpowered trials etc.).

    • I have the same ide as harryq, I think, but let me try saying it my way. If we have three studies with n=300, each barely significatn at the 5% level, then that’s a lot better than having one study barely significant at the 5% level. If we combined the three studies we would have more “total information”. That’s not the same as a scenario where we have the same exact data but it’s randomly cut into three separate papers. Both are interesting scenarios to discuss, even if the first one has a simpler answer. (For the second scenario, we have to start thinking about who is doing the replicating and suchlike.)

      This, I think, is why people say they would like 90 studies of n=10 even better. It’s hard to get 5% significant results with n=10, so the effect must be extremely clear in each of those 90 studies!

      • I replied to this, but I don’t think it went through. It was similar to a post below, which I’ll copy again here with some modifications:

        Right, but it’s not realistic.

        If these are generated from the same DGP, then if 90 N=10 studies were all at p < .05, then the N=900 study would almost certainly have an extremely low p value; it would be incredibly rare to see 90 N=10 studies < .05 and an N=900 study of the exact same method and DGP have a p-value so close to p=.045.

        To the point that it's not a realistic scenario. If you had 90 N=10 studies with all p < .05, then the effect size must be massive for that to occur (you'd have to have incredibly high effect sizes to have such incredibly high power with N=10), in which case the N=900 study would most assuredly detect it with a p value of MUCH less than .05, and certainly much less than any p-value from any given N=10 study. It would be ludicrous to see so many small-N tests be significant, but then only see a barely-significant p=.05 from a sample 90x larger than any of the given tests had, assuming all of the observations came from the same DGP and population.
        Tbh, if you were sampling a DGP that yielded 90/90 significant results with N=10, it must be so astronomically large of an effect that N=900 would basically NEVER yield a p-value anywhere close to .05.

        • But this is precisely why most people chose paper A!

          You asked them to compare multiple small-size confirmatory studies (i.e. p<0.05, i.e. the effect size must be massive)
          with one single large-size confirmatory study (i.e. p<0.05, i.e. a barely-significant result would correspond to a weak effect).

          I guess they would give a different answer if the second option read
          "Paper B has a novel study with confirmed hypotheses (n=900) with p-value of MUCH less than 0.05, certainly much less than any p-value from Paper A"

        • It’s the same dgp. Same thing studied. The effects generating the data are the same. So no, it’s not correct to say the 3 smaller studies must be driven by a massive effect and the larger study could be driven by a trivial effect. They study the same thing, varying only sample size partitioning

          If the effect is so astronomically large that you’d find it perfectly across several small samples, you will almost certainly find it in larger samples. By all probability, if you find in 3 smaller samples p < .05, then in a larger sample from the same process, you will find p <<< .05. And at least with the larger sample your estimate is more precise.

        • It’s not clear, to me at least, that the original question “which paper has the most evidence” was conditional on the data being identical.

          Please consider the following alternative question:

          “Paper A has a novel study with confirmed hypotheses (n=900).
          Paper B has a novel study with non-confirmed hypotheses (n=900).
          *Intuitively*, which paper would you think has the most evidence?”

          Do you find the following answer satisfactory?

          “They provide the same amount of evidence, by essentially the likelihood principle. It’s the same dgp. Same thing studied. The effects generating the data are the same.”

        • Another take on you original question:

          Imagine that there are two studies exploring the same topic and using the same methodology. Both were preregistered.
          Trial A is a sequence of small-sample experiments (n1=300, n2=300, n3=300).
          Trial B is a single large-sample experiment (n=900).

          1) You learn that paper B has been published (you know that p<0.05, nothing else). Does learning that paper A (p1<0.05, p2<0.05, p3<0.05) has been published as well provide additional evidence?

          2) You learn that paper A (p1<0.05, p2<0.05, p3<0.05) has been published. Does learning that paper B (p<0.05) has been published as well provide additional evidence?

          How would you rank the levels of evidence for these three scenarios?
          – paper A published, outcome of trial B unknown
          – paper B published, outcome of trial A unknown
          – both papers A and B published

        • The whole point is that we only know about the data-generating-process through the studies, by knowing that each subset of the data of 300 patients is showing the same thing, we have more information than from knowing what the overall combined 900 patients show.

          If you do not like the answer, then one way to change the question might be to consider 3 times p<=0.05 versus one time p<=0.00003125. In which case I feel we mostly get into debates as to whether possible methodological differences between the 3 studies are helpful (while if they are 100% identical in every way, then the questions starts to become unrealistic, if we are honest…).

        • You can’t multiply p values together. You could have 90 p-values of about .96, and when multiplied together, they are .05.

          And I really don’t think it’s true that 3 n=300 studies has /more/ information.
          If you have 3 n=300 studies with p < .05, then an n=900 study collected exactly the same way from the same DGP and population will almost certainly have p <<< .05.

          If a DGP exists such that 3 N=300 samples yield p < .05, then sampling /more/ (N=900) from that DGP would assuredly yield even /more/ evidence than any one given study, and should asymptotically equal three n=300 samples in terms of total information.

          Take this; say you have 3 n=300 studies and you estimate the d statistic with 95% CIs. d1 = .56 [.26, .86], d2 = .35 [.05, .55], d3 = .62 [.32, .92]. p-values, respectively, are: .00001, .03, .000001
          Of course you can be fairly sure there's an effect across the three.
          Now you take n=900 of the same DGP: d = .45 [.4, .5], p = .000000000000001.

          You may say "the three studies provide more information because it detected the effect multiple times", and I would say "the three studies tell you nothing that the n=900 sample didn't already".
          With n=900, the estimate is d=.45 with 95% CI of [.4, .5], extremely tiny p-value. In Bayes-land, the credible interval may be similar, meaning we can be fairly sure the parameter is between .4 and .5. The fairly precise estimate of d=.45 [.4, .5] tells us exactly what we would find in smaller samples; with n=300, we'd have an NHST power of detecting d=.45 of about .999, so if we were to sample three n=300 studies, we'd have a .999^3 probability of getting 3 significant results.

          This is a precise example of what I mean. Three n=300 studies supporting a hypothesis shouldn't tell you /more/ than a single n=900 study; more data gives you better estimates, and even if you crossed some arbitrary threshold 3 times (vs 1 time with a larger sample), it doesn't mean you have more evidence, it means you have arbitrarily sliced evidence. Just like if you had 3 n=300 studies with p = .09, p=.01, p=.04, it doesn't mean your effect is inconsistent, it probably means your effect is small enough that n=300 isn't adequately powered for consistently /detecting/ the effect. With an n=900 sample, you may have adequate power to detect the effect and precisely estimate the effect; with that estimate (say, it's d=.25), you could say 'yeah, replicating this would require more than n=300 to have a high chance of replication'.

        • Thanks for answering. This is just so you know I read it. Carlos made the reply I woudl have: why assume the DGP is the same? If you do, it’s a different question, I agree.

  12. You can argue that the papers provide the same evidence if the data is the same, but that was not part of the original question.

    Paper A has a novel study with confirmed hypotheses (n=900).
    Paper B has a novel study with non-confirmed hypotheses (n=900).
    Q: which paper gives stronger support for the hypothesis?
    A: both have the same evidence, because I’m thinking of a case where the data is identical but the analysis assumptions lead to different conclussions.

  13. The question was:
    Imagine you’re given two papers.
    Both papers explore the same topic and use the same methodology. Both were preregistered.
    Paper A has a novel study (n1=300) with confirmed hypotheses, followed by two successful direct replications (n2=300, n3=300).
    Paper B has a novel study with confirmed hypotheses (n=900).
    *Intuitively*, which paper would you think has the most evidence? (Be honest, what is your gut reaction?)

    How about this question?
    Imagine you’re given two papers.
    Both papers explore the same topic and use the same methodology. Both were preregistered.
    Paper A has a novel study (n1=300) with confirmed hypotheses, followed by one successful direct replication (n2=300) and one unsuccessful replication (n3=300).

    Paper B has a novel study with confirmed hypotheses (n=900).
    *Intuitively*, which paper would you think has the most evidence?

    Assume further that, if you partition the data in Paper B into (300, 300, 300) it exactly matches the data in Paper A.

    As for me, my preference between the papers would probably depend on (1) the day of the week and (2) the number of spelling and grammatical errors in each paper. I think that I would prefer the N=900 study because bigger N is better N. But, I suspect that I would be wrong. The two papers do not provide the same total information. Paper A tells you something (not very much) regarding the variance of small samples that is not revealed in paper B. Imagine you had 10 studies with n=90 that supported the hypothesis. Or 900 with n=1 that supported the hypothesis. 900 studies with n=1 is more compelling than one study with n=900.

    Bob

    • I disagree. If you have N=900, you probably have a good estimate of the effect(s). Knowing that estimate, you can already tell how smaller samples will perform in detecting the effect. If you have N=900, d = .32, you can already tell that smaller samples (n=100) may struggle to detect that effect (d=.32) due to the sampling variability; really, just: power. Bigger N -> Better estimate -> Better inference -> You can already tell what will happen with smaller samples because you have a good estimate of the effect, you thus can make better judgments about the probability that small sample replications will in fact replicate that effect.

      There’s nothing I can think of that smaller samples will tell you that a larger sample would not. Sure you, may get support, non-support, support, but that doesn’t tell you anything about the ‘variability of the effect’, that just says the effect is small enough that with your sample size, the power is insufficient.

      Just like if you had 90 N=10 studies, and saw a mix of support and non-support. That doesn’t mean the effect is inconsistent, it means your samples are way too small to consistently detect that effect, and each sample is seriously underpowered (and thus your detected effects must necessarily be large overestimates). It says nothing about the variance of small samples that you couldn’t tell already from having a larger N estimate of the effect and determining the power of detecting such an effect with smaller N.

      • Well, my point was that if I were shown 10 N=90 studies that were at the p<0.05 level, I would probably be more convinced that seeing one N=900 study with p = 0.045. But, it's late and I've had a long day. So, I am not going to do the proper analysis.

        Bob

        • Right, but it’s not realistic.

          If these are generated from the same DGP, then if 10 N=90 studies were all at p < .05, then the N=900 study would almost certainly have an extremely low p value; it would be incredibly rare to see 10 N=90 studies < .05 and an N=900 study of the exact same method and DGP have a p-value so close to p=.045.

          To the point that it's not a realistic scenario. If you had 10 N=90 studies with all p < .05, then the effect size must be massive for that to occur (you'd have to have incredibly high effect sizes to have such incredibly high power with N=10), in which case the N=900 study would most assuredly detect it with a p value of MUCH less than .05, and certainly much less than any p-value from any given N=90 study.

        • Again, you are only learning about the DGP from the data. The 10 studies convince me more the hypothesized effect is really there in the DGP.

        • Why would more *studies* be more important than more *data*?

          If you have a population of K=10000, would you seriously trust 900 independent samples of N=10 over 1 sample of N=9000?

          Studies aren’t the important factor, it’s information.

  14. we ran a study in which we had a basketball players shoot a basketball 300 times over 3 sessions separated by 6 months. We could have had each player shoot 900 shots in a single session, but the players would have gotten very tired.

    There was another bonus: we could sort players in the first session and see if their tendencies persisted across sessions.

    • Josh:

      Yours is the best comment in the thread because it uses a specific example. I had a hard time answering the original question because the difference between studies will depend on context.

      • @Andrew,

        I’m having a hard time seeing why you didn’t understand the original question as posed. Sure it wasn’t realistic, and in the real world any set of three studies would depend crucially on the details of those studies. But in the abstract, is there any meaningful difference between the studies attributable solely by how they are split up as

        • Submitted early by accident but I think it’s clear. Do the features he mentioned make a difference, all else being equal, or are the stated differences irrelevant?

      • Andrew: If your blog post were about “N=100 studies or one N=300 study?” I may not have noticed the connection!

        Keith: nice example. It’s funny, I can see reasons for loss of control due to scale issues in a single large study, but then I can also see reasons for loss of control with multiple small studies, because of variability in the UTOS. My instinct is that multiple small studies is better in general because of the robustness qualities, as well as the room in gives for exploratory research—there are too many forking paths when you divide the data ex-post.

        • Martha

          hmm.. still works for me, I guess google books limited you. For the link you provided, they discuss it a bit differently.

          I’ll copy-paste below, from Maxwell, S. E. & Delaney, H. D.Reigert, D., ed. (2004), Designing Experiments and Analyzing Data: A Model Comparison Perspective, Lawrence Erlbaum Associates, Inc..

          —-

          Validity means essentially truth or correctness, a correspondence between a proposition describing how things work in the world and how they really work (see Russell, 1919b; Campbell, 1986, p. 73). Naturally, we never know with certainty if our interpretations are valid, but we try to proceed with the design and analysis of our research in such a way to make the case for our conclusions as plausible and compelling as possible.

          The propositions or interpretations that abound in the discussion and conclusion sections of behavioral science articles are about how things work in general. As Shadish et al. (2002) quip, “Most experiments are highly local but have general aspirations” (p. 18). Typical or modal experiments involve particular people manifesting the effects of particular treatments on particular measures at a particular time and place. Modal conclusions involve few, if any, of these particulars. Most pervasively, the people (or patients, children, rats, classes, or, most generally, units of analysis) are viewed as a sample from a larger population of interest. The conclusions are about the population. The venerable tradition of hypothesis testing is built on this foundational assumption: One unit of analysis differs from another. The variability among units, however, provides the yardstick for making the statistical judgment of whether a difference in group means is “real.”

          What writers such as Campbell have stressed is that not just the units or subjects, but also the other components of our experiments should be viewed as representative of larger domains, in somewhat the same way that a random sample of subjects is representative of a population. Specifically, Cronbach (1982) suggested that there are four building blocks to an experiment: units, treatments, observations or measures, and settings. We typically want to generalize along all four dimensions, to a larger domain of units, treatments, observations, and settings, or as Cronbach puts it, we study “utos” but want to draw conclusions about “UTOS.” For example, a specific multifaceted treatment program (t) for problem drinkers could have involved the same facets with different emphases (e.g., more or less time with the therapist) or different facets not represented initially (e.g., counseling for family members and close friends) and yet still be regarded as illustrating the theoretical class of treatments of interest, controlled drinking (IT). (In Chapter 10, we discuss statistical procedures that assume the treatments in a study are merely representative of other treatments of that type that could have been used, but more often the problem of generalization is viewed as a logical or conceptual problem, instead of a statistical problem.)

          Turning now to the third component of experiments—namely the observations or measures—it is perhaps easier because of the familiarity of the concepts of “measurement error” and “validity of tests,” to think of the measures instead of the treatments used in experiments as fallible representatives of a domain. Anyone who has worked on a large-scale clinical research project has probably been impressed by the number of alternative measures available for assessing the various psychological traits or states of interest in that study. Finally, regarding the component of the setting in which experiments take place, our comments about the uniformity of nature underscore what every historian or traveler knows but that writers of discussion sections sometimes ignore: What is true about behavior for one time and place may not be universally true. In sum, an idea to remember as you read about the various types of validity is how they relate to the question of whether a component of a study— such as the units, treatments, measures, or setting—truly reflects the domain of theoretical interest.

  15. I had an experience submitting a paper (to a social science journal) with N>100,000, and being asked to submit the paper revised with another study. I thought that I should have found some excuse to split the sample into lots of little samples and get many studies from one study. For non-scientific reasons, there is an idea among lots of editors that papers absolutely must have lots of studies, so three flawed studies can get through better than one really solid one. I think it shouldn’t be this way: if one study can be totally solid, why do we need more?

      • In the context of the broader scientific enterprise, replication is important. In the case of one individual paper, one really good study should be enough. It cannot be the responsibility of each paper to provide replications of itself – that is the responsibility of subsequent papers. Replication is an endless process, but each paper is just one brick in the wall and we should be happy when it’s a good brick rather than demanding that it be several little bricks or worse bricks.

        • To some extent this depends on the field. In my wife’s Biology lab, if a student comes to her with an interesting result, the first thing she does is throw it in the trash and tell them to go back and do it again and see if they get the same result (i’m exaggerating, but I do know biologist who have literally taken the gel out of an excited undergrad’s hands and thrown it in the trash. The experience is shocking but it makes for a great learning experience, young students are super-eager to get results.).

          But, when it comes to running a decade long social science policy program on economic development in the third world involving tens or hundreds of millions of dollars in foreign aid per year… I’m pretty sure the response “that’s nice, now go back and do it again” is the wrong one. Much better to do a lot of critical thinking about what are alternative explanations for the same data, because collecting the data is decades and billions of dollars of cost, and just repeating the experiment is a poor way to make progress.

        • Much better to do a lot of critical thinking about what are alternative explanations for the same data,

          You should be doing that anyway… If you add up all the “gels thrown in the trash” because people aren’t cleaning up their protocols (or even just making extra little notes about “watch out for this!”) it is probably more expensive than the decade long study.

          Not vouching for this source or anything, but one estimate gives $90 million spent 2004-2009 on electrophoresis equipment: http://www.phortech.com/08ephinstPRfin1.htm

          So if 1 in 10 gels are thrown out, we get ~$1.8 million wasted a year due to this. I bet it is closer to 1 in 4 gels are “thrown out” though. Also we need to add the lab tech salary and grad student stipends, a probable increase over the last 9 years, etc. Anyway, that’d be a nice study for someone to do.

        • While your points are valid, I think you may have missed my point somewhat.

          Consider the bayesian decision theory on the undergrad doing gels. We could

          A) Assume the result is “real” and start doing complicated analyses following up on the idea for a few months. Total cost maybe $50,000

          B) Throw out the gel and re-do it for a total cost of $20 and then make our decision about follow up when we have some information about whether the phenomenon is repeatable or just an accident.

          Pretty clearly B is a good risk management strategy in biology.

          Now, in the decades long economic development problem, even the idea that it is in principle repeatable seems unlikely, the background conditions of the world change over a decade. It makes no sense to “re do” the experiment in any sense.

        • I am suggesting:

          C) “do a lot of critical thinking about what are alternative explanations for the same data” and “RE-do the gel”. Then make a note in the protocol of what leads to the non-reproducible gel (if applicable).

          BTW, I’ve was told by professors that it is standard practice to “throw out” gels that don’t fit because “something probably got messed up” (eg 3 “worked” and one “didn’t work”). So, I don’t think this is limited to undergrads.

        • I think my comment still fits: if an undergrad’s biology gel is throw-away-able, then it is probably not a good “brick” in my analogy.

          I think sometimes it’s right to throw away studies or redo or add to them or do more of them. But sometimes these decisions are not made very well. For many editors, it seems like a mix of laziness and convention: there’s a convention that social science papers need more than one study per paper, so good one-study papers get rejected, or they can’t find anything wrong with a paper, so they make the lazy request “do another study”. With professors and graduate students, there can also be some laziness, with reasoning like “I don’t want to think seriously about this student’s work, so I’ll just make him go back and do more of the same”. I’m not accusing anyone in particular of that thought process, but I think it happens pretty often among professors who feel that they are very busy.

        • Yes, I agree with this take on it. To some extent anonymous review has become a rent-seeking opportunity. In Academia the quantity of academics has increased dramatically over the last 3 decades, opportunities for academics are scarce, so putting up barriers to entry becomes a good strategy. One barrier to entry is to raise the bar for the amount of work and time it takes to get each academic “chit”. Anonymous reviewers can do this by just asking for “more” as a matter of course.

          Note, I don’t think people think of this as something they’re actively doing, but strategies evolve to fill ecological niches. Snakes don’t think of themselves as taking advantage of the fast generation time and burrowing nature of small mammals either. They just do their thing. “Their Thing” for academics has in large part become competing for scarce resources (professorships and grants etc) during an artificial population bloom.

        • The analogy is imperfect. In the current environment it may be the “real scientists” who have to sneak some resources from all the enormous pompous power-brokers. With close to a million dollars a year in grants, Wansink seems more like the “big male”. The main point is, “playing an effective strategy” is becoming more important than actually doing the work.

        • Sounds like biology has cleaned up some NSFW terminology since I first learned about sneaker male fish some years ago.

          But the question of whether Wansink is a sneaker male or a “big male” implicitly assumes the sneaker male/big male model. It may be better to replace that model with the more general model that there is more than one route to evolutionary “success”.

        • Martha, yes the analogy is imperfect and the question isn’t whether Wansink etc are the big males or the sneaker males, it’s more about the fact that playing a strategy has become critically important for “success” and many effective strategies don’t involve actually doing any kind of science.

          I’m going to blog this in some more depth.

        • @Martha. Another point that is important is that Mimicry is a fairly widespread strategy in biology. There are snakes that look like rattlesnakes (and that shake their tails) or milk snakes (bright stripes) but have no venom themselves. There are butterflies that look like monarchs (which are toxic to birds) but are not themselves toxic…

          And there are people who do all the trappings of science but don’t do the hard work of actual science, and they collect a good paycheck and lots of prestige.

        • Daniel Lakeland said: “And there are people who do all the trappings of science but don’t do the hard work of actual science, and they collect a good paycheck and lots of prestige.”

          That says it well.

        • PI: In the context of the broader scientific enterprise, replication is important. In the case of one individual paper, one really good study should be enough.

          GS: Enough for what? Publication? I agree…but if you are talking about between-groups experiments (which is what everyone means here when they talk about “a study”) it has to be held in suspicion, especially given the puny effects that are sometimes pursued. But why would you not want to replicate something in your own lab, if the contingencies surrounding research were more conducive to good science rather than what it is concerned with? And, back to SSDs, there you have to have replication even if you have one subject (as long as the ethical concerns are not too great). But SSDs are not within most people’s ken. It is worth mentioning, though, that not using SSDs where they are relevant could be considered bad science.

          PI: It cannot be the responsibility of each paper to provide replications of itself – that is the responsibility of subsequent papers.

          GS: Why is that necessarily true?

          PI: Replication is an endless process, but each paper is just one brick in the wall and we should be happy when it’s a good brick rather than demanding that it be several little bricks or worse bricks.

          GS: Well…each demonstration of a reliable effect is a brick in the wall…and that takes replication – it’s best when it comes from another lab, but not necessarily worthless otherwise.

        • Before responding, let me say I think we probably agree about most things and just have some quibbles on the edges.

          In response to your three comments in order:

          1. (In response to “enough for what?”) Enough to make a valuable contribution. You bring up puny effect sizes. But really puny effect sizes and small samples wouldn’t be a “really good study” like I said would be enough. You ask “why would you not want to replicate something in your own lab”. You definitely would want to replicate anything you had the resources to replicate. But an individual paper doesn’t need to contain every possible replication of every study. See my response #2 as well.

          2. (In response to “why is that necessarily true?”) Imagine doing Study 1. Then you want to replicate it. So you do Study 2. Maybe you are considering ending the paper there. But suppose you think that every paper needs to replicate itself. So you redo Study 1 and Study 2, and get Study 3 and Study 4. Then you think it’s time to submit to a journal. But you think that every paper needs to replicate itself. So you redo Studies 1-4 and get Studies 5-8. And you redo those to get 16 studies, so that it can replicate itself. And it never ends and you never submit the paper. Each paper is finite and has to have some loose ends, and some things that are not replicated as many times as someone might wish. If papers cannot be infinite, they have to leave some of the responsibility for replication to subsequent papers.

          3. I agree. Maybe our disagreement lies in what should constitute a “paper”. In graduate school, professors tried to tell me that a paper should definitively answer some stylized question, with finality. I think that papers, rather than providing definitive final answers, should only try to make some contribution to our knowledge and understanding. I think that every contribution is necessarily incomplete, no matter how many replications or how competent or brilliant the researcher. Which is why I don’t mind the idea of shorter or 1-study papers.

  16. Hi Andrew,
    happy to see that we agree this time. I discuss the fallacy of believing in multiple significant results as strong evidence in this article.
    I explicitly discuss that 10 studies with N = 20 do not provide more evidence than 1 study with N = 200. After all, the evidence of the 10 small studies will be inconsistent and has to be integrated with a meta-analysis and a fixed-effect MA gives the same result as a single analysis with N = 200. The editor made me add that a random effects meta-analysis would even lower the strength of evidence.

    http://www.utm.utoronto.ca/~w3psyuli/PReprints/IC.pdf

    Best, Uli

  17. What if in the 3 studies the means were +20, +20, and +100, significant at 10%, 10% and 1%? Then if we combine them, the big study is significant at better than 1% still. But knowing that the last 300 data points had a significantly different mean would make me worry that the data-generating process was *not* the same for all three studies.

    Indeed, using “information” in a loose sense, rather than as a term of art, there is more information in knowing the 3 individual studies’ means and variances than in just knowing the big study’s. Having the 900 individual data points would be even more information. If, for example, they were observed perfectly ordered by magnitude, I’d be very suspicious.

  18. Am I the only one to be amazed that in 80 comments the word “Likelihood” appears only two times? (Three now.) Surely the information in the data that is interesting is the information relevant to the parameter of interest in the statistical model chosen. That information is a likelihood function.

    • All the information is in the likelihood function assuming that the model is correct. As mentioned in other comments, the three-studies presentation might contain some extra information that raises doubts about the adequacy of the chosen model.

      • Carlos, you make a good point. But doesn’t the point apply equally to all other parametric statistical methods?

        Inspection of a likelihood function doesn’t preclude assessment of the competence and relevance of the statistical model. Indeed, it probably helps.

  19. I think what is most interesting in the question is that it leads us to realize that when we say “replication” we do not actually want 100% replication. We want to have enough variance in the conditions under which the study is run so we can see if the results hold up.

    The answer to a question like this depends on the details.

    • This, exactly. This is what makes me skeptical that the Nosek-style replication projects, that is direct replications that get lots of splashy coverage, are the best way to go. The counter-argument is that without a direct replication then people still believe the original result under the original conditions. But I think it’s better to just tell people to stop doing that. Then we can have resources put into more interesting projects than trying for 100% perfect repeat of the original study.

  20. When I see the term replication I tend to assume there are differences in who is doing the study, participants, etc. One thing very appealing to me with Hierarchical modeling is that the studies themselves are experimental units and at least as I see it our goal is modeling and explaining variability.

    I wonder if it might be interesting to have a senior scientist, or group of scientists have a grant to put together an internet sight with study ideas and conditions to run for a given problem. The PIs could use ideas from experimental design to come up with optimal design spaces to answer the question of interest. Then undergraduates, graduate students, other researchers looking for a project could log on and find one to do, preregister and then do it and put in their study results when they are done with all the information on their “experimenter degrees of freedom” and decisions they made. It is sort of like designing ahead of time studies one would want to put into a meta-analysis and then crowd sourcing them.

  21. The question never specifies if the researchers conducting each experiment in Paper A were the same group. If it was different researchers, I’d trust Paper A far more, because there will be a range of methodological differences between all three research groups, increasing error and making the findings overall more conservative. On the other hand, Paper B could suffer from a single methodological issue (that wouldn’t have been replicated by other groups) that we could never pick up in their paper.

    Given that both provide the same information, the one you pick is based upon your ideas about people’s ability to run an experiment the way they planned to. I would suspect that methodologists are saying Paper B is better, the applied statisticians are picking A, because we know how terrible most experiments are.

  22. Someone somewhere above here said: “NHST is the dominant paradigm in science”. That seems a bit of a stretch to me – not that it is dominant, but that it is a paradigm. Wasn’t ever designed or developed from principles, more cobbled together from spare parts that were lying around.

  23. I don’t believe that the two cases are equivalent, even if the three N=300 cases were taken one right after the other, with no changes in the methodology.

    Consider taking the N=900 case, and dividing the data into the first, middle, and last 300 samples taken. Over the space of all possible measured results that lead to the N=900 case showing support for the hypothesis, some of those results will show support for the first and middle N=300 cases, but not the last. Similarly, some will not show support for the middle N=300 case, and others will not show support for the first N=300 case. The possible data which leads to all three N=300 results showing support is a subset of the possible data which will show support for the N=900 case.

    My “gut instinct” is that the case with the three N=300 subsets all showing support is a little stronger than the single combined N=900 result. I could be wrong on that, but we can’t simply handwave and say that the three N=300 results all showing support is equivalent to an N=900 combination also showing support.

  24. I’m changing my answer. (Given my familiarity with Mayo-style severity calculations I should have spotted this right away). The Paper A does indeed given more evidence of an effect. I’m granting the assumption that we’re looking at a basically arbitrary division of data into three subsets and no other differences. I’ll pretend we’re in the large-sample limit where data likelihoods are approximately normal and that “hypothesis was confirmed” means an effect was detected. Under independence we can pretend that paper A reports 3 (very precise) N = 1 experiments and paper B reports an N = 3 experiment. We’ve got two likelihood functions (the prior is the same in both scenarios, presumably, and presumably flat where the likelihood is curved):

    log-likelihood_A = 3 * log[Pr(effect was detected using N = 1 | effect size)] = 3* log(integral of full-data likelihood over all N = 1 data sets with a detected effect)
    log-likelihood_B = log[Pr(effect was detected using N = 3 | effect size)] = log(integral of full-data likelihood over all N = 3 data sets with a detected effect)

    (Note that if all we know is that the effect was detected then only the prior can give us upper bounds on the effect size.)

    Let’s say we’re interested in the amount of posterior probability mass for the effect size above some value of interest, w.l.o.g zero. This is tantamount to declaring the effect detected if the datum/sample mean is above some threshold. The threshold in paper A is sqrt(3) times larger than in paper B. Without loss of generality let’s assume the precision in paper A is 1 and, with loss of generality but for the sake of concreteness, let’s assume threshold_A is 2. Then the precision of the sample mean in paper B is 3 and the threshold is 2/sqrt(3) = 1.15. Let’s do some likelihood calculations by hand; we’ll check effect sizes 0, 2, and 4. The paper A likelihoods (for three observations, each above threshold_A, at the given effect size and precision) are 0.0, 0.125, and 0.93; the paper B likelihoods (for the sample mean above threshold_B at the given effect size and precision) are 0.02, 0.093, and 1.0. Paper A is constraining the effect to larger values.

    R code:
    loglikelihood_A <- function(effect) 3*pnorm(2, effect, sd = 1, lower.tail = FALSE, log = TRUE)
    loglikelihood_B <- function(effect) pnorm(2/sqrt(3), effect, sd = 1/sqrt(3), lower.tail = FALSE, log = TRUE)

    • I think we need to model this as a communications channel. Has more information been collected about the phenomenon of interest if you take a single data set and separate it at arbitrary boundaries? No. But has more information been transmitted to the reader? Yes. This is unsurprising, after all in the 3 studies example you received 3 numbers through the channel, and in the single study example you received 1 number.

      • You didn’t get three numbers vs one number; you didn’t even get three pieces of censored data vs one piece of censored data — the thresholds aren’t known. They don’t have to be because the conclusion holds irrespective of the particular threshold values. It’s a good thing the OP only asked for a very low-res conclusion.

        • You got 3 binary digits in the 3 papers version, and 1 binary digit in the 1 paper version. But, conditional on them analyzing the same overall dataset, which I think is the real point here, there isn’t more information in the data, it’s just more information in the transmission to the reader.

        • So, with that interpretation it leads to Carlos’ comment below, that yes technically you could have three 300 datapoint datasets which when aggregated are stronger than one 900 point dataset, but in most cases the 900 point dataset will have a much much smaller p value than each of the 300 datasets.

          I think the core point, that by checking 3 aspects of the dataset and transmitting 3 bits to the reader you provide the reader with slightly more information than checking 1 overall larger dataset and transmitting a single bit, but in either case, it’s a lousy way to transmit info to the reader…. still holds.

    • I think you are looking at A and B separately (two scenarios) but in the original question “you are given two papers” and it seems that the intent of the question “which paper has more evidence” is to compare the amount of evidence in paper A, conditional on papers A and B, with the amount of evidence in paper B, conditional on papers A and B (*). It’s not very clear what that means, and the vagueness in the notion of “amount of evidence” doesn’t help.

      (*) “If you have 3 n=300 studies with p < .05, then an n=900 study collected exactly the same way from the same DGP and population will almost certainly have p <<< .05."

      • That’s a good point. I guess the upshot would be that the difference in evidence (howsoever measured) would be negligible but not strictly zero as the N = 900 study technically allows for some possible data sets that are strictly excluded in the three N = 300 studies but those possible data sets are very very unlikely given the information about hypothesis confirmation in both papers.

Leave a Reply to Stephen Martin Cancel reply

Your email address will not be published. Required fields are marked *