Here’s James Heckman in 2013:

Also holding back progress are those who claim that Perry and ABC are experiments with samples too small to accurately predict widespread impact and return on investment. This is a nonsensical argument. Their relatively small sample sizes actually speak for — not against — the strength of their findings. Dramatic differences between treatment and control-group outcomes are usually not found in small sample experiments, yet the differences in Perry and ABC are big and consistent in rigorous analyses of these data.

Wow. The “What does not kill my statistical significance makes it stronger” fallacy, right there in black and white. This one’s even better than the quote I used in my blog post. Heckman’s pretty much saying that if his results are statistically significant (and “consistent in rigorous analyses,” whatever that means) that they should be believed—and even more so if sample sizes are small (and of course the same argument holds in favor of stronger belief if measurement error is large).

With the extra special bonus that he’s labeling contrary arguments as “nonsensical.”

I agree with Stuart Buck that Heckman is wrong here. Actually, the smaller sample sizes (and also the high variation in these studies) speaks against—not for—the strength of the published claims.

Hey, we all make mistakes. Selection bias is a tricky thing, and it can confuse even some eminent econometricians. What’s important is that we can learn from them, as I hope Heckman and his collaborators have learned from the savvy critiques of Stuart Buck and others.

**P.S.** These ideas are not trivial, but they’re not super-technical either. You, blog reader, can follow the links and think things through and realize that James Heckman, Nobel prize winner, was wrong here.

What’s the difference between you and James Heckman, Nobel prize winner? It’s very simple. It’s not that you’re better at math than James Heckman, Nobel prize winner, or that you know more about early childhood education, or that you have a deeper understanding of statistics than he does. Maybe you do, maybe you don’t.

The difference—and it’s almost the only difference that matters here—is that you’re willing to consider the possibility that you might be wrong. And Heckman isn’t willing to consider that possibility. Or he hasn’t been. It’s never too late for him to change, though.

Well I guess the next logical step is to run n = 2 experiments. I mean, if small sample sizes increase the strength of evidence, might as well make them as small as possible right? If the experimental participant is way larger than the control participant, then the experiment must have an effect, by this logic.

Heck, run it again with another n = 2 study, and you can even appease all those replication nay-sayers by replicating your own work.

Huh. Just got a lot easier to get tenure.

These quotes from Heckman and Buck are not inconsistent with one another, despite Buck’s introductory element:

Heckman:

Dramatic differences between treatment and control-group outcomes are usually not found in small sample experiments.Buck:

Contrary to what Heckman says, dramatic differences between treatment and control groups are MOST likely to show up in small samples.Big differences in small samples can be both unusual, relative to small(er) differences in small samples, while also being more likely to occur in small samples than large samples. Heckman is making a statement only about small-sample studies, whereas Buck is making a statement about small-, medium-, and large-sample studies.

Of course, Heckman is also drawing the erroneous conclusion that large observed effects provide strong evidence by virtue of the fact that large differences are unusual with small samples.

Hauntingly, I ran into the Heckman argument about large effects in small samples speaking for strength of their findings when my clinical directer requested that I attend a talk (1980,s) by a biostatistician recently recruited from Harvard. He did hint that I would find their argument amusing.

But I have encountered it more often than I hoped – and now Heckman in 2013!

I give some speculations below on why it happens and continues to happen and will continue to happen.

I find Peirce’s three grades of a concept to be helpful in thinking about this, 1. the ability to recognise instances, 2. the ability to define it and 3. the ability to get the upshot of it (how it should change our future thinking and actions). For, instance in statistics the ability to recognise what is or is not a p_value, the ability to define a p_value and the ability to know what to make of a p_value in a given study. Its the third that is primary and paramount to “enabling researchers to be less misled by the observations”. Its the one Heckman got very wrong. It also always remains open ended.

Almost all the teaching in statistics is about the first two and much (most) of the practice of statistics skips over the third with the usual, this is the p_value in your study and do not forget its actual definition. But in any important research someone will or should try to get the upshot of it – and more often than desirable – get it badly wrong.

The statistical discipline needs to take some accountability for the habits of inference they _set up_ in various areas of scholarship by the methods/explanations they recommend. I think the way forward is to stop analyzing single studies trying to get the upshot of them (except in extreme emergencies) but rather simply report what you did, what you think happened and make all the data and protocol information available to others. Then when there are multiple studies some sensible upshot will not be so hazardous to try and discern. For instance, with many small studies, there will be a lot of variation in estimates and the largest one (rare but which would have gotten most of the attention) will easily be seen as an exaggerated estimate.

Keith said: “For, instance in statistics the ability to recognise what is or is not a p_value, the ability to define a p_value and the ability to know what to make of a p_value in a given study. …

Almost all the teaching in statistics is about the first two and much (most) of the practice of statistics skips over the third “

I would say something stronger than the last assertion; namely, that a lot of statistics teaching teaches the third incorrectly — by giving examples that conclude with statements like, “The p-value is < .05, so we reject the null hypothesis and conclude that the alternate hypothesis is true."

OK, but Buck’s argument differs from yours, if I’m not mistaken. He’s arguing “effect of outliers is greater in smaller samples”, while you’re arguing Garden of forking paths. Also, is a Garden of forking paths as relevant here as it is for social psychology studies where anything you can think up can be an explanatory variable?

Jack:

There are lots and lots of forking paths in those education studies; see the linked post. It’s been said that more studies have been published on this ABC data than there were participants in the experiment! Also, the claim that “their relatively small sample sizes actually speak for — not against — the strength of their findings” is in error, even in the absence of forking paths. Forking paths explain how it is that researchers keep coming up with those p-values, but even with no forking paths at all, the math tells us that type M errors become large when sample sizes are small and data are noisy.

The confusion here has existed in statistics for a long time, as pointed out by Royall (1997):

1. Berkson (1942) ‘the evidence provided by a small p correctly evaluated is broadly independent of the number in the sample’

2. Lindley and Scott (1984): “the interpretation to be placed on the phrase `significant at 5%’ depends on the sample size: it is more indicative of the falsity of the null hypothesis with a small sample than with a large one”

3. Peto et al (1976): ‘A given p value in a large trial is usually stronger evidence that the treatments really differ than the same p-value in a small trial of the same treatments would be.”

So the p-value in small samples means either the same thing, more, or less than in large samples. Take your pick.

None of those arguments are wrong, of course. They are each just nuanced about what they are conditional on.

But in addition, none of them are conditioning on only being viewed after publication, which is the issue Andrew is referring to.

> only being viewed after publication

Even before publication if its being selectively focused on because the CI excludes zero.

Its the selective focus on the subset of such intervals and the neglect of all others.

To make the difference clearer, think of an ideally conducted, reported and published paper that provided numerous CIs – the subset that exclude zero will be biased but the whole set won’t be. The bias in the subset will be larger with smaller sample sizes (all else equal).

I can’t understand the sense in which the Lindley and Scott one can be true (what am I suppose to condition on?). If the null hypothesis is true, surely the chance of ‘significant at 5%’ is 5%, independent of sample size. But if the null hypothesis is false, we are _less_ likely to see ‘significant at 5%’ with the lower-power (smaller sample size) test. So if do I see 5% significance, what possible logic could weight ‘false’ becoming relatively more likely than ‘true’ as the sample size decreases?

(Yes, obviously I’m missing something, and probably something obvious.)

Thanks.

“None of those arguments are wrong” Huh?

1.Berkson’s is contrary to the the greater chance of a Type M error with small samples. (A nice example of this pointed out by Howard Wainer: Bill Gate’s promotion of small schools, because studies showed that small schools were more likely to have high average student achievement than large schools. But looking at the data, small schools are also more likely to have low average achievement than large schools.)

2. While Lindley’s is true, it needs to be applied with the understanding that large sample sizes can (and often do) result in statistically significant findings that are so tiny that they are of no practical significance.

3. Peto et al’s statement has the same problem as Lindley’s.

As you have stated, (2) and (3) are correct. Should be understood in context as you’ve said, but still correct.

And your issue with (1) is about type M errors, while the statement is about the evidence against the null/sign of the effect.

“And your issue with (1) is about type M errors, while the statement is about the evidence against the null/sign of the effect.”

As with (2) and (3), it’s practical significance that is important, not just the sign of the effect. But Type S errors (wrong sign) can have high frequency as well. (See http://www.stat.columbia.edu/~gelman/research/published/francis8.pdf)

These statements are not contradictory, they merely condition on different statements.

Gelman’s statements condition on having seen a p-value less than 0.05 and having extremely low power in the study. Berkson conditions on neither of these statements.

Well… if you assume the null hypothesis is false (a good idea in my opinion), then sample size needed to detect a deviation from the null hypothesis should decrease with the size of the deviation. This works for the dice example: no one thinks a die is perfectly fair but if you need to roll 100k times to really see the deviation then who cares? If it lands on six 90% of the first hundred rolls then there is a problem.

People have sent me this, as I’ve written a lot about it. Here I’ll just plunk down my analogy. People worry about small sample size and low power because they question the assumptions of a test: they suspect cherry-picking, fishing for significance and the like (especially if the observed difference is out of whack with the known population effect size). The sample size must be large enough for the assumptions to hold. However, IF the assumptions hold then attaining the same level of significance with the smaller sample size is indicative of a larger effect or discrepancy than with a smaller. To claim the reverse is to commit what I call:

Mountains out of Molehills Fallacy (large n problem): The fallacy of taking a (P-level) rejection of H0 with larger sample size (higher power) as indicative of a greater discrepancy from the null than with a smaller sample size.

Consider an analogy with two fire alarms: The first goes off with a sensor liable to pick up on burnt toast, the second is so insensitive, it doesn’t kick in until your house is fully ablaze. You’re in another state, but you get a signal when the alarm goes off. Which fire alarm indicates the greater extent of fire? Answer, the second, less sensitive one. When the sample size increases it alters what counts as a single sample. It is increasing the sensitivity of the fire alarm. But, if the test rings the alarm (i.e., rejects H0) even for tiny discrepancies from the null value, then the alarm is poor grounds for inferring larger discrepancies. Compare two cases of a just statistically significant result at level .025 (with known sigma, say 1): one with n = 10,000,and a second n = 25.

Subtract 1.5 SE from the outcome which is at 2SE to get the lower .93 confidence bound:

(i) for n = 25, the .93 lower estimate is : μ > .5(1/5)= .1

(ii) for n = 10,000, the inferred .93 lower estimate is μ >.5(1/100)= .005.

That is why we get the Jeffreys-Lindley result: “the more powerful the test, the more a just significant result favors the null hypothesis” (Pratt 1961, p. 166).

Which test statistic or effect size would you trust more? The one with n = 25 or the one with n = 10,000?

The fallacy as stated illustrates, indirectly, why basing decisions on p values rather than observed test statistics or effect sizes is problematic (and makes me wonder how common the fallacy actually is). It’s not at all clear (to me) that equating these two cases with respect to “how significant” they are is particularly illuminating, and it’s obvious that if you observe equivalent test statistics or effect sizes with n = 25 and n = 10,000 that that latter is more informative.

If the null is true, the probability of observing a “just significant” result is the same for both cases, by definition. But the probability of observing an erroneous, large discrepancy is much larger for n = 25 than for n = 10,000.

“Which test statistic or effect size would you trust more? The one with n = 25 or the one with n = 10,000?”

You’ve just significantly changed the question. The question should read “Which true effect size do you think is bigger: an estimated effect of 10 with n = 25 or an estimated effect of 1 with n = 2500”?

I think it is a good idea to change the question.

Also, it would depend on what you are studying – if your phenomenon is intrinsically highly variable, say characterised by 100 parameters in principle, then I would be very skeptical of the n=25 case because again, it is ‘far from equilibrium’.

So again, to me I think you need to ask how well your sample represents the stochastic object you are studying.

(Mayo does briefly hint at this when she mentions the sample size needing to be large enough for model validation)

Furthermore, back to the analogy:

A sensitive fire alarm to me is one which could be set off by a single molecule.

An insensitive one would be one that measures many molecules to make sure.

While Mayo appears at first to agree, she then equates the sensitive alarm to the _large_ n case, while to me it corresponds to the _small_ n (single molecule measurement) case.

So Mayo associates a more sensitive fire alarm to a larger n. While it isn’t ‘wrong’ given her explanation and the notion of ‘significance’, I prefer to relate sensitivity to convergence and stability. Larger n means more stable results means less sensitive to fluctuations.

Huber on robustness says:

‘a…small minority should never be able to overide the evidence of the majority of observations… [this idea] makes sense only with relatively large sample sizes, since otherwise the notion of a “small minority” would be meaningless’

I agree. If you want robustness/stability then you need larger sample sizes (though what is a large or small sample will depend on the empirical variability of the phenomenon studied)

Yes, I did, because, as I said, the original question doesn’t seem particularly interesting to me. Equating the two cases for “how significant” they are seems like a convenient way to sidestep the fact that n = 10,000 will be much more precise and much more informative than n = 25.

If both of these cases – as stated, equated for level of significance – were from studies of the same thing, the n = 25 case may well provide evidence of a larger effect than does the n = 10,000 case, but the former is much weaker evidence of anything at all than is the latter.

Just to elaborate a bit more, we never know what the true effect size is, so a question that starts with “which true effect size do you think…” doesn’t really relate directly to any real-world issues we might encounter.

Shouldn’t the analogy be more like comparing the results from a completely erratic fire alarm to those of a very predictable fire alarm?

To use a physics analogy, I’d call small n ‘far from equilibrium’ and large n ‘near equilibrium’ behaviour. The former has large fluctuations and is much harder to understand than the latter.

Similarly, I trust statistics more when the asymptotics kick in and we have a stable measurement of a phenomenon.

+1

Ugh, the problem with this discussion is that everyone’s arguments are conditioning on different things.

I believe Mayo’s point is that suppose you have two tests, one with n1 = 10 and another with n2 = 10,000. You run your analysis and you get p = 0.01 for both tests. Which effect is likely to be a stronger effect size? Test 1, of course.

On the other hand, if you condition on seeing something in publication, knowing it came from a process in which p greater than 0.05 means it is truncated from your sample, which result do you trust more? Well, the test with n1 = 10 is going to be much more biased than n2 = 10,000, as for the same effect size, the n1 results will be more heavily truncated.

But I’m more proposing a different conceptual view than a different conditioning as such.

For example, you might ask me to compare n=2 from a normal distribution and n=100 from a normal distribution.

I’d say: forget the ‘from a normal distribution’.

Instead, ask at what point the empirical distribution a good reflection of the distribution of interest.

If we assume we can characterise the property of interest as a functional of a distribution we want to know when T(Pn) is representative of T(P). One answer is: when Pn is close to P and T is continuous.

Two reasons for a large measured ‘effect size’ T(Pn):

1. T(P) is large in the population of interest

2. Pn is very far from P

In small samples I’ll take 2.

That’s a rational choice under the prior that T(P) is very highly concentrated around 0. If not, that’s a highly irrational choice.

To clarify:

I would remain agnostic in case 2.

It seems fairly ‘rational’ to me to say

‘if I’m studying a stochastically variable phenomenon and only have one measurement of it then I probably shouldn’t trust my results’

but whatevs, call it irrational if you want.

I don’t think we are in disagreement, but I’m just pointing out that your statement implies that you have strong prior that T(P) is small. If your prior was strong that T(P) was large, you wouldn’t need any data to think (1) was probably true, and if you saw a small amount of data that was consistent with (1) being true, you should be even more convinced that (1) is true than before you saw the data (although maybe only slightly).

If you tell me you have a strong prior that the himmicane effect should be very small, so you’re not convinced by the small dataset (for which you believe that Pn is an unreliable estimate of P given the relatively small n), I think everyone here will agree with you.

> your statement implies that you have strong prior that T(P) is small

My ‘prior’ is that I’m studying something which requires multiple measurements to characterise and relies on, for example, the empirical distribution being close to the distribution of interest.

It isn’t really a ‘prior’ in the usual Bayesian sense but sure, it’s something I ‘believe’!or whatever.

Ie I make no assumption about what T(P) is, I just say that I don’t think T(Pn) is a good approximation to it unless Pn is a good approximation to P

ojm, I think a reader is right.

> ‘if I’m studying a stochastically variable phenomenon and only have one measurement of it then I probably shouldn’t trust my results’

Two reasons for a small measured ‘effect size’ T(Pn):

1) T(P) is small in the population of interest

2) Pn is very far from P (or you remain agnostic, whatever)

In small samples, would you also take 2 in this case? If your agnosticism applies only to the large T(Pn) case, where does the difference come from?

> In small samples, would you also take 2 in this case? If your agnosticism applies only to the large T(Pn) case, where does the difference come from?

Yes. Those two options were perhaps poorly phrased if taken too literally. The other statements I made were better.

The point is that Pn is very far from converging to P in small samples. Hence T(Pn) has no guarantee of representing T(P) even if T is continuous. The reliably of the estimate T is inherited, via continuity, from the convergence of Pn to P. No convergence of Pn means no reliability to inherit.

So if I understand correctly your point is that it’s better to have more data than less data, but it’s unrelated to the measured effect size being large or small.

I suppose, though it depends on what exactly ‘effect size’ means.

I’m just saying that if we had a ‘full population’ of interest and a functional T then we would report our quantity of interest as T(P). This is directly referring to a population of interest and not a null vs alternative comparison but I suppose it could be a population of diffferences or some such.

The whole ‘what would you do if you had the whole population?’ question.

If Pn has empirically converged to near P and T is a continuous functional then by definition we have T(Pn) near T(P). If either Pn hasn’t converged to near P or T is not continuous then we have no such guarantee that our estimate is near our population quantity of interest. This is borderline a fiducial argument but mathematically straightforward.

My thoughts on this were prompted in part from people using the example of a single sample ‘from a normal distribution’. (Andrew called it his ‘new favourite example’ at some point).

Intuitively I would not trust any definite answer to this question, the above is my way of saying why I would refuse to follow the usual arguments.

If you really wanted to introduce a prior into the mix then I suppose you could do something like

Define the mixture data distribution

P* = (n/N)Pn + (1-n/N)Pp

where Pp is a ‘prior data’ distribution and N represents your population size and n you sample size (the asymptotics will apply for n/N near one, no need for infinite N but you do need an estimate of relative scales ie sample size to ‘full’ population).

Your estimate for any sample size is simply

T(P*)

while we still have P*->Pn-> P for large n. For small n your prior data model dominates. But again, only as n gets near N would I trust what the data say.

(It would probably also be best to choose the weights to scale with the convergence of Pn to P so that you don’t artificially prevent or encourage your estimate to converge at the same rate as Pn to P…but then again we’re just back to convergence of Pn to P as the crucial element)

I understood her point, I just don’t find it particularly compelling. I pointed out that equating the p values for the two tests isn’t particularly illuminating or interesting, whereas equating the test statistics/effect sizes is.

Noah (and Mayo):

I agree her analysis makes sense when an alarm has gone off and you know whether it was a sensitive or insensitive alarm, but I do not think it is a good analogy for what we think Heckman did (primarily focus on an estimate more than two SE from zero).

In particular if one considers a probabilistic alarm generating model P(alarm| detectable ppm in air) as a function of detectable ppm in air, the sensitive alarm’s likelihood will be flat from zero to ppm in air that toast smoke creates and then steadily increase, whereas the insensitive alarm’s likelihood will be flat from zero to ppm in air that only a fully ablaze fire creates. The second has a near discontinuity that results in the likelihood ratio being essentially one for all fires less than fully ablaze.

A better household analogy would be a wireless thermometer placed in the front door hallway (where temperature changes when the door is opened and or the sun occasionally shines directly on it) versus a wireless thermometer placed in the interior where less temperature variation will occur. Now these wireless thermometer send out alarm reports only when the temperature is + or – 7 from the set temperature. In the summer, if you only have a thermometer placed in the front door hallway and you get a high temperature alarm it would be silly to think I should take this more seriously than if I had one in the interior. Vice versa in the winter.

Keith: Just to be clear, I do not claim to know anything about Heckman’s particular examples, and given it’s economics, I can well imagine faulty assumptions. I had the impression the discussion was on the general point about p-values, given the assumptions were adequate, (according to those who wrote to me) and I was just responding to it. So there may be no disagreement.

> the discussion was on the general point about p-values

OK, so not worrying about you analogy being apt for what we think Heckman did, I am still worried in the fire alarm case about transitioning to Wald confidence bounds (estimate + or – 2 SE) as I explained below.

Admitted my discerning of likelihoods was very back of the envelope but I think qualitatively correct – the likelihood shaped x% coverage confidence intervals would overlap for almost fully ablaze or more but not for less than fully ablaze (for some x% coverage). The insensitive alarm’s interval having lower bound almost fully ablaze while the sensitive alarm’s interval having lower bound almost toast smoke.

I do think whats going the fire alarm example is much clearer in these the likelihood shaped x% coverage confidence intervals than with p_values. You might wish to work severity through with such intervals.

.”Which effect is likely to be a stronger effect size? Test 1, of course.” I think it’s better to write it as: which observed effect is indicative of a large underlying population or parametric discrepancy (from null). Another useful way to see it (in the example I gave) is to look at the upper bound of a .95 confidence interval. The alpha-level observed difference with the large sample size is actually ruling out (statistically) parametric values that are well indicated with the smaller.

Your second paragraph sounds as if it’s pointing to grounds for questioning whether the p-values are legitimate or spurious. If they are spurious, then the claims about which indicates a larger underlying discrepancy can’t even be made in the first place.

“Your second paragraph sounds as if it’s pointing to grounds for questioning whether the p-values are legitimate or spurious. If they are spurious, then the claims about which indicates a larger underlying discrepancy can’t even be made in the first place.”

Well, it all depends on what types off effects are typically examined. And I believe this is Gelman’s argument. It’s very easy to talk about the operating characteristics of p-values on a single sample of data. But what happens when you’re trying to decide what to make of results in the literature? Making the assumption that everything with p greater than 0.05 doesn’t make it into the literature, we need to recognize the operating characteristics are totally different that if we had looked at an isolated example of data.

To make any sense of the reliability of results, we need to ask the following question: what is the distribution of effects commonly evaluated in the literature? If the literature routinely evaluates near-zero effects, then all the p less than 0.05’s we see in the literature are mostly meaningless; we supposedly see less reports than we had before the p than 0.05 filter, but they still have a high probability of being spurious. However, if the literature is routinely examining large effects, then we can reasonably have faith in these results: they had a moderate probability of being large effects to start with AND made it through the p less than 0.05 filter.

Mayo:

Agree about the upper bound but in your fire alarm analogy a confidence interval formed by + or – 2 SE (a Wald interval) would be completely inappropriate given the likelihoods I have discerned for your fire alarm case (above).

You co-author David Cox strongly prefers likelihood ratio shaped confidence intervals (that have approximately the correct coverage). So an appropriate CI for the insensitive interval would have a lower bound of just below full ablaze to something above a full blaze or even unbounded. So something very different is going on in the fire alarm case and I believe that makes it a less than appropriate analogy.

I left this a bit unfinished.

If one assumes that there are intense blazes so hot that the fire alarms will be destroyed before an alarm sounds then both sensitive and insensitive alarms’ likelihood ratio shaped confidence intervals will have the same upper bound. On the other hand, the lower bound on the insensitive alarm will be just below fully ablaze while the lower bound on the sensitive alarm will be just below toast smoke.

So the confidence interval will be narrower and hence more informative for the insensitive alarm. The opposite of what happens with small versus large sample studies in approximately Normal examples (where it is fine to use Wald shaped confidence intervals).

Alternatively, if one assumes both alarms will always sound no matter how intense the blaze then neither will have an upper bound and so both will have infinite width. But the insensitive alarms’ interval will be more concentrated and hence more informative.

I think it’s better to write it as: which observed effect is indicative of a large underlying population or parametric discrepancy (from null). Another useful way to see it (in the example I gave) is to look at the upper bound of a .95 confidence interval. The alpha-level observed difference with the large sample size is actually ruling out (statistically) parametric values that are well indicated with the smaller.But isn’t it true that n = 25 provides less evidence of anything at all than does n = 10,000? Shouldn’t we trust the rejection of values greater than the upper CI of the n = 10,000 estimate more than we do the fact that some of these values are not rejected (i.e., are “well indicated”) with respect to the n = 25 estimate?

If I believe that very large effect sizes are rare and many are small to moderate, then I start to distrust huge effects from small studies (with the likelihood still pretty consistent with no effect) a lot more. Even more so, if one suspects that people keep cranking out small studies about any hypothesis (if we are so charitable to assume some degree of pre-specficiation of the hypothesis or even the analyses) no matter how implausible we are, while given the price tag large studies are more likely to be conducted about concepts that at least have some above-average up-front plausibility.

Andrew, could you provide the reference for this Heckman citation?

My wife briefly worked as a research assistant in Dr. Heckman’s group a few years ago. I suspect Andrew’s answer is correct… he hasn’t considered that he might be wrong.

People liked working with him, but it was understood that he was getting older, kind of grumpy, kind of demanding. I got the impression not a lot of people would contradict him (he is a Nobel prize winner, after all). I’m not saying he wouldn’t listen to people around him, or shoot people down or anything awful like that… I just got the impression that his collaborators and workers deeply respected him and he could, in a sense, do no wrong.

sounds like just about every other scientist. not many scientists ever report that they were wrong.

I find it perplexing that there seems to be disagreement on the one straightforward point about significance tests that critics and defenders of tests agree on. Maybe I can clarify. Critics have long berated tests because with large enough sample size, even trivial underlying discrepancies will probably produce significance. Some even say the test is only informing us of how large the sample was: Here’s Jay Kadane (2011, p. 438)*:

“with a large sample size virtually every null hypothesis is rejected, while with a small sample size, virtually no null hypothesis is rejected. And we generally have very accurate estimates of the sample size available without having to use significance testing at all.”

The tester agrees with this first group of critics: With large enough sample size, even trivial underlying discrepancies will probably produce significance. Her reply is either to adjust p-values down in dealing with high sample sizes (and there are formulas for this) or, as I prefer, to be clear that the discrepancy indicated by a just significant difference with larger sample size is less indicative of a discrepancy than with smaller. In other words, she says: don’t commit a mountains out of molehills fallacy.

We are given the assumptions hold approximately, else we can’t even be talking about principles for interpreting test results.

Now it seems there’s a second group of critics (perhaps on this blog) who hold that a p-value rejection with a higher sample size is actually stronger evidence of a given discrepancy with the null than with a smaller size. So Kadane (and all who point out the Jeffreys-Lindley result–a p-value with a large enough sample size is actually evidence for the null) are mistaken, according to this second group. Tests, confidence intervals and likelihood appraisals are also wrong.

Consider observing a sample mean M for estimating (or testing) a normal mean with known sigma = 1. With n = 100, 1 SE = .1; n = 10,000, 1 SE = .01. We observe a 3SE difference in both cases. M = .3 in the first case, and .03 in the second. 95% confidence intervals are around

(.1, .5) n = 100 and

(.01, .05) n = 10,000.

Consider the inference: mu > .1. The second group of critics appears to be saying that this inference, mu > .1, is better indicated with M = .03 (n = 10,000) than it is with M = .3 (n = 100). But it would be crazy to infer mu > .1 with observation M = .03 (n = 10,000)**, while it’s sensible to infer mu > .1 with M = .01.

That is why I think the only way for the position of the second group of critics to make sense is to see them as questioning the underlying assumptions needed to compute things like the SE, p-values and confidence coefficients. But then the question of which is the appropriate way to reason from test results can’t even get off the ground.

*Principles of Uncertainty.

**Such a rule would be wrong w probability ~1.

Oops. The next to last paragraph obviously should read: “But it would be crazy to infer mu > .1 with observation M = .03 (n = 10,000)**, while it’s sensible to infer mu > .1 with M = .3 (n = 100)”.

In situations where you see extremely small sample sizes (and hence noisy decisions if you do NHSTs badly) is it really likely that you’ll also see sample sizes at the “Kadane limit”?

That is, aren’t the two criticism attacking different ends of the same problem (rather than criticising each other)?

> In situations where you see extremely small sample sizes (and hence noisy decisions if you do NHSTs badly) is it really likely that you’ll also see sample sizes at the “Kadane limit”?

> That is, aren’t the two criticism attacking different ends of the same problem (rather than criticising each other)?

Agree (I think).

I’m worried about pretending you have a _large enough_ sample size when you are in fact studying a highly variable phenomenon, i.e. I’m worried about being in the

p >> n

regime (where p is number of parameters not a p-value).

Also about assuming a model form e.g. Normal when you only have, say, a single sample value available. In what real sense does this actually make sense? Where is this choice of model ‘costed’ in terms of sample size. If it isn’t then model choice artificially import free information.

It isn’t enough to me to just say ‘assuming the model assumptions are met’. You can’t tell with a sample size of 2, say! Model assumption validity (really, to me, convergence of Pn to P) should be taken into account in sample size discussions.

So, in sum, if you are studying a social science phenomenon and looking at many possible degrees of freedom, noisy measurements etc etc then I highly doubt we are really in the n > p regime where we can enjoy the fun of contorting significance levels back into interpretable ‘effect size’ estimates for different n values (all > p).

I was up most of the night writing a test, which is my excuse for various missing words and rapidly degenerating english skills…

> Consider the inference: mu > .1

The p-values calculated with the null hypothesis mu=0 are useful for inference about mu>0. M=0.3 (n=100) and M=0.03 (n=10000) are both 3-sigma events relative to mu=0 (one-sided p-value 0.001).

If we are interested on the inference mu>0.1 we should be using the null hypothesis mu=0.1. M=0.3 (n=100) and M=0.03(n=10000) are in no way comparable for inference relative to mu=0.1. I think everyone on this blog will agree on that.

Mayo:

I think folks are saying many different things such as

1. Look at confidence intervals instead of p_values – why not?

2. If folks are making qualitative distinctions between confidence intervals that do or don’t cross zero and focusing just on those that do – they should shouldn’t have – but to make sense of what they have done you need condition on the selection.

3. Confidence interval should (according to David Cox and many others) be based of likelihood ratio shape and not Wald based when likelihoods are not quadratic.

n?

I don’t see the contradictions you are pointing out coming there but rather only when forced to only use the p_values or use your severity assessment for the inference: mu > .1?

Consider the inference: mu > .1. The second group of critics appears to be saying that this inference, mu > .1, is better indicated with M = .03 (n = 10,000) than it is with M = .3 (n = 100). But it would be crazy to infer mu > .1 with observation M = .03 (n = 10,000)**, while it’s sensible to infer mu > .1 with M = .3.If we are trying to draw inferences about the true value of mu (and we all agree that it’s sensible to talk about “the true value of mu”), and we’ve observed M = .3 (n=100) and M = .03 (n=10,000), then we shouldn’t infer that mu > .1.

M = .3 (n=100) may well provide stronger evidence of mu > .1 than does M = 0.03 (n=10,000), but the latter provides stronger evidence that mu is close to 0.03 than the former provides evidence for any value of mu.

I’m pretty sure it’s intended that there is a mu1 with n1 = 100 samples from the relevant population and a mu2 with n2 = 10,000 samples from that relevant population.

This example is just silly… I wonder if you even understand the your result is mere algebraic artifact.

This thread is a great example of why “machine learning” is replacing “statistics” for all practical purposes (ie things other than publishing exciting discoveries in academic papers, for which NHST is clearly great). You shouldn’t be checking for statistically significant differences to begin with, so it is just as relevant as a discussion about “how many angels dance on the head of a pin”, or “did the father begat the son or was the son begotten by the father”.

+1/2

Machine learning also has some issues with chasing noise but at least they generally agree that more data is better than less!

Definitely, it is is amazingly simple to chase noise with machine learning techniques, leak info from training into test sets, etc. However, the standard practices there are to use as much info as possible (not remove arbitrarily deemed “non-significant” features for no reason) and assess success via predictive skill on fresh data. This makes a huge difference.

Agreed. I did a little side consulting recently. Guided what worked best I ended up taking a far more machine learning approach than I usually would. I was impressed but still thought about it a bit more, read some theory etc.

Now I’m pretty much converted. Luckily that world is far less dogmatic too so I didn’t have to ‘convert’ to a one size fits all approach (contra all hypothesis testing or all Bayes or whatever). There’s also a lot more interesting theory there than I realised. Weirdly, it made me appreciate the benefits of some parts of ‘frequentist’ theory much more, though more ’empirical estimation’ than ‘inferential error probabilities’.

Anoneuoid, ojm,

It’s been my impression that ML techniques can most comfortably be utilized in situations (similar to statistics) where there is an underlying theory of the data generating mechanism. In the other circumstances, good performance on the training, validation, and test sets is likely as not to be chasing noise (unless the dataset is truly close to a representative sample from some collective, which is usually far from the case if only due to variance of the data generating mechanism through time)

Would you agree or am I missing some theoretical or practical reason why ML can overcome overfitting without substantive theory?

> most comfortably be utilized in situations (similar to statistics) where there is an underlying theory of the data generating mechanism.

I wouldn’t really agree with this – I’d say it’s most suited to situations where you don’t have a good theory and hence are willing to replace theory with less well understood relationships.

> unless the dataset is truly close to a representative sample from some collective

This is probably more important.

In general, ML usually relies on having a well-defined _task_ and feedback mechanism for learning about this task. Practice (train) how you play (test) and all that.

Think about eg learning to recognise people in photos, control a robot to reach a goal, play video games etc. More like emergent understanding via evolutionary-style learning than a priori imposed rules. But sure, no feedback and it will drift off to who knows where.

To mix the various the analogies above, there should be sufficient data to reach near ‘equilibrium’ with the environment. If the environment changes, hopefully it can adapt fast enough, yet in a stable way ( eg an ‘online’ learning scenario). Bias-variance and all that, really.

Typically you hope there is some lower-dimensional representation of higher dimensional data, you just don’t know or impose what it is. Instead you let it ’emerge’ and, possibly, adapt if the data/scenario changes.

ojm: “I wouldn’t really agree with this – I’d say it’s most suited to situations where you don’t have a good theory and hence are willing to replace theory with less well understood relationships.”

This makes good sense. However, that doesn’t preclude information about the data generating mechanism from being beneficial in both cases (ML and statistical).

The issue I have with the typical discussion of the bias-variance trade-off is that it is usually framed in the context of the difference between performance on the training/test sets. My point is that you could get good predictions on both the training and the test set but be entirely inaccurate on future predictions if the original sample was not representative of subsequent data. And without a theory of the data generating mechanism, that’s a hard thing to know when you do the original analysis.

If I understand your analogies correctly, you are using examples where ML algorithms are updated as more data becomes available and therefore the models becomes more accurate through time despite a lack of theoretical understanding of the data generating process. In other words, as the collected data becomes less sparse across some unknown parameter space ML algorithms can more or less accurately tease out stable relationships. I agree with this but it again involves having a training set that is representative. Knowing when this is the case without a theory behind the data is pretty much impossible; which becomes problematic when you have to trust the predictions of the model.

Sure, though I’m increasingly skeptical that ‘the data generating mechanism’ is a meaningful phrase.

And prediction is hard, especially about the future and especially with bad data etc etc.

BTW I come from a theory-first mathematical modelling background so I like the complement of data-driven ML approaches that do one simple thing well and can be used alongside theory if you want.

I have to admit that I’m not sure I really trust ML folk who haven’t originally come from modelling backgrounds, even though those who do probably don’t need to explicitly use it anymore. Something about Wittgenstein’s ladder, I guess.

And what do you mean by ‘theory’ exactly? Someone like Vapnik has plenty of theory, it’s just generally not on the form of ‘literal model of thing’.

I think ‘literalist’ modelling is a trap too many fall into. ‘More realistic’ or ‘more literal representation’ doesn’t always mean better ‘model’.

Often quite abstract ideas can be very powerful. ML doesn’t necessarily have less theory it just has less literalist modelling. I find being willing to consider non-literal models surprisingly helpful when dealing with many complex problems.

…last follow up…

I think ML encourages operational or algorithmic or abstract thinking which is a different way of thinking than more ‘realist’ approaches.

While realist thinking can be useful I think it can also hinder people. Eg when learning mathematics you can worry about what something ‘really is’ (complex numbers, irrational numbers, infinity etc) or you can worry about what they ‘do’. The latter is usually easier to get to grips on that the former.

I think we are in agreement on the broad strokes. If you think the only way to make accurate predictions (or inferences) is to have a model that exactly replicates the phenomenon under study then you’re unlikely to be successful; abstractions are an essential component of mathematical modeling.

I come from an engineering background so when I say theory I don’t necessarily mean that to imply an understanding of exactly what happened to produce the data (I also don’t not mean that either); engineers are supposed to be practical people after all! What I usually settle for is some moderate understanding of the mechanism that produced the data so that I may furnish reasonable lower/upper bounds on the parameters, predictions, etc. so that I can either a) constrain the model or b) know when it’s getting a bit wonky or c) look at changes in the real life phenomenon for which I collect data and have some understanding if the changes will have an impact on the model predictions that I should be concerned about (prior to it blowing up in my face).

I think Keith’s links down below touch on this.

Yeah I think we broadly agree too…but that’s no fun :-)

Generally speaking I think I advocate more ML for modelling folks and more modelling for ML folks. I haven’t really reached a consistent opinion! But I do try to be open to new ways of doing things. I don’t like the ‘oh that’s just secretly Bayes’ or whatever attitude. Maybe it’s something else, ya know?

Speaking of engineering, I think machine learning fits pretty seamlessly into an engineering curriculum. Traditional stats including Bayes less so. The good thing is these students also get a large amount of modelling too. So they can work well together.

Related to present discussion and many themes of this blog, just saw this (haven’t read properly):

http://pilab.psy.utexas.edu/publications/Yarkoni_Westfall_PPS_in_press.pdf

‘Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning’

Ojm:

I read the paper you link to. I agree with them on the substance, but I disagree with the framing in which prediction is opposed to explanation. I think predictions work better in the context of good explanatory models, and that explanatory models are predictive.

I agree that they _can_ be complementary but I think there is also a good case that modelling for prediction and modelling explanation can be a tradeoff.

I’m often surprised when people justify their inferential parameter estimation methods on the basis of ‘well the ultimate goal is prediction’. If that really is the case then you can often bypass many steps of traditional parameter estimation.

So I tend to think both explanation and prediction are good but that they are not identical and can often conflict. Of course this gets into many philosophical issues what exactly these terms mean etc etc but I’ve found the heuristic helpful.

Allan: I agree with you more than ojm, but ojm has a point you may wish to more directly deal with.

ML encourages operational or algorithmic or abstract thinking which is a different way of thinking than more ‘realist’ approaches is I think true. The way I have seen that in my career is folks use/prefer favored algorithmic choices – absolute norm (Tibshirani), neural analogies (Hinton), biological analogies (who ever thought of drop out for deep learning), etc. They happen upon procedures that have very good properties (relative to other approaches) but with little sense of why, how to improve them and when they will fail.

Then folks discern how these same things could be seen as (or close to) what data generating models would set out – Laplace priors for Lasso (me in one of the first seminars Tibshirani gave on the Lasso), Gaussian processesfor neural nets (Radford Neal), Bayesian priors for drop out (Yarin Gal 2016), etc.

So we have a larger community of zig zagging from happened upon but not understood algorithms that work surprisingly well. To then discerning Bayesian models that would imply similar algorithms that are better understood, can be tweaked to be less wrong and with some sense of when they will fail.

My favorite talk that describes doing that is by Mike West https://www.youtube.com/watch?v=oBYoPtEHzTE&list=PL3T2Ppt4bgDJBiGZlan-qNY6PsLOGXdAB&index=3

A shorter talk would be this one on the History of Bayesian Neural Networks by Zoubin Ghahramani https://www.youtube.com/watch?v=FD8l2vPU5FY (first 10 minutes should do).

Thanks Keith. I don’t have time at the moment to get into these but I will have a gander tonight.

Daniel (re 11:30 am post): Good points

Hi Keith,

See above for a comment on the ‘oh that’s just Bayes’ approach. Not that it’s not necessarily a useful exercise it’s just that I’m often interested in thinking through things from a different perspective. I think it’s that I don’t find Bayes to be the appropriate general framework so much anymore.

ojm, I’m still waiting for you to write up a demonstration where profile likelihood is clearly better than Bayes…

Haha, well likelihood has its flaws too. I wouldn’t say likelihood or Bayes are sufficient frameworks.

But I have an example with an ODE model and real data that I’ve been meaning to look at properly at some point. Will let you know when I get around to it.

I get the feeling that, by whatever criteria you are using, there are no sufficient, general frameworks :)

ojm: “I don’t like the ‘oh that’s just secretly Bayes’ or whatever attitude. Maybe it’s something else, ya know?”

OK, but then say what it is. Until you can do that, I don’t see what’s not to _like_ about it could be recast as this Bayesian model which provides more ways to view something?

On the whole, this discussion reminds me of the ongoing arguments between David Cox (models) and Brian Ripley (just prediction) when I was at Oxford. It was a year later at John Nelder’s festcreft that Ripley publicly admitted it was hard for him to grasp the additional value of having a model – but now he did. Maybe he was wrong.

Keith: I think the area of study makes a big difference. For example, one area that machine learning has done well in is product recommendations. What is the utility of knowing *why* people at time t prefer headphones X vs headphones Y, particularly when both products entire lifecycle is about 8 months? Developing N causal models for N recommendation problems whose lifetime of utility is 3-12 months is pretty clearly a waste of time.

Another area where machine learning has done well is things like visual recognition algorithms in autonomous vehicles. In the end, I’d rather have lower observed risk of accidents in a wide variety of test conditions than understand *why* an algorithm is better able to discern the difference between a stop sign and a yield sign.

Sure, if understanding some ML technique as a special case or approximation of some general Bayesian technique helps us design better techniques, then fine. But I think there is a place for pure predictive model free inference. It’s just that this case is almost entirely as *an engineering exercise* rather than as a *scientific inquiry method*.

When your goal is “optimize some well defined engineering objective function in any way that works” then… that’s your goal and it makes sense to carefully use whatever tool gets you that result. (and I say carefully because you’d better be testing in a *wide* variety of real-world conditions)

On the other hand, if your objective is to understand fundamental aspects of the causal structure of a problem, such as *why* do certain types of mutations cause certain types of cancer. Then ML doesn’t address that question directly, and if you’re going to use it, it can be useful to know that it’s really a fast computational approximation of some general Bayes model or the like and then try to extract the approximate underlying Bayesian inference.

Daniel: ” area of study makes a big difference”

Agree, I sometimes I argue a randomized study would be hopeless if for instance the world changes faster than such studies could learn about it.

“special case or approximation of some general Bayesian technique helps us design better techniques”

That is the bigger point, its the ensemble of future predictors and predictions we want to be less wrong about.

“visual recognition algorithms in autonomous vehicles. “

But someone has to regulate those, approve as being safe enough, and doing that with a black box approach with unknowable fail behavior – that has some folks concerned.

> OK, but then say what it is

Machine learning?

> don’t see what’s not to _like_ about it could be recast as this Bayesian model which provides more ways to view something?

Because I suspect that this itself is just one limited way to view things, obscured by the idea that having a correspondence to some Bayesian model under some particular circumstances means it ‘really is’ Bayesian.

Maybe Bayes is a special case of machine learning, valid under certain special circumstances?

Speaking of the ‘safety’ issue, I don’t like Bayes’ over-reliance on simple parameteric models and, even worse, density-based inference. This isn’t safe either*.

Procedures can of course be both mathematically and empirically studied. You don’t need to convert it to a Bayesian model to prove a theorem about an algorithm!

(*PS just switching to Gaussian processes, as seems to be a trend, don’t really suffice to address these concerns imo either..,)

> It’s just that this case is almost entirely as *an engineering exercise* rather than as a *scientific inquiry method*.

Agreed. Side point – I once read an interesting philosophical book on the engineering method called ‘Discussion of the Method’ by Billy Van Koen. Makes an interesting case for its universality, which is somewhat relevant to the general philosophical discussion here.

Correction – Billy Vaughn Koen

ojm, what do you mean by “density-based inference”? Computation of expectations wrt posterior distribution? If so, that’s a rather misleading formulation…

How do you define Bayes’ theorem?

Also, expectations are not sufficient to uniquely determine posteriors (definitely not given finite samples) hence the need for something like ‘typical set’ sampling, which as far as I can tell is not a Bayesian concept as such.

Point being, as far as I am aware Bayesians define Bayes’ theorem via densities (in infinite dimensions, Radon-Nikodym derivatives).

They then say ‘oh I’m only interested in expectations over this density’. But either a) the assume ‘this density’ is something that actually exists, in which case ‘density-based inference’ seems reasonable, or b) they actually only want some sort of expectation and don’t care about different densities that give the same expectation. But since this doesn’t uniquely define what posterior they are actually interested in the problem appears to me to be ill-defined. Why not tackle the problem more directly without the Bayesian detour?

(Additional criteria like ‘typical set’ sampling are added without a Bayesian justification implicitly, in my view, to make the problem better posed).

ojm, Bayes’s theorem as usually stated is on probabilities of event; the density version of Bayes’s theorem can be derived from the event space version if the Radon-Nikodym derivative of the probability measure exists. So if there’s a problem to be found in the Bayesian approach, I feel that it cannot be merely that it operates on densities because it only does so if the Radon-Nikodym derivative exists. When the Radon-Nikodym derivative doesn’t exist (as is typically the case in Bayesian nonparametrics) it just means that the posterior probability measure can’t be defined via the density form of Bayes’s theorem, not that it can’t be defined at all.

ojm, clearly there is more at work here than either of us really are interested in dissecting at the moment. Whether formulated more precisely as probability measures (as Corey is saying) or summarized offhand as ‘density-based inference’ why is this especially problematic for you? It seems like you are saying that the Bayesian premise – use of probability to quantify uncertainty – is not generally valid. That seems like a rather extraordinary claim in general. But maybe I’m just not going to get it ;)

Corey:

I should say I’m talking about continuous and/or infinite-dimensional problems. I’m actually curious – I’ve only seen infinite-dimensional problems tackled via RN derivatives. Can you point me to a good reference tackling it in other ways?

Chris, re:

> the Bayesian premise – use of probability to quantify uncertainty – is not generally valid. That seems like a rather extraordinary claim in general.

It depends what ‘not generally valid’ means.

Probability can quantify _some_ types of uncertainty, sure, so I am certaintly not saying it is invalid in all cases.

But to say that it quantifies _all_ types of uncertainty seems to me like an extraordinary claim, and one that has been hotly contested (successfully imo) by many people.

Yes I mean quantifying uncertainty in the parameters of statistical models. Contested, sure- e.g. Fisher ;) But it is clearly possible and successful in many applications.

Chris, this has probably reached the point of not being productive any more but…

> ‘quantifying uncertainty in parameters of statistical models’

OK, but even Andrew refuses to use probability to quantify the uncertainty _of the models themselves_. Given this is usually a massive source of ‘uncertainty’, this is a good example of a case where the role probability is contested, even by Bayesians.

Furthermore, Neyman himself said, for his brand of Frequentist inference, that his aim was

“to construct a theory of mathematical statistics independent of the conception of likelihood…entirely based on the classical theory of probability”

So even what it means to ‘quantify uncertainty in parameters of statistical models’ using probability is somewhat ambiguous (or uncertain…).

BTW – my point about density-based inference was a mathematical one…I’m open to correction but I honesty have looked and haven’t yet seen a good Bayesian response, while I have seen a few people abandon Bayes on this basis…

Also, does how do you feel about quantifing the uncertainty in a model parameter such as the mean of a Normal distribution on the basis of a single sample (as in Andrew’s ‘new favourite example’)?

I agree uncertainty among models brings us to some deep philosophical waters…another time. As to estimating mean of normal with one observation – I view that as a corner case for showing how sensitive inference is (and should be!) to prior/external information, when data are very limiting. I don’t think posterior distributions should be taken literally like some fundamentalists claim to read the Bible – rather, the inference is relative to the modeling assumptions…

ojm: I’m not sure what you mean by “typical set sampling”. All the sampling methods are MCMC, and it’s just a mathematical fact that basically all the probability mass is is in the typical set and so that’s where MCMC spends its time.

Why do we only compute expectations? Because in say 2000 dimensions, even just one sample on each side of the median in each dimension is 2^2000 samples.

yet, we can get good expectations with 20 or 100 or in extreme cases maybe 1000 effective samples.

Daniel – I mean focusing on sampling the typical set rather than the full high probability set, which as we’ve discussed are not quite the same thing.

More generally, taking advantage of probabilistic convergence results with increasing sample size seems to me pretty frequentist, and/or ’empirical inference’ in the Vapnik etc style..

As far as I am aware none of the convergence results/concentration of measure/ whatever are very Bayesian in spirit. They introduce explicit sample size, asymptotics etc which seem to do most of the work and have nothing to do with logical or a priori probabilistic modelling (which ironically usually assume an _exact_ model of uncertainty)

ojm: it’s just focusing on sampling the distribution. In high dimensions samples aren’t in that little corner of the high probability set at the mode, because the volume there is too small, that’s all. No one’s actually keeping the sampler out of that region, it just won’t ever get there because of the math.

When it comes to relying on mathematical probability while doing MCMC sampling, this is because Bayesian probability as quantification of scientific uncertainty ends when you write down the distribution you want to use. In the same way that PDEs as a description of a scientific process involving momentum and energy and whatever ends after you write down the equations and the boundary conditions. After that it’s mechanical calculation that becomes the issue.

As far as mechanical calculation goes, in PDEs we have finite elements or finite volume or finite difference equations or whatever, and in Bayesian stats we have MCMC of some sort. Both are designed just as pure mechanical deduction… this equation and these boundary conditions imply this new state at this later time… or similarly, this density over the parameters implies this chain of samples converges to give the right expectations. Numerical analysis of difference equations isn’t scientific modeling of the behavior of physical objects, and MCMC sampling isn’t scientific analysis of the uncertainty in models and data… but both are techniques for calculation that apply to their individual purposes.

I think we need to make the distinction between focusing on large sample size of scientific data, vs focusing on large sample size of MCMC. Large sample size of MCMC is more or less like using a small step size for an ODE solver. It just makes the calculation more accurate, it doesn’t help you get closer to the true science if your model is wrong.

On the other hand, focusing on large sample size for data collection I think I agree with you. When you are in fact imposing a sufficient random structure on the problem. (ie. using a computer RNG to sub-sample a finite well defined population of things for example) then large sample size helps you. But, when you’re working with non-randomly-selected samples you shouldn’t be focusing on CLT type theorems, because the non-randomness of your sample violates those assumptions. Then, you’re stuck using an appropriate Bayesian model that acknowledges this. For example, in polling Trump vs Hillary there should have been a model structure that acknowledged an unknown bias in the polling, and made some prior guesses about its possible size. Instead, relying on a CLT type result, you get polls that decide that Hillary has a 95% chance of winning… If results were truly from random samples with 100% compliance etc… it would have been true. But they weren’t.

Daniel, you are right of course about MCMC as a machine for Bayesian inference, rather than the inference itself. I think the meat of the matter here is that ojm just doesn’t like Bayes inference anymore (nor apparently likelihood), for some set of reasons that I don’t fully understand (but have concluded we are unlikely to resolve here :) )

OK one last quick comment. The gist (or a gist) of the argument is that I’m not convinced Bayesian inference is fully well-defined, particularly for continuous and/or infinite dimensional problems.

One way to make a possibly ill-defined problem better defined is to introduce a concrete procedure for computing what you mean.

Eg MCMC for actually sampling ‘the’ posterior. But multiple posteriors yield the same expectations, especially with finite computation power. Unfortunately these can also be arbitrarily far apart in the strong topology.

So what are we actually doing? What is ‘the’ posterior target in general if we only care about expectations? Do we care about strong or weak topologies etc?

So I’m not fully convinced the answer to ‘what are we doing’ is properly defined, though open to convincing. On the other hand I see alternative, more direct methods which relate directly to the actual boring mathematics underlying eg MCMC ‘technology’ and wonder why we don’t just consider those the fundamental concepts instead. Others seem to have done so.

>> OK, but then say what it is

> Machine learning?

So then https://en.wikipedia.org/wiki/Machine_learning ;-)

Maybe https://global.oup.com/ushe/product/discussion-of-the-method-9780195155990;jsessionid=68946F16BBF13953608DB8B898B78870?cc=&lang=en& when I have time!

> OK, but even Andrew refuses to use probability to quantify the uncertainty _of the models themselves_.

“Even Andrew”? I’m not sure how to interpret that, in particular after your remarks a few weeks ago about his attitude to Bayesians foundations.

> how do you feel about quantifing the uncertainty in a model parameter such as the mean of a Normal distribution on the basis of a single sample (as in Andrew’s ‘new favourite example’)?

How should we feel? Two samples are one single sample after another, etc. If I have a coherent method to update the uncertainty with each single sample, I don’t see where is the problem.

Fair enough re ‘even Andrew’.

RE the problem and coherence. I think it brings out how unrealistic the concept of ‘a single sample known to be drawn from a particular distribution’ is in practice (and perhaps in principle).

Eg someone like Vapnik would argue, I assume, that we begin from given data not a given model.

There are two reasons to model a distibution as normal. 1) we know it’s a good description of the actual distribution 2) we don’t know much about the actual distribution but want to keep the model simple. The Gaussian is the maximum entropy distribution with the first two moments fixed, but of course nothing prevents you from adding higher order terms (more parameters in the model). I’m not sure what’s the fundamental problem with a single sample (apart from the low information content). We agree that statistics is easier (even unnecessary) if you have enough data.

I agree with Daniel about the typical set. That’s just a property of distributions when the number of dimensions grows to infinity. Nothing has to be Bayesian about it, in the same way that there is nothing Bayesian about MCMC or other numerical integration schemes. But if you have a probability distribution you can use MCMC to calculate expectations and if you have an very-high-dimensional probability distribution you may see the concentration of measure (I still don’t know what makes it interesting, though).

I think these responses miss the point somewhat but as Chris mentions we are unlikely to resolve the issues here. Plus I just landed in Hawaii for a holiday :-)

People write a lot of very complicated stuff about this, but really it comes down to a misunderstanding about the size of the result and the certainty of the result.

It’s perfectly illustrated by comparing the UK Brexit vote, and a typical murder-trial jury result. Brexit was an experiment with over 33 million replicates, the result being 52%/48% split. The chance of this happening by accident (as it would, for example, if the voters had all tossed coins) is infinitesimally small, so the result is very strong indeed, from the point of view of statistical errors.

But think about a murder trial, where 12 jurors arrive at an 11/1 split of guilty/not-guilty.

Which result would you use to convict the defendant: 11/1, or 5,200,000/4,800,000? The result from the small jury is strong in the sense that the effect is large, (nearly)everyone agrees, there is only one “outlier”. The result from the mega-jury of 10,000,000 jurors is very strong in the sense that it couldn’t happen by chance, but it’s very weak in the sense that the effect is small: regardless of how strongly any individual might hold their views, and the number of people who’ve been asked, it remains indubitably true that nearly as many people think the defendant is innocent as guilty. In this case we’d probably convict based on 11/1 even though this result is statistically more likely to happen by chance than 5,200,000/4,800,000.

Both the certainty of the result and the size of the result matter. We need to be sure the effect is big enough to matter, and that the measurement is reliable enough that we’re not chasing a random blip. Too small an experiment means that the measurement is unreliable. Too small an effect means it’s probably not worth pursuing.

Imagine a 9/3 jury split; this isn’t enough to convict. We wouldn’t convict based on 9/3 because we can’t be sure the result isn’t random. The jurors might conceivably have reached this result by tossing coins. Nor would we convict based on 5,200,000/4,800,000; although the jurors couldn’t conceivably reach this result by tossing coins, it still means that an awful lot of them don’t believe in guilt. Both tests fail to secure a conviction, but for quite different reasons.

Discussing the relative strengths of “small experiments showing big differences” versus “big experiments showing small differences” is totally impossible (and very unhelpful) because the strengths are totally unrelated and incomparable. How can you say that a jury decision of 5,200,000/4,800,000 is stronger or weaker than 11/1? It’s as useless as saying an elephant is more than a leopard. An elephant is heavier, a leopard is faster.

(in response to https://andrewgelman.com/2017/08/16/also-holding-back-progress-make-mistakes-label-correct-arguments-nonsensical/#comment-552266)

ojm, my (possibly incorrect) understanding is that infinite Gaussian mixture model doesn’t have a Radon-Nikodym derivative that’s useful for Bayesian inference. For example, I can’t spot a clear use of the density version of Bayes’s theorem in this paper. IIRC the papers I’ve read on it give algorithms for sampling from the posterior measure without defining a posterior density as such. A related example is this paper on the Mondrian process in which a consistency property is used to give an algorithm for posterior sampling; I can’t spot a posterior density per se or a use of the density versions of Bayes’s theorem.

Thanks, will take a look :-)