The house is stronger than the foundations

Posted on October 9, 2017 9:24 AM by Andrew

Oliver Maclaren writes:

Regarding the whole ‘double use of data’ issue with posterior predictive checks [see here and, for a longer discussion, here], I just wanted to note that David Cox describes the ‘Fisherian reduction’ as (I’ve summarised slightly; see p. 24 of ‘Principles of Statistical Inference)

– Find the likelihood function
– Reduce to a sufficient statistic S of the same dimension as theta
– Estimate theta based on the sufficient statistic
– Use the conditional distribution of the data given S=s informally or formally to assess the adequacy of the formulation

Your conception of posterior predictive checks seems to me to be essentially the same
– Find a likelihood and prior
– Use Bayes to estimate the parameters.
– The data enters this estimation procedure only via the sufficient statistics (i.e. ‘the likelihood principle as applied within the model’)
– There is thus a ‘leftover’ part of the data y|S(y)
– You can use this to check the adequacy of the formulation
– Do this by conditioning on the sufficient statistics, i.e. using the posterior predictive distribution which was fit using the sufficient statistics

Formally, I think, the posterior predictive distribution is essentially p(y|S(y)) since Bayes only uses S(y) rather than the ‘full’ data.

Thus there is no ‘double use of data’ when checking the parts of the data corresponding to the ‘residual’ y|S(y).

On the other hand the aspects corresponding to the sufficient statistics are essentially ‘automatically fit’ by Bayes (to use your description in the PSA abstract).

You are probably aware of all of this, but it may help some conceptually.

I personally found it useful to make this connection in order to resolve (at least some parts of) the conflict between my intuitive understanding of why PPP are good and some of the formal objections raised.

My reply: When I did my formalization of predictive checks in the 1990s, it was really for non-Bayesian purposes: I had seen problems where I wanted to test a model and summarize that test, but the p-value depended on unknown parameters, so it made sense to integrate them out. Since then, posterior predictive checks have become popular among Bayesians, but I’ve been disappointed that non-Bayesians have not been making use of this tool. The non-Bayesians seem obsessed with the uniform distribution of the p-value, a property that makes no sense to me.

The following papers might be relevant here:

Two simple examples for understanding posterior p-values whose distributions are far from unform

Section 2.3 of A Bayesian formulation of exploratory data analysis and goodness-of-fit testing

Maclaren responded:

It seems to me that a relevant division of non-Bayesians is into something like

– Fisherians, e.g. David Cox and those who emphasise likelihood, conditioning, ‘information’ and ‘inference’. If they are interested in coverage it is usually conditional coverage with respect to the appropriate situation. Quite similar to your ideas on defining the appropriate ‘replications’ of interest.

– Neymanians, i.e. those with a more ‘pure’ Frequentist bent who emphasise optimality, decisions, coverage (often unconditional) etc.

I think the former are/would be much more sympathetic to your approach. For example, as noted I think Cox basically advocates the same thing in the simple case. Lindsey, Sprott etc also all emphasise the perspective of ‘information division’ which I think addresses at least some concerns with double use of data in simple cases.

With regard to having the ‘residual’ dependent on the parameters: presumably there is some intuitive notion here of a ‘weak’ or ‘local’ dependence on the fitted parameters (or something similar)? Or some kind of ‘inferential separation’? Perhaps an unusual model structure?

I’m trying to think of the ‘logic’ of information separation here.

For example, I can imagine a factorisation something like

P(Y|θ) = P(Y|S,α(λ))P(S|λ)

where

P(S|λ) gives the likelihood for fitting λ
P(Y|S,α(λ)) gives the residual for model checking, now depending on λ but via α(λ)

In this case θ = (λ,α(λ)) seem to provide the needed separation but they are not (variation) independent.

So it still makes sense to use your best estimate of λ in model checking to make sure you use a relevant α (i.e. average over λ’s posterior).

Something like a curved exponential model might fit this case.

Just thinking out loud, really.

Me again: Sure, but also there’s all the regularization and machine learning stuff. Take, for example, the Stanford school of statistics: Efron, Hastie, Tibshirani, Donoho, etc. They use what I (and they) would call modern methods which I think of as Bayesian and they think of as regularized likelihood or whatever, but I think we all worship the same god even if we give it different names. When it comes to foundations, I’m pretty sure that the Stanford crew think in a so-called Neyman-Pearson framework with null hypotheses and error rates. There’s no doubt that they’ve had real success, both methodological and applied, with that false discovery rate approach, even though I still find it lacking as to me it’s based on a foundation of null hypotheses that is in my opinion worse than rickety.

In any case, I have mixed feelings about the relevance of posterior predictive p-values for these people. I would definitely like them to do some model checks, and I continue to feel that some posterior predictive distribution is the best way to get a reference set to use to compare observed data in a model check. But I think more and more that p-values are a dead end. I guess what I’d really like of non-Bayesian statisticians is for them to make their assumptions more explicit—to express their assumptions in the form of generative models, so that then these models can be checked and improved. Right now things are so indirect: the method is implicitly based on assumps (or, to put it another way, the method will be most effective when averaging over data generating processes that are close to some ideal) but these assumps are not stated clearly or always well understood, which I think makes it difficult to choose among methods or to improve them in light of data.

I’ve been thinking this a long time. I have a discussion of a paper of Donoho et al. from, ummm, 1990 or 1992, making some of the above points (in proto-fashion). But I don’t think I explained myself clearly enough: in their rejoinder, Donoho et al. saw that I was drawing a mathematical equivalence between their estimators and Bayesian priors, but I hadn’t been so clear on the positive virtues of making assumptions that can be rejected, with that rejection providing a direction for improvement.

Maclaren:

There’s a lot here I agree with of course.

And yes, the cultures of statistics, and quantitative modeling generally, are pretty variable and it can be difficult to bridge gaps in perspective.

Now, some more overly long comments from me.

As some context for the different cultures aspect, I’ve bounced around maths and engineering departments while working on biological problems, industrial problems, working with physicists, mathematicians, statisticians, engineers, computer scientists, biologists etc. It has of course been very rewarding but the biggest barriers are usually basic ‘philosophical’ or cultural differences in how people see and formulate the main questions and methods of addressing these. These are much more entrenched than you realise until you try to actually bridge these gaps.

I wouldn’t really describe myself as a Bayesian, Frequentist, Likelihoodist, Machine Learner etc, despite seeing plenty of value in each approach. The more I read on foundations the more I find myself – to my surprise, since I used to view them as old-fashioned – quite sympathetic with Fisher, Barnard, Cox, Sprott, Barndorff-Nielsen etc. In particular on the organisation, reduction, splitting, combining etc of ‘information’ and the geometric perspective on this.

Hence me trying to understand PPC from this point of view. I think the simple point that you can for example base estimation on part of the data and checking on another part, and in simple cases represent this as a factorisation, clears a few things up for me. It also explains some (retrospectively) obvious results I saw when using PPC eg the difference between checks based directly on fitted stats vs those based on ‘residual’ information.

But even then I have plenty of disagreements with the Fisherian school, and would like to see it extended to more complex problems. Bayes in the Jeffreys, Jaynes vein is of course similar to this ‘organisation of information’ perspective, but I find Jaynes in particular tends to often make overly strong claims while ignoring mathematical and philosophical subtleties. Classic physicist style of course!

(I started writing some notes on a reformulation of this sort of perspective in terms of category theory but I doubt I’ll ever finish them or anyone would read them if I did! Sander did, surprisingly, offer some encouragement on this – the DAG people are probably more open to general abstract nonsense in diagram form!).

RE: The Stanford school. Yes they seem a somewhat strange mix of decision theory, optimisation and function approximation. (Though again, not an unfamiliar mix to me – I spend a fair amount of time around operations research people, and minored in it in undergrad. Everything is rewritten as an optimal decision problem. And yes the statistical aspect seems to come from NP origins.

The models are often implicit while primary focus is given to the ‘fitting procedure’. And to them, likelihood is mainly just an objective function to be maximised to get estimators to evaluate Frequentist style.

(Of course this connects to the two big misconceptions about likelihood analysis from both Bayes and Freq – one that it’s just for getting Frequentist estimators, usually via maximisation. Two, it can’t handle nuisance parameters systematically.)

Bayes of course tends to blend modeling and inference. Both have pros and cons to me – I think there is benefit to separating a model from its analysis (think for example finding weak solutions to differential equations) while there is also benefit in seeing this in turn as a modified model (think for example rewriting a differential equation as an integral equation – again leads to weak solutions, but from a more model-based perspective).

Some people love to think in terms of models, some in terms of procedures. This is a difficult gap to bridge sometimes, particularly between eg stats/comp sci vs scientists. I think the idea of ‘measurement’ is important here. ‘Framework theories’ like quantum mechanics and thermodynamics provide a good guide to me, but of course there is no shortage of arguments over how to think about these subjects either!

In terms of p-values for model checking, I definitely prefer graphical checks. In terms of Frequentist parameter inference I differ from you I think in that I see value in seeing confidence intervals as inverted hypothesis tests. I prefer however to see them as something like an inverse image of the data in parameter space rather than as a measure of uncertainty or even as a measure of error rates.

Me: I think it’s great when people come up with effective methods. What irritates me is when people tie themselves into knots trying to solve problems that in my opinion aren’t real. For example there’s this huge literature on simulation from the distribution of a discrete contingency table with known margins. But I think that’s all a waste of time because the whole point is to compute a p-value with respect to a completely uninteresting model in which margins are fixed, which corresponds to a design that’s just about never used. For another example, Efron etc. wasted who knows how many journal pages and man-years of effort on the problem of bootstrap confidence intervals. But I think the whole confidence intervals thing is a waste of time. (I think uncertainty intervals are great; what I specifically don’t like are those inferences that are supposed to have specified coverage conditional on any value of the unknown parameters, and which are defined by inverting hypothesis tests.)

There’s an expression I sometimes use with this work, which is that the house is stronger than the foundations.

52 thoughts on “The house is stronger than the foundations”

Christian Hennig on October 9, 2017 11:46 AM at 11:46 am said:

I’m all for making assumptions more explicit but I think that the Bayesian approach, i.e., making up a prior, often does not help with this. Rather, in many applied situations, my impression is that this forces researchers to make up additional assumptions with little basis (there’s a stronger basis occasionally, then fine by me). I think this idea that frequentists don’t make assumptions transparent that are explicit in the Bayesian approach comes from the fact that many frequentist analyses can be expressed methematically as Bayesian analyses with one specific prior, so the Bayesian would then say, see, you’re assuming this prior but you don’t tell.
But as the frequentist logic is different and the outcome is not a distribution over parameter values, it’s actually not the same as assuming a prior for computing a posterior.
(I accept that the mathematical equivalence of a frequentist analysis with a Bayesian analysis with a certain prior is informative and adds to the understanding of what goes on, but that’s not quite the same as “assuming the prior without making it explicit”.)

So I’m often fine with confidence intervals; if I don’t have background knowledge that translates smoothly into a prior, I’m quite happy not to use one, and I’d indeed think that a Bayesian who then makes one up assumes more, rather than only making an anyway existing assumption explicit.

Reply ↓
- Daniel Lakeland on October 9, 2017 1:33 PM at 1:33 pm said:
  
  The correspondence of many confidence intervals to Bayesian models with flat priors is informative to me though. I know ahead of time that the object I’m trying to estimate is finite, and the flat prior logically states within the Bayesian framework that the thing is infinitely large. Using a point estimation / optimization procedure ignores this measure theoretic concept and focuses just on the local density. The flat density becomes a way to “avoid altering the location of the maximum” and it’s behavior epsilon away from the maximum is irrelevant in that framework.
  
  But any attempt to use a Bayesian Decision Rule to make a decision can’t focus on just the local density at the peak. You could reject Bayesian Decision Rules, but then you have to explain how Wald’s theorem plays into your justification for rejection…
  
  In the end, I think my biggest concern is that the frequentist confidence interval and/or point estimation procedure feel like a misfired attempt to make good decisions based on data. It seemed intuitive, but it wasn’t fully thought through.
  
  Reply ↓
  - Christian Hennig on October 9, 2017 6:10 PM at 6:10 pm said:
    
    “the flat prior logically states within the Bayesian framework that the thing is infinitely large” – but when discussing confidence intervals we’re not in that framework. You just don’t pre-specify any size for it; and the CI will then usually sit nicely around what the data indicates, with no unruly drive to infinity whatsoever.
    
    “But any attempt to use a Bayesian Decision Rule to make a decision can’t focus on just the local density at the peak” – if I remember correctly, you always claim that specifying a prior is not complicated and whatever has high density at the true value is fine… what you say here may depend a lot on what your prior does elsewhere, at least as long as you don’t have large amounts of data.
    
    “You could reject Bayesian Decision Rules” – I don’t (I’m quite undogmatic and have been caught defending Bayes against frequentists). I’m fine with them as long as you need a decision and your prior and loss function are well justified. Except that this in my practice is a minority situation.
    
    Reply ↓
    - Chris Wilson on October 10, 2017 8:52 AM at 8:52 am said:
      
      Yea I’m inclined to agree on this one. The correspondence btw frequentist CI and Bayes with flat prior is useful, and the uniform prior on (-Inf, Inf) may well logically imply goofy estimates, as Daniel says. However, in my view the frequentist CIs seem a lot closer to doing MAP with flat priors, rather than real Bayesian inference (i.e. integrating over parameter space). This is why with N = small, the frequentist CI is often smaller, and centered more literally on the data, for instance, than integrating over a Bayesian model with uninformative priors.
    - Daniel Lakeland on October 10, 2017 9:19 AM at 9:19 am said:
      
      I don’t think the frequentist CI based on using a flat prior can ever be smaller. The real prior is always more concentrated.
      
      Centered more on the data, perhaps, depends on the situation. Whether that’s a good thing also depends on the situation, for example the noisiness of the data collection process and the amount of background info that informs the prior.
    - Chris Wilson on October 10, 2017 10:04 AM at 10:04 am said:
      
      Sure it can be. Estimate the mean and sd of this vector of data c(0.02, 0.025, 0.04, 0.031) using stan and flat priors. Compare to estimates from lm() or use Stan optimizers and use the Hessian. The key is integrating over parameter space versus working with MAP and curvature (which is much much closer, if not identical in many cases, to the frequentist estimate and CI). With N small, the differences are noticeable.
    - Christian Hennig on October 10, 2017 10:04 AM at 10:04 am said:
      
      If the prior puts much mass around where the data indicate that the truth is, a Bayesian uncertainty interval will be smaller than a frequentist CI. However, if the prior has the bulk of its mass elsewhere, it may be bigger.
      It all boils down to the observation that a prior is helpful if the information encoded in it is good and reliable, otherwise rather not.
    - Chris Wilson on October 10, 2017 10:09 AM at 10:09 am said:
      
      + 1
    - Chris Wilson on October 10, 2017 10:07 AM at 10:07 am said:
      
      e.g.
      stan_simpleCI <- '
      data{
      int N;
      vector[N] dat;
      }
      parameters{
      real mu;
      real sigma;
      }
      
      model{
      dat ~ normal(mu,sigma);
      }
      ‘
      dat <- c(0.02, 0.025, 0.04, 0.031)
      stan_dat <- list(N = 4, dat = dat)
      
      demo <- stan(file = "stan_simpleCI.stan", data = stan_dat,
      chains = 3, iter = 500)
      
      #divergents are suppressible with adapt_delta and not important
      # to point I'm making
      
      # compare to
      demo_mod <- stan_model("stan_simpleCI.stan")
      opt_demo <- optimizing(demo_mod, data = stan_dat, hessian = T)
      print(opt_demo)
      sqrt(diag(solve(-opt_demo$hessian)))
      
      # or
      summary(lm(dat ~ 1))
    - Chris Wilson on October 10, 2017 10:10 AM at 10:10 am said:
      
      aha, sorry Daniel I think I misread your comment. The key provision there is your saying “the *real* prior”. Which is where Christian Hennig’s statement summarizes things nicely…
    - Chris Wilson on October 10, 2017 10:24 AM at 10:24 am said:
      
      FWIW, for this example it looks like even principled, order of magnitude type weakly informative priors, lead to larger CI than the ‘frequentist’ (or MAP plus curvature) CI. The mechanism is integration – if we are not integrating over parameter space, tail area (and hence logical implication) of prior does not matter…
      
      e.g.
      stan_model2 <- '
      data{
      int N;
      vector[N] dat;
      }
      parameters{
      real mu;
      real sigma;
      }
      
      model{
      
      mu ~ normal(0,0.05);
      sigma ~ normal(0,0.05);
      dat ~ normal(mu,sigma);
      }
      ‘
    - Daniel Lakeland on October 10, 2017 11:41 AM at 11:41 am said:
      
      I haven’t thought this through carefully, but my intuition is that this can’t be true. We’re talking about the normalized product of two functions p(Data | Params) p(Params). Let’s let p(Params) = normal(Params, mu, sd) for any mu and sd we like…
      
      The “flat” prior corresponds to the limit as sd goes to infinity. Let [a,b] be a “flat prior based confidence interval”. If you make sd something less than infinity you increase the prior probability that the parameter is in [a,b] because as sd goes to infinity, the prior probability to be in any fixed interval goes to zero.
      
      If we increase the prior probability, and the likelihood would pick out this region under the flat prior, then we should increase the posterior probability as well. This means we can shorten the interval and keep the same posterior probability.
      
      Now, your point about MAP + curvature may mostly be an observation that for small data, the *approximation procedure of MAP + curvature* systematically underestimates the appropriate size of the confidence interval. That’s a different story ;-)
    - Chris Wilson on October 10, 2017 12:04 PM at 12:04 pm said:
      
      right, but my point is specifically the comparison of the Bayesian estimate (integrating over parameter space) to a hypothetical non-Bayesian approach (e.g. based on max likelihood, least squares, whatever) where there is no integration over parameter space. The closeness of MAP + curvature *as an approximation* to the Bayesian estimate is a Bayesian perspective/concern :) To be super clear, just because we can interpret a frequentist procedure as a Bayesian model with a flat prior does not mean that the inference will necessarily be either a) the same, or b) a larger CI (because of aforementioned integration or lack thereof).
    - Daniel Lakeland on October 10, 2017 12:40 PM at 12:40 pm said:
      
      Chris: “just because we can interpret a frequentist procedure as a Bayesian model with a flat prior does not mean that the inference will necessarily be either a) the same, or b) a larger CI (because of aforementioned integration or lack thereof).”
      
      No, I don’t think this is correct.
      
      Mathematically, the 95% CI that occurs when a Frequentist procedure based on likelihood theory + a flat prior is used to construct a confidence interval, *is* the 95% posterior interval that occurs when the Bayesian uses the Bayesian procedure and an improper flat prior. In other words, whatever you get from running Stan with a flat prior, is *what you should have gotten with the CI procedure*.
      
      There’s a mathematical isomorphism, the math that you get is the same in the two cases, the intervals you get are the same in the two cases.
      
      Now, *as an approximation* to the *correct CI* often a MAP + curvature calculation is done by software like “lm” etc. The calculation is based on an asymptotic theory, and perhaps this may systematically under-estimate the correct size of the interval when the dataset is small. The fact is, this is a *calculation error* and the resulting interval *is not a correct 95% CI*
      
      Taking the results of this MAP/ML + curvature *calculation error* as evidence that the CI can be smaller and hence “better” in some sense, is just saying “often the simple calculation that is done leads you to a systematic error in which your CI is wrong because it’s too small, and this is ‘good'”
      
      Yes, I agree with you that often the point of a Frequentist interval is really just to pick out a number (the MAP/ML estimate). If Frequentists just don’t care about the size of their CI then fine… But when the model is mathematically equivalent to a Bayes CI, the intervals *need to be the same* that’s what “equivalent” means in this context.
    - Daniel Lakeland on October 10, 2017 12:54 PM at 12:54 pm said:
      
      Chris: to be clear, as I said, there may be a subtlety I’m missing here, and so I’m not stating this as dogmatically definitely true, it’s just that I would like to see the mathematical reason why a *correctly calculated* confidence procedure from a model that is equivalent to a Bayesian model with flat prior would ever be shorter than the Bayesian model with a diffuse but proper prior.
      
      Practically speaking, software may spit out a shorter interval, but we need to consider whether the shorter interval arises from using an asymptotic approximation that is inappropriate rather than actually having a correct CI.
    - Chris Wilson on October 10, 2017 3:16 PM at 3:16 pm said:
      
      I think we are moving into ‘dancing on pinheads’ territory :) Of course, if the models are truly equivalent, then they will yield the same results. What I’m saying is that *integration over parameter space* differentiates Bayesian and non-Bayesian estimation given the same model specification (as illustrated in code I provided above). My understanding from, e.g. BDA3, is that there is a legitimate frequentist derivation and interpretation of the max likelihood estimates that are often indeed equivalent to MAP + curvature (the Taylor Series approx of posterior mode is illustrated). From a Bayesian perspective, MAP+curvature is an approximation to what we really want (the integral), but maximum likelihood approaches for instance are NOT trying to integrate over parameter space. Therefore, I don’t think it is correct to say that those CIs are somehow invalid…and I am making no judgements here about which one is “better”. Conditional on not knowing the ‘true data generating distribution’, that is always debatable :)
    - Daniel Lakeland on October 10, 2017 4:38 PM at 4:38 pm said:
      
      Thinking about this a little more, asymptotically a ML estimate becomes point-estimate like in the same sense that a Bayesian posterior becomes delta-function like with enough data.
      
      The ML + curvature based estimate for a confidence interval can be seen as a first-order correction to the zeroth order estimate that the parameter value is just the ML value (a delta function).
      
      A *proper* confidence interval however while I agree it doesn’t integrate over the parameter space, it *is* supposed to guarantee coverage regardless of what the true value is. The typical confidence interval construction procedure in ML fitting relies on asymptotic normality of the estimator, and only gives proper coverage as N goes to infinity. So for example if you do your simulation study for N = 100 instead of N = 4 and for mildly-informative order-of-magnitude priors, what do you observe?
      
      So, for small data sets, the ML + Curvature estimate is probably systematically too small to be a proper CI because to be a proper CI requires that you do some kind of mini-max (ie. the minimum length interval having 95% coverage for the parameter value that is the worst-case)
      
      It’s not that frequentist theory here says we *should* use the ML + Curvature, it’s that it has no computationally tractable way to get a true coverage guarantee and so it relies on an approximation.
      
      At least, I think that’s what’s going on.
    - ojm on October 10, 2017 4:51 PM at 4:51 pm said:
      
      I was originally just trying to make a simple observation about posterior predictive checks and ‘double use of data’ not kick off another Bayes vs Freq argument! (FWIW If anything, I see a bigger division between model first vs procedure first people).
      
      I’m glad that I found Mike Evans’ comments on the relevance of sufficient stats to the double use of data topic, so I don’t think I’m completely wrong. But would be nice to hear any other thoughts on the topic. Even Andrew seemed reluctant to take up the specific point and hence our big detour into ‘foundations’ territory.
      
      Does anyone out there have thoughts on posterior predictive checks, sufficient stats, and double use of data?
    - Chris Wilson on October 10, 2017 5:43 PM at 5:43 pm said:
      
      ojm, I don’t fully follow your distinction btw ‘sufficient statistics’ and ‘parameters of a model’. So, y ~ normal(mu,sigma) i.e. I model P(mu,sigma|y) prop to P(y|mu,sigma)*P(mu,sigma)
      I then do a posterior predictive check on yrep|y which is basically an integral over P(yrep|mu,sigma)*P(mu,sigma|y)*dmu*dsigma
      
      Here mu and sigma are parameters of my model. Whether or not there exist sufficient statistics for estimating them in a simple way depends on my likelihood and priors, no? For simple cases that amount to max likelihood, such statistics exist. But PPC pretty clearly needs to extend to complex cases. So I don’t understand how sufficient statistics really apply (big caveat that I am not a statistician) In particular, you say
      “The data enters this estimation procedure only via the sufficient statistics (i.e. ‘the likelihood principle as applied within the model’)”
      But I don’t see how this is generally true in Bayesian models, and thus for how PPC happens in practice, except in special cases where T(y) are available. What am I missing?
    - ojm on October 10, 2017 6:20 PM at 6:20 pm said:
      
      Check out Mike Evan’s papers below, perhaps?
      
      > ojm, I don’t fully follow your distinction btw ‘sufficient statistics’ and ‘parameters of a model’.
      
      A sufficient statistic is a statistic (function of the data), while a parameter is a…parameter? E.g. for a parameter defined via Fisher-constant functional it is the value of the statistic evaluated on the ‘full’ model rather than just the observed distribution.
      
      For a sufficient statistic T(y) and model p(y;theta) we have
      
      P(y;theta) = p(y|T(y))p(T(y);theta)
      
      by definition of sufficient statistic.
      
      In particular, the term p(y|T(y)) is independent of theta by definition.
      
      In Bayes the data only enters parameter estimation via the likelihood function which is _only defined up to a data-dependent constant_. This is the important point, and noted by both ‘Frequentists’ (e.g. David Cox) and ‘Bayesians’ (e.g. Mike Evans).
      
      Any purely data-dependent factor such as p(y|T(y)) drops out of estimating theta. So this is ‘unused’ info in this sense.
      
      Equivalently, you can take the set of likelihood ratios or normalised likelihood function as the minimal sufficient statistic. Again the data-dependent constant component factor drops out. E.g.
      
      [p(y|T(y))p(T(y);theta_1)]/[p(y|T(y))p(T(y);theta_2)] = p(T(y);theta_1)/p(T(y);theta_2)
      
      in which p(y|T(y)) drops out of the comparisons. You can also write this all out in terms of a Bayesian posterior – again, something like p(y|T(y)) drops out.
      
      The intuition is that your choice of model determines your choice of minimal sufficient statistics or vice-versa. Knowing one gives you the other.
      
      This tells you which parts of the data you are paying attention to.
      
      However, you can also ask whether you are paying attention to the right things – what does the ‘residual’ y|T(y) look like? If this shows unusual patterns then you doubt you’ve used a good minimal sufficient statistic i.e. you doubt you’ve used a good model.
      
      An issue with _nontrivial_ minimal sufficient statistics is that they don’t generally exist outside of special families.
      
      I’d probably argue that one should proceed the other way – choose what statistics are of interest and are a compact summary (‘minimal sufficient’ from a data analysis point of veiw) and _then_, if you still want to do parametric estimation, choose a model for which these are the minimal sufficient stats.
      
      (This is also probably not far off the same procedure as something like MaxEnt, but minus the mysticism and, possibly, issues with working with entropy).
    - ojm on October 10, 2017 6:24 PM at 6:24 pm said:
      
      Fisher-constant -> Fisher-consistent
      veiw -> view
      etc -> etc
    - Chris Wilson on October 10, 2017 6:33 PM at 6:33 pm said:
      
      ojm, your point about non existence of non-trivial minimal sufficient statistics is I think where I was going. I’ll check out Evans paper. But Isn’t this a huge handicap for modeling? Don’t we want a resolution to whole double use of data in PPC that isn’t so restrictive?
    - ojm on October 10, 2017 6:43 PM at 6:43 pm said:
      
      > Don’t we want a resolution to whole double use of data in PPC that isn’t so restrictive?
      
      Yes but this is a case where the answer seems clear. We can then look at how to get _approximations_ to this.
      
      Some quoutes from the first Mike Evans’ paper I posted:
      
      > It is our claim that effectively (1) [OJM: basically, factoring the distribution as above] shows us how to proceed to avoid double use of the information and, as such, avoid double use of the data. Of course, as mentioned in the paper, it may be difficult, with complicated models to determine [p(y|T) etc] in meaningful ways. According, it seems reasonable to weaken this requirement in such contexts to having this hold asymptotically in some sense…
      
      also
      
      > In frequentist statistical theory, inference about parameters depends on the data only through the minimal sufficient statistic and, what is left over in the data (the residual), is available for model checking. Mixing these up would seem to correspond to an inappropriate statistical analysis. We believe this is equally applicable in Bayesian formulations….Of course, this restriction could be weakened ….only satisfy (1) in some asymptotic sense. The motivation for this would seem to arise from the complexity of some situations. Still, (1) can be implemented exactly with many models of considerable importance, so it isn’t just of theoretical relevance.
      
      He also discusses checking priors and checking hierarchical models.
    - Christian Hennig on October 11, 2017 9:05 AM at 9:05 am said:
      
      “In frequentist statistical theory, inference about parameters depends on the data only through the minimal sufficient statistic and, what is left over in the data (the residual), is available for model checking. Mixing these up would seem to correspond to an inappropriate statistical analysis.”
      Interesting… but it’s part of the model assumptions that the factorisation is possible in this way and one may want to check this, too, for which more than the “leftover” would be needed.
    - ojm on October 11, 2017 2:49 PM at 2:49 pm said:
      
      Christian,
      
      True but I _think_ part of the idea is that if y|T looks ‘suspect’ then the model and hence factorisation is suspect.
      
      But all this is with my Bayesian hat on – more generally I agree that it is probably best to separate this stuff from modelling assumptions even further.
    - Christian Hennig on October 11, 2017 9:30 PM at 9:30 pm said:
      
      ojm: If I remember it correctly, it’s not too difficult to construct examples in which the model is clearly wrong but y|T won’t show it.
    - ojm on October 11, 2017 11:14 PM at 11:14 pm said:
      
      I wouldn’t be surprised, but I would still be interested to see them!
      
      BTW I then wonder – In what sense is the model judged ‘wrong’?
      
      Eg is it that your statistics T clearly don’t capture everything you are ‘interested in’ (model independent judgement) but these don’t ‘show up’ in the conditional distribution? Or?
    - ojm on October 11, 2017 11:17 PM at 11:17 pm said:
      
      I should say, putting my Bayes-ish hat on that the setup should be
      
      a) T is minimal sufficient for the model family indexed by theta
      b) y|T is used to judge the model
      
      I’m interested in cases where b) fails while a) holds.
    - Christian Hennig on October 12, 2017 1:21 PM at 1:21 pm said:
      
      One example is linear regression with normality assumption and leverage outlier(s). Leverage outliers will usually have small residuals so looking at residuals (i.e. y|T, T regression parameter estimators and maybe variance) will not successfully find them.
      One can probably argue that in this case the linearity plus normality assumption for residuals elsewhere technically will be violated, so for n to infinity residual analysis will tell you that the model is wrong (though it will not highlight the outlier(s) as source), but for moderate n one can easily jot down datasets in which looking at all information the leverage outlier is crystal clear but the residuals on their own won’t look suspicious at all.
      
      Note that I don’t have time to write anything down in detail right now so you’ve got to live with what I get from my memory on the fly. I made a number of stupid mistakes with integrals in my lecture today so you shouldn’t necessarily trust me. ;-)
    - ojm on October 12, 2017 2:23 PM at 2:23 pm said:
      
      Thanks!
    - ojm on October 12, 2017 3:23 PM at 3:23 pm said:
      
      This is another reason I’ve moved more towards the idea of
      
      a) compute some summaries of interest eg a straight line summary
      
      b) think about how these vary under sensible ‘challenges’ eg dropping or resampling observations etc
    - ojm on October 12, 2017 4:24 PM at 4:24 pm said:
      
      I’m curious whether/how this sort of approach would work in a context where ‘evidence’ is central eg a legal trial, however.
      
      I suppose the same sort of idea applies – an ‘evidence’ statistic (or statistics) and stability of this evidence statistic to challenges- but I’m not too sure how one would provide guidelines on measuring ‘evidence’ in the first place.
      
      Example: murder weapon found in bedroom. Clearly ‘evidence’. Stability challenge: was found by cop with history of planting evidence.
      
      Less clear: genetic evidence matches to high but not certain probability. Clearly ‘evidence’. But only versus those who lack same genetic signature. None vs someone who also matches. Is this a question of evidence measure or of stability or both or?
    - ojm on October 12, 2017 4:31 PM at 4:31 pm said:
      
      I suppose both are similar: consider how often evidence would occur under competing challenge (‘hypothesis’ if you must). Eg how often evidence would occur when found by cop with history of planting and eg how often evidence would occur when crime committed by person with same genetic signature.
      
      End of thinking out loud on other people’s blogs…for now
- Martha (Smith) on October 9, 2017 3:18 PM at 3:18 pm said:
  
  If you “don’t have background knowledge that translates smoothly into a prior,” then a sensible thing to do is to try plausible priors and see how much the choice of prior affects results.
  
  Reply ↓
  - Christian Hennig on October 9, 2017 5:51 PM at 5:51 pm said:
    
    As long as you feel that you’ve got to have a prior; if you’re old fashioned like me and think you can do without most of the time, doing sensitivity analysis with various priors feels like trying to address a problem which you don’t need to have in the first place.
    
    Reply ↓
- Guido Biele on October 10, 2017 7:05 AM at 7:05 am said:
  
  “if I don’t have background knowledge that translates smoothly into a prior, I’m quite happy not to use one”
  
  I think it is hard to make or understand statements about usefulness of priors (or trade-offs when defining them) without specifying the statistical problem one is dealing with.
  
  If one has lots of data and/or very good measurements and one is estimating relatively simple models, I could also be happy not to use priors.
  However, when the data is weak (small samples and/or unreliable measurements) or when trying to estimate a complex model, I would be unhappy not to use priors.
  
  So maybe disagreements about the trade-offs involved in formulating priors stem in part from different implicit assumptions about the statistical problems to be solved?
  
  Reply ↓
  - Christian Hennig on October 10, 2017 10:00 AM at 10:00 am said:
    
    “I think it is hard to make or understand statements about usefulness of priors (or trade-offs when defining them) without specifying the statistical problem one is dealing with.”
    Fair enough, I agree that this depends on the problem.
    
    “However, when the data is weak (small samples and/or unreliable measurements) or when trying to estimate a complex model, I would be unhappy not to use priors.”
    …and this looks far too general a statement to me. A prior can help if it represents genuine reliable information that is not in the data. If it doesn’t, I don’t see what you get from it. Surely you only want the prior to do some work here if the work done by it is good and helpful.
    Sometimes I think it’s better to be honest and say that the available information is not strong enough to estimate your complex model at the required precision, and use graphs and non-probabilistic reasoning instead (or to use probability in a purely exploratory fashion) – keeping in mind also that if the data are too weak to estimate your model, chances are that they will also be to weak to check your model well.
    
    Reply ↓
    - Guido Biele on October 10, 2017 10:51 AM at 10:51 am said:
      
      i agree with “…and this looks far too general a statement to me”.
      
      Of course, priors should contain information that is not in the data, I took this as a given. In my (limited) experience, this is not very hard, because often weakly informative priors suffice.
      I also agree that when the data is to weak to estimate the model, e.g. when parameter estimates depend more on priors than on data, one should rather admit that and collect better data than employ priors.
      
      Lastly, when simple models are estimated based on weak data, priors can make sense because they can protect from falling for unreasonably large effect sizes (In the field I am working in).
      
      In the end, also given my failed attempt ;-), I am wondering if general statements about the usefulness of priors that reflect more than common sense are possible.
    - Martha (Smith) on October 10, 2017 6:53 PM at 6:53 pm said:
      
      “Common sense” is often very subjective — i.e., can vary from person to person.
ojm on October 9, 2017 12:09 PM at 12:09 pm said:

I forgot about this rambling exchange. Weird/a little embarrassing to see it here, especially given my views have changed a fair bit since (not on the main PPC thing tho), but I suppose some might be interested.

The somewhat frustrating thing for me is that I tried out as many different stats approaches as I could and still found them all fairly wanting. Tukey’s ‘constructive scientific anarchism’ still holds up, I suppose.

Reply ↓
- ojm on October 9, 2017 12:37 PM at 12:37 pm said:
  
  Though the rest of my contribution is somewhat irrelevant nonsense, reading back I still think my comment on PPC and sufficient statistics was worth emphasising for some folk.
  
  That is, Bayes estimation only uses the sufficient stat so PPC can be perfectly well justified from a traditional point of view when based on y|T(y), just as in eg David Cox’s the ‘Fisherian reduction’.
  
  I got the feeling that Andrew didn’t care too much about emphasising this correspondence, despite mentioning on occasion that PPC should be based on aspects of the model that are not ‘fit automatically’. Perhaps because he had (rightly) moved on to more interesting things. But still.
  
  Reply ↓
  - Daniel Lakeland on October 9, 2017 1:13 PM at 1:13 pm said:
    
    This notion of “sufficient statistics” is a non-standard one though isn’t it:
    
    From wiki: “In statistics, a statistic is sufficient with respect to a statistical model and its associated unknown parameter if ‘no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter’.”
    
    Sometimes in a Bayesian model things collapse to a small set of sufficient statistics, like the mean and sd of a normal. But other times that’s not true. I think the cauchy case there is no sufficient statistic according to the standard definition of sufficient statistics (I’ve never looked too carefully into that but I see it repeated in discussions, so I assume it’s a standard result.). So, while I kind of like the idea that there’s information left on the table, I don’t think it’s information within the statistical model, it’s kind of information about whether the statistical model itself is good for your purposes, and that comes down to whether it predicts “well” after fitting. The “well” has something to do with a utility rather than an inference. The fact is as fallible humans we can’t expect that we’ve specified the model exactly in the way we will want it specified until we’ve had a lot of experience with using it…
    
    Reply ↓
    - ojm on October 9, 2017 2:21 PM at 2:21 pm said:
      
      I think it’s consistent with the standard mathematical definition of sufficient statistic.
      
      But as you mentioned sufficient stats rarely exist for more complex problems. The guiding intuition is perhaps the same though.
      
      > while I kind of like the idea that there’s information left on the table, I don’t think it’s information within the statistical model, it’s kind of information about whether the statistical model itself is good for your purposes
      
      That’s precisely the point!
    - ojm on October 9, 2017 2:24 PM at 2:24 pm said:
      
      (See also David Cox’s description of the Fisherian Reduction)
    - ojm on October 9, 2017 4:24 PM at 4:24 pm said:
      
      Mike Evans had made a similar point
      
      http://www.utstat.toronto.edu/wordpress/wp-content/uploads/2011/09/Evans-Comment-on-Bayesian-Checking-of-the-Second-levels-of-Hierarchical-Models.pdf
    - ojm on October 9, 2017 4:49 PM at 4:49 pm said:
      
      Also:
      https://projecteuclid.org/download/pdf_1/euclid.ba/1340370946
    - ojm on October 9, 2017 7:48 PM at 7:48 pm said:
      
      Last comment for now – if this wasn’t clear then by ‘sufficient stat’ I really mean ‘non-trivial sufficient stat’ i.e. minimal sufficient.
Christian on October 9, 2017 2:43 PM at 2:43 pm said:

Would like to link back to the related post
http://statmodeling.stat.columbia.edu/2017/04/26/using-prior-knowledge-frequentist-tests/

Its part of a quest to use Bayesian posteriors with informative priors as a method to what was quoted above as ‘pure’ Frequentist (emphasise optimality, decisions, coverage). Key is the realization that informative priors in Bayesian statistics carry the informtion that in ‘pure’ ferquentist statistics is carried by informative loss function.

Reply ↓
Huw Llewelyn on October 9, 2017 5:10 PM at 5:10 pm said:

Uniform priors are a mathematical consequence (or corollary) of the random sampling model applied to statistical inference; they do not have to be assumed (see: https://blog.oup.com/2017/06/suspected-fake-results-in-science/ . In effect, a subjective Bayesian prior is therefore a ‘1st generation’ subjective posterior probability created by combining the ‘base-rate’ uniform prior with a subjective, hypothetical likelihood distribution. This is combined with a likelihood distribution based on data to produce a 2nd generation subjective posterior probability distribution. This in turn can be combined with another likelihood distribution based on data to create a 3rd generation posterior probability distribution, and so on. This process is mathematically identical to performing a ‘meta-analysis’ on hypothetical data and other real data and then using a uniform ‘base-rate’ prior to create a posterior probability distribution for the combined data.

Adding hypothetical likelihood distributions is a very powerful way of doing sensitivity analyses to anticipate how an existing study might give different results or the result of an attempted replication in a different setting. However, it is wrong to regard the uniform base-rate prior probability distribution as simply one more hypothetical prior. On the contrary, the uniform base rate prior probability is a fundamental part of random sampling mathematical models. Frequentists should make use of this ‘objective Bayesian’ approach too.

Reply ↓
- Andrew on October 9, 2017 5:23 PM at 5:23 pm said:
  
  Huw:
  
  I recommend abandoning the terms “subjective” and “objective” for reasons discussed in detail here by Christian Hennig, myself, and 53 others.
  
  Reply ↓
  - Huw Llewelyn on October 9, 2017 8:04 PM at 8:04 pm said:
    
    Andrew
    
    OK. I still see these terms being used commonly and I used them because of such common usage. I will avoid doing so in future! I thank you for pointing out the paper by you and Christian; I accept your points and fully appreciate the views in the paper and the discussion.
    
    I am still unsure about what term I should apply to a probability calculated using:
    
    (1) carefully observed, documented data alone with a mathematical model (which of course makes some unverifiable assumptions) and incorporating Bayes rule
    
    (2) hypothetical data (“made up” as Christian says) as well as carefully observed, documented data with the same model (based on the same unverifiable assumptions) and incorporating Bayes rule
    
    In his the first reply to your blog, Christian uses the terms ‘frequentist’ (“when using Bayesian analyses with one specific prior”) and “Bayesian” (the latter when “making up” a prior). So in accordance with Christian’s usage, do you think my comment above should have used the term ‘Bayesian’ instead of ‘subjective’ for (2) and ‘frequentist’ instead of ‘objective’ for (1)? I must say that I also feel uncomfortable about using the terms ‘frequentist’ and ‘Bayesian’ too! Perhaps my last sentence should have read “Frequentists should make use of this ‘frequentist Bayesian’ approach too.
    
    Reply ↓
    - Christian Hennig on October 10, 2017 6:46 AM at 6:46 am said:
      
      My use of the terms here is a bit tongue-in-cheek of course. Somebody here needs to take the Mickey out of all the Bayesians from time to time.

Statistical Modeling, Causal Inference, and Social Science

The house is stronger than the foundations

52 thoughts on “The house is stronger than the foundations”

Leave a Reply Cancel reply