Sander Greenland is a leading epidemiologist and educator who’s strongly influenced my thinking on hierarchical models by pointing out that often the data do not supply much information for estimating the group-level variance, a problem that can be particularly severe when the number of groups is low. (And, in some sense, the number of groups is always low or always should be low, in that if there are many groups they can be subdivided into different categories.) As a result, Greenland has generally recommended that researchers use pre-determined values for hierarchical variance parameters rather then trying to estimate them from data. For many years I thought that attitude was eccentric—having been weaned on the examples of Lindley, Novick, Dempster, Rubin, etc. from the 1970s, I’d just assumed that the right way to go was hierarchical Bayes, estimating the hierarchical parameters from the data—and in my books and research articles I’ve done it that way, fitting hierarchical models with flat or (more recently) weakly informative hyperpriors. More and more, though, I’ve been seeing the point of strongly informative priors, in general and for group-level variance parameters in particular. I’ve come around to Greenland’s attitude that there is typically a lot of external (“prior”) information available on group-level variances and not so much local data information, and our analyses should reflect this. For too many years I’ve been a “cringing Bayesian,” trying to minimize the use of prior information in my analyses, but I’m thinking that this has been a mistake. I’m not all the way there yet—in particular, BDA3 remains full of flat or weak priors—but this is the direction I’m going.
Here’s an example of Greenland’s thinking on this, from a few years ago. I read it at the time, but I really wan’t paying enough attention.
I’ve also been positively disposed to Sander Greenland (if not to Greenland itself), ever since 1997 when he published a very positive review of Bayesian Data Analysis for the American Journal of Epidemiology. This was at a time when I was feeling very paranoid after having my work attacked in an ignorant fashion by a pack of theoretical statisticians at my former workplace. It was a true relief to learn that there was a world outside that closed environment, where people were interested in fitting and learning from statistical models without ideological constraints.
So this is all just to say that I have a lot of respect for what Sander has to say, because (a) he has a track record of recommending something different than what I do and being right, and (b) he is not a methodological ideologue.
With all this as background, I’ll convey a recent email discussion I’ve had with Sander in which we disagree and I think he’s mistaken. I’ll give you the whole exchange and you can judge for yourself.
Controversy over posterior predictive checks
In classical goodness-of-fit testing, you simulate fake data from a model and compare to the data at hand. To the extent the data don’t fit the model, this indicates a problem.
From my perspective, goodness-of-fit testing (“model checking”) is different from null hypothesis significance testing (“hypothesis testing”) in that the goal of model checking is not to reject a model—we already know that essentially all our models are false—but rather to understand the ways in which our model does not fit the data. One way to put it is that, in hypothesis testing, the null hyp is typically something that the researcher does not like, and the goal is to reject it and thus prove something interesting. But in model checking, the model being checked is something that the researcher likes, the model’s falsity is taken for granted, and the goal is to find problems with it, with the ultimate goal of improving or replacing the model. My paper with Cosma Shalizi (see also here) presents this Lakatosian perspective in detail.
In practice, goodness-of-fit testing has the problem that the model is typically being compared to data that are used to fit the model itself. For linear models, a “degrees of freedom correction” is is needed to adjust for this overfitting issue. For nonlinear models, or models with constraints, or models with informative prior distributions, it’s not so clear how to make this correction. This was a problem that concerned me back in the late 1980s when I was writing my Ph.D. thesis. The topic was image reconstruction, the model was linear but with positivity constraints (the intensity of an image is a nonnegative function of space), lots of parameters are being fit but not all the constraints are active for any particular fit. For this problem, I came up with the idea of averaging over the posterior distribution to get a comparison distribution for the chi-squared statistic, a computation that could be performed using posterior simulation. I talked about the idea with my thesis advisor, Don Rubin, and he pointed out that this could be viewed as a model-checking application of his idea of multiple imputation for summarizing uncertainty, and also as a posterior predictive check of the sort discussed in his classic 1984 paper. I kept thinking about these ideas, and Xiao-Li Meng, Hal Stern, and I ultimately wrote a paper on posterior predictive checking, a paper that got rejected my various journals and eventually appeared, with many discussions, in Statistica Sinica in 1996 (see here for the article, many discussions by others, and our rejoinder).
Since that article has come out, Xiao-Li and I have had recurring discussions with various people about the so-called miscalibration of posterior predictive checks. Here’s an example from 2007, and here’s a blog discussion on the topic from a few years ago. Also this.
The quick summary is that some people (including Sander Greenland) are unhappy that posterior predictive p-values do not in general have uniform distributions when averaging over the prior distribution, whereas other people (including me) see this as a feature not a bug. I think it all comes back to the question of what these p-values are for: is the goal to reject with a specified probability or to explore aspects of model misfit?
OK, now here’s my conversation with Sander, which unfolded over email over a few days.
Hi, Sander. John Carlin told me that you had a problem with calibration of posterior predictive p-values. First I just wanted to emphasize that I’m not such a fan of p-values and prefer to do these checks graphically (we discuss this in chapter 6 of BDA). But second I don’t think there is any calibration problem at all: any posterior p-value can be interpreted directly as the probability that replicated data will be more exteme (in some dimension) than observed data. This paper might help clarify:
Thanks for sending this final version. I think you sent a preprint of it in March of last year, which seems the same.
I don’t know what to say except I still can’t follow your reasoning, and am left as unsatisfied as in the e-mail exchange among you, me and Jim Berger in March of last year.
We can set one minor issue aside: That of what to call a tail area for an observed statistic, versus a uniform statistic. For our debate I am content to call the first a P-value and the second a U-value, as in your sec. 3.
However (like Robins, Ventura and Van der Vaart, as well as Bayarri and Berger) I can’t make any sense of your argument for using a P-value that is not also a U-value (which unfortunately would be called either an UP value or a PU value). At least, I can’t make any frequentist sense of it. That is important because the aformentioned critics are making a frequentist argument, one which I would call Neyman-Pearsonian (NP).
Let me quote directly from your paper some passages that I find problematic:
“The nonuniformity of the posterior distribution has been attributed to a double use of the data (Bayarri and Castellanos,2007), although this latter claim has been disputed based on the argument that the predictive p-value is a valid posterior probability whether or not its marginal distribution is uniform (Gelman,2007).”
- this seems like a complete non sequitur. Bayarri and all the other critics of PPP are arguing about its poor (indeed, unacceptable) NP frequency behavior. Whether or not it is valid posterior probability is irrelevant. To understand their complaint you might pretend for a moment you are trapped in the mindset of (say) Finney in 1950 in which Bayesian probabilities are like alchemy and astrology, exiled to the realm of pseudoscience, and that only frequency behavior matters. That the PPP fails to be uniform is a mechanical consequence of using the data twice, once to construct the predictive distribution, then again to find the point on it at which to start the area. This is just a variation on the well-known fact that a classifier constructed from data will, when used to classify the same data, grossly overestimate the sensitivity and specificity of the classifier. Likewise the data gets used in the PPP to construct a predictor and then the predictor gets evaluated against the data used in its construction. Again, whether that is a valid posterior probability never enters your critics arguments one way or the other, so I have no idea why you invoke that fact.
The next sentence seems to continue the non sequitur, now for Robins et al.:
“From an applied direction, some researchers have proposed adjusted predictive checks that are calibrated to have asymptotic uniform null distributions (Robins, Vaart, and Ventura, 2000); others have argued that, in applied examples, posterior predictive checks are directly interpretable without the need for comparison to a reference uniform distribution (Gelman et al., 2003).”
- whether or not PPP are “directly interpretable” in some sense is again irrelevant to the issue at hand. The whole point is that, however they are interpreted, they should not be confused with a U-value such as (say) a chi-squared test of fit. And again the reason for this is that the entire justification for using a P-value in your critics’ world (which adopts the framework of NP and Wald, even for “objective Bayes” treatments) revolves around these data-frequency properties:
For any alpha, we require
Validity: rejection of the tested hypothesis no greater than alpha of the time when that hypothesis is true,
Unbiasedness: rejection is minimized when the tested hypothesis is true, and
Uniform power: rejection of the tested hypothesis at the maximum possible rate possible among valid unbiased tests, whatever the alternative is.
Setting aside technical issues about when exact UMPU tests exist, these properties can be jointly satisfied asymptotically under our standard regression models, and it turns out that the P-values that provide these tests (including the usual Wald, likelihood-ratio, and score test P-values) have to be asymptotically uniform under the test hypothesis, concentrating more and more toward 0 as one moves away from that hypothesis (and of course as the sample size increases, given the test hypothesis is false).
None of this theory invokes posterior probability, and tail area only arises as a consequence of seeking power, not as a core definition as in Fisher’s system.
Now in section 2 you lay out a series of circumscriptions that you seem to imply free you of power concerns. In fact you state you are not interested in power.
You also make statements that I can’t make sense of:
“in a classical setting we want to be assured that the data are coherent with predictive inferences given point or interval estimates.”
What is a “classical setting”? Some classify NP as classical (even though they are really latecomers by a century and a half relative to P-values and posterior probabilities).
If you reject power and NP, what are your criteria for judging a diagnostic? If you talk about repeated-sampling criteria you will be led right back to power, so by coherence you must eb assigning a Bayesian meaning, in which case I think your first example exhibits the discrepancy between the formal way you are using “coherence” and what I would consider the practical, commonsensical meaning of coherence (not unlike the difference between formal and informal but commonsensical meanings of “significance”).
“We are working within a world in which the purpose of a p-value or diagnostic of fit is to reveal systematic differences between the model and some aspects of the data; if the model is false but its predictions fit the data, we do not want our test to reject.”
If the model is false in some identifiable way, that falsity will harm the accuracy of out of data predictions and we should want the test to warn us about this.
Whether its predictions fit the analysis data depends entirely on our criterion for fit, so to claim PPP is indicative of fit seems to me circular. The question is, does it warn us as well as it should about discrepancies among the entire set of constraints we are imposing on the problem? (which, as Box took pains to describe, is the data and the data model and the prior).
You invoke the Jeffreys-Lindley paradox, a discrepancy between NP testing and a very particular Bayesian test, which (as Lindley noted) arises from placing a prior spike at the test value, something which (I thought) we agreed made no sense in practice.
Then section 3 you say
“This property has led some to characterize posterior predictive checks as conservative or uncalibrated. We do not think such labeling is helpful; rather, we interpret p-values directly as probabilities.”
-this seems to me to dismiss the NP perspective of your critics with no justification at all. Labeling PPP as uncalibrated is very helpful: it says they should be ignored when doing an NP analysis. Your writing here also seems to mix up Fisherian and Bayesian P-values. Now I am as ecumenical as anyone, but I view these systems as distinct languages that have to be applied separately and their meaning translated into one another. To mix up Fisherian and Bayesian quantities strikes me as akin to speaking in sentences that are a mix of English and French words and grammar. Sometimes the words are the same or translate easily, but subtle differences can change the interpretation drastically. E.g., even Fisherian tests of fit (like chi-squared) are calibrated, and can also be viewed as 0-1 rescalings of the distance between the data and the data-model manifold in observation space, so why would a Fisherian want a PPP?
Finally, your interpretation of the first example makes no sense to me at all: With y=500 you have direct evidence that something is terribly wrong with at least one of your prior expectations or your current data. If I saw the prior P-value result with some (or any) context I would be immediately suspicious that my whole view of the background or the study generating the data is seriously flawed, e.g., previous or current data has been mis-recorded by orders of magnitude; and I would definitely not want to mix the data and the prior, however little influence the prior had – instead I’d want to report the practical incoherence or inconsistency between the prior and the data!
On the whole then, your paper seems to me to fall far short of justification for using PPP or for ignoring prior checks or Fisherian tests of fit, but perhaps you have better examples (preferably real) or arguments – last year you mentioned you were trying to do something with Jim Berger. Did anything result from that?
All the Best,
From here on, I’ll remove the hellos and goodbyes and other incidental materials. The conversation continues:
The short answer is that I treat Bayesian p-values as posterior probabilities, and I find them to be helpful in many settings. As discussed in that article of mine, I am not particularly interested in constructing tests that reject 5% of the time if the null hypothesis is true. In almost every problem I work on, I know the model is false. To me the point of goodness-of-fit testing is not to have the goal of rejection but rather to reveal aspects of misfit of model to data, which I operationalize as settings where the predicted data from the model differ consistently from the observed data.
I think all the problems you discuss below arise based on the framework in which the goal is to reject false hypotheses as often as possible while keeping a 5% chance of rejecting true hypotheses. But this is not a problem that interests me.
On another matter, I guess we could argue all night about the expression “using the data twice,” as ultimately this phrase has no mathematical definition. From a Bayesian perspective, the posterior predictive p-value is a conditional probability given the data, and as such it uses the data exactly once.
Finally, you write, “perhaps you have better examples (preferably real).” I have many real examples of posterior predictive checks. Chapter 6 of Bayesian Data Analysis has several, also I’ve written many many papers with real examples of posterior predictive checks from my research and that of others. One thing I don’t lack is real examples!
Perhaps we might agree that PPP is not admissible in the precise frequentist sense I described, and thus not admissible to a conjunctive (“and”) ecumenist, by which I mean someone who regards such admissibility as a necessary criterion for adoption of a method, along with Bayesian coherence of the method for priors derived from the context. This is how I would interpret Cox’s eclecticism, as he laid out in many places such as his comment on Lindley (2000) on p. 321-324 of the attached. In the present case I think his philosophy (to which mine is very close) says that if I have a Bayesian method like PPP that falls short of admissibility, the frequentist can offer me a recalibration (as did Efron and Morris for parametric EB intervals, and BB and RVV 2000 did for PPP). This recalibration makes sure I don’t overlook potentially fatal yet easily avoidable errors or model problems, like those I described for possible scenarios that would give rise to your example 1 (where the prior check screams that there is an identifiable mistake somewhere in our formulation, such as data miscoding or fraud, a fact I want to know about but which the posterior check pastes over), and which you may have overlooked in your real examples if you did not also do calibrated checks.
I note that Rod Little summarizes the superficially similar-sounding ecumenicism of “calibrated Bayes” in the attached Am Stat article. I think this article is superb and agree with almost all of it. But toward the end of p. 7 he offers in brief the same defense of PPP as you do, which I find objectionable in the sense I described below for your example 1. So we have evidence that there are at least two types of ecumenicism afoot: One strict or conjuctive that strives for both strong repeated-sampling calibration (as Cox and Hinkley laid out in Ch. 2 of their 1974 book) along with coherency with specified prior information; the other weak or disjunctive (“or”) ecumencism outlined by Little but apparently attributable to Don (1984) requiring only one or the other criterion be met. (I think that were Jack Good still with us he could offer at least 46,000 more kinds of ecumenicism/eclecticism.)
Finally, I think PPP does indeed use the data twice in very straightforward arithmetic way: On p. 2597 of your EJS article, you define the PPP as Pr(T(y_rep)>T(y)|y), although I think a close-paren is missing in your expression. The data y appear twice in this expression: once after the conditioning bar “|” and once before. This is not mere notation, as the data are used twice and in this order to compute the expression: once to construct the posterior measure Pr and then again to define the set it is applied to. Why deny this fact? Rod seems to accept it in his wording at the end of his p. 7, and it has no bearing that I can see on the core argument for PPP that you and he share. After all, some superb statistical methodologies count the data twice in essentially the same sense, such as empirical Bayes (something which Neyman himself praised as a brilliant innovation, albeit its developers paid close attention to NP calibration). In the end our divergence here comes down to a matter of philosophy of applied statistics (a very good topic to enrich, I think) and what one means by “calibrated Bayes”, not whether the data are counted twice. And in practical terms, the entire dispute might be resolved if one advocated always presenting and comparing the prior or recalibrated diagnostic alongside the PPP, so that one would be alerted to the problems caught by the prior or recalibrated checks, but missed by the PPP.
I wouldn’t be surprised if most of my readers (at least those who are informed on these issues) will agree with you. But I still think that your view is ultimately a holdover from the “null hypothesis significance testing” approach which I think is ill-founded.
Just to respond briefly: I don’t know what is meant by “not admissible in the precise frequentist sense [you] described.” I have never seen a definition of admissibility in this sense. So I really don’t know what you are saying here.
A posterior predictive p-value is a probability statement regarding replications, and, like any Bayesian statement, it is a correct (i.e., calibrated and fully informative) statement to the extent that the model is true. Of course the model isn’t true, so the probability has this funny interpretation as being a predictive probability, conditional on the model that we don’t believe—but, again, that’s true of every Bayesian probability statement (with some special-case exceptions). Nonetheless, many researchers (including you and me) find Bayesian inference to be useful sometimes.
Regarding my article in EJS, the point is that, in my first example, I have a ppp-value and also a “calibrated” ppp-value (in my terminiology, a “u-value” using the same test statistic). And, as I explain in that paper, for the purpose of model evaluation, I prefer my p-value (whose distribution under the prior is highly concentrated near 0.5) to the u-value. Again, I’m diong predictive checks to evaluate model fit, not to maximize the probability of rejecting a false model while having a fixed probability of rejecting th emodel if it is true.
In my example, you say that the prior check is screaming at me—but my point is that, for the purposes of future inferences, the posterior model is just fine. Meng (cc-ed here), Stern (cc-ed here), and I discuss this in our 1996 paper: prior predictive replications are different than posterior predictive replications. It is possible for a model to work well for posterior predictive replications but not to work with prior predictive replications. Indeed, this happens all the time. A model can work well for predictions of new data from existing groups but not for new data from new groups. To see this, consider a hierarchical model with a super-weak prior on the group parameters and lots of data for each group. Lots of data on each group implies that the group-level parameters will be estimated fine for existing groups; the weak prior is no problem cos there’s lots of data. But for new groups, the weak prior kills you, it’s not allowing you to learn about the distribuiton of the group-level parameters. Or, for another example, suppose the group-level parameters are bimodal but you fit them to a hierarchical normal model. Again, with lots of data within each group you will estimate the group-level parameters fine. But when making predictions for new groups, you’ll just keep drawing from this normal model and this will be wrong. So it makes sense that the predictive check will show a problem when predicting new groups (the prior predictive check) but not when predicting new observations from existing groups (the posterior predictive check).
If you want to say that “Pr(T(y_rep)>T(y)|y)” uses the data twice, then you could just as well say that the expression “y^2” uses the data twice because it mulitplies y by itself!
Finally, I agree with your last suggestion which is that users can look at prior predictive checks as well as posterior predictive checks. That’s an excellent point. Indeed, as XL, Hal, and I wrote, prior predictive checks are a special case of posterior predictive checks in which the predictive replication involves re-drawing the parameters as well as the data. In my writing I’ve tended to say that people should look at replications of interest, but there’s no reason not to look at lots of different replications, if that will help.
I am opposed to blanket or mindless null hypothesis testing (NHT) as much as anyone and for two decades have been lecturing on “Overthrowing the Tyranny of the Null Hypothesis”. But that NHT is a grossly overused tool does not negate that the fact that it sometimes makes sense to use it. I note that in a number of articles (e.g., Gelman and Weakliem, 2009) you have described the “significance” and “nonsignificance” of null tests quite prominently, so I had the impression that you have used NHT yourself.
In my view, testing the difference between the prior and the likelihood function is an example of useful null testing, not because we believe some null (or that our prior or data model is perfect) but for the Fisherian reason that we want to be alerted to outright statistical contradictions between the prior and the data model (where outright statistical contradiction could mean a 5 sigma difference, as in your example 1 and in detecting the Higgs boson).
Regarding “admissibility,” I was carelessly using “admissible” as a shorthand for the test being approximately uniformly most powerful among unbiased tests (UMPU). So the passage at issue could be rewritten: Perhaps we might agree that PPP is not UMPU, and thus not acceptable to a conjunctive (“and”) ecumenist, by which I mean someone who regards such UMPU as a necessary or at least desirable property for a test.
All probability statements (including so-called nonparametric P-values) are conditional on models, e.g., that somehow selection was randomized within observed covariate levels. Thus I don’t believe any of the models we use in observational health, medical and social sciences are more than rough guides at best. Using ‘model’ in the Boxian sense of prior+data model, I do not want to make inferences from a model (or a posterior computed from it) when the data indicate that at least one of the prior or data model are way off of reality.
To me, example 1 in your Electronic Journal of Statistics paper is demonstrating exactly the opposite of what you claim in your paper: That the PPP can miss important evidence of model problems which are easily seen with other diagnostics. So your basis for ‘preferring’ PPP over supplementing it continues to elude me.
Now if one rigidly insists that the data model is correct (which neither of us wants to do) and then has a situation like example 1 in which the prior and data model conflict statistically, one can claim as you do that the high PPP indicates that the prior (which must be bad) did not harm your inferences much. It seems to me however that this defense of PPP corresponds to nothing more than the old observation that, in finite-parameter problems, the data eventually swamp the prior (assuming the prior support includes the true value, as it must in your example). This defense never appealed to me, because why would one want to use inferences contaminated by a discredited prior?
When you say “the purposes of future inferences, the posterior model is just fine,” this is only if the data model is correct. If the divergence between the prior and the likelihood function arose from undetected data error or fraud, or from data-model mis-specification, I don’t think the posterior model will be just fine for all future inferences, and those possibilities are things to worry about in practice.
I am concerned to know if the analyst has clearly failed to predict the new groups with any kind of accuracy. It could call into question the analyst’s specification competence as well as the data integrity, so I could see why the prior check would be unpopular.
Regarding “using the data twice,” I think you missed my point completely, which is that y appears on both sides of the conditioning bar “|”. In other words, the way that y is used both to construct the sampling (data) measure and to define the sampling event being measured leads to the need for recalibration. This observation applies to EB inferences, which are respectable frequentist tools, as well as to PPP. It seems to me that you are treating this “double-counting” observation defensively as if it is some sort of insult of PPP, rather than the matter of fact that it is. It is just a math property that alerts a frequentist to the need for (and is addressed straightforwardly by) recalibration. Bayesian computations will analogously yield incoherently precise posteriors if the data are double counted in similar ways and this shortcut is not accounted for, e.g., if nuisance parameters are replaced by their MLEs and subsequently treated as fixed quantities in posterior computation, instead of being integrated out.
Finally, is it safe to say we agree for practical purposes that one should perform and report the prior vs. likelihood check, or a recalibrated PPP, even if one also wants a raw PPP? Does anybody who has read this far disagree?
In my paper with Weakliem, we show some null hypothesis significance testing results not because we think null hypothesis significance testing is a good idea in that example but rather to communicate with that aspect of the scientific community that uses these tests. In fact, in that paper we state that, whether or not the observed comparison were statistically significant, we would not believe the claim. Further discussion of this point is in my recent paper with John Carlin for Perspectives on Psychological Science:
Regarding UMPU: As I stated earlier, UMPU is based on the framework in which the goal is to reject false hypotheses as often as possible while keeping a 5% chance of rejecting true hypotheses. But this is not a problem that interests me. I just don’t care about UMPU nor do I think it is a useful general principle for model checking. I don’t see why wanting UMPU makes someone an ecumenist.
Finally, you write, “is it safe to say we agree for practical purposes that one should perform and report the prior vs. likelihood check, or a recalibrated PPP, even if one also wants a raw PPP?” No, I don’t agree with this. If someone wants to report posterior predictive checks for different replications, that’s fine, I agree that it’s a good idea but I wouldn’t call it a “should.” I think it’s best for people to realize that model checking depends on the purposes to which a model will be used. Part of this dependence-on-purposes comes in the choice of test variable (for example, if you test a model based on the skewness, you might miss a problem of the kurtosis), and this is well understood. Another part of this dependence-on-purposes comes in the choice of replication. In a hierarchical model with data y, local parameters alpha, and hyperparamters phi, one can consider three possible replications:
(i) same phi, same alpha, new y
(ii) same phi, new alpha, new y
(iii) new phi, new alpha, new y
You seem to want to privilege (iii), and that’s fine, it might make a lot of sense in your practice. But in other cases, for other people’s models, (i) or (ii) might make more sense. For example if someone has a flat prior on phi, they can’t do (iii) at all. But that doesn’t mean the model can’t be checked.
Well, no convergence between us here, so I guess the controversy over PPP will remain unresolved once again (as with the issue of what constitutes ‘valid’ as opposed to ‘proper’ imputation). Maybe that makes it all the more worth blogging about. If there is one thing our new debate revealed to me, however, it is that I am much more frequentist than I initially thought! Hence I’m going to forward our exchange to Mayo, since with her I spent most of my time defending Bayesian perspectives
Just a footnote re UMPU: As UMPU is just one of many possible formalizations of how to optimize a test, and unbiasedness is a particularly dubious criterion, I don’t want to push UMPU. But in frequentist evaluation of methods, I do believe it is important to deploy some kind of optimization criteria to minimize error or loss (whether in testing or estimation) and I was just using UMPU as a well-known example. The problem for me is that I still don’t know what frequentist error-minimization criteria are satisfied by PPP. Thus, for now, due to calibration objections, PPP will remain inadmissible (in the informal English sense) in my methodology, which as I said tries to satisfy both frequentist and Bayesian desiderata; I don’t want to ignore or dismiss objections from either frequentist or Bayesian perspectives, even though satisfying both completely may often be impractical if not impossible.
There are multiple frequentist desiderata; indeed, I have written that one strenghth of the frequentist approach is that the frequentist can use commonsensical and subject-matter concerns to choose the appropriate desiderata for any particular problem. I’m sure there are some problems for which it makes sense to have the goal of rejecting false hypotheses as often as possible while keeping a 5% chance of rejecting true hypotheses. It’s just not something that’s ever come up in the hundreds of problems I’ve worked on over the years. To me, null hypothesis significance testing is a tool that can be useful but I don’t see it as being based on underlying principles that make sense, at least not in the applications I’ve seen. The idea of model checking, though: that seems important to me. I am not interested in demonstrating that a model is false—that’s something I already know—but I am interested in understanding the ways in which a model does not fit the data.
And Greenland ends it with:
I too am most interested in understanding ways in which the model (which again I take as prior+data model) fits poorly, and not in demonstrating the model is false, which I already know (as you do). Thus it is interesting how we can agree in principle when that principle is phrased in broad and nebulous English, but diverge about devilish details.