## Prior information, not prior belief

From a couple years ago:

The prior distribution p(theta) in a Bayesian analysis is often presented as a researcher’s beliefs about theta. I prefer to think of p(theta) as an expression of information about theta.

Consider this sort of question that a classically-trained statistician asked me the other day:

If two Bayesians are given the same data, they will come to two conclusions. What do you think about that? Does it bother you?

My response is that the statistician has nothing to do with it. I’d prefer to say that if two different analyses are done using different information, they will come to different conclusions. This different information can come in the prior distribution p(theta), it could come in the data model p(y|theta), it could come in the choice of how to set up the model and what data to include in the first place. I’ve listed these in roughly increasing order of importance.

Sure, we could refer to all statistical models as “beliefs”: we have a belief that certain measurements are statistically independent with a common mean, we have a belief that a response function is additive and linear, we have a belief that our measurements are unbiased, etc. Fine. But I don’t think this adds anything beyond just calling this a “model.” Indeed, referring to “belief” can be misleading. When I fit a regression model, I don’t typically believe in additivity or linearity at all, I’m just fitting a model, using available information and making assumptions, compromising the goal of including all available information because of the practical difficulties of fitting and understanding a huge model.

Same with the prior distribution. When putting together any part of a statistical model, we use some information without wanting to claim that this represents our beliefs about the world.

1. Xi'an says:

Looks like you are preparing your intervention at the Bayesian Fiducial and Frequentist conference! I agree about the ambiguous meaning of “belief”, but the same applies to “information”, so in the end there is always an arbitrary choice in turning items of information into a single model. The resulting inference is conditional or relative to such choices, which is fine by me. Actually, to mimic the question of the classically-trained statistician, a Bayesian provided with the same data could well return several conclusions, relative to different modeling choices…

• Fernando says:

The preference for “information” vs “belief” largely turns on semantics. It would help if Andrew could provide clear definitions for each. What allows us to distinguish one from the other? And can this distinction be made objectively?

After all, is the statement “Obama is spying on my emails” information, belief, lies, delusion?

• I think this is a case like “Free software” and “significant result” where the goal is to move away from an ambiguous word.

Degree of Belief could mean anything from “degree to which something seems plausible” (credence or plausibility or dare I say it… probability, which is problematic because it has two meanings as well) to “degree to which I will commit acts of actual violent terrorism to further my cause” (pseudo-religious fervor)

“information” on the other hand rarely connotes religious fervor. The point then is that to the extent that you can rule out or partially rule out ahead of time some values for an unknown, you can encode this information into your analysis, and you don’t want to use a term that implies that somehow only heathens fail to agree with you, etc.

• Andrew says:

Fernando:

I think of belief as an internal mental state, whereas information is external to the individual. Sure, the boundaries are fuzzy—if I report my belief, my answer is a survey response which is now shareable information—but in general the idea is that information can be identified in some external sense. In applications, the point is, when specifying models, to explain what external information is being used to make these decisions.

• Exactly as I suspected and wrote in my comment below (wrote it at the same time as Andrew’s—no collusion).

• I like the conception of statistics as method of argument. If I want to argue that you should think that X is true using classical logic, I’ll need to say something like “IF A then X” which you will reply “Yes I agree” and then argue, “B and C and D are true” and if you agree then you’ll say “yes I agree” and then I argue “B,C,D imply A because REASON” and if you think carefully about this and agree, you’ll say “yes” and then I’ll say, “But you’ve agreed that if A then X and so therefore X” and you’ll say “yes fine I agree” but if not, you’ll go back and look at the structure of the argument and see if you went wrong somewhere, not thinking carefully enough about some portion.

Now of course, at each turn you might disagree as well, but the point is that we argue starting from some base of knowledge that we agree on, using models of the world, towards some conclusion.

Bayes does this exact thing, except with a real-valued positive number that assigns how much you should agree instead of TRUE vs FALSE for whether you do or don’t agree.

So, in this context, a Bayesian model should be seen as a philosophical/logical argument for a conclusion, and so you’re absolutely right, it is critical to “explain what external information is being used to make these decisions” that is, besides just formal mathematics, which are mostly uncontroversial (ie. p(A,B) = p(A|B) p(B) or the like), you need to explain the relationship between the mathematics and the real world. There is no way around that, and it should be something where there is the possibility of thoughtful controversy, and it’s why I hate “default” statistical methods. Default statistical methods short-circuit the thinking that is required to do a good job of logical argument.

Now, that isn’t to say we shouldn’t use default models such as logit or linear regression etc, but we should think to ourselves *why* is this an appropriate choice? For example, logit regressions are usually thought of as a linear regression on a transformed scale, but there’s nothing that says you need to do LINEAR regression on this scale, the logistic curve is just a link function that constrains the output of a function to the range [0,1] ANY link function that does this would work FINE provided that you are willing to do nonlinear models on the input.

Y ~ error_model(logistic_curve(nonlinear_function(x)), error_parameter)

is a general purpose model capable of representing ANY continuous function that maps from x space to [0,1]

so, “I chose logit because it’s sufficient for any purposes” just like “I chose polynomial regression because it’s a complete basis for continuous functions on a finite interval” are *good* arguments, whereas “I chose linear-logistic regression because it’s what we always do in this field” is a *bad* argument.

Putting in that explanation is critical.

• Martha (Smith) says:

+1

• Cliff AB says:

‘so, “I chose logit because it’s sufficient for any purposes” just like “I chose polynomial regression because it’s a complete basis for continuous functions on a finite interval” are *good* arguments, whereas “I chose linear-logistic regression because it’s what we always do in this field” is a *bad* argument.

Putting in that explanation is critical.’

While that seems like a nice idea in theory, it’s been my experience that this much more often than not becomes a really bad idea in practice.

As an applied statistician, I’ve had the opportunity to overlook a lot of work from other statisticians. An extremely common trend that I’ve seen is that the models that were built “on the fly” especially for some experiment were very, very likely to be extremely wonky. All of these models were quite clever and fancy, but very frequently fell apart upon close inspection. And this is not too surprising; it’s really hard to know what the data issues you will have before you see your data, unless this is an experiment you’ve seen several times.

Forget NHST, that’s going to give you terrible feedback from your data (although building wonky models, deciding they are not appropriate, building another wonky model until you get your desired results is also a terrible way to abuse the NHST framework that happens all the time). But when it comes time for peer review, the non-statisticians reviewing the paper have very little power to complain about the methods; they don’t (nor shouldn’t) know what the statistician is talking about!

I’m 100% supportive of careful decision making about models. But, based purely on the empirical evidence of observed statistical practice, I think meticulous improving of “best practices” in various data settings is much, much better than cooking up whatever idea just popped into your head for how to analyze the latest dataset you’ve seen. If you’re looking a totally new type of dataset with totally unique properties, then go ahead, start the best-practices book for these situations. But also know that you’re first thought about how to model this data is almost certainly doing something very wrong!

• Cliff, can you elaborate on what “this” means in “this much more often than not becomes a really bad idea in practice. ” ??

The way you’ve written it it makes it seem like “this” refers to “creating an argument about why your model is a good choice” and I just can’t fathom any situation in which doing so is bad.

• Cliff AB says:

My point is not such much about “creating an argument about why your model is a good choice” (arguments about validity are good, but in my experience, not great; everyone thinks their model is great and everyone else’s is crap, so that’s a whole slippery slope I will avoid for now), but rather getting a dataset and coming up with a novel model to fit after some philosophical based (rather than empirically based on several example datasets) debate about what would be the best way to analyze it is.

To be clear, I’m not directly disagreeing with you; just saying “everyone’s always done it this way!” and never giving the model another thought is not good, and that’s what I believe you are saying. But in my experience, that’s better than the results I’ve seen from the other end of the spectrum: “here’s this new model I just cooked up yesterday and I think it’s gotta better because of X, Y and Z!”. And having reviewed too many other statistical reports, it’s my view that encouraging statisticians to buck established methods and argue their own models would make things worse, not better. Perhaps that’s field dependent.

I’ve seen a lot of people on this blog upset about how various agencies don’t accept the “fancier” analysis. My experience with reviewing other analyses is that we should be extra skeptical about the “fancy” analysis and only accept those statistical methods once they’ve been thoroughly tested; lots of things look great on paper and end up being disastrous in practice!

• Curious says:

And replication and prediction distinguish those analysts who understand what they are doing from those that do not.

• Curious says:

An unwillingness to innovate in the face of poor modeling practice is not a virtue.

• Curious says:

Cliff:

In my experience, it has been a stubborn resistance to acknowledging the limitations and weaknesses of “tried and true” methods in the face of strong evidence that they are not actually working. “Because the p-value is significant.”

• Curious says:

“Because the data speaks for itself, we don’t need to use a model.”

• Curious says:

“Because the best predictor of future behavior is past behavior.”

• Curious says:

The reality is that when methods are used that don’t actually control for the confounds in the data and those confounds change. Guess what. The model will no longer be predictive.

• Cliff AB says:

Curious:

What you are describing, if done right, is what I would call ‘meticulous improving of “best practices”’; showing that time and time again, your new proposed model predicts better than the traditional models.

All that I am saying is that I’ve seen way too much of the opposite problem; crazy models that confuse everyone, including, as I have seen too often, the original statistician who came up with them.

• Curious says:

The point I am trying to make is that I believe discouraging innovative thinking is the wrong way to go and that a better path forward is both encouraging innovative thinking and substantial thinking about the theoretical assumptions of measurement, model structure, and distributions that are not normal.

• “crazy models that confuse everyone, including, as I have seen too often, the original statistician”

is not what I would call in the same category as:

“you need to explain the relationship between the mathematics and the real world. There is no way around that, and it should be something where there is the possibility of thoughtful controversy” (quoting from my post).

So I don’t think we disagree, I just think you’re pointing out that people do “innovation” without “explanation of the relationship between the mathematics and the real world” and that this “innovation” is also harmful. I’ll go along with that.

• Cliff AB says:

Curious + Daniel:

Right, I don’t think we are really in total disagreement: in your experiences, you’ve worked with agencies, etc, that are too slow to pick up on improvements with in methodologies. I don’t have experience with such agencies, but it’s hard for me to imagine that they can be unnecessarily slow.

My experience is in a world without such agencies, just peer reviewers, who aren’t strict enough about being skeptical about new statistical methodologies. My only point is that this is not a Utopia; people are really good at coming up with philosophical reasons why their new statistical model is great, but my experience is that more often than not, the newest idea often opens up many more problems than it fixes. So while we should be open to improving practices, in my experience “creating an argument about why your model is a good choice” is not worth very much, and just about nil compared empirical evidence.

• Cliff AB says:

that is, “it’s NOT hard for me to imagine that they can be unnecessarily slow”

• Curious says:

Cliff:

I agree that reviewers who do not understand the underlying assumptions of both the existing method and the implications of a proposed method will not be capable of making valid decisions about them. That said, I still do not agree that proposing new methods is not the way to go. I simply think that more and better critical thinking about the implications of those methods should be a part of the rationale.

A default of “no new methods” results in the current state of research methods in many areas of psychology. And to your point, the application of sophisticated methods to the same crude and noisy measures contributes to the same problem.

I still think the best way forward is to encourage innovation, but require sound rationale that addresses the known and likely problems of implementing such a method.

• Curious says:

Cliff:

It may also make sense to require both methods to be used initially until the new method can be demonstrated as providing utility not provided by existing methods.

• Fernando says:

Andrew:

If I understood you right, you are saying beliefs are private, whereas information is public. If so, frequentist analysis relies on beliefs (for it does not make priors explicit) whereas Bayesian analysis uses information (since priors are made explicit).

Ironic, because the critique by frequentists is often that they don’t like using beliefs in inference.

You are going to trip a lot of people over with this use of language, as in “Frequentists need to stop using subjective beliefs!”

• Andrew says:

Fernando:

I don’t find the terms “objective” and “subjective” to be particularly helpful when talking about statistical methods; see this article with Christian Hennig.

• Fernando says:

So maybe the right distinction is:

Information is intersubjectively available, beliefs are not.

I’m just trying to nail down the distinction you are making.

• Glen M. Sizemore says:

“I think of belief as an internal mental state…”

GS: Would a belief be a cause of behavior?

2. Keith O'Rourke says:

I prefer the term representation to information as that allows guessing and conjectures in the specifying of priors and data generating models (those guesses and conjectures are being informed by the information one has – but do go somewhere beyond it). https://andrewgelman.com/2017/04/19/representists-versus-propertyists-rabbitducks-good/#comment-475235

• george says:

Representation of what? (Please don’t say beliefs or information)

• Keith O'Rourke says:

george:

From the link I gave “You have a representation of empirical phenomenon [the reality you are trying to grasp not too wrongly] with unknown parameters, a random model for how the parameters’ values were set/determined and a random model for how observable[s] came about and were observed/recorded.”

3. Wayne says:

“If two Bayesians are given the same data, they will come to two conclusions.”

First, the “will” part bothers me. Will they actually come to substantially different conclusions? That’s a big assumption. They may come to the same conclusion, or their conclusions may differ so slightly that they can be accepted as the same answer. In any case, they will have made their data, model, and priors explicit so that they can be discussed and sensitivity analysis can be done to see to what degree any difference in conclusion is driven by their choices of priors. In fact, the act of making their priors explicit is a useful contribution to their field in and of itself.

Second, I think of similar questions like, “If two economists are given the same data, they will come to two conclusions. Does it bother you?” If that doesn’t result in your rejecting the entire field of economics, then why should it make you uncomfortable with a Bayesian approach?

• NatashaRostova says:

It’s hard to imagine two Bayesians coming to two separate conclusions on whether the sun will rise tomorrow given observable data, even if their priors differ :)

4. Andrew says:

X, Keith, Wayne:

All your comments are reasonable. Let me clarify that what I’m saying is that the prior information represents external information, not that it is that information. Or, to put it another way, there is some imperfect mapping between the scientific or subject-matter information being encoded, and the prior distribution (and, for that matter, the data model) being used in the statistical analysis.

5. Speaking as someone who started out as a dyed-in-the-wool error-statistical frequentist, and has moved to appreciating and starting to use Bayesian approaches, here’s how I’ve been thinking about it:

When I set up a problem, I treat any given model as a (not that bright) thing with its own beliefs about the world. The model itself is a subjective Bayesian with incredibly rigid beliefs about the state of the world; not only in the priors, but in the likelihood, the data sampling process, etc. The model then updates its own beliefs with the data I give it. After that, my job is to interrogate the model to see if its view of the world leads to major errors: can the model generate accurate estimates when I give it a known data generating process? After fitting to real data, does the model make poor predictions about other facets of the data? After it’s passed enough tests that I’m willing to tentatively trust its results do I share the results. However, at no point in the process does the model represent my exact beliefs; I’m still looking for models that have good error-statistical properties.

• Leon says:

+1

After that, my job is to interrogate the model to see if its view of the world leads to major errors: can the model generate accurate estimates when I give it a known data generating process? After fitting to real data, does the model make poor predictions about other facets of the data?

Doing this for a bunch of models specified by different (hyper)parameter values is Box-style “model criticism” — which is basically the logic of p-values, but “one level up” so to speak.

• Corey says:

Are you using “error-statistical” in the sense of Mayo’s error statistics?

• I am (assuming I’m not massively misinterpreting her ideas). I think effective statistical practice requires quantifying error rates of different methods, and trying to understand how frequently different types of error arise from a given method. Also, we can only trust results that have been tested and passed a wide range of tests for potential errors. Neither a low p-value or a high posterior density is by itself sufficient to say a given result reliably reflects the aspect of nature we think its trying to measure.

• Keith O'Rourke says:

Eric: You might find Mike Evans’ relative believe approach of interest – here is X’s review from a couple years ago https://xianblog.wordpress.com/2015/07/22/measuring-statistical-evidence-using-relative-belief-book-review/ (with response by Mike and me in the comments).

6. Carlos Ungil says:

Doesn’t everybody think of the prior p(theta) as the expression of the information available to them about theta? I always understood belief as uncertain knowledge, not as faith. De Finetti’s definition of subjective probabilities (“the degree of belief in the occurrence of an event attributed by a given person at a given instant and with a given set of information”) seems applicable to the prior as you think of it, given that the mapping between the external information and the prior distribution is somewhat subjective: it depends on many assumptions made by the researcher.

7. Clark says:

When I’m analyzing the data from an experiment, I am principally interested in what that experiment — in isolation — is telling me. Not what that experiment is suggesting in the context of (and bias of) prior expectation. To me, this is fundamentally good science. To me, the role of a prior is in more in a meta-analysis context — so an argument could be made to report both a frequentist (no or null prior) and Bayesian interpretation in a single paper (this is what we see, this is what we think this means in the context of prior knowledge), or to typically report just the frequentist results when reporting the results of a single study, followed by review papers amalgamating these in a Bayesian sort of meta-analysis to provide a sense of “what we think we know”.

Yes, I appreciate the issues of p-hacking and so-on, but I think these are as easily (perhaps more so) abused from a Bayesian perspective as from a frequentist approach.

• Clark says:

To clarify this a bit further — I would be comfortable with a prior which is informed by the data from the study being analyzed. For instance, a prior which takes into account data quality like extreme outliers and deviation from the assumed distribution, as well as for missing data or drop-out. If I understand it correctly, this sort of thing is done with penalized splines, and I think that’s great.

• Carlos Ungil says:

If you think the data is unreliable you should use the likelihood account for that, not the prior.

• Yes, and I think this is the part that gets people with a frequentist background so confused. In frequentist statistics you’re supposed to use the long run frequency of the data as the likelihood. Of course, most frequentist analyses use normal distributions, and in many cases the analysis is looking at something like the sample mean and using the CLT to get the likelihood. This leads people to think that they *have* to use a certain default class of model to model the data and that the main thing a Bayesian analysis adds is the ability to fiddle with a prior to adjust your inference or something.

In a Bayesian analysis you’re free to use anything you think is reasonable as the likelihood for the data. each individual point can have a different pdf if you have some information that tells you this makes sense. Or you could treat your whole dataset as a single sample from a joint pdf over N dimensional vectors (see gaussian process regression etc)

• KKnight says:

“In a Bayesian analysis you’re free to use anything you think is reasonable as the likelihood for the data. each individual point can have a different pdf if you have some information that tells you this makes sense …”

Frequentists also do exactly this. The main difference is there’s no prior although some sort of regularization might be used. Horses for courses, as the Brits might say.

• But of course in a frequentist analysis, you’re not warranted in applying a distribution to a data-point unless you think, in the long run, that would be the frequency distribution for “similar” data points. You really should specify this reference class, collect a bunch of data, and do some frequentist tests over goodness of fit before applying this distribution.

So, the main difference isn’t “no prior” it’s “long run frequency under repetition” vs “plausibility of this particular value”.

Let me give you an example, suppose there is some audio transmission coming through an electronic circuit, and I measure the voltage, and I round-off to 4 bits… Now in a Bayesian analysis I can say “this particular value was X and my information about the difference between this value and the real voltage is therefore that the real voltage is between X + 1/32 and X-1/32 for sure, so I will use uniform(-1/32,1/32) for the error of this *particular* measurement.

Now, in a frequentist analysis, you can only assign uniform(-1/32,1/32) not to a particular measurement, but to an ensemble of “similar” measurements, that is, if you somehow believe that repeatedly measuring these errors you will get *uniform frequency of errors* (or that this is sufficiently good approximation).

In a frequentist analysis, the distribution to assign for a particular data point is an in-principle verifiable fact about how often things occur…

In a bayesian analysis, the distribution to assign for a particular data point is an approximate fact about the set of assumptions the model-builder is willing to entertain…

• Christian Hennig says:

Frequentists assign distributions to single measurements all the time. Take a time series model. There’s only one observation at any time t, still a frequentist interpretation of the model for the full time series implies statements about limiting frequencies under repetition of the whole series including time point t.
These models are normally set up in such a way that some component of the model (innovations, error terms) are indeed modelled to be repeated over different time points, and this can be tested (though not “verified”, only falsified), but there is always, on top of this, a hypothetical repetition involved that cannot be realised and tested (for lack of the ability to replicate at an identical time point).

Obviously this can be used to criticise the frequentist setup as an illusion (although I personally see it as a feature rather than a bug that at least some part of it can be tested), anyway, it is done and applied all the time and has been found useful occasionally (yes you’ll say that whenever frequentism is useful it’s because it’s equivalent to an appropriate Bayesian analysis but I ignore this for the moment).

• I think we need to make a distinction between people who call themselves or think of themselves as Frequentists, and the actual philosophy of Frequentism. When I say “a frequentist does such and such” I mean “an idealized someone who is committed to the idea that probability distributions can only represent the frequency of occurrence of stuff in the world” not “joe who thinks of himself as frequentist”

Every time you’re calculating a likelihood and you haven’t verified that this likelihood is in fact a good approximation to the frequency of occurrence of the thing you’re modeling under some reasonably well specified conditions, you are doing a Bayesian calculation, whether you like it or not.

On the other hand, things like permutation tests, or Kolmogorov-Smirnov or Anderson-Darling or chi-squared goodness of fit, or the like are fundamentally frequentist, they represent tests of whether theoretical frequencies of occurrence do in fact match observed frequencies.

Just being unwilling to use non-iid likelihoods or being unwilling to use priors doesn’t in my book make your calculation Frequentist. It needs to be the case that actual frequencies of occurrence enter into your calculation.

• Christian Hennig says:

You think that somebody who uses a non-i.i.d. time series model cannot be a frequentist? This looks like a pretty unusual definition of frequentism to me.
By the way, frequentists can do Bayesian calculations. Look at Bayes’s original paper; the very first prior distribution to appear in history has a fine frequentist interpretation in the framework of a repeatable process.

• Christian: unless there is a reasonably well defined class of observations that could in principle become an ensemble whose frequency is being modeled, there can be no sense of frequency. That’s just *what it means* to have a frequency.

Every time a frequentist calculation is done and claims “the frequency of event A is F(A)” there needs to be

1) A class of conditions in the world that would be labeled A containing many instances (rainy days in Los Angeles, Chickens born at industrial farms in Kentucky, Closing prices of stock X, returns on closing price of stock X in each of the 3 days after announcements of fraud… whatever)

2) An argument why through time during the period of interest, the frequency with which A occurs is stable, and is approximately equal to F(A)

3) Some data that is consistent with (2) and not utterly rejected by goodness of fit tests

If you’ve got that, then I’ll agree that the analysis is Frequentist. Otherwise, the analysis can only be interpreted as Bayesian due to the simple lack of any meaningful frequencies. And most likely not a very good Bayesian model, because there is some inherent confusion in a person claiming to do a frequentist analysis in which they don’t have all of (1),(2),(3)

• So, the stock price after fraud thing is a good example of a frequentist non-iid model. You look at a single stock, and each day and classify the days into “normal”, “day of announcement”, “first day after announcement”… etc… Then each time-series of an individual stock has non-iid returns, on “normal” days they’re all iid together, but on the day of announcement it’s not identically distributed with the others…

BUT, for it to be meaningful as a frequentist analysis, on the day of an announcement, taken ACROSS MULTIPLE STOCKS, each return must belong to an ensemble that has an IID frequency that you’re claiming for “day of announcement” otherwise there’s no sense of frequency! You aren’t allowed to just pick a return based on you knowledge, you had better look at day of announcement for several hundred stocks, and fit a distribution F to it, or you’re lying to yourself / confused about what you’re doing.

• Chris Wilson says:

I don’t have time for a philosophical discussion, but in practice, I think Daniel is mostly right about this. Most of what passes for “frequentist” analyses are really either poorly done or just ‘approximately done’ Bayes (i.e. max likelihood where likelihood is assigned based on convenience/convention/plausibility rather than, as Daniel says, any kind of validated frequency distribution).

• Christian Hennig says:

1) It doesn’t make sense to me to talk about “Bayesian” and “frequentist” analyses as if they were opposed to each other, an example is actually Bayes’s original paper, as stated before.

2) In every frequentist analysis there is *something* that is repeated and *something* that is rather thought of as a realisation of an in principle repeatable random process that actually is not repeated and is only an idealisation. If this was not so, a frequentist could indeed not use time series models (because a distribution for a fixed time point t is implied but no fixed time point can be repeated), and also a standard frequentist way of analysing data from processes that are thought of as repeatable is via i.i.d. models, but keep in mind that “i.i.d.” is defined in terms of probabilities, so if frequentists want to explain the probabilities involved in the i.i.d. assumption, they have to talk about repetition of *the whole sequence of observations that were observed*, which are in fact not repeated if an i.i.d. model is use for all the available observations (as is usually done).

So all frequentist modelling requires idealisation and it requires some assumptions about repeatability of something that is in fact not repeated. On the other hand, from the fact that such an idealisation was made, one cannot conclude that therefore the analysis is not frequentist.

• Christian Hennig says:

Just to make this clear: If I use the term “frequentist” I refer to an interpretation of probability. If I use the term “Bayesian” I refer to certain manners of computing probabilities, Bayes’ theorem, prior distributions, as introduced in Bayes’s original paper, which is quite ambiguous about the implied interpretation of probability. Bayesian computation is compatible with and has been used with various interpretations of probability, among others the frequentist one (sometimes referred to as “empirical Bayes”) and on the other hand epistemic ones as used in various fashions by e.g. de Finetti, Jeffreys or Jaynes. There is no such thing as a unique “Bayesian interpretation”.

• Chris Wilson says:

Christian, how would you classify the large body of work devoted to max likelihood methods for estimating/fitting models? I normally hear that pitched as “frequentist” – but clearly it is a way of doing computation and not primarily about interpretation of the probabilities involved. Speaking for myself, when I see max likelihood being used in all but the simplest models, it never really makes sense to me except as an approximate Bayesian computation. But I’m not claiming any special expertise here.

• Christian Hennig says:

Chris: Whether something is frequentist or not is about how the involved probabilities are interpreted. So I’d call such analysis “frequentist” if the results and assumptions are interpreted in such a way that probabilities refer to data generating processes that deliver the corresponding relative frequencies when replicated an infinite amount of times. This is usually the case when ML estimators are applied, although the word “usually” is probably misleading because in the majority of cases people may not really think about how probabilities are interpreted and therefore wouldn’t explicitly explain this, and there is little basis to call them frequentists or Bayesian or whatever.

This reads similar to some extent to what Daniel is writing but the difference here seems to be that Daniel is not willing to accept idealisation; I call an analysis “frequentist” if probabilities are *thought of* as frequentist, which is always an idealisation but sometimes a more and sometimes a less bold one (the latter in the cases to which Daniel alludes), whereas Daniel seems to demand that the reality really is like that, which it will never perfectly be, so there couldn’t be any “true frequentists” if Daniel’s view would be taken to the extreme.

• Keith O'Rourke says:

“Daniel is not willing to accept idealisation… Daniel seems to demand that the reality really is like that, which it will never perfectly be, so there couldn’t be any “true frequentists” if Daniel’s view would be taken to the extreme”

It does _seem_ like that to me too which I sometimes phrase as thinking all models have to be literally like what they represent – like the argument that a prior does not make any sense for a fixed but unknown parameter value.

• Chris Wilson says:

Thanks for clarifying Christian. That makes sense to me and is similar to things I’ve read from Andrew. I think part of the trouble is that “frequentist” is used casually to refer to a variety of methods, like use of least squares or max likelihood, rather than philosophically. It’s a good point that all models make idealizations that cannot be literally true, so we shouldn’t demand that hypothetical long-run frequencies be empirically demonstrable. To be clear, you are saying that I could write out a probabilistic model, and use full Bayes (with MCMC and prior specification) to estimate parameters, but my analysis could be “frequentist” if I interpret the probabilities involved as in-principle long-run frequencies rather than plausibility/belief? If so, presumably the key is that I have to have a frequency-based prior? If I do not have any reasonable basis for a frequency-based prior, but can use some scientific reasoning to elicit or develop a prior, then there’s no sense in which I could interpret probabilities from such a model frequentistically right?
What about a case where we need to ride probability theory pretty hard to arrive at any kind of likelihood (i.e. via conditional specification for complex scientific models), but then max likelihood is used to estimate (usually due to a misguided aversion/discomfort with specifying priors)? I have seen this kind of thing done, and it always strikes me as weird, like wanting to have the Bayesian omelet without breaking the Bayesian eggs.
Not trying to argue with you by the way – these are just questions that I’ve had for a while…

• Christian Hennig says:

“To be clear, you are saying that I could write out a probabilistic model, and use full Bayes (with MCMC and prior specification) to estimate parameters, but my analysis could be “frequentist” if I interpret the probabilities involved as in-principle long-run frequencies rather than plausibility/belief? “
Yes.

“If so, presumably the key is that I have to have a frequency-based prior?”
My handling of terms is actually even more liberal than that, for me it’s enough that you *think of it* as *potentially* frequency-based. However, if there is no real basis for any frequency to be ever observed, you stand on very shaky grounds indeed (when I label something frequentist I don’t imply that it has to be convincing).

“If I do not have any reasonable basis for a frequency-based prior, but can use some scientific reasoning to elicit or develop a prior, then there’s no sense in which I could interpret probabilities from such a model frequentistically right?”
In this setup, by setting up a prior with an epistemic interpretation, I’d say that you have committed yourself to a non-frequentist analysis, yes.
You could however still be frequentist if you said that your prior represents some kind of frequentist “sampling from different possible worlds”; only that people may not find this very convincing. It allows potentially interesting simulations, though.

“What about a case where we need to ride probability theory pretty hard to arrive at any kind of likelihood (i.e. via conditional specification for complex scientific models), but then max likelihood is used to estimate (usually due to a misguided aversion/discomfort with specifying priors)? I have seen this kind of thing done, and it always strikes me as weird, like wanting to have the Bayesian omelet without breaking the Bayesian eggs.”
As “frequentism” in my use refers to an interpretation, I’d ask the researchers in such cases what their interpretation is, and hope that they could come up with something that makes sense (which of course is not always the case).

In any case, I don’t like the kind of argument that tries to tell me that lots of things are “implicitly Bayesian” even if they were neither done nor presented as such. I don’t subscribe to the idea that we are or should be Bayesian all the time, that every useful frequentist analysis is only useful to the extent to which it is Bayesian, and that every useless frequentist analysis has failed because people haven’t applied proper Bayes. This seems dogmatic to me and has little to do with what deserves to be called Bayesian because it actually goes back to Bayes himself.

• Christian: I just don’t think it’s enough that you “think of” your analysis as frequentist. People “think of” things that are wrong all the time.

When you have 50 pallets of orange juice cartons and you use a random number generator to select 100 of the cartons… you are doing something where frequentist analysis is very clearly applicable. The experiment is repeatable, and the random number generator has proven mathematical properties.

When you take 1000 intervals of 10 seconds of seismic data using a random number generator, and you calculate the mean squared amplitude of the signal… again a frequentist analysis can tell you something. You could go back and get a new sample from the records over and over.

When you run a global climate circulation model for 100 years of simulation from a variety of initial conditions, and then look at the distribution of temperature at the 100 year point… you can “think of” this as frequentist all you want. The fact is that there will only be one future, and the initial conditions distribution you put in is entirely a result of lack of knowledge about the conditions today.

“Thinking about” the analysis as frequentist doesn’t make it the case that frequencies of outcomes from your simulation process correspond to the frequency with which something actually occurs in the world.

My assertion is that you simply can’t have any philosophical basis for calling a *scientific* analysis frequentist if it *doesn’t describe frequencies in the actual world we live in*

Of course, pure mathematics can tell you that in some theoretical *other* world where x held…. but this isn’t *science* it’s *pure mathematics* just like in some world with 4 spatial dimensions we could discuss how much surface 3D “area” there is in the epsilon vicinity of a 4D sphere of radius 1.

A Frequentist analysis doesn’t have to describe frequencies accurately to the Nth decimal place, if you say normal(0,1) and it turns out to be closer to student_t(0,1,dof=22) with an extra lump between 2 and 2.5 this is not the kind of thing I’m complaining about, though I admit the farther away you get from your normal(0,1) the more unhappy I will get about the results of the analysis.

Show me a repeatable experiment that can in principle with enough time and money be run that would give you a histogram, and show me how you have actually collected a little bit of this data to validate that whatever model you’re using doesn’t give a small p value for your little bit of data… and I’ll be ok with calling the analysis frequentist..

It’s just that in vast swaths of “frequentist” analysis, this isn’t even close to done. For example, a simple linear regression on Economic outcomes from 4 different studies on health economics in central Africa…. there’s no sense in which the “Frequency” of anything is being modeled. we’re talking 4 seperate unique studies at particular times under particular global economic and trade conditions in particular villages. Plausibly many of the important factors underlying the causality are completely unrepeatable.

• I think I can provide some fairly hard checkable rules that would make an analysis both scientific (that is, about the real world) and Frequentist in my opinion:

1) You used a random number generator to collect data on a subset of some well defined set of objects at a particular time.

2) You collected data repeatedly at different times and showed that the histogram of the data was stable in time (many subsets of the data in a window in time all have similar histograms).

3) You have a single data set and an explicit scientific reason to believe that stable frequencies come out of this process such as that your measurement is an average over many individual events each of which is bounded, and hence a central limit theorem type thing applies to your measurement or the like. For example, calibrating electronic measurement instruments, or looking at the behavior of the total load curve for the electric grid on days with the same outdoor temperature… or similar.

For these 3 cases I’ll give a pass to calling an analysis both scientific, and Frequentist.

Of course, every null hypothesis test is a frequentist test, but it’s much less clear that it’s *scientific* after all when it’s not rejected it specifies only that a random number generator is a sufficiently good explanation, and when it’s rejected it says only that a random number generator isn’t a sufficient explanation. At no point in time does anyone ever think that a random number generator *did* generate the cholesterol measurements in your patients or whatever… so this is really a point about pure-math, that a certain type of pure mathematical process could or could not produce a similar set of measurements.

• Christian Hennig says:

” I just don’t think it’s enough that you “think of” your analysis as frequentist. People “think of” things that are wrong all the time.”
I didn’t mean that people are frequentist by thinking “I am frequentist”; I rather think that researchers do frequentist analyses if they use and interpret probabilities in a frequentist way, i.e., as referring to limiting relative frequencies in data generating processes (I’d actually include “long run propensities” as for example advocated by Gillies in this).

I don’t think that the distinction between frequentist and non-frequentist should be defined by some kind of assessment which kind of idealisation is still close enough to reality and which is too far away. Most random number generators are deterministic, and as I explained before, all i.i.d. models imply statements about frequencies that cannot be checked by histograms.

When confronted with a frequentist analysis (in my sense) in which people think of something as a “data generating process” that can in fact not be repeated, you can well say “this is based on fiction and not connected to reality, and therefore I think that it’s useless.” Fair enough by me. (I may not agree in all cases but in many.)
On the other hand I will stick to the idea that idealisation is required and that it is not reason enough to beat up a frequentist analysis that at some point it involves a model of something that cannot be repeated (such as a single time point in a time series analysis).

• Chris Wilson says:

Daniel, I think what he is saying is that those are all things you arguably might want to do to argue that you have an *accurate* frequentist analysis, but that it is not the essence of what makes an analysis frequentist. Long-term frequency is of course an idealization, more so in some cases than others; but then again, so is specifying our uncertainty using probability distributions. I mean, speaking personally, I don’t *think* in Normal(0,1) distributions of Gamma(2,1/a) or whatever- neither does any other biological entity :) What we can do, in many cases, is try and map our knowledge into those symbolic representations. As far as I can tell, we do this because it has proved to work in many cases- but that doesn’t mean it’s not an idealization too…

• So, now I think we need a distinction not between what is frequentist and what is bayesian, but instead what is science, and what is not.

Clearly, questions such as those made by pure mathematicians about the frequency properties of abstract sequences of numbers are *Frequentist* but, are they scientific?

And my answer is no, in the same way that a question about the abstract 3D “surface region” around a 4D “sphere” is not a scientific question, because we don’t live in a 4D world (let’s ignore application of pure math to things like encoding schemes for transmission of bits down a wire etc).

Now when it comes to Frequentist statistics applied to science, I think that when we have reasons to believe in the stable frequency properties of certain scientific measurements, we can do Frequentist analysis on models of this and get actual scientific predictions.

But, when we lack either fairly strong a-priori reasons, or a body of data that supports the stability of observed Frequencies, then when we go off and do Frequentist analysis, we are doing pure-math in the guise of “scientific” research. And I object to that on scientific grounds, not grounds that the math is incorrect. It’s a little like some chemist giving me an argument about what will occur based on Phlogiston. I’m sorry but Phlogiston isn’t a scientific theory, it is falsified by our knowledge of atoms and chemical energy and quantum mechanics.

Whereas a Bayesian model is automatically a pure logical calculation about the implications of a well defined model of the science, a Frequentist calculation is either about the frequency properties of abstract numbers when it’s actually pure math, or about predicting the future frequency properties of physical systems… which is pretty often scientifically unjustified.

Saying the frequency of X is F(X) implies a belief about the future distribution of measurements of *physics* or *economics* or *ecology* etc.

This is why I think Andrew says that Hypothesis testing tells you something when you fail to reject… it tells you that an abstract mathematical “frequency checker” robot wouldn’t have been surprised to hear that someone had switched out your data for the output of a certain random number generator.

In some sense, this tells you that your data is noisy and as good as an RNG, without needing to imply that your data really did come out of an RNG.

But when you do a linear regression y = a * x + err you are making a guess about what the actual distribution of the err will be in terms of *long run frequency* and this is a scientific assertion about the science. What is the mysterious “randomness force” that enforces a stable histogram? It’s Phlogiston.

• Corey says:

Christian, a couple of times you’ve brought up Thomas Bayes and his original paper. I think this ought to be avoided because, due to the unfortunate association of Bayes’s name with what used to be called “inverse probability” (thanks Fisher!), people are likely to fail to appreciate that the Reverend’s position on these questions is not clear from his writings.

• Christian Hennig says:

Corey: I agree with this: “the Reverend’s position on these questions is not clear from his writings”, but I don’t think that’s a reason to avoid bringing up the original paper. People should read it, don’t you think? Particularly people who think of themselves as “Bayesians”.

• Corey says:

In a counterfactual world where Fisher never called inverse probability “Bayesian”, would you think inverse probabilists would need to read an obscure 250-year-old paper written at a time when expectation was thought to be easier to grasp than probability? I think instead you’d be recommending Laplace’s Essai philosophique sur les probabilités as the foundational text — and I’d agree with you there. (I find it’s easier to agree with you when the ‘you’ in question is my counterfactual model of you. ;-) )

• Anoneuoid says:

When I’m analyzing the data from an experiment, I am principally interested in what that experiment — in isolation — is telling me.

I don’t think this is even possible. You always need to incorporate other information when interpreting results, that is why the “trick” of leaving out one or two key piece(s) of background info works so well. Can you give an example of what you are thinking?

• Clark says:

Consider a cell culture experiment, where one set of cultures is control and the other given some drug. Outcomes are multiple biomarkers. Results of this study are what they are — measures as a product of the described designed experiment. The Discussion is where these results would be discussed/interpreted in the context of what is understood from other studies and general prior knowledge, but the results stand on their own.

• I think you’re confused about the purpose of doing statistics in this context. The DATA stand on their own… they are definitely observed things. The purpose of the statistical analysis is to infer something that can’t be observed. In this context, even just specifying what it is you think needs inference uses background information, a model of what’s going on.

• Anoneuoid says:

Cell Culture experiment…Outcomes are multiple biomarkers.

Thanks, I feel like you may still be glossing over the important part here. Biomarkers of what and what is actually measured, eg fluorescence?.

I’m not sure if this will end up where Daniel Lakeland suggested, but it does seem like we may eventually be talking about the raw data rather than the results of any “analysis”.

• For example, you could estimate an average transcription rate of gene A among the cells. But why an average? Why not a 95 percentile? why not the actual transcription rate in the cell at coordinate x,y in the dish, why not the ratio of the transcription rate of A to the transcription rate of B, is it the average of this ratio across the cells? or is it the ratio of the averages across the cells? what is important to you other than just the facts of the observed data? As soon as you answer that question, you are talking about model dependence and background information.

• Garnett says:

I hear this perspective a lot, and I agree that it follows from misunderstanding the role of statistics in science. The particular sample/experiment is of no particular interest. It is simply a probe into the true effects of the drug onto whatever environment that you want to generalize your results over to: another identical experiment; different drugs; different strains of bacteria; people; etc.

• Jonathan (another one) says:

The problem with that view (IMO) is that the data in isolation is so background-poor. It’s like saying that my experiment shows a visual pattern of light with alternating patterns of colors arrayed vertically, and someone else says: “Have you never seen a zebra before?” What the experiment tells you in isolation usually isn’t very meaningful. The meaning comes from the application of background knowledge. Unfortunately (or not), that will inevitably lead to different people with sharply different background understandings coming to different conclusions except when the experiments are simply flatly incompatible with one understanding, the other, or both.

8. Cliff AB says:

Personally, I very much like this perspective, even if it’s a bit at odds with classic Bayesian philosophy.

The idea that I could work together with a group of experts in a field, and we could wrap up all their knowledge into a multivariate normal + Wishart distribution is somewhat insulting. But, to me, it is quite reasonable that they could wrap up something like a lower bound of information that can lead to improved estimation and inference.

9. This is super helpful for thinking about how to explain this material. The distinction between “belief” and “knowledge” is a tricky one. And indeed it’s semantics to try to define these terms. And pragmatics if we want to talk about how to use them and whether our definitions are useful. Philosophers sometimes define knowledge as justified true belief. But we don’t really want to talk about truth with a capital “T” here. We know we only have a model of the world, and not the world itself. Same as in physics—I don’t want to say Newtonian physics is true, just that it’s a useful model of the world for making predictions.

I think the distinction Andrew’s trying to get at is a traditional one. Belief is often defined as a property of cognitive agents, making it subject specific (aka subjective). Knowledge is often defined as a property of both cognitive agents (the ones who have the belief) and the objects in the world (aka objective). And both knowledge and belief can be ascribed singly (to individual agents) or collectively (to communities—of course, not all communities need to share beliefs).

Knowledge is almost always assumed to be consistent, whereas theories of belief usually allow for inconsistencies. For example, see the Wikipedia page on doxastic logic (logic of beliefs) for various conceptions of what belief might be. There is no specialized logic of knowledge—that’s just ordinary logic!

The point Andrew’s always trying to drive home is that Bayesian model is prior plus likelihood, whereas a frequentist model is prior plus penalty (at least if you want to try to avoid overfitting with MLE). In both cases, the statistician choose a likelihood function based on a combination of convenience and appropriateness based on their (belief | knowledge) about what is being modeled. In both cases, inferences only make sense relative to this researcher-chosen model. For both frequentists and Bayesians, prior knowledge or belief is involved in choosing this likelihood: do I pool or not pool these parameters or introduce random effects, do I include or not include these predictors, do I use interaction effects in my regression, do I use tobit or linear regression (t or normal errors), poisson or negative binomial distributions, do I build a time-series or spatial model, ad infinitum? In Bayesian statistics, you also get to encode your knowledge or belief in the form of a prior. If you’re being strictly frequentist, then you’ll be careful to not treat parameters as random variabes, and encode this same sort of knowledge as a “penalty”.

Then, of course, there’s what to do with uncertainty after observing the data. Here’s where things again get philosophical, but that’s the topic for another post.

• Carlos Ungil says:

“Subjectivity is objective.” – Boris Grushenko

> theories of belief usually allow for inconsistencies.

I’m not sure if you’re including them in that remark, but the whole point of the subjective/personalistic probability theories of de Finetti and Savage is to ensure that the “degrees of belief” are coherent. For Jaynes [*], Cox’s probability theory offers something even better than coherence: logic consistency.

[*] I don’t know if he may be described as a subjective Bayesian, clearly he didn’t think so.

• Good point. I think like a lot of 20th century philosophers, they were aiming at a normative theory of how agents should  formulate beliefs, not a descriptive theory of how humans actually formulate beliefs (under the conventional usage of “belief” in English).

So it all gets tangled up in the way we do science. A scientist gets a hunch and thinks some process might behave in such and such a way. You might even say at that point they believe (or at least hope) that’s how the process works. We don’t want them putting that hope into the prior. We want them to be putting evidence they have of how the process works. You can think of this as a kind of meta-analysis—it’s what we’ve learned from previous experiments. This all ties into Andrew’s default priors post. I’m moving toward thinking of priors as a way of loosely including the information from prior experiments without setting up a careful meta-analysis that jointly models the data from all experiments (and I think that’s where Andrew’s going with this, though I’d be interested in hearing his response).

• Keith O'Rourke says:

> aiming at a normative theory of how agents should formulate beliefs
Yup, how we ought to inquire not how various folks do inquire.

But how we ought to inquire is learned by induction (rather than by deduction as only earlier philosophers argue?) and this involves how folks did inquire and what successes/failures seemed to follow from that.

Any inquiry requires representations of the reality trying to be grasped and so I think that needs to be the starting point.

> We don’t want them putting that hope into the prior. We want them to be putting evidence they have of how the process works.
Nicely put – I think conjecture/abduction will still be an important part of evidence …

> including the information from prior experiments without setting up a careful meta-analysis that jointly models the data from all experiments
That’s were I started in 1985 with the desire to include the information from prior experiments in planning new studies requiring the exposition of methods to do meta-analysis http://annals.org/aim/article/702054/meta-analysis-clinical-research

Not sure what you mean by careful – given the way studies are conducted and (selectively) reported it is rarely if ever possible to do a good meta-analysis (outside regulatory agencies where the conduct and reporting is controlled and audited).

Maybe just do a quick multilevel model of published summaries, shift the likelihood towards the null and flatten it?

• Martha (Smith) says:

Bob: Well put.

• Martha (Smith) says:

10. Ben Goodrich says:

What has always struck me as odd about this interpretation is that I am not aware of an information theorist who advocates it (and Jaynes repudiates it). If you are correct, then it seems that you should be able to point to some entropy measure that is strictly reduced by specifying this or that prior distribution (or likelihood function).

For likelihoods in the exponential family and conjugate priors, the posterior distribution is in the same family as the prior, and I am pretty sure it can be shown that conditioning on the data always makes the differential entropy of the posterior distribution smaller than the differentiable entropy of the prior distribution. So, it would make sense to interpret the data as providing information about the parameters (at least in the conjugate case) and since the hyperparameters can be interpreted as pseudo-data, it kind of makes sense to interpret them in terms of information.

But as a counter-example, suppose you take the simple case of a normal prior, a normal likelihood with known precision, and get a normal posterior distribution for the unknown mean. It seems as if specifying the prior location should provide “information” about the unknown mean, but the entropy of the prior and posterior normal distribution only depends on the precision.

I am also not sure in what sense the choice of a prior family (even a conjugate one) reduces entropy. Relative to what?

• “I am also not sure in what sense the choice of a prior family (even a conjugate one) reduces entropy. Relative to what?”

Relative to an implicit uniform(-N,N) prior where N is an enormous number that everyone in the world would agree is sufficiently large that the quantity of interest must be in the interval.

For example N = 10^308 is big enough for almost all applied work no matter what the subject as witnessed by the fact that people routinely use double precision floats for almost everything. In a few narrow corners of the world this might not suffice but in those corners people do things like quadruple precision floats whose largest value is around 10^4932. There’s always some N.

• also, people work with the logarithm, so that then they are representing numbers up to N=exp(10^308)

The Internal Set Theory construction of nonstandard analysis is explicitly built on this idea of “there’s always a number bigger than anything anyone will ever represent explicitly”

• Carlos Ungil says:

>suppose you take the simple case of a normal prior

I don’t follow your argument. Isn’t the entropy of the posterior reduced when you start with a normal prior, relative to the case where a flat (improper) prior is used? And the higher the precision of the prior, the lower the entropy of the posterior.

• Yes, but what seems to be the case is that altering the position of the prior doesn’t change the entropy.

However, that doesn’t mean there isn’t information in the position parameter. The total information put into the model is the information in the Stan code, plus the information in the dataset. If you gzip compress the stan code you can get a sense of how much info is being put in to the analysis in the form of the model.

• Carlos Ungil says:

Why would altering the position of the prior change the entropy?

• it shouldn’t but specifying the position of the prior should in some sense “provide some information”. The sense in which this is true is related to the gzip compressed length of the Stan file.

All of this is kind of related to Kolmogorov Complexity, which is uncomputable, but you can sort of approximate it by doing things like gzip compressing the code.

• Carlos Ungil says:

I don’t understand where’s the perceived issue. Specifying the position of the prior (v.g. mu=0) does “provide some information” and specifying an alternative position of the prior (v.g. mu=42) also provides a similar ammount of information.Why is it surprising that both priors N(0,sigma) and N(42,sigma) result in the same entropy? What does the gzip compressed length of Stan files have to do with anything?

• I think Ben is equating information with reduced entropy, so if changing the location of the prior doesn’t change the entropy, then he can’t see how you’re able to have a measurable change in information. My point is that the total information is the sum of the information contained in the model code, (Stan file) and the information encoded in the data.

So, when Andrew says that he’s putting in prior information, it comes in the form of text in the Stan file, and it includes both the content of the prior and of the form of the likelihood. Then you join it with the dataset, and you get a posterior distribution over the parameters. The information in this posterior comes from two places, the text of the Stan file, and the text in the CSV file / other data source containing the data.

• Ben Goodrich says:

I would be surprised if Andrew were referring to the compressed length of the Stan file, but it is an interesting idea. Do two Stan files that are identical except for the prior theta ~ exponential(1) vs theta ~ gamma(1,1) have the same complexity?

• The Kolmogorov complexity of a program is the *shortest* program that outputs the same thing… so the answer is yes at least at a theoretical level. In practice Kolmogorov complexity is uncomputable so …

• Martha (Smith) says:

Ben said: “What has always struck me as odd about this interpretation is that I am not aware of an information theorist who advocates it”

It sounds like you are confusing two meanings of “information”.

• Ben Goodrich says:

I recognize that Andrew was likely using “information” in a less rigorous sense than an information theorist would. What I don’t recognize is how you get a theory of inference, decision-making, etc. out of using “information” in this less rigorous sense.

If “information” is all that is required, why can’t someone just list a couple of things like “I have information that the coefficients are not too big” and “I have information that the errors are independent of the predictors” and combine those facts to reach a valid conclusion about the coefficients in a regression model?

You can apply the negative log transformation to both sides of Bayes Law and it remains a valid equation. In fact, we do that in HMC in order to draw from the posterior distribution efficiently. But no one uses that form for inference.

• I have information that the coefficients are not too big is exactly the kind of thing that goes into a choice for a prior.

• Ben Goodrich says:

Yes, but my point is that you have to specify something more specific like beta ~ normal(0,10). And one can say that particular prior has some amount of information content. But I am not seeing how one can accept the reasoning that implies that information content and deny that a normal distribution with a mean of zero and a standard deviation of ten describes your beliefs about beta before seeing the data.

• I think your question is no longer about what constitutes information and more about what constitutes “belief”. Belief is a loaded word, like “beautiful” and “significant” and I think the point of this post is that you can drop all the loaded aspects of “belief” and just talk about “what information was put into the analysis?” and that information arrives via two paths, the model (eg. a Stan file) and the data.

Merriam-Webster offers:

Definition of belief

1
: a state or habit of mind in which trust or confidence is placed in some person or thing her belief in God a belief in democracy I bought the table in the belief that it was an antique. contrary to popular belief

2
: something that is accepted, considered to be true, or held as an opinion : something believed an individual’s religious or political beliefs; especially : a tenet or body of tenets held by a group the beliefs of the Catholic Church

3
: conviction of the truth of some statement or the reality of some being or phenomenon especially when based on examination of evidence belief in the validity of scientific statements

So I think the sense in which you might discuss the “belief” in beta ~ normal(0,10) is at best definition 3. And really, even there, the sense of commitment that “conviction” implies… is not there in a general Bayesian analysis. In other words, the loadedness of “belief” is that it brings to mind (1) and (2).

• Paul Shearer says:

The idea that anyone “believes” in their normal(0,10) prior is ridiculous. Beliefs aren’t supposed to change for the sake of convenience. But tell a researcher that a new computational method lets him do more rigorous/bigger/better analyses, and that method happens to require a gamma prior… watch how quickly his “prior belief” shifts over.

At best, priors are computationally convenient approximate representations of something we may tentatively believe, given the information we have at hand. Rather than rattle that bloated sentence off every time we are asked about priors, I’d like to just say my prior represents information.

• Martha (Smith) says:

Paul: Good point in your first paragraph. But I think your second paragraph is misguided; that “bloated” sentence does need to be rattled off frequently — just saying “represents information” is too vague and is likely to promote a formulaic approach that becomes just a formality rather than promoting careful thinking.

• Glen M. Sizemore says:

*And, incidentally, I can clean up my own statement here, removing references to “knowledge” and “realizations,” and it might be important to do so eventually.

• Paul Shearer says:

> What I don’t recognize is how you get a theory of inference, decision-making, etc. out of using ‘information’ in this less rigorous sense.

The point is not to build a new theory, it is to better describe existing theory. “Prior beliefs” is misleading statistical jargon that doesn’t reflect actual practice.

Most of us don’t use Bayesian methods because we “believe” our priors. We use them because we have information that we want the inference to take into account. A prior is a reasonable representation of the information we have, chosen for convenience of use in a favored computational framework. It is not something we actually believe in. Outside of the cult of subjective Bayesianism, the idea that someone would “believe” that some obscure treatment effect is normally distributed with a mean of 1 and a variance of 5 is absurd on its face.

“Prior beliefs” are a polite fiction that enable us to get our work done and still sell ourselves as Bayesian uber-rationalists. But by now everyone has seen through the hype, and we should be more clear and humble about what we’re actually doing.

11. Carlos Ungil says:

Do you think that the total amount of information (including the Stan file and the data) is different when the position of the prior changes? Is the prior N(0,1) more or less informative than the prior N(42,1)?

• Carlos Ungil says:

(Obviously this is for Daniel, I inserted my reply at the wrong place.)

• In ASCII 0 takes 1 byte to write, and 42 takes 2 bytes, normal(3875,1) takes 4 bytes for the location… of course after gzipping they’d be less, but yes imagine that your implied prior is uniform on the range of IEEE double floats, and then you write uniform(338,750) in the Stan file, those two numbers take some minimum number of bits to transmit.

Entropy is always relative to some state of information. so if “everyone knows” that uniform(338,750) is the “right” prior to use, then it takes basically zero bits to transmit. Whereas if all you know is that someone is going to transmit you the lower and upper bounds as IEEE double floats… then it takes 16 bytes. But if you “know” that these numbers tend to be rounded off decimals to 3 decimal places… then you could put a discrete distribution over those 999 options… etc

A person who knows that you are sending the output of the Mersenne Twister algorithm only needs the seed… whereas gzip which knows nothing about the mersenne twister… will need O(32N) bits to transmit N 32 bit numbers.

• This is all related to Kolmogorov Complexity. The quantity of information in a Stan file and a dataset combined is quantified by the number of bits of the smallest computer program that could output the Stan file and the dataset…

• Keith O'Rourke says:

One light if by sea, two lights if by land ;-)

• Right, it requires 1 bit to represent those two options under this model, but if you put up 3 lights…. that’s gonna take at least a couple hundred bits to explain.

• Carlos Ungil says:

Ok. So there is more “information” in the prior N(42,1) than in the prior N(0,1). But there is no difference in “information” between the priors N(42,1) and N(1,42). And I can increase the “information” in the prior without actually changing the distribution, by using longer variable names. Or adding comments. Or whitespace. (I’m joking, nobody knows for sure in the “information” would increase or decrease in all those scenarios because of gzipping!)

This “information” doesn’t seem particularly useful to measure how informative prior distributions are. Getting the same entropy for p(theta)~normal(0,1) and p(theta)~normal(42,1) seems a desirable property to me, both prior distributions seem equally informative. Measuring the “information” in going from p(theta)~normal(0,1) to p(theta)~normal(42,1) is a different question. And I’m sure there may be answers (Kullback–Leibler divergence divergence perhaps?) other than gzipping Stan files.

• Gzipping is just a computational approximation to Kolmogorov Complexity. The Kolmogorov Complexity is the *smallest* computer program that produces the desired output. so if adding comments doesn’t change the output, it doesn’t change the K-C, but of course it does change the gzip output, which just tells you that gzip isn’t that great of a computational approximation.

I think the point that is valid though is that the Stan file contains information, and we can’t deny this. If you put no explicit prior Stan imposes an implicit uniform(-largest_float,largest_float), adding in some alternative in the stan code requires you to add information to the model, that information is explicit and included. Adding in a comment about *why* you put the prior in some sense adds even more information (in the sense of communication of knowledge from the author to the eventual reader of the analysis) in that it should help to convince someone about the use of that prior…

• Carlos Ungil says:

I don’t deny that a Stan file contains information. Variable names will contain information. Comments will contain information (maybe wrong information, though). And maybe even the whitespace will contain (formatting) information. I just don’t find all this information particularly useful to measure how informative prior distributions are.

Anyway, you are free to think that a prior p(theta)~Normal(42,1) for some parameter does contain more information than a prior p(theta)~Normal(0,1).

If so then I think I’ve made my point.

• Carlos Ungil says:

Prior A normal(2.3087559104,1) does NOT contain more information than prior B normal(0,1).

Maybe you intended to write normal(2.3087559104,0.000000001) ?

If you think prior A contains more information than prior B, what happens if the transformation x’=x-2.3087559104 is applied?

Does prior A normal(0,1) still contain more information than prior B normal(-2.3087559104,1) or has the ranking reverse somehow?

• Carlos:

The two distributions don’t have different entropy. This is a statement I agree with.

But the two Stan files don’t have the same Kolmogorov Complexity. Specifying the number 0 is going to be easy in any coding system, the number zero is basically special, it’s in every finite bit length number system humans actually use. Humans tend to use numbers that have around 3 or so decimal digits, so if you created a special “short float” that had 10 bit mantissa and 4 bit exponent you could encode most numbers that people would write into a Stan file in 14 bits, numbers like 1.05e3 or 33 or 320e6 (about the population of the US).

On the other hand, 2.3087559104 is a “weird” number for a human to write down, it’s highly specific and requires 11 decimal digits to specify, in practice you’re going to have to use a full 32 bit float to specify it. so it contains more “information”

Your point could be construed that a human would never bother to write down such a highly precise value for the mean, and then fail to specify an appropriately small standard deviation. I mean, with a standard deviation of 1 there’s really no point in writing down more than about 2.31, and I agree with you… but there’s no good way to define “information in the Stan file” that doesn’t measure 2.3087559104 as having more “information” than “2.3”

This may seem pointless, but suppose you are trying to infer information about the US population. If you go to the census clock https://www.census.gov/popclock/ you’ll get a number, which at this point is saying 324989460 people in the US. That would be a number you could write down in a Stan file related to the number of people in the US and it represents *information* that thousands of census employees have collected. It represents more information than “last year I remember it was around 320 million so it should be maybe 322 M this year”

12. Mike Lawrence says:

I appreciate the elucidation that both the Bayesian & Frequentist employ prior information, but wrt your title: is there a point distinguishing between “information” and “belief”? I don’t see one from an information theoretic perspective; indeed, I tend to introduce priors when teaching by starting out by plotting theta on the x-axis and label the y-axis “how surprised I’d be if I were to discover the true value was here”, work out a curve, then flip the y-axis and re-label it “prior credibility”.

• Martha (Smith) says:

Mike said: “but wrt your title: is there a point distinguishing between “information” and “belief”? I don’t see one from an information theoretic perspective;”

See my comment to Ben above; I think Andrew is talking about ordinary used of the words “information” and “belief.”