Skip to content
 

Divisibility in statistics: Where is it needed?

The basics of Bayesian inference is p(parameters|data) proportional to p(parameters)*p(data|parameters). And, for predictions, p(predictions|data) = integral_parameters p(predictions|parameters,data)*p(parameters|data).

In these expressions (and the corresponding simpler versions for maximum likelihood), “parameters” and “data” are unitary objects.

Yes, it can be helpful to think of the parameter objects as being a list or vector of individual parameters; and yes, it can be helpful to think of the data object as being a list of vector of individual data points (in the simplest case, an “iid model”), but this is not necessary to do Bayesian inference. Similarly, we can check our models using posterior predictive comparisons of “predictions” with “data” without needing to think formally about partitions of either of these into individual data points.

There are, however, some settings where partitioning is required: statistical concepts that make no sense without a principle of divisibility of the data.

1. The first such setting is cross-validation. As Aki and I discuss in our papers here and here, concepts such as leave-one-out cross-validation and AIC are only defined in the context of divisions of the data. The “p(parameters|data)” framework is not rich enough to encompass cross-validation. We don’t need iid data or even independent data, but we do need a structure in which “the data” are divided into N “data points.”

Once some division of the data is required, it makes us realize that the particular division we use is a choice: not a choice of the probability model, but a choice of how the model will be interpreted and evaluated. In a hierarchical model, we can choose to cross-validate on individual data points or on groups, and these two options have different implications.

2. Another setting is exchangeability. A model can be spoken of as exchangeable only with respect to some list or sequence of variables. It does not make sense to say p(theta) is exchangeable; that property applies to p(theta_1,…,theta_J).

3. Divisibility is also necessary for asymptotics. For example, consider the statement that it’s hard to compute the Ising model. It’s “hard to compute” as N increases. The Ising model with N=10 is easy to compute! To make statements about scalability, we need some increasing N, some divisibility of the data, or divisibility of the model, or both.

So what?

The point of this is that a Bayesian model is not just its posterior distribution. It’s also the sampling distribution (which is required for predictive checking, as discussed in chapter 6 of BDA) and also the division or partition of data.

10 Comments

  1. Sameera Daniels says:

    Shravan,

    ‘Useful’ has many many connotations

    http://www.thesaurus.com/browse/useful. Which one/s do you mean to convey?

    However, nearly everyone here on the blog has been pointing to practical & theoretical problems in specific contexts: based on their own vocational/practical experiences. They have been getting ‘their hands dirty’.

    If the objective is to redo, reinvent, or circumvent those problematic results then it leaves open the sense in which you mean ‘useful. Perhaps you can point to one or two examples of useful efforts.

    If you mean that it is necessary to grasp technical knowledge; that is a given. Nevertheless there are aspects to any of the problems raised that are within the grasp of those who may not have extensive technical knowledge. Obviously some fields are highly technical to begin with. It requires putting in the time to learn them. It’s that most entail logical reasoning, which is important to detect as well. Everyone has different competencies [subject matter & logical] in different degrees. There is no assurance that subject matter expertise is necessarily going to yield critically thinking effort. That is why several of the books recommended may be useful.

    This discussion reminds of theme that has cropped up now and then: The generalist vs. subject matter expert. And even more fundamentally back to how to improve judgment; qualitatively and quantitatively.

    Lastly we have been apprised of some of these problems when Sander debated Carlos here. I’ll have to find the link.

  2. ojm says:

    > statistical concepts that make no sense without a principle of divisibility of the data.

    Yes!

    A key difference between a single dataset as N=1 vs N = many is that in the latter case one can include/exclude individual observations. Resampling methods absolutely require the idea of divisibility of data, as does asymptotics.

    • Andrew says:

      Ojm:

      I’m glad that someone commented on this post! I think these sort of half-technical ideas are extremely important, but they often don’t seem to get much interest or attention.

      • ojm says:

        Well, I’m glad you posted this post!

        I’ve thought about this idea before, but because it is indeed ‘half-technical’ I haven’t often said it out loud and/or very clearly, even though it’s one of my favourite statistical ideas.

      • Keith O'Rourke says:

        Agree – its hard to get others see the purposefulness of half-technical ideas that as one journal editor put it reviewing http://andrewgelman.com/wp-content/uploads/2011/05/plot13.pdf – “does not involve enough technical development to warrant publication in my journal” just after he wrote he thought the ideas were important and should be used more.

        This was my best pitch at it – “At the heart of any statistical analysis is replication – or the repeated observation of a phenomenon. Each replication can be considered a unit of analysis – and it is these very unit of analysis contributions that we wish to understand and clearly display [at least appreciate]. For a replication to be a true replication, there must not be complete dependence and for a replication to be strong there must be as much independence as is possible. In fact, often a unit of analysis is taken as that unit that gives complete independence under reasonable assumptions.”

      • Luis Usier says:

        Maybe you should have a “like” button for posts, so people can indicate they like it without the need for commenting

  3. Another way to think about this is that as Bayesian modelers, we have a choice of how to model something. Sometimes all the data is a single observation: a timeseries for example. Our goal is to explain this whole vector as one object. Other times, people take a time-series and claim it’s just a function + independent noise, now the noise is divisible, in fact completely divisible…

    Other times you’d like to imagine that “close-together” data is related, but “far apart” data is as-if independent. Say with spatial data for example, or the averages in different decades of some time-series.

    The point i’m trying to make is that we have a *choice* of how to treat divisibility which we should consciously make. There’s no law of nature that says that two events far apart in time are independent, nor is there a law of nature that says we need to treat everything as related. Sure it all *is* related, but we don’t need to *treat* it that way to make progress. When we make weather models for short term hurricane forecasts we don’t include butterfly flapping etc even though it’s well known that such sensitivity is inherent in many dynamical systems.

  4. Carlos Ungil says:

    > The first such setting is cross-validation. (…) We don’t need iid data or even independent data,

    Doesn’t cross-validation assume exchangeability? If essentially different models are being fitted, how do you put them together?

    • Keith O'Rourke says:

      When I was taught cross-validation (by David F Andrews) we were advised to try and make it non-exchangeable as part of a fuller cross-validation analysis. That is try to discover in what ways the data are not exchangeable in hopes of getting less wrong conditional conditional exchangeability or not taking the cross-validation as being very stable or meaningfull.

      Recently a paper in toxicology did this nicely – the winning model won by being good at predicting little to no toxicity by missing most of the high toxicity – good on average but definitely not what anyone wants.

Leave a Reply