Touch me, I want to feel your data.

(This is not a paper we wrote by mistake.)

(This is also not Andrew)

(This is also really a blog about an aspect of the paper, which mostly focusses on issues around visualisation and how visualisation can improve workflow. So you should read it.)

Recently Australians have been living through a predictably ugly debate around marriage equality. But of course, every country that lives through this humiliating experience needs to add it’s own home-grown, jingoistic, and weirdly specific idiocy.  This came in the guise of Senator Eric Abetz who claimed that marriage equality would  “open the “floodgates” to people rushing to marry the Sydney Harbour Bridge.”

Why is this relevant?  Well because there is a frequent and vociferous opposition to procedures that “use the data twice”.  This always surprises me, because when I do applied work [which as of yet has not involved a pre-designed RCT] I look at the data so often that I’ve basically married it.  In fact, during my last consulting job, the data and I had seven children and settled down in a beautiful house in the country.

We saw a problem

So why is there so much dissonance between people who write about methodological statistics and my own experience doing applied statistics.  (I am willing to consider the idea that I could be better at statistics. Because it is hard!)

A big part of the dissonance comes from two sources:

1) If Null Hypothesis Significance Testing is happening in any guise (see Bayes factors for model chocie, for a Bayesian example) then using the data twice violates every aspect of the theoretical assumptions. This is one of the many reasons why NHST is difficult to perform in correctly.

2) If we are trying to build a predictive model (a general task, remembering that a causal model is just a predictive model for a scenario we haven’t seen), then using the data twice can lead to overfitting and massive generalisation error.

Of these two problems, I care deeply about the second (as for the first: I care not a sausage).

So we wrote a paper about it

This is a very long way around to the point that we [Jonah, me, Aki, Michael, and Andrew]  wrote a nice paper called Visualization in Bayesian workflow.

The paper is about how visualisation [I’m not American. I spell it correctly.] is a critical part of a Bayesian’s applied workflow.

The critical idea that we’re playing with is that you build up a model piece-by-piece in part by looking at the data and seeing what if can support.  In the paper we cover a bunch of ideas about how visualisation is important in this process (from building a model, to checking that your MCMC worked, to doing posterior checks, to doing model comparison).

This workflow inevitably uses the data more than once. But, as we wrote in the final paragraph:

Regardless of our concerns we have about using the data twice, the workflow that we have described in this paper (perhaps without the stringent prior and posterior predictive checks) is common in applied statistics. As academic statisticians, we have a duty to understand the consequences of this workflow and offer concrete suggestions to robustify practice. We firmly believe that theoretical statistics should be the theory of applied statistics, otherwise it’s just the tail trying to wag the dog.

A better use of prior predictive distributions

As we talked about in the last paper, understanding that a marginal likelihood uses the prior predictive distribution is a key to understanding why it’s so fragile when used for model comparison. (Really, we should call marginal likelihoods “leave-everything-out cross validation”)

But that does not mean the prior predictive distribution is useless, it just means it’s not the right thing to look at when doing model comparison.  In Section 3, we suggest using the prior predictive distribution as an additional check to monitor how informative your priors are jointly.  In particular, we argue that the prior predictive distribution should be “weakly informative” in the same way that priors on individual parameters should be weakly informative.

That is, the data generated using the prior (and the fixed elements of the design, in this case the spatial locations and the covariate values) should cover the range of “reasonable” data values and a little beyond.  For this application, we know that the reasonable range is below about 200 micrograms per cubic metre, so actually Figure 4b) suggests we could probably tighten the priors even further (although they are probably fine).

How do you do this in practice?  Well I make an R script to make a directory with 100 or so realisations from the prior and flip through them. You could also look at the distributions of relevant summary statistics.

In the end, prior predictive checks and posterior predictive checks are both important in the statistical workflow.  But your prior predictive checks make most sense before you do things with your data, and posterior predictive checks make most sense after. This also let’s us avoid computing a marginal likelihood, which we all should know by now is extremely difficult.

bayesplot

Almost all of the plots in the paper were done using Jonah’s package bayesplot or using ggplot2.  A few of the plots are actually not available in the current version of bayesplot: unsurprisingly, when you write a research paper about visualisation techniques, it turns out you need to implement new ones!

 

45 thoughts on “Touch me, I want to feel your data.

  1. Like Dan said most of the plots from the paper are already available in bayesplot, but the ones that aren’t in the current release of the package are on their way (https://github.com/stan-dev/bayesplot/pull/108). For anyone interested, when we get the next release of bayesplot out I will then put together a vignette that shows how to fit the models in the paper with Stan and then make all of the plots with bayesplot.

      • Can’t wait to hear how long it takes the function to actually make the plot with that many parameters! It’s going to be one of the slower functions in bayesplot since there’s a lot more for ggplot2 to draw than there is for summary plots (e.g. intervals) or any plot that just shows one or two parameters.

        I’ll also be interested to know if it’s at all interpretable with that many parameters. My hunch is that you’d need to order the parameters along the horizontal axis in a meaningful way to have any hope of noticing anything useful. Of course “meaningful” is very context dependent, which is always annoying but also what makes things interesting.

        • After normalising the (all) data (e.g. mean 0, sd 1) one can compare parameters based on the sd calculated for the divergent sample.

          Also it would be great if we could plot the model connections as a graph. Then we could plot only a subset of parameters (e.g. pamaters that are connected to top 3 parameters with lowest sd (divergent samples) .

  2. A while back we had discussions about “using the data twice” and I came to the conclusion that it’s not how often you use the data, but how well you condition on having used the data. Carlos Ungil gets big props for helping me come to that understanding.

    For example, suppose someone sends you some data from an instrument that measures in an unknown set of units, but you’re told that it has measurement noise which is of order of magnitude of 10% of the typical measurement.

    s ~ exponential(1.0/0.1);

    data ~ normal(mu,mutyp*s);

    might be an ok model, if you just had some idea what the heck the units of the mu values were so you could choose a mutyp and use it to set a prior on mu. But, even if you know that this thing is say measuring people’s heights, because you don’t know what units it’s using (micro-meters, cm, lightyears, hundredths of an inch?) you can’t even approximately put a prior. Looking at the data, and then conditioning on that look, provides the answer:

    p(data | mu,sigma)p(mu | units)p(units | median(data))

    p(units | median(data)) strongly concentrates around one kind of units (lets say the median value is 170 you’re now looking at cm not inches or feet or furlongs or micrometers)

    Similarly, if you look at prior predictive values for your model and adjust your priors until you find that outputs are at least sort of the right order of magnitude as in figure 4 of your paper, are you over-fitting? To the extent that your prior still includes fake data well outside the ranges of real data, and to the extent that you have many more than 2 samples in your dataset, I’d say no. You could for example do quantile(data,c(.01,.99)) which pulls out essentially 2 “degrees of freedom” from your data and fit your priors to blanket that interval or the like (say a convex hull over a 2D point cloud, or 95% intervals in each dimension or whatever). If you have N much larger than 2 data points, you are using very little data information.

    The big issue is that in fairly complicated models it can be hard to understand how the joint prior across many parameters affects the prior predictive distribution, and without that knowledge, you can’t reasonably set priors, because a prior is just a device to constrain the acceptable search region over possible prediction models. If you set priors that seem like they are individually reasonable, but in joint they are somehow accidentally informative in an incorrect way about the data generating process, you will get much worse fits than if you adjust priors in the way you suggest (figure 4). So, which is worse? A wrong fit + no data “reuse” or minimal data reuse that gives you the right answer? I know which one makes sense to me.

    • This is excellent.

      Do you have a link to the comment thread where you had that discussion with Carlos? I would be very interested in reading the dialogue.

        • There’s nothing embarrassing about learning from experience / discussions with others.

          “You don’t honestly think I believe today what I did last year do you?” – George Barnard (as quoted in An Accidental Statistician, G.E.B chapter 4)

      • I agree with this. To some extent, that’s what I was talking about when I was talking about how theory doesn’t always shadow use. Arguments about “using data [multiple times]” are like that. I think everyone accepts that there a bad ways to use data multiple times, but that doesn’t mean every multiple use is unsafe. So the question comes “can we quantify this” in order to identify danger points and robustify practice. (My spellcheck firmly refuses to believe robustify is a word. But I am unwilling to write “fortify”)

        • I think one of the informal ways of justifying relooking at the data and changing a prior is that a prior in the form of a classical parametric distribution is only a greatly oversimplified version of our a priori knowledge.

          An example that I like to use is that of looking at mean FEV in teens that smoke compared with those that don’t. Before seeing the data, I think that FEV should be higher in non-smokers than smokers. So I put a normal distribution with a negative mean as the prior for the “smoking effect” (quotes imply the word effect is misleading). However, I see the data and the smoking effect is positive, with a p-value of 0.001.

          Is the logical thing to conclude smoking effect is somewhere in the middle? If I truly believe the normal prior, yes. On the other hand, this surprising result makes me rethink things for a bit. After some thought, I realize that smokers are more likely to be older than non-smokers, and age was not adjusted for in this study, other than “being a teenager”. Now I realize that my previous prior was silly, and most importantly, didn’t really represent my full prior knowledge before seeing the data.

          Daniel, I think your example falls into the same category. You have definitive prior information about the measure of interest. But you have a lot of trouble accurately describing it in a distribution, at least until you’ve seen a few samples.

          If we were saying that calculated Bayesian posteriors were how one should make decisions about publication, grants, etc., AND we allowed you to change your prior after seeing the data, this would unquestionably lead to horrible gaming of the system. On the other hand, if you are researcher and want to do your very best to understand a system, updating a mistakenly misparameterized prior is definitively a good idea.

        • This is my thought as well. It’s not that looking at the data in this circumstance changes the prior, it calibrates the prior. This is just cool.

          Although the line between calibration and fishing could get blurred very quickly. And I agree with Dan that careful thought about the principals here is required to make sure we don’t do something silly without realizing it.

        • Let me emphasize in this discussion that the same concerns arise with the data model, which, like the prior, is chosen by the researcher, not given by God.

        • Right, so let’s get into an example there too. You start out by looking at some psychology issue…. and you find that the average effect of your treatment on the full population was small… so you decide to look at whether there are effects within sub populations that sort of cancel out at the full population, like say a breakdown by race and sex or age or something.

          Now, your data model has changed after looking at the data. Under what circumstances does this make sense?

          I think it makes sense when there are no other alternative models you’d think of fitting. If there are other models, you really need to incorporate them into a model choice problem, a mixture model p1(data|param1) p(param1) + p2(data|param2)p(param2) etc

          If you try just one model at a time, you can find that several don’t fit that well, and then one fits sort of better, and then ignoring those that don’t fit well is not justified, whereas in the model choice problem where all models are allowed to participate, if you find one model that is strongly favored, then it is justified to say that this model is strongly favored, the evidence suggests this model over the others.

          Again I think this has the flavor of incorrect conditioning. Rejecting a model because it’s a poor fit and moving to the next model and the next model… if each “next model” is conditioned on the fact that you’ve rejected some previous model… then that can work, but just trying a new model without conditioning correctly on that information you had doesn’t work. The easiest and most effective way to condition correctly is to create a super-model that incorporates all the possibilities into one big model, whether that’s a discrete mixture or a continuous mixture, or whatever.

        • I have been thinking about this statement “we feel that outmoded attitudes of objectivity and subjectivity have impeded the work of statisticians, by discouraging some good practices and by motivating some bad practices” from http://www.stat.columbia.edu/~gelman/research/published/gelman_hennig_full_discussion.pdf along with Phil Dawid comments on meta-statistics and Peirce’s late realization that meta-physics can’t be ignored as frivolous because it leads to bad statistical practice. Further warning that with direct repercussions meta-physics is in desperate need of logic.

          So seeing it everywhere now – but I do think if we want good statistical practice it needs to be sorted out.

          For instance I think (and have since Two Cheers for Bayes https://www.ncbi.nlm.nih.gov/pubmed/8889349) research communities need to learn how process and critically peer review prior assumptions so that they are not taken as just arbitrary opinions of individuals.

    • “…without that knowledge, you can’t reasonably set priors, because a prior is just a device to constrain the acceptable search region over possible prediction models.”
      Daniel, it sounds like you are moving away from the view of Bayesian inference as some kind of real-valued generalization of Boolean logic?

    • > because you don’t know what units it’s using (micro-meters, cm, lightyears, hundredths of an inch?) you can’t even approximately put a prior.

      Why not? If you really don’t have any idea of the units, scale invariance suggest a prior of the form 1/x.

      > (lets say the median value is 170 you’re now looking at cm not inches or feet or furlongs or micrometers)

      It could also be in atto-light-years or centi-yards… It’s not clear to me if in this example you’re trying to choose among a well defined set of “units” or if you’re really ignorant about the units. In the second case, I don’t think the approach that you propose is going to be any better that using the uninformative prior on mu from the start.

      • Carlos: yes, I was trying to imagine choosing between a reasonable finite set of units (feet, meters, cm, in, mm, yards for example)

        However, this example was motivated by a recent problem I had where people had measured the pixel area of images, but had no absolute scale in the image (ie. nothing of a known size to measure within the image).

        In this problem, it actually didn’t matter because the only meaningful thing was the dimensionless ratio of the size of things for genotype A vs the average size of things for genotype B. Within a batch the scale was constant (no-one adjusted the magnification), and I know that measurement error is a certain fraction of the size of the whole… Dividing all the measurements in a batch by the median eliminates the scale problem and lets me put a reasonable prior on the adjusted measurements. Inferring the actual size of things in meters is an intermediate step that adds bias and inaccuracy which doesn’t actually exist in the problem, because the actual problem is scale invariant (dimensionless ratio of sizes).

        Dan mentions below that you should always use the units and convert between them etc. But this information just may not be known. If you know the units, sure, respect them. If you don’t, such as when you have images with no reference length in them, this doesn’t mean you can’t proceed, it just means you have additional uncertainty… Scale invariant priors are … not priors because they don’t normalize, and also usually don’t respect the degree of knowledge you do have. I don’t actually think people are measuring people’s heights in atto-lightyears.

        Anyway, the problem with coming up with lots of examples is that they’re never what you might call complete and well thought out, but they illustrate interesting issues in data analysis, so I still think it’s worthwhile thinking about some of the issues they bring up. In the case where you’re really completely uncertain about the units, there is no concentration of p(units | median(x)) around a single value so that case doesn’t behave as I was implying.

        • > Dividing all the measurements in a batch by the median eliminates the scale problem and lets me put a reasonable prior on the adjusted measurements. Inferring the actual size of things in meters is an intermediate step that adds bias and inaccuracy which doesn’t actually exist in the problem, because the actual problem is scale invariant (dimensionless ratio of sizes).

          I think that writing your reasonable prior in terms of a nuisance scale parameter is the proper way to handle the problem, at least in principle. The intermediate step of normalising using the median of the data is what might add bias and inaccuracy if not properly accounted for. Obviously if there are many samples your data-peeking will go unpunished, but if there are only a few data points is easy to see that the procedure is not clean in general (maybe it doesn’t mater in your particular example, I don’t know).

          > Scale invariant priors are … not priors because they don’t normalize, and also usually don’t respect the degree of knowledge you do have.I don’t actually think people are measuring people’s heights in atto-lightyears.

          It was you who said that you couldn’t put a prior on mu! The scale invariant prior is appropriate if you have *no* knowledge. If you do have *some* knowledge it’s up to you to use it to construct a better prior.

        • “I think that writing your reasonable prior in terms of a nuisance scale parameter is the proper way to handle the problem, at least in principle.”

          This might be getting a bit far off topic and into specifics of a problem that isn’t generally of interest, but here goes with at least a little more background:

          suppose you have values data[i] measured in some strange units that depend on how someone adjusted the microscope zoom….

          then

          data / C are data values measured in some other strange units for any positive value of C.

          Suppose you have a procedure for choosing C, Choose(C) which guarantees that all the rescaled measurements are O(1), then given your knowledge of the measurement procedure suppose you can infer that the measurement errors are O(0.1) in these new units. Suppose that you are willing to model the statement O(1) and O(0.1) by some choice of distributions using the constants 1 and 0.1

          Now suppose you know that C = median(data) is a procedure which does guarantee O(1) measurements.

          Then you can use your information by building a model that is conditional on the fact that you use C = median(data) as your choice of measurement unit.

          If you pretend that median(data) = the mean of your data, or equals some parameter Q, then you’re going wrong. But this isn’t the information you’re using.

          What we’re doing here is a kind of symmetry breaking, and is really similar to allowing some finite set of possible rescaling values, and then truncating the model to the one that is strongly picked out. For example suppose you accept the following:

          there exists an integer N between -100 and 100 such that
          C = 2^N makes my measurements O(1) and I am willing to place particular priors on mu, s given N.

          Now N is from a finite set and one particular N can be inferred with near 100% posterior mass after looking only at median(data). Although technically we should retain all those other N in our model, since they have probability ~ 0 it’s sufficient in practice to truncate the model to a single term. This is perhaps the best way to think about what is going on.

        • I can understand why this is convenient. I just say that you can do equally well, or better, introducing a latent scale parameter mu and having your model assume measurements with some distribution O(mu) with error O(0.1mu). If you have just three points, fixing the scaling parameter to be the median is likely to end up understating or overstating the magnitude of the error. Of course if you have enough data the result will be approximately the same.

        • In practice, when I did that, it was virtually impossible for Stan to sample the posterior. The likelihood is doing inference on the scale-free dimensionless ratio. The inference on the dimensionless ratio is of course independent of the scale parameter because it divides away. Almost the only information on the scale parameter is provided by the prior. If you use a non-normalizable prior like 1/s it’s not even possible to sample. If you use a normalizable prior that is super-wide you wind up with a near singular mass matrix as it needs to ship that value all over the place. The final inference becomes highly dependent on the prior and gave unreasonable inference because in order to get coverage over the whole range of possible scale values, you wind up giving prior weight to totally unreasonable scales thousands of times larger than reality.

          Thinking of the posterior inference in terms of something like floating point numbers, you are better off treating the exponent as a set of integers over which you’re doing inference, and the mantissa as having all of your “real” uncertainty. By the time you’re doing that, only one integer value of the exponent makes any sense, and your prior is then expressing the mantissa uncertainty. this is more or less what I did (actually I didn’t divide by median(x) i divided by exp(round(log(median(x)))))

          The mantissa uncertainty is real, but the exponent is highly concentrated on a single integer value.

    • > because you don’t know what units it’s using (micro-meters, cm, lightyears, hundredths of an inch?) you can’t even approximately put a prior

      I don’t agree with this even slightly. If you’re working in an area where there are units (not every statistical problem falls into this category, but that’s the world you’ve put us in), then your model should have units as well. In the event that the gathered data doesn’t match those, you convert one to the units of the other. This is not “looking at the data”. It’s just setting a prior that respects your physical reality (in this case, that a quantity has units).

      What was that thing they sent to mars where they did the calculation in feet but programmed it in metres? Ignoring the units in a problem is the statistical version of that.

      Carlos suggests you can use scale-invariant priors, which is true but besides the point.

      • It’s not always the case that the units are known. How much area in physical mm^2 is encompassed in a particular region of a digital microscope image? It depends on lots of things that we don’t know.

        If you are doing inference on a dimensionless ratio of areas, the inference is on a dimensionless parameter that is independent of the actual area measurement (because the units appear in the numerator and denominator). This symmetry makes your inference problem needlessly hard if you insist on inferring the conversion factor to mm^2, because you’ll need some prior on this conversion factor, a quantity you don’t actually care about, and on which your inference of the ratio of areas does not depend.

      • Dan, Daniel:

        This all really deserves its own post—actually, I think I may have posted on the topic already at some point— . . . anyway, in my view the right way to set up this sort of problem is with a redundant multiplicative parameterization, in which the multiplier represents the appropriate scaling factor to put the problem on unit scale. Standardizing based on the data can be a reasonable hack—that’s how we set up our default priors in rstanarm—but ultimately I think of standardization as an approximation to a two-step approach in which there’s some unknown standardizing factor to be estimated.

        And, yes, we’re all in agreement (but the statistics profession isn’t!) that prior information about scaling can be very relevant. Indeed, in general it’s too much to ask of a statistical procedure to work for any possible scaling factor.

        • It’s a really interesting and unintuitive issue. I agree that standardizing relative to the data is a bit of a hack but in general in many problems this hack seems to give better inference than priors that pretend you don’t have scale information. For example in the microscope problem. Suppose several people collect the data. Some of them have 8 megapixel cameras, some 12 some 10…. They each measure areas in pixels. They take the measurements for treatment and control samples. The only parameters of interest are the ratio of means between treatment and control, and the measurement errors in the measurements as a fraction of control mean value….

          Oh and by the way they don’t tell you what their camera resolution or their microscope mag settings were, you just know they’re constant across each batch… And that people tend to set their mag so that the structure fill a significant fraction of the field of view. It’s a fascinating simple but deep exercise in understanding scale and measurement and human factors.

  3. Dan:

    This comment may be unnoticed among all the music references, but, if you’re interested, here are some links on that “using the data twice” thing.

    A post from a few years ago in which I wrote:

    I guess we could argue all night about the expression “using the data twice,” as ultimately this phrase has no mathematical definition. From a Bayesian perspective, the posterior predictive p-value is a conditional probability given the data, and as such it uses the data exactly once.

    An article from 10 years ago covered by this blog post.

    • I think I saw the first of these when it came out.

      I totally agree with you (except for the bit about P-values vs U-values, but mainly because I am attached to the convenience of knowing what distribution my computed quantities should follow so I can do my quick graphical checks). I like LOO-PITs for my PPC-ing because for continuous observations they are fairly uniform and departures from uniform are graphically interpretable in terms of calibration (over-/under-dispersed predictions, bias, etc) or in terms of the data being insufficient to predict itself (which is also picked up by the k-hat values Aki, you and Jonah constructed).

      But I guess the way we’re touching the data multiple times is potentially a much worse sin (as I said, I don’t give a fig for NHST-ing, but I am always worried about generalisation error or overfitting). My feeling is that the stuff we did (prior and posterior predictive checks) is enough to keep over-fitting under control, but I think it would be an interesting “theory of applied stats” question to see if we can find conditions under which this is not true, which might hint at what else is needed. (Just to put a pin in this, so we can talk about it in a couple of weeks. Viva workflow!)

  4. Is fig 4 labeled wrong?

    “Panel (c) shows (a) and (b) in the same plot, with red points representing the realizations using vague priors and gray points using weakly informative priors.”

    Gray points are showing vague priors, right? In the text, page 5 and just below fig 4, you write: “it is quite a bit more plausible than the model with vague priors.”

    • No, in panel (c) the grey points are the same points as in panel (b) but *the y scale has changed*. The point is that observed log(PM2.5) is around 1 to 5, the panel b shows prior predictive giving around 0 to 10, but panel (a) shows prior predictive giving 200 to 800 (!!! on log scale!)

      if you show both panel a and b on the same plot, the range 0-10 appears as basically nothing, all smushed into the line of dark plots at the bottom because the y scale goes 0 to 800

      The figure could be better constructed, unless you know to look carefully at the y axis labels it’s hard to understand what’s going on.

      • Yeah what Daniel Lakeland said. The figure label isn’t wrong but it could probably be more informative. When referring to the figure in the text we do say “note the y-axis limits and recall that the data are on the log scale” but maybe we should also put that in the figure caption. I’m also happy to change that figure if anyone has a better idea for how to display this information.

        • The first thing you could do is keep the color coding consistent. Color panel a red, b blue (to help color blind people) and then in the final panel keep the colors, red and blue or if you don’t like red and blue, choose some other colors, but keep them consistent.

          Then add a background grey oval representing good prediction stretched out along a y=x line to each plot. Make the b plot have symmetric x and y axes so the y=x line goes through the middle of the graph. Make the a plot have axes so that you could see where the y=x line and grey oval with good prediction is…

          Hope some of that helps.

  5. “I look at the data so often that I’ve basically married it”.

    Thank you for this sentence, for this blog post and for your paper. ‘Using data twice’ does not mean anything, actually, but some people (frequentists and subjectivist Bayesians) tend to use this expression as a general warning. Of course, as applied statisticians we often need to look at the data more than once, and, maybe, update our beliefs (priors) over the time. As Andrew said several times in his papers, the prior must be tested! I dealt with something similar in this arxiv paper

    https://arxiv.org/abs/1708.00099

Leave a Reply to Daniel Lakeland Cancel reply

Your email address will not be published. Required fields are marked *