Oh and by the way they don’t tell you what their camera resolution or their microscope mag settings were, you just know they’re constant across each batch… And that people tend to set their mag so that the structure fill a significant fraction of the field of view. It’s a fascinating simple but deep exercise in understanding scale and measurement and human factors.

]]>Thinking of the posterior inference in terms of something like floating point numbers, you are better off treating the exponent as a set of integers over which you’re doing inference, and the mantissa as having all of your “real” uncertainty. By the time you’re doing that, only one integer value of the exponent makes any sense, and your prior is then expressing the mantissa uncertainty. this is more or less what I did (actually I didn’t divide by median(x) i divided by exp(round(log(median(x)))))

The mantissa uncertainty is real, but the exponent is highly concentrated on a single integer value.

]]>This all really deserves its own post—actually, I think I may have posted on the topic already at some point— . . . anyway, in my view the right way to set up this sort of problem is with a redundant multiplicative parameterization, in which the multiplier represents the appropriate scaling factor to put the problem on unit scale. Standardizing based on the data can be a reasonable hack—that’s how we set up our default priors in rstanarm—but ultimately I think of standardization as an approximation to a two-step approach in which there’s some unknown standardizing factor to be estimated.

And, yes, we’re all in agreement (but the statistics profession isn’t!) that prior information about scaling can be very relevant. Indeed, in general it’s too much to ask of a statistical procedure to work for any possible scaling factor.

]]>If you are doing inference on a dimensionless ratio of areas, the inference is on a dimensionless parameter that is independent of the actual area measurement (because the units appear in the numerator and denominator). This symmetry makes your inference problem needlessly hard if you insist on inferring the conversion factor to mm^2, because you’ll need some prior on this conversion factor, a quantity you don’t actually care about, and on which your inference of the ratio of areas does not depend.

]]>This might be getting a bit far off topic and into specifics of a problem that isn’t generally of interest, but here goes with at least a little more background:

suppose you have values data[i] measured in some strange units that depend on how someone adjusted the microscope zoom….

then

data / C are data values measured in some other strange units for any positive value of C.

Suppose you have a procedure for choosing C, Choose(C) which guarantees that all the rescaled measurements are O(1), then given your knowledge of the measurement procedure suppose you can infer that the measurement errors are O(0.1) in these new units. Suppose that you are willing to model the statement O(1) and O(0.1) by some choice of distributions using the constants 1 and 0.1

Now suppose you know that C = median(data) is a procedure which does guarantee O(1) measurements.

Then you can use your information by building a model that is conditional on the fact that you use C = median(data) as your choice of measurement unit.

If you pretend that median(data) = the mean of your data, or equals some parameter Q, then you’re going wrong. But this isn’t the information you’re using.

What we’re doing here is a kind of symmetry breaking, and is really similar to allowing some finite set of possible rescaling values, and then truncating the model to the one that is strongly picked out. For example suppose you accept the following:

there exists an integer N between -100 and 100 such that

C = 2^N makes my measurements O(1) and I am willing to place particular priors on mu, s given N.

Now N is from a finite set and one particular N can be inferred with near 100% posterior mass after looking only at median(data). Although technically we should retain all those other N in our model, since they have probability ~ 0 it’s sufficient in practice to truncate the model to a single term. This is perhaps the best way to think about what is going on.

]]>I think that writing your reasonable prior in terms of a nuisance scale parameter is the proper way to handle the problem, at least in principle. The intermediate step of normalising using the median of the data is what might add bias and inaccuracy if not properly accounted for. Obviously if there are many samples your data-peeking will go unpunished, but if there are only a few data points is easy to see that the procedure is not clean in general (maybe it doesn’t mater in your particular example, I don’t know).

> Scale invariant priors are … not priors because they don’t normalize, and also usually don’t respect the degree of knowledge you do have.I don’t actually think people are measuring people’s heights in atto-lightyears.

It was you who said that you couldn’t put a prior on mu! The scale invariant prior is appropriate if you have *no* knowledge. If you do have *some* knowledge it’s up to you to use it to construct a better prior.

]]>However, this example was motivated by a recent problem I had where people had measured the pixel area of images, but had no absolute scale in the image (ie. nothing of a known size to measure within the image).

In this problem, it actually didn’t matter because the only meaningful thing was the dimensionless ratio of the size of things for genotype A vs the average size of things for genotype B. Within a batch the scale was constant (no-one adjusted the magnification), and I know that measurement error is a certain fraction of the size of the whole… Dividing all the measurements in a batch by the median eliminates the scale problem and lets me put a reasonable prior on the adjusted measurements. Inferring the actual size of things in meters is an intermediate step that adds bias and inaccuracy which doesn’t actually exist in the problem, because the actual problem is scale invariant (dimensionless ratio of sizes).

Dan mentions below that you should always use the units and convert between them etc. But this information just may not be known. If you know the units, sure, respect them. If you don’t, such as when you have images with no reference length in them, this doesn’t mean you can’t proceed, it just means you have additional uncertainty… Scale invariant priors are … not priors because they don’t normalize, and also usually don’t respect the degree of knowledge you do have. I don’t actually think people are measuring people’s heights in atto-lightyears.

Anyway, the problem with coming up with lots of examples is that they’re never what you might call complete and well thought out, but they illustrate interesting issues in data analysis, so I still think it’s worthwhile thinking about some of the issues they bring up. In the case where you’re really completely uncertain about the units, there is no concentration of p(units | median(x)) around a single value so that case doesn’t behave as I was implying.

]]>I don’t agree with this even slightly. If you’re working in an area where there are units (not every statistical problem falls into this category, but that’s the world you’ve put us in), then your model should have units as well. In the event that the gathered data doesn’t match those, you convert one to the units of the other. This is not “looking at the data”. It’s just setting a prior that respects your physical reality (in this case, that a quantity has units).

What was that thing they sent to mars where they did the calculation in feet but programmed it in metres? Ignoring the units in a problem is the statistical version of that.

Carlos suggests you can use scale-invariant priors, which is true but besides the point.

]]>Why not? If you really don’t have any idea of the units, scale invariance suggest a prior of the form 1/x.

> (lets say the median value is 170 you’re now looking at cm not inches or feet or furlongs or micrometers)

It could also be in atto-light-years or centi-yards… It’s not clear to me if in this example you’re trying to choose among a well defined set of “units” or if you’re really ignorant about the units. In the second case, I don’t think the approach that you propose is going to be any better that using the uninformative prior on mu from the start.

]]>Daniel, it sounds like you are moving away from the view of Bayesian inference as some kind of real-valued generalization of Boolean logic? ]]>

“You don’t honestly think I believe today what I did last year do you?” – George Barnard (as quoted in An Accidental Statistician, G.E.B chapter 4)

]]>Then add a background grey oval representing good prediction stretched out along a y=x line to each plot. Make the b plot have symmetric x and y axes so the y=x line goes through the middle of the graph. Make the a plot have axes so that you could see where the y=x line and grey oval with good prediction is…

Hope some of that helps.

]]>if you show both panel a and b on the same plot, the range 0-10 appears as basically nothing, all smushed into the line of dark plots at the bottom because the y scale goes 0 to 800

The figure could be better constructed, unless you know to look carefully at the y axis labels it’s hard to understand what’s going on.

]]>Now, your data model has changed after looking at the data. Under what circumstances does this make sense?

I think it makes sense when there are no other alternative models you’d think of fitting. If there are other models, you really need to incorporate them into a model choice problem, a mixture model p1(data|param1) p(param1) + p2(data|param2)p(param2) etc

If you try just one model at a time, you can find that several don’t fit that well, and then one fits sort of better, and then ignoring those that don’t fit well is not justified, whereas in the model choice problem where all models are allowed to participate, if you find one model that is strongly favored, then it is justified to say that this model is strongly favored, the evidence suggests this model over the others.

Again I think this has the flavor of incorrect conditioning. Rejecting a model because it’s a poor fit and moving to the next model and the next model… if each “next model” is conditioned on the fact that you’ve rejected some previous model… then that can work, but just trying a new model without conditioning correctly on that information you had doesn’t work. The easiest and most effective way to condition correctly is to create a super-model that incorporates all the possibilities into one big model, whether that’s a discrete mixture or a continuous mixture, or whatever.

]]>Thank you for this sentence, for this blog post and for your paper. ‘Using data twice’ does not mean anything, actually, but some people (frequentists and subjectivist Bayesians) tend to use this expression as a general warning. Of course, as applied statisticians we often need to look at the data more than once, and, maybe, update our beliefs (priors) over the time. As Andrew said several times in his papers, the prior must be tested! I dealt with something similar in this arxiv paper

]]>So seeing it everywhere now – but I do think if we want good statistical practice it needs to be sorted out.

For instance I think (and have since Two Cheers for Bayes https://www.ncbi.nlm.nih.gov/pubmed/8889349) research communities need to learn how process and critically peer review prior assumptions so that they are not taken as just arbitrary opinions of individuals.

]]>Although the line between calibration and fishing could get blurred very quickly. And I agree with Dan that careful thought about the principals here is required to make sure we don’t do something silly without realizing it.

]]>Also it would be great if we could plot the model connections as a graph. Then we could plot only a subset of parameters (e.g. pamaters that are connected to top 3 parameters with lowest sd (divergent samples) .

]]>“Panel (c) shows (a) and (b) in the same plot, with red points representing the realizations using vague priors and gray points using weakly informative priors.”

Gray points are showing vague priors, right? In the text, page 5 and just below fig 4, you write: “it is quite a bit more plausible than the model with vague priors.”

]]>An example that I like to use is that of looking at mean FEV in teens that smoke compared with those that don’t. Before seeing the data, I think that FEV should be higher in non-smokers than smokers. So I put a normal distribution with a negative mean as the prior for the “smoking effect” (quotes imply the word effect is misleading). However, I see the data and the smoking effect is positive, with a p-value of 0.001.

Is the logical thing to conclude smoking effect is somewhere in the middle? If I truly believe the normal prior, yes. On the other hand, this surprising result makes me rethink things for a bit. After some thought, I realize that smokers are more likely to be older than non-smokers, and age was not adjusted for in this study, other than “being a teenager”. Now I realize that my previous prior was silly, and most importantly, didn’t really represent my full prior knowledge before seeing the data.

Daniel, I think your example falls into the same category. You have definitive prior information about the measure of interest. But you have a lot of trouble accurately describing it in a distribution, at least until you’ve seen a few samples.

If we were saying that calculated Bayesian posteriors were how one should make decisions about publication, grants, etc., AND we allowed you to change your prior after seeing the data, this would unquestionably lead to horrible gaming of the system. On the other hand, if you are researcher and want to do your very best to understand a system, updating a mistakenly misparameterized prior is definitively a good idea.

]]>I’ll also be interested to know if it’s at all interpretable with that many parameters. My hunch is that you’d need to order the parameters along the horizontal axis in a meaningful way to have any hope of noticing anything useful. Of course “meaningful” is very context dependent, which is always annoying but also what makes things interesting.

]]>Here’s my blog where I summarized the final ideas, it links to the comment thread where they got hashed out.

http://models.street-artists.org/2016/03/30/using-the-data-twice-vs-incorrect-conditioning/

]]>Do you have a link to the comment thread where you had that discussion with Carlos? I would be very interested in reading the dialogue.

]]>I totally agree with you (except for the bit about P-values vs U-values, but mainly because I am attached to the convenience of knowing what distribution my computed quantities should follow so I can do my quick graphical checks). I like LOO-PITs for my PPC-ing because for continuous observations they are fairly uniform and departures from uniform are graphically interpretable in terms of calibration (over-/under-dispersed predictions, bias, etc) or in terms of the data being insufficient to predict itself (which is also picked up by the k-hat values Aki, you and Jonah constructed).

But I guess the way we’re touching the data multiple times is potentially a much worse sin (as I said, I don’t give a fig for NHST-ing, but I am always worried about generalisation error or overfitting). My feeling is that the stuff we did (prior and posterior predictive checks) is enough to keep over-fitting under control, but I think it would be an interesting “theory of applied stats” question to see if we can find conditions under which this is not true, which might hint at what else is needed. (Just to put a pin in this, so we can talk about it in a couple of weeks. Viva workflow!)

]]>This comment may be unnoticed among all the music references, but, if you’re interested, here are some links on that “using the data twice” thing.

A post from a few years ago in which I wrote:

I guess we could argue all night about the expression “using the data twice,” as ultimately this phrase has no mathematical definition. From a Bayesian perspective, the posterior predictive p-value is a conditional probability given the data, and as such it uses the data exactly once.

An article from 10 years ago covered by this blog post.

]]>For example, suppose someone sends you some data from an instrument that measures in an unknown set of units, but you’re told that it has measurement noise which is of order of magnitude of 10% of the typical measurement.

s ~ exponential(1.0/0.1);

data ~ normal(mu,mutyp*s);

might be an ok model, if you just had some idea what the heck the units of the mu values were so you could choose a mutyp and use it to set a prior on mu. But, even if you know that this thing is say measuring people’s heights, because you don’t know what units it’s using (micro-meters, cm, lightyears, hundredths of an inch?) you can’t even approximately put a prior. Looking at the data, and then conditioning on that look, provides the answer:

p(data | mu,sigma)p(mu | units)p(units | median(data))

p(units | median(data)) strongly concentrates around one kind of units (lets say the median value is 170 you’re now looking at cm not inches or feet or furlongs or micrometers)

Similarly, if you look at prior predictive values for your model and adjust your priors until you find that outputs are at least sort of the right order of magnitude as in figure 4 of your paper, are you over-fitting? To the extent that your prior still includes fake data well outside the ranges of real data, and to the extent that you have many more than 2 samples in your dataset, I’d say no. You could for example do quantile(data,c(.01,.99)) which pulls out essentially 2 “degrees of freedom” from your data and fit your priors to blanket that interval or the like (say a convex hull over a 2D point cloud, or 95% intervals in each dimension or whatever). If you have N much larger than 2 data points, you are using very little data information.

The big issue is that in fairly complicated models it can be hard to understand how the joint prior across many parameters affects the prior predictive distribution, and without that knowledge, you can’t reasonably set priors, because a prior is just a device to constrain the acceptable search region over possible prediction models. If you set priors that seem like they are individually reasonable, but in joint they are somehow accidentally informative in an incorrect way about the data generating process, you will get much worse fits than if you adjust priors in the way you suggest (figure 4). So, which is worse? A wrong fit + no data “reuse” or minimal data reuse that gives you the right answer? I know which one makes sense to me.

]]>