Kaiser writes:

More on that work on age adjustment. I keep asking myself where is it in the Stats curriculum do we teach students this stuff? A class session focused on that analysis teaches students so much more about statistical thinking than anything we have in the textbooks.

I’m not sure. This sort of analysis is definitely statistics, but it doesn’t fit into the usual model-estimate-s.e. pattern of intro stat. Nor does it fit into the usual paradigm of exploratory data analysis which focuses on scatterplots. If we were to broaden exploratory analysis to include time series plots, that would work—but still this would miss the point, in that the usual focus would then be on the techniques of how to make the graph, not on the inquiry. From a conceptual point of view, the analysis I did is not so different from regression. It’s that lumping and splitting thing. And then there’s the age adjustment which is model-based but not in the usual way of intro statistics classes.

There’s an appeal to starting a class with examples such as this, where sample sizes are huge so we can go straight to the deterministic patterns and not get distracted with standard errors, p-values, and so forth.

When I taught intro stat, I did use various examples like this. Another one being the log-log graph of metabolism of animals vs. body mass, where again the point was the general near-deterministic relationship, with the variation around the line being secondary, and estimates/s.e./hyp-test not coming in at all). Deb and I have a bunch of these examples in our Teaching Statistics book.

It’s not hard to cover such material in class, but there does seem to be a bit of a conceptual gap when trying to link it to the rest of statistics, even at an introductory level but at other levels as well. In our new intro stat course, we’re trying to structure everything in terms of comparisons, and these sorts of examples fit in very well. But where the rubber meets the road is setting up specific skills students can learn so they can practice doing this sort of analysis on their own.

Kaiser follows up with a more specific question:

Also, on a separate topic, have you come across a visual display of confidence intervals that is on the scale of probabilities? It always bugs me that the scale of standard errors is essentially a log scale. Moving from 2 to 3 and from 3 to 4 are displayed as equal units but the probabilities have declined exponentially.

My reply: I like displaying 50% and 95% intervals together, as here:

**P.S.** Regarding the age-adjustment example that Kaiser mentioned: it just happens that I posted on it recently in the sister blog:

Mortality rates for middle-aged white men have been going down, not up.

But why let the facts get in the way of a good narrative?

It’s frustrating to keep seeing pundits writing about the (nonexistent) increasing mortality rate of middle-aged white men. It’s like Red State Blue State all over again. Just makes me want to scream.

The age adjustment seems like more of a problem of defining the right estimand, which is more of a logical operation than a mathematical one. Mathematically, both the raw and adjusted summaries of death rates are well-defined summaries of the data — the question is just what these summaries are supposed to represent in a larger argument. If you’re a pharmaceutical advertiser who is using an API that only allows online ad targeting for aggregate age groups (18-34, 35-44, 45-54, etc), then the raw death curve would be the “right” summary, because you’d like your summary to pick up that this bin has a higher mortality rate, whether or not that mortality rate is the result of a changing age composition within the bin. On the other hand, if you’re a pundit making an argument that’s supposed to aggregate the personal narratives of a group of people (“the downfall of white middle-aged mend”), then the age-adjusted curve is the right one to use.

I think a major problem in intro to statistics education is that there isn’t enough focus on the rhetorical aspect of statistics — instead of teaching the skill of constructing a complete logical argument, attention instead focused on a few mathematical links in the larger logical chain.

Isn’t this more like an actuarial / demographic / epidemiological adjustment? I mean, sure, it’s crucial to the problem at hand, but it’s a bit demanding to expect the Stat. courses to teach all this.

It’s like expecting a Stat. course to teach how to adjust a LHC detector signal to cancel spurious noise or something.

What do you think about cats eye for the intervals? ( as suggested by Cummings e.g. http://www.psychologicalscience.org/index.php/publications/observer/2014/march-14/theres-life-beyond-05.html)

Rickard:

I followed the link and saw this: “Values in the interval are plausible for the true level of support, and values outside the CI are relatively implausible.” Seems like a classic case in which confidence intervals are being interpreted as Bayesian inference. In real life, a confidence interval can include lots of implausible values; indeed we often see cases where the only plausible values are

outsidethe interval.Regarding the display: Sure, the cat’s eyes are fine. A bit too elaborate for my taste, and a bit too Bayesian to be appropriate for classical non-Bayesian intervals, but I can see their appeal.

I agree that statements like “the confidence interval is a range of plausible values for the true mean” are not good — and can be misleading. But I still struggle when asked* how to describe a confidence interval for an audience that knows little or nothing about statistics or probability and won’t tolerate anything very complex. My current best guess is something like “an interval that gives some sense of the uncertainty in our estimate”.

Anyone have any better suggestions?

*e.g., by a school district employee charged with explaining standardized test results to teachers, parents, school board members, etc.

Martha:

This came up on the blog awhile ago . . . as a general term I prefer “uncertainty interval” rather than “confidence interval” or “credibility interval” or whatever. One difficulty, though, is that “confidence interval,” like “unbiased,” has a technical definition which is not always so intuitive. Strictly speaking a 95% confidence interval is a procedure which gives intervals which, at least 95% of the time, contain the true parameter value. But it’s kinda horrible to give this definition, given that in practice confidence intervals are typically taken to be uncertainty intervals.

“Strictly speaking a 95% confidence interval is a procedure which gives intervals which, at least 95% of the time, contain the true parameter value” Wouldn’t we need to add to this “assuming the true parameter value is the estimated value”? Or,

A 95% confidence interval is a procedure that gives intervals which at least 95% of the time contain the estimated parameter value if that parameter is the true parameter value.

or…”if we model the distribution as if the estimated parameter value is the true value”

Okay I just simulated to check my understanding. Where’s the delete comment button?

A 95% confidence interval is an interval created by a procedure which gives intervals such that 95% of the time we run the procedure *ON THE RESULTS OF A RANDOM NUMBER GENERATOR* it will contain the correct parameter value that we put into the random number generator.

There is absolutely nothing that forces real scientific data to obey the 95% coverage guarantee except in the case where you’re using a random number generator to select subsets of a fixed set of objects at a fixed time (in that case you’re just mapping the output of the RNG through a 1 to 1 and onto function defined by the real world set of stuff).

“ON THE RESULTS OF A RANDOM NUMBER GENERATOR* it will contain the correct parameter value that we put into the random number generator”

Yes, by “model the distribution” I meant exactly what you say by “put into a random number generator”

Is there any difference between the *RESULTS OF A RANDOM NUMBER GENERATOR* and the usual formulation based on the *SAMPLING DISTRIBUTION*?

If I make a measurement and the distribution of the measurement error is known (say I’m measuring a position and I know the error can be described as N(0,1)), would you say I’m selecting from a fixed set of objects?

Carlos: yes absolutely!!!

Because most of the time people doing scientific experiments are NOT simply running a random number generator and selecting a different sample from the same finite set of well defined things, there’s absolutely no frequency guarantee when it comes to repeating scientific experiments. but, MOST PRACTITIONERS SEEM TO THINK THERE IS

There was a perfect example of this based on repeated astronomy data posted in a comment by Bill Jefferys a month or so back, a whole bunch of confidence intervals from two different groups of people, the two sets had no overlap, and neither set contained the true value at all.

Is that because the procedure they used doesn’t work on random number generators? No, the procedure would have worked fine if they had been studying RNGs, the procedure didn’t work because some other basic assumptions of the model were violated.

If the distribution of the measurement error is known to be a stable physical thing then no you’re not selecting from a fixed set of objects, however so long as the objects you’re measuring are stable physical things whose measurements don’t change through time, it is still an applicable idea. The thing is, this is far more rare situation than is usual in statistics. This is more applicable to something like freshman physics labs where the whole apparatus is precisely calibrated and the instruments measure pressure to 0.5% and you’re placing calibrated weights on a piston to validate PV/N=kT and you get a straight line for PV/N compared to T and any fool in Excel would get k to 3 significant figures because the errors are trivial. Take that and add in something like temperature instability of the pressure gauge and temperature instability in the diameter of the piston so that the seals leak more or less depending on the weather and whether the HVAC is running, and etc so that the errors are more significant and there’s no reason that repeated sampling needs to work like an RNG and give you the coverage guarantees.

Here’s an amusing example to explain what I mean:

Each semester a physics lab tries to measure PV/N = kT and get a numerical value for k using linear regression. The temperature probe is a thermistor, and the pressure probe is an electronic transducer. Each semester the lab takes a lot of measurements and gets a small confidence interval for k, but each confidence interval excludes the others and there’s a distinct jitter from semester to semester. The coverage guarantees couldn’t possible work out.

After a couple of decades the campus loses funding for its campus radio station for 4 semesters and lo and behold the k values are stable for 4 semesters…. It turns out that the sampling distribution for the data depended heavily on the kind of music bleeding over onto the probe wires and being fed into the A/D circuits, so that the confidence interval for k depended on whether the DJs in the 10AM to 12PM timeslot liked 80’s hair metal bands or music for yoga classes…

Is this so far fetched? It’s the same thing for the classical Agricultural experiments of Fisher. If you repeat them in successive years the “effect of” different fertilizers could depend on cross pollination from neighboring counties, the amount of cloud cover, whether a new pesticide was introduced in a farm upwind, whether the harvester from last year had rusty or clean blades (so that different amounts of iron were deposited) etc etc.

Repeatedly running a scientific experiment need not correspond to repeated sampling from a stable RNG.

Daniel, let me ask differently: do you agree with the following statement?

“A 95% confidence interval is an interval created by a procedure which gives intervals such that 95% of the time we run the procedure it will contain the correct parameter value *PROVIDED THAT THE MODEL USED TO CALCULATE THE SAMPLING DISTRIBUTION IS CORRECT*”.

Of course if the model assumptions are wrong anything goes. And if someone is not completely aware of that, I think pointing it out is much simpler than introducing into the picture a scary, all-uppercase random number generator.

(By the way, two 95% confidence intervals could behave properly from the frequentist point of view and fail to overlap as much as 10% of the time.)

Carlos, sure, but I think the point is the only time that the model is really CORRECT is when it’s the output of an RNG. So how incorrect can it be and still be useful? And how do we know when it’s correct or not? Those are complicated important questions that are very very much glossed over in practice among people not trained as statisticians, and short-cut in explanations of stats textbooks.

> the point is the only time that the model is really CORRECT is when it’s the output of an RNG.

I don’t think I’m getting any closer to getting your point. Let me know if this goes in the good direction:

I have a model, let’s call it MODEL, which depends on a parameter. Based on this model I can calculate a sampling distribution, MODEL_SAMP_DIST, and define confidence intervals for the parameter with some coverage properties. In reality, the actual “model”, ACTUAL, might completely different and arbitrarily complex. A corresponding sampling distribution ACTUAL_SAMP_DIST can be defined. The confidence intervals will cover the true value of the parameter 95% of the time only when MODEL_SAMP_DIST~ACTUAL_SAMP_DIT.

Alternatively, we identify (by an unclear procedure) our model with a (particular) random number generator. We identify the “correct model” with a (potentially different) random number generator. Our model is the correct model if it is the output (whatever it means) of __a__ random number generator (not __any__ random number generator, but __the__ random number generator corresponding to the “correct model”).

I don’t see a meaningful difference, if this is indeed what you mean.

And if you want to clarify what does it mean for the model to be the output of a random number generator:

Imagine that my model x~N(mu,1) but in reality x~N(mu,42). I take a single observation and create a confidence interval. Given that the actual sampling distribution is different from the one derived from my model, my confidence intervals will be wrong.

Imagine that my model was “correct” x~N(mu,42). Then the sampling distribution is correct and the confidence intervals are fine.

The model is correct means that “the model it is the output of a random number generator”. Which one?

Wouldn’t the “wrong” model presented before also be the output of another random number generator?

carlos: starting a new thread at the bottom of the page…

@Andrew: “One difficulty, though, is that “confidence interval,” like “unbiased,” has a technical definition which is not always so intuitive.”

My question was what does someone who knows the technical definition and realizes it is not intuitive do to give an explanation which is not too misleading in a situation where there is not the time (nor the interest on the part of the receiving party) to explain the technical definition.

I agree that “uncertainty interval” is a better choice of terminology than “confidence interval,” but I would prefer something that says a little more about the uncertainty — e.g, ” “an interval that gives some sense of the uncertainty in our estimate” (as I mentioned above) or “an interval expressing some of the uncertainty inherent in estimation”.

“Margin of error,” as often used in describing results of polls, has some of this idea (although I would prefer “approximate margin of error”), but is not applicable for situations such as odds ratios, where the interval is not centered at the point estimate.

I would be interested in hearing other possibilities that people use in such situations (i.e., where the complete precise definition is not appropriate for the audience/situation>)

Martha: interesting challenge. You could try calling it “a range of values under which the data wouldn’t be surprising” – perhaps substitute “extreme” for “surprising”.

George,

I prefer something that explicitly mentions uncertainty. I’m leaning toward the following:

“In estimating values using a sample, there are several sources of uncertainty. The confidence interval is often used to give a rough idea of the uncertainty coming from one of these sources (called “sampling variability”): the fact that different samples can give different estimates.”

And one might add (depending on the intended audience) a list of some other sources of uncertainty: measurement uncertainty, uncertainty of method of analysis (“forking paths uncertainty”), uncertainty of the assumptions used to calculate the confidence interval (“model uncertainty”).

This (an interval generated by a procedure …). is actually very close to what I told my intro students this semester. And yes with very strong emphasis on the importance of random sampling. But I think it’s also important to add assuming that certain assumptions are reasonable. We did many simulations (using StatKey) and they did really well on a question about how often we’d expect an interval not to contain the true parameter value. Also .. before confidence intervals we spent a lot of time learning that we don’t think that a sample estimate is going to exactly equal the parameter.

Sharable online R code simulating the above points

http://www.r-fiddle.org/#/fiddle?id=H0qpP6kn

Interesting to hear your views on this. Always interesting to read your blog. Hope you will blog about bayes factors sometimes as they are getting really popular in psychology and seem very practical for experimental psychologists.

Rickard:

I generally hate Bayes factors. See my 1995 article with Rubin or chapter 7 of BDA3.

We did something similar at the Bureau of Justice Statistics — http://www.bjs.gov/content/pub/pdf/dvctue.pdf

The age adjustment would be covered in a demography course, even though your specific example might not have since the demography text books often say that adjustment within 10 year intervals is not needed (though I think this is a misinterpretation of what that statement means). I think you are right, it is not really taught unless you have a data wrangling class and even then … the issue is not “do you teach age adjustment” it is “do you teach students to really analyze what the data production mechanism is and think critically about it.” In my experience I learned all that kind of thing (also about how to deal with changing census tract borders etc) from working on specific projects with experienced people and also reading a lot.

Replying to Carlos above: http://andrewgelman.com/2016/05/30/all-that-really-important-statistics-stuff-that-isnt-in-the-statistics-textbooks/#comment-275694

A confidence interval is based on repeated trials and uses frequency under repetition as the defining concept of probability. So, when you construct a confidence interval you do several things:

1) You specify a hypothetical family of distributions that the data comes from (or that some test statistic of the data comes from) and this distribution has some parameter p which is thought to be meaningful (like the mean or the median or whatever)

2) You then mathematically prove (or maybe simulate) what would happen if you sampled heavily from an RNG with the distribution in (1) and constructed a confidence interval CI(D_i) from different samples D_i using your CI procedure.

3) You then claim that confidence interval generating procedure CI(D) when applied to data D which actually DOES come from the RNG specified in (1) will 95% of the time the sample is taken and procedure repeated contain the parameter p used for the RNG.

Notice that at no point during the whole process do you make any scientific claims. The logical structure is:

IF data D_i comes out of RNG(p) then CI(D_i) contains p for 95% of the i values when i goes from 1…N for N very large.

Beyond the claim about the mathematical properties of the RNG from (1) the confidence procedure itself has no content. In particular, it has no scientific content about what will happen in future experiments.

I agree that the only claim is that (3) the confidence interval generating procedure CI(D) when applied to data D assuming that the model specified in (1) is true will 95% of the time the sample is taken and the procedure repeated contain the true parameter p. Or alternatively: IF data D_i is generated by model(p) then CI(D_i) contains p for 95% of the i values when i goes from 1…N for N very large. (I still see no reason to talk about RNGs being as specified instead of models being true, but I won’t say you’re wrong.)

“model being true” just doesn’t have the CONCRETE reality of “data actually does come from a random number generator with the defined distribution”. And, that is the only case where the model is REALLY TRUE. We have to communicate with scientists who are statistically naive but not AT ALL naive about the reality of the stuff they’re studying (hopefully). Bending over backwards to make them understand what the assumption is of a frequentist analysis is important in my opinion.

If you say “well every time you run your experiment it’s really just a way of asking the same random number generator to give you over and over a new sample” then they’re likely to object in the way that they really OUGHT to.

On the other hand, it also make things clear how sampling from a finite population with an RNG makes surveying things work correctly. You have items you can assign integers to “item 1,2,3,….,4569044” and then you ask an RNG to give you a uniform sample without replacement from the integers 1…4569044 and you measure those particular items in the world, and then if you were to ask the RNG again and again you really would be sampling from an RNG that had a distribution defined by the distribution of the real measurements among the 4.5 million items.

So, it IS applicable to something like taking detailed measurements on a subset of the stars in the sky to determine the amount of calcium in our galaxy, and NOT (necessarily) applicable to running several clinical trials of a drug regimen at two or three hospitals in the US and then claiming what the effect of your drug will be across the entire population of people sick with some illness.

also, I think “model” is a broad term that has a tendency to be ambiguous in these scenarios of discussions between scientists and data analysts / statisticians.

For example, a biologist may have a model in mind in which expression levels of gene X drive transcription of gene Y which drives hyper-oxidation of some enzyme which drives DNA damage which produces cancer…

now if you say “if your model is true” what may plop into their head is “if the biochemical pathway I’ve sketched out in my mind is what is really going on” whereas what YOU the statistician mean by “if your model is true” is just “if every time you measure the oxidation of your enzyme the error in the measurement has a stable distribution D” which is just a COMPLETELY different concept which isn’t even ABOUT THE SAME THING (an enzyme pathway vs a measurement apparatus).