Geoffrey Sheean writes:
I am having trouble thinking Bayesianly about the so-called ‘normal’ or ‘reference’ values that I am supposed to use in some of the tests I perform. These values are obtained from purportedly healthy people. Setting aside concerns about ascertainment bias, non-parametric distributions, and the like, the values are usually obtained by setting the limits at ± 2SD from the mean. In some cases, supposedly because of a non-normal distribution, the third highest and lowest value observed in the healthy group sets the limits, on the assumption that no more than 2 results (out of 20 samples) are allowed to exceed these values: if there are 3 or more, then the test is assumed to be abnormal and the reference range is said to reflect the 90th percentile. The results are binary – normal, abnormal.
The relevance to the diseased state is this. People who are known unequivocally to have condition X show Y abnormalities in these tests. Therefore, when people suspected of having condition X (or something kind of like it) are found to have Y abnormalities (similar or milder), the test is said to support the diagnosis of condition X (or something kind of like it). The problem is that there is no true sensitivity and specificity information because of the lack of a gold standard for many diseases, and in part because condition X and abnormality Y are actually broad categories, rather than specific diseases or states. The findings observed in one situation are extrapolated to broadly similar situations. Without sensitivity or specificity, what can one do?
An example. People who are known to have suffered partial nerve damage, trauma, polio, etc. develop large sized electrical signals in muscles that are connected to the nerves. So, when we see similar large signals in other people, we conclude that they too have suffered nerve injury, even if it is not due to trauma, or polio, or any other the known circumstances in which large signals were originally seen.
So, how to think about this statistically? If I find a result that is greater than 2SD away from the reference population mean, classical statistics suggests I should think that there is only a 5% chance of finding this result (or worse) in a normal person. From this, I am trained to conclude that this person is abnormal (95% probability). Similarly, if I find 3 abnormalities in a sample of 20 from a patient that exceed a certain value, I am to conclude that this person is abnormal (how probable??).
With the ± 2SD range, at least I can readily see just how far beyond the reference range a value is and take heart from the fact that the further out of this range the value is, the more ‘likely’ it is a true abnormality, reflecting true pathology, even though that is not what the standard statistical interpretation would say. Isn’t this way of thinking akin to interpreting a really, really low p value as stronger evidence than just a low p value? That is, confusing the probability of finding the evidence with the strength of the evidence? Trouble is, it really does feel like stronger evidence.
With the second type of reference range, there is no such guidance. For example, the upper limit of a value (3rd highest) for a test was set at 55 but the raw data show some normal subjects had values well over 100 and in one case, over 400, just not more than 2 over 55. So, what does it mean if I see 5 out of 20 values over 55 compared with just 3 out of 20? What if I choose to collect more than 20 samples, say 30 samples? Am I to adjust the number of abnormal values needed to conclude abnormality to 4 (based on upper limit is 2/20 = 10%, so 10% of 30 is 3)? Is a person with 3/20 values of 126, 89, and 279 more abnormal than someone with 3 values of 58, 63, and 74? With ± 2SD, at least I can gain some (false?) confidence from severity of the abnormality.
It gets worse. Often, we have no way of knowing if we are right, no way to calibrate. The reasons are a little complicated so I won’t explain.
I’d prefer to either use the continuous variable as is, or else create a transformed scale using some clinical understanding. For example, instead of transforming into a 0/1 variable (‘normal/high”) or maybe -1/0/1 (“low/normal/high”), transform on to a 1-5 or 1-10 scale. The .05 level has nothing to with anything here; the point is to use this variable as an effective predictor.