## What is the normal range of values in a medical test?

Geoffrey Sheean writes:

I am having trouble thinking Bayesianly about the so-called ‘normal’ or ‘reference’ values that I am supposed to use in some of the tests I perform. These values are obtained from purportedly healthy people. Setting aside concerns about ascertainment bias, non-parametric distributions, and the like, the values are usually obtained by setting the limits at ± 2SD from the mean. In some cases, supposedly because of a non-normal distribution, the third highest and lowest value observed in the healthy group sets the limits, on the assumption that no more than 2 results (out of 20 samples) are allowed to exceed these values: if there are 3 or more, then the test is assumed to be abnormal and the reference range is said to reflect the 90th percentile. The results are binary – normal, abnormal.

The relevance to the diseased state is this. People who are known unequivocally to have condition X show Y abnormalities in these tests. Therefore, when people suspected of having condition X (or something kind of like it) are found to have Y abnormalities (similar or milder), the test is said to support the diagnosis of condition X (or something kind of like it). The problem is that there is no true sensitivity and specificity information because of the lack of a gold standard for many diseases, and in part because condition X and abnormality Y are actually broad categories, rather than specific diseases or states. The findings observed in one situation are extrapolated to broadly similar situations. Without sensitivity or specificity, what can one do?

An example. People who are known to have suffered partial nerve damage, trauma, polio, etc. develop large sized electrical signals in muscles that are connected to the nerves. So, when we see similar large signals in other people, we conclude that they too have suffered nerve injury, even if it is not due to trauma, or polio, or any other the known circumstances in which large signals were originally seen.

So, how to think about this statistically? If I find a result that is greater than 2SD away from the reference population mean, classical statistics suggests I should think that there is only a 5% chance of finding this result (or worse) in a normal person. From this, I am trained to conclude that this person is abnormal (95% probability). Similarly, if I find 3 abnormalities in a sample of 20 from a patient that exceed a certain value, I am to conclude that this person is abnormal (how probable??).

With the ± 2SD range, at least I can readily see just how far beyond the reference range a value is and take heart from the fact that the further out of this range the value is, the more ‘likely’ it is a true abnormality, reflecting true pathology, even though that is not what the standard statistical interpretation would say. Isn’t this way of thinking akin to interpreting a really, really low p value as stronger evidence than just a low p value? That is, confusing the probability of finding the evidence with the strength of the evidence? Trouble is, it really does feel like stronger evidence.

With the second type of reference range, there is no such guidance. For example, the upper limit of a value (3rd highest) for a test was set at 55 but the raw data show some normal subjects had values well over 100 and in one case, over 400, just not more than 2 over 55. So, what does it mean if I see 5 out of 20 values over 55 compared with just 3 out of 20? What if I choose to collect more than 20 samples, say 30 samples? Am I to adjust the number of abnormal values needed to conclude abnormality to 4 (based on upper limit is 2/20 = 10%, so 10% of 30 is 3)? Is a person with 3/20 values of 126, 89, and 279 more abnormal than someone with 3 values of 58, 63, and 74? With ± 2SD, at least I can gain some (false?) confidence from severity of the abnormality.

It gets worse. Often, we have no way of knowing if we are right, no way to calibrate. The reasons are a little complicated so I won’t explain.

I’d prefer to either use the continuous variable as is, or else create a transformed scale using some clinical understanding. For example, instead of transforming into a 0/1 variable (‘normal/high”) or maybe -1/0/1 (“low/normal/high”), transform on to a 1-5 or 1-10 scale. The .05 level has nothing to with anything here; the point is to use this variable as an effective predictor.

1. Joseph says:

I have a great deal of concern with dichotomizing medical test variables (and I work with this type of data as a daily task). You lose a lot of information for many variables. For example, fasting blood glucose is predictive of events even within the normal range. Putting people into “normal”, “impaired fasting glucose” and “diabetic” is a huge reduction in information for very little benefit.

I was always very disturbed to see the 2SD definition used for bone density scans. It seemed like we were throwing out all sorts of interesting variation. Before I’d be interested in that sort of model, I’d be really interested to see if the outcomes predicted by the test had a shape that was well predicted by two levels. It seems rare that this works out.

2. It seems to me like calculating a z score for each of several diagnostic tests and then adding them up and comparing the results to a chi-squared distribution wouldn’t be the worst way to do diagnostics. I mean, there’s probably a lot of better ways, but the z score method has to be better than 1 bit of information per test (normal vs abnormal).

3. Johan Bjerner says:

Thanks for the discussion. I am working as a laboratory physician, and I would like to add a few comments.

In my opinion, the benefit of reference intervals is that reference intervals force laboratories to calibrate and report results consistently. In easy cases, we measure something than can be easily standardized, like the sodium concentration in human plasma. Here we have the possibility of calibrating individual laboratories against a “reference”, where the reference simply is the correct sodium concentration, and we might have made it without reference intervals. However, often we do not measure a simple concentration, but a mixture of different entities. Vitamin B12 may be considered as a mixture of different cobalamines. This mixture of cobalamines differs between patients, so there simply is no easy standardization. The concept of reference intervals means that the laboratory has to report the single measurement together with the 2.5th and 97.5th percentiles of Vitamin B12 for this specific measurement method. It will thus be easier for the clinician to judge the patient findings against outcomes from clinical studies where a different method for Vitamin B12 measurements has been used.

Second, reference intervals are bad therapeutic guidelines. Concentrations differ between different groups in the population. Vitamin B12 is found in meat, fish and egg. If you measure Vitamin B12 in certain groups, such as vegans or hinduists (who tend to be vegan or close to vegan), you will probably find lower concentrations than in the meat-eating population. If we conservatively argue for correct reference intervals, an increase of the percentage of veganism in population from 1% to 10% will have quite a large effect on the 2.5th percentile of Vitamin B12 concentration. Further, there is no guarantee that concentrations within the reference interval are not harmful to you. Concentrations of Vitamin B12 close to the lower reference limit may harm neurodevelopment in small children, and the same may be said about having thyroid stimulating hormone close to the upper reference level.

Third, reporting “z-scores” instead of the actual measurement is a bad idea, for several reasons. One reason is that the units actually matter. We want to know the renal function, not just the z-score of the renal function. Clinicians adjust the dosage of medicines according to the renal function, something you cannot do with a z-score. Sometimes we want to do calculations with the results, the concentrations of positive ions equals the concentration of negative ions, and a mol of haemoglobin can carry four mols of oxygen. You cannot make these calculations with z-scores either.

Fourth, we cannot easily calculate the probability of having three abnormal tests out of twenty. Tests, or biochemical measurements, are not independent entities, because they are sampled on the same individual. If the individual X has a kidney function corresponding to the lower 10 percentile, this individual will have a higher probability of results outside the reference intervals for several tests, all dependent on kidney function, such as potassium, creatinine, carbamide and uric acid. What does matter is thus not just the number of pathologic tests, but the pattern of pathologic tests.

Fifth, I will argue that you continue to think Bayesian on biochemical tests. The outcome of a biochemical test is not a perfect concentration, but it is rather the observation of a concentration, and this observation is subject to biological (inter and intrapatient variation) and analytical variation. You will have a likelihood of having a true pathological concentration dependent on the concentration. If the upper reference limit is 50, the likelihood of a true concentration over 50 is definitely higher in an individual having 400 than in an individual having a concentration of 52.

• Certainly reporting only a z score is a bad idea, but reporting a concentration and a mean and standard deviation for the population allows you to calculate a z score, as well as other things related to the actual units as you mention. The point you make about non-independence is exactly why calculating the z scores and adding them up isn’t a terrible idea. If someone has renal problems, they will have several tests which differ from the population in a consistent way, the chi-squared score for this person will be well outside the expected range. This basically tells you that they are consistently different from the population in a non-random way.

A person who has one or two out of say 5 or 10 tests in the “abnormal” range will most likely have a chi-squared score which is within normal ranges. If your alternative to the chi-squared score is to see “how many readings are abnormal” you will get a lot more information from the z-score sum than from this kind of dichotomization.

As I said, there are undoubtedly better ways, such as a multivariate logistic regression, if you have a dataset where patients are classified into “well” and “sick”. If you have “well” and 3 or 4 different kinds of “sick” there are discrete choice models and soforth. But typically I imagine the clinician doesn’t have access to a large dataset that they can compare the patient to. Doing the z-score is essentially an “improper” linear model, see Dawes article from the 1970′s on using them in diagnostics: http://www.uwe-mortensen.de/DawesRobustBeautyImproperLinModsDecisionMaking1979.pdf

• As for the “Bayesian” aspect here, choosing *which* tests to run is already an expression of the prior knowledge of the clinician as to how important the various tests are for the disease process.

I like the equality of p-value and posterior probability:

If I find a result that is greater than 2SD away from the reference population mean, classical statistics suggests I should think that there is only a 5% chance of finding this result (or worse) in a normal person. From this, I am trained to conclude that this person is abnormal (95% probability).

• Johan Bjerner says:

I do not think this conclusion is justified. “Normal” is quite easy to define, but what is “abnormal”? Is this a person having a disease? If we think of “abnormal” as a person having a disease, I think the probability that a person having an abnormal test also has the disease is: p/(p+f) where p is the prevalence of disease (if we assume that all persons with the disease have an abnormal test) and f is the frequency of abnormal tests in the normal population. This may not be a 95% probability for having the disease.
In fact, the value of tests often lies in “ruling out” diseases, and for “ruling in” diseases, additional tests or investigations are needed.

• Geoff Sheean says:

Excellent discussion. More than I could have hoped for. I agree with Johan here – I should have said that the test result is abnormal with 95% probability and, of course, this would be then applied to my pre-test probability of the disease reflected by this abnormality. Use of the Z score would help formalize where the result lies on the continuous spectrum.
The non-parametric 3rd highest/lowest cutoff is still bothersome for the reasons outlined. On John Cook’s excellent blog (http://www.johndcook.com/blog/2010/03/30/statistical-rule-of-three/) he posted an interesting article on the probability of missing an abnormality after a certain sample size (e.g. 15% after 20 samples). I would like to know if I have found 2 in a sample of 20, the probability of missing the needed 3rd abnormality to conclude the test overall is abnormal. I posted the question but I have no had a reply yet.
I try to discourage the residents from thinking in terms of ruling in or out in favour of shifting probability, just to keep the idea of test results adjusting subjective probability foremost in their minds.

5. Johan Bjerner says:

Again, thanks for an interesting discussion, and thanks for the suggestions of an “improper” linear model.

We´ll have to begin reporting z-scores from the laboratory. So now to the practical questions:

Thyroid stimulating hormone (TSH) is log-normally distributed. How should we report it: The actual measurement + the quantile transformed to a z-score? Of course we can report it as a log-normal transformation, but if we want to compare results from different laboratories, these results will differ in a linear (and not a log-normal) pattern.

human Chorionic Gonadotropin will have non-measurable concentrations in perhaps 60% of men. How should we report it?

As noted previously, I see the actual measurement as an observation (with measurement uncertainty) of the true concentration. This measurement uncertainty is usually a function of the concentration measured (e.g. proportional to the concentration). If we report z-scores we will have another “source” of uncertainty, which will be the local density of observations. (An example is free thyroxine. At the lower reference level we have a high density, so the difference in measurement units between the 2th and 10th percentile in the population is small. At the higher reference level we have a low density, so the difference in measurement units between the 90th and 98th percentile is quite large. The uncertainty of the z-score corresponding to the 2.5th percentile thus must considerably be larger than the uncertainty of the 97.5th percentile). How do we report the uncertainty of the z-score?

• Geoff Sheean says:

The log of the raw TSH values should follow a normal distribution, so the asymmetry you mention should disappear, no?