The history of characterizing groups of people by their averages

Andrea Panizza writes:

I stumbled across this article on the End of Average.

I didn’t know about Todd Rose, thus I had a look at his Wikipedia entry:

Rose is a leading figure in the science of individual, an interdisciplinary field that draws upon new scientific and mathematical findings that demonstrate that it is not possible to draw meaningful inferences about human beings using statistical averages.

Hmmm. I guess you would have something to say about that last sentence. To me, it sounds either trivial, if we interpret it in the sense illustrated by the US Air force later on in the same page, i.e., that individuals whose properties (weight, height, chest, etc.) are “close” to those of the Average Human Being are very rare, provided the number of properties is sufficiently high. Or plain wrong, if it’s a claim that statistics cannot be used to draw useful inferences on some specific population of individuals (American voters, middle-aged non-Hispanic white men, etc.). Either way, I think this would make a nice entry for your blog.

My reply: I’m not sure! On one hand, I love to be skeptical; on the other hand, since you’re telling me I won’t like it, I’m inclined to say I like it, just to be contrary!

OK, that was my meta-answer. Seriously, though . . . I haven’t looked at Rose’s book, but I kinda like his Atlantic article that you linked to, in that it has an interesting historical perspective. Of course we can draw meaningful inferences using statistical averages—any claim otherwise seems just silly. But if the historical work is valid, we can just go with that and ignore any big claims about the world. Historians can have idiosyncratic views about the present but still give us valuable insights about how we got to where we are today.

50 thoughts on “The history of characterizing groups of people by their averages

  1. “Of course we can draw meaningful inferences using statistical averages”

    Not always. It all depends on the statistical distribution/stoch.process from which the data are drawn. The mean is not defined on some distributions (power law etc.) and hence any estimates of such inexistent statistic are meaningless. Furthermore the trend (i.e. time varying mean) of some stochastic processes moves so quickly that is impossible to be estimated.

        • Yet it still minimizes the squared errors in that sample. It is just one of those freaky samples I warn my undergraduates about. The fact that there are freaky samples doesn’t change the properties of the sampling distribution, More important is that people in a given coffee shop are unlikely to have been randomly selected. Pretty often they probably aren’t even independent, like if Melinda Gates or one of their children were there too. Remember?

        • “fact that there are freaky samples doesn’t change the properties of the sampling distribution”

          If samples look “freaky” to your sampling distribution, then your sampling distribution is probably wrong.

        • If you think sampling distributions don’t have tails then i guess you are right, but most of the ones I know about do have tails, and if you teach you know you will start with sample sizes of e.g. 5 or even 2. So, no, I don’t think you are correct in that statement. Sometimes a random sample is just a random sample that is far from the middle of the sampling distribution and that’s just life in the big city.

      • This isn’t about means vs medians; in fact, the cases in which ‘statistics’ don’t work are very different – and I could imagine why a historian sees it that way.

        Arnold Kling (Specialization and Trade) talks about two ways of looking at economics. The first puts everyone into a low dimensional space with, effectively, continuous distributions for key variables which can be effectively optimized (like income, or days of educational attainment). In the other, everyone has very many (high dimensional) traits, like ‘desire for a 18″ BMX bicycle’ which are more effectively modeled by Logit (presence/absence) than by some carefully calibrated real values. In this case, purchasing decisions, productivity, happiness, etc, are about finding a deal or not; edges of moderate or large effect, in a constantly shifting network of possibly choices.

        This is also the way many biological networks operate. The organism cares very little for the average supply of nitrogen and carbon, but a great deal about the supply of very specific molecular forms (or higher levels, even, like intact prey); more than ‘bioavailability’ in some generic sense, which is really about how many organisms can process a specific form. There are dozens of vitamin B-type molecules, each needs a different receptor, effectively. They can be substituted, but only to a limited degree and only by organisms which have the right pathways to do the substitutions.

        History ends up retracing the very specific interactions of complex people (and families, etc) in these spaces. Low dimensions, and you ask how much iron was available in the British Isles. High dimension, and for lack of a nail, the kingdom is lost. Stratification is all about stripping the meaning from the individual item and seeing it in a range of contexts to understand it – statistically. History studies the context, meaning, and narrative.

      • The Cauchy distribution (Student-t with one degree of freedom) doesn’t have finite means or finite variances. That’s one reason we like to call parameters “location” and “scale” rather than “mean” and “variance” (or “deviation”).

        So the central limit theorem doesn’t apply. Averages of data are meaningless. It’s the poster child for “fat tails”. MCMC also won’t work (also no MCMC central limit theorem) if the posteriors are Cauchy distributed.

        • Bob:

          It’s not true that MCMC won’t work if the posteriors are Cauchy distributed. Random walk Metropolis won’t work, maybe, but other MCMC algorithms can work. Also, R-hat won’t work, but one can switch to a rank-based or interval-based R-hat and get the same essential functionality.

        • Good point. What I was thinking is that we can’t calculate posterior expectations, which is one of the main motivations for doing MCMC. Given that we can make random draws from a a Cauchy, there’s no prohibition against the sampling part. BUGS or JAGS might even be able to do it, but Euclidean Hamiltonian Monte Carlo (what Stan uses up to 2.11 at least) won’t work because the change in log density is bounded by a chi-square variate in the number of dimensions (how the random kinetic energy is generated).

        • Actually with NUTS it kinds of gets close (integrating for long times can help compensate for a lot), but in general you want to be careful with heavy tails as they are both hard to fit _and_ hard to diagnose when they don’t fit.

    • That’s not completely true. Actually, many (all I know of) power law distributions found in practice have $\alpha > 1 $: Zipf’s law (s >1), fits of the Pareto distribution to the distribution of wealth in the world, the distribution of the T-statistics (I’ve never seen a study where the sample size is 2!), etc. This implies that the mean exists, while the variance may not exist. Now, even for infinite variance the Law of Large Numbers may hold: https://en.wikipedia.org/wiki/Law_of_large_numbers. Of course, convergence of the sample mean to the mean of the distribution will be slower than in the case of finite variance, but it will still happen. This means that our estimate is all but meaningless.

  2. I think the key here is that “an average is pointless without an accompanying measure of variability.” The US Air Force could build every seat so that an average height person can be comfortable, which would make the seat uncomfortable for the overwhelming majority of pilots. However, making the seat adjustable to a few standard deviations from that height would the overwhelming majority of pilots to sit comfortably.

    • And yet, the majority of seats out there are non-adjustable.

      i.e. If you must choose seating for public-transport or lecture halls or airport terminals & it is non-adjustable, what metric would you rather use? Is it not the average?

      • This is an example of “0-1 loss” so you’d pick the mode which isn’t necessarily the average, particularly when there are lumpy distributions the average could be very far from the mode. For example, suppose your population of total heights looks like normal(a,1) * 0.52 + normal(b,1) + 0.48 with a < b, (a model for say men and women in the population with 4% more women than men)

        Then you’d pick height a, because it’s the mode, whereas the average would be close to (a+b)/2 and if a,b are more than 4 or 5 units apart then there might actually be pretty much NOBODY who is (a+b)/2 :-)

        • What’s the right way to choose the height of a chair given a realistic model for distribution of person heights? Define a “badness” function that measures how bad an outcome is when you choose x and a given person is actually height y

          badness(x,y)

          then using your distribution for y, calculate

          average(badness(x,y), dy)

          which is now a function of x. And choose the x that minimizes this function (lowest expected cost). This is Wald’s theorem which I harped on about a couple weeks back (basically says that the set of solutions to look in is the bayesian decision theory solutions which is what minimizing expected badness is).

          if badness looks like 1-exp(-(x-y)^2/scale)

          so that if you are more than about “scale” distance from y it’s equally bad no matter what, then when scale is smallish, you’ll choose something close to the mode, because you’ll maximize the number of people who are “ok” (badness close to 0).

          if badness looks like (x-y)^2 then you’ll choose the average.

          if badness looks asymmetric (ie. it’s better to be too big than to be too small) than you’ll bias your choice towards the high end of the distribution…

          etc. Providing a realistic cost function is important, but not always critical. Sometimes just choosing something that has some good features is better than realistic alternatives (ie. let “joe” decide he’s the boss, or “just pick the average”, but wait, no-one is actually near the average! etc)

        • Right. My speculation was that in practical problems the optimal point allocation chosen on the basis of your “real” badness function typically won’t be very far from the mean.

          I could be wrong. I am looking for examples.

          i.e. If one must choose a point metric in a pinch, is the mean a bad choice?

        • Daniel Lakeland’s example is the basis of decision theory. The nice part about Bayesian theory is that you can factor the decision theory and the posterior and integrate the posterior uncertainty into the decision theory.

          It’s more of a challenge to find a place where the average is a good choice.

          Airline seats are a good example. Larger seats tend to be better for smaller people than slightly larger seats. But the larger a seat, the higher the cost, and airlines believe (know?) people on average (?) seem to be very cost averse. This is bad for me because I rarely pay for my own travel—someone else buys me a coach seat, and these days, my shoulders don’t fit across the seat and my knees don’t fit behind the seat in front of me. And I’m not even quite six feet tall or particularly big in the shoulders. When three guys my size are put in three side-by-side coach seats, we all have to sit at an angle to squeeze in. On my way to Paris this time on Air France, I was in a Comfort Plus seat, which was funny, because I got plenty of leg room, but the width of the seat wasn’t even enough to squeeze my arms in next to my hips. Just not enough room and really uncomfortable. What you’re seeing in American hospitals is much larger wheelchairs for the much larger population. You can put a small person in a large wheelchair, but not vice-versa, so the loss isn’t even continuous. At some point, you exclude people altogether from seats.

          Do I want the average movie made again and again? No, I want some variety and almost never want the average movie or TV show (it’d all be reality TV and CGI movies).

      • Typically fixtures are built around maximums and minimums, not averages. I.e. any feet controls on an airplane must be within reach of a minimum height and the pilot should be able to sit without hitting the ceiling — the airforce then enforce heights for their pilots, but the planes are built around mins and maxs. Balconies are reinforced to withstand a maximum number of maximum weight people and so on.

        I think the synopsis is inadequate to describe the idea, which is not that averages can’t describe populations, but that they can’t accurately describe and shouldn’t be used to describe individuals within the population. I.e. there are too many dimensions to consider for modern problems to draw reliable inferences about a sample person, see michael betancourt’s comment for a better description.

    • I wish all reports of group averages (of the form “As are P-er than Bs”) would be accompanied by variation measures, and even better, by (good, representative) visualizations of the distribution. I wish they’d compare the inter-group variation with intra-group. I wish this was drilled into scientists and journalists as a rule, for starters.

  3. There’s an important statistical affect called concentration of measure at play here — the more features you have the less any single point will be representative of a population. For example, if you have more than a few features then no single individual in the population will look anything like the population average. Hence as Brandon notes you need to consider at least some measure of variability, and ideally you would have a model of the entire population itself.

    • If you do have a multi-feature widget, but the constraint is non-adjustibility what metric should one use to best suit the population?

      Is it typically *not* the average? Does it often depend on the particular problem’s definition of “best fit”?

      e.g. Asymmetric penalties: Far more uncomfy to be a big guy in a small chair than the other way around?

      • Such solutions are usually not so precise that they are optimal only for a very narrow segment of the population and terrible for the rest. But in any case the best decision will depend on the particular problem’s definition of “best fit”, i.e. the appropriate utility function.

        • Right. I am wondering, how far in practical problems the chosen optimum is far from the average. (with whatever choice of utility function is deemed appropriate)

          In other words, pragmatically how often is the mean a terrible choice, in practice.

          Also, a constraint mandating non-adjustable, single-point solutions seems not at all uncommon, in real world problems. The choice often is only, which point?

        • Aren’t “multipoint” solutions as common as or even more than single point in real world problems? For example, if we expect the response to depend on some predictors, we wouldn’t predict the response based on the sample mean – we would design an experiment, build a regression model and then, for each vector of predictors, use the model to predict the response corresponding to that predictor vector, together with a measure of variability.

      • @Rahul: “Far more uncomfy to be a big guy in a small chair than the other way around?”

        That depends. For example, if the chair has arms that are closer together than the width of the big guy, or if there is an obstruction too close to the seat to allow room for the big guy’s knees, then yes. But if the chair seat is so high that the small guy can’t sit with feet touching the ground or even a rung, while the chair back is several inches behind the small guy’s back when knees are at the front of the seat, and there are no pillows or substitute available — that’s pretty darned uncomfy for the small guy (well, unless the small guy is a young enough child or maybe a gymnast).

  4. I read Rose’s book and have mentioned it a number of times on this blog. It is quite thought provoking although I’m not sure it is as revolutionary as the publicity states. His classic example is the air force design of pilot seats. The problem with designing it for the “average” pilot is that the relevant dimensions of pilot sizes are many – so that in the end there is no pilot that is average. They are all average in some dimensions and not others – as a result, the only design that works is one that has a number of adjustments. The reason you don’t see this in commercial airline seats is that the price you pay for your ticket is not what the Air Force pays for their planes….

    The more controversial and thought provoking example is the purpose of The End of Average. Rose believes that our educational system is designed to compare everyone to averages. In his view there is no such thing as an “average” student. Hence, stating that one student is 2 standard deviations above the average is not useful – in fact, it is harmful (and I think he is relating it to his personal experience here, which is quite an interesting story). If learning is, in fact, multidimensional, and if the dimensions are not very correlated, then we are almost all above average in some dimensions and not in others. This more subtle view of each student does not fit well with most current admissions policies, assessment policies, rankings, etc.

    I admit to having a great deal of sympathy with this view. I am not sure it changes much of statistical practice – but I do think it at least requires a more careful and nuanced way of reporting and using statistical analysis.

  5. I also read Rose’s book, and it left an impression on me. Consider the implications of The End of Average to the study of nutrition: We have a lot of studies trying to find the ideal, average diet; meanwhile, we have a population with a huge variability in response to diet, in a way where responses to different foods (the different variables) are often poorly correlated to each other. In describing an ideal average diet, we be close to tailoring a suit for the average person – it is a perfect fit for almost no one.

    There are many critical fields, like nutrition, which offer proscriptive behavior to a population without considering the implications of variability. Think about this experiment: We want to study the effect of wearing a well-tailored suit on the success of a job applicant, so we give many applicants that average suit I mentioned above and test the response. Rose’s contribution is to point out that (1) this is nuts, and (2) scientists do equivalent experiments in many different fields. They just can’t see the ill fit of the suit.

    I don’t think that this viewpoint changes *statistical* practice, but it sure as heck should change experimental practice. Statisticians have the opportunity to help experimentalists understand the limits of knowledge that can come from a randomized control experiment (for example) in the face of so much variability across so many axes. Much wasted effort could be saved, and we’d all benefit.

    • Yes. In medicine, it is supposed to be ‘precision’ or ‘personalized’ medicine.

      But experimentalists have been trained to minimize variation (tight ‘controls’) and go for low p-values,
      and to use as few people/animals as possible. Uncontrolled variation is scary and can seem like bias and
      confounding and all that, but really, over generalizing the discussion section is the worse bias. If one
      can do a multidimensional parameter scan, wonderful, but otherwise, it’s better to open up the variation
      and hope your effect size is big enough to make a dent.

      • Ben, I agree with what you’ve written. But I am starting to become skeptical of the entire process, including the way we are doing personalized medicine to some extent (it’s still early yet). There are *so many* hidden variables in medicine, and what Rose is teaching is that if you have enough of them, you are statistically doomed from the start.

        The research process today is to hope that you don’t have that many hidden variables and just go for it, but it’s pretty clear from the replicability crisis and FDA data that situations where we all behave in biologically similar ways are very much the exception. It’s uncomfortable to face this reality because we will have to rebuild our processes from the ground up. Imagine telling the establishment that randomized control trials are usually as good as throwing darts. This will not go over well.

        FWIW, I think there is an alternative, which is to turn to engineering instead of science. The discipline of “control theory” has a lot to say here. Adopting that approach would put a lot more emphasis on measurement than on studies. But my view is that may be much more fruitful.

        • I would disagree with some of your conclusions. Actually, most humans respond in roughly the same way to major treatments. Rotten tooth? Pull it; and you’ll get the same result for most everyone, except for some immune compromised individuals who will have infections and die as a result of the oral surgery. Vitamin C for scurvy, all around.

          In fact, many such treatments are good for mammals, even birds and lizards.

          Then there are some obvious stratifications among humans – biological gender, age (neonate, geriatric, etc). These are readily observed.

          We can pull out some other variables – like your gut bacteria and ibuprofen metabolism. Or comorbidities that either impact how you do things (HIV) or indicate an underlying problem of a physiological nature.

          I agree that we should do much better. For example, diagnosing a fever these days is postively paleolithic. There is no reason we shouldn’t have a representative daily temperature cycle for each man, and one that shifts by day of the month for each woman. Then we can compare a point measure to an adjusted target. Same thing for blood pressure, expiratory force and volume, etc. We should have personalized baselines adjusted for at least time of day. This would actually repair many problems with the RCTs, as well.

          But this doesn’t amount to throwing out all of medicine; or admitting that humans are fundamentally more dissimilar than similar.

          Anyway, medicine is engineering, not science – in my mind. Medicine poses a simple practical question and seeks a complex answer.

        • @BenK re “most humans respond in roughly the same way to major treatments. Rotten tooth? Pull it; and you’ll get the same result for most everyone”

          Pulling a tooth is a major treatment??

          I’m also skeptical about “Medicine poses a simple practical question and seeks a complex answer”.

  6. How is Ross’s argument different from proponents precision medicine/personalized medicine? Randomized trials report average outcomes in the form of median values which are then used to decide (aka inference) whether the treatments will be beneficial for future average patients. Proponents of precision medicine want to know whether treatments will work on a particular patient, i.e. they want to end the use of averages.

  7. This is all related to typical sets. As an exercise, take a unit multivariate normal. Most of the draws are nowhere near the mean (also the mode for this distribution), and the distance increases with dimension. I’m about to roll out my case study on this for Stan—just need to get some time to take Michael’s comments into account.

    Forgetting the bit about outliers, you can check out the first plot here:

    http://blogs.sas.com/content/iml/2012/03/23/the-curse-of-dimensionality.html

    Or better yet, recreate it yourself. This is one sense in which niether the mode nor the mean is a good summary of a distribution—it is usually a point that’s not even in the typical set or anywhere near random draws. Of course, the random draws aren’t anywhere near each other, either, but the point of Bayesian inference is that you compute expectations by averaging over the posterior, which is within epsilon, the same as averaging over the typical set (parameterized by epsilon here).

  8. It’s not just about considering variation as context for the average. It’s also about what the variation represents. Peter Molenaar, featured in Rose’s book, points out that even in psychometrics there is an appeal to substituting population variance for individual variation to define reliability. But between-person covariance structures are not necessarily the same as within-person covariance structures…an important point when modeling change, development, learning etc., and when interpreting regression coefficients.

    • Good question! I’ve pondered this myself and think they are quite related. And, I think the ecological fallacy is a vastly underrated (at least underused) concept. My (somewhat superficial) take is that the ecological fallacy says that you may find relationships that hold, on average, that do not hold (almost certainly not as strongly) on the individual level. It would seem that this is another view of the same phenomenon that Rose is describing.

  9. Bill Harris and Dale Lehman: Yes! The situation described by Todd Rose, and the ecological fallacy, and also Simpson’s Paradox, are all related. They are all examples of aggregating data one way but drawing conclusions as if the data had been aggregated in another way.

    Rose’s book mentions the work of Peter Molenaar. Molenaar has written one of the best papers in psychology that I’ve ever read.

    Molenaar, P. C. M. (2004). A manifesto on psychology as idiographic science: Bringing the person back into psychology, this time forever. MEASUREMENT: INTERDISCIPLINARY RESEARCH AND PERSPECTIVES, 2(4), 201-218, with lots of commentaries and a discussion.

Leave a Reply

Your email address will not be published. Required fields are marked *