Skip to content

Easier-to-download graphs of age-adjusted mortality trends by sex, ethnicity, and age group

Jonathan Auerbach and I recently created graphs of smoothed age-adjusted mortality trends from 1999-2014 for:
– 50 states
– men and women
– non-hispanic whites, blacks, and hispanics
– age categories 0-1, 1-4, 5-14, 15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84.

We posted about this on the blog and also wrote an article for Slate expressing our frustration with scientific papers and news reports that oversimplified the trends.

Anyway, when Jonathan and I put all our graphs into a file it was really really huge, and I fear this could’ve dissuaded people from downloading it.

Fortunately, Yair found a way to compress the pdf file. It’s now still pretty readable and only takes up 9 megabytes. So go download it and enjoy all the stunning plots. (See above for an example.) Thanks, Yair!

P.S. If you want the images in full resolution, the original document remains here

Crack Shot

Raghu Parthasarathy writes:

You might find this interesting, an article (and related essay) on the steadily declining percentage of NIH awards going to mid-career scientists and the steadily increasing percentage going to older researchers. The key figure is below. The part that may be of particular interest to you, since you’ve written about age-adjustment in demographic work: does an analysis like this have to account for the changing demographics of the US population (which wasn’t done), or it that irrelevant since there’s no necessarily link between the age distribution of scientists and that of the society they’re drawn from? I have no idea, but I figured you might.

Most of the article is about the National Heart Lung and Blood Institute, one of the NIH institutes, but I would bet that its findings are quite general.

Jeez, what an ugly pair of graphs! Actually, I’ve seen a lot worse. These graphs are actually pretty functional. But uuuuuugly. And what’s with those R-squareds? Anyway, the news seems good to me—the crossover point seems to be happening just about when I turn 55. And I’d sure like some of that National Heart Lung and Blood Institute for Stan.

The Association for Psychological Pseudoscience presents . . .

Hey! The organization that publishes all those Psychological Science-style papers has scheduled their featured presentations for their next meeting.

Included are:

– That person who slaps the label “terrorists” on people who have the nerve to question their statistical errors.

– One of the people who claimed that women were 20 percentage points were likely to vote for Barack Obama, during a certain time of the month.

– One of the people who claimed that women are three times as likely to wear red, during a certain time of the month.

– The editor of the notorious PPNAS papers on himmicanes, air rage, and ages ending in 9.

– One of the people who claimed, “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.”

– Yet another researcher who responded to a failed replication without even acknowledging the possibility that their original claims might have been in error.

– The person who claimed, “Barring intentional fraud, every finding is an accurate description of the sample on which it was run.”


The whole thing looks like a power play. The cargo-cult social psychologists have the power, and they’re going to use it. They’ll show everyone who’s boss. Nobody’s gonna use concerns such as failed replications, lack of face validity, and questionable research practices to push them around!

Talk about going all-in.

All they’re missing are speakers on ESP, beauty and sex ratio, the contagion of obesity, ego depletion, pizza research, 2.9013, and, ummm, whatever Mark Hauser and Diederik Stapel have been up to recently.

Here’s the official line:

The mission of the Association for Psychological Science is to promote, protect, and advance the interests of scientifically oriented psychology in research, application, teaching, and the improvement of human welfare.

I guess that works, as long as you use the following definitions:
“promote” = Ted talks, twitter, and NPR;
“protect” = deflect scientific criticism;
“advance the interests of scientifically oriented psychology” = enhance the power of a few well-situated academic psychologists and their friends.

It’s a guild, man, nuthin but an ivy-covered Chamber of Commerce. Which is fine—restraint of trade is as American as baseball, hot dogs, apple pie, and Chevrolet.

The only trouble is that I’m guessing that the Association for Psychological Science has thousands of members who have no interest in protecting the interests of this particular club. I said it before and I’ll say it again: Psychology is not just a club of academics, and “psychological science” is not just the name of their treehouse.

The Las Vegas Odds

Kevin Lewis suggests the above name for pro football’s newest team, after hearing that “The NFL is letting the Oakland Raiders move to Las Vegas, a move once nearly unthinkable due to its opposition to sports gambling.”

Is there anyone good at graphic design who’d like to design a logo? I’m not sure what images would work here. Maybe a couple of dice, some betting slips, . . . ?

2 Stan job postings at Columbia (links fixed)

1. Stan programmer. This is the “Stan programmers” position described here.

2. Stan project development. This is the Stan business developer/grants manager described here.

To apply, click on the first link for each position above (the site) and follow the instructions.

P.S. In first version of this post I messed up the links. They’re fixed now.

Probability and Statistics in the Study of Voting and Public Opinion (my talk at the Columbia Applied Probability and Risk seminar, 30 Mar at 1pm)

Probability and Statistics in the Study of Voting and Public Opinion

Elections have both uncertainty and variation and hence represent a natural application of probability theory. In addition, opinion polling is a classic statistics problem and is featured in just about every course on the topic. But many common intuitions about probability, statistics, and voting are flawed. Some examples of widely-held but erroneous beliefs: votes should be modeled by the binomial distribution; sampling distributions and standard errors only make sense under random sampling; poll averaging is a simple problem in numerical analysis; survey sampling is a long-settled and boring area of statistics. In this talk, we discuss some challenging problems in probability and statistics that arise from the study of opinion and elections.

I’ll be speaking at the Applied Probability and Risk Seminar, 1:10-2:10pm in 303 Mudd Hall, Columbia University.

P.S. I discussed work from these papers:

Dead Wire

Kevin Lewis pointed me to this quote from a forthcoming article:

Daniele Fanelli and colleagues examined more than 3,000 meta-analyses covering 22 scientific disciplines for multiple commonly discussed bias patterns. Studies reporting large effect sizes were associated with large standard errors and large numbers of citations to the study, and were more likely to be published in peer-reviewed journals than studies reporting small effect sizes. The strength of these associations varied widely across research fields, and, on average, the social sciences showed more evidence of bias than the biological and physical sciences. Large effect sizes were not associated with the first or last author’s publication rate, citation rate, average journal impact score, or gender, but were associated with having few study authors, early-career authors, and authors who had at least one paper retracted. According to the authors, the results suggest that small, highly cited studies published in peer-reviewed journals run an enhanced risk of reporting overestimated effect sizes.

My response:

This all seems reasonable to me. What’s interesting is not so much that they found these patterns but that they were considered notable.

It’s like regression to the mean. The math makes it inevitable, but it’s counterintuitive, so nobody believes it until they see direct evidence in some particular subject area. As a statistician I find this all a bit frustrating, but if this is what it takes to convince people, then, sure, let’s go for it. It certainly doesn’t hurt to pile on the empirical evidence.

Let’s accept the idea that treatment effects vary—not as something special but just as a matter of course

Tyler Cowen writes:

Does knowing the price lower your enjoyment of goods and services?

I [Cowen] don’t quite agree with this as stated, as the experience of enjoying a bargain can make it more pleasurable, or at least I have seen this for many people. Some in fact enjoy the bargain only, not the actual good or service. Nonetheless here is the abstract [of a recent article by Kelly Haws, Brent McFerran, and Joseph Redden]:

Prices are typically critical to consumption decisions, but can the presence of price impact enjoyment over the course of an experience? We examine the effect of price on consumers’ satisfaction over the course of consumption. We find that, compared to when no pricing information is available, the presence of prices accelerates satiation (i.e., enjoyment declines faster). . . .

I have no special thoughts on pricing and enjoyment, nor am I criticizing the paper by Haws et al. which I have not had the opportunity to read (see P.S. below).

The thing I did want to talk about was Cowen’s implicit assumption in his header that a treatment has a fixed effect. It’s clear that Cowen doesn’t believe this—the very first sentence of this post recognizes variation—so it’s not that he’s making this conceptual error. Rather, my problem here is that the whole discussion, by default, is taking place on the turf of constant effects.

The idea, I think, is that you first establish the effect and then you look for interactions. But if interactions are the entire story—as seems plausible here—that main-effect-first approach will be a disaster. Just as it was with power pose, priming, etc.

Framing questions in terms of “the effect” can be a hard habit to break.


I was thinking of just paying the $35.95 but, just from the very fact of knowing the price, my satiation increased and my enjoyment declined and I couldn’t bring myself to do it. In future, perhaps Elsevier can learn from its own research and try some hidden pricing: Click on this article and we’ll remove a random amount of money from your bank account! That sort of thing.

No-op: The case of Case and Deaton

In responding to some recent blog comments I noticed an overlap among our two most recent posts:

1. Mortality rate trends by age, ethnicity, sex, and state

2. When does research have active opposition?
Continue reading ‘No-op: The case of Case and Deaton’ »

When does research have active opposition?

A reporter was asking me the other day about the Brian Wansink “pizzagate” scandal. The whole thing is embarrassing for journalists and bloggers who’ve been reporting on this guy’s claims entirely uncritically for years. See here, for example. Or here and here. Or here, here, here, and here. Or here. Or here, here, here, . . .

The journalist on the phone was asking me some specific questions: What did I think of Wansink’s work (I think it’s incredibly sloppy, at best), Should Wansink release his raw data (I don’t really care), What could Wansink do at this point to restore his reputation (Nothing’s gonna work at this point), etc.

But then I thought of another question: How was Wansink able to get away with it for so long. Remember, he got called on his research malpractice a full 5 years ago; he followed up with some polite words and zero action, and his reputation wasn’t dented at all.

The problem, it seems to me, is that Wansink has had virtually no opposition all these years.

It goes like this. If you do work on economics, you’ll get opposition. Write a paper claiming the minimum wage helps people and you’ll get criticism on the right. Write a paper claiming the minimum wage hurts people and you’ll get criticism on the left. Some—maybe most—of this criticism may be empty, but the critics are motivated to use whatever high-quality arguments are at their disposal, so as to improve their case.

Similarly with any policy-related work. Do research on the dangers of cigarette smoking, or global warming, or anything else that threatens a major industry, and you’ll get attacked. This is not to say that these attacks are always (or never) correct, just that you’re not going to get your work accepted for free.

What about biomedical research? Lots of ambitious biologists are running around, all aiming for that elusive Nobel Prize. And, so I’ve heard, many of the guys who got the prize are pushing everyone in their labs to continue publishing purported breakthrough after breakthrough in Cell, Science, Nature, etc. . . . What this means is that, if you publish a breakthrough of your own, you can be sure that the sharks will be circling, and lots of top labs will be out there trying to shoot you down. It’s a competitive environment. You might be able to get a quick headline or two, but shaky lab results won’t be able to sustain a Wansink-like ten-year reign at the top of the charts.

Even food research will get opposition if it offends powerful interests. Claim to have evidence that sugar is bad for you, or milk is bad for you, and yes you might well get favorable media treatment, but the exposure will come with criticism. If you make this sort of inflammatory claim and your research is complete crap, then there’s a good chance someone will call you on it.

Wansink, though, his story is different. Yes, he’s occasionally poked at the powers that be, but his research papers address major policy debates only obliquely. There’s no particular reason for anyone to oppose a claim that men eat differently when with men than with women, or that buffet pricing affects or does not affect how much people eat, or whatever.

Wansink’s work flies under the radar. Or, to mix metaphors, he’s in the Goldilocks position, with topics that are not important for anyone to care about disputing, but interesting and quirky enough to appeal to the editors at the New York Times, NPR, Freakonomics, Marginal Revolution, etc.

It’s similar with embodied cognition, power pose, himmicanes, ages ending in 9, and other PPNAS-style Gladwell bait. Nobody has much motivation to question these claims, so they can stay afloat indefinitely, generating entire literatures in peer-reviewed journals, only to collapse years or decades later when someone pops the bubble via a preregistered non-replication or a fatal statistical criticism.

We hear a lot about the self-correcting nature of science, but—at least until recently—there seems to have been a lot of published science that’s completely wrong, but which nobody bothered to check. Or, when people did check, no one seemed to care.

A couple weeks ago we had a new example, a paper out of Harvard called, “Caught Red-Minded: Evidence-Induced Denial of Mental Transgressions.” My reaction when reading this paper was somewhere between: (1) Huh? As recently as 2016, the Journal of Experimental Psychology: General was still publishing this sort of slop? and (2) Hmmm, the authors are pretty well known, so the paper must have some hidden virtues. But now I’m realizing that, yes, the paper may well have hidden virtues—that’s what “hidden” means, that maybe these virtues are there but I don’t see them—but, yes, serious scholars really can release low-quality research, when there’s no feedback mechanism to let them know there are problems.

OK, there are some feedback mechanisms. There are journal referees, there are outside critics like me or Uri Simonsohn who dispute forking path p-value evidence on statistical grounds, and there are endeavors such as the replication project that have revealed systemic problems in social psychology. But referee reports are hidden (you can respond to them by just submitting to a new journal), and the problem with peer review is the peers; and the other feedbacks are relatively new, and some established figures in psychology and other fields have had trouble adjusting.

Everything’s changing—look at Pizzagate, power pose, etc., where the news media are starting to wise up, and pretty soon it’ll just be NPR, PPNAS, and Ted standing in a very tiny circle, tweeting these studies over and over again to each other—but as this is happening, I think it’s useful to look back and consider how it is that certain bubbles have been kept afloat for so many years, how it is that the U.S. government gave millions of dollars in research grants to a guy who seems to have trouble counting pizza slices.

Mortality rate trends by age, ethnicity, sex, and state (link fixed)

There continues to be a lot of discussion on the purported increase in mortality rates among middle-aged white people in America.

Actually an increase among women and not much change among men but you don’t hear so much about this as it contradicts the “struggling white men” story that we hear so much about in the news media.

A big fat pile of graphs

To move things along, Jonathan Auerbach and I prepared a massive document (zipped file here; still huge) with 60 pages of graphs, showing raw data and smoothed trends in age-adjusted mortality rate from 1999-2014 for:
– 50 states
– men and women
– non-hispanic whites, blacks, and hispanics
– age categories 0-1, 1-4, 5-14, 15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84.

It’s amazing how much you can learn by staring at these graphs.

For example, these trends are pretty much the same in all 50 states:

But look at these:

Flat in some states, sharp increases in others, and steady decreases in other states.

The patterns are even clearer here:

Starting point

To get back to the question that got everything started, here’s the story for non-Hispanic white men and women, aged 45-54:

Screen Shot 2016-01-18 at 10.35.52 PM

Different things are happening in different regions—in particular, things have been getting worse for women in the south and midwest, whereas the death rate of men in this age group have been declining during the past few years—but overall there has been little change since 1999. In contrast, as Anne Case and Angus Deaton noticed a bit over a year ago, other countries and U.S. nonwhites have seen large declines in death rates, something like 20%.

Breaking down trends by education: it’s tricky

In a forthcoming paper, “Mortality and morbidity in the 21st century,” Case and Deaton report big differences in trends among whites with high and low levels of education: “mortality is rising for those without, and falling for those with, a college degree.”

But the comparison of death rates by education is tricky because average education levels have been increasing over time. There’s a paper from 2015 on this topic, “Measuring Recent Apparent Declines In Longevity: The Role Of Increasing Educational Attainment,” by John Bound, Arline Geronimus, Javier Rodriguez, and Timothy Waidmann, who write:

Independent researchers have reported an alarming decline in life expectancy after 1990 among US non-Hispanic whites with less than a high school education. However, US educational attainment rose dramatically during the twentieth century; thus, focusing on changes in mortality rates of those not completing high school means looking at a different, shrinking, and increasingly vulnerable segment of the population in each year.

Breaking down trends by state

In my paper with Jonathan Auerbach, we found big differences in different regions of the country. We followed up and estimated mortality rates by state and by age group, and there are tons of interesting patterns. Again, our latest graph dump is here (zipped file here), and you can look through the graphs yourself to see what you see. Next step is to build some sort of open-ended tool to use Stan to do smoothing for arbitrary slices of these data. Also there are selection issues as people move between states, which is similar but not identical to selection issues regarding education.

Message to journalists

Case and Deaton found some interesting patterns. They got the ball rolling. Read their paper, talk with them, get their perspective. Then talk with other experts: demographers, actuaries, public health experts. Talk with John Bound, Arline Geronimus, Javier Rodriguez, and Timothy Waidmann, who specifically raised concerns about comparisons of time series broken down by education. Talk with Chris Schmid, author of the paper, “Increased mortality for white middle-aged Americans not fully explained by causes suggested.” You don’t need to talk with me—I’m just a number cruncher and claim no expertise on causes of death. But click on the link, wait 20 minutes for it to download and take a look at our smoothed mortality rate trends by state. There’s lots and lots there, much more than can be captured in any simple story.

This could be a big deal: the overuse of psychotropic medications for advanced Alzheimer’s patients

I received the following email, entitled “A research lead (potentially bigger than the opioid epidemic,” from someone who wishes to remain anonymous:

My research lead is related to the use of psychotropic medications in Alzheimer’s patients. I should note that strong cautions have already been issued with respect to the use of these medications in the elderly (e.g. As a practical matter, however, at present agitated and aggressive behaviors are considered a common symptom of advanced Alzheimer’s and therefore the use of these drugs as chemical restraints is common even by the most conservative physicians. Furthermore, as the medical profession is very diverse many physicians have not updated their practice and are far from cautious in prescribing these medications to the elderly. Indeed, chemical restraints are the norm in nursing home practice, so it is my belief that psychotropic medications are ubiquitous in nursing homes.

A family member is an active medical professional (a physician assistant who works as a front-line health care provider in a family practice office in rural America), so I have some insight into how a conservative medical practitioner will behave. I have observed two things: first, when a symptom is in the list of potential symptoms for a disease that has been diagnosed in the patient, there is a strong presumption that the symptom is caused by the disease. Second, when a symptom is a side effect of a medication (particularly one prescribed by another doctor), there is a presumption that the need for the medication outweighs the side effect.

A combination of close familiarity with my mother’s symptoms/behavior on and off a variety of drugs and my knowledge of the ubiquitous bank gaming of regulatory controls in financial markets leads me to wonder whether the drug companies aren’t playing the same kind of game. In particular the addition of “behavioral and psychological symptoms of dementia” (BPSD) to the criteria for the diagnosis of late-stage Alzheimer’s is as far as I can tell of very recent vintage. (I believe they were introduced with the DSM-5 in 2013.) BPSD are symptoms treated by psychotropic medications. These same medications are also commonly used to treat mild sleep and anxiety disorders in the general population.

The problem with all the psychotropic medications is that they are used to treat the same behaviors that they can also cause as side effects, including irritability, anxiety, “disinhibition” which maps into a willingness to hit out at or behave abusively to others, aggressiveness, and self-harm. In sufficiently high doses, however, the patient is heavily sedated and they are very effective chemical restraints.

My suspicion is that the introduction of BPSD into the definition of common symptoms of Alzheimer’s has developed as a result of the ubiquitous use of psychotropic medications in this population. That is, as far as I can tell the studies that have found BPSD to be common in Alzheimer’s are population-based studies that did not control for the use of medications that have as side effects BPS behaviors. Successfully bringing BPSD into the clinical definition of Alzheimer’s is hugely profitable for the drug companies that now have physician’s biases – to attribute symptoms to the disease that has already been diagnosed, rather than to the drug that may cause it as a side effect – working on their side.

In short, what I would really like to see is a careful statistician’s review of the studies that find that BPSD are common symptoms of Alzheimer’s and an analysis of whether sufficient controls for the use of medicines that have as side effects the same symptoms have been implemented. (With psychotropic medications, the length of use is an important factor because they build up in the system. Thus long-term use has different effects from short-term use. Long-term prescriptions are not a good proxy for long-term use, since refill/renewal of shorter term prescriptions is common practice.)

I don’t know anything about this, but I ran it by an expert on eldercare who said that, yes, this is a big deal. So I’m posting this in the hope that someone will look into it more carefully. As you can see from the discussion above, there are statistical entry points to this one, and it touches on some interactions between causal inference and decision analysis.

Some natural solutions to the p-value communication problem—and why they won’t work

Blake McShane and David Gal recently wrote two articles (“Blinding us to the obvious? The effect of statistical training on the evaluation of evidence” and “Statistical significance and the dichotomization of evidence”) on the misunderstandings of p-values that are common even among supposed experts in statistics and applied social research.

The key misconception has nothing to do with tail-area probabilities or likelihoods or anything technical at all, but rather with the use of significance testing to finesse real uncertainty.

As John Carlin and I write in our discussion of McShane and Gal’s second paper (to appear in the Journal of the American Statistical Association):

Even authors of published articles in a top statistics journal are often confused about the meaning of p-values, especially by treating 0.05, or the range 0.05–0.15, as the location of a threshold. The underlying problem seems to be deterministic thinking. To put it another way, applied researchers and also statisticians are in the habit of demanding more certainty than their data can legitimately supply. The problem is not just that 0.05 is an arbitrary convention; rather, even a seemingly wide range of p-values such as 0.01–0.10 cannot serve to classify evidence in the desired way.

In our article, John and I discuss some natural solutions that won’t, on their own, work:

– Listen to the statisticians, or clarity in exposition

– Confidence intervals instead of hypothesis tests

– Bayesian interpretation of one-sided p-values

– Focusing on “practical significance” instead of “statistical significance”

– Bayes factors

You can read our article for the reasons why we think the above proposed solutions won’t work.

From our summary:

We recommend saying No to binary conclusions . . . resist giving clean answers when that is not warranted by the data. . . . It will be difficult to resolve the many problems with p-values and “statistical significance” without addressing the mistaken goal of certainty which such methods have been used to pursue.

P.S. Along similar lines, Stephen Jenkins sends along the similarly-themed article, “‘Sing Me a Song with Social Significance’: The (Mis)Use of Statistical Significance Testing in European Sociological Research,” by Fabrizio Bernardi, Lela Chakhaia, and Liliya Leopold.

Clarke’s Law: Any sufficiently crappy research is indistinguishable from fraud (Pizzagate edition)

This recent Pizzagate post by Nick Brown reminds me of our discussion of Clarke’s Law last year.

P.S. I watched a couple more episodes of Game of Thrones on the plane the other day. It was pretty good! And so I continue to think this watching GoT is more valuable than writing error-ridden papers such as “Lower Buffet Prices Lead to Less Taste Satisfaction.”

Indeed, I think that if people spent more time watching Game of Thrones and less time chopping up their data and publishing their statistically significant noise in the Journal of Sensory Studies, PPNAS, etc., the world would be a better place.

So, if you’re working in a lab and your boss asks you to take a failed study and get it published four times as if it were a success, my advice to you: Spend more time on Facebook, Twitter, Game of Thrones, Starbucks, spinning class. Most of us will never remember what we read or posted on Twitter or Facebook yesterday. But if you publish four papers with 150 errors, people will remember that forever.

P.P.S. From Clifford Anderson-Bergman, here’s another person who should’ve spent less time in the lab and more time watching TV. Key quote: “In court documents, prosecutors noted that Kinion set out to win prestige and advance his career rather than enriching himself.”

Whassup, Pace investigators? You’re still hiding your data. C’mon dudes, loosen up. We’re getting chronic fatigue waiting for you already!

James Coyne writes:

For those of you who have not heard of the struggle for release of the data from the publicly funded PACE trial of adaptive pacing therapy, cognitive behaviour therapy, graded exercise therapy, and specialist medical care for chronic fatigue syndrome, you can access my [Coyne’s] initial call for release of the portion of the data from the trial published in PLOS One.

Here is support from Stats News for release of the PACE trial data. And my half-year update on my request.

Despite the investigators’ having promised the data would be available as a condition of publishing in PLOS One, 18 months after making my request, I still have not received the data and the PACE investigators are continuing to falsely advertise to readers of their PLOS One article that they have complied with the data-sharing policy.

Here’s what the Pace team wrote:

We support the use of PACE trial data for additional, ethically approved research, with justified scientific objectives, and a pre-specified analysis plan. We prefer to collaborate directly with other researchers. On occasion, we may provide data without direct collaboration, if mutually agreed. . . .

Applicants should state the purpose of their request, their objectives; qualifications and suitability to do the study, data required, precise analytic plans, and plans for outputs. See published protocol for details of data collected.

Dayum. Them’s pretty strong conditions, considering that, as Coyne points out, the original Pace trial didn’t follow a pre-specified analysis plan itself.


Data will be provided with personal identifiers removed. Applicants must agree not to use the data to identify individual patients, unless this is a pre-specified purpose for record linkage.

Individual researchers who will see the data must sign an agreement to protect the confidentiality of the data and keep it secure.

This seems fair enough, but I don’t see that it has anything to do with pre-analysis plans, qualifications and suitability, or direct collaboration. The issue is that there are questions with the published analyses (see, for example, here), and to resolve these questions it would help to work with the original data.

Putting roadblocks in the way of data sharing, that seems a bit vexatious to me.

P.S. There’s nothing more fun than a story with good guys and bad guys, and it’s easy enough to put the Pace struggle into that framework. But, having talked with people on both sides, I feel like lots of people are trying their best here, and that some of these problems are caused by the outmoded statistical attitude that the role of a study is to prove that a treatment “works” or “does not work.” Variation is all. I don’t know how much can be learned by reanalysis of these particular data, but it does seem like it would be a good idea for the data to be shared as broadly as possible.

“Bias” and “variance” are two ways of looking at the same thing. (“Bias” is conditional, “variance” is unconditional.)

Someone asked me about the distinction between bias and noise and I sent him some links. Then I thought this might interest some of you too, so here it is:

Here’s a recent paper on election polling where we try to be explicit about what is bias and what is variance:

And here are some other things I’ve written on the topic:
The bias-variance tradeoff
Everyone’s trading bias for variance at some point, it’s just done at different places in the analyses
There’s No Such Thing As Unbiased Estimation. And It’s a Good Thing, Too.
Balancing bias and variance in the design of behavioral studies

Finally, here’s the sense in which variance and bias can’t quite be distinguished:
– An error term can be mathematically expressed as “variance” but if it only happens once or twice, it functions as “bias” in your experiment.
– Conversely, bias can vary. An experimental protocol could be positively biased one day and negatively biased another day or in another scenario.

P.S. These two posts are also relevant:
How do you think about the values in a confidence interval?
(The question was “Are all values within the 95% CI equally likely (probable), or are the values at the “tails” of the 95% CI less likely than those in the middle of the CI closer to the point estimate?”
And my answer was: In general, No and It depends.)
Why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests

“A blog post that can help an industry”

Tim Bock writes:

I understood how to address weights in statistical tests by reading Lu and Gelman (2003). Thanks.

You may be disappointed to know that this knowledge allowed me to write software, which has been used to compute many billions of p-values. When I read your posts and papers on forking paths, I always find myself in agreement. But, I can’t figure out how they apply to commercial survey research. Sure, occasionally commercial research involves modeling and hierarchical Bayes can work out, but nearly all commercial market research involves the production of large numbers of tables, with statistical tests being used to help researchers work out which numbers on which tables are worth thinking about. Obviously, lots of false positives can occur, and a researcher can try and protect themselves by, for example:

1. Stating prior beliefs relating to important hypotheses prior to looking at the data.

2. Skepticism/the smell test.

3. Trying to corroborate unexpected results using other data sources.

4. Looking for alternative explanations for interesting results (e.g., questionnaire wording effects, fieldwork errors).

5. Applying multiple comparison corrections (e.g., FDR).

In Gelman, Hill, and Yajima (2013), you wrote “the problem of multiple comparisons can disappear entirely when viewed from a hierarchical Bayesian perspective.” I really like the logic of the paper, and get how I can apply it if I am building a model. But, how can I get rid of frequentist p-values as a tool for sifting through thousands of tables to find the interesting ones?

Many a professor has looked at commercial research and criticized the process, suggesting that research should be theory led, and it is invalid to scan through a lot of tables. But, such a response misses the point of commercial research, which is an inductive process by-and-large. To say that one must have a hypothesis going in, is to miss the point of commercial research.

What is the righteous Bayesian solution to the problem? Hopefully this email can inspire a blog post that can help an industry.

My response:

First, change “thousands of tables” to “several pages of graphs.” See my MRP paper with Yair for examples of how to present many inferences in a structured way.

Second, change “p-values . . . sifting through” to “multilevel modeling.” The key idea is that a “table” has structure; structure represents information; and this information can and should be used to guide your inferences. Again, that paper with Yair has examples.

Third, model interactions without aiming for certainty. In this article I discuss the connection between varying treatment effects and the crisis of unreplicable research.

Fatal Lady

Eric Loken writes:

I guess they needed to add some drama to Hermine’s progress.

[background here]

P.S. The above post was pretty short. I guess I should give you some more material. So here’s this, that someone sent me:

You’ve written written about problems with regression discontinuity a number of times.

This paper that just came in on the NBER Digest email looks like it has another very unconvincing regression discontinuity. I haven’t read the paper—I’m just looking at the picture. But if you asked me to pick out the discontinuity from the data, I think I’d have a lot of trouble…

Among other things, it looks like if you take their linear specification seriously, we should expect negative patents for firms with assets below 50 million pounds if eligible for SME (left), or for firms below 80 million pounds if ineligible (right). So, um, maybe we shouldn’t take those lines very seriously.

Also the minimum x value edge looks suspiciously as though it’s been chosen to make the left-side regression line go as high as possible. Wouldn’t be an issue if they had doing some kind of LOESS regression, of course.

Dark Angel

Chris Kavanagh writes:

I know you are all too frequently coming across defensive, special pleading-laced responses to failed replications so I thought I would just point out a recent a very admirable response from Will Gervais posted on his blog.

He not only commends the replicators but acknowledges that the original finding was likely a false positive.

Cool! He’s following in the footsteps of 50 shades of gray. Which in turn reminds me of the monochrome Stroop chart:

Ensemble Methods are Doomed to Fail in High Dimensions

Ensemble methods

By ensemble methods, I (Bob, not Andrew) mean approaches that scatter points in parameter space and then make moves by inteprolating or extrapolating among subsets of them. Two prominent examples are:

There are extensions and computer implementations of these algorithms. For example, the Python package emcee implements Goodman and Weare’s walker algorithm and is popular in astrophysics.

Typical sets in high dimensions

If you want to get the background on typical sets, I’d highly recommend Michael Betancourt’s video lectures on MCMC in general and HMC in particular; they both focus on typical sets and their relation to the integrals we use MCMC to calculate:

It was Michael who made a doughnut in the air, pointed at the middle of it and said, “It’s obvious ensemble methods won’t work.” This is just fleshing out the details with some code for the rest of us without such sharp geometric intuitions.

MacKay’s information theory book is another excellent source on typical sets. Don’t bother with the Wikipedia on this one.

Why ensemble methods fail: Executive summary

  1. We want to draw a sample from the typical set
  2. The typical set is a thin shell a fixed radius from the mode in a multivariate normal
  3. Interpolating or extrapolating two points in this shell is unlikely to fall in this shell
  4. The only steps that get accepted will be near one of the starting points
  5. The samplers devolve to a random walk with poorly biased choice of direction

Several years ago, Peter Li built the Goodman and Weare walker methods for Stan (all they need is log density) on a branch for evaluation. They failed in practice exactly the way the theory says they will fail. Which is too bad, because the ensemble methods are very easy to implement and embarassingly parallel.

Why ensemble methods fail: R simulation

OK, so let’s see why they fail in practice. I’m going to write some simple R code to do the job for us. Here’s an R function to generate a 100-dimensional standard isotropic normal variate (each element is generated normal(0, 1) independently):

normal_rng <- function(K) rnorm(K);

This function computes the log density of a draw:

normal_lpdf <- function(y) sum(dnorm(y, log=TRUE));

Next, generate two draws from a 100-dimesnional version:

K <- 100;
y1 <- normal_rng(K);
y2 <- normal_rng(K);

and then interpolate by choosing a point between them:

lambda = 0.5;
y12 <- lambda * y1 + (1 - lambda) * y2;

Now let's see what we get:

print(normal_lpdf(y1), digits=1);
print(normal_lpdf(y2), digits=1);
print(normal_lpdf(y12), digits=1);

[1] -153
[1] -142
[1] -123

Hmm. Why is the log density of the interpolated vector so much higher? Given that it's a multivariate normal, the answer is that it's closer to the mode. That should be a good thing, right? No, it's not. The typical set is defined as an area within "typical" density bounds. When I take a random draw from a 100-dimensional standard normal, I expect log densities that hover between -140 and -160 or so. That interpolated vector y12 with a log density of -123 isn't in the typical set!!! It's a bad draw, even though it's closer to the mode. Still confused? Watch Michael's videos above. Ironically, there's a description in the Goodman and Weare paper in a discussion of why they can use ensemble averages that also explains why their own sampler doesn't scale---the variance of averages is lower than the variance of individual draws; and we want to cover the actual posterior, not get closer to the mode.

So let's put this in a little sharper perspective by simulating thousands of draws from a multivariate normal and thousands of draws interpolating between pairs of draws and plot them in two histograms. First, draw them and print a summary:

lp1 <- vector();
for (n in 1:1000) lp1[n] <- normal_lpdf(normal_rng(K));

lp2 <- vector()
for (n in 1:1000) lp2[n] <- normal_lpdf((normal_rng(K) + normal_rng(K))/2);

from which we get:

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -177    -146    -141    -142    -136    -121 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -129    -119    -117    -117    -114    -108 

That's looking bad. It's even clearer with a faceted histogram:

df <- data.frame(list(log_pdf = c(lp1, lp2),
                      case=c(rep("Y1", 1000), rep("(Y1 + Y2)/2", 1000))));
plot <- ggplot(df, aes(log_pdf)) +
        geom_histogram(color="grey") +
        facet_grid(case ~ .);

Here's the plot:

The bottom plot shows the distribution of log densities in independent draws from the standard normal (these are pure Monte Carlo draws). The top plot shows the distribution of the log density of the vector resulting from interpolating two independent draws from the same distribution. Obviously, the log densities of the averaged draws are much higher. In other words, they are atypical of draws from the target standard normal density.


Check out what happens as (1) the number of dimensions K varies, and (2) as lambda varies within or outside of [0, 1].

Hint: What you should see is that as lambda approaches 0 or 1, the draws get more and more typical, and more and more like random walk Metropolis with a small step size. As dimensionality increases, the typical set becomes more attenuated and the problem becomes worse (and vice-versa as it decreases).

Does Hamiltonian Monte Carlo (HMC) have these problems?

Not so much. It scales much better with dimension. It'll slow down, but it won't break and devolve to a random walk like ensemble methods do.