## Stan Talk in NYC: Macroeconomic Forecasting using Analogy Weighting

This post is by Eric.

The next Stan meetup is coming up in February. It will be hosted by the New York Bayesian Data Analysis Meetup group and International Securities Exchange. The BDA group was formerly called Stan Users – NYC. We will still be focusing on Stan, but we would also like to open it up to a broader Bayesian community and hold more regular meetups.

P.S. What is Analogy Weighting you ask? I have no idea, but I am sure Jim Savage will tell us.

## Middle-aged white death trends update: It’s all about women in the south

Jonathan Auerbach and I wrote up some of the age-adjustment stuff we discussed on this blog a couple months ago. Here’s our article, a shorter version of which will appear as a letter in PPNAS.

Wow!! Remember that increasing death rate among middle-aged non-Hispanic whites? It’s all about women in the south (and, to a lesser extent, women in the midwest). Amazing what can be learned just by slicing data.

I don’t have any explanations for this. As I told a reporter the other day, I believe in the division of labor: I try to figure out what’s happening, and I’ll let other people explain why.

I’m sure you can come up with lots of stories on your own, though. When performing your reverse causal inference, remember that people move, and, as we’ve discussed before, the cohorts are changing. 45-54-year-olds in 1999 aren’t the same people as 45-54-year-olds in 2013. We adjust for changing age distributions (ya gotta do that) but we’re still talking about different cohorts.

Here’s how our paper begins:

In a recent article in PNAS, Case and Deaton show a figure illustrating “a marked increase in the all-cause mortality of middle-aged white non-Hispanic men and women in the United States between 1999 and 2013.” The authors state that their numbers “are not age-adjusted within the 10-y 45-54 age group.” They calculated the mortality rate each year by dividing the total number of deaths for the age group by the population of the age group.

We suspected an aggregation bias and examined whether much of the increase in aggregate mortality rates for this age group could be due to the changing composition of the 45–54 year old age group over the 1990 to 2013 time period. If this were the case, the change in the group mortality rate over time may not reflect a change in age-specific mortality rates. Adjusting for age confirmed this suspicion. Contrary to Case and Deaton’s figure, we find there is no longer a steady increase in mortality rates for this age group. Instead there is an increasing trend from 1999–2005 and a constant trend thereafter. Moreover, stratifying age-adjusted mortality rates by sex shows a marked increase only for women and not men, contrary to the article’s headline.

And here’s the age-adjustment story in pictures:

For some reason, the NYT ran a story on this the other day and didn’t age adjust, which was a mistake. Nor did they break down the data by region of the country. Too bad. Lots more people read the NYT than read this blog or even PPNAS.

## My namesake doesn’t seem to understand the principles of decision analysis

It says “Never miss another deadline.” But if you really could never miss your deadlines, you’d just set your deadlines earlier, no? It’s statics vs. dynamics all over again.

That said, this advice seems reasonable:

The author has also developed a foolproof method of structuring your writing, so that you make effective use of your time. It’s based on the easy-to-remember three-step formula: Pre-write, Free-write, Re-write. Pre-write refers to researching the necessary information. Free-write refers to getting the information onto the computer screen. Re-write refers to the essential task of editing the writing into clear readable text. This technique allows writers to become the editors of their own writing, thereby dramatically improve its quality.

I haven’t actually read or even seen this book, but maybe I should take a look, as it is important to me that my students learn how to write effectively. A bit odd to choose a book based on the author’s last name, but that’s serendipity for you.

## On deck this week

Mon: My namesake doesn’t seem to understand the principles of decision analysis

Tues: Middle-aged white death trends update: It’s all about women in the south

Wed: My talk Fri 1pm at the University of Chicago

Thurs: If you’re using Stata and you want to do Bayes, you should be using StataStan

Fri: One quick tip for building trust in missing-data imputations?

Sat: This graph is so ugly—and you’ll never guess where it appeared

Sun: 2 new reasons not to trust published p-values: You won’t believe what this rogue economist has to say.

P.S. If you just can’t wait till Tues to learn about the death trends, the paper is here.

And if you just can’t wait till Thurs to learn about why StataStan is the way to go for Bayes in Stata, that paper is here.

It’s funny to think that I know what’s up all the way through April (modulo topical insertions), but you don’t!

## Grizzly Adams is an object of the class Weekend at Bernies

It just came to me when I saw his obit.

## The devil really is in the details; or, You’ll be able to guess who I think are the good guys and who I think are the bad guys in this story, but I think it’s still worth telling because it provides some insight into how (some) scientists view statistics

I noticed this on Retraction Watch:

“Scientists clearly cannot rely on the traditional avenues for correcting problems in the literature.” PubPeer responds to an editorial slamming the site.

I’ve never actually read anything on PubPeer but I understand it’s a post-publication review site, and I like post-publication review.

So I’m heading into this one on the side of PubPeer, and let me deflate any suspense right here but telling you that, having followed the links and read the discussion, my position hasn’t changed.

So, no news and no expectation that this new story should change your beliefs, if you happen to be on the Evilicious side of this particular debate.

So, if I’m not trying to convince anybody, why am my writing this post? Actually, I’m usually not trying to convince anyone when I write; rather, I use writing as a way to explore my thoughts and to integrate the discordant information I see into coherent stories (with one sort of coherent story being of the form, “I don’t yet understand what’s going on, the evidence seems to be contradictory, and I can’t form a coherent story”).

In that sense, writing is a form of posterior predictive check, or perhaps I should just say posterior inference, a way of working out the implications of my implicit models of the world in the context of available data.

They say Code Never Lies and they’re right, but writing has its own logic that can be helpful to follow.

Hence, I blog.

Now back to the item at hand. The above link goes to a post on PubPeer that begins as follows:

In an editorial entitled “Vigilante science”, the editor-in-chief of Plant Physiology, Michael Blatt, makes the hyperbolic claim that anonymous post-publication peer review by the PubPeer community represents the most serious threat to the scientific process today.

We obviously disagree. We believe a greater problem, which PubPeer can help to address, is the flood of low-quality, overinterpreted and ultimately unreliable research being experienced in many scientific fields . . .

I then clicked to see what Michael Blatt had to say in the journal Plant Physiology.

Since its launch in October 2012, PubPeer has sought to facilitate community-wide, postpublication critique of scientific articles. The Web site has also attracted much controversy . . . .

PubPeer operates as a blog on which anyone can post comments, either to a published article or to comments posted by other participants, and authors may respond. It is a bit like an extended journal club; not a bad idea to promote communication among scientists, you might think, so why the controversy?

Why, indeed? Blatt explains:

The problems arising are twofold . . . First, most individuals posting on PubPeer—let’s use the euphemism commenters for now—take advantage of the anonymity afforded by the site in full knowledge that their posts will be available to the public at large.

I don’t understand why “commenters” is considered a euphemism. That’s the problem with entering a debate in the middle—sometimes you can’t figure out what people are talking about.

Anyway:

Second, the vast majority of comments that are posted focus on image data (gels, blots, and micrographs) that contribute to the development of scientific ideas but are not ideas in themselves. With few exceptions, commenters on PubPeer do no more than flag perceived faults and query the associated content.

But, wait, what’s wrong with commenting on image data? And “flagging perceived faults”—that’s really important, no? We should all be aware of faults in published papers.

Of course, I say this as someone who’s published a paper that was invalidated by a data error, so I personally would benefit from outsiders checking my work and letting me know when they see something fishy.

So what’s the problem, then? Blatt tells us:

My overriding concern with PubPeer is the lack of transparency that arises from concealing the identities of both commenters and moderators.

This is so wrong I hardly know where to start. No, actually, I do know where to start, which is to point out that articles are published based on anonymous peer review.

Who were the reviewers who made the mistake of recommending publication of those papers by Daryl Bem or Satoshi Kanazawa or those ovulation-and-voting people? We’ll never know. For the himmicanes and hurricanes people, we do know that Susan Fiske was the editor who recommended publication, and she can be rightly criticized for her poor judgment on this one (nothing personal, I make lots of poor judgments myself, feel free to call me out on them), but we don’t know who were the external referees who failed to set her straight. Or, to go back 20 years, we don’t know who were the statistical referees who made the foolish, foolish decision to recommend that Statistical Science publish that horrible Bible Code paper. I do know the journal’s editor at the time, but he was in a difficult position if he was faced with positive referee reports.

So, according to Blatt: Anonymous pre-publication review, good. Anonymous post-publication review, bad. Got it.

Indeed, Blatt is insistent on this point:

I accept that there is a case for anonymity as part of the peer-review process. However, the argument for anonymity in postpublication discussion fallaciously equates such discussion with prepublication peer review. . . . In short, anonymity makes sense when reviews are offered in confidence to be assessed and moderated by an editor, someone whose identity is known and who takes responsibility for the decision informed by the reviews. Obviously, this same situation does not apply postpublication, not when the commenters enter into a discussion anonymously and the moderators are also unknown.

Oh no god no no no no no. Here’s the difference between pre-publication reviews, as usually conducted, and post-publication reviews:

Pre-publication reviews are secret. Not just the author of the review, also the actual content. Only very rarely are pre-publication reviews published in any form. Post-publication reviews, by their very nature, are public.

As Stephen King says, it’s the tale, not he who tells it. Post-publication reviews don’t need to be signed; we actually have the damn review. Given the review, the identity of the reviewer supplies very little information.

The other difference is that pre-publication reviews tend to be much more negative than post-publication reviews. I find it laughable when Blatt writes that post-publication reviews are “one-sided,” “petty,” “missing . . . courtesy and common sense,” “negative and occasionally malicious,” and “about policing, not discussion.” All these descriptions apply even more for pre-publication reviews.

Why do I care?

At this point, you might be asking yourself why I post this at all. Neither you nor I have ever heard of the journal Plant Physiology before, and we’ll likely never hear of it again. So who cares that the editor of an obscure journal emits a last-gasp rant against PubPeer, a site with represents the future in the same way that editor-as-gatekeeper Michael Blatt represents the past.

Who indeed? I don’t care what the editor of Plant Physiology thinks about post-publication review. What I do care about is we’re not there yet. Any dramatic claim with “p less than .05” that appears in Science or Nature or PPNAS or Psychological Science still has a shot of getting massive publicity. That himmicanes-and-hurricanes study was just last year. And this year we’ve seen a few more.

P.S. Incidentally, it seems that journals vary greatly in the power they afford to their editors. I can’t imagine the editor of Biometrics or the Journal of the American Statistical Association being able to publish this sort of opinion piece in the journal like this. I don’t know the general pattern here, but I have the vague impression that biomedical journals feature more editorializing, compared to journals in the physical and social sciences.

P.P.S. Two commenters pointed out small mistakes in this post, which I’ve fixed. Another point in favor of post-publication review!

Andrea Panizza writes:

I just read about psychologist Uri Simonson debunking a research by colleagues Raphael Silberzahn & Eric Uhlmann on the positive effects of noble-sounding German surnames on people’s careers (!!!). Here the fact is mentioned.

I think that the interesting part (apart, of course, from the general weirdness of Silberzahn & Uhlmann’s research hypothesis) is that Silberzahn & Uhlmann gave Simonson full access to their data, and apparently he debunked their results thanks to a better analytical approach.

My reply: Yes, this is an admirable reaction. I had seen that paper when it came out, and what struck me was that, if there is such a correlation, there could be lots of reasons not involving a causal effect of the name. in any case, it’s good to see people willing to recognize their errors: “Despite our public statements in the media weeks earlier, we had to acknowledge that Simonsohn’s technique showing no effect was more accurate.”

More generally, this sort of joint work is great, even if it isn’t always possible. Stand-alone criticism is useful, and collaborative criticism such as this is good too.

In a way it’s a sad state of affairs that we have to congratulate a researcher for acting constructively in response to criticism, but that’s where we’re at. Forward motion, I hope.

## McElreath’s Statistical Rethinking: A Bayesian Course with Examples in R and Stan

We’re not even halfway through with January, but the new year’s already rung in a new book with lots of Stan content:

This one got a thumbs up from the Stan team members who’ve read it, and Rasmus Bååth has called it “a pedagogical masterpiece.”

The book’s web site has two sample chapters, video tutorials, and the code.

The book is based on McElreath’s R package rethinking, which is available from GitHub with a nice README on the landing page.

If the cover looks familiar, that’s because it’s in the same series as Gelman et al.’s Bayesian Data Analysis.

## Unz Ivy Stats Flashback

This news story reminded me of some threads from a few years ago about Ron Unz, the political activist who wrote a statistics-filled article a few years ago claiming that Harvard and other Ivy League colleges discriminate against Asian-Americans and in favor of Jews in undergraduate admissions. It turned out that some of his numbers were off by factors of 2 or 4 or more, and then, amusingly, or horrifyingly, Unz offhandedly remarked that one of his more high-profile statistical claims (cited in a notorious New York Times column by David Brooks) had been based on “five minutes of cursory surname analysis.”

Unz really pulled the rug out from Brooks on that one! The world-famous Ted Talk speaker and humility expert was too proud to retract his column, so Unz had left him to twist slowly in the wind. (Hence the image above; if we’re gonna mix metaphors I might as well go whole hog.)

In all seriousness, I remain upset by the way that false claims can stick in the public discourse. Nothing new here, of course (insert Twain-attributed quote here) but particularly sad in this case in that I expect the numbers in question were not lies but rather mere mistakes.

At some point I have more to say about this case as an example of how certain intuitions can lead us astray—believe it or not, I see some connections between the reasoning of Unz and Brooks and that of authors of various unreplicated papers in social psychology—but for now, I’d just like to help out anyone who’s coming to this particular himmicane in the middle (for example, after reading the above-linked article in today’s paper) by pointing you to my previous post on the topic, from a couple years ago, which contains links to several earlier posts and some long discussions on the details of the case.

## MTA sucks

They had a sign on the wall promoting this Easy Pay express metrocard that would auto-refill and I was like, cool, so when I got to the ofc I looked it up, found the sign-up page, gave my information and chose the EasyPayXpress PayPerRide Plan, clicked on lu et endendu or whatever they call it, and promptly got this:

I clicked on the back button to see what I’d done wrong, and all my information was gone from the form. Damn bureaucratic #^%@#*&^! How come Amazon can do this right and the MTA can’t?

## rstanarm and more!

Ben Goodrich writes:

The rstanarm R package, which has been mentioned several times on stan-users, is now available in binary form on CRAN mirrors (unless you are using an old version of R and / or an old version of OSX). It is an R package that comes with a few precompiled Stan models — which are called by R wrapper functions that have the same syntax as popular model-fitting functions in R such as glm() — and some supporting R functions for working with posterior predictive distributions. The files in its demo/ subdirectory, which can be called via the demo() function, show how you can fit essentially all of the models in Gelman and Hill’s textbook

http://stat.columbia.edu/~gelman/arm/

and rstanarm already offers more (although not strictly a superset of the) functionality in the arm R package.

The rstanarm package can be installed in the usual way with

install.packages(“rstanarm”)

which does not technically require the computer to have a C++ compiler if you on Windows / Mac (unless you want to build it from source, which might provide a slight boost to the execution speed). The vignettes explain in detail how to use each of the model fitting functions in rstanarm. However, the vignettes on the CRAN website

https://cran.r-project.org/web/packages/rstanarm/index.html

do not currently show the generated images, so call browseVignettes(“rstanarm”). The help(“rstarnarm-package”) and help(“priors”) pages are also essential for understanding what rstanarm does and how it works. Briefly, there are several model-fitting functions:

• stan_lm() and stan_aov(), which just calls stan_lm(), use the same likelihood as lm() and aov() respectively but add regularizing priors on the coefficients
• stan_polr() uses the same likelihood as MASS::polr() and adds regularizing priors on the coefficients and, indirectly, on the cutpoints. The stan_polr() function can also handle binary outcomes and can do scobit likelihoods.
• stan_glm() and stan_glm.nb() use the same likelihood(s) as glm() and MASS::glm.nb() and respectively provide a few options for priors
• stan_lmer(), stan_glmer(), stan_glmer.nb() and stan_gamm4() use the same likelihoods as lme4::lmer(), lme4::glmer(), lme4::glmer.nb(), and gamm4::gamm4() respectively and basically call stan_glm() but add regularizing priors on the covariance matrices that comprise the blocks of the block-diagonal covariance matrix of the group-specific parameters. The stan_[g]lmer() functions accept all the same formulas as lme4::[g]lmer() — and indeed use lme4’s formula parser — and stan_gamm4() accepts all the same formulas as gamm::gamm4(), which can / should include smooth additive terms such as splines

If the objective is merely to obtain and interpret results and one of the model-fitting functions in rstanarm is adequate for your needs, then you should almost always use it. The Stan programs in the rstanarm package are better tested, have incorporated a lot of tricks and reparameterizations to be numerically stable, and have more options than what most Stan users would implement on their own. Also, all the model-fitting functions in rstanarm are integrated with posterior_predict(), pp_check(), and loo(), which are somewhat tedious to implement on your own. Conversely, if you want to learn how to write Stan programs, there is no substitute for practice, but the Stan programs in rstanarm are not particularly well-suited for a beginner to learn from because of all their tricks / reparameterizations / options.

Feel free to file bugs and feature requests at

https://github.com/stan-dev/rstanarm/issues

If you would like to make a pull request to add a model-fitting function to rstanarm, there is a pretty well-established path in the code for how to do that but it is spread out over a bunch of different files. It is probably easier to contribute to rstanarm, but some developers may be interested in distributing their own CRAN packages that come with precompiled Stan programs that are focused on something besides applied regression modeling in the social sciences. The Makefile and cleanup scripts in the rstanarm package show how this can be accomplished (which took weeks to figure out), but it is easiest to get started by calling rstan::rstan_package_skeleton(), which sets up the package structure and copies some stuff from the rstanarm GitHub repository.

On behalf of Jonah who wrote half the code in rstanarm and the rest of the Stan Development Team who wrote the math library and estimation algorithms used by rstanarm, we hope rstanarm is useful to you.

Also, Leon Shernoff pointed us to this post by Wayne Folta, delightfully titled “R Users Will Now Inevitably Become Bayesians,” introducing two new R packages for fitting Stan models:  rstanarm and brms.  Here’s Folta:

There are several reasons why everyone isn’t using Bayesian methods for regression modeling. One reason is that Bayesian modeling requires more thought . . . A second reason is that MCMC sampling . . . can be slow compared to closed-form or MLE procedures. A third reason is that existing Bayesian solutions have either been highly-specialized (and thus inflexible), or have required knowing how to use a generalized tool like BUGS, JAGS, or Stan. This third reason has recently been shattered in the R world by not one but two packages: brms and rstanarm. Interestingly, both of these packages are elegant front ends to Stan, via rstan and shinystan. . . . You can install both packages from CRAN . . .

He illustrates with an example:

mm <- stan_glm (mpg ~ ., data=mtcars, prior=normal (0, 8))
mm  #===> Results
stan_glm(formula = mpg ~ ., data = mtcars, prior = normal(0,
8))

Estimates:
(Intercept) 11.7   19.1
cyl         -0.1    1.1
disp         0.0    0.0
hp           0.0    0.0
drat         0.8    1.7
wt          -3.7    2.0
qsec         0.8    0.8
vs           0.3    2.1
am           2.5    2.2
gear         0.7    1.5
carb        -0.2    0.9
sigma        2.7    0.4

Sample avg. posterior predictive
distribution of y (X = xbar):
mean_PPD 20.1    0.7


Note the more sparse output, which Gelman promotes. You can get more detail with summary (br), and you can also use shinystan to look at most everything that a Bayesian regression can give you. We can look at the values and CIs of the coefficients with plot (mm), and we can compare posterior sample distributions with the actual distribution with: pp_check (mm, "dist", nreps=30):

This is all great.  I’m looking forward to never having to use lm, glm, etc. again.  I like being able to put in priors (or, if desired, no priors) as a matter of course, to switch between mle/penalized mle and full Bayes at will, to get simulation-based uncertainty intervals for any quantities of interest, and to be able to build out my models as needed.

## Pro-PACE, anti-PACE

Pro

Simon Wessely, a psychiatrist who has done research on chronic fatigue syndrome, pointed me to an overview of the PACE trial written by its organizers, Peter White, Trudie Chalder, and Michael Sharpe, and also to this post of his from November, coming to the defense of the much-maligned PACE study:

Nothing as complex as a multi-centre trial (there were six centres involved), that recruited 641 people, delivered thousands of hours of treatment, and managed to track nearly all of them a year later, can ever be without some faults. But this trial was a landmark in behavioural complex intervention studies. . . .

I have previously made it clear that I [Wessely] think that PACE was a good trial; I once described it as a thing of beauty. In this blog I will describe why I still think that . . . Here is a recent response to criticisms, few of them new.

He provides some background on his general views:

CFS is a genuine illness, can cause severe disability and distress, affects not just patients but their families and indeed wider society, as it predominantly affects working age adults, and its cause, or more likely causes, remains fundamentally unknown. I do not think that chronic fatigue syndrome is “all in the mind”, whatever that means, and nor do the PACE investigators. I do think that, as with most illnesses, of whatever nature, psychological and social factors can be important in understanding illness and helping patients recover. Like many of the PACE team, I have run a clinic for patients with chronic fatigue syndrome for many years. Like the PACE investigators, I have also in the past done research into the biological nature of the illness; research that has indicated some of the biological abnormalities that have been found repeatedly in CFS.

And now on to the trial itself:

The PACE trial randomly allocated 641 patients with chronic fatigue syndrome, recruited in six clinics across the UK . . . What were its main findings? These were simple:

That both cognitive behaviour therapy (CBT) and graded exercise therapy (GET) improved fatigue and physical function more than either adaptive pacing therapy (APT) or specialist medical care (SMC) a year after entering the trial.

All four treatments were equally safe.

These findings are consistent with previous trials (and there are also more trials in the pipeline), but PACE, because of its sheer size, has attracted the most publicity, both good and bad.

Wessely continues:

What makes a good trial and how does PACE measure up?

Far and away the most important is allocation concealment; the ability of investigators/patients to influence the randomisation process . . . No one has criticised allocation concealment in PACE, it was exemplary. . . .

Next comes power. . . . Predetermined sample size calculations showed it [PACE] had plenty of power to detect clinically significant differences. It was one of the largest behavioural or psychological medicine trials ever undertaken. No one has criticised its size.

The next thing that can jeopardise the integrity of a trial is major losses to follow up . . . The key end point in PACE was pre-defined as the one year follow up. 95% of patients provided follow up data at this stage. I am unaware of any large scale behavioural medicine trial that has exceeded this. Again, no one has questioned this . . .

Next comes treatment infidelity, which is where participants do not get the treatment they were allocated to. . . . At the end of the trial, two independent scrutineers, masked to treatment allocation, both rated over 90% of the randomly chosen 62 sessions they listened to as the allocated therapy. Only one session was thought by both scrutineers not to be the right therapy. Again, no criticism has been made on the basis of therapy infidelity.

Analytical bias. The analytical protocol was predetermined (before the analysis started) and published. Two statisticians were involved in the analysis, blind to treatment group until the analysis was completed and signed off. So again, the chances of bias being introduced at this stage are also negligible.

Post-hoc sub-group analysis (fishing for significant differences) . . . here were no post-hoc sub-group analyses in the main outcome paper. A couple of sub-group post-hoc analyses were done in follow up publications, and clearly identified as such and appropriate cautions issued. None concerned the main outcomes. Again, no one has raised the issue of sub-group analyses.

Blinding. PACE was not blinded; the therapists and patients knew what treatments were being given, which would be hard to avoid. This has been raised by several critics, and of course is true. It could hardly be otherwise; therapists knew they were delivering APT, or CBT or whatever, and patients knew what they were receiving. This is not unique to PACE. . . . Did this matter? One way is to see whether there were differences in what patients thought of the treatment, to which they were allocated, before they started them. . . . And that did happen in the PACE trial itself. One therapy was rated beforehand by patients as being less likely to be helpful, but that treatment was CBT. In the event, CBT came out as one of the two treatments that did perform better. If it had been the other way round; that CBT had been favoured over the other three, then that would have been a problem. But as it is, CBT actually had a higher mountain to climb, not a smaller one, compared to the others.

He summarizes:

So far then, I would suggest that PACE has passed the main challenges to the integrity of a trial with flying colours. . . . For example, the two most recent systematic reviews in this field rated PACE as good quality, with a low risk of bias.

On this last point he gives two references:

Larun L, Brurberg KG, Odgaard-Jensen J, Price JR. (2015) Exercise therapy for chronic fatigue syndrome. Cochrane Database of Systematic Reviews 2015, Issue 2. Art. No.: CD003200. DOI: 10.1002/14651858.CD003200.pub3.

Smith MB et al. (2015) Treatment of Myalgic Encephalomyelitis/Chronic Fatigue Syndrome: A Systematic Review for a National Institutes of Health Pathways to Prevention Workshop. Ann Intern Med. 162: 841-850. doi: http://dx.doi.org/10.7326/M15-0114.

I have a question about the “analytical bias” thing mentioned above. Recall what Julie Rehmeyer wrote:

The study participants hadn’t significantly improved on any of the team’s chosen objective measures: They weren’t able to get back to work or get off welfare, they didn’t get more fit, and their ability to walk barely improved. Though the PACE researchers had chosen these measures at the start of the experiment, once they’d analyzed their data, they dismissed them as irrelevant or not objective after all.

This doesn’t sound like a predetermined analytical protocol, so I’m not sure what’s up with that. (Let me emphasize at this point that I’ve published hundreds of statistical analyses, maybe thousands, and have preregistered almost none of them. So I’m not saying that a predetermined analytical protocol is necessary or a good idea, just saying that there seems to be a question of whether this particular analysis was really chosen ahead of time.

Here’s what Wessely says in his post:

The researchers changed the way they scored and analysed the primary outcomes from the original protocol. The actual outcome measures did not change, but it is true that the investigators changed the way that fatigue was scored from one method to another (both methods have been described before and both are regularly used by other researchers) in order to provide a better measure of change (one method gives a maximum score of 11, the other 33). How the two primary outcomes (fatigue and physical function) were analysed was also changed from using a more complex measure, which combined two ways to measure improvement, to a simple comparison of mean (average) scores. This is a better way to see which treatment works best, and made the main findings easier to understand and interpret. This was all done before the investigators were aware of outcomes and before the statisticians started the analysis of outcomes.

There seems to be some dispute here: is it just that there was an average improvement but, when you look at each part of the total score, the difference is not statistically significant. In this case I would think it makes sense to average.

Wessely then puts it all into perspective:

Were the results maverick? Did PACE report the opposite to what has gone before or happened since? The answer is no. It is a part of a jigsaw (admittedly the biggest piece) but the picture it paints fits with the other pieces. I think that we can have confidence in the principal findings of PACE, which to repeat, are that two therapies (CBT and GET) are superior to adaptive pacing or standard medical treatment, when it comes to generating improvement in patients with chronic fatigue syndrome, and that all these approaches are safe. . . .

I [Wessely] think this trial is the best evidence we have so far that there are two treatments that can provide some hope for improvement for people with chronic fatigue syndrome. Furthermore the treatments are safe, so long as they are provided by trained appropriate therapists who are properly supervised and in a way that is appropriate to each patient. These treatments are not “exercise and positive thinking” as one newspaper unfortunately termed it; these are sophisticated, collaborative therapies between a patient and a professional.

But . . .

Having said that, there were a significant number of patients who did not improve with these treatments. Some patients deteriorated, but this seems to be the nature of the illness, rather than related to a particular treatment. . . .

PACE or no PACE, we need more research to provide treatments for those who do not respond to presently available treatments.

Anti

All of the above seemed reasonable to me, so then I followed the link to the open letter by Davis, Edwards, Jason, Levin, Racaniello, and Reingold criticizing PACE.

The key statistical concerns of Davis et al. were (1) a mismatch between the intake criteria and the outcome measures, so that it seems possible to have gotten worse during the period of the study but be recorded as improving, and the changing of the outcome measures in the middle of the study.

Regarding point (1), Wessely points out that with a randomized trial any misclassifications should cancel across the study arms. Given that the original PACE article reported changes in continuous outcome measures, I think the definition of whether a patient is “in the normal range” should be a side issue. To put it another way: I think it makes sense to model the continuous data and then post-process the inferences to make statements about normal ranges, etc.

Point (2) seems to relate to the dispute above, in which Wessely said the change was done “before the investigators were aware of outcomes,” but which Davis et al. write “is of particular concern in an unblinded trial like PACE, in which outcome trends are often apparent long before outcome data are seen.” I’m not quite sure what to say here; ultimately I’m more concerned about which summary makes sense rather than about which was chosen ahead of time. It could make sense to fit a multilevel model if there is a concern about averaging. But, realistically, I’m guessing that the study is large enough to detect averages but not large enough to get much detail beyond that—at least not without using some qualitative information from the clinicians and patients.

Davis et al. also write:

The PACE investigators based their claims of treatment success solely on their subjective outcomes. In the Lancet paper, the results of a six-minute walking test—described in the protocol as “an objective measure of physical capacity”—did not support such claims, notwithstanding the minimal gains in one arm. In subsequent comments in another journal, the investigators dismissed the walking-test results as irrelevant, non-objective and fraught with limitations. All the other objective measures in PACE, presented in other journals, also failed. The results of one objective measure, the fitness step-test, were provided in a 2015 paper in The Lancet Psychiatry, but only in the form of a tiny graph. A request for the step-test data used to create the graph was rejected as “vexatious.”

I’m not quite sure what to think about this: perhaps there was a small but not statistically significant difference for each separate outcome, but a statistically significant difference for the average? If so, then I would think it would make sense to ok to report success based on the average.

I also asked Wessely about the above quote, and he wrote: “There was a significant improvement in the walking test after graded exercise therapy, which was not matched by any other treatment arm, and this was reported in the primary paper (White et al, 2011) and certainly not regarded as irrelevant.” So I guess the next step is to find the subsequent comments in the other journal where the investigators dismissed the walking-test result as irrelevant.

And I disagree, of course, the decision of the investigators not to share the step-test data. Whassup with that? This is one reason I prefer to have data posted online rather than sent by request, then anyone can get the data and there doesn’t have to be anything personal involved.

Davis et al. conclude:

We therefore urge The Lancet to seek an independent re-analysis of the individual-level PACE trial data, with appropriate sensitivity analyses, from highly respected reviewers with extensive expertise in statistics and study design. The reviewers should be from outside the U.K. and outside the domains of psychiatry and psychological medicine. They should also be completely independent of, and have no conflicts of interests involving, the PACE investigators and the funders of the trial.

This seems reasonable to me, and not in contradiction with the points that Wessely made. Indeed, when I asked Wessely what he thought of this, he replied that an independent review group in a different country had already re-analyzed some of the data and would be publishing something soon. So maybe we’re closer to convergence on this particular study than it seemed.

From the results of the study to the summary and the general recommendations

One thing I liked about Wessely’s post was his moderation in summarizing the study’s results and its implications. He reports that in the study his preferred treatment outperformed the alternative, but he recognizes that, for many (most?) people, none of these treatments do much. Wessely points out that this is not just his view; he quotes this from the original article by White et al.: “Our finding that studied treatments were only moderately effective also suggests research into more effective treatments is needed. The effectiveness of behavioural treatments does not imply that the condition is psychological in nature.”

Some questions that come to me are: Can we say that different treatments work for different people? Would we have some way of telling which treatment to try on which people? Are there some treatments that should be ruled out entirely? One of the concerns of the PACE critics is that the study is being used to deny social welfare payments to people with chronic physical illness.

And one of the criticisms of PACE coming from Davis et al. has to do with reporting of results:

In an accompanying Lancet commentary, colleagues of the PACE team defined participants who met these expansive “normal ranges” as having achieved a “strict criterion for recovery.” The PACE authors reviewed this commentary before publication.

This commentary seems to be a mistake, in that later correspondence, the PACE authors wrote, “It is important to clarify that our paper did not report on recovery; we will address this in a future publication.” That was a few years ago; the future has happened; and I guess recovery was not so easy to assess. This happens a lot in research: early success, big plans, but then slow progress. Certainly not unique to this project.

From my perspective, when I wrote about the PACE study hurting the reputation of the Lancet, I was thinking not so much of the particular flaws of the original report, or even of that incomprehensible mediation analysis that was later published (after all, you can do an incomprehensible mediation analysis of anything; just because someone does a bad analysis, it doesn’t mean there’s no pony there), but rather the Lancet editor’s aggressive defense and the difficulty that outsiders seemed to have in getting the data. According to Wessely, though, the study organizers will be sharing the data, they just need to deal with confidentiality issues. So maybe part of it is the journal editor’s communication problems, a bit of unnecessary promotion and aggression on the part of Richard Horton.

To get back to the treatments: Again, it’s no surprise that CBT and exercise therapy can help people. The success of these therapies for some percentage of people, does not at all contradict the idea that many others need a lot more, nor does it provide much support for the idea that “fear avoidance beliefs” are holding back people with chronic fatigue syndrome. So on the substance—setting aside the PACE trial itself—it seems to me that Wessely and the critics of that study are not so far apart.

## Cancer statistics: WTF?

This post is by Phil.

I know someone who was recently diagnosed with lung cancer and is trying to decide whether to get chemo or just let it run its course. What does she have to go on? A bunch of statistics that are barely useful. For example, its easy to find the average survival time for someone with her stage of this particular cancer. Let’s say that’s 12 months. Fine, but that’s an average over all ages, both sexes, and includes people who did and didn’t opt for chemo! Maybe the average time is 6 months if you eschew chemo and 18 months if you get it, and half the people do and half the people don’t? Or maybe 80% of victims get chemo, and those that don’t only average about 2 months.

The only other stat that seems to be widely available is “5-year survival rate.” Let’s say that’s 6%…that’s not so good. But if the victim is 80 years old they’re not so likely to live another 5 years anyway! Plus, again, the information that would be useful is the information that might change your behavior. If you have a 1% chance of living two years if you eschew chemo, and a 20% chance if you get it, then maybe you’d choose to get it even if you know you won’t make it 5 years.

What people should care about is conditional probabilities, not summary statistics. You don’t care what the average survival time is, averaged over all patients, instead you want to know it for people like you: your stage of cancer, your sex, your age, your physical condition.

I’m kind of shocked, and very disappointed, that there’s no website somewhere that lets you put in your age, sex, cancer type and stage, and maybe a few other relevant details, and get the statistical distribution of survival times if you do or don’t get chemo (or surgery, or radiation, or whatever). This lack was understandable in 1996 or even 2006 but it’s 2016 for crying out loud!

This post is by Phil.

## Stan 2.9 is Here!

We’re happy to announce that Stan 2.9.0 is fully available(1) for CmdStan, RStan, and PyStan — it should also work for Stan.jl (Julia), MatlabStan, and StataStan. As usual, you can find everything you need on the

The main new features are:

• R/MATLAB-like slicing of matrices. There’s a new chapter in the user’s guide part of manual up front explaining how it all works (and more in the language reference on the nitty-gritty details). This means you can write foo[xs] where xs is an array of integers and use explicit slicing, as with bar[1:3, 2] and baz[:3, , xs] and so on.
• Variational inference is available on an experimental basis in RStan and PyStan, and the adaptation has been improved; we still don’t have a good handle on when variational inference will work and when it won’t, so we would strongly advise only using it for rough work and then verifying with MCMC.
• Better-behaved unit-vector transform; alas, this is broken already due to a dimensionality mismatch and you’ll have to wait for Stan 2.9.1 or Stan 2.10 before the unit_vector type will actually work (it never worked in the past, either—our bad in both the past and now for not having enough tests around it).

We also fixed some minor bugs and cleaned up quite a bit of the code and build process.

We also would like to welcome two new developers: Krzysztof Sakrejda and Aki Vehtari. Aki’s been instrumental in many of our design discussions and Krzysztof’s first major code contribution was sparse matrix multiplication, which leads to our next topic.

We have also released the first version of RStanARM package. The short story on RStanARM is that it’s an MCMC and VB-based replacement for lm() and glm() from core R, and to some extent, lmer() and glmer() from lme4. I believe there’s also a new version of ShinyStan (2.1) available.

We also wrote up a paper on Stan’s reverse-mode automatic differentiation, the cornerstone of the Stan Math Library:

Sincerely,

The Stan Development Team

(1) Apologies to those of you who tried to download and install RStan as it was trickling through the CRAN process. The problem is that the managers of CRAN felt a single RStan package was too large (4MB or so) and forced us to import existing packages and break RStan down (BH for the Boost headers, RcppEigen for the Eigen headers, StanHeaders for the Stan header files, and RStan itself for RStan itself). Alas, they provide no foolproof way to synchronize releases. We can insist on a particular version, but R always tries to download the latest or just fails. In the future, we’ll be more proactive and let people know ahead of time when things are in an unsettled state on CRAN and how to install through GitHub. Thanks for your patience.

## Paxil: What went wrong?

Dale Lehman points us to this news article by Paul Basken on a study by Joanna Le Noury, John Nardo, David Healy, Jon Jureidin, Melissa Raven, Catalin Tufanaru, and Elia Abi-Jaoude that investigated what went wrong in the notorious study by Martin Keller et al. of the GlaxoSmithKline drug Paxil.

Lots of ethical issues here, but what’s interesting to me here is something about the data analysis in the original study. Here’s Basken:

[The biggest problem was] routine professional disagreements over how exactly to classify patient behaviors.

Patients who showed some form of suicidal behavior were not included in Dr. Keller’s final count, the analysis concluded, because of failures to transcribe all adverse events from one database to another and the use of “an idiosyncratic coding system.”

Such breakdowns are widely seen in clinical trials. The effect, “wittingly or unwittingly,” is to hide the adverse effects of medications being tested, said an author of the analysis, Jon N. Jureidini, a professor of psychiatry and pediatrics at the University of Adelaide, in Australia.

It’s called the garden of forking paths. If you get to choose your data-exclusion rule, you get to win the “p less than .05 game,” you get to publish your articles in top journals, and if you’re really lucky you get \$.

Also this, which will resonate with regular readers of our blog:

Another editorial, by Peter Doshi, an associate editor of the journal, repeated emphatic criticisms of Glaxo, Dr. Keller and his co-authors (and their universities for failing to publicly rebuke them), and the journal that published their study back in 2001, the Journal of the American Academy of Child and Adolescent Psychiatry. Mr. Doshi also described turmoil within the academy, which recently elected one of Dr. Keller’s co-authors, Karen D. Wagner, a professor of psychiatry and behavioral sciences at the University of Texas Medical Branch at Galveston, to serve as its president, beginning in 2017.

Remember, Ed Wegman received the Founders Award from the American Statistical Association.

And this:

“It is often said that science self-corrects,” Mr. Doshi wrote. “But for those who have been calling for a retraction of the Keller paper for many years, the system has failed.”

Eternal vigilance is the price of liberty.

Full disclosure: I do regular consulting for Novartis.

So horrible it’s funny; so funny it’s horrible

Basken got this amazing, amazing quote:

Dr. Keller contacted The Chronicle on Wednesday to insist that the 2001 results faithfully represented the best effort of the authors at the time, and that any misrepresentation of his article to help sell Paxil was the responsibility of Glaxo.

“Nothing was ever pinned on any of us,” despite various trials and investigations, he said. “And when I say that, I’m not telling you we’re like the great escape artists, that we’re Houdinis and we did something wrong and we got away with the crime of the century. Don’t you think if there was really something wrong, some university or agency or something would have pinned something on us?”

Wow. Call me gobsmacked. Does anyone really talk like that? He sounds like the bad guy in a Columbo episode, somewhere after he stops pretending that he doesn’t know anything about the crime, and just about the time he turns to the detective and says how, even if he had done it, there’s no possible proof.

P.S. Here’s Doshi’s editorial. Worth reading. As Doshi writes, “It’s often argued that fairness in journalism requires getting ‘both sides’ of the story, but in the story of Study 329, the “other side” does not seem interested in talking.” Reminds me of Weggy. Much worse, of course, but the same principle of stonewalling.

## Street-Fighting Probability and Street-Fighting Stats: 2 One-Week Modules

In a comment to my previous post on the Street-Fighting Math course, Alex wrote:

Have you thought about incorporating this material into more conventional classes? I can see this being very good material for a “principles” section of a linear modeling or other applied statistics course. It could give students a sense for how to justify their model choices by insight into a problem rather than, say, an algorithmic search over possible specifications.

Good point. I’d still like to do the full course—for one thing, that would involve going through Sanjoy Mahajan’s two books, which would have a lot of value in itself—but if we want to be able to incorporate some of these concepts in existing probability and statistics courses, it would make sense to construct a couple of one-week modules.

I’m thinking one on probability and one on statistics.

We can discuss content in a moment but first let me consider structure. I’m thinking that a module would consist of an article (equivalent to a textbook chapter, something for students to read ahead of time that would give them some background, include general principles and worked examples, and point them forward), homework assignments, and a collection of in-class activities.

It’s funny—I’ve been thinking a lot about how to create a full intro stat class with all these components, but I’ve been hung up on the all-important question of what methods to teach. Maybe it would make sense for me to get started by putting together stand-alone one-week modules.

OK, now on to the content. I think that “street-fighting math” would fit into just about any topic in probability and statistics. Some of the material in my book with Deb Nolan

Probability: Law of large numbers, central limit theorem, random walk, birthday problem (some or all of these are included in Mahajan’s books, which is fine, I’m happy to repurpose his material), lots more, I think.

Statistics: Log and log-log, approximation of unknown quantities using what Mahajan calls this the “divide and conquer” method, propagation of uncertainty, sampling, regression to the mean, predictive modeling, evaluating predictive error (for example see section 1.2 of this paper), the replication crisis, and, again, lots more.

Social science: I guess we’d want a separate social science module too. Lots of ideas including coalitions, voting, opinion, negotiation, networks, really a zillion possible topics here. I’d start with things I’ve directly worked on but would be happy to include examples from others. But we can’t call it Street-Fighting Political Science. That would give the wrong impression!

## New course: Street-Fighting Math

I want to teach a course next year based on two books by Sanjoy Mahajan: Street-Fighting Mathematics and The Art of Insight in Science and Engineering. You can think of the two books as baby versions of Weisskopf’s 1969 classic, Modern Physics from an Elementary Point of View. Another book in the same vein is Knut Schmidt-Nielsen’s How Animals Work, from 1972. And of course the recent What If, by Randall Munroe.

I’ve never taught such a course before. My plan would be to go through Mahajan’s two books and intersperse some material of my own on Street-Fighting Stats. Some of the Mahajan’s principles such as dimensional analysis are, I’ve come to believe, particularly relevant in Bayesian statistics. (For some background on ways in which purely mathematical ideas come up in Bayesian modeling, see this paper from 1996 and this one from 2004.) The class would conclude with student projects in which they would apply these ideas to problems of their choosing.

I foresee a few challenges in teaching this material, beyond the usual difficulties involved in starting any new course from scratch.

First, this stuff is not part of the standard curriculum in statistics, or mathematics, or political science or economics, or even physics or engineering. It’s in no sense a “required course.” Thus, as with my class on statistical communication and graphics, I’ll have to attract those unusual students across the university who want to learn something that’s useful but not part of the standard sequence of classes.

Second, the math and physics levels of these two books are pretty high. Mahajan teaches at MIT so that’s not a problem, but just about anywhere else, we’d be hard pressed to find many students who will be comfortable with this level of quantitative thinking about the world. I’m not sure how best to handle this. For example, from section 4.6.2 of The Art of Insight:

From the total energy, we can estimate the range of the 747—the distance that it can fly on a full tank of fuel. The energy is $\sqrt{C}mgd$ so the range of d is

$d\sim\frac{E_{\rm fuel}}{\sqrt{C}mg}$

where $E_{\rm fuel}$ is the energy in the full tank of fuel. To estimate d, we need to estimate $E_{\rm fuel}$, the modified drag coefficient C, and maybe also the plane’s mass m.

And he goes on from there. I love this stuff, but for most of the students I see in my statistics classes, this wouldn’t be beyond them, exactly, but . . . it would require a lot of effort on their part to work through, and I’m not clear they’d be willing to put in the work.

So maybe I’d have to attract students with stronger math and physics backgrounds. The trouble is, I don’t know that these sorts of kids are the ones who’d be inclined to take my course.

Another way to go would be to set aside all but the simplest of the physics examples and center the course on statistics and social science, where a little bit of algebra will go a long way. We could start the class with some basic material on scaling for regression models, curvature in nonlinear prediction, and partitioning of variance, and go from there. This approach would more directly serve the students who take my classes, but it has the drawback that I’m no longer following Mahajan’s books, and that would a loss because courses go so smoothly when they follow a textbook. So I’m not quite sure what to do.

P.S. Here’s a bit from the Art of Insight book that’s a bit more relevant to statistics students:

A random and a regular walk are analogous in having an invariant. For a regular walk, it is $/t$: the speed. For a random walk, it is $/t$: the diffusion constant.

I really like how he put that. I hadn’t thought of a random walk and a regular, deterministic walk as being two versions of the same thing, but that makes sense. And of course a random walk with drift is a bridge between the two concepts, with random and deterministic walks being special cases of zero drift and zero variance.

P.S. More thoughts here.

## On deck this week

Mon: New course: Street-Fighting Math

Tues: Paxil: What went wrong?

Wed: Pro-PACE, anti-PACE

Thurs: My namesake doesn’t seem to understand the principles of decision analysis

Fri: Risk aversion is a two-way street

Sat: A reanalysis of data from a Psychological Science paper

Sun: The devil really is in the details; or, You’ll be able to guess who I think are the good guys and who I think are the bad guys in this story, but I think it’s still worth telling because it provides some insight into how (some) scientists view statistics

## Stan in the tabloids!

I’ve never published anything in PPNAS except for letters, but now you could say I have an indirect full publication there, as Peter Smits informs us of this new paper that uses Stan!

Peter D. Smits. Expected time-invariant effects of biological traits on mammal species duration. PPNAS 2015, published ahead of print October 5, 2015, doi:10.1073/pnas.1510482112.

It’s paleontology all the way down, baby.

## Why are trolls so bothersome?

We don’t get a lot of trolls on this blog. When people try, I typically respond with some mixture of directness and firmness, and the trolls either give up or perhaps they recognize that I am answering questions in sincerity, which does not serve their trollish purposes.

But I’m pretty sure that my feeling is shared by many others, which is that trolls are disturbing, not just for their direct effects (they waste my time, they threaten to degrade the blog’s community) but also in themselves. As Dr. Anil Potti might put it, they’re pooping in my sandbox.

I thought of this yesterday (well, Ok, actually a few months ago) when encountering a sort of troll in real life.

I was going down the street and a clueless pedestrian walked right in front of me in the middle of the block. He wasn’t looking at all. This happens often enough, it’s no big deal, I certainly wasn’t angry or upset or even annoyed at this point, I just slowed down to let him pass. As he did so, he turns around, looks directly at me, and says, with disgust, “You’re going the wrong way. You idiot.”

I did a double take because I was actually not going the wrong way! Amsterdam Avenue in that area is a one-way street and I was going north, just like all the other traffic. So I said to him something like: “Hey, no, I’m going the right way. Look at all the cars!” He just ignored me and trudged away down the sidewalk. For some reason this really bothered me so I went back and pestered him: “Hey, you said I was going the wrong way down the street but I was going the right way!” He turned around and kinda snarled at me. It was weird, I’d thought he’d made an honest mistake and was going to reply with a sheepish “Sorry, guy,” but not at all. I suddenly thought of all the armed and crazy people in this country and decided to withdraw. But the whole episode bothered me, more than it should’ve, somehow. I’m still not quite sure why. Maybe it was the way he was so angry at me in a direct way. It seemed so personal. But of course it wasn’t personal: this guy didn’t know me, indeed he didn’t really seem to be listening to anything I was saying. So I can’t really figure out why I was so bothered, especially given that aggressive pedestrians are everywhere in this city. Maybe it was his Wegman-like refusal to admit an error?