Simon Wessely, a psychiatrist who has done research on chronic fatigue syndrome, pointed me to an overview of the PACE trial written by its organizers, Peter White, Trudie Chalder, and Michael Sharpe, and also to this post of his from November, coming to the defense of the much-maligned PACE study:
Nothing as complex as a multi-centre trial (there were six centres involved), that recruited 641 people, delivered thousands of hours of treatment, and managed to track nearly all of them a year later, can ever be without some faults. But this trial was a landmark in behavioural complex intervention studies. . . .
I have previously made it clear that I [Wessely] think that PACE was a good trial; I once described it as a thing of beauty. In this blog I will describe why I still think that . . . Here is a recent response to criticisms, few of them new.
He provides some background on his general views:
CFS is a genuine illness, can cause severe disability and distress, affects not just patients but their families and indeed wider society, as it predominantly affects working age adults, and its cause, or more likely causes, remains fundamentally unknown. I do not think that chronic fatigue syndrome is “all in the mind”, whatever that means, and nor do the PACE investigators. I do think that, as with most illnesses, of whatever nature, psychological and social factors can be important in understanding illness and helping patients recover. Like many of the PACE team, I have run a clinic for patients with chronic fatigue syndrome for many years. Like the PACE investigators, I have also in the past done research into the biological nature of the illness; research that has indicated some of the biological abnormalities that have been found repeatedly in CFS.
And now on to the trial itself:
The PACE trial randomly allocated 641 patients with chronic fatigue syndrome, recruited in six clinics across the UK . . . What were its main findings? These were simple:
That both cognitive behaviour therapy (CBT) and graded exercise therapy (GET) improved fatigue and physical function more than either adaptive pacing therapy (APT) or specialist medical care (SMC) a year after entering the trial.
All four treatments were equally safe.
These findings are consistent with previous trials (and there are also more trials in the pipeline), but PACE, because of its sheer size, has attracted the most publicity, both good and bad.
What makes a good trial and how does PACE measure up?
Far and away the most important is allocation concealment; the ability of investigators/patients to influence the randomisation process . . . No one has criticised allocation concealment in PACE, it was exemplary. . . .
Next comes power. . . . Predetermined sample size calculations showed it [PACE] had plenty of power to detect clinically significant differences. It was one of the largest behavioural or psychological medicine trials ever undertaken. No one has criticised its size.
The next thing that can jeopardise the integrity of a trial is major losses to follow up . . . The key end point in PACE was pre-defined as the one year follow up. 95% of patients provided follow up data at this stage. I am unaware of any large scale behavioural medicine trial that has exceeded this. Again, no one has questioned this . . .
Next comes treatment infidelity, which is where participants do not get the treatment they were allocated to. . . . At the end of the trial, two independent scrutineers, masked to treatment allocation, both rated over 90% of the randomly chosen 62 sessions they listened to as the allocated therapy. Only one session was thought by both scrutineers not to be the right therapy. Again, no criticism has been made on the basis of therapy infidelity.
Analytical bias. The analytical protocol was predetermined (before the analysis started) and published. Two statisticians were involved in the analysis, blind to treatment group until the analysis was completed and signed off. So again, the chances of bias being introduced at this stage are also negligible.
Post-hoc sub-group analysis (fishing for significant differences) . . . here were no post-hoc sub-group analyses in the main outcome paper. A couple of sub-group post-hoc analyses were done in follow up publications, and clearly identified as such and appropriate cautions issued. None concerned the main outcomes. Again, no one has raised the issue of sub-group analyses.
Blinding. PACE was not blinded; the therapists and patients knew what treatments were being given, which would be hard to avoid. This has been raised by several critics, and of course is true. It could hardly be otherwise; therapists knew they were delivering APT, or CBT or whatever, and patients knew what they were receiving. This is not unique to PACE. . . . Did this matter? One way is to see whether there were differences in what patients thought of the treatment, to which they were allocated, before they started them. . . . And that did happen in the PACE trial itself. One therapy was rated beforehand by patients as being less likely to be helpful, but that treatment was CBT. In the event, CBT came out as one of the two treatments that did perform better. If it had been the other way round; that CBT had been favoured over the other three, then that would have been a problem. But as it is, CBT actually had a higher mountain to climb, not a smaller one, compared to the others.
So far then, I would suggest that PACE has passed the main challenges to the integrity of a trial with flying colours. . . . For example, the two most recent systematic reviews in this field rated PACE as good quality, with a low risk of bias.
On this last point he gives two references:
Larun L, Brurberg KG, Odgaard-Jensen J, Price JR. (2015) Exercise therapy for chronic fatigue syndrome. Cochrane Database of Systematic Reviews 2015, Issue 2. Art. No.: CD003200. DOI: 10.1002/14651858.CD003200.pub3.
Smith MB et al. (2015) Treatment of Myalgic Encephalomyelitis/Chronic Fatigue Syndrome: A Systematic Review for a National Institutes of Health Pathways to Prevention Workshop. Ann Intern Med. 162: 841-850. doi: http://dx.doi.org/10.7326/M15-0114.
I have a question about the “analytical bias” thing mentioned above. Recall what Julie Rehmeyer wrote:
The study participants hadn’t significantly improved on any of the team’s chosen objective measures: They weren’t able to get back to work or get off welfare, they didn’t get more fit, and their ability to walk barely improved. Though the PACE researchers had chosen these measures at the start of the experiment, once they’d analyzed their data, they dismissed them as irrelevant or not objective after all.
This doesn’t sound like a predetermined analytical protocol, so I’m not sure what’s up with that. (Let me emphasize at this point that I’ve published hundreds of statistical analyses, maybe thousands, and have preregistered almost none of them. So I’m not saying that a predetermined analytical protocol is necessary or a good idea, just saying that there seems to be a question of whether this particular analysis was really chosen ahead of time.
Here’s what Wessely says in his post:
The researchers changed the way they scored and analysed the primary outcomes from the original protocol. The actual outcome measures did not change, but it is true that the investigators changed the way that fatigue was scored from one method to another (both methods have been described before and both are regularly used by other researchers) in order to provide a better measure of change (one method gives a maximum score of 11, the other 33). How the two primary outcomes (fatigue and physical function) were analysed was also changed from using a more complex measure, which combined two ways to measure improvement, to a simple comparison of mean (average) scores. This is a better way to see which treatment works best, and made the main findings easier to understand and interpret. This was all done before the investigators were aware of outcomes and before the statisticians started the analysis of outcomes.
There seems to be some dispute here: is it just that there was an average improvement but, when you look at each part of the total score, the difference is not statistically significant. In this case I would think it makes sense to average.
Wessely then puts it all into perspective:
Were the results maverick? Did PACE report the opposite to what has gone before or happened since? The answer is no. It is a part of a jigsaw (admittedly the biggest piece) but the picture it paints fits with the other pieces. I think that we can have confidence in the principal findings of PACE, which to repeat, are that two therapies (CBT and GET) are superior to adaptive pacing or standard medical treatment, when it comes to generating improvement in patients with chronic fatigue syndrome, and that all these approaches are safe. . . .
I [Wessely] think this trial is the best evidence we have so far that there are two treatments that can provide some hope for improvement for people with chronic fatigue syndrome. Furthermore the treatments are safe, so long as they are provided by trained appropriate therapists who are properly supervised and in a way that is appropriate to each patient. These treatments are not “exercise and positive thinking” as one newspaper unfortunately termed it; these are sophisticated, collaborative therapies between a patient and a professional.
But . . .
Having said that, there were a significant number of patients who did not improve with these treatments. Some patients deteriorated, but this seems to be the nature of the illness, rather than related to a particular treatment. . . .
PACE or no PACE, we need more research to provide treatments for those who do not respond to presently available treatments.
All of the above seemed reasonable to me, so then I followed the link to the open letter by Davis, Edwards, Jason, Levin, Racaniello, and Reingold criticizing PACE.
The key statistical concerns of Davis et al. were (1) a mismatch between the intake criteria and the outcome measures, so that it seems possible to have gotten worse during the period of the study but be recorded as improving, and the changing of the outcome measures in the middle of the study.
Regarding point (1), Wessely points out that with a randomized trial any misclassifications should cancel across the study arms. Given that the original PACE article reported changes in continuous outcome measures, I think the definition of whether a patient is “in the normal range” should be a side issue. To put it another way: I think it makes sense to model the continuous data and then post-process the inferences to make statements about normal ranges, etc.
Point (2) seems to relate to the dispute above, in which Wessely said the change was done “before the investigators were aware of outcomes,” but which Davis et al. write “is of particular concern in an unblinded trial like PACE, in which outcome trends are often apparent long before outcome data are seen.” I’m not quite sure what to say here; ultimately I’m more concerned about which summary makes sense rather than about which was chosen ahead of time. It could make sense to fit a multilevel model if there is a concern about averaging. But, realistically, I’m guessing that the study is large enough to detect averages but not large enough to get much detail beyond that—at least not without using some qualitative information from the clinicians and patients.
Davis et al. also write:
The PACE investigators based their claims of treatment success solely on their subjective outcomes. In the Lancet paper, the results of a six-minute walking test—described in the protocol as “an objective measure of physical capacity”—did not support such claims, notwithstanding the minimal gains in one arm. In subsequent comments in another journal, the investigators dismissed the walking-test results as irrelevant, non-objective and fraught with limitations. All the other objective measures in PACE, presented in other journals, also failed. The results of one objective measure, the fitness step-test, were provided in a 2015 paper in The Lancet Psychiatry, but only in the form of a tiny graph. A request for the step-test data used to create the graph was rejected as “vexatious.”
I’m not quite sure what to think about this: perhaps there was a small but not statistically significant difference for each separate outcome, but a statistically significant difference for the average? If so, then I would think it would make sense to ok to report success based on the average.
I also asked Wessely about the above quote, and he wrote: “There was a significant improvement in the walking test after graded exercise therapy, which was not matched by any other treatment arm, and this was reported in the primary paper (White et al, 2011) and certainly not regarded as irrelevant.” So I guess the next step is to find the subsequent comments in the other journal where the investigators dismissed the walking-test result as irrelevant.
And I disagree, of course, the decision of the investigators not to share the step-test data. Whassup with that? This is one reason I prefer to have data posted online rather than sent by request, then anyone can get the data and there doesn’t have to be anything personal involved.
Davis et al. conclude:
We therefore urge The Lancet to seek an independent re-analysis of the individual-level PACE trial data, with appropriate sensitivity analyses, from highly respected reviewers with extensive expertise in statistics and study design. The reviewers should be from outside the U.K. and outside the domains of psychiatry and psychological medicine. They should also be completely independent of, and have no conflicts of interests involving, the PACE investigators and the funders of the trial.
This seems reasonable to me, and not in contradiction with the points that Wessely made. Indeed, when I asked Wessely what he thought of this, he replied that an independent review group in a different country had already re-analyzed some of the data and would be publishing something soon. So maybe we’re closer to convergence on this particular study than it seemed.
From the results of the study to the summary and the general recommendations
One thing I liked about Wessely’s post was his moderation in summarizing the study’s results and its implications. He reports that in the study his preferred treatment outperformed the alternative, but he recognizes that, for many (most?) people, none of these treatments do much. Wessely points out that this is not just his view; he quotes this from the original article by White et al.: “Our finding that studied treatments were only moderately effective also suggests research into more effective treatments is needed. The effectiveness of behavioural treatments does not imply that the condition is psychological in nature.”
Some questions that come to me are: Can we say that different treatments work for different people? Would we have some way of telling which treatment to try on which people? Are there some treatments that should be ruled out entirely? One of the concerns of the PACE critics is that the study is being used to deny social welfare payments to people with chronic physical illness.
And one of the criticisms of PACE coming from Davis et al. has to do with reporting of results:
In an accompanying Lancet commentary, colleagues of the PACE team defined participants who met these expansive “normal ranges” as having achieved a “strict criterion for recovery.” The PACE authors reviewed this commentary before publication.
This commentary seems to be a mistake, in that later correspondence, the PACE authors wrote, “It is important to clarify that our paper did not report on recovery; we will address this in a future publication.” That was a few years ago; the future has happened; and I guess recovery was not so easy to assess. This happens a lot in research: early success, big plans, but then slow progress. Certainly not unique to this project.
From my perspective, when I wrote about the PACE study hurting the reputation of the Lancet, I was thinking not so much of the particular flaws of the original report, or even of that incomprehensible mediation analysis that was later published (after all, you can do an incomprehensible mediation analysis of anything; just because someone does a bad analysis, it doesn’t mean there’s no pony there), but rather the Lancet editor’s aggressive defense and the difficulty that outsiders seemed to have in getting the data. According to Wessely, though, the study organizers will be sharing the data, they just need to deal with confidentiality issues. So maybe part of it is the journal editor’s communication problems, a bit of unnecessary promotion and aggression on the part of Richard Horton.
To get back to the treatments: Again, it’s no surprise that CBT and exercise therapy can help people. The success of these therapies for some percentage of people, does not at all contradict the idea that many others need a lot more, nor does it provide much support for the idea that “fear avoidance beliefs” are holding back people with chronic fatigue syndrome. So on the substance—setting aside the PACE trial itself—it seems to me that Wessely and the critics of that study are not so far apart.