Skip to content

Look. At. The. Data. (Hollywood action movies example)

Kaiser Fung shares an amusing story of how you can be misled by analyzing data that you haven’t fully digested. Kaiser writes, “It pains me to think how many people have analyzed this dataset, and used these keywords to build models.”

Stan Weekly Roundup, 3 August 2017

You’d almost think we were Europeans based on how much we’ve slowed down over the summer.

  • Imad Ali, Jonah Gabry, and Ben Goodrich finished the online pkgdown-style documentation for all the Stan Development Team supported R packages. They can be accessed via, e.g.,

    The Stan manual will also get converted as soon as we can get to it.

  • Ben Goodrich added the nonlinear inverse link functions with Gaussian outcomes following lme4’s “self-starting” functions nlmer.

  • Rob Trangucci has added further GP doc to the manual and is working on multinomial logit to RStanArm.

  • Jonah Gabry released ShinyStan 2.4 and a new Bayesplot is on the way—they’re more flexible about ggplot2 theming, and also RStanArm releases to coordinate with the next Gelman and Hill edition bringing it up to date with modern R.

  • Breck Baldwin has been trying to wrangle governance issues.

  • Imad Ali is working on some basketball models and waiting for NBA data. Also supervising our high school student and working on the nonlinear models for RStanArm.

  • Aki Vehtari is continuing work on Pareto smoothed importance sampling with Jonah. StanCon Helsinki planning is underway; still waiting on a date.

  • Ben Bales rewrote append arrays and the initial RNG vectorization.

  • Bill Gillesie has been learning C++ and software development with Charle’s help. His first pending pull request is adding a linear interpolation function like the one in BUGS.

  • Charles Margossian finished the Torsten 0.8.3 release (that’s Metrum’s pharmacometrics package wrapping RStan).

  • Charles also finished the pull request for the algebraic solver and it’s passed code review, so it should land in the math lib soon.

  • Charles is also writing some docs on how to start programming with Stan, based on whathe’s been learning teaching Bill to write C++.

  • Charles and Bill are also adapting the mixed solver for a PK/PD journal.

  • Michael Betancourt wrote a case study about QR decomposition that’s up on the web site. He’s since been at JSM talking about Stan, where there were lots of posters citing Stan. He gave away a lot of stickers through the Metrum booth.

  • Michael, Aki, and Rob Trangucci have been working on GPs and Michael has a case study in the works.

  • Michael also made a GP movie tweet that’s gotten a ton of positive feedback on Twitter (along with the spatial, methodology, and QR decomposition case studies).

  • Andrew Gelman wrote up a draft of an R-hat an ESS calculation paper with me and Michael.

  • Mitzi Morris launched the spatial model case studies with the fit for intrinsic conditional autoregression (ICAR) model, with some neat parameterizations by Dan Simpson. She’s also got the Cook, Gelman, and Rubin in the wings.

  • Mitzi is also adding a data specifiation for variables that will let us write functions that only apply to data.

  • Sean Talts and Daniel Lee have been hammering away at the C++ builds through all our repos and allow better conditional compilation of optional external libs like CVODES (for ODE solving), MPI (process parallelism), and OpenCL (GPUs).

  • Sean and Michael have also been fiding anomalies in their Cook-Gelman-Rubin stats (as has Mitzi) when the number of replications is cranked up to the thousands.

PPNAS again: If it hadn’t been for the jet lag, would Junior have banged out 756 HRs in his career?

In an email with subject line, “Difference between “significant” and “not significant”: baseball edition?”, Greg Distelhorst writes:

I think it’s important to improve statistical practice in the social sciences. I also care about baseball.

In this PNAS article, Table 1 and the discussion of differences between east vs. west and home vs. away effects do not report tests for differences in the estimated effects. Back of the envelope, it looks to me like we can’t reject the null for many of these comparisons. It would have been easy and informative to add a column to each table testing for differences between the estimated coefs.

Overall, I think the evidence in Table 1 is more consistent with a small negative effect of jet lag (no matter whether home/away or east/west).

Or perhaps I am missing something. In any case, I know you collect examples of questionable statistical practice so I wanted to share this.

My reply: I’m starting to get tired of incompetent statistical analyses appearing in PPNAS. The whole thing is an embarrassment. The National Academy of Sciences is supposed to stand for something, no?

Just to be clear: this is far from the worst PPNAS paper out there, and much of what the authors claim could well be true; jetlag could be a big deal. But many of the specifics seem like noise mining. And we shouldn’t be doing that. It would be better to wrap your comparisons into a multilevel model and partially pool, rather than just grab things that are at “p less than .05” in the raw data and then treat so-called non-significant comparisons as if they’re zero.

Again, no big deal, nothing wrong with this sort of thing published—along with all the raw data—and then others can investigate further. But maybe not in a journal that says it “publishes only the highest quality scientific research.”

Look, I do lots of quick little analyses. Nothing wrong with that. Not everything we do is “the highest quality scientific research.” To me, the problem is not with the researchers doing a quick-and-dirty, skim-out-the-statistically-significant-comparisons job, but with PPNAS for declaring it “the highest quality scientific research.” What’s the point of that? Can’t they just say something like, “PPNAS publishes a mix of the highest quality of scientific research, some bad research that slips through the review process or is promoted by an editor who is intellectually invested in the subject, and some fun stuff which might be kinda ok and can get us some publicity”? What would be wrong with that? Or, if they don’t want to be so brutally honest, they could say nothing at all.

I teach at Columbia University, one of the country’s top educational institutions. We have great students who do great work. But do we say that we admit “only the highest quality students”? Do we say we hire “only the highest quality faculty”? Do we say we approve “only the highest quality doctoral theses”? Of course not. Some duds get through. Better, to my mind, to accept this, and to work on improving the process to reduce the dud rate, than to take the brittle position that everything’s just perfect.

On the specific example above, Distelhorst followed up:

There may also be some forking paths issues around the choice of what flight lengths are coded as potentially jet-lag-inducing.

Ironically I am a Seattle Mariners fan and they often have one of the worst flight schedules in MLB. I would like everyone to believe that their 15 year playoff drought could be blamed on jet lag…

If it hadn’t been for the jet lag, Junior certainly would’ve banged out 756 HRs in his career!

It’s hard to know what to say about an observational comparison that doesn’t control for key differences between treatment and control groups, chili pepper edition

Jonathan Falk points to this article and writes:

Thoughts? I would have liked to have seen the data matched on age, rather than simply using age in a Cox regression, since I suspect that’s what really going on here. The non-chili eaters were much older, and I suspect that the failure to interact age, or at least specify the age effect more finely, has a gigantic impact here, especially since the raw inclusion of age raised the hazard ratio dramatically. Having controlled for Blood, Sugar, and Sex, the residual must be Magik.

My reply: Yes, also they need to interact age x sex, and smoking is another elephant in the room. A good classroom example, I guess.

It’s no scandal than a weak analysis is published in Plos-One. Indeed, I think it’s just fine for such speculative research to be published—along with the raw data and code used in the analysis—so that others can follow it up.

Seemingly intuitive and low math intros to Bayes never seem to deliver as hoped: Why?

This post was prompted by recent nicely done videos by Rasmus Baath that provide an intuitive and low math introduction to Bayesian material. Now, I do not know that these have delivered less than he hoped for. Nor I have asked him. However, given similar material I and others have tried out in the past that did not deliver what was hoped for, I am anticipating that and speculating why here. I have real doubts about such material actually enabling others to meaningfully interpret Bayesian analyses let alone implement them themselves. For instance, in a conversation last year with David Spiegelhalter, his take was that some material I had could easily be followed by many, but the concepts that material was trying to get across were very subtle and few would have the background to connect to them. On the other hand, maybe I am hoping to be convinced otherwise here.

For those too impatient to watch the three roughly 30 minute videos, I will quickly describe my material David commented on (which I think is fairly similar to Rasmus’). I am more familiar with it and doing that avoids any risk of misinterpreting anything Rasmus did. It is based on a speculative description of what what Francis Galton did in 1885 which was discussed more thoroughly by Stephen Stigler. It also involves a continuous (like) example which I highly prefer starting with. I think continuity is of overriding importance so one should start with it unless they absolutely can not.

Galton constructed a two stage quincunx (diagram for a 1990 patent application!) with the first stage representing his understanding of the variation of the targeted plant height in a randomly chosen plant seed of a given variety. The pellet haphazardly falls through the pins and lands at the bottom of the first level as the target height of the seed. His understanding I think is a better choice of wording than belief, information or even probability (which it can be taken to be given the haphazardness). Also it is much much better than prior! Continuing on from the first level, the pellet falls down a second set of pins landing at the very bottom as the height the plant actually grew to. This second level represents Galton’s understanding of how a seed of a given targeted height varies in the height it actually grows. Admittedly this physical representation is actually discrete but the level of discreteness can be lessened without preset limits (other than practicality).

Possibly, he would have assessed the ability of his machine to adequately represent his understanding, by running it over and over again and and comparing the set of heights plants represented on the bottom level with knowledge of past, if not current heights this variety of seed usually did grow to. He should have. Another way to put Galton’s work would be that of building (and checking) a two stage simulation to adequately emulate one’s understanding of targeted plant heights and actual plant heights that have been observed to grow. Having assessed his machine as adequate (by surviving a fake data simulation check) he might have then thought about how to learn about a particular given seeds targeted height (possibly already growing or grown) given he would only get to see the actual height grown. The targeted height remains unknown and actual height becomes know. It is clear that Galton decided that in trying to assess the targeted height from an actual height one should not look downward from a given targeted height but rather upward from the actual height grown.

Now by doing multiple drops of pellets, one at a time, and recording only where the seed was at the bottom of the first level if and only if it lands at a particular location on the bottom level matching an actual grown height, he would doing a two two stage simulation with rejection. This clearly provides an exact (smallish) sample from the posterior given the exact joint probability model (physically) specified/simulated by the quincunx. It is exactly the same as the conceptual way to understand Bayes suggested by Don Rubin in 1982. As such, it would have been an early fully Bayesian analysis, even if not actually perceived as such at the time (though Stigler argues that it likely was).

This awkward to carry out, but arguably less challenging way to grasp Bayesian analysis can be worked up to address numerous concepts in statistics (both implementing calculations and motivating formulas ) that are again less challenging to grasp (or so its hoped). This is what I perceive, Rasmus, Richard McElreath and and others are essentially doing. Authors do differ in their choices of which concepts to focus on. My initial recognition of these possibilities lead to this overly exuberant but very poorly thought through post back in 2010 (some links are broken).

To more fully discuss this below (which may be of interest only to those very interested), I will extend the quincuz to multiple samples (n > 1) and multiple parameters, clarify the connection to approximate Bayesian computation (ABC) and point out something much more sensible when there is a formula for the second level of the quincunz (the evil likelihood function) . The likelihood might provide a smoother transition to (MCMC) sampling from the typical set instead of the entirety of parameter space. I will also say some nice things about Rasmus’s videos and of course make a few criticisms.

Continue reading ‘Seemingly intuitive and low math intros to Bayes never seem to deliver as hoped: Why?’ »

Giving feedback indirectly by invoking a hypothetical reviewer

Ethan Bolker points us to this discussion on “How can I avoid being “the negative one” when giving feedback on statistics?”, which begins:

Results get sent around a group of biological collaborators for feedback. Comments come back from the senior members of the group about the implications of the results, possible extensions, etc. I look at the results and I tend not to be as good at the “big picture” stuff (I’m a relatively junior member of the team), but I’m reasonably good with statistics (and that’s my main role), so I look at the details.

Sometimes I think to myself “I don’t think those conclusions are remotely justified by the data”. How can I give honest feedback in a way that doesn’t come across as overly negative? I can suggest alternative, more valid approaches, but if those give the answer “it’s just noise”, I’m pouring cold water over the whole thing.

There are lots of thoughtful responses at the link. I agree with Bolker that the question and comments are interesting.

What’s my advice in such situations? I often recommend that, if you’re involved in a project where you think something’s being done wrong, but you don’t have the authority to just say it should be done differently, is to bring in a hypothetical third party, to say something like this: “I think I see what you’re doing here, but what if a reviewer said XYZ? How would you respond to that?”

“Explaining recent mortality trends among younger and middle-aged White Americans”

Kevin Lewis sends along this paper by Ryan Masters, Andrea Tilstra, and Daniel Simon, who write:

Recent research has suggested that increases in mortality among middle-aged US Whites are being driven by suicides and poisonings from alcohol and drug use. Increases in these ‘despair’ deaths have been argued to reflect a cohort-based epidemic of pain and distress among middle-aged US Whites.

We examine trends in all-cause and cause-specific mortality rates among younger and middle-aged US White men and women between 1980 and 2014, using official US mortality data. . . .

Trends in middle-aged US White mortality vary considerably by cause and gender. The relative contribution to overall mortality rates from drug-related deaths has increased dramatically since the early 1990s, but the contributions from suicide and alcohol-related deaths have remained stable. Rising mortality from drug-related deaths exhibit strong period-based patterns. Declines in deaths from metabolic diseases have slowed for middle-aged White men and have stalled for middle-aged White women, and exhibit strong cohort-based patterns.

We find little empirical support for the pain- and distress-based explanations for rising mortality in the US White population. Instead, recent mortality increases among younger and middle-aged US White men and women have likely been shaped by the US opiate epidemic and an expanding obesogenic environment.

I don’t have anything to add on this topic right now—my own work with Auerbach (also see here) broke down mortality rates by age, sex, ethnicity, and state but not by cause of death.

I’m just sharing this new article in support of the point I’ve made several times, that when we’re looking at trends in disease and mortality, it makes sense to listen more to demographers, who should be the real experts on these topics.

The fractal zealots

Paul Alper points to this news report by Ian Sample, which goes:

Psychologists believe they can identify progressive changes in work of artists who went on to develop Alzheimer’s disease

The first subtle hints of cognitive decline may reveal themselves in an artist’s brush strokes many years before dementia is diagnosed, researchers believe. . . .

Forsythe found that paintings varied in their fractal dimensions over an artist’s career, but in the case of de Kooning and Brooks, the measure changed dramatically and fell sharply as the artists aged. “The information seems to be like a footprint that artists leave in their art,” Forsythe said. “They paint within a normal range, but when something is happening the brain, it starts to change quite radically.” . . .

The research provoked mixed reactions from other scientists. Richard Taylor, a physicist at the University of Oregon, described the work as a “magnificent demonstration of art and science coming together”. But Kate Brown, a physicist at Hamilton College in New York, was less enthusiastic and dismissed the research as “complete and utter nonsense”. . . .

“The whole premise of ‘fractal expressionism’ is completely false,” Brown said. “Since our work came out, claims of fractals in Pollock’s work have largely disappeared from peer-reviewed physics journals. But it seems that the fractal zealots have managed to exert some influence in psychology.”

Letter to the Editor of Perspectives on Psychological Science

[relevant cat picture]

tl;dr: Himmicane in a teacup.

Back in the day, the New Yorker magazine did not have a Letters to the Editors column, and so the great Spy magazine (the Gawker of its time) ran its own feature, Letters to the Editor of the New Yorker, where they posted the letters you otherwise would never see.

Here on this blog we can start a new feature, Letters to the Editor of Perspectives on Psychological Science, which will feature corrections that this journal refuses to print.

Here’s our first entry:

“In the article, ‘Going in Many Right Directions, All at Once,’ published in this journal, the author wrote, “some critics go beyond scientific argument and counterargument to imply that the entire field is inept and misguided (e.g., Gelman, 2014; Shimmack [sic], 2014).’ However, this article provided no evidence that either Gelman or Schimmack ever wrote anything implying that the entire field is inept and misguided, nor has there anywhere else been provided any evidence that Gelman or Schimmack have implied such a claim. The journal regrets publishing this statement which was given without any evidence or support.”

The background is that a few days ago I learned about this article through a blog comment. I followed the above references and found zero examples of either Schimmack (that’s the correct spelling of his name) or me implying that the entire field of psychology, or social psychology, is inept and misguided. Which makes sense because I’ve never believed, said, or implied such a thing.

What Schimmack and I have in common is that we both state that some published work in psychology is of low quality. The author of the above-mentioned article seems to be equating criticisms of some papers with statements about “the entire field.” That is a mistake on the author’s part: it is possible to criticize some published work while holding high respect for other work in that field.

I really wish I’d never been alerted to that article, but once I had, I couldn’t un-see it. I don’t usually write letters to journal editors, but when someone directly mischaracterizes my writings and that of others, this bothers me, and in this case I sent off an email which, after some exchanges, resulted in my proposing the above correction note. The editor of the journal and the chair of the publications board of the Association for Psychological Science responded to me refusing to run this or any other correction. They characterized the offending statement not as a “factual error” but rather as a “matter of opinion,” and they argued that a correction would be impossible because you can’t correct an opinion.

I wonder why a scientific journal would want to devote its space to unrefutable statements of opinion, but that’s another story.

It’s a good thing I can post the correction here. Not as good as putting it online with the offending journal article, but we’ll do what we can do. I have other problems with the article in question, but I won’t get into them right now, as my focus here is on this specific correction that the journal refused to run.

I’ll let you know if I hear anything else from the Association for Psychological Science. In my last communication from them, I was told I had the option to “to write an editorial of your own [for the journal] and have that peer reviewed.” That’s fine, I guess—although I have zero trust in the journal’s peer review, given that they let through the whopper in that article—but in the meantime I think the journal editors should just correct the damn error themselves, as they’re the people who published that article. As it is, they’re doing that thing-that-is-morally-equivalent-to-lying: they’re intentionally perpetuating a misrepresentation of what Schimmack and I have written.

Giving me (or, I suppose, Schimmack, once they can figure out how to spell his name) the opportunity to submit an article to Perspectives on Psychological Science is fine—actually, I’ve published three articles in that journal already—but there’s no reason that should stop them from fixing their errors and running a correction immediately. To put it another way, if I don’t publish a new article with that correction in their journal, then the errors are ok? They’re the ones who published the false claim; really it should bother them more than me.

The whole thing is just bizarre, the attitude that they will only correct an error when they’re absolutely forced to do so.

It’s as if, for the Association for Psychological Science, the public perception of truth is more important than truth itself.

But here’s the news from the chair of the APS publications board:

[The journal editor] will not be publishing an Editor’s note on this topic and further requests for such an Editor’s note will not be productive.

I’ve omitted the names of the journal editor and board member here, not because they are secret—all this information is easy to find on the web—but because I assume they’re acting in an institutional capacity, so it doesn’t really matter who they are. I’ll take them at their word that they do not have the authority to correct errors that have appeared in their own publication. Too bad, that. If I were ever in such a position, I’d resign my position as journal editor, or chair of the publications board, immediately. Or I’d run the correction and let the publication board fire me.

Reproducing biological research is harder than you’d think

Mark Tuttle points us to this news article by Monya Baker and Elie Dolgin, which goes as follows:

Cancer reproducibility project releases first results

An open-science effort to replicate dozens of cancer-biology studies is off to a confusing start.

Purists will tell you that science is about what scientists don’t know, which is true but not much of a basis on which to develop new cancer drugs. Hence the importance of knowledge: how crucial this mutation or that cell-surface receptor really is to cancer growth. These are the findings that launch companies and clinical trials — provided, of course, that they have been published in research papers in peer-reviewed journals.

As we report in a News story this week, a systematic effort to check some of these findings by repeating an initial five published cancer studies has reported that none could be completely reproduced. The significance of this divergence — how the specific experiments were selected and what the results mean for the broader agenda of reproducibility in research — is already hotly contested.

Perhaps the most influential aspect of the exercise, called the Reproducibility Project: Cancer Biology, has nothing to do with those arguments. It lies beneath the surface, in the peer reviews of the project teams’ replication plans, which were published before the studies began. . . .

Again and again, the peer reviewers and the replicators clash. The reviewers are eager to produce the best experiment to test a publication’s conclusions; they want to correct deficiencies in the design of the original high-impact studies. The replicators do, on several occasions, agree to add an extra measurement, particularly of positive and negative controls that had originally been neglected. Often, however, they resist calls for more definitive studies. . . .

This is a frustrating, if understandable, response. It is easier to compare the results of highly similar experiments than to assess a conclusion. Thus, the replication efforts are not especially interested in, say, the big question of whether public gene-expression data can point to surprising uses for new drugs. They focus instead on narrower points — such as whether a specific finding that an ulcer drug stalls the growth of lung cancer in mice holds up (it did, more or less; see I. Kandela et al. eLife 6, e17044; 2017). Even so, the results are not definitive. . . .

One aspect that merits increased focus is how markedly the results of control experiments varied between the original work and replications. In one case, mice in the control group of an original study survived for nine weeks after being engrafted with tumours, whereas those in the replication survived for just one. . . .

More than 50 years ago, the philosopher Thomas Kuhn defined ‘normal science’ as the kind of work that faithfully supports or chisels away at current hypotheses. It is easy to dismiss this as workmanlike and uninteresting. But only by doing such normal science — and doing it well — can we recognize when revolutions are necessary.

That’s good! And the point about normal science and revolutions is important; Shalizi and I discuss it in our paper.

It’s not “lying” exactly . . . What do you call it when someone deliberately refuses to correct an untruth?

New York Times columnist Bret Stephens tells the story. First the background:

On Thursday I interviewed Central Intelligence Agency Director Mike Pompeo on a public stage . . . There was one sour moment. Midway through the interview, Pompeo abruptly slammed The New York Times for publishing the name last month of a senior covert C.I.A. officer, calling the disclosure “unconscionable.” The line was met with audience applause. I said, “You’re talking about Phil Agee,” and then repeated the name. . . . My startled rejoinder was not a reference to the covert C.I.A. officer unmasked by The Times, but rather a fumbled attempt to refer to the law governing such disclosures. Philip Agee, as Pompeo and everyone in the audience knew, was the infamous C.I.A. officer who went rogue in the 1970s, wrote a tell-all memoir, and publicly identified the names of scores of C.I.A. officers, front companies and foreign agents. His disclosures led Congress in 1982 to pass the Intelligence Identities Protection Act, a.k.a. the “Anti-Agee Act,” which made it a federal crime to reveal the names of covert agents. Agee died in Havana in 2008.

Then came the false statement directed at Stephens:

Dan Scavino, the White House director of social media, tweeted that I had [disclosed the name of a covert officer].

“CIA Dir Pompeo calls out @NYTimes for publishing name of an UNDERCOVER CIA agent,” he wrote on his official Twitter account, adding, “Just as disgraceful? @BretStephensNYT REPEATS name 2x’s!” He also posted a brief clip of the exchange — but muted my voice when I mentioned Agee.

Stephens elaborates:

This was nasty, manipulated and false, but it wasn’t necessarily a lie.

Why not necessarily a lie? Stephens explains:

If Scavino had never heard of Agee, didn’t know the name of the C.I.A. officer whose name was published by The Times and didn’t bother to fact check before tweeting, he might have inferred from my reply that I had indeed done what he alleged. That’s a plausible surmise about a White House where the line between idiocy and malice isn’t always clear.

“The line between idiocy and malice isn’t always clear” . . . that reminds be a bit of Clarke’s law, relating to the fine line between scientific incompetence and scientific fraud. At what point does the refusal to admit incompetence become a kind of fraud? I’m not sure.

But I digress. Let’s return to Stephens:

To give Scavino the benefit of the doubt, I asked the C.I.A. spokesman to set him straight. I also rebutted his claim on Twitter, emailed and left messages with him on his private number, and wrote the new communications director, Anthony Scaramucci, at his personal email address.

And what was the response from the government of the most powerful nation in the world?

No acknowledgment. No response. The tweet has not been deleted. The C.I.A. has not publicly corrected the record. The White House is knowingly allowing Scavino’s falsehood to stand. That’s called lying — which, as Pompeo might say, is “unconscionable.”

So here’s my issue. I don’t think Stephens is quite right that Scavino is “lying” when he refuses to admit his error. I agree that Scavino’s action is morally equivalent to lying, but it’s something different.

Here’s the definition of “lie” according to Merriam-Webster:

1 : to make an untrue statement with intent to deceive . . .

2 : to create a false or misleading impression . . .

Definition 2 suggests that you can lie “by accident,” as it were. I guess we’ll never know if Scavino’s original statement was a purposeful lie or just aggressive ignorance: “The line between idiocy and malice isn’t always clear” and all that.

Is there a term for Scavino’s follow-up behavior, where he deliberately avoids correcting an error that he is responsible for? I didn’t like it when David Brooks did it, but Scavino’s behavior is arguably even worse, in that he’s lying about a specific person rather than just refusing to correct false statistics. I’m wondering this because something similar recently happened to me (on a much much less important topic).

P.S. I’m kinda liking the term “misrepresent” rather than “lie” as it’s more general and seems to work ok in the after-the-fact sense.

For example:

– David Brooks misrepresented statistics on high school achievement; this misrepresentation was originally unintentional (I assume) but then he intentionally avoided correcting the false numbers.

– Satoshi Kanazawa misrepresented the information from a survey in order to draw a scientifically unwarranted conclusion about beauty and sex ratio; after his statistical errors were pointed out to him, he refused to alter his published views. This continuing misrepresentation was intentional but it could be explained by his continuing to not understand some of the subtle statistical principles involved.

– Susan Fiske misrepresented my writing when she wrote that Ulrich Schimmack and I implied that the entire field of psychology is inept and misguided; after this error was pointed out, the journal where it was published refused to correct it. Fiske’s original misrepresentation may have been a simple sloppy error (“the line between idiocy and malice isn’t always clear”) but the journal’s refusal to correct it was an intentional act of misrepresentation.

– Brian Wansink misrepresented his data and data-collection procedures in many published papers. It is not clear to what extent these misrepresentations arose from sloppiness, confusion, or intent to deceive. When these errors were pointed out, Wansink minimized them, thus misrepresenting the large problems they represented for his published work.

– Dan Scavino misrepresented Bret Stephens’s words in a particularly malicious way. It’s not clear if the initial misrepresentation was intentional, but Scavino intentionally perpetuated the misrepresentation by not correcting it.

– That character in Dear Evan Hansen (spoiler alert!) misrepresents his relationship with that other kid. The misrepresentation was a mistake at first but becomes intentional when Evan avoids opportunities to correct it.

In general, when people misrepresent evidence and their mistakes are pointed out to them, they can choose to acknowledge and correct the error, to do a minimal correction or quasi-denial that perpetuates the misrepresentation, to just duck the issue entirely, or to double down on the misrepresentation by going on the offensive. Any option but the first is an active misrepresentation. Even if the original error was an honest mistake, it’s misrepresentation to perpetuate it.

As the Evan Hansen example illustrates, correcting a misrepresentation can be difficult and awkward, it can hurt people’s feelings. There can be good reasons to misrepresent evidence, just as there can be good reasons to lie. Marc Hauser might well have felt that, in misrepresenting his monkey data, he was serving a higher truth. That’s fine: misrepresent if you feel that’s the right thing to do. But let’s be honest and admit that’s what people are doing. Dan Scavino is following in the footsteps of Winston Churchill who famously said that “in wartime truth is so precious that she should be attended by a bodyguard of lies.”

Iceland education gene trend kangaroo

Someone who works in genetics writes:

You may have seen the recent study in PNAS about genetic prediction of educational attainment in Iceland. the authors report in a very concerned fashion that every generation the attainment of education as predicted from genetics decreases by 0.1 standard deviations.

This sounds bad. But consider that the University of Iceland was founded in 1911, right at the beginning of the period 1910-1990 studied by the authors (!). So there is a many-thousand-percent increase in actual educational attainment at the same time as there is an ominous 0.005 SD/year decrease in ‘genetic’ educational attainment. Over this period educational attainment in the developed world seems to have exploded beyond a reasonable doubt, as is shown in the paper’s appendix also for Iceland. This genetic effect seems like a kangaroo feather to me.

My reply:

I’m not quite as skeptical as you—after all, it doesn’t seem unreasonable that different groups in the population are having children at different rates—but there are some things about the paper that I don’t completely follow:

1. The last sentence of the summary: “Another important observation is that the association between the score and fertility remains highly significant after adjusting for the educational attainment of the individuals.” This comes up again at the end of the paper: “It is also clear that education attained does not explain all of the effect. Hence, it seems that the effect is caused by a certain capacity to acquire education that is not always realized.” I guess they mean it’s not just that more educated people are having fewer (or later) children, it’s that this fertility differential is predicted by the genes. But I don’t see why that matters for their story.

2. The last sentence of the abstract: “Most importantly, because POLYEDU only captures a fraction of the overall underlying genetic component the latter could be declining at a rate that is two to three times faster.” This is what Eric Loken and I call the “backpack argument.” N is huge here so we don’t have to worry about statistical significance; still, I’m concerned about measurement error and there’s something about this “two to three times faster” claim that I find suspicious, even though it’s hard for me to pin down exactly what’s bothering me about it.

3. Saying P < 10^-100 is kinda silly. They say ∼0.010 standard units per decade, which is fine, but then it would make sense to perform a separate estimate for each decade and see how this varies.

4. I also don’t quite understand “the genetic propensity for educational attainment.” Doesn’t the definition of this propensity vary over time? In the U.S., having an XY chromosome used to be strongly predictive of educational attainment, but not so much anymore. Similarly for various ethnic groups. Iceland might be different because it’s a more homogeneous society, but I’d still think these conditions would vary over time.

This is not to say the research is all wrong—I actually know the first author of the paper, we were in grad school together and he’s a smart guy. It seems like the topic is worth studying as long as people recognize the uncertainty and variation here. Calling it “nonsense” seems a bit strong, no?

My correspondent responded:

Thinking it over, I agree that I was being too harsh, though I remain skeptical. Certainly the editor is high respected, and the last author is also very prominent. I would tend to agree with you and the authors that different people have different fertilities that are associated with demographic factors like education (this appears to be a very old result). Furthermore, it would not be surprising if there was an indication of this in the genetic makeup of the population, but this could be true whether the trait was genetically determined or not.

I expand on my criticism at some length below- I apologize for going on a bit. First I’ll respond directly to your points:

1) I think you are right- how I would say it is that these variables (education, fertility, and POLYEDU) are all somewhat independently correlated with each other.

2) I think that this refers to the fact that most heritable variance of educational attainment (~90%) is not explained by POLYEDU. The argument seems to be that this unobserved genetic component is probably behaving the same as POLYEDU, see page 4 results. I doubt this, based on my view of complex trait genetics, but making this assumption allows them to make the estimate. That said, I couldn’t reconstruct where the 2-3 multiplier comes from. I wasn’t able to find anything on your backpack argument via google, but I would be curious to know more.

3) Agreed.

4) For me the weakest point (see below). All of these genetic effects are context dependent, and the context in which they occur is one in which the environment is overwhelmingly pushing the opposite direction relative to the (measured) genetics. For instance, imagine that educational attainment and fertility both stabilize in the 21st century. It is possible that the selective effect goes away even if the demographic pattern persists, because what POLYEDU measured was a demographic group who were transitioning from high-fertility to low-fertility as they attained more education. Moreover, all of this is based on gene-education associations that were only measured in the last generation, and they account for only 4% of the variance of the trait (of which 20-40% total is heritable).

I should admit that the source of my strong reaction to the paper, and particularly to the publicity, is that when this kind of finding about human genetics gets oversold, it immediately gets shared on white supremacist websites as incontrovertible proof of whatever (as has already happened in this case). This is something that I have a problem with in my chosen field (and why I don’t work on humans): there are consequences to overstating the role of genetics in human variation.

This paper seems to be playing into an “Idiocracy” narrative based on its data; this is certainly how the paper seems to be getting played in the press (“We’re evolving stupid: Icelandic study finds gradual decline in genes linked to education, IQ”). However, I don’t see how they could reliably measure that effect with this data. For instance, towards the end of the paper they translate their POLYEDU effect into IQ because education increases IQ, and therefore IQ has decreased by ~0.3 points overall based on POLYEDU. At the same time, educational attainment in the study population has increased by an average of 2-4 years over the same period, which if you look at other studies ( suggests a historical increase in IQ of 8-16 on the same period (one measured increase in the interval of the study said 13.8 points). So what they’re measuring is a countercyclical genetic trend in the face of a much larger environmental trend. It isn’t wrong to try to do this, but the presentation obscures the larger picture that genetics probably isn’t doing much here. A rather more parsimonious hypothesis is that the demographic shift caused some mostly meaningless fluctuations in standing genetic variation (which would be expected anyways in a population model). This seems more plausible to me, and I believe it could be modeled without too much difficulty.

The authors are also certainly correct that 0.01 SU/decade would be a significant trend in evolution, but this then also assumes that the conditions of the 20th century in which this change is taking place are held constant for evolution. Presumably this includes the increasing educational attainment that is the context for the decreasing genetic propensity for educational attainment. I think this is similar to your question (4).

To their credit, the last author more or less agrees with all this in an interview, and they talk about it in the end of the results section. But then, why describe something as meaningful on an evolutionary timescale unless you believe it actually will be? Their extrapolation just makes me uncomfortable, especially because they don’t account for the effects of demography. What is particularly worrisome is your question (1), which might mean that the score is actually somehow measuring signatures of declining fertility via the mediator variable of education. Personally, I mistrust polygenic scores in humans (animals are better, I think). They tend to rely on large numbers of features to explain rather small effects (4% of variance in this case). However, mine is a fringe position in this field.

Delegate at Large

Asher Meir points to this delightful garden of forking paths, which begins:

• Politicians on the right look more beautiful in Europe, the U.S. and Australia.
• As beautiful people earn more, they are more likely to oppose redistribution.
• Voters use beauty as a cue for conservatism in low-information elections.
• Politicians on the right benefit more from beauty in low-information elections.

I wrote: On the plus side, it did not appear in a political science journal! Economists and psychologists can be such suckers for the “voters are idiots” models of politics.

Meir replied:

Perhaps since I am no longer an academic these things don’t even raise my hackles anymore. I just enjoy the entertainment value.

This stuff still raises my hackles, partly because I’m in the information biz so I don’t like to see people mangle information, and partly because I feel that these sorts of claims of the voters-are-shallow variety, do their small bit to degrade the prestige of democratic governance. I’m similarly bothered by the don’t-vote-it’s-a-waste-of-time thing, and the shark-attacks-and-subliminal-smiley-faces-thing and the fat-arms-and-redistribution thing etc etc etc.

Stan Weekly Roundup, 28 July 2017

Here’s the roundup for this past week.

  • Michael Betancourt added case studies for methodology in both Python and R, based on the work he did getting the ML meetup together:

  • Michael Betancourt, along with Mitzi Morris, Sean Talts, and Jonah Gabry taught the women in ML workshop at Viacom in NYC and there were 60 attendees working their way up from simple linear regression, through Poisson regression to GPs.

  • Ben Goodrich has been working on new R^2 analyses and priors, as well as the usual maintenance on RStan and RStanArm.

  • Aki Vehtari was at the summer school in Valencia teaching Stan.

  • Aki has also been kicking off planning for StanCon in Helsinki 2019. Can’t believe we’re planning that far ahead!

  • Sebastian Weber was in Helsinki giving a talk on Stan, but there weren’t many Bayesians there to get excited about Stan; he’s otherwise been working with Aki on variable selection.

  • Imad Ali is finishing up the spatial models in RStanArm and moving on to new classes of models (we all know his goal is to model basketball, which is a very spatially continuous game!).

  • Ben Bales has been working on generic append array funcitons and vectorizing random number geneators. We learned his day job was teaching robotics with lego to mechanical engineering students!

  • Charles Margossian is finishing up the algebraic solvers (very involved autodiff issues there, as with the ODE solvers) and wrapping up a final release of Torsten before he moves to Columbia to start the Ph.D. program in stats. He’s also writing the mixed solver paper with feedback from Michael Betancourt and Bill Gillespie.

  • Mitzi Morris added runtime warning messages for problems arising in declarations, which inadvertently fixed another bug arising for declarations with sizes for which constraints couldn’t be satisfied (as in size zero simplexes).

  • Miguel Benito, along with Mitzi Morris and Dan Simpson, with input from Michael Betancourt and Andrew Gelman, now have spatial models with matching results across GeoBUGS, INLA, and Stan. They further worked on better priors for Stan so that it’s now competitive in fitting; turns out the negative effect of the sum-to-zero constraint on the spatial random effects had a greater negative effect on the geometry than a positive effect on identifiability.

  • Michael Andreae resubmitted papers with Ben Goodrich and Jonah Gabry and is working on some funding prospects.

  • Sean Talts (with help from Daniel Lee) has most of the C++11/C++14 dev ops in place so we’ll be able to start using all those cool toys.

  • Sean Talts and Michael Betancourt with some help from Mitzi Morris, have been doing large-scale Cook-Gelman-Rubin evaluations for simple and hierarchical models and finding some surprising results (being discussed on Discourse). My money’s on them getting to the bottom of what’s going on soon; Dan Simpson’s jumping in to help out on diagnostics, in the same thread on Discourse.
  • Aki Vehtari reports that Amazon UK (with Neil Lawrence and crew) are using Stan, so we expect to see some more GP activity at some point.

  • We spent a long time discussing how to solve the multiple translation unit problems. It looks at first glance like Eigen just inlines every function and that may also work for us (if a function is declared inline, it may be defined in multiple translation units).

  • Solène Desmée, along with France Mentré and others have been fitting time-to-event models in Stan and have a new open-access publication, Nonlinear joint models for individual dynamic prediction of risk of death using Hamiltonian Monte Carlo: application to metastatic prostate cancer. You may remember France as the host of last year’s PK/PD Stan conference in Paris.

An improved ending for The Martian

In this post from a couple years ago I discussed the unsatisfying end of The Martian. At the time, I wrote:

The ending is not terrible—at a technical level it’s somewhat satisfying (I’m not enough of a physicist to say more than that), but at the level of construction of a story arc, it didn’t really work for me.

Here’s what I think of the ending. The Martian is structured as a series of challenges: one at a time, there is a difficult or seemingly insurmountable problem that the character or characters solve, or try to solve, in some way. A lot of the fun comes when the solution of problem A leads to problem B later on. It’s an excellent metaphor for life (although not stated that way in the book; one of the strengths of The Martian is that the whole thing is played straight, so that the reader can draw the larger conclusions for him or herself).

OK, fine. So what I think is that Andy Weir, the author of The Martian, should’ve considered the ending of the book to be a challenge, not for his astronaut hero, but for himself: how to end the book in a satisfying way? It’s a challenge. A pure “win” for the hero would just feel too easy, but just leaving him on Mars or having him float off into space on his own, that’s just cheap pathos. And, given the structure of the book, an indeterminate ending would just be a cheat.

So how to do it? How to make an ending that works, on dramatic terms? I don’t know. I’m no novelist. All I do know is that, for me, the ending that Weir chose didn’t do the job. And I conjecture that had Weir framed it to himself as a problem to be solved with ingenuity, maybe he could’ve done it.

And, hey! I finally figured out how Weir could’ve done it. As I said, the challenge is to avoid the two easy outs of a pure win, in one direction, or a failure, on the other.

So here’s the solution:

Have the spaceman get rescued—by the way, it’s a sign of the weak characterization that, even though I read the book and saw the movie, I still can’t remember the main character’s name—but have that rescue require resources that would otherwise have been necessary for the space program, so that, as a result, future missions are canceled. The astronaut then for the rest of his life has to live with the fact that all his and his colleagues’ ingenuity did manage to save him, but at the cost of ending future manned exploration of space. Bittersweet.

“Statistics textbooks (including mine) are part of the problem, I think, in that we just set out ‘theta’ as a parameter to be estimated, without much reflection on the meaning of ‘theta’ in the real world.”

Carol Nickerson pointed me to a new article by Arie Kruglanski, Marina Chernikova, Katarzyna Jasko, entitled, “Social psychology circa 2016: A field on steroids.”

I wrote:

1. I have no idea what is the meaning of the title of the article. Are they saying that they’re using performance-enhancing drugs?

2. I noticed this from the above article: “Consider the ‘power posing’ effects (Carney, Cuddy, & Yap, 2010; Carney, Cuddy, & Yap, 2015) or the ‘facial feedback’ effects (Strack, Martin, & Stepper, 1988), both of which recently came under criticism on grounds of non-replicability. We happen to believe that these effects could be quite real rather than made up, albeit detectable only under some narrowly circumscribed conditions. Our beliefs derive from what (we believe) is the core psychological mechanism mediating these phenomena.”

This seems naive to me. If we want to talk about what we “happen to believe” about “the core psychological mechanism,” I’ll register my belief that if you were to do an experiment on the “crouching cobra” position and explain how powerful people hold their energy in reserve, and if you had all the researcher degrees of freedom available to Carney, Cuddy, and Yap, I think you’d have no problem demonstrating that a crouched “cobra pose” is associated with powerfulness, while an open pose is associated with weakness.

One could also argue that the power pose results arose entirely from facial feedback. Were the experimenters controlling for their own facial expressions or the facial expressions of the people in the experiment? I think not.

Nickerson replied:

The effects *could* be real (albeit small). But this hasn’t been demonstrated in any credible way.

I didn’t understand the “on steroids” bit, either. That idiom usually means “in a stronger, more powerful, or exaggerated way.” I’d agree that the results of much social psychology research seems exaggerated (if not just plain wrong), but I don’t think that is what they meant.

And I followed up by saying:

Yes, the effects could be real. But they could just as well be real in the opposite direction (that the so-called power pose makes things worse). Realistically I think the effects will depend so strongly on context (including expectations) that I doubt that “the effect” or “the average effect” of power pose has any real meaning. Statistics textbooks (including mine) are part of the problem, I think, in that we just set out “theta” as a parameter to be estimated, without much reflection on the meaning of “theta” in the real world.

This discussion is relevant too.

Died in the Wool

Garrett M. writes:

I’m an analyst at an investment management firm. I read your blog daily to improve my understanding of statistics, as it’s central to the work I do.

I had two (hopefully straightforward) questions related to time series analysis that I was hoping I could get your thoughts on:

First, much of the work I do involves “backtesting” investment strategies, where I simulate the performance of an investment portfolio using historical data on returns. The primary summary statistics I generate from this sort of analysis are mean return (both arithmetic and geometric) and standard deviation (called “volatility” in my industry). Basically the idea is to select strategies that are likely to generate high returns given the amount of volatility they experience.

However, historical market data are very noisy, with stock portfolios generating an average monthly return of around 0.8% with a monthly standard deviation of around 4%. Even samples containing 300 months of data then have standard errors of about 0.2% (4%/sqrt(300)).

My first question is, suppose I have two times series. One has a mean return of 0.8% and the second has a mean return of 1.1%, both with a standard error of 0.4%. Assuming the future will look like the past, is it reasonable to expect the second series to have a higher future mean than the first out of sample, given that it has a mean 0.3% greater in the sample? The answer might be obvious to you, but I commonly see researchers make this sort of determination, when it appears to me that the data are too noisy to draw any sort of conclusion between series with means within at least two standard errors of each other (ignoring for now any additional issues with multiple comparisons).

My second question involves forecasting standard deviation. There are many models and products used by traders to determine the future volatility of a portfolio. The way I have tested these products has been to record the percentage of the time future returns (so out of sample) fall within one, two, or three standard deviations, as forecasted by the model. If future returns fall within those buckets around 68%/95%/99% of the time, I conclude that the model adequately predicts future volatility. Does this method make sense?

My reply:

In answer to your first question, I think you need a model of the population of these time series. You can get different answers from different models. If your model is that each series is a linear trend plus noise, then you’d expect (but not be certain) that the second series will have a higher future return than the first. But there are other models where you’d expect the second series to have a lower future return. I’d want to set up a model allowing all sorts of curves and trends, then fit the model to past data to estimate a population distribution of those curves. But I expect that the way you’ll make real progress is to have predictors—I guess they’d be at the level of the individual stock, maybe varying over time—so that your answer will depend on the values of these predictors, not just the time series themselves.

In answer to your second question, yes, sure, you can check the calibration of your model using interval coverage. This should work fine if you have lots of calibration data. If your sample size becomes smaller, you might want to do something using quantiles, as described in this paper with Cook and Rubin, as this will make use of the calibration data more efficiently.

Multilevel modeling: What it can and cannot do

Today’s post reminded me of this article from 2005:

We illustrate the strengths and limitations of multilevel modeling through an example of the prediction of home radon levels in U.S. counties. . . .

Compared with the two classical estimates (no pooling and complete pooling), the inferences from the multilevel models are more reasonable. . . . Although the specific assumptions of model (1) could be questioned or improved, it would be difficult to argue against the use of multilevel modeling for the purpose of estimating radon levels within counties. . . . Perhaps the clearest advantage of multilevel models comes in prediction. In our example we can predict the radon levels for new houses in an existing county or a new county. . . . We can use cross-validation to formally demonstrate the benefits of multilevel modeling. . . . The multilevel model gives more accurate predictions than the no-pooling and complete-pooling regressions, especially when predicting group averages.

The most interesting part comes near the end of the three-page article:

We now consider our model as an observational study of the effect of basements on home radon levels. The study includes houses with and without basements throughout Minnesota. The proportion of homes with basements varies by county (see Fig. 1), but a regression model should address that lack of balance by estimating county and basement effects separately. . . . The new group-level coefficient γ2 is estimated at −.39 (with standard error .20), implying that, all other things being equal, counties with more basements tend to have lower baseline radon levels. For the radon problem, the county-level basement proportion is difficult to interpret directly as a predictor, and we consider it a proxy for underlying variables (e.g., the type of soil prevalent in the county).

This should serve as a warning:

In other settings, especially in social science, individual av- erages used as group-level predictors are often interpreted as “contextual effects.” For example, the presence of more basements in a county would somehow have a radon-lowering effect. This makes no sense here, but it serves as a warning that, with identical data of a social nature (e.g., consider substituting “income” for “radon level” and “ethnic minority” for “basement” in our study), it would be easy to leap to a misleading conclusion and find contextual effects where none necessarily exist. . . .

This is related to the problem in meta-analysis that between-study variation is typically observational even if individual studies are randomized experiments . . .

In summary:

One intriguing feature of multilevel models is their ability to separately estimate the predictive effects of an individual predictor and its group-level mean, which are sometimes interpreted as “direct” and “contextual” effects of the predictor. As we have illustrated in this article, these effects cannot necessarily be interpreted causally for observational data, even if these data are a random sample from the population of interest. Our analysis arose in a real research problem (Price et al. 1996) and is not a “trick” example. The houses in the study were sampled at random from Minnesota counties, and there were no problems of selection bias.

Read the whole thing.

Adding a predictor can increase the residual variance!

Chao Zhang writes:

When I want to know the contribution of a predictor in a multilevel model, I often calculate how much of the total variance is reduced in the random effects by the added predictor. For example, the between-group variance is 0.7 and residual variance is 0.9 in the null model, and by adding the predictor the residual variance is reduced to 0.7, then VPC = (0.7 + 0.9 – 0.7 – 0.7) / (0.7 + 0.9) = 0.125. Then I assume that the new predictor explained 12.5% more of the total variance than the null model. I guess this is sometimes done by some researchers when they need a measure of sort of an effect size.

However, now I have a case in which adding a new predictor (X) greatly increased the between-group variance. After some inspection, I realized that this was because although X correlate with Y positively overall, it correlate with Y negatively within each group, and X and Y vary in the same direction regarding the grouping variable. Under this situation, the VPC as computed above becomes negative! I am puzzled by this because how could the total variance increase? And this seems to invalidate the above method, at least in some situations.

My reply: this phenomenon is discussed in Section 21.7 of my book with Jennifer Hill. The section is entitled, “Adding a predictor can increase the residual variance!”

It’s great when I can answer a question so easily!

Recently in the sister blog

This research is 60 years in the making:

How “you” makes meaning

“You” is one of the most common words in the English language. Although it typically refers to the person addressed (“How are you?”), “you” is also used to make timeless statements about people in general (“You win some, you lose some.”). Here, we demonstrate that this ubiquitous but understudied linguistic device, known as “generic-you,” has important implications for how people derive meaning from experience. Across six experiments, we found that generic-you is used to express norms in both ordinary and emotional contexts and that producing generic-you when reflecting on negative experiences allows people to “normalize” their experience by extending it beyond the self. In this way, a simple linguistic device serves a powerful meaning-making function.