## The anthropic principle in statistics

The anthropic principle in physics states that we can derive certain properties of the world, or even the universe, based on the knowledge of our existence. The earth can’t be too hot or too cold, there needs to be oxygen and water, etc., which in turn implies certain things about our solar system, and so forth.

In statistics we can similarly apply an anthropic principle to say that the phenomena we can successfully study in statistics will not be too small or too large. Too small a signal and any analysis is hopeless (we’re in kangaroo territory); too large and statistical analysis is not needed at all, as you can just do what Dave Krantz calls the Inter-Ocular Traumatic Test.

You never see estimates that are 8 standard errors away from 0, right?

An application of the anthropic principle to experimental design

I actually used the anthropic principle in my 2000 article, Should we take measurements at an intermediate design point? (a paper that I love; but I just looked it up and it’s only been cited 3 times; that makes me so sad!), but without labeling the principle as such.

From a statistical standpoint, the anthropic principle is not magic; it won’t automatically apply. But I think it can guide our thinking when constructing data models and prior distributions. It’s a bit related to the so-called method of imaginary results, by which a model is understood based on the reasonableness of its prior predictive distribution. (Any assessment of “reasonableness” is, implicitly, an invocation of prior information not already included in the model.)

Using the anthropic principle to understand the replication crisis

Given that (a) researchers are rewarded for obtaining results that are 2 standard errors from zero, and (b) There’s not much extra benefit to being much more than 2 standard errors from zero, the anthropic principle implies that we’ll be studying effects just large enough to reach that level. (I’m setting aside forking paths and researcher degrees of freedom which allow researchers on ESP, beauty and sex ratios, etc., to attain statistical significance from what is essentially pure noise.) This implies high type M errors, that is, estimated effects that are overestimates (see Section 2.1 of this paper).

A discussion with Ken Rice and Thomas Lumley

I was mentioning this idea to Ken Rice when I was in Seattle the other day, and he responded:

Re: the anthropic idea, that (I think) statisticians never see really large effects relative to standard errors, because problems involving them are too easy to require a statistician, this sounds a lot like the study of local alternatives, e.g., here and here.

This is old news, of course, but an interesting (and unpublished) variation on it might be of interest to you. When models are only locally wrong from what we assume, it’s possible we can’t ever reliably detect the model mis-specificiation based on the data, but yet that the mis-specification really does matter for inference. Thomas Lumley writes about this here and here.

At first this was highly non-intuitive to me and others; perhaps we were all too used to thinking in the “classic” mode where the estimation problem stays fixed regardless of the sample size, and this phenomenon doesn’t arise. I suspect Thomas’ work has implications for “workflow” ideas too – that for sufficiently-nuanced inferences we can’t be confident that the steps that led them being considered are reliable, at least if they were based on the data alone. Some related impossibility results are reviewed here.

And Lumley points to this paper, Robustness of semiparametric efficiency in nearly-true models for two-phase samples.

## David Bellos’s book on translation

Seeing as linguistics is on the agenda, I thought I’d mention this excellent book I just finished, “Is That a Fish in Your Ear,” by David Bellos. Bellos is a translator and scholar of French literature, and in his book he covers all sorts of topics. Nothing deep, but, as a non-expert on the topic, I learned a lot. Over the years I’ve read a few things on translation, and this was my favorite.

## The statistical significance filter leads to overoptimistic expectations of replicability

Shravan Vasishth, Daniela Mertzen, Lena Jäger, et al. write:

Treating a result as publishable just because the p-value is less than 0.05 leads to overoptimistic expectations of replicability. These overoptimistic expectations arise due to Type M(agnitude) error: when underpowered studies yield significant results, effect size estimates are guaranteed to be exaggerated and noisy. These effects get published, leading to an overconfident belief in replicability. We demonstrate the adverse consequences of this statistical significance filter by conducting six direct replication attempts (168 participants in total) of published results from a recent paper. We show that the published claims are so noisy that even non-significant results are fully compatible with them. We also demonstrate the contrast between such small-sample studies and a larger-sample study (100 participants); the latter generally yields less noisy estimates but also a smaller effect size, which looks less compelling but is more realistic. We make several suggestions for improving best practices in psycholinguistics and related areas.

Shravan asks all of you for a favor:

Can we get some reactions from the sophisticated community that reads your blog? I still have a month to submit and wanted to get a feel for what the strongest objections can be.

## Garden of forking paths – poker analogy

[image of cats playing poker]

Someone who wishes to remain anonymous writes:

Just wanted to point out an analogy I noticed between the “garden of forking paths” concept as it relates to statistical significance testing and poker strategy (a game I’ve played as a hobby).

A big part of constructing a winning poker strategy nowadays is thinking about what one would have done had they been dealt a different hand. That is to say, to determine whether betting with 89hh in a certain situation is a “good play,” you should think about what you would have done with all the other hands you could have in this situation.

In contrast, many beginning players will instead only focus on their current hand and base their play on what they think the opponent will do.

This is a counterintuitive idea and it took the poker community a long time to think in a “garden of forking paths” way, even though Von Neumann used similar ideas to calculate the Nash equilibrium of simplified poker games a long time ago.

So I’m not too surprised that a lot of researchers seem to have difficulty grasping the garden of forking paths concept.

Excellent point. Statistics is hard, like knitting, basketball, and poker.

## Prior distributions and the Australia principle

There’s an idea in philosophy called the Australia principle—I don’t know the original of this theory but here’s an example that turned up in a google search—that posits that Australia doesn’t exist; instead, they just build the parts that are needed when you visit: a little mock-up of the airport, a cityscape with a model of the Sydney Opera House in the background, some kangaroos, a bunch of desert in case you go into the outback, etc. The idea is that it would be ridiculously inefficient to build an entire continent and that it makes much more sense for them to just construct a sort of stage set for the few places you’ll ever go.

And this is the principle underlying the article, The prior can often only be understood in the context of the likelihood, by Dan Simpson, Mike Betancourt, and myself. The idea is that, for any given problem, for places in parameter space where the likelihood is strong, relative to the questions you’re asking, you won’t need to worry much about the prior; something vague will do. And in places where the likelihood is weak, relative to the questions you’re asking, you’ll need to construct more of a prior to make up the difference.

This implies:
1. The prior can often only be understood in the context of the likelihood.
2. What prior is needed can depend on the question being asked.

To follow up on item 2, consider a survey of 3000 people, each of whom is asked a binary survey response, and suppose this survey is a simple random sample of the general population. If this is a public opinion poll, N = 3000 is more than enough: the standard error of the sample proportion is something like 0.5/sqrt(3000) = 0.01; you can estimate a proportion to an accuracy of about 1 percentage point, which is fine for all practical purposes, especially considering that, realistically, nonsampling error will be likely be more than that anyway. On the other hand, if the question on this survey of 3000 people is whether your baby is a boy or a girl, and if the goal is to compare sex ratios of beautiful and ugly parents, then N = 3000 is way way too small to tell you anything (see, for example, the discussion on page 645 here), and if you want any kind of reasonable posterior distribution for the difference in sex ratios you’ll need a strong prior. You need to supply the relevant scenery yourself, as it’s not coming from the likelihood.

The same principle—that the prior you need depends on the other information you have and the question you’re asking—also applies to assumptions within the data model (which in turn determines the likelihood). But for simplicity here we’re following the usual convention and pretending that the likelihood is known exactly ahead of time so that all the modeling choices arise in the prior.

P.S. The funny thing is, Dan Simpson is from Australia himself. Just a coincidence, I’m sure.

## Regularized Prediction and Poststratification (the generalization of Mister P)

This came up in comments recently so I thought I’d clarify the point.

Mister P is MRP, multilevel regression and poststratification. The idea goes like this:

1. You want to adjust for differences between sample and population. Let y be your outcome of interest and X be your demographic and geographic variables you’d like to adjust for. Assume X is discrete so you can define a set of poststratification cells, j=1,…,J (for example, if you’re poststratifying on 4 age categories, 5 education categories, 4 ethnicity categories, and 50 states, then J=4*5*4*50, and the cells might go from 18-29-year-old no-high-school-education whites in Alabama, to over-65-year-old, post-graduate-education latinos in Wyoming). Each cell j has a population N_j from the census.

2. You fit a regression model y | X to data, to get a predicted average response for each person in the population, conditional on their demographic and geographic variables. You’re thus estimating theta_j, for j=1,…,J. The {\em regression} part of MRP comes in because you need to make these predictions.

3. Given point estimates of theta, you can estimate the population average as sum_j (N_j*theta_j) / sum_j (N_j). Or you can estimate various intermediate-level averages (for example, state-level results) using partial sums over the relevant subsets of the poststratification cells.

4. In the Bayesian version (for example, using Stan), you get a matrix of posterior simulations, with each row of the matrix representing one simulation draw of the vector theta; this then propagates to uncertainties in any poststrat averages.

5. The {\em multilevel} part of MRP comes because you want to adjust for lots of cells j in your poststrat, so you’ll need to estimate lots of parameters theta_j in your regression, and multilevel regression is one way to get stable estimates with good predictive accuracy.

OK, fine. The point is: poststratification is key. It’s all about (a) adjusting for many ways in which your sample isn’t representative of the population, and (b) getting estimates for population subgroups of interest.

But it’s not crucial that the theta_j’s be estimated using multilevel regression. More generally, we can use any {\em regularized prediction} method that gives reasonable and stable estimates while including a potentially large number of predictors.

Hence, regularized prediction and poststratification. RPP. It doesn’t sound quite as good as MRP but it’s the more general idea.

## Awesome data visualization tool for brain research

When I was visiting the University of Washington the other day, Ariel Rokem showed me this cool data visualization and exploration tool produced by Jason Yeatman, Adam Richie-Halford, Josh Smith, and himself. The above image gives a sense of the dashboard but the real thing is much more impressive because it’s interactive. You can rotate that brain image.

And here’s a research paper describing what they did.

## How to think about research, and research criticism, and research criticism criticism, and research criticism criticism criticism?

Some people pointed me to this article, “Issues with data and analyses: Errors, underlying themes, and potential solutions,” by Andrew Brown, Kathryn Kaiser, and David Allison. They discuss “why focusing on errors [in science] is important,” “underlying themes of errors and their contributing factors, “the prevalence and consequences of errors,” and “how to improve conditions and quality,” and I like much of what they write. I also appreciate the efforts that Allison and his colleagues have done to point the spotlight on scientific errors in nutrition research, and I share his frustration when researchers refuse to admit errors in their published work; see for example here and here.

But there are a couple things in their paper that bother me.

First, they criticize Jordan Anaya, a prominent critic of Brian “pizzagate” Wansink, in what seems to be an unfair way. Brown, Kaiser, and Allison write:

The recent case of the criticisms inveighed against a prominent researcher’s work (82) offers some stark examples of individuals going beyond commenting on the work itself to criticizing the person in extreme terms (e.g., ref. 83).

Reference 83 is this:

Anaya J (2017) The Donald Trump of food research. Medium.com. Available at https://medium.com/@OmnesRes/the-donald-trump-of-food-research-49e2bc7daa41. Accessed September 21, 2017.

Sure, referring to Wansink as the “Donald Trump of food research” might be taken to be harsh. But if you read the post, I don’t think it’s accurate to say that Anaya is “criticizing the person in extreme terms.” First, I don’t think that analogizing someone to Trump is, in itself, extreme. Second, Anaya is talking facts. He indeed has good reasons for comparing Wansink to Trump. (“Actually, the parallels with Trump are striking. Just as Trump has the best words and huge ideas, Wansink has ‘cool data’ that is ‘tremendously proprietary’. Trump’s inauguration was the most viewed in history period, and Wansink doesn’t p-hack, he performs ‘deep data dives’. . . . Trump doesn’t let facts or data get in his way, and neither does Wansink. When Plan A fails, he just moves on to Plan B, Plan C…”).

You might call this hyperbole, and you might call it rude, but I don’t see it as “criticizing the person in extreme terms”; I think of it as criticizing Wansink’s public actions and public statements in negative terms.

Again, I don’t see Anaya’s statements as “going beyond commenting on the work”; rather, I see them as a vivid way of commenting on the work, and associated publicity statements issued by Wansink, very directly.

One reason that the brief statement in the article bothered me is that it’s easy to isolate someone like Anaya and say something like, We’re the reasonable, even-keeled critics, and dissociate us from bomb-throwers. But I don’t think that’s right at all. Anaya and his colleagues put in huge amounts of effort to reveal a long and consistent pattern of misrepresentation of data and research methods by a prominent researcher, someone who’d received millions of dollars of government grants, someone who received funding from major corporations, held a government post, appeared frequently on television, and was considered an Ivy League expert. And then he writes a post with a not-so-farfetched analogy to a politician, and all of a sudden this is considered a “stark example” of extreme criticism. I don’t see it. I think we need people such as Anaya who care enough to track down the truth.

Here’s some further commentary on the Brown, Kaiser, and Allison article, by Darren Dahly. What Dahly writes seems basically reasonable, except for the part that calls them “disingenuous.” I hate when people call other people disingenuous. Calling someone disingenuous is like calling them a liar. I think it would be better for Dahly to have just said he thinks their interpretation of Anaya’s blog post is wrong. Anyway, I agree with Dahly on most of what he writes. In particular I agree with him that the phrase “trial by blog” is ridiculous. A blog is a very open way of providing information and allowing comment. When Anaya or I or anyone else posted on Wansink, anyone—including Wansink!—was free to respond in comments. And, for that matter, when Wansink blogged, anyone was free to comment there too (until he took that blog down). In contrast, closed journals and the elite news media (the preferred mode of communication of many practitioners of junk science) rarely allow open discussion. “Trial by blog” is, very simply, a slogan that makes blogging sound bad, even though blogging is far more open-ended than the other forms of communications that are available to us.

In their article, Brown, Kaiser, and Allison write, “Postpublication discussion platforms such as PubPeer, PubMed Commons, and journal comment sections have led to useful conversations that deepen readers’ understanding of papers by bringing to the fore important disagreements in the field.” Sure—but this is highly misleading. Why not mention blogs in this list? Blogs have led to lots of useful conversations that deepen readers’ understanding of papers by bringing to the fore important disagreements in the field. And the best thing about blogs is that they are not part of institutional structures.

You also write, “Professional decorum and due process are minimum requirements for a functional peer review system.” But peer review does not always follow these rules; see for example this story.

In short, the good things that can be done in official journals such as PNAS can also be done in blogs; also, the bad things that can be done in blogs can also be done in journals. I think it’s good to have multiple channels of communication and I think it’s misleading to associate due process with official journals and to associate abuses with informal channels of communication such as blogs. In the case of Wansink, I’d say the official journals largely did a terrible job, as did Cornell University, whereas bloggers were pretty much uniformly open, fair, and accurate.

Finally, I think their statement, “Individuals engaging in ad hominem attacks in scientific discourse should be subject to censure,” would be made stronger if it were to directly refer to the ad hominem attacks made by Susan Fiske and others in the scientific establishment. I don’t think Jordan Anaya should be subject to censure just because he analogized Brian Wansink to Donald Trump in the context of a detailed and careful discussion of Wansink’s published work.

The larger problem, I think, is that the tone discussion is being used strategically by purveyors of bad science to maintain their power. (See Chris Chambers heres and James Heathers here and here.) I’d say the whole thing is a big joke, except that I am being personally attacked in scientific journals, the popular press, and, apparently, presentations being given by public figures. I don’t think these people care about me personally: they’re just demonstrating their power, using me as an example to scare off others, and attacking me personally as a distraction from the ideas they are promoting and that they don’t want criticized. In short, these people whose research practices are being questioned are engaging in ad hominem attacks in scientific discourse, and I do think that’s a problem.

That all said, one thing I appreciate about Brown, Kaiser, and Allison is that they do engage the real problems in science, unlike the pure status quo defenders who go around calling people terrorists and saying ridiculous things such as that the replication rate is “statistically indistinguishable from 100%.” They wrote some things I agree with, and they wrote some things I disagree with, and we can have an open discussion, and that’s great. On the science, they’re open to the idea that published work can be wrong. They’re not staking their reputation on ovulation and voting, ESP, himmicanes, and the like.

Andrew Brown, Kathryn Kaiser, and David Allison say . . .

I told David Allison that I’d be posting something on his article, and he and his colleagues prepared a three-page response which is here.

Jordan Anaya says . . .

In addition, Jordan Anaya sent me an email elaborating on some of the things that bothered him about the Brown, Kaiser, and Allison article:

I’m not mad with Allison, he can say whatever he wants about me in whatever media he chooses, as long as it’s his opinion or accurate. I’m not completely oblivious, I knew the title of my blog post would upset people, but that’s kind of the point. I felt the people it would preferentially offend are the old boys’ club at Harvard, so it seemed worth it. Concurrently, I felt that over time the title of my post would age well, so anyone who was critical of the post initially would eventually look silly. The first whistle blower to claim a major scandal is always seen as a little crazy, so I don’t necessarily blame people who were initially critical of my post, but after seeing two decades worth of misconduct from Wansink it would be hard for me to take people seriously now if they think the title is inappropriate. Wansink is clearly a con artist, just like Donald Trump.

I only learned of Allison’s talk because a journalist contacted me. Of course I’m honored whenever my work is talked about at a conference, but if it is misrepresented to such an extent that a journalist has to contact me to get my side of the story that’s a problem. I sort of thought that Allison would regret his statements in the talk given how many additional problems we found with Wansink’s work, but to my surprise he then said the same thing in a publication. I mean, one time is a mistake, but twice is something else.

So I have three issues with Allison’s comments in his talk and paper. First, I don’t agree with his general argument about ad hominem attacks. Second, I don’t think he is being honest in his portrayal of our investigation. Third, I find the whole thing hypocritical.

Going to Wikipedia, there’s a pyramid where ad hominem is defined as “attacks the characteristics or authority of the writer without addressing the substance of the argument”. Ad hominem is not the same as name-calling.

Allison doesn’t specifically say my blog post is an example of an ad hominem attack, but it is in a paragraph with other examples of ad hominems. The title of my post could be seen as name-calling, but throughout the post I provide evidence for the title, so I’m not even sure if it’s name-calling. And besides, I’m not sure why being compared to the President of the United States, whom he likely voted for, would be seen as name-calling.

But let’s say Allison is right and my post is extremely inappropriate due to it being unsubstantiated criticism. I think it’s interesting to look at the opposite case, where someone gets unsubstantiated acclaim, a type of reverse ad hominem if you will. Wansink was celebrated as the “Sherlock Holmes of food”. I would classify this as reverse name-calling/ad hominem. If you are concerned about someone’s reputation being unfairly harmed by name-calling, surely you must be similarly concerned about someone gaining an unwarranted amount of fame and power by reverse name-calling.

This might sound silly, but here’s a very applicable example. Dan Gilbert and Steve Pinker both shared Sabeti’s Boston Globe essay on Twitter saying it was one of the best things they’ve ever read, an unsubstantiated reverse ad hominem. Why does this matter? Well the essay was filled with errors and attacked you. So by blindly throwing their support behind the essay (a reverse ad hominem), they are essentially blindly criticizing you.

So if we are going to be deeply concerned about (perceived) unwarranted criticism, then we need to be equally concerned about unwarranted praise since that can result in someone getting millions of dollars in funding and best-selling books based on pseudoscience.

Lastly, what is inappropriate is subjective, so it’s impossible to police. I agree with Chris Chambers here. Sure, if someone calls me an idiot I probably won’t throw a party, but if they then point out problems with my work I’d be happy. I’d rather that than someone say my work is great when it is actually filled with errors. Allison says the only things that matter are the data and methods and “everything else is a distraction”. I agree, so if someone happens to mention something else feel free to ignore it, whether it be positive or negative.

The next problem with his talk/paper is I’m not sure our investigation is being accurately presented. In Allison’s talk, (timestamp 36:30), he says he feels what’s happening to Wansink is a “trial by media” and a “character assassination”. Yes, I’ll admit we used our media contacts, but that’s only because we were unsatisfied with how Wansink was handling the situation. We felt we needed to turn up the pressure, and my blog post was part of that. I know turning to the media is not on Allison’s flowchart of how to deal with scientific misconduct, but if you look at our results I would say it is extremely effective and he may want to update his opinions.

He goes on to use the example of a student cheating on a test and having the professor call them out in front of the class. This analogy doesn’t work for various reasons. First, student exams are confidential, while Wansink’s work and errors and blog are in the public domain. Secondly, when a student does well on a test that is also confidential–the teacher doesn’t announce to the class you got an A. Conversely, Wansink has received uncritical positive media attention throughout his career, so I don’t see any issue with a little negative media attention. We’re back to the ad hominem/reverse ad hominem example. If you have no problems with reverse ad hominems and positive media attention I don’t see how you can be against ad hominems and negative media attention.

Third, this whole thing is filled with irony and hypocrisy. It’s funny to note that the whole thing was indeed started by a blog post, but it was Wansink’s blog post. And in that blog post he threw a postdoc under the bus and praised a grad student. The grad student was identified by name, and it was easy to figure out who the postdoc was (she didn’t want to comment when we contacted her). So if you’re going to be mad about a blog post how about you start there? Wansink not only ruined the Turkish student’s career, he provided information about the postdoc she probably wishes wasn’t made public. If you dislike my blog posts fine, but then you better really really hate Wansink’s post. At the end of his talk he reads a quote from my blog and says that’s not the type of culture he wants to foster. Given his lack of criticism of anything in Wansink’s blog, I guess he prefers a culture where senior academics complain about how their employees are lazy and won’t p-hack enough.

Allison is famous for bemoaning how hard it is to get work corrected/retracted, which is exactly what we faced with his good friend (and coauthor) Wansink. We just happened to then use nontraditional routes of criticism, which were extremely effective. You’d think if someone is writing an article about getting errors corrected, they might want to mention one of the most successful investigations. He might not agree with our methods, but I don’t see how he can ignore the results, and it seems wrong to not mention that this is a possible route which can work.

And as I’ve mentioned before, his article in PNAS is basically a blog post, so it’s funny for him to complain about blogs. And I can’t help but wonder whether he would have singled out my blog post if he wasn’t a friend and coauthor of Wansink. Presumably if he didn’t know Wansink he would have described the case in detail given its scale.

One more time

Again, I’m encouraged by the openness of all the people involved in this discussion. I get so frustrated with people who attack and then hide. Much better to get disagreements out in the open. And of course anyone can add their comments below.

## No, there is no epidemic of loneliness. (Or, Dog Bites Man: David Brooks runs another column based on fake stats)

Remember David Brooks? The NYT columnist, NPR darling, and former reporter who couldn’t correctly report the price of a meal at Red Lobster? The guy who got it wrong about where billionaires come from and who thought it was fun to use one of his columns to make fun of a urologist (ha ha! Get it?) who had the misfortune to have the New York Times announce his daughter’s wedding announcement to (“exactly the sort of son-in-law that pediatric urologists dream about” — yeah, nice one, Dave! You sound like a real man of the people here)? The dude who thinks that staying out of jail is a conservative value? Who got conned into mainstreaming some erroneous calculations on high school achievement and college admissions and then never corrected himself? Etc? Etc?

And who never backs down. (The closest to admission of error was a characterization of his Red Lobster statement as a joke (as one reporter put it, a “comedic riff“), which I guess could describe his entire career.)

Well, it turns out he published some more fake stats in his column.

I know, I know, it’s no surprise. Still perhaps worth talking about, as an example of our degraded news media environment.

The story comes from sociologist David Weakliem, via sociologist Jay Livingston and sociologist Claude Fischer. (It’s been a busy week here for sociology.)

Here’s Weakliem:

A couple of days ago, David Brooks had a column in which he wrote “In the 1980s, 20 percent of Americans said they were often lonely. Now it’s 40 percent.” . . . The source of the 40% figure seems to be a survey of people aged 45 and over sponsored by the AARP in 2010. However, the report of that survey didn’t say anything about changes in loneliness.

So then Weakliem looks up loneliness directly:

There is a question that has been asked in a number of surveys asking people if they had felt “very lonely or remote from other people” in the past few weeks. The percent saying they had:

Nov 1963 28%
June 1965 26%
Jan 1981 17%
May 1990 19%
Sept 2001 26%
Dec 2001 24%

That doesn’t look like any kind of trend. The numbers in the 1981 and 1990 are lower, but they were in surveys taken by Gallup, and the others were by NORC, so that may be a factor. Unfortunately, the question hasn’t been asked since 2001.

Anything else?

I [Weakliem] searched Google scholar for papers about trends in loneliness, and found one from 2014 entitled “Declining Loneliness Over Time: Evidence From American Colleges and High Schools”. It was based on surveys at various colleges and universities and on the Monitoring the Future Survey, a representative survey of high school students that has been conducted since the 1970. It mentioned that other literature claimed that loneliness had increased, but I checked the sources they cited and they didn’t provide any evidence—they just said it had, or cited research that wasn’t really relevant.

Weakliem summarizes:

It’s remarkable that online editions of newspapers and magazines haven’t developed reasonable conventions about when to include links to a source. I checked five or six articles, all in well-regarded publications, which included the claim that levels of loneliness had doubled. Only one provided a link: that was to the AARP survey report, which didn’t support the claim.

Perhaps one reason they don’t link is because, with the link, you could check the claim. Brooks supplies no link, hence he can claim just about anything he wants. The non-linking thing also seems to be a general issue with journalism: everything has to be a “scoop,” so newspapers and magazines rarely point to earlier reporting on a topic. Newspaper A breaks a story, then when newspaper B follows up, it’s typical for them to never mention that the topic was first covered in newspaper A.

To return to the question of the supposed loneliness epidemic, here’s Claude Fisher, who’s been writing about these misconceptions for at least six years:

Yes, loneliness is a social problem, but no, there is no “epidemic of loneliness.” . . .

First, distinctions are needed. At least three different topics get conflated in the media these days: feeling lonely, being socially isolated, and using new social media. They are not the same things. It is well known in the research, for example, that socially isolated people are likelier to report feeling lonely than others do—but not much likelier. So, this post just addresses feelings of loneliness; other posts have addressed isolation and others the effects of the internet.

Weakliem found some long-term National Opinion Research Center data . . . I [Fischer] reported similar fragments of data about loneliness in my book . . . with the same conclusion: No sign of a trend.

But, wait! There’s more: A 2015 study found that in college samples and more importantly, among high school students in the several decades-long Monitoring the Future project, there was a slight decline in reports of loneliness from the 1980s to 2010. Flat would be a close enough summary.

It’s funny—Brooks is on record saying that technical knowledge is like the recipes in a cookbook and can be learned by rote.

The guy lacks the most basic technical knowledge—the ability to read a publication—or a menu!—and accurately report the numbers he sees. I guess he and his New York Times editors excuse him on this on the grounds that nobody ever gave him the cookbook.

Too bad Brooks can’t afford a research assistant who could google his more ridiculous claims. I guess the Times doesn’t pay him enough for that.

It’s funny that the newspaper can’t just run a correction note every time one of its columnists reports something demonstrably false.

When it comes to NYT corrections, my favorite remains this one:

An earlier version of this column misstated the location of a statue in Washington that depicts a rambunctious horse being reined in by a muscular man. The sculpture, Michael Lantz’s ‘Man Controlling Trade’ (1942), is outside the Federal Trade Commission, not the Department of Labor.

With important items like this to run, you can see how the newspaper would have no space to correct mangled statistics.

Who Cares?

Fischer summarizes:

A layperson might ask, What difference—besides diss’ing social scientists—does it make if these interesting articles about loneliness growing are off a bit? First, they are off a lot. But more important, they are a critical distraction. Chatter about feelings (of mainly affluent folks) distracts us from the many real crises of our time—say, widened inequality, children growing up in criminally and chemically dangerous neighborhoods, the dissolution of job security for middle Americans, drug addiction, housing shortages (where the jobs are), a medical system mess, hyper-partisanship, and so on. That’s what makes the loneliness scare not just annoying but also another drag on serious problem-solving.

I’d just say “dissing,” not “diss’ing,” but otherwise I completely agree. Fake social science crap in the NYT, NPR, Ted, etc., sucks away attention from real issues. In the case of Brooks, this distraction may be intentional: “loneliness” is a kind of soft problem, not directly addressed by taxation. More generally, though, I suspect that it’s simple ignorance. Working with numbers is hard. And that’s ok—not everyone has to be a sociologist or a statistician! And reporters make mistakes; that’s inevitable. But if you are a reporter and you do promulgate an error, you should be working extra hard to correct it. Your mess, you clean it up.

P.S. I suppose it would be better for my future media relations if I were to go easy on Brooks: after all, he writes for the Times (where I sometimes write), he has lots of powerful friends, etc. But . . . grrrrrr . . . it really annoys me when people garble the numbers. We’re not talking about David Sedaris here, who’s a humorist and is understood to be joking, exaggerating, and flat-out making things up in order to tell a story. Brooks is purporting to report on social science. When his numbers are completely wrong—as in this case and many others—it completely destroys his point. And it’s insulting to the many researchers who’ve bothered to study the topic more seriously. So, no, I’m not gonna go easy on the guy just to avoid burning bridges in the media.

Again: all of Brooks’s published errors may well have been honest mistakes. Not correcting any of these errors, though? That’s on him.

And the whole thing is so sad—it just makes me want to cry—in that it would be so easy for him to run corrections for his errors. But, like White House social media director Dan Scavino, he just doesn’t go in for that sort of thing.

## “Eureka bias”: When you think you made a discovery and then you don’t want to give it up, even if it turns out you interpreted your data wrong

This came in the email one day:

I am writing to you with my own (very) small story of error-checking a published finding. If you end up posting any of this, please remove my name!

A few years ago, a well-read business journal published an article by a senior-level employee at my company. One of the findings was incredibly counter-intuitive. I looked up one of the reference studies and found that a key measure was reverse-coded (e.g. a 5 meant “poor” and a 1 meant “excellent”). My immediate conclusion was that this reverse coding was not accounted for in the article. I called the author and suggested that with such a strange finding, they should check the data to make sure it was coded properly. The author went back to the research partner, and they claimed the data was correct.

Still thinking the finding was anomalous, I downloaded the data from the original reference study. I then plotted the key metric and showed that the incorrectly coded data matched what was published, but that the correctly coded data matched the intuition. I sent those charts to the senior person. The author and research partner double-checked and confirmed there was an error in their reporting. So far so good!

After confirming the error, the author called and asked me “What are you trying to accomplish here?”. I responded that I was only trying to protect this senior person (and the company), because if I found the error somebody else would find it later down the line. The author, however, was suspicious of why I took the time to investigate the data. I was puzzled, since it appeared it was the research partner who made the fundamental error and the author’s only fault was in not diving into a counter-intuitive result. In the end, the graph in question was redacted from the online version of article. And, as you by now would certainly expect, the author claimed “none of the conclusions were materially impacted by the change”.

Do you have a name for this phenomenon in your lexicon yet? Might I suggest “eureka bias”? Meaning, when somebody is well-intentioned and discovers something unique, that “eureka moment” assumes a supremely privileged status in the researcher’s mind, and they never want to abandon that position despite evidence to the contrary…

My reply: Hey, I reverse-coded a variable once! Unfortunately it destroyed my empirical finding and I felt the need to issue a correction:

In the paper “Should the Democrats move to the left on economic policy?” [Ann. Appl. Stat. 2 (2008) 536–549] by Andrew Gelman and Cexun Jeffrey Cai, because of a data coding error on one of the variables, all our analysis of social issues is incorrect. Thus, arguably, all of Section 3 is wrong until proven otherwise.

We thank Yang Yang Hu for discovering this error and demonstrating its importance.

“Eureka bias,” yes, that’s an interesting idea. I’ve written about this, I can’t remember where, and I think you’re right. Sometimes there’s confirmation bias, when someone does a study and, no surprise!, finds exactly what they were looking for, apparently all wrapped in a bow with statistical significance as long as you ignore the forking paths (as in that famous ESP paper from a few years back).

Other times, though, a researcher is surprised by the data and then takes that surprise as confirming evidence, with the implicit reasoning being: I wasn’t even looking to see this and it showed up anyway, so it must be real. At that point the researcher seems to become attached to the finding and doesn’t want to give it up, sometimes going to extreme lengths to defend it and even to attack and question the motives of anyone who points out problems with their data, as we see in your example above and we’ve seen many times before in various contexts.

So, yes, “Eureka bias” it is.

P.S. Check out the “please remove my name!” above. I hear this sort of thing from whistleblowers all the time, and it’s my impression that there’s a lot of bullying done against people who go to the trouble of uncovering inconvenient truths about purportedly successful research. Remember how Marc Hauser treated his research assistants? Remember how that psychologists applied the “terrorists” label to people who were pointing out errors in published research? There’s a good reason that Pubpeer allows anonymous comments. Not all scientists respond well to criticism; some will attack anyone who they see as devaluing their brand.

## Boston Stan meetup 12 June!

Shane Bussmann writes to announce the next Boston/Camberville Stan users meetup, Tuesday, June 12, 2018, 6:00 PM to 9:00 PM, at Insight Data Science Office, 280 Summer St., Boston:

To kick things off for our first meetup in 2018, I [Bussman] will give a talk on rating teams in recreational ultimate frisbee leagues. In this talk, I show how a Bayesian framework offers a simple, clear path to rating teams that has a number of benefits relative to alternative, more heuristic-based approaches. Specifically, the Bayesian framework (1) transparently incorporates strength of schedule into the ratings; (2) allows the use of priors to account for the fact that teams self-select into one of three divisions (i.e., skill levels); (3) makes model validation straightforward; and (4) can lead to fun topics like quantitatively predicting the outcome of the end-of-season tournament. I will present a Stan model that implements this Bayesian framework and apply it to data from the local ultimate frisbee league run by the Boston Ultimate Disc Alliance. I use ScalaStan (an open-source Scala DSL for Stan) to build and run the model and Evilplot (an open-source data visualization library written in Scala) to make plots.

Special thanks to CiBO Technologies (http://www.cibotechnologies.com/) for sponsoring this meetup and to Insight Data Science (https://www.insightdatascience.com/) for hosting the event!

There are interesting features in the above abstract. Adjusting for strength of schedule is the bread and butter of probabilistic item-response models. But we also see adjustment for self-selection, and validation, two items that are important for statistical modeling and workflow but often get overlooked. Also possibly of interest is the use of Scala.

Too bad Phil and Aki (not to mention David Mackay) can’t make this one.

## How to reduce Type M errors in exploratory research?

Miao Yu writes:

Recently, I found this piece [a news article by Janet Pelley, Sulfur dioxide pollution tied to degraded sperm quality, published in Chemical & Engineering News] and the original paper [Inverse Association between Ambient Sulfur Dioxide Exposure and Semen Quality in Wuhan, China, by Yuewei Liu, published in Environmental Science & Technology].

Air pollution research is hot, especially for China. However, I think they might use a tricky way to do those studies. Typically, they collected many samples and analyzed many pollutants in those samples. Then just find the connection between one contaminant and environmental factor or diseases by checking the correlation among all compounds-environmental factor/disease pairs like this study. I have to say such template has been used a lot in environmental studies. Just a game of permutation and combination between thousands exposure factors (now we could detect them at one single run) and thousands of public concerns.

Since such observational study is actually hard to be really randomized, I am uncomfortable about those results. It seems we could use thousands of assumptions between compounds and environmental factor/disease and published the one with “significant” differences and shout at public press. Of course we could control for age, gender, smoking, BMI, etc. However, it’s just hard to control unknown unknown and just blame the known parts. Furthermore, type M error are also behind those study.

Are there suggestions to avoid those kind of errors or studies?

My reply: To start with, I’m not going to address this particular study, which happens to cost $40 to download. The effects of air pollution are an important topic but I think that for most of you there will be more interest in the general issue of how to learn from open-ended, exploratory studies. So. The easiest answer to the “what to do in general” question is to simply separate the exploration and inference: use the exploratory data, in concert with theory, to come up with some hypotheses and then test them in new preregistered study. But I don’t like that answer because we want some answers now. I’m not saying we want certainty now, or near-certainty, or statistical significance—but we’d like to give our best estimates from the data we have; we don’t want to be using estimates that are clearly biased. So what should be done? Here are some suggestions: 1. Forget about statistical significance. Publish all your results and don’t select the results that exceeded some threshold for special treatment. If you’re looking at associations between many different predictors and many different outcomes, show the correlations or coefficients in a big table. 2. Partially pool the estimates toward zero. This can be done using informative priors or with multilevel modeling. You can’t get selection bias down to 0 (the type M error depends on the unknown parameter value) but you can at least reduce it. 3. Control for age, gender, smoking, BMI, etc (which I assume was done in the above-linked study). Adjusting for these predictors will not fix all your problems but, again, it seems like it’s going in the right direction. The point is that whether we think of our goal as getting the best estimates to make decisions right now, or if we’re just considering this as an exploratory analysis—either way we want to learn as much from the data as possible, and correct for biases as much as we can. ## Does “status threat” explain the 2016 presidential vote? Steve Morgan writes: The April 2018 article of Diana Mutz, Status Threat, Not Economic Hardship, Explains the 2016 Presidential Vote, was published in the Proceedings of the National Academy of Sciences and contradicts prior sociological research on the 2016 election. Mutz’s article received widespread media coverage because of the strength of its primary conclusion, declaimed in its title. The current article is a critical reanalysis of the models offered by Mutz, using the data files released along with her article. The title of Morgan’s article is, “Fake News: Status Threat Does Not Explain the 2016 Presidential Vote.” What happened? According to Morgan: Material interests and status threat are deeply entangled in her cross-sectional data and, together, do not enable a definitive analysis of their relative importance. . . . Her panel-data model of votes, which she represents as a fixed-effect logit model, is, in fact, a generic pooled logit model. And, the punch line: In contrast, the sociological literature has offered more careful interpretations, and as such provides a more credible interpretation of the 2016 election. Mutz, like me, is a political scientist. Morgan is a sociologist. So, what we have here is: 1. A technical statistical dispute 2 . . . about the 2016 election 3 . . . in PNAS. Lots to talk about. I’ll discuss each in turn. But, before going on, let me just point out one thing I really like about this discussion, which is that, for once, it doesn’t involve scientific misconduct, and it doesn’t involve anyone digging in and not admitting ambiguity. Mutz made a strong claim in a published paper, and Morgan’s disputing it on technical grounds. That’s how it should go. I’m interested to see how Mutz replies. She may well argue that Morgan is correct, that the data are more ambiguous than implied in her article, but that she favors her particular explanation on other grounds. Or she might perform further analyses that strengthen her original claim. In any case, I hope she can share her (anonymized) raw data and analyses. 1. The technical statistical dispute In his article, Morgan writes: A first wave of sociological research on the 2016 presidential election has now been published, and a prominent theme of this research is the appeal of Trump’s campaign to white, working-class voters. Analyses of Obama-to-Trump voters, along with the spatial distribution of votes cast, are both consistent with the claim that white, working-class voters represented the crucial block of supporters who delivered the electoral college victory to Trump . . . The overall conclusion of this first wave of sociological research would appear to be that we have much more work to do in order to understand why so many white voters supported Trump. And, although we may never be able to definitively decompose the sources of their support, three primary motives deserve further scrutiny: material interests, white nativism, and the appeal of the Trump persona. At the same time, it remains to be determined how much of the animus toward his competitor – Hillary Clinton – was crucial to his success . . . From a statistical perspective, Morgan is saying that there’s collinearity between three dimensions of variation among white voters: (a) lack of education, (b) nativism, and (c) opposition to Clinton. He and Mutz both focus on (a) and (b), so I will too. The quick summary is that Mutz is saying it’s (b) not (a), while Morgan is saying that we can’t disentangle (a) from (b). Mutz’s argument derives from an analysis of a survey where people were interviewed during the 2012 and 2016 campaigns. She writes: It could be either, but this study shows that the degree to which you have been personally affected had absolutely no change between 2012 and 2016. It’s a very small percentage of people who feel they have been personally affected negatively. It’s not that people aren’t being hurt, but it wasn’t those people who were drawn to support Trump. When you look at trade attitudes, they aren’t what you’d expect: It’s not whether they were in an industry where you were likely to be helped or hurt by trade. It’s also driven by racial attitudes and nationalistic attitudes—to what extent do you want to be an isolationist country? Trade is not an economic issue in terms of how the public thinks about it. Morgan analyzes three of Mutz’s claims. Here’s Morgan: Question 1: Did voters change their positions on trade and immigration between 2012 and 2016, and were they informed enough to recognize that Trump’s positions were much different than Romney’s, in comparison to Clinton’s and Obama’s? Answer: Voters did change their positions on trade and immigration, but only by a small amount. They were also informed enough to recognize that the positions of Trump and Clinton were very different from each other on these issues, and also in comparison to the positions of Romney and Obama four years prior. On this question, Mutz’s results are well supported under reanalysis, and they are a unique and valuable addition to the literature . . . Question 2: Can the relative appeal of Trump to white voters with lower levels of education be attributed to status threat rather than their material interests? Answer: No. . . . Question 3: Do repeated measures of voters’ attitudes and policy priorities, collected in October of 2012 and 2016, demonstrate that status threat is a sufficiently complete explanation of Trump’s 2016 victory? Answer: No. . . . The key points of dispute, clearly, are questions 2 and 3. Question 2 is all about the analysis of Mutz’s cross-sectional dataset, a pre-election poll from 2016. Mutz presents an analysis predicting Trump support (measured by relative feeling thermometer responses or vote intention), given various sets of predictors: A: education (an indicator for “not having a college degree”) B: five demographic items and party identification C: four economic items (three of which are about the respondent’s personal economic situation and one of which is an attitude question on taxes and spending) D: eight items on status threat (four of which represent personal beliefs and four of which are issue attitudes). For each of feeling thermometer and vote intention, Mutz regresses on A+B, A+B+C, and A+B+C+D. What she finds is that the predictive power of the education indicator is high when regressing on A+B, remains high when regressing on A+B+C, but decreases mostly to zero when regressing on A+B+C+D. She concludes that the predictive power of education is not explained away by economics, but is largely explained away by status threat. Hence the conclusion that it is status threat, not economics, that explains the Trump shift among low-education whites. Morgan challenges Mutz’s analysis in four ways. First, he would prefer not to include party identification as a predictor in the regression because he considers it to be an intermediate outcome: there are some voters who will react to Clinton and Trump and change their affiliation. My response is: sure, this can happen, but my impression from lookin gat such data for many years is that party ID is pretty stable, even during an election campaign, and it predicts a lot. So I’m OK with Mutz’s decision to treat it as a sort of demographic variable. Morgan does his analyses both ways, with and without party ID, so I’ll just look at his results that include this variable. Second, Morgan would prefer to just restrict the analysis to white respondents, rather than including ethnicity as a regression predictor. I agree with him on this one. There were important shifts in the nonwhite vote, but much of that is, I assume due to Barack Obama not being on the ticket, and this is not what’s being discussed here. So, for this discussion, I think the analyses should be restricted to whites. Morgan’s third point has to do with Mutz’s characterization of the regression predictors. He relabels Mutz’s economic indicators as “material interests.” That’s no big deal given that in her paper Mutz labeled them as measures of “being left behind with respect to personal financial wellbeing,” which sounds a lot like “material interests.” But Morgan also labels several of Mutz’s “status threat” variables as pertaining “material interests” and “foreign policy.” When he does the series of regressions, Morgan finds that the material interest and foreign policy variables explain away almost all the predictive power of education (that is, the coefficient of “no college education” goes to nearly zero after controlling for these other predictors), and that the remaining status threat variables don’t reduce the education coefficient any further. Thus, repeating Mutz’s analysis but including the predictors in a different order leads to a completely different conclusion. I find Morgan’s batching to be as convincing as Mutz’s, and considering that Morgan is making a weaker statement—all he’s aiming to show is that the data are ambiguous regarding the material-interest-or-status-threat-question, I think he’s done enough to make that case. Morgan’s fourth argument is that, in any case, this sort of analysis—looking at how a regression coefficient changes when throwing in more predictors—can’t tell us much about causality without a bunch of strong assumptions that have not been justified here. I agree: I always find this sort of analysis hard to understand. In this case, Morgan doesn’t really have to push on this fourth point because his reanalysis suggests issues with Mutz’s regressions on their own terms. Question 3 is the panel-data analysis, about which Mutz writes: Because the goal is understanding what changed from 2012 to 2016 to facilitate greater support for Trump in 2016 than Mitt Romney in 2012, I estimate the effects of time-varying independent variables to determine whether changes in the independent variables produce changes in candidate choice without needing to fully specify a model including all possible influences on candidate preference. Significant coefficients thus represent evidence that change in an independent variable corresponds to change in the dependent variable at the individual level. Morgan doesn’t buy this claim, and neither do I. At least, not in general. Also this: To examine whether heightened issue salience accounts for changing preferences, I include in the models measures of respondents’ pre-Trump opinions on these measures interacted with a dichotomous wave variable. These independent variable by wave interactions should be significant to the extent that the salience of these issue opinions was increased by the 2016 campaign, so that they weighed more heavily in individual vote choice in 2016 than in 2012. For example, if those who shifted toward Trump in 2016 were people who already opposed trade in 2012 and Trump simply exploited those preexisting views for electoral advantage, this would be confirmed by a significant interaction between SDO and wave. I dunno. This to me puts a big burden on the regression model that you’re fitting. I think I’d rather see these sorts of comparisons directly, than fit a big regression model and start looking at statistically significant coefficients. And this: To the extent that changes in issue salience are responsible for changing presidential preferences, the interaction coefficients should be significant; to the extent that changing public opinions and/or changing candidate positions also account for changing presidential vote preferences, the coefficients corresponding to these independent variables will be significant. Again, I’m skeptical. The z-score, or p-value, or statistical significance of a coefficient is a noisy random variable. It’s just wrong to equate statistical significance with reality. It’s wrong in that the posited story can be wrong and you can still get statistical significance (through a combination of model misspecification and noise), and in that the posited story can be right and you can still not get statistical significance (again, model misspecification and noise). I know this is how lots of people do social science, but it’s not how statistics works. That said, this does not mean that Mutz is wrong in her conclusions. Although it is incorrect to associate statistical significance with a casual explanation, it’s also the case that Mutz is an expert in political psychology, and I have every reason to believe that her actual data analysis will be reasonable. I have not tried to follow what she did in detail. I read her article and Morgan’s critique, and what I got out of this is that there are lots of ways to use regression to analyze opinion change in this panel survey, and different analyses lead to different conclusions, which again seems to support Morgan’s claim that we can’t at this time separate these different stories of the white voters who switched from Obama to Trump. Again, Mutz’s story could be correct but it’s not so clearly implied by the data. 2. The 2016 election Mutz writes that “change in financial wellbeing had little impact on candidate preference. Instead, changing preferences were related to changes in the party’s positions on issues related to American global dominance and the rise of a majority–minority America,” but I find Morgan’s alternative analyses convincing, so for now I’d go with Morgan’s statement that available data don’t allow us to separate these different stories. My own article (with Julia Azari) on the 2016 election is here, but we focus much more on geography than demography, so our paper isn’t particularly relevant to the discussion here. All I’ll say on the substance here is that different voters have different motivations, and that individual voters can have multiple motivations on vote choice. As we all know, most voters went with their party ID in 2016 as in the past; hence it makes sense each year to explain what motivated the voters who went for the other party. Given the multiple reasons to vote in either direction (or, for that matter, to abstain from voting), I think we have to go beyond looking for single explanations for the behaviors of millions of voters. 3. PNAS Morgan wrote, “Mutz’s article received widespread media coverage because of the strength of its primary conclusion.” I don’t think this was the whole story. I think the key reason Mutz’s article received widespread media coverage was because it appeared in PNAS. And here’s the paradox: PNAS seems to be considered by journalists to be a top scientific outlet—better than, say, the American Political Science Review or the American Journal of Sociology—even though, at least when it comes to social sciences, I think it’s edited a lot more capriciously than these subject-matter journals. (Recall air rage, himmicanes, and many others.) It’s not that everything published in PNAS is wrong—far from it!—but I wish the news media would spend a bit less attention on PNAS press releases and a bit more time on papers in social science journals. One problem I have with PNAS, beyond all issues of what papers get accepted and publicized, is what might be called PNAS style. Actually, I don’t think it’s necessarily the style for physical and biological science papers in PNAS, maybe it’s just for social science. Anyway, the style is to make big claims. Even if, as an author, you don’t want to hype your claims, you pretty much have no choice—if you want your social science paper to appear in Science, Nature, or PNAS. And, given that publication in these “tabloids” is likely to give you work a lot more influence, it makes sense to go for it. The institutional incentives favor exaggeration. I hope that Morgan’s article, along with discussions by Mutz and others, is published in the APSR and gets as much, or more, media attention as the PNAS paper that motivated it. ## Aki’s favorite scientific books (so far) A month ago I (Aki) started a series of tweets about “scientific books which have had big influence on me…”. They are partially in time order, but I can’t remember the exact order. I may have forgotten some, and some stretched the original idea, but I can recommend all of them. I have collected all those book tweets below and fixed only some typos. These are my personal favorites, and there are certainly many great books I haven’t listed. Please, tell your own favorite books and short description why you like those books in the comments. I start to tweet about scientific books which have had big influence on me… 1. Bishop, Neural Networks for Pattern Recognition, 1995. The first book where I read about Bayes. I learned a lot about probabilities, inference, model complexity, GLMs, NNs, gradients, Hessian, chain rule, optimization, integration, etc. I used it a lot for many years. Looking again at contents, it is still a great book although naturally some parts are bit outdated. 2. Bishop (1995) referred to Neal, Bayesian Learning for Neural Networks, 1996, from which I learned about sampling in high dimensions, HMC, prior predictive analysis, evaluation of methods and models. Neal’s free FBM code made it easy to test everything in practice. 3. Jaynes, Probability Theory: The Logic of Science, 1996: I read this because it was freely available online. There is not much for practical work, but plenty of argumentation why using Bayesian inference makes sense, which I did find useful when I was just learning B. 15 years later I participated in a reading circle with mathematicians and statisticians going through the book in detail. The book was still interesting, but not that spectacular anymore. The discussion in the reading circle was worth it. 4. Gilks, Richardson & Spiegelhalter (eds), Markov Chain Monte Carlo in Practice (1996). Very useful introductions to different MCMC topics by Gilks, Richardson & Spiegelhalter Ch1, Roberts Ch3, Tierney Ch4, Gilks Ch5, Gilks & Roberts Ch6, Raftery & Lewis Ch7. And with special mentions to Gelman on monitoring convergence Ch8, Gelfand on importance-sampling leave-one-out cross-validation Ch9, and Gelman & Meng on posterior predictive checking Ch11. My copy is worn out from heavy use. 5. Gelman, Carlin, Stern, and Rubin (1995). I just loved the writing style, and it had so many insights and plenty of useful material. During my doctoral studies I also made about 90% of the exercises as self-study. I considered using the first edition when I started teaching Bayesian data analysis, but I thought it was maybe too much for a introduction course, and it didn’t have model assessment and selection, which is important for me. This book (and its later editions) is the one I have re-read most, and when re-reading I keep finding things I didn’t remember being there (I guess I have a bad memory). I still use the last edition regularly, and I’ll get later back to these later editions. 6. Bernardo and Smith, Bayesian Theory, 1994. Great coverage (although not complete) of foundations and axioms of Bayesian theory with emphasize that actions and utilities are inseparable part of the theory. They admit problems of theory in continuous space (which seem to not have a solution that would please everyone, even if it works in practice) and review general probability theory. They derive basic models from simple exchangeability and invariance assumptions. They review utility and discrepancy based model comparison and rejection with definitions of M-open, -complete, and -closed. This and Bernardo’s many papers had strong influence how I think about model assessment and selection (see, e.g. http://dx.doi.org/10.1214/12-SS102). 7. Box and Tiao, Bayesian Inference in Statistical Analysis, 1973. Wonderful book, if you want to see how difficult inference was before MCMC and prob. programming. Includes some useful models, and we used one of them as a prior in a neuromagnetic inverse problem http://becs.aalto.fi/en/research/bayes/brain/lpnorm.pdf 8. Jeffreys, Theory of Probability, 3rd ed, 1961. Another book with historical interest. The intro and estimation part are sensible. I was very surprised to learn that he wrote about all the problems of Bayes factor, which was not evident from the later literature on BF. 9. Jensen, A introduction to Bayesian Networks, 1996. I’m travelling to Denmark, which reminded me about this nice book on Bayesian networks. It’s out of print, but Jensen & Nielsen, Bayesian Networks and Decision Graphs, 2007, seems to be a good substitute. 10. Dale, A History of Inverse Probability: From Thomas Bayes to Karl Pearson, 1991. Back to historically interesting books. Dale has done lot of great research on history of statistics. This one helps to understand Bayesian-Frequentist conflict in 20th century. The conflict can be seen, eg, Lindley writing in 1968: “The approach throughout is Bayesian: there is no discussion of this point, I merely ask the non-Bayesian reader to examine the results and consider whether they provide sensible and practical answers”. McGrayne, The Theory That Would Not Die: How Bayes’ Rule Cracked The Enigma Code, Hunted Down Russian Submarines, & Emerged Triumphant from Two Centuries of Controversy, 2011 is more recent and entertaining, but based also on much of Dale’s research. 11. Laplace, Philosophical Essay on Probabilities, 1825. English translation with notes by Dale, 1995. Excellent book. I enjoyed how Laplace justified the models and priors he used. Considering clarity of the book, it’s strange how little these ideas were used before 20th century 12. Press & Tanur, The Subjectivity of Scientists and the Bayesian Approach, 2001. Many interesting and fun stories about progress of science by scientists being very subjective. Argues that Bayesian approach at least tries to be more explicit on assumptions. 13. Spirer, Spirer & Jaffe, Misused Statistics, 1998. Examples of common misuses of statistics (deliberate or inadvertent) in graphs, methodology, data collection, interpretation, etc. Great and fun (or scary) way to teach common pitfalls and how to do things better. 14. Howson & Urbach, Scientific Reasoning: The Bayesian Approach, 2nd ed, 1999. Nice book on Bayesianism and philosophy of science: induction, confirmation, falsificationism, axioms, Popper, Lakatos, Kuhn, Cox, Good, and contrast to Fisherian & Neyman-Pearson significance tests. There are also 1st ed 1993 and 3rd ed 2005. 15. Gentle, Random Number Generation and Monte Carlo Methods, 1998, 2.ed 2003. Great if you want to understand or implement: pseudo rng’s, checking quality, quasirandom, transformations from uniform, methods for specific distributions, permutations, dependent samples & sequences. 16. Sivia, Data Analysis. A Bayesian tutorial, 1996. I started teaching a Bayesian analysis course in 2002 using this thin very Jaynesian book, as it had many good things. Afterward I realized that it missed too much from the workflow, so that students could do their own projects 17. Gelman, Carlin, Stern, & Rubin, BDA2, 2003. This hit the spot. Improved model checking, new model comparison, more on MCMC, and new decision analysis made it at that time the best book for the whole workflow. I started using it in my teaching the same year it was published. Of course it still had some problems, like using DIC instead of cross-validation, effective sample size estimate without autocorrelation analysis, etc., but additional material I needed to introduce in my course was minimal compared what any other book would had required. My course included the chapters 1-11 and 22 (with varying emphasis), and I recommended for students to read other chapters. 18. MacKay, Information Theory, Inference, and Learning Algorithms, 2003. Super clear introduction to information theory and codes. Has also excellent chapters on probabilities, Monte Carlo, Laplace approximation, inference methods, Bayes, and ends up with neural nets and GPs. The book is missing the workflow part, but it has many great insights clearly explained. For example, in Monte Carlo chapter, I love how MacKay tells when the algorithms fail and what happens in high dimensions. Before the 2003 version, I had been reading also drafts which had been available since 1997. 19. O’Hagan and Forster, Bayesian Inference, 2nd ed, vol 2B of Kendall’s Advanced Theory of Statistics, 2004. A great reference on all the important concepts in Bayesian inference. Fits well between BDA and Bayesian Theory, and one of my all of favorite books on Bayes. Covers, e.g., inference, utilities, decisions, value of information, estimation, likelihood principle, sufficiency, ancillarity, nuisance, non-identifiability, asymptotics, Lindley’s paradox, conflicting information, probability as a degree of belief, axiomatic formulation, … finite additivity, comparability of events, weak prior information, exchangeability, non-subjective theories, specifying probabilities, calibration, elicitation, model comparison (a bit outdated), model criticism, computation (MCMC part is a bit outdated), and some models… 20. Rasmussen and Williams, Gaussian Processes for Machine Learning, 2006. I was already familiar with GPs through many articles, but this become very much used handbook and course book for us. The book is exceptional in that it also explains how to implement stable computation. It has a nice chapter on Laplace approximation and expectation propagation conditional on hyperparameters, but has only Type II MAP estimate for hyperparameters. It has a ML flavor overall, and I know statisticians who have difficulties following the story. The book was very useful when writing GPstuff. It’s also available free online. 21. Gelman & Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, 2006. I was already familiar with the models and methods in the book, but I loved how it focused on how to think about models, modeling and inference, using many examples to illustrate concepts. The book starts from a simple linear models and has a patience to progress slowly not to go to early to details on computation and it works surprisingly well even if Bayesian inference comes only after 340 pages. Gaussian linear model, logistic regression, generalized linear models, simulation, model checking, causal inference, multilevel models, Bugs, Bayesian inference, sample size and power calculations, summarizing models, ANOVA, model comparison, missing data. I recommended the book to my students after BDA2 and O’Hagan & Forster, as it seemed to be a good and quick read for someone who knows how to do the computation already, but I couldn’t see how I would use it in teaching as Bayesian inference comes late and it was based on BUGS! More recently re-reading the book, I still loved the good bits, but also was shocked to see how much it was encouraging to wander around in a garden of forking paths. AFAIK there is a new edition in progress which updates it to use more modern computation and model comparison. 22. Harville, Matrix Algebra From a Statistician’s Perspective, 1997. 600 pages of matrix algebra with focus on that part of matrix algebra commonly used in statistics. Great book for people implementing computational methods for GPs and multivariate linear models. Nowadays with Matrix cookbook online, I use it less often to check simpler matrix algebra tricks, but my students still find it useful as it goes deeper and has more derivations in many topics. 23. Gelman and Nolan, Teaching Statistics: A Bag of Tricks, 2002 (2.ed 2017). A large number of examples, in-class activities, and projects to be used in teaching concepts in intro stats course. I’ve used ideas from different parts and especially from decision analysis part. 24. Abrams, Spiegelhalter & Myles, Bayesian Approaches to Clinical Trials and Health-Care Evaluation, 2004. This was helpful book to learn basic statistical issues in clinical trials and health-care evaluation, and how to replace “classic” methods with Bayesian. Medical trials, sequential analysis, randomised controlled trials, ethics of randomization, sample-size assessment, subset and multi-center analysis, multiple endpoints and treatments, observational studies, meta-analysis, cost-effectiveness, policy-making, regulation, … 25. Ibrahim, Chen & Sinha, Bayesian Survival Analysis, 2001. The book goes quickly to the details of model and inference and thus is not an easy one. There has been a lot of progress in models and inference afterwards, but it’s still very valuable reference on survival analysis. 26. O’Hagan et al, Uncertain Judgments: Eliciting Experts’ Probabilities, 2006. A great book on very important but too much ignored topic of eliciting prior information. A must read for anyone considering using (weakly) informative priors. The book reviews psychological research that shows, e.g., how the form of the questions affect the experts’ answers. The book also provides recommendations how to make better elicitation and how to validate the results of elicitation. Uncertainty & the interpretation of probability, aleatory & epistemic, what is an expert?, elicitation process, the psychology of judgment under uncertainty, biases, anchoring, calibration, representations, debiasing, elicitation, evaluating elicitation, multiple experts, … 27. Bishop, Pattern Recognition and Machine Learning, 2006. It’s quite different from 1995 book, although it covers mostly the same models. For me there was not much new to learn, but my students have used it a lot as a reference, and I also enjoyed the explanations of VI and EP. Based on the contents and the point of view, the name of the book could also be “Probabilistic Machine Learning” Due to the theme “influence on me”, it happened that all books I listed were published 2006 or earlier. After that I’ve seen great books, but those have had less influence on me. I may later make a longer list of more recent books I can recommend, but here are some as a bonus: • McGrayne, The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy, 2012. Entertaining book about history of Bayes theory. • Gelman, Carlin, Stern, Dunson, Vehtari & Rubin, Bayesian Data Analysis, 3rd ed, 2013. Obviously a great update of the classic book. • Särkkä, Bayesian Filtering and Smoothing, 2013. A concise introduction to non-linear Kalman filtering and smoothing, particle filtering and smoothing, and to the related parameter estimation methods from the Bayesian point of view. • McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan, 2015. Easier than BDA3 and well written. I don’t like how the model comparison is presented, but after reading this book, just check my related articles which were mostly published after this book. • Goodfellow, Bengio & Sourville, Deep Learning, 2016. This deep learning introduction has enough probabilistic view that I also can recommend it. • Stan Development Team, Stan Modeling Language: User’s Guide and Reference Manual, 2017. It’s not just Stan language manual, it’s also full of well written text about Bayesian inference and models. There is a plan to divide this in parts, and one part would make a great text book. I’ve read more than these and the list was just the ones I enjoyed most. I think people, and also I, read less books now when it’s easier to find articles, case studies, and blog posts in internet. Someday I’ll make similar list for the top papers I’ve enjoyed. ## Could you say that again less clearly, please? A general-purpose data garbler for applications requiring confidentiality Ariel Rokem pointed me to this Python program by Bill Howe, Julia Stoyanovich, Haoyue Ping, Bernease Herman, and Matt Gee that will take your data matrix and produce a new data matrix that has the same size, shape, and general statistical properties but with none of the same actual numbers. The use case is when you want to give your data to someone to play around with some analyses, but the data themselves are proprietary or confidential. This is different from the scenario such as with the Census where they create a synthetic dataset that adds a little bit of noise or takes out some observations to preserve confidentiality but otherwise is intended to give the right answers. Here, there’s no aim to use the fake data to perform real applied analyses; it’s all about creating something similar that people can work with, for example to prototype their data analysis plans. Howe et al. call their program DataSynthesizer but if it were up to me it would just be called Garbler. Here’s the paper describing what the program actually does. The statistical procedures used in the garbling are nothing fancy, but that’s fine: just as well to keep it simple, given the simplicity of the goals. The title of their paper is “Synthetic Data for Social Good,” but I don’t really understand that last part: it seems to me that Garbler, like other statistical tools, could be used for good or bad. ## “16 and Pregnant” Ted Joyce writes: In December 2015 the AER published an article, “Media Influences on Social Outcomes: The Impact of MTV’s 16 and Pregnant on Teen Childbearing,” by Melissa Kearney and Phil Levine [KL]. The NBER working paper of this article appeared in January of 2014. It received huge media attention as the authors claimed the show was responsible for 25% of the decline in teen childbearing from 2008-2010. . . . Joyce and his colleagues [David Jaeger, Robert Kaestner] were skeptical: So we buy the Nielsen data and we acquire the confidential birth certificate data and we have a go. Attached is the second version of that effort. . . . In brief, KL are trying to identify the effect of a nationally broadcast program on teen fertility amidst large secular trends and at the nadir of the Great Recession. There is no time variation in the show across units and their IV strategy lacks anything resembling an exogenous instrument. Even a DD interpretation fails the parallel trend assumption. We feel it was their use of Twitter data and Google searches that captured the imagination of the reviewers, but here too we show all their social media results collapse as soon as we use all THEIR data and not just a selected sample. The paper by Jaeger, Joyce, and Kaestner is called, “Did Reality TV Really Cause a Decline in Teenage Childbearing? A Cautionary Tale of Evaluating Identifying Assumptions,” and here’s their key claim: We find that controlling for differential time trends in birth rates by a market’s pre-treatment racial/ethnic composition or unemployment rate cause Kearney and Levine’s results to disappear, invalidating the parallel trends assumption necessary for a causal interpretation. Extending the pre-treatment period and estimating placebo tests, we find evidence of an “effect” long before 16 and Pregnant started broadcasting. I have not had a chance to read any of these papers in detail so I’m just presenting the controversy here without any endorsement (or anti-endorsement) of Jaeger, Joyce, and Kaestner’s argument. When these sorts of things come up on the blog, some readers feel a bit cheated, that I’m bringing up this live issue and not giving my own take on it. All I can say is that some takes are easier to come by than others, and I think there’s value in all sorts of posts. Sometimes I can do a careful investigation of my own or follow a debate closely enough to have an informed opinion, sometimes I can share a controversy and let the experts chew on it. Science is full of uncertainty and turmoil and it’s not so bad to sometimes present a scientific dispute without trying to resolve it myself. Also, this one’s about causal inference, which is one of our core blog topics. ## Slow to update This post is a placeholder to remind Josh Miller and me to write our paper on slow updating in decision analysis, with the paradigmatic examples being pundits being slow to update their low probabilities of Leicester City and Donald Trump in 2016. We have competing titles for this paper. Josh wants to call it, “The past as a reference point: ex-ante thinking and the slowness to update,” but I prefer “The past is another country and it’s hard to get from there to here.” Or, I guess, “The past is another country.” That by itself is a fine title. ## Evaluating Sigmund Freud: Should we compare him to biologists or economists? This post is about how we should think about Freud, not about how we should think about biology or economics. So. There’s this whole thing about Sigmund Freud being a bad scientist. Or maybe I should say a bad person and a terrible scientist. The “bad person” thing isn’t so relevant, but the “terrible scientist” thing is that, in a sort of perverse reversal of Popperian reasoning, he falsified his data to fit his theories. This came up in a recent book by Frederick Crews (see, for example, this review by George Prochnik), but it’s not news to anyone with any awareness that Freud had major, possibly fatal flaws as an empirical scientist, at least by the standards to which we hold empirical science today, and even in the time of Darwin. But . . . maybe we’re using the wrong standard of comparison. Maybe, instead of comparing Freud to biologists such as Darwin, or psychologists such as Piaget or Pavlov, we should be comparing him to economists such as Adam Smith or David Ricardo who, like Freud, set up broad theoretical frameworks that promised to explain vast otherwise seemingly disconnected aspects of life. OK, so Freud altered the story of Dora to make a point he wanted to make. Well, what about Adam Smith and the pin factory? That’s a stylized story too, right? Smith was more honest because he didn’t pretend he was documenting a particular story—and honesty is central to science. My point here is not that Freud was an exemplary scientist, or an exemplary social scientist, but rather that his theories have a similar logical status to classic economic theories, and a similar appeal. The problem is, one could say, that Freud is presenting social-science-style theorizing as if it’s biological science. Imagine if some economist of Freud’s era had said that there was a “utility organ” in the brain. That would be pretty silly. So, from that perspective, perhaps Freud’s big problem was that he was making a sort of category error. Strip away the biological trappings and Freud is a social scientist of 1900 vintage. To criticize Freud for being unscientific would make no more sense than criticizing Adam Smith on these grounds: both are builders of frameworks. This is not to shield Freud from criticism, just to say that he can and should be criticized on the specifics, and his framework can be evaluated on its utility, but without thinking that just cos it’s not biology-style science, that it’s necessarily useless. Try thinking this another way, as a sort of family tree of scientific inquiry starting with Newton, Kepler, etc., who were systematizers, coming up with general laws of motion. From this tree, one branch leads to the social scientists who are, in a sense, the followers of the theoretical physicists, coming up with models of the world, the social equivalents of string theory. Another branch leads to the biologists who are testing their theories with data and being all Popperian. This split is not clean—biology has sweeping theories and some social science theories can be tested—but I think there’s something to this idea. P.S. Also this (racism is a framework, not a theory) and this (Economics now = Freudian psychology in the 1950s). ## Should Berk Özler spend$2 million to test a “5 minute patience training”?

Berk Özler writes:

Background: You receive a fictional proposal from a major foundation to review. The proposal wants to look at the impact of 5 minute “patience” training on all kinds of behaviors. This is a poor country, so there are no admin data. They make the following points:

A. If successful, this is really zero cost to roll out—it’s just pushed through smart phones. Therefore, the cost of the program can be modelled as approximately zero. It falls in the “letters to people” kind of stuff.

B. However, they want to show the impact of this on a whole bunch of things. They can check take-up because they know who clicks through and goes through everything, and they expect it to be low.

C. Given take-up and expected impacts, they argue that a fairly small impact could have quite a large effect.

D. But here is the rub: For the experiment they will need $2 million for data collection. They need to survey XY,000 households and do very long surveys on everything you can think of 4 times. Having carefully read all the stuff on p-values, this is the power they need to detect a 1% increase in savings. The combination of a “letter to people” type experiment with (a) lack of admin data and (b) desire to show effects on all kinds of things essentially blows the budget. I am quite worried about this p-values and power stuff, since (a) on the one hand it’s good that we don’t give too much credence to small sample studies with large effects because of publication bias and (b) on the other hand, if this is going to be interpreted as “powering up” what are essentially stupid interventions, that’s not a great direction to go either. The problem is that without a theoretical framework, it’s harder to be ex ante sanguine about what is “stupid”—agnostic may be a better phrase, since it *might* turn up something. My instinct here, motivated by the Bayesian optimal sampling literature, would be to say that they should first try this with 100 households for$500 and see what the effects are AND publish the effects. There should be an optimal sequence of experiments that leads to scale-up as positive results arise. In short, publication bias implies a preference for small sample size experiments with big effects, which are probably false. But this should cause us to solve the publication bias problem, NOT create a further distortion by powering up stupidity. Ramsay second best is not going to work here. Unfortunately, the math even in the simple case with single sequential samples is a complete nightmare, but wondering if there is a simpler way to explain this.

I’ll respond to these questions in reverse order:

– To address the last sentence in the above quote: no, the math is not a nightmare here at all. There’s no “math” at all to worry about here: just gather the data, and then, in your analysis, include as predictors all variables that are used in the data collection. With a sequential design, just include time as a predictor in your model. This general issue actually came up in a recent discussion.

To do this and get reasonable results, you’ll want to do a reasonable analysis: don’t aim for or look at p-values, Bayes factors, or “statistical significance”: just fit a serious regression model, using relevant prior information to partially pool effect size estimates toward zero. And of course commit to publishing your results no matter what shows up.

– I’m not quite sure what is meant by “a 1% increase in savings”? This is a poor country; are these cash savings? Who’s doing the saving? Are you saying that the savers will save 1% more? Or that 1% more people will save? These questions are relevant to who you target the intervention to.

– I don’t have a good sense of where this $2 million cost is coming from. I guess$2 million is a bargain if the intervention really works. I don’t have a good sense of whether you think it will really work.

– One way to get a handle on “how effective is the intervention?” question is to consider a large set of possible interventions. You have this “5 minute patience training,” whatever that is. But there must be a lot of other ideas out there that are similarly zero-cost and somehow vaguely backed by previous theory and experiment. Would it make sense to spend $2 million on each of these? This is not a rhetorical question: I’m really asking. If there are 10 possible interventions, would you do a set of studies costing$20 million? Or is the idea that any of these 10 interventions would work, but their effects would not be additive, so you just want to find one good one and it doesn’t matter which one it is.

– A related point is that interventions are compared to alternative courses of action. What are people currently doing? Maybe whatever they are currently doing is actually more effective than this 5 minute patience training?

Anyway, the good news is that you don’t need to worry about “the math.” I think all the difficulty here comes in thinking about the many possible interventions you might do.