Skip to content

Also holding back progress are those who make mistakes and then label correct arguments as “nonsensical.”

Here’s James Heckman in 2013:

Also holding back progress are those who claim that Perry and ABC are experiments with samples too small to accurately predict widespread impact and return on investment. This is a nonsensical argument. Their relatively small sample sizes actually speak for — not against — the strength of their findings. Dramatic differences between treatment and control-group outcomes are usually not found in small sample experiments, yet the differences in Perry and ABC are big and consistent in rigorous analyses of these data.

Wow. The “What does not kill my statistical significance makes it stronger” fallacy, right there in black and white. This one’s even better than the quote I used in my blog post. Heckman’s pretty much saying that if his results are statistically significant (and “consistent in rigorous analyses,” whatever that means) that they should be believed—and even more so if sample sizes are small (and of course the same argument holds in favor of stronger belief if measurement error is large).

With the extra special bonus that he’s labeling contrary arguments as “nonsensical.”

I agree with Stuart Buck that Heckman is wrong here. Actually, the smaller sample sizes (and also the high variation in these studies) speaks against—not for—the strength of the published claims.

Hey, we all make mistakes. Selection bias is a tricky thing, and it can confuse even some eminent econometricians. What’s important is that we can learn from them, as I hope Heckman and his collaborators have learned from the savvy critiques of Stuart Buck and others.

mc-stan.org down & single points of failure

[update: back up. whew. back to our regularly scheduled programming.]

[update: just talked to our registrar on the phone and they say it’ll probably take an hour or two for the DNS to catch up again, but then everything should be OK. I would highly recommend PairNIC—their support was awesome.]

mc-stan.org is down because I (Bob) forgot to renew the domain. I just renewed it, so at least we won’t lose it. Hopefully everything will start routing again—the content is still there to be served on GitHub.

5-year anniversary

It was registered 5 years ago, roughly concurrently with our first release in August 2012. Time sure flies.

It just lapsed today, but I was able to renew. We’ll get this under better management going forward with the rest of our assets.

Backup web site?

I don’t know if we can run mc-stan.github.io in the meantime—we should be able to, but I don’t know how to configure that or what the switching back and forth times are, so it may not be worth it.

Single points of failure

This is really terrible for us since almost everything we do is now run out of the web site (other than GitHub, so we can keep coding, but our users can’t get anything).

The Pandora Principle in statistics — and its malign converse, the ostrich

The Pandora Principle is that once you’ve considered a possible interaction or bias or confounder, you can’t un-think it. The malign converse is when people realize this and then design their studies to avoid putting themselves in a position where they have to consider some potentially important factor. For example, suppose you’re considering some policy intervention that can be done in several different ways, or conducted in several different contexts. The recommended approach is, if possible, to try out different realistic versions of the treatments in various realistic scenarios; you can then estimate an average treatment effect and also do your best to estimate variation in the effect (recognizing the difficulties inherent in that famous 1/16 efficiency ratio). An alternative, which one might call the reverse-Pandora approach, is to do a large study with just a single precise version of the treatment. This can give a cleaner estimate of the effect in that particular scenario, but to extend it to the real world will require some modeling or assumption about how the effect might vary. Going full ostrich here, one could simply carry over the estimated treatment effect from the simple experiment and not consider any variation at all. The idea would be that if you’d considered two or more flavors of treatment, you’re really have to consider the possibility of variation in effect, and propagate that into your decision making. But if you only consider one possibility, you could ostrich it and keep Pandora at bay. The ostrich approach might get you a publication and even some policy inference but it’s bad science and, I think, bad policy.

That said, there’s no easy answer, as there will always be additional possible confounding factors that you will not have be able to explore. That is, among all the scary contents of Pandora’s box, one thing that flies out is another box, and really you should open that one too . . . that’s the Cantor principle, which we encounter in so many places in statistics.

tl;dr: You can’t put Pandora back in the box. But really she shouldn’t’ve been trapped in there in the first place.

“Social Media and Fake News in the 2016 Election”

Gur Huberman asks what I think about this paper, “Social Media and Fake News in the 2016 Election,” by Hunt Allcott and Matthew Gentzkow.

I haven’t looked at in detail my quick thought is that they’re a bit too “mechanistic” as the effect of fake news is not just the belief in each individual story but also the discrediting of the whole idea of independent sources of news.

Gur adds: “Plus they are estimating a weak effect on (probably) a small subset off the voters who tipped the election.”

The above is not intended as a “debunking,” just some concern about how these very specific findings can be overinterpreted.

The Westlake Review

I came across this site one day:

The Westlake Review is a blog dedicated to doing a detailed review and analysis of every novel Donald Westlake published under his own name, as well as under a variety of pseudonyms. These reviews will reveal major plot elements, though they will not be full synopses. People who have not read a book being reviewed here should bear that in mind before proceeding. Some articles will be more general in their focus, analyzing aspects of Westlake’s writing, and in some cases of authors he was influenced by, or has influenced in turn. There will also be reviews of film adaptations of his work.

It’s been going since 2014! Westlake wrote a lot of books, so I guess they can keep going for awhile.

My favorite Westlake is Killing Time, but I also like Memory. And then there’s The Axe. And Slayground’s pretty good too. And Ordo, even if it’s kind of a very long extended joke on the idea of murder.

Overall, I do think there’s a black hole at the center of Westlake’s writing: as I wrote a few years ago, he has great plots and settings and charming characters, but nothing I’ve ever read of his has the emotional punch of, say, Scott Smith’s A Simple Plan (to choose a book whose plot would fit well into the Westlake canon).

But, hey, nobody can do everything. And I’m glad the Westlake Review is still going strong.

I don’t like discrete models (hot hand in baseball edition)

Bill Jefferys points us to this article, “Baseball’s ‘Hot Hand’ Is Real,” in which Rob Arthur and Greg Matthews analyze a year of pitch-by-pitch data from Major League Baseball.

There are some good things in their analysis, and I think a lot can be learned from these data using what Arthur and Matthews did, so my overall impression is positive. But here I want to point to two aspects of their analysis that I don’t like, that I think they could do better.

First and most obviously, their presentation is incomplete. I don’t know exactly what they did or what model they fit. They say they fit a hidden Markov model but they don’t say how many states the model had. From context I think they were fitting 3 states—hot, cold, or normal—for each player, but I’m not sure. The problem is . . . they shared no code. It would be the simplest thing in the world for them to have shared their Stan code, or R code, or Python code, or whatever—but they didn’t. This doesn’t make Arthur and Matthews uniquely bad—I’ve published a few hundred papers not sharing my code too—it’s commonplace. But it’s still a flaw. It’s hard to understand or evaluate work when you can’t see the math or the code or the data.

Second, I don’t think the discrete model makes sense. I do believe that pitchers have days when they throw harder or less hard, and I’m sure a lot more can be learned from these data too, but I would not model this as discrete states. Rather I’d say that the max pitch speed varies continuously over time.

I can see how a discrete model could be easier to fit—and I can certainly see the virtue of a discrete model in a simulation study, indeed I’ve used simulations of discrete models to understand the “hot hand fallacy fallacy” in basketball—but I think that any discrete model here should be held at arms length, as it were, and not taken too seriously.

In particular, I don’t see how we can get much out of statements such as “the typical pitcher goes through 57 streaks in a season, jumping between hot and cold every 24 pitches,” which seems extremely sensitive to how the model is parameterized.

Again, I say this not to slam Arthur and Matthews (hey, they linked to this blog! We’re all on the same side here!) but rather to point to a couple places where I think their analysis could be improved.

Also, let me emphasize that my comments above do not reflect any particular baseball knowledge on my part; I’d say the same thing if the analysis were done for football, or golf, or tennis, or any other sport.

Speaking generally, it should be much easier to study hotness using continuous measurements (such as pitch speed in baseball or ball speed and angle in basketball) than using discrete measurements (such as strikeouts in baseball or successful shots in basketball). With continuous data you just have so much more to work with. Remember the adage that the most important aspect of a statistical method is not what it does with the data but what data it uses.

Bird fight! (Kroodsma vs. Podos)

Donald Kroodsma writes:

Birdsong biologists interested in sexual selection and honest signalling have repeatedly reported confirmation, over more than a decade, of the biological significance of a scatterplot between trill rate and frequency bandwidth. This ‘performance hypothesis’ proposes that the closer a song plots to an upper bound on the graph, the more difficult the song is to sing, and the more difficult the song the higher quality the singer, so that song quality honestly reveals male quality. In reviewing the confirming literature, however, I can find no support for this performance hypothesis.

OK, that sounds jargony, so let me make it clear: when Kroodsma says he “can find no support for this performance hypothesis,” what he’s really saying is that a sub-literature in the animal behavior literature is in error. Rip it up and start over.

How did everyone get it wrong? Kroodsma continues:

I will argue here that the scatter in the graph for songbirds is better explained by social factors and song learning. When songbirds learn their songs from each other, multiple males in a neighbourhood will sing the same song type. The need to conform to the local dialect of song types guides a male to learn a typical example of each song type for that population, not to take a memorized song and diminish or exaggerate it in trill rate or frequency bandwidth to honestly demonstrate his relative prowess. . . . There is no consistent, reliable information in the song performance measures that can be used to evaluate a singing male.

Damn.

But the other side is not going down without a fight. Jeffrey Podos responds:

Kroodsma [in the above-linked article] has critiqued ‘the performance hypothesis’, which posits that two song attributes, trill rate and frequency bandwidth, provide reliable indicators of singer quality and are used as such in mate or rival assessment. . . .

I address these critiques in turn, offering the following counterpoints: (1) the reviewed literature actually reveals substantial plasticity in song learning, leaving room for birds to tailor songs to their own performance capacities; (2) reasonable scenarios, largely untested, remain to explain how songs of repertoire species could convey information about singer quality; and (3) the playback studies critiqued actually enable direct, reasonable inferences about the function of vocal performance variations, because they directly contrast birds’ responses to low- versus high-performance stimuli.

Where did the critics go wrong? Podos continues:

My analyses also reveal numerous shortcomings with Kroodsma’s arguments, including an inaccurate portrayal throughout of publications under review, logic that is thus rendered questionable and reliance on original data sets that are incomplete and thus inconclusive.

I have not read either paper because it just all seems so technical. I suppose with some effort I could untangle this one, but I don’t feel like putting in the effort right now.

Any ornithologists in the house?

Fun fact: Both authors in this discussion had the same academic affiliation of Department of Biology, University of Massachusetts, Amherst. Podos is a professor there, and Kroodsma is a retired professor. Either way, the story is compelling: youngster does shoddy research and the retired prof blows the whistle, or cranky old man can’t handle new methods. In some general sense, I’ve been on both sides of this debate: Sometimes I criticize what I see as flashy research with empty claims, otherwise I’m frustrated that traditionalists will seem to find any excuse not to take a new method seriously.

P.S. A google search turned up this review from 2005 of a book on the science of birdsong. In the review, Bernard Lohr writes:

The contributors do not shy away from controversy. Donald Kroodsma, for example, issues a challenge to those who suggest large song repertoires are a consequence of sexual selection. Kroodsma remains unconvinced that existing direct experimental data demonstrate female choice for larger repertoires in a natural context. Although his criticism is general, he selects—as he did in an earlier critique of song playback designs—studies of other eminent birdsong biologists as specific examples. Because those researchers are more than capable of defending their conclusions and viewpoints, an interesting and vigorous debate is sure to ensue.

That was 12 years ago, and the issue doesn’t yet seem to have been resolved. Strike one against the story that science is self-correcting.

P.P.S. Also relevant is this article, Response to Kroodsma’s critique of banded wren song performance research, by S. L. Vehrencamp, S. R. de Kort, and A. E. Illes.

Torture talk: An uncontrolled experiment is still an experiment.

Paul Alper points us to this horrifying op-ed by M. Gregg Bloche about scientific study of data from U.S. military torture programs.

I’ll leave the torture stuff to the experts or this guy who you’ve probably heard of.

Instead, I have a technical point to make. In the op-ed, Bloche writes:

In a true experimental study, the C.I.A. would have had to test its interrogation strategy against one or more standard interrogation methods, using experimental and control groups of captives. There’s no evidence the agency did this.

They [the psychologists, Mitchell and Jessen] argued that interrogation strategies can’t be standardized and therefore can’t be compared, like medical treatments, in randomized, prospective fashion.

No one, though, is claiming that C.I.A. review efforts involved experimental and control groups and so were “experimentation” as science defines it.

This statement, that a true experiment requires a control group, is wrong. In a controlled experiment there needs to be a comparison, but an uncontrolled experiment is still a form of experiment. To put it another way, we use the term “controlled experiment” because control is not a necessary part of an experiment. In many cases control is good practice, and control makes it easier to perform certain inferences, but you can do experimentation with out a control group.

Does declawing cause harm?

Alex Chernavsky writes:

I discovered your blog through a mutual friend – the late Seth Roberts. I’m not a statistician. I’m a cat-loving IT guy who works for an animal shelter in Upstate New York.

I have a dataset that consists of 17-years’-worth of animal admissions data. When an owner surrenders an animal to us, we capture the main reason for the admission (as told to us by the owner). Some of the reasons are generic: owner moving to no-pets apartment, can’t afford the pet, no time, family members have allergies, etc. But some of the reasons are related to the specific characteristics of the animal: animal is aggressive, pees outside the litter box, hides and acts skittish, etc.

Much of the sheltering community has a long-standing belief that declawed cats are more likely to have behavioral problems, as compared to intact cats. I’d like to test this hypothesis by analyzing the dataset (which also contains the declawed status of all the cats admitted to us). The dataset contains around 100,000 cat admissions, and approximately 4% of those cats were declawed.

I’ve been reading your blog enough to know that you’re not fond of null hypothesis significance testing. What approach would you recommend in this situation? Can you point me in the right direction? I might be collaborating with a veterinary epidemiologist at Cornell Vet School, and possibly also with a data scientist from Booz Allen Hamilton.

My reply: I’d go with the standard approaches for causal inference from observational (that is, non-experimental) data, as discussed for example in chapters 9 and 10 of my book with Jennifer Hill. You’d compare the “treatment group” (declawed cats) to the “control group” (all others), controlling for pre-treatment variables (age, type, size of cat; characteristics of the family the cat is living with; whether the cat lives indoors or outdoors; etc.). There are selection bias concerns if the cats were declawed because they were scratching people too much.

The statement, “Declawing causes cats to be more likely to have behavioral problems,” is not the same as the statement, “Declawed cats are more likely to have behavioral problems, as compared to intact cats.” The first of these statements is implicitly a comparison of what would happen for an individual cat if he or she were declawed, while the second statement is a comparison between two different sets of cats, who might differ in all sorts of other ways.

So your analysis might be tentative. But the starting point would to be to see how the comparison looks in your data.

P.S. Speedy cat image from Guy Srinivasan.

P.P.S. Chernavsky adds this link and says that if any readers out there are interested in collaborating on this project, he has all the data and is looking for someone to help analyze it properly.

Stan Weekly Roundup, 11 August 2017

This week, more Stan!

  • Charles Margossian is rock star of the week, finishing off the algebraic solver math library fixture and getting all plumbed through Stan and documented. Now you can solve nonlinear sets of equations and get derivatives with the implicit function theorem all as part of defining your log density. There is a chapter in the revised manual

  • The entire Stan Development Team, spearheaded by Ben Goodrich needing fixes for RStan, is about to roll out Stan 2.17 along with the interfaces. Look for that to begin trickling out next week. This will fix some further install and error message reporting issues as well as include the algebraic solver. We are also working on moving things toward Stan 3 behind the scenes. We won’t just keep incrementing 2.x forever!

  • Ben Goodrich fixed the inlining declarations in C++ to allow multiple Stan models to be linked or built in a single translation unit. This will be part of the 2.17 release.

  • Sean Talts is still working on build issues after Travis changed some config and compilers changed everywhere disrupting our continuous integration.

  • Sean is working with Michael Betancourt on the Cook-Gelman-Rubin diagnostic and have gotten to the bottom of the quantization errors (usingly on 1000 posterior draws and splitting into too many bins).

  • Imad Ali is looking ahead to spatio-temporal models as he wraps up the spatial models in RStanArm.

  • Yuling Yao and Aki Vehtari finished the stacking paper (like model averaging), adding references, etc.

  • Yuling has also made some updates to the loo package that are coming soon.

  • Andrew Gelman wrote papers (with me and Michael) about R-hat and effective sample size and a paper on how priors can only be understood in the context of the likelihood.
  • Jonah Gabry, Ben and Imad have been working on the paper for priors based on $R^2$

  • Andrew, Breck Baldwin and I are trying to put together some educational proposals for NSF to teach intro stats, and we’re writing more grants now that it’s grant season. All kinds of interesting things at NSF and NIH with spatial modeling.

  • Jonah Gabry continues the ggplot-ification of the new Gelman and Hill book.

  • Ben Goodrich has been working on multivariate effective sample size calculations.

  • Breck Baldwin has been working with Michael on courses before Stan (and elsewhere).

  • Jonah Gabry has been patching all the packagedown references for our doc and generally cleaning things up with cross-references and organization.

  • Mitzi Morris finished adding a data qualifier to function arguments to allow propagation of data-only-ness as required by our ODE functions and now the algebraic solver.

  • Dootika Vats was visiting Michael; she’s been working on multivariate effective sample sizes, which considers how bad things can be with linear combinations of parameters; scaling is quadratic, so there are practical issues. She has previously worked on efficient and robust effective sample size calculations which look promising for inclusion in Stan.

  • Bob Carpenter has been working on exposing properties of variables in a Stan program in the C++ model at runtime and statically.

Consider seniority of authors when criticizing published work?

Carol Nickerson writes:

I’ve written my fair share of letters to the editor and commentaries over the years, most of them languishing in the file drawer. It used to be impossible to get them published. The situation has improved a bit, but not enough.

In any case, I never think about the sex of the author(s). Or the race, for that matter. I do consider the seniority of the authors. When the lead author is junior — a graduate student, a post doc, a young assistant professor — I much prefer to bring problems to his or her attention using private e-mail. It’s more collegial and constructive. I know that many younger people in psychology have not had good statistical training at their colleges and universities. I like to attach relevant references of which he or she may not be be aware and so on. I’ve also done this with authors who are more senior. Interestingly, I have found that the junior authors are more likely to reply, more likely to initiate a corrigendum, etc. The senior authors are more likely to ignore my e-mail message.

The situation gets sticky when the lead author is a graduate student but the other author(s) are professors, and you know that said professors engage in a lot of questionable research practices or even research misconduct. Not too long ago, I discovered that an article published in Psychological Science re-used data from an article published earlier in another journal without so stating. In psychology, this is considered unethical. Of course, I had to alert the editor to this data re-use, but I worried a lot about the impact on the lead author (then a graduate student, now a post doc) and her career if the editor decided to retract the article. (He decided not to retract it, but to issue a corrigendum, which I think was the wrong decision.) Nick and I have had a lot of discussions about this dilemma. He thinks (a) that the student should know better and (b) that it is better for the student to have something like this happen earlier in his or her career, when recovery from such a setback is easier, than to have it happen later. I understand and have some sympathy with this view, but it still makes me feel terrible to go after a junior person.

I wonder if other people worry about this sort of thing. You could write a blog article about this sometime, Andrew, if you run out of other topics.

My reply: Let me break this up into three parts.

First, the question of contacting the author directly. I understand the appeal of this, but I usually don’t. Why not?

a. I don’t like conflict. It’s been my experience that when I do contact authors directly, they don’t like to admit error, and the conversation becomes awkward. When people send me their papers directly, I have no problem sending back by praise and criticism. But unsolicited criticism—and even unsolicited questions that are open-ended enough that they might imply criticism—that doesn’t always work so well.

b. There’s also a principle here, which is that published articles are . . . public. That’s what publishing means! So even if I can reach the authors directly, not everyone can, or will. I think there’s value in public comment on public papers. If people really don’t want their papers criticized in public, they shouldn’t go around publishing them and publicizing them.

Similarly, if you disagree with something I post here, please say so in a comment: that way others will see your reaction. It won’t just be me you’re talking to.

Second, the question of going easy on younger researchers. I don’t know about this. I published a false theorem when I was 28! I want that sort of thing corrected as soon as possible. One can also flip it around and ask, is it appropriate to be less nice to people, just cos they’re older?

I guess I’d like to think that it’s not mean to point out flaws in published work. Again, I’m happy for people to do this service for me, whether privately or publicly. I was grateful when someone sent me an airmail letter (!) with a counterexample to my false theorem, and I was grateful when someone wrote an angry blog post pointing out implausible estimates that I’d produced. In both cases, I would’ve been even happier if I’d done things right in the first place—but, conditional on the error, I appreciated the corrections.

I know that lots of you disagree with me on this issue, so by posting this I’m setting myself up for some criticism, but that’s fair. It’s good to get others’ perspectives.

Bigshot statistician keeps publishing papers with errors; is there anything we can do to get him to stop???

OK, here’s a paper with a true theorem but then some false corollaries.

First the theorem:

The above is actually ok. It’s all true.

But then a few pages later comes the false statement:

This is just wrong, for two reasons. First, the relevant reference distribution is discrete uniform, not continuous uniform, so the normal CDF thing is at best just an approximation. Second, with Markov chain simulation, the draws theta_i^(l) are dependent, so for any finite L, the distribution of q_i won’t even be discrete uniform.

The theorem is correct because it’s in the limit as L approaches infinity, but the later statement (which I guess if true would be a corollary, although it’s not labeled as such) is false.

The error wasn’t noticed in the original paper because the method happened to work out on the examples. But it’s still a false statement, even if it happened to be true in a few particular cases.

Who are these stupid-ass statisticians who keep inflicting their errors on us??? False theorems, garbled data analysis, going beyond scientific argument and counterargument to imply that the entire field is inept and misguided, methodological terrorism . . . where will it all stop? Something should be done.

P.S. For those of you who care about the actual problem of statistical validation of algorithms and software which this is all about: One way to fix the above error is by approximating to a discrete uniform distribution on a binned space, with the number of bins set to something like the effective sample size of the simulations. We’re on it. And thanks to Sean, Kailas, and others for revealing this problem.

I love when I get these emails!

On Jan 27, 2017, at 12:24 PM, ** <**.**@**.com> wrote:

Hi Andrew,

I hope you are well. I work for ** and we are looking to chat to someone who knows about Freud – I read that you used to be an expert in Freud? Is that correct?

Background here.

What explains my lack of openness toward this research claim? Maybe my cortex is just too damn thick and wrinkled

Diana Senechal writes:

Yesterday Cari Romm reported that researchers had found a relation between personality traits and cortex shape: “People who scored higher on openness tended to have thinner and smoother cortices, while those who scored high on neuroticism had cortices that were thicker and more wrinkled.”

I [Senechal] looked up the study itself (https://academic.oup.com/scan/article/doi/10.1093/scan/nsw175/2952683/Surface-based-morphometry-reveals-the) and read:

These findings demonstrate that anatomical variability in prefrontal cortices is linked to individual differences in the socio-cognitive dispositions described by the FFM.

At the end of the Statistical Methods section, the authors state:

To control for multiple comparisons in the SBM analysis, cluster correction was completed using Monte Carlo simulation (vertex-wise cluster forming threshold of P < 0.05) at a cluster-wise P (CWP) value of 0.05.

In the discussion, they return to this issue:

In terms of potential shortcomings, it can be surmised that a relatively large number of statistical tests was performed. This could have increased the probability of type I errors, although the use of a large sample size and state-of-art methods to correct for multiple comparisons should have mitigated against this problem.

Are they overly confident that they corrected for multiple comparisons?

My reply: To paraphrase Flava Flav, 0.05 is a joke. If there are multiple potential comparisons to be considered, I think the researchers should study all of them and analyze them using a mutilevel model. That makes a lot more sense than picking just one comparison and trying to correct for things. Why should you care about just one comparison? It strains credulity to think that, when you’re talking about something as multifaceted as “scoring high on openness,” that it would just show up in one particular dimension. I mean, sure, you never know: useful discoveries could come from this approach. But I’m skeptical, as the statistical method seems like such a poor match to the problem being studied.

P.S. Senechal supplied the above picture. I have no idea how thick and wrinkly is the cortex of this cat.

The Supreme Court can’t do statistics. And, what’s worse, they don’t know what they don’t know.

Kevin Lewis points us to this article by Ryan Enos, Anthony Fowler, and Christopher Havasy, who write:

This article examines the negative effect fallacy, a flawed statistical argument first utilized by the Warren Court in Elkins v. United States. The Court argued that empirical evidence could not determine whether the exclusionary rule prevents future illegal searches and seizures because “it is never easy to prove a negative,” inappropriately conflating the philosophical and arithmetic definitions of the word negative. Subsequently, the Court has repeated this mistake in other domains, including free speech, voting rights, and campaign finance. The fallacy has also proliferated into the federal circuit and district court levels. Narrowly, our investigation aims to eradicate the use of the negative effect fallacy in federal courts. More broadly, we highlight several challenges and concerns with the increasing use of statistical reasoning in court decisions. As courts continue to evaluate statistical and empirical questions, we recommend that they evaluate the evidence on its own merit rather than relying on convenient arguments embedded in precedent.

Damn. I thought I’d never say this, but Earl Warren knows about as much about statistics as the editors of Perspectives on Psychological Science and the Proceedings of the National Academy of Sciences.

I’d hope for better from the U.S. Supreme Court. Seriously.

Lots of psychology professors, including editors at top journals, don’t know statistics but feel the need to pretend that they do. But nobody expects a judge to know statistics. So one would think this would enable these judges to feel free to contact the nation’s top statistical experts for advice as needed. No need to try to wing it, right?

As Mark Twain might have said, it’s not what you don’t know that kills you, it’s what you know for sure that ain’t true.

Irwin Shaw, John Updike, and Donald Trump

So. I read more by and about Irwin Shaw. I read Shaw’s end-of-career collection of short stories and his most successful novel, The Young Lions, and also the excellent biography by Michael Shnayerson. I also read Adam Begley’s recent biography of John Updike, which was also very good, and it made be sad that probably very few people actually read it. Back in the old days, a major biography of a major writer would’ve had a chance of attracting some readers.

John Updike was a master of the slice of life and also created one very memorable character in Rabbit. Irwin Shaw was known as a “storyteller” but I’m not quite sure what that means, as his stories didn’t have such memorable plots. Kinda like a composer whose music is engrossing but at the same time has no memorable tunes. The guy was no John Le Carre or Stephen King.

In his New York Times obituary, Herbert Mitgang wrote, “Stylistically, Mr. Shaw’s short stories were noted for their directness of language, the quick strokes with which he established his different characters, and a strong sense of plotting.” Well put. Quick strokes. His characters didn’t come to life, but their situations and predicaments did. In that way he had a lot in common with Updike.

One thing Shaw did have was a combination of emotional sympathy, real-world grit, and social observation. Some similarity here with John O’Hara, but O’Hara’s situations always seemed a bit more schematic to me, whereas Shaw’s characters seem to be in real situations (even if they’re not, ultimately, real characters).

Updike and Shaw had different career trajectories. Updike started at the top and stayed here. Shaw started at the top and worked his way down. OK, even at the end he was selling lots of copies, but his books weren’t getting much respect (and, at least according to his biographer, they had some strong moments but they weren’t great; I can’t bring myself to try to read these novels myself). On the other hand, I’ve tried to read a couple of Updike’s later novels and I wasn’t so impressed. From my perspective, Updike redeemed himself by writing a lot of excellent literary journalism. As they got older, both Updike and Shaw reduced their output of short stories, maintaining the high quality in both cases.

Speaking of John Updike, if he were around today I expect he’d’ve had something to say about those rural Pennsylvanians who voted for Donald Trump. Being a rural Pennsylvanian. And John O’Hara, as a Pennsylvanian, and Roman Catholic, and an all-around resentful person: he would’ve had something to say about Trump voters from all those groups. Then we could bring in Lorrie Moore to explain Hillary Clinton voters to us. Hey, here it is—ok, that didn’t work: Moore doesn’t like Clinton. Hmmm, lots of people don’t like Hillary Clinton, but she did get 51% of the two-party vote. We’ll have to find some expert to explain those voters to us.

What readings should be included in a seminar on the philosophy of statistics, the replication crisis, causation, etc.?

André Ariew writes:

I’m a philosopher of science at the University of Missouri. I’m interested in leading a seminar on a variety of current topics with philosophical value, including problems with significance tests, the replication crisis, causation, correlation, randomized trials, etc. I’m hoping that you can point me in a good direction for accessible readings for the syllabus. Can you? While the course is at the graduate level, I don’t assume that my students are expert in the philosophy of science and likely don’t know what a p-value is (that’s the trouble—need to get people to understand these things). When I teach a course on inductive reasoning I typically assign Ian Hacking’s An Introduction to Probability and Inductive Logic. I’m familiar with the book and he’s a great historian and philosopher of science.

He’d like to do more:

Anything you might suggest would be greatly appreciated. I’ve always thought that issues like these are much more important to the philosophy of science than much of what passes as the standard corpus.

My response:

I’d start with the classic and very readable 2011 article by Simmons, Nelson, and Simonsohn, False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.

And follow up with my (subjective) historical overview from 2016, What has happened down here is the winds have changed.

You’ll want to assign at least one paper by Paul Meehl; here’s a link to a 1985 paper, and here’s a pointer to a paper from 1967, along with the question, “What happened? Why did it take us nearly 50 years to what Meehl was saying all along? This is what I want the intellectual history to help me understand,” and 137 comments in the discussion thread.

And I’ll also recommend my own three articles on the philosophy of statistics:

The last of these is the shortest so it might be a good place to start—or the only one, since it would be overkill to ask people to read all three.

Regarding p-values etc., the following article could be helpful (sorry, it’s another one of mine!):

And, for causation, I recommend these two articles, both of which should be readable for students without technical backgrounds:

OK, that’ll get you started. Perhaps the commenters have further suggestions?

P.S. I’d love to lead a seminar on the philosophy of statistics, unfortunately I suspect that here at Columbia this would attract approximately 0 students. I do cover some of these issues in my class on Communicating Data and Statistics, though.

Wolfram on Golomb

I was checking out Stephen Wolfram’s blog and found this excellent obituary of Solomon Golomb, the mathematician who invented the maximum-length linear-feedback shift register sequence, characterized by Wolfram as “probably the single most-used mathematical algorithm idea in history.” But Golomb is probably more famous for inventing polyominoes.

The whole thing’s a good read, and it even includes this cool nonperiodic tiling from Wolfram’s 2002 book:

There’s also some interesting stuff on cellular automata, itself a fascinating topic. Wolfram should hire someone to prove some theorems about it!

P.S. Wolfram’s blog has lots of good stuff. In fact, I just added it to the blogroll! For example, here’s a long post from a few months ago on cellular automata and physics. It’s a funny thing, though: Wolfram seems to have an extreme aversion to talking about his collaborators. With Wolfram, it’s all through the day, I me mine, I me mine, I me mine. Don’t get me wrong, I like to talk about myself too. But science as I experience it is soooo collaborative, it’s hard for me to imagine being in Wolfram’s situation: he has all the resources in the world but he works all on his own. So lonely. On one hand, he has these interesting ideas that he wants to share with the world, with complete strangers on his blog. On the other hand, he doesn’t seem to be able to collaborate with people directly. In literature, this would not be surprising—we don’t demand or even expect that Matthew Klam, Francis Spufford, Alison Bechdel, etc., find collaborators—but in science it seems like a mistake to work alone. Then again, what do I know. Andrew Wiles didn’t seem to require a research team or even a research partner.

“This finding did not reach statistical sig­nificance, but it indicates a 94.6% prob­ability that statins were responsible for the symptoms.”

Charles Jackson writes:

The attached item from JAMA, which I came across in my doctor’s waiting room, contains the statements:

Nineteen of 203 patients treated with statins and 10 of 217 patients treated with placebo met the study definition of myalgia (9.4% vs 4.6%. P = .054). This finding did not reach statistical sig­nificance, but it indicates a 94.6% prob­ability that statins were responsible for the symptoms.

Disregarding the statistical issues involving NHST and posterior probability, ​I assume that the author means to say ” . . . that statins were responsible for the symptoms in about half of the treated patients.” I doubt if statins caused the myalgia in the untreated patients.

Yup. Statistics is hard, like basketball, or knitting. Even Jama editors can get these things horribly wrong.

Look. At. The. Data. (Hollywood action movies example)

Kaiser Fung shares an amusing story of how you can be misled by analyzing data that you haven’t fully digested. Kaiser writes, “It pains me to think how many people have analyzed this dataset, and used these keywords to build models.”