Skip to content

Understanding Chicago’s homicide spike; comparisons to other cities

Michael Masinter writes:

As a longtime blog reader sufficiently wise not to post beyond my academic discipline, I hope you might take a look at what seems to me to be a highly controversial attempt to use regression analysis to blame the ACLU for the recent rise in homicides in Chicago. A summary appears here with followup here.

I [Masinter] am skeptical, but loathe to make the common mistake of assuming that my law degree makes me an expert in all disciplines. I’d be curious to know what you think of the claim.

The research article is called “What Caused the 2016 Chicago Homicide Spike? An Empirical Examination of the ‘ACLU Effect’ and the Role of Stop and Frisks in Preventing Gun Violence,” and is by Paul Cassell and Richard Fowles.

I asked Masinter what were the reasons for his skepticism, and he replied:

Two reasons, one observational, one methodological, and a general sense of skepticism.

Other cities have restricted the use of stop and frisk practices without a corresponding increase in homicide rates.

Quite a few variables appear to be collinear. The authors claim that Bayesian Model Averaging controls for that, but I lack the expertise to assess their claim.

More generally, their claim of causation appears a bit like post hoc propter hoc reasoning dressed up in quantitative analysis.

Perhaps I am wrong, but I have come to be skeptical of such large claims.

I asked Masinter if it would be ok to quote him, and he responded:

Yes, but please note my acknowledged lack of expertise. Too many of my law prof colleagues think our JD qualifies us as experts in all disciplines, a presumption magnified by our habit of advocacy.

I replied, “Hey, don’t they know that the only universal experts are M.D.’s and, sometimes, economists?”, and Masinter then pointed me to this wonderful article by Arthur Allen Leff from 1974:

With the publication of Richard A. Posner’s Economic Analysis of Law, that field of learning known as “Law and Economics” has reached a stage of extended explicitness that requires and permits extended and explicit comment. . . . I was more than half way through the book before it came to me: as a matter of literary genre (though most likely not as a matter of literary influence) the closest analogue to Economic Analysis of Law is the picaresque novel.

Think of the great ones, Tom Jones, for instance, or Huckleberry Finn, or Don Quixote. In each case the eponymous hero sets out into a world of complexity and brings to bear on successive segments of it the power of his own particular personal vision. The world presents itself as a series of problems; to each problem that vision acts as a form of solution; and the problem having been dispatched, our hero passes on to the next adventure. The particular interactions are essentially invariant because the central vision is single. No matter what comes up or comes by, Tom’s sensual vigor, Huck’s cynical innocence, or the Don’s aggressive romanticism is brought into play . . .

Richard Posner’s hero is also eponymous. He is Economic Analysis. In the book we watch him ride out into the world of law, encountering one after another almost all of the ambiguous villains of legal thought, from the fire-spewing choo-choo dragon to the multi-headed ogre who imprisons fair Efficiency in his castle keep for stupid and selfish reasons. . . .

One should not knock the genre. To hold the mind-set constant while the world is played in manageable chunks before its searching single light is a powerful analytic idea, the literary equivalent of dropping a hundred metals successively into the same acid to see what happens. The analytic move, just as a strategy, has its uses, no matter which mind-set is chosen, be it ethics or psychology or economics or even law. . . .

Leff quotes Posner:

Efficiency is a technical term: it means exploiting economic resources in such a way that human satisfaction as measured by aggregate consumer willingness to pay for goods and services is maximized. Value too is defined by willingness to pay.

And then Leff continues:

Given this initial position about the autonomy of people’s aims and their qualified rationality in reaching them, one is struck by the picture of American society presented by Posner. For it seems to be one which regulates its affairs in rather a bizarre fashion: it has created one grand system—the market, and those market-supportive aspects of law (notably “common,” judge-made law)—which is almost flawless in achieving human happiness; it has created another—the political process, and the rest of “the law” (roughly legislation and administration)—which is apparently almost wholly pernicious of those aims.

An anthropologist coming upon such a society would be fascinated. It would seem to him like one of those cultures which, existing in a country of plenty, having developed mechanisms to adjust all intracultural disputes in peace and harmony, lacking any important enemies, nevertheless comes up with some set of practices, a religion say, simultaneously so barbaric and all-pervasive as to poison almost every moment of what would otherwise appear to hold potential for the purest existential joy. If he were a bad anthropologist, he would cluck and depart. If he were a real anthropologist, I suspect he would instead stay and wonder what it was about the culture that he was missing. That is, he would ponder why they wanted that religion, what was in it for them, what it looked like and felt like to be inside the culture. And he would be especially careful not to stand outside and scoff if, like Posner, he too leaned to the proposition that “most people in most affairs of life are guided by what they conceive to be their self-interest and . . . choose means reasonably (not perfectly) designed to promote it.”

I’m reminded of our discussion from a few years ago regarding the two ways of thinking in economics.

Economists are often making one of two arguments:

1. People are rational and respond to incentives. Behavior that looks irrational is actually completely rational once you think like an economist.

2. People are irrational and they need economists, with their open minds, to show them how to be rational and efficient.

Argument 1 is associated with “why do they do that?” sorts of puzzles. Why do they charge so much for candy at the movie theater, why are airline ticket prices such a mess, why are people drug addicts, etc. The usual answer is that there’s some rational reason for what seems like silly or self-destructive behavior.

Argument 2 is associated with “we can do better” claims such as why we should fire 80% of public-schools teachers or Moneyball-style stories about how some clever entrepreneur has made a zillion dollars by exploiting some inefficiency in the market.

Despite going in the opposite directions, Arguments 1 and 2 have a similarity, in that both used to make the point that economists stand from a superior position by which they can evaluate the actions of others.

But Leff said it all back in 1974.

P.S. We seem to have mislaid our original topic, which is the study of the change in homicide rates in Chicago. I read the article by Casell and Fowles and their analysis of Chicago seemed reasonable to me. At the same time, I agree with Masinter that the comparisons to other cities didn’t seem so compelling.

Limitations of “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection”

“If you will believe in your heart and confess with your lips, surely you will be saved one day”The Mountain Goats paraphrasing Romans 10:9

One of the weird things about working with people a lot is that it doesn’t always translate into multiple opportunities to see them talk.  I’m pretty sure the only time I’ve seen Andrew talk was at a fancy lecture he gave in Columbia. He talked about many things that day, but the one that stuck with me (because I’d not heard it phrased that well before, but as a side-note this is a memory of the gist of what he was saying. Do not hold him to this opinion!) was that the problem with p-values and null-hypothesis wasn’t so much that the procedure was bad. The problem is that people are taught to believe that there exists a procedure that can, given any set of data, produce a “yes/no” answer to a fairly difficult question. So the problem isn’t the specific decision rule that NHST produces, so much as the idea that a universally applicable decision rule exists at all. (And yes, I know the maths. But the problem with p-values was never the maths.)

This popped into my head again this week as Aki, Andrew, Yuling, and I were working on a discussion to Gronau and Wagenmakers’ (GW) paper “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection”.

Our discussion is titled “Limitations of ‘Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection'” and it extends various points that Aki and I have made at various points on this blog.

To summarize our key points:

  1. It is a bad thing for GW to introduce LOO model selection in a way that doesn’t account for its randomness. In their very specialized examples this turns out not to matter because they choose such odd data that the LOO estimates have zero variance. But it is nevertheless bad practice.
  2. Stacking is a way to get model weights that is more in line with the LOO-predictive concept than GW’s ad hoc pseudo-BMA weights. Although stacking is also not consistent for nested models, in the cases considered in GW’s paper it consistently picks the correct model. In fact, the model weight for the true model in each of their cases is w_0=1 independent of the number of data points.
  3. By not recognizing this, GW missed an opportunity to discuss the limitations of the assumptions underlying LOO (namely that the observed data is representative of the future data, and each individual data point is conditionally exchangeable).  We spent some time laying these out and proposed some modifications to their experiments that would make these limitations clearer.
  4. Because LOO is formulated under much weaker assumptions than is used in this paper, namely LOO does not assume that the data is generated by one of the models under consideration (the so-called “M-Closed assumption”), it is a little odd that GW only assess its performance under this assumption. This assumption almost never holds. If you’ve ever used the famous George Box quote, you’ve explicitly stated that the M-Closed assumption does not hold!
  5. GW’s assertion that when two models can support identical models (such as in the case of nested models), the simplest model should be preferred is not a universal truth, but rather a specific choice that is being made. This can be enforced for LOO methods, but like all choices in statistical modelling, it shouldn’t be made automatically or by authority, but should instead be critically assessed in the context of the task being performed.

All of this has made me think about the idea of doing model selection. Or, more specifically, it’s made me question whether or not we should try to find universal tools for solving this problem. Is model selection even possible? (Danielle Navarro from UNSW has a particularly excellent blog post outlining her experiences with various existing model selection methods that you all should read.)

So I guess my very nebulous view is that we can’t do model selection, but we can’t not do model selection, but we also can’t not not do model selection.

In the end we need to work out how to do model selection for specific circumstances and to think critically about our assumptions. LOO helps us do some of that work.

To close off, I’m going to reproduce the final section of our paper because what’s the point of having a blog post (or writing a discussion) if you can’t have a bit of fun.

Can you do open science with M-Closed tools?

One of the great joys of writing a discussion is that we can pose a very difficult question that we have no real intention of answering. The question that is well worth pondering is the extent to which our chosen statistical tools influence how scientific decisions are made. And it’s relevant in this context because of a key difference between model selection tools based on LOO and tools based on marginal likelihoods is what happens when none of the models could reasonably generate the data.

In this context, marginal likelihood-based model selection tools will, as the amount of data increases, choose the model that best represents the data, even if it doesn’t represent the data particularly well. LOO-based methods, on the other hand, are quite comfortable expressing that they can not determine a single model that should be selected. To put it more bluntly, marginal likelihood will always confidently select the wrong model, while LOO is able to express that no one model is correct.

We leave it for each individual statistician to work out how the shortcomings of marginal likelihood-based model selection balance with the shortcomings of cross-validation methods. There is no simple answer.

Stan on the web! (thanks to RStudio)

This is big news. Thanks to RStudio, you can now run Stan effortlessly on the web.

So you can get started on Stan without any investment in set-up time, no need to install C++ on your computer, etc.

As Ben Goodrich writes, “RStudio Cloud is particularly useful for Stan tutorials where a lot of time can otherwise be wasted getting C++ toolchains installed and configured on everyone’s laptops.”

To get started with a simple example, just click here and log in.

We’ve pre-loaded this particular RStudio session with a regression model and an R script to simulate fake data and run the model. In your online RStudio Cloud session (which will appear within your browser when you click the above link), just go the lower-right window with Files, and click on simple_regression.stan and simple_regression.R. This will open up those files. Run simple_regression.R and it will simulate the data, run the Stan program, and produce a graph.

Now you can play around.

Create your own Stan program: just work in the upper-left window, click on File, New File, Stan File, then click on File, Save As, and give it a .stan extension. The RStudio editor already has highlighting and autocomplete for Stan files.

Or edit the existing Stan program (simple_regression.stan, sitting there in the lower right of your RStudio Cloud window), change it how you’d like, then edit the R script or create a new one. You can upload data to your session too.

When you run a new or newly edited Stan program, it will take some time to compile. But then next time you run it, R will access the compiled version and it will run faster.

You can also save your session and get back to it later.

Some jargon for ya—but I mean every word of it!

This is a real game-changer as it significantly lowers the barriers to entry for people to start using Stan.

Limitations

Ultimately I recommend you set up Stan on your own computer so you have full control over your modeling, but RStudio’s Cloud is a wonderful way to get started.

Here’s what Rstudio says:

Each project is allocated 1GB of RAM. Each account is allocated one private space, with up to 3 members and 5 projects. You can submit a request to the RStudio Cloud team for more capacity if you hit one of these space limits, and we will do our best accomodate you. If you are using a Professional shinyapps.io 2 account, you will not encounter these space limits.

In addition to the private space (where you can collaborate with a selected group of other users), every user also gets a personal workspace (titled “Your Workspace”), where there is virtually no limit to the number of projects you can create. Only you can work on projects in your personal workspace, but you can grant view & copy access to them to any other RStudio Cloud user.

This is just amazing. I’m not the most computer-savvy user, but I was able to get this working right away.

Ben adds:

It also comes with
* The other 500-ish examples in the examples/ subdirectory
* Most of the R packages that use Stan, including McElreath’s rethinking package from GitHub and all the ones under stan-dev, namely
– rstanarm (comes with compiled Stan programs for regression)
– brms (generates Stan programs for regression)
– projpred (for selecting a submodel of a GLM)
– bayesplot and shinystan (for visualizing posterior output)
– loo (for model comparison using expected log predictive density)
– rstantools (for creating your own R packages like rstanarm)
* Saving new / modified compiled Stan programs to the disk to use across sessions first requires the user to do rstan_options(auto_write = TRUE)

I’m so excited. You can now play with Stan yourself with no start-up effort. Or, if you’re already a Stan user, you can demonstrate it to your friends. Also, you can put your own favorite models in an RStudio Cloud environment (as Ben did for my simple regression model) and then share the link with other people, who can use your model on your data, upload their own data, alter your model, etc.

P.S. It seems that for now this is really only for playing around with very simple models, to give you a taste of Stan, and then you’ll have to install it on your computer to do more. See this post from Aki. That’s fine. as this demo is intended to be a show horse, not a work horse. I think there is also a way to pay RStudio for cycles on the cloud and then you can run bigger Stan models through the web interface. So that could be an option too, for example if you want to use Stan as a back-end for some computing that you’d like others to access remotely.

Why are functional programming languages so popular in the programming languages community?

Matthijs Vákár writes:

Re the popularity of functional programming and Church-style languages in the programming languages community: there is a strong sentiment in that community that functional programming provides important high-level primitives that make it easy to write correct programs. This is because functional code tends to be very short and easy to reason about because of referential transparency, making it quick to review and maintain. This becomes even more important in domains where correctness is non-trivial, like in the presence of probabilistic choice and conditioning (or concurrency).

The sense – I think – is that for the vast majority of your code, correctness and programmer-time are more important considerations than run-time efficiency. A typical workflow should consist of quickly writing an obviously correct but probably inefficient program, profiling it to locate the bottlenecks and finally – but only then – thinking hard about how to use computational effects like mutable state and concurrency to optimise those bottlenecks without sacrificing correctness.

There is also a sense that compilers will get smarter with time, which should result in pretty fast functional code, allowing the programmer to think more about what she wants to do rather than about the how. I think you could also see the lack of concern for performance of inference algorithms in this light. This community is primarily concerned with correctness over efficiency. I’m sure they will get to more efficient inference algorithms eventually. (Don’t forget that they are not statisticians by training so it may take some time for knowledge about inference algorithms to percolate into the PL community.)

Justified or not, there is a real conviction in the programming languages community that functional ideas will become more and more important in mainstream programming. People will point to the increase of functional features in Scala, Java, C# and even C++. Also, people note that OCaml and Haskell these days are getting to the point where they are really quite fast. Jane Street would not be using OCaml if it weren’t competitive. In fact, one of the reasons they use it – as I understand it – is that their code involves a lot of maths which functional languages lend themselves well to writing. Presumably, this is also part of the motivation of using functional programming for prob prog? In a way, functional programming feels closer to maths.

Another reason I think people care about language designs in the style of Church is that they make it easy to write certain programs that are hard in Stan. For instance, perhaps you care about some likelihood-free model, like a model where some probabilistic choices are made, then a complicated deterministic program is run and only then the results are observed (conditioned on). An example that we have here in Oxford is a simulation of the standard model of particle physics. This effectively involves pushing forward your priors through some complicated simulator and then observing. It would be difficult to write down a likelihood for this model. Relatedly, when I was talking to Erik Meijer, I got the sense that Facebook wants to integrate prob prog into existing large software systems. The resulting models would not necessarily have a smooth likelihood wrt the Lebesgue measure. I am told that in some cases inference in these models is quite possible with rejection sampling or importance sampling or some methods that you might consider savage. The models might not necessarily be very hard statistically, but they are hard computationally in the sense that they do involve running a non-trivial program. (Note though that this has less to do with functional programming than it does with combining probabilistic constructs with more general programming constructs!)

A different example that people often bring up is given by models where observations might be of variable length, like in language models. Of course, you could do it in Stan with padding, but some people don’t seem to like this.

Of course, there is the question whether all of this effort is wasted if eventually inference is completely intractable in basically all but the simplest models. What I would hope to eventually see – and many people in the PL community with me – is a high-level language in which you can do serious programming (like in OCaml) and have access to serious inference capabilities (like in Stan). Ideally, the compiler should decide/give hints as to which inference algorithms to try — use NUTS when you can, but otherwise back-off and try something else. And there should be plenty of diagnostics to figure out when to distrust your results.

Finally, though, something I have to note is that programming language people are the folks writing compilers. And compilers are basically the one thing that functional languages do best because of their good support for user-defined data structures like trees and recursing over such data structures using pattern matching. Obviously, therefore, programming language folks are going to favour functional languages. Similarly, because of their rich type systems and high-level abstractions like higher-order functions, polymorphism and abstract data types, functional languages serve as great hosts for DSLs like PPLs. They make it super easy for the implementer to write a PPL even if they are a single academic and do not have the team required to write a C++ project.

The question now is whether they also genuinely make things easier for the end user. I believe they ultimately have the potential to do so, provided that you have a good optimising compiler, especially if you are trying to write a complicated mathematical program.

I replied:

On one hand, people are setting up models in Church etc. that they may never be able to fit—it’s not just that Church etc are too slow to fit these models, it’s that they’re unidentified or just don’t make sense or have never really been thought out. (My problem here is not with the programming language called Church; I’m just here concerned about the difficulty with fitting any model correctly the first time, hence the need for flexible tools for model fitting and exploration.)

But, on the other hand, people are doing inference and making decisions all the time, using crude regressions or t-tests or whatever.

To me, it’s close to meaningless for someone to say they can write “an obviously correct but probably inefficient program” if the program is fitting a model that you can’t really fit!

Matthijs responded:

When I say “write an obviously correct but probably inefficient program”, I am talking about usual functional programming in a deterministic setting. I think many programming language (PL) folks do not fully appreciate what correctness for a probabilistic program means. I think they are happy when they can prove that their program asymptotically does the right thing. I do not think they quite realise yet that that gives no guarantees about what to think about your results when you run your program for a finite amount of time. It sounds silly, but I think at least part of the community is still to realise that. Personally, I expect that people will wake up eventually and will realise that correctness in a probabilistic setting is a much more hairy beast. In particular, I expect that the PL community will realise that run-time diagnostics and model checking are the things that they should have been doing all along. I am hopeful though that at some point enough knowledge of statistics will trickle through to them such that genuinely useful collaboration is possible. I strongly feel that there is mostly just a lot of confusion rather than wilful disregard.

The Golden Rule of Nudge

Nudge unto others as you would have them nudge unto you.

Do not recommend to apply incentives to others that you would not want for yourself.

Background

I was reading this article by William Davies about Britain’s Kafkaesque immigration policies.

The background, roughly, is this: Various English politicians promised that the net flow of immigrants to the U.K. would decrease. But the British government had little control over the flow of immigrants to the U.K., because most of them were coming from other E.U. countries. So they’d pretty much made a promise they couldn’t keep. The only way the government could even try to reach their targets was to convince people currently living in the U.K. to leave. And one way to do this was to make their lives more difficult. Apparently the government focused this effort on people living in Britain who’d originally come from former British colonies in the Caribbean, throwing paperwork at them and threatening their immigration status.

Davies explains:

The Windrush generation’s immigration status should never have been in question, and the cause of their predicament is recent: the 2014 Immigration Act, which contained the flagship policies of the then home secretary, Theresa May. Foremost among them was the plan to create a ‘hostile environment’, with the aim of making it harder for illegal immigrants to work and live in the UK . . .

It’s almost as if, on discovering that law alone was too blunt an instrument for deterring and excluding immigrants, May decided to weaponise paperwork instead. The ‘hostile environment’ strategy was never presented just as an effective way of identifying and deporting illegal immigrants: more important, it was intended as a way of destroying their ability to build normal lives. The hope, it seemed, was that people might decide that living in Britain wasn’t worth the hassle. . . .

The thread linking benefit sanctions and the ‘hostile environment’ is that both are policies designed to alter behaviour dressed up as audits. Neither is really concerned with accumulating information per se: the idea is to use a process of constant auditing to punish and deter . . .

And then he continues:

The coalition government was fond of the idea of ‘nudges’, interventions that seek to change behaviour by subtle manipulation of the way things look and feel, rather than through regulation. Nudgers celebrate the sunnier success stories, such as getting more people to recycle or to quit smoking, but it’s easy to see how the same mentality might be applied in a more menacing way.

Nudge for thee but not for me?

This all reminds me of a general phenomenon, that “incentives matter” always seems like a good motto for other people, but we rarely seem to want it for ourselves. For example, there’s lots of talk about how being worried about losing your job is a good motivator to work hard (or, conversely, that a secure job is a recipe for laziness), but we don’t want job insecurity for ourselves. (Sure, lots of people who aren’t tenured professors or secure government employees envy those of us with secure jobs, and maybe they think we don’t deserve them, but I don’t think these people are generally asking for less security in their own jobs.)

More generally, it’s my impression that people often think that “nudges” are a good way to get other people to behave in desirable ways, without these people wanting to be “nudged” themselves. For example, did Jeremy Bentham put himself in a Panopticon to ensure his own good behavior?

There are exceptions, though. I like to stick other people in my office so it’s harder for me to drift off and be less productive. And lots of smokers are supportive of polities that make it less convenient to smoke, as this can make it easier for them to quit and harder for them to relapse.

Summary

With all that in mind, I’d like to propose a Golden Rule of Nudge: Nudge unto others as you would have them nudge unto you.

Do not recommend to apply incentives to others that you would not want for yourself.

Perhaps you could try a big scatterplot with one dot per dataset?

Joe Nadeau writes:

We are studying variation in both means and variances in metabolic conditions. We have access to nearly 200 datasets that involve a range of metabolic traits and vary in sample size, mean effects, and variance. Some traits differ in mean but not variance, others in variance but not mean, still others in both, and of course some in neither. These studies are based on animal models where genetics and environmental conditions are well-controlled and where pretty rigorous study designs are possible. We plan to report the full range of results, but would like to rank results according to confidence (power?) for each dataset. Some are obviously more robust than others. Confidence limits. which will be reported, don’t seem to quite right for ranking. We feel an obligation to share some sense of confidence, which should be based on sample size, variance in the contrast groups – between genotypes, between diets, between treatments, …… . We are of course aware of the trickiness in studies like these. Our question is based on getting the rest right, and wanting to share with readers and reviewers a sense of the ‘strengths, limitations’ across the datasets. Suggestions?

My reply: Perhaps you could try a big scatterplot with one dot per dataset? Also I would doubt that there are really traits that do not differ in mean or variance. Maybe it would help to look at the big picture rather than to be categorizing individual cases, which can be noisy. Similarly, rather than ranking the results, I think it would be better to just consider ways of displaying all of them.

Podcast interview on polling (mostly), also some Bayesian stuff

Hugo Bowne-Anderson interviewed me for a DataCamp podcast. Transcript is here.

Rising test scores . . . reported as stagnant test scores

Joseph Delaney points to a post by Kevin Drum pointing to a post by Bob Somerby pointing to a magazine article by Natalie Wexler that reported on the latest NAEP (National Assessment of Educational Progress) test results.

In an article entitled, “Why American Students Haven’t Gotten Better at Reading in 20 Years,” Wexler asks, “what’s the reason for the utter lack of progress in reading scores?”

The odd thing is, though, is that reading scores have clearly gone up in the past twenty years, as Somerby points out in text and Drum shows in this graph:

Drum summarizes:

Asian: +15 points
White: +5 points
Hispanic: +10 points
Black: +5 points

. . . Using the usual rule of thumb that 10 points equals one grade level, black and white kids have improved half a grade level; Hispanic kids have improved a full grade level; and Asian kids have improved 1½ grade levels.

Why does this sort of thing get consistently misreported? Delaney writes:

My [Delaney’s] opinion: because there is a lot of money in education and it won’t be possible to “disrupt” education and redirect this money if the current system is doing well. . . .

It also moves the goalposts. If everything is falling apart then it isn’t such a crisis if the disrupted industry has teething issues once they strip cash out of it to pay for the heroes who are reinventing the system.

But if current educational systems are doing well, and slowly improving through incremental change, then it is a lot harder to argue that there is a crisis in education, isn’t it?

Could be. The other thing is that it can be hard to get unstuck from a conventional story. We discussed a similar example a few years ago: that time it was math test scores, which economist Roland Fryer stated had been “largely constant over the past thirty years,” even while he’d included a graph showing solid improvements.

Bayesian inference and religious belief

We’re speaking here not of Bayesianism as a religion but of the use of Bayesian inference to assess or validate the evidence regarding religious belief, in short, the probability that God !=0 or the probability that the Pope is Catholic or, as Tyler Cowen put it, the probability that Lutheranism is true.

As a statistician and social scientist I have a problem with these formulations in large part because the states in question do not seem at all clearly defined—for instance, Lutheranism is a set of practices as much as it is a set of doctrines, and what does it mean for a practice to be “true”; also the doctrines themselves are vague enough that it seems pretty much meaningless to declare them true or false.

Nonetheless, people have wrestled with these issues, and it may be that something can be learned from such explorations.

So the following email from political philosopher Kevin Vallier might be of interest, following up on my above-linked reply to Cowen. Here’s Vallier:

Philosophers of religion have been using Bayesian reasoning to determine the rationality of theistic belief in particular for many years. Richard Swinburne introduced Bayesian analysis in his many books defending the rationality of theistic and Christian belief. But many others, theist and atheist, use it as well. So there actually are folks out there who think about their religious commitments in Bayesian terms, just not very many of us. But I bet there are at least 1000 philosophers, theologians, and lay apologists who find thinking in Bayesian terms useful on religious matters and debates.

A nice, accessible use of Bayesian reasoning with regard to religious belief is Richard Swinburne’s The Existence of God, 2nd ed.

Wow—I had no idea there were 1000 philosophers in the world, period. Learn something new every day.

In all seriousness . . . As indicated in my post replying to Cowen, I’m interested in the topic not so much out of an interest in religion but rather because of the analogy to political and social affiliation. It seems to me that many affiliations are nominally about adherence to doctrine but actually are more about belonging. This idea is commonplace (for example, when people speak of political “tribes”), but complications start to arise when the doctrines have real-world implications, as in our current environment of political polarization.

Present each others’ posters

It seems that I’ll be judging a poster session next week. So this seems like a good time to repost this from 2009:

I was at a conference that had an excellent poster session. I realized the session would have been even better if the students with posters had been randomly assigned to stand next to and explain other students’ posters. Some of the benefits:

1. The process of reading a poster and learning about its material would be more fun if it was a collaborative effort with the presenter.

2. If you know that someone else will be presenting your poster, you’ll be motivated to make the poster more clear.

3. When presenting somebody else’s poster, you’ll learn the material. As the saying goes, the best way to learn a subject is to teach it.

4. The random assignment will lead to more inderdisciplinary understanding and, ultimately, collaboration.

I think just about all poster sessions should be done this way.

My post elicited some comments to which I replied:

– David wrote that my idea “misses the potential benefit to the owner of the poster of geting critical responses to their work.” The solution: instead of complete randoimization, randoimize the poster presenteres into pairs, then put pairs next to each other. Student A can explain poster B, student B can explain poster A, and spectators can give their suggestions to the poster preparers.

– Mike wrote that “one strong motivation for presenters is the opportunity to stand in front of you (and other members of the evaluation committee) and explain *their* work to you. Personally.” Sure, but I don’t think it’s bad if instead they’re explaining somebody else’s work. If I were a student, I think I’d enjoy explaining my tellow-students’ work to an outsider. The ensuing conversation might even result in some useful new ideas.

– Lawrence suggested that “the logic of your post apply to conference papers, too.” Maybe so.

“Fudged statistics on the Iraq War death toll are still circulating today”

Mike Spagat shares this story entitled, “Fudged statistics on the Iraq War death toll are still circulating today,” which discusses problems with a paper published in a scientific journal in 2006, and errors that a reporter inadvertently included in a recent news article. Spagat writes:

The Lancet could argue that if [Washington Post reporter Philip] Bump had only read the follow-up letters it published, he never would have reprinted the discredited graph. But this argument is akin to saying that there is no need for warning labels on cigarettes because people can just read the scientific literature on smoking and consider themselves warned. But in practice, many people will just assume the graph is kosher because it sits on the Lancet website with no warning attached. . . .

That said, this particular chapter at least has a happy ending. I wrote to Bump and the Washington Post and they fixed the story, in the process demonstrating an admirable respect for evidence and a commitment to the truth. The Lancet would do well to follow their example.

The Lancet declined to comment on this piece.

And here’s Spagat’s summary of the whole process:

1. The Lancet publishes a false graph.
2. The problems of the graph are exposed, several even in letters to the Lancet.
3. The Lancet just leaves the graph up.
4. A Washington Post reporter stumbles onto the false graph, thinks it’s cool and reprints it.
5. I tell the reporter that he just published a false graph.
6. The reporter does a mea culpa and pulls the graph down.
7. I write up this sequence of events for The Conversation [see above link].
8. The Conversation sends it to the Lancet.
9. The Lancet declines to comment and leaves the false graph up.
10. The Conversation publishes the piece.
11. Someone else sees and believes in the graph?

Over the years I’ve had bad experiences trying to get both academic journals and newspapers/magazines to make corrections but the journals have been the more reluctant of the two. A low water mark was when an editor at Public Opinion Quarterly told me, after a long exchange, that POQ policy is that they don’t correct errors.

The Lancet, of course, publishes lots of good stuff too (including, for example, this article). So it’s too bad to see them duck this one. Or maybe there’s more to the story. Anyone can feel free to add information in the comments.

“Ivy League Football Saw Large Reduction in Concussions After New Kickoff Rules”

I noticed this article in the newspaper today:

A simple rule change in Ivy League football games has led to a significant drop in concussions, a study released this week found.

After the Ivy League changed its kickoff rules in 2016, adjusting the kickoff and touchback lines by just five yards, the rate of concussions per 1,000 kickoff plays fell to two from 11, according to the study, which was published Monday in the Journal of the American Medical Association. . . .

Under the new system, teams kicked off from the 40-yard line, instead of the 35, and touchbacks started from the 20-yard line, rather than the 25.

The result? A spike in the number of touchbacks — and “a dramatic reduction in the rate of concussions” . . .

The study looked at the rate of concussions over three seasons before the rule change (2013 to 2015) and two seasons after it (2016 to 2017). Researchers saw a larger reduction in concussions during kickoffs after the rule change than they did with other types of plays, like scrimmages and punts, which saw only a slight decline. . . .

I was curious so I followed the link to the research article, “Association Between the Experimental Kickoff Rule and Concussion Rates in Ivy League Football,” by Douglas Wiebe, Bernadette D’Alonzo, Robin Harris, et al.

From the Results section:

Kickoffs resulting in touchbacks increased from a mean of 17.9% annually before the rule change to 48.0% after. The mean annual concussion rate per 1000 plays during kickoff plays was 10.93 before the rule change and 2.04 after (difference, −8.88; 95% CI, −13.68 to −4.09). For other play types, the concussion rate was 2.56 before the rule change and 1.18 after (difference, −1.38; 95% CI, −3.68 to 0.92). The difference-in-differences analysis showed that 7.51 (95% CI, −12.88 to −2.14) fewer concussions occurred for every 1000 kickoff plays after vs before the rule change.

I took a look at the table and noticed some things.

First, the number of concussions was pretty high and the drop was obviously not statistical noise: 126 (that is, 42 per year) for the first three years, down to 33 (15.5 per year) for the last two years. With the exception of punts and FG/PATs, the number of cases was large enough that the drop was clear.

Second, I took a look at the confidence intervals. The confidence interval for “other play types combined” includes zero: see the bottom line of the table. Whassup with that?

I shared this example with my classes this morning and we also had some questions about the data:

– How is “concussion” defined? Could the classification be changing over time? I’d expect that, what with increased concussion awareness, that concussion would be diagnosed more than before, which would make the declining trend in the data even more impressive. But I don’t really know.

– Why data only since 2013? Maybe that’s only how long they’ve been recording concussions.

– We’d like to see the data for each year. To calibrate the effect of a change over time, you want to see year-to-year variation, in this case a time series of the 5 years of data. Obviously the years of the concussions are available, and they might have even been used in the analysis. In the published article, it says, “Annual concussion counts were modeled by year and play type, with play counts as exposures, using maximum likelihood Poisson regression . . .” I’m not clear on what exactly was done here.

– We’d also like to see similar data from other conferences, not just the Ivy League, to see changes in games that did not follow these rules.

– Even simpler than all that, we’d like to see the raw data on which the above analysis was based. Releasing the raw data—that would be trivial. Indeed, the dataset may already be accessible—I just don’t know where to look for it. Ideally we’d move to a norm in which it was just expected that every publication came with its data and code attached (except when not possible for reasons of privacy, trade secrets, etc.). It just wouldn’t be a question.

The above requests are not meant to represent devastating criticisms of the research under discussion. It’s just hard to make informed decisions without the data.

Checking the calculations

Anyway, I was concerned about the last row of the above table so I thought I’d do my best to replicate the analysis in R.

First, I put the data from the table into a file, football_concussions.txt:

y1 n1 y2 n2
kickoff 26 2379 3 1467
scrimmage 92 34521 28 22467
punt 6 2496 2 1791
field_goal_pat 2 2090 0 1268

Then I read the data into R, added a new row for “Other plays” summing all the non-kickoff data, and computed the classical summaries. For each row, I computed the raw proportions and standard errors, the difference between the proportion, and the standard errors of that difference. I also computed the difference in differences, comparing the change in the concussion rate for kickoffs to the change in concussion rate for non-kickoff plays, as this comparison was mentioned in the article’s results section. I multiplied all the estimated differences and standard errors by 1000 to get the data in rates per thousand.

Here’s the (ugly) R code:

concussions <- read.table("football_concussions.txt", header=TRUE)
concussions <- rbind(concussions, apply(concussions[2:4,], 2, sum))
rownames(concussions)[5] <- "other_plays"
compare <- function(data) {
  p1 <- data$y1/data$n1
  p2 <- data$y2/data$n2
  diff <- p2 - p1
  se_diff <- sqrt(p1*(1-p1)/data$n1 + p2*(1-p2)/data$n2)
  diff_in_diff <- diff[1] - diff[5]
  se_diff_in_diff <- sqrt(se_diff[1]^2 + se_diff[5]^2)
  return(list(diffs=data.frame(data, diff=diff*1000, 
     diff_ci_lower=(diff - 2*se_diff)*1000, diff_ci_upper=(diff + 2*se_diff)*1000),
     diff_in_diff=diff_in_diff*1000, se_diff_in_diff=se_diff_in_diff*1000))
}
print(lapply(compare(concussions), round, 2))

And here's the result:

$diffs
                y1    n1 y2    n2  diff diff_ci_lower diff_ci_upper
kickoff         26  2379  3  1467 -8.88        -13.76         -4.01
scrimmage       92 34521 28 22467 -1.42         -2.15         -0.69
punt             6  2496  2  1791 -1.29         -3.80          1.23
field_goal_pat   2  2090  0  1268 -0.96         -2.31          0.40
other_plays    100 39107 30 25526 -1.38         -2.05         -0.71

$diff_in_diff
[1] -7.5

$se_diff_in_diff
[1] 2.46

The differences and 95% intervals computed this way are similar, although not identical to, the results in the published table--but there are some differences that baffle me. Let's go through the estimates, line by line:

- Kickoffs. Estimated difference is identical 95% interval is correct to two significant digits. This unimportant discrepancy could be coming because I used a binomial model and the published analysis used a Poisson regression.

- Plays from scrimmage. Estimated difference is the same for both analyses; lower confidence bound is almost the same. But the upper bounds are different: I have -0.69; the published analysis is -0.07. I'm guessing they got -0.70 and they just made a typo when entering the result into the paper. Not as bad as the famous Excel error, but these slip-ups indicate a problem with workflow. A problem I often have in my workflow too, as in the short term it is often easier to copy numbers from one place to another than to write a full script.

- Punts. Estimated difference and confidence interval work out to within 2 significant digits.

- Field goals and extra points. Same point estimate, but confidence bounds are much different, with the published intervals much wider than mine. This makes sense: There's a zero count in these data, and I was using the simple sqrt(p_hat*(1-p_hat)) standard deviation formula which gives too low a value when p_hat=0. Their Poisson regression would be better.

- All non-kickoffs. Same point estimate but much different confidence intervals. I think they made a mistake here: for one thing, their confidence interval doesn't exclude zero, even though the raw numbers show very strong evidence of a difference (30 concussions after the rule change and 100 before; even if you proportionally scale down the 100 to 60 or so, that's still sooo much more than 30; it could not be the result of noise). I have no idea what happened here; perhaps this was some artifact of their regression model?

This last error is somewhat consequential, in that it could lead readers to believe that there's no strong evidence that the underlying rate of concussions was declining for non-kickoff plays.

- Finally, the diff-in-diff is three standard errors away from zero, implying that the relatively larger decline in concussion rate in kickoffs, compared to non-kickoffs, could not be simply explained by chance variation.

Difference in difference or ratio of ratios?

But is the simple difference the right thing to look at?

Consider what was going on. Before the rule change, concussion rates were much higher for kickoffs than for other plays. After the rule change, concussion rates declined for all plays.

As a student in my class pointed out, it really makes more sense to compare the rates of decline than the absolute declines.

Put it this way. If the probabilities of concussion are:
- p_K1: Probability of concussion for a kickoff play in 2013-2015
- p_K2: Probability of concussion for a kickoff play in 2016-2017
- p_N1: Probability of concussion for a non-kickoff play in 2013-2015
- p_N2: Probability of concussion for a non-kickoff play in 2016-2017.

Following the published paper, we estimated the difference in differences, (p_K2 - p_K1) - (p_N2 - p_N1).

But I think it makes more sense to think multiplicatively, to work with the ratio of ratios, (p_K2/p_K1) / (p_N2/p_N1).

Or, on the log scale, (log p_K2 - log p_K1) - (log p_N2 - log p_N1).

What's the estimate and standard error of this comparison?

The key step is that we can use relative errors. From the Poisson distribution, the relative sd of an estimated rate is 1/sqrt(y), so this is the approx sd on the log scale. So, the estimated difference in difference of log probabilities is (log(3/1467) - log(26/2379)) - (log(30/25526) - log(100/39107)) = -0.90. That's a big number: exp(0.90) = 2.46, which means that the concussion rate fell over twice as fast in kickoff than in non-kickoff plays.

But the standard error of this difference in difference of logs is sqrt(1/3 + 1/26 + 1/30 + 1/100) = 0.64. The observed d-in-d-of-logs is -0.90, which is less than two standard errors from zero, thus not conventionally statistically significant.

So, from these data alone, we cannot confidently conclude that the relative decline in concussion rates is more for kickoffs than for other plays. That estimate of 3/1467 is just too noisy.

We could also see this using a Poisson regression with four data points. We create the data frame using this (ugly) R code:

data <- data.frame(y = c(concussions$y1[c(1,5)], concussions$y2[c(1,5)]),
                   n = c(concussions$n1[c(1,5)], concussions$n2[c(1,5)]),
                   time = c(0, 0, 1, 1),
                   kickoff = c(1, 0, 1, 0))

Which produces this:

    y     n time kickoff
1  26  2379    0       1
2 100 39107    0       0
3   3  1467    1       1
4  30 25526    1       0

Then we load Rstanarm:

library("rstanarm")
options(mc.cores = parallel::detectCores())

And run the regression:

fit <- stan_glm(y ~ time*kickoff, offset=log(n), family=poisson, data=data)
print(fit)

Which yields:

             Median MAD_SD
(Intercept)  -6.0    0.1  
time         -0.8    0.2  
kickoff       1.4    0.2  
time:kickoff -0.9    0.6

The coefficient for time is negative: concussion rates have gone down a lot. The coefficient for kickoff is positive: kickoffs have higher concussion rates. The coefficient for the interaction of time and kickoff is negative: concussion rates have been going down faster for kickoffs. But that last coefficient is less than two standard errors from zero. This is just confirming with our regression analysis what we already saw with our simple standard error calculations.

OK, but don't I think that the concussion rate really was going down faster, even proportionally, in kickoffs than in other plays? Yes, I do, for three reasons. In no particular order:
- The data do show a faster relative decline, albeit with uncertainty;
- The story makes sense given that the Ivy League directly changed the rules to make kickoffs safer;
- Also, there were many more touchbacks, and this directly reduces the risks.

That's fine.

Summary

- There seem to be two mistakes in the published paper: (1) some error in the analysis of the plays from scrimmage which led to too wide a standard error, and (2) a comparison of absolute rather than relative probabilities.

- It would be good to see the raw data for each year.

- The paper's main finding---a sharp drop in concussions, especially in kickoff plays---is supported both by the data and our substantive understanding.

- Concussion rates fell faster for kickoff plays, but they also dropped a lot in non-kickoff plays too. I think the published article should've emphasized this finding more than they did, but they perhaps here were hindered by the error they made in their analysis leading to an inappropriately wide confidence interval for non-kickoff plays. The decline in concussions for non-kickoff plays is consistent with a general rise in concussion awareness.

P.S. I corrected a mistake in the above Poisson regression code, pointed out by commenter TJ (see here).

Strategic choice of where to vote in November

Darcy Kelley sends along a link to this site, Make My Vote Matter, in which you can enter two different addresses where you might vote, and it will tell you in which (if any) of these addresses has elections that are predicted to be close. The site is aimed at students; according to the site, they can register to vote at either their home or school address.

“Six Signs of Scientism”: where I disagree with Haack

I came across this article, “Six Signs of Scientism,” by philosopher Susan Haack from 2009. I think I’m in general agreement with Haack’s views—science has made amazing progress over the centuries but “like all human enterprises, science is ineradicably is fallible and imperfect. At best its progress is ragged, uneven, and unpredictable; moreover, much scientific work is unimaginative or banal, some is weak or careless, and some is outright corrupt . . .”—and I’ll go with her definition of “scientism” as “a kind of over-enthusiastic and uncritically deferential attitude towards science, an inability to see or an unwillingness to acknowledge its fallibility, its limitations, and its potential dangers.”

But I felt something was wrong with Haack’s list of the six signs of scientism, which she summarizes as:

1. Using the words “science,” “scientific,” “scientifically,” “scientist,” etc., honorifically, as generic terms of epistemic praise.

2. Adopting the manners, the trappings, the technical terminology, etc., of the sciences, irrespective of their real usefulness.

3. A preoccupation with demarcation, i.e., with drawing a sharp line between genuine science, the real thing, and “pseudo-scientific” imposters.

4. A corresponding preoccupation with identifying the “scientific method,” presumed to explain how the sciences have been so successful.

5. Looking to the sciences for answers to questions beyond their scope.

6. Denying or denigrating the legitimacy or the worth of other kinds of inquiry besides the scientific, or the value of human activities other than inquiry, such as poetry or art.

Yes, these six signs can be a problem. But, to me, these six signs are associated with a particular sort of scientism, which one might call active scientism, a Richard Dawkins-like all-out offensive against views that are considered anti-scientific.

I spend a lot of time thinking about something different: a passive scientism which is not concerned about turf (thus, not showing signs 1 and 3 above); not concerned about the scientific method, indeed often displaying little interest in the foundations of scientific logic and reasoning (thus, not showing sign 4 above); and not showing any imperialistic inclinations to bring the humanities into the scientific orbit (thus, not showing signs 5 or 6 above). Passive scientism does involve adopting the trappings and terminology of science in a thoughtless way, so there is a bit of sign 2 above, but that’s it.

A familiar examples of passive scientism is “pizzagate”: the work, publication, and promotion, of the studies conducted by the Food and Brand Lab at Cornell University. Other examples include papers published in the Proceedings of the National Academy of Sciences on himmicanes, air rage, and ages ending in 9.

In all these cases, topics were being studied that clearly can be studied by science. So the “scientism” here is not coming into the decision to pursue this research. Rather, the scientism arises from blind trust in various processes associated with science, such as randomized treatment assignment, statistical significance, and peer review.

In this particular manifestation of scientism—claims that bounce around between scientific journals, textbooks, and general media outlets such as NPR and Ted talks—there is no preoccupation with identifying the scientific method or preoccupation with demarcation, but rather the near-opposite, an all-too-calm acceptance of wacky claims that happen to be in the proximity to various tokens of science.

So I think that when talking about scientism, we need to consider passive as well as active scientism. For every Dawkins, there is a Gladwell.

David Weakliem points out that both economic and cultural issues can be more or less “moralized.”

David Weakliem writes:

Thomas Edsall has a piece in which he cites a variety of work saying that Democratic and Republican voters are increasingly divided by values. He’s particularly concerned with “authoritarianism,” which is an interesting issue, but one I’ll save for another post. What I want to talk about here is the idea that the recent rise in political polarization is the result of a rise of “cultural and lifestyle politics” at the expense of economic issues. The reasoning is that it’s easier to compromise on economics, on which you can split the difference, than on cultural issues, which involve principles of right and wrong. The idea that culture has been displacing economics as the main axis of political conflict been around for about fifty years—it was first proposed in response to the developments of the late 1960s and early 1970s. I think it has value (although with some qualifications which I discuss in this paper), but I don’t see how it can explain the rising polarization of the last decade or so. In that time, the single most divisive issue in American politics has probably been the Affordable Care Act. This is basically an economic policy, and a very complicated one involving a lot of technical issues—that is, exactly the kind of issue where it seems you could make deals, offering a concession here in return for getting something there. The second most divisive issue has probably been the combination of bailouts, tax changes, and stimulus spending that gave birth to the Tea Party: another complicated set of economic policies that seemed to offer lots of room for compromise. Meanwhile, some leading cultural issues have faded. For example, same-sex marriage is widely accepted—even people who aren’t enthusiastic about it have mostly given up the fight. Another example involves drugs: a consensus seems to be developing in favor of legalizing and regulating marijuana, and the rise in opioid abuse has been treated as a public health problem rather than producing a “moral panic.”

What I think these examples show is that both economic and cultural issues can be more or less “moralized.” There was a period in the middle of the 20th century when leading politicians of both left and right accepted the basic principles of the welfare state and government intervention to maintain high employment. But that consensus had not been around before then, and it isn’t around now. Now issues that were once part of what Seymour Martin Lipset called “the politics of collective bargaining” are part of the “culture wars.”

This is an excellent point. Weakliem should have a regular column in the New York Times.

“Moral cowardice requires choice and action.”

Commenter Chris G pointed out this quote from Ta-Nehisi Coates:

Moral cowardice requires choice and action. It demands that its adherents repeatedly look away, that they favor the fanciful over the plain, myth over history, the dream over the real.

Coates was writing about the defenders of the Confederate flag. Coates points to this quotation from one of the founders of the Confederate government:

Our new government is founded upon exactly the opposite idea; its foundations are laid, its corner-stone rests, upon the great truth that the negro is not equal to the white man; that slavery subordination to the superior race is his natural and normal condition. This, our new government, is the first, in the history of the world, based upon this great physical, philosophical, and moral truth …

As Coates says, it takes some effort to look away from this history.

What is the default attitude when confronted with criticism?

Coates’s insight is interesting and unexpected in that one would typically think of cowardice or denial as the easy option, the default when confronted with evidence that your beliefs don’t make sense, or are contradicted by the evidence, or (in the case of racism and white supremacy) lead to reprehensible behavior.

Coates’s point is that, while cowardice or denial may well be the default option from a psychological perspective, it is not an easy position to hold from a logical perspective.

From an intellectual perspective, cowardice—or, more generally, intellectual dishonesty—requires choice and action. It takes effort, a continuing effort to defy logical gravity, as it were.

Cool postdoc position in Arizona on forestry forecasting using tree ring models!

Margaret Evans sends in this cool job ad:

Two-Year Post Doctoral Fellowship in Forest Ecological Forecasting, Data Assimilation

A post-doctoral fellowship is available in the Laboratory of Tree-Ring Research (University of Arizona) to work on an NSF Macrosystems Biology-funded project assimilating together tree-ring and forest inventory data to analyze patterns and drivers of forest productivity across the interior western U. S. The aim of the project is to generate ecological forecasts of future forest ecosystem functioning, especially carbon sequestration, in the face of rising temperatures and evaporative demand. The approach is to leverage an existing, continental-scale ecological observatory network (the permanent sample plot network of the U. S. Forest Service’s Forest Inventory and Analysis Program [FIA]) and assimilate into it a new data stream: annual-resolution time series of individual tree growth from ~6,000 increment cores collected in the same plot network. The post-doc will be able to participate in all aspects of the project, with an emphasis on manipulating Forest Inventory and Analysis (FIA) census data, tree-ring data, and climate data, and scaling up an existing data assimilation workflow, with the opportunity to develop lines of research related to the themes of the lab based on their interests. The project will be co-supervised by Margaret Evans (Laboratory of Tree-Ring Research, University of Arizona), Justin DeRose and John Shaw (Interior West-FIA, Rocky Mountain Research Station) and statistical ecologists Andrew Finley (Michigan State University) and Mike Dietze (Boston University), along with the cyberinfrastructure support of NSF’s CyVerse. Applicants should have a PhD in ecology, forestry, or related field with strong statistical and computing skills, or a PhD in mathematics, applied mathematics, statistics, or a related field, with experience or interest in plant or forest ecology. The successful candidate will have a background and/or strong interest in hierarchical Bayesian models, data assimilation, dynamic linear modeling, ecological forecasting, uncertainty quantification, spatial statistics, dendrochronology, and/or computer science (e.g., writing MCMC samplers). Experience working with large datasets or databases, strong writing skills and associated publications in peer-reviewed literature, communication skills, and mentoring and collaboration skills are also strongly valued.

The position is funded for two years, beginning as soon as December of 2018. Duties will be carried out at the Laboratory of Tree-Ring Research on the University of Arizona campus in Tucson, Arizona. The University of Arizona is a committed Equal Opportunity/Affirmative Action Institution. Women, minorities, veterans and individuals with disabilities are encouraged to apply. Situated an hour and a half from Mexico in the Sonoran desert and Sky Island region of southeastern Arizona, Tucson has an exceptionally low cost of living along with a wide range of opportunities for outdoor recreation and biological and cultural richness. One example is the recent designation of Tucson as a UNESCO World City of Gastronomy. Complete applications must include (1) a cover letter, (2) curriculum vita, and (3) names and contact information for three references, and should be submitted through the UACareers portal at https://uacareers.com/postings/32591. Applications will be reviewed until the position is filled.

Tree ring analysis! That’s challenging. Looks like a great project.

But . . . what’s with “writing MCMC samplers”? Can’t they just run Stan? I’m not joking here.

My talk tomorrow (Tues) 4pm in the Biomedical Informatics department (at 168th St)

The talk is 4-5pm in Room 200 on the 20th floor of the Presbyterian Hospital Building, Columbia University Medical Center.

I’m not sure what I’m gonna talk about. It’ll depend on what people are interested in discussing. Here are some possible topics:

– The failure of null hypothesis significance testing when studying incremental changes, and what to do about it;

– Cool stuff in Stan;

– Bayes for big data: Challenges and possible research directions;

– Multilevel models and N=1 studies: Bridging between big and little data;

– Bayesian workflow;

– Improving statistical graphics through statistical modeling, and vice versa;

– Informative prior distributions and how to be a frequentist;

– Teaching statistics.

Or whatever else comes up.

Bob Erikson on the 2018 Midterms

A couple months ago I wrote about party balancing in the midterm elections and pointed to the work of Joe Bafumi, Bob Erikson, and Chris Wlezien.

Erikson recently sent me this note on the upcoming midterm elections:

Donald Trump’s tumultuous presidency has sparked far more than the usual interest in the next midterm elections as a possible midcourse correction. Can the Democrats can win back the House of Representatives and possibly even the Senate in 2018? This short essay presents some observations about midterm elections and congressional elections generally, followed by some considerations relevant toward understanding the upcoming 2018 midterm verdict. Most of my [Erikson’s] remarks would be commonplace among seasoned congressional election scholars. Please note, however, that I tout a theory of ideological balancing in elections, that remains controversial in some quarters.

As soon as it became clear that Donald Trump would become the next president of the United States, it became almost certain that the Democrats would gain seats in the House (and possibly the Senate) in 2018. For the past 32 midterm elections going back to 1878, in all but three instances (1934, 1998, 2002) the out-party gained seats. In all but two instances (1926, 2002), the out-party gained in terms of vote percentage. So we have a sort of law of politics. Why? One factor is that when presidents have coattails, these coattails are withdrawn at midterm. But winning presidents often lack coattails. Another factor is that unpopular presidents (e.g., Trump) see their party receive the greatest punishment at midterm. But on average presidents are not uniquely unpopular at midterm. So there must be something more.

Further, it is not just that the presidential party’s vote and seat margins decline at midterm. The presidential party also suffers in terms of its level of support. Each major party performs worse at midterm when it does not hold the presidency. If the goal of a congressional party is to do well at midterm, it should lose the presidency.

The mechanism that drives midterm loss is ideological balancing, a theory best elaborated by Alberto Alesina and Howard Rosenthal in the 1990s. The election of a president tilts the direction of national policy to the left or the right of the median voter, depending on the party of the winner. The way for the median voter (and voters generally) to make the ideological correction at midterm is by voting into office more from the opposition party.

One challenge to balance theory as the explanation for near-universal midterm loss, however, is that presidential elections are often predictable in advance. Knowing the presidential winner, presidential-year voters could balance in advance by casting congressional ballots for the party they expected to lose the presidency. Certain winners, however, are likely to have coattails that also carry their ticket-mates into office via straight-ticket voting. In large and predictable presidential wins, coattails work but are offset by presidential-year balancing. At midterm, withdrawn coattails lead to midterm loss. In other words, coattails and balancing work in tandem to account for nearly universal midterm loss.

An interesting circumstance is when perceptions of the likely presidential winner are wrong. In 1948, everyone “knew” that Thomas Dewey would defeat President Harry Truman. But, as everyone knows, Truman won in a huge upset. What is less well known is that Truman’s Democrats gained a whopping 75 House seats in 1948, arguably because people were voting Democratic for Congress to block “President Dewey.” The 2016 election provided an analogue. Plausibly, many who thought Hillary Clinton would win voted Republican for Congress to block, thus accounting for the Democrats’ surprisingly feeble performance at the congressional level in 2016.

It is arguable that balancing also works in the reverse direction. In 2016, for instance, the fact that the Republicans seemed to have a firm hold on Congress may have led some to put their thumb on the scale for Clinton over Trump for the presidency. In 1994 and 2010, Republican congressional triumphs could have made it easier for Bill Clinton and Barack Obama to win reelection two years later. In retrospect, maybe Harry Truman was reelected in 1948 because his campaign against the “do-nothing” Republican Congress was surprisingly effective. (As side evidence, we can see a pattern involving gubernatorial and presidential elections. When a party wins a close gubernatorial election at midterm, it becomes more likely to lose the state in the presidential election two years later. Based on the statistical evidence—via regression discontinuity—that implies this regularity, one could actually project that if the Republicans had counterfactually lost all the governor races they won in 2010, Romney could have defeated Obama in 2012.)

So what about the 2018 midterms? What can we say beyond the casual evidence that there may be a Democratic wave coming this Fall? First, we should believe the generic ballot polls. Properly interpreted, generic ballot polls are more predictive of House midterm elections than presidential polls are of presidential election outcomes. Based on post-WWII history, the way to predict the November vote from polls in mid-spring is roughly to take the vote margin in live-interview polls and cut it in half and then add four points to the out-party. By this measure, the current expectation is that the Democrats should win about 54 percent of the vote. As a statistical prediction, forecasting the midterm vote from regressing the House vote on generic ballot margins plus party control has a smaller expected error (RMSE) than the comparable prediction of presidential elections based on early presidential polls. Generic polls predict, and early in the campaign (e.g. spring of the election year) one should also factor in party control. Survey respondents apparently do not consider party control much until the approach of Election Day.

So the chances of the Democrats winning a sizeable majority of the vote for the House are quite good. The question is whether that is enough to win most seats. From gerrymandering (both natural and induced by Republican state legislatures), the terrain tilts in favor of the Republicans. For instance, in 2016, for districts where the two parties each ran a candidate for the House, the average Democratic vote was only 1.5 percentage point less than Hillary Clinton’s percent of the local two-party vote. In other words, the vote margin nationally was a virtual even split. Yet the Republicans won a decisive plurality of 47 seats.

A natural but wrong way to predict the 2018 seat division is to take the 2016 vote in all districts and project a uniform swing of the two-party vote. By this (wrong) uniform swing method, a projected 54 percent of the vote would yield a pathetic 12 seat pickup. The Democrats would need an unrealistic 55.8 percent of the vote to win the required 23 new seats. The task of winning the House would be nearly impossible.

We can see the error in uniform-swing forecasting by applying the uniform swing retrospectively to the two most recent surge elections. If in 2006 the Democrats knew in advance exactly how much their national vote would increase, the uniform-swing heuristic would lead them to predict a gain of only 16 seats rather than their actual yield of 30 seats! If in 2010 the Republicans had known in advance what their gain in the vote margin would be, their expectation would be 48 seats gained, rather than their actual yield of 63 new Republican seats.

Where uniform swing goes awry is the failure to account for party effort being strategic. The “wave-party” (Democrats in 2006, Republicans in 2010) had put little effort in the previous election in those districts where its best shot would have been a close loss. Thus, the wave-party started with few obvious pickups from close misses from the pre-wave election. Instead, the wave party wins new seats where the non-wave-party seemed safe based on prior voting, but where a combination of surging partisan winds plus wave-party effort made them vulnerable. Similarly, for 2018, a combination of a Democratic surge and a Democratic effort could make the difference in many seemingly safe Republican districts.

But what about gerrymandering, one might ask. The Democrats did regain the House in 2006 with about the same vote margin that is projected for 2018. They did so despite a gerrymander based on the 2000 Census that was only marginally less harmful than the gerrymander facing the Democrats of 2018, based on the 2010 election. Further, when a wave is extreme—as if a 100 year flood, the gerrymander can backfire by upending party control where the gerrymander is designed to protect against all—except maybe a 100 year flood. The 2006 election provides a lesson. Democrats could have picked up 10 additional seats (beyond their total of 30) if the 2006 vote (not 2004 in a uniform swing) had been one percentage point more Democratic everywhere. That is, there were 10 seats the Republicans barely hung on with less than 51.0 percent of the vote.

There is one major change that has overtaking recent congressional elections that adds to uncertainty about predicting 2018. This is the growing nationalization of congressional elections; district partisanship now takes on increased importance at the expense of the candidates’ personal vote. (Partisan polarization plus party control of Congress being in play increasingly dominates congressional attention to local issues). This trend has several ramifications.

On the one hand, the more partisanship determines outcomes, the stronger is the bias from the Republican gerrymander if the national vote is close. On the other hand, nationalizing elections offers two temporary advantages to the Democrats. First, the incumbency advantage seems to have largely disappeared in very recent elections.5 With more Republican than Democratic incumbents, this decline takes away from Republican incumbents the insulation of a strong incumbency advantage. Second, the “swing ratio,” the increment of seats won per increment of the national vote, should steepen as candidates lose control of their fates. Thus 2018 election could see a sharper partisan shift in Democratic seats per increments in Democratic votes than previously considered normal.

One final uncertainty about 2018 concerns who will vote and especially, voter turnout among the young. Prior to this century, age differences in partisan voting were relatively minimal, so it mattered little if the youth vote did not show up at midterm. In the 2010 and 2014 midterms, the combination of balancing plus the missing (mainly Democratic) youth vote worked together to create the overwhelming Republican success. The balancing argument works for the Democrats in 2018. But will the Democrats’ younger voters vote in sufficient numbers to make a difference?

References

Alberto Alesina and Howard Rosenthal. 1995. Partisan Politics, Divided Government and the Economy. Cambridge University Press.

Robert S. Erikson. 2016. Congressional Elections in Presidential Years: Presidential Coattails and Strategic Voting. Legislative Studies Quarterly. 41: 551-572.

Robert S. Erikson, Olle Folke, and James Snyder. 2015. Is there a Gubernatorial Helping Hand? Journal of Politics. 77: 491-504.

Joseph Bafumi, Robert S. Erikson, and Christopher Wlezien. 2010, Balancing, Generic Polls, and Midterm Congressional Elections. Journal of Politics. 72: 705-719.

Very helpful. I’ll just add one clarification regarding factors that are not directly mentioned in the model.

What about campaigning and candidate quality? That’s gotta make a difference, and we’ve heard a lot about these factors in some recent primary elections.

First off, we’d expect individual campaigning and individual features of the candidates to be more important in the primary than in the general election: General elections have just two candidates are more predictable and more determined by partisanship, in contrast to primaries, which can have multiple candidates with the same party and similar political ideologies; so in primaries it’s more important to differentiate yourself personally and communicate your existence and your positions to the voters.

That said, campaigning and candidate quality can make a difference in the general election—just ask that Republican who lost the Senate race in Alabama last year.

So, how do these factors fit into Erikson’s story? They come into the model in two ways. Most obviously, there’s the error term. The predictions have a lot of uncertainty, both at the national level and for individual races, and some of that comes down to who’s running, what resources they have, and how they campaign. The other way that campaigning and candidate quality come in is that they are affected by expectations: in a year where there everyone expects a swing toward the Democrats—and we’ve been expecting this for the past year, if not more—this motivates more people to run on the Democratic side and provides less incentives for Republicans to run. And of course it’s easier to run a strong campaign when public opinion is on your side. So campaigning and candidate quality are implicitly in the model, in that they partly explain how it is that the lead in the polls transfers into votes.

Beyond this there are further complications, as Erikson indicated with his comment about party effort being strategic.

To put it another way, the fact that outcomes are somewhat predictable based on data available a year before the election does not imply that candidates and campaigns don’t make a difference. What’s happening is that: (a) in expectation, the difference that will be made by campaigns and candidates is implicitly included in the predictions, and (b) the model’s predictions do have some uncertainty, and that uncertainty includes variation in candidate and campaign effectiveness, as well as uncertainty about national voting trends.

What do you do when someone says, “The quote is, this is the exact quote”—and then misquotes you?

Ezra Klein, editor of the news/opinion website Vox, reports on a recent debate that sits in the center of the Venn diagram of science, journalism, and politics:

Sam Harris, host of the Waking Up podcast, and I [Klein] have been going back and forth over an interview Harris did with The Bell Curve author Charles Murray. In that interview, which first aired almost a year ago, the two argued that African Americans are, for a combination of genetic and environmental reasons, intrinsically and immutably less intelligent than white Americans, and Murray argued that the implications of this “forbidden knowledge” should shape social policy. Vox published a piece criticizing the conversation, Harris was offended by the piece and challenged me to a debate, and after a lot of back-and-forth, this is that debate. . . .

These hypotheses about biological racial difference are now, and have always been, used to advance clear political agendas — in Murray’s case, an end to programs meant to redress racial inequality, and in Harris’s case, a counterstrike against identitarian concerns he sees as a threat to his own career. Yes, identity politics are at play in this conversation, but that includes, as it always has, white identity politics. . . .

You can follow the link and read the discussion and follow more links and read more, etc.

I’ve never met any of the people involved in this discussion, but I’ve written for Vox and corresponded a couple times with Klein, and I’ve had a few interactions with Murray: some emails, also I reviewed one of his books and arranged for him to have the chance to reply to my review in the journal. The topics of genetics, intelligence, race, and identity politics didn’t really come up in any of these discussions.

A trivial thing but it really annoys me

There’s lots here to think about in the above conversation between Harris and Klein, but this is the part that jumped out at me:

Sam Harris:

One line [of Klein’s earlier Vox article] said while I have a PhD in neuroscience I appear to be totally ignorant of facts that are well known to everyone in the field of intelligence studies.

Ezra Klein:

I think you should quote the line. I don’t think that’s what the line said.

Sam Harris:

The quote is, this is the exact quote: “Sam Harris appeared to be ignorant of facts that were well known to everyone in the field of intelligence studies.” Now that’s since been quietly removed from the article, but it was there and it’s archived.

Klein follows up, not in the conversation but in his post:

[I went back and looked into this and, as far as I can tell, the original quote that Harris is referring to is this one: “Here, too briefly, are some facts to ponder — facts that Murray was not challenged to consider by Harris, who holds a PhD in neuroscience, although they are known to most experts in the field of intelligence.” Here is the first archived version of the piece if you want to compare it with the final.]

This really bugs me. Harris says, “The quote is, this is the exact quote”—and follows up with something that’s not the exact quote.

I mean, really, what’s the point of that? How do you deal with people who do this sort of thing? I guess it’s related to the idea we talked about the other day, the distinction between truth and evidence. Presumably, Harris feels that he’s been maligned, and he’s not so concerned about the details. So when he says “this is the exact quote,” what he means is: This is the essential truth.

Kinda like David Brooks, who publishes checkably false statements and then, when people call him on it, refuses to make corrections and instead expresses irritation. I assume that Brooks, too, believes that he is true in the essentials and considers the factual details as somewhat irrelevant.

As a statistician, I find this attitude sooooo frustrating.