Skip to content

“A hard case for Mister P”

Kevin Van Horn sent me an email with the above title (ok, he wrote MRP, but it’s the same idea) and the following content:

I’m working on a problem that at first seemed like a clear case where multilevel modeling would be useful. As I’ve dug into it I’ve found that it doesn’t quite fit the usual pattern, because it seems to require a very difficult post-stratification.

Here is the problem. You have a question in a survey, and you need to estimate the proportion of positive responses to the question for a large number (100) of different subgroups of the total population at which the survey is aimed. The sample size for some of these subgroups can be rather small. If these were disjoint subgroups then this would be a standard multi-level modeling problem, but they are not disjoint: each subgroup is defined by one or two variables, but there are a total of over 30 variables used to define subgroups.

For example, if x[i], 1 <= i <= 30, are the variables used to define subgroups, subgroup i for i <= 30 might be defined as those individuals for which x[i] > 1, with the other subgroup definitions involving combinations of two or possibly three variables. Examples of these subgroup definitions include patterns such as
· x1 == 1 OR x2 == 1 OR x3 == 1
· (x1 == 1 OR x1 == 2) AND x3 < 4.

You could do a multilevel regression with post-stratification, but that post-stratification step looks very difficult. It seems that you would need to model the 30-dimensional joint distribution for the 30 variables describing subgroups.

Have you encountered this kind of problem before, or know of some relevant papers to read?

I replied:

In your example, I agree that it sounds like it would be difficult to compute things on 2^30 cells or however many groups you have in the population. Maybe some analytic approach would be possible? What are your 30 variables?

And then he responded:

The 30+ variables are a mixture of categorical and ordinal survey responses indicating things like the person’s role in their organization, decision-making influence, familiarity with various products and services, and recognition of various ad campaigns. So you might have subgroups such as “people who recognize any of our ad campaigns,” or “people who recognize ad campaign W,” or “people with purchasing influence for product space X,” or more tightly defined subgroups such as “people with job description Y who are familiar with product Z.”

Here’s some more context. I’m looking for ways of getting better information out of tracking studies. In marketing research a tracking study is a survey that is run repeatedly to track how awareness and opinions change over time, often in the context of one or more advertising campaigns that are running during the study period. These surveys contain audience definition questions, as well as questions about familiarity with products, awareness of particular ads, and attitudes towards various products.

It’s hard to get clients to really understand just how large sampling error can be, so there tends to be a lot of upset and hand wringing when they see an unexplained fluctuation from one month to the next. Thus, there’s significant value in finding ways to (legitimately) stabilize estimates.

Where things get interesting is when the client wants to push the envelope by
a) running surveys more often, but with a smaller sample size, so that the total number surveyed per month remains the same, or
b) tracking results for many different overlapping subgroups.

I’m seeing some good results for handling (a) by treating the responses in each subgroup over time as a time series and applying a simple state-space model with binomial error model; this is based on the assumption that the quantities being tracked don’t typically change radically from one week to the next. This kind of modeling is less useful in the early stages of the study, however, when you don’t yet have much information on the typical degree of variation from one time period to the next. Multilevel modeling for b) seems like a good candidate for the next improvement in estimation, and would help even in the early stages of the study, but as I mentioned, the post-stratification looks difficult.

Now here’s me again:

I see what you’re saying about the poststrat being difficult. In this case, one starting point could be to make somewhat arbitrary (but reasonable) guesses for the sizes of the poststrat cells—for example, just use the proportion of respondents in the different categories in your sample—and then go from there. The point is that the poststrat would be giving you stability, even if it’s not matching quite to the population of interest.

And Van Horn came back with:

You write, “one starting point could be to make somewhat arbitrary (but reasonable) guesses for the sizes of the poststrat cells.”

But there are millions of poststrat cells… Or are you thinking of doing some simple modeling of the distribution for the poststrat cells, e.g. treating the stratum-defining variables as independent?

That sounds like it could often be a workable approach.

Just to stir the pot, though . . . One could argue that a good solution should have good asymptotic behavior, in the sense that, in the limit of a large subgroup sample size, the estimate for the proportion should tend to the empirical proportion. Certainly if one of the subgroups is large, in which case one would expect the empirical proportion to be a good estimate for that subgroup, and my multilevel-model-with-poststrat gives an estimate that differs significantly from the “obvious” answer, this is likely to raise questions about the validity of the approach. It seems to me that, to achieve this asymptotic behavior, I’d need to be able to model the distribution of poststrat cells at arbitrary levels of detail as the sample size increases. This line of thought has me looking into Bayesian nonparametric modeling.

Fun stuff.

Stroopy names

Baby Name Wizard is all over this one.

And this all makes me wonder: is there a psychology researcher somewhere with a dog named Stroopy? Probably so.

P.S. I just made the mistake of googling “Stroopy.” Don’t do it. I was referring to this.

Some quick disorganzed tips on classroom teaching

Below are a bunch of little things I typically mention at some point when I’m teaching my class on how to teach. But my new approach is to minimize lecturing, and certainly not to waste students’ time by standing in front of a group of them, telling them things they could’ve read at their own pace.

Anyway, here I am preparing my course on statistical computing and graphics and thinking of points to mention during the week on classroom teaching. My old approach would be to organize these points in outline format and then “cover” them in class. Instead, though, I’ll stick them here and then I can assign this to students to read ahead of time, freeing up class time for actual discussion.

Working in pairs:

This is the biggie, and there are lots of reasons to do it. When students are working in pairs, they seem less likely to drift off, also with two students there is more of a chance that one of them is interested in the topic. Students learn from teaching each other, and they can work together toward solutions. It doesn’t always work for students to do homeworks pairs or groups—I have a horrible suspicion that they’ll often just split up the task, with one student doing problem #1, another doing problem #2, and so forth—but having them work in pairs during class seems like a no-lose proposition.

The board:

Students don’t pay attention all the time nor do they have perfect memories; hence, use the blackboard as a storage device. For example, if you are doing a classroom activity (such as the candy weighing), outline the instructions on the board at the same time as you explain them to the class. For another example, when you’re putting lots of stuff on the board, organize it a bit: start at the top-left and continue across and down, and organize the board into columns with clear labels. In both cases, the idea is that if a student is lost, he or she can look up at the board and have a chance to see what’s up.

Another trick is to load up the board with relevant material before the beginning of class period, so that it’s all ready for you when you need it.

The projector:

It’s becoming standard to use beamer (powerpoint) slide presentations in classroom teaching as well as with research lectures. I think this is generally a good idea, and I have just a few suggestions:
- Minimize the number of words on the slides. If you know what you’re talking about, you can pretty much just jump from graph to graph.
- The trouble with this strategy is that, without seeing the words on the screen, it can be hard to remember what to say. This suggests that what we really need is a script (or, realistically, a set of notes) to go along with the slide show. Logistically this is a bit of a mess—it’s hard enough to keep a set of slides updated without having to keep the script aligned at the same time—and as a result I’ve tended to err on the side of keeping too many words on my slides (see here, for example). But maybe it’s time for me to bite the bullet and move to a slides-and-script format.

Another intriguing possibility is to go with the script and ditch the slides entirely. Indeed, you don’t even need a script; all you need are some notes or just an idea of what you want to be talking about. I discovered this gradually over the past few years when giving talks (see here for some examples). I got into the habit of giving a little introduction and riffing a bit before getting to the first slide. I started making these ad libs longer and longer, until at one point I gave a talk that started with 20 minutes of me talking off the cuff. It seemed to work well, and the next step was to give an entire talk with no slides at all. The audience was surprised at first but it went just fine. Most of the time I come prepared with a beamer file full of more slides than I’ll ever be able to use, but it’s reassuring to know that I don’t really need any of them.

Finally, assuming you do use slides in your classes, there’s the question of whether to make the slides available to the students. I’m always getting requests for the slides but I really don’t like it when students print them out. I fear that students are using the slides as a substitute for the textbook, also that if the slides are available, students will think they don’t need to pay attention during class because they can always read the slides later.

It’s funny: Students are eager to sign up for a course to get that extra insight they’ll obtain from attending classes, beyond whatever they’d get by simply reading the textbook and going through the homework problems on their own. But once they’re in class, they have a tendency to drift off, and I need to pull all sorts of stunts to keep them focused.

The board and the projector, together:

Just cos your classroom has a projector, that don’t mean you should throw away your blackboard (or whiteboard, if you want to go that stinky route). Some examples:
- I think it works better to write out an equation or mathematical derivation in real time rather than to point at different segments of an already-displayed formula.
- It can help to mix things up a little. After a few minutes of staring at slides it can be refreshing to see some blackboard action.
- You can do some fun stuff by projecting onto the blackboard. For example, project x and y axes and some data onto the board, then have a pair of students come up and draw the regression line with chalk. Different students can draw their lines, then you click onto the next slide which projects the actual line.


Paper handouts can be a great way to increase the effective “working memory” for the class. Just remember not to overload a handout. Putting something on paper is not the same thing as having it be read. You should figure out ahead of time what you’re going to be using in class and then refer to it as it arises.

I like to give out roughly two-thirds as many handouts as there are people in the audience. This gives the handouts a certain scarcity value, also it enforces students discussing in pairs since they’re sharing the handouts already. I found that when I’d give a handout to every person in the room, many people would just stick the handout in their notebook. The advantage of not possessing something is that you’re more motivated to consume it right away.

Live computer demonstrations:

These can go well. It perhaps goes without saying that you should try the demo at home first and work out the bugs, then prepare all the code as a script which you can execute on-screen, one paragraph of code at a time. Give out the script as a handout and then the students can follow along and make notes. And you should decide ahead of time how fast you want to go. It can be fine to do a demo fast to show how things work in real life, or it can be fine to go slowly and explain each line of code. But before you start you should have an idea of which of these you want to do.

Multiple screens:

When doing computing, I like to have four windows open at once: the R text editor, the R console, an R graphics window (actually nowadays I’ll usually do this as a refreshable pdf or png window rather than bothering with the within-R graphics window), and a text editor for whatever article or document I’m writing.

But it doesn’t work to display 4 windows on a projected screen: there’s just not enough resolution, and, even if resolution were not a problem, the people in the back of the room won’t be able to read it all. So I’m reluctantly forced to go back and forth between windows. That’s one reason it can help to have some of the material in handout form.

What I’d really like is multiple screens in the classroom so I can project different windows on to different screens and show all of them at once. But I never seem to be in rooms with that technology.


That’s “just in time teaching”; see here for details. I do this with all my classes now.

Peer instruction:

This is something where students work together in pairs on hard problems. It’s an idea from physics teaching that seems great to me but I’ve never succeeded in implementing true “peer instruction” in my classes. I have them work in pairs, yes, but the problems I give them don’t look quite like the “Concept Tests” that are used in the physics examples I’ve seen. The problem, perhaps, is that intro physics is just taught at a higher level than intro statistics. In my intro statistics classes, it’s hard enough to get the students to learn about the basics, without worrying about getting them into more advanced concepts. So when I have students work in pairs, it’s typically on more standard problems.


In addition to these pair or small-group activities, I like the idea of quick drills that I shoot out to the whole class and students do, individually, right away. I want them to be able to handle basic skills such as sqrt(p*(1-p)/n) or log(a*x^(2/3)) instantly.

Getting their attention:

You want your students to stay awake and interested, to enter the classroom full of anticipation and to leave each class period with a brainful of ideas to discuss. Like a good movie, your class should be a springboard for lots of talk.

But you don’t want to get attention for the wrong things. An extreme example is the Columbia physics professor who likes to talk about his marathon-fit body and at one point felt the need to strip to his underwear in front of his class. This got everyone talking—but not about physics. At a more humble level, I sometimes worry that I’ll do goofy things in class to get a laugh, but then the students remember the goofiness and not the points I was trying to convey. Most statistics instructors probably go too far in the other direction, with a deadpan demeanor that puts the students to sleep.

It’s ok to be “a bit of a character” to the extent that this motivates the students to pay attention to you. But, again, I generally recommend that you structure the course so that you talk less and the students talk more.

Walking around the classroom:

Or wheeling around, if that’s your persuasion. Whatever. My point here is that you want your students to spend a lot of the class time working on problems in pairs. While they’re doing this, you (and your teaching assistants, if this is a large so-called lecture class with hundreds of students) should

Teaching tips in general:

As I explained in my book with Deb Nolan, I’m not a naturally good teacher and I struggle to get students to participate in class. Over the decades I’ve collected lots of tricks because I need all the help I can get. If you’re a naturally good teacher or if your classes already work then maybe you do without these ideas.


It’s not clear how much time should be spent preparing the course ahead of time. I think it’s definitely a good idea to write the final exam and all the homeworks before the class begins (even though I don’t always do this!) because then it gives you a clearer sense of where you’re heading. Beyond that, it depends. I’m often a disorganized teacher and I think it helps me a lot to organize the entire class before the semester begins.

Other instructors are more naturally organized and can do just fine with a one-page syllabus that says which chapters are covered which weeks. These high-quality instructors can then just go into each class, quickly get a sense of where the students are stuck, and adapt the class accordingly. For them, too much preparation might well backfire.

My problem is that I’m not so good at individualized instruction; even in a small class, it’s hard for me to keep track of where each student is getting stuck, and what the students’ interests and strengths are. I’d like to do better on this, but for now I’ve given up on trying to adapt my courses for individuals. Instead I’ve thrown a lot of effort into detailed planning of my courses, with the hope that these teaching materials will be useful for other instructors.

Students won’t (in general) reach your level of understanding:

You don’t teach students facts or even techniques, you teach them the skills needed to solve problems (including the skills needed to find the solution on their own). And there’s no point in presenting things they’re not supposed to learn; for example, if a mathematical derivation is important, put it on the exam with positive probability. And if students aren’t gonna get it anyway (my stock example here is the sampling distribution of the sample mean), just don’t cover it. That’s much better, I think, than wasting everyone’s time and diluting everyone’s trust level with a fake-o in-class derivation.

The road to a B:

You want a plan by which a student can get by and attain partial mastery of the material. See discussion here.


What, if anything, did the students actually learn during the semester?

You still might want to evaluate what your students are actually learning, but we don’t usually do this. I don’t even do it, even though I talk about it. Creating a pre-test and post-test is work! And it requires some hard decisions. Whereas not testing at all is easy. And even when educators try to do such evaluations, they’re often sloppy, with threats to validity you could drive a truck through. At the very least, this is all worth thinking about.

Relevance of this advice to settings outside college classrooms:

Teaching of advanced material happens all over, not just in university coursework, and much of the above advice holds more generally. The details will change with the goals—if you’re giving a talk on your latest research, you won’t want the audience to be spending most of the hour working in pairs on small practice problems—but the general principles apply.

Anyway, it was pretty goofy that I used to teach a course on teaching and stand up and say all these things. It makes a lot more sense to write it here and reserve class time for more productive purposes.

One more thing

I can also add to this post between now and the beginning of class. So if you have any ideas, please share them in the comments.

On deck this week

Mon: Some quick disorganzed tips on classroom teaching

Tues: Stroopy names

Wed: “A hard case for Mister P”

Thurs: The field is a fractal

Fri: Replication Wiki for economics

Sat, Sun: As Chris Hedges would say: Stop me if you’ve heard this one before

My courses this fall at Columbia

Stat 6103, Bayesian Data Analysis, TuTh 1-2:30:

We’ll be going through the book, section by section. Follow the link to see slides and lecture notes from when I taught this course a couple years ago. This course has a serious workload: each week we have three homework problems, one theoretical, one computational, and one applied.

Stat 6191, Statistical Communication and Graphics, TuTh 10-11:30:

This is an entirely new course that will be structured around student participation. I’m still working out the details but here’s the current plan of topics for the 13 weeks:

1. Introducing yourself and telling a story
2. Introducing the course
3. Presenting and improving graphs
4. Graphing data
5. Graphing models
6. Dynamic graphics
7. Programming
8. Writing
9. Giving a research presentation
10. Collaboration and consulting
11. Teaching a class
12-13. Student projects

Why am I teaching these courses?

The motivation for the Bayesian Data Analysis class is obvious. There’s a continuing demand for this course, and rightly so, as Bayesian methods are increasingly powerful for a wide range of applications. Now that our book is available, I see the BDA course as having three roles: (1) the lectures serve as a guide to the book, we talk through each section and point to tricky points and further research; (2) the regular schedule of homework assignments gives students a lot of practice applying and thinking about Bayesian methods; and (3) students get feedback from the instructor, teaching assistant, and others in the class.

The idea of the communication and graphics class is that statistics is all about communication to oneself and to others. I used to teach a class on teaching statistics but then I realized that classroom teaching is just one of many communication tasks, along with writing, graphics, programming, and various forms of informal contact. I think it’s important for this class to not be conducted through lectures, or guest lectures, or whatever, but rather as much as possible via student participation.

“Psychohistory” and the hype paradox

Lee Wilkinson writes:

I thought you might be interested in this post.

I was asked about this by someone at Skytree and replied with this link to Tyler Vigen’s Spurious Correlations. What’s most interesting about Vigen’s site is not his video (he doesn’t go into the dangers of correlating time series, for example), but his examples themselves.

The GDELT project is a good case, I think, of how Big Data is wagging the analytic dog. The bigger the data, the more important the analysis. There seem to be at least a few at Google who have caught this disease.

The post Lee links to above is called “Towards Psychohistory: Uncovering the Patterns of World History with Google BigQuery” and is full of grand claims about using a database of news media stories “to search for the basic underlying patterns of global human society” and that “world history, at least the view of it we see through the news media, is highly cyclic and predictable.” Also some pretty graphs such as this one:


I responded to Lee:

Yes, I agree, the grand claims seem bogus to me. But it’s hard for me to judge because I’m not clear exactly what they’re plotting. Is it number of news articles each day including the word “protest” and the country name? Or all news articles featuring the country with a conflict theme?

In any case, perhaps the best analogy is to the excitement in previous eras regarding statistical regularity. A famous example is the number of suicides in a country each year. Typically these rates are stable, just showing some “statistical” variation. And this can seem a bit mysterious. Suicide is the ultimate individual decision yet the number from year to year shows a stunning regularity. Other examples would be the approximate normality of various distributions of measurements, and various appearances of Zipf’s law. In each case, the extreme claims regarding the distributions typically end up seeming pretty silly, but there is something there. In this case, the Google researchers are, as they say, learning something about statistical patterns of media coverage. And that’s fine. I wish they could do it without the hype—but perhaps the hype is the price that we must pay for the work to get done.

And Lee replied:

I’m not clear either regarding the dependent variable.

A few (sort of) random thoughts.

1) There’s little attention given to the number of considerable patterns in the second 60 day period. Not the number of *possible* patterns, because the dependent variable is presumably continuous or at least presents many possible values. I mean instead the number of patterns the researcher would have considered different from each other — a sort of JND measure of how they visually interpret the prediction. My guess is that there are not very many such patterns — in other words, they have a categorical prior over very few values. As evidence of this, they seem to be ignoring relatively small-scale variation in the first case and highlighting it in the second. Very subjective and post-hoc.

2) They appear to be willing to compare different series on different time scales in order to find similar patterns. This is reminiscent of dynamic time warping, which works OK for bird calls but is questionable for historical data. What are the limits of this flexibility in actual practice? One series covering only January and another covering the whole year that are deemed to be similar? I don’t see them explicitly ruling out such extreme comparisons.

3) Rather broadly, this appears to be similar to “charting” methods for picking stocks, which have been discredited for many years. Similar patterns don’t necessarily predict similar outcomes because context matters. Different exogenous variables can produce similar patterns for very different reasons. Put another way, one can find similar patterns in different time series that are based on fundamentally different processes, particularly on a small scale (60 days in this case?).

4) Searching that many correlations based on “p=.05″ is arbitrary. I know they need a magic number to help filter, but why give it this appearance of legitimacy?

5) They say, “Whether these patterns capture the actual psychohistorical equations governing all of human life or, perhaps far more likely, a more precise mathematical definition of how journalism shapes our understanding of global events, they demonstrate the unprecedented power of the new generation of “big data” tools like Google BigQuery …” I have no idea what they mean here. Perhaps there is some dynamical system underlying these types of historical events, but until someone identifies plausible variables, I find the observation both breathless and uninteresting.

6) I’m all for BIG DATA. After all, I now work at a machine learning company. But statistics is about using methods that minimize the chance of our being fooled by randomness or bias. The methods used here, it seems to me, offer none of these protections.

I still have positive feelings about this Google effort because, even though the big claims have gotta be bogus, setting aside the hype, ya gotta start somewhere. On the other hand, one can be legitimately annoyed by the hype, in that, without the hype, we never would’ve heard about this in the first place.

Luck vs. skill in poker

Screen Shot 2014-07-10 at 11.37.28 AM

The thread of our recent discussion of quantifying luck vs. skill in sports turned to poker, motivating the present post.

1. Can good poker players really “read” my cards and figure out what’s in my hand?

For a couple years in grad school a group of us had a regular Thursday-night poker game, nickel-dime-quarter with a maximum bet of $2, I believe it was. I did ok, it wasn’t hard to be a steady winner by just laying low most of the time and raising when I had a good hand. Since then I’ve played only very rarely (one time was a memorable experience with some journalists and a foul-mouthed old-school politico—I got out of that one a couple hundred dollars up but with no real desire to return), but I did have a friend who was really good. I played a couple of times with him and some others, and it was like the kind of thing you hear about: he seemed to be able to tell what cards I was holding. Don’t get me wrong here, I’m not saying that he was cheating or that it was uncanny or anything, and it’s not like he was taking my money every hand. As always in a limit game, the outcomes had a lot of randomness. But from time to time, it big hands, it really did seem like he was figuring me out. I didn’t think to ask him how he was doing it but I was impressed.

Upon recent reflection, though (many years later), it seems to me that I was slightly missing the point. The key is that my friend didn’t need to “read” me or know what I had; all he needed to do was make the right bets (or, to be more precise, make betting decisions that would perform well on average). He could well have made some educated guesses about my holdings based on my betting patterns (or even my “tells”) and used that updated probability distribution to make more effective betting decisions. The point is that, in many many settings, he doesn’t need to guess my cards; he just needs a reasonable probability distribution (which might be implicit). For example, in some particular situation in a particular hand, perhaps it would be wise for him to fold if he the probability is more than 30% that a particular hole card of mine is an ace. With no information, he’d assess this event as having an (approximate) 2% probability. So do I have that ace? He just needs to judge whether the probability is greater or less than 30%, an assessment that he can do using lots of information available to him. But once he makes that call, if he does it right (as he will, often enough; that’s part of what it means to be a good poker player), it’ll seem to me like he was reading my hand.

2. Some references on luck vs. skill in poker

Louis Raes pointed to three papers:

Ben van der Genugten and Peter Borm wrote quite a bit on Poker and the extent to which skill or luck is important. This work is mainly geared towards Dutch regulation but interesting nonetheless.


3. Rick Schoenberg’s definition

Rick Schoenberg sent along an excerpt from his book, Probability with Texas Holdem Applications. Rick writes:

Surprisingly, a lot of books on game theory do not define the words “luck” or “skill”, maybe because it is very hard to do so. . . . in poker I [Rick] define skill as equity gained during the betting rounds and luck as equity gained during the deal of the cards. I then go through a televised twenty-something hand battle between Howard Lederer and Dario Minieri, two players with about as opposite styles as you can get, and try to quantify how much of Lederer’s win was due to luck and how much was due to skill.

I’ll go through Rick’s material and intersperse some comments.
Continue reading ‘Luck vs. skill in poker’ »

Updike and O’Hara

I just read this review by Louis Menand of a biography of John Updike. Lots of interesting stuff here, with this, perhaps, being the saddest:

When Updike received the National Book Foundation Medal for Distinguished Contribution to American Letters, in 1998, two of [his second wife's] children were present, but his were not invited.

Menand’s article seemed insightful to me but I was surprised to not see the name “John O’Hara” once. Updike seems so clearly to be a follower of O’Hara, both in form (lots of New Yorker short stories and bestselling novels) and also in content (they wrote a lot about sex and a lot about social class). Here’s Menand:

Updike wanted to do with the world of mid-century middle-class American Wasps what Proust had done with Belle Époque Paris and Joyce had done with a single day in 1904 Dublin—and, for that matter, Jane Austen had done with the landed gentry in the Home Counties at the time of the Napoleonic Wars and James had done with idle Americans living abroad at the turn of the nineteenth century. He wanted to biopsy a minute sample of the social tissue and reproduce the results in the form of a permanent verbal artifact.

That sounds a lot like O’Hara, no? Also this:

Updike believed that people in that world sought happiness, and that, contrary to the representations of novelists like Cheever and Kerouac, they often found it. But he thought that the happiness was always edged with dread, because acquiring it often meant ignoring, hurting, and damaging other people.

And this:

Updike’s identification with Berks County and its un-cosmopolitan ways . . . was crucial to a deeply defended and fundamentally spurious conception of himself as an ordinary middle-American guy. He wanted to rescue serious fiction from what he saw as a doctrinaire rejection of middle-class life . . .

Sure, there were differences between the two authors, most notably that Updike was famous for having excelled at Harvard, whereas O’Hara was famous for resenting that he’d not gone to Yale. Also, O’Hara wrote lots of things in the old-fashioned story-with-a-twist style, whereas Updike’s plots were more straightforward, one might say more modernist in avoiding neat plotting. Overall, though, lots of similarities.

I’m not saying that Updike is a clone of O’Hara but I was surprised to that Menand didn’t mention him at all.

P.S. In searching on the web, I came across this article by Lorin Stein that quotes Fran Lebowitz as describing O’Hara as “underrated.” Which is funny to me because Fran Lebowitz is perhaps the most overrated writer I’ve ever heard of.

P.P.S. More interesting than all the above is this 1973 essay, “O’Hara, Cheever & Updike,” by Alfred Kazin.

How do you interpret standard errors from a regression fit to the entire population?

James Keirstead writes:

I’m working on some regressions for UK cities and have a question about how to interpret regression coefficients. . . .

In a typical regression, one would be working with data from a sample and so the standard errors on the coefficients can be interpreted as reflecting the uncertainty in the choice of sample. In my case, I’m working with every city in the UK so the error interpretation isn’t as clear. There are two sources of confusion:

1. Since I’m working with a full population, can I just ignore the coefficient errors or do they have an additional interpretation that might be relevant? I’ve seen some mention of finite populations but am not sure how this might apply in a classical regression.

2. The definition of a city is itself somewhat uncertain. In my study I’m looking at about six definitions, each one consisting of a full population of UK cities (though each definition has a different number of cities; it’s not just the attributes of each city that change but the population size itself). Would it be sensible to interpret regression coefficient errors as capturing this uncertainty, or would an alternative model formulation be more appropriate?

My reply: Hey, you’re in luck, I’ve already answered this one!

Understanding the hot hand, and the myth of the hot hand, and the myth of the myth of the hot hand, and the myth of the myth of the myth of the hot hand, all at the same time

Josh Miller writes:

I came across your paper in the Journal of Management on unreplicable research, and in it you illustrate a point about the null hypothesis via the hot hand literature.

I am writing you because I’d like to move your current prior (even if our work uses a classical approach). I am also curious to hear your thoughts about what my co-author and I have done.

We have some new experimental and empirical work showing that the hot hand phenomenon can be substantial in individual players. We think our measures are more tightly related to hot hand shooting (rather than cold hand shooting).

Also, we find clear evidence of hot hand shooting in Gilovich, Vallone & Tversky’s original data set.

Our new paper, “A Cold Shower for the Hot Hand Fallacy,” is on SSRN, here.

We have three comments on your discussion in Journal of Management:

1. The earlier reported small effect sizes come about for three main reasons (1) pooling data across players means the guys who don’t get hot (or fall apart) attenuate average effects so you don’t see the hot guys, (2) the measurement error story of D. Stone (who I see commented on your blog once), (3) not every streak is a hot streak, so the real infrequent, but persistent hot hands get diluted; you would need to measure something else in conjunction with shot outcomes to pick this up.

2. We overturn the basic findings of GVT: it is not a fallacy to believe that some players can get substantial hot hands. We have the proof of concept in 3 separate controlled shooting studies (GVTs, on earlier one, and our own). We have more discussion on this in the paper.

3. Now, while it is no longer that case that believing in the hot hand is a fallacy, there remains a question which you pose as answered: to what extent do players, coaches or fans overestimate hot hand effects based on shot outcomes alone? An important first point: this overestimation wasn’t the main point of GVT because it is really hard to show that players and coaches are over-estimating the impact via the decisions they make (stated beliefs would be a little silly, but these haven’t been asked cleanly). GVT did something cleaner: show no effect and then you know any belief in the hot hand must be fallacious.

The question I have for you: do you think Bill Russell, thought to be the greatest teamplayer of all time, was wrong when he said this? (retirement letter to SI):

People didn’t give us credit for being as good as we were last season. Personally, I think we won because we had the best team in the league. Some guys talked about all the stars on the other teams, and they quote statistics to show other teams were better. Let’s talk about statistics. The important statistics in basketball are supposed to be points scored, rebounds and assists. But nobody keeps statistics on other important things—the good fake you make that helps your teammate score; the bad pass you force the other team to make; the good long pass you make that sets up another pass that sets up another pass that leads to a score; the way you recognize when one of your teammates has a hot hand that night and you give up your own shot so he can take it. All of those things. Those were some of the things we excelled in that you won’t find in the statistics. There was only one statistic that was important to us—won and lost.

Because if you read GVT 1985 and GT 1989 papers in chance, that is the message you get.

Here’s the relevant passage from my recent article in the Journal of Management:

As an example, consider the continuing controversy regarding the “hot hand” in basket- ball. Ever since the celebrated study of Gilovich, Vallone, and Tversky (1985) found no evidence of serial correlation in the successive shots of college and professional basketball players, people have been combing sports statistics to discover in what settings, if any, the hot hand might appear. Yaari (2012) points to some studies that have found time dependence in basketball, baseball, volleyball, and bowling, and this is sometimes presented as a debate: Does the hot hand exist or not?

A better framing is to start from the position that the effects are certainly not zero. Athletes are not machines, and anything that can affect their expectations (for example, success in previous tries) should affect their performance—one way or another. To put it another way, there is little debate that a “cold hand” can exist: It is no surprise that a player will be less successful if he or she is sick, or injured, or playing against excellent defense. Occasional periods of poor performance will manifest themselves as a small positive time correlation when data are aggregated.

However, the effects that have been seen are small, on the order of 2 percentage points (for example, the probability of a success in some sports task might be 45% if a player is “hot” and 43% otherwise). These small average differences exist amid a huge amount of variation, not just among players but also across different scenarios for a particular player. Sometimes if you succeed, you will stay relaxed and focused; other times you can succeed and get overconfident.

Whatever the latest results on particular sports, we cannot see anyone overturning the basic finding of Gilovich et al. (1985) that players and spectators alike will perceive the hot hand even when it does not exist and dramatically overestimate the magnitude and consistency of any hot-hand phenomenon that does exist. In short, this is yet another problem where much is lost by going down the standard route of null hypothesis testing. Better to start with the admission of variation in the effect and go from there.

And here is my response to Miller:

What is your estimated difference in probability of successful shot in pre-chosen hot and non-hot situations? I didn’t see this number in your paper, but my impression from earlier literature is that any effect is on the order of magnitude of 2 percentage points, which is not zero but is small compared to people’s subjective perceptions. My own experience, if this helps any, is that I do feel that I have a hot hand when I’m making a basketball shot, but that feeling of hotness is coming as a consequence of the pleasant but largely random event of my shot happening to fall into the hoop. To me, the hot hand fallacy is not such a surprise; it is consistent with the “illusion of control” (to use another psychology catchphrase).

The Bill Russell quote is interesting, but given the findings of the classic hot hand paper, it is not surprising to me that a player would view the hot hand as a major factor, whether or not it is indeed important. Players can believe in all sorts of conventional wisdom. Of course I agree with Russell’s statement that all that matters is wins and losses. I’d guess that points scored for and against is a pretty important statistic too. All the other statistics we see are just imperfect attempts to better understand point-scoring.

To which Miller replied:

On your first question, I’ll give you the quick measure but it depends on the player. Lets compare the hit rate after making 3+ shots in a row to the hit rate in any other shooting situation. For the player RC, because he was nearly significant in the first session, we followed up with him 6 months later to see if this predicted a hot hand out of sample, and it did: on average his boost was 8-9 percentage points across all session (see p. 25 for the difference). The hottest player in the JNI data set of 6 players, with 9 different shooting sessions (he had periods of elevated performance in all sessions), had a boost of around 13 percentage points (see p. 28 for the difference). In GVTs data, 8 out 26 shooters had a 10 percentage point plus boost , and 4 more than plus 20 percentage points (see page 29 for a brief report). Its clear some player’s have substantial boosts in performance, but yes, the average effect is modest as in previous studies, around a 3-5 percentage point boost. I think the important point is not that the hot hand is some big average effect, but that some players have a tendency to be streaky.

On Russell. Players receive information beyond sequential shot outcome data; they have a long experience playing with teammates and they can get cues on a player’s underlying mental and physical state to use in conjunction with sequential outcome data, so in that sense outcome data may be more informative for them than it would be for a fan. Further, the mechanism to get hot doesn’t always mean you have a positive feedback of shot outcomes into a player ability, another mechanism is endogenous fluctuations in mental and physiological state, or exogenous input such as energy from fans, teammates, etc. In this case, for teammates and coaches, the cues on mental and physical state are more important than the shot outcome. Notice if you take the original Cognitive Psychology paper and two Chance papers, the message is against both mechanisms.

Now, just to clarify, in my personal view, the tendency for spectators to attach too much meaning to streaks is clearly there, we can see it any time when we watch the 3-point contest, the videos are on youtube. Anytime a player hits three shots he is “heating up.” This is the pattern of intuitive judgement that GVT identified, and this is interesting psychologically, and it was predicted by previous lab experiments. Instead, we approach it from the perspective of whether there is strong evidence that players and coaches are wildly off. Our evidence suggests their belief can be justified, but we don’t demonstrate that it is in any particular game circumstance (no one has showed that it isn’t!). If you look at some recent interesting work from Matthew Goldman and Justin Rao, on average, players do a surprisingly good job allocating their shots.

Good stuff. I’ll just say this: I’m terrible at basketball. But every time I take a shot, I have the conviction that if I really really focus, I’ll be able to get it in.