No, it’s not April 1, and yup, I’m serious. Josh Miller came into my office yesterday and convinced me that the hot hand is real.

Here’s the background. Last year we posted a discussion on streakiness in basketball shooting. Miller has a new paper out, with Adam Sanjurjo, which begins:

We find a subtle but substantial bias in a standard measure of the conditional dependence of present outcomes on streaks of past outcomes in sequential data. The mechanism is driven by a form of selection bias, which leads to an underestimate of the true conditional probability of a given outcome when conditioning on prior outcomes of the same kind. The biased measure has been used prominently in the literature that investigates incorrect beliefs in sequential decision making — most notably the Gambler’s Fallacy and the Hot Hand Fallacy. Upon correcting for the bias, the conclusions of some prominent studies in the literature are reversed. The bias also provides a structural explanation of why the belief in the law of small numbers persists, as repeated experience with finite sequences can only reinforce these beliefs, on average.

What’s this bias they’re talking about?

Jack takes a coin from his pocket and decides that he will flip it 4 times in a row, writing down the outcome of each flip on a scrap of paper. After he is done flipping, he will look at the flips that immediately followed an outcome of heads, and compute the relative frequency of heads on those flips. Because the coin is fair, Jack of course expects this conditional relative frequency to be equal to the probability of flipping a heads: 0.5. Shockingly, Jack is wrong. If he were to sample 1 million fair coins and flip each coin 4 times, observing the conditional relative frequency for each coin, on average the relative frequency would be approximately 0.4.

Really?? Let’s try it in R:

rep <- 1e6 n <- 4 data <- array(sample(c(0,1), rep*n, replace=TRUE), c(rep,n)) prob <- rep(NA, rep) for (i in 1:rep){ heads1 <- data[i,1:(n-1)]==1 heads2 <- data[i,2:n]==1 prob[i] <- sum(heads1 & heads2)/sum(heads1) }

OK, I've simulated, for each player, the conditional probability that he gets heads, given that he got heads on the previous flip.

What's the mean of these?

> print(mean(prob)) [1] NaN

Oh yeah, that's right: sometimes the first three flips are tails, so the probability is 0/0. So we'll toss these out. Then what do we get?

> print(mean(prob, na.rm=TRUE)) [1] 0.41

Hey! That's not 50%! Indeed, if you get this sort of data, it will look like people are anti-streaky (heads more likely to be followed by tails, and vice-versa), even though they're not.

With sequences of length 10, the average streakiness statistic (that is, for each person you compute the ~~conditional probability~~ proportion that he gets heads, conditional on him having just got heads on the previous flip, and then you average this across people), is .445. This is pretty far from .5, given that previous estimates of streak-shooting probability have been in the range of 2 percentage points.

And the bias is larger for comparisons such as the ~~probability~~ proportion of heads, conditional on following three straight heads, compared to the overall probability of heads. Which is one measure of streakiness, if "heads" is replaced by success of a basketball shot.

So here's the deal. The classic 1985 paper by Gilovich, Vallone, and Tversky and various followups used these frequency comparisons, and as a result they all systematically underestimated streakiness, reporting no hot hand when, actually, when the data are analyzed correctly, the evidence is there, as Miller and Sanjurjo report in the above-linked paper and also in another recent article which uses the example of the NBA three-point shooting contest.

Next step: fitting a model in Stan to estimate individual players' streakiness.

This is big news. Just to calibrate, here's what I wrote on the topic last year:

Consider the continuing controversy regarding the “hot hand” in basketball. Ever since the celebrated study of Gilovich, Vallone, and Tversky (1985) found no evidence of serial correlation in the successive shots of college and professional basketball players, people have been combing sports statistics to discover in what settings, if any, the hot hand might appear. Yaari (2012) points to some studies that have found time dependence in basketball, baseball, volleyball, and bowling, and this is sometimes presented as a debate: Does the hot hand exist or not?

A better framing is to start from the position that the effects are certainly not zero. Athletes are not machines, and anything that can affect their expectations (for example, success in previous tries) should affect their performance—one way or another. To put it another way, there is little debate that a “cold hand” can exist: It is no surprise that a player will be less successful if he or she is sick, or injured, or playing against excellent defense. Occasional periods of poor performance will manifest themselves as a small positive time correlation when data are aggregated.

However, the effects that have been seen are small, on the order of 2 percentage points (for example, the probability of a success in some sports task might be 45% if a player is “hot” and 43% otherwise). These small average differences exist amid a huge amount of variation, not just among players but also across different scenarios for a particular player. Sometimes if you succeed, you will stay relaxed and focused; other times you can succeed and get overconfident.

I don't think I said anything *wrong* there, exactly, but Miller and Sanjurjo's bias correction makes a difference. For example, they estimate the probability of success in the 3-point shooting contest as 6 percentage points higher after three straight successes. In comparison, the raw (biased) estimate is 4 percentage points. The difference between 4% and 6% isn't huge, but the overall impact of all this analysis is to show clear evidence for the hot hand. It will be interesting to see what Stan finds regarding the variation.

So I admittedly didn’t read the papers yet, so maybe this question doesn’t really make sense.

But does this mean that the, e.g., probability of having a girl be born after a boy isn’t exactly 50/50 either?

Does it mean that having only children of the same sex is less likely than we thought?

Or should I have read the articles first and realized why this isn’t relevant?

Josh:

Pr(girl) = .485 (approximately), no matter what was the sex of the previous baby. But if you were to estimate the conditional probabilities in a certain way (averaging within families, then averaging across families, rather than weighting each birth equally in the average), then you could come up with a biased estimate. And this sort of thing can be tricky; recall the error-filled work of Satoshi Kanazawa.

Is there a simple / intuitive reason why one would average this way, rather than say, vectorizing the matrix in the above example first and averaging once?

Alex:

If you’re estimating a common probability such as Pr(girl), you’d average everybody. For the hot hand, the usual approach is to get a separate estimate for each player, hence the separate averaging.

So one might “explain” this bias with the two-dimensional extension of Jensen’s inequality for expectations?

I’ve seen the claim that sex distribution fits a beta binomial model better than a binomial model.

http://onlinelibrary.wiley.com/doi/10.1111/1467-9876.00103/abstract

If a beta binomial distribution can be simulated with urns by drawing a ball from an urn, and if a boy is drawn, replacing the ball with 2 or more ‘boy’ balls prior to drawing the next ball, that suggests a serial correlation > 0.5. The large the family the more skewed the sex ratio.

I think you can show this with a pretty simple directed acyclic graph: for any given sequence of kids (say, consider only families who have 3 kids), you want to know whether probability (boy|prior kid was a boy)=pr(boy), or approximately 0.5.

You obviously can only evaluate families for whom either kid 1 or kid 2 was a boy. So, the DAG includes 3 variables: sex of kid 1, sex of kid 2, both of which influence whether the family is included in your analysis. The DAG looks like:

kid1_boy–>include family in analysis <–kid2 boy

Among included families, the two causes of inclusion are associated (in this case, because it's an either/or inclusion rule, they are inversely associated – if kid2 is a girl, then you know kid1 must have been a boy or else the family wouldn't be included in your analysis).

Something that confused me when I first read this is how any association between the sex of kid3 is induced – but upon consideration (by which I mean an embarrassing amount of time obsessively thinking about the problem) actually I don't think there is any association with the last element of the sequence, because that element has no influence on whether the family (or cluster) is included in the analysis.

In short, seems like a nice illustration of collider bias.

Maria

I find this explanation very helpful, thanks! Any relation to Clark?

The probability of having a boy after a girl is 1/4. The probability of having a mixed pair is 1/2 (assuming the true prob of boy vs girl is .5 and not .48) The probability of having two boys, out of 3 children, in a row is 3/8. Note that the grouping changes the outcome. That seems to me what is happening here. I think a probability tree greatly helps in this case.

We would add two things on effect size:

1. These are not small average effects. With the bias correction (mean adjusted) the *average* effect size is 6 percentage points in the 3 pt data. and 13 percentage points in the original GVT study (the bias was bigger in GVT because there were 100 shots). The difference between the median NBA shooter and the best NBA shooter is 10 percentage points.

2. This is an average effect, we should expect heterogeneity in effect size, not every player has a tendency to get hot. The average is unlikely to be representative of how how some players can get, because their effect is diluted with those who have weaker effects (and some players are anti-streaky, though not many). Take a look at the effect size 3 point contest for some players, there are many big effects, same in the original GVT study.

we are both looking forward to seeing what Stan says on the heterogeneity…

oops: The average is unlikely to be representative of how *hot* some players can get

Josh:

Agreed. 6 percentage points is a lot.

Andrew, related to one of your favorite topics: is this effect size perhaps too big?

There’s discussion of the 3 point study paper on a popular basketball stats forum: http://www.apbr.org/metrics/viewtopic.php?f=2&t=8942 . I think some of the questions there apply here: for example, what’s the mechanism for the hot hand? Why isn’t there nearly so much evidence for a cold hand? It’s difficult to think of any way for a hot hand to exist without a corresponding cold hand outside of selection effects.

There is evidence of the cold hand in some players, just not nearly a much. I would guess when you are in the zone, or in a flow state, you just repeat what you did before without thinking about it, and when you are missing too many, you adjust your shot, or you try to refocus. Craig Hodges is a 56% shooter in the 3 point contest. He once hit 19 in a row. I don’t recall his longest miss streak, but I bet there is no way he would even get close unless he had a hand injury or something like that.

But shouldn’t there be games/days where you just can’t refocus? Or the thinking about it throws off your motor memory and you do worse? The Hodges example also fits with my suggestion of selection biases: you don’t observe cold hand streaks because they would only occur when the player was injured (in which case he wouldn’t play much or shoot often) or the player would be removed from the competition (not advance in the 3 point contest or be pulled from the game). Any kind of selection bias like that wouldn’t affect your paper’s point about how to analyze the hot hand, but it would inflate hot hand estimates in any NBA-type data because players will predominantly only be allowed to shoot when they perform around their skill level or better.

Interesting point Alex. I’d imagine historically good performers are kept in the game, even when they are choking so to speak (think John Starks’ 2-18 performance in Game 7 of the 1994 finals). Even if a player is having a blip in performance, there are reasons to keep a player in a game – long term player confidence and team chemistry come to mind. Players hide injuries too.

That said, I would tend to agree that the truly cold hands are more likely to be selected out, especially for non-star players. Why would the coach have patience for these players? Further, a coach has every reason to try to nip these things early, and not wait for a streak of misses—a coach could suspect the player is hiding something (mental or physical), which I’d imagine could be picked up by looking at consistency in shooting mechanics, or some other cue. This is far outside my area of expertise though, you’d have to talk to a Sports Psychologist, or Sport Physiologist on this one (or an experienced coach?).

Now we have been talking about games, and game data is tough to handle–once you start controlling for everything you want to control for, you end up with more control variables than data points. For controlled shooting, we have looked at 4 reasonably sized data sets (from the classic study of GVT, a little known set pre-GVT set [Jagacinski et al. ], our controlled shooting study, and 3 pt). The story is pretty similar, even though there is variation in how the shots are taken in each data set: stronger evidence for hit streaks over miss streaks. BUT, I would guess there is a selection going on in 3pt in particular: poor shooters don’t advance and thus give us very little data, so we are selecting the best shooters when we are looking at the 33 with 100+ shots in the table. If poor shooters, are more likely to generate cold streaks (relative to their probability of success of course), then we won’t pick this up, e.g. Michael Jordan’s single 1990 round shows up only in our pooled analysis. Next time we get back to the data (other projects now), we could try to look at poor shooters separately rather than pooling, but I suspect there just isn’t enough data, because there are so few rounds from them.

I suspect the real difference between poor shooters and good shooters will show up in hit streak patterns rather than miss streak patterns. The best shooters can probably sustain the attention/focus needed to maintain a streak (again relative to their overall probability of success).

Alex: Because the 3PT contest features good shooters (mean FG% of 56%), there must be many more hot streaks in the data than cold streaks. So it’s possible the study is effectively under-powered when it comes to detecting a cold hand effect.

Is a cold streak a “series of misses” or is it a series of oscillations?

Or to put it another way, mechanistically I can see how it’s easy to be consistently bad (like say maybe a random person off the street) but a “cold hand” might mean that hitting a good shot makes you “choke” on the next shot, so a kind of oscillation effect. which would be more relevant to people who are on-average pretty good (hitting say 40%+ of shots) vs people who are pretty bad (hitting < 10% of shots) .

Hi Daniel

anti-streaky/reversal/alternation (when a belief isn’t true Gambler’s Fallacy) are the terms usually used to what you refer to as oscillation, and streaky/momentum/repetition are the terms usually used for series of makes or misses (when a belief about series of makes isn’t true, it is called Hot Hand Fallacy).

the choke effect (anti-streaky for hits), though we haven’t seriously analyzed it, looks to be Peja Stojakovic in the 3pt. data. streaky for misses, perhaps Kyrie Irving, but I’d like to see more data. Table 3 in our paper is best read as, there are 8 out of 33 players how have substantial and significant estimated hot hand effects, for some of these players this may be chance, but is exceeding unlikely (i.e. beyond all standards of statistical testing) that 8 of 33 players would have such significant effects (p<.001), so some of these 8 guys are definitely hot guys.

Hi Guy, the measures of hot streaks and cold streaks are always bench-marked to player performance, so a better player doesn’t have to have as long of a cold streak to say he is cold in our analysis.

I’ve had a cold hand on the basketball court since December 1986.

Seriously, sometimes what seems like a hot hand or a cold hand at the time is a historic improvement or decline in ability. For example, in the fall of 1974 I cunningly suckered a school friend who was from Pittsburgh into betting, even odds, that the Steelers, a historically mediocre franchise, would win their next five games. He didn’t realize that would require them to win the Super Bowl.

Of course, they did go on to win their next five games, including the Super Bowl, and then 3 more Super Bowls in the following 4 years.

I was playing the odds, but I got trampled on by history.

What I always wonder about in these papers, especially the original GVT one, is that they seem to assume that people’s shooting ability (or whatever) is unchanged throughout their career. Look at Serena Williams, is it just chance that she happens to have a Serena Slam over the last 12 months? Or is it that she really has taken her game up a level from where it was a couple of years ago?

Kind of like how aging Barry Bonds took up his game to another level from 1998 to 2001.

I am not disputing the results of the article, but I think saying that there is a hot hand is still wrong.

The original article was not published in a vacuum. Its point was that people are bad natural statisticians and that they tend to see patterns where none exist, because of the representativeness heuristic. If the bias is admittedly subtle, there is no way that the “hot hand phenomenon” as understood by basketball fans could possibly exist. Which was the point of the original article and remais true today. The fact that someone now found a really small relationship does not change that.

TLDR: no there isn’t

Hi Tiago

Thank you.

Your comments are important and should be addressed to all.

1. It was already known before this study that people do not have a good intuitive feeling of randomness. Many of these examples were collected by probabilists and statisticians over the years, just check out William Feller’s classic book. Psychologists sought to understand the mechanism behind this, and they systematically investigated the limits of man as an intuitive statistician from the 1950s through the 1980s. The big insight of GVT was that there was a field environment in which these mistakes could be illustrated, an environment in which there is actually a cost to expert decision makers for making these mistakes (substitutions, offenses adjusted to create open looks for hot players, defense adjustments). The research question was very interesting and novel, and should be judged as such. The big deal was that experts could making big mistakes, and they were completely unmoved by the fact that statistical evidence indicated they were wrong. Fans have no reason to correct their beliefs if they are wrong, and it is not clear how strongly their beliefs are held. The Fan evidence was to be expected given prior work in the lab, but it was good to show. If the original study was about fans over-inferring from streaks, it would not have had the effect that it had.

2. The average effect size isn’t small, and it is consistent with the possibility of big effects from some players. Recall, for the original exercise to be valid you need no hot hand to be able to label beliefs a fallacy. If you don’t do that, it becomes a question of: is there a bias? This is a more difficult exercise because it requires a calibration of beliefs with the size of the hot hand effect, which is very difficult to measure. First you have to ask: is the belief in the hot hand a belief that pertains to the average shooter getting the hot hand, or does it pertain to certain exceptionally streaky players getting the hot hand? If it is the later, then you have to figure out how streaky these exceptionally streaky players are. This is quite a difficult task, no? Gilovich and Tversky understood this issue well; there was in fact a criticism of their work in 1989 by Statisticians Larkey, Smith and Kadane, who claimed to find evidence of the hot hand in Vinnie “the Microwave” Johnson. A single counter-example would undo the fallacy view, and Gilovich and Tversky responded, and they found an error in the data coding of Larkey et al. which invalidated their results. The fallacy view was preserved, and even replicated in the 3pt contest in 2003. One cannot forget that the hot hand fallacy has been considered a massive and widespread cognitive illusion by many, i.e. the feeling that you are in the zone in basketball, or think you see someone that is, well that is an illusion. It has become intellectually unrespectable to believe in momentum in performance. This is simply wrong.

3. Now the question is: How big is the hot hand? Is it smaller than the beliefs that experts reveal when they actually make decisions? The effects in the data are consistent with the possibility of it being big for a few players and small for many (which is consistent with how players think about it), or it is could be modest for most. This is an empirical question, but it certainly isn’t a fallacy, and it is not clear if there is a big bias for players. This is the other open question that is hard to answer, because probabilities are not what basketball players see, instead they see made shots in conjunction with other cues like consistency in shooting mechanics, body language, etc. and form a subjective impression; asking about probabilities is unnatural. It would be much better to get some choice-based measure to figure out the implied probability of success.

Thank you for your thorough comments. Truth is I was too quick to comment because I just thought: “this can’t possibly be the case”. Now I’m not sure anymore.

An easy way to build intuition on what causes the bias is to consider the case where the sequence length is 3. This makes it easy to enumerate all 8 cases.

The effect is due to undercounting the evidence in long streaks relative to shorter ones: by averaging over sequences rather than over individual tosses, a streak like 111 (two successes out of 2 attempts) is counted as a single streak with 100% success, contributing only the same amount of evidence as 011 (also 100% success, but over just a single attempt).

Some quick observations, and I admit a lot of the discussions have addressed some of these issues. I also haven’t read GVT in it’s entirety so forgive me for some naive observations.

As many people have pointed out, the R code is summing up the probabilities for each row, and then taking the average of those probabilities, this inherently has a negative bias which discounts the actual counts. Here, take for example this set of numbers. . .

0 0 1 0

1 0 1 1

In the algorithm, it would show in heads1 and heads2 the truth tables below.

heads1 <- false, false, true

heads2 <- false, true, false

For the first row a prob of 0.0, which means after a flip of 1, there were 0 times another head appeared.

In the second row, he would show a probability of 0.5 from the truth tables below.

heads1<- true, false, true

heads2<- false, true, true

This shows there was a single time after a head flip, there was another flip heads . . . so there is a 0.5 probability.

Now. . . when the code takes the mean(prob) there is a total prob of 25% for flipping a head

mean(prob, na.rm=true) = 0.25 or the prob of 0 and .5 is .25)

That isn't correct though, because there were three times a head came up in the first three flips, and only once did a head occur on the next flip. This means the probability for a head flipped after a head flip is actually 33% not 25% in this data set. The code is normalizing each row, when it should be counting individually each occurrence. Because of this, there is a negative bias in the own algorithm, and the error is not being accounted for. That is the reason for the approx 40% coming up, not because of hot hands.

If instead the code's program changed to actually count the occurrences instead of creating a normalizing function, he would find that indeed it does happen 50% of the time, and thus no hot hands.

Yea that’s the whole conclusion right? When we group things and underweight the groups that have massive success our average success rate of the groups goes down. I don’t see what’s important about this.

Maybe I am looking at this the wrong way, but how does it prove hot hands exist by creating a negative bias in the analysis? They mention for example a three point contest, but GVT was not about sitting in a spot, and hitting the same shot over and over. It was about the random nature of shooting in a game environment where you have an high number of variables than change on a millisecond timeframe.

I don’t see anything that disagrees with GVT, but instead a transformation of the data through underweighting the successes. If this underweight isn’t done, everything aligns with GVT. Unless I am missing something.

Michael:

1. GVT is not just about shooting in a game environment; they have lots of data from non-game environments.

2. GVT supply an estimate of the hot hand. Their estimate is near zero so they conclude there’s no hot hand. But, actually, if there is no hot hand, you’d expect they’d get negative estimates for the statistical reason explained by Miller and Sanjurjo.

While Miller-Sanjurjo’s observation about the measured streakiness in finite samples is correct, the stated implications for the Gilovich study and similar hot hand studies is not. In fact, their proposed “bias adjustment” is more than twice as large as it should be. Here’s why:

To correct the bias in the Gilovich study, M-S use their model to estimate an expected difference in success rates (after a 3-hit streak vs a 3-miss streak) for each player. On average, they say this should be 8 points — what their model estimates for N=100, P=.5, and K=3. So for each shooter in the Gilovich study, S-M are asking “what is the mean expected differential for a player with this true FG%?“

However, that’s the wrong question. These samples are not random trials from a player with a known true FG%. Rather, they are fixed outcomes for players of unknown true talent — and that is an important difference. Let me use the simple example of 4-flip sequences that Andrew cited. The model says that when P=.50 we will see a mean .41 success rate after a head. But if we look at cases where exactly 2 heads come up, the success rate is actually just .33, not .41. And it is these 2H/2T examples that are analogous to the Gilovich subjects: both have a fixed distribution of outcomes.

The question at hand is whether or not the *sequences* of these known distributions were random. So the relevant null is a random sequencing of a specific distribution, which is what we would expect if shooters are not streaky. We can easily calculate the expected success rates under that assumption:

After a successful streak, P = (N*P – K)/(N-K);

After a failure streak, P = (N*P)/(N-K).

Let’s take “Male #1” in S-M’s Table 3 (using data from the Gilovich study) as an example. This player made 54 shots in 100 opportunities overall, and he shoots .06 higher after a hit streak (.50) vs. a miss streak (.44). However, the M-S model expected him to be .08 worse, so they say he is +14 points better than expected. But that’s not right: if K=3 and P=.5, then after three made shots our expectation is 47/97 or .485, and after 3 misses is .515, for an expected differential of -.03. When you run the numbers for every shooter in Table 3, the expected differential is -.03 overall, not M-S’s estimate of -.08. That means the “hot hand” result in this study is +.06, not +.13.

So yes, the Gilovich study IS biased, because it ignored the impact of removing the streak shots when setting the expectation for the next shot. And they were thus wrong to assume the differential between the two rates should be zero. But the true bias is much smaller than M-S claim. Whether it’s large enough to reverse the conclusions of Gilovich or other prior studies, I don’t know.

Similarly, the true bias in the 3-point shot study is not .025, as M-S claim, but only .01. If we choose to accept their reported raw difference of .04, that means the “hot hand” in that study is .05 rather than .06.

Bottom line: the actual bias is fairly small when N>100, although certainly worth correcting for. It certainly explains why some studies show a negative hot hand, and perhaps it should revise Andrew’s “prior” to an effect of 3% or 4%, rather than 2%. But it’s not big enough to substantially change our estimate of true hot hand effects (if any).

Hi Guy

thanks again for your interest.

The bias is actually quite a bit more severe than the formula you provide, which happens to be a sampling-without replacement formula. Please combine theorem 7 and equation #3 for the explicit formula, which you will have to implement in numerically. If you want to convince yourself quickly that the bias is more severe than the formula you provide, the minimal working example is n=5, k=2. Enumerate all 32 sequences, notice for each n1, the bias is worse than sampling-without replacement.

With regard to average effect, it is pretty big given how diluted things are. You are pooling heterogeneous responses, and you are using a very weak signal for “hotness”, plenty of people can hit three shots in a row when they are not in the zone.

It will be interesting to see what Stan says about the heterogeneity. What is the likely distribution? A few guys have a strong hot hand, and most have little? Or does everyone have a modest hot hand? The former seems more intuitive, but who knows.

Anyway, the correct way to do hypothesis testing in face of this bias is precisely the method we use, which is re-sampling based on observed outcomes.

OK, I looked at the case of 5 flips and runs of 2. If you take each sequence and calculate the expectation based on sampling w/o replacement, and average these expectations weighting each sequence equally, the expectation is .416. Isn’t that the correct estimate of bias? So the conclusion appears to be the same as when we use the data in your table 1: sampling w/o replacement fully corrects for the bias. Why is there a need for an additional adjustment?

Dear Guy & others who are interested. Again the minimal working example of n=5, k=2 mentioned above should convince you if you work through it, below is an illustration for how to go through the exercise. Let N_1 be the number of 1s. The punch line is E[Rf(H|HH)]= .385, E[Rf(H|HH) | N_1= 3]= .2857<1/3, and E[Rf(H|HH) | N_1= 4]= .6333 <2/3. If you want to see that this is important quantitatively, please see the graphs in the draft. Here is the table.

nhits seq RelFreq (Rf)

0 00000

1 00100

1 00010

1 10000

1 01000

1 00001

2 00110 0

2 11000 0

2 10001

2 01100 0

2 00011

2 10010

2 01010

2 00101

2 10100

2 01001

3 11001 0

3 00111 1

3 01101 0

3 11010 0

3 10011

3 10101

3 11100 .5

3 01011

3 10110 0

3 01110 .5

4 11011 0

4 01111 1

4 10111 1

4 11101 .5

4 11110 .6666667

5 11111 1

As the paper notes, it works out to the expected 50% if you weight by the number of heads in the sequence, or equivalently put all the opportunities together before computing the relative probability, rather than computing unweighted relative probabilities for each sequence and then averaging them.

That bias makes perfect sense to me, as you’re removing a disproportionate number of heads from the sample. However, I never thought through how that affected the hot hand literature. I didn’t realize that the average was computed over each single session, and then the sessions averages averaged together, rather than weighted properly.

Hi John-

thanks for your interest.

the re-weighting only works for coin flips, because it is safe to assume each sequence is from the same coin. For basketball it would be a problem, if you re-weighted each sequence, which is generated by a different player, then 3 hits is a row would signal that it is a better player, and thus more likely to make the next shot. So then you’d be tempted to impose a linear model and add fixed effects, which would return us back where we started.

Yes, I agree, that is clearly the problem. I can imagine a model using individual players’ career average, but as you say it leads to a number of proposed models and effects.

The coin example is a really good one for illustrating the problem, though. Without that intuitive example I think it’s difficult to persuade people. The R code is fine, but without the natural language explanation of weighting it’s hard to believe.

Thanks a bunch.

Always seemed to me that the hot hand literature needed a hidden Markov analysis, in which the hidden states are “hot” and “cold” and the emissions are hits and misses. This paper appears to implement a hidden Markov model for a similarly structured problem: http://www-stat.wharton.upenn.edu/~dsmall/clumpiness_paper.pdf

Would be interesting to see the HMM applied to the sports data. Maybe Miller et al have done this already?

Richard:

I don’t like this idea as I don’t see there being underlying discrete states. It all seems continuous to me.

how about a set of coupled ODEs that track growth and decay of “hotness” through time, with the successes being a kind of forcing function, and the ODE measuring a probability of success. Then you could capture the decay of hotness if you get interrupted in your flow state as well ;-)

Stan should be able to fit this model with its fancy new ODE solver.

Just so people have context to the significance of shooting %, every additional 1% increase in eFG% at the team level represents about 4 wins. That’s quite significant!

“To put it another way, there is little debate that a “cold hand” can exist:”

I’ve had a cold hand on the golf course since 1996.

I hope that this is presented only as a psych experiment/Sokal affair to troll people- people actually believing this would jar even my cynicism.

You can enumerate the 16 4-coin sequences and then enumerate the 24 flips that follow a heads, and 12 of those are heads.

“Because the coin is fair, Jack of course expects this conditional relative frequency to be equal to the probability of flipping a heads: 0.5. Shockingly, Jack is wrong. If he were to sample 1 million fair coins and flip each coin 4 times, observing the conditional relative frequency for each coin, on average the relative frequency would be approximately 0.4”

Conditional relative frequency is NOT the same as the average relative frequency of cases. HHHH has 3 relevant flips where HTTT has 1 relevant flip. You cannot weight these equally, yet this is what is done in table 1 of the paper. In other words, number of flips is the relevant denominator, not number of sequences.

I thought I was possibly misunderstanding the argument until this line in the paper:

“The fact that a coin is expected to exhibit an alternation “bias” in finite sequences, implies that coin flips can be successfully “predicted” in finite sequences at rates better than that of chance”

This is so blatant it makes me lean towards this all being a social experiment. So cheers if that is the case.

TLDR: IID events are somehow not IID.

Hi Sam

It depends on how the predictions are evaluated, if you are evaluated based on the *percent* of predictions that are correct, you can game this, as outlined in the paper. There is a better way to game this evaluation system, unrelated to alternation in finite sequences. It is fun to think about.

Wow!

E. Peters & collaborators do cool lab studies showing that there are certain quantitative information-processing biases that are peculiar to people high in Numeracy (for example).

This — the undetected flaw in the methodology of Gilovich et al., not to mention the resistance to the proof of the flaw understandably displayed in these comments — is an amazing real-world example of such a bias.

(I’m willing to bet $10^3 against, but if someone shows the paper is wrong, then it’s *still* an example.)

Biases associated with low Numeracy are myriad.

This event (the proof of a very clear mistake but near-invisible mistake in probabilistic reasoning that for decades evaded detection by experts on mistakes people make in probabilistic reasoning) suggests that we can’t be super confident about any estimates of the frequency of real-world biases associated with high Numeracy…

erratum:

“(for example)” -> “(for example)”

Interesting hypothesis.

Benjamin Peirce was one of the early proponents of mathematics as being necessary reasoning from assumptions to implications using phrases such as there are no errors in math, careless mistakes that students might make – but no errors. Today people sometimes say things like you can’t argue with a proof.

CS Peirce (Benjamin’s son and forced lifelong student) eventually dismissed this arguing that math was really just diagrammatic reasoning, that is experiments performed on diagrams (more generally abstract symbols) instead of empirical objects (e.g. chemicals, animals, etc.). like all experiments, if the experiment described is in fact valid, others can re-run them and get the same result by doing the same manipulations on the same materials.

What was notable with math experiments though was ease of redoing the experiment and establishing if it is reproducible (and exactly reproducible). Anyone with the requisite math background and a complete report of the experiment could redo with just pen and paper (today maybe a computer). So because of this, we should be surprised that incorrect math survives very long – it quickly gets corrected or dismissed.

But of course not always and it would be interested to learn about featurse that predict a long delay in the detection of some real math errors. Wonder if anyone has looked at this?

Everyone’s focusing on the hot hand literature but the same measure is used in gambler’s fallacy work. Does that tend to show negative effects?

The result *explains* why we are prone to the “gamblers’ fallacy”: because if we observe a finite series of random, independent events (e.g., coin tosses), it *will* be the case that in our sample the recurrence of a particular outcome will be *lower* than the occurrence of that outcome in the first place.

But it is indeed a “fallacy” in reasoning to infer from that “experiment” that the outcome (getting “heads”) is less likely to recur than it is to happen in the first place.

Proof of the “gambler’s fallacy” involves showing that people erroneously treat independent events as non-independent when they predict future outcomes; “hot hand” research involves the mistake (by people way too Numerate to fall prey to the “gambler’s fallacy”) of not realizing that one *would* see the pattern consistent with the “gambler’s fallacy” if one examined the recurrence of independent events *retrospectively.*

Read the paper– it explains this in very clear & compelling terms!

Wow!

[quote]if we observe a finite series of random, independent events (e.g., coin tosses), it *will* be the case that in our sample the recurrence of a particular outcome will be *lower* than the occurrence of that outcome in the first place.[/quote]

Not necessarily. If I watch Stephen Curry play basketball regularly, I will see him make 44% of his 3pt attempts. After he makes 2 consecutive 3pt baskets, I will see him succeed 44% of the time. After he misses 2 shots, I will again see him succeed 44% of the time. Now, it is true that if I calculate Curry’s average post-streak success rate in each game, and then average those averages, the result will understate his true success rate in these situations. But who does that? More importantly, who forms their intuition that way? Would a fan really give equal weight to a game in which Curry shot 1 for 3 and a game in which he shot 7 for 9? That is what the paper is arguing. Doesn’t it seem likely that fans (most of whom are not social scientists) respond to each opportunity roughly equally rather than averaging per-game averages?

Whoever can explain the conceptual point in intuitive terms should get a prize. I tried & failed. Actually, I think the paper is clear — so probably the prize should go to Miller & Sojurjo, and I should let Miller continue to bear the burden of speaking for himself! But to try to clean up my own mess:

For sure, we will not observe that “the recurrence of a particular outcome will be *lower* than the occurrence of that outcome in the first place” *every* time we observe a finite series.

But if we sample subsamples of all the finite series we encounter, we will observe that, as shown by the paper’s interrogation of the sample space for observations collected in that way (see Tbl 1).

The answer, I think, to Guy’s point is that this *is* a plausible account of how we observe independent events occurring. We observe or collect data in “attention span” units. We then effectively sample all the sequences recorded during “attention span” units & observe that in fact the recurrence of an outcome immediately after it occurred was less than the probability it would occur in the first place.

Or so M&S surmise.

But I do think the paper’s exposure of the *mistake* in the method reflected in the “hot hand” literature– one that they point out is more likely to be made by people who “get” the Gambler’s Fallacy than those who don’t (see p. 2, n.2 for cool proposed experiment that could test this)– is more interesting than the (still very interesting) conjecture M&S offer about their proof being the “source” or “explanation for” the Gambler’s Fallacy.

our point about Gambler’s Fallacy was not applied to fan beliefs. The basic point is that the finite sample bias means that structurally you cannot unlearn these beliefs, however you may have arrived at them. It is ultimately an empirical question whether or not this hypothesis applies either to the emergence or the persistence of the gambler’s fallacy type beliefs. Perhaps there is already some evidence out there? Someone has recently made us aware of the work of Yaakov Kareev, which looks at finite sample bias when correlating two variables: http://theinvisiblegorilla.com/blog/2010/03/31/when-less-is-more-memory-limits-and-correlations/

You do advance the claim (and even present some original data to support it) that “finite sample bias” can “explain” prevelance of gambler’s bias.

BTW, why not call the bias you are describing the “‘hot hand fallacy’ fallacy”!

Also, I do think you should try to come up w/ more ways to summon the *right* intuition & silence the wrong one that keeps people (myself included) saying “‘this can’t possibly be the case.’ … Now I’m not so sure.”

What do you think of this?

Imagine I do 100 coin tosses and observe 50 “heads” and 50 “tails.” No problem so far, right?

If I now observe the recorded sequence and begin to count backwards from 50 every time I see a “heads,”: I’ll always know how many “heads” remain in the sequence.

Necessarily, the number goes down by 1 every time I see a “heads” in the sequence.

And necessarily the number does not go down — it stays the same — every time I see a “tails” in the sequence.

From this we can deduce that the probability that the “next” flip in the sequence will be a “heads” is always lower if the “previous” flip was a “heads” than if it was a “tails.” …

Your paper has pretty much distracted the hell out of me for going on 24 hrs…

Hi dmk, thanks! I like your title!

I was up until 3 in the morning with some friends from the Fed on this, it was Monty Hall all over again.

We are still looking for more refined intuition to communicate better. The Monty Hall has the 99 doors intuition, and the choose the best of 2 doors intuition. Something like that would be great.

We have two intuitions but neither are immediate (1) selection bias of assigning flips after we know the sequence, and how that relates to the initial few 1s in a run (2) the fact that expected relative frequency is a clustered average, and thus sequences with more observations (more flips that immediately follow heads) are weighted the same as sequences with few observations.

This intuition feels appealing and maybe there is a tweak to make it work for all sequences, but it doesn’t work for every sequence, in fact no intuition will work for every sequences because for some sequences it is just true that you will flip heads after at heads relatively more frequently, e.g. 0011. Its an expectation thing on the typical sequence.

The fact that it is an expectation thing is what makes it so difficult, and why neither of the intuitions above make it “pop” well of course it has to go in this direction. Notice the second intuition just shows you there is a problem but does not tell you the direction. The first intuition shows you the direction, but again, it doesn’t pop, because while mechanically this is going on, it is an expectation thing.

ciao!

josh

Did you ask them how they missed the friggin’ housing bubble?…

Josh:

I’ve been blogging about this on my tumblr (you can find what I wrote on the subject at http://jadagul.tumblr.com/tagged/hot-hand ). For me at least, and for some of the people I’ve talked to, your (2) has been the most helpful intuition. (Assuming, of course, that I’ve understood it correctly!) In particular, you get the expected average if you weight the sequence by the number of observations in it before averaging the probability in the sequences; if you don’t weight it, you’re taking a biased measurement and get weird results.

I think the broad based intuition to communicate these findings may lie in the (2) thread. By increasing (n) from 2 to 20, the probabilities take on significantly more possible outcomes. For instance:

n=2 P(H|H) {0,1}

n=3 P(H|H) {0,.05,1}

n=4 P(H|H) {0,0.5,0.66,1}

n=20 P(H|H) {> sort(unique(prob))

[1] 0.0000000 0.1000000 0.1111111 0.1250000 0.1428571 0.1666667 0.1818182

[8] 0.2000000 0.2222222 0.2500000 0.2727273 0.2857143 0.3000000 0.3333333

[15] 0.3636364 0.3750000 0.4000000 0.4166667 0.4285714 0.4444444 0.4545455

[22] 0.4615385 0.5000000 0.5384615 0.5454545 0.5555556 0.5714286 0.5833333

[29] 0.6000000 0.6153846 0.6250000 0.6363636 0.6428571 0.6666667 0.6923077

[36] 0.7000000 0.7142857 0.7272727 0.7333333 0.7500000 0.7692308 0.7777778

[43] 0.7857143 0.8000000 0.8125000 0.8181818 0.8235294 0.8333333 0.8461538

[50] 0.8571429 0.8666667 0.8750000 0.8823529 0.8888889 0.9000000 0.9090909

[57] 0.9166667 0.9230769 0.9285714 0.9333333 0.9375000 0.9411765 0.9444444

[64] 0.9473684 1.0000000 }

The small (n) is analogous to limiting the number of rectangles in integration when estimating f(x). More rectangles = better estimate. The frequency of these discrete probability bins as an artifact of (n) for P(H|H) also demonstrates a convergence towards 0.5 for larger (n), as well as filling in these increased surrounding bins while diminishing the 0’s and 1’s.

Where is the re-analysis of Gilovich, Vallone and Tversky that seems to be advertised?

This is a really cool point, but it doesn’t look to me like it applies in this case. The problem arises when averaging over small clusters. If GVT had computed an average for each _game_, and then averaged those averages, the bias described here should apply. By averaging over a player’s whole season, it should be negligible (or possibly non-existent, depending on what is assumed about the relationship between players).

A fun way to think about the original example (M+S also make this point): The probability of getting a head on any toss _cannot_ be biased (in the R code). So if we were to take the _weighted_ average across all the trials (weighting by the number of throws we focus on), we would have to get 1/2. Thus, the _unweighted_ average (which fails to give more weight to cases where more heads were thrown) must be lower than 1/2. The observed bias is entirely a function of the two-level averaging (and the implicit choice of weights that goes with it).

Dear Jonathan-

the re-analysis of the data from the original Gilovich, Vallone and Tversky study (among others) is here: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2450479

Also the re-weighting trick only works for coins, because we can assume all coins have the same probability of success. This doesn’t work for basketball shots. You might think that you can place this in a regression context and add a fixed effect, but that would return us back to where we started.

Thanks for this explanation.

It’s not really a reweighting trick, in my opinion. I would say that the R script reweights the “trials” (flips that follow a head) by averaging first over “coins”. If we stuck with trials as a unit, we would not need to reweight at all.

Yes, I did think that I could solve the problem of differing probabilities with regression, and now I see why it might not work (although I still kind of want to try it). I still think there is no problem with GVT’s “Study 2” — since a player’s whole season should have a lot of trials, even for the longer streaks — but maybe you’re not saying that there was a bias problem with Study 2.

Just for the record: I now get that there is no easy way around this problem when sequences are short (or even not-that-short, as in the GVT experiments) and probabilities vary.

Dear Jonathan

we do have a way around this problem, re-sampling (via permutation), which matches exactly the core assumptions behind the null in the original study. It also allows us to pool data from all players. The method is described here: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2450479

thanks!

Dear Jonathan:

I agree “trick” is the wrong word. Re-weighting illustrates the clustering issue that is going on. Re-weighting, or treating the trial as the unit of observation, makes sense when you have a strong theoretical reason that every coin is coming from the same process with the same parameters. Coins satisfy this criteria, basketball players do not. If you treat the trial as the unit with multiple basketball players, then you will have selection bias in favor of finding the hot hand. If you add a fixed effect, then it is equivalent the problem we point out with the linear probability model at the end of the intro.

Study 2 of the original paper has a severe endogeneity problem, which was pointed at quite early, e.g. on the first page of Avinash Dixit and Barry Nalebuff’s Thinking Strategically book they explain clearly the problem of strategic adjustment (see it here: http://bit.ly/1eXxdI3 ). Scientifically speaking, this is why Study 3 is so important, because it does not suffer from these issues. If you can show that there is no evidence of hot hand shooting in Study 3, it is reasonable to infer it doesn’t exist. This is also why great hay has been made about the no-effect result in the 3 point study of Koehler & Conley (2003).

ciao!

josh

I think it would be interesting to look at time elapsed between shots as well. My guess is that the effect should be stronger when time between shots is lower.

it does appear that players try to “hit the iron while its hot” so to speak. That’s what is said about Steph Curry. There might be a way we can look into this, we recorded that data too. thanks Z!

Is the point that

E[1(h2 and h1)/1(h1)] != E[1(h2 and h1)]/E[1(h1)]

?

Or a rewrite of the code:

for (i in 1:rep){

heads1 <- data[i,1:(n-1)]==1

heads2 <- data[i,2:n]==1

bigrams[i] <- sum(heads1 & heads2)

unigrams[i] = <- sum(heads1)

}

mean(bigrams)/mean(unigrams)

Jack takes a coin from his pocket and decides that he will flip it 4 times in a row, writing down the outcome of each flip on a scrap of paper. After he is done flipping, he will look at the flips that immediately followed an outcome of heads, and compute the relative frequency of heads on those flipsPlease help, I do not understand this setup and cannot read R code. I understand this to be the correct process:

1. Take a slip of paper, looking at the first three flips for a “heads.”

2. For any “heads” flips in step one, note whether the next flip is a heads (success) or tails (failure)

3. Repeat for 1 million slips of paper

4. Divides the number of successes by the number of trials

And yet, going through that process I get a Success/Trial ratio of .5

What did I miss?

Okay, got it.

1. Take a slip of paper, looking at the first three flips for a “heads.”

2. For any “heads” flips in step one, note whether the next flip is a heads (success) or tails (failure)

3. If you found at least one “heads” in part 1., calculate the Success/Trial ratio for this slip of paper

4. Repeat for 1 millions slips of paper, recording the Success/Trial ratio for

eachslip of paper5. Find the mean of all the ratios recorded in step 4.

I don’t understand how Professor Gelman’s R code is a simulation of “the conditional probability that he gets heads, given that he got heads on the previous flip.”

If I were to simulate that, I would count subsequences of HT vs HH. To calculate it, I would enumerate the population of 4 digit nonnegative binary numbers (since generating the sequences consists in sampling uniformly from this population), and enumerate length-2 subsequences of successive digits. 11 and 10 are equally frequent. Can someone explain the interpretation of “sum(heads1 & heads2)/sum(heads1)”?

Ok I understand that averaging conditional probability estimates across samples incorrectly weights the individual “trials,” although I don’t see whatever the little trick is that makes that thing the individual sample estimator.

Unless I have made a stupid mistake somewhere, I do not think it is safe to say that P(H|H) = 0.4.

In the below, the second column is the number of heads for which there is a subsequent flip

The third column is the number of such subsequent flips that is heads

Therefore, the sum of the third column divided by the sum of the second column if P(H|H)

HHHH 3 3

HHHT 3 2

HHTH 2 1

HTHH 2 1

THHH 2 2

HHTT 2 1

HTHT 2 0

THHT 2 1

HTTH 1 0

THTH 1 0

TTHH 2 1

TTTH 0 0

TTHT 1 0

THTT 1 0

HTTT 1 0

TTTT 0 0

—–

25 12

P(H|H) = 12/25 which is as close to 0.5 as an odd number of trials can allow. If I just compute a conditional probability for each row and give each row equal weight, then I get the MS result that P(H|H) = 0.4. But is it correct to give equal weight to the first row in which there are three flips following a heads and the penultimate row in which there is only one flip following a heads? Should one flip get the same weight as three flips?

I think Guy is on to something when he asks, “Would a fan really give equal weight to a game in which Curry shot 1 for 3 and a game in which he shot 7 for 9? That is what the paper is arguing.”

The question is, what is the unit of observation. Is it the streak or is it the shot?

I encountered a similar problem when I analyzed the paper by Nelson and Simmons in which they claimed that baseball players with the initial “k” strike out more often because “k” is the scorer’s symbol for a strikeout.

Nelson and Simmons used the player rather than plate appearance as the unit of observation. According to the logic of Nelson and Simmons, when computing a team batting average, a player who gets one hit in one at bat should count as much toward the team batting average as a player who has 100 hits in 400 at bats. (Their paper was published in Psychological Science.) I don’t know of any fan who would compute a team batting average in this fashion.

Anyhow, I don’t think it is correct to assert baldly that P(H|H) =/= 0.5. I look forward to reading the paper by MS.

Right, the P(H|H) = 0.4 appears to be an artifact of averaging the runs of 4 flips. However, when you aggregate all of the flips, you arrive at the expected probability of 0.5.

So it is not true that the conditional probability is 0.4.

So (with my correction below) I calculated what you say, and get (2 figure accuracy) 0.405, which agrees with what Andrew’s program produces and confirms what you say is being done.

It’s not true that the conditional probability is ~0.4. That is a bogus number calculated in a bogus way.

I am sure Andrew knows this. The question is, if this is what “hot hand” researchers are doing, do they know what they are doing?

You miscounted. It’s 12/24 = .5 exactly.

Is this just a farce where a lot of people are talking smart?

“Has the whole world gone CRAZY?!” — Walter

Your table has the line “TTHH 2 1”

This isn’t right, because the fourth H should not be counted (it isn’t followed by anything, a rule that you followed on every other line in your table).

Should be “TTHH 1 1”

which makes the total number of heads 24, not 25, and the conditional probability exactly 0.5.

Andrew was not correct to describe the number he calculated as “the conditional probability that he gets heads, given that he got heads on the previous flip”. Your table (except for that error) is identical to what I produced independently, and that number is the actual the conditional probability that he gets heads, given that he got heads on the previous flip. Not what Andrew calculated.

That is what was getting me. Andrew made that assertion, which is clearly wrong, and I couldn’t understand what his program was trying to calculate. Some of the other comments (and Zachary’s) help to explain this. But Andrew’s assertion mystified me.

Bill:

Yes, you’re right. I should’ve said “proportion,” not “probability.” I’ve fixed.

Andrew:

AHA! That makes all the difference!

I recreated the simulation noted above in @Risk for Excel. I looked at sequences of 1,000 H/T combinations, and the probability of a Head given 1 head on the prior event, 2 heads on the two prior events, 3 heads / 3, etc. I defined the “excess heads” as the difference between the percentage of heads observed after 1, 2, 3 (etc) heads and the total number of heads observed in the iteration, and ran 100,000 iterations.

My results were:

Streak Excess Mean #Obs

1 -0.11% 500.00

2 -0.16% 249.96

3 -0.36% 124.95

4 -0.76% 62.46

5 -1.61% 31.23

6 -3.56% 15.62

7 -7.49% 7.81

8 -12.34% 3.91

9 0.08% 1.95

I think the results for streaks of >6 or so stem from the small number of observations in each trial. I guess I am arguing in agreement with BD McCullough above – it doesn’t seem to me that there is any ‘hot hand’ in random results, as asserted in the original post. Of course, I could be missing the point entirely, and I haven’t read the MS paper.

Darn! – I *knew* the table formatting would be a problem. Sorry

So, is there a consensus that Miller et al are right and Tversky et al were wrong?

What do the “Hot Hand Fallacy” old guard say in response?

I think M+S have an interesting example of a bias that _can_ arise with certain kinds of averaging, but I don’t think they’ve made even a prima facie case that it _does_ arise in the GVT paper (and I don’t think that it does).

Jonathan:

See the last 2 columns of Table 3 of this paper (linked in the above post). They give p.hat(hit|3 hits) – p.hat(hit|3 misses):

GVT estimate .03

bias-adjusted estimate .13

So, yeah, it does seem to make a difference in that classic analysis!

Thanks. I completely missed that part of GVT. I’ve looked at GVT several times, and thought I knew what it said. But I apparently never got beyond “Study 2” — perhaps because I am a Sixers’ fan.

Both my comments were focused on GVT’s in-game field goal study, which I still think is not much affected by the M+S bias. As M+S point out, though, in-game data may be too complex to analyze convincingly for the hot hand effect.

It’s a real shame that these people don’t talk with athletes about how a hat hand really works. Or even how shooting mechanics work. Maybe they should talk with golfers. Try to tell PGA pros that their good rounds are just a fortunate collection of lucky shots and their bad rounds are an unfortunate collection of unlucky shots. They’ll laugh at you. And they should.

The reality of an athlete’s skill is that it varies. There are days when a pitcher has great control of his slider and days when he doesn’t. It isn’t luck like a flip of a coin. There are days when a shooter has his stroke in basketball and days when he doesn’t. Often, it is due to small variations in fundamentals. Players get in bad habits. The Atlanta Braves once revived a pitcher’s career because they noticed he’d changed his arm angle slightly — just enough to affect the ball movement. If some stats geek thinks that every pitch that pitcher threw was due to the same ‘ability’ (just subject to random variability), the stats geek has absolutely no clue what really happened.

Hitters really do go through stretches where they see the ball better. They go through stretches where they are tired and their bat speed drops off. Pitchers have days where their fastball has less pop (they really can’t get as many mph on it). Any analysis which assumes these days are all the same is worthless.

Basketball players play hurt. A dinged ankle affects the lift of a jump shot. It can completely change the shooting mechanics. To chalk that up to random variability is just stupid.

Have you ever read Fish, S. Dennis Martinez and the Uses of Theory, Yale Law Journal 96, 1773-1800 (1987)?

I’m pretty sure that asking Larry Byrd about hot hands would be a lot like this… (Kareem Abdul-Jabar might have something edifying to say, but he is an outlier.)

Here is an attempted intuition for the bias (in the i.i.d. coin version, with strings of 4 flips):

Properly considered, a trial is any flip following an H, in any string. If we counted all trials equally, we would get the correct answer of .5 for the conditional frequency. But we are instead weighting all strings of 4 flips equally. This means we are overweighting trials from strings with low denominator, i.e. few H’s in the first 3 positions, and underweighting trials from strings with many H’s. Sure enough, we get an average below .5.

In a single sentence: Strings favoring the hot hand discount themselves by creating a higher denominator.

I’m sure this echoes in part something written previously.

There is a major non sequitur in the introduction to the Miller/Sanjurjo paper. They state that because of the bias they correctly identify, “observing that the relative frequency is equal to the base rate is in fact evidence in favor of the hot hand.” This does not follow, and at least in natural examples this statement can be false, as follows:

For simplicity, suppose we know the long-run frequency of H’s in a binary sequence is .5. We have two hypotheses about the data-generating process:

A: i.i.d

B: First flip 50-50. Thereafter Markov with P(H¦H)=P(T¦T)=.6

And we observe HHT. The frequency of H’s following H’s is ½, exactly the base rate. The probability of this sequence is .125 under hypothesis A and .12 under hypothesis B. So our posterior favors A (i.i.d.) more than our prior, for any prior. The same is true if .6 is replaced by any probability unequal to .5.

Maybe the assumption of a known base rate is an issue here? And maybe this is covered later in the paper? In any case I was concerned to see this questionable statement in the introduction.

Dear Jonathan

Thanks, this is a nice question to think about.

We didn’t get into this in the intro because over-precision can make the text unreadable. Yes, the bias is calculated relative to the null that the player is a consistent shooter (shoots with a fixed probability of success), which is the reference distribution of the previous studies.

Now for the model you propose the bias is substantially *worse*. Consider p(H|H)-p(H|T) in column 3 as an estimate of the effect size in your model of the data generating process (which you assume to be 0.2). The estimate of the effect size is expected to be E[p(H|H)-p(H|T)]=-.2125, so the bias is -.4125, which is less than the bias of -.33 for this estimate when the data is generated by the consistent shooter model presented in the bottom row of column 3. To see how E[p(H|H)-p(H|T)] calculated, note that in your model each sequence has probability C*0.5*(0.4)^a(0.6)^(3-a), in which a is equal to the number of alternations in the sequence, and C is equal to the normalizing constant that accounts for sequences in which p(H|H)-p(H|T) evaluates to missing.

In our paper from last year (“A Cold Shower for the Hot Hand Fallacy”) we explore the power of our statistical tests by considering plausible alternative models of hot hand shooting including a regime switching model (Hidden Markov), which is in spirit with what you propose, put perhaps more plausible. What you see in this case is that the underestimate of the hot hand is even more severe, but for a different reason than your example. When looking at the relative frequency of H after HHH in the regime switching model, you are using HHH as a proxy for the probability of success in the hot state. This will necessarily be an underestimate of the hot hand because HHH will also occur when a player is in a non-hot state (i.e. you are pooling hot and not-hot shots together).

Joshua,

Thanks for the reply. I don’t question the bias you identified in the mean conditional frequency, and I think it’s a nice point showing a subtle and original example of selection bias. But I question the relevance of the bias for inference about whether the null model (iid) or a Markov/Hidden Markov model is correct. My example shows that even though under the null the expected conditional frequency is below p=.5, a string with conditional frequency .5 can still be evidence in favor of the null and against the hot hand.

I looked at Gilovich-Vallone-Tversky. If they relied on averaging conditional vs. unconditional frequency across players for their methods, your bias would certainly continue into play. But they don’t. They look at each player individually, and typically summarize a data set by saying how many players had positive vs. negative serial correlation. In fact in footnote 3, p.304, they say it would be wrong to average across players (for reasons different from yours).

Now I realize that even looking at one player, the biased mean is there. But when GVT discuss how many players have positive vs. negative hot-hand effect, the question is not whether the *mean* is biased but whether the *median* is biased, an entirely different question. In fact the median is not systematically biased, so GVT are justified in thinking that in i.i.d. data, half the players will show hot-hand and half anti-hot-hand, though anti-hot-hand will be by bigger percentage margins, the cause of your bias. That is: Let D=(sampled conditional frequency)-(sampled unconditional frequency). You have shown that the mean of D is negative. But the median of D is not necessarily negative. (Because of integer constraints, the median can be slightly positive or slightly negative, somewhat arbitrarily.)

Best,

Jonathan

PS I was a little too quick to claim knowledge of the behavior of the median — it looks complicated based on a few examples. But more important points are (1) D being higher than its mean, for example D=0, is not necessarily evidence for the hot hand (2) nothing in GVT’s methods look to me like it’s invalidated by biased D.

Hi Jonathan

I wish there was a way to get a notification when a reply comes in.

For your first paragraph:

Please forgive me if I miss-read what you mean. In your model there was a hot hand P(H|H)-P(H|T)=0.2>0 and for the estimate, lets use rf to avoid confusion, the estimate is expected to be E[rf(H|H)-rf(H|T)]=-0.2125, so it is also biased in finite samples, and far worse. This means that if you were to assume there is some unknown effect size, and you were to estimate it, and you saw rf(H|H)-rf(H|T)=0, you would have an underestimate, and falsely conclude that there is no hot hand. The correct test is not to compare these two groups of flips as if they are independent groups, the correct test is to re-sample as we do in our other 2 papers (permutation test, details in this paper which corrects for the bias: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2450479).

Now our goal was not to say one model of the hot hand is right and another model is wrong. There are many ways to operationalize the hot hand using statistical measures and we don’t want to take a stand on what it is, so we have 3 measures that are based on hit streaks which we justify based on identification and power grounds in the other papers, and we reject the null for patterns in these measures, and importantly, we don’t reject the null for patterns in miss streak measures. This is exactly what you would predict if the hot hand was there.

For your second paragraph:

Looking at each player individually *is* the problem. With 100 shots and a 50 percent shooters you are expected to find a difference of -8 percentage points when you condition on 3+ hits. Also, while we report mean adjusted differences, out tests are all based on the exact distribution under the null. For example, in the paper we perform binomial tests based on the true median and the true .05 percentile (using resampling). In case you are curious the median is -6 percentage points.

hope that helps clear things up. Its tough to strike the balance between readable exposition and full detail. The full detail is in the first two papers, but I see we need to add a little more detail to this paper.

thanks for kicking the tires!

ciao!

josh

Overall, I think the answer here is: so what?

Who cares about a sequence of four flips? In the original code, if you set n=2 then you get 0.50. So why is 4 flips a sufficient number to establish the randomness of a single sequence yet 2 isn’t? I reject the entire premise, aggregating conditionals across sequences to confirm/deny the existence of “hot hands” isn’t the way to go about doing it.

I find it interesting that neither of the cited papers mention the Wald–Wolfowitz runs test or K-S tests for the sequences.

If we call a sequence of flips a run, then we already have sufficient ways of testing whether or not the data is drawn from a random binomial.

To clarify: have you ever played basketball? Who calls a hot hand after 4 shots? A little domain expertise would be prudent here :p

I think you (Zachary) might be misunderstanding the significance of the 4 flips for the authors’ argument. Certainly, the authors recognize the point you & BD McCullough are making. You are actually treating Tbl 1 as if it were a randomly generated 64 sequence flip; it’s not, and the authors don’t analyze it as such.

the paper’s key point is that “in a finite sequence generated by repeated trials of a Bernoulli random variable the expected conditional relative frequency of successes, on those realizations that immediately follow a streak of successes, is strictly less than the fixed probability of success….”

If you want to disagree with them, then I think you need to show either (a) the authors are wrong about that; or (b) the authors are wrong to understand the analytic strategy of the classic “hot hand” studies as assuming, when they analyzed player performance over a particular interval, that if one examines a finite sequence of outcomes generated by a binary random process, the probability of the recurrence of a particular outcome following a specified string of such outcomes *is* in fact the *same* as the unconditional probability of that outcome within that set.

No one’s done either of those things so far in this discussion, at least as far as I can tell. Even Guy seems to have agreed there was exactly the defect M&S have identified in the original studies; Guy quarrels with how to specify the expected P(succcess|following specified string of successes) to determine whether the observed “streaks” do in fact differ from what one would expect to see by chance.

Hi Zachary

The goal isn’t to find any pattern that rejects the null of a player with a fixed probability of success, the goal is to find specific patterns that are consistent with hot hand shooting. This means that if the hot hand exists, the patterns associated with hot hand shooting (certain types of hit streaks) will lead you to reject the null, but other patterns, for example those that have to do with miss streaks, may not lead you to reject the null. This is what we find across all studies we have looked at. If you peek in the appendix you can see the connection with runs. For a complete discussion on these issues, please see this earlier paper of ours: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2450479

So, if, presuming, Miller et al are right and Tversky et al were wrong, how did nobody notice for all these years? And what does this say about how often Tversky et al were cited as proof of the superiority of the teller of the tale?

After considering Josh’s comments here, and reviewing the paper again, I am now convinced that the authors have identified a real bias. And as best I can determine – contra my comment above — they have measured its magnitude correctly. So kudos to Miller and Sanjurjo for this discovery.

At the same time, I think M&S could do a better job of communicating what they found and – especially – what they did to measure the bias. Let me try to provide a brief explanation for the bias that will hopefully prove intuitive for some, and explain what I think the authors actually did to measure the bias (which is not exactly what the paper says they did).

There are two different biases at work that need to be corrected. First is the bias created by the fact that the conditional flips cannot also appear as the next flip. Let’s call this the “conditional bias.” This can be corrected using sampling without replacement (SWOR): for example, if n=100 and frequency of heads is .50, our expectation after 3 consecutive heads would be 47/97 or .485 once we account for conditional bias.

However, what M&S have discovered is that the average expectancy after a streak is even less than the estimate provided by SWOR. For example, sequences with exactly 50 heads and 50 tails will have a conditional frequency that is less than .485 (approximately .46). This additional “M&S bias” of .025 results from the fact that the number of HHH streaks in a sequence is not independent of the success rate after HHH: more success after a streak also means more streaks. Thus, if every sequence is weighted equally, we will underweight the streaks followed by a hit and the observed mean conditional frequency will be less than estimated by SWOR. This extra bias is relatively small for larger sequences (for example, in their NBA three-point study, it amounts to only about 1% for an average player), but does have a larger impact when sequences are shorter and especially when computing the difference between “hot” and “cold” frequencies (as in Gilovich).

One confusing aspect of the paper is the presentation of their measurement of this bias. They say that they “derive an explicit formula for the expected conditional relative frequency of successes for any probability of success p, any streak length k, and any sample size n.” The presentation and discussion of table 1 (coin flip sequences with P=.5) also seem consistent with a model that estimates bias based on a particular probability of success. However, such a model would *not* correctly measure the bias in hot hand studies. As I observed above, these samples are not random trials from a player with a known true FG%, but are fixed frequencies from players of unknown true FG%: for every sequence in these studies the observed frequency of heads is exactly equal to the inferred “P” (by definition). And the bias in sequences with actual frequency X is *not* the same as the mean bias in all sequences generated when P=X. For example, Andrew notes above that sequences of 4 coin tosses (P=.50) will have a mean .41 success rate after a head, but if we look only at those cases where exactly 2 heads come up — and thus we would infer P=.50 — the success rate is actually just .33, not .41. The difference is much smaller for longer sequences, but does not disappear.

So, what M-S call “P” is really the observed frequency (“F”) of hits in a finite sequence. The paper would be clearer if M&S did not imply they were estimating bias for sequences from a known probability, which does not seem relevant to the hot hand studies. Indeed, I’m not sure the word “probability” belongs anywhere in the paper. But fortunately, M&S did not base their bias estimate on all outcomes for a given probability (it took me some time to figure this out), but rather correctly estimate the bias for fixed distributions. Their method thus provides reasonable estimates of the combined effect of the two types of bias.

(I suspect the paper would also generate less confusion if the authors removed statements — mainly in the first six pages — that seem to imply a discovery about the actual probability of alternating outcomes, such as “the result has implications for evaluation and compensation systems, and suggests successful gambling systems.” The suggestion that a financial firm might reward an analyst based on typically guessing right about the market more than 15 days out of 30 – while ignoring the fact that her losses in the months she fails this test are larger than her profits in the other months – is particularly implausible.)

By the way, it does seem to me that a simpler solution for coping with this bias is available than that proposed by S&M. Rather than using the biased average of sequences and then correcting for it, why not just remove the bias by taking an average for the full sample (weighting all streaks equally)? Then generate a null by calculating the SWOR estimate for each player, and weighting these by number of streaks. While high-FG% players will be overrepresented in the full sample mean, they are similarly overweighted in the null estimate. So I believe this will still provide a valid comparison.

Does the discovery of this bias mean vindication for a strong “hot hand” effect? I’m much less sure of that than Andrew appears to be. That depends in part on the quality of the earlier studies which M&S now claim to have ‘reversed,’ and I’m not familiar enough with those studies to offer an opinion. In terms of an actual hot hand effect under game conditions, this recent study found that four consecutive baskets elevate the expected FG% on the next shot by only 1%, a very weak effect: http://www.sloansportsconference.com/wp-content/uploads/2014/02/2014_SSAC_The-Hot-Hand-A-New-Approach.pdf. So I feel that a “prior” of a small (c. 2%) hot hand effect is still quite reasonable, and we will need a lot more evidence before revising that upward.

Dear Guy

Thank you for the comments on our exposition and your challenges on how to interpret the evidence with respect to hot hand, they are helpful.

Dear all: The feedback in general in the comments section here has been great and very helpful, thank you for taking your time to look at our work.

I will address to everyone the three important points you bring up regarding: (1) the effect size, (2) the relevance of the bias as a function of the sample size, (3) the method of bias correction.

I. Effect Size

———————-

Should Bocskocsky, Ezekowitz & Stein (2014) be the study which informs our priors on how large the hot hand is likely to be? There are a few critical issues here, but before discussing these issues; I’d like to mention one thing. We are now have different goal posts than that of the original studies. The original GVT study and the early challenges were about the question of whether not sometimes some players get the hot hand, they were not about whether the average player is a streaky shooter or tends to get hot. The consensus came to be, after these few challenges, that perception that some players get hot is a cognitive is illusion. Much later Koehler and Conley (2003) looked at data from individual shooters in the NBA’s Three Point Contest and replicated GVTs conclusions; this paper has been often cited as the study which should reinforce our belief that the hot hand is a fallacy. What can now be said, with the cleanest data set from the original study, is a reversal of the original conclusion, that this perception is not a fallacy. Further, across all controlled shooting studies, which all differ, and also in the NBA’s Three-Point contest, we get the same reversal. The average effect size is substantial, and there is a great degree of heterogeneity in effect size, with some players having large effects. If we wanted to isolate the basketball shooting, and study it in scientific way, this is the type of data we would want to look at. GVT understood that, that is why they included a controlled study. I think everyone would agree that we cannot conclude anything about the existence of the hot hand or its effect size from the game data in the original study (Study 2, Dixit and Nalebuff’s point stands: http://bit.ly/1eXxdI3).

Now, while these goal posts are different, they are valid, and we are in no obligation to be tied to the line of argument in the academic literature. We all want to know, how big is this hot hand effect in games? If we look at Bocskocsky at al. (2014) we see what appears to be a small effect size, but in looking to this study for information on effect size, we are asking too much from them. Their study, which analyzes the richest in game data set ever, is not a study of effect size, rather it is a test of whether sometimes some players get the hot hand. If we are willing to accept some of the natural limitations that come with studying game data, we have to conclude that they find at least some evidence of players being streaky.

The reason why Bocskocsky et al. is not the study to look to if we want to get information on effect size is because their empirical strategy will vastly understate the hot hand effect for the following reasons:

1.Measurement error: We want to know a player’s shooting percentage in the “hot” state, but what is actually measured is a player’s shooting percentage in a *proxy* for the hot state. This is a problem with the approach our papers as well; not every instance of a streak of three or more made shots is a “hot” streak, so when we measure a player’s field goal percentage after a streak of three or more made shots, we are pooling shots attempted in the hot state, with shots attempted in a non-hot state. The degree of measurement error is substantial, in fact, under many plausible alternative models of the hot hand, more substantial than the bias we discus (if you want a particularly clean example of this please look at the work of Dan Stone: http://bit.ly/1TJ0Qfo). Now there is every reason to believe that measurement error is far worse when you study game data. Take Bocskocsky et al.’s proxy for hot hand, their complex heat index, they are looking at a player’s shooting percentage in his previous 4 shots relative to what is expected, and those previous 4 shots can be separated by tens of minutes. This is a very weak signal of a hot state; plenty of these shots will be from a normal state. In more controlled settings the shots are happening in shorter time intervals, so while hitting 3 or more shots in a row is not a perfect signal of being hot, it is a far stronger signal in these contexts.. If you want to better proxy for the hot state, you will have to measure it in some other way. Perhaps teammates and coaches can pick-up on the subtle cues in body language, facial expression or shooting mechanics that signal a hot hand? No one has done this research, and why would they if they didn’t believe the hot hand exists? (note: if you want to get more of a sense of measurement error issue, Jeremy Arkes also has done some interesting simulations. He found that if the hot hand is infrequent, your estimates (and power) will be diluted. I can only find a gated copy of the paper here: http://bit.ly/1TJ1KZn but if you email him, I am sure he would share it gratis)

2.Omitted variable bias: while Bocskocsky et al. control for a lot, they do not control for many important features of the defense that make shots more difficult, features that you would expect to be more present when a player is in the hot state (due to strategic adjustment), thus reducing a player’s probability of making a shot relative to baseline. Just a few examples: (1) the quality and identity of the defender, (2) whether the defender placed a well-timed visual occlusion (Joan Vicker’s has illustrated the importance of the “quiet eye” in far aiming tasks: http://bit.ly/1gGEZXM ; also see the work Oudejans, Oliveira and colleagues: http://bit.ly/1fdR99G), (3) whether the defender forced an off balance shot.

3. Pooling heterogeneous responses: Not all players are great shooters. Not all players are streaky. Some players may even be anti-streaky, e.g. go off the rails after making a few in a row. Now mix all their shots together as done in Bocskocsky et al. Do you expect to see a big average effect? This is an important point. It is the fact that there is heterogeneity in response that gives players a reason to want discriminate between the streaky and non-streaky guys, and to adjust play appropriately. We see that heterogeneity when we isolate the shooting from the strategic confounds present in games.

I hope we agree that if you want information on how big the hot hand effect size can be in individual players, game data is not where you want to look. If you want to make the best scientific inference you can make, based on the data you have available where would you look? All controlled shooting designs and the NBA’s Three Point Contest involve the same physical act that is carried out in games. These designs all differ in how players take their shots. The story is pretty much the same across studies, substantial, but heterogeneous effect sizes. Now scientific inference here is not restricted to isolating a behavior. We have to remember we are studying human beings, and we have additional scientific knowledge about human beings. Human performance in other domains has been found to largely effected by variation in confidence (self-efficacy, Albert Bandura), attention and concentration (e.g. Daniel Kahneman), the flow state (Mihaly Csikszentmihalyi, think “the zone”), and even physiological state (e.g. Churchland http://bit.ly/1HZemTO ). There is no reason to believe these factors do not operate to the same degree when shooting a basketball.

Now, I have a question: in 1984 would anyone’s prior have been that either the hot hand doesn’t exist, or if it does, it is weak? There was no reason to have this prior. T Why would it be anyone’s prior now?

I get that everyone *wants* to believe in the hot hand, and we should be suspicious of this kind of motivated reasoning because people are likely to only confirm their priors. On the other hand, there is a dual motivation, held among researchers, to be able to say that these experts don’t know what they are talking about, that with high-powered statistics and without any knowledge of basketball we can know more. This is sometimes true. But we should have some humility. We are looking at 0s and 1s, the player and the coach have a far richer information set (more than what SportVU picks up). We can’t pretend we are measuring everything. Jeremy Arkes made us aware of a beautiful quote from Bill Russell on this very issue, see it here: http://bit.ly/1IamgOH

II. Bias as a function of sample size.

—————–

You are correct that the bias is small when players take a lot of shots, which is true in our controlled shooting study of the Spanish semi-pros and true for *some* of the NBA Three Point contestants. But the bias matters crucially in the original study. When the data from the original study is analyzed correctly the original conclusions are not just invalidated, they are reversed! (here: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2450479). This is important point. There was never foundation to conclude that the belief in the hot hand was a fallacy, the original data set actually had statistically significant evidence of the hot hand. The mathematical fact of the bias, the fact that it cannot be argued with, and that it matters in the original study, well that opens the door to all these other (important!) subtleties that people have mentioning for years. The issue of power, measurement error, and omitted variable bias all likely understate the hot hand effect to a much greater degree (especially in game data), but because issues have required a model of the world to even discuss them, they have been thrown into the category of “debatable” and the fallacy view has lived on. But you can’t debate the math, and the math matters in the original study. Now if you want to have a justifiable opinion on the hot hand, you have to think about these other issues.

III. Bias correction

—————–

The bias correction illustrates the difference relative to the bias in the null Bernoulli model. This is a *conservative* correction, the bias is actually worse if a player actually has the hot hand, please see the comments of Jonathan Weinstein above, and the responses.

Thanks again

Thanks for the really fun read.

Here is a suggested heuristic to address the issue. Looking at equation (3), and approximating all the exponentials as 0, the measured frequency will be n/(n-1) * (p-1/n). So, rule of thumb: If you measure q, adjust to (1-1/n)*q + 1/n.

Perl script – What am i doing wrong?

my $rep = 1e6;

my $n = 4;

my @result;

my $summary_result = 0;

my $denom = 0;

for (my $j=0; $j<$rep; $j++) {

for (my $i=0; $i= 0.5) {

$result[$i] = 1;

}

else {

$result[$i] = 0;

}

}

for (my $i=0; $i<$n; $i++) {

if ($i != 3) {

if($result[$i] == 1) {

if ($result[$i+1] == 1) {

$summary_result++;

}

$denom++;

}

}

}

}

print "Total count: ".$summary_result/($denom)."\n";

Hi Sid,

I think the point is that what you’ve coded is not the estimator that has been used in some previous studies of the hot hand.

Instead, an estimator that calculates that estimator for each replication (player) and forms an average of those has been used.

And if I may try to express it in my own LaTeX, I think that the present authors point out that:

$\frac{\sum_{i=1}^m \sum_{j=1}^{n_j} X_{ij}}{\sum_{i=1}^m \sum_{j=1}^{n_j} Y_{ij}} > \frac{1}{m}\sum_{i=1}^m \frac{\sum_{j=1}^{n_j} X_{ij}}{\sum_{j=1}^{n_j} Y_{ij}}$

&space;\frac{1}{m}\sum_{i=1}^m&space;\frac{\sum_{j=1}^{n_j}&space;X_{ij}}{\sum_{j=1}^{n_j}&space;Y_{ij}}” target=”_blank”>&space;\frac{1}{m}\sum_{i=1}^m&space;\frac{\sum_{j=1}^{n_j}&space;X_{ij}}{\sum_{j=1}^{n_j}&space;Y_{ij}}” title=”\frac{\sum_{i=1}^m \sum_{j=1}^{n_j} X_{ij}}{\sum_{i=1}^m \sum_{j=1}^{n_j} Y_{ij}} > \frac{1}{m}\sum_{i=1}^m \frac{\sum_{j=1}^{n_j} X_{ij}}{\sum_{j=1}^{n_j} Y_{ij}}” />

JD

Oops. I just copied and pasted the html generated at https://www.codecogs.com/latex/eqneditor.php. That didn’t work.

Perhaps the url will be better:

https://latex.codecogs.com/gif.latex?\frac{\sum_{i=1}^m&space;\sum_{j=1}^{n_j}&space;X_{ij}}{\sum_{i=1}^m&space;\sum_{j=1}^{n_j}&space;Y_{ij}}&space;>&space;\frac{1}{m}\sum_{i=1}^m&space;\frac{\sum_{j=1}^{n_j}&space;X_{ij}}{\sum_{j=1}^{n_j}&space;Y_{ij}}

In their July 6 article, Miller and Sanjurjo assert that a way to determine the probability of a heads following a heads in a fixed sequence, you may calculate the proportion of times a head is followed by a head for each possible sequence and then compute the average proportion, giving each sequence an equal weighting on the grounds that each possible sequence is equally likely to occur. I agree that each possible sequence is equally likely to occur. But I assert that it is illegitimate to weight each sequence equally because some sequences have more chances for a head to follow a second head than others. I am making the same argument as Sam Peterson, B.D. McCullough, Zachary David, and Bill Jefferys. The point of this comment is to frame our common point in a way that will persuade sketpics.

Let us assume, as Miller and Sanjurjo do, that we are considering the 14 possible sequences of four flips containing at least one head in the first three flips. A head is followed by another head in only one of the six sequences (see below) that contain only one head that could be followed by another, making the probability of a head being followed by another 1/6 for this set of six sequences.

TTHT Heads follows heads 0 times.

THTT Heads follows heads 0 times.

HTTT Heads follows heads 0 times.

TTHH Heads follows heads 1 time.

THTH Heads follows heads 0 times.

HTTH Heads follows heads 0 times.

A head is followed by another head six times in the six sequences (see below) that contain two heads that could be followed by another head, making the probability of a head being followed by another 6/12 = ½ for this set of six sequences.

THHT Heads follows heads 1 time.

HTHT Heads follows heads 0 times.

HHTT Heads follows heads 1 time.

THHH Heads follows heads 2 times.

HTHH Heads follows heads 1 time.

HHTH Heads follows heads 1 time.

A head is followed by another head five times in the six sequences (see below) that contain three heads that could be followed by another head, making the probability of a head being followed by another 5/6 this set of two sequences.

HHHT Heads follows heads 2 times.

HHHH Heads follows heads 3 times.

An unweighted average of the 14 sequences = [(6 X 1/6) + (6 X ½) + (2 X 5/6)]/14 = [17/3]/14 = .405, which is what Miller and Sanjurjo report.

A weighted average of the 14 sequences = [(1)(6X1/6) + (2)(6 X ½) + (3)(2 X 5/6)]/[(1 X 6) + (2 X 6) + (3 X 2)] =

[1 + 6 +5]/[6 + 12 + 6] = 12/24 = .50.

Using an unweighted average instead of a weighted average is the pattern of reasoning underlying the statistical artifact known as Simpson’s paradox. And as is the case with Simpson’s paradox, it leads to faulty conclusions about how the world works.

Dear Jeff

thanks for the comment, I didn’t realize there wasn’t clarity on this point.

We do not assert that: “a way to determine the probability of a heads following a heads in a fixed sequence, you may calculate the proportion of times a head is followed by a head for each possible sequence and then compute the average proportion, giving each sequence an equal weighting on the grounds that each possible sequence is equally likely to occur.”

In fact we say in the July 6th paper that it is a mistaken intuition to treat this computation as an unbiased estimator of the true probability. It is certainly consistent, but biased, as we demonstrate. This mistake is *the* problem in the original hot hand study, and several of the subsequent studies.

In the introduction to the July 6th paper we discuss the weighting issue that you here describe. Weighting flips equally does eliminate the bias, but can only be used if you *know* that all coins are the same. This may be reasonable for coins, but is not reasonable for basketball players. If you weight all shots equally you will have another form of “Simpson’s Paradox”, which is a bias now towards finding the hot hand (selecting a shot that immediately follows 3 hits in a row creates a bias towards selecting a better player, and a better player is more likely to hit the next shot). If you put this in a regression context and add fixed player effects (to control for better players), you end up back where you started, with the finite sample bias.

Something seems wrong here. There are 24 instances of a head in one of the first three flips of Table 1 in the paper. Of those, exactly 12 are followed by a head and exactly 12 are followed by a tail. Why would one average the results of the p(H|H) columns, etc?

PS, there are 8 instances of “double heads” in the first three columns. As above, 50% of them are followed by a head and 50% are followed by a tail. Similarly, there are two instance of triple heads and they are again followed by either a head or a tail, each with 50% probability.

How is this different from https://en.wikipedia.org/wiki/Penney%27s_game?

Alex:

No, it’s different.

So I’m off to the Casino and have mortgaged the house for gambling money. If you don’t hear back from me I won big and have a hot hand going. If you hear back from me I lost the ranch.

Hey, looks like the hot hand work inspired a new discovery about prime numbers! http://www.wired.com/2016/03/mathematicians-discovered-prime-conspiracy/