Skip to content

Most successful blog post ever

Last month, I posted this on the sister blog at the Washington Post:

Under the subject line, “My best friend from 1st grade wrote this article,” Joshua Vogelstein pointed me to pointed me to an article in the journal Marketing Science . . . written by Brett Gordon and Wesley Hartmann . . .

Then came some some poli sci bla bla bla (it was some research on the effectiveness of political advertising), then I continued:

That’s pretty impressive! I’ve completely lost touch with my best friend from 1st grade. We moved away when I was in 2nd grade and we actually stayed in touch for awhile—I recall a sleepover when we were 10 or 11 and my family was back in town for some reason . . . Hmmm, his name was Mike Adlin, that’s a pretty rare name, I bet I can look it up . . .

And then the emails came in. From Brett Gordon:

Dear Andrew,

I recently got an email from Josh Vogelstein, my best friend from 1st grade. Not only was I happy to hear from Josh, whom I hadn’t heard from in over 25 years, but I was also delighted to learn about your Washington Post article on my political advertising paper . . .

And from Mike Adlin:

Great article with an interesting lede, but a correction may be in order because my recollection is that the sleepover when we were 10 or 11 took place in your house . . .

And . . . I told the story to my sister who told me, a few days later:

Thanks again for encouraging me to look up my best friend from childhood. I friended her on Facebook, and we’ve reconnected and are catching each other up on the past 40+(!) years.


“Gallup gives up the horse race: As pollsters confront unprecedented obstacles, the biggest name in the business backs away”

A couple people pointed me to this news item. I don’t have anything particular to say here, but it seemed worth noting. End of an era and all that.

P.S. A colleague commented: “They’re not going to poll one of those things where we can tell if you get it wrong. Not good.”

I replied: Gallup is not a public utility. It was my impression that those horse-race polls were a loss leader to Gallup, a way for them to get their name in the newspaper and help sell the paid poll questions that make money for them. But if the publicity is negative, I can see why they might not want to do this anymore!

Mindset interventions are a scalable treatment for academic underachievement — or not?

Someone points me to this post by Scott Alexander, criticizing the work of psychology researcher Carol Dweck. Alexander looks carefully at an article, “Mindset Interventions Are A Scalable Treatment For Academic Underachievement,” by David Paunesku, Gregory Walton, Carissa Romero, Eric Smith, David Yeager, and Carol Dweck, and he finds the following:

Among ordinary students, the effect on the growth mindset group was completely indistinguishable from zero, and in fact they did nonsignificantly worse than the control group. This was the most basic test they performed, and it should have been the headline of the study. The study should have been titled “Growth Mindset Intervention Totally Fails To Affect GPA In Any Way”.


As Alexander reports, the authors “went to subgroup analysis.” But we all know the problems there. Garden of forking paths, anyone? Again, let me emphasize that I don’t think that preregistration is the best solution to the garden of forking paths; rather, I recommend multilevel modeling, looking at all interactions that might be of interest, not just pulling out a few dramatic comparisons.

At this point I’m in a cloud of cognitive dissonance. On one hand, I met Dweck once and she seemed very reasonable. On the other hand, Alexander’s criticisms do seem reasonable, and it doesn’t help that the article in question was published in . . . yup, you guessed it, Psychological Science.

So really I don’t know what to think.

But what really amazed me were two things:

1. I’d never heard of this guy and his blog has about a zillion comments. There clearly are large corners of the internet that I didn’t know about.

2. It was also striking that 100% of the commenters thought the study in question is B.S. I have no idea, but Dweck is a respected resesarcher. I don’t think she’s in Daryl Bem or Ellen Langer territory.

The person who sent the original message replied to me:

There definitely been complaints from some corners about Dweck’s work not replicating, but also lots of followers doing other mindset experiments in her tradition.

Re that, these are the two posts preceding the analysis of that study:

The blog posts moderately often about bad stats, and Vox and the Atlantic link to him occasionally. Some other random stats related posts:

Alcoholics anonymous:

More alcoholism treatment and the false positive psychology paper:

The claim false rape accusations are less than 1/30th as common as being struck by lighting:



The “perceptions of required ability by field” study:


Hmmm, I guess I should look into this in more detail. Maybe I’ll talk with some of my psychology colleagues. In any case, I’m still impressed by Alexander getting hundreds of comments on that post—he must be doing something right to be getting this sort of attention and careful reading!

P.S. More here.

P.P.S. The person who sent the above message informs me that an author of the paper said that they have had another successful replication since, and will be preregistering their next one on Open Science Framework. If their effect is real and works in the preregistered many-school replications then it will generate a huge amount of social value by helping millions of kids in school.

Anti-cheating robots

Paul Alper writes:

Surely you would like to comment on the amazing escalation in the anti-cheating tech world. I predict it will be followed by some clever software which makes it appear that the student enrolled is actually the one taking the exam. Reminiscent of the height of the cold war of counter weapons and counter-counter weapons. You may believe that the students sitting in your class are actually there but how long will it be before someone comes up with software producing holograms of your students? Of of you?

Alper adds:

I also came across this amazing link concerning Big Data and profiling via “Stoplight.”

Inasmuch as you and most of your blog followers are Bayesian, note the chilling way priors exist for a student’s grade:

The profile shows a red light, a green light, or a yellow light based on things like have you attempted to take the class before, what’s your overall level of performance, and do you fit any of the demographic categories related to risk.

And, it appears that said priors may never get overridden by subsequent data:

These profiles tend to follow students around, even after folks change how they approach school.

What struck me is how they decide who gets monitored. The first link, a news article by Natasha Singer, describes a pretty invasive system installed at Rutgers University:

Once her exam started, Ms. Chao said, a red warning band appeared on the computer screen indicating that Proctortrack was monitoring her computer and recording video of her. To constantly remind her that she was being watched, the program also showed a live image of her in miniature on her screen. . . .

As universities and colleges around the country expand their online course offerings, many administrators are introducing new technologies to deter cheating. The oversight, administrators say, is crucial to demonstrating the legitimacy of an online degree to students and their prospective employers.

I think what they’re really saying is that they don’t want to pay instructors and teaching assistants. Indeed:

Ms. Chao [a student interviewed for the news article] said administrators had since offered to provide her with a live human proctor for a fee of $40 per exam.

So, yeah, the robot is replacing the human teacher. Seems like a problem, though, in that the teaching assistant doesn’t just verify students for exams. The T.A. is also supposed to get to know the student a bit and offer some individualized instruction.

I was also amused by the Rutgers connection, as I was reminded of Frank Fischer, an elderly professor of political science who was caught copying big blocks of text (with minor modifications) from others’ writings without attribution. This all happened several years ago, but Fischer is still listed as a professor on the Rutgers website. It seems a bit unfair that the students there are subject to Proctortrack and the faculty can just do whatever they want.

P.S. Thinking more about it, I’m not “amused” by the Rutgers connection, I’m actually angry that they’re surveilling students in this way while tolerating plagiarism by a professor.

PMXStan: an R package to facilitate Bayesian PKPD modeling with Stan

poster_PMXstan_ACoP2015_30Sep2015 copy

From Yuan Xiong, David A James, Fei He, and Wenping Wang at Novartis.

Full version of the poster here.

Cognitive skills rising and falling

David Hogg writes:

I thought this was either interesting or bunk—using online games to infer how various kinds of cognitive intelligence vary with age. I thought it might be interesting to you on a number of levels. For one: Are there really categories of intelligence and can these map onto online games? For another: How do you make conclusions about the population as a whole from the population that participates in online games. They find age effects, but I bet there are age effects in the participation rates…

Hogg is referring to a press release by Anne Trafton describing an article by Joshua Hartshorne and Laura Germine, “When Does Cognitive Functioning Peak? The Asynchronous Rise and Fall of Different Cognitive Abilities Across the Life Span.” From the press release:

Scientists have long known that our ability to think quickly and recall information, also known as fluid intelligence, peaks around age 20 and then begins a slow decline. However, more recent findings, including a new study from neuroscientists at MIT and Massachusetts General Hospital (MGH), suggest that the real picture is much more complex.

The study, which appears in the journal Psychological Science [uh oh — ed.], finds that different components of fluid intelligence peak at different ages, some as late as age 40. . . .

I’ve heard that blogging peaks at the age of 50, actually.

Seriously, though, their general approach seems reasonable. I’d like to see some raw data, though. Also, for some of the tasks, the idea of “peak performance” seems to miss the point. Consider this figure from the Hartshorne and Germine paper:

Screen Shot 2015-07-20 at 1.59.37 PM

First, it’s hard for me to believe this is showing raw data: The lines in figure a and b look too smooth. Second, in a graph such as figure b, there’s no peak to find. This seems like a limitation of the statistical approach: Before seeing the data, it would seem to make sense to look for a peak, but not so much after.

I also wonder whether many of these curves could be usefully categorized as the sum of two curves: a gradually increasing curve representing “experience” and a sharply decreasing curve representing “performance.”

There’s also the difficulty of disentangling age, period, cohort effects. The authors do discuss this in their paper but I don’t think they resolve it.

In summary, I find this paper to be interesting. A few more like this, and Psychological Science might get a good reputation! I think more could be done here by modeling the data a bit. Lots to look into here.

On deck this week

Mon: Cognitive skills rising and falling

Tues: Anti-cheating robots

Wed: Mindset interventions are a scalable treatment for academic underachievement — or not?

Thurs: Most successful blog post ever

Fri: Political advertising update

Sat: Doomed to fail: A pre-registration site for parapsychology

Sun: Mars Missions are a Scam

Also, don’t forget what’s on deck for the rest of the year.

Flamebait: “Mathiness” in economics and political science


Political scientist Brian Silver points me to his post by economist Paul Romer, who writes:

The style that I [Romer] am calling mathiness lets academic politics masquerade as science. Like mathematical theory, mathiness uses a mixture of words and symbols, but instead of making tight links, it leaves ample room for slippage between statements in natural versus formal language and between statements with theoretical as opposed to empirical content.

Also some thoughtful discussion by Leopoldo Fergusson, who writes:

In empirical work there are phenomena akin to mathiness, and similar risks. Mathiness stems from certain obsession, healthy to some extent, with formal economic analysis. Similarly, in empirical work many risks arise from a healthy concern about being more rigorous when analyzing data . . .

Economists (social scientists in general) obsessed with identifying the causal effect (yes, it is redundant and yet we love it) can fall into the trap of studying comparatively minor problems . . .

In his (otherwise great) article on writing advice for PhD students, John Cochrane asks: “What are the three most important things for empirical work?” His response: “Identification, Identification, Identification”.

Wrong. The most important thing, always, is that we tackle an interesting question. . . .

Regarding the general problem of “mathiness” serving as a deterrent to research communication in economics, this is an interesting point, especially in that in many ways political science has gone in the opposite direction. Back when I was getting my Ph.D., there were not many “political methodologists,” and there was a large overlap with the formal theorists. Game theory ruled, and the people who were considered the top methodologists were aping econometricians. But the field was young and malleable enough that things opened up: “formal theory” became a bit of a backwater (at least from my perspective) and statistical modeling and graphics became more popular. So, sure, math is cool, but it’s a rare work of political science that uses math to exclude dissenters.

Also I was amused by Romer’s earlier post, “Ed Prescott is No Robert Solow, No Gary Becker.” As far as I can tell, Gary Becker was no Gary Becker. As for Solow, I only saw him once, in a talk at MIT 30 years ago where he anti-impressed me by making an offhand swipe at how he would cut funding for Amtrak—I guess he thought all those highways were just free.

Silver replied with some background of his own:

I entered grad school in 1965 and started out as a “Russian area specialist.” That my dissertation was largely a quantitative study of ethnic assimilation using census data was very different from the norm that was established at the major Russian area centers at Harvard and Columbia as well as the significant ones elsewhere. When I applied for a Foreign Area Fellowship as well as a year of study abroad through IREX, the interview committee asked me about my thesis. I told them I was studying ethnic assimilation by minority nationalities in the Soviet Union. The immediate question: “WHICH nationality?” My answer: “all of them.” It shocked them that anybody could try to do that! (I got the fellowships.)

Only when the Soviet system began to fall apart did this subfield begin to draw a lot of young scholars into it who applied a wide array of methods, including quantitative, to study the post-Soviet transition. About a third of my research was essentially “demographic,” and not obviously political. Today there’s practically nobody in US doing this work concerning the post-Soviet region. There is, however, something of a normal demographic science now within Russia — but one that treads very carefully and doesn’t deal with some of the issues that were among the foci of my research (language and ethnic identity change, bilingual education policy, etc.).

For the most part the “comparativists” at Wisconsin—the faculty—were qualitatively oriented. But back in 1965 we got to cut our data analysis teeth in the introductory mass political behavior course by analyzing the Almond-Verba data, which had just been released. So some of us comparativists learned to do quantitative data analysis—and multi-country research. Nobody taught formal theory there at the time. When I asked one of the comparative faculty why this was so, he quickly responded, “We don’t believe in it.” But they did believe in data analysis, and so some of us comparativists got decent training even in political science, and a few (e.g., Doug Hibbs, who was in my UW cohort) took econometrics from Goldberger.

Comparing Waic (or loo, or any other predictive error measure)

Ed Green writes:

I have fitted 5 models in Stan and computed WAIC and its standard error for each. The standard errors are all roughly the same (all between 209 and 213). If WAIC_1 is within one standard error (of WAIC_1) of WAIC_2, is it fair to say that WAIC is inconclusive?

My reply:

No, you want to compare directly; see section 5.2 of this paper by Aki, Jonah, and me.

For those of you who are too lazy to click over and read the paper, the idea is that Waic and loo are computed for each data point and then added up; thus when you are comparing two models, you want to compute the difference for each data point and only then compute the standard error. That is, the scenario is a paired comparison rather than a difference between two groups.

This can matter in computing the standard error because the pointwise components of predictive error can be highly correlated when comparing the two models, in which case the correct standard error will be much lower than the standard error that would be naively obtained by combining the standard error of the separate Waic or loo calculations for the two models.

In our paper we give the example of fitting two models to the arsenic well-switching data (which you might recall from chapter 5 of ARM):

Screen Shot 2015-10-03 at 9.41.55 AM

There are certain points which neither model fits well (for example, people living in households that are high in arsenic and close to neighbors with safe wells but who still say they would not switch wells), and when comparing the fit of two models it’s important to do it pointwise, otherwise you’ll overstate your uncertainty in the difference.

And all this is reminding me that we’d like to add an Anova-like feature for comparing multiple models; in that paper we present methods of computing Waic or loo for one model, or comparing two models, so we should really also present the general comparison of multiple model fits.

Stan PK/PD Tutorial at the American Conference on Pharmacometrics, 8 Oct 2015

Bill Gillespie, of Metrum, is giving a tutorial next week at ACoP:

This is super cool for us, because Bill’s not one of our core developers and has created this tutorial without the core development team’s help. Having said that, we’ve learned a lot from Bill and colleagues on our mailing lists as we were designing ODE solvers for Stan (an ongoing issue—see below for future plans).

Bill’s tutorial is up against a 2-day Monolix tutorial and a 2-day tutorial on R by Devin Pastoor, who’s also been active on our mailing lists recently.

Why Stan for PK/PD?

In case you’re wondering why people would use Stan for this instead of something more specialized like Monolix or NONMEM, it’s because of the modeling flexiblity provided by the Stan language and the effectiveness of NUTS for MCMC. So far, though, we’re in the hole in not having a stiff ODE solver in place. Or a good NONMEM-like event data language on top.

Maybe Bill will jump in with some other motivations.

What’s in Store for Stan’s ODE Solvers?

There’s been lots of behind-the-scenes activity on our ODE solvers—we’re really just getting burned in warmed up.

The next minor release of Stan (2.9) should stop the freezing issue when parameters wander into regions of parameter space that lead to stiff ODEs. And we’ve really sped up the Jacobian calculations when Michael Betancourt realized we were doing a lot of redundant calculation and he and I put a patch in to fix it. We should also allow user-defined control of absolute and relative tolerances.

Next, hopefully by Stan 2.10, we’ll have a stiff solver and maybe a way for users to supply analytic coupled-system gradients and Jacobians. Stay tuned. These new designs are largely being guided by Sebastian Weber and Wenping Wang at Novartis. And of course, by Michael Betancourt working out all the math and Daniel, Michael, and I working out the code with Sebastian’s and Wenping’s input.

We also need to evaluate how well variational inference works for ODE problems. Our early trials are very promising. Then we could replace the max marginal likelihood approach of NONMEM with a very speedy variational inference mechanism allowing much more general models.

There’s more in the works, but the above are the top of our to-do list.

Solution to Stan Puzzle 1: Inferring Ability from Streaks

If you missed it the first time around, here’s a link to:

First, a hat-tip to Mike, who posted the correct answer as a comment. So as not to spoil the surprise for everyone else, Michael Betancourt (different Mike), emailed me the answer right away (as he always does for math problems—Michael’s literally amazing).

Although I formulated it to myself as “How do I code this in Stan?”, it turns out there’s an analytic solution. Here’s how I worked through it (after about as many false starts as the others who posted on the list).

Michael Betancourt also analyzed the process qua process; maybe he’ll elaborate by editing this post below or in comments.


If the observed data are streaks y = (y_1, \ldots, y_N), with streak lengths y_n \geq 1, the underlying sequence of successes and failures must match the following regular expression

z = 0^{*} \, 1^{y_1} \, 0^{+} \, 1^{y_2}  \cdots 0^{+} \, 1^{y_N} \, 0,

where a^n is n repetitions of a, a^* is zero or more repetitions of a, and a^+ is one or more repetitions of a. Because sequence concatenation is associative, 0^+ = 0 \, 0^*, and y_n \geq 1, the above can be regrouped as

z = (0^* \, 1) \, 1^{y_1 - 1} \, 0 \, (0^* \, 1) \, 1^{y_2 - 1} \cdots (0^* \, 1) \, 1^{y_N - 1} \, 0

Given the way the streak data is generated, the probability of generating the subsequent misses after the first and then the first made shot, namely (0^* \, 1), is equivalent to the the sum of the probabilities of generating 1 or generating 0\,1 or 0\,0\,1 or …, which conventiently reduces to 1, because it covers all the possibilities for observing zero or more failures, then a single success. Many people were getting at that intuition in the comments.

Thus the probability of z, marginalizing over the unobserved 0^* sequences, reduces to the probability of generating the 1^{y_n - 1} terms and the required inter-streak 0 terms. With a \theta chance of success, the likelihood reduces to

\displaystyle p(y \, | \, \theta) \propto \prod_{n=1}^N \left( \theta^{y_n - 1} \times (1 - \theta)\right)

\displaystyle \mbox{ } \ \ \ = \theta^{\sum_{n=1}^N y_n - 1} \times (1 - \theta)^N

\displaystyle \mbox{ } \ \ \ = \theta^{\mbox{\footnotesize sum}(y) - N} \times (1 - \theta)^N

\displaystyle \mbox{ } \ \ \ \propto \mbox{Binomial}(\mbox{sum}(y) - N \, | \, N, \theta)

With a uniform prior on \theta, which is equivalent to a \mbox{Beta}(\theta \, | \, 1,1) prior, the beta-binomial conjugacy provides the following analytic solution for the posterior.

\displaystyle p(\theta \, | \, y) = \mbox{Beta}(\theta \, | \, \mbox{sum}(y) - N + 1, N + 1)

Loss of Information

How much information is “leaking” when we reduce the underlying sequence z with streaks y? When formulated as streaks, the Beta posterior is based on a total of N “observations,” whereas the length of z is greater than N and would lead to a \mbox{Beta}(\alpha,\beta) posterior with \alpha = \mbox{sum}(y) + 1 and \beta \geq N + 1.

Stan Code

So as not to disappoint those who wanted to see a Stan solution, here’s the MCMC version which lays it out as a model with parameters to estimate.

data {
  int N;
  int y[N];
parameters {
  real theta;
model {
  sum(y) - N ~ binomial(N, theta);

This model can be integrated into larger models based on theta, but is not much use in and of itself.

In this case, the analytic solution lets you generate draws directly from the posterior in the generated quantities block using Monte Carlo (withouth the Markov chain bit), which is much more efficient than MCMC.

data {
  int N;
  int y[N];
generated quantities {
  real theta;
  theta <- beta_rng(sum(y) - N + 1, N + 1);

But there's unlikely to be a need to do even straight-up Monte Carlo when you have an analytic posterior.

Syllabus for my course on Communicating Data and Statistics

Actually the course is called Statistical Communication and Graphics, but I was griping about how few students were taking the class, and someone suggested the title Communicating Data and Statistics as being a bit more appealing. So I’ll go with that for now.

I love love love this class and everything that’s come from it (including statistics diaries and ShinyStan).

Here’s the syllabus. It’s full of fun reading and great activities, in and outside of class. The only thing missing are the jitts, but I like to keep them as a surprise. So if you want to teach this class—and I think you should, indeed I think this course should be taught everywhere and it should be a standard part of the statistics and quantitative social science curriculum—you’ll just have to write your own jitts. Otherwise the course pretty much teaches itself. And remember, with your guest visitors, keep the converstations short and focus. Long rambling discussions are fun, and they’re easy on the instructor, but ultimately you want to spend lots of class time directly on feedback on student work.

Now for the next 90 seconds I’d like you to talk with your neighbor and come up with a question to ask me.

OK, start yapping!

Jason Chaffetz is the Garo Yepremian of the U.S. House of Representatives, and I don’t mean that in a good way.

Mike Spagat and Paul Alper points us to this truly immoral bit of graphical manipulation, courtesy of U.S. Representative Jason Chaffetz.

Here’s the evil graph:


Here’s the correction:


From the news article by Zachary Roth:

As part of a contentious back-and-forth in which Chaffetz repeatedly cut off [Planned Parenthood president Cecile] Richards, the congressman displayed a slide with a graph that looked like this [top graph above]. When Richards said she’d never seen it before, Chaffetz replied: “It comes straight from your annual reports.”

Moments later, Richards shot back: “My lawyers just informed me that the source of this information is Americans United for Life, an anti-abortion group. I would check your source.”

But the source wasn’t the only problem. A cursory look at the graph, which comes from an Americans United For Life report about Planned Parenthood centers released in June, makes it seem like in 2006, Planned Parenthood performed far more cancer screening and prevention services than abortions, but that by around 2010 it performed an equal number of both, and by 2013 it performed far more abortion services than anti-cancer services.

The issue is important because as part of their effort to defund Planned Parenthood, Republicans have portrayed it as primarily an abortion provider, while the group’s defenders have said it mostly performs other women’s health services, like cancer screenings.

But look at the actual numbers in the graph. They show that in 2006, Planned Parenthood performed 2,007,371 anti-cancer services and 289,750 abortions. By 2013, the gap had closed slightly, but the group still performed many more anti-cancer services than abortions, 935,573 to 327,000.
Why does it seem otherwise? Because the “graph” has no y axis, which allows its creators to simply plot the results wherever they choose in order to create a compelling visual effect. That’s how 327,000 is made to look like a much larger number than 935,573.

What the slide actually shows, of course, is that the number of abortions performed by Planned Parenthood rose very modestly between 2006 and 2013, while the number of anti-cancer services it performed did indeed fall by more than half. But Richards said some of the services, like pap smears, dropped in frequency because of changing medical standards about who should be screened and how often. Displaying that information on an actual graph would show the line for abortions rising very slightly over the 7-year time period, and the line for anti-cancer services dropping, but always remaining far above the line for abortions.

I looked up Jason Chaffetz on wikipedia and found that this:

[Chaffetz] was the starting placekicker on the BYU football team in 1988 and 1989. He still holds the BYU individual records for most extra points attempted in a game, most extra points made in a game, and most consecutive extra points made in a game.

A placekicker of all people should understand the principle of division of labor. If you want to make a graph, get an expert to do it. Don’t use a double y-axis and, while you’re at it, don’t tell untruths about where you got it from.

Just kick the damn ball, and leave the passing to the quarterback, OK?

P.S. The bottom graph above is much better than the top graph but it’s still not perfect. The axis labels are too tiny to be readable, and there are way too many numbers on the y-axis. Tick marks at 0, 500,000, 1 million, 1.5 million, etc., would do just fine. Also those heavy black and red lines on the left, bottom, and right of the plot are bad news.

Hot hand explanation again


I guess people really do read the Wall Street Journal . . .

Edward Adelman sent me the above clipping and calculation and writes:

What am I missing? I do not see the 60%.

And Richard Rasiej sends me a longer note making the same point:

So here I am, teaching another statistics class, this time at Santa Monica College, and reading the Wall Street Journal before heading in to school. Not surprisingly, extremely intrigued by the article about the ‘Hot Hand’.

I know you were quoted in it, although it looks like some of the quote got cut off somehow.

Anyway, I was very confused by the piece and did a little pencil and paper work, and am not quite sure I buy it. Admittedly, I did not try to find the original paper or commentary, but based my doodling on the description in the article.

As I understand it, we look at the 14 out of 16 possible sequences of 4 coin tosses which do not begin TTT (in order to have at least one H).

Then, whenever there is an H in the subsequence of the first three tosses, we look at whether or not it is followed by a T.

So I wrote all 14 down, and for each one looked at how often a T follows an F.

Here’s what I found:

HHHH: 3 opportunities, 0 Ts
HHHT: 3 opportunities, 1 T
HHTH: 2 opportunities, 1 T
HHTT: 2 opportunities, 1 T
HTHH: 2 opportunities, 1 T
HTHT: 2 opportunities, 2 Ts
HTTH: 1 opportunity, 1 T
HTTT: 1 opportunity, 1 T
THHH: 2 opportunities, 0 Ts
THHT: 2 opportunities, 1 T
THTH: 1 opportunity, 1 T
THTT: 1 opportunity, 1 T
TTHH: 1 opportunity, 0 Ts
TTHT: 1 opportunity, 1 T

This seems to total up to 24 opportunities to see whether or not an H in in first three positions is followed by a T, and a total of 12 Ts, for 50%. So I don’t see where the 60% mentioned in the article comes from.

Also, note that the coin toss in the fourth position is irrelevant to this count, since the simulation is for only four tosses – so we never know what happens on the “fifth” toss.

Besides not understanding where the 60% comes from, how much of this (what is alleged in the article) is an artifact of the length of the sequence of tosses? Rather than restricting ourselves to sequences of length 4, should not the analysis look at sequences of all lengths? That is, start with tosses of length 2, 3, 4, 5, 6, etc., measure the frequency with which a T follows an H in the possible subsequences of lengths 1, 2, 3, 4, 5, etc., and then try to determine if the sequence of proportions converges?

My reply to both:

You get the non-50% number by first computing the percentage for each scenario, then averaging the 16 scenarios equally. If you weight by the number of opportunities you indeed get the correct answer of 50% here, but the point is that when the hot hand has traditionally been estimated, the estimation has been done by taking the empirical difference for each player, and then taking a simple (not weighted) average across players, hence the bias, as explained and explored in several recent papers by Josh Miller and Adam Sanjurjo.

More here.

P.S. Miller points out that, for real shooting data (as opposed to coin flips) there is no simple weighted averaging that would give you the correct hot-hand estimate, as such an average would not correct for differences between players. That’s why I think the ultimate way to go will be to fit a Bayesian analysis using Stan. We’ve done some steps toward this but our model is still in a simple and preliminary stage.

An unconvincing analysis claiming to debunk the health benefits of moderate drinking


Daniel Lakeland writes:

This study on alcohol consumption (by Craig Knott, Ngaire Coombs, Emmanuel Stamatakis, and Jane Biddulph) was written up in the BMJ editorials as “Alcohol’s Evaporating health benefits.”

They conveniently show their data in a table, so that they can avoid graphing a “J” shape that they constantly allude to being wrong… But their own models (see the links under table 3 and table 4) show that the hazard ratio relative to the “never drinker” category for males 50-64 years old declines and then goes up for the “heavy drinker” category, like… I dunno, kind of a J shape??? yes.. yes definitely like that.

Ok, so how about table 4 for women… aged 50-64 years:

Ok, decline with minimum at around 10-15 units /wk… increasing for the heavier drinkers….

Ok, model 2, model 2 is going to do it right??? well… kind of constant, but a definite low point at 10-15 units/wk….

Ok, looking across all the different models in all the different tables…. all the different age groups… yep… pretty much every group has lowest risk in the same range, around 10-15 units per week, or 1-3 units per day or whatever way you want to look at it… pretty much just exactly where the traditional J shape puts things… one or two drinks per day 3 to 5 days a week or something like that.

Nevertheless… the editorial claims: “if there is any beneficial dose-response relation, it is limited to women aged 65 or more — and even that association is at best modest and likely to be explained by selection bias.”

and “for a range of reasons, including confounding and selection bias in the papers generally cited, even low level alcohol consumption is unlikely to protect drinkers from cardiovascular disease”…

so… do a study… don’t like the results? Bury them in a table and just claim you found the opposite of what you found? Or if you don’t claim it yourself, maybe at least get your friend to write an editorial or something.

The editorial that Lakeland cites has an explicit political agenda. But setting this aside, I think the larger point is that the effects of any drug will depend on its context. My guess is that the authors of the research paper and the editorial aren’t concerned about moderate drinking. It’s more that they’re worried that the news about the health benefits of moderate drinking will be used as an encouragement for people to drink heavily.

This contextual effect can arise at the individual or societal level. A person might hear about alcohol being good for you and then lapse into alcoholism—at least, the’s the concern. Or, at the national level, the news about the benefits of moderate drinking will get in the way of public healths efforts to combat problem drinking.

I can share my own experiences here. A few years ago I was talking with my cardiologist and he asked me about my alcohol consumption. I said I drank rarely, probably less than one glass of wine a week. He said I should drink a few times a week, that it would be good for me. Then when I was in France getting a health checkup for my employment, the doctor asked me how often I drank alcohol. I said I had a glass or two of wine a few times a week, cos it was recommended by my cardiologist. She told me not to drink so much, it was bad for my liver. I asked her about the benefits to my heart and she said, no, don’t believe that.

This is just n=2, of course, but perhaps it makes sense that my cardiologist’s recommendation made sense given that I rarely drank, and the French doctor’s recommendation made sense given that, for all she knew, maybe I was an alcoholic and was just trying to justify my addiction.

In any case, I agree with Lakeland that it’s better to report results clearly and graphically rather than contorting the data to support some particular claim.

How to use lasso etc. in political science?

Tom Swartz writes:

I am a graduate student at Oxford with a background in economics and on the side am teaching myself more statistics and machine learning. I’ve been following your blog for some time and recently came across this post on lasso. In particular, the more I read about the machine learning community, the more I realize how none of this work is incorporated into the majority of economics research.

I was wondering if you could give some advice on how to use techniques such as lasso, which retain a certain degree of interpretability, in a situation like economics or political science? Given that the goal is largely to describe, rather than just optimize an un-interpretable model, how would you use such techniques in a way that reduces variance and point estimate overestimation while at the same time interpreting the particular coefficients in a meaningful way?

My reply:

I don’t really buy the idea that lasso gives more interpretability; I think of it as a way to regularize inferences from regression. In most settings I actually find it difficult to directly interpret more than one coefficient in a regression model. Think of it this way: the coefficient of some predictor x represents a comparison of two items that differ in x while being identical in all other predictors of the model. Typically this only has a clear interpretation if x is the “last” predictor in the model, so that all the other predictors come “before” it.

Regularization is great, I just think the way to think of lasso is as a way of regularizing a regression model. The model is what’s important. What’s good about lasso and other regularizers is that they allow you to fit a regression model with lots of predictors. But the interpretability, or lack thereof, is a property of the regression model, not of the regularization.

RStan 2.8.0 is on CRAN!

RStan 2.8.0 is available on CRAN!

Installation directions can be found on RStan’s Wiki.

And since I know a lot of people aren’t patient enough to read through installation instructions, the most important parts are:

  1. You (still) need a C++ toolchain.
    Mac: XCode. Make sure to open it once after download to accept the license.
    Windows: Rtools. Make sure the binaries are on your Windows path.Linux. If you don’t have a C++ toolchain in Linux, you should probably rethink your operating system.
  2. From within R:
    > install.packages("rstan", dependencies = TRUE)

    I don’t know why you need dependencies, but maybe the RStan gurus can explain.

  3. Restart R before using RStan. Please.
    This is another thing that I don’t understand, but it does solve a lot of problems.

As always, if you run into trouble, let us know on the stan-users mailing list.

Fitting models with discrete parameters in Stan

This book, “Bayesian Cognitive Modeling: A Practical Course,” by Michael Lee and E. J. Wagenmakers, has a bunch of examples of Stan models with discrete parameters—mixture models of various sorts—with Stan code written by Martin Smira! It’s a good complement to the Finite Mixtures chapter in the Stan manual.

On deck through the rest of 2015

There’s something for everyone! I had a lot of fun just copying the titles to make this list, as I’d already forgotten about a lot of this stuff. Here are the scheduled posts, in order through 31 Dec:

Fitting models with discrete parameters in Stan

How to use lasso etc. in political science?

An unconvincing analysis claiming to debunk the health benefits of moderate drinking

Tamiflu conflict of interest

Alleged data manipulation in NIH-funded Alzheimer’s study

Flamebait: “Mathiness” in economics and political science

Cognitive skills rising and falling

Anti-cheating robots

Mindset interventions are a scalable treatment for academic underachievement — or not?

Most successful blog post ever

Political advertising update

Doomed to fail: A pre-registration site for parapsychology

Mars Missions are a Scam

Ta-Nehisi Coates, David Brooks, and the “street code” of journalism

What do you learn from p=.05? This example from Carl Morris will blow your mind

Here’s a theoretical research project for you

Hierarchical logistic regression is easy in Stan

In answer to James Coyne’s question, no, I can’t make sense of this diagram.

In that article, they forgot to mention that Ludmerer is one of the 5 doctors in America who has no opinion on whether cigarette smoking contributes to lung cancer in human beings.

“Null hypothesis” = “A specific random number generator”

My webinar with Brad Efron this Wednesday

How to build trust in missing-data imputations?

Evaluating models with predictive accuracy

Using Stan to map cancer screening!

Why you can’t always use predictive performance to choose among models

Top 5 movies about scientists

“Modern Physics from an Elementary Point of View”

Super-topical NBA post!!!

Characterizing the spatial structure of defensive skill in professional basketball

The original Hot Hand preprint!

Exaggeration of effects of fan distraction in NCAA basketball

What do I say when I don’t have much to say?

Cauchy priors for logistic regression coefficients

Where the fat people at?

“Priming Effects Replicate Just Fine, Thanks”

My job here is done

The tabloids strike again

Econometrics: Instrument locally, extrapolate globally

I wish Napoleon Bonaparte had never been born

DataMeetsViz workshop

How to parameterize hyperpriors in hierarchical models?

“Don’t get me started on ‘cut’”

Taleb’s Precautionary Principle: Should we be scared of GMOs?

Pass the popcorn

Who falls for the education reform hype?

Inference from an intervention with many outcomes, not using “statistical significance”

“Should Prison Sentences Be Based On Crimes That Haven’t Been Committed Yet?”

At this point, I’m primed to be skeptical about claims of social priming

He wants to teach himself some statistics

I like the Monkey Cage

What years of the economy influence the next presidential election?

Humans Can Discriminate More than 1 Trillion Olfactory Stimuli. Not.

Some people are so easy to contact and some people aren’t.

A statistical approach to quadrature

“The Bayesian Second Law of Thermodynamics”

Jökull Snæbjarnarson writes . . .

Bayesian inference for network links

0.05 is a joke

Statistics diaries and comparable assignments in other fields

“A pure Bayesian or pure non-Bayesian is not forever doomed to use out-of-date methods, but at any given time the purist will be missing some of the most effective current techniques.”

7 tips for work-life balance

A missed opportunity?

How to analyze hierarchical survey data with post-stratification?

My quick answer is that I would analyze all 10 outcomes using a multilevel model.

Rogue historian just can’t stop copying

Questions about data transplanted in kidney study

Party like it’s 2005

Cannabis/IQ follow-up: Same old story

Waic and cross-validation for survival models?

Hierarchical modeling when you have only 2 groups: I still think it’s a good idea, you just need an informative prior on the group-level variation

I definitely wouldn’t frame it as “To determine if the time series has a change-point or not.” The time series, whatever it is, has a change point at every time. The question might be, “Is a change point necessary to model these data?” That’s a question I could get behind.

Actually, I’d just do full Bayes

“Baby Boomer” as all-purpose insult

Defining conditional probability

In defense of endless arguments

Bayesian decision analysis for the drug-approval process

Mars 1, This American Life 0

LaCour and Green 1, This American Life 0

What is a Republican?

“Perhaps the most reasonable explanation is that no one watched the video or did the textbook reading . . .”

A Replication in Economics: Does “Genetic Distance” to the US Predict Development?

Death of a statistician

Rapid post-publication review

He’s skeptical about Neuroskeptic’s skepticism

R sucks

“Am I doing myself a disservice by being too idealistic in a corporate environment?”

Gresham’s Law of experimental methods

Turbulent Studies, Rocky Statistics: Publicational Consequences of Experiencing Inferential Instability

There are 6 ways to get fired from Johnson & Johnson: (1) theft, (2) sexual harassment, (3) running an experiment without a control group, (4) keeping a gambling addict away from the casino, (5) chapter 11 bankruptcy proceedings, and (6) not covering up records of side effects of a drug you’re marketing to kids

“The lifecycle of scholarly articles across fields of economic research”

My presentation at the Electronic Conference on Teaching Statistics

When the numbers differ in the third decimal place

Definitely got nothing to do with chess IV

As usual, I’ll occasionally bump posts for more topical material. And my cobloggers are free to intersperse their posts whenever.

The Final Bug, or, Please please please please please work this time!

I’ve been banging my head against this problem, on and off, for a couple months now. It’s an EP-like algorithm that a collaborator and I came up with for integrating external aggregate data into a Bayesian analysis. My colleague tried a simpler version on an example and it worked fine, then I’ve been playing around with a multivariate version and . . . it kinda works. At one point it was working ok, and then in writing up the algorithm I noticed some places where it could be improved, and then I did the improvements, and it was failing. I was getting extreme importance ratios and degenerate covariance matrices. Then I realized my algorithm wasn’t quite right, I was using the wrong factor in my EP computation so that it would not converge to what I wanted. So I fixed that. Then more problems. Etc etc. I tried going back to the simple version of the algorithm but it ran really poorly in my example. At this point I don’t know what I’m doing, I start playing desperately with the algorithm, pulling factors in and out of the importance weights, changing the distribution from which I was drawing the initial approx, etc etc etc. Can’t get it to work. Even when it doesn’t crash, I’m getting simulation efficiencies approaching zero.

Then I look at the code one more time. Damn! I was passing the arguments in the wrong order to my R function. OK, that’s the bug. I run it one more time . . . No! I was confused, the order of my arguments was just fine. So I’m still in the thick of it. Ugh.

P.S. On reading the comments I see there’s some confusion here. The problem is not simply: “I want to do X, I wrote code to do X, but it’s not working.” The problem is I’m doing research, I have a sense that this algorithm should work, but there could be something I’m missing. Bugs in the code are interacting with my incomplete understanding of the method that my colleagues and I are developing.

Also, I wrote the above post a couple of months ago. Since then we’ve made progress and I hope to post the paper soon.