Skip to content

Flamebait: “Mathiness” in economics and political science


Political scientist Brian Silver points me to his post by economist Paul Romer, who writes:

The style that I [Romer] am calling mathiness lets academic politics masquerade as science. Like mathematical theory, mathiness uses a mixture of words and symbols, but instead of making tight links, it leaves ample room for slippage between statements in natural versus formal language and between statements with theoretical as opposed to empirical content.

Also some thoughtful discussion by Leopoldo Fergusson, who writes:

In empirical work there are phenomena akin to mathiness, and similar risks. Mathiness stems from certain obsession, healthy to some extent, with formal economic analysis. Similarly, in empirical work many risks arise from a healthy concern about being more rigorous when analyzing data . . .

Economists (social scientists in general) obsessed with identifying the causal effect (yes, it is redundant and yet we love it) can fall into the trap of studying comparatively minor problems . . .

In his (otherwise great) article on writing advice for PhD students, John Cochrane asks: “What are the three most important things for empirical work?” His response: “Identification, Identification, Identification”.

Wrong. The most important thing, always, is that we tackle an interesting question. . . .

Regarding the general problem of “mathiness” serving as a deterrent to research communication in economics, this is an interesting point, especially in that in many ways political science has gone in the opposite direction. Back when I was getting my Ph.D., there were not many “political methodologists,” and there was a large overlap with the formal theorists. Game theory ruled, and the people who were considered the top methodologists were aping econometricians. But the field was young and malleable enough that things opened up: “formal theory” became a bit of a backwater (at least from my perspective) and statistical modeling and graphics became more popular. So, sure, math is cool, but it’s a rare work of political science that uses math to exclude dissenters.

Also I was amused by Romer’s earlier post, “Ed Prescott is No Robert Solow, No Gary Becker.” As far as I can tell, Gary Becker was no Gary Becker. As for Solow, I only saw him once, in a talk at MIT 30 years ago where he anti-impressed me by making an offhand swipe at how he would cut funding for Amtrak—I guess he thought all those highways were just free.

Silver replied with some background of his own:

I entered grad school in 1965 and started out as a “Russian area specialist.” That my dissertation was largely a quantitative study of ethnic assimilation using census data was very different from the norm that was established at the major Russian area centers at Harvard and Columbia as well as the significant ones elsewhere. When I applied for a Foreign Area Fellowship as well as a year of study abroad through IREX, the interview committee asked me about my thesis. I told them I was studying ethnic assimilation by minority nationalities in the Soviet Union. The immediate question: “WHICH nationality?” My answer: “all of them.” It shocked them that anybody could try to do that! (I got the fellowships.)

Only when the Soviet system began to fall apart did this subfield begin to draw a lot of young scholars into it who applied a wide array of methods, including quantitative, to study the post-Soviet transition. About a third of my research was essentially “demographic,” and not obviously political. Today there’s practically nobody in US doing this work concerning the post-Soviet region. There is, however, something of a normal demographic science now within Russia — but one that treads very carefully and doesn’t deal with some of the issues that were among the foci of my research (language and ethnic identity change, bilingual education policy, etc.).

For the most part the “comparativists” at Wisconsin—the faculty—were qualitatively oriented. But back in 1965 we got to cut our data analysis teeth in the introductory mass political behavior course by analyzing the Almond-Verba data, which had just been released. So some of us comparativists learned to do quantitative data analysis—and multi-country research. Nobody taught formal theory there at the time. When I asked one of the comparative faculty why this was so, he quickly responded, “We don’t believe in it.” But they did believe in data analysis, and so some of us comparativists got decent training even in political science, and a few (e.g., Doug Hibbs, who was in my UW cohort) took econometrics from Goldberger.

Comparing Waic (or loo, or any other predictive error measure)

Ed Green writes:

I have fitted 5 models in Stan and computed WAIC and its standard error for each. The standard errors are all roughly the same (all between 209 and 213). If WAIC_1 is within one standard error (of WAIC_1) of WAIC_2, is it fair to say that WAIC is inconclusive?

My reply:

No, you want to compare directly; see section 5.2 of this paper by Aki, Jonah, and me.

For those of you who are too lazy to click over and read the paper, the idea is that Waic and loo are computed for each data point and then added up; thus when you are comparing two models, you want to compute the difference for each data point and only then compute the standard error. That is, the scenario is a paired comparison rather than a difference between two groups.

This can matter in computing the standard error because the pointwise components of predictive error can be highly correlated when comparing the two models, in which case the correct standard error will be much lower than the standard error that would be naively obtained by combining the standard error of the separate Waic or loo calculations for the two models.

In our paper we give the example of fitting two models to the arsenic well-switching data (which you might recall from chapter 5 of ARM):

Screen Shot 2015-10-03 at 9.41.55 AM

There are certain points which neither model fits well (for example, people living in households that are high in arsenic and close to neighbors with safe wells but who still say they would not switch wells), and when comparing the fit of two models it’s important to do it pointwise, otherwise you’ll overstate your uncertainty in the difference.

And all this is reminding me that we’d like to add an Anova-like feature for comparing multiple models; in that paper we present methods of computing Waic or loo for one model, or comparing two models, so we should really also present the general comparison of multiple model fits.

Stan PK/PD Tutorial at the American Conference on Pharmacometrics, 8 Oct 2015

Bill Gillespie, of Metrum, is giving a tutorial next week at ACoP:

This is super cool for us, because Bill’s not one of our core developers and has created this tutorial without the core development team’s help. Having said that, we’ve learned a lot from Bill and colleagues on our mailing lists as we were designing ODE solvers for Stan (an ongoing issue—see below for future plans).

Bill’s tutorial is up against a 2-day Monolix tutorial and a 2-day tutorial on R by Devin Pastoor, who’s also been active on our mailing lists recently.

Why Stan for PK/PD?

In case you’re wondering why people would use Stan for this instead of something more specialized like Monolix or NONMEM, it’s because of the modeling flexiblity provided by the Stan language and the effectiveness of NUTS for MCMC. So far, though, we’re in the hole in not having a stiff ODE solver in place. Or a good NONMEM-like event data language on top.

Maybe Bill will jump in with some other motivations.

What’s in Store for Stan’s ODE Solvers?

There’s been lots of behind-the-scenes activity on our ODE solvers—we’re really just getting burned in warmed up.

The next minor release of Stan (2.9) should stop the freezing issue when parameters wander into regions of parameter space that lead to stiff ODEs. And we’ve really sped up the Jacobian calculations when Michael Betancourt realized we were doing a lot of redundant calculation and he and I put a patch in to fix it. We should also allow user-defined control of absolute and relative tolerances.

Next, hopefully by Stan 2.10, we’ll have a stiff solver and maybe a way for users to supply analytic coupled-system gradients and Jacobians. Stay tuned. These new designs are largely being guided by Sebastian Weber and Wenping Wang at Novartis. And of course, by Michael Betancourt working out all the math and Daniel, Michael, and I working out the code with Sebastian’s and Wenping’s input.

We also need to evaluate how well variational inference works for ODE problems. Our early trials are very promising. Then we could replace the max marginal likelihood approach of NONMEM with a very speedy variational inference mechanism allowing much more general models.

There’s more in the works, but the above are the top of our to-do list.

Solution to Stan Puzzle 1: Inferring Ability from Streaks

If you missed it the first time around, here’s a link to:

First, a hat-tip to Mike, who posted the correct answer as a comment. So as not to spoil the surprise for everyone else, Michael Betancourt (different Mike), emailed me the answer right away (as he always does for math problems—Michael’s literally amazing).

Although I formulated it to myself as “How do I code this in Stan?”, it turns out there’s an analytic solution. Here’s how I worked through it (after about as many false starts as the others who posted on the list).

Michael Betancourt also analyzed the process qua process; maybe he’ll elaborate by editing this post below or in comments.


If the observed data are streaks y = (y_1, \ldots, y_N), with streak lengths y_n \geq 1, the underlying sequence of successes and failures must match the following regular expression

z = 0^{*} \, 1^{y_1} \, 0^{+} \, 1^{y_2}  \cdots 0^{+} \, 1^{y_N} \, 0,

where a^n is n repetitions of a, a^* is zero or more repetitions of a, and a^+ is one or more repetitions of a. Because sequence concatenation is associative, 0^+ = 0 \, 0^*, and y_n \geq 1, the above can be regrouped as

z = (0^* \, 1) \, 1^{y_1 - 1} \, 0 \, (0^* \, 1) \, 1^{y_2 - 1} \cdots (0^* \, 1) \, 1^{y_N - 1} \, 0

Given the way the streak data is generated, the probability of generating the subsequent misses after the first and then the first made shot, namely (0^* \, 1), is equivalent to the the sum of the probabilities of generating 1 or generating 0\,1 or 0\,0\,1 or …, which conventiently reduces to 1, because it covers all the possibilities for observing zero or more failures, then a single success. Many people were getting at that intuition in the comments.

Thus the probability of z, marginalizing over the unobserved 0^* sequences, reduces to the probability of generating the 1^{y_n - 1} terms and the required inter-streak 0 terms. With a \theta chance of success, the likelihood reduces to

\displaystyle p(y \, | \, \theta) \propto \prod_{n=1}^N \left( \theta^{y_n - 1} \times (1 - \theta)\right)

\displaystyle \mbox{ } \ \ \ = \theta^{\sum_{n=1}^N y_n - 1} \times (1 - \theta)^N

\displaystyle \mbox{ } \ \ \ = \theta^{\mbox{\footnotesize sum}(y) - N} \times (1 - \theta)^N

\displaystyle \mbox{ } \ \ \ \propto \mbox{Binomial}(\mbox{sum}(y) - N \, | \, N, \theta)

With a uniform prior on \theta, which is equivalent to a \mbox{Beta}(\theta \, | \, 1,1) prior, the beta-binomial conjugacy provides the following analytic solution for the posterior.

\displaystyle p(\theta \, | \, y) = \mbox{Beta}(\theta \, | \, \mbox{sum}(y) - N + 1, N + 1)

Loss of Information

How much information is “leaking” when we reduce the underlying sequence z with streaks y? When formulated as streaks, the Beta posterior is based on a total of N “observations,” whereas the length of z is greater than N and would lead to a \mbox{Beta}(\alpha,\beta) posterior with \alpha = \mbox{sum}(y) + 1 and \beta \geq N + 1.

Stan Code

So as not to disappoint those who wanted to see a Stan solution, here’s the MCMC version which lays it out as a model with parameters to estimate.

data {
  int N;
  int y[N];
parameters {
  real theta;
model {
  sum(y) - N ~ binomial(N, theta);

This model can be integrated into larger models based on theta, but is not much use in and of itself.

In this case, the analytic solution lets you generate draws directly from the posterior in the generated quantities block using Monte Carlo (withouth the Markov chain bit), which is much more efficient than MCMC.

data {
  int N;
  int y[N];
generated quantities {
  real theta;
  theta <- beta_rng(sum(y) - N + 1, N + 1);

But there's unlikely to be a need to do even straight-up Monte Carlo when you have an analytic posterior.

Syllabus for my course on Communicating Data and Statistics

Actually the course is called Statistical Communication and Graphics, but I was griping about how few students were taking the class, and someone suggested the title Communicating Data and Statistics as being a bit more appealing. So I’ll go with that for now.

I love love love this class and everything that’s come from it (including statistics diaries and ShinyStan).

Here’s the syllabus. It’s full of fun reading and great activities, in and outside of class. The only thing missing are the jitts, but I like to keep them as a surprise. So if you want to teach this class—and I think you should, indeed I think this course should be taught everywhere and it should be a standard part of the statistics and quantitative social science curriculum—you’ll just have to write your own jitts. Otherwise the course pretty much teaches itself. And remember, with your guest visitors, keep the converstations short and focus. Long rambling discussions are fun, and they’re easy on the instructor, but ultimately you want to spend lots of class time directly on feedback on student work.

Now for the next 90 seconds I’d like you to talk with your neighbor and come up with a question to ask me.

OK, start yapping!

Jason Chaffetz is the Garo Yepremian of the U.S. House of Representatives, and I don’t mean that in a good way.

Mike Spagat and Paul Alper points us to this truly immoral bit of graphical manipulation, courtesy of U.S. Representative Jason Chaffetz.

Here’s the evil graph:


Here’s the correction:


From the news article by Zachary Roth:

As part of a contentious back-and-forth in which Chaffetz repeatedly cut off [Planned Parenthood president Cecile] Richards, the congressman displayed a slide with a graph that looked like this [top graph above]. When Richards said she’d never seen it before, Chaffetz replied: “It comes straight from your annual reports.”

Moments later, Richards shot back: “My lawyers just informed me that the source of this information is Americans United for Life, an anti-abortion group. I would check your source.”

But the source wasn’t the only problem. A cursory look at the graph, which comes from an Americans United For Life report about Planned Parenthood centers released in June, makes it seem like in 2006, Planned Parenthood performed far more cancer screening and prevention services than abortions, but that by around 2010 it performed an equal number of both, and by 2013 it performed far more abortion services than anti-cancer services.

The issue is important because as part of their effort to defund Planned Parenthood, Republicans have portrayed it as primarily an abortion provider, while the group’s defenders have said it mostly performs other women’s health services, like cancer screenings.

But look at the actual numbers in the graph. They show that in 2006, Planned Parenthood performed 2,007,371 anti-cancer services and 289,750 abortions. By 2013, the gap had closed slightly, but the group still performed many more anti-cancer services than abortions, 935,573 to 327,000.
Why does it seem otherwise? Because the “graph” has no y axis, which allows its creators to simply plot the results wherever they choose in order to create a compelling visual effect. That’s how 327,000 is made to look like a much larger number than 935,573.

What the slide actually shows, of course, is that the number of abortions performed by Planned Parenthood rose very modestly between 2006 and 2013, while the number of anti-cancer services it performed did indeed fall by more than half. But Richards said some of the services, like pap smears, dropped in frequency because of changing medical standards about who should be screened and how often. Displaying that information on an actual graph would show the line for abortions rising very slightly over the 7-year time period, and the line for anti-cancer services dropping, but always remaining far above the line for abortions.

I looked up Jason Chaffetz on wikipedia and found that this:

[Chaffetz] was the starting placekicker on the BYU football team in 1988 and 1989. He still holds the BYU individual records for most extra points attempted in a game, most extra points made in a game, and most consecutive extra points made in a game.

A placekicker of all people should understand the principle of division of labor. If you want to make a graph, get an expert to do it. Don’t use a double y-axis and, while you’re at it, don’t tell untruths about where you got it from.

Just kick the damn ball, and leave the passing to the quarterback, OK?

P.S. The bottom graph above is much better than the top graph but it’s still not perfect. The axis labels are too tiny to be readable, and there are way too many numbers on the y-axis. Tick marks at 0, 500,000, 1 million, 1.5 million, etc., would do just fine. Also those heavy black and red lines on the left, bottom, and right of the plot are bad news.

Hot hand explanation again


I guess people really do read the Wall Street Journal . . .

Edward Adelman sent me the above clipping and calculation and writes:

What am I missing? I do not see the 60%.

And Richard Rasiej sends me a longer note making the same point:

So here I am, teaching another statistics class, this time at Santa Monica College, and reading the Wall Street Journal before heading in to school. Not surprisingly, extremely intrigued by the article about the ‘Hot Hand’.

I know you were quoted in it, although it looks like some of the quote got cut off somehow.

Anyway, I was very confused by the piece and did a little pencil and paper work, and am not quite sure I buy it. Admittedly, I did not try to find the original paper or commentary, but based my doodling on the description in the article.

As I understand it, we look at the 14 out of 16 possible sequences of 4 coin tosses which do not begin TTT (in order to have at least one H).

Then, whenever there is an H in the subsequence of the first three tosses, we look at whether or not it is followed by a T.

So I wrote all 14 down, and for each one looked at how often a T follows an F.

Here’s what I found:

HHHH: 3 opportunities, 0 Ts
HHHT: 3 opportunities, 1 T
HHTH: 2 opportunities, 1 T
HHTT: 2 opportunities, 1 T
HTHH: 2 opportunities, 1 T
HTHT: 2 opportunities, 2 Ts
HTTH: 1 opportunity, 1 T
HTTT: 1 opportunity, 1 T
THHH: 2 opportunities, 0 Ts
THHT: 2 opportunities, 1 T
THTH: 1 opportunity, 1 T
THTT: 1 opportunity, 1 T
TTHH: 1 opportunity, 0 Ts
TTHT: 1 opportunity, 1 T

This seems to total up to 24 opportunities to see whether or not an H in in first three positions is followed by a T, and a total of 12 Ts, for 50%. So I don’t see where the 60% mentioned in the article comes from.

Also, note that the coin toss in the fourth position is irrelevant to this count, since the simulation is for only four tosses – so we never know what happens on the “fifth” toss.

Besides not understanding where the 60% comes from, how much of this (what is alleged in the article) is an artifact of the length of the sequence of tosses? Rather than restricting ourselves to sequences of length 4, should not the analysis look at sequences of all lengths? That is, start with tosses of length 2, 3, 4, 5, 6, etc., measure the frequency with which a T follows an H in the possible subsequences of lengths 1, 2, 3, 4, 5, etc., and then try to determine if the sequence of proportions converges?

My reply to both:

You get the non-50% number by first computing the percentage for each scenario, then averaging the 16 scenarios equally. If you weight by the number of opportunities you indeed get the correct answer of 50% here, but the point is that when the hot hand has traditionally been estimated, the estimation has been done by taking the empirical difference for each player, and then taking a simple (not weighted) average across players, hence the bias, as explained and explored in several recent papers by Josh Miller and Adam Sanjurjo.

More here.

P.S. Miller points out that, for real shooting data (as opposed to coin flips) there is no simple weighted averaging that would give you the correct hot-hand estimate, as such an average would not correct for differences between players. That’s why I think the ultimate way to go will be to fit a Bayesian analysis using Stan. We’ve done some steps toward this but our model is still in a simple and preliminary stage.

An unconvincing analysis claiming to debunk the health benefits of moderate drinking


Daniel Lakeland writes:

This study on alcohol consumption (by Craig Knott, Ngaire Coombs, Emmanuel Stamatakis, and Jane Biddulph) was written up in the BMJ editorials as “Alcohol’s Evaporating health benefits.”

They conveniently show their data in a table, so that they can avoid graphing a “J” shape that they constantly allude to being wrong… But their own models (see the links under table 3 and table 4) show that the hazard ratio relative to the “never drinker” category for males 50-64 years old declines and then goes up for the “heavy drinker” category, like… I dunno, kind of a J shape??? yes.. yes definitely like that.

Ok, so how about table 4 for women… aged 50-64 years:

Ok, decline with minimum at around 10-15 units /wk… increasing for the heavier drinkers….

Ok, model 2, model 2 is going to do it right??? well… kind of constant, but a definite low point at 10-15 units/wk….

Ok, looking across all the different models in all the different tables…. all the different age groups… yep… pretty much every group has lowest risk in the same range, around 10-15 units per week, or 1-3 units per day or whatever way you want to look at it… pretty much just exactly where the traditional J shape puts things… one or two drinks per day 3 to 5 days a week or something like that.

Nevertheless… the editorial claims: “if there is any beneficial dose-response relation, it is limited to women aged 65 or more — and even that association is at best modest and likely to be explained by selection bias.”

and “for a range of reasons, including confounding and selection bias in the papers generally cited, even low level alcohol consumption is unlikely to protect drinkers from cardiovascular disease”…

so… do a study… don’t like the results? Bury them in a table and just claim you found the opposite of what you found? Or if you don’t claim it yourself, maybe at least get your friend to write an editorial or something.

The editorial that Lakeland cites has an explicit political agenda. But setting this aside, I think the larger point is that the effects of any drug will depend on its context. My guess is that the authors of the research paper and the editorial aren’t concerned about moderate drinking. It’s more that they’re worried that the news about the health benefits of moderate drinking will be used as an encouragement for people to drink heavily.

This contextual effect can arise at the individual or societal level. A person might hear about alcohol being good for you and then lapse into alcoholism—at least, the’s the concern. Or, at the national level, the news about the benefits of moderate drinking will get in the way of public healths efforts to combat problem drinking.

I can share my own experiences here. A few years ago I was talking with my cardiologist and he asked me about my alcohol consumption. I said I drank rarely, probably less than one glass of wine a week. He said I should drink a few times a week, that it would be good for me. Then when I was in France getting a health checkup for my employment, the doctor asked me how often I drank alcohol. I said I had a glass or two of wine a few times a week, cos it was recommended by my cardiologist. She told me not to drink so much, it was bad for my liver. I asked her about the benefits to my heart and she said, no, don’t believe that.

This is just n=2, of course, but perhaps it makes sense that my cardiologist’s recommendation made sense given that I rarely drank, and the French doctor’s recommendation made sense given that, for all she knew, maybe I was an alcoholic and was just trying to justify my addiction.

In any case, I agree with Lakeland that it’s better to report results clearly and graphically rather than contorting the data to support some particular claim.

How to use lasso etc. in political science?

Tom Swartz writes:

I am a graduate student at Oxford with a background in economics and on the side am teaching myself more statistics and machine learning. I’ve been following your blog for some time and recently came across this post on lasso. In particular, the more I read about the machine learning community, the more I realize how none of this work is incorporated into the majority of economics research.

I was wondering if you could give some advice on how to use techniques such as lasso, which retain a certain degree of interpretability, in a situation like economics or political science? Given that the goal is largely to describe, rather than just optimize an un-interpretable model, how would you use such techniques in a way that reduces variance and point estimate overestimation while at the same time interpreting the particular coefficients in a meaningful way?

My reply:

I don’t really buy the idea that lasso gives more interpretability; I think of it as a way to regularize inferences from regression. In most settings I actually find it difficult to directly interpret more than one coefficient in a regression model. Think of it this way: the coefficient of some predictor x represents a comparison of two items that differ in x while being identical in all other predictors of the model. Typically this only has a clear interpretation if x is the “last” predictor in the model, so that all the other predictors come “before” it.

Regularization is great, I just think the way to think of lasso is as a way of regularizing a regression model. The model is what’s important. What’s good about lasso and other regularizers is that they allow you to fit a regression model with lots of predictors. But the interpretability, or lack thereof, is a property of the regression model, not of the regularization.

RStan 2.8.0 is on CRAN!

RStan 2.8.0 is available on CRAN!

Installation directions can be found on RStan’s Wiki.

And since I know a lot of people aren’t patient enough to read through installation instructions, the most important parts are:

  1. You (still) need a C++ toolchain.
    Mac: XCode. Make sure to open it once after download to accept the license.
    Windows: Rtools. Make sure the binaries are on your Windows path.Linux. If you don’t have a C++ toolchain in Linux, you should probably rethink your operating system.
  2. From within R:
    > install.packages("rstan", dependencies = TRUE)

    I don’t know why you need dependencies, but maybe the RStan gurus can explain.

  3. Restart R before using RStan. Please.
    This is another thing that I don’t understand, but it does solve a lot of problems.

As always, if you run into trouble, let us know on the stan-users mailing list.

Fitting models with discrete parameters in Stan

This book, “Bayesian Cognitive Modeling: A Practical Course,” by Michael Lee and E. J. Wagenmakers, has a bunch of examples of Stan models with discrete parameters—mixture models of various sorts—with Stan code written by Martin Smira! It’s a good complement to the Finite Mixtures chapter in the Stan manual.

On deck through the rest of 2015

There’s something for everyone! I had a lot of fun just copying the titles to make this list, as I’d already forgotten about a lot of this stuff. Here are the scheduled posts, in order through 31 Dec:

Fitting models with discrete parameters in Stan

How to use lasso etc. in political science?

An unconvincing analysis claiming to debunk the health benefits of moderate drinking

Tamiflu conflict of interest

Alleged data manipulation in NIH-funded Alzheimer’s study

Flamebait: “Mathiness” in economics and political science

Cognitive skills rising and falling

Anti-cheating robots

Mindset interventions are a scalable treatment for academic underachievement — or not?

Most successful blog post ever

Political advertising update

Doomed to fail: A pre-registration site for parapsychology

Mars Missions are a Scam

Ta-Nehisi Coates, David Brooks, and the “street code” of journalism

What do you learn from p=.05? This example from Carl Morris will blow your mind

Here’s a theoretical research project for you

Hierarchical logistic regression is easy in Stan

In answer to James Coyne’s question, no, I can’t make sense of this diagram.

In that article, they forgot to mention that Ludmerer is one of the 5 doctors in America who has no opinion on whether cigarette smoking contributes to lung cancer in human beings.

“Null hypothesis” = “A specific random number generator”

My webinar with Brad Efron this Wednesday

How to build trust in missing-data imputations?

Evaluating models with predictive accuracy

Using Stan to map cancer screening!

Why you can’t always use predictive performance to choose among models

Top 5 movies about scientists

“Modern Physics from an Elementary Point of View”

Super-topical NBA post!!!

Characterizing the spatial structure of defensive skill in professional basketball

The original Hot Hand preprint!

Exaggeration of effects of fan distraction in NCAA basketball

What do I say when I don’t have much to say?

Cauchy priors for logistic regression coefficients

Where the fat people at?

“Priming Effects Replicate Just Fine, Thanks”

My job here is done

The tabloids strike again

Econometrics: Instrument locally, extrapolate globally

I wish Napoleon Bonaparte had never been born

DataMeetsViz workshop

How to parameterize hyperpriors in hierarchical models?

“Don’t get me started on ‘cut’”

Taleb’s Precautionary Principle: Should we be scared of GMOs?

Pass the popcorn

Who falls for the education reform hype?

Inference from an intervention with many outcomes, not using “statistical significance”

“Should Prison Sentences Be Based On Crimes That Haven’t Been Committed Yet?”

At this point, I’m primed to be skeptical about claims of social priming

He wants to teach himself some statistics

I like the Monkey Cage

What years of the economy influence the next presidential election?

Humans Can Discriminate More than 1 Trillion Olfactory Stimuli. Not.

Some people are so easy to contact and some people aren’t.

A statistical approach to quadrature

“The Bayesian Second Law of Thermodynamics”

Jökull Snæbjarnarson writes . . .

Bayesian inference for network links

0.05 is a joke

Statistics diaries and comparable assignments in other fields

“A pure Bayesian or pure non-Bayesian is not forever doomed to use out-of-date methods, but at any given time the purist will be missing some of the most effective current techniques.”

7 tips for work-life balance

A missed opportunity?

How to analyze hierarchical survey data with post-stratification?

My quick answer is that I would analyze all 10 outcomes using a multilevel model.

Rogue historian just can’t stop copying

Questions about data transplanted in kidney study

Party like it’s 2005

Cannabis/IQ follow-up: Same old story

Waic and cross-validation for survival models?

Hierarchical modeling when you have only 2 groups: I still think it’s a good idea, you just need an informative prior on the group-level variation

I definitely wouldn’t frame it as “To determine if the time series has a change-point or not.” The time series, whatever it is, has a change point at every time. The question might be, “Is a change point necessary to model these data?” That’s a question I could get behind.

Actually, I’d just do full Bayes

“Baby Boomer” as all-purpose insult

Defining conditional probability

In defense of endless arguments

Bayesian decision analysis for the drug-approval process

Mars 1, This American Life 0

LaCour and Green 1, This American Life 0

What is a Republican?

“Perhaps the most reasonable explanation is that no one watched the video or did the textbook reading . . .”

A Replication in Economics: Does “Genetic Distance” to the US Predict Development?

Death of a statistician

Rapid post-publication review

He’s skeptical about Neuroskeptic’s skepticism

R sucks

“Am I doing myself a disservice by being too idealistic in a corporate environment?”

Gresham’s Law of experimental methods

Turbulent Studies, Rocky Statistics: Publicational Consequences of Experiencing Inferential Instability

There are 6 ways to get fired from Johnson & Johnson: (1) theft, (2) sexual harassment, (3) running an experiment without a control group, (4) keeping a gambling addict away from the casino, (5) chapter 11 bankruptcy proceedings, and (6) not covering up records of side effects of a drug you’re marketing to kids

“The lifecycle of scholarly articles across fields of economic research”

My presentation at the Electronic Conference on Teaching Statistics

When the numbers differ in the third decimal place

Definitely got nothing to do with chess IV

As usual, I’ll occasionally bump posts for more topical material. And my cobloggers are free to intersperse their posts whenever.

The Final Bug, or, Please please please please please work this time!

I’ve been banging my head against this problem, on and off, for a couple months now. It’s an EP-like algorithm that a collaborator and I came up with for integrating external aggregate data into a Bayesian analysis. My colleague tried a simpler version on an example and it worked fine, then I’ve been playing around with a multivariate version and . . . it kinda works. At one point it was working ok, and then in writing up the algorithm I noticed some places where it could be improved, and then I did the improvements, and it was failing. I was getting extreme importance ratios and degenerate covariance matrices. Then I realized my algorithm wasn’t quite right, I was using the wrong factor in my EP computation so that it would not converge to what I wanted. So I fixed that. Then more problems. Etc etc. I tried going back to the simple version of the algorithm but it ran really poorly in my example. At this point I don’t know what I’m doing, I start playing desperately with the algorithm, pulling factors in and out of the importance weights, changing the distribution from which I was drawing the initial approx, etc etc etc. Can’t get it to work. Even when it doesn’t crash, I’m getting simulation efficiencies approaching zero.

Then I look at the code one more time. Damn! I was passing the arguments in the wrong order to my R function. OK, that’s the bug. I run it one more time . . . No! I was confused, the order of my arguments was just fine. So I’m still in the thick of it. Ugh.

P.S. On reading the comments I see there’s some confusion here. The problem is not simply: “I want to do X, I wrote code to do X, but it’s not working.” The problem is I’m doing research, I have a sense that this algorithm should work, but there could be something I’m missing. Bugs in the code are interacting with my incomplete understanding of the method that my colleagues and I are developing.

Also, I wrote the above post a couple of months ago. Since then we’ve made progress and I hope to post the paper soon.

Annals of Spam

OK, explain to me this email:

God day,

How are you? My name is **. I came across your contact email at the University of Cyprus, Department of Economics. I seek for a private Economics teacher for my Daughter. I would like to know if you would be available for job.

If you would be available, kindly let me know your policy with regard to the fee, cancellations, location and make-up lessons. Also,get back to me with your area of specialization.

The lessons can start by 16th of June.


It seems too weird to be another one of those stupid experiments. But i can’t see the money-making potential. Maybe if I respond, they come back to me with the pitch? And what’s with Cyprus? Nothing makes sense here.

God day to you too!

P.S. It turns out there is a logic to the scam, as explained in the link given by commenter Scott. This sounds like a great future career track for Xian “Alex” Zhao and Monica Biernat—that is, once their family emergencies are done.

Stan Puzzle #1: Inferring Ability from Streaks

Inspired by X’s blog’s Le Monde puzzle entries, I have a little Stan coding puzzle for everyone (though you can solve the probabilty part of the coding problem without actually knowing Stan). This almost (heavy emphasis on “almost” there) makes me wish I was writing exams.

Puzzle #1: Inferring Ability from Streaks

Suppose a player is shooting free throws, but rather than recording for each attempt whether it was successful, she or he instead reports the length of her or his streaks of consecutive successes. For the sake of this puzzle, assume the the player makes a sequence of free throw attempts, z = (z_1, z_2, \ldots), assumed to be i.i.d. Bernoulli trials with chance of success \theta, until N streaks are recorded. The data recorded is only the length of the streaks, y = (y_1, \ldots, y_N).

Puzzle:   Write a Stan program to estimate p(\theta \, | \, y).

Example:   Suppose a player sets out to record 4 streaks and makes shots

z = (0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0).

This produces the observed data

N = 4

y = (3, 1, 2, 5).

Any number of initial misses (0 values) in z would lead to the same y. Also, the sequence z always ends with the the first failure after the N-th streak.

Hint:   Feel free to assume a uniform prior p(\theta). The trick is working out the likelihood p(y \, | \, \theta,N), after which it is trivial to use Stan to compute p(\theta \, | \, y) via sampling.

Another Hint:   Marginalize the unobserved failures (0 values) out of z. This is non-trivial because we don’t know the length of z.

Extra Credit:   Is anything lost in observing y rather than z?

Answer: Solution to Stan Puzzle 1.

Low-power pose

“The samples were collected in privacy, using passive drool procedures, and frozen immediately.”

Screen Shot 2014-11-17 at 11.19.42 AM

Anna Dreber sends along a paper, “Assessing the Robustness of Power Posing: No Effect on Hormones and Risk Tolerance in a Large Sample of Men and Women,” which she published in Psychological Science with coauthors Eva Ranehill, Magnus Johannesson, Susanne Leiberg, Sunhae Sul, and Roberto Weber.

I can’t find a copy of the paper online but here’s the Open Science Framework page for the project, and here’s how the paper begins:

In a growing body of research, psychologists have studied how physical expression influences psychological processes . . . A recent strand of literature within this field has focused on how physical postures that express power and dominance (power poses) influence psychological and physiological processes, as well as decision making . . . Carney et al. found that power posing affected levels of hormones such as testosterone and cortisol, financial risk taking, and self-reported feelings of power in a sample of 42 participants . . .. We conducted a conceptual replication study with a similar methodology as that employed by Carney et al. but using a substantially larger sample (N = 200) and a design in which the experimenter was blind to condition. . . .

And here’s what they find:

Consistent with the findings of Carney et al., our results showed a significant effect of power posing on self-reported feelings of power. However, we found no significant effect of power posing on hormonal levels or in any of the three behavioral tasks.

I just have a couple of statistical comments:

1. Ranehill et al. write, “Our statistical power to detect an effect of the magnitude reported by Carney et al. was more than 95%.” Sure, but a key principle of design calculation (my preferred term, because I think that conventional “power” is unduly focused on statistical significance) is to hypothesize effect sizes using external information, not to simply use a published point estimate that is highly vulnerable to noise and selection.

I’m not saying Ranehill et al. did anything wrong in their analysis here, it’s just that it should be no surprise that this purportedly high-power study did not replicate, as the assumed power is coming from a biased and noisy effect size estimate.

2. After the non-replication, they write, “It is possible that subtle differences between the experimental protocols in Carney et al. and those in our study, originally designed as an extension of Carney et al., led to the omission of factors crucial for power poses to influence hormonal levels and behavior.” Let me just emphasize that just about all effects of interest vary across people and across scenarios. But when someone does a noisy study that fails to replicate in a larger sample, I have no reason, in general, to take that first result seriously.

By the way, in case you’re wondering, no, Anna Dreber is not some sort of a professional skeptic. The papers listed on her webpage include:

Apicella, Coren L., Anna Dreber & Johanna Möllerström. “Salivary testosterone change following monetary wins and losses predicts future financial risk-taking.” Psychoneuroendocrinology, 39: 58-64.

Rand, David G., Anna Dreber, Omar Haque, Rob Kane, Martin A. Nowak and Sarah Coakley. “Religious Motivations for Cooperation: An Experimental Investigating using Explicit Primes.” Religion, Brain and Behavior, 4(1): 31-48.

Dreber, Anna, Christer Gerdes and Patrik Gränsmark. “Beauty Queens and Battling Knights: Risk Taking and Attractiveness in Chess.” Journal of Economic Behavior and Organization, 90: 1-18.

Dreber, Anna, Christer Gerdes, Patrik Gränsmark and Anthony C. Little. “Facial Masculinity Predicts Risk and Time Preferences in High-Level Chess Players.” Applied Economics Letters, 20(16): 1477-1480.

I don’t know if this should make you more or less likely to believe her findings on power poses; my point is just that, unlike me, Dreber is an active researcher in that area.

I wonder if the Ted people will update their webpage? Probably not, eh? If they did, that would be news.

Screen Shot 2015-03-26 at 4.37.54 PM

Amtrak is evil

Screen Shot 2015-09-24 at 9.51.20 PM

Hmmmm, coverage for travel delay, that might not be so bad. This is Amtrak, after all. Let’s click through to the fine print:

Screen Shot 2015-09-24 at 9.54.15 PM

They’ll cover me for a departure delay of six or more hours, huh?

Nice try in your attempt to scam me out of $8.50. It didn’t work this time, but, hey, why not waste everyone’s time figuring out which little box to click? I’m sure you did some A/B testing and found there was a large enough pool of suckers to make it all worth it for you.


Draw your own graph!

Screen Shot 2015-05-28 at 11.49.11 PM

Screen Shot 2015-05-28 at 11.50.04 PM

Screen Shot 2015-05-28 at 11.50.46 PM

Bob writes:

You must have seen this. I like it. But not enough to spend time blogging about it.

I’ll try blogging it myself . . . OK, yeah, this interactive graph is great. It reminds me of “scatterplot charades” exercises we do in class from time to time. Somebody should write a program so that this can be done with any data. It’s awesome.

OK, that wasn’t so hard.


Screen Shot 2015-09-23 at 7.29.53 PM

Screen Shot 2015-09-23 at 7.30.14 PM

I don’t know if he has to say that this body type are actually better for a baseball player. Maybe it’s enough to just make the case that, Moneyball-style, players with this shape are underrated.

P.S. I still don’t see why James chose in his book to summarize players by games played, home runs, RBI, batting average, and . . . nothing else. I can see how he’d want to include the standard stats as a point of comparison, but how hard would it be to include OBP, SLG, and maybe a couple other numbers of the sort that ordinarily he’d prefer?

What was the worst statistical communication experience you’ve ever had?

In one of the jitts for our statistical communication class we asked, “What was the worst statistical communication experience you’ve ever had?” And here were the responses (which I’m sharing with permission from the students):

Not sure if this counts, but I used to work with a public health researcher who published a journal article impugning a major pharmaceutical company. The data on which she based her argument was incorrect! When this mistake came out, readers were upset, and the article was widely read and emailed because it was being criticized. It was ultimately one of the ten most-emailed articles that appeared in the journal that year, and she bragged about this distinction, not recognizing that it was actually a bad thing.

Me trying to present the findings of a study I did on a company’s website usage and their customers’s behavior. I had no idea on how to present the correlations I found and clearly display my causation hypotheses, or how to translate them into actionable insights.

I used an event on the news to explain bayes theorum and conditional probability to my friends. One of them stopped listening when I started to write mathematical symbols on paper.

Trying to explain to the New York City Council Speaker why a regression line is a “good fit” even though none of the data points actually fall on that line.

With the little experience that I do have, I would have to say interpreting a speck phone case advertisement was the worst statistical communication I have had. It was a simple venn diagram with three circles. The three categories were “people who workout”, “people who don’t”, and “people who would if it weren’t all hard and stuff”. In the middle where all the circles intersected there was the speck logo. First thought, these are all disjoint. However, reading more into it “people who would if it weren’t all hard and stuff” is basically another way of saying people who don’t workout. But “people who workout” is still disjoint from “people who don’t” because they have nothing in common. So there wouldn’t be anyone in the middle where all the circle intersected. To me the speck symbol in the center is then implying that the people in that section would have a speck case, but there is no one so no one has a speck case. This would just be poor advertising. Another person suggested that what everyone has in common is that they all have a speck case. So if you do or don’t workout, you still have a speck case. So everyone would then be in the middle and everyone would have a speck case. To this, another person said bluntly, you’re wrong. No where on the ad does it talk about these “people” having a speck case, so I thing the ad has flaws. This isn’t a very serious occasion, but this conversation occurred several times with the same group of people and we are split on what the ad is suppose to mean. If you want to the see the advertisement I can show you and then I can have your input on it!

During my internship in marketing company this summer, when I extracted bunch of data from SQL Server and copied part of the output to present to my supervisor, I didn’t explain what means by each column and didn’t give him which table I used. So he was very confused about my result and just told me the result he want it to be. However, from my side, the result he want is exactly what i represent to him. So I thought he gave me a harsh time. Then I went to his office, explaining what i did regarding the data I got. He got what I mean and make me think it is because I didn’t make my result understandable. It is the worst statistical communication.

Trying to explain what a density is in an interview for a tutoring job when I was fresh out of highschool. I totally knew what it was, but that didn’t seem to have any impact whatsoever on my ability to explain it.

I sat in many meetings at the UN where data was presented by chairs of a committee and no one in the room had a math background that could explain it clearly. The worst time was when we were looking at military expenditures over time. Every country documented theirs differently and there were about 5 languages being translated to English. I wasn’t able to speak up since I was just taking notes on the meeting but greatly looking forward to this class to help me learn how to communicate everything I understand about statistics!

Talking about my research during my internship last summer. I had an hour-long talk and my talk required substantial background that most of the audience did not have. Instead of simplifying my content, I decided to try to teach some of the background during the talk instead, but did not do that particularly effectively.

My worst statistical communication experience happened when I did the GARCH model to analyze the volatility of S&P prices during last fifteen years. Since I had to deal with the data first before I input the data in your model. I spent a lot of time standardizing the data and bridging different returns to make sure the comparison accurately. It was a huge project to complete the project.

I have not had very many, but most recently, a conversation with a friend who works in data science. She was working with data for the purpose of bringing attention to the lack of New York govt funding to poorer school districts. I criticized how she was analyzing the data, and she explained to me that whether or not the data represent the truth, her job is to take pieces of data to bring attention to subjects in need. I unfortunately saw how data can be used as a weapon.

Arguing with my suddenly vegan father about whether the “China Study” proves that vegan diets are the healthiest possible option.

What was your worst statistical communication experience?