Skip to content

We got mooks

Screen Shot 2015-11-27 at 10.18.59 PM

Columbia University’s Data Science Institute is releasing some mooks, and I’m part of it. I’ll first give the official announcement and then share some of my thoughts.

The official announcement:

The Data Science Institute at Columbia University is excited to announce the launch of its first online-education series, Data Science and Analytics in Context, on Dec. 14. Available through the edX platform, the three-course series will run through April, featuring lectures, engaging exercises and community discussion.

The first course, Statistical Thinking for Data Science and Analytics, teaches the statistical foundations for analyzing large datasets. You will learn how data scientists design the data collection process, gain insights from visualizing data, find supporting evidence for data-based decisions and construct models for predicting future trends.

The second course, Machine Learning for Data Science and Analytics, is an introduction to machine learning and algorithms. In this course, you will develop a basic understanding of machine learning principles and how to find practical solutions using predictive analytics. We will also examine why algorithms play an essential role in Big Data analysis.

The third course, Enabling Technologies for Data Science and Analytics, explores the major components of the Internet of Things, including data gathering sensors. You will develop an understanding of how software is able to analyze events, recognize faces, interpret sentiment on social media, and how this information is fed into the decision-making process.

Learn from leading data scientists at Columbia University with guidance provided by Columbia graduate assistants during each course. Watch the video trailer for the series online at ColumbiaX and enroll today!

Link for video –

Link to enroll –

My perspective:

The mooks were organized by a group at our new Data Science Institute, including Prof. Tian Zheng, a friend and colleague of mine in the statistics department. I prepared two lectures, one on Bayesian data analysis and one on exploratory data analysis and visualization. The content was not super-organized; I just used some material I had around, including some of my favorite recent stories such as the Xbox polls and the age-adjusted death rates. I’m not sure how well they went because I hate looking at videos of myself. I did see clips from some of the other lectures and they looked pretty good.

Last year I prepared an intro stat course for the School of International and Public Affairs. I taped twelve 40-minute lectures, and along with each were R sessions with Ben Goodrich. These taped lectures were super-smooth; I actually ended up writing scripts for all of them because I sounded too awkward when I simply spoke as if I were giving a usual talk. In contrast, these new mooks are more like classroom lectures; it’s a different feel entirely.

Anyway, I hope this goes well. Organizing a remote course on data science seems like a real challenge, and it seems like a reasonable starting point to get different people to give different lectures on their areas of expertise. I suppose much will depend on the homework assignments and the student feedback. I was happy to contribute my parts, small as they were.

You’ll never believe what this girl wrote in her diary (NSFW)

Arber Tasimi heard about our statistics diaries and decided to try it out in the psychology class he was teaching. The students liked his class but a couple of them pushed back against the diaries, describing the assignment as pointless or unhelpful in their learning.

This made me think that it may be that a diary is more useful for statistics than for psychology. I don’t know why that would be, but it’s a thought. It also could be that this sort of introspective assignment is par for the course in psychology but unusual in statistics.

Arber responded:

That’s an intriguing idea. Students in this class (Perspectives on Human Nature) may have seen the diary as just another “assignment” whereas students in your class may have treated it as an “activity.”

Your claim, if I understand it correctly, makes an interesting prediction, that students in, say, English or Comparative Literature classes would find the diary the least useful (and, perhaps, students in set theory or particle physics courses would find it the most useful).

Maybe so. From another perspective, statistics and psychology have the feature that they connect to many different aspects of everyday life. A particle-physics or set-theory diary might end up being a bit more inward-focusing, although that could be good in its own right.

Anyway, this is just speculation based on small and non-representative samples. . . .

“iPredict to close after Govt refuses anti-money laundering law exemption”

Richard Barker points us to an update on ipredict, the New Zealand political prediction market. From the news article by Hamish Rutherford:

The site, run by Victoria University of Wellington’s commercialisation arm, VicLink, issued a statement to its website and on Twitter on Thursday.

According to the iPredict statement, Associate Justice Minister Simon Bridges refused to grant it an exemption from the Anti-Money Laundering and Countering Financing of Terrorism Act, declaring that it was a “legitimate money laundering risk” because of the lack of customer due diligence. . . .

Geoff Todd, managing director of VicLink, said the website had been caught in a legal loophole which had caused problems globally.

“Predictions markets aren’t financial markets, and they’re not gambling, but the legislation is very binary. You’re either gambling or you’re a financial market.” . . .

The purpose of the website was to bring private information into the public domain, Todd said, meaning it was unreasonable to ask users to give their identities. A number of staff around Parliament, as well as MPs, are known to have used iPredict. . . .

Boston Stan meetup 1 Dec

Here’s the announcement:

Using Stan for variational inference, plus a couple lightning talks

Dustin Tran will give a talk on using Stan for variational inference, then we’ll have a couple lightening (5 minute-ish) talks on projects. David Sparks will talk, I will talk about some of my work and we’re looking for 1-2 more volunteers. If you have a project or idea of a project using Stan you’d like to talk about let me know!

Thanks to RStudio for sponsorship of the meeting.


Gary McClelland agrees with me that dichotomizing continuous variables is a bad idea. He also thinks my suggestion of dividing a variable into 3 parts is also a mistake.

In response to some of the discussion that inspired yesterday’s post, Gary McClelland writes:

I remain convinced that discretizing a continuous variable, especially for multiple regression, is the road to perdition.

Here I explain my concerns. First, I don’t buy the motivation that discretized analyses are easier to explain to lay citizens and the press. Second, I believe there is an error in your logic for computing the relative efficiency for splitting into three groups. Third, and most importantly, dichotomizing or trichotomizing two or more continuous variables in a multiple regression is an especially bad idea. In such cases, the loss of efficiency is irrelevant because the discrete predictor variables have a different correlation than the continuous variables. As a consequence, the parameter estimates from the discrete analysis are biased. I’ll explain all three issues in some detail.

1. I just don’t buy the motivating issue—that the essence of a regression analysis can’t be explained to lay people. Explaining a regression result in terms of differences between averages is fine with me, but that doesn’t require a dichotomized analysis. We assume there is some true population difference between the average case in the top group (whether that be the top half, top third, top 27%, top quarter) and the average case in the bottom group. Let’s call those two population means muH and muL (for high and low). Our goal is to estimate that population mean difference. We, as statisticians, have two (at least) ways to estimate that mean difference muH – muL.

a. We do the split and compute the corresponding averages ybarH and ybarL and our estimate of muH – muL is ybarH – ybarL.

b. We regress y on x, as originally measured, to obtain yhat = b0 + b1 x. then we estimate muH – muL using (b0 + b1 xbarH) – (b0 + b1 xbarL) = b1(xbarH – xbarL).

Both are unbiased estimates of muH – muL and both can be described as “our data estimates the difference between the average person in the top group in the population and the average person in the bottom group in the population is …” The only difference between the two methods is that the variance of the estimate in (a) is greater than the variance of the estimate in (b). That implies that there will be many times when the estimate in (a) is either higher or lower than the estimate in (b). Hence, the two analyses will seldom agree on the magnitude of the raw effect. That gives researchers another degree of freedom to report the estimate that better fits their argument. We should use the the more precise regression estimate (b) and explain it in terms of a mean difference between high and low groups. If we are communicating to a lay group we should give them our best estimate and that is b1(xbarH – xbarL). We don’t need to explain to them how we got our estimate of muH – muL unless they ask. and even then the explanation isn’t that difficult. “We compared our prediction for a person with an average score in the top group to our prediction for a person with an average score in the low group.”

2. Using extreme groups in a three-way split of a single continuous predictor: The error in your mathematical analysis, I believe, is the assumption that the residual variance remains constant and is therefore the same in Eq. 4 and Eq. 5 in Gelman & Park (2008). It is easy to disprove that assumption. The residual variance is V(e) = V(Y)(1-r^2). Discretizing changes the correlation between X and Y. Furthermore, restricting the values of Y to cases that have extreme values of X will necessarily increase the V(Y). The exception is that when r = 0, V(Y) will be unchanged. Hence, your claims about the relative efficiency of extreme groups apply if and only if r = 0. In an attached Mathematica notebook (also included the pdf if you don’t use Mathematica) and an attached R simulation file, I did a detailed analysis of the relative efficiency for different values of b in the model Y = b X + e. This graph summarizes my results:

Untitled copy

The curves represent the relative efficiency (ratio of the variances of the estimates of the slopes) for, top to bottom, slopes of b = 0, .25, 0.5, 1, 2, and 3. Given the assumption that V(e) = 1 in the full data, these correspond to correlations, respectively, of r = 0, 0.24, 0.45, 0.71, 0.89, and 0.95. The top curve corresponds to your efficiency curve for the normal distribution in your Figure 3. And, as you claim, using an extreme group split (whether the keep fraction is 0.2, 0.25, 0.27, or 0.333) is superior to a median split at all degrees of relationship between X and Y. However, relative efficiency declines as the strength of the X,Y relationship increases. Note also that the optimal fraction to keep shifts lower as the strength of the relationship increases.

Are these discrepancies important? For me and my colleagues in the social sciences, I decided the differences were of interest to geeky statisticians like you and me but probably not of practical importance. Within the range of most social science correlations (abs(r) < 0.5), the differences in the efficiency curves are trivial. And if a social scientist felt compelled for reasons of explainability to discretize the analysis, then I certainly agree that doing an extreme-groups analysis is preferable to doing a median split. However, if a researcher studying a physical industrial process (where the correlation is likely very high and high precision is desired) were tempted to do an extreme-groups analysis because it would be easier to explain to upper management, I would strongly advise against it. The relative efficiency is likely to be extremely low. On the right axis I’ve indexed the factor by which the sample size would need to be increased to compensate for the loss of efficiency. The price to be paid is extremely high. 3. When two or more correlated variables are analyzed via multiple regression, discretizing a continuous variable is a particularly bad idea not only because of reduced efficiency, but more importantly because discretizing changes the correlational structure of the predictors and that leads to bias in the parameter estimates. Most of the discussion in the set of median split papers in JCP concerned whether one could get away with splitting a single continuous variable which was to be analyzed in multiple regression with another continuous variable or as a covariate in an ANCOVA design. We thought both the considerable loss of power and the induced bias as a function of the predictor correlation were adequate reasons to reject such dichotomizations. I will be interested to see what your take is on that. However, I believe that doing an analysis with two discretized variables, whether by median splits or by “thirds” is a terrible idea because of the bias it induces. For median splits of two predictors with a bivariate normal distribution with correlation rho = 0.5, I can show analytically that the correlation between the dichotomized predictors will be 0.33, resulting in a confounding of the estimated slopes. Specifically, b1 = (5 b1 +b2)/6 and b2 = (b1 + 5 b2)/6. That is not good science. In the case of trichotomizing the two predictors and then using the extreme four corners of the 3 x 3 cells, I can show analytically that the predictor correlation INCREASES from 0.5 to 0.7. You can see why the correlation is enhanced in the bivariate distribution with correlation rho = 0.5 in this contour plot: PastedGraphic-2

Using only the four extreme cells makes the correlation appear stronger.

I haven’t yet worked through analytically the bias this will cause, but I have experimented with simulations and observed that there is an enhancement bias for the coefficients. If one coefficient is larger than the other, then the value of the larger coefficient is unbiased but the value of the smaller coefficient is increased (i’ve been working with all positive coefficients). For example, when predictors x and z are from a bivariate normal distribution with correlation 0.5 and the model is y = x + 2 z + e, then the cut analysis yields coefficient estimates of 1.21 and 2.05. The 21% enhancement of the smaller coefficient isn’t just bad science, it isn’t science at all. The source of the problem can be seen in the comparison of two tables. The first table is the predicted means using the regression equation for the full model applied to the actual cell means for x and z.

-3.7 1.2
-1.2 3.7

The following table is the mean y values for each cell (equivalently, the model derived from the cut variables).

-4.0 1.0
-1.0 4.0

In other words, the cut analysis exaggerates the differences in the cell means. This arises because the cut analysis forces a false orthogonal design. This is confounding in the same sense that bad experimental designs confound effect estimates.

A particularly disturbing example is for the model y = x + 0*z + e, the coefficients for the cut analysis are 0.96 and 0.11, a spurious effect for z. This can be seen in the table of means for the four groups:

-1.3 -1.1
1.0 1.3

In fact, the columns should have been identical as in

-1.22 -1.23
1.22 1.21

consistent with the null effect for z. This spurious effect is akin to the problem of spurious effects due to dichotomizing two variables identified by Maxwell & Delaney (1993).

In short, discretizing two continuous predictors has no place in the responsible data analyst’s toolbox. At the end of section 2.8 you describe doing the appropriate full analysis as an option. I strongly disagree this is optional—it is a requirement.

You present bivariate analyses across years both for a continuous analysis (Figure 5) and an extreme-groups analysis (Figure 6). If both ways of estimating the effects were unbiased and equally efficient, we would expect the rank order of a given effect across years to remain the same as well as the rank order of the three effects for a given year to remain constant. Neither seems to be the case. The differences are not large relative to the standard error so perhaps these differences are just due to the increased variability of the discretized estimates. However, if religious attendance and income are correlated and especially if the degree of this correlation changes over the years, then I suspect that some of the differences between Figures 5 and 6 are due to bias induced by using discretized correlated predictors. I think the logits of Figure 5 transformed back to probability differences would have been more appropriate and no more difficult to explain.

I also am attaching a 5th paper in the JCP sequence—our effort at a rebuttal of their rebuttal that we posted on SSRN.

For the quick version, here’s McClelland’s note, which begins:

Gelman & Park (2008) argue that splitting a single continuous predictor into extreme groups and omitting the middle category produces an unbiased estimate of the difference and, although less efficient than using the continuous predictor, is less destructive than the popular median split. In this note I show that although their basic argument is essentially true, they overstate the efficiency of the extreme splits. Also their claims about optimal fractions for each distribution ignores a dependency of the optimal fraction on the magnitude of the correlation between X and Y.

In their Equations 4 and 5, Gelman & Park assume that the residual variance of Y is constant. It is easy to show that is not the case when discretizing a continuous variable, especially when using extreme groups. . . .

I don’t have time to look at this right now, but let me quickly say that I prefer to model the continuous data, and I consider the discretization to just be a convenience. I’ll have to look at McClelland’s notes more carefully to see what’s going on: is he right that we were overstating the efficiency of the comparison that uses the discretized variable? Stay tuned for updates.

P.S. I don’t want to make a big deal about this, but . . . this is the way to handle it when someone says you made a mistake in a published paper: you give them a fair hearing, you don’t dismiss their criticisms out of hand. And it’s more than that: if you have a reputation for listening to criticism, this motivates people to make such criticisms openly. Everybody wins.

Beyond the median split: Splitting a predictor into 3 parts

Carol Nickerson pointed me to a series of papers in the journal Consumer Psychology, first one by Dawn Iacobucci et al. arguing in favor of the “median split” (replacing a continuous variable by a 0/1 variable split at the median) “to facilitate analytic ease and communication clarity,” then a response by Gary McClelland et al. arguing that “there are no real benefits to median splits, and there are real costs” in increases in error, another reply by Derek Rucker et al. also arguing against the median split on the grounds of statistical efficiency, and finally another defense of the median split from Iacobucci et al.

My reply:

I think it does not generally make sense to break a variable into 2 parts. Breaking into 3 parts is better; see this paper from 2008. We recommend splitting a predictor into three parts and then coding the trichotomized variable as -1, 0, 1. This allows the ease of interpretation of the median split (you can just compare the upper to lower parts) but at a great increase in efficiency.

Key figure:

Screen Shot 2015-11-11 at 8.09.40 PM

For pure efficiency it’s still best to keep the continuous information, of course, but if you have the a goal of clarity in exposition and you are willing to dichotomize, I say: trichotomize instead.

We used this trick in Red State Blue State to compare incomes of rich and poor voters, and rich and poor states, without having to run regressions (which I felt would add an unnecessary level of complexity in exposition). Indeed, David Park and I performed the analysis that led to this paper as a side project during the writing of Red State Blue State.

I had the impression this idea of trichotomizing was well known in psychometrics (when we got our paper back from the journal reviewers, they pointed us to a paper by Cureton from 1957 and papers from Kelley in 1928 and 1939!) so I was surprised to see the above papers which were all about dichotomizing. In many applications, though, I’ve seen people dichotomize in ways that seem wasteful and inappropriate. So I guess it’s necessary for experts to continue to write papers explaining why dichotomizing is generally a bad idea.

I suspect that part of the motivation for dichotomizing is that people like to think deterministically. For example, instead of giving people a continuous depression score and recognizing that different people have different symptoms at different times, just set a hard threshold: then you can characterize people deterministically and not have to think so hard about uncertainty and variation.

As Howard Wainer might have said at some point or another: people will work really hard to avoid having to think.

The median split indeed: what a joke! Truly a crime to throw away data like that, unless you really have way more than you need.

I already know who will be president in 2016 but I’m not telling

Nadia Hassan writes:

One debate in political science right now concerns how the economy influences voters. Larry Bartels argues that Q14 and Q15 impact election outcomes the most. Doug Hibbs argues that all 4 years matter, with later growth being more important. Chris Wlezien claims that the first two years don’t influence elections but the second two do.

After 2000, Larry Bartels and John Zaller used Bayesian model averaging (BMA) to assess why election models overestimated the Gore vote in their pre-election forecast. Erikson, Bafumi and Wilson (2002) had some beefs with their arguments and conclusions. But both groups agreed BMA is useful for understanding elections. Do you think BMA could help evaluate these arguments and the uncertainties surrounding them?

My reply:

There are interesting debates on how the economy influences voters and also how presidents influence the economy, and in both cases the question of timing comes up. With regard to the particular question above, I haven’t looked at the details myself, but my guess is that the data at hand would not be enough to decide among these various theories. Ultimately, “N” is not large, and you have to use outside criteria to decide what model to go with. I doubt that Bayesian model averaging will give you much here. But I guess it won’t hurt.

Top 9 questions to ask a statistician

If a study is worth a mention, it’s worth a link

Gur Huberman points to this op-ed entitled “Are Good Doctors Bad for Your Health?” and writes:

Can’t the NYT provide a link or an explicit reference to the JAMA Internal Medicine article underlying this OpEd? A reader could then access the original piece and judge its credibility for himself

I replied: Yes, very tacky of the author not to even mention the authors of the study, nor to give a title or a link. And poor practice of the NYT editors not to demand this. Also ironic that one of his policy recommendation is to “require that doctors provide patients with data about a procedure, including its rate of success, complications and the like, before every major intervention”—but he can’t be bothered even to provide a link to the study.

P.S. I’m not saying that every fact or even every obscure reference in an article needs to be references. Readers can always use Google and Wikipedia on their own. But when the entire basis of a column is a published study, it’s poor form not to link, and even poorer form not to say who did the study.

P.P.S. Sometime after I posted this, the NYT editors slipped in a link to the original study, which is by Anupam Jena, Vinay Prasad, Dana Goldman, and John Romley. See here for the original version of the op-ed which had no link. The newspaper version of course had no link, nor did it name any of the study’s authors.

Flatten your abs with this new statistical approach to quadrature


Philipp Hennig, Michael Osborne, and Mark Girolami write:

We deliver a call to arms for probabilistic numerical methods: algorithms for numerical tasks, including linear algebra, integration, optimization and solving differential equations, that return uncertainties in their calculations. . . . We describe how several seminal classic numerical methods can be interpreted naturally as probabilistic inference. We then show that the probabilistic view suggests new algorithms that can flexibly be adapted to suit application specifics, while delivering improved empirical performance. We provide concrete illustrations of the benefits of probabilistic numeric algorithms on real scientific problems from astrometry and astronomical imaging, while highlighting open problems with these new algorithms. Finally, we describe how probabilistic numerical methods provide a coherent framework for identifying the uncertainty in calculations performed with a combination of numerical algorithms (e.g. both numerical optimisers and differential equation solvers), potentially allowing the diagnosis (and control) of error sources in computations.

“A call to arms,” huh? That sounds very British indeed. In any case, this all seems very important, the idea of treating computing problems as statistical problems. Worth a look.

Benford lays down the Law

Screen Shot 2015-11-20 at 10.18.48 PM

A few months ago I received in the mail a book called An Introduction to Benford’s Law by Arno Berger and Theodore Hill. I eagerly opened it but I lost interest once I realized it was essentially a pure math book. Not that there’s anything wrong with math, it just wasn’t what I wanted to read.

But, hey, the book got reviewed by Frank Benford himself (well, actually, Frank Benford the grandson of Frank Benford himself), and the review has some math too. Should be of interest to some of you.

P.S. On the applied end, Ethan Rouen sends along a recent paper, Financial statement errors: evidence from the distributional properties of financial statement numbers, where Dan Amiram, and Zahn Bozanic, and Rouen “use Benford’s Law to measure the amount of error contained in a firm’s financial statements and show how the law can be used to help predict fraud.”

Here’s an example:

Screen Shot 2015-11-21 at 4.42.36 PM

Which somehow reminds me of this classic plot:

Screen Shot 2015-11-21 at 4.44.53 PM

4 California faculty positions in Design-Based Statistical Inference in the Social Sciences

This is really cool. The announcement comes from Joe Cummins:

The University of California at Riverside is hiring 4 open rank positions in Design-Based Statistical Inference in the Social Sciences. I [Cummins] think this is a really exciting opportunity for researchers doing all kinds of applied social science statistical work, especially work that cuts across traditional disciplinary boundaries.

Relevant disciplines include, but are not limited to, Business, Economics, Education, Medicine (Epidemiology and Public Health), Political Science, Public Policy, Sociology, and Statistics/Biostatistics. We seek candidates who excel at developing, testing, and applying cutting-edge research designs and statistical methods for causal identification. Successful candidates might make theoretical and methodological contributions to causal inference, develop novel experimental designs, conduct Bayesian meta-analysis, program evaluation, applied econometrics, or political methodology, and will show an interest to work across traditional disciplines and ability to attract extramural funds. Review of the applications will begin January 8, 2016 and will continue until the position is filled. Senior application materials should be submitted to Junior application materials should be submitted to

UCR is embarking on a major new hiring initiative ( that will add 300 tenured and tenure-track faculty in 33 cross-disciplinary areas and invest in research infrastructure to support their work. This initiative will build critical mass in vital and emerging fields of scholarship, foster truly cross-disciplinary work and further diversify the faculty at one of America’s most diverse research universities.

Here’s the job posting.

Stan Puzzle 2: Distance Matrix Parameters

This puzzle comes in three parts. There are some hints at the end.

Part I: Constrained Parameter Definition

Define a Stan program with a transformed matrix parameter d that is constrained to be a K by K distance matrix. Recall that a distance matrix must satisfy the definition of a metric for all i, j:

* positivity :  d[i,j] >= 0

* self distance :  d[i,j] = 0 iff i = j

* symmetry :  d[i,j] = d[j,i]

* triangle inequality :  d[i,j] <= d[i,k] + d[k,j] for all k

Part II: Modeling Assumptions

Now suppose there are noisy measurements y[n] of the distance between points ii[n] and jj[n]. This corresponds to a Stan program with the following data block.

data {
  int K;  // num dims
  int N;  // num observations
  int ii[N];  // first point of observation n
  int jj[N];  // second point of observation
  real y[N];  // distance measurements (could bound below by 0)

A likelihood that assumes independent normal noise on the measurements is defined as follows.

model {
  for (n in 1:N)
    y[n] ~ normal(d[ii[n], jj[n]], sigma);

This assumes a positive-constrained parameter sigma for the noise scale.

parameter {
  real sigma;  // measurement noise scale

Feel free to give it a prior taking into account the distance scales involved in the problem and the measurement process.

Then there are then (K choose 2) free parameters, which may be (a) left unconstrained, (b) constrained to be positive, or (c) transformed to a proper distance matrix satisfying the distance metric conditions. What is the effect on the posterior of these three modeling assumptions?

Part III: Complexity

What’s the complexity of evaluating the likelihood with and without the distance-matrix constraint on the parameters?




Hint 1 (I) :  There should be (K choose 2) [i.e., K * (K – 1) / 2] “raw” parameters (either unconstrained or constrained to be positive) which then need to be scaled.

Hint 2 (I) :  See the manual chapter “Transformation of Constrained Variables” and in particular the section on lower and upper bounded variables if you want hints on the Jacobians necessary.

Hint 3 (I) :  Order the transform so the Jacobian’s easy to compute.

Hint 4 (II) :  The only way I know how to do this is by simulation, but a bunch of the other Stan developers probably just know the answer.

Hint 5 (III) :  The answer should be should be an expression involving N and K, such as {\mathcal O}(f(N,K)) for some function f.

Hint 6 (III) :   Count the number of expressions that get evaluated in the transform and in the model block.


Tip o’ the iceberg to ya


Paul Alper writes:

The Washington Post ran this article by Fred Barbas with an interesting quotation:

“Every day, on average, a scientific paper is retracted because of misconduct,” Ivan Oransky and Adam Marcus, who run Retraction Watch, wrote in a New York Times op-ed in May.

But, can that possibly be true, just for misconduct alone and not including honest mistakes? Science researchers would then seem to belong in the same league as used car dealers.

My reply: I don’t give two poops about Dr. Anil Potti but, sure, with over a million scientific papers published a year, I’d think that we should be seeing tens of thousands of retractions a year. Just as a point of reference, I’ve published a few hundred research papers and retracted two of them (or I’ve run corrections which I consider the equivalent of retraction). I think I’m a pretty good scientist so I’m guessing that other people’s retraction rates should be higher than my approx 1/2 of 1%.

If tens of thousands of papers should be retracted each year, that comes to something like 100 a day. So if only 1 paper a day is retracted, my guess is that something like 99% of the papers that should be retracted, aren’t.

I like the Monkey Cage

The sister blog is a good place to reach a wider audience, also our co-bloggers and guests have interesting posts on important topics, but what I really like about our blog at the Washington Post is its seriousness and its political science perspective.

For better or worse, political science does not have a high profile in the news media or, I think, in the public policy world. We sometimes get quoted on some technical matters regarding campaigns, polls, and elections, but, other than that, political scientists’ inputs into the public discourse are sporadic. Sure, there was “bowling alone,” and “broken windows,” and “the clash of civilizations,” but I don’t get a sense that, when political issues come up, political scientists are the go-to sources for explanation and commentary.

That might be just fine. Maybe the contributions of political science are pretty much irrelevant to politics. It’s not my place to say that news coverage of these issues should change.

But one thing this all does mean, is that when political scientists do enter the discussion, it can be in a weird, contextless way, not representative of the field of political science and not necessarily making much sense at all.

One problem, I suspect, is that because poli sci is not on journalists’ radar screen, they have no way of evaluating the input they get from political scientists. It’s as if they were reviewing literature in translation from some distant northern land.

I was thinking about this the other day after coming across a horrible op-ed in the New York Times the other day written by someone named Gerard Alexander who is labeled as a political scientist.

I have nothing against Gerard Alexander either as a person or a political scientist. I did not happen to be familiar with his work but I see on Google that it’s been cited many times, and it might be excellent. He perhaps was having a bad day when writing that op-ed.

But that the editors at the NYT had no way of evaluating this piece. It was by a political scientist, and . . . who knows, really?

I took the opportunity on the Monkey Cage to extract something useful (I hope) out of Alexander’s op-ed, as to me it was an interesting, if unwitting, example of political polarization.

And this brings me to the point of today’s post. In the Monkey Cage, political science is presented in context. I’m not saying every post on the Cage is perfect, far from it. Some recent Monkey Cage claims that I’ve publicly disagreed with include “Liberals smell better to other liberals than to conservatives,” “Here’s how a cartoon smiley face punched a big hole in democratic theory,” and “Could non-citizens decide the November election?” But that’s fine: research is messy, and we also have room to give second opinions. But there is a context, there’s a steady stream of political science, rather than what we see in the NYT op-ed page, which is random things that happen to pop in.

I’m hoping the Monkey Cage will have a positive impact, not just at the Washington Post, but also at other media outlets, as editors start to sense that political science is a field of research and scholarship and not just an excuse for people to polemicize.

First, second, and third order bias corrections (also, my ugly R code for the mortality-rate graphs!)


As an applied statistician, I don’t do a lot of heavy math. I did prove a true theorem once (with the help of some collaborators), but that was nearly twenty years ago. Most of the time I walk along pretty familiar paths, just hoping that other people will do the mathematical work necessary for me to fit my models (for example, taking care of all the intricacies of implementing differential equation models in Stan, or developing the mathematical tools necessary to derive algorithms to sample from difficult distributions).

Every once in awhile, though, I’m reminded that a baseline level of mathematical expertise allows me (and others with similar training) to see problems from a distance and resolve them as necessary. This sort of mathematical skill can be nearly invisible while it is being applied, and even afterward it’s not always apparent what was being done.

Mathematical understanding can be used not just to solve a well-formulated problem; it also helps us decide what problems are worth solving in the first place.

I thought of this general point after some back-and-forth regarding a recently published article by Anne Case and Angus Deaton on trends in death rates. If you haven’t been following this story on the blog, you can read my recent Slate article for some background.

The study was first summarized as an increase in death rates for 45-54-year-old non-Hispanic white Americans (see, for example, Ross Douthat and Paul Krugman), but after “age adjustment”—that is, correcting for the change in age distribution, standardizing to a common distribution of ages—the pattern looks much different. We then learned more by looking at other ages and breaking up the data for men and women. The biggest part of the story is a comparison to mortality trends in other countries, but I won’t get into that now. Here I’ll be focusing on the U.S. data.

First-order correction

What I want to talk about is the value of a mathematical understanding of different sorts of bias correction, a kind of thinking that is known by many statisticians but is rarely part of the formal curriculum—we learn it “on the street,” as it were.

Let’s start with a first-order bias. Here’s a graph of #deaths among 45-54-year-old non-Hispanic whites in the U.S., based on data taken directly from the CDC website:


But that’s just raw number of deaths. The population is increasing too. Let’s take a look:


Hey—the population increased and then decreased in this age group! That’s the baby boomers entering and leaving the 45-54 category. Anyway, this population pattern tracks pretty closely to the #deaths pattern.

Looking at trends in number of deaths without adjusting for population is like looking at nominal price trends without adjusting for inflation. It’s a first-order bias, and (almost) everyone knows not to do it.

Second-order correction

So the natural step is to look at changes in mortality rate, #deaths/#people in this group:


But then we have to worry about another bias. As noted above, the baby boom generation was moving through, and so we’d expect the average age among 45-54-year-olds to be increasing, which all by itself would lead to an increase in mortality rate via the aging of the group.

Let’s check:


As expected, the 45-54-year olds are getting older. But what’s happening with 2001? Is that for real? Let’s just double-check by pulling off ages from another dataset:


Yup, it seems real. Just quickly, let’s consider 2001. 2001-55=1946, and the jumpiness of the lines at the start of the above graph is tracking corresponding jumps in the number of babies born each year during the 1940s.

OK, the next question is: How would the change in age distribution affect the death rate in the 45-54 category? In other words, what is the bias in the above raw mortality curve, due to age composition?

We can do a quick calculation by taking the death rate by single year of age in 1999, and use this along with each year’s age distribution to track the mortality rate in the 45-54 group, if there were not change in underlying death rates by age. Thus, all the changes in the graph below represent the statistical artifact of age composition:


Now let’s line up this curve with the changes in raw death rate:


About half the change can be attributed to aggregation bias.

We can sharpen this comparison by anchoring the expected-trend-in-death-rate-just-from-changing-age-composition graph at 2013, the end of the time series, instead of 1999. Here’s what we get:


And here it’s clear: since 2003, all the changes in raw death rate in this group can be explained by changes in age composition.

The much-heralded increase in death rates among middle-aged non-Hispanic white Americans happened entirely in the first part of the series.

In summary so far: this adjustment for changes in age composition is a second-order bias correction, less important then the first-order correction for raw population changes but large enough to qualitatively change the trend story.

Third-order correction

Now that we’ve identified the bias, we can correct by producing age-adjusted death rates: for each year in time, we take the death rates by year of age and average them, thus computing the death rate that would’ve been observed had the population distribution of 45-54-year-olds been completely flat each year.

The age-adjusted numbers show an increasing death rate until 2003-2005 and then a steady state since then:


But this is only one way to perform the age adjustment. Should we be concerned, with Anne Case, that “there are a very large number of ways someone can age-adjust this cohort” and that each method comes “with its own implicit assumptions, and that each answers a different question”?

The answer is no, we need not be so concerned with exactly how the age adjustment is done in this case. I’ll show this empirically and then discuss more generally.

First the empirics. I performed three age adjustments to these data: first assuming a uniform distribution of ages 45-54, as shown above; second using the distribution of ages in 1999, which is skewed toward the younger end of the 45-54 group; and third using the 2013 age distribution, which is skewed older.

Here’s what we found:


The results don’t differ much, with no change in the qualitative trends and not much change in the numbers either.

It’s important to do some age adjustment, but it doesn’t matter so much exactly how you do the age adjustment. In math jargon, age-adjustment corrects a second-order bias, while the choice of age adjustment represents a third-order correction.

That’s why, when I did my analysis a week or so, I performed a simple age adjustment. Based on my statistical experience and general mathematical understanding, I had a sense that the choice of age adjustment was a third-order decision that really wouldn’t have any practical implications. So I didn’t even bother to check. I did it here just for the purpose of teaching this general concept, and also in response to Case’s implication that the whole age-adjustment thing was too assumption-laden to trust. Case was making the qualitative point that any adjustment requires assumptions; I’m making a quantitative analysis of how much these assumptions make a difference.

Lateral thinking

So far I’ve been focusing entirely on the headline trends in mortality among 45-54-year-old non-Hispanic whites. But there’s nothing stopping us from grabbing the data separately for men and women:


These separate age-adjusted trends tell a new and interesting story. All the bias correction in the world won’t get you there; you have to pull in new data.

To put it another way: Age adjustment was a necessary first step. But now that we’ve dealt with that, we can move forward and really start learning from the data.

We can also look at other ages and other groups; see here for some graphs.

Concerns about data quality

When I first heard about Case and Deaton’s paper, I didn’t think about age adjustment at all; I was alerted to the age aggregation bias by an anonymous commenter. More recently this commenter has raised skepticism regarding the ethnic categories in the CDC data. I haven’t checked this out at all but it seems worth looking into. Changes in categorization could affect observed trends.

Turtles all the way down.

Asking the question is the most important step

As I wrote the other day, the point of bias correction and data inspection is not “gotcha!” Rather, the point of correcting biases and questioning the data is that the original researchers are studying something interesting and important, and we want to help them do better.

And here’s the R script

I put my R code and all my data files here. You should be able to run the script and create all the graphs I’ve blogged.

Warning: the code is ugly. Don’t model your code after my practices! If any of you want to make a statistics lesson out of this episode, I recommend you clean the code. Meanwhile, perhaps the very ugliness of the script can give you a feeling of possibility, that even a clunky programmer like me can perform an effective graphical data analysis.

I have the feeling that Hadley could’ve done all of this analysis in about an hour using something like 20 lines of code.

There’s lots more that can be done; I’ve only looked at a small part of the available data. The numbers are public; feel free to do your own analyses.

Pareto smoothed importance sampling and infinite variance (2nd ed)

This post is by Aki

Last week Xi’an blogged about an arXiv paper by Chatterjee and Diaconis which considers the proper sample size in an importance sampling setting with infinite variance. I commented Xi’an’s posting and the end result was my guest blog posting in Xi’an’s og.

I made an additional figure below to summarise how Pareto shape parameter and Pareto smoothed importance sampling is related to this. You can read more in our arXived paper with Andrew.

When MCMC got popular, the reasons why IS got less popular were 1) the problem of finding good proposal distributions and 2) possibility of getting infinite variance for the weight distribution (which leads to infinite variance of the IS estimate). If the variance is finite, the central limit theorem holds and people have (mostly) assumed that everything is going to be fine, and if the variance is infinite people have assumed that all hope is lost.

Chen and Shao (2004) showed that the rate of convergence to normality is faster when more higher moments exist, so it’s useful to examine the existence of higher moments, too. Koopman, Shephard, and Creal (2009) proposed to make a sample based estimate of the existence of the moments using generalized Pareto distribution fitted to the tail of the weight distribution. The number of existing moments is less than 1/k (when k>0), where k is the shape parameter of generalized Pareto distribution. Koopman, Shephard, and Creal (2009) focused on making a hypothesis test whether the variance is finite.

The following figure shows why it is useful to look at the continuous k value, instead of discrete number of moments (and why Pareto smoothing is great). The proposal distribution was Exp(1) and the target distribution was Exp(θ) with varying value of θ. The figure shows results with basic IS (blue) and with Pareto smoothed IS (yellow). The vertical axis is the estimated mean divided by the true mean (ie. values close to 1 are good). The violin plots present the distributions of the results from 1000 repeated simulations (with different random number seeds) using each time 1000 draws from the proposal distribution. For each case the estimated Pareto shape value k is shown.

IS vs PSIS example figure

The figure shows that when the variance is finite the errors are smaller, but the Pareto shape value k gives additional information about the distribution of errors in both finite and infinite variance cases (since we are using a finite number of samples, we don’t in practice observe infinite variance, but it is still a useful concept to describe asymptotic properties). The figure also shows that Pareto smoothing can reduce the error in all cases and that we can get useful estimates also when k≥1/2, although it can’t make miracles when the proposal distribution is too narrow compared to the target distribution. PSIS estimate as described in our paper has allways finite variance, but when k≥1/2 it will have some bias which seems to be small for k<0.7.

Our PSIS-LOO paper with Andrew and Jonah shows one example of the benefit of PSIS and we’ll soon publish other examples where PSIS improves the computation.

“David Brooks And Jeremy Paxman To Judge The Golden Giraffes”


I don’t think I have much of a chance here, not because of the judging—I’d trust Brooks and Paxman to recognize good writing—but because the competition includes some heavy hitters, including Dan Davies with a meta-blog-post called The Verjus Manifesto, Sara Paretsky on The Detective As Speech, and Charles Pierce with . . . well, it almost doesn’t matter what, since anything by Charles Pierce is gonna be good. Hey—isn’t it cheating to include a professional writer on this list? It would be like Robin Williams coming on to the Gong Show to win a quick $516.32.

Other notable items on the list include Peter McCleery’s gimmicky-but-well-executed Thank You For Calling Mamet’s Appliance Centre and Evan Ratliff’s grabbily-titled-but-disappointing My Wife Found My Email In The Ashley Madison Database. (Spoiler: it was someone else who used his email address.)

Also a post by blog favorite Scott Adams. Not about Charlie Sheen, though.

Unfortunately it’s hard to quickly get a sense of all hundred articles cos you have to click through to 10 separate pages.

My own article on the list is a post from January on Psychology And Behavioural Economics, which begins stirringly:

I’ve been coming across these issues from several different directions lately, and I wanted to get the basic idea down without killing myself in the writing of it. So consider this a sketchy first draft.

Charles Pierce, it ain’t. Seriously, though, I appreciate being appreciated, and I appreciate that they’ll consider something that’s exploratory and bloggy, that’s not so polished. I like The Browser; they published two 5books interviews with me and one with my sister.

Top 10 vote getters get to be judged by David Brooks! You can vote here.

Just Filling in the Bubbles

Screen Shot 2015-11-16 at 8.14.14 PM

Collin Hitt writes:

I study wrong answers, per your blog post today. My research focuses mostly on surveys of schoolchildren. I study the kids who appear to be just filling in the bubbles, who by accident actually reveal something of use for education researchers.

Here’s his most recent paper, “Just Filling in the Bubbles: Using Careless Answer Patterns on Surveys as a Proxy Measure of Noncognitive Skills,” which concludes:

Using two national longitudinal datasets, I [Hitt] show that careless answer patterns from adolescent respondents are negatively predictive of later educational attainment, independent of cognitive ability and other traditionally-measured noncognitive skills. I posit that careless answers, as I have quantified them, proxy as a behavioral measure of a negative noncognitive trait.

No graphs, though. Gotta work on the graphs. Need some graphs. graphs graphs graphs

Is 8+4 less than 3? 11% of respondents say Yes!

Shane Frederick shares some observations regarding junk survey responses:

Obviously, some people respond randomly. For open ended questions, it is pretty easy to determine the fraction who do so. In some research I did with online surveys, “asdf” was the most common and “your mama” was 9th. This fraction is small (maybe 1-2%). But the fraction of random responses is harder to identify (and is likely higher) for items with binary and multichotomous response options, since many respondents must realize their random responding can go undetected. Hence, you can’t use the random response rate from open ended questions to assess this. You can do other things to try to estimate it (like ask them “Is 8+4 less than 3?” YES NO). But two problems remain: the fraction saying YES is a blend of random and perverse responding and both of these things vary across items. Dramatically.

I put up a few questions on Google Consumer Surveys with large samples. Random + perverse response rates differ dramatically:

Do you have a fraternal twin? YES NO
4% Yes. *Pretty close to truth*

Do you have an identical twin? YES NO
8% Yes. *Pretty far from truth, but funnier to lie about?*

Is 8+4 less than 3? YES NO
11% Yes. *Profound innumeracy, confusion, or just fucking with me?*

Were you born on the planet Neptune? YES NO
17% Yes. *Perhaps using it metaphorically, as in “My friends say I’m a weird guy”?*

In a recent published paper I [Frederick] averred that you could just multiply the number of people who endorse something crazy by the number of response options to estimate the fraction of random responders. But this is obviously wrong.

So, basically, I’m not sure what to do. You could look at response latencies or something, but then you end up imposing some arbitrary thresholds which are unsatisfying, much like removing outliers without any good justification that the responses are not sincere.

My reply: These responses are hilarious. I believe there is some literature on this sort of thing but I’m not the expert on it. I’ve looked a bit into list experiments (you can search my blog, I have a post with a title like, A list of reasons not to trust list experiments) but there seems to be a lot of information on the actual responses. Maybe you could learn something by regressing these on demographics, also see if the same people who give wrong answers for some of these, give wrong answers for others.