More on Software Validation

Andrew and I have both written here about our Software Validation paper with Don Rubin. The last thing to add on the topic is that my website now has newly updated software to implement our validation method (go down to Research Software and there are .zip and .tar versions of the R package). If the software you want to validate isn’t written in R, you can always write an R function that calls the other program. Feel free to contact me (cook@stat.columbia.edu) with questions, comments, etc.

Teaching Example

There was a fun little article in the New York Times a while back (unfortunately I can’t find it now and am missing some of the numbers, but the main idea still holds) about income differences across New York City’s five boroughs. Apparently the mean income in the Bronx is higher than in Brooklyn, even though Brooklyn has a smaller proportion of residents below the poverty line, higher percentage of homeowners, and lower unemployment. Why is income higher in the Bronx, then? The reason, according to the article, is the New York Yankees–Yankees’ salaries are so high that they make the whole borough look richer than it is.

(I’m not sure exactly how these income figures were calculated, since most of the Yankees probably don’t actually live in the Bronx, but let’s ignore that.) Obviously one should be comparing medians rather than means, which is where the teaching example comes in. I told my regression students this story last semester and someone asked about Queens, but I don’t think the Mets’ payroll even comes close to that of the Yankees (who, by the way, are a game behind the Red Sox).

Terrorism and Statistics

There was an interesting editorial in Sunday’s New York Times about the anxiety produced by terrorism and people’s general inability to deal rationally with said anxiety. All kinds of interesting stuff that I didn’t know or hadn’t thought about. Nassim Nicholas Taleb, a professor at UMass Amherst, writes that risk avoidance is governed mainly by emotion rather than reason, and our emotional systems tend to work in the short term: fight or flight; not fight, flight, or look at the evidence and make an informed decision based on the likely outcomes of various choices. Dr. Taleb points out that Osama bin Laden “continued killing Americans and Western Europeans in the aftermath of Sept. 11”: People flew less and drove more, and the risk of death in an automobile is higher than the risk in an airplane. If you’re afraid of an airplane hijacking, though, you’re probably not thinking that way. It would be interesting to do a causal analysis of the effect of the September 11 terrorist attacks on automobile deaths (maybe someone already has?).

3 Books

One of the more memorable questions I was asked when on the job market last year was “If you were stranded on a deserted island with only three statistics books, what would they be?”. (I’m not making this up.) If I were actually in that incredibly unlikely and bizarre situation, the best thing would probably be to just choose the three biggest books out there, in case I needed them for a fire or something. I’m pretty sure there’s no tenure clock on deserted islands. But I digress. What I said was:

1. Gelman, Carlin, Stern, and Rubin (Bayesian Data Analysis)
2. The Rice Book (Mathematical Statistics and Data Analysis, John Rice)
3. Muttered something about maybe a survey sampling book and quickly changed the subject.

If anyone ever asks me that again, I think I’ll change Number 3 to Cox’s The Planning of Experiments.

Pet Peeve

I was reading an article in the newspaper the other day (I think it was about Medicare fraud in New York state, but it doesn’t really matter) that presented some sort of result obtained from a “computer analysis.” A computer analysis? Regression analysis, even statistical or economic analysis, would give at least some vague notion of what was done, but the term computer analysis is about as uninformative as saying that the analysis was done inside an office building. It’s sort of like saying you analyzed data using Gibbs sampling, as opposed to saying what the model was that the Gibbs sampler was used to fit. Not untrue, but pretty uninformative.

Columbia Causal Inference Meeting

On June 20, we had a miniconference on causal inference at the Columbia University Statistics Department. The conference consisted of six talks and lots of discussion. One topic of discussion was the use of propensity scores in causal inference, specifically, discarding data based on propensity scores. Discarding data (e.g., discarding all control units whose propensity scores are outside the range of the propensity scores in the treated group) can reduce or eliminate extrapolation, a potential cause of bias if the treated and control groups have different distributions of background covariates. However, it’s sort of unappealing to throw out data, and can sometimes lead to treatment effect estimates for an ill-defined subset of the population. There was discussion on the extent to which modeling can be done using all available data without extrapolation. Other topics of discussion included bounds, intermediate outcomes, and treatment interactions. For more information, click here.

Baby Names

This is a really fun website.

You type in a name and it plots the popularity of the name since 1880. I of course first typed in my own name, and learned that it wasn’t very common (110th most popular) when I was born, but was very common (4th most popular) in the 1990’s. Which means that most of the Samanthas out there are much younger than I am. Does that mean people might expect me to be younger than I am because of my name? There are a lot of names that I associate with older people, but I can’t think of too many that I associate with young people. Maybe that’s just because I don’t know many kids, though.

Posted in Uncategorized

Objective and Subjective Bayes

Turns out I’m less of an objective Bayesian than I thought I was. I’m objective, and I’m Bayesian, but not really an Objective Bayesian. Last week I was at the OBayes 5 (O for objective) meeting in Branson, MO. It turns out that most of the Objective Bayes research is much more theoretical than I am. I like working with data, and I just can’t deal with prior distributions that are three pages long, even if they do have certain properties of objectiveness.

I’m not really a subjective Bayesian either. Sometimes I am–if I have prior information I’m usually happy to use it–but there are some situations (clincial trials, for example) where it really is inappropriate or even unethical/illegal to use prior information. As far as vague prior distributions go, I’m a little bit sympathetic to the following argument: “You can’t really believe that you know the model that generated the data but at the same time have no idea about the values of any of the model parameters.” Sounds like a valid point. The problem is that I don’t really agree with either part of that last statement. Sure, I usually have at least some vague notion of plausible parameter values. BUT I also don’t really believe that whatever model I’m using is the exact model that generated the data. Using a vague prior distribution then seems kind of like hedging your bets.

And that’s one of the cool things about Bayesian statistics. If you have prior information you can use it, but if you don’t have it you can still often perform an objective analysis using Bayesian methods. I think an important potential research topic is how to create objective (vague, diffuse, whatever) prior distributions that are more easily implementable in complex problems. Click here for a summary of my own stab at this problem.

Posted in Uncategorized

From George Box

I recently read George Box’s paper “Sampling and Bayes’ Inference in Scientific Modelling and Robustness” (JRSSA, 1980). It’s a discussion paper, and I really liked his rejoinder. It starts like this:

“To clear up some misunderstandings and to set my reply in context, let me first make clear what I regard as the proper role of a statistician. This is not as the analyst of a single set of data, nor even as the designer and analyser of a single experiment, but rather as a colleague working with an investigator throughout the whole course of iterative deductive-inductive investigation. As a general rule he should, I think, not settle for less. In some examples the statistician is a member of a research team. In others the statistician and the investigator are the same person but it is still of value to separate his dual functions. Also I have tended to set the scene in the physical sciences where designed experiments are possible. I would however argue that the scientific process is the same for, say, an investigation in economics or sociology where the investigator is led along a pat, unpredictable a priori, but leading to (a) the study of a number of different sets of already existing data and/or (b) the devising of appropriate surveys.

The objective taking precedence over all others is that the scientific iteration converge as surely and as quickly as possible. In this endeavour the statistician has an alternating role as sponsor and critic of the evolving model. Deduction, based on the temporary pretense that the current model is true, is attractive because it involves the statistician in “exact” estimation calculations which he alone controls. By contrast induction resulting on the idea that the current model may not be true is messy and the statistician is much less in control. His role is now to present analyses in such a form, both numerical and graphical, as will accurately portray the current situation to the investigator’s mind, and appropriately stimulate his colleague’s imagination, leading to the next step. Although this inductive jump is the only creative part of the cycle and hence is scientifically the most important, the statistician’s role in it may appear inexact and indirect.

If he finds these facts not to his liking, or if his training has left him unfamiliar with them, the statistician can construct an imaginary world consisting of only the clean deductive half of the scientific process. This has undoubted advantages. A model dubbed true remains so, all alternative models are known a priori, the likelihood principle and the principle of coherence reign supreme. It is possible to devise rigid “optimal” rules with known operating characteristics which aspire to elevate the statistician from a mere subjective artist to an objective automaton. But there are disadvantages. Deduction alone is sterile–by cutting the iterative process in two you kill it. What is left can have little to do with the never-ending process of model evolution which is the essence of Science.”

Sabermetricians vs. Gut-metricians

There’s a little debate going on in baseball right now about whether decisions should be made using statistics (a sabermetrician is a person who studies baseball statistics) or instincts. Two books are widely considered illustrative of the two sides of the debate. Moneyball, by Michael Lewis, is about the Oakland A’s and their general manager Billy Beane. Beane, with the second-lowest payroll in baseball in 2002, set out to put together an affordable team of undervalued players, using a lot of scouting and statistics. Three nights in August, by Buzz Bissinger, is about St. Louis Cardinals’ manager Tony La Russa, and is seen by some as a counter to Moneyball, with La Russa relying much more on guts when making decisions.

There are two problems with this. One is that the distinction just isn’t that sharp: Billy Beane also makes some gut-based decisions, and Tony La Russa looks at statistics. The other is that, as a general manager, Billy Beane makes more long-term, far-reaching decisions and has a lot more time to make them. It makes sense to look at a lot of statistics before deciding which players to sign. Tony La Russa is a manager, (i.e., coach), which means he’s making lots of quick decisions. Go for the steal? Bunt? Fastball, curveball, slider? You can’t be running regressions for each game-time decision. Baseball may be a slow game, but it’s not that slow. Any good manager goes into a game having studied both teams’ stats, and this knowledge surely affects decisions made during the game, but the immediacy of coaching-type decisions, compared to managerial decisions, just doesn’t allow for the same type of analytical methods.

GO SOX!!

Posted in Uncategorized

Another one from the news

There’s a really interesting article in Slate by Steven D. Levitt and Stephen J. Dubner (the authors of Freakonomics) about female births and heptatitis B. The disproportionate number of male births in some Asian countries has been attributed to causes such as selective abortion and infanticide. But, as explained in the paper “Hepatitis B and the Case of the Missing Women”, by Harvard Economics graduate student Emily Oster, Hepatitis B infection rates actually explain a lot of the discrepancy. Pregnant women who have Hepatitis B are more likely to bear sons than daughters, and Hepatitis B is more common in those parts of the world where the proportion of male births is so high. Pretty cool.

Again, though, the reason I’m writing about the article doesn’t have much to do with its subject matter. What struck me more than anything were the article’s opening sentences:

“What is economics, anyway? It’s not so much a subject matter as a sort of tool kit–one that, when set loose on a thicket of information, can determine the effect of any given factor.”

Whoa! That’s a really bold statement! I won’t fuss about whether what’s in the tool kit is actually economics or statistics–methods like regression can be shared by many different fields, and there’s enough SAS for everyone. What’s more troubling is the idea that most any question can be answered by throwing enough regression-type analyses at a large data set. Regression isn’t magic, and causal inference is hard; while most statisticians and economists know that (and I’m not accusing Steven Levitt of not knowing that), we could maybe do a better job of convincing the rest of the world.

In an interesting twist, the New York Times review of Freakonomics ends by praising the work of Amartya Sen, whose work Levitt and Dubner use to introduce the lopsided sex ratio in Asia.

What This Woman Wants: Covariate Information

The current most emailed headline on the New York Times website is titled “What Women Want,” by op-ed columnist John Tierney. He’s writing about a working paper, “Do Women Shy Away from Competition?”, by Muriel Niederle and Lise Vesterlund, economists at Stanford University and the University of Pittsburgh, respectively. They conducted an experiment where men and women were first paid to add up numbers in their head, earning fifty cents for each correct answer (referred to as the “piece-rate” task). The participants were eventually offered the choice to compete in a tournament where the person who has the most correct answers after five minutes receives $2 per correct answer and everyone else receives zero compensation. One of the main points of the article was that, even at similar levels of confidence and ability, men were much more likely to enter the tournament than women, i.e., women are less willing than men to enter competition. The results of this study yield another possible theory for why there are so few women in top-paying jobs: Even in a world of equal abilities and no discrimination, family issues, social pressures, etc., women might be less likely to end up as tenured professors or CEOs because the jobs are so inherently competitive.

Gender inequality is a hot topic, but my issue with the paper is of a much more mundane statistical nature. For the results of the experiment to be meaningful, you’d want to start with men and women who, aside from the gender difference, are pretty similar; otherwise, something besides gender might explain the women’s lower participation in the tournament. The authors acknowledge this, and conclude that the men and women are comparable because they scored similarly on both the piece-rate and tournament tasks. What isn’t mentioned in the article is any comparison of other potentially relevant characteristics between the men and women in the study, age for example. If, say, men and women scored similarly on the addition tasks but the women were on average twenty years older than the men, you might wonder whether the difference in competitiveness is due to gender or to age. I don’t really think that’s what’s going on here, but I’d be even more convinced if the paper included a table with some background information on the men and women in the study. Actually, even just a sentence stating that the men and women in the study did not differ significantly on age or education (or…) would work. That’s all I want, just a sentence.

Thoughts on Teaching Regression

I recently finished my first semester of teaching. I was a TA in grad school, but this was my first time being “the professor.” I was teaching a regression course, and there are several things I’d like to do differently should I teach the same class again in the future. I just have to figure out how.

Teaching the first part of the course was the easiest: least squares, maximum likelihood, testing, lots of theory, and not a lot of data. Sounds dry and boring, I know, but I wouldn’t have become a statistician if I didn’t at least sort of like the theory and the mathematics underlying basic statistics. But more than that, the theoretical part of the regression course was the easiest to teach because it’s basically non-negotiable. Hypothesis tests and p-values might require a lot of explanation because the thinking is kind of counterintuitive, but the content is pretty cut-and-dry: If the data in a regression analysis meet the various assumptions (independence, normality, etc.), then the results on testing, confidence intervals, prediction, etc., follow. Done.

Then come the actual data: the interesting stuff, the examples, the reason we’re doing statistics in the first place (the reason I’m doing statistics, anyway). And it all gets a lot messier and harder to teach. Because real data don’t tend to fit all the assumptions that regression theory is based on. Residual plots aren’t always the patternless clouds of points shown as examples in textbooks. Normal probability plots of the residuals aren’t always 45-degree lines. Measurement error happens. Model selection is hard. I found myself saying things like “It’s an art as much as it is a science” in response to questions like “How much collinearity is too much?”. I think data analysis really is an art, and decisions about which variables to include or which transformations to make (etc., etc., etc.) don’t happen in a vacuum: You might make different decisions based on the research question you’re trying to answer, how much data you have, how much time you have, how complex a model you’re willing to use. And so on. All of which can be kind of overwhelming to students learning regression analysis for the first time.

So I think what I need to do is figure out how better to give general guidance without resorting to dogmatic rules (“If any predictors have correlation greater than .8 you must exclude some of them or use ridge regression!”), and I think a good starting point is probably to find better examples. I’ve had a hard time finding real data that illustrates important points while being neither contrived nor hopelessly complex, but they must be out there. I’d really appreciate any thoughts/advice.

A Very Delayed Lightbulb Over my Head

Daniel Scharfstein (http://commprojects.jhsph.edu/faculty/bio.cfm?F=Daniel&L=Scharfstein) recently gave a very good talk at the Columbia Biostatistics Department. He presented an application of causal inference using principal stratification. The example was similar to something I’ve heard Don Rubin and others speak about before, but I realized I’d been missing something important about this particular example.

The example in Dr. Scharfstein’s talk was estimating the effect of vision loss on depression in older adults. Briefly, the vision was tested early in the study and patients were followed up for a few years. At the end of the study, patients were screened for depression, with the goal of determining whether the patients with vision loss were more depressed than patients without vision loss. The complication (one of them, anyway) is that many patients died before the end of the study. If vision loss affects death at all (and it seems suspect to assume that it doesn’t), then restricting the analysis to those patients who were alive at the end of study can bias results and won’t produce estimates of causal effects that are valid in the Rubin sense. Alternatively, treating depression outcomes for patients who died as missing data isn’t quite right because these outcomes are fundamentally unobservable (a priori counterfactuals). The goal of the analysis then is to estimate the effect of vision loss on depression for that latent subclas (principal strata) of patients for whom depression could be observed both with and without vision loss: the patients who would live whether or not they had vision loss.

As I said, I’ve seen presentations on this type of analysis before, and heard it said that this setting is very different from the standard economic instrumental variables setting. It wasn’t that I didn’t believe it was different, I just never gave a lot of thought to exactly how it was different. It wasn’t until I asked a dumb question in Dr. Scharfstein’s talk that I realized exactly what the difference is. Instrumental variables models rely on the exclusion restriction, the assumption that if the treatment (vision loss, in this case) doesn’t affect the “intermediate outcome” (death) it doesn’t affect the primary outcome (depression). Making that assumption here would defeat the whole purpose of the analysis since the only subgroup of patients we’re interested in is one for whom the treatment does not affect the intermediate outcome. So making the exclusion restriction would lead to the very unhelpful result of forcing our quantity of interest to be zero.

But I digress. Enter the belated lightbulb. Without the exclusion restriction, additional assumptions and covariates are usually necessary in order to obtain precise estimates of causal effects. One nice thing about the potential outcomes/principal stratification framework is that these assumptions can be made very clear and usually have scientific interpretation, so appropriate assumptions can be made based on subject-matter knowledge.

For the abstract of Professor Scharfstein’s talk, go to
http://cpmcnet.columbia.edu/dept/sph/biostat/seminar/spring2005.html, scroll down to January 27, and click on Abstract. See Zhang and Rubin (2003, Journal of Educational and Behavioral Statistics) for another example of this type of model.

Data-driven Vague Prior Distributions

I’m not one to go around having philosophical arguments about whether the parameters in statistical models are fixed constants or random variables. I tend to do Bayesian rather than frequentist analyses for practical reasons: It’s often much easier to fit complicated models using Bayesian methods than using frequentist methods. This was the case with a model I recently used as part of an analysis for a clinical trial. The details aren’t really important, but basically I was fitting a hierarchcal, nonlinear regression model that would be used to impute missing blood measurements for people who dropped out of the trial. Because the analysis was for an FDA submission, it might have been preferable to do a frequentist analysis; however, this was one of those cases where fitting the model was much easier to do Bayesianly. The compromise was to fit a Bayesian model with a vague prior distribution.

Sounded easy enough, until I noticed that making small changes in the parameters of what I thought (read: hoped) was a vague prior distribution resulted in substantial changes in the resulting posterior distribution. When using proper prior distributions (which there are all kinds of good reasons to do), even if the prior variance is really large there’s a chance that the prior density is decreasing exponentially in a region of high likelihood, resulting in parameter estimates based more on the prior distribution than on the data. Our attempt to fix this potential problem (it’s not necessarily a problem if you really believe your prior distribution, but sometimes you don’t) is to perform preliminary analyses to estimate where the mass of the likelihood is. A vague prior distribution is then one that is centered near the likelihood with much larger spread.

We estimate the location and spread of the likelihood by capitalizing on the fact that the posterior mean and variance are a combination of the prior mean and variance and the “likelihood” mean and variance. Consider the model for multivariate normal data with known covariance matrix, and a multivariate normal prior distribution on the mean vector:

y| μ, Σ ~ N(μ, Σ)
μ ~ N(μ0, Δ0).

The posterior distribution of μ (where n represents the number of observations) is:

μ|y, Σ ~ N(μn, Δn), where

μn = (Δ0-1 + nΣ-1)-10-1μ0 +nΣ-1y)

Δn-1 = Δ0-1 + nΣ-1.

Here y and Σ/n represent what I’m calling the likelihood mean and variance of μ. If we were unable to calculate them directly, we could do so by solving the above two equations for y and Σ/n, obtaining

Σ/n = (Δn-1 – Δ0-1)-1

y = (Σ/n) Δ0-1n – μ0 ) + μn.

A vague prior distribution for μ could then be something like N(y, Σ) or N(y, 20Σ/n). For more complicated models you could do the same thing. Let (μ0, Δ0) and (μn, Δn) represent the prior and posterior mean vector and covariance matrix of the model hyperparameters. First fit the model (with multivariate normal prior distribution on the hyperparameters) for any convenient choice of (μ0, Δ0), then use the equations above to estimate the location and spread of the likelihood for these parameters. This approximation relies on approximate normality of the hyperparameters. In large samples this should be true; in smaller samples transformations of parameters can make the normal approximation more accurate.

It’s also possible to check the accuracy of the likelihood approximation: Fit the model again using the estimated likelihood mean and variance as the prior mean and variance. The resulting posterior mean should approximately equal the prior mean, and the posterior variance should be about half the prior variance, if the likelihood approximation is good. If not, the process can be iterated: fit the model, estimate the likelihood mean and variance, use these as the prior mean and variance, fit the model again and compare the prior and posterior means and variances. Repeat until the prior and posterior means are approximately equal and the posterior variance is about half the prior variance. From here, a vague prior can be obtained by setting the prior mean to the estimated likelihood mean the prior variance equal to the estimated likelihood variance scaled by some large constant.

I’ve tried this method in some simulations and it seems to work, in the sense that after iterating the above procedure a few times you do obtain an estimated likelihood mean and variance that, when used as the prior mean and variance, lead to a posterior distribution with the same mean and half the variance. With simple or well-understood models, there are surely better ways than this to come up with a vague prior distribution, but in complex models this method could be a helpful last resort.

Bayesian Software Validation

A wise statistician once told me that to succeed in statistics, one could either be really smart, or be Bayesian. He was joking (I think), but maybe an appropriate correlary to that sentiment is that to succeed in Bayesian statistics, one should either be really smart, or be a good programmer. There’s been an explosion in recent years in the number and type of algorithms Bayesian statisticians have available for fitting models (i.e., generating a sample from a posterior distribution), and it cycles: as computers get faster and more powerful, more complex model-fitting algorithms are developed, and we can then start thinking of more complicated models to fit, which may require even more advanced computational methods, which creates a demand for bigger better faster computers, and the process continues. As new computational methods are developed, there is rarely well-tested, publicly-available software for implementing the algorithms, and so statisticians spend a fair amount of time doing computer programming. Not that there’s anything wrong with that, but I know I at least have been guilty (once or twice, a long time ago) of being a little bit lax about making sure my programs actually work. It runs without crashing, it gives reasonable-looking results, it must be doing the right thing, right? Not necessarily, and this is why standard debugging methods from computer science and software engineering aren’t always helpful for testing statistical software. The point is that we don’t know exactly what the software is supposed output (if we knew what our parameter estimates were supposed to be, for example, we wouldn’t need to write the program in the first place), so if software has an error that doesn’t cause crashes or really crazy results, we might not notice.

So computing can be a problem. When fitting Bayesian models, however, it can also come to the rescue. (Like alcohol to Homer Simpson, computing power is both the cause of, and solution to, our problems. This particular problem, anyway.) The basic idea is that if we generate data from the model we want to fit and then analyze those data under the same model, we’ll know what the results should be (on average), and can therefore test that the software is written correctly. Consider a Bayesian model p(θ)p(y|θ), where p(y|θ) is the sampling distribution of the data and p(θ) is the prior distribution for the parameter vector θ. If you draw a “true” parameter value θ0 from p(θ), then draw data y from p(y|θ0), and then analyze the data (i.e., generate a sample from the posterior distribution, p(θ|y)) under this same model, θ0 and the posterior sample will both be drawn from the same distribution, p(θ|y), if the software works correctly. Testing that the software works then amounts to testing that θ0 looks like a draw from p(θ|y). There are various ways to do this. Our proposed method is based on the idea that if θ0 and the posterior sample are drawn from the same distribution, then the quantile of θ0 with respect to the posterior sample should follow a Uniform(0,1) distribution. Our method for testing software is as follows:

1. Generate θ0 from p(θ)
2. Generate y from p(y|θ0)
3. Generate a sample from p(θ|y) using the software to be tested.
4. Calculate the quantile of θ0 with respect to the posterior sample. (If θ is a vector, do this for each scalar component of θ.)

Steps 1-4 comprise one replication. Performing many replications gives a sample of quantiles that should be uniformly distributed if the software works. To test this, we recommend performing a z test (individually for each component of θ) on the following transformation of the quantiles: h(q) = (q-.5)2. If q is uniformly distributed, h(q) has mean 1/12 and variance 1/180. Click here for a draft of our paper on software validation [reference: www.techloris.com], which explains why we (we being me, Andrew Gelman, and Don Rubin) like this particular transformation of the quantiles, and also presents an omnibus test for all components of θ simultaneously. We also present examples and discuss design issues, the need for proper prior distributions, why you can’t really test software for implementing most frequentist methods this way, etc.

This may take a lot of computer time, but it doesn’t take much more programming time than that required to write the model-fitting program in the first place, and the payoff could be big.