Skip to content

Solution to the problem on the distribution of p-values

See this recent post for background.

Here’s the question:

It is sometimes said that the p-value is uniformly distributed if the null hypothesis is true. Give two different reasons why this statement is not in general true. The problem is with real examples, not just toy examples, so your reasons should not involve degenerate situations such as zero sample size or infinite data values.

And here’s the answer:  Reason 1 is that if you have a discrete sample space, the p-value will have a discrete distribution.  Reason 2 is that if you have a composite null hypothesis, the p-value will, except in some special cases, depend on the value of the nuisance parameter.

All 4 of the students gave reason 1, but none of them gave reason 2.  And none of them gave any other good reason.

We’ll do the final question tomorrow.

Solution to the helicopter design problem

See yesterday’s post for background.

Here’s the question:

In the helicopter activity, pairs of students design paper ”helicopters” and compete to create the copter that takes longest to reach the ground when dropped from a fixed height. The two parameters of the helicopter, a and b, correspond to the length of certain cuts in the paper, parameterized so that each of a and b must be more than 0 and less than 1. In the activity, students are allowed to make 20 test helicopters at design points (a,b) of their choosing. The students measure how long each copter takes to reach the ground, and then they are supposed to fit a simple regression (not hierarchical or even Bayesian) to model this outcome as a function of a and b. Based on this model, they choose the optimal a,b and then submit this to the class. Here is the question. Why is it inappropriate for that regression model to be linear?

And here’s the answer:  For a linear model the optimum is necessarily on the boundary.  But we already know the solution can’t be on the boundary (each of a and b must be more than 0 and less than 1).  You need a nonlinear model to get an internal optimum.

I was happy to see that all 4 of the students got this correct.

P.S. I see there was some confusion among the commenters on the definition of a linear model, perhaps because I labeled the design variables as “a” and “b” rather than “x1″ and “x2.” The distinction between linear and nonlinear models is important in applied statistics. In this case, a linear regression model would be of the form, y=beta_0 + beta_1*a + beta_2*b + error. A nonlinear model could have quadratic terms (for example) or it could look like y=(beta_0 + beta_1*a + beta_2*b + error_beta)/(gamma_0 + gamma_1*a + gamma_2*b + error_gamma + error_gamma) or it could be mathematically constructed to give the right answers at the boundaries. All sorts of nonlinear models are possible. We don’t really teach our students how to construct such nonlinear models, but writing this post is making me think it should be in the curriculum.

No, Michael Jordan didn’t say that!

The names are changed, but the song remains the same.

First verse. There’s an article by a journalist,

to which Andrew responded in blog form,

Second verse. There’s an article by a journalist,

to which Michael Jordan responded in blog form,

Whenever I (Bob, not Andrew) read a story in an area I know something about (slices of computer science, linguistics, and statistics), I’m almost always struck by the inaccuracies. The result is that I mistrust journalists writing about topics I don’t know anything about, such as foreign affairs, economics, or medicine.

Some questions from our Ph.D. statistics qualifying exam

In the in-class applied statistics qualifying exam, students had 4 hours to do 6 problems. Here were the 3 problems I submitted:

  1. In the helicopter activity, pairs of students design paper ”helicopters” and compete to create the copter that takes longest to reach the ground when dropped from a fixed height. The two parameters of the helicopter, a and b, correspond to the length of certain cuts in the paper, parameterized so that each of a and b must be more than 0 and less than 1. In the activity, students are allowed to make 20 test helicopters at design points (a,b) of their choosing. The students measure how long each copter takes to reach the ground, and then they are supposed to fit a simple regression (not hierarchical or even Bayesian) to model this outcome as a function of a and b. Based on this model, they choose the optimal a,b and then submit this to the class. Here is the question. Why is it inappropriate for that regression model to be linear?
  2. You are designing an experiment where you are estimating a linear dose-response pattern with a dose that x can take on the values 1, 2, 3, and the response is continuous. Suppose that there is no systematic error and that the measurement variance is proportional to x. You have 100 people in your experiment. How should you allocate them among the x=1, 2, and 3 conditions to best estimate the dose-response slope?
  3. It is sometimes said that the p-value is uniformly distributed if the null hypothesis is true. Give two different reasons why this statement is not in general true. The problem is with real examples, not just toy examples, so your reasons should not involve degenerate situations such as zero sample size or infinite data values.

You can try to do these at home, also try to guess which of these problems were easy for the students and which were hard.  (One of them was solved correctly by all 4 students who took the exam, while another turned out to be so difficult that none of the students got close to the right answer.) I’ll post solutions over the next three days.

Stan 2.5, now with MATLAB, Julia, and ODEs

stanlogo-main

As usual, you can find everything on the

Drop us a line on the stan-users group if you have problems with installs or questions about Stan or coding particular models.

New Interfaces

We’d like to welcome two new interfaces:

The new interface home pages are linked from the Stan home page.

New Features

The biggest new feature is a differential equation solver (Runge-Kutta from Boost’s odeint with coupled sensitivities). We also added new cbind and rbind functions, is_nan and is_inf functions, a num_elements function, a mechanism to throw exceptions to reject samples with printed messages, and two new distributions, the Frechet and 2-parameter Pareto (both contributed by Alexey Stukalov).

Backward Compatibility

Stan 2.5 is fully backward compatible with earlier 2.x releases and will remain so until Stan 3 (which is not yet designed, much less scheduled).

Revised Manual

In addition to the ODE documentation, there is a new chapter on marginalizing discrete latent parameters with several example models, new sections on regression priors for coefficients and noise scale in ordinary, hierarchical, and multivariate settings, along with new chapters on all the algorithms used by Stan for MCMC sampling, optimization, and diagnosis, with configuration information and advice.

Preview of 2.6 and Beyond

Our plans for major features in the near future include stiff ODE solvers, a general MATLAB/R-style array/matrix/vector indexing and assignment syntax, and uncertainty estimates for penalized maximum likelihood estimates via Laplace approximations with second-order autodiff.

Release Notes

Here are the release notes.


v2.5.0 (20 October 2014)
======================================================================

New Features
----------------------------------------
* ordinary differential equation solver, implemented by coupling
  the user-specified system with its sensitivities (#771)
* add reject() statement for user-defined rejections/exceptions (#458)
* new num_elements() functions that applies to all containers (#1026)
* added is_nan() and is_inf() functions (#592)
* nested reverse-mode autodiff, primarily for ode solver (#1031)
* added get_lp() function to remove any need for bare lp__  (#470)
* new functions cbind() and rbind() like those in R (#787)
* added modulus function in a way tht is consistent with integer
  division across platforms (#577)
* exposed pareto_type_2_rng (#580)
* added Frechet distribution and multi_gp_cholesky distribution 
  (thanks to Alexey Stukalov for both)

Enhancements 
----------------------------------------
* removed Eigen code insertion for numeric traits and replaced
  with order-independent metaprogram (#1065)
* cleaned up error messages to provide clearer error context
  and more informative messages (#640)
* extensive tests for higher order autodiff in densities (#823)
* added context factory 
* deprecated lkj_cov density (#865)
* trying again with informational/rejection message (#223)
* more code moved from interfaces into Stan common libraries,
  including a var_context factory for configuration
* moved example models to own repo (stan-dev/example-models) and
  included as submodule for stan-dev/stan (#314)
* added per-iteration interrupt handler to BFGS optimizer (#768)
* worked around unused function warnings from gcc (#796)
* fixed error messages in vector to array conversion (#579, thanks
  Kevin S. Van Horn)
* fixed gp-fit.stan example to be as efficient as manual 
  version (#782)
* update to Eigen version 3.2.2 (#1087)

Builds
----------------------------------------
* pull out testing into Python script for developers to simplify
  makes
* libstan dependencies handled properly and regenerate
  dependencies, including working around bug in GNU 
  make 3.8.1 (#1058, #1061, #1062)

Bug Fixes
----------------------------------------
* deal with covariant return structure in functions (allows
  data-only variables to alternate with parameter version);  involved
  adding new traits metaprograms promote_scalar and
  promote_scalar_type (#849)
* fixed error message on check_nonzero_size (#1066)
* fix arg config printing after random seed generation (#1049)
* logical conjunction and disjunction operators short circuit (#593)
* clean up parser bug preventing variables starting with reserved
  names (#866)
* fma() function calls underlying platform fma (#667)
* remove upper bound on number of function arguments (#867)
* cleaned up code to remove compiler warnings (#1034)
* share likely() and unlikely() macros to avoid redundancy warnings (#1002)
* complete review of function library for NaN behavior and consistency
  of calls for double and autodiff values, with extensive
  documentation and extensive new unit tests for this and other,
  enhances NaN testing in built-in test functions (several dozen issues 
  in the #800 to #902 range)
* fixing Eigen assert bugs with NO_DEBUG in tests (#904)
* fix to makefile to allow builds in g++ 4.4 (thanks to Ewan Dunbar)
* fix precedence of exponentiation in language (#835)
* allow size zero inputs in data and initialization (#683)

Documentation
----------------------------------------
* new chapter on differential equation solver
* new sections on default priors for regression coefficients and
  scales, including hierarchical and multivariate based on
  full Cholesky parameterization
* new part on algorithms, which chapters on HMC/NUTS, optimization,
  and diagnostics
* new chapter on models with latent discrete parameters
* using latexmk through make for LaTeX compilation
* changed page numbers to beg contiguous throughout so page 
  numbers match PDF viewer page number
* all user-supplied corrections applied from next-manual issue
* section on identifiability with priors, including discussion of K-1
  parameterization of softmax and IRT
* new section on convergence monitoring
* extensive corrections from Andrew Gelman on regression models
  and notation
* added discussion of hurdle model in zero inflation section
* update built-in function doc to clarify several behaviors (#1025)

Sailing between the Scylla of hyping of sexy research and the Charybdis of reflexive skepticism

GillrayBritannia

Recently I had a disagreement with Larry Bartels which I think is worth sharing with you. Larry and I took opposite positions on the hot topic of science criticism.

To put things in a positive way, Larry was writing about some interesting recent research which I then constructively criticized.

To be more negative, Larry was hyping some sexy research and I was engaging in mindless criticism.

The balance between promotion and criticism is always worth discussing, but particularly so in this case because of two factors:

1. The research in question is on the borderline. The conclusions in question are not rock-solid—they depend on how you look at the data and are associated with p-values like 0.10 rather than 0.0001—but neither are they silly. Some of the findings definitely seem real, and the debate is more about how far to take it than whether there’s anything there at all. Nobody in the debate is claiming that the findings are empty; there’s only a dispute about their implications.

2. The topic—the effect of unperceived messages on political attitudes—is important.

3. And, finally, Larry and I generally respect each other, both as scholars and as critics. So, even though we might be talking past each other regarding the details of this particular debate, we each recognize that the other has something valuable to say, both regarding methods and public opinion.

What it’s all about

The background is here:

We had a discussion last month on the sister blog regarding the effects of subliminal messages on political attitudes. It started with a Larry Bartels post entitled “Here’s how a cartoon smiley face punched a big hole in democratic theory,” with the subtitle, “Fleeting exposure to ‘irrelevant stimuli’ powerfully shapes our assessments of policy arguments,” discussing the results of an experiment conducted a few years ago and recently published by Cengiz Erisen, Milton Lodge and Charles Taber. Larry wrote:

What were these powerful “irrelevant stimuli” that were outweighing the impact of subjects’ prior policy views? Before seeing each policy statement, each subject was subliminally exposed (for 39 milliseconds — well below the threshold of conscious awareness) to one of three images: a smiling cartoon face, a frowning cartoon face, or a neutral cartoon face. . . . the subliminal cartoon faces substantially altered their assessments of the policy statements . . .

I followed up with a post expressing some skepticism:

Unfortunately they don’t give the data or any clear summary of the data from experiment No. 2, so I can’t evaluate it. I respect Larry Bartels, and I see that he characterized the results as the “subliminal cartoon faces substantially altered their assessments of the policy statements — and the resulting negative and positive thoughts produced substantial changes in policy attitudes.” But based on the evidence given in the paper, I can’t evaluate this claim. I’m not saying it’s wrong. I’m just saying that I can’t express judgment on it, given the information provided.

Larry then followed up with a post saying that further information was in chapter 3 of Erisen’s Ph.D. dissertation and presented as evidence this path analysis:

erisen1

along with this summary:

In this case, subliminal exposure to a smiley cartoon face reduced negative thoughts about illegal immigration, increased positive thoughts about illegal immigration, and (crucially for Gelman) substantially shifted policy attitudes.

And Erisen sent along a note with further explanation, the centerpiece of which was another path analysis.

Unfortunately I still wasn’t convinced. The trouble is, I just get confused whenever I see these path diagrams. What I really want to see is a direct comparison of the political attitudes with and without the intervention. No amount of path diagrams will convince me until I see the direct comparison.

However, I had not read all of the relevant chapter of Erisen’s dissertation in detail. I’d looked at the graphs (which had results of path analyses, and data summaries on positive and negative thoughts, but no direct data summaries of issue attitudes) and at some of the tables. It turns out, thought that there were some direct comparisons of issue attitudes in the text of the dissertation but not in the tables and figures.

I’ll get back to that in a bit, but first let me return to what I wrote at the time, in response to Erisen and Bartels:

I’m not saying that Erisen is wrong in his claims, just that the evidence he [and Larry] shown me is too abstract to convince me. I realize that he knows a lot more about his experiment and his data than I do and I’m pretty sure that he is much more informed on this literature than I am, so I respect that he feels he can draw certain strong conclusions from his data. But, for me, I have to go what information is available to me.

Why do these claims from path analysis confuse me? An example is given in a comment by David Harris, who reports that Erisen et al. “seem to acknowledge that the effect of their priming on people’s actual policy evaluations is nil” but that they then follow up with a convoluted explanation involving a series of interactions.

Convoluted can be OK—real life is convoluted—but I’d like to see some simple comparisons. If someone wants to claim that “Fleeting exposure to ‘irrelevant stimuli’ powerfully shapes our assessments of policy arguments,” I’d like to see if these fleeting exposures indeed have powerful effects. In an observational setting, such effects can be hard to “tease out,” as the saying goes. But in this case the researchers did a controlled experiment, and I’d like to see the direct comparison as a starting point.

Commenter Dean Eckles wrote:

The answer is that those effects are not significant at conventional levels in Exp 2. From ch. 3 (pages 89-91) of Cengiz Erisen’s dissertation (from https://dspace.sunyconnect.suny.edu/handle/1951/52338) we have:

Illegal Immigration: “In the first step of the mediation model a simple regression shows the effect of affective prime on the attitude (beta=.34; p [less than] .07). Although not hypothesized, this confirms the direct influence of the affective prime on the illegal immigration attitude.”

Energy Security: “As before, the first step of the mediation model ought to present the effect of the prime on one’s attitude. In this mediation model, however, the affective prime does not change energy security attitude directly (beta=-.10; p [greater than] .10. Yet, as discussed before, the first step of mediation analysis is not required to establish the model (Shrout & Bolger 2002; MacKinnon 2008).”

So (the cynic in me says), this pretty much covers it. The direct result was not statistically significant. When it went in the expected direction and was not statistically significant, it was taken as a confirmation of the hypothesis. When it went in the wrong direction and was not statistically significant, it was dismissed as not being required.

Back to the debate

OK, so here you have the story as I see it: Larry heard of an interesting study regarding subliminal messages, a study that made a lot of sense especially in light of the work of Larry and others regarding the ways in which voters can be swayed by information that logically should be irrelevant to voting decisions or policy positions (and, indeed, consistent with the work of Kahneman, Slovic, and Tversky regarding shortcuts and heuristics in decision making). The work seemed solid and was supported by several statistical analyses. And there does seem to be something there (in particular, Erisen shows strong evidence of the stimulus affecting the numbers of positive and negative thoughts expressed by the students in his experiment). But the evidence for the headline claim—that the subliminal smiley-faces affect political attitudes themselves, not just positive and negative expressions—is not so clear.

That’s my perspective. Now for Larry’s. As he saw it, my posts were sloppy: I reacted to the path analyses presented by him and Erisen and did not look carefully within Erisen’s Ph.D. thesis to find the direct comparisons. Here’s what Larry wrote:

Now it seems that one of your commenters has read (part of) the dissertation chapter and found two tests of the sort you claimed were lacking, one of which indicates a substantial effect (.34 on a six-point scale) and the other of which indicates no effect. If you or your commenter bothered to keep reading, you would find four more tests, two of which (involving different issues) indicate substantial effects (.40 and .51) and two of which indicate no effects. The three substantial effects (out of six) have reported p-values of <.07, <.08, and >.10. How likely is that set of results to occur by chance? Do you really want to argue that the appropriate way to assess this evidence is one .05 test at a time?

Hmmm, I’ll have to think about this one.

My quick response is as follows:
1. Sure, if we accept the general quality of the measurements in this study (no big systematic errors, etc.) then there’s very clear evidence of the subliminal stimuli having effects on positive and negative expressions, hence it’s completely reasonable to expect effects on other survey responses including issue attitudes.
2. That is, we’re not in “Bem” territory here. Conditional on the experiments being done competently, there are real effects here.
3. Given that the stimuli can affect issue attitudes, it’s reasonable to expect variation, to expect some positive and some negative effects, and for the effects to vary across people and across situations.
4. So if I wanted to study these effects, I’d be inclined to fit a multilevel model to allow for the variation and to better estimate average effects in the context of variation.
5. When it comes to specific effects, and to specific claims of large effects (recall the original claim that the stimulus “powerfully [emphasis added] shapes our assessments of policy arguments,” elsewhere “substantially altered,” elsewhere “significantly and consistently altered,” elsewhere “punched a big hole in democratic theory”), I’d like to see some strong evidence. And these “p less than .07″ and “p greater than .10″ things don’t look like strong evidence to me.
6. I agree that these results are consistent with some effect on issue attitudes but I don’t see the evidence for the large effects that have been claimed.
7. Finally, I respect the path analyses for what they are, and I’m not saying Erisen shouldn’t have done them, but I think it’s fair to say that these are the sorts of analyses that are used to understand large effects that exist; they don’t directly address the question of the effects of the stimulus on policy attitudes (which is how we could end up with explanation of large effects that cancel out).

As a Bayesian, I do accept Larry’s criticism that it was odd for me to claim that there was no evidence just because p was not less than 0.05. Even weak evidence should shift my priors a bit, no?

And I agree that weak evidence is not the same as zero evidence.

So let me clarify that, conditional on accepting the quality of Erisen’s experimental protocols (which I have no reason to question), I have no doubt that some effects are there. The question is about the size and the direction of the effects.

Summary

In some sense, the post-publication review process worked well: Larry promoted the original work on the sister blog which gave it a wider audience. I read Larry’s post and offered my objection on the sister blog and here, and, in turn, Erisen and various commenters replied. And, eventually, after a couple of email exchange, I finally got the point that Larry had been trying to explain to me, that Erisen did have the direct comparisons I’d been asking for, they were just in the text of his dissertation and not in the tables and figures.

This post-publication discussion was slow and frustrating (especially for Larry, who was rightly annoyed that I kept saying that the information wasn’t available to me, when it was there in the dissertation all along), but I still think it moved forward in a better way than would’ve happened without the open exchange, if, for example, all we’d had were a series of static, published articles presenting one position or another.

But these questions are difficult and somewhat unstable because of the massive selection effects in play. This discussion had its frustrating aspects on both sides but things are typically much worse! Most studies in political science don’t get discussed on the Monkey Cage or on this blog, and what we see is typically bimodal: a mix of studies that we like and think are worth sharing, and studies that we dislike and think are worth taking the time to debunk.

But I don’t go around looking for studies to shoot down! What typically happens is they get hyped by somebody else (whether it be Freakonomics, or David Brooks, or whoever) and then I react.

In this case, Larry posted on a research finding that he thought was important and perhaps had not received enough attention. I was skeptical. After all the dust has settled, I remain skeptical about any effects of the subliminal message on political attitudes. I think Larry remains convinced, and maybe our disagreement ultimately comes down to priors, which makes sense given that the evidence from the data is weak.

Meanwhile, new studies get published, and get neglected, or hyped, or both. I offer no general solution to how to handle these—clearly, the standard system of scientific publishing has its limitations—here I just wanted to raise some of these issues in a context where I see no easy answers.

To put it another way, I think social science can—and should—do better than we usually do. For a notorious example, consider “Reinhart and Rogoff”: a high-profile paper published in a top journal with serious errors that were not corrected for several years after publication.

On one hand, the model of discourse described in my above post is not at all scalable—Larry Bartels and I are just 2 guys, after all, and we have finite time available for this sort of thing. On the other hand, consider the many thousands of researchers who spend so many hours refereeing papers for journals. Surely this effort could be channeled in a more useful way.

Try a spaghetti plot

Classic-spaghetti-carbonara

Joe Simmons writes:

I asked MTurk NFL fans to consider an NFL game in which the favorite was expected to beat the underdog by 7 points in a full-length game. I elicited their beliefs about sample size in a few different ways (materials .pdfdata .xls).

Some were asked to give the probability that the better team would be winning, losing, or tied after 1, 2, 3, and 4 quarters. If you look at the average win probabilities, their judgments look smart.

But this graph is super misleading, because the fact that the average prediction is wise masks the fact that the average person is not. Of the 204 participants sampled, only 26% assigned the favorite a higher probability to win at 4 quarters than at 3 quarters than at 2 quarters than at 1 quarter. About 42% erroneously said, at least once, that the favorite’s chances of winning would be greater for a shorter game than for a longer game.

How good people are at this depends on how you ask the question, but no matter how you ask it they are not very good.

The explicit warning, “This Graph is Super Misleading,” is a great idea.

But don’t stop there! You can do better. The next step is to follow it up with a spaghetti plot showing people’s estimates.  If you click through the links, you see there are about 200 respondents, and 200 is a lot to show in a spaghetti plot, but you could handle this by breaking up the people into a bunch of categories (for example, based on age, sex, and football knowledge) thus allowing a grid of smaller graphs, each of which wouldn’t have too many lines.

P.S. Jeff Leek points out that sometimes a spaghetti plot won’t work so well because there are too many lines to plot and all you get is a mess (sort of like the above plate-o-spag image, in fact). He suggests the so-called lasagna plot, which is a sort of heat map, and which seems to have some similarities to Solomon Hsiang’s “watercolor” uncertainty display.

A heat map could be a good idea but let me also remind everyone that there are some solutions to overplotting of the lines in a spaghetti plot, some ways to keep the spaghetti structure while losing some of the messiness. Here are some strategies, in increasing order of complexity:

1. Simply plot narrower lines. Graphics devices have improved, and thin lines can work well.

2. Just plot a random sample of the lines. If you have 100 patients in your study, just plot 20 lines, say.

3. Small multiples: for example, a 2×4 grid broken down by male/female and 4 age categories. Within each sub-plot you don’t have so many lines so less of a problem with overplotting.

4. Alpha-blending.

Three ways to present a probability forecast, and I only like one of them

To the nearest 10%:

weather

To the nearest 1%:

upshot

To the nearest 0.1%:

I think the National Weather Service knows what they’re doing on this one.

On deck this week

Mon: Three ways to present a probability forecast, and I only like one of them

Tues: Try a spaghetti plot

Wed: I ain’t got no watch and you keep asking me what time it is

Thurs: Some questions from our Ph.D. statistics qualifying exam

Fri: Solution to the helicopter design problem

Sat: Solution to the problem on the distribution of p-values

Sun: Solution to the sample-allocation problem

“Your Paper Makes SSRN Top Ten List”

I received the following email from the Social Science Research Network, which is a (legitimate) preprint server for research papers:

Dear Andrew Gelman:

Your paper, “WHY HIGH-ORDER POLYNOMIALS SHOULD NOT BE USED IN REGRESSION DISCONTINUITY DESIGNS”, was recently listed on SSRN’s Top Ten download list for: PSN: Econometrics, Polimetrics, & Statistics (Topic) and Political Methods: Quantitative Methods eJournal.

As of 02 September 2014, your paper has been downloaded 17 times. You may view the abstract and download statistics at: http://ssrn.com/abstract=2486395.

Top Ten Lists are updated on a daily basis. . . .

The paper (with Guido Imbens) is here.

What amused me, though, was how low the number was. 17 downloads isn’t so many. I guess it doesn’t take much to be in the top 10!