The post How does a Nobel-prize-winning economist become a victim of bog-standard selection bias? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Linking to a new paper by Jorge Luis García, James J. Heckman, and Anna L. Ziff, an economist Sue Dynarski makes this “joke” on facebook—or maybe it’s not a joke:

How does one adjust standard errors to account for the fact that N of papers on an experiment > N of participants in the experiment?

Clicking through, the paper uses data from the “Abecedarian” (ABC) childhood intervention program of the 1970s. Well, the related ABC & “CARE” experiments, pooled together. From Table 3 on page 7, the ABC experiment has 58 treatment and 56 control students, while ABC has 17 treatment and 23 control. If you type “abecedarian” into Google Scholar, sure enough, you get 9,160 results! OK, but maybe some of those just have citations or references to other papers on that project… If you restrict the search to papers with “abecedarian” in the title, you still get 180 papers. If you search for the word “abecedarian” on Google Scholar (not necessarily in the title) and restrict to papers by Jim Heckman, you get 86 results.

That’s not why I thought to email you though.

Go to pages 7-8 of this new paper where they explain why they merged the ABC and CARE studies:

CARE included an additional arm of treatment. Besides the services just described, those in the treatment group also received home visiting from birth to age 5. Home visiting consisted of biweekly visits focusing on parental problem-solving skills. There was, in addition, an experimental group that received only the home visiting component, but not center-based care.[fn 17] In light of previous analyses, we drop this last group from our analysis. The home visiting component had very weak estimated effects.[fn 18] These analyses justify merging the treatment groups of ABC and CARE, even though that of CARE received the additional home-visiting component.[fn 19] We henceforth analyze the samples so generated as coming from a single ABC/CARE program.

OK, they merged some interventions (garden of forking paths?) because they wanted more data. But, how do they know that home visits had weak effects? Let’s check their explanation in footnote 18:

18: Campbell et al. (2014) test and do not reject the hypothesis of no treatment effects for this additional component of CARE.

Yep. Jim Heckman and coauthors conclude that the effects are “very weak” because ran some tests and couldn’t reject the null. If you go deep into the supplementary material of the cited paper, to tables S15(a) and S15(b), sure enough you find that these “did not reject the null” conclusions are drawn from interventions with 12-13 control and 11-14 treatment students (S15(a)) or 15-16 control and 18-20 treatment students (S15(b)). Those are pretty small sample sizes…

This jumped out at me and I thought you might be interested too.

My reply: This whole thing is unfortunate but it is consistent with the other writings of Heckman and his colleagues in this area: huge selection bias and zero acknowledgement of the problem. It makes me sad because Heckman’s fame came from models of selection bias, but he doesn’t see it when it’s right in front of his face. See here, for example.

The topic is difficult to write about for a few reasons.

First, Heckman is a renowned scholar and he is evidently careful about what he writes. We’re not talking about Brian Wansink or Satoshi Kanazawa here. Heckman works on important topics, his studies are not done on the cheap, and he’s eminently reasonable in his economic discussions. He’s just making a statistical error, over and over again. It’s a subtle error, though, that has taken us (the statistics profession) something like a decade to fully process. Making this mistake doesn’t make Heckman a bad guy, and that’s part of the problem: When you tell a quantitative researcher that they made a statistical error, you often get a defensive reaction, as if you accused them of being stupid, or cheating. But lots of smart, honest people have made this mistake. That’s one of the reasons we have formal statistical methods in the first place: people get lots of things wrong when relying on instinct. Probability and statistics are important, but they’re not quite natural to our ways of thinking.

Second, who wants to be the grinch who’s skeptical about early childhood intervention? Now, just to be clear, there’s lots of room to be skeptical about Heckman’s claims and still think that early childhood intervention is a good idea. For example, this paper by Garcia, Heckman, Leaf, and Prados reports a benefit/cost ratio of 7.3. So they could be overestimating their effect by a factor of 7 and still have a favorable ratio. The point is, if for whatever reason you support universal day care or whatever, you have a motivation not to worry too much about the details of a study that supports your position.

Again, I’m not saying that Heckman and his colleagues are doing this. I can only assume they’re reporting what, to them, are their best estimates. Unfortunately these methods are biased. But a lot of people with classical statistics and econometrics training don’t realize this: they thing regression coefficients are unbiased estimates, but nobody ever told them that the biases can be huge when there is selection for statistical significance.

And, remember, selection for statistical significance is *not* just about the “file drawer” and it’s *not* just about “p-hacking.” It’s about researcher degrees of freedom and forking paths that researchers themselves don’t always realize until they try to replicate their own studies. I don’t think Heckman and his colleagues have dozens of unpublished papers hiding in their file drawers, and I don’t think they’re running their data through dozens of specifications until they find statistical significance. So it’s not the file drawer and it’s not p-hacking as is often understood. But these researchers *do* have nearly unlimited degrees of freedom in their data coding and analysis, they *do* interpret “non-significant” differences as null and “significant” differences at face value, they have forking paths all over the place, and their estimates of magnitudes of effects are biased in the positive direction. It’s kinda funny but also kinda sad, that there’s so much concern for rigor in the design of these studies and in the statistical estimators used in the analysis, but lots of messiness in between, lots of motivation on the part of the researchers to find success after success after success, and lots of motivation for scholarly journals and the news media to publicize the results uncritically. These motivations are not universal—there’s clearly a role in the ecosystem for critics within academia, the news media, and in the policy community—but I think there are enough incentives for success within Heckman’s world to keep him and his colleagues from seeing what’s going wrong.

Again, it’s not easy—it took the field of social psychology about a decade to get a handle on the problem, and some are still struggling. So I’m not slamming Heckman and his colleagues. I think they can and will do better. It’s just interesting, when considering the mistakes that accomplished people make, to ask, How did this happen?

The post How does a Nobel-prize-winning economist become a victim of bog-standard selection bias? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Short course on Bayesian data analysis and Stan 23-25 Aug in NYC! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Before class everyone should install R, RStudio and RStan on their computers. (If you already have these, please update to the latest version of R and the latest version of Stan.) If problems occur please join the stan-users group and post any questions. It’s important that all participants get Stan running and bring their laptops to the course.

Class structure and example topics for the three days:

Day 1: Foundations

Foundations of Bayesian inference

Foundations of Bayesian computation with Markov chain Monte Carlo

Intro to Stan with hands-on exercises

Real-life Stan

Bayesian workflow

Day 2: Linear and Generalized Linear Models

Foundations of Bayesian regression

Fitting GLMs in Stan (logistic regression, Poisson regression)

Diagnosing model misfit using graphical posterior predictive checks

Little data: How traditional statistical ideas remain relevant in a big data world

Generalizing from sample to population (surveys, Xbox example, etc)

Day 3: Hierarchical Models

Foundations of Bayesian hierarchical/multilevel models

Accurately fitting hierarchical models in Stan

Why we don’t (usually) have to worry about multiple comparisons

Hierarchical modeling and prior information

Specific topics on Bayesian inference and computation include, but are not limited to:

Bayesian inference and prediction

Naive Bayes, supervised, and unsupervised classification

Overview of Monte Carlo methods

Convergence and effective sample size

Hamiltonian Monte Carlo and the no-U-turn sampler

Continuous and discrete-data regression models

Mixture models

Measurement-error and item-response models

Specific topics on Stan include, but are not limited to:

Reproducible research

Probabilistic programming

Stan syntax and programming

Optimization

Warmup, adaptation, and convergence

Identifiability and problematic posteriors

Handling missing data

Ragged and sparse data structures

Gaussian processes

Again, information on the course is here.

The course is organized by Lander Analytics.

The course is not cheap. Stan is open-source, and we organize these courses to raise money to support the programming required to keep Stan up to date. We hope and believe that the course is more than worth the money you pay for it, but we hope you’ll also feel good, knowing that this money is being used directly to support Stan R&D.

The post Short course on Bayesian data analysis and Stan 23-25 Aug in NYC! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Make Your Plans for Stans (-s + Con) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>*This post is by Mike*

A friendly reminder that registration is open for StanCon 2018, which will take place over three days, from Wednesday January 10, 2018 to Friday January 12, 2018, at the beautiful Asilomar Conference Grounds in Pacific Grove, California.

Detailed information about registration and accommodation at Asilomar, including fees and instructions, can be found on the event website. Early registration ends on Friday November 10, 2017 and no registrations will be accepted after Wednesday December 20, 2017.

We have an awesome set of invited speakers this year that is worth attendance alone,

- Susan Holmes (Department of Statistics, Stanford University)
- Sean Taylor and Ben Letham (Facebook Core Data Science)
- Manuel Rivas (Department of Biomedical Data Science, Stanford University)
- Talia Weiss (Department of Physics, Massachusetts Institute of Technology)
- Sophia Rabe-Hesketh and Daniel Furr (Educational Statistics and Biostatistics, University of California, Berkeley)

Contributed talks will proceed as last year, with each submission consisting of self-contained knitr or Jupyter notebooks that will be made publicly available after the conference. Last year’s contributed talks were awesome and we can’t wait to see what users will submit this year. For details on how to submit see the submission website. The final deadline for submissions is Saturday September 16, 2017 5:00:00 AM GMT.

This year we are going to try to support as many student scholarships as we can — if you are a student who would love to come but may not have the funding then don’t hesitate to submit a short application!

Finally, we are still actively looking for sponsors! If you are interested in supporting StanCon 2018, or know someone who might be, then please contact the organizing committee.

The post Make Your Plans for Stans (-s + Con) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post His concern is that the authors don’t control for the position of games within a season. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I read your blog post about middle brow literature and PPNAS the other day. Today, a friend forwarded me this article in The Atlantic that (in my opinion) is another example of what you’ve recently been talking about. The research in question is focused on Major League Baseball and the occurrence that a batter is hit by a pitch in retaliation for another player previously hit by a pitch in the same game. The research suggests that temperature is an important factor in predicting this retaliatory behavior. The original article by Larrick et al. in the journal Psychological Science is here.

My concern is that the authors don’t control for the position of games within a season. There are several reasons why the probability of retaliation may change as the season progresses, but a potentially important one is the changing relative importance of games as the season goes along. Games in the early part of the season (April, May) are important as teams try to build a winning record. Games late in the season are more important as teams compete for limited playoff spots. In these important games, retaliation is less likely because teams are more focused on winning than imposing baseball justice. The important games occur in relatively cool months. There exists a soft spot in the schedule during June, July, and August (hot months) where the games are less consequential. Perhaps what is driving the result is the schedule position (and relative importance) of the game. Regardless of the mechanism by which the schedule position impacts the probability of retaliation, the timing of a game within the season is correlated with temperature.

One quick analysis to get at the effect of temperature in games of similar importance would be to examine those that were played in the month of August. Some of those games will be played in dome stadiums which are climate controlled. Most games will be played in outdoor stadiums. I am curious to see if the temperature effect still exists after controlling for the relative importance of the game.

My reply: Psychological Science, published in 2011? That says it all. I’ll blog this, I guess during next year’s baseball season…

And here we are!

The post His concern is that the authors don’t control for the position of games within a season. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Animating a spinner using ggplot2 and ImageMagick appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m so pleased with it that I show the plot to Mitzi. She replies, “Why don’t you animate it?” I don’t immediately say, “What a waste of time,” then get back to what I’m doing. Instad, I boast, “It’ll be done when you get back from your run.” Luckily for me, she goes for long runs—I just barely had the prototype working as she got home. And then I had to polish it and turn it into a blog post. So here it is, for your wonder and amazement.

Here’s the R magic.

library(ggplot2) draw_curve <- function(angle) { df <- data.frame(outcome = c("success", "failure"), prob = c(0.3, 0.7)) plot <- ggplot(data=df, aes(x=factor(1), y=prob, fill=outcome)) + geom_bar(stat="identity", position="fill") + coord_polar(theta="y", start = 0, direction = 1) + scale_y_continuous(breaks = c(0.12, 0.7), labels=c("success", "failure")) + geom_segment(aes(y= angle/360, yend= angle/360, x = -1, xend = 1.4), arrow=arrow(type="closed"), size=1) + theme(axis.title = element_blank(), axis.ticks = element_blank(), axis.text.y = element_blank()) + theme(panel.grid = element_blank(), panel.border = element_blank()) + theme(legend.position = "none") + geom_point(aes(x=-1, y = 0), color="#666666", size=5) return(plot) } ds <- c() pos <- 0 for (i in 1:66) { pos <- (pos + (67 - i)) %% 360 ds[i] <- pos } ds <- c(rep(0, 10), ds) ds <- c(ds, rep(ds[length(ds)], 10)) for (i in 1:length(ds)) { ggsave(filename = paste("frame", ifelse(i < 10, "0", ""), i, ".png", sep=""), plot = draw_curve(ds[i]), device="png", width=4.5, height=4.5) }

I probably should've combined theme functions. Ben would've been able to define ds in a one-liner and then map ggsave. I hope it's at least clear what my code does (just decrements the number of degrees moved each frame by one---no physics involved).

After producing the frames in alphabetical order (all that ifelse and paste mumbo-jumbo), I went to the output directory and ran the results through ImageMagick (which I'd previously installed on my now ancient Macbook Pro) from the terminal, using

> convert *.png -delay 3 -loop 0 spin.gif

That took a minute or two. Each of the pngs is about 100KB, but the final output is only 2.5MB or so. Maybe I should've went with less delay (I don't even know what the units are!) and fewer rotations and maybe a slower final slowing down (maybe study the physics). How do the folks at Pixar ever get anything done?

P.S. I can no longer get the animation package to work in R, though it used to work in the past. It just wraps up those calls to ImageMagick.

P.P.S. That salmon and teal color scheme is the default!

The post Animating a spinner using ggplot2 and ImageMagick appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “The ‘Will & Grace’ Conjecture That Won’t Die” and other stories from the blogroll appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The “Will & Grace” Conjecture That Won’t Die

From sociologist David Weakliem:

Why does Trump try to implement the unpopular ideas he’s proposed, and not the popular ideas?

The post “The ‘Will & Grace’ Conjecture That Won’t Die” and other stories from the blogroll appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How to design future studies of systemic exercise intolerance disease (chronic fatigue syndrome)? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>For conditions like S.E.I.D., then, the better approach may be to gather data from people suffering “in the wild,” combining the careful methodology of a study like PACE with the lived experience of thousands of people. Though most may be less eloquent than Rehmeyer, each may have his or her own potential path to recovery.

Ramsey asks:

From your perspective, are there particular design features to such an approach that one should prioritize, in order to maximize its usefulness to others?

Here’s the challenge.

The current standard model of evaluating medical research is the randomized clinical trial with 100 or so patients. This sort of trial is both too large and too small (see also here): too large in there is so much variation in the population of patients, and different treatments will work (or not work, or even be counterproductive) for different people; too small in that the variation in such studies makes it hard to find reliable, reproducible results.

I think we need to move in two directions at once. From one direction, N=1 experiments: careful scientific evaluations of treatment options adapted to individual people. From the other direction, full population studies, tracking what really is happening outside the lab. The challenge there, as Ramsey notes, is that a lot of uncontrolled information is and will be available.

I’m sorry to say that I *don’t* have any good advice right now on how future studies should proceed. Speaking generally, I think it’s important to measure exactly what’s being done by the doctor and patient at all times, I think you should think carefully about outcome measures, and I think it’s a good idea to try multiple treatments on individual patients (that is, to perform within-person comparisons, also called crossover trials in this context). And, when considering observational studies (that is, comparisons based on existing treatments), gather whatever pre-treatment information that is predictive of individuals’ choice of treatment regimen to follow. For SEID in particular, it seems that the diversity of the condition is a key part of the story and so it would be good to find treatments that work with well-defined subgroups.

I hope others can participate in this discussion.

The post How to design future studies of systemic exercise intolerance disease (chronic fatigue syndrome)? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Should we continue not to trust the Turk? Another reminder of the importance of measurement appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>From 2017 (link from Kevin Lewis), from Jesse Chandler and Gabriele Paolacci:

The Internet has enabled recruitment of large samples with specific characteristics. However, when researchers rely on participant self-report to determine eligibility, data quality depends on participant honesty. Across four studies on Amazon Mechanical Turk, we show that a substantial number of participants misrepresent theoretically relevant characteristics . . .

For some purposes you can learn a lot from these online samples, but it depends on context. Measurement is important, and it is underrated in statistics.

The trouble is if you’re cruising along treating “p less than .05” as your criterion of success, then quality of measurement barely matters at all! Gather your data, grab your stars, get published, give your Ted talk, and sell your purported expertise to the world. Statistics textbooks have lots about how to analyze your data, a little bit on random sampling and randomized experimentation, and next to nothing on gathering data with high reliability and validity.

The post Should we continue not to trust the Turk? Another reminder of the importance of measurement appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post They want help designing a crowdsourcing data analysis project appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>My collaborators and myself are doing research where we try to understand the reasons for the variability in data analysis (“the garden of forking paths”). Our goal is to understand the reasons why scientists make different decisions regarding their analyses and in doing so reach different results.

In a project called “Crowdsourcing data analysis: Gender, status, and science”, we have recruited a large group of independent analysts to test the same hypotheses on the same dataset using a platform we developed.

The platform is essentially Rstudio running online with few additions:

· We record all executed commands even if they are not in the final code

· We ask analysts to explain these commands by creating semantic blocks explaining the rationale and alternatives

· We allow analysts to create graphical workflow of their work using these blocks and by restructuring them

You can find the more complete experiment description here. Also a short video tutorial of the platform.

Of course this experiment is not covering all considerations that might lead to variability (e.g. R users might differ from Python users), but we believe it is a step towards better understanding how defensible, yet subjective analytic choices may shape research results. The experiment is still running but we are likely to receive about 40-60 submissions of code, logs, comments, and explanations of decisions made. We are also collecting various information about analysts like their background, methods they usually use and the way they operationalized the hypotheses.

Our current plan is to analyze the data from this crowdsourced project using inductive coding by splitting participants into groups that reached similar results (effect size and direction). We then plan to identify factors that can explain various decisions as well as explain the similarities between participants.

We would love to receive any feedback and suggestions from readers of your blog regarding our planned approach to account for variability in results across different analysts.

If anyone has suggestions, feel free to respond in the comments.

The post They want help designing a crowdsourcing data analysis project appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Graphs as comparisons: A case study appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Above is a pair of graphs from a 2015 paper by Alison Gopnik, Thomas Griffiths, and Christopher Lucas. It takes up half a page in the journal, Current Directions in Psychological Science. I think we can do better.

First, what’s wrong with the above graphs?

We could start with the details: As a reader, I have to go back and forth between the legend and the bars to keep the color scheme fresh in my mind. The negative space in the middle of each plot looks a bit like a white bar, which is not directly confusing but makes the display that much harder to read. And I had to do a double-take to interpret the infinitesimal blue bar of zero height on the left plot. Also, it’s not clear how to interpret the y-axes: 25 participants out of how many? And whassup with the y-axis on the second graph: the 15 got written as a 1, which makes me suspect that the graph was re-drawn from an original, which then leads to concern that other mistakes may have been introduced in the re-drawing process.

But that’s not the direction I want to go. There *are* problems with the visual display, but going and fixing them one by one would not resolve the larger problem. To get there, we need to think about goals.

A graph is a set of comparisons, and the two goals of a graph are:

1. To understand communicate the size and directions of comparisons that you were already interested in, before you made the graph; and

2. To facilitate discovery of new patterns in data that go beyond what you expected to see.

Both these goals are important. Its important to understand and communicate what we think we know, and it’s also important to put ourselves in a position where we can learn more.

The question, when making any graphs, is: what comparisons does it make it easy to see? After all, if you just wanted the damn numbers you could put them in a table.

Now let’s turn to the graph above. It makes it easy to compare the heights of two lines right next to each other—for example, I see that the dark blue lines are all higher than the light blue lines, except for the pair on the left . . . hmmmm, which are light blue and which are dark blue, again? I can’t really do much with the absolute levels of the lines because I don’t know what “25” means.

Look. I’m not trying to rag on the authors here. This sort of Excel-style bar graph is standard in so many presentations. I just think they could do better.

So, how to do better? Let’s start with the goals.

1. What are the key comparisons that the authors want to emphasize? From the caption, it seems that the most important comparisons are between children and adults. We want a graph that shows the following patterns:

(i) In the Combination scenario, children tended to chose multiple objects (the correct response, it seems) and adults tended to choose single objects (the wrong response).

(ii) In the Individual scenario, both children and adults tended to choose a single object (the correct response).

Actually, I think I’d re-order these, and first look at the Individual scenario which seems to be some sort of baseline, and then go to Combination which is displaying something new.

2. What might we want to learn from a graph of these data, beyond the comparisons listed just above? This one’s not clear so I’ll guess:

Who were those kids and adults who got the wrong answer in the Combination scenario? Did they have other problems? What about the *adults* who got the wrong answer in the Individual scenario, which was presumably easier? Did they also get the answer wrong in the other case? There must be some other things to learn from these data too—it’s hard to get people to participate in a psychology experiment, and once you have them there, you’ll want to give them as many tasks as you can. But from this figure alone, I’m not sure what these other questions would be.

OK, now time to make the new graph. Given that I don’t have the raw data, and I’m just trying to redo the figure above, I’ll focus on task 1: displaying the key comparisons clearly.

Hey—I just realized something! The two outcomes in this study are “Single object” and “Multiple object”—that’s all there is! And, looking carefully, I see that the numbers in each graph add up to a constant: it’s 25 children and, ummm, let me read this carefully . . . 28 adults!

This simplifies our task considerably, as now we have only 4 numbers to display instead of 8.

We can easily display four numbers with a line plot. The outcome is % who give the “Single object” response, and the two predictors are Child/Adult and Individual Principle / Combination Principle.

One of these predictors will go on the x-axis, one will correspond to the two lines, and the outcome will go on the y-axis.

In this case, which of our two predictors goes on the x-axis?

Sometimes the choice is easy: if one predictor is binary (or discrete with only a few categories) and the other is continuous, it’s simplest to put the continuous predictor as x, and use the discrete predictor to label the lines. In this case, though, both predictors are binary, so what to do?

I think we should use logical or time order, as that’s easy to follow. There are two options:

(1) Time order in age, thus Children, then Adults; or

(2) Logical order in the experiment, thus Individual Principle, then Combination Principle, as Individual is in a sense the control case and Combination is the new condition.

I tried it both ways and I think option 2 was clearer. So I’ll show you this graph and the corresponding R code. Then I’ll show you option 1 so you can compare.

Here’s the graph:

I think this is better than the bar graphs from the original article, for two reasons. First, we can see everything in one place: Like the title sez, “Children did better than adults,\nespecially in the combination condition.” Second, we can directly make both sorts of comparisons: we can compare children to adults, and we can also make the secondary comparison of seeing that both groups performed worse under the combination condition than the individual condition.

Here’s the data file I made, gopnik.txt:

Adult Combination N_single N_multiple 0 0 25 0 0 1 4 21 1 0 23 5 1 1 18 10

And here’s the R code:

setwd("~/AndrewFiles/research/graphics") gopnik <- read.table("gopnik.txt", header=TRUE) N <- gopnik$N_single + gopnik$N_multiple p_multiple <- gopnik$N_multiple / N p_correct <- ifelse(gopnik$Combination==0, 1 - p_multiple, p_multiple) colors <- c("red", "blue") combination_labels <- c("Individual\ncondition", "Combination\ncondition") adult_labels <- c("Children", "Adults") pdf("gopnik_2.pdf", height=4, width=5) par(mar=c(3,3,3,2), mgp=c(1.7, .5, 0), tck=-.01, bg="gray90") plot(c(0,1), c(0,1), yaxs="i", xlab="", ylab="Percent who gave correct answer", xaxt="n", yaxt="n", type="n", main="Children did better than adults,\nespecially in the combination condition", cex.main=.9) axis(1, c(0, 1), combination_labels, mgp=c(1.5,1.5,0)) axis(2, c(0,.5,1), c("0", "50%", "100%")) for (i in 1:2){ ok <- gopnik$Adult==(i-1) x <- gopnik$Combination[ok] y <- p_correct[ok] se <- sqrt((N*p_correct + 2)*(N[ok]*(1-p_correct) + 2)/(N[ok] + 4)^3) lines(x, y, col=colors[i]) points(x, y, col=colors[i], pch=20) for (j in 1:2){ lines(rep(x[j], 2), y[j] + se[j]*c(-1,1), lwd=2, col=colors[i]) lines(rep(x[j], 2), y[j] + se[j]*c(-2,2), lwd=.5, col=colors[i]) } text(mean(x), mean(y) - .05, adult_labels[i], col=colors[i], cex=.9) } dev.off()

Yeah, yeah, I know the code is ugly. I'm pretty sure it could be done much more easily in ggplot2.

Also, just for fun I threw in +/- 1 and 2 standard error bars, using the Agresti-Coull formula based on (y+2)/(n+4) for binomial standard errors. Cos why not. The one thing this graph *doesn't* show is whether the adults who got it wrong on the individual condition were more likely to get it wrong in the combination condition, but that information wasn't in the original graph either.

On the whole, I'm satisfied that the replacement graph contains all the information in less space and is much clearer than the original.

Again, this is not a slam on the authors of the paper. They're not working within a tradition in which graphical display is important. I'm going through this example in order to provide a template for future researchers when summarizing their data.

And, just for comparison, here's the display going the other way:

(This looks so much like the earlier plot that it seems at first that we did something wrong. But, no, it just happened that way because we're only plotting four numbers, and it just happened that the two numbers whose positions changed had very similar values of 0.84 and 0.82.)

And here's the R code for this second graph:

pdf("gopnik_1.pdf", height=4, width=5) par(mar=c(3,3,3,2), mgp=c(1.7, .5, 0), tck=-.01, bg="gray90") plot(c(0,1), c(0,1), yaxs="i", xlab="", ylab="Percent who gave correct answer", xaxt="n", yaxt="n", type="n", main="Children did better than adults,\nespecially in the combination condition", cex.main=.9) axis(1, c(0, 1), adult_labels) axis(2, c(0,.5,1), c("0", "50%", "100%")) for (i in 1:2){ ok <- gopnik$Combination==(i-1) x <- gopnik$Adult[ok] y <- p_correct[ok] se <- sqrt((N*p_correct + 2)*(N[ok]*(1-p_correct) + 2)/(N[ok] + 4)^3) lines(x, y, col=colors[i]) points(x, y, col=colors[i], pch=20) for (j in 1:2){ lines(rep(x[j], 2), y[j] + se[j]*c(-1,1), lwd=2, col=colors[i]) lines(rep(x[j], 2), y[j] + se[j]*c(-2,2), lwd=.5, col=colors[i]) } text(mean(x), mean(y) - .1, combination_labels[i], col=colors[i], cex=.9) } dev.off()

The post Graphs as comparisons: A case study appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Hey—here are some tools in R and Stan to designing more effective clinical trials! How cool is that? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>It would be better to integrate design and analysis. My colleague Sebastian Weber works at Novartis (full disclosure: they have supported my research too), where they want to take some of the sophisticated multilevel modeling ideas that have been used in data analysis to combine information from different experiments, and apply these to the design of new trials.

Sebastian and his colleagues put together an R package wrapping some Stan functions so they can directly fit the hierarchical models they want to fit, using the prior information they have available, and evaluating their assumptions as they go.

Sebastian writes:

Novartis was so kind to grant permission to publish the RBesT (R Bayesian evidence synthesis Tools) R library on CRAN. It’s landed there two days ago. We [Sebastian Weber, Beat Neuenschwander, Heinz Schmidli, Baldur Magnusson, Yue Li, and Satrajit Roychoudhury] have invested a lot of effort into documenting (and testing) that thing properly. So if you follow our vignettes you get an in-depth tutorial into what, how and why we have crafted the library. The main goal is to reduce the sample size in our clinical trials. As such the library performs a meta-analytic-predictive (MAP) analysis using MCMC. Then that MAP prior is turned into a parametric representation, which we usually recommend to “robustify”. That means to add a non-informative mixture component which we put there to ensure that if things go wrong then we still get valid inferences. In fact, robustification is critical when we use this approach to extrapolate from adults to pediatrics. The reason to go parametric is that this makes it much easier to communicate that MAP prior. Moreover, we use conjugate representations such that the library performs operating characteristics with high-precision and high-speed (no more tables of type I error/power, but graphs!). So you see, RBesT does the job for you for the problem to forge a prior and then evaluate it before using it. This library is a huge help for our statisticians at Novartis to apply the robust MAP approach in clinical trials.

Here are the vignettes:

– Getting started with RBesT (binary)

– Using RBesT to reproduce Schmidli et al. “Robust MAP Priors”

The post Hey—here are some tools in R and Stan to designing more effective clinical trials! How cool is that? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What is “overfitting,” exactly? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>It’s not overfitting so much as model misspecification.

I really like this line. If your model is correct, “overfitting” is impossible. In its usual form, “overfitting” comes from using too weak of a prior distribution.

One might say that “weakness” of a prior distribution is not precisely defined. Then again, neither is “overfitting.” They’re the same thing.

**P.S.** In response to some discussion in comments: One way to define overfitting is when you have a complicated statistical procedure that gives worse predictions, on average, than a simpler procedure.

Or, since we’re all Bayesians here, we can rephrase: Overfitting is when you have a complicated model that gives worse predictions, on average, than a simpler model.

I’m assuming full Bayes here, not posterior modes or whatever.

Anyway, yes, overfitting can happen. And it happens when the larger model has too weak a prior. After all, the smaller model can be viewed as a version of the larger model, just with a very strong prior that restricts some parameters to be exactly zero.

The post What is “overfitting,” exactly? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan Weekly Roundup, 14 July 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>- Kevin Van Horn and Elea McDonnell Feit put together a tutorial on Stan [GitHub link] that covers linear regression, multinomial logistic regression, and hierarchical multinomial logistic regression.
**Andrew**has been working on writing up our “workflow”. That includes Chapter 1, Verse 1 of*Bayesian Data Analysis*of (1) specifying joint density for all observed and unobserved quantities, (2) performing inference, and (3) model criticism and posterior predictive checks. It also includes fake data generation and program checking (the Cook-Gelman-Rubin procedure; eliminating divergencent transitions in HMC) and comparison to other inference systems as a sanity check. It further includes the process of building up the model incrementally starting from a simple model. We’re trying to write this up in our case studies and make it the focus of the upcoming Stan book.**Ben Goodrich**working on RStanArm with lots of new estimators, specifically following from nlmer, for GLMs with unusual inverse functions. This led to some careful evaluation, uncovering some multimodal behavior.**Breck Baldwin**has been pushing through governance discussions so we can start thinking about how to make decisions about the Stan project when not everyone agrees. I think we’re going to go more with a champion model than a veto model; stay tuned.**Mitzi Morris**has been getting a high-school intern up to speed for doing some model comparisons and testing.**Mitzi Morris**has implemented the Besag-York-Mollie likelihood with improved priors provided by Dan Simpson. You can check out the ongoing branch in the stan-dev/example-models repo.**Aki Vehtari**has been working on improving Pareto smoothed importance sampling and refining effective sample size estimators.**Imad Ali**has prototypes of the intrinsic conditional autoregressive models for RStanArm.**Charles Margossian**is working on gradients of steady-state ODE solvers for Torsten and a mixed solver for forcing functions in ODEs; papers are in the works, including a paper selected to be highlighted at ACoP.**Jonah Gabry**is working on a visualization paper with Andrew for submission and is gearing up for the Stan course later this summer. Debugging R packages.**Sebastian Weber**has been working on the low-level architecture for MPI including a prototype linked from the Wiki. The holdup is in shipping out the data to the workers. Anyone know MPI and want to get involved?**Jon Zelner**and**Andrew Gelman**have been looking at adding hierarchical structure to discrete-parameter models for phylogeny. These models are horribly intractable, so they’re trying to figure out what to do when you can’t marginalize and can’t sample (you can write these models in PyMC3 or BUGS, but you can’t explore the posterior). And when you can do some kind of pre-pruning (as is popular in natural language processing and speech recognition pipelines).**Matthew Kay**has a GitHub package TidyBayes that aims to integrate data and sampler data munging in a TidyVerse style (wrapping the output of samplers like JAGS and Stan).**Quentin F. Gronau**has a Bridgesampling package on CRAN, the short description of which is “Provides functions for estimating marginal likelihoods, Bayes factors, posterior model probabilities, and normalizing constants in general, via different versions of bridge sampling (Meng & Wong, 1996)”. I heard about it when Ben Goodrich recommended it on the Stan forum.**Juho Piironen**and**Aki Vehtari**arXived their paper, Sparsity information and regularization in the horseshoe and other shrinkage priors. Stan code included, naturally.

The post Stan Weekly Roundup, 14 July 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Slaying Song appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>On April 22, Tribe shared a story from a website called the Palmer Report — a site that has been criticized for spreading hyperbole and false claims — entitled “Report: Trump gave $10 million in Russian money to Jason Chaffetz when he leaked FBI letter,” a reference to the notorious pre-election letter sent by former FBI director James Comey to members of Congress that many have blamed for Hillary Clinton’s November loss.

The “report” the article points to is a since-deleted tweet by a Twitter user named LM Garner, who describes herself in her Twitter biography as “Just a VERY angry citizen on Twitter. Opinions are my own. Sometimes prone to crazy assertions. Not a fan of this nepotistic kleptocracy.” Garner, who has 257 followers, has tweeted more than 25,000 times from her protected account.

“I don’t know whether this is true,” Tribe’s tweet reads, “But key details have been corroborated and none, to my knowledge, have been refuted. If true, it’s huge.”

Reached by email, Tribe said that he was aware of the Palmer Report’s “generally liberal slant” and “that some people regard a number of its stories as unreliable.” Still, he added, “When I share any story on Twitter, typically with accompanying content of my own that says something like ‘If X is true, then Y,’ I do so because a particular story seems to be potentially interesting, not with the implication that I’ve independently checked its accuracy or that I vouch for everything it asserts.”

OK, then. But the “Palmer Report” thing ran a bell—didn’t someone send me something from there once? I did a quick search and found this Slate article, “Stop Saying the Election Was Rigged,” regarding “the rampant sharing of two postelection articles from Bill Palmer.”

Kinda sad to see a high-paid law professor fall for this sort of thing.

Still, though, whenever I see the name Laurence Tribe I will think of this letter. Bluntly put, indeed. If you’ll forgive my reference to bowling.

The post Slaying Song appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Classical statisticians as Unitarians appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Christian Robert, Judith Rousseau, and I wrote:

Several of the examples in [the book under review] represent solutions to problems that seem to us to be artificial or conventional tasks with no clear analogy to applied work.

“They are artificial and are expressed in terms of a survey of 100 individuals expressing support (Yes/No) for the president, before and after a presidential address (. . . ) The question of interest is whether there has been a change in support between the surveys (…). We want to assess the evidence for the hypothesis of equality H1 against the alternative hypothesis H2 of a change.”

Based on our experience in public opinion research, this is not a real question. Support for any political position is always changing. The real question is how much the support has changed, or perhaps how this change is distributed across the population.

A defender of Aitkin (and of classical hypothesis testing) might respond at this point that,

yes, everybody knows that changes are never exactly zero and that we should take a more “grown-up” view of the null hypothesis, not that the change is zero but that it is nearly zero. Unfortunately, the metaphorical interpretation of hypothesis tests has problems similar to the theological doctrines of the Unitarian church.[emphasis added] Once you have abandoned literal belief in the Bible, the question soon arises: why follow it at all? Similarly, once one recognizes the inappropriateness of the point null hypothesis, we think it makes more sense not to try to rehabilitate it or treat it as treasured metaphor but rather to attack our statistical problems directly, in this case by performing inference on the change in opinion in the population. . . .All this is application-specific. Suppose public opinion was observed to really be flat, punctuated by occasional changes, as in the left graph in Figure 3. In that case, Aitkin’s question of “whether there has been a change” would be well-defined and appropriate, in that we could interpret the null hypothesis of no change as some minimal level of baseline variation.

Real public opinion, however, does not look like baseline noise plus jumps, but rather shows continuous movement on many time scales at once, as can be seen from the right graph in Figure 3, which shows actual presidential approval data. In this example, we do not see Aitkin’s question as at all reasonable. Any attempt to work with a null hypothesis of opinion stability will be inherently arbitrary. It would make much more sense to model opinion as a continuously-varying process.

The statistical problem here is not merely that the null hypothesis of zero change is nonsensical; it is that the null is in no sense a reasonable approximation to any interesting model. The sociological problem is that, from Savage (1954) onward, many Bayesians have felt the need to mimic the classical null-hypothesis testing framework, even where it makes no sense.

This quote came up in blog comments a few years ago; I love it so much I wanted to share it again.

**P.S.** I also like this one, from that same review:

In a nearly century-long tradition in statistics, any probability model is sharply divided into “likelihood” (which is considered to be objective and, in textbook presentations, is often simply given as part of the mathematical specification of the problem) and “prior” (a dangerously subjective entity to which the statistical researcher is encouraged to pour all of his or her pent-up skepticism). This may be a tradition but it has no logical basis. If writers such as Aitkin wish to consider their likelihoods as objective and consider their priors as subjective, that is their privilege. But we would prefer them to restrain themselves when characterizing the models of others. It would be polite to either tentatively accept the objectivity of others’ models or, contrariwise, to gallantly affirm the subjectivity of one’s own choices.

The post Classical statisticians as Unitarians appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post 3 things that will surprise you about model validation and calibration for state space models appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I was wondering if you had any advice specific to state space models when attempting model validation and calibration. I was planning on conducting a graphical posterior predictive check.

I’d also recommend fake-data simulation. Beyond that, I’d need to know more about the example.

I’m posting here because this seems like a topic that some commenters could help on (and could supply the 3 things promised by the above title).

The post 3 things that will surprise you about model validation and calibration for state space models appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Daryl Bem and Arthur Conan Doyle appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The only thing I don’t like about Engber’s article is its title, “Daryl Bem Proved ESP Is Real. Which means science is broken.” I understand that “Daryl Bem Proved ESP Is Real” is kind of a joke, but to me this is a bit too close to the original reporting on Bem, back in 2011, where people kept saying that Bem’s study was high quality, state-of-the-art psychology, etc. Actually, Bem’s study was crap. It’s every much as bad as the famously bad papers on beauty and sex ratio, ovulation on voting, elderly-related words and slow walking, etc.

And “science” is not broken. Crappy science is broken. Good science is fine. If “science” is defined as bad articles published in PPNAS—himmicanes, air rage, ages ending in 9, etc.—then, sure, science is broken. But if science is defined as the real stuff, then, no, it’s not broken at all. Science could be improved, sure. And, to the extent that some top scientists operate on the goal of tabloid publication and Ted-talk fame, then, sure, the system of publication and promotion could be said to be broken. But to say “science is broken” . . . . I think that’s going too far.

Anyway, I agree with Engber on the substance and I admire his ability to present the perspectives of many players in this story. A grabby if potentially misleading title is not such a big deal.

**But what about that Bem paper?**

One of the people who pointed me to Engber’s article knows some of the people involved and assured me that the Journal of Personality and Social Psychology editor who handled Bem’s paper is, and was, no fool.

So how obvious were the problems in that original article?

Here, I’m speaking not of problems with Bem’s theoretical foundation or with his physics—I won’t go there—but rather with his experimental design and empirical analysis.

I do think that paper is terrible. Just to speak of the analysis, the evidence is entirely from p-values but these p-values are entirely meaningless because of forking paths. The number of potential interactions to be studied is nearly limitless, as we can see from the many many different main effects and interactions mentioned in the paper itself.

But then the question is, how could smart people miss these problems?

Here’s my answer: It’s all obvious in retrospect but wasn’t obvious at the time. Remember, Arthur Conan Doyle was fooled by amateurish photos of fairies. The JPSP editor was no fool either. Much depends on expectations.

Here are the fairy photos that fooled Doyle, along with others. The photos are obviously faked, and it was obvious at the time too. Doyle just really really wanted to believe in fairies. From everything I’ve heard about the publication of Bem’s article, I doubt that the journal editor really really wanted to believe in ESP. But I wouldn’t be surprised if this editor really really wanted to believe that an eminent psychology professor would not do really bad research.

**P.S.** I wrote the post a few months ago and it just happened to appear the day after a post of mine on why “Clinical trials are broken.” So we’ll need to discuss further.

**P.P.S.** Just to clarify the Bem issue, here are a few more quotes from Engber’s article:

Even with all that extra care, Bem would not have dared to send in such a controversial finding had he not been able to replicate the results in his lab, and replicate them again, and then replicate them five more times. His finished paper lists nine separate ministudies of ESP. Eight of those returned the same effect.

Bem’s paper has zero preregistered replications. What he has are “conceptual replications,” which are open-ended studies that can be freely interpreted as successes through the garden of forking paths.

Here’s Engber again:

But for most observers, at least the mainstream ones, the paper posed a very difficult dilemma. It was both methodologically sound and logically insane.

No, the paper is not methodologically sound. Its conclusions are based on p-values, which are statements regarding what the data summaries would look like, had the data come out differently, but Bem offers no evidence that, had the data come out differently, his analyses would’ve been the same. Indeed, the nine studies of his paper feature all sorts of different data analyses.

Engber gets to these criticisms later in his article. I just worry that people who just read the beginning will take the above quotes at face value.

The post Daryl Bem and Arthur Conan Doyle appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Clinical trials are broken. Here’s why. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I responded as follows:

At one point I had the thought of doing a big investigative project on this, formally interviewing a bunch of people on all sides of the issue, etc., but then I didn’t really have the energy to do so. When Rehmeyer’s book came out, I had the idea of reviewing it, and using that review as a springboard to talk about the larger issues, as I think this goes beyond Pace to the more general question about how to develop and evaluate therapies for poorly-understood medical conditions. The standard paradigm of statistically based science goes like this:

1. Come up with a cool idea.

2. Test it in a clinical trial.

The trouble is that part 1 is not well integrated with data—there’s not really a quantitative approach to developing potential treatment ideas.

And part 2 has the problem that classical statistical methods don’t work when studying small effects; see here.

With Pace there is an additional difficulty that systemic exertion intolerance disease, or chronic fatigue syndrome, is not so well defined, and so it seems perfectly plausible to me that graded exercise therapy and cognitive behavioral therapy could help some subset of people characterized as having chronic fatigue syndrome, even if useless to Julie Rehmeyer and others. But the larger problem is the disconnect between the development and evaluation of the treatments. To make progress, I think, we need to move beyond the idea that the idea and all the details of the treatment come from the outside, with the data and statistical analysis being only to evaluate (or, unfortunately, to “prove” or “demonstrate”) pre-existing ideas. One thing I like about working in pharmacometrics is that there’s more of a continuous path between data, science, and policy.

The post Clinical trials are broken. Here’s why. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Further criticism of social scientists and journalists jumping to conclusions based on mortality trends appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>So. We’ve been having some discussion regarding reports of the purported increase in mortality rates among middle-aged white people in America. The news media have mostly spun a simple narrative of struggling working-class whites, but there’s more to the story.

Some people have pointed me to some contributions from various sources:

In “The Death of the White Working Class Has Been Greatly Exaggerated,” journalist Malcolm Harris looks into some of the selection biases and problems with data presentation that has led people to misunderstand recent trends.

In “Why Are American Non-College Whites Killing Themselves?,” Hugh Whalen goes beyond the mortality trends (which is all that I’ve ever looked at for this problem) and considers social and economic causes. I haven’t looked into Whalen’s analysis in detail but from a quick glance it looks reasonable, and more sophisticated than the basic white-men-are-suffering story. (Yes, white men are suffering; so are a lot of other people. The point of these more careful analyses is not to dismiss the pain of white men but rather to understand the bigger picture.)

In “Deaths of Despair. An Analysis of the Case-Deaton Conference Paper on the Mortality Rates of Middle-Aged Whites,” Echidne (the pseudonym of a thoughtful social-science-and-politics blogger) again recognizes that these comparisons of recent mortality trends are important and newsworthy, while criticizing the over-simplified stories presented in prominent scholarly articles as well as in the press. Echidne also talks about the bait-and-switch, in which the data come from both sexes (indeed, as shown in the graphs below, the much-advertised increase in middle-aged white mortality is among women, not among men) but then the discussion is mostly about men.

I find a lot of this “working class” discussion to be gendered. The phrase “woking class” seems to conjure up an image of a man working at a factory and not a woman cleaning bedpans, for example. I’m sure that middle-aged men have it tough in many ways, but I’ve been hearing the story of the suffering emasculated white male for over 40 years now—ever since the movies “Mean Streets,” “Rocky,” and “Saturday Night Fever”—and every time we hear it again, it’s presented as a new and transgressive idea. Again, I think it’s fine to talk about the troubles of non-upper-class white men, and also to talk about corresponding troubles of non-upper-class white women, and all sorts of other groups. We just have to be careful with the bait-and-switch. The statistics should be relevant to the group being discussed, and terms such as “working class” (or, worse, “blue collar“) have this gendered aspect that can distort the conversation.

**Background (for those who haven’t been following the story)**

Economists Anne Case and Angus Deaton started the general discussion with two papers, “Rising morbidity and mortality in midlife among white non-Hispanic Americans in the 21st century” and “Mortality and morbidity in the 21st century.”

These articles raised some interesting issues but had some technical problems, as I discussed in various ways here on the blog and with summary articles in Slate in

2015 and 2017.

From my paper with Jonathan Auerbach:

The whole discussion was kinda weird because everybody knows that when you’re analyzing mortality rates you should separate the sexes and you should age adjust. Reporting trends in mortality rates without age adjusting is like reporting trends in nominal prices without adjusting for the consumer price index.

More recently there was some discussion regarding breakdowns by age, education, and ethnicity. I expressed some concern about selection bias from the education breakdown, also some concern about isolating particular ethnicities and particular age groups, and so Jonathan and I prepared a document showing trends for both sexes, lots of age groups, and four different ethnic categories.

**P.S.** Bottom line is that Case, Deaton, and many others have made valuable contributions in looking at these data in various ways and drawing people’s attention to demographic trends, both expected and unexpected. Let’s not let some relatively minor technical issues distract us from that.

The post Further criticism of social scientists and journalists jumping to conclusions based on mortality trends appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post You can read two versions of this review essay on systemic exertion intolerance disease (chronic fatigue syndrome) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The New Yorker editors are pros, so I can only assume their version is better than mine. But here on this blog there are no space limitations—pixels are free—and so I thought I’d share the original, which I completed two months ago:

1. “Through the Shadowlands: A Science Writer’s Odyssey Into an Illness that Science Doesn’t Understand”

Julie Rehmeyer was a science journalist leading an active, outdoorsy life in New Mexico when, over the period of a few months, she lost most of her strength, endurance, and confidence, along with the ability to live a normal life. This all happened over fifteen years ago, and her new book, Through the Shadowlands, chronicles her struggles since then, her search for a cure or at least some amelioration of her condition, and the steps she went through to get to this point.

Spoiler: As of the book’s publication in 2017, Rehmeyer currently has a mild form of her condition, enabling her to go about her days but but requiring an extraordinary level of care and caution. There are many twists and turns in the story, and the book is structured around the theories, relationships, and physical possessions that she needed to discard to get from there to here.

In a way, this is the story of your life and mine–if we get lucky enough to live it all the way through. We become weaker as we age, at times gradually and at times suddenly, you don’t know what you’ve got till it’s gone. But along the way we adapt ourselves to fit our diminished capabilities and we adapt our environment to fit us; we make new connections and struggle to make sense of the world, to clear the trail for others. Rehmeyer’s struggles with her health and well being gave an urgency to the search for understanding of self and nature that we all must undertake in our lives.

Rehmeyer’s physical condition, which she shares with an estimated million other Americans, is called chronic fatigue syndrome, a term that unfortunately doesn’t explain much of anything: its definition from the Mayo Clinic, for example, is “a complicated disorder characterized by extreme fatigue that can’t be explained by any underlying medical condition” and has also been called myalgic encephalomyelitis and, more recently, systemic exertion intolerance disease.” This last phrase seems to me to be the most descriptive, so I will use it in the subsequent discussion. A particularly frustrating aspect of the disease is that the usual principles do not apply: the seemingly sensible plan of gradually ramping up exercise level can lead not to a gradual rebuilding of strength but to a collapse. The strategy that would be most natural for a physically active person recovering from an injury or an illness turns out to be counterproductive, and no treatment has been found to be generally effective.

Here is the outline of the book’s narrative: Following her physical breakdown, Rehmeyer quickly went through the usual steps of conventional and alternative medicine, but her condition continued to worsen. Somewhere along the line she moved to a different state, struggled to keep up with her work, and broke up with a partner who was not willing or able to be with her in sickness or in health. She eventually got in touch with a online community that recommended avoidance of even the smallest exposure to mold. Following this severe regimen, which required her to get rid of her house and much of its contents, she cobbled together a functional, if wary, life. More recently she trained herself to reduce her sensitivity to low concentrations of mold, but she remains subject to exposures that can suddenly weaken her. At the same time, however, Rehmeyer has struggled to not only understand her world but to change it, and along with this effort came new friends, a new partner, and an involvement with the scientific literature and controversy on exertion intolerance disease.

Rather than retelling each episode of this story, I will give some quotes that illustrate some episodes in the external events and in the development of the narrator’s ideas.

“Although I [Rehmeyer] certainly wasn’t a Christian Scientist as my mother was, her religion had fueled the belief in me that diligent attention to one’s internal feelings and attitudes was the best place to start in solving a problem, allowing one to act more powerfully in the world and sometimes opening the doors to change in ways you couldn’t predict.”

“Clearly, something had been going wrong inside my body, and medicine’s failure to understand what that something was didn’t change that fundamental reality. . . . But without an explanation, the illness felt a bit like Schrodinger’s cat, neither dead nor alive, neither physical nor psychological–and yet rich with possibility.”

“When my mother first died, I’d felt like the sole survivor of a catastrophe that had wiped out not just her, but my entire culture, as if I were an ancient Indian who was the only remaining speaker of my native language.”

“When I read the scientific literature, though, it felt like tissue paper to me, the findings fragile, with great gaps between them. There were thousands of studies out there, may of which identified abnormalities, but the abnormalities seemed barely related, and the studies were generally so small I didn’t trust them any way.”

“I was struck by key role a ‘good story’ had in my decision-making. . . . what made it a good story for me was that it cast me as adventurer rather than victim . . . I also noticed that when I contemplated this expedition to the desert, a sort of current ran through me, an energy that felt different from personal excitement.”

At this point, I couldn’t get the image from Breaking Bad out of my mind, that image of Walter White in that camper out in the desert.

“When Jones described the state’s effort to save the devils, I felt a pang of envy. Tasmania was spending many millions of dollars every year . . . If the CDC had responded as forcefully to the outbreak of ME/CFS in the 80’s, would I have been all alone in dealing with this damn illness? I wondered how much the Tasmanian devils were helped by the vulnerability of those poteroos and bandicoots and bettongs . . . maybe what we needed was a Looney Toons character based on us . . . Taz might have few similarities to real devils, but he’d certainly created worldwide awareness of them.”

And then there was the mysterious swim cure, what Rehmeyer describes as “a reliable miracle.” Early on in her story, when it seemed that nothing could stop her life from falling apart, Rehmeyer found that when she was at her weakest, she could jump in the pool, will her body to swim, and her weakness would go away for the rest of the day—“but it only worked if I went in the middle of an episode, not beforehand.” Beyond the practical challenges that this entailed, it also frustrated Rehmeyer at an intellectual level, as she could never find any explanation for why the swim cure worked, even to this day, and it remains a loose end in her story.

Rehmeyer’s progress was far from linear: there was a period for a year or so where her condition nearly disappeared, only to return more severely than before. Her rigorous strategy of avoiding all mold gave her some control over her symptoms, but then eventually she was able to handle greater exposures. She concludes by writing, “I’m doing very well, but I’m also not cured. . . . At times, I’m close to 100% well, but other times are more difficult. I’m able to live a full life. . . . I got lucky.” She adds: “A number of ME/CFS patients have tried mold avoidance, influenced by me, and their results have been mixed.”

While focused on Rehmeyer’s own experiences, the book touches in many places on the larger questions of the diagnosis and treatment of systemic exertion intolerance disease, and I’ll return to some of these issues in a moment.

But first, what makes this book so appealing and the underlying story so thought-provoking? I see two factors.

First, Rehmeyer is the hero of her own story: literally knocked off her feet by a still-mysterious condition, she needed to elicit from others the help that she required, working through theories of what was going on. Full of life, curiosity, and ingenuity, this the book that Richard Feynman would’ve written, had he been afflicted with a disease that the scientific establishment couldn’t understand.

Second, systemic exertion intolerance disease is a medical mystery affecting a million or more people, with a corresponding scientific controversy. Can psychiatric treatment help sufferers of systemic exertion intolerance disease? If so, does this mean the condition is “all in the head”? To what extent can we trust conclusions published in medical journals? I have argued above that some of these controversies arise from thinking in black-and-white terms and not recognizing variation in the condition and its treatment, but in any case this has become a fascinating story of a struggle between patient groups and the medical establishment, and also within a scientific community that has often had difficulty balancing the goals of publication, certainty, and snappy conclusions with real-world uncertainty and variation.

2. Systemic exertion intolerance disease and the controversy over the Pace trial

As Rehmeyer so powerfully demonstrates in her book, there is no known cure for systemic exertion intolerance disease. Indeed, there is not even any generally effective therapy for ameliorating the condition. But there are a lot of theories floating around, and some of them have been incorporated into interventions that have been tested in clinical trials.

Most prominent of these studies is the Pace trial, an experiment conducted in 2005 on 641 patients in the United Kingdom who were randomly assigned to four treatments: adaptive pacing therapy, cognitive behavioral therapy, graded exertise therapy, or nothing beyond medical care. One reason for the influence of the study is that the U.K., through its National Health Service, must make decisions about what treatments to approve and fund, and controlled clinical trials are generally taken to offer the best evidence.

There has been much dispute about the results of the Pace trial, along with a struggle over data availability. Before getting to the controversy, I will summarize the findings as presented in the original paper on the study, published in the English medical journal The Lancet in 2011. This article, bylined by a team of 19 authors led by Peter White, reported results after 12 weeks, 24 weeks, and one year of followup, which were encouraging for two of the four treatments in the study:

– Cognitive behavior therapy (“This theory regards chronic fatigue syndrome as being reversible and that cognitive responses (fear of engaging in activity) and behavioural responses (avoidance of activity) are linked and interact with physiological processes to perpetuate fatigue.”), and

– Graded exercise therapy (“GET was done on the basis of deconditioning and exercise intolerance theories of chronic fatigue syndrome. These theories assume that the syndrome is perpetuated by reversible physiological changes of deconditioning and avoidance of activity. . . . The aim of treatment was to help the participant gradually return to appropriate physical activities, reverse the deconditioning, and thereby reduce fatigue and disability.”).

The results were less positive for the other two treatments that were considered:

– Adaptive pacing therapy (“APT was based on the envelope theory of chronic fatigue syndrome. This theory regards chronic fatigue syndrome as an organic disease process that is not reversible by changes in behaviour and which results in a reduced and finite amount (envelope) of available energy. . . . This adaptation was achieved by helping the participant to plan and pace activity to reduce or avoid fatigue, achieve prioritised activities and provide the best conditions for natural recovery.”), and

– Specialist medical care (“SMC was provided by doctors with specialist experience in chronic fatigue syndrome. . . . Treatment consisted of an explanation of chronic fatigue syndrome, generic advice, such as to avoid extremes of activity and rest, specific advice on self-help, according to the particular approach chosen by the participant (if receiving SMC alone), and symptomatic pharmacotherapy (especially for insomnia, pain, and mood).”) SMC was given to all participants in the study, hence it can be considered a control condition in this experiment.

The major outcomes presented were based on self-reported fatigue and physical function scores. Fatigue was based on a questionnaire which put people on a 0-33 scale with patients’ average initial scores of around 28. This score dropped during the year (that is, a reduction in fatigue) by an average of 7.4 and 7.6 points under CBT and GET, the two more effective treatments, but only 5.4 and 4.5 points under APT and SMC. Physical function was based on a questionnaire which put people on a 0-100 scale with patients’ average initial scores of around 38. This score increased during the year (an increase in physical function) by 19.2 and 21.0 under CBT and GET, but only 8.7 and 11.6 under APT and SMC. Improvements of 2 points in the fatigue score or 8 points in the physical function score were considered “clinically useful,” hence the authors reported that CBT and GET were effective, while APT was no more effective than the control (that is, APT plus SMC was no better than SMC alone).

On the other hand, as Julie Rehmeyer wrote in a news article a couple years ago, “The study participants hadn’t significantly improved on any of the team’s chosen objective measures: They weren’t able to get back to work or get off welfare, they didn’t get more fit, and their ability to walk barely improved. Though the PACE researchers had chosen these measures at the start of the experiment, once they’d analyzed their data, they dismissed them as irrelevant or not objective after all.” Rehmeyer was suspicious of the subjective reports, imagining herself as a participant: “I come in and I’m asked to rate my symptoms. Then, I’m repeatedly told over a year of treatment that I need to pay less attention to my symptoms. Then I’m asked to rate my symptoms again. Mightn’t I say they’re a bit better—even if I still feel terrible–in order to do what I’m told, please my therapist, and convince myself I haven’t wasted a year’s effort?”

The 2011 Pace paper and its followups were much criticized among activist groups and among the larger scientific community, leading to a stern editorial and several responses in the Journal of Health Psychology. As summarized by Caroline Wilshire in that discussion, “Cognitive behavioural therapy and graded exercise therapy had modest, time-limited effects on self-report measures, but little effect on more objective measures such as fitness and employment status. Given that the trial was non-blinded, and the favoured treatments were promoted to participants as ‘highly effective’, these effects may reflect participant response bias.”

In 2015, a team of researchers led by biologist Ronald Davis, and including my Columbia biostatistics colleague Bruce Levin, wrote an open letter criticizing how the Pace team reported its results. Davis and his colleagues wrote: “In an accompanying Lancet commentary, colleagues of the PACE team defined participants who met these expansive ‘normal ranges’ as having achieved a ‘strict criterion for recovery.’ The PACE authors reviewed this commentary before publication.”

in later correspondence, the Pace authors wrote, “our paper did not report on recovery,” but they did use this term in other writings. For example, there’s a 2007 paper, “Is a full recovery possible after cognitive behavioural therapy for chronic fatigue syndrome?”, which reports: “Using the most comprehensive definition of recovery, 23% of the patients fully recovered.” And the aformentioned Lancet commentary, written by collaborators of the Pace team, stated, “the recovery rate of cognitive behaviour therapy and graded exercise therapy was about 30%.”

As a statistician, I don’t see much value in using hard thresholds, but if you are going to talk about recovery as a yes-or-no outcome, then it’s important to use consistent definitions. One criticism of the analysis of Pace trial is that the research team changed their definition of recovery after collecting their data, weakening it so much that patients could get worse on two of four criteria (which were the central measures–fatigue and physical function) over the course of he trial and still be considered “recovered.”

One difficulty in understanding all these comparisons is that changes are happening to individuals, but the published article reports only averages. This is one of many reasons why outsiders have requested the patient-level data from the experiment, and this has created its own controversy: when the experiments refuse to share the data, is it to protect patient confidentiality or simply to protect themselves from criticism?

I don’t have access to the raw data myself but I can make some inferences based on general statistical principles. The first is that there is always variation: some people get better, some people get worse, others improve on some measures and decline on others. In a setting such as systemic exertion intolerance disease where the treatments are so speculative and the condition itself is not clearly defined, we would expect any treatment to be more effective than control only on some subset of people.

Based on the treatment descriptions given above, it makes sense that each of the three options—CBT, GET, or APT—would help some sufferers of systemic exertion intolerance disease and not others. Beyond that, patients can get much better or much worse because of unforeseen other factors, as happened various times in Julie Rehmeyer’s story.

To get an idea of what we’re struggling with, here’s an unhelpful quote from a radio interview with Lancet editor Richard Horton: “adaptive pacing therapy essentially believes that chronic fatigue is an organic disease which is not reversible by changes in behaviour. Whereas cognitive behaviour therapy obviously believes that chronic fatigue is entirely reversible. And these two philosophies are kind of facing off against one another in the patient community and what these scientists were trying to do is to say, ‘Well, let’s see. Which one is right?'”

The problem with this attitude is that systemic exertion intolerance disease, or ME/CFS, is a diverse condition, or set of conditions lumped under a common diagnosis. It is completely reasonable to think that different therapies could work for different people, and that the condition has different sources for different people. Even for any particular person, the condition could have a mix of causes and be amenable to a mix of therapies. So the attitude that it’s one or the other, can be a serious mistake even for a given patient, let alone when trying to characterize a broadly-diagnosed syndrome in the general population.

It should be no surprise that cognitive behavioral therapy and exercise therapy can help people. The success of these therapies for some percentage of people, does not at all contradict the idea that many others need a lot more, nor does it provide much support for the idea that “fear avoidance beliefs” are holding back those people with systemic exertion intolerance disease who, like Rehmeyer, were in no condition to increase their exercise level.

As Simon Wessely, one of the supporters of the Pace trial and a pillar of the English medical establishment, writes, “there were a significant number of patients who did not improve with these treatments. Some patients deteriorated, but this seems to be the nature of the illness, rather than related to a particular treatment. . . . PACE or no PACE, we need more research to provide treatments for those who do not respond to presently available treatments.”

I was bummed that the New Yorker editor cut the line about “Breaking Bad” and other parts of my book review—but, yeah, yeah, I know, “kill your darlings.” On the plus side, the editor, Sharan Shetty, pushed me to end more clearly with a discussion of the bigger picture of the larger implications for medical research. i told Shetty that, following academic rules, I’d like to add him as coauthor but he said, no, that’s not how they do things in the world of general-interest publishing. So I’ll thank him here.

**P.S.** Gary Greenberg points out an error in my New Yorker article, which states, “Traditionally, the controlled trial has been considered the gold standard of medical evidence: you gather a bunch of patients and randomly assign some to a control group, which receives no treatment, and some to an experimental group, which does.”

That’s not quite right. “Control” can include a placebo or indeed any alternative treatment, so it was not correct to define “control” as “no treatment.”

The post You can read two versions of this review essay on systemic exertion intolerance disease (chronic fatigue syndrome) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Why they aren’t behavioral economists: Three sociologists give their take on “mental accounting” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Rather than retreat to disciplinary corners, let us begin by affirming our respect for the generative work undertaken across a variety of disciplines. We’re all talking money, so it is helpful to specify what’s similar and what’s different when we do. . . . In this post, we address our closest cousins: behavioral economics and cognitive psychology. . . .

Consider the case of a child’s “college fund.” Marketing professors Soman and Ahn recount the dilemma one of their acquaintances, who is an economist, faced with the option of borrowing money at a high rate of interest to pay for a home renovation or using money he already had saved in his three-year-old son’s low-interest rate education account. As a father, he simply could not go through with the more cost-effective option of “breaking into” his child’s education fund. Soman and Ahn use this story to frame how consequential the emotional content of a particular mental account can be. . . .

How does the sociological approach differ? . . . when managing these accounts, individuals are really managing their relationships with others. The account is thus relational as well as psychological as individuals engage in what we call relational work. In the anecdote of the college savings account, for instance, we find the parents reluctant to dip into money earmarked for their children’s education. Why? Because these funds represent and reinforce meaningful family ties: they include but transcend individual mental budgeting; the accounts are therefore as relational as they are mental. Suppose a mother gambles away money from the child’s “college fund.” This is not only a breach of cognitive compartments but involves a relationally damaging violation. Most notably, the misspending will hurt her relationship to her child. But the mother’s egregious act is likely to also undermine the relationship to her spouse and even to family members or friends who might sanction harshly the mother’s misuse of money. These interpersonal dynamics thereby help explain why a college fund functions so effectively as a salient relational earmark rather than only a cognitive category.

Ok, laugh all you want at the sociology jargon—“salient relational earmark,” etc. No big deal; all fields have jargon. That’s not my concern here. My real point is that Bandelj, Wherry, and Zelizer have a point here. The econ story of bank accounts is all about the liquidity of money. The psych story is all about individual behaviors, what do people do with their money and how do they think about it. The soc story is all about roles and interpersonal relations and institutions. “Behavioral economics,” which is a mashup of econ and psych, doesn’t tell the whole story.

The other thing—and this is important—is that the perspectives coming from these three academic disciplines are not competing; they’re complementary. It’s important that money in different bank accounts is liquid—or, to be more precise, it can be liquid for those people who choose to let it be so. It’s important that people often seem to behave as if there are walls between the accounts, restricting their transactions and “freezing” the money, as it were. And it’s important to understand the social context of these behaviors.

Analogously, in section 5.2 of our paper on rational-choice models of voting, Edlin, Kaplan, and I discuss how the rational model is complementary with a psychological understanding of voters. It’s my impression that Bandelj, Wherry, and Zelizer are in agreement with me on this general point, that patterns of human behavior can be usefully understood in different theoretical frameworks. There’s no “right” or “wrong” framework (although one can come to correct or incorrect conclusions within any framework), rather, each framework gives us a way of thinking about the behavior, and entry points into studying it further.

I talk more about frameworks, and how they differ from theories, here.

**P.S.** The above post appeared on the orgtheory blog, where I was amused to see that it was followed by a series of ads. Commerce! I guess the blog organizers were hoping that readers of the post would be motivated to spend something from their academic or research accounts.

**P.P.S.** Now here’s a really funny, or sad, story. I googled *mental accounting* to make sure I was getting the term right, and I ended up at a website called Investopedia (to which I give no link for reasons that will become clear in a moment).

It starts out with a clear enough definition with the pure econ perspective on the liquidity of money:

Mental accounting refers to the tendency for people to separate their money into separate accounts based on a variety of subjective criteria, like the source of the money and intent for each account. . . . Although many people use mental accounting, they may not realize how illogical this line of thinking really is. For example, people often have a special “money jar” or fund set aside for a vacation or a new home, while still carrying substantial credit card debt. . . . Simply put, it’s illogical (and detrimental) to have savings in a jar earning little to no interest while carrying credit-card debt accruing at 20% annually. . . . This seems simple enough, but why don’t people behave this way? The answer lies with the personal value that people place on particular assets. For instance, people may feel that money saved for a new house or their children’s college fund is too “important” to relinquish. . . . Logically speaking, money should be interchangeable, regardless of its origin. . . . The key point to consider for mental accounting is that money is fungible; regardless of its origins or intended use, all money is the same. . . .

But then, comes the twist:

Introducing: Become A Day Trader by Investopedia Academy

In less than a day, learn the tricks of the trade and the traps to avoid. In this self-paced, step-by-step course, you’ll learn to develop a money management strategy which mitigates your risk, capitalize on market movements while keeping your emotions in check, and the 6 most profitable trades for your arsenal. Pre-register for free >>

Whoa! First they’re offering innocuous (if slightly misleading) financial advice, then all of a sudden it’s turned into a scam!

But, hey, not a problem. On their website, they announce:

We are serious about maintaining a company culture in each office that is imaginative and fun. Investopedia’s offices are collaborative, highly engaged, and full of positive energy.

Promoting day trading—the ultimate negative-expected-value financial activity—that’s totally cool as long as you’re “full of positive energy,” right? Or maybe the point is that an “imaginative and fun” company culture compensates for what would otherwise be a pretty demoralizing job. They could share notes with the boys at Caesars on how to most effectively fleece the sheep.

The post Why they aren’t behavioral economists: Three sociologists give their take on “mental accounting” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Sparse regression using the “ponyshoe” (regularized horseshoe) model, from Juho Piironen and Aki Vehtari appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but has previously suffered from two problems. First, there has been no systematic way of specifying a prior for the global shrinkage hyperparameter based on the prior information about the degree of sparsity in the parameter vector. Second, the horseshoe prior has the undesired property that there is no possibility of specifying separately information about sparsity and the amount of regularization for the largest coefficients, which can be problematic with weakly identified parameters, such as the logistic regression coefficients in the case of data separation.

So, what are they going to do?

Funny you should ask:

This paper proposes solutions to both of these problems. We introduce a concept of effective number of nonzero parameters, show an intuitive way of formulating the prior for the global hyperparameter based on the sparsity assumptions, and argue that the previous default choices are dubious based on their tendency to favor solutions with more unshrunk parameters than we typically expect a priori. Moreover, we introduce a generalization to the horseshoe prior, called the regularized horseshoe, that allows us to specify a minimum level of regularization to the largest values. We show that the new prior can be considered as the continuous counterpart of the spike-and-slab prior with a finite slab width, whereas the original horseshoe resembles the spike-and-slab with an infinitely wide slab. Numerical experiments on synthetic and real world data illustrate the benefit of both of these theoretical advances.

This is a big deal. It’s the modern way of variable selection for regressions with many predictors. And it’s in Stan! (and soon, we hope, in rstanarm)

The post Sparse regression using the “ponyshoe” (regularized horseshoe) model, from Juho Piironen and Aki Vehtari appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Night Hawk appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Not sure whether you saw the NYT story a couple of days ago about the declining prospects for democracy in rich countries (based on a recently published paper by Roberto Foa (University of Melbourne) and Yascha Mounk (Harvard). This graph, showing differences in the fraction of individuals reporting that it is “essential” to live in a democracy (i.e., 10 out of 10 on an importance scale) across birth cohorts was used (by the Times reporter) as evidence that “the percentage of people who say it is “essential” to live in a democracy has plummeted”.

Of course there are many potential issues here (e.g., the arbitrary choice of 10/10 as a definition of “essential”), but I primarily object to the use of birth cohort differences from a cross-sectional survey (World Values Survey) as evidence of “plummeting” rates, which I think typically would mean changes over time rather than over birth cohorts. I used the same data to replicate this graph but also showed that there has been little change over time, and, moreover, (at least for the US) the declines in saying it is “essential” to live in a democracy were primarily coming from older birth cohorts. I wrote a little blog about it here, but if you graph *changes* across survey waves by birth cohort, I think it is hard to back up the story that the NYT was selling.

The post Night Hawk appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bigshot psychologist, unhappy when his famous finding doesn’t replicate, won’t consider that he might have been wrong; instead he scrambles furiously to preserve his theories appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I am a Swedish math professor turned cultural evolutionist and psychologist (and a fan of your blog). I am currently working on a topic that might interest you (why public opinion moves on some issues but not on others), but that’s for another day.

Hey—I’m very interested in why public opinion moves on some issues but not on others. I look forward to hearing about this on that other day.

Anyway, Eriksson continues:

I want to alert you to this paper published yesterday in Frontiers in Psychology by Fritz Strack who published the original finding on the effect on happiness of holding a pen in your mouth to force a smile. This very famous result recently failed to replicate, and Fritz Strack is angry that some people may interpret this as the original effect not being valid. Rather (he argues) people who run replications lack the expertise and motivation to run them as well as the ground-breaking researchers who publish first on the topic. In support of his view he even cites Susan Fiske:

In their introduction to the 2016 volume of the Annual Review of Psychology, Susan Fiske, Dan Schacter, and Shelley Taylor point out that a replication failure is not a scientific problem but an opportunity to find limiting conditions and contextual effects. To allow non-replications to regain this constructive role, they must come with conclusions that enter and stimulate a critical debate. It is even better if replication studies are endowed with a hypothesis that relates to the state of the scientific discourse. To show that an effect occurs only under one but not under another condition is more informative than simply demonstrating non-effects (Stroebe and Strack, 2014). But this may require expertise and effort.

I have two problems with Strack’s argument.

First, he privileges his original, uncontrolled study, over the later, controlled replications. From a scientific or statistical standpoint, this doesn’t make sense, for reasons I explain in my post on the time-reversal heuristic.

Second, he’s making the common mistake of considering only one phenomenon at once. Suppose Strack, Fiske, etc., are correct that we should take all those noisy published studies seriously, that we should forget about type M and type S errors and just consider statistically significant estimates as representing true effects. In that case, every study is a jungle. Sure, Strack did an experiment in which people’s faces were being held in smiling positions—but maybe the results were entirely driven by the power pose that the experimenters where using, or not, when doing their study. Maybe the experimenter was doing a power pose under one smiling condition and not under the other, and that determined everything. What if the results all came about because one of the experimenters was ovulating and was wearing red, which in turn decisively changed the attitudes of the participants in the study? What if everything happened from unconscious priming: perhaps there were happiness-related or sadness-related words in the instructions which put the participants in a mood? Or maybe everyone was happy because their local college football team had won last weekend—or sad because they lost? Perhaps they were busy preparing for a himmicane or upset about their ages ending in 9? You might say that such effects wouldn’t matter if the experiment was randomized, but that’s not correct in a world in which interactions can be as large as main effects, as in the famous papers on fat arms and political attitudes (a reported interaction with parents’ socioeconomic status), ovulation and clothing (interaction with outdoor temperature), ovulation and voting (interaction with relationship status), ESP (interactions with image content), and the collected works of Brian Wansink (interactions with just about everything).

In the full PPNAS world of Strack, Fiske, Bargh, etc., we’re being buffeted by huge effects every moment of the day, which makes any particular experiment essentially uninterpretable and destroys the plan to compare the results of a series of noisy studies as an “opportunity to find limiting conditions and contextual effects.”

So, sorry, but no.

From a political standpoint there could be value in a face-saving “out” which would allow Strack to preserve his belief in the evidentiary value of his originally published statistically significant comparison, even in light of later failed replications and a statistical understanding that very little information is contained in a noisy estimate, even if it happens to have “p less than .05” attached to it. But from a scientific and statistical point: No, you just have to start over.

Here, I’ll call it like I see it. Perhaps others who are more politically savvy can come up with a plan for some more diplomatic way to say it.

The post Bigshot psychologist, unhappy when his famous finding doesn’t replicate, won’t consider that he might have been wrong; instead he scrambles furiously to preserve his theories appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan Weekly Roundup, 7 July 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Ben Goodrich**and**Jonah Gabry**shipped RStan 2.16.2 (their numbering is a little beyond base Stan, which is at 2.16.0). This reintroduces error reporting that got lost in the 2.15 refactor, so please upgrade if you want to debug your Stan programs!**Joe Haupt**translated the JAGS examples in the second edition of John Kruschke’s book*Doing Bayesian Data Analysis*into Stan. Kruschke blogged it and Haupt has a GitHub page with the Stan programs. I still owe him some comments on the code.**Andrew Gelman**has been working on the second edition of his and Jennifer Hill’s regression book, which is being rewritten as two linked books and translated to Stan. He’s coordinating with**Jonah Gabry**and**Ben Goodrich**on the RStanArm replacements for lme4 and lm/glm in R.**Sean Talts**got in the pull request for enabling C++11/C++14 in Stan. This is huge for us developers as we have a lot of pent-up demand for C++11 features on the back burner.**Michael Betancourt**, with feedback from the NumFOCUS advisory board for Stan, put together a web page of guidelines for using the Stan trademarks.**Gianluca Baio**released version 1.0.5 of survHE, a survival analysis package based on RStan (and INLA and ShinyStan). There’s also the GitHub repo that**Jacki Buros Novik**made available with a library of survival analysis models in Stan. Techniques from these packages will probably make their way into RStanArm eventually (Andrew’s putting in a survival analysis example in the new regression book).**Mitzi Morris**finished testing the Besag-York-Mollie model in Stan and it passes the Cook-Gelman-Rubin diagnostics. Given that GeoBUGS gets a different answer, we now think it’s wrong, but those tests haven’t completed running yet (it’s much slower than Stan in terms of effective sample size per unit time if you want to get to convergence).**Imad Ali**has been working with Mitzi on getting the BYM model into RStanArm.**Jonah Gabry**taught a one-day Stan class in Padua (Italy) while on vacation. That’s how much we like talking about Stan.**Ben Goodrich**just gave a Stan talk at the Brussels useR conferece group following close on the heels of his Berlin meetup. You can find a lot of information about upcoming events at our events page.

**Mitzi Morris**and**Michael Betancourt**will be teaching a one-day Stan course for the Women in Machine Learning meetup event in New York on 22 July 2017 hosted by Viacom. Dan Simpson’s comment on the blog post was priceless.**Martin Černý**improved feature he wrote to implement a standalone function parser for Stan (to make it easier to expose functions in R and Python).**Aki Vehtari**arXived a new version of the horseshoe prior paper with a parameter to control regularization more tightly, especially for logistic regression. It has the added benefit of being more robust and removing divergent transitions in the Hamiltonian simulation. Look for that to land in RStanArm soon.**Charles Margossian**continues to make speed improvements on the Stan models for Torsten and is also working on getting the algebraic equation solver into Stan so we can do fixed points of diff eqs and other fun applications. If you follow the link to the pull request, you can also see my extensive review of the code. It’s not easy to put a big feature like this into Stan, but we provide lots of help.**Marco Inacio**got in a pull request for definite numerical integration. There are needless to say all sorts of subtle numerical issues swirling around integrating. Marco is working from John Cook’s basic implemnetation of numerical integration and John’s been nice enough to offer it under a BSD license so it would be compatible with Stan.**Rayleigh Lei**is working on vectorizing all the binary functions and has a branch with the testing framework. This is really hairy template programming, but probably a nice break after his first year of grad school at U. Michigan!**Allen Riddell**and**Ari Hartikainen**have been working hard on Windows compatibility for PyStan, which is no walk in the park. Windows has been the bane of our existence since starting this project and if all the world’s applied statisticians switched to Unix (Linux or Mac OS X), we wouldn’t shed a tear.**Yajuan Si**, Andrew Gelman, Rob Trangucci, and Jonah Gabry have been working on a survey weighting module for RStanArm. Sounds like RStanArm’s quickly becoming the swiss army knife (leatherman?) of Bayesian modeling.-
**Andrew Gelman**finished a paper on (issues with) NHST and is wondering about clinical effects that are small by design because they’re being compared to the state of the art treatment as a baseline. - My own work on mixed mode tests continues apace. The most recent pull request adds logical operators (and, or, not) to our autodiff library (it’s been in Stan—this is just rounding out the math lib operators directly) and removed 4000 lines of old code (replacing it with 1000 new lines, but that includes doc and three operators in both forward and reverse mode). I’m optimistic that this will eventually be done and we’ll have RHMC and autodiff Laplace approximations.
**Ben Bales**submitted a pull request for appending arrays, which is under review and will be generalized to arbitary Stan array types.**Ben Bales**also submitted a pull request for the initial vectorization of RNGs. This should make programs doing posterior predictive inference so much cleaner.- I wrote a functional spec for standalone generated quantities. This would let us do posterior predictive inference after fitting the model. As you can see, even simple things like this take a lot of work. That spec is conservative on a task-by-task basis, but given the correlations among tasks, probably not so conservative in total.
- I also patched potentially undefined bools in Stan; who knew that C++ would initialize a bool in a class to values like 127. This followed on from Ben Goodrich filing the issue after some picky R compiler flagged some undefined behavior. Not a bug, but the code’s cleaner now.

The post Stan Weekly Roundup, 7 July 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post My unpublished papers appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>My undergraduate thesis, on the errors of Robert Axelrod’s application of game theory to trench warfare, was unpublished in any form for a long time. About 10 or 15 years after graduating, I rewrote the most interesting parts of the thesis in article form and submitted it to a political science journal; it was returned with lots of comments and I didn’t do anything more with it. Then awhile later I turned it into a chapter in my edited book with Jeronimo Cortina, A Quantitative Tour of the Social Sciences, and then a few years after *that*, I was contacted by a journal that wanted to publish it, which I did, under the title, Methodology as Ideology.

My next unpublished paper was from 1986, it was from my research on zone-melt-reconstruction of silicon; backstory here. I liked that one, and by then I’d read a bunch of research papers (and written a few), so I had a sense of how to get the results down on paper. I never finished the article, though, and I must have lost my copy of it.

My Ph.D. thesis, Topics in Image Reconstruction from Emission Tomography, from 1990, was never published. But lots of people asked me for copies of it; I must have sent out 100 or so. Parts of it went into various published articles, but most of it just served as an education for me.

Next one was from 1991: it was a Bayesian version of the iterative proportional fitting algorithm, using Gibbs sampling, or something like it, to draw from the posterior distribution in a model for contingency tables. This one even got some cites, I think. We never submitted it to a journal, though, because ultimately I wasn’t really happy with the model, which had no structure.

Hmmm, what else? There was a paper from 2003 that I’m a coauthor on, but which I think is just horrible so I removed it from my C.V. It was from a project where I served as statistical consultant, and I wasn’t happy with what was done, which was my fault as much as anyone else’s: had I insisted on something different/better, we probably could’ve done it, but I was just too lazy. That one’s not unpublished, but I’ve done my best to unpublish it, as it were.

After that, I have this list of 24 papers. Many of these, especially the more recent ones, don’t really count, as they’re submitted to journals and in some form will surely get published somewhere. Others on this list have already been published in abridged form but I’ve kept the original, longer versions on the website.

Here are the papers from that list which are unpublished in article form and will probably stay that way:

Fully Bayesian computing (with Jouni Kerman, from 2004). This is from Jouni’s thesis, and we introduce what is now called probabilistic programming. Now that everybody knows about probabilistic programming it’s not clear that there’d be any reason to publish this one.

Sampling for Bayesian computation with large datasets (with Zaiying Huang, from 2005). Our first divide-and-conquer algorithm. I never tried to publish this paper because the speed improvements from parallelization were so underwhelming in our example. It influenced our later work on expectation propagation, and we continue to cite this unpublished paper from 2005.

Moderation in the pursuit of moderation is no vice: the clear but limited advantages to being a moderate for Congressional elections (with Jonathan Katz, from 2007). I like this paper, and it’s mostly done, but we never bothered to get it into final shape. I used much of it in one of the chapters in Red State Blue State, which, in turn, has lots of research material that could’ve been made into articles, had we chosen to do so. (Back when Deb Nolan and I wrote our first edition of Teaching Statistics: A Bag of Tricks, I realized we had lots of publishable material so we quickly extracted about 10 articles from that book and published them in different places. But by the time I was writing Red State Blue State, six years later, I’d lost the motivation to churn out articles in that way. Not that I think there’s anything wrong with churning out articles: it’s a way to reach different audiences that might never otherwise see that material.)

One vote, many Mexicos: Income and vote choice in the 1994, 2000, and 2006 presidential elections (with Jeronimo Cortina and Naryana Lasala, from 2008). A spinoff of Red State Blue State, we used some of it in chapter 7 of that book. We submitted the paper to journals and revised it a few times; maybe it will appear at some point, I’m not sure.

Thoughts on new statistical procedures for age-period-cohort analyses (from 2008). This one was really annoying! The editor of the American Journal of Sociology invited me to write this as a comment on a paper to appear in their journal. I wrote this article, which I really like, and then the journal told me they didn’t want it. I never felt like taking the trouble to turn this into a stand-alone article. But it did help Yair and me set up our paper on the Great Society, Reagan’s revolution, and generations of presidential voting, which I’m sure will appear in a journal some day.

Visualizing distributions of covariance matrices (with Tomoki Tokuda, Ben Goodrich, Iven Van Mechelen, and Francis Tuerlinckx, from 2011, maybe?). I’m not actually sure why this never got published, as the paper is crisp and clean, with some good ideas. I guess it was nobody’s #1 priority, so once we got it rejected by a couple journals, we just let it slide.

Why ask why? Forward causal inference and reverse causal questions (with Guido Imbens, from 2013). I loooove this paper. I can’t remember if we ever submitted it anywhere. Guido and I talked about with Avi Feller, and the consensus was that we’d need to do more literature review to get it acceptable to a journal. We made some plans but then never bothered to go through with it. Again, nobody’s first priority. I incorporated it into one of the causal inference chapters in my upcoming book with Jennifer, so maybe it will reach people in that way.

The problem with p-values is how they’re used (from 2013). Funny story about this one: the journal Ecology solicited it as a comment on a paper they were publishing. Then in the production process, I found out they wanted to charge me, I think it was $300. Huh? They asked me to write the article for them, I wrote it for free, and then they wanted to charge *me* $300?? It turned out this fee was in the fine print all along. I couldn’t believe it. So I said, just forget about it. Nothing I can do with the article, so I just kept it on the unpublished papers site.

Causal inference with small samples and incomplete baseline for the Millennium Villages Project (with Shira Mitchell, Rebecca Ross, Susanna Makela, Elizabeth Stuart, Avi Feller, and Alan Zaslavsky). This one has lots of good stuff; I have no idea if it was ever submitted to a journal in this form or if it was just cannibalized for other papers.

NO TRUMP!: A statistical exercise in priming (with Jonathan Falk, from 2016). An amusing parody, no chance of getting published. I guess I could’ve submitted to Arxiv on April 1, but that’s not a journal publication either. My earlier paper on zombies did get published, actually, in a sampler of modern writing for undergraduates! So all things are possible, I guess.

Attitudes toward amalgamating evidence in statistics (with Keith O’Rourke, from 2016). Someone invited me to write this for some journal, I forget which, and then Keith and I wrote this fine little piece, and the journal rejected it! How annoying. Their call, though. I don’t know what we’ll do with it.

I suppose my collaborators and I have many more unpublished articles to come. But I expect none of them will rival any of the articles in this list of the greatest works of statistics never published.

The post My unpublished papers appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The upcoming NBA hackathon: You’ll never guess the top 10 topics . . . appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We’re hosting our second annual NBA Hackathon this September. This year, there will be two tracks, basketball analytics and business analytics. Prizes include a trip to NBA All-Star 2018 in Los Angeles and a lunch with NBA Commissioner, Adam Silver. Any help spreading the word among your students and beyond is greatly appreciated.

Sounds like fun. Anticipated top 10 NBA hackathon topics:

10. The Knicks aren’t so bad; they won almost twice as many games as the legendary 1972 Miami Dolphins!

9. If Stef Curry juiced like Barry Bonds, would he be able to hit 50% from half-court?

8. The WNBA uses smaller basketballs. Shouldn’t they use smaller hoops too?

7. Remember that article exaggerating the effects of fan distraction in basketball? Was that the worst sports article ever published in the New York Times?

6. I want to an NBA game a couple years ago and it was REALLY LOUD. Like, that jumbotron would never shut up. Could those business analytics guys convince them to chill out a bit? It’s a sporting event, not a circus, for chrissake.

5. The Tebow effect.

4, 3, 2. Ummmm, I’m running out of ideas here . . .

1. The myth of the myth of the myth of the myth of the myth of the myth of the myth of the myth of the myth of the myth of the hot hand.

The post The upcoming NBA hackathon: You’ll never guess the top 10 topics . . . appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Statisticians and economists agree: We should learn from data by “generating and revising models, hypotheses, and data analyzed in response to surprising findings.” (That’s what Bayesian data analysis is all about.) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>All analysts approach data with preconceptions. The data never speak for themselves. Sometimes preconceptions are encoded in precise models. Sometimes they are just intuitions that analysts seek to confirm and solidify. A central question is how to revise these preconceptions in the light of new evidence.

Empirical analyses in economics have diverse goals—all valuable. . . . Common across all approaches is lack of formal guidelines for taking the next step and learning from surprising findings. There is no established practice for dealing with surprise . . .

This paper advocates a strategy for reacting to surprise. Economists should abduct. Abduction is the process of generating and revising models, hypotheses, and data analyzed in response to surprising findings. . . .

Regular readers of this blog or of our article and books will not be surprised that I am in complete agreement that we should react to surprise, generate and revise models and hypotheses, etc.

It’s just too bad that Heckman and Singer are unfamiliar with modern Bayesian statistics. For example, they write:

Do Bayesians Abduct?

Bayesian readers will likely respond that learning from data is an integral part of Bayesian reasoning. They are correct as long as they describe learning about events that are a priori thought to be possible as formalized in some prior, however arrived at.

More fundamentally, Bayesians have no way to cope with the totally unexpected (priors rule out “a surprising fact C is observed” if C is a complete surprise). Total surprise is the domain of abduction. . . .

I don’t think they really mean *total* surprise—all our reasoning is probabilistic. But, on the larger point, yes, learning from surprise is a core aspect of Bayesian data analysis. Indeed, it’s the third of the three steps listed on the very first page of our book, Bayesian Data Analysis. Here is how our book begins:

1.1 The three steps of Bayesian data analysis

This book is concerned with practical methods for making inferences from data using probability models for quantities we observe and for quantities about which we wish to learn. The essential characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty in inferences based on statistical data analysis.

The process of Bayesian data analysis can be idealized by dividing it into the following three steps:

1. Setting up a full probability model—a joint probability distribution for all observable and unobservable quantities in a problem. The model should be consistent with knowledge about the underlying scientific problem and the data collection process.

2. Conditioning on observed data: calculating and interpreting the appropriate posterior distribution—the conditional probability distribution of the unobserved quantities of ul- timate interest, given the observed data.

3. Evaluating the fit of the model and the implications of the resulting posterior distribution: how well does the model fit the data, are the substantive conclusions reasonable, and how sensitive are the results to the modeling assumptions in step 1? In response, one can alter or expand the model and repeat the three steps.

What Heckman and Singer call abduction is included in this step 3, and we talk a lot more about it in chapter 6 of the book.

Don’t get me wrong—I’m not saying this idea is original to me, or to me and my collaborators. I’m just disputing the claim that “Bayesians have no way to cope with the totally unexpected.” We do! We set up strong models, then when the unexpected happens, we realize we’ve learned something.

Here’s another relevant article:

Why ask why? Forward causal inference and reverse causal questions (with Guido Imbens)

And, for a non-quantitative take on the same idea:

When do stories work? Evidence and illustration in the social sciences.

That’s the paper where Thomas Basbøll and I argue that good stories are anomalous and immutable, which is another way of saying that we learn from surprises, from aspects of reality that don’t fit our existing models.

Also this one from 2003:

A Bayesian formulation of exploratory data analysis and goodness-of-fit testing.

Finally, here’s a paper where Cosma Shalizi and I connect statistical model checking and model improvement with Lakatosian ideas of testing and improvement of research programs.

Again, my citation of this work is not an attempt to claim priority, nor is it intended to diminish Heckman and Singer’s suggestions. I assume they’ll be happy to learn that an influential school of Bayesian statisticians and econometricians is in agreement with them on the value of generating and revising models, hypotheses, and data analyzed in response to surprising findings.

Indeed, I think Bayesian inference is particularly valuable in this area, both in allowing us to fit more complex, realistic models, and, when coupled with graphical visualization techniques, in providing methods for checking the fit of such models.

**P.S.** All the abduction in the world won’t save us from selection bias, and I still think that just about all published estimates of effect sizes are biased upward. Including the one discussed here.

The post Statisticians and economists agree: We should learn from data by “generating and revising models, hypotheses, and data analyzed in response to surprising findings.” (That’s what Bayesian data analysis is all about.) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post From Whoops to Sorry: Columbia University history prof relives 1968 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>According to the American Historical Association, Armstrong “reviewed his work and the underlying scholarship and identified a number of instances where the source citations were incorrect. Dr. Armstrong has corrected the citation errors.”

Going through the links in the Retraction Watch article, I found this story from last year, reporting a series of complaints by Balazs Szalontai which, if correct, imply that Armstrong did not just have “citation errors” but behaved unethically. (In his later note, Armstrong refers to the old, old excuse of “multiple transfers of notes, some made by my research assistants and others done by myself.”) Anyway, in that earlier post, Armstrong responded to Szalontai’s criticism with this:

I have, as far as I know, never offended him. I’ve known him for years, and appreciate the work he’s done. His book appears in my bibliography. I don’t understand why he would come after me this way.

As if the only reason one would want to criticize bad work is because you’ve been offended.

Flip it around, and you see that Armstrong was saying, essentially, that all he has to do is “never offend” people and include their books in his bibliography, and he should be immune from criticism.

I followed another link and found this detailed report by Szalontai, “Invalid Source Citations in Tyranny of the Weak: My Response to Professor Armstrong’s Explanation,” which includes lots of damning details, for example this sequence:

Armstrong had written:

Szalontai then took this apart, bit by bit:

Wow. There seem to be three things going on here: (1) Getting the meaning of the quoted passage entirely backward (a big-deal correction almost entirely occluded by Armstrong’s dry-as-dust correction notice); (2) How the error happened in the first place, given that the research assistant was a native speaker of Korean and the supervisor received a diploma in Korean language; and (3) That horrible way that supervisors like to blame their errors on “research assistants.”

Szalontai pushes further on that last point:

Also this, from that old Retraction Watch post:

Armstrong noted that none of the errors he has discovered so far undermine the main conclusions of his book:

Not that I agree with all of the criticisms. But to the extent I can find them to be justified, I am correcting them. And so far I find nothing that affects the core arguments of the book.

Jeez . . . if none of these fake citations affected the core arguments of the book, why include them in the first place? I guess cos he felt it would make his arguments seem stronger. I hope Armstrong will take the next step and apologize to Szalontai for dragging this out for so long.

Of course I’d be happy to hear any other side of the story, if there is one. Maybe Matt Whitaker has some relevant perspective to offer here. In the meantime, it’s sad to see this sort of thing happening at Columbia.

Let me conclude with one more item, this one from Armstrong’s blog, after he got caught with his hand in the citation cookie jar but before he decided to return the prize:

Since early this past fall, a group of people, including Dr. Balazs Szalontai, has circulated lists of problems with my book . . . Dr. Szalontai never communicated his concerns or criticisms directly to me prior to these various posts on different blogs. Why direct communication, a common professional courtesy and practice in academia, was not the preferred form of expression remains a mystery.

This is just ridiculous. Armstrong’s the one who ripped off Szalontai and listed false citations in his book. What ever happened to “direct communication, a common professional courtesy and practice in academia”? Did Armstrong engage in “direct communication, a common professional courtesy and practice in academia” with Szalontai before taking and garbling the material from his book?

Armstrong’s book is in the public record. It has errors, and those errors, too, should be in the public record. To the extent that the contents of his book matter at all, readers should have access to the correct information right away.

The post From Whoops to Sorry: Columbia University history prof relives 1968 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Turks need money after expensive weddings appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>“We offered mTurk workers $0.50–$0.75 to complete the survey.”

Why would someone who spent $20k+ on their wedding be filling out a survey on mTurk? Maybe things didn’t turn out so well?

Josh continues:

I didn’t read the paper or the empirical section, just the abstract and I quickly looked at their data source and stopped.

I don’t think mTurk is always bad, just in this case the interaction could be a source of selection bias, and produce an effect mechanically.

I guess you gotta make up that $20K somehow. . . .

The post Turks need money after expensive weddings appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan/NYC WiMLDS Workshop appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>If you’re in the New York City area and want to attend then you can register at the event page. We hope that you can make it!

P.S. Don’t forget that StanCon 2018 will take place January 10-12 next year, and those identifying as members of underrepresented communities can take advantage of discounted registration.

The post Stan/NYC WiMLDS Workshop appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What is a pull request? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A pull request (PR) is the minimal publishable unit of open-source development. It’s a proposed change to the code base that we can then review. If you want to see how the sausage is made, follow this link.

If you click on “files changed”, you’ll see what Sean is proposing doing with the code. Interpsersed in there are 67 comments out of line and many more than that inline on the code. This is the PR that kicked off this discussion of how extreme we should be in reviewing (but you’ll also see this pull request touched almost 200 code files).

As soon as a pull request is made, it kicks off testing on multiple platforms that takes nearly a day to run to completion.

The post What is a pull request? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Maternal death rate problems in North Carolina appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Somebody named Jerrod writes:

I though you might find this article [“Black moms die in childbirth 3 times as often as white moms. Except in North Carolina,” by Julia Belluz] interesting as it relates to some of your interests in health data and combines it with bad analysis and framing.

My beef with the article:

1) a 40 percent decrease in the black maternal mortality rate paired with an over 100 percent increase in white maternal mortality rate is presented as a policy success.

2) the author wants you to think that white maternal mortality has stayed the same (with the first figure) and then elides over the dramatic increase in North Carolina’s white maternal mortality rate by saying that it mirrors the recent increase in over white mortality.

3) the two figures have different time scales.

4) Its thesis (“Ultimately, North Carolina is saving more lives by focusing on income, not race”) is not supported by the data.

For the following, I’ll assume that no government action is a policy response. I looked at the Vital Records for NC to get the number of live births for each race category and then multiplied the percentage of maternal deaths per live birth by the number of live births. The 1999 vital records only had data for “whites” and “minorities”. The numbers presented assume all minorities in NC in 1999 were black (which would weaken any of my conclusions). The maternal mortality rates are from the article.

Even the difference between 1999 deaths and 2013 deaths doesn’t support the conclusion that government policy saved lives (keep in mind that this is during a time with falling fertility). The last column just takes the difference between the 2013 hypothetical deaths in the second to last column and the actual deaths in 2013. While every death is important, the article also glosses over the fact that these are small magnitudes.

Don’t take this as a slam on Belluz—these things are challenging to report—it’s just good to get multiple perspectives on this sort of thing.

The post Maternal death rate problems in North Carolina appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “The Null Hypothesis Screening Fallacy”? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Rick Gerkin writes:

A few months ago you posted your list of blog posts in draft stage and I noticed that “Humans Can Discriminate More than 1 Trillion Olfactory Stimuli. Not.” was still on that list. It was about some concerns I had about a paper in Science (http://science.sciencemag.org/content/343/6177/1370). After talking it through with them, the authors of that paper eventually added a correction to the article. I think the issues with that paper are a bit deeper (as I published elsewhere: https://elifesciences.org/content/4/e08127) but still it takes courage to acknowledge the merit of the concerns and write a correction.

Meanwhile, two of the principal investigators from that paper produced a new, exciting data set which was used for a Kaggle-like competition. I won that competition and became a co-first author on a *new* paper in Science (http://science.sciencemag.org/content/355/6327/820).

And this is great! I totally respect them as scientists and think their research is really cool. They made an important mistake in their paper and since the research question was something I care a lot about I had to call attention to it. But I always looked forward to moving on from that and working on the other paper with them, and it all worked out.

That is such a great attitude.

Gerkin continues:

Yet another lesson that most scientific disputes are pretty minor, and working together with the people you disagreed with can produce huge returns. The second paper would have been less interesting and important if we hadn’t been working on it together.

What a wonderful story!

Here’s the background. I received the following email from Gerkin a bit over a year ago:

About 3 months ago there was a paper in Science entitled “Humans Can Discriminate More than 1 Trillion Olfactory Stimuli” (http://www.sciencemag.org/content/343/6177/1370). You may have heard about it through normal science channels, or NPR, or the news. The press release was everywhere. It was a big deal because the conclusion that humans can discriminate a trillion odors was unexpected, previous estimates having been in the ~10000 range. Our central concern is the analysis of the data.

The short version:

They use a hypothesis testing framework — not to reject a null hypothesis with type 1 error rate alpha — but to essentially convert raw data (fraction of subjects discriminating correctly) into a more favorable form (fraction of subjects discriminating significantly above chance), which is subsequently used to estimate an intermediate hypothetical variable, which, when plugged into another equation produces the final point estimate of “number of odors humans can discriminate”. However, small changes in the choice of alpha during this data conversion step (or equivalently small changes in the number of subjects, the number of trials, etc), by virtue of their highly non-linear impact on that point estimate, undermine any confidence in that estimate. I’m pretty sure this is a misuse of hypothesis testing. Does this have a name? Gelman’s fallacy?

I replied:

People do use hyp testing as a screen. When this is done, it should be evaluated as such. The p-values themselves are not so important, you just have to consider the screening as a data-based rule and evaluate its statistical properties. Personally, I do not like hyp-test-based screening rules: I think it makes more sense to consider screening as a goal and go from there. As you note, the p-value is a highly nonlinear transformation of the data, with the sharp nonlinearity occurring at a somewhat arbitrary place in the scale. So, in general, I think it can lead to inferences that throw away information. I did not go to the trouble of following your link and reading the original paper, but my usual view is that it would be better to just analyze the raw data (taking the proportions for each person as continuous data and going from there, or maybe fitting a logistic regression or some similar model to the individual responses).

Gerkin continued:

The long version:

1) Olfactory stimuli (basically vials of molecular mixtures) differed from each other according to the number of molecules they each had in common (e.g. 7 in common out of 10 total, i.e. 3 differences). All pairs of mixtures for which the stimuli in the pair had D differences were assigned to stimulus group D.

2) For each stimulus pair in a group D, the authors computed the fraction of subjects who could successfully discriminate that pair using smell.

3) For each group D, they then computed the fraction of pairs in D for which that fraction of subjects was “significantly above chance”. By design, chance success had p=1/3, so a pair was “significantly above chance” if the fraction of subjects discriminating it correctly exceeded that given by the binomial inverse CDF with x=(1-alpha/2), p=1/3, N=# of subjects. The choice of alpha (an analysis choice) and N (an experimental design choice) clearly drive the results so far. Let’s denote by F that fraction of pairs exceeding the threshold determined by the inverse CDF.

4) They did a linear regression of F vs D. They defined something called a “limen” (basically a fancy term for a discrimination threshold) and set it equal to the solution to 0.5 = beta_0 + beta_1*X, where the betas are the regression coefficients.

5) They then plugged X into yet another equation with more parameters, and the result was their estimate of the number of discriminable olfactory stimuli.

My reply: I’ve seen a lot of this sort of thing, over the years. My impression is that people are often doing these convoluted steps, not so much out of a desire to cheat but rather because they have not ever stepped back and tried to consider their larger goals. Or perhaps they don’t have the training to set up a model from scratch.

Here’s Gerkin again:

I think it was one of those cases where an experimentalist talked to a mathematician, and the mathematician had some experience with a vaguely similar problem and suggested a corresponding framework that unfortunately didn’t really apply to the current problem. The kinds of stress tests one would apply to resulting model to make sure it makes sense of the data never got applied.

And then he continued with his main thread:

If you followed this, you’ve already concluded that their method is unsound even before we get to step 4 and 5 (which I believe are unsound for unrelated reasons). I also generated figures showing that reasonable alternative choices of all of these variables yield estimates of the number of olfactory stimuli ranging from 10^3 to 10^80. I have Python code implementing this reanalysis and figures available at http://github.com/rgerkin/trillion. But what I am wondering most is, is there a name for what is wrong with that screening procedure? Is there some adage that can be rolled out, or work cited, to illustrate this to the author?

To which I replied:

I don’t have any name for this one, but perhaps one way to frame your point is that the term “discriminate” in this context is not precisely determined. Ultimately the question of whether two odors can be “discriminated” should have some testable definition: that is, not just a data-based procedure that produces an estimate, but some definition of what “discrimination” really means. My guess is that your response is strong enough, but it does seem that if someone estimates “X” as 10^9 or whatever, it would be good to have a definition of what X is.

Gerkin concludes with a plea:

The one thing I would really, really like is for the fallacy I described to have a name—even better if it could be listed on your lexicon page. Maybe “The Null Hypothesis Screening Fallacy” or something. Then I could just refer to that link instead of to some 10,000 words explanation of it, everytime this comes up in biology (which is all the time).

**P.S.** Here’s my earlier post on smell statistics.

The post “The Null Hypothesis Screening Fallacy”? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Furthermore, there are forms of research that have reached such a degree of complexity in their experimental methodology that replicative repetition can be difficult.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Shravan Vasishth writes:

The German NSF (DFG) has recently published a position paper on replicability, which contains the following explosive statement (emphasis mine in the quote below).

The first part of their defence against replicability is reasonable: some experiments can never be repeated under the same conditions (e.g., volcanic eruptions etc). But if that is so, why do researchers use frequentist logic for their analyses? This is the one situation situation where one cannot even imagine repeating the experiment hypothetically (cause the volcano to erupt 10,000 times and calculate the mean emission or whatever and its standard error).

The second part of their defence (in boldface) gives a free pass to the social psychologists. Now one can always claim that the experiment is “difficult” to redo. That is exactly the Fiske defence.

DFG quote:

Scientific results can be replicable, but they need not be. Replicability is not a universal criterion for scientific knowledge. The expectation that all scientific findings must be replicable cannot be satisfied, if only because numerous research areas investigate unique events such as climate change, supernovas, volcanic eruptions or past events. Other research areas focus on the observation and analysis of contingent phenomena (e.g. in the earth system sciences or in astrophysics) or investigate phenomena that cannot be observed repeatedly for other reasons (e.g., ethical, financial or technical reasons).

Furthermore, there are forms of research that have reached such a degree of complexity in their experimental methodology that replicative repetition can be difficult.

Wow. I guess they’ll have to specify exactly which are these forms of research are too complex to replicate. And why, if it is too complex to replicate, we should care about such claims. As is often the case in such discussions, I feel that their meaning would be much clearer if they’d give some examples.

The post “Furthermore, there are forms of research that have reached such a degree of complexity in their experimental methodology that replicative repetition can be difficult.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post No, I’m not blocking you or deleting your comments! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I am worried you may have blocked me from commenting on your blog (because a couple of comments I made aren’t there). . . . Or maybe I failed to post correctly or maybe you just didn’t think my comments were interesting enough. . . .

This comes up from time to time and I always explain that, no, I don’t delete comments.

I don’t block commenters. I flag spam comments as spam—this includes comments with actual content but that contain spam links, and it also includes comments with no links but with such meaningless content that they seem to be some sort of spam—and I delete duplicate comments, which happens I think when people don’t realize their comment was entered the first time. In nearly 15 years of blogging I think I’ve deleted fewer then 5 comments based on content when people are extremely rude.

Legitimate comments also can get caught in the spam. When people email me as above, I search the blog’s spam comments file, and the comment in question is typically there, having been trapped by the spam filter. Other times the comment isn’t there, and I’m guessing it got eaten by the person’s browser before it ever got posted.

I appreciate all the effort that people put into their comments and definitely don’t want to be deleting them! Just as I blog for free so as to improve scientific discourse, so do you and others supply comments for free for that same reason, and I’m glad we have such free and interesting exchanges.

The post No, I’m not blocking you or deleting your comments! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan Weekly Roundup, 30 June 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Stan**were granted a U.S. Trademark Registration No. 5,222,891 and a U.S. Serial Number: 87,237,369, respectively. Hard to feel special when there were millions of products ahead of you. Trademarked names are case insensitive and they required a black-and-white image, shown here.^{®}and the logo

**Peter Ellis**, a data analyst working for the New Zealand government, posted a nice case study, State-space modelling of the Australian 2007 federal election. His post is intended to “replicate Simon Jackman’s state space modelling [from his book and pscl package in R] with house effects of the 2007 Australian federal election.”

**Masaaki Horikoshi**provides Stan programs on GitHub for the models in Jacques J.F. Commandeur and Siem Jan Koopman’s book*Introduction to State Space Time Series Analysis*.

**Sebastian Weber**put out a first draft of the MPI specification for a map function for Stan. Mapping was introduced in Lisp with maplist(); Python uses map() and R uses sapply(). The map operation is also the first half of the parallel map-reduce pattern, which is how we’re implmenting it. The reduction involves fiddling the operands, result, and gradients into the shared autodiff graph.

**Sophia Rabe-Hesketh, Daniel Furr, and Seung Yeon Lee**, of UC Berkeley, put together a page of Resources for Stan in educational modeling; we only have another partial year left on our IES grant with Sophia.**Bill Gillespie**put together some introductory Stan lectures. Bill’s recently back from teaching Stan at the PAGE conference in Budapest.**Mitzi Morris**got her pull request merged to add compound arithmetic and assignment to the language (she did the compound declare/define before that). That means we’ll be able to write`foo[i, j] += 1`

instead of`foo[i, j] = foo[i, j] + 1`

going forward. It works for all types where the binary operation and assignment are well typed.**Sean Talts**has the first prototype of Andrew Gelman’s algorithm for max marginal modes—either posterior or likelihood. This’ll give us the same kind of maximum likelihood estimates as Doug Bates’s packages for generalized linear mixed effects models, lme4 in R and MixedModels.jl in Julia. It not only allows penalities or priors like Vince Dorie’s and Andrew’s R package blme, but it can be used for arbitrary parameters subsets in arbitrary Stan models. It shares some computational tricks for stochastic derivatives with Alp Kucukelbir’s autodiff variational inference (ADVI) algorithm.**I**got the pull request merged for the forward-mode test framework. It’s cutting down drastically on code size and improving test coverage. Thanks to Rob Trangucci for writing the finite diff functionals and to Sean Talts and Daniel Lee for feedback on the first round of testing. This should mean that we’ll have higher-order autodiff exposed soon, which means RHMC and faster autodiffed Hessians.

The post Stan Weekly Roundup, 30 June 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Plan 9 from PPNAS appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Asher Meir points to this breathless news article and sends me a message, subject line “Fruit juice leads to 0.003 unit (!) increase in BMI”:

“the study results showed that one daily 6- to 8-ounce serving increment of 100% fruit juice was associated with a small .003 unit increase in body mass index over one year in children of all ages.”

No confidence intervals but obviously this finding is very worrisome. Children shouldn’t be gaining weight.

Meir continues:

Of course it’s not a coincidence that it’s weird. I send you a very unrepresentative sample of the stuff I read. I mostly don’t send you ordinary schlock but rather things that are really weird – like a “0.003 unit increase in BMI” which is not only statistically insignificant but even if it was able to be substantiated would be of 0 health consequences.

I really enjoy seeing things like this, they are so ridiculous they are like those cult movies that are so bad they’re good.

**P.S.** Yeah, yeah, I know that this particular piece of junk science didn’t appear in PPNAS. But until PPNAS apologizes for wasting the world’s time with air rage, himmicanes, ages ending in 9, etc., I think we have the moral right to continue to use them as shorthand for this sort of thing.

The post Plan 9 from PPNAS appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Again: Let’s stop talking about published research findings being true or false appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This is not good.

First, math (or even statistics) won’t solve the reproducibility problem. All the math in the world won’t save you if you gather noisy data and study ill-defined effects. Satoshi Kanazawa could be a brilliant as Carl Friedrich Gauss, squared, and it wouldn’t matter cos there’s no blood that can be squeezed from the stone of those sex-ratio studies. Similarly for the ovulation and voting paper, or the ovulating-women-are-three-times-as-likely-to-wear-red paper. Dead on arrival, all of ’em. Too much noise. Going for statistical significance won’t work in those “power = .06” studies, cos if you do get lucky and find statistical significance, it tells you just about nothing anyway. That’s why I *don’t* go around recommending that people do preregistered replications of these sorts of studies. Why bother? I’m not gonna tell people to waste their time.

The other problem is this: “More useful, Mr. Fournier said, would be a practice in which yes-or-no declarations would be replaced in journal articles by more specific estimates of how likely it is that a particular research observation did not just randomly occur: such as 1 in 20, or 1 in 100, or 1 in 1,000.” This is wrong for all the reasons discussed here.

So, no, there’s no “new theory on how researchers can solve the reproducibility crisis.” To the extent there is a new theory, it’s an old theory, which is that scientists should focus on scientific measurement (see here and here).

In some sense, the biggest problem with statistics in science is not that scientists don’t know statistics, but that they’re relying on statistics in the first place.

Just imagine if papers such as himmicanes, air rage, ages-ending-in-9, and other clickbait cargo-cult science had to stand on their own two feet, *without* relying on p-values—that is, statistics—to back up their claims. Then we wouldn’t be in this mess in the first place.

I’m not saying statistics are a bad idea. I do applied statistics for a living. But I think that if researchers want to solve the reproducibility crisis, they should be doing experiments that can successfully be reproduced—and that involves getting better measurements and better theories, not rearranging the data on the deck of the Titanic.

**P.S.** Just to be clear: I’m not criticizing Basken’s article, which brings up a bunch of issues that people are talking about. I’m just bothered by what I see as a naive attitude that some people have, that statisticians and statistical education will fix our scientific replication problems. As McShane and Gal have pointed out, lots of statisticians don’t understand some key principles in this area, and I worry that focus on statistics, preregistration, etc., will distract researchers from the real problem that crappy measurement is standard in some fields of research. Remember, honesty and transparency are not enough.

The post Again: Let’s stop talking about published research findings being true or false appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Let’s stop talking about published research findings being true or false appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>When I heard about John Ioannidis’s paper, “Why Most Published Research Findings Are False,” I thought it was cool. Ioannidis was on the same side as me, and Uri Simonsohn, and Greg Francis, and Paul Meehl, in the replication debate: he felt that there was a lot of bad work out there, supported by meaningless p-values, and his paper was a demonstration of how this could come to pass, how it was that the seemingly-strong evidence of “p less than .05” wasn’t so strong at all.

I didn’t (and don’t) quite buy Ioannidis’s mathematical framing of the problem, in which published findings map to hypotheses that are “true” or “false.” I don’t buy it for two reasons: First, statistical claims are only loosely linked to scientific hypotheses. What, for example, is the hypothesis of Satoshi Kanazawa? Is it that sex ratios of babies are not identical among all groups? Or that we should believe in “evolutionary psychology”? Or that strong powerful men are more likely to have boys, in all circumstances? Some circumstances? Etc. Similarly with that ovulation-and-clothing paper: is the hypothesis that women are more likely to wear red clothing during their most fertile days? Or during days 6-14 (which are not the most fertile days of the cycle)? Or only on warm days? Etc. The second problem is that the null hypotheses being tested and rejected are typically point nulls—the model of zero difference, which is just about always false. So the alternative hypothesis is just about always true. But the alternative to the null is not what is being specified in the paper. And, as Bargh etc. have demonstrated, the hypothesis can keep shifting. So we go round and round.

Here’s my point. Whether you think the experiments and observational studies of Kanazawa, Bargh, etc., are worth doing, or whether you think they’re a waste of time: either way, I don’t think they’re making claims that can be said to be either “true” or “false.” And I feel the same way about medical studies of the “hormone therapy causes cancer” variety. It could be possible to coerce these claims into specific predictions about measurable quantities, but that’s not what these papers are doing.

I agree that there *are* true and false statements. For example, “the Stroop effect is real and it’s spectacular” is true. But when you move away from these super-clear examples, it’s tougher. Does power pose have real effects? Sure, everything you do will have some effect. But that’s not quite what Ioannidis was talking about, I guess.

Anyway, I’m still glad that Ioannidis wrote that paper, and I agree with his main point, even if I feel it was awkwardly expressed by being crammed into the true-positive, false-positive framework.

But it’s been 12 years now, and it’s time to move on. Back in 2013, I was not so pleased with Jager and Leek’s paper, “Empirical estimates suggest most published medical research is true.” Studying the statistical properties published scientific claims, that’s great. Doing it in the true-or-false framework, not so much.

I can understand Jager and Leek’s frustration: Ioannidis used this framework to write a much celebrated paper; Jager and Leek do something similar—but with real data!—and get all this skepticism. But I do think we have to move on.

And I feel the same way about this new paper, “Too True to be Bad: When Sets of Studies With Significant and Nonsignificant Findings Are Probably True,” by Daniel Lakens and Alexander Etz, sent to me by Kevin Lewis. I suppose such analyses are helpful for people to build their understanding, but I think the whole true/false thing with social science hypotheses is just pointless. These people are working within an old-fashioned paradigm, and I wish they’d take the lead from my 2014 paper with Carlin on Type M and S errors. I suspect that I would agree with the recommendations of this paper (as, indeed, I agree with Ioannidis), but at this point I’ve just lost the patience for decoding this sort of argument and reframing it in terms of continuous and varying effects. That said, I expect this paper by Lakens and Etz, like the earlier papers by Ioannidis and Jager/Leek, could be useful, as I recognize that many people are still comfortable working within the outmoded framework of true and false hypotheses.

The post Let’s stop talking about published research findings being true or false appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bayesian, but not Bayesian enough appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This short New York Times article on a study published in BMJ might be of interest to you and your blog community, both in terms of how the media reports science and also the use of bayesian vs frequentist statistics in the study itself.

Here is the short summary from the news ticker thing on the NYTimes homepage:

Wow, that sounds really bad! Here is the full article:

https://www.nytimes.com/2017/05/09/well/live/pain-relievers-tied-to-immediate-heart-risks.htmlIt is extremely short, and basically just summarizes the abstract, adds that the absolute increase in risk is actually very small, and recommends talking to your doctor before taking NSAIDs. I guess my problem is that they have the scary headline (53%!), but then say the risk is actually small and you might or might not want to avoid NSAIDs. So is this important or not? The average reader probably has not thought much about relative versus absolute risk, so I wish they would have expanded on that.

In terms of bayesian vs frequentist, this study is bayesian (bayesian meta-analysis of individual patient data). Here is the link:

http://www.bmj.com/content/357/bmj.j1909

Despite being bayesian, the way the results are presented give me very frequentist/NHST vibes. For example, the NYTimes article gives the percent increase in risk of heart attack for the various NSAIDs, which are taken directly from the odds ratios in the abstract:

With use for one to seven days the probability of increased myocardial infarction risk (posterior probability of odds ratio >1.0) was 92% for celecoxib, 97% for ibuprofen, and 99% for diclofenac, naproxen, and rofecoxib. The corresponding odds ratios (95% credible intervals) were 1.24 (0.91 to 1.82) for celecoxib, 1.48 (1.00 to 2.26) for ibuprofen, 1.50 (1.06 to 2.04) for diclofenac, 1.53 (1.07 to 2.33) for naproxen, and 1.58 (1.07 to 2.17) for rofecoxib.

This reads to me like the bayesian equivalent of “statistically significant, p<0.05, lower 95% CI is greater than 1”! To be fair that is just the abstract, and the article itself provides much, much more information.

The following passage also caught my eye:

The bayesian approach is useful for decision making. Take, for example, the summary odds ratio of acute myocardial infarction of 2.65 (1.46 to 4.67) with rofecoxib >25 mg/day for 8-30 days versus non-use. With a frequentist confidence interval, which represents uncertainty through repetition of the experience, all odds ratios from 1.46 to 4.67 might seem equally likely. In contrast, the bayesian approach, although resulting in a numerically similar 95% credible interval, also allows us to calculate that there is an 83% probability that this odds ratio of acute myocardial infarction is greater than 2.00.

It seems like they’re using bayesian methods to generate alternative versions of the typical frequentist statistics that can actually be interpreted the way most people incorrectly interpret frequentist/NHST stats (p=0.01 meaning 99% probability that there is an effect, etc). If so that is great because it makes sense to use statistics that match how people will interpret them anyway, but I also imagine it also would be subject to the same limitations and abuse that is common to NHST (I am not saying that about this particular study, just in general).

I agree. If you’re doing decision analysis, you can’t do much with statements such as, “there is an 83% probability that this odds ratio of acute myocardial infarction is greater than 2.00.” It’s better to just work with the risk parameter directly. A parameter being greater than 2.00 isn’t what kills you.

The post Bayesian, but not Bayesian enough appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Estimating Public Market Exposure of Private Capital Funds Using Bayesian Inference appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I don’t know anything about this work by Luis O’Shea and Vishv Jeet—that is, I know nothing of public market exposure or private capital firms, and I don’t know anything about the model they fit, the data they used, or what information they had available for constructing and checking their model.

But what I *do* know is that they fit their model in Stan.

Fitting models in Stan is just great, for the usual reasons of flexible modeling and fast computing, and also because Stan code can be shared, so we—the Stan user community and the larger research community—can learn from each other and move all our data analyses forward.

The post Estimating Public Market Exposure of Private Capital Funds Using Bayesian Inference appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Capitalist science: The solution to the replication crisis? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The solution to science’s replication crisis is a new ecosystem in which scientists sell what they learn from their research. In each pairwise transaction, the information seller makes (loses) money if he turns out to be correct (incorrect). Responsibility for the determination of correctness is delegated, with appropriate incentives, to the information purchaser. Each transaction is brokered by a central exchange, which holds money from the anonymous information buyer and anonymous information seller in escrow, and which enforces a set of incentives facilitating the transfer of useful, bluntly honest information from the seller to the buyer. This new ecosystem, capitalist science, directly addresses socialist science’s replication crisis by explicitly rewarding accuracy and penalizing inaccuracy.

The idea seems interesting to me, even though I don’t think it would quite work for my own research as my work tends to be interpretive and descriptive without many true/false claims. But it could perhaps work for others. Some effort is being made right now to set up prediction markets for scientific papers.

Knuteson replied:

Prediction markets have a few features that led me to make different design decisions. Two of note:

– Prices on prediction markets are public. The people I have spoken with in industry seem more willing to pay for information if the information they receive is not automatically made public.

– Prediction markets generally deal with true/false claims. People like being able to ask a broader set of questions.

A bit later, Knuteson wrote:

I read your post “Authority figures in psychology spread more happy talk, still don’t get the point . . .”

You may find this Physics World article interesting: Figuring out a handshake.

I fully agree with you that not all broken eggs can be made into omelets.

Also relevant is this paper where Eric Loken and I consider the idea of peer review as an attempted quality control system, and we discuss proposals such as prediction markets for improving scientific communication.

The post Capitalist science: The solution to the replication crisis? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bad Numbers: Media-savvy Ivy League prof publishes textbook with a corrupted dataset appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I might not have noticed this one, except that it happened to involve Congressional elections, and this is an area I know something about.

The story goes like this. I’m working to finish up Regression and Other Stories, going through the examples. There’s one where we fit a model to predict the 1988 elections for the U.S. House of Representatives, district by district, given the results from the previous election and incumbency status. We fit a linear regression, then used the fitted model to predict 1990, then compared to the actual election results from 1990. A clean example with just a bit of realism—the model doesn’t fit perfectly, there’s some missing data, there are some choices in how to set up the model.

This example was in Data Analysis Using Regression and Multilevel/Hierarchical Models—that’s the book that Regression and Other Stories is the updated version of the first half of—and for this new book I just want to redo the predictions using stan_glm() and posterior_predict(), which is simpler and more direct than the hacky way we were doing predictions before.

So, no problem. In the new book chapter I adapt the code, cleaning it in various places, then I open an R window and an emacs window for my R script and check that everything works ok. Ummm, first I gotta find the directory with the old code and data, I do that, everything seems to work all right. . . .

I look over what I wrote one more time. It’s kinda complicated: I’d imputed winners of uncontested elections at 75% of the two-party vote—that’s a reasonable choice, it’s based on some analysis we did many years ago of the votes in districts the election before or after they became uncontested—but then there was a tricky thing where I excluded some of these when fitting the regression and put them back in the imputation. In rewriting the example, it seemed simpler to just impute all those uncontested elections once and for all and then do the modeling and fitting on all the districts. Not perfect—and I can explain that in the text—but less of a distraction from the main point in this section, which is the use of simulation for nonlinear predictors, in this case the number of seats predicted to be won by each party in the next election.

Here’s what I had in the text: “Many of the elections were uncontested in 1988, so that y_i = 0 or 1 exactly; for simplicity, we exclude these from our analysis. . . . We also exclude any elections that were won by third parties. This leaves us with n = 343 congressional elections for the analysis.” So I went back to the R script and put the (suitably imputed) uncontested elections back in. This left me with 411 elections in the dataset, out of 435. The rest were NA’s. And I rewrote the paragraph to simply say: “We exclude any elections that were won by third parties in 1986 or 1988. This leaves us with $n=411$ congressional elections for the analysis.”

But . . . wait a minute! Were there really ~~34~~ 24 districts won by third parties in those years? That doesn’t sound right. I go to the one of the relevant data file, “1986.asc,” and scan down until I find some of the districts in question:

The first column’s the state (we were using “ICPSR codes,” and states 44, 45, and 46 are Georgia, Louisiana, and Mississippi, respectively), the second is the congressional district, third is incumbency (+1 for Democrat running for reelection, -1 for Republican, 0 for an open seat), and the last two columns are the votes received by the Democratic and Republican candidates. If one of those last two columns is 0, that’s an uncontested election. If both are 0, I was calling it a third-party victory.

But can this be right?

Here’s the relevant section from the codebook:

~~Nothing about what to do if both columns are 0.~~

Also this:

For those districts with both columns -9, it says the election didn’t take place, or there was a third party victory, or there was an at-large election.

Whassup? Let’s check Louisiana (state 45 in the above display). Google *Louisiana 1986 House of Representatives Elections* and it’s right there on Wikipedia. I have no idea who went to the trouble of entering all this information (or who went to the trouble of writing a computer program to enter all this information), but here it is:

So it looks like the data table I had was just incomplete. I have no idea how this happened, but it’s kinda embarrassing that I never noticed. What with all those uncontested elections, I didn’t really look carefully at the data with ~~zeroes~~ -9’s in both columns.

Also, the incumbency information isn’t all correct. Our file had LA-6 with a Republican incumbent running for reelection, but according to Wikipedia, the actual election was an open seat (but with the Republican running unopposed).

I’m not sure what’s the best way forward. Putting together a new dataset for all those decades of elections, that would be a lot of work. But maybe such a file now exists somewhere? The easiest solution would be to clean up the existing dataset just for the three elections I need for the example: 1986, 1988, 1990. On the other hand, if I’m going to do that anyway, maybe better to use some more recent data, such as 2006, 2008, 2010.

No big deal—it’s just one example in the book—but, still, it’s a mistake I should never have made.

This is all a good example of the benefits of a reproducible workflow. It was through my efforts to put together clean, reproducible code that I discovered the problem.

Also, errors in this dataset could have propagated into errors in these published articles:

[2008] Estimating incumbency advantage and its variation, as an example of a before/after study (with discussion). {\em Journal of the American Statistical Association} {\bf 103}, 437–451. (Andrew Gelman and Zaiying Huang)

[1991] Systemic consequences of incumbency advantage in U.S. House elections. {\em American Journal of Political Science} {\bf 35}, 110–138. (Gary King and Andrew Gelman)

[1990] Estimating incumbency advantage without bias. {\em American Journal of Political Science} {\bf 34}, 1142–1164. (Andrew Gelman and Gary King)

I’m guessing that the main conclusions won’t change, as the total number of these excluded cases is small. Of course those papers were all written before the era of reproducible analyses, so it’s not like the data and code are all there for you to re-run.

The post Bad Numbers: Media-savvy Ivy League prof publishes textbook with a corrupted dataset appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Problems with the jargon “statistically significant” and “clinically significant” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>After listening to your EconTalk episode a few weeks ago, I have a question about interpreting treatment effect magnitudes, effect sizes, SDs, etc. I studied Econ/Math undergrad and worked at a social science research institution in health policy as a research assistant, so I have a good amount of background.

At the institution where I worked we started adopting the jargon “statistically significant” AND “clinically significant.” The latter describes the importance of the magnitude in the real world. However, my understanding of standard T testing and p-values is that since the null hypothesis is treatment == 0, then if we can reject the null at p>.05, then this is only evidence that the treatment is > 0. Because the test was against 0, we cannot make any additional claims about the magnitude. If we wanted to make claims about the magnitude, then we would need to test against the null hypothesis of treatment effect == [whatever threshold we assess as clinically significant]. So, what do you think? Were we always over-interpreting the magnitude results or am I missing something here?

My reply:

Section 2.4 of this recent paper with John Carlin explains the problem with talking about “practical” (or “clinical”) significance.

More generally, that’s right, the hypothesis test is, at best, nothing more than the rejection of a null hypothesis that nobody should care about. In real life, treatment effects are not exactly zero. A treatment will help some people and hurt others; it will have some average benefit which will in turn depend on the population being studied and the settings where the treatment is being applied.

But, no, I disagree with your statement that, if we wanted to make claims about the magnitude, then we would need to test other hypotheses. The whole “hypothesis” thing just misses the point. There are no “hypotheses” here in the traditional statistical sense. The hypothesis is that some intervention helps more than it hurts, for some people in some settings. The way to go, I think, is to just model these treatment effects directly. Estimate the treatment effect and its variation, and go from there. Forget the hypotheses and p-values entirely.

The post Problems with the jargon “statistically significant” and “clinically significant” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Analyze all your comparisons. That’s better than looking at the max difference and trying to do a multiple comparisons correction. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The following email came in:

I’m in a PhD program (poli sci) with a heavy emphasis on methods. One thing that my statistics courses emphasize, but that doesn’t get much attention in my poli sci courses, is the problem of simultaneous inferences. This strikes me as a problem.

I am a bit unclear on exactly how this works, and it’s something that my stats professors have been sort of vague about. But I gather from your blog that this is a subject near and dear to your heart.

For purposes of clarification, I’ll work under the frequentist framework, since for better or for worse, that’s what almost all poli sci literature operates under.

But am I right that any time you want to claim that two things are significant *at the same time* you need to halve your alpha? Or use Scheffe or whatever multiplier you think is appropriate if you think Bonfronni is too conservative?

I’m thinking in particular of this paper [“When Does Negativity Demobilize? Tracing the Conditional Effect of Negative Campaigning on Voter Turnout,” by Yanna Krupnikov].

In particular the findings on page 803.

Setting aside the 25+ predictors, which smacks of p-hacking to me, to support her conclusions she needs it to simultaneously be true that (1) negative ads themselves don’t affect turnout, (2) negative ads for a disliked candidate don’t affect turnout; (3) negative ads against a preferred candidate don’t affect turnout; (4) late ads for a disliked candidate don’t affect turnout AND (5) negative ads for a liked candidate DO affect turnout. In other words, her conclusion is valid iff she finds a significant effect at #5.

This is what she finds, but it looks like it just *barely* crosses the .05 threshold (again, p-hacking concerns). But am I right that since she needs to make inferences about five tests here, her alpha should be .01 (or whatever if you use a different multiplier)? Also, that we don’t care about the number of predictors she uses (outside of p-hacking concerns) since we’re not really making inferences about them?

My reply:

First, just speaking generally: it’s fine to work in the frequentist framework, which to me implies that you’re trying to understand the properties of your statistical methods in the settings where they will be applied. I work in the frequentist framework too! The framework where I *don’t* want you working is the null hypothesis significance testing framework, in which you try to prove your point by rejecting straw-man nulls.

In particular, I have no use for statistical significance, or alpha-levels, or familywise error rates, or the .05 threshold, or anything like that. To me, these are all silly games, and we should just cut to the chase and estimate the descriptive and casual population quantities of interest. Again, I *am* interested in the frequentist properties of my estimates—I’d like to understand their bias and variance—but I don’t want to do it conditional on null hypotheses of zero effect, which are hypotheses of zero interest to me. That’s a game you just don’t need to play anymore.

When you *do* have multiple comparisons, I think the right way to go is to analyze all of them using a hierarchical model—*not* to pick one or two or three out of context and then try to adjust the p-values using a multiple comparisons correction. Jennifer Hill, Masanao Yajima, and I discuss this in our 2011 paper, Why we (usually) don’t have to worry about multiple comparisons.

To put it another way, the original sin is *selection*. The problem with p-hacked work is not that p-values are uncorrected for multiple comparison, it’s that some subset of comparisons is selected for further analysis, which is wasteful of information. It’s better to analyze all the comparisons of interest at once. This paper with Steegen et al. demonstrates how many different potential analyses can be present, even in a simple study.

OK, so that’s my general advice: look at all the data and fit a multilevel model allowing for varying baselines and varying effects.

What about the specifics?

I took a look at the linked paper. I like the title. “When Does Negativity Demobilize?” is much better than “Does Negatively Demobilize.” The title recognizes that (a) effects are never zero, and (b) effects vary. I can’t quite buy this last sentence of the abstract, though: “negativity can only demobilize when two conditions are met: (1) a person is exposed to negativity after selecting a preferred candidate and (2) the negativity is about this selected candidate.” No way! There must be other cases when negativity can demobilize. That said, at this point the paper could still be fine: even if a paper is working within a flawed inferential framework, it could still be solid empirical work. After all, it’s completely standard to estimate constant treatment effects—we did this in our first paper on incumbency advantage and I still think most of our reported findings were basically correct.

Reading on . . . Krupnikov writes, “The first section explores the psychological determinants that underlie the power of negativity leading to the focal hypothesis of this research. The second section offers empirical tests of this hypothesis.” For the psychological model, she writes that first a person decides which candidate to support, then he or she decides whether to vote. That seems a bit of a simplification, as sometimes I know I’ll vote even before I decide whom to vote for. Haven’t you ever heard of people making their decision inside the voting booth? I’ve done that! Even beyond that, it doesn’t seem quite right to identify the choice as being made at a single precise time. Again, though, that’s ok: Krupnikov is presenting a *model*, and models are inherently simplifications. Models can still help us learn from the data.

OK, now on to the empirical part of the paper. I see what you mean: there are a lot of potential explanatory variables running around: overall negativity, late negativity, state competitiveness, etc etc. Anything could be interacted with anything. This is a common concern in social science, as there is an essentially unlimited number of factors that could influence the outcome of interest (turnout, in this case). On one hand, it’s a poopstorm when you throw all these variables into your model at once; on the other hand, if you exclude anything that might be important, it can be hard to interpret any comparisons in observational data. So this is something we’ll have to deal with: it won’t be enough to just say there are too many variables and then give up. And it certainly won’t be a good idea to trawl through hundreds of comparisons, looking for something that’s significant at the .001 level or whatever. That would make no sense at all. Think of what happens: you grab the comparison with a z-score of 4, setting aside all those silly comparisons with z-scores of 3, or 2, or 1, but this doesn’t make much sense, given that these z-scores are so bouncy: differences of less than 3 in z-scores are not themselves statistically significant.

To put it another way, “multiple comparisons” can be a valuable *criticism*, but multiple comparisons corrections are not so useful as a *method* of data analysis.

Getting back to the empirics . . . here I agree that there are problems. I don’t like this:

Estimating Model 1 shows that overall negativity has a null effect on turnout in the 2004 presidential election (Table 2, Model 1). While the coefficient on the overall negativity variable is negative, it does not reach conven- tional levels of statistical significance. These results are in line with Finkel and Geer (1998), as well as Lau and Pomper (2004), and show that increases in the negativity in a respondent’s media market over the entire duration of the campaign did not have any effect on his likelihood of turning out to vote in 2004.

Not statistically significant != zero.

Here’s more:

Going back to the conclusion from the abstract, “negativity can only demobilize when two conditions are met: (1) a person is exposed to negativity after selecting a preferred candidate and (2) the negativity is about this selected candidate,” I think Krupnikov is just wrong here in her application of her empirical results. She’s taking non-statistically-significant comparisons as zero, and she’s taking the difference between significant and non-significant as being significant. Don’t do that.

Given that the goal here is causal inference, I think it would’ve been better off setting this up more formally as an observational study comparing treatment and control groups.

I did not read the rest of the paper, nor am I attempting to offer any evaluation of the work. I was just focusing on the part addressed by your question. The bigger picture, I think, is that it can be valuable for a researcher to (a) summarize the patterns she sees in data, and (b) consider the implications of these patterns for understanding recent and future campaigns, while (c) recognizing residual uncertainty.

Remember Tukey’s quote: “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

The attitude I’m offering is not nihilistic: even if we have not reached anything close to certainty, we can still learn from data and have a clearer sense of the world after our analysis than before.

The post Analyze all your comparisons. That’s better than looking at the max difference and trying to do a multiple comparisons correction. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan^{®} appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Update: Usage guidelines**

We basically just followed Apache’s lead.

**It’s official**

“Stan” is now a registered trademark. For those keeping score, it’s

The Stan logo (see image below) is also official

No idea why there are serial numbers for the image and registration numbers for the text. Ask the USPTO

**How to refer to Stan**

Please just keep writing “Stan”. We’ll be using the little ® symbol in prominent branding, but you don’t have to.

**Thanks to NumFOCUS**

Thanks to Leah Silen and NumFOCUS for shepherding the application through the registration process. NumFOCUS is the official trademark holder.

**“Stan”, not “STAN”**

We use “Stan” rather than “STAN”, because “Stan” isn’t an acronym. Stan is named after Stanislaw Ulam.

The mark is rendered as “STAN” on the USPTO site. Do not be fooled! The patent office capitalizes everything because the registrations are case insensitive.

The image submitted for the trademark (shown above) is black and white. So far, we’ve always used color—on the web site, manual, t-shirts, stickers, etc.

The post Stan^{®} appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Incentives Matter (Congress and Wall Street edition) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Thomas Ferguson sends along this paper. From the summary:

Social scientists have traditionally struggled to identify clear links between political spending and congressional voting, and many journalists have embraced their skepticism. A giant stumbling block has been the challenge of measuring the labyrinthine ways money flows from investors, firms, and industries to particular candidates. Ferguson, Jorgensen, and Chen directly tackle that classic problem in this paper. Constructing new data sets that capture much larger swaths of political spending, they show direct links between political contributions to individual members of Congress and key floor votes . . .

They show that prior studies have missed important streams of political money, and, more importantly, they show in detail how past studies have underestimated the flow of political money into Congress. The authors employ a data set that attempts to bring together all forms of campaign contributions from any source— contributions to candidate campaign committees, party committees, 527s or “independent expenditures,” SuperPACs, etc., and aggregate them by final sources in a unified, systematic way. To test the influence of money on financial regulation votes, they analyze the U.S. House of Representatives voting on measures to weaken the Dodd-Frank financial reform bill. Taking care to control as many factors as possible that could influence floor votes, they focus most of their attention on representatives who originally voted in favor of the bill and subsequently to dismantle key provisions of it. Because these are the same representatives, belonging to the same political party, in substantially the same districts, many factors normally advanced to explain vote shifts are ruled out from the start. . . .

The authors test five votes from 2013 to 2015, finding the link between campaign contributions from the financial sector and switching to a pro-bank vote to be direct and substantial. The results indicate that for every $100,000 that Democratic representatives received from finance, the odds they would break with their party’s majority support for the Dodd-Frank legislation increased by 13.9 percent. Democratic representatives who voted in favor of finance often received $200,000–$300,000 from that sector, which raised the odds of switching by 25–40 percent. The authors also test whether representatives who left the House at the end of 2014 behaved differently. They find that these individuals were much more likely to break with their party and side with the banks. . . .

I had a quick question: how do you deal with the correlation/causation issue? The idea that Wall St is giving money to politicians who would already support them? That too is a big deal, of course, but it’s not quite the story Ferguson et al. are telling in the paper.

Ferguson responded:

We actually considered that at some length. That’s why we organized the main discussion on Wall Street and Dodd-Frank around looking at Democratic switchers — people who originally voted for passage (against Wall Street, that is), but then switched in one or more later votes to weaken. Nobody is in that particular regression who didn’t already vote against Wall Street once already, when it really counted.

I replied: Sure, but there’s still the correlation problem, in that one could argue that switchers are people whose latent preferences were closer to the middle, so they were just the ones who were more likely to shift following a change in the political weather.

Ferguson:

Conservatism is controlled for in the analysis, using a measure derived from that Congress. This isn’t going to the middle; it’s a tropism for money. The other obvious comment is that if they are really latent Wall Street lovers, they should be moving mostly in lockstep on the subsequent votes. If you look at our summary nos., you can see they weren’t. We could probably mine that point some more.

Short of administering the MMPPI for banks in advance, are you prepared to accept any empirical evidence? Voting against banks in the big one is pretty good, I think.

Me: I’m not sure, I’ll have to think about it. One answer, I think, is that if it’s just $ given to pre-existing supporters of Wall St., it’s still an issue, as the congressmembers are then getting asymmetrically rewarded (votes for Wall St get the reward, votes against don’t get the reward), and, as economists are always telling us, Incentives Matter.

Ferguson:

Remember those folks who turned on Swaps Push Out didn’t necessarily turn out for the banks on other votes. If it’s “weather” it’s a pretty strange weather.

The post Incentives Matter (Congress and Wall Street edition) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan Weekly Roundup, 23 June 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>* Lots of people got involved in pushing Stan 2.16 and interfaces out the door; Sean Talts got the math library, Stan library (that’s the language, inference algorithms, and interface infrastructure), and CmdStan out, while Allen Riddell got PyStan 2.16 out and Ben Goodrich and Jonah Gabry are tackling RStan 2.16

* Stan 2.16 is the last series of releases that will not require C++11; let the coding fun begin!

* Ari Hartikainen (of Aalto University) joined the Stan dev team—he’s working with Allen Riddell on PyStan, where judging from the pull request traffic, he put in a lot of work on the 2.16 release. Welcome!

* Imad Ali’s working on adding more cool features to RStanArm including time series and spatial models; yesterday he and Mitzi were scheming to get intrinsic conditional autoregressive models in and I heard all those time series name flying around (like ARIMA)

* Michael Betancourt rearranged the Stan web site with some input from me and Andrew; Michael added more descriptive text and Sean Talts managed to get the redirects in so all of our links aren’t broken; let us know what you think

* Markus Ojala of Smartly wrote a case study on their blog, Tutorial: How We Productized Bayesian Revenue Estimation with Stan

* Mitzi Morris got in the pull request for adding compound assignment and arithmetic; this adds statements such as `n += 1`

.

* lots of chatter about characterization tests and a pull request from Daniel Lee to update some of update some of our our existing performance tests

* Roger Grosse from U.Toronto visited to tell us about his, Siddharth Ancha, and Daniel Roy’s 2016 NIPS paper on testing MCMC using bidirectional Monte Carlo sampling; we talked about how he modified Stan’s sampler to do annealed importance sampling

* GPU integration continues apace

* I got to listen in on Michael Betancourt and Maggie Lieu of the European Space Institute spend a couple days hashing out astrophysics models; Maggie would really like us to add integrals.

* Speaking of integration, Marco Inacio has updated his pull request; Michael’s worried there may be numerical instabilities, because trying to calculate arbitrary bounded integrals is not so easy in a lot of cases

* Andrew continues to lobby for being able to write priors directly into parameter declarations; for example, here’s what a hierarchical prior for `beta`

might look like

parameters { real mu ~ normal(0, 2); realsigma ~ student_t(4, 0, 2); vector[N] beta ~ normal(mu, sigma); }

* I got the go-ahead on adding foreach loops; Mitzi Morris will probably be coding them. We’re talking about

real ys[N]; ... for (y in ys) target += log_mix(lambda, normal_lpdf(y | mu[1], sigma[1]), normal_lpdf(y | mu[2], sigma[2]));

* Kalman filter case study by Jouni Helske was discussed on Discourse

* Rob Trangucci rewrote the Gaussian processes chapter of the Stan manual; I’m to blame for the first version, writing it as I was learning GPs. For some reason, it’s not up on the web page doc yet.

* This is a very ad hoc list. I’m sure I missed lots of good stuff, so feel free to either send updates to me directly for next week’s letter or add things to comments. This project’s now way too big for me to track all the activity!

The post Stan Weekly Roundup, 23 June 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Best correction ever: “Unfortunately, the correct values are impossible to establish, since the raw data could not be retrieved.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Several errors and omissions occurred in the reporting of research and data in our paper: “How Descriptive Food Names Bias Sensory Perceptions in Restaurants,” Food Quality and Preference (2005) . . .

The dog ate my data. Damn gremlins. I hate when that happens.

As the saying goes, “Each year we publish 20+ new ideas in academic journals, and we appear in media around the world.” In all seriousness, the problem is not that they publish their ideas, the problem is that they are “changing or omitting data or results such that the research is not accurately represented in the research record.” And of course it’s not just a problem with Mr. Pizzagate or Mr. Gremlins or Mr. Evilicious or Mr. Politically Incorrect Sex Ratios: it’s all sorts of researchers who (a) don’t report what they actually did, and (b) refuse to reconsider their flimsy hypotheses in light of new theory or evidence.

The post Best correction ever: “Unfortunately, the correct values are impossible to establish, since the raw data could not be retrieved.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>