Skip to content

Objects of the class “Verbal Behavior”

Steve Shulman-Laniel writes:

My nominee for a new “objects of the class” would be B. F. Skinner’s “Verbal Behavior” — i.e., the criticisms of the thing are more widely read than the thing itself.

Hmmm . . . I’ve heard of Skinner but not of “Verbal Behavior,” let alone its criticism. But the general idea sounds good.

What other objects are in that class?

It’s related to Objects of the class “Foghorn Leghorn” but slightly different in that here we’re talking criticism, not parody.

And here are other objects of the class “Objects of the class.”

New Stan case studies: NNGP and Lotka-Volterra

It’s only January and we already have two new case studies up on the Stan site.

Two new case studies

Lu Zhang of UCLA contributed a case study on nearest neighbor Gaussian processes.

Bob Carpenter (that’s me!) of Columbia Uni contributed one on Lotka-Volterra population dynamics.

Mitzi Morris of Columbia Uni has been updating her ICAR spatial models case study; the neighbor graph is now explained in pictures, the results have been extended from Brooklyn to all five boroughs of NYC, and the results overlaid on Google maps.

Tufte handout format!

I’m really excited about the tufte-handout style I used. Thanks to Sarah Heaps of Newcastle Uni for showing me her teaching materials for her Jumping Rivers RStan course. It inspired me to check out the Tufte format. I’m already using it for the MCMC and Stan book to back up the Coursera courses I’m preparing.

StanCon case studies coming soon

As soon as Jonah can process them, there’ll be a whole bunch of case studies in the proceedings from the 2018 StanCon in Asilomar.

Lots of case studies available

Here’s the complete list of official case studies.

Of course, lots of people do Stan case studies in papers, YouTube videos and blogs.

Your case study here

Meanwhile, if you have case studies to contribute, please let us know. It’s as easy as using knitr or Jupyter and sending us the HTML and a repo link.

Looking at all possible comparisons at once: It’s not “overfitting” if you put it in a multilevel model

Rémi Gau writes:

The human brain mapping conference is on these days and heard via tweeter about this Overfitting toolbox for fMRI studies that helps explore the multiplicity of analytical pipelines in a more systematic fashion.

Reminded me a bit of your multiverse analysis: thought you might like the idea.

The link is to a conference poster by Joram Soch, Carsten Allefeld, and John-Dylan Haynes that begins:

A common problem in experimental science is if the analysis of a data set yields no significant result even though there is a strong prior belief that the effect exists. In this case, overfitting can help . . . We present The Overfitting Toolbox (TOT), a set of computational tools that allow to systematically exploit multiple model estimation, parallel statistical testing, varying statistical thresholds and other techniques that allow to increase the number of positive inferences.

I’m pretty sure it’s a parody: for one thing, the poster includes a skull-and-crossbones image; for another, it ends as follows:

Widespread use of The Overfitting Toolbox (TOT) will allow researchers to uncover literally unthinkable sorts of effects and lead to more spectacular findings and news coverage for the entire fMRI community.

That said, I actually think it can be a good idea to fit a model looking at all possible comparisons! My recommended next step, though, is not to look at p-values or multiple comparisons or false discovery rates or whatever, but rather to fit a hierarchical model for the distribution of all these possible effects. The point is that everything’s there, but most effects are small. In the areas where I’ve worked, this makes more sense than trying to pick out and focus on just a few comparisons.

Stacking and multiverse

It’s a coincidence that there is another multiverse posting today.

Recently Tim Disher asked in Stan discussion forum a question “Multiverse analysis – concatenating posteriors?”

Tim refers to a paper “Increasing Transparency Through a Multiverse Analysis” by Sara Steegen, Francis Tuerlinckx, Andrew Gelman, and Wolf Vanpaemel. The abstract says

Empirical research inevitably includes constructing a data set by processing raw data into a form ready for statistical analysis. Data processing often involves choices among several reasonable options for excluding, transforming, and coding data. We suggest that instead of performing only one analysis, researchers could perform a multiverse analysis, which involves performing all analyses across the whole set of alternatively processed data sets corresponding to a large set of reasonable scenarios. Using an example focusing on the effect of fertility on religiosity and political attitudes, we show that analyzing a single data set can be misleading and propose a multiverse analysis as an alternative practice. A multiverse analysis offers an idea of how much the conclusions change because of arbitrary choices in data construction and gives pointers as to which choices are most consequential in the fragility of the result.

In that paper the focus is in looking at the possible results from the multiverse of forking paths, but Tim asked whether it would “make sense at all to combine the posteriors from a multiverse analysis in a similar way to how we would combine multiple datasets in multiple imputation”?

After I (Aki) thought about this, my answer is

  • in multiple imputation the different data sets are posterior draws from the missing data distribution and thus usually equally weighted
  • I think multiverse analysis is similar to case of having a set of models with different variables, variable transformations, interactions and non-linearities like in our Stacking paper (Yao, Vehtari, Simpson, Gelman), where we have different models for arsenic well data (section 4.6). Then stacking would be sensible way to combine *predictions* (as we may have different model parameters for differently processed data) with non-equal weights. Stacking is a good choice for model combination here as
    1. we don’t need to assign prior probabilities for different forking paths
    2. stacking favors paths which give good predictions
    3. it avoids “prior dilutation problem” if some processed datasets happen to be very similar with each other (see fig 2c in Stacking paper)

The multiverse in action!

In a recent paper, “Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking,” Jelte Wicherts, Coosje Veldkamp, Hilde Augusteijn, Marjan Bakker, Robbie van Aert, and Marcel van Assen write:

The designing, collecting, analyzing, and reporting of psychological studies entail many choices that are often arbitrary. The opportunistic use of these so-called researcher degrees of freedom aimed at obtaining statistically significant results is problematic because it enhances the chances of false positive results and may inflate effect size estimates. In this review article, we present an extensive list of 34 degrees of freedom that researchers have in formulating hypotheses, and in designing, running, analyzing, and reporting of psychological research. The list can be used in research methods education, and as a checklist to assess the quality of preregistrations and to determine the potential for bias due to (arbitrary) choices in unregistered studies.

34 different degrees of freedom! That’s a real multiverse; it can get you to a p-value of 2^-34 (that is, 0.0000000001) in no time. And it’s worse than that: Wicherts et al. write, “We created a list of 34 researcher DFs, but our list is in no way exhaustive for the many choices that need be made during the different phases of a psychological experiment.”

Preregistration is fine, but let me remind all readers, though, that the most important steps in any study are valid and reliable measurements and, where possible, large and stable effect sizes. All the preregistration in the world won’t save you if your measurements are not serious or if you’re studying an effect that is tiny or highly variable. Remember the kangaroo problem.

As I wrote here, “I worry that the push toward various desirable procedural goals can make people neglect the fundamental scientific and statistical problems that, ultimately, have driven the replication crisis.”

Another bivariate multivariate dependence measure!

Joshua Vogelstein writes:

Since you’ve posted much on various independence test papers (e.g., Reshef et al., and then Simon & Tibshirani criticism, and then their back and forth), I thought perhaps you’d post this one as well.

Tibshirani pointed out that distance correlation (Dcorr) was recommended, we proved that our oracle multiscale generalized correlation (MGC, pronounced “Magic”) statistically dominates Dcorr, and empirically demonstrate that sample MGC nearly dominates.

The new paper, by Cencheng Shen, Carey Priebe, Mauro Maggioni, Qing Wang, and Joshua Vogelstein, is called “Discovering Relationships and their Structures Across Disparate Data Modalities.” I don’t have the energy to read this right now but I thought it might interest some of you. I’m glad that people continue to do research on these methods as they would seem to have many areas of application.

My quick comment to Vogelstein on this paper was to suggest “whether” to “how” in the first sentence of the abstract.

The Paper of My Enemy Has Been Retracted

The paper of my enemy has been retracted
And I am pleased.
From every media outlet it has been retracted
Like a van-load of p-values that has been seized
And sits in star-laden tables in a replication archive,
My enemy’s much-prized effort sits in tables
In the kind of journal where retraction occurs.
Great, square stacks of rejected articles and, between them, aisles
One passes down reflecting on life’s vanities,
Pausing to remember all that thoughtful publicity
Lavished to no avail upon one’s enemy’s article—
For behold, here is that study
Among these ranks and banks of duds,
These ponderous and seeminly irreducible cairns
Of complete stiffs.

The paper of my enemy has been retracted
And I rejoice.
It has gone with bowed head like a defeated legion
Beneath the yoke.
What avail him now his awards and prizes,
The praise expended upon his meticulous technique,
His cool new experiments?
Knocked into the middle of next week
His brainchild now consorts with the bad theories
The sinker, clinkers, dogs and dregs,
The Evilicious Edsels of pseudoscience,
The bummers that no amount of hype could shift,
The unbudgeable turkeys.

Yea, his slim paper with its understated abstract
Bathes in the blare of the brightly promoted Air Rage Paper,
His unmistakably individual new voice
Shares the same scrapyard with a forlorn skyscraper
Of Ovulation and Voting,
His honesty, proclaimed by himself and believed by others,
His renowned abhorrence of all p-hacking and pretense,
Is there with the Collected Works of Marc Hauser—
One Hundred Years of Research Misconduct,
And (oh, this above all) his sensibility,
His sensibility and its hair-like filaments,
His delicate, quivering sensibility is now as one
With Edward Wegman’s Wikipedia cribs,
A volume graced by the descriptive rubric
‘The simplex method visits all 2d vertices.’

Soon now a paper of mine could be retracted also,
Though not to the monumental extent
In which the chastisement of correction has been meted out
To the paper of my enemy,
Since in the case of my own study it will be due
To a miscoded variable, a confusion over data—
Nothing to do with merit.
But just supposing that such an event should hold
Some slight element of sadness, it will be offset
By the memory of this sweet moment.
Chill the champagne and polish the crystal goblets!
The paper of my enemy has been retracted
And I am glad.

(Adapted from the classic poem written by the brilliant Clive James.)

Bayes, statistics, and reproducibility: My talk at Rutgers 5pm on Mon 29 Jan 2018

In the weekly seminar on the Foundations of Probability in the Philosophy Departmentat Rutgers University, New Brunswick Campus, Miller Hall, 2nd floor seminar room:

Bayes, statistics, and reproducibility

The two central ideas in the foundations of statistics—Bayesian inference and frequentist evaluation—both are defined in terms of replications. For a Bayesian, the replication comes in the prior distribution, which represents possible parameter values under the set of problems to which a given model might be applied; for a frequentist, the replication comes in the reference set or sampling distribution of possible data that could be seen if the data collection process were repeated. Many serious problems with statistics in practice arise from Bayesian inference that is not Bayesian enough, or frequentist evaluation that is not frequentist enough, in both cases using replication distributions that do not make scientific sense or do not reflect the actual procedures being performed on the data. We consider the implications for the replication crisis in science and discuss how scientists can do better, both in data collection and in learning from the data they have.

Here are some relevant papers:

Philosophy and the practice of Bayesian statistics (with Cosma Shalizi) and rejoinder to discussion

Beyond subjective and objective in statistics (with Christian Hennig)

The failure of null hypothesis significance testing when studying incremental changes, and what to do about it

When the appeal of an exaggerated claim is greater than a prestige journal

Adam Clarke writes:

Are you aiming to write a blog post soon on the recent PNAS article of ‘When the appeal of a dominant leader is greater than a prestige leader’? The connection it points out between economic uncertainty and preference for dominant leaders seems intuitive – perhaps a bit too intuitive. The “Edited by Susan T. Fiske” usually makes me suspicious, but I couldn’t find much wrong with the content. It’d be interesting to hear your blogged thoughts on the paper.

My reply: I suppose that as an exploratory analysis it’s fine. This sort of thing could never get published in a good political science journal—it’s more of a blog post, as it were—but it’s fine to get stuff like that out there. The key in communicating this sort of result is to place it in the speculative zone.

Let’s take a look. Here’s the abstract of the paper:

We examine why dominant/authoritarian leaders attract support despite the presence of other admired/respected candidates. Although evolutionary psychology supports both dominance and prestige as viable routes for attaining influential leadership positions, extant research lacks theoretical clarity explaining when and why dominant leaders are preferred. Across three large-scale studies we provide robust evidence showing how economic uncertainty affects individuals’ psychological feelings of lack of personal control, resulting in a greater preference for dominant leaders. This research offers important theoretical explanations for why, around the globe from the United States and Indian elections to the Brexit campaign, constituents continue to choose authoritarian leaders over other admired/respected leaders.

I have no problem with the first two sentences of this abstract. But then:

– “we provide robust evidence . . .” No.
– “how economic uncertainty affects . . .” No. Strike that causal language.
– “This research offers important theoretical explanations . . .” I doubt that. But “important” is a subjective term. I guess the authors think it’s important.

And, by the way, since they mentioned the recent U.S. elections, and setting aside the fact that Clinton won the popular vote, who were those “other admired/respected leaders” who were beat out by Trump? Jeb Bush? Ted Cruz? Hillary Clinton? Were those people really “admired/respected leaders”? I don’t know that the leading anti-Brexit campaigners were so admired or respected either.

My real concern here is not political here. Rather, it’s a point of science communication: If you see an interesting pattern in observational data, it doesn’t seem to be enough to just say that. Instead you have to make inflated claims about “robust evidence” (i.e., a bunch of p-values less than 0.05 found within a garden of forking paths) and “important theoretical explanations.” Without those big claims, I can’t imagine such a paper appearing in PPNAS in the first place.

To put it another way, there’s a selection effect. If you find a cool data pattern and present it as such, you probably won’t get much attention. But if you wrap it in the garb of scientific near-certainty, there’s a chance you could hit the media jackpot. The incentives are all wrong.

NSF Cultivating Cultures for Ethical STEM

The National Science Foundation is funding this program:

NSF Cultivating Cultures for Ethical STEM (CCE STEM [science, technology, engineering, and mathematics])

Funding: The maximum amount for 5-year awards is $600,000 (including indirect costs) and the maximum amount for 3-year awards is $400,000 (including indirect costs). The average award is $275,000.

Deadline: Internal Notice of Intent due 02/07/18
Final Proposal due 04/17/18

Limit: One application per institution

Summary: Cultivating Cultures for Ethical STEM (CCE STEM) funds research projects that identify (1) factors that are effective in the formation of ethical STEM researchers and (2) approaches to developing those factors in all the fields of science and engineering that NSF supports. CCE STEM solicits proposals for research that explores what constitutes responsible conduct for research (RCR), and which cultural and institutional contexts promote ethical STEM research and practice and why.

Successful proposals typically have a comparative dimension, either between or within institutional settings that differ along these or among other factors, and they specify plans for developing interventions that promote the effectiveness of identified factors.

CCE STEM research projects will use basic research to produce knowledge about what constitutes or promotes responsible or irresponsible conduct of research, and how to best instill students with this knowledge. In some cases, projects will include the development of interventions to ensure responsible research conduct.

I don’t know what to think about this. On one hand, I think ethics in science is important. On the other hand, it’s not clear to me how to do $275,000 worth of research on this project. On the other hand—hmmm, I guess I should say, back on the first hand—I guess it should be possible to do some useful qualitative research. After all, I think a lot about ethics and I write a bit about the topic, but I haven’t really studied it systematically and I don’t really know how to. So it makes sense for someone to figure this out.

There’s also this:

What practices contribute to the establishment and maintenance of ethical cultures and how can these practices be transferred, extended to, and integrated into other research and learning settings?

I’m thinking maybe the Food and Brand Lab at Cornell University could apply for some of this funding. At this point, they must know a lot about what practices contribute to the establishment and maintenance of ethical cultures and how can these practices be transferred, extended to, and integrated into other research and learning settings. You could say they’re the true experts in the field, especially since that Evilicious guy has left academia.

Or maybe an ethics supergroup could be convened. (Here’s a related list, of which my favorite is the Traveling Wilburys. Sorry, Dan!)

In all seriousness, I really don’t know how to think about this sort of thing. I hope NSF gets some interesting proposals.

To better enable others to avoid being misled when trying to learn from observations, I promise not be transparent, open, sincere nor honest?

I recently read a paper by Stephen John with the title “Epistemic trust and the ethics of science communication: against transparency, openness, sincerity and honesty”. On a superficial level, John’s paper can be re-stated as honesty  (transparency, openness and sincerity) is not always the best policy. For instance, “publicising the inner workings of sausage factories does not necessarily promote sausage sales”. The paper got me thinking about how as statistical consultants or collaborators we need to discern how other’s think if we hope to enable them to actually do better science – at least when the science is challenging. I used to think about that a lot more early in my statistical career, sometimes referring to it as exploratory cognitive analysis.

Now I was fortunate at the time to be with a group where mentoring and enabling better research was job one. Most of those I worked with were clinical research fellows and their clinical mentors. They had fairly secure funding so they were not too worried about getting grants. Usually we would work together for a few years and publications did seem to be mostly about sharing knowledge rather than building careers. For instance, if something went wrong in a study, a paper often was written along the lines of identifying what went wrong and how others could avoid the same problems.  Perhaps in such a context, learning how they thought – so I could better engage with them, made more sense. This maybe is rare in academia. Additionally, for much of the research, available or well recognised statistical methods were lacking at the time. Or maybe I just dropped the ball after leaving?

Before I left that group, I did provide some training on exploratory cognitive analysis to my replacement. It involved debriefing with them after meetings with clinical researchers discussing what the clinical researcher seemed to grasp, what they would likely accept as suggestions, what they did not seem to grasp and how to further engage with them. In some of the meetings, I arranged to purposely show up late or have to leave early so my replacement got a few solo runs at this.  I was also reminded of this activity in a JSM presentation by Christine H. Fox, The Role of Analysis in Supporting Strategic decisions. Afterwards I mentioned this to my replacement and they mentioned that it has worked well for them.  However, there does not seem to much written on the need to learn how others think when engaging in statistical consultation or collaboration. Or how to avoid providing “insultation” instead of consultation – as Xiao-Li Meng put it here.  Now, is it likely that “not being transparent, open, sincere nor honest” will help here?

John’s assumptions and goals are interesting. “All I assume is that the world is a better place if non-experts’ [non-statisticians] beliefs are informed by true experts’ [statisticians] claims”. Additionally accepting the goal of “ensure[ing] that non-experts’ [non-statisticians’] beliefs mirror experts’ [statisticians’] claims in many ways” – the ethics of science communication boils down to identifying what is permissible in doing this. This makes sense – if non-statisticians were to think like statisticians that would enable better research – right? The challenge arises because of how most non-statisticians may learn from statisticians and the preponderance of a “folk philosophy of science [statistics]” among them. This allows “agnotology” (the deliberate creation and maintenance of ignorance) by careerist statisticians or those who get something by providing statistical “help” (e.g. to get co-authorship) – to flourish. For instance, the folk philosophy of statistics’ belief that there are correct and incorrect methods in statistics. Given that, once you have involved a statistician or statistically able consultant/collaborator, you can be fully confident the methods used in your study are correct. The uncertainty has been fully laundered. An example of this I ran into often at one university was “After our discussion about analysing my observational study, I talked to Dr. X (a statistician with a more senior appointment in the university) they assured me that step-wise regression is a correct way to analyse my observational study. They also said your concerns about confounding just over complicate things and make for a lot of unnecessary delay and work. In fact, I was able to do the analysis on my own, we have already submitted the paper and it looks like it will be accepted.”

Below I discuss John’s model regarding how non-statisticians become informed by statisticians (both competent ones,  incompetent ones and agnotological ones), try to identify current deficiencies in the statistical discipline’s ability to deliver if John’s model is applicable and discern areas where I argue such a model is or is not too wrong to be helpful.

Continue reading ‘To better enable others to avoid being misled when trying to learn from observations, I promise not be transparent, open, sincere nor honest?’ »

State-space modeling for poll aggregation . . . in Stan!

Peter Ellis writes:

As part of familiarising myself with the Stan probabilistic programming language, I replicate Simon Jackman’s state space modelling with house effects of the 2007 Australian federal election. . . .

It’s not quite the model that I’d use—indeed, Ellis writes,
“I’m fairly new to Stan and I’m pretty sure my Stan programs that follow aren’t best practice, even though I am confident they work”—but it’s not so bad to see a newcomer work through the steps on a real-data example. This might inspire some of you to use Stan to fit Bayesian models on your own problems.

Again, the key advantage of Stan is flexibility in modeling: you can (and should) start with something simple, and then you can add refinements to make the model more realistic, at each step thinking about the meanings of the new parameters and about what prior information you have. It’s great if you have strong prior information to help fit the model, but it can also help to have weak prior information that regularizes—gives more stable estimates—by softly constraining the zone of reasonable values of the parameters. Go step by step, and before you know it, you’re fitting models that make much more sense than all the crude approximations you were using before.

Hey—here’s the title of my talk for this year’s New York R conference

Big Data Needs Big Model

Big Data are messy data, available data not random samples, observational data not experiments, available data not measurements of underlying constructs of interest. To make relevant inferences from big data, we need to extrapolate from sample to population, from control to treatment group, and from measurements to latent variables. All these steps require modeling.

At a more technical level, each of these steps should be more effective if we adjust for more factors, but models that adjust for lots of factors are hard to estimate, and we need (a) regularization to get more stable estimates (and in turn to allow us to adjust for more factors, and (b) modeling of latent variables. Bayesian inference in Stan can do these things. And our next step is Bayesian workflow, building up models from simple to complex and tracking how these work on data as we do it.

So somebody please give me 10 million dollars.

How smartly.io productized Bayesian revenue estimation with Stan

Markus Ojala writes:

Bayesian modeling is becoming mainstream in many application areas. Applying it needs still a lot of knowledge about distributions and modeling techniques but the recent development in probabilistic programming languages have made it much more tractable. Stan is a promising language that suits single analysis cases well. With the improvements in approximation methods, it can scale to production level if care is taken in defining and validating the model. The model described here is the basis for the model we are running in production with various additional improvements.

He begins with some background:

Online advertisers are moving to optimizing total revenue on ad spend instead of just pumping up the amount of conversions or clicks. Maximizing revenue is tricky as there is huge random variation in the revenue amounts brought in by individual users. If this isn’t taken into account, it’s easy to react to the wrong signals and waste money on less successful ad campaigns. Luckily, Bayesian inference allows us to make justified decisions on a granular level by modeling the variation in the observed data.

Probabilistic programming languages, like Stan, make Bayesian inference easy. . . .

Sure, we know all that. But then comes the new part, at least it’s new to me:

In this blog post, we describe our experiences in getting Stan running in production.

Ojala discusses the “Use case: Maximizing the revenue on ad spend” and provides lots of helpful detail—not just the Stan code itself, but background on how they set up the model, and intermediate steps such as the first try which didn’t work because the model was insufficiently constrained and they needed to add prior information. As Ojala puts it:

What’s nice about Stan is that our model definition turns almost line-by-line into the final code. However, getting this model to fit by Stan is hard, as we haven’t specified any limits for the variables, or given sensible priors.

His solution is multilevel modeling:

The issue with the first approach is that the ad set estimates would be based on just the data from the individual ad sets. In this case, one random large $1,000 purchase can affect the mean estimate of a single ad set radically if there are only tens of conversion events (which is a common case). As such large revenue events could have happened also in other ad sets, we can get better estimates by sharing information between the ad sets.

With multilevel modeling, we can implement a partial pooling approach to share information. . . .

It’s BDA come to life! (But for some reason this paper is labeled, “Tauman, Chapter 6.”)

He continues with model development and posterior predictive checking! I’m lovin it.

Also this excellent point:

After you get comfortable in writing models, Stan is an expressive language that takes away the need to write custom optimization or sampling code to fit your model. It allows you to modify the model and add complexity easily.

Now let’s get to the challenges:

The model fitting can easily crash if the learning diverges. In most cases that can be fixed by adding sensible limits and informative priors for the variables and possibly adding a custom initialization for the parameters. Also non-centered parametrization is needed for hierarchical models.

These are must-haves for running the model in production. You want the model to fit 100% of the cases and not just 90% which would be fine in interactive mode. However, finding the issues with the model is hard.

What to do?

The best is to start with really simple model and add stuff step-by-step. Also running the model against various data sets and producing posterior plots automatically helps in identifying the issues early.

And some details:

We are using the PyStan Python interface that wraps the compilation and calling of the code. To avoid recompiling the models always, we precompile them and pickle them . . .

For scheduling, we use Celery which is a nice distributed task queue. . . .

We are now running the revenue model for thousands of different accounts every night with varying amount of campaigns, ad sets and revenue observations. The longest run takes couple minutes. Most of our customers still use the conversion optimization but are transitioning to use the revenue optimization feature. Overall, about one million euros of advertising spend on daily level is managed with our Predictive Budget Allocation. In future, we see that Stan or some other probabilistic programming language plays a big role in the optimization features of Smartly.io.

That’s awesome. We used the BSD license for Stan so it could be free and open source and anyone could use it inside their software, however they like. This sort of thing is exactly what we were hoping to see.

How to get a sense of Type M and type S errors in neonatology, where trials are often very small? Try fake-data simulation!

Tim Disher read my paper with John Carlin, “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors,” and followed up with a question:

I am a doctoral student conducting research within the field of neonatology, where trials are often very small, and I have long suspected that many intervention effects are potentially inflated.

I am curious as to whether you have any thoughts as to how the methods you describe could be applied within the context of a meta-analysis. My initial thought was to do one of:

1. Approach the issue in a similar way to how minimum information size has been adapted for meta-analysis e.g. assess the risk of type S and M errors on the overall effect estimate as if it came from a single large trial.

2. Calculate the type S and M errors for each trial individually and use a simulation approach where trials are drawn from a mean effect adjusted for inflation % chance of opposite sign.

I think this is the first time we’ve fielded a question from a nurse, so lets go for it. My quick comment is that you should get the best understanding of what your statistical procedure is doing by simulating fake data. Start with a model of the world, a generative model with hyperparameters set to reasonable values; then simulate fake data; then apply whatever statistical procedure you think might be used, including any selection for statistical significance that might occur; then compare the estimates to the assumed true values. Then repeat that simulation a bunch of times; this should give you a sense of type M and type S errors and all sorts of things.

The Trumpets of Lilliput

Gur Huberman pointed me to this paper by George Akerlof and Pascal Michaillat that gives an institutional model for the persistence of false belief. The article begins:

This paper develops a theory of promotion based on evaluations by the already promoted. The already promoted show some favoritism toward candidates for promotion with similar beliefs, just as beetles are more prone to eat the eggs of other species. With such egg-eating bias, false beliefs may not be eliminated by the promotion system. Our main application is to scientific revolutions: when tenured scientists show favoritism toward candidates for tenure with similar beliefs, science may not converge to the true paradigm. We extend the statistical concept of power to science: the power of the tenure test is the probability (absent any bias) of denying tenure to a scientist who adheres to the false paradigm, just as the power of any statistical test is the probability of rejecting a false null hypothesis. . . .

It was interesting to see a mathematical model for the persistence of errors, and I agree that there must be something to their general point that people are motivated to support work that confirms their beliefs and to discredit work that disconfirms their beliefs. We’ve seen a lot of this sort of analysis at the individual level (“motivated reasoning,” etc.) and it makes sense to think of this at an interpersonal or institutional level too.

There were, however, some specific aspects of their model that I found unconvincing, partly on statistical grounds and partly based on my understanding of how social science works within society:

1. Just as I don’t think it is helpful to describe statistical hypotheses as “true” or “false,” I don’t think it’s helpful to describe scientific paradigms as “true” or “false.” Also, I’m no biologist, but I’m skeptical of a statement such as, “With the beetles, the more biologically fit species does not always prevail.” What does it mean to say a species is “more biologically fit”? If they survive and reproduce, they’re fit, no? And if a species’ eggs get eaten before they’re hatched, that reduces the species’s fitness.

In the article, they modify “true” and “false” to “Better” and “Worse,” but I have pretty much the same problem here, which is that different paradigms serve different purposes, so I don’t see how it typically makes sense to speak of one paradigm as giving “a more correct description of the world,” except in some extreme cases. For example, a few years ago I reviewed a pop-science book that was written from a racist paradigm. Is that paradigm “more correct” or “less correct” than a non-racist paradigm? It depends on what questions are being asked, and what non-racist paradigm is being used as a comparison.

2. Also the whole academia-tenure framework seems misplaced, in that the most important scientific paradigms are rooted in diverse environments, not just academia. For example, the solid-state physics paradigm led to transistors (developed at Bell Labs, not academia) and then of course is dominant in industry. Even goofy theories such as literary postmodernism (is this “Better” or “Worse”? How could we ever tell?) exist as much in the news media as in academe; indeed, if we didn’t keep hearing about deconstructionism in the news media, we’d never have known of its existence. And of course recent trendy paradigms in social psychology (embodied cognition, etc.) are associated with self-help gurus, Gladwell books, etc., as much as with academia. I think that a big part of the success of that sort of work in academia is because of its success in the world of business consulting. The wheelings and dealings of tenure committees are, I suspect, the least of it.

Beyond all this—or perhaps explaining my above comments—is my irritation at people who use university professors as soft targets. Silly tenured professors ha ha. Bad science is a real problem but I think it’s ludicrous to attribute that to the tenure system. Suppose there was no such thing as academic tenure, then I have a feeling that social and biomedical science research would be even more fad-driven.

I sent the above comments to the authors, and Akerlof replied:

I think that your point of view and ours are surprisingly on the same track; in fact the paper answers Thomas Kuhn’s question: what makes science so successful. The point is rather subtle and is in the back pages: especially regarding the differences between promotions of scientists and promotion of surgeons who did radical mastectomies.

A lesson from the Charles Armstrong plagiarism scandal: Separation of the judicial and the executive functions

[updated link]

Charles Armstrong is a history professor at Columbia University who, so I’ve heard, has plagiarized and faked references for an award-winning book about Korean history. The violations of the rules of scholarship were so bad that the American Historical Association “reviewed the citation issue after being notified by a member of the concerns some have about the book” and, shortly after that, Armstrong relinquished the award. More background here.

To me, the most interesting part of the story is that Armstrong was essentially forced to give in, especially surprising given how aggressive his original response was, attacking the person whose work he’d stolen.

It’s hard to imagine that Columbia University could’ve made Armstrong return the prize, given that the university gave him a “President’s Global Innovation Fund Grant” many months after the plagiarism story had surfaced.

The reason why, I think, is that the American Historical Association had this independent committee.

And that gets us to the point raised in the title of this post.

Academic and corporate environments are characterized by an executive function with weak to zero legislative or judicial functions. That is, decisions are made based on consequences, with very few rules. Yes, we have lots of little rules and red tape, but no real rules telling the executives what to do.

Evaluating every decision based on consequences seems like it could be a good idea, but it leads to situations where wrongdoers are left in place, as in any given situation it seems like too much trouble to deal with the problem.

An analogy might be with the famous probability-matching problem. Suppose someone gives shuffles a deck with 100 cards, 70 red and 30 black, and then starts pulling out cards, one at a time, asking you to guess. You’ll maximize your expected number of correct answers by simply guessing Red, Red, Red, Red, Red, etc. In each case, that’s the right guess, but put it together and your guesses are not representative. Similarly, if for each scandal the university makes the locally optimal decision to do nothing, the result is that nothing is ever done.

This analogy is not perfect: I’m not recommending that the university sanction 30% of its profs at random—for one thing, that could be me! But it demonstrates the point that a series of individually reasonable decisions can be unreasonable in aggregate.

Anyway, one advantage of a judicial branch—or, more generally, a fact-finding institution that is separate from enforcement and policymaking—is that its members can feel free to look for truth, damn the consequences, because that’s their role.

So, instead of the university weighing the negatives of having an barely-repentant plagiarist on faculty or having the embarrassment of sanctioning a tenured professor, there can be an independent committee of the American History Association just judging the evidence.

it’s a lot easier to judge the evidence if you don’t have direct responsibility for what will be done by the evidence. Or, to put it another way, it’s easier to be a judge if you don’t also have to play the roles of jury and executioner.

P.S. I see that Armstrong was recently quoted in Newsweek regarding Korea policy. Maybe they should’ve interviewed the dude he copied from instead. Why not go straight to the original, no?

The difference between me and you is that I’m not on fire

“Eat what you are while you’re falling apart and it opened a can of worms. The gun’s in my hand and I know it looks bad, but believe me I’m innocent.” – Mclusky

While the next episode of Madam Secretary buffers on terrible hotel internet, I (the other other white meat) thought I’d pop in to say a long, convoluted hello. I’m in New York this week visiting Andrew and the Stan crew (because it’s cold in Toronto and I somehow managed to put all my teaching on Mondays. I’m Garfield without the spray tan.).

So I’m in a hotel on the Upper West Side (or, like, maybe the upper upper west side. I’m in the 100s. Am I in Harlem yet? All I know is that I’m a block from my favourite bar [which, as a side note, Aki does not particularly care for] where I am currently not sitting and writing this because last night I was there reading a book about the rise of the surprisingly multicultural anti-immigration movement in Australia and, after asking what my book was about, some bloke started asking me for my genealogy and “how Australian I am” and really I thought that it was both a bit much and a serious misunderstanding of what someone who is reading book with headphones on was looking for in a social interaction.) going through the folder of emails I haven’t managed to answer in the last couple of weeks looking for something fun to pass the time.

And I found one. Ravi Shroff from the Department of Applied Statistics, Social Science and Humanities at NYU (side note: applied statistics gets a short shrift in a lot of academic stats departments around the world, which is criminal. So I will always love a department that leads with it in the title. I’ll also say that my impression when I wandered in there for a couple of hours at some point last year was that, on top of everything else, this was an uncommonly friendly group of people. Really, it’s my second favourite statistics department in North America, obviously after Toronto who agreed to throw a man into a volcano every year as part of my startup package after I got really into both that Tori Amos album from 1996 and cultural appropriation. Obviously I’m still processing the trauma of being 11 in 1996 and singularly unable to sacrifice any young men to the volcano goddess.) sent me an email a couple of weeks ago about constructing interpretable decision rules.

(Meta-structural diversion: I starting writing this with the new year, new me idea that every blog post wasn’t going to devolve into, say, 500 words on how Medúlla is Björk’s Joanne, but that resolution clearly lasted for less time than my tenure as an Olympic torch relay runner. But if you’ve not learnt to skip the first section of my posts by now, clearly reinforcement learning isn’t for you.)

To hell with good intentions

Ravi sent me his paper Simple rules for complex decisions by Jongbin Jung, Connor Concannon, Ravi Shroff, Sharad Goel and Daniel Goldstein and it’s one of those deals where the title really does cover the content.

This is my absolute favourite type of statistics paper: it eschews the bright shiny lights of ultra-modern methodology in favour of the much harder road of taking a collection of standard tools and shaping them into something completely new.

Why do I prefer the latter? Well it’s related to the age old tension between “state-of-the-art” methods and “stuff-people-understand” methods. The latter are obviously preferred as they’re much easier to push into practice. This is in spite of the former being potentially hugely more effective. Practically, you have to balance “black box performance” with “interpretability”. Where you personally land on that Pareto frontier is between you and your volcano goddess.

This paper proposes a simple decision rule for binary classification problems and shows fairly convincingly that it can be almost as effective as much more complicated classifiers.

There ain’t no fool in Ferguson

The paper proposes a Select-Regress-and-Round method for constructing decision rules that works as follows:

  1. Select a small number k of features \mathbf{x} that will be used to build the classifier
  2. Regress: Use a logistic-lasso to estimate the classifier h(\mathbf{x}) = (\mathbf{x}^T\mathbf{\beta} \geq 0 \text{ ? } 1 \text{ : } 0).
  3. Round: Chose M possible levels of effect and build weights

w_j = \text{Round} \left( \frac{M \beta_j}{\max_i|\beta_i|}\right).

The new classifier (which chooses between options 1 and 0) selects 1 if

\sum_{j=1}^k w_j x_j > 0.

In the paper they use k=10 features and M = 3 levels.  To interpret this classifier, we can consider each level as a discrete measure of importance.  For example, when we have M=3 we have seven levels of importance from “very high negative effect”, through “no effect”, to “very high positive effect”. In particular

  • w_j=0: The jth feature has no effect
  • w_j= \pm 1: The jth feature has a low effect (positive or negative)
  • w_j = \pm 2: The jth feature has a medium effect (positive or negative)
  • w_j = \pm 3: The jth feature has a high effect (positive or negative).

A couple of key things here that makes this idea work.  Firstly, the initial selection phase allows people to “sense check” the initial group of features while also forcing the decision rule to only depend on a small number of features, which greatly improves the ability for people to interpret the rule.  The second two phases then works out which of those features are used (the number of active features can be less than k. Finally the last phase gives a qualitative weight to each feature.

This is a transparent way of building a decision rule, as the effect of each feature used to make the decision is clearly specified.  But does it work?

She will only bring you happiness

The most surprising thing in this paper is that this very simple strategy for building a decision rule works fairly well. Probably unsurprisingly, complicated, uninterpretable decision rules constructed through random forests typically do work better than this simple decision rule.  But the select-regress-round strategy doesn’t do too badly.  It might be possible to improve the performance by tweaking the first two steps to allow for some low-order interactions. For binary features, this would allow for classifiers where neither X nor Y are strong indicators of success, but the co-occurance of them (XY) is.

Even without this tweak, the select-regress-round classifier performs about as well as logistic regression and logistic lasso models that use all possible features (see the above figure from the paper), although it performs worse than the random forrest.  It also doesn’t appear that the rounding process has too much of an effect on the quality of the classifier.

This man will not hang

The substantive example in this paper has to do with whether or not a judge decides to grant bail, where the event you’re trying to predict is a failure to appear at trial. The results in this paper suggest that the select-regress-round rule leads to a consistently lower rate of failure compared to the “expert judgment” of the judges.  It also works, on this example, almost as well as a random forest classifier.

There’s some cool methodology stuff in here about how to actually build, train, and evaluate classification rules when, for any particular experimental unit (person getting or not getting bail in this case), you can only observed one of the potential outcomes.  This paper uses some ideas from the causal analysis literature to work around that problem.

I guess the real question I have about this type of decision rule for this sort of example is around how these sorts of decision rules would be applied in practice.  In particular, would judges be willing to use this type of system?  The obvious advantage of implementing it in practice is that it is data driven and, therefore, the decisions are potentially less likely to fall prey to implicit and unconscious biases. The obvious downside is that I am personally more than the sum of my demographic features (or other measurable quantities) and this type of system would treat me like the average person who has shares the k features with me.

We were measuring the speed of Stan incorrectly—it’s faster than we thought in some cases due to antithetical sampling

Aki points out that in cases of antithetical sampling, our effective sample size calculations were unduly truncated above at the number of iterations. It turns out the effective sample size can be greater than the number of iterations if the draws are anticorrelated. And all we really care about for speed is effective sample size per unit time.

NUTS can be antithetical

The desideratum for a sampler Andrew laid out to Matt was to maximze expected squared transition distance. Why? Because that’s going to maximize effective sample size. (I still hadn’t wrapped my head around this when Andrew was laying it out.) Matt figured out how to achieve this goal by building an algorithm that simulated the Hamiltonian forward and backward in time at random, doubling the time at each iteration, and then sampling from the path with a preference for the points visited in the final doubling. This tends to push iterations away from their previous values. In some cases, it can lead to anticorrelated chains.

Removing this preference for the second half of the chains drastically reduces NUTS’s effectiveness. Figuring out how to include it and satisfy detailed balance was one of the really nice contributions in the original NUTS paper (and implementation).

Have you ever seen 4000 as the estimated n_eff in a default Stan run? That’s probably because the true value is greater than 4000 and we truncated it.

The fix is in

What’s even cooler is that the fix is already in the pipeline and it just happens to be Aki’s first C++ contribution. Here it is on GitHub:

Aki’s also done simulations, so the new version is actually better calibrated as far as MCMC standard error goes (posterior standard deviation divided by the square root of the effective sample size).

A simple example

Consider three Markov processes for drawing a binary sequence y[1], y[2], y[3], …, where each y[n] is in { 0, 1 }. Our target is a uniform stationary distribution, for which each sequence element is marginally uniformly distributed,

Pr[y[n] = 0] = 0.5     Pr[y[n] = 1] = 0.5  

Process 1: Independent. This Markov process draws each y[n] independently. Whether the previous state is 0 or 1, the next state has a 50-50 chance of being either 0 or 1.

Here are the transition probabilities:

Pr[0 | 1] = 0.5   Pr[1 | 1] = 0.5
Pr[0 | 0] = 0.5   Pr[1 | 0] = 0.5

More formally, these should be written in the form

Pr[y[n + 1] = 0 | y[n] = 1] = 0.5

For this Markov chain, the stationary distribution is uniform. That is, some number of steps after initialization, there’s a probability of 0.5 of being in state 0 and a probability of 0.5 of being in state 1. More formally, there exists an m such that for all n > m,

Pr[y[n] = 1] = 0.5

The process will have an effective sample size exactly equal to the number of iterations because each state in a chain is independent.

Process 2: Correlated. This one makes correlated draws and is more likely to emit sequences of the same symbol.

Pr[0 | 1] = 0.01   Pr[1 | 1] = 0.99
Pr[0 | 0] = 0.99   Pr[1 | 0] = 0.01

Nevertheless, the stationary distribution remains uniform. Chains drawn according to this process will be slow to mix in the sense that they will have long sequences of zeroes and long sequences of ones.

The effective sample size will be much smaller than the number of iterations when drawing chains from this process.

Process 3: Anticorrelated. The final process makes anticorrelated draws. It’s more likely to switch back and forth after every output, so that there will be very few repeating sequences of digits.

Pr[0 | 1] = 0.99   Pr[1 | 1] = 0.01
Pr[0 | 0] = 0.01   Pr[1 | 0] = 0.99

The stationary distribution is still uniform. Chains drawn according to this process will mix very quickly.

With an anticorrelated process, the effective sample size will be greater than the number of iterations.

Visualization

If I had more time, I’d simulate, draw some traceplots, and also show correlation plots at various lags and the rate at which the estimated mean converges. This example’s totally going in the Coursera course I’m doing on MCMC, so I’ll have to work out the visualizations soon.