Skip to content

My 2 classes this fall

Stat 6103, Bayesian Data Analysis

Modern Bayesian methods offer an amazing toolbox for solving science and engineering problems. We will go through the book Bayesian Data Analysis and do applied statistical modeling using Stan, using R (or Python or Julia if you prefer) to preprocess the data and postprocess the analysis. We will also discuss the relevant theory and get to open questions in model building, computing, evaluation, and expansion. The course is intended for students who want to do applied statistics and also those who are interested in working on statistics research problems.

Stat 8307, Statistical Communication and Graphics

Communication is central to your job as a quantitative researcher. Our goal in this course is for you to improve at all aspects of statistical communication, including writing, public speaking, teaching, informal conversation and collaboration, programming, and graphics. With weekly assignments and group projects, this course offers you a chance to get practice and feedback on a range of communication skills. All this is in the context of statistics; in particular we will discuss the challenges of visualizing uncertainty and variation, and the ways in which a deeper integration of these concepts into statistical practice could help resolve the current statistical crisis in science. Statistics research is not separate from communication; the two are intertwined, and this course is about you putting in the work to become a better writer, teacher, speaker, and statistics practitioner.

The communication and graphics course should be no problem; I’ll teach it pretty much how I taught it last year, with 2 meetings a week, diaries, jitts, homeworks, class discussions, projects, etc.

The Bayes class I’ll be doing in a new way. It’ll meet once a week, and my plan is for the first half of each class to be a discussion of material from the book and in the second half for students to work together using Stan, with me and the teaching assistant walking around helping. Also, the homeworks will be more Stan-centered. The idea is for the students to really learn applied Bayesian statistics, as well as to have a chance to grapple with important theoretical concepts and to be introduced to the research frontier. We’ll see how it goes. The key will be coming up with in-class and homework assignments that give students the chance to fit Bayesian models for interesting problems.

On deck this week

Mon: My 2 classes this fall

Tues: “Soylent 1.5” < black beans and yoghurt

Wed: 0.05 is a joke

Thurs: Data-analysis assignments for BDA class

Fri: Aahhhhh, young people!

Sat: Plaig! (non-Wegman edition)

Sun: We provide a service

Rockin the tabloids

Rick Gerkin points me to this opinion piece from a couple years ago by biologist Randy Schekman, titled “How journals like Nature, Cell and Science are damaging science” and subtitled “The incentives offered by top journals distort science, just as big bonuses distort banking.” Here’s Schekman:

The prevailing structures of personal reputation and career advancement [in biology] mean the biggest rewards often follow the flashiest work, not the best. . . .

We all know what distorting incentives have done to finance and banking. The incentives my colleagues face are not huge bonuses, but the professional rewards that accompany publication in prestigious journals – chiefly Nature, Cell and Science.

These luxury journals are supposed to be the epitome of quality, publishing only the best research. Because funding and appointment panels often use place of publication as a proxy for quality of science, appearing in these titles often leads to grants and professorships. But the big journals’ reputations are only partly warranted. . . .

These journals aggressively curate their brands, in ways more conducive to selling subscriptions than to stimulating the most important research. Like fashion designers who create limited-edition handbags or suits, they know scarcity stokes demand, so they artificially restrict the number of papers they accept. . . .

A paper can become highly cited because it is good science – or because it is eye-catching, provocative or wrong. Luxury-journal editors know this, so they accept papers that will make waves because they explore sexy subjects or make challenging claims. . . . It builds bubbles in fashionable fields where researchers can make the bold claims these journals want . . .

In extreme cases, the lure of the luxury journal can encourage the cutting of corners, and contribute to the escalating number of papers that are retracted as flawed or fraudulent. . . .

Sharif don’t like it.

Why couldn’t Breaking Bad find Mexican Mexicans?

Watching “Breaking Bad” . . . I’m told on good authority that may of the actors playing Mexicans are not actually Mexican; some of them can barely speak Spanish at all. Whassup with that? How hard is it to find a Mexican actor in LA or Albuquerque??

ShinyStan v2.0.0

For those of you not familiar with ShinyStan, it is a graphical user interface for exploring Stan models (and more generally MCMC output from any software). For context, here’s the post on this blog first introducing ShinyStan (formerly shinyStan) from earlier this year.


ShinyStan v2.0.0 released

ShinyStan v2.0.0 is now available on CRAN. This is a major update with a new look and a lot of new features. It also has a new(ish) name: ShinyStan is the app/GUI and shinystan the R package (both had formerly been shinyStan for some reason apparently not important enough for me to remember). Like earlier versions, this version has enhanced functionality for Stan models but is compatible with MCMC output from other software packages too.

You can install the new version from CRAN like any other package:


If you prefer a version with a few minor typos fixed you can install from Github using the devtools package:

devtools::install_github("stan-dev/shinystan", build_vignettes = TRUE)

(Note: after installing the new version and checking that it works we recommend removing the old one by running remove.packages(“shinyStan”).)

If you install the package and want to try it out without having to first fit a model you can launch the app using the preloaded demo model:



This update contains a lot of changes, both in terms of new features added, greater UI stability, and an entirely new look. Some release notes can be found on GitHub and there are also some instructions for getting started on the ShinyStan wiki page. Here are two highlights:

  • The new interactive diagnostic plots for Hamiltonian Monte Carlo. In particular, these are designed for models fit with Stan using NUTS (the No-U-Turn Sampler).

    Diagnostics screenshot Diagnostics screenshotshinystan_diagnostics3

  • The deploy_shinystan function, which lets you easily deploy ShinyStan apps for your models to RStudio’s ShinyApps hosting service. Each of your apps (i.e. each of your models) will have a unique URL. To use this feature please also install the shinyapps package: devtools::install_github("rstudio/shinyapps").

The plan is to release a minor update with bug fixes and other minor tweaks in a month or so. So if you find anything we should fix or change (or if you have any other suggestions) we’d appreciate the feedback.

Harry S. Truman, Jesus H. Christ, Roy G. Biv

Are there any others?

Hey—Don’t trust anything coming from the Tri-Valley Center for Human Potential!

Shravan sends along this article by Douglas Peters and Stephen Ceci, who report:

We selected 12 already published research articles by investigators from prestigious and highly productive American psychology departments, one article from each of 12 highly regarded and widely read American psychology journals with high rejection rates (80%) and nonblind refereeing practices.

With fictitious names and institutions substituted for the original ones (e.g., Tri-Valley Center for Human Potential), the altered manuscripts were formally resubmitted to the journals that had originally refereed and published them 18 to 32 months earlier. Of the sample of 38 editors and reviewers, only three (8%) detected the resubmissions. This result allowed nine of the 12 articles to continue through the review process to receive an actual evaluation: eight of the nine were rejected. Sixteen of the 18 referees (89%) recommended against publication and the editors concurred. The grounds for rejection were in many cases described as “serious methodological flaws.”

Amusing. On the plus side, it could reflect a positive trend, that crappy papers that were getting accepted 2 years ago, would get rejected now.

Reprint of “Observational Studies” by William Cochran followed by comments by current researchers in observational studies

Dylan Small organized this discussion in the new journal, Observational Studies. Cochran’s 1972 article is followed by comments from:
Norman Breslow
Thomas Cook
David Cox & Nanny Wermuth
Stephen Fienberg
Joseph Gastwirth & Barry Graubard
Andrew Gelman
Ben Hansen & Adam Sales
Miguel Hernan
Jennifer Hill
Judea Pearl
Paul Rosenbaum
Donald Rubin
Herbert Smith
Mark van der Laan
Tyler VanderWeele
Stephen West.

My discussion is called “The state of the art in causal inference: Some changes Since 1972.”

Cochran’s article and all the discussions are downloadable in a convenient pdf here, at the journal’s website. Lots to chew on.

Wasting time reading old comment threads

Screen Shot 2015-08-12 at 12.11.22 AM

I was linking to something and came across this hilarious thread, which culminated in this revelation by commenter Jrc:

True story: after reading this post,, I started going to the Jamaican store around the corner. I was eating a lot of those things by the end. Its probably good that we moved.

It’s hard to replicate (that is, duplicate) analyses in sociology

Cristobal Young points us to this post on replication packages; he writes, “we found that only 28% of sociologists would/could provide a replication package.”

I read the comments. The topic arouses a lot of passion. Some of the commenters are pretty rude! And, yes, I’m glad to see this post, given my own frustrating experience trying to re-analyze a sociology study. To me, one of the big problems is the idea that once a paper is published, it is considered to be Truth, by the authors, by promotion committees, by the ASR and the NYT. Take that away, and everything changes: all of a sudden there’s not so much incentive to hide your data.

Neither time nor stomach

Screen Shot 2015-06-19 at 4.43.19 PM

Mark Palko writes:

Thought you might be interested in an EngageNY lesson plan for statistics. So far no (-2)x(-2) = -4 (based on a quick read), but still kind of weak. It bothers me that they keep talking about randomization but only for order of test; they assigned treatment A to the first ten of each batch.

Maybe I’m just in a picky mood.

I replied that I don’t like this bit at all: “Students use a randomization distribution to determine if there is a significant difference between two treatments.”

I don’t like randomization tests and I really really really don’t like the idea that the purpose of a study is “to determine if there is a significant difference between two treatments.”

Also it’s a bit weird that it’s in Algebra II. This doesn’t seem like algebra at all.

Palko added:

If you have the time (and the stomach), I’d recommend going through the entire “Topic D” section. You’ll find lots more to blog about.

I fear I have neither time nor stomach for this.

Fitting a multilevel model

Cui Yang writes:

I have a question about the use of BRT (Boosting regression tree). I am planning to write an article about the effects of soil fauna and understory fine roots on forest soil organic carbon. The experiment was conducted in a subtropical forest area in China. There were 16 blocks each with 5 coarse-mesh nylon bag wrapped and 5 fine-mesh nylon bag wrapped soil cores throughout four forests (forests A/B/C/D). Each forest had four different blocks. The coarse mesh (4 mm) bags allowed the access of both the entire soil fauna and understory plant fine roots. The meshes of fine bags were on the whole 0.038 mm, except that a row of 4 mm holes were distributed evenly at the lateral side and a circle of 4 mm holes at the circumferential bottom. Ideally, we supposed that this kind of fine bags could allow the access of most of soil fauna through 4 mm holes, while getting rid of the roots by the overall 0.038 mm meshes.

Originally, the dataset generated from this experiment should be 160 pieces of record constituted of soil total carbon, dissolved organic carbon, microbial biomass carbon plus soil fauna abundances and fine roots biomass. The soil fauna data was of coarse taxonomic resolution and mainly included six abundant fauna taxa, i.e., Acari, Collembola, Hymenoptera, Nematode, Enchytraeid. For fine roots data, parts of records in forests B (18 records) and C (17 records) were missing because of faulty operation. As a result, fine roots biomass had records at only 78% soil cores.

Because BRT can handle missing data in predictor variables by using surrogates, I decided to use BRT to determine the relative influence of predictor variables on the individual response for soil carbon, especially for the abundant fauna taxa and fine roots. So, can BRT be applicable to the abovementioned dataset, which only has 160 records and one of its main explanatory variable covers less than 80% records?

I have sent a similar mail to Professor *** who advised me to read your book “Data Analysis Using Regression and Multilevel/Hierarchical Models” for the resolution. Since my data are structured (samples within blocks within forests), a generalized linear mixed model will be suited. But I am worrying about that, while removing the records with missing data about fine roots, the experimental design would be an unbalanced one and at the expense of less useful data of other variables such as soil carbon and soil fauna taxa. Upon the unbalanced design, I will use Redundancy Analysis as a tool for multivariate ANOVA. Through RDA, I might select the several most influential fauna taxa probably with fine root. After this action, I would use Variation Partitioning to quantifying the variation explained by these factor. However, I am puzzled that, should the model used by RDA involve the effects of mesh size of nylon bag (fine mesh versus coarse mesh) and blocks factors or as well as forests? I am mainly using software R for statistical analyses. If the alternative of BRT is RDA, how can the model of RDA involve the data structure which seems to be a split-plot design? I feel that it is difficult to use functions such as adonis {vegan} in R to define the data structure of split-plot designs, especially under the condition of unbalanced design. What’s worse, my focus is the effects of soil fauna and fine root instead of mesh size/blocks factors/forests.

My reply:

You write, “BRT can handle missing data in predictor variables by using surrogates.” But any method for prediction can handle missing data by using surrogates, no? Or maybe I’m missing something here. In any case, here are some general comments:

– If your missingness is only in the outcome variable, then just fit a multilevel model, no problem.

– If some of your predictors have some missingness, I recommend imputing the missing values first using some multivariate imputation method. We have an R package “mi” to do this, but other options are available too.

– Unbalanced data don’t present any difficulty; indeed, multilevel models are well suited to unbalanced data. Unbalanced data is one of the main motivations for fitting multilevel models.

– Similarly, you’re fine with any mix of nested and non-nested factors. Split-plot is fine too, the multilevel model will handle it automatically, as discussed in my 2005 paper.

– Once it’s time for you to fit your multilevel model, I recommend you use Stan. In my book with Jennifer we use lmer, but now I use Stan. We just haven’t updated the book yet.

Dan Kahan doesn’t trust the Turk

Dan Kahan writes:

I [Kahan] think serious journals should adopt policies announcing that they won’t accept studies that use M Turk samples for types of studies they are not suited for. . . . Here is my proposal:

Pending a journal’s adoption of a uniform policy on M Turk samples, the journal should should oblige authors who use M Turk samples to give a full account of why the authors believe it is appropriate to use M Turk workers to model the reasoning process of ordinary members of the U.S. public. The explanation should consist of a full accounting of the authors’ own assessment of why they are not themselves troubled by the objections that have been raised to the use of such samples; they shouldn’t be allowed to dodge the issue by boilerplate citations to studies that purport to “validate” such samples for all purposes, forever & ever. Such an account helps readers to adjust the weight that they afford study findings that use M Turk samples in two distinct ways: by flagging the relevant issues for their own critical attention; and by furnishing them with information about the depth and genuineness of the authors’ own commitment to reporting research findings worthy of being credited by people eager to figure out the truth about complex matters.

There are a variety of key points that authors should be obliged to address.

First, M Turk workers recruited to participate in “US resident only” studies have been shown to misrepresent their nationality. Obviously, inferences about the impact of partisan affiliations distinctive of US society on the reasoning of members of the U.S. general public cannot validly be made on the basis of samples that contain a “substantial” proportion of individuals from other societies (Shapiro, Chandler and Muller 2013) Some scholars have recommended that researchers remove from their “US only” M Turk samples those subjects who have non-US IP addresses. However, M Turk workers are aware of this practice and openly discuss in on-line M Turk forums how to defeat it by obtaining US-IP addresses for use on “US worker” only projects (Chandler, Mueller & Paolacci 2014). Why do the authors not view this risk as one that makes using M Turk workers inappropriate in a study like this one?

Second, M Turk workers have demonstrated by their behavior that they are not representative of the sorts of individuals that studies of political information-processing are supposed to be modeling. Conservatives as grossly under-represented among M Turk workers who represent themselves as being from the U.S. (Richey 2012). One can easily “oversample” conservatives to generate adequate statistical power for analysis. But the question is whether it is satisfactory to draw inferences about real US conservatives generally from individuals who are doing something that such a small minority of real U.S. conservatives are willing to do. It’s easy to imagine that the M Turk “US” conservatives lack sensibilities that ordinary US conservatives normally have—such as the sort of disgust sensibilities that are integral to their political outlooks (Haidt & Hersch 2001), and that would likely deter them from participating in a “work force” a major business activity of which is “tagging” the content of on-line porn. These unrepresentative “US” conservatives might well not react as strongly or dismissively toward partisan arguments on a variety of issues. Is this not a concern for the authors? It is for me, and I’m sure would be for many readers trying to assess what to make of a study like this.

Third, there are in fact studies that have investigated this question and concluded that M Turk workers do not behave the way that US general population or even US student samples do when participating in political information-processing experiments (Krupnikov & Levine 2014). Readers will care about this—and about whether the authors care.

Fourth, Amazon M Turk worker recruitment methods are not fixed and are neither warranted nor designed to be calibrated to generate samples suitable for scholarly research. No serious person who cares about getting at the truth would accept the idea that a particular study done at a particular time could “validate” M Turk, for the obvious reason that Amazon doesn’t publicly disclose its recruitment procedures, can change them and has on multiple occasions, and is completely oblivious to what researchers care about. A scholar who decides it’s “okay” to use M Turk anyway should tell readers why this does not trouble him or her.

Fifth, M Turk workers share information about studies and how to respond to them (Chandler, Mueller & Paolacci 2014). This makes them completely unsuitable for studies that use performance-based reasoning proficiency measures, which M Turk workers have been massively exposed to. But it also suggests that the M Turk workforce is simply not an appropriate place to recruit subjects from; they are evincing a propensity to behave in a manner that makes all of their responses highly suspect. Imagine you discovered that the firm you had retained to recruit your sample had a lounge in which subjects about to take the study could discuss it w/ those who just had completed it; would you use the sample, and would you keep coming back to that firm to supply you with study subjects in the future? If this does not bother the authors, they should say so; that’s information that many critical readers will find helpful in evaluating their work.

I [Kahan] feel pretty confident M Turk samples are not long for this world.

OK, so far so good. But I’d bet the other direction on whether M Turk samples (or something similar) are long for this world. Remember Gresham’s Law?

Kahan does also give a positive argument, that there is a better alternative:

Google Consumer Surveys now enables researchers to field a limited number of questions for between $1.10 & $3.50 per complete– a fraction of the cost charged by on-line firms that use valid & validated recruitment and stratification methods.

Google Consumer Surveys has proven its validity in the only way that a survey mode–random-digit dial, face-to-face, on-line –can: by predicting how individuals will actually evince their opinions or attitudes in real-world settings of consequence, such as elections. Moreover, if Google Surveys goes into the business of supplying high-quality scholarly samples, they will be obliged to be transparent about their sampling and stratification methods and to maintain them (or update them for the purposes of making them even more suited for research) over time. . . .

The problem right now w/ Google Consumer Surveys is that the number of questions is limited and so, as far as I can tell, is the complexity of the instrument that one is able to use to collect the data, making experiments infeasible.

But I predict that will change.

OK, maybe so. But it does seem to me that M Turk’s combination of low cost and low validity will make it an attractive option for many researchers.

Some background:

Don’t trust the Turk (also see discussion in comments, back from the days when the sister blog had a useful comments section)

Researchers are rushing to Amazon’s Mechanical Turk. Should they?

That latter post, by Kathleen Searles and John Barry Ryan, concludes that “platitudes such as ‘Don’t trust the Turk’ are nice, but, as is often the case in life, they are too simple to be followed.”

I actually think “Don’t trust the Turk” is a slogan not a platitude but I take their point, and indeed even though Searles and Ryan are broadly pro-Turk while Kahan is anti-Turk, these researchers all offer the common perspective that when evaluating a data source you need to consider the purpose for which it will be used.

P.S. Some good discussion in comments. “Don’t trust the Turk” doesn’t mean “Never use the Turk. It means: Be aware of the Turk’s limitations. Don’t exhibit the sort of blind faith associated with the buggy-whip lobby and their purported “grounding in theory.”

On deck this week

Mon: Dan Kahan doesn’t trust the Turk

Tues: Neither time nor stomach

Wed: Reprint of “Observational Studies” by William Cochran followed by comments by current researchers in observational studies

Thurs: Hey—Don’t trust anything coming from the Tri-Valley Center for Human Potential!

Fri: Harry S. Truman, Jesus H. Christ, Roy G. Biv

Sat: Why couldn’t Breaking Bad find Mexican Mexicans?

Sun: Rockin the tabloids

Stan at JSM2015

In addition to Jigiang’s talk on Stan, 11:25 AM on Wednesday, I’ll also be giving a talk about Hamiltonian Monte Carlo today at 3:20 PM.  Stanimals in attendance can come find me to score a sweet Stan sticker.


And everyone should check out Andrew’s breakout performance in “A Stan is Born”.

Update: Turns out I missed even more Stan!  There was a great session just this morning, that unfortunately I was not able to post earlier due to some logistical issues (i.e. my inadvertently leaving my laptop behind after my talk yesterday…). Seth will also be talking about his sweet Gaussian processes Tuesday at 10:35 AM.

Monte Carlo and the Holy Grail


On 31 Dec 2010, someone wrote in:

A British Bayesian curiosity: Adrian Smith has just been knighted, and so becomes Sir Adrian. He can’t be the first Bayesian knight, as Harold Jeffreys was Sir Harold.

I replied by pointing to this discussion from 2008, and adding: Perhaps Spiegelhalter can be knighted next. Or maybe Ripley!

My correspondent replied the next day:

I doubt that Ripley will ever get an Honour. But Spiegelhalter did strike me as the most likely next Bayesian knight in 5-10 years. I would not want to put a number on the probability. Please don’t quote me in public on that, unless anonymously.

Now here it is, 4 1/2 years later, and the person informs me that David Spiegelhalter was indeed knighted in 2014!

All hail Lord Spiegelhalter! He will smite you if you communicate statistics poorly.

Classifying causes of death using “verbal autopsies”

Tyler McCormick sent along this paper, “Probabilistic Cause-of-death Assignment using Verbal Autopsies,” coauthored with Zehang Li, Clara Calvert, Amelia Crampin, Kathleen Kahn, and Samuel Clark:

In areas without complete-coverage civil registration and vital statistics systems there is uncertainty about even the most basic demographic indicators. In such areas the majority of deaths occur outside hospitals and are not recorded. Worldwide, fewer than one-third of deaths are assigned a cause, with the least information available from the most impoverished nations. In populations like this, verbal autopsy (VA) is a commonly used tool to assess cause of death and estimate cause-specific mortality rates and the distribution of deaths by cause. VA uses an interview with caregivers of the decedent to elicit data describing the signs and symptoms leading up to the death. This paper develops a new statistical tool known as InSilicoVA to classify cause of death using information acquired through VA. InSilicoVA shares uncertainty between cause of death assignments for specific individuals and the distribution of deaths by cause across the population. Using side-by-side comparisons with both observed and simulated data, we demonstrate that InSilicoVA has distinct advantages compared to currently available methods.

As I’ve been saying a lot recently, measurement is a central and underrated aspect of statistics, so I’m always happy to see serious research on measurement-error models. I hope this project is directly useful and also stimulates further work in this area.

The secret to making a successful conference presentation

JSM (the Joint Statistical Meetings) are coming up soon, and Jiqiang’s giving a talk on Stan. Here’s the advice I gave him:

in 20 minutes, something like this:

– What is Stan?
– Where does Stan work well?
– Current and future Stan research.

For JSM audience it could be good to spend some time on our exciting future research ideas. The goal is not to teach people Stan, it’s to get them excited about it.

You can also look in our JEBS paper for material, as we do some comparison of Stan with other Bayesian model-fitting options.

Regarding surveys, you can say that you personally are not currently working on survey data but the past and current development of Stan has been motivated by various applications of mine, including survey analysis, and we are currently being supported by the polling company YouGov.

Stan has the potential to revolutionize survey inference, as follows: More and more, surveys are not reprsentative of the popualation. Problems with non-response, self-selection, etc. So we want to weight or adjust for as many variables as possible so as to match sample to population. But it’s well known that if you try to weight on a lot of varables, and their interactions, the weights will be super-noisy. Better and more stable to do MRP, i.e. hierarchical Bayes. Stan allows us to build big fast and make models with lots of predcitors and (it will be necessary) informative priors.

The key piece of advice (the secret to giving a good talk) is in bold above.

P.S. And if the computer for the presentation is linked to the sound system, he can start off with the Stan trailer.

When does Bayes do the job?

E. J. writes:

I’m writing a paper where I discuss one of the advantages of Bayesian inference, namely that it scales up to complex problems where maximum likelihood would simply be unfeasible or unattractive. I have an example where 2000 parameters are estimated in a nonlinear hierarchical model; MLE would not fare well in this case.

I recall that you have also stressed this issue, and I’d like to acknowledge that. Do you have pointers to a few of your papers where you explicitly mention this? Ideally I would just take a quotation.

I responded:

Bayes will do this but only with informative priors. With noninformative priors, the Bayes answer can sometimes be worse than maximum likelihood; see section 3 of this 1996 paper which I absolutely love.

Then there’s this paper about why, with hierarchical Bayes, we don’t need to worry about multiple comparisons.

Here’s a quote from that paper:

Researchers from nearly every social and physical science discipline have found themselves in the position of simultaneously evaluating many questions, testing many hypothesis, or comparing many point estimates. . . . we believe that the problem is not multiple testing but rather insufficient modeling of the relationship between the corresponding parameters of the model. Once we work within a Bayesian multilevel modeling framework and model these phenomena appropriately, we are actually able to get more reliable point estimates. A multilevel model shifts point estimates and their corresponding intervals toward each other (by a process often referred to as “shrinkage” or “partial pooling”), whereas classical procedures typically keep the point estimates stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the p values corresponding to intervals of fixed width). In this way, multilevel estimates make comparisons appropriately more conservative, in the sense that intervals for comparisons are more likely to include zero. As a result we can say with confidence that those comparisons made with multilevel estimates are more likely to be valid. At the same time this “adjustment” does not sap our power to detect true differences as many traditional methods do.

That’s a bit long but maybe you can take what you need!

You also might enjoy this paper with Aleks on whether Bayes is radical, liberal, or conservative.

How Hamiltonian Monte Carlo works

Marco Inancio posted this one on the Stan users list:

( Statement 1) If the kinetic energy equation comes from a distribution $L$ which is not a symmetric distribution, then thanks to the “Conservation of the Hamiltonian” property we’ll still be able to accept the proposal with probability 1 if we are computing the Hamiltonian’s equations exactly.

( Statement 2) But if we are approximating the Hamilton’s equations by discretizing time (using leapfrog) then the acceptance probability [from $(q_0, p_0)$ to $(q_1, p_1)$] won’t be simply $\exp(U_{0}+K_{0}-U_{1}-K_{1})$

We would have to find $p*$ for which $(q_{0}, p_{0})$ will be the proposal values if we start from $(q_{1}, p*)$

And, in this case, the acceptance probability will be: $\exp(U_{0}+K_{0}-U_{1}-K_{1})L(p*)/L(p_0)$

Are both statements correct? If so, it seems that it’s pretty difficult to find the value of $p*$…

I replied that this is a job for Betancourt. And Betancourt responded:

Let’s review the basics of HMC. We have our target distribution, pi(q) = exp(-V(q)), and the conditional distribution of the momenta, pi(p|q) = exp(-T(q, p)). The HMC transition is then (i) sample the momenta, p ~ pi(p|q) (ii) evolve using Hamiltonian flow, (q, p) -> \phi_{t} (q, p) (iii) marginalize the momenta, (q, p) -> q. Each step in the operation preserves pi(q) and so the entire transition preserves pi(q) and yields a desired sample. T(q, p) can be almost anything and there is no Metropolis correction here.

In practice we can’t do (ii) exactly and we have to approximate the flow with a symplectic integrator. This introduces error and pi(q) is no longer preserved exactly. Typically we fix this by treating the approximate flow as a proposal and add a Metropolis correction, but this requires that the flow be reversible which it is not. To make the flow reversible we can do a few things — we can sample the integration time from any distribution symmetric around zero or, if the kinetic energy is symmetric with respect to p, we can add a momentum flip at the end of the flow, (q, p) -> (q, -p). Outside of NUTS people typically consider a fixed integration time which leaves only the second option, hence the importance of the symmetry of the kinetic energy.

What you’ve proposed is to sample from a distribution pi(p|q) = exp(-L(q, p)) that is not related to the kinetic energy. Immediately this will cause a problem because (i) will preserve only exp(-L) while (ii) preserves only exp(-T) and the combined transition no longer has a stationary distribution. So the exact HMC algorithm doesn’t work. We can still try to use this as a Metropolis-Hastings proposal (although a poorly performing one) if we can make it reversible. How we make it reversible depends on the choice of L(q, p), but it general it will not be an easy problem.

Choosing a kinetic energy is a somewhat subtle problem. Nominally there are no constraints on T(q, p) which is usually a really bad sign — the more the math constrains our options the less tuning we have to do and the more robust the algorithm will be. One way to introduce constraints is to introduce more structure, such as a metric. It turns out that a metric gives a canonical family of kinetic energies with nice properties, and every member of that family is symmetric because of the symmetry of the metric. So asymmetric kinetic energies are not only awkward to use they’re actually really hard to motivate in the first place.

For all the gory details see