Amelia, it was just a false alarm

Nah, jet fuel can’t melt steel beamsI’ve watched enough conspiracy documentaries – Camp Cope

Some ideas persist long after the mounting evidence against them becomes overwhelming. Some of these things are kooky but probably harmless (try as I might, I do not care about ESP etc), whereas some are deeply damaging (I’m looking at you “vaccines cause autism”).

When these ideas have a scientific (be it social or physical) basis, there’s a pretty solid pattern to be seen: there is a study that usually over-interprets a small sample of data and there is an explanation for the behaviour that people want to believe is true.

This is on my mind because today I ran into a nifty little study looking at whether or not student evaluation of teaching (SET) has any correlation with student learning outcomes.

As a person who’s taught at a number of universities for quite a while, I have some opinions about this.

I know that when I teach my SET scores better be excellent or else I will have some problems in my life. And so I put some effort into making my students like me (Trust me, it’s a challenge) and perform a burlesque of hyper-competence lest I get that dreaded “doesn’t appear to know the material” comment. I give them detailed information about the structure of the exam. I don’t give them tasks that they will hate even when I think it would benefit certain learning styles. I don’t expect them to have done the reading*.

Before anyone starts up on a “kids today are too coddled” rant, it is not the students who make me do this. I teach the way I do because ensuring my SET scores are excellent is a large part** of both my job and my job security. I adapt my teaching practice to the instrument used to measure it***.

I actually enjoy this challenge.  I don’t think any of the things that I do to stabilize my SET scores are bad practice (otherwise I would do it differently), but let’s not mistake the motives.

(For the record, I also adapt my teaching practice to minimize the effects of plagiarism and academic dishonesty. This means that take-home assignments cannot be a large part of my assessment scheme. If I had to choose between being a dick to students who didn’t do the readings and being able to assess my courses with assignments, I’d choose the latter in a heartbeat.)

But SETs have some problems. Firstly, there’s increasingly strong evidence that women, people of colour, and people who speak english with the “wrong” accent**** receive systematically lower SET scores. So as an assessment instrument, SETs are horribly biased.

The one shining advantage to SETs, however, is that they are cheap and easy to measure. They are also centred on the student experience andnd there have been a number of studies that suggest that SET scores are correlated with student results.

However a recent paper from Bob Uttl, Carmela A. White, and Daniela Wong Gonzalez suggests that this observed correlation in these studies is most likely due to the small sample sizes.

The paper is a very detailed meta-analysis (and meta-reanalysis of the previous positive results) of a number of studies on the correlation between SET scores and final grades. The experiments are based on large, multi-section courses where the sections are taught by multiple instructors. The SET score of the instructor is compared to the student outcomes (after adjusting for various differences between cohorts). Some of these courses are relatively small and hence the observed correlation will be highly variable.

The meta-analytic methods used in this paper are heavily based on p-values, but are also very careful to correctly account for the differing sample sizes across studies. The paper also points out that if you look at the original data from the studies, some of the single-study correlations are absurdly large. It’s always a good idea to look at your data!

So does this mean SETs are going to go away? I doubt it. Although they don’t measure the effectiveness of teaching, universities increasingly market themselves based on the student experience, which is measured directly. And let us not forget that in the exact same way that I adapt my teaching to the metrics that are used to evaluate it, universities will adapt to metrics used to evaluate them. Things like the National Student Survey in the UK and the forthcoming Teaching Excellence Framework (also in the UK) will strongly influence how universities expect their faculty to teach.

Footnotes:

*I actually experimented assuming the students would do the reading once when I taught a small grad course. Let’s just say the students vociferously disagreed with the requirement. I’ve taught very similar material since then without this requirement (also with some other changes) and it went much better. Obviously very small cohorts and various other changes mean that I can’t definitively say it was the reading requirement that sunk that course, but I can say that it’s significantly easier to teach students who don’t hate you.

** One of the things I really liked about teaching in Bath was that one of the other requirements was to make sure that scatterplot of a student’s result in my class against an average of their marks on their other subjects that semester was clustered around the line y=x.

***I have unstable employment and a visa that is tied to my job. What do you expect?

**** There is no such thing as a wrong english accent. People are awful and students are people.

In my role as professional singer and ham

Pryor unhooks the deer’s skull from the wall above his still-curled-up companion. Examines it. Not a good specimen –the back half of the lower jaw’s missing, a gap that, with the open cranial cavity, makes room enough for Pryor’s head.

He puts it on. – Will Eaves, Murmur

So as we roll into the last dying embers of the flaming glory that is (North American) Pride Month (Europe you’ve still got some fun ahead), I’ve doing my best to make both good and questionable decisions.

The good decision was to read an absolutely fabulous book by a British author named Will Eaves (on a run of absolutely stunning books) who fairly recently released a book called Murmur.  Murmur is an odd beast, I guess you could say it’s a novel about Alan Turing but that would be moderately inaccurate and probably unfair.

Firstly, because even from one of my favourite authors, I am basically allergic to people writing about maths and mathematicians. It’s always so stodgy and wooden, as if they are writing about a world they don’t understand but also can’t convincingly fake. (Pride month analogy: I mostly avoid straight people writing queer stories for the same reason.) And Turing is a particular disaster for cultural portrayals: he intersects with too many complicated things (world war 2, cryptography, computers, pre-1967 British homosexuality) for him to ever be anything but an avatar.

So this is not a book about Alan Turing being a tortured genius. It’s a book about a guy called Alec Pryor who just happens to share a bunch of biographical details with Turing (Bletchley, Cambridge, ex-Fiance, arrest and chemical castration, Jungian therapist). And it’s not a book about him being sad, wronged, gay genius who kills himself. It’s a story about him living and him processes the changes in his internal life due to his punishment and his interactions with the outside world and his past and his musings on consciousness and computation.

All of which is to say Murmur is a complex, wonderfully written book that over its hundred and seventy something pages sketches out a fully realized world that doesn’t make me want to hide under the sofa in despair. And it does that rare thing for people telling this story: it doesn’t flatten out the story by focusing on the punchline but rather the person and the life behind it.

(As an aside, I’d strongly recommend Hannah Gadsby’s Netflix special Nanette, which talks about the damage we do by truncating our stories to make other people happy. It’s the only cultural document so far of 2018 worth going out of your way to see.)

I would recommend you find yourself a copy. It’s published by a small press, so order it from them or, if you’re in London, pop into Gays The Word and get yourself a copy.

And now that you’ve made your way through the unexpected book review, let’s get to the point

But  the point of writing this post wasn’t to do a short review of a wonderful book (it was to annoy Aki who is waiting for me to finish something). But Murmur is a book that spends some time (as you inevitably do when considering an ersatz Turing) considering the philosophical implications of artificial intelligence.  (Really selling it there Daniel.) And this parallels some discussion that I’ve been seeing around the traps about what we talk about when we talk about neural networks.

Also because the quote that I ripped from an absolutely wonderful run in the novel to unceremoniously shove at the top of this post made me think of how we use methods that we don’t fully understand.

The first paper I fell into (and incidentally reading papers on neural nets is my aforementioned questionable decision) has the direct title Polynomial Regression As an Alternative to Neural Nets, where Cheng, Khomtchouk, and Matloff argue that we might as well use polynomial regression as it’s easier to interpret than a NN and basically gives the same answer.

The main mathematical argument in the paper is that if you build a NN with a polynomial activation function, each layer gives a higher-order polynomial. They argue that the Stone-Weierstrass approximation theorem suggests that any activation function will lead to a NN that can be well approximated by a high-order polynomial.

Now as a general rule, anytime someone whips out Stone-Weierstrass I feel a little skeptical. Because the bit of me that remembers my approximation theory remembers that the construction in this theorem is very slow to converge. I’m also alarmed by the use of high-degree polynomial regression using the natural basis and no regularization. Both of these things are a very bad idea.

But the whole story–that neural networks can also be understood as being quite like polynomial regression and that analogy can allow us to improve our NN techniques–is the sort of story that we need to tell to understand how to make these methods better in a principled way.

(Honestly, I’m not a giant fan of a lot of the paper–it does polynomial regression in exactly the way people have been telling applied people they should never do it. But hey, there’s some interesting stuff in there.)

The other paper I read was a whole lot more interesting. Rather than trying to characterize a neural network as something else, it instead tries to argue that NNs should be used to process electronic health records. And it gets good results compared to standard methods, which is always a good sign. But that’s not what’s interesting.

The interesting thing comes via Twitter. Uri Shalit, who’s a prof at Technion, noticed something very interesting in the appendix. Table 1 in the appendix showed that regularized logistic regression performs almost exactly as well as the complicated deep net.

Once again, the Goliath of  AI is slain by the David of “really boring statistical methods”.

But again, there’s more to this than that. Firstly, the logistic regression required some light “feature engineering” (ie people who knew what they were doing had to do something to the covariates). In particular, they had to separate them into time bins to allow the model to do a good job at modelling time. The Deep Net didn’t need that.

This particular case of feature engineering is trivial, but in a lot of cases it’s careful understanding of the structure of the problem that lets us use simple statistical techniques instead of something weird and complex.  My favourite example of this is Bin Yu’s work where she (read: her and a lot of collaborators) basically reconstructed movies a person was watching from an fMRI scan! The way they did it was to understand how certain patterns excite certain pathway (basically using existing science) and putting those activation patterns in as covariates in a LASSO regression. So the simplest modern technique + feature engineering (aka science) gave fabulous results.

The argument for all of these complex AI and deep learning methods is that they allow us to be a little more sloppy with the science. And it seems to work quite well for images and movies, which are extremely structured. But electronic health records are not particularly structured and can have some really weird missingness problems, so it’s not clear that the same methods will have as much room to move. In this case a pretty boring regularized logistic regression does almost exactly as well, which suggests that the deep net is not able to really fly.

The path forward now is to start understanding these cases, working out how things work, when things work, and what techniques we can beg borrow and steal from other areas of stats and machine learning. Because deep learning is not a panacea, it’s just a boy standing in front of a girl asking her to love him.

I am the supercargo

In a form of sympathetic magic, many built life-size replicas of airplanes out of straw and cut new military-style landing strips out of the jungle, hoping to attract more airplanes. – Wikipedia

Twenty years ago, Geri Halliwell left the Spice Girls, so I’ve been thinking about Cargo Cults a lot.

As an analogy for what I’m gonna talk about, it’s … inapt, but only if you’ve looked up Cargo Cults. But I’m going with it because it’s pride week and Drag Race is about to start

The thing is, it can be hard to identify if you’re a member of a Cargo Cult. The whole point is that from within the cult, everything seems good and sensible and natural. Or, to quote today’s titular song,

They say “our John Frum’s coming,
He’s bringing cargo…” and the rest
At least they don’t expect to be
Surviving their own deaths.

This has been on my mind on and off for a while now. Mostly from a discussion I had with someone in the distant-enough-to-not-actually-remember-who-I-was-talking-to past, where we were arguing about something (I’m gonna guess non-informative vs informative priors, but honestly I do not remember) and this person suggested that the thing I didn’t like was a good idea, at least in part, because Harold Jeffreys thought it was a good idea.

A technical book written in the 1930s being used as a coup de grâce to end a technical argument in 2018 screams cargo cult to me. But is that fair? (Extreme narrator voice: It is not fair.)

I guess this is one of the problems we need to deal with as a field: how do we maintain the best bits of our old knowledge (pre-computation, early computation, MCMC, and now) while dealing with the rapidly evolving nature of modern data and modern statistical questions?

So how do you avoid cult like behaviour? Well, as a child of Nü-Metal*, I think there’s only one real answer:

Break stuff

I am a firm believer that before you use a method, you should know how to break it. Describing how to break something should be an essential part of describing a new piece of statistical methodology (or, for that matter, of resurrecting an existing one). At the risk of getting all Dune on you, he who can destroy a thing controls a thing. 

(We’re getting very masc4masc here. Who’d’ve thought that me with a hangover was so into Sci-Fi & Nü-Metal? Next thing you know I’ll be doing a straight-faced reading of Ender’s Game. Look for me at 2am explaining to a woman who’d really rather not be still talking to me that they’re just called “buggers” because they look like bugs.)

So let’s break something.

This isn’t meant to last, this is for right now

Specifically let’s talk about breaking leave-one-out cross validation (LOO-CV) for computing the expected log-predictive density (elpd or sometimes LOO-elpd). Why? Well, partly because I also read that paper that Aki commented on a few weeks back that made me think more about the dangers of accidentally starting a cargo cult. (In this analogy, the cargo is a R package and a bunch of papers.)

One of the fabulous things about this job is that there are two things you really can’t control: how people will use the tools you construct, and how long they will continue to take advice that turned out not to be the best (for serious, cool it with the Cauchy priors!).

So it’s really important to clearly communicate flaws in method both when it’s published and later on. This is, of course, in tension with the desire to actually get work published, so we do  what we can.

Now, Aki’s response was basically definitive, so I’m mostly not going to talk about the paper. I’m just going to talk about LOO.

One step closer to the edge

One of the oldest criticisms of using LOO for model selection is that it is not necessarily consistent when the model list contains the true data generating model (the infamous, but essentially useless** M-Closed*** setting). This contrasts with model selection using Bayes’ Factors, which are consistent in the useless asymptotic regime. (Very into Nü-Metal. Very judgemental.)

Being that judge-y without explaining the context is probably not good practice, so let’s actually look at the famous case where model selection will not be consistent: Nested models.

For a very simple example, let’s consider two potential models:

$latex \text{M1:}\; y_i \sim N(\mu, 1)$

$latex \text{M2:}\; y_i \sim N(\mu + \beta x_i, 1)$

The covariate $latex x_i$ can be anything, but for simplicity, let’s take it to be $latex x_i \sim N(0,1)$.

And to put us in an M-Closed setting, let’s assume the data that we are seeing is drawn  from the first model (M1) with $latex \mu=0$. In this situation, model selection based on the LOO-expected log predictive density will be inconsistent.

Spybreak!

To see this, we need to understand what the LOO methods are using to select models. It is the ability to predict a new data point coming from the (assumed iid) data generating mechanism. If two models asymptotically produce the same one point predictive distribution, then the LOO-elpd criterion will not be able to separate them.  This is different to Bayes’ factors, which will always choose the simplest of the models that make the same predictions.

Let’s look at what happens asymptotically. (And now you see why I focussed on such simple models: I’m quite bad at maths.)

Because these models are regular and have finite-dimensional parameters, they both satisfy all of the conditions of the Bernstein-von Mises theorem (which I once wrote about in these pages during an epic panic attack) which means that we know in both cases that the posterior for the  model parameters $latex \theta$ after observing n data points will be $latex \theta_j^{(n)} = \theta_{(j)}^* + \mathcal{O}_p(n^{-1/2})$. Here:

  • $latex \theta_j^{(n)}$ is the random variable distributed according to the posterior for model j after $latex n$ obeservations,
  • $latex \theta_{(j)}^*$ is the true parameter from model j that would generate the data. In this case $latex \theta_{(1)}^*=0$ and $latex \theta_{(2)}^*=(0,0)^T$.
  • And $latex \mathcal{O}_p(n^{-1/2})$ is a random variable with (finite) standard deviation that goes to zero as increases like $latex n^{-1/2}$.

Arguing loosely (again: quite bad a maths), the LOO-elpd criterion is trying to compute**** $latex E_{\theta_j^{(n)}}\left[\log(p(y\mid\theta_j^{(n)}))\right]$ which asymptotically looks like $latex \log(p(y\mid\theta_j^*))+O(n^{-1/2})$.

This means that, asymptotically, both of these models will give rise to the same posterior predictive distribution and hence LOO-elpd will not be able to tell between them.

Take a look around

LOO-elpd can’t tell them apart, but we sure can! The thing is, the argument of inconsistency in this case only really holds water if you never actually look at the parameter estimates. If you know that you have nested models (ie that one is the special case of another), you should just look at the estimates to see if there’s any evidence for the more complex model.  Or, if you want to do it more formally, consider the family of potential nested models as your M-Complete model class and use something like projpred to choose the simplest one.

All of which is to say that this inconsistently is mathematically a very real thing but should not cause practical problems unless you use model selection tools blindly and thoughtlessly.

For a bonus extra fact: This type of setup will also cause the stacking weights we (Yuling, Aki, Andrew, and me) proposed not to stabilize. Because any convex combination will asymptotically give the same distribution. So be careful if you’re trying to interpret model stacking weights as posterior model probabilities.

Have a cigar

But I said I was going to break things. And so far I’ve just propped up the method yet again.

The thing is, there is a much bigger problem with LOO-elpd. The problem is the assumption that leaving one observation out is enough to get a good approximation to the average value of the posterior log-predictive over a new data set.  This is all fine when the data is iid draws from some model.

LOO-elpd can fail catastrophically and silently when the data cannot be assumed to be iid. A simple case where this happens is time-series data, where you should leave out the whole future instead.  Or spatial data, where you should leave out large-enough spatial regions that the point you are predicting is effectively independent of all of the points that remain in the data set. Or when your data has multilevel structure, where you really should leave out whole strata.

In all of these cases, cross validation can be a useful too, but it’s k-fold cross validation that’s needed rather than LOO-CV. Moreover, if your data is weird, it can be hard to design a cross validation scheme that’s defensible. Worse still, while LOO is cheap (thanks to Aki and Jonah’s work on the loo package), k-fold CV requires re-fitting the model a lot of times, which can be extremely expensive.

All of this is to say that if you want to avoid an accidental LOO cargo cult, you need to be very aware of the assumptions and limitations of the method and to use it wisely, rather than automatically. There is no such thing as an automatic statistician.

Notes:

* One of the most harrowing days of my childhood involved standing that the check out of the Target in Buranda (a place that has not changed in 15 year, btw) and having to choose between buying the first Linkin Park album and the first Coldplay album. You’ll be pleased to know that I made the correct choice.

** When George Box said that “All models are wrong” he was saying that M-Closed is a useless assumption that is never fulfilled.

*** The three modelling scenarios (according to Bernado and Smith):

  • M-closed means the true data generating model is one of the candidate models $latex M_k$, although which one is  unknown to researchers
  • M-complete refers to the situation where the true model exists (and we can specify the explicit form for it), but for some reason it is not in the list of candidate models.
  • M-open refers to the situation in which we know the true model is not in the set of candidate models and we cannot specify it’s explicit form (this is the most common one).

**** A later edit: I forgot the logarithms in the expected log-densities, because by the time I finished this a drag queen had started talking and I knew it was time to push publish and finish my drink.

Opportunity for Comment!

(This is Dan)

Last September, Jonah, Aki, Michael, Andrew and I wrote a paper on the role of visualization in the Bayesian workflow.  This paper is going to be published as a discussion paper in the Journal of the Royal Statistical Society Series A and the associated read paper meeting (where we present the paper and people present discussions) will take place at the RSS conference in Cardiff on 5 September, 2018.

The instructions for submitting a discussion on the paper (either in person or via email) are given here. The gist is in this paragraph:

Further shorter comments (up to five minutes or 400 words in writing) are then invited from the audience. If the two hours in total permit before the authors are given the opportunity to respond very briefly verbally (and at greater length in writing in the journal), the secretary for the meeting will read out any contributions that have been sent in from people who cannot be present. Contributions in writing (again, 400 words maximum) are welcome up to two weeks after the meeting.

If anyone has thoughts on that paper or what we could’ve done better or differently, now is y0ur opportunity to say so in print!

There will be two other papers presented at the same meeting (Graphics for uncertainty by Adrian Bowman and Visualizing spatiotemporal models with virtual reality: from fully immersive environments to applications in stereoscopic view by Stefano Castruccio, Marc Genton, and Ying Sun), so it’s sure to be a barn-burner. You should read them and submit comments on them too if you’re so inclined!

One good and one bad response to statistics’ diversity problem

(This is Dan)

As conference season rolls into gear, I thought I’d write a short post contrasting some responses by statistical societies to the conversation that the community has been having about harassment of women and minorities at workshops and conferences.

ISI: Do what I say, not what I do

Let’s look at a different diversity statement by the International Statistical Institute, more commonly known as the ISI.  Feel free to read it in full–it’s not very long. But I’ll reproduce a key paragraph.

ISI is committed to providing a professional environment free from discrimination on the basis of sex, race, colour, religion, national origin, disability, age, sexual orientation, gender identity, and politics.

On the face of it, this looks fine. It’s a boilerplate paragraph that is in almost every code of conduct and diversity statement. But let’s actually take a look at how it’s currently implemented.

One of the major activities of the ISI is a biennial World Statistics Congress, which is one of the larger statistics conferences on the calendar. Last year, they held it in Morocco. Next year it will be held in Malaysia.

In Morocco, the penalty for same-sex sexual activity is up to three years in jail. In Malaysia, the penalty for same-sex sexual activity is up to twenty years in jail (as well as fines and corporal punishment).

That the ISI has put two consecutive major statistics meetings in countries where homosexual activity is illegal is not news to anyone. These aren’t secret meetings–they are very large and have been on the books for a few years.

But these meetings manifestly fail to live up to the new diversity statement. This reflects a lack of care and a lack of thought. The ISI World Statistics Conferences do not provide a professional environment free from discrimination.

By holding these meetings in these countries, the ISI sending a strong message to the LGBT+ community that they do not value us a statisticians. They are explicitly forcing LGBT+ scholars to make a very difficult choice in the event that they get invited to speak in a session: do they basically pretend to be straight for a week or do they give up this career opportunity.

The ISI has released a diversity statement that it does not live up to. It is a diversity statement that anyone organizing a session at the next ISI World Statistics Conference will not live up to (they are participating in organizing a hostile environment for LGBT+ scholars).

This is pathetic. It is rare to see a group release a diversity statement and actually make the situation worse.

That they did it at the beginning of Pride Month in North America is so bleak I actually find it funny.

ISBA: A serious, detailed, and actionable response

For a much better response to the problems facing minority communities in statistics, we can look at ISBA’s response to reports of sexual harassment at their conferences.

ISBA has taken these reports seriously and have released a detailed Code of Conduct that covers all future events as well as responsibilities and expectations of members of the society.

This was a serious and careful response to a real problem and it’s a credit to ISBA that they made this response in time for its forthcoming major meeting. It’s also a credit to them that they did not rush the process–this is the result of several months of hard work by a small team of people.

The level of detail that the code of conduct goes into in its complaints procedures, its investigation procedures, and the rights and responsibilities of all involved is very good. While it is impossible to undo the harm from not dealing with this problem earlier, this code of conduct is a good basis for making ISBA a safe place for statisticians from now and into the future.

 

PS (added later): There was some concern shown in the comments that the two countries I mentioned were Muslim-majority countries. There were no other ISI meeting I could see in the same time period that happened in places with similar anti-LGBT legislation. (Although such places exist and the laws are justified using a variety of religious and social positions.)

The last ISI WSC to be held in a Muslim-majority country before Morocco was 20 years previous is Turkey, which, to my knowledge, did not have anti-LGBT laws on the books then or now.

It is always a mistake to attribute bad laws to faith groups. People who share a common religion are diverse and assigning them responsibility for a law would be like assigning every American blame for everything Trump (or Obama) wrote into law.

Zero-excluding priors are probably a bad idea for hierarchical variance parameters

(This is Dan, but in quick mode)

I was on the subway when I saw Andrew’s last post and it doesn’t strike me as a particularly great idea.

So let’s take a look at the suggestion for 8 schools using a centred parameterization.  This is not as comprehensive as doing a proper simulation study, but the results are sufficiently alarming to put the brakes on Andrew’s proposal. Also, I can do this quickly (with a little help from Jonah Gabry).

We know that this will lead to divergences under any priors that include a decent amount of mass at zero, so Andrew’s suggestion was to use a boundary avoiding prior.  In particular a Gamma(2,$latex \beta$) prior, which has mean $latex 2\beta^{-1}$ and variance $latex 2\beta^{-2}$.  So let’s make a table!

Here I’ve run RStan under the default settings, so it’s possible that some of the divergences are false alarms, however you can see that you need to pull the prior quite far to the right in order to reduce them to double digits. A better workflow would then look at pairs plots to see where the divergences are (see Betancourt’s fabulous workflow tutorial), but I’m lazy. I’ve also included a column for warning if there’s a chain that isn’t sampling the energy distribution appropriately.

beta Prior mean 0.01 quantile Divergences (/4000) Low BFMI Warning? 2.5% for tau 50% for tau 97.5% for tau
10.00 0.20 0.01 751 FALSE 0.04 0.19 0.57
5.00 0.40 0.03 342 FALSE 0.11 0.34 1.05
2.00 1.00 0.07 168 FALSE 0.30 0.99 2.81
1.00 2.00 0.15 195 FALSE 0.68 1.76 5.23
0.50 4.00 0.30 412 FALSE 0.64 2.78 8.60
0.20 10.00 0.74 34 FALSE 1.51 5.32 14.71
0.10 20.00 1.49 40 FALSE 1.74 6.92 18.71
0.05 40.00 2.97 17 FALSE 2.21 8.45 23.41
0.01 200.00 14.86 13 FALSE 2.88 9.37 27.86

You can see from this that you can’t reliably remove divergences and there is strong prior dependence based on how strongly the prior avoids the boundary.

But it’s actually worse than that. These priors heavily bias the group standard deviation! To see this, compare to the inference from two of the priors we normally use (which need to be fitted using a non-centred parameterization). If the prior for $latex \tau$ is a half-Cauchy with dispersion parameter 5, the (2.5%, 50%, 97.5%) posterior values for $latex \tau$ are (0.12 2.68 11.90).  If instead it’s a half-normal with standard deviation 5, the right tail shrinks a little to (0.13, 2.69, 9.23). None of the boundary avoiding priors give estimates anything like this.

So please don’t use boundary avoiding priors on the hierarchical variance parameters. It probably doesn’t work.

Git repo for the code is here.

Edit: At Andrew’s request, a stronger boundary avoiding prior. This one is an inverse gamma, which has a very very light left tail (the density goes to zero like $latex e^{1/\tau}$ for small $latex \tau$) and a very heavy right tail (the density goes like $latex \tau^{-3}$ for large $latex \tau$).  This is the same table as below. For want of a better option, I kept the shape parameter at 2, although I welcome other options.

beta Prior mean 0.01 quantile Divergences (/4000) Low BFMI Warning? 2.5% for tau 50% for tau 97.5% for tau
0.50 0.50 0.07 95 TRUE 0.09 0.32 1.91
1.00 1.00 0.15 913 TRUE 0.11 0.44 4.60
2.00 2.00 0.30 221 FALSE 0.29 1.10 5.17
5.00 5.00 0.74 64 FALSE 0.99 2.60 8.61
10.00 10.00 1.48 57 FALSE 1.41 4.17 12.38
20.00 20.00 2.96 4 FALSE 2.94 6.92 16.44
30.00 30.00 4.44 0 FALSE 4.08 8.86 19.99
40.00 40.00 5.92 0 FALSE 5.03 10.31 22.13
50.00 50.00 7.40 0 FALSE 5.88 11.56 24.98
100.00 100.00 14.79 0 FALSE 9.93 17.77 34.84

You can see that you need to use strong boundary avoidance to kill off the divergences, and the estimates for $latex \tau$ still seem a bit off (if you compare the prior and posterior quantiles, you can see that the data really wants to be smaller).

You better check yo self before you wreck yo self

We (Sean Talts, Michael Betancourt, Me, Aki, and Andrew) just uploaded a paper (code available here) that outlines a framework for verifying that an algorithm for computing a posterior distribution has been implemented correctly. It is easy to use, straightforward to implement, and ready to be implemented as part of a Bayesian workflow. [latest version of the paper is here. — AG]

This type of testing should be required in order to publish a new (or improved) algorithm that claims to compute a posterior distribution. It’s time to get serious about only publishing things that actually work!

You Oughta Know

Before I go into our method, let’s have a brief review of some things that are not sufficient to demonstrate that an algorithm for computing a posterior distribution actually works.

  • Theoretical results that are anything less than demonstrably tight upper and lower bounds* that work in finite-sample situations.
  • Comparison with a long run from another algorithm unless that algorithm has stronger guarantees than “we ran it for a long time”. (Even when the long-running algorithm is guaranteed to work, there is nothing generalizable here. This can only ever show the algorithm works on a specific data set.)
  • Recovery of parameters from simulated data (this literally checks nothing)
  • Running the algorithm on real data. (Again, this checks literally nothing.)
  • Running the algorithm and plotting traceplots, autocorrelation, etc etc etc
  • Computing the Gelman-Rubin R-hat statistic. (Even using multiple chains initialized at diverse points, this only checks if the Markov Chain has converged. It does not check that it’s converged to the correct thing)

I could go on and on and on.

The method that we are proposing does actually do a pretty good job at checking if an approximate posterior is similar to the correct one. It isn’t magic. It can’t guarantee that a method will work for any data set.

What it can do is make sure that for a given model specification, one dimensional posterior quantities of interest will be correct on average. Here, “on average” means that we average over data simulated from the model. This means that rather than just check the algorithm once when it’s proposed, we need to check the algorithm every time it’s used for a new type of problem. This places algorithm checking within the context of Bayesian Workflow.

This isn’t as weird as it seems. One of the things that we always need to check is that we are actually running the correct model. Programming errors happen to everyone and this procedure will help catch them.

Moreover, if you’re doing something sufficiently difficult, it can happen that even something as stable as Stan will quietly fail to get the correct result. The Stan developers have put a lot of work into trying to avoid these quiet cases of failure (Betancourt’s idea to monitor divergences really helped here!), but there is no way to user-proof software. The Simulation-Based Calibration procedure that we outline in the paper (and below) is another safety check that we can use to help us be confident that our inference is actually working as expected.

(* I will also take asymptotic bounds and sensitive finite sample heuristics because I’m not that greedy. But if I can’t run my problem, check the heuristic, and then be confident that if someone died because of my inference, it would have nothing to do with the computaition of the posterior, then it’s not enough.)

Don’t call it a comeback, I’ve been here for years

One of the weird things that I have noticed over the years is that it’s often necessary to re-visit good papers from the past so they reflect our new understanding of how statistics works.  In this case, we re-visited an excellent idea  Samantha Cook, Andrew, and Don Rubin proposed in 2006.

Cook, Gelman, and Rubin proposed a method for assessing output from software for computing posterior distributions by noting a simple fact:

If $latex \theta^* \sim p(\theta)$ and $latex y^* \sim p(y \mid \theta^*)$, then the posterior quantile $latex \Pr(h(\theta^*)<h(\theta)\mid y^*)$ is uniformly distributed  (the randomness is in $latex y^*$) for any continuous function $latex h(\cdot)$.

There’s a slight problem with this result.  It’s not actually applicable for sample-based inference! It only holds if, at every point, all the distributions are continuous and all of the quantiles are computed exactly.

In particular, if you compute the quantile $latex \Pr(h(\theta^*)<h(\theta)\mid y^*)$ using a bag of samples drawn from an MCMC algorithm, this result will not hold.

This makes it hard to use the original method in practice. That might be down-weighting the problem. This whole project happened because we wanted to run Cook, Gelman and Rubin’s procedure to compare some Stan and BUGS models. And we just kept running into problems. Even when we ran it on models that we knew worked, we were getting bad results.

So we (Sean, Michael, Aki, Andrew, and I) went through and tried to re-imagine their method as something that is more broadly applicable.

When in doubt, rank something

The key difference between our paper and Cook, Gelman, and Rubin is that we have avoided their mathematical pitfalls by re-casting their main theoretical result to something a bit more robust. In particular, we base our method around the following result.

Let $latex \theta^* \sim p(\theta)$ and $latex y^* \sim p(y \mid \theta^*)$, and $latex \theta_1,\ldots,\theta_L$ be independent draws from the posterior distribution $latex p(\theta\mid y^*)$. Then the rank of $latex h(\theta^*)$ in the bag of samples $latex h(\theta_1),\ldots,h(\theta_L)$ has a discrete uniform distribution on $latex [0,L]$.

This result is true for both discrete and continuous distributions. On the other hand, we now have freedom to choose $latex L$. As a rule, the larger $latex L$, the more sensitive this procedure will be. On the other hand, a larger $latex L$ will require more simulated data sets in order to be able to assess if the observed ranks deviate from a discrete-uniform distribution.   In the paper, we chose $latex L=100$ samples for each posterior.

The hills have eyes

But, more importantly, the hills have autocorrelation. If a posterior has been computed using an MCMC method, the bag of samples that are produced will likely have non-trivial autocorrelation. This autocorrelation will cause the rank histogram to deviate from uniformity in a specific way. In particular, it will lead to spikes in the histogram at zero and/or one.

To avoid this, we recommend thinning the sample to remove most of the autocorrelation.  In our experiments, we found that thinning by effective sample size was sufficient to remove the artifacts, even though this is not theoretically guaranteed to remove the autocorrelation.  We also considered using some more theoretically motivated methods, such as thinning based on Geyer’s initial positive sequences, but we found that these thinning rules were too conservative and this more aggressive thinning did not lead to better rank histograms than the simple effective sample size-based thinning.

Simulation based calibration

After putting all of this together, we get the simulation based calibration (SBC) algorithm.  The below version is for correlated samples. (There is a version in the paper for independent samples).

The simple idea is that each of the $latex N$ simulated datasets, you generate a bag of $latex L$ approximately independent samples from the approximate posterior. (You can do this however you want!) You then compute the rank of the true parameter (that was used in the simulation of the data set) within the bag of samples.  So you need to compute $latex N$ true parameters, each of which is used to compute one data set, which is used to compute $latex L$ samples from its posterior.

So. Validating code with SBC is obviously expensive. It requires a whole load of runs to make it work. The up side is that all of this runs in parallel on a cluster, so if your code is reliable, it is actually quite straightforward to run.

The place where we ran into some problems was trying to validate MCMC code that we knew didn’t work. In this case, the autocorrelation on the chain was so strong that it wasn’t reasonable to thin the chain to get 100 samples. This is an important point: if your method fails some basic checks, then it’s going to fail SBC. There’s no point wasting your time.

The main benefit of SBC is in validating MCMC methods that appear to work, or validating fast approximate algorithms like INLA (which works) or ADVI (which is a more mixed bag).

This method also has another interesting application: evaluating approximate models. For example, if you replace an intractable likelihood with a cheap approximation (such as a composite likelihood or a pseudolikelihood), SBC can give some idea of the errors that this approximation has pushed into the posterior. The procedure remains the same: simulate parameters from the prior, simulate data from the correct model, and then generate a bag of approximately uncorrelated samples from corresponding posterior using the approximate model. While this procedure cannot assess the quality of the approximation in the presence of model error, it will still be quite informative.

When You’re Smiling (The Whole World Smiles With You)

One of the most useful parts of the SBC procedure is that it is inherently visual. This makes it fairly straightforward to work out how your algorithm is wrong.  The one-dimensional rank histograms have four characteristic non-uniform shapes: “smiley”, “frowny”, “a step to the left”, “a jump to the right”, which are all interpretable.

  • Histogram has a smile: The posteriors are narrower than they should be. (We see too many low and high ranks)
  • Histogram has a frown: The posteriors are wider than they should be. (We don’t see enough low and high ranks)
  • Histogram slopes from left to right: The posteriors are biased upwards. (The true value is too often in the lower ranks of the sample)
  • Histogram slopes from right to left: The posteriors are biased downwards. (The opposite)

These histograms seem to be sensitive enough to indicate when an algorithm doesn’t work. In particular, we’ve observed that when the algorithm fails, these histograms are typically quite far from uniform. A key thing that we’ve had to assume, however, is that the bag of samples drawn from the computed posterior is approximately independent. Autocorrelation can cause spurious spikes at zero and/or one.

These interpretations are inspired by the literature on calibrating probabilistic forecasts. (Follow that link for a really detailed review and a lot of references).  There are also some multivariate extensions to these ideas, although we have not examined them here.

Justify my love

When everyone starts walking around the chilly streets of Toronto looking like they’re cosplaying the last 5 minutes of Call Me By Your Name, you know that Spring is in the air. Let’s celebrate the end of winter by pulling out our Liz Phair records, our slightly less-warm coats, and our hunger for long reads about some fundamentals.

I’ve always found the real sign that Spring has almost sprung is when strangers start asking which priors you should use on the variance parameters in a multilevel model.

A surprising twist on this age-old formula is that this year the question was actually asked on Twitter! Because I have a millennial’s-1 deep understanding of social media, my answer was “The best way to set priors on the variance of a random effect0 is to stop being a Nazi TERF”. But thankfully I have access to a blogging platform, so I’m going to give a more reasoned, nuanced, and calm answer here.

(Warning: This post is long. There’s a summary at the end. There is also an answer that fits into a 280 character limit. It obviously links to the Stan prior choice wiki. You can also look at this great Stan case study from Michael Betancourt.)

The book of love

In this post, I’m going to focus on weakly informative priors for the variance parameters in a multilevel model. The running example will be a very simple multilevel model for Poisson where the log-risk is modelled using a global intercept, a covariate, and one iid Gaussian random effect0. Hence we only have one variance parameter in teh model that we need to find a prior for.

Some extensions beyond this simple model are discussed further down, but setting priors for those problems are much harder and I don’t have good answers for all of them. But hopefully there will be enough information here to start to see where we’re (we = Andrew, Aki, Michael and I, but also Håvard Rue, Sigrunn Sørbye, Andrea Riebler, Geir-Arne Fuglstad, and others) going with this thread of research.

Fallin’ out of love

The first thing I’m going to do is rule out any attempt to construct “non-informative” (whatever that means) priors. The main reason for this is that in most situations (and particularly for multilevel models), I don’t think vague priors are a good idea.

The main reason is that I don’t think they live up to their oft-stated ability to “let the data speak for itself”. The best case scenario is that you’re working with a very regular model and you have oodles of data. In this case, a correctly-specified vague prior (ie a reference prior) will not get in the way of the data’s dream to stand centre stage and sing out “I am a maximum likelihood estimator to high order“. But if that’s not the tune that your data should be singing (for example, if you think regularization or partial pooling may be useful), then a vague prior will not help.

Neither do I think that “letting the data speak” is the role of the prior distribution. The right way to ensure that your model is letting the data speak for itself is through a priori and a posteriori model checking and model criticism. In particular, well-constructed cross-validation methods, data splitting, posterior predictive checks, and prior sensitivity analysis are always necessary. It’s not enough to just say that you used a vague prior and hope. Hope has no place in statistics.  

Most of the priors that people think are uninformative turn out not to be. The classic example is the $latex \text{Inverse-Gamma}(\epsilon,\epsilon)$ prior for the variance of a random effect0. Andrew wrote a paper about this 12 years ago. It’s been read (or at least cited) quite a lot. The paper is old enough that children who weren’t even born when this paper was written have had time to rebel and take up smoking. And yet, the other day I read yet another paper that referred to a  $latex \text{Inverse-Gamma}(0.001,0.001)$ prior on the variance of a random effect0 as “weakly informative”. So let me say it one more time for the people at the back: It’s time to consign $latex \text{Inverse-Gamma}(\epsilon,\epsilon)$ priors to the dustbin of history.

Love connection

But even people who did read Andrew’s paper0.25 can go wrong when they try to set vague priors. Here’s a “close to home” example using the default priors in INLA.  INLA is an R package for doing fast, accurate, approximate Bayesian inference for latent Gaussian models, which is a small but useful class of statistical models that covers multilevel models, lots of spatial and spatiotemporal models, and some other things. The interface for the INLA package uses an extension to R’s formula environment (just like rstanarm and brms), and one of the things that this forces us to do is to specify default priors for the random effects0. Here is a story of a default prior being bad.

This example is taken from the first practical that you can find here (should you want to reproduce it). The model, which will be our running example for the next little while, is a simple Poisson random effects0 model. It has 200 observations, and the random effect0 has 200 levels. For those of you familiar with the lme4 syntax, the model can be fit as

lme4::glmer(num_awards ~ math + (1|id),data=awards,family="poisson")

If you run this command, you get a standard deviation of 0.57 for the random effect0.

INLA parameterises a normal distribution using its precision (ie the inverse of the variance). This is in line with JAGS and BUGS, but different to Stan (which is the younger piece of software and uses the standard deviation instead). The default prior on the precision is $latex \text{Gamma}(1,5\cdot 10^{-5})$, which is essentially a very flat exponential prior.

The following table shows the 95% posterior credible intervals for the standard deviation across five different priors. The first is the INLA default; the second is recommended by Bernardinelli, Clayton, and Montomoli. Both of these priors are very vague. The third is a $latex \text{Inverse-Gamma}(\epsilon,\epsilon)$ prior with a not particularly small $latex \epsilon = 0.001$. The fourth and fifth are examples of PC priors which are designed to be weakly informative. For this model, the PC prior is an exponential prior on the standard deviation with the rate parameter tuned so that a priori $latex \Pr(\sigma >U) < 0.1$ with upper bounds $latex U=1$ and $latex U=10$ respectively.

2.5% Median 97.5%
INLA default 0.00 0.01 0.05
IG(0.5,0.0005) 0.02 0.07 0.59
IG(0.001,0.001) 0.04 0.40 0.75
PC(1,0.1) 0.04 0.42 0.74
PC(10,0.1) 0.09 0.49 0.79

The takeaway message here is that the two vague priors on the precision (the default priors in INLA and the one suggested by Bernardinelli et al.) are terrible for this problem. The other three priors do fine and agree with the frequentest estimate0.5.

The take away from this should not be that INLA is a terrible piece of software with terrible priors. They actually work fairly well quite a lot of the time. I chose INLA as an example because it’s software that I know well. More than that, the only reason that I know anything at all about prior distributions is that we were trying to work out better default priors for INLA.

The point that we came to is that the task of finding one universal default prior for these sorts of parameters is basically impossible. It’s even more impossible to do it for more complex parameters (like lag-1 autocorrelation parameters, or shape parameters, or over-dispersion parameters). What is probably possible is to build a fairly broadly applicable default system for building default priors.

The rest of this post is a stab at explaining some of the things that you need to think about in order to build this default system for specifying priors. To do that, we need to think carefully about what we need a prior to do, as well as the various ways that we can make the task of setting priors a little easier.

To think that I once loved you

The first problem with everything I’ve talked about above is that, when you’re setting priors, it is a terrible idea to parameterize a normal distribution by its precision. No one has an intuitive feeling about the inverse of the variance of something, which makes it hard to sense-check a prior.

So why is it used? It simplifies all of the formulas when you’re dealing with hierarchical models.  If

$latex y\mid x \sim N(x, \Sigma_y)$

$latex x \sim N(0, \Sigma_x)$,

then the variance of $latex x \mid y$ is $latex (\Sigma_x^{-1} + \Sigma_y^{-1})^{-1}$, whereas if we parameterize by the precision, we just need to add the precisions of $latex x$ and $latex y$ to get the precision of $latex x\mid y$. So when people sat down to write complicated code to perform inference on these models, they made their lives easier and parameterized normal distributions using the precision.

But just because something simplifies coding, it doesn’t mean that it’s a good idea to use it in general situations!

much better way to parameterize the normal distribution is in terms of the standard deviation. The key benefit of the standard deviation is that it directly defines the scale of the random effect0: we know that it’s highly likely that any realization of the random effect0 is within 3 standard deviations of the mean. This means that we can easily sense-check a prior on the standard deviation by asking “is this a sensible scale for the random effect0?”.

For this model, the standard deviation is on the scale of the log-counts of a Poisson distribution, so we probably don’t want it to be too big. What does this mean? Well, the largest count observation that a computer is most likely only going to allow is 2,147,483,648. This is a fairly large number. If we want the mean of the Poisson to be most likely less than that (in the case where everything else in the log-risk is zero), then we want to ensure $latex \sigma < 7.2$. If we want to ensure that the expected count in each cell is less than a million, we need to ensure $latex \sigma < 4.6$. If we want to ensure the expected count to be less than 1000, we need to ensure $latex \sigma < 2.3$. (I can literally go on all day.) Under the Gamma(0.001,0.001) prior on the precision, the probabilities of these three events happening are, respectively, 0.01, 0.009, and 0.007. This means that rather than “speaking for itself” (whatever that means), your data is actually trying to claw back probability mass from some really dumb parts of the parameters space.

If you plot all of the distributions on the standard deviation scale (sample from them and do a histogram of their inverse square root), you see that the first 3 priors do not look particularly good. In particular, the first two are extremely concentrated around small values. This explains the extreme shrinkage of the standard deviation estimates. The third prior is extremely diffuse.

F.E.E.L.I.N.G.C.A.L.L.E.D.L.O.V.E.

So if the standard deviation is the correct parameter for setting a prior, what sort of prior should we set on it?

Well I have some good news for you: there are several popular alternatives and no clear winner. This is pretty common when you do research on priors. If you don’t screw up the prior completely, it’s a higher-order effect. This means that asymptotically, the effect of the prior on the posterior is smaller than the effect of sampling variation in the data.

But this didn’t happen in the above example, where there was a very clear effect. There are a few reasons for this. Firstly, the data only has 200 observations, so we may be so far from asymptopia the the prior is still quite important. The second reason is more interesting. There is more than one sensible asymptotic regime for this problem, and which one we should use depends on the details of the application.

The data, as it stands, has 200 records and the random effect0 has 200 different levels.  There are essentially 3 different routes to asymptopia here. The first one is that we see an infinite number of repeated observations for these 200 levels. The second is a regime where we have $latex n$ observations,  the random effect0 has $latex n$ levels, and we send $latex n\rightarrow \infty$. And the third option is half way between the two where the number of levels grows slower than the number of observations.

It is a possibly under-appreciated point that asymptotic assessments of statistical methods are conditional on the particular way you let things go to infinity. The “standard” asymptotic model of independent replication is quite often unrealistic and any priors constructed to take advantage of this asymptotic structure might have problems. This is particularly relevant to things like reference priors, where the asymptotic assumptions are strongly baked in to the prior specification. But it’s also relevant to any prior that has an asymptotic justification or any method for sense-checking priors that relies on asymptotic reasoning.

This girl’s in love with you

Well that got away from me, let’s try it again. What should a prior for the standard deviation of a Gaussian random effect0 look like?

The first thing is that it should peak at zero and go down as the standard deviation increases. Why? Because we need to ensure that our prior doesn’t prevent the model from easily finding the case where the random effect0 should not be in the model. The easiest way to ensure this is to have the prior decay away from zero. My experience is that this also prevents the model from over-fitting (in the sense of having too much posterior mass on very “exciting” configurations of the random effects0 that aren’t clearly present in the data). This idea that the prior should decrease monotonically away from a simple base model gives us some solid ground to start building our prior on.

The second thing that we want from a prior is containment. We want to build it so that it keeps our parameters in a sensible part of the model space. One way to justify this (and define “sensible”) is to think about the way Rubin characterized the posterior distribution in a very old essay (very old = older than me). He argued that you can conceptualize the posterior through the following procedure (here $latex y_\text{obs}$ is the observed data):

  • Simulate $latex \theta_j \sim p(\theta)$
  • Simulate $latex y_j \sim p(y \mid \theta)$
  • If $latex y_j = y_\text{obs}$, then keep $latex \theta_j$, otherwise discard it.

The set of $latex \theta_j$ that are generated with this procedure are samples from the posterior distribution.

This strongly suggests that the prior should be built to ensure that the following containment property holds: The prior predictive distribution $latex p(y) = \int p(y \mid \theta)p(\theta)\,d\theta$ should be slightly broader than a prior predictive distribution that only has mass on plausible data sets.

This requirement is slightly different to Andrew’s original definition1 of a weakly informative prior (which is a prior on a parameter that is slightly broader than the best informative prior that we could put on a parameter without first seeing the data). The difference is that while the original definition of a WIP requires us to have some sort of interpretation of the parameter that we are putting the prior on, the containment condition only requires us to understand how a parameter affects the prior data generating process. Containment also reflects our view that The prior can often only be understood in the context of the likelihood.

If this all sounds like it would lead to very very complicated maths, it probably will. But we can always check containment visually. We have a discussion of this in our (re-written the other month to read less like it was written under intense deadline pressure) paper on visualization and the Bayesian workflow. That paper was recently accepted as a discussion paper in the Journal of the Royal Statistical Society Series A (date of the discussion meeting TBC) which is quite nice.

Standard Bitter Love Song #1

An immediate consequence of requiring this sort of containment condition is that we need to consider all of the parameters in our model simultaneously in order to check that it is satisfied. This means that we can’t just specify priors one at a time  when a model has a number or random effects0, or when there are other types of model component.

For the sake of sanity, we can simplify the process by breaking up the parameters into groups that affect different part of the data generating process. For instance, if you have several random effects0, a good way to parameterize their variances is to put a prior on the standard deviation of the total variability and then a simplex-type prior to distribute the total standard deviation across the many components.

There are some complexities here (eg what do you do when the random effects0 have differing numbers of levels or differing dependence structures between levels?). Some suggested solutions are found towards the end of this paper and in this paper. A similar idea (except it’s the variances that are distributed rather than the standard deviations) is used for the default priors in rstanarm. The same idea pops up again for setting priors on mixture models.

I am leaving you because I don’t love you

So while that was a fun diversion, it still hasn’t given us a prior to use on the variance of a random effect0. But we do now know what we’re looking for.

There are a number of priors that decay away from the base model and give us some form of containment. Here is an incomplete, but popular, list2 of simple containment priors:

  • A half-Gaussian on the standard deviation
  • An exponential on the standard deviation
  • A half-t with 7 degrees of freedom on the standard deviation
  • A half-t with 3 degrees of freedom on the standard deviation
  • A half-Cauchy on the standard deviation.

These were listed from lightest to heaviest tail on purpose.

All of these priors require the specification of one parameter controlling the width of the 90% highest probability density interval2.5. This parameter allows us to control containment.  For example, if the random effect0 was part of a model for the log-relative risk of a disease, it may be sensible to set the parameter so that the probability that the standard deviation was less than 1 with 90% prior probability. This would correspond to the random effect0 being contained within [-3,3] and hence the random effect’s0 contribution to the relative risk be constrained to be a multiplicative factor contained within [0.05, 20].

This brings out an important aspect of containment priors: they are problem dependent. Although this example does not need a lot of expert information to set a sensible prior, it does need someone to understand how big a deviation from the baseline risk is unlikely for the particular scenario you are modelling. There isn’t a way around this. You either explicitly encode information about the problem in a prior, hide it in the structure of your asymptotic assumptions, or just throw your prior against a wall and hope that it sticks.

One way to standardize the procedure for setting priors is to demand that the model is correctly scaled in order to ensure that all of the parameters are on the unit scale. This can be done structurally (if your expected counts are well constructed, the relative risk will often be on unit scale). It can also be done by introducing some data-dependence into the prior, although this is a little more philosophically troublesome and you have to be diligent with your model assessment.

As for which of these priors is preferred, it really depends on context. If your model has a lot of random effects0, or the likelihood is sensitive to extreme values of the random effect0, you should opt for a lighter tail. On the other hand, a heavier tail goes some way towards softening the importance of correctly identifying the scale of the random effect0.

To put some skin in the game, last time I talked to them about this3 Andrew seemed more in favour of good scaling and a half-Normal(0,1) prior, I like an exponential prior with an explicitly specified 90% quantile for the standard deviation, and Aki seemed to like one of the Student-t distributions. Honestly, I’m not sure it really makes a difference and with the speed of modern inference engines, you can often just try al three and see which works best. A nice4 simulation study in this direction was done by Nadja Klein and Thomas Kneib.

I love you but you’re dead

None of us are particularly fond of the half-Cauchy as a prior on the standard deviation. This was one of the early stabs at a weakly informative prior for the standard deviation. It does some nice things, for example if you marginalize out a standard deviation with a half-Cauchy prior, you get a distribution on the random effect0 that has a very heavy tail. This is the basis for the good theoretical properties of the Horseshoe prior for sparsity, as well as the good mean-squared error properties of the posterior mean estimator for the normal means problem.

But this prior has fallen out of favour with us for a couple of reasons. Firstly, the extremely heavy tail makes for a computational disaster if you’re actually trying to do anything with this prior. Secondly, it turns out that some regularization does good things for sparsity estimators.

But for me, it’s the third problem that sinks the prior. The tail is so heavy that if you simulate from the model you frequently get extremely implausible data sets. This means that the half-Cauchy prior on the standard deviation probably doesn’t satisfy our requirement that a good prior satisfies the containment property.

Can’t help lovin’ that man

Right. We’ve managed to select a good prior for the standard deviation for a Gaussian random effect0. The next thing to think about is whether or not there are any generalizable lessons here. (Hint: There are.) So let us look at a very similar model that would be more computationally convenient to fit in Stan and see that, at least, all of the ideas above still work when we change the distribution of the random effect0.

The role of the random effect0 in the example model is to account for over-dispersion in the count data (allowing the variance being larger than the mean). An alternative model that does the same thing is to take the likelihood as negative-binomial rather than Poisson. The resulting model no longer has a random effect0 (it’s a pure GLM). If you’re fitting this model in Stan, this is probably a good thing as the dimension of the parameter space goes from 203 to 3, which will definitely make your model run faster!

To parameterize the negative binomial distribution, we introduce an over-dispersion parameter $latex \phi$ with the property that the mean of the negative binomial is $latex \text{E}(y)=\mu$ and the variance is $latex \text{Var}(y)=\mu(1+\phi\mu)$.

We need to work out a sensible prior for the over-dispersion parameter. This is not a particularly well-explored topic in the Bayesian literature. And it’s not really obvious what a sensible prior will look like. The effect of $latex \phi$ on the distribution of $latex y$ is intertwined with the effect of $latex \mu$.

One way through this problem is to note that setting a prior on $latex \phi$ is in a lot of ways quite similar to setting a prior on the standard deviation of a Gaussian random effect0.

To see this, we note that we can write the negative binomial with mean $latex \mu$ and overdispersion parameter $latex \phi$ as

$latex y \mid z \sim \text{Poisson}(z\mu)$

$latex z \sim \text{Gamma}(\phi^{-1},\phi^{-1})$.

This is different to the previous model.  The Gamma distribution for $latex z$ has a heavier left tail and a lighter right tail than then log-normal distribution that was implied by the previous model. That being said, we can still apply all of our previous logic to this model.

The concept of the base model would be a spike at $latex z=1$, which is a Poisson distribution. (The argument for this is that it is a good base model because every other achievable model with this structure is more interesting as the mean and the variance are different from each other.) The base model occurs when  $latex \phi = 0$.

So we now need to work out how to ensure containment for this type of model. The first thing to do is to try and make sure we have a sensible parameterization so that we can use one of our simple containment priors.

The gamma distribution has a mean of 1 and a variance of $latex \phi$, so one option would be to completely follow our previous logic and use $latex \sqrt{\phi}$ as a sensible transformation. Because the Gamma distribution is highly skewed, it isn’t completely clear that the standard deviation is as good a parameterization as it is in the previous model. But it turns out we can justify it a completely different way, which suggests it might be a decent choice.

A different method for finding a good parameterization for setting priors was explored in our PC priors paper. The idea is that we can parameterize our model according to its deviation from the base model. In both the paper and the rejoinder to the discussion, we give a pile of reasons why this is a fairly good idea.

In the context of this post, the thing that should be clear is that this method will not ensure containment directly. Instead, we are parameterizing the model by a deviation $latex d(\phi)$ so that if you increase the value of $latex d(\phi)$ by one unit, the model gets one unit more interesting (in the sense that the square root5 of the amount of information lost when you replace this model by the base model increases by one). The idea is that with this parameterization we will contain $latex z$ and hopefully therefore constrain $latex y$.

If we apply the PC prior re-parameterization to the Gaussian random effects0 model, we end up setting the prior on the standard deviation of the random effect0, just as before. (This is a sense check!)

For a the Gamma random effect0, some tedious maths leads to the exact form of the distance (Warning: this looks horrible. It will simplify soon.)

$latex d(\phi)=\sqrt{\lim_{\phi_0\rightarrow 0}\phi_0^{-1}KL\left[\text{Gamma}(\phi^{-1},\phi^{-1})\,||\,\text{Gamma}(\phi_0^{-1},\phi_0^{-1})\right]}=\sqrt{\log(\phi^{-1})-\psi(\phi^{-1})}$,

where $latex KL[\cdot \,||\, \cdot]$ is the Kullback-Leibler divergence, and $latex \psi(\cdot)$ is the digamma function (the derivative of the log-gamma function).

To simplify the distance, let’s take a look at what it looks like for very small and very large $latex \phi$.  When $latex \phi$ is near zero, some rooting round the internet shows that $latex d(\phi)=\mathcal{O}(\sqrt{\phi})$. Similarly, when $latex \phi$ is large, we also get $latex d(\phi)=\mathcal{O}(\sqrt{\phi})$.   Moreover, if you plot the square of the distance, it looks a lot like a straight line (it isn’t quite, but it’s very close). The suggests that putting a containment prior on the parameter $latex d(\phi)\approx\sqrt{\phi}$ might be a good idea.

So the end of this journey is that something like an appropriately scaled half-normal, exponential, or half-t distribution on $latex \sqrt{\phi}$ is a good candidate for a containment prior for the over-dispersion parameter of a negative binomial distribution. The truth of this statement can be evaluated using prior data simulations to check containment, and posterior sensitivity and predictive checks to check that the prior is appropriate for the problem at hand. Because no matter how good I think the mathematical argument is for a prior, it is still vital to actually verify the theory numerically for the particular problem you are trying to solve.

To bring you my love

If I had to summarize this very long post, I’d probably say the following:

  • Vague priors tend to be “accidentally informative” in random effects0 models (and other somewhat complicated statistical models)
  • The first thing you need to think about when setting a prior is to find a parameterization that is at least a little bit interpretable. If you can’t give an “in words” interpretation of your prior, you probably haven’t put enough care into setting it.
  • Base models are very useful to work out how to set priors. Think about what the most boring thing that your model can do and expand it from there.
  • The idea of containment is hiding in a lot of the ways people write about priors. It’s a good thing and we should pay more attention to it.
  • Containment tells us explicitly that we need to consider the joint prior on all parameters of the model, rather than just thinking about the priors on each parameter independently. Sometimes we can cheat if different sets of parameters change (almost) disjoint aspects of a model.
  • Containment also suggests that the priors need to respect the scale of the effect a parameter has. Sometimes we can fix this though a linear scaling (as in regression coefficients or standard deviation parameters), but sometimes we need to be more creative (as in the over-dispersion parameter).
  • Containment makes it hard to specify a universal default prior for a problem, but we can still specify a universal default procedure for setting priors for a problem.
  • We can always check containment using prior simulations from the model.
  • We can always assess the effect of the prior on the estimates through careful prior and posterior checks (this is really the story of our Visualization and Workflow paper).
  • These considerations stretch far beyond the problem of specifying priors for variance components in multilevel models.
Footnotes:

-1 For those of you who use “millennial” as a synonym for “young”, I promise you we are not. Young people were not alive when the Millennium turned. The oldest millennials are 35.

0 When I talk about random effects, I’m using the word in the sense of Hodges and Clayton, who have a lovely paper looks at how that term should be modernized. In particular, they point out the way that in modern contexts, random effects can be interpereted as “formal devices to facilitate smoothing or shrinkage, interpreting those terms broadly”. Everything that I talk about here holds for the scale parameter of a Gaussian process or a spatial random effect. Andrew and Aki prefer to just use different terms. The specific example that I’m using falls into their definition of an “old-style” random effect, in that the distribution of the random effect is of interest rather than the particular value of the realization isn’t of interest. I personally think that “random effect” is a sufficiently broad, accessible definition that bridges decades of theory with current practice, so I’m happy with it. But mileage obviously varies. (I have tagged the term every time it’s used in case someone gets half way through before they realize they’re not sure what I mean.)

0.25 Even if you use the suggested prior in Andrew’s paper (a half-Cauchy with scale parameter 25 on the standard deviation), bad things will happen. Consider a simple version of our problem, where $latex y\sim Po(e^x)$, $latex x\sim N(0,s^2)$. If we put the recommended half-Cauchy prior on $latex s$, then the simulation of $latex y$ will return an NA in R around 20% of the time. This happens because it tries to draw an integer bigger than $latex 2^{31}$, which is the largest integer that R will let you use. This will lead to very very bad inference and some fun numerical problems in small-data regimes.

0.5 Posterior intervals that are consistent with a particular estimator with some good frequentist properties isn’t necessarily a sign that a prior is good. Really there should be some proper posterior checks here. I have not done them because this is a blog.

1 Now I didn’t go back and check Andrew’s paper, so this may well be my memory of my original interpretation of Andrew’s definition of a Weakly Informative Prior. There are turtles everywhere!

2 These options are taken from the Stan prior choice wiki, which has a lot of interesting things on it.

2.5 I can’t say it controls the standard deviation because the half-Cauchy doesn’t have a well-defined standard deviation. So I picked a 90% highest density interval as the thing to be controlled. Note that because we’ve specified priors that decay away from the base model, these are one-sided intervals. Obviously if you want to use something else, use something else.

3 Obviously I do not speak for them here and you really shouldn’t take my half-memory of a conversation at some point in the dusty, distant past as a reflection of their actual views. But we do have a range of opinions on this matter, but not such a diverse set of opinions that we actually disagree with each other.

4 The exponential prior on the standard deviation (which is the PC prior for this model) did very well in these simulations, so obviously I very much like the results!

5 The square root is there for lots of good reasons, but mainly to get make sure all of the scales come out right. For the maths-y, it’s strongly suggested by Pinsker’s inequality.

What is not but could be if

And if I can remain there I will say – Baby Dee

Obviously this is a blog that love the tabloids. But as we all know, the best stories are the ones that confirm your own prior beliefs (because those must be true).  So I’m focussing on  this article in Science that talks about how STEM undergraduate programmes in the US lose gay and bisexual students.  This leaky pipeline narrative (that diversity is smaller the further you go in a field because minorities drop out earlier) is pretty common when you talk about diversity in STEM. But this article says that there are now numbers! So let’s have a look…

And when you’re up there in the cold, hopin’ that your knot will hold and swingin’ in the snow…

From the article:

The new study looked at a 2015 survey of 4162 college seniors at 78 U.S. institutions, roughly 8% of whom identified as LGBQ (the study focused on sexual identity and did not consider transgender status). All of the students had declared an intention to major in STEM 4 years earlier. Overall, 71% of heterosexual students and 64% of LGBQ students stayed in STEM. But looking at men and women separately uncovered more complexity. After controlling for things like high school grades and participation in undergraduate research, the study revealed that heterosexual men were 17% more likely to stay in STEM than their LGBQ male counterparts. The reverse was true for women: LGBQ women were 18% more likely than heterosexual women to stay in STEM.

Ok. There’s a lot going on here. First things first, let’s say a big hello to Simpson’s paradox! Although LGBQ people have a lower attainment rate in STEM, it’s driven by men going down and women going up. I think the thing that we can read straight off this is that there are “base rate” problems happening all over the place. (Note that the effect is similar across the two groups and in opposite directions, yet the combined total is fairly strongly aligned with the male effect.) We are also talking about a drop out of around 120 of the 333 LGBQ students in the survey. So the estimate will be noisy.

I’m less worried about forking paths–I don’t think it’s unreasonable to expect the experience to differ across gender. Why? Well there is a well known problem with gender diversity in STEM.  Given that gay women are potentially affected by two different leaky pipelines, it sort of makes sense that the interaction between gender and LGBQ status would be important.

The actual article does better–it’s all done with multilevel logistic regression, which seems like an appropriate tool. There are p-values everywhere, but that’s just life. I struggled from the paper to work out exactly what the model was (sometimes my eyes just glaze over…), but it seems to have been done fairly well.

As with anything however (see also Gayface), the study is only as generalizable as the data set. The survey seems fairly large, but I’d worry about non-response. And, if I’m honest with you, me at 18 would’ve filled out that survey as straight, so there are also some problems there.

My father’s affection for his crowbar collection was Freudian to say the least

So a very shallow read of the paper makes it seems like the stats is good enough. But what if it’s not? Does that really matter?

This is one of those effects that’s anecdotally expected to be true. But more importantly, a lot of the proposed fixes are the types of low-cost interventions that don’t really need to work very well to be “value for money”.

For instance, it’s suggested that STEM departments work to make LGBT+ visibility more prominent (have visible, active inclusion policies). They suggest that people teaching pay attention to diversity in their teaching material.

The common suggestion for the last point is to pay special attention to work by women and under-represented groups in your teaching. This is never a bad thing, but if you’re teaching something very old (like the central limit theorem or differentiation), there’s only so much you can do. The thing that we all have a lot more control over is our examples and exercises. It is a no-cost activity to replace, for example, “Bob and Alice” with “Barbra and Alice” or “Bob and Alex”.

This type of low-impact diversity work signals to students that they are in a welcoming environment. Sometimes this is enough.

A similar example (but further up the pipeline) is that when you’re interviewing PhD students, postdocs, researchers, or faculty, don’t ask the men if they have a wife. Swapping to a gender neutral catch-all (partner) is super-easy. Moreover, it doesn’t force a person who is not in an opposite gender relationship to throw themselves a little pride parade (or, worse, to let the assumption fly because they’re uncertain if the mini-pride parade is a good idea in this context). Partner is a gender-neutral term. They is a gender-neutral pronoun. They’re not hard to use.

These environmental changes are important. In the end, if you value science you need to value diversity. Losing women, racial and ethnic minorities, LGBT+ people, disabled people, and other minorities really means that you are making your talent pool more shallow. A deeper pool leads to better science and creating a welcoming, positive environment is a serious step towards deepening the pool.

In defence of half-arsed activism

Making a welcoming environment doesn’t fix STEM’s diversity problem. There is a lot more work to be done. Moreover, the ideas in the paragraph above may do very little to improve the problem. They are also fairly quiet solutions–no one knows you’re doing these things on purpose. That is, they are half-arsed activism.

The thing is, as much as it’s lovely to have someone loudly on my side when I need it, I mostly just want to feel welcome where I am. So this type of work is actually really important. No one will ever give you a medal, but that doesn’t make it less appreciated.

The other thing to remember is that sometimes half-arsed activism is all that’s left to you. If you’re a student, or a TA, or a colleague, you can’t singlehandedly change your work environment. More than that, if a well-intentioned-but-loud intervention isn’t carefully thought through it may well make things worse. (For example, a proposal at a previous workplace to ensure that all female students (about 400 of them) have a female faculty mentor (about 7 of them) would’ve put a completely infeasible burden on the female faculty members.)

So don’t discount low-key, low-cost, potentially high-value interventions. They may not make things perfect, but they can make things better and maybe even “good enough”.

Anybody want a drink before the war?

Your lallies look like darts, and you’ve got nanti carts, but I love your bona eke – Lee Sutton (A near miss)

I’ve been thinking about gayface again. I guess this is for a bunch of reasons, but one of the lesser ones is that this breathless article by JD Schramm popped up in the Washington Post the other day. This is not just because it starts with a story I relate to very deeply about accidentally going on a number of dates with someone. (In my case, it was a woman, I was 19, and we bonded over musical theatre. Apparently this wasn’t a giveaway. Long story short, she tried to kiss me, I ran away.)

There is no other Troy for me to burn

The main gist of Schramm’s article is that the whole world is going to end and all the gays are going to be rounded up into concentration camps or some such things due to the implications of Wang and Kosinski’s gayface article.

I think we can safely not worry about that.

Success has made a failure of our home

For those of you who need a refresher, the Wang and Kosinski paper had some problems. They basically scraped a bunch of data from an undisclosed dating site that caters to men and women of all persuasions and fed it to a deep neural network face recognition program to find facial features that were predictive of being gay or predictive or being straight.  They then did a sparse logistic regression to build a classifier.

There were some problems.

  1. The website didn’t provide sexual orientation information, but only a “looking for” feature. Activity and identity overlap but are not actually the same thing, so we’re already off to an awkward start.

To go beyond that, we need to understand what these machine learning algorithms can actually do. The key thing is that they do not extrapolate well. They can find deep, non-intuitive links between elements of a sample (which is part of why they can be so successful for certain tasks), but it can’t imagine unobserved data.

For example, if we were trying to classify four legged creatures and we fed the algorithm photos of horses and donkeys, you’d expect it to generalize well to photos of mules, but less well to photos of kangaroos.

To some extent, this is what we talk about when we talk about “generalization error”. If a classifier does a great job on the data it was trained on (and holdout sets thereof), but a terrible job on a new dataset, one explanation would be that the training data is in some material way different from the new data set. This would turn classification on the new data set into an extrapolation task, which is an area where these types of algorithms excel.

2. The training data set is a terrible representative of the population. The testing set is even worse.

There are other problems with the paper. My favourite is that they find that facial brightness is positively correlated with the probability of being gay and posit a possible reason for that is that is that an overabundance of testosterone darkens skin. Essentially, they argue that straight people are a bit dull because they’ve got too much testosterone.

As much as I enjoy the idea that they’ve proposed some sort of faggy celestial navigation (you’re never lost if there’s a gay on the horizon to light your way to safety), it’s not that likely. More likely, gay men use more filters in their dating profile shots and we really should sound the correlation is not causation klaxon.

How I be me (and you be you)?

But the howling, doom-laden tone of the Schramm piece did make me think about if  building the sort of AI he’s warning against would even be possible.

Really, we’re talking about passive gaydar here, where people pick up on if you’re gay based solely on information that isn’t actively broadcast. Active gaydar is a very different beast–this requires a person to actively signal their orientation. Active gaydar is known to confuse whales and cause them to beach themselves, so please avoid using it near large bodies of water.

To train an AI system to be a passive gaydar, you would need to feed it with data that covered the broad spectrum of presentation of homosexuality. This is hard. It differs from place to place, among cultures, races, socioeconomic groups. More than this, it’s not stable in time. (A whole lot more straight people know drag terminology now than a decade ago, even if they do occasionally say “cast shade” like it’s a D&D spell.)

On top of this, the LGB population is only a small fraction of the whole population. This means that even a classifier that very accurately identifies known gay people or known straight people will be quite inaccurate when applied to the whole population. This is the problem with conditional probabilities!

I think we need to be comfortable with the idea that among all of the other reasons we shouldn’t try to build an AI gaydar, we probably just can’t. Building an AI gaydar would be at least as hard as building a self-driving car. Probably much harder.

Eid ma clack shaw zupoven del ba.

When I say “I love you”, you look accordingly skeptical Frida Hyvönen

A few years back, Bill Callahan wrote a song about the night he dreamt the perfect song. In a fever, he woke and wrote it down before going back to sleep. The next morning, as he struggled to read his handwriting, he saw that he’d written the nonsense that forms the title of this post.

Variational inference is a lot like that song; dreams of the perfect are ruined in the harsh glow of the morning after.

(For more unnaturally tortured metaphors see my twitter. I think we can all agree setting one up was a fairly bad choice for me.)

But how can we tell if variational inference has written the perfect song or, indeed, if it has laid an egg? Unfortunately, there doesn’t seem to be a lot of literature to guide us. We (Yuling, Aki, Me, and Andrew) have a new paper to give you a bit more of an idea.

Palimpsest

The guiding principle of variational inference is that if it’s impossible to work with the true posterior $latex p(\theta \mid y)$, then near enough is surely good enough. (It seldom is.)

In particular, we try to find the  member $latex q^*(\theta)$ of some tractable set of distributions $latex \mathcal{Q}$ (commonly the family of multivariate Gaussian distributions with diagonal covariance matrices) that minimizes the Kullback-Leibler divergence

$latex q^*(\theta) = \arg \min_{q \in \mathcal{Q}} KL\left[\,q(\cdot) \mid p(\cdot \mid y)\,\right].$

The Kullback-Leibler divergence in this direction (it’s asymmetric, so the order of arguments is important) can be interpreted as the the amount of information lost if we replace the approximate posterior $latex q(\theta)$ with the true posterior $latex p(\theta \mid y)$. Now, if this seems like the wrong way around to you [that we should instead worry about what happens if we replace the target posterior $latex p(\theta \mid y)$ with the approximation $latex q(\theta)$], you would be very very correct.  That Kullback-Leibler divergence is backwards.

What does this mean? Well it means that we won’t penalize approximate distributions that are much less complex than the true one as heavily as we should. How does this translate into real life? It means that usually we will end up with approximations $latex q^*(\theta)$ that are narrower than the true posterior. Usually this manifests as distributions with lighter tails.

(Quiet side note: Why are those last few sentences so wishy-washy? Well it turns out that minimizing a Kullback-Leibler divergence in the wrong direction can do all kinds of things to the resulting minimizer and it’s hard to really pin down what will happen.  But it’s almost everyone’s experience that the variational posterior $latex q(\theta)$ is almost always narrower than than the true posterior. So the previous paragraph is usually true.)

So variational inference is mostly set up to fail. Really, we should be surprised it works at all.

Cold discovery

There are really two things we need to check when we’re doing variational inference. The first is that the optimization procedure that we have used to compute $latex q^*(\theta)$ has actually converged to a (local) minimum.  Naively, this seems fairly straightforward. After all, we don’t think of maximum likelihood estimation as being hard computationally, so we should be able to solve this optimization problem easily. But it turns out that if we want our variational inference to be scalable in the sense that we can apply it to big problems, we need to be more clever. For example Automatic Differentiation Variational Inference (ADVI) uses a fairly sophisticated stochastic optimization method to find $latex q^*(\theta)$.

So first we have to make sure the method actually converges.  I don’t really want to talk about this, but it’s probably worth saying that it’s not trivial and stochastic methods like ADVI will occasionally terminate too soon. This leads to terrible approximations to the true posterior. It’s also well worth saying the if the true posterior is multimodal, there’s no guarantee that the minimum that is found will be a (nearly) global one.  (And if the approximating family $latex \mathcal{Q}$ only contains unimodal distributions, we will have some problems!) There are perhaps some ways out of this (Yuling has many good ideas), but the key thing is that if you want to actually know if there is a potential problem, it’s important to run multiple optimizations beginning at a diverse set of initial values.

Anyway, let’s pretend that this isn’t a problem so that we can get onto the main point.

The second thing that we need to check is that the approximate posterior $latex q^*(\theta)$ is an ok approximation to the true posterior $latex p(\theta \mid y )$. This is a much less standard task and we haven’t found a good method for addressing it in the literature. So we came up with two ideas.

Left only with love

Our first idea was based Aki, Andrew, and Jonah’s Pareto-Smoothed Importance Sampling (PSIS). The crux of our idea is that if $latex q(\theta)$ is a good approximation to the true posterior, it can be used as an importance sampling proposal to compute expectations with respect to $latex p(\theta \mid y)$. So before we can talk about that method, we need to remember what PSIS does.

The idea is that we can approximate any posterior expectation $latex \int h(\theta)p(\theta \mid y)\,d \theta$ using a self-normalized importance sampling estimator. We do this by drawing $latex S$ samples $latex \{\theta_s\}_{s=1}^S$ from the proposal distribution $latex q(\theta)$ and computing the estimate

$latex \int h(\theta)p(\theta \mid y)\,d \theta \approx \frac{\sum_{s=1}^S h(\theta_s) r_s}{\sum_{s=1}^S r_s}.$

Here we define the importance weights as

$latex r_s = \frac{p(\theta_s,y)}{q(\theta_s)}.$

We can get away with using the joint distribution instead of the posterior in the numerator because $latex p(\theta \mid y) \propto p(\theta,y)$ and we re-normalise the the estimator. This self-normalized importance sampling estimator is consistent with bias that goes asymptotically like $latex \mathcal{O}(S^{-1})$. (The bias  comes from the self-normalization step.  Ordinary importance sampling is unbiased.)

The only problem is that if the distribution of $latex r_s$ has too heavy a tail, the self-normalized importance sampling estimator will have infinite variance. This is not a good thing. Basically, it means that the error in the posterior expectation could be any size.

The problem is that if the distribution of $latex r_s$ has a heavy tail, the importance sampling estimator will be almost entirely driven by a small number of samples $latex \theta_s$ with very large $latex r_s$ values. But there is a trick to get around this: somehow tamp down the extreme values of $latex r_s$.

With PSIS, Aki, Andrew, and Jonah propose a nifty solution. They argue that you can model the tails of the distribution of the importance ratio with a generalized Pareto distribution 

$latex p(r|\mu, \sigma, k)=\begin{cases} \frac{1}{\sigma} \left( 1+k\left( \frac{r-\mu}{\sigma} \right) \right) ^{-\frac{1}{k}-1}, & k \neq 0. \\ \frac{1}{\sigma} \exp\left( \frac{r-\mu}{\sigma} \right), & k = 0.\\\end{cases} .$

This is a very sensible thing to do: the generalized Pareto is the go-to distribution that you use when you want to model the distribution of  all samples from an iid population that are above a certain (high) value. The PSIS approximation argues that you should take the $latex M$ largest $latex r_s$ (where $latex M$ is chosen carefully) and fit a generalized Pareto distribution to them. You then replace those $latex M$ largest observed importance weights with the corresponding expected order statistics from the fitted generalized Pareto.

There are some more (critical) details in the PSIS paper but the intuition is that we are replacing the “noisy” sample importance weights with their model-based estimates.  This reduces the variance of the resulting self-normalized importance sampling estimator and reduces the bias compared to other options.

It turns out that the key parameter in the generalized Pareto distribution is the shape parameter $latex k$. The interpretation of this parameter is that if the generalized Pareto distribution has shape parameter $latex k$, then the distribution of the sampling weights have $latex \lfloor k^{-1} \rfloor$ moments.

This is particularly relevant in this context as the condition for the importance sampling estimator to have finite variance (and be asymptotically normal) is that the sampling weights have (slightly more than) two moments. This translates to $latex k<1/2$.

VERY technical side note: What I want to say is that the self-normalized importance sampling estimator is asymptotically normal. This was nominally proved in Theorem 2 of Geweke’s 1983 paper. The proof there looks wrong. Basically, he applies a standard central limit theorem to get the result, which seems to assume the terms in the sum are iid. The only problem is that the summands

$latex \frac{h(\theta_s)r_s}{\sum_{s=1}^Sr_s}$

are not independent. So it looks a lot like Geweke should’ve used a central limit theorem for weakly-mixing triangular arrays instead. He did not. What he actually did was quite clever. He noticed that the bivariate random variables $latex (h(\theta_s)r_s, r_s)^T$ are independent and satisfy a central limit theorem with mean $latex (A,w)^T$. From there you’re a second-order Taylor expansion of the function $latex f(A,w) = A/w$ to show that the sequence

$latex f \left( S^{-1} \sum_{s=1}^S h(\theta_s) r_s, S^{-1} \sum_{s=1}^S r_s\right)$

is also asymptotically normal as long as zero or infinity are never in a neighbourhood of $latex S^{-1}\sum_{s=1}^S r_s $.

End VERY technical side note!

The news actually gets even better! The smaller $latex k$ is, the faster the importance sampling estimate will converge. Even better than that, the PSIS estimator seems to be useful even if $latex k$ is slightly bigger than 0.5. The recommendations in the PSIS paper is that if $latex \hat k <0.7$, the PSIS estimator is reliable.

But what is $latex \hat k$? It’s the sample estimate of the shape parameter $latex k$. Once again, some really nice things happen when you use this estimator. For example, even if we know from the structure of the problem that $latex k<0.5$, if $latex \hat k >0.7$ (which can happen), then importance sampling will perform poorly. The value of $latex \hat k$ is strongly linked to the finite sample behaviour of the PSIS (and other importance sampling) estimators.

The intuition for why the estimated shape parameter is more useful than the population shape parameter is that it tells you when the sample of $latex r_s$ that you’ve drawn could have come from a heavy tailed distribution. If this is the case, there isn’t enough information in your sample yet to push you into the asymptotic regime and pre-asymptotic behaviour will dominate (usually leading to worse than expected behaviour).

Footprints

Ok, so what does all this have to do with variational inference? Well it turns out that if we draw samples from out variational posterior and use them to compute the importance weights, then we have another interpretation for the shape parameter $latex k$:

$latex k = \arg \inf_{k’>0} D_{1/k}(p(\theta \mid y) \, ||\, q(\theta),$

where $latex D_\alpha (p, q)$ is the Rényi divergence of order $latex \alpha$.  In particular, if $latex k >1$, then the Kullback-Leibler divergence in the more natural direction $latex KL(p(\theta \mid y) \mid q(\theta)) = \infty $ even if $latex q(\theta)$ minimizes the KL-divergence in the other direction! Once again, we have found that the estimate $latex \hat k$ gives an excellent indication of the performance of the variational posterior.

So why is checking if $latex \hat k <0.7 $ a good heuristic to evaluate the quality of the variational posterior? There are a few reasons. Firstly, because the variational posterior minimizes the KL-divergence in the direction that penalizes approximations with heavier tails than the posterior much harder than approximations with lighter tails, it is very difficult to get a good $latex \hat k$ value by simply “fattening out” the approximation.  Secondly, empirical evidence suggests that the smaller the value of $latex \hat k$, the closer the variational posterior is to the true posterior. Finally, if $latex \hat k <0.7$ we can automatically improve any expectation computed against the variational posterior using PSIS.  This makes this tool both a diagnostic and a correction for the variational posterior that does not rely too heavily on asymptotic arguments. The value of $latex \hat k$ has also proven useful for selecting the best parameterization of the model for the variational approximation (or equivalently, between different approximation families).

There are some downsides to this heuristic. Firstly, it really does check that the whole variational posterior is like the true posterior. This is a quite stringent requirement that variational inference methods often do not pass. In particular, as the number of dimensions increases, we’ve found that unless the approximating family is particularly well-chosen for the problem, the variational approximation will eventually become bad enough that $latex \hat k$ will  exceeded the threshold. Secondly, this diagnostic only considers the full posterior and cannot be modified to work on lower-dimensional subsets of the parameter space. This means that if the model has some “less important” parameters, we  still require their posterior be very well captured by the variational approximation.

Let me see the colts

The thing about variational inference is that it’s actually often quite bad at estimating a posterior. On the other hand, the centre of the variational posterior is much more frequently a good approximation to the centre of the true posterior. This means that we can get good point estimates from variational inference even if the full posterior isn’t very good. So we need a diagnostic to reflect this.

Into the fray steps an old paper of Andrew’s (with Samantha Cook and Don Rubin) on verifying statistical software. We (mainly Stan developer Sean) have been playing with various ways of extending and refining this method for the last little while and we’re almost done on a big paper about it. (Let me tell you: god may be present in the sweeping gesture, but the devil is definitely in the details.) Thankfully for this work, we don’t need any of the new detailed work we’ve been doing. We can just use the original results as they are (with just a little bit of a twist).

The resulting heuristic, which we call Variational Simulation-Based Calibration (VSBC), complements the PSIS diagnostic by assessing the average performance of the implied variational approximation to univariate posterior marginals. One of the things that this method can do particularly well is indicate if the centre of the variational posterior will be, on average, biased.  If it’s not biased, we can apply clever second-order corrections (like the one proposed by Ryan Giordano, Tamara Broderick, and Michael Jordan).

I keep saying “on average”, so what do I mean by that? Basically, VSBC looks at how well the variational posterior is calibrated by computing the distribution of $latex p_j = P( \theta < \tilde{\theta}_j \mid y_j)$ where $latex y_j$ is simulated from the model with parameter $latex \theta_j $ that is itself drawn from the prior distribution. If the variational inference method is calibrated, then Cook et al. showed that the histogram of $latex p_j$ should be uniform.

This observation can be generalized using insight from the forecast validation community: if the histogram of $latex p_j$ is asymmetric, then the variational posterior will be (on average over data drawn from the model) biased. In the paper, we have a specific result, which shows that this insight is exactly correct if the true posterior is symmetric, and approximately true if it’s fairly symmetric.

There’s also the small problem that if the model is badly mis-specified, then it may fit the observed data much worse or better than the average of data drawn from the model. Again, this contrasts with the PSIS diagnostic that only assesses the fit for the particular data set you’ve observed.

In light of this, we recommend interpreting both of our heuristics the same way: conservatively. If either heuristic fails, then we can say the variational posterior is poorly behaved in one of two specific ways. If either or both heuristics pass, then we can have some confidence that the variational posterior will be a good approximation to the true distribution (especially after a PSIS or second-order correction), but this is still not guaranteed.

Faith/Void

To close this post out symmetrically (because symmetry indicates a lack of bias), let’s go back to a different Bill Callahan song to remind us that even if it’s not the perfect song, you can construct something beautiful by leveraging formal structure:

If
If you
If you could
If you could only
If you could only stop
If you could only stop your
If you could only stop your heart
If you could only stop your heart beat
If you could only stop your heart beat for
If you could only stop your heart beat for one heart
If you could only stop your heart beat for one heart beat

The difference between me and you is that I’m not on fire

“Eat what you are while you’re falling apart and it opened a can of worms. The gun’s in my hand and I know it looks bad, but believe me I’m innocent.” – Mclusky

While the next episode of Madam Secretary buffers on terrible hotel internet, I (the other other white meat) thought I’d pop in to say a long, convoluted hello. I’m in New York this week visiting Andrew and the Stan crew (because it’s cold in Toronto and I somehow managed to put all my teaching on Mondays. I’m Garfield without the spray tan.).

So I’m in a hotel on the Upper West Side (or, like, maybe the upper upper west side. I’m in the 100s. Am I in Harlem yet? All I know is that I’m a block from my favourite bar [which, as a side note, Aki does not particularly care for] where I am currently not sitting and writing this because last night I was there reading a book about the rise of the surprisingly multicultural anti-immigration movement in Australia and, after asking what my book was about, some bloke started asking me for my genealogy and “how Australian I am” and really I thought that it was both a bit much and a serious misunderstanding of what someone who is reading book with headphones on was looking for in a social interaction.) going through the folder of emails I haven’t managed to answer in the last couple of weeks looking for something fun to pass the time.

And I found one. Ravi Shroff from the Department of Applied Statistics, Social Science and Humanities at NYU (side note: applied statistics gets a short shrift in a lot of academic stats departments around the world, which is criminal. So I will always love a department that leads with it in the title. I’ll also say that my impression when I wandered in there for a couple of hours at some point last year was that, on top of everything else, this was an uncommonly friendly group of people. Really, it’s my second favourite statistics department in North America, obviously after Toronto who agreed to throw a man into a volcano every year as part of my startup package after I got really into both that Tori Amos album from 1996 and cultural appropriation. Obviously I’m still processing the trauma of being 11 in 1996 and singularly unable to sacrifice any young men to the volcano goddess.) sent me an email a couple of weeks ago about constructing interpretable decision rules.

(Meta-structural diversion: I starting writing this with the new year, new me idea that every blog post wasn’t going to devolve into, say, 500 words on how Medúlla is Björk’s Joanne, but that resolution clearly lasted for less time than my tenure as an Olympic torch relay runner. But if you’ve not learnt to skip the first section of my posts by now, clearly reinforcement learning isn’t for you.)

To hell with good intentions

Ravi sent me his paper Simple rules for complex decisions by Jongbin Jung, Connor Concannon, Ravi Shroff, Sharad Goel and Daniel Goldstein and it’s one of those deals where the title really does cover the content.

This is my absolute favourite type of statistics paper: it eschews the bright shiny lights of ultra-modern methodology in favour of the much harder road of taking a collection of standard tools and shaping them into something completely new.

Why do I prefer the latter? Well it’s related to the age old tension between “state-of-the-art” methods and “stuff-people-understand” methods. The latter are obviously preferred as they’re much easier to push into practice. This is in spite of the former being potentially hugely more effective. Practically, you have to balance “black box performance” with “interpretability”. Where you personally land on that Pareto frontier is between you and your volcano goddess.

This paper proposes a simple decision rule for binary classification problems and shows fairly convincingly that it can be almost as effective as much more complicated classifiers.

There ain’t no fool in Ferguson

The paper proposes a Select-Regress-and-Round method for constructing decision rules that works as follows:

  1. Select a small number $latex k$ of features $latex \mathbf{x}$ that will be used to build the classifier
  2. Regress: Use a logistic-lasso to estimate the classifier $latex h(\mathbf{x}) = (\mathbf{x}^T\mathbf{\beta} \geq 0 \text{ ? } 1 \text{ : } 0)$.
  3. Round: Chose $latex M$ possible levels of effect and build weights

$latex w_j = \text{Round} \left( \frac{M \beta_j}{\max_i|\beta_i|}\right)$.

The new classifier (which chooses between options 1 and 0) selects 1 if

$latex \sum_{j=1}^k w_j x_j > 0$.

In the paper they use $latex k=10$ features and $latex M = 3$ levels.  To interpret this classifier, we can consider each level as a discrete measure of importance.  For example, when we have $latex M=3$ we have seven levels of importance from “very high negative effect”, through “no effect”, to “very high positive effect”. In particular

  • $latex w_j=0$: The $latex j$th feature has no effect
  • $latex w_j= \pm 1$: The $latex j$th feature has a low effect (positive or negative)
  • $latex w_j = \pm 2$: The $latex j$th feature has a medium effect (positive or negative)
  • $latex w_j = \pm 3$: The $latex j$th feature has a high effect (positive or negative).

A couple of key things here that makes this idea work.  Firstly, the initial selection phase allows people to “sense check” the initial group of features while also forcing the decision rule to only depend on a small number of features, which greatly improves the ability for people to interpret the rule.  The second two phases then works out which of those features are used (the number of active features can be less than $latex k$. Finally the last phase gives a qualitative weight to each feature.

This is a transparent way of building a decision rule, as the effect of each feature used to make the decision is clearly specified.  But does it work?

She will only bring you happiness

The most surprising thing in this paper is that this very simple strategy for building a decision rule works fairly well. Probably unsurprisingly, complicated, uninterpretable decision rules constructed through random forests typically do work better than this simple decision rule.  But the select-regress-round strategy doesn’t do too badly.  It might be possible to improve the performance by tweaking the first two steps to allow for some low-order interactions. For binary features, this would allow for classifiers where neither X nor Y are strong indicators of success, but the co-occurance of them (XY) is.

Even without this tweak, the select-regress-round classifier performs about as well as logistic regression and logistic lasso models that use all possible features (see the above figure from the paper), although it performs worse than the random forrest.  It also doesn’t appear that the rounding process has too much of an effect on the quality of the classifier.

This man will not hang

The substantive example in this paper has to do with whether or not a judge decides to grant bail, where the event you’re trying to predict is a failure to appear at trial. The results in this paper suggest that the select-regress-round rule leads to a consistently lower rate of failure compared to the “expert judgment” of the judges.  It also works, on this example, almost as well as a random forest classifier.

There’s some cool methodology stuff in here about how to actually build, train, and evaluate classification rules when, for any particular experimental unit (person getting or not getting bail in this case), you can only observed one of the potential outcomes.  This paper uses some ideas from the causal analysis literature to work around that problem.

I guess the real question I have about this type of decision rule for this sort of example is around how these sorts of decision rules would be applied in practice.  In particular, would judges be willing to use this type of system?  The obvious advantage of implementing it in practice is that it is data driven and, therefore, the decisions are potentially less likely to fall prey to implicit and unconscious biases. The obvious downside is that I am personally more than the sum of my demographic features (or other measurable quantities) and this type of system would treat me like the average person who has shares the $latex k$ features with me.

Now, Andy did you hear about this one?

We drank a toast to innocence, we drank a toast to now. We tried to reach beyond the emptiness but neither one knew how. – Kiki and Herb

Well I hope you all ended your 2017 with a bang.  Mine went out on a long-haul flight crying so hard at a French AIDS drama that the flight attendant delivering my meal had to ask if I was ok. (Gay culture is keeping a running ranking of French AIDS dramas, so I can tell you that this BPM was my second favourite.)

And I hope you spent your New Year’s Day well. Mine went on jet lag and watching I, Tonya in the cinema. (Gay culture is a lot to do with Tonya Harding especially after Sufjan Stevens chased his songs in Call Me By Your Name with the same song about Tonya Harding in two different keys.)

But enough frivolity and joy, we’re here to talk about statistics, which is anathema to both those things.

Andy Kaufman in the wrestling match

So what am I talking about today (other than thumbing my nose at a certain primary blogger who for some reason didn’t think I liked REM)?

Well if any of you are friends with academic statisticians, you know that we don’t let an argument go until it’s died, and been resurrected again and again, each time with more wisdom and social abilities (like Janet in The Good Place). So I guess a part of this is another aspect of a very boring argument: what is the role for statistics in “data science”.

Now, it has been pointed out to me by a very nice friend that academic statisticians have precisely no say in what is and isn’t data science. Industry has taken that choice out of our equivocating hands. But if I cared about what industry had to say, I would probably earn more money.

So what’s my view? Well it’s kinda boring. I think Data Science is like Mathematics: an encompassing field with a number of traditional sub-disciplines. There is a lot of interesting work to do within each individual sub-field, but increasingly there is bleeding-edge work to be done on the boundaries between fields.  This means the boundaries between traditions and, inevitably, the boundaries between training methods. It also means that the Highlander-Style quest for dominance among the various sub-disciplines of Data Science is a seriously counter-productive waste of time.

So far, so boring.

So why am I talking about this?  Because I want to seem mildly reasonable before I introduce our pantomime villain for the post (yes, it’s still Christmas until the Epiphany, so it’s still panto season. If you’re American or otherwise don’t know what panto season is, it’s waaaaaay too hard to explain, but it involves men in elaborate dresses and pays the rent for a lot of drag queens in the UK).

So who’s my villain? That would be none other than Yann LeCun.  Why? Because he’s both the Director of AI Research at Facebook and a Professor at NYU, so he’s doing fine. (As a side note, I’d naively assume those were two fairly full on full time jobs, so all power to that weirdly American thing where people apparently do both).

LeCun also has a slightly bonkers sense of humour, evidenced here by him saying he’s prepared to throw probability theory under a bus. (A sentiment I, as someone who’s also less good at probability theory than other people, can jokingly get behind.) Also a sentiment some people seriously got behind, because apparently if you stand for nothing you’ll fall for anything.

Hey Andy are you goofing on Elvis, hey baby, are we losing touch?

But what is the specific launching pad for my first unnecessarily wordy blog post of 2018? Well it’s a debate that he participated in at NIPS2017 on whether interpretability is necessary for machine learning.  He was on the negative side. The debate was not, to my knowledge, recorded, but some record of it lives on on a blue tick twitter account.

Ok, so the bit I screenshot’d is kinda dumb. There’s really no way to water that down. It’s wrong. Actively and passively. It’s provably false. It’s knowably false from some sort of 101 in statistical design. It’s damaging and it’s incorrect.

So, does he know better? Honestly I don’t care. (But, like, probably because you usually don’t get to hold two hard jobs at the same time without someone noticing you’re half-arsing at least one of them unless you’re pretty damn good at what you do and if you’re pretty damn good at what you do you know how clinical trials work because you survived undergraduate.) Sometimes it’s fun to be annoyed by things, but that was like 3 weeks ago and a lot has been packed into that time. So forget it.

And instead look at the whole thread of the “best and brightest” in the category of “people who are invited to speak at NIPS”. Obviously this group includes no women because 2017 can sink into the ******* ocean. Please could we not invite a few women. I’m pretty sure there are a decent number who could’ve contributed knowledgeably and interestingly to this ALL MALE PANEL (FOR SHAME, NIPS. FOR SHAME). 

Serious side note time: Enough. For the love of all gods past, future, and present can every single serious scientific body please dis-endorse any meeting/workshop/conference that has all male panels, all male invited session, or all male plenary sessions. It is not the 1950s. Women are not rare in our field. If you can’t think of a good one put out a wide call and someone will probably find one. And no. It’s not the case that the 4 voices were the unique ones that had the only orthogonal set of views on this topic. I’m also not calling for tokenism. I reckon we could re-run this panel with no men and have a vital and interesting debate. Men are not magic. Not even these ones.

[And yes, this is two in a row. I am sorry. I do not want to talk about this, but it’s important that it doesn’t pass without mention. But the next post will be equality free.]

Other serious side note: So last night when I wrote this post I had a sanctimonious thing asking why men agree to be in all-male sessions. Obviously I checked this morning and it turns out I’m in one at ISBA, so that’s fun. And yes, there are definitely women who can contribute to a session (ironically) called The diversity of objective approaches at least as well as I can (albeit probably not talking about the same stuff).

Will I do it? Probably. I like my job and speaking in sessions at major conferences is one of the things I need to do to keep it. Do I think ISBA (and other conferences) should reject out of hand every all male session?  No, because there are focussed but important subfields that just don’t have many women in them. Which is unfortunate, but we need to live in the world as it is. Do I think ISBA (and other conferences) should require a strong justification for a lack of diversity in proposed sessions? ABSOLUTELY.

Final serious side note: Why do I care about this? I care because people like Yann LeCun or Andrew Gelman or Kerrie Mengersen do not spring fully formed from the mind of Zeus. While they are extremely talented (and would be no matter what they did), they also benefitted from mentoring, guidance, opportunities, and most importantly a platform at all points during their career. We are very bad at doing that for people in general, which means that a lot of talent slips through the cracks. Looking at gender diversity is an easy place to start to fix this problem – the fact that women make up about half of the population but only a tiny slither of possible speaking slots at major conferences is an easy thing to see.

If we give space to women, ethnic and racial minorities, people from less prestigious institutions, people with non-traditional backgrounds etc, (and not just the same handful that everyone else invites to speak at their things) we will ensure that we don’t miss the next generation of game changers.

Anyway, long rant over. Let’s move onto content.

So first and foremost let’s talk about machine learning (where LeCun makes his bread). It’s a fairly uncontroversial view to say that when the signal-to-noise ratio is high, modern machine learning methods trounce classical statistical methods when it comes to prediction. (I’ve talked a little about the chicken-armageddon regime before.) The role of statistics in this case is really to boost the signal-to-noise ratio through the understanding of things like experimental design. [The Twitter summary of] LeCun’s argument suggests that this skill may not be being properly utilized in industrial data science practice.  This is the seat at the table we should be fighting hard for.

One easy thing statisticians could contribute could be built around Andrew’s  (and collaborator’s) ideas of Mister-P and their obvious extensions to non-multilevel smoothing methods. Another would be a paper I love by Chen, Wakefield and Lumley, which is basically the reverse of Mister-P.

Broadly speaking, the Twitter thread summarizing the debate is worth reading. It’s necessarily a summary of what was said and does not get across tone or, possibly, context. But it does raise a pile of interesting question.

I fall on the side that if you can’t interpret your model you have no chance of sense-checking your prediction. But it’s important to recognize that this view is based on my own interests and experiences (as is LeCun saying that he’d prefer good prediction to good interpretability). In the applied areas I’m interested in, new data comes either at a scientific or human cost (or is basically impossible to get), so this is hugely important. In other areas where you can get essentially infinite data for free (I’m looking at fun problems like solving Go), there is really no reason for interpretability.

What’s missing from the summary is the idea that Science is a vital part of data science and that contextually appropriate loss functions are vitally important when making decisions about which modelling strategy (or model) will be most useful for the problem at hand. (Or the philosophy, methodolgy, and prior specification should only be considered in the context of the likelihood.)

We need to stop sacrificing women on the altar of deeply mediocre men (ISBA edition)

(This is not Andrew. I would ask you not to speculate in the comments who S is, this is not a great venue for that.)

Kristian Lum just published an essay about her experiences being sexually assaulted at statistics conferences.  You should read the whole thing because it’s important, but there’s a sample paragraph.

I debated saying something about him at the time, but who would have cared? It was a story passed down among female graduate students in my circles that when one woman graduate student was groped at a party by a professor and reported it to a senior female professor, she was told that if she wanted to stay in the field, she’d just have to get used to it. On many occasions, I have been smacked on the butt at conferences. No one ever seemed to think it was a problem. I knew it would be even more difficult to get people to find S’s behavior problematic since he is employed by a large tech company and his participation in academic conferences, I have heard, often comes with sponsorship money.

I have a friend who had essentially the same experience with the same man.

Although it’s heartening to see that some senior people in ISBA cut his name from the nominations in the recent elections, there is no formal mechanism for the society to respond to these type of allegations or this type of behaviour.

Why is that important?  I’m just on my way back from O’Bayes, which is the cliquiest of the Bayesian conferences (all of which are pretty cliquey). You can set your watch by who gives the tutorials and who’s on the scientific committee.  They’ve all known each other forever and are friends. To report this type of behaviour in an environment where the people you should tell are likely friends with the assaulter is an extreme act of bravery. ISBA (and all of its sections) need to work to help people come forward and support them formally as well as informally.

And keep the creeps out.

 

(Quick clarification: In the second last paragraph I really do not want to suggest that the senior people in ISBA would not act on reports about their friend. And, as the article said, they did in one case. But I know that because I know these people reasonably well, which is not a luxury most have. ISBA needs to do more to signal its willingness and openness to dealing with this problem.)

 

More important edit:

Kerrie Mengersen, ISBA president and all-round wonderful person has just weighed in officially.  I look forward to seeing what ISBA does and I look forward to never seeing “S” in person again.

Update: ISBA’s Taskforce is now live.  The announcement is here. If you have any suggestions or concerns, there is an email address at the bottom of the announcement.

Always crashing in the same car

“Hey, remember me?  I’ve been busy working like crazy”Fever Ray

I’m at the Banff International Research Station (BIRS) for the week, which is basically a Canadian version of Disneyland where during coffee breaks a Canadian woman with a rake politely walks around telling elk to “shoo”.

The topic of this week’s workshop isn’t elk-shooing, but (more interestingly) spatial statistics.  But the final talk of the workshop spurred some non-spatial thoughts.

Lance Waller was tasked with drawing together the various threads from the last few days, but he actually did something better. He spent his 45 minutes talking about the structures and workflows that make up applied spatial statistics. Watch the talk. It’s interesting.

Rage of Travers

The one thing in the talk that I strongly disagreed with in Lance’s talk was the way he talked about priors in Bayesian analysis and penalties in frequentist analysis as coming from the same place.

Now Lance is a world-beating epidemiologist and biostatistician and, as such, works in a field where some people are vehemently against specifying priors. In this context, it makes sense to emphasize the similarities between the two approaches, especially given that a log-posterior density and a penalized log-likelihood look exactly the same. Let there be a thousand blossoms bloom, as far as I am concerned. But I ain’t spendin’ any time on it, because in the mean time, every three months, a person is torn to pieces by a crocodile in North Queensland.

My fear is always that if we aren’t careful when emphasizing that two things are very similar, people will think that they’re the same.  And a penalty is not a prior.

Why not? Well they focus on different things. A penalty is concerned only with what the maximum looks like.  For example, the Lasso can be interpreted as saying

bring me the value of $latex \beta$ that maximizes the likelihood while ensuring that $latex \sum_j |\beta_j| $ is less than some fixed (small) number.

On the other hand, a prior worries about what the plausible spread of parameters are.  The (terrible) Bayesian Lasso prior can be interpreted as saying

I think the plausible values of $latex \beta_j$ lie between $latex \pm 3.7\lambda^{-1}$, where $latex \lambda$ is the rate parameter of the double exponential distribution. I also think that 5% of the time, the value of $latex \beta_j$ lies between $latex \pm 0.03\lambda^{-1}$.

To put it differently: A penalty is a statement about what a maximum would look like, while a prior is a statement about where the estimate is likely to be (or, conversely, where the estimate is likely not to be).

This is yet another way to see that the Bayesian Lasso will struggle to support both small and large $latex \beta_j$.

Firewood and candles

This pushes us again to one of the biggest difference between Bayesian and Classical methods. They differ in when you can begin simulating new data. For classical methods, you can only predict new data after you’ve already seen data from the process.  While in Bayesian methods, you can predict before you see data.

Obviously, you should be able to make better predictions after you’ve seen data, but the key idea that you can use the prior predictive to do useful things!

Antarctica starts here

At some point after his talk, I was talking to Lance and he mentioned a colleague of his who is strongly against adding prior information. This colleague was of the view that you should elicit more data rather than eliciting a prior.

But really, that’s a key use of the prior predictive distribution.  Often you can elicit information from scientists about what the data will (or will not) look like, and the prior predictive distribution is useful for seeing how well you have incorporated this information into your model.  Or, to put it differently, when you build generative models, you can generate data and you should ask people what the data you can generate should look like.

 

Asymptotically we are all dead (Thoughts about the Bernstein-von Mises theorem before and after a Diamanda Galás concert)

They say I did something bad, then why’s it feel so good–Taylor Swift

It’s a Sunday afternoon and I’m trying to work myself up to the sort of emotional fortitude where I can survive the Diamanda Galás concert that I was super excited about a few months ago, but now, as I stare down the barrel of a Greek woman vocalizing at me for 2 hours somewhere in East Toronto, I am starting to feel the fear.

Rather than anticipating the horror of being an emotional wreck on public transportation at about 10pm tonight, I’m thinking about Bayesian asymptotics. (Things that will make me cry on public transport: Baby Dee, Le Gateau Chocolate, Sinead O’Connor, The Mountain Goats. Things that will not make me cry on any type of transportation: Bayesian Asymptotics.)

So why am I thinking about Bayesian asymptotics? Well because somebody pointed out a thread on the Twitter (which is now apparently a place where people have long technical discussions about statistics, rather than a place where we can learn Bette Midler’s views on Jean Genet or reminisce about that time Cher was really into horses) that says a very bad thing:

The Bernstein-von Mises theorem kills any criticism against non-informative priors (for the models commonly used). Priors only matter if one wishes to combine one’s confirmation bias with small studies. Time to move to more interesting stuff(predictive inference)

I’ve written in other places about how Bayesian models do well at prediction (and Andrew and Aki have written even more on it), so I’m leaving the last sentence alone. Similarly the criticisms in the second last sentence are mainly rendered irrelevant if we focus on weakly informative priors. So let’s talk about the first sentence.

Look what you made me do

The Bernstein-von Mises theorem, like Right Said Fred, is a both a wonder and a horror that refuses to stay confined to a bygone era. So what is it?

The Bernstein-von Mises theorem (or BvM when I’m feeling lazy) says the following:

Under some conditions, a posterior distribution converges as you get more and more data to a multivariate normal distribution centred at the maximum likelihood estimator with covariance matrix given by $latex n^{-1} I(\theta_0)^{-1}$, where $latex \theta_0$ is the true population parameter (Edit: here $latex I(\theta_0)$ is the Fisher information matrix at the true population parameter value).

A shorter version of this is that (under some conditions) a posterior distribution looks asymptotically like the sampling distribution of a maximum likelihood estimator.

Or we can do the wikipedia version (which lacks in both clarity and precision. A rare feat.):

[T]he posterior distribution for unknown quantities in any problem is effectively independent of the prior distribution (assuming it obeys Cromwell’s rule) once the amount of information supplied by a sample of data is large enough.

Like a lot of theorems that are imprecisely stated, BvM is both almost always true and absolutely never true.  So in order to do anything useful, we need to actually think about the assumptions. They are written, in great and loving detail, in Section 2.25 of  these lecture notes from Richard Nickl. I have neither the time nor energy to write these all out but here are some important assumptions:

  1. The maximum likelihood estimator is consistent.
  2. The model has a fixed, finite number of parameters.
  3. The true parameter value lies on the interior of the parameter space (ie if you’re estimating a standard deviation, the true value can’t be zero).
  4. The prior density must be non-zero in a neighbourhood of $latex \theta_0$.
  5. The log-likelihood needs to be smooth (two derivates at the true value and some other stuff)

The first condition rules out any problem where you’d want to use a penalized maximum likelihood estimator. (Edit: Well this was awkwardly stated. You need the MLE to be unbiased [Edit: Consistent! Not unbiased. Thanks A Reader] and there to be a uniformly consistent estimator, so I’m skeptical these things hold in the situation where you would use penalized MLE.) The third one makes estimating variance components difficult. The fifth condition may not be satisfied after you integrate out nuance parameters as this can lead to spikes in the likelihood.

I guess the key thing to remember is that this “thing that’s always true” is, like everything else in statistics, a highly technical statement that can be wrong when the the conditions under which it is true are not satisfied.

Call it what you want

You do it ’cause you can. Because when I was younger, I couldn’t sustain those phrases as long as I could. Now I can just go on and on. If you keep your stamina and you learn how to sing right, you should get better rather than worse. – Diamanda Galás in Rolling Stone 

Andrew has pointed out many times that the problem with scientists misapplying statistics is not that they haven’t been listening, it’s that they have listened too well. It is not hard to find statisticians (Bayesian or not) who will espouse a similar sentiment to the first sentence of that ill-begotten tweet. And that’s a problem.

When someone says to me “Bernstein-von Mises implies that the prior only has a higher-order effect on the posterior”, I know what they mean (or, I guess, what they should mean). I know that they’re talking about a regular model, a lot of information, and a true parameter that isn’t on the boundary of the parameter space. I know that declaring something a higher-order effect effect is a loaded statement because the “pre-asymptotic” regime can be large for complex models.

Or, to put it differently, when someone says that I know they are not saying that priors aren’t important in Bayesian inference. But it can be hard to know if they know this. And to be fair, if you make a super-simple model that can be used in the type of situation where you could read the entrails of a recently gutted chicken and still get an efficient, asymptotically normal estimator, then the prior is not a big deal unless you get it startlingly wrong.

No matter how much applied scientists may want to just keep on gutting those chickens, there aren’t that many chicken gutting problems around. (Let the record state that a “chicken gutting” problem is one where you only have a couple of parameters to control your system, and you have accurate, random, iid samples from your population of interest. NHSTs are pretty good at gutting chickens.) And the moment that data gets “big” (or gets called that to get the attention of a granting agency), all the chickens have left the building in some sort of “chicken rapture” leaving behind only tiny pairs of chicken shoes.

Big reputation, big reputation. Ooh, you and me would be a big conversation.

I guess what I’m describing is a communication problem. We spend a lot of time communicating the chicken gutting case to our students, applied researchers, applied scientists, and the public rather than properly preparing them for the poultry armageddon that is data analysis in 2017. They have nothing but a tiny knife as they attempt to wrestle truth from the menagerie of rhinoceroses, pumas, and semi-mythical megafauna that are all that remain in this chicken-free wasteland we call a home.

(I acknowledge that this metaphor has gotten away from me.)

The mathematical end of statistics is a highly technical discipline. That’s not so unusual–lots of disciplines are highly technical. What is unusual about statistics as a discipline is that the highly technical parts of the field mix with the deeply applied parts. Without either of these ingredients, statistics wouldn’t be an effective discipline.  The problem is, as it always is, that people at different ends of the discipline often aren’t very good at talking to each other.

Many people who work as excellent statisticians do not have a mathematics background and do not follow the nuances of the technical language. And many people who work as excellent statisticians do not do any applied work and do not understand that the nuances of their work are lost on the broader audience.

My background is a bit weird. I wasn’t trained as a statistician, so a lot of the probabilistic arguments in the theoretical stats literature feel unnatural to me. So I know that when people like Judith Rousseau or Natalia Bochkina or Ismael Castillo or Aad van der Vaart or any of the slew of people who understand Bayesian asymptotics more deeply that I can ever hope to speak or write, I need to pay a lot of attention to the specifics.  I will never understand their work on the first pass, and may never understand it deeply no matter how much effort I put in.

The only reason that I now know more than nothing about Bayesian asymptotics is that I hit a point where I no longer had the luxury to not know. So now I know enough to at least know what I don’t know.

Replication is not just Taylor Swift’s new album

The main thing that I want to make clear about the Bernstein-von Mises theorem is that it is hard to apply it in practice. This is for the exact same reason that the asymptotic arguments behind NHSTs rarely apply in practice.

Just because you have a lot of data, doesn’t mean you have independent replicates of the same experiment.

In particular, issues around selection, experimental design, forking paths, etc all are relevant to applying statistical asymptotics. Asymptotic statements are about what would happen if you gather more data and analyze it, and therefore they are statements about the entire procedure of doing inference on a new data set. So you can’t just declare that BvM holds. The existence of a Bernstein-von Mises theorem for your analysis is a statement about how you have conducted your entire analysis.

Let’s start with the obvious candidate for breaking BvM: big data. Large, modern data sets are typically observational (that is, they are not designed and the mechanism for including the data may be correlated with the inferential aim of the study). For observational data, it is unlikely that the posterior mean (for example) be a consistent estimator of the population parameter, which precludes a BvM theorem from applying.

Lesson: Consistency is a necessary condition for a BvM to hold, and it is unlikely to hold for undesigned data.

Now onto the next victim: concept drift.  Let’s imagine that we can somehow guarantee that the sampling mechanism we are using to collect our big data set will give us a sample that is representative of the population as a whole.  Now we have to deal with the fact that it takes a lot of time to collect a lot of data. Over this time, the process you’re measuring can change.  Unless your model is able to specifically model the mechanism for this change, you are unlikely to be in a situation where BvM holds.

Lesson: Big data sets are not always instantaneous snapshots of a static process, and this can kill off BvM.

For all the final girls: Big data sets are often built by merging many smaller data sets. One particular example springs to mind here: global health data. This is gathered from individual countries, each of which has its own data gathering protocols. To some extent, you can get around this by carefully including design information in your inference, but if you’ve come to the data after it has been collated, you may not know enough to do this. Once again, this can lead to biased inferences for which the Bernstein-von Mises theorem will not hold.

Lesson: The assumptions of the Bernstein-von Mises theorem are fragile and it’s very easy for a dataset or analysis to violate them. It is very hard to tell, without outside information, that this has not happened.

`Cause I know that it’s delicate

(Written after seeing Diamanda Galas, who was incredible, or possibly unbelievable, and definitely unreal. Sitting with a thousand or so people sitting in total silence listening to the world end is a hell of a way to wrap up a weekend.)

Much like in life, in statistics things typically only ever get worse when they get more complicated.  In the land of the Bernstein-von Mises theorem, this manifests in the guise of a dependence on the complexity of  the model.  Typically, if there are $latex p$ parameters in a model and we observe $latex n$ independent data points (and all the assumptions of the BvM are satisfied), then the distance from the posterior to a Normal distribution is $latex \mathcal{O}\left(\sqrt{pn^{-1}}\right)$.  That is, it takes longer to converge to a normal distribution when you have more parameters to estimate.

Do you have the fear yet? As readers of this blog, you might be seeing the problem already. With multilevel models, you will frequently have at least as many parameters as you have observations.  Of course, the number of effective parameters is usually much much smaller due to partial pooling. Exactly how much smaller depends on how much pooling takes place, which depends on the data set that is observed.  So you can see the problem.

Once again, it’s those pesky assumptions (like the Scooby gang, they’re not going away no mater how much latex you wear).  In particular, the fundamental assumption is that you have replicates which, in a multilevel model with as many parameters as data, essentially means that you can pool more and more as you observe more and more categories. Or that you keep the number of categories fixed and you see more and more data in each category (and eventually see an infinite amount of data in each category).

All this means that the speed at which you hit the asymptotic regime (ie how much data you need before you can just pretend you posterior is Gaussian) will be a complicated function of your data. If you are using a multilevel model and the data does not support very much pooling, then you will reach infinity very very slowly.

This is why we can’t have nice things

Rolling Stone: Taylor Swift?

Diamanda Galás: [Gagging noises]

The ultimate death knell for simple statements about the Bernstein-von Mises theorem is the case where the model has an infinite dimensional parameter (aka a non-parametric effect).  For example, if one of your parameters is an unknown function.

A common example relevant in a health context would be if you’re fitting survival data using a Cox Proportional Hazards model, where the baseline hazard function is typically modelled by a non-parametric effect.  In this case, you don’t actually care about the baseline hazard (it’s a nuisance parameter), but you still have to model it because you’re being Bayesian. In the literature, this type of model is called a “semi-parametric model” as you care about the parametric part, but you still have to account for a non-parametric term.

To summarize a very long story that’s not been completely mapped out yet, BvM does not hold in general for models with an infinite dimensional parameter. But it does hold in some specific cases. And maybe these cases are common, although honestly it’s hard for me to tell. This is because working out exactly when BvM holds for these sorts of models involves pouring through some really tough theory papers, which typically only give explicit results for toy problems where some difficult calculations are possible.

And then there are Bayesian models of sparsity, trans-dimensional models (eg models where the number of parameters isn’t fixed) etc etc etc.

But I’ll be cleaning up bottles with you on New Year’s Day

So to summarise, a thing that a person said on twitter wasn’t very well thought out. Bernstein-von Mises fails in a whole variety of ways for a whole variety of interesting, difficult models that are extremely relevant for applied data analysis.

But I feel like I’ve been a bit down on asymptotic theory in this post. And just like Diamanda Galás ended her show with a spirited rendition of Johnny Paycheck’s (Pardon Me) I’ve Got Someone to Kill, I want to end this on a lighter note.

Almost all of my papers (or at least the ones that have any theory in them at all) have quite a lot of asymptotic results, asymptotic reasoning, and applications of other people’s asymptotic theory. So I strongly believe that asymptotics are a vital part of our toolbox as statisticians. Why? Because non-asymptotic theory is just too bloody hard.

In the end, we only have two tools at hand to understand and criticize modern statistical models: computation (a lot of which also relies on asymptotic theory) and asymptotic theory. We are trying to build a rocket ship with an earthmover and a rusty spoon, so we need to use them very well. We need to make sure we communicate our two tools better to the end-users of statistical methods. Because most people do not have the time, the training, or the interest to go back to the source material and understand everything themselves.

Teeth are the only bones that show

“I lived in the country where the dead wood aches, in a house made of stone and a thousand mistakes”The Drones

Sometimes it’s cold and grey and Canadian outside and the procrastination hits hard. Sometimes, in those dark moments, one is tempted to fire up the social media and see what’s happening in other places where it’s probably also cold and grey, but possibly not Canadian. As if change were possible.

And sometimes you see something on the social media that annoys you.

In this case, it was a Pink News article titled “Men with muscles and money are more attractive to gay men, new study finds”.

Now, I curate my social media fairly heavily: I did not get into this liberal bubble by accident and I have no interest in an uncurated life. So I didn’t see this from some sort of Pink News feed, but rather from some stranger saying (and I paraphrase) “this is crap”.

A quick sidebar

Really? A study shows men with muscles and money are more attractive to gay men? Really?

For a community of people who are nominally dedicated to performing masculinity in all its varied, rainbow forms, the gay media is pretty much exclusively devoted to good looking, often straight, rich, often white men who are so dehydrated they have fabulous abs.

To paraphrase Ms Justin Elizabeth Sayer, “I know you find it hard to believe, with all the things you could be as a gay person, boring would still be an option. But it is.”.

We don’t need new research for this, we could just look at the Pink News website. (Please don’t. It’s not worth the pain.)

A journey in three parts: Pink News

The best thing that I can say about the Pink News article is that it didn’t regurgitate the press release.  It instead used it as a base to improvise a gay spin around. Pink News: the jazz of news.

It’s immediately clear from this article that the findings are based around the content of only one (creepy) website called TubeCrush, where you can upload photos taken of unsuspecting men on public transport. So a really classy establishment. (I mean, there are not enough words to get into just how ugly this is as a concept, so let’s just take that as read.)

The idea was that because the photos on this site focus on ripped white men who often have visual signifiers of wealth, gay men are attracted to ripped, rich, white men.

Of course, this excited me immediately. The researchers are making broad-ranging conclusions from a single, niche website that fetishizes a particular lack of consent.

Why is this exciting? Well I’m interested in the way that the quantitative training that scientists (both of the social and antisocial bent) have received doesn’t appear to have made the point that an inference can only ever be as sharp as the data used to make it.  Problems come when you, to paraphrase the Mother Superior in Sister Act 2: Back in the Habit, try to make a silk purse out of a sow’s ear.

That terrible gayface study from a few months ago was a great example of how quantitative social science can fall apart due to biases in the data collection, and this looked like a great example of how qualitative social science can do the same thing.

A journey in three parts: The press release

Coventry University, for reasons best known only to themselves, decided that the revelation that pervy website TubeCrush had a tendency to feature ripped, rich white men was worth a press release.  Here it is.

The key sentence, which Pink News ran with, was

The researchers say their study of the entries on the site quickly revealed that both straight women’s and gay men’s desired particular types of men.

It goes on to talk about how, despite London being multicultural, most of the photos were of white men. Most of the men were good looking and muscled. Most of the commentary on TubeCrush mentioned the muscles, as well as the expensive suits, watches, and phones the men were wearing/using.

Why is all this important? Well the press release tell us:

The academics said that public transport – in this case the Tube – has now become the space where gender politics is decided.

How fascinating. Not in schools, or universities. Not in families or friendship circles. Not at work or in social spaces. On. The. Tube.

The press release ends with a few quotes from the lead researcher Adrienne Evans, which is a useful thing for journalists who need to pretend they’ve talked with the authors. I’m just going to cherry pick the last one, but you should get the point (or follow the link above):

“It’s a problem as because although it appears as though we have moved forward, our desires are still mostly about money and strength.”

That is a heavy extrapolation of the available data.

A journey in three parts: The paper

Finally, I decided to read the paper. If you’re going to do a deep dive methodological criticism instead of doing the pile of things that you are actually supposed to be spending the afternoon on, then you should probably read the actual paper.

So here is the paper, entitled “He’s a total TubeCrush”: post-feminist sensibility as intimate publics by Adrienne Evans and Sarah Riley, published in Feminist Media Studies.

Now I have no expert knowledge of feminist studies (media or otherwise).  I’ve done the basic readings, but I am essentially ignorant. But I’m going to attempt to parse the title anyway.

An intimate public appears to be the idea that you can leverage a private, shared community identity into a market. The example in the paper is “chick lit”, which takes shared, performed aspects of femininity and basically markets products towards them. Arguably another example would be Pink News and the gay media industry. The key feature is that “Intimate publics do not just orient desire toward traditional gender roles, but evoke a nostalgic stance toward them, so that one element of the intimate public is feelings of nostalgia.”. So an intimate public in this context will always throw back to some supposed halcyon days of peak masculinity and celebrate this by marketing it, as TubeCrush does. So far so good.

The post-feminist sensibility is basically the idea that post-feminist media simultaneously sell the idea of equality while also expelling that you can achieve equality through consumption.  (I mean, they use a lot more words than that, so I’m probably missing nuance, but that seems to be the thrust).

So the main argument (but less than half the paper) is devoted to the idea that TubeCrush is an example of post-feminist intimate publics. (a post-feminist intimate publics? The grammar of academic feminism is confusing)

All well and good. (And probably quite interesting if you’re into that sort of thing).

Also, as far as I’m concerned, completely kosher.  They are studying an object (TubeCrush), they are categorizing it, and they are unpicking the consequences of that categorization. That is essentially what academics do. We write paragraphs like:

TubeCrush makes masculinity a bodily property in much the same way as femininity is within post-feminist sensibility (Gill 2007). The aesthetic idealization of strength in the posts can be tied both to the heightened visibility of masculinity more generally, and to its location within an “attraction to-” culture. The intersection of heterosexual women’s and gay men’s desire arguably heightens the emphasis on strength as a key component of post-feminist masculinity, where gay male culture holds up heterosexual male strength as part of its own visual landscape (Duane Duncan 2007, 2010). In a culture that only very recently was not visible at all, effeminacy is disparaged and what is celebrated are “visible public identities that [have] more in common with traditional images of masculinity” (Duncan 2007, 334). In this way, representations of masculinity on TubeCrush demonstrate the maintenance of hegemonic masculinity, tied into notions of strength and phallic power.

Ah but there it is. They extrapolated from limited available data (a deep read of TubeCrush) to population dynamics.

On the maintenance of hegemonic data practices in social sciences

(Sorry)

And this is the crux. Moving from this to the press release to that awful Pink News article is a straight ride into hell on a sleigh made of good intentions.

Statements like “TubeCrush is a reestablishment of traditional gender roles within the context of post-feminism” are perfectly reasonable outcomes from this study. They take the data and summarize it in a way and sheds light on the underlying process.  But to move the conclusions beyond the seedy TubeCrush universe is problematic.

And the researchers are aware of this (a smarter blogger could probably make links here with the post-feminist sensibility in that they also generalize with their data while admitting that you need to buy the generalization through more data collection. But the analogy doesn’t completely hold up.)

While there was no rigid research design prior to funding, we believe this extended engagement with a website (whose materials only go back as far as 2011) has provided an in-depth understanding of the patterns and content of TubeCrush.

The first challenge when generalizing these results off the TubeCrush platform is that the study is not designed. I would term this a “pilot study”. I guess you can write press releases about pilot studies, but you probably shouldn’t write press releases that don’t mention the limitations of the data.

My major criticism of this article is that it treats TubeCrush, the photographs and the comments as an entity that just exists outside of any context. This is a website that is put together by a person (or a team of people) who select photographs and caption photographs. So these photographs and comments will always be a weak instrument for understanding society.  They are not irrelevant: the website is apparently quite popular. But the data has been filtered through a very limited sensibility. It reflects both the ideals of attractiveness and the sense of humour (or “humour”, I’ve not actually visited the site because I don’t need that in my life) of the curator(s).

TubeCrush makes it a weak tool for understanding society and all inferences built using only this data will necessarily be weak.

And I guess this is my broad point. Quantitative thinking  around the data gathering, the design of the experiment, and the limitations of the measurements can inform the reliability of a qualitative study.

Is this paper bad? No. Was the article? Yes. Was the press release? Definitely.

But what sort of cold-hearted person doesn’t love a paper with the sentence

The wordplay on the financial language of the double-dip recession to again signify performing oral sex on financially secure (if not wealthy) masculinities demonstrates the juxtapolitics at the heart of TubeCrush.

Why won’t you cheat with me?

But I got some ground rules I’ve found to be sound rules and you’re not the one I’m exempting. Nonetheless, I confess it’s tempting. – Jenny Toomey sings Franklin Bruno

(The maths has died. There’s an edited version here, which is probably slightly better)

It turns out that I did something a little controversial in last week’s post. As these things always go, it wasn’t the thing I was expecting to get push back from, but rather what I thought was a fairly innocuous scaling of the prior. One commenter (and a few other people on other communication channels) pointed out that the dependence of the prior on the design didn’t seem kosher.  Of course, we (Andrew, Mike and I) wrote a paper that was sort of about this a few months ago, but it’s one of those really interesting topics that we can probably all deal with thinking more about.

So in this post, I’m going to go into a couple of situations where it makes sense to scale the prior based on fixed information about the experiment. (The emerging theme for these posts is “things I think are interesting and useful but are probably not publishable” interspersed with “weird digressions into musical theatre / the personal mythology of Patti LuPone”.)

If you haven’t clicked yet, this particular post is going to be drier than Eve Arden in Mildred Pierce.  If you’d rather be entertained, I’d recommend Tempting: Jenny Toomey sings the songs of Franklin Bruno. (Franklin Bruno is today’s stand in for Patti, because I’m still sad that War Paint closed. I only got to see it twice.)

(Toomey was one of the most exciting American indie musicians in the 90s both through her bands [Tsunami was the notable one, but there were others] and her work with Simple Machines, the label she co-founded. These days she’s working in musician advocacy and hasn’t released an album since the early 2000s. Bruno’s current band is called The Human Hearts. He has had a long solo career and was also in an excellent powerpop band called Nothing Painted Blue, who had an album called The Monte Carlo Method. And, now that I live in Toronto, I should say that that album has a fabulous cover of Mark Szabo’s I Should Be With You. To be honest, the only reason I work with Andrew and the Stan crew is that I figure if I’m in New York often enough I’ll eventually coincide with a Human Hearts concert.)

Sparsity

Why won’t you cheat with me? You and I both know you’ve done it before. – Jenny Toomey sings Franklin Bruno

The first object of our affliction are priors that promote sparsity in high-dimensional models.  There has been a lot of work on this topic, but the cheaters guide is basically this:

While spike-and-slab models can exactly represent sparsity and have excellent theoretical properties, they are basically useless from a computational point of view. So we use scale-mixture of normal priors (also known as local-global priors) to achieve approximate sparsity, and then use some sort of decision rule to take our approximately sparse signal and make it exactly sparse.

What is a scale-mixture of normals?  Well it has this general form

$latex \beta_j \sim N(0, \tau^2 \psi^2_j)$

where $latex \tau$ is a global standard deviation parameter, controlling how large the $latex \beta_j$ parameters are in general, while the local standard deviation parameters $latex \psi_j$ control how big the parameter is allowed to be locally.  The priors for $latex \tau$ and the $latex \psi_j$ are typically set to be independent.  A lot of theoretical work just treats $latex \tau$ as fixed (or as otherwise less important than the local parameters), but this is wrong.

Interpretation note: This is a crappy parameterisation.  A better one would constrain the $latex \psi_j$ to lie on a simplex.  This would then give us the interpretation that $latex \tau$ is the overall standard deviation if the covariates are properly scaled to be $latex \mathcal{O}(1)$  and the local parameters control how the individual parameters contribute to this variability. The standard parameterisation leads to some confounding between the scales of the local and global parameters, which can lead to both an interpretational and computational problems.  Interesting Bhattacharya et al. showed that in some specific cases you can go from a model where the local parameters are constrained to the simplex to the unconstrained case (although they parameterised with the variance rather than the standard deviation).

Pedant’s corner: Andrew likes define mathematical statisticians as those who use x for their data rather than y. I prefer to characterise them by those who think it’s a good idea to put a prior on variance (an unelicitable quantity) rather than standard deviation (which is easy to have opinions about). Please people just stop doing this. You’re not helping yourselves!

Actually, maybe that last point isn’t for Pedant’s Corner after all.  Because if you parameterise by standard deviation it’s pretty easy to work out what the marginal prior on $latex \beta_j$ (with $latex \tau$ fixed) is.  This is quite useful because, with the notable exception of the “Bayesian” “Lasso” which-does-not-work-but-will-never-die-because-it-was-inexplicably-published-in-the leading-stats-journal-by-prominent-statisticians-and-has-the-word-Lasso-in-the-title-even-though-a-back-of-the-envelope-calculation-or-I-don’t-know-a-fairly-straightforward-simulation-by-the-reviewers-should-have-nixed-it (to use its married name), we can’t compute the marginal prior for most scale-mixtures of normals.  The following result, which was killed by reviewers at some point during the PC prior papers long review process, but lives forever on the V1 paper on arXiv tells you everything you need to know.  It’s a picture because frankly I’ve had a glass of wine and I’m not bloody typing it all again.

For those of you who don’t want to click through, it basically says the following:

  • If the density of the prior on the standard deviation is finite at zero, then the implied prior on $latex \beta_j$ has a logarithmic spike at zero.
  • If the density of the prior on the standard has a polynomial tail, then the implied prior on $latex \beta_j$ has the same polynomial tail.

(Not in the result, but computed at the time: if the prior on the standard deviation is exponential, the prior on $latex \beta_j$ still has Gaussian-ish tails.  I couldn’t work out what happened in the hinterland between exponential tails and polynomial tails, but I suspect at some point the tail on the standard deviation does eventually get heavy enough to be seen in the marginal, but I can’t tell you when.)

With this sort of information, you can compute the equivalent of the bounds that I did on the Laplace prior for the general case (or, actually, for the case that will have at least a little bit of a chance, which is the monotonically decreasing priors on the standard deviation).

Anyway, that’s a long way around to say that you get similar things for all computationally useful models of sparsity.  Why?  Well basically it’s because these models are a dirty hack. They don’t allow us to represent exactly sparse signals, so we need to deal with that somehow.  The somehow is through some sort of decision process that can tell a zero from a non-zero.  Unfortunately, this decision is going to depend on the precision of the measurement process, which strongly indicates that it will need to know about things like $latex n$, $latex p$ and $latex \mathbf{X}$. One way to represent this is through an interaction wiht the prior.

You’d look better if your shadow didn’t follow you around, but it looks as though you’re tethered to the ground, just like every pound of flesh I’ve ever found. – Franklin Bruno in a sourceless light.

For a very simple decision process (the deterministic threshold process described in the previous post), you can work out exactly how the threshold needs to interact with the prior.  In particular, we can see that if we’re trying to detect a true signal that is exactly zero (no components are active), then we know that $latex \| \mathbf{X} \boldsymbol{\beta} \| = 0$.  This is not possible for these scale-mixture models, but we can require that in this case all of the components are at most $latex \epsilon$, in which case

$latex \| \mathbf{X}\boldsymbol{\beta} \| \leq \epsilon \| \mathbf{X} \|$,

which suggests we want $latex \epsilon \ll \| \mathbf{X} \|^{-1}$. The calculation in the previous post shows that if we want this sort of almost zero signal to have any mass at all under the prior, we need to scale $latex \lambda$ using information about $latex \mathbf{X}$.

Of course, this is a very very simple decision process.  I have absolutely no idea how to repeat these arguments for actually good decision processes, like the predictive loss minimization favoured by Aki. But I’d still expect that we’d need to make sure there was a priori enough mass in the areas of the parameter space where the decision process is firmly one way or another (as well as mass in the indeterminate region).  I doubt that the Bayesian Lasso would magically start to work under these more complex losses.

Models specified through their full conditionals

Why won’t you cheat with me? You and I both know that he’s done the same. – Jenny Toomey sings Franklin Bruno

So we can view the design dependence of sparsity priors as preparation for the forthcoming decision process. (Those of you who just mentally broke into Prepare Ye The Way Of The Lord from Godspell, please come to the front of the class. You are my people.)  Now let’s talk about a case where this isn’t true.

To do this, we need to cast our minds back to a time when people really did have the original cast recording of Godspell on their mind.  In particular, we need to think about Julian Besag (who I’m sure was really into musicals about Jesus. I have no information to the contrary, so I’m just going to assume it’s true.) who wrote a series of important papers, one in 1974 and one in 1975 (and several before and after, but I can’t be arsed linking to them all.  We all have google.) about specifying models through conditional independence relations.

These models have a special place in time series modelling (where we all know about discrete-time Markovian processes) and in spatial statistics. In particular, generalisations of Besag’s (Gaussian) conditional autoregressive (CAR) models are widely used in spatial epidemiology.

Mathematically, Gaussian CAR models (and more generally Gaussian Markov models on graphs) are defined through their precision matrix, that is the inverse of the covariance matrix as

$latex \mathbf{x} \sim N(\mathbf{0}, \tau^{-1}\mathbf{Q}^{-1})$.

For simple models, such as the popular CAR model, we assume $latex \mathbf{Q}$ is fixed, known, and sparse (i.e. it has a lot of zeros) and we typically interpret $latex \tau$ to be the inverse of the variance of $latex \mathbf{x}$.

This interpretation of $latex \tau$ could not be more wrong.

Why? Well, let’s look at the marginal

$latex x_j \sim N\left(0, \tau^{-1}[Q^{-1}]_{ii}\right)$.

To interpet $latex \tau$ and the inverse variance, we need the diagonal elements of $latex \mathbf{Q}^{-1}$ to all be around 1. This is never the case.

A simple, mathematically tractable example is the first order random walk on a one-dimensional lattice, which can be written in terms of the increment process as

$latex x_{j+1} – x_j \sim N(0, \tau^{-1})$.

Conditioned on a particular starting point, this process looks a lot like a discrete version of Brownian motion as you move the lattice points closer together. This is a useful model for rough non-linear random effects, such as the baseline hazard rate in a Cox proportional hazard model. A long and detailed (and quite general) discussion of these models can be found in Rue and Held’s book.

I am bringing this case up because you can actually work out the size of the diagonal of $latex \mathbf{Q}^{-1}$. Sørbye and Rue talk about this in detail, but for this model maybe the easiest way to understand it is that if we had a fixed lattice with $n$ points and we’d carefully worked out a sensible prior for $latex \tau$.  Now imagine that we’ve gotten some new data and instead of only $latex n$ points in the lattice, we got information at a finer scale, so now the same interval is covered by $latex nk$ equally spaced nodes.  We model this with the new first order random walk prior

$latex x’_{j+1} – x’_j \sim N(0,[\tau’]^{-1})$.

It turns out that we can relate the inverse variances of these two increment processes as $\tau’ = k \tau$.

This strongly suggests that we should not use the same prior for $latex \tau$ as we should for $latex \tau’$, but that the prior should actually know about how many nodes there are on the lattice. Concrete suggestions are in the Sørbye and Rue paper linked above.

Not to coin a phrase, but play it as it lays – Franklin Bruno in Nothing Painted Blue

This type of design dependence is a general problem for multivariate Gaussian models specified through their precision (so-called Gaussian Markov random fields).  The critical thing here is that, unlike the sparsity case, the design dependence does not come from some type of decision process.  It comes from the gap between the parameterisation (in terms of $latex \tau$ and $latex \mathbf{Q}$) and the elicitable quantity (the scale of the random effect).

Gaussian process models

And it’s not like we’re tearing down a house of more than gingerbread.  It’s not like we’re calling down the wrath of heaven on our heads. –  Jenny Toomey sings Franklin Bruno

So the design dependence doesn’t necessarily come in preparation for some kind of decision, it can also be because we have constructed (and therefore parameterised) our process in an inconvenient way.  Let’s see if we can knock out another one before my bottle of wine dies.

Gaussian processes, the least exciting tool in the machine learner’s toolbox, are another example where your priors need to be design dependent.  It will probably surprise you not a single sausage that in this case the need for design dependence comes from a completely different place.

For simplicity let’s consider a Gaussian process $latex f(t)$ in one dimension with isotropic covariance function

$latex c(s,t) =\sigma^2 (\kappa|s-t|)^\nu K_\nu(|\kappa|s-t|)$.

This is the commonly encountered Whittle-Matérn family of covariance functions.  The distinguished members are the exponential covariance function when $latex \nu = 0.5$ and the squared exponential function

$latex c(s,t)= \sigma^2\exp\left(\kappa |s-t|^2 \right)$,

which is the limit as $latex \nu \rightarrow \infty$.

One of the inconvenient features of Matérn models in 1-3 dimensions is that it is impossible to consistently recover all of the parameters by simply observing more and more of the random effect on a fixed interval.  You need to see new replicates in order to properly pin these down.

So one might expect that this non-identifiability would be the source of some problems.

One would be wrong.

The squared exponential covariance function does not have this pathology, but it’s still very very hard to fit. Why? Well the problem is that you can interpret $\kappa$ as an inverse-range parameter.  Roughly, the interpetation is that if

$latex |s – t | > \frac{ \sqrt{ 8 \nu } }{\kappa}$

then the value of $latex f(s)$ is approximately independent of the value of $latex f(t)$.  This means that a fixed data set provides no information about $latex \kappa$ in large parts of the parameter space.  In particular if $latex \kappa^{-1}$ is bigger than the range of the measurement locations, then the data has no information about the parameter.  Similarly, if $latex \kappa^{-1}$ is smaller than the smallest distance between two data points (or for irregular data, this should be something like “smaller than some low quantile of the set of distances between points”), then the data will have nothing to say about the parameter.

Of these two scenarios, it turns out that the inference is much less sensitive to the prior on small values of $latex \kappa$ (ie ranges longer than the data) than it is on small values of $latex \kappa$ (ie ranges shorter than the data).

Currently, we have two recommendations: one based around PC priors and one based around inverse gamma priors (the second link is to the Stan manual). But both of these require you to specify the design-dependent quantity of a “minimum length scale we expect this data set to be informative about”.

Betancourt has some lovely Stan case studies on this that I assume will migrate to the mc-stan.org/ website eventually.

I’m a disaster, you’re a disaster, we’re a disaster area. – Franklin Bruno in The Human Hearts (featuring alto extraordinaire and cabaret god Ms Molly Pope)

So in this final example we hit our ultimate goal.  A case where design dependent priors are needed not because of a hacky decision process, or an awkward specification, but due to the limits of the data.  In this case, priors that do not recognise the limitation of the design of the experiment will lead to poorly behaving posteriors.  In this case, it manifests as the Gaussian processes severely over-fitting the data.

This is the ultimate expression of the point that we tried to make in the Entropy paper: The prior can often only be understood in the context of the likelihood.

Principles can only get you so far

I’m making scenes, you’re constructing dioramas – Franklin Bruno in Nothing Painted Blue

Just to round this off, I guess I should mention that the strong likelihood principle really does suggest that certain details of the design are not relevant to a fully Bayesian analysis. In particular, if the design only pops up in the normalising constant of the likelihood, it should not be relevant to a Bayesian. This seems at odds with everything I’ve said so far.

But it’s not.

In each of these cases, the design was only invoked in order to deal with some external information. For sparsity, design was needed to properly infer a sparse signal and came in through the structure of the decision process.

For Gaussian processes, the same thing happened: the implicit decision criterion was that we wanted to make good predictions. The design told us which parts of the parameter space obstructed this goal, and a well specified prior removed the problem.

For the CAR models, the external information was that the elicitable quantity was the marginal standard deviation, which was a complicated function of the design and the standard parameter.

There are also any number of cases in real practice where the decision at hand is stochastically dependent on the data gathering mechanism. This is why things like MRP exist.

I guess this is the tl;dr version of this post (because apparently I’m too wordy for some people. I suggest they read other things. Of course suggesting this in the final paragraph of such a wordy post is very me.): Design matters even if you’re Bayesian. Especially if you want to do something with your posterior that’s more exciting than just sitting on it.

The king must die

And then there was Yodeling Elaine, the Queen of the Air. She had a dollar sign medallion about as big as a dinner plate around her neck and a tiny bubble of spittle around her nostril and a little rusty tear, for she had lassoed and lost another tipsy sailor“– Tom Waits

(2021 edit: The equations on this post have died, so I’ve posted a lightly edited version on my distill blog. Enjoy.)

It turns out I turned thirty two and became unbearable. Some of you may feel, with an increasing sense of temporal dissonance, that I was already unbearable. (Fair point) Others will wonder how I can look so good at my age. (Answer: Black Metal) None of that matters to me because all I want to do is talk about the evils of marketing like the 90s were a vaguely good idea. (Narrator: “They were not. The concept of authenticity is just another way for the dominant culture to suppress more interesting ones.”)

The thing is, I worry that the real problem in academic statistics in 2017 is not a reproducibility crisis, so much as that so many of our methods just don’t work. And to be honest, I don’t really know what to do about that, other than suggest that we tighten our standards and insist that people proposing new methods, models, and algorithms work harder to sketch out the boundaries of their creations. (What a suggestion. Really. Concrete proposals for concrete change. But it’s a blog. If ever there was a medium to be half-arsed in it’s this one. It’s like twitter for people who aren’t pithy.)

Berätta för mig om det är sant att din hud är doppad i honung

So what is the object of my impotent ire today. Well nothing less storied than the Bayesian Lasso.

It should be the least controversial thing in this, the year of our lord two thousand and seventeen, to point out that this method bears no practical resemblance to the Lasso. Or, in the words of Law and Order: SVU, “The [Bayesian Lasso] is fictional and does not depict any actual person or event”.

Who do you think you are?

The Bayesian Lasso is a good example of what’s commonly known as the Lupita Nyong’o fallacy, which goes something like this: Lupita Nyong’o had a break out role in Twelve Years a Slave, she also had a heavily disguised role in one of the Star Wars films (the specific Star Wars film is not important. I haven’t seen it and I don’t care). Hence Twelve Years a Slave exists in the extended Star Wars universe.

The key point is that the (classical) Lasso plays a small part within the Bayesian Lasso (it’s the MAP estimate) in the same way that Lupita Nyong’o played a small role in that Star Wars film. But just as the presence of Ms Nyong’o does not turn Star Wars into Twelve Years a Slave, the fact that the classical Lasso can be recovered as the MAP estimate of the Bayesian Lasso does not make the Bayesian Lasso useful.

And yet people still ask if they can be fit in Stan. In that case, Andrew answered the question that was asked, which is typically the best way to deal with software enquiries. (It’s usually a fool’s game to try to guess why people are asking particular questions. It probably wouldn’t be hard for someone to catalogue the number of times I’ve not followed my advice on this, but in life as in statistics, consistency is really only a concern if everything else is going well.) But I am brave and was not asked for my opinion, so I’m going to talk about why the Bayesian Lasso doesn’t work.

Hiding all away

So why would anyone not know that the Bayesian Lasso doesn’t work?  Well, I don’t really know. But I will point out that all of the results that I’ve seen in this directions (not that I’ve been looking hard) have been published in the prestigious but obtuse places like Annals of Statistics, the journal we publish in when we either don’t want people without a graduate degree in mathematical statistics to understand us or when we want to get tenure.

By contrast, the original paper is very readable and published in JASA, where we put papers when we are ok with people who do not have a graduate degree in mathematical statistics being able to read them, or when we want to get tenure.

To be fair to Park and Casella, they never really say that the Baysian Lasso should be used for sparsity. Except for one sentence in the introduction where they say the median gives approximately sparse estimators and the title which links it to the most prominent and popular method for estimating a sparse signal. Marketing eh. (See, I’m Canadian now).

The devil has designed my death and is waiting to be sure

So what is the Bayesian LASSO (and why did I spend 600 words harping on about something before defining it? The answer will shock you. Actually the answer will not shock you, it’s because it’s kinda hard to do equations on this thing.)

For data observed with Gaussian error, the Bayesian Lasso takes the form

$latex \mathbf{y} \mid \boldsymbol{\beta} \sim N( \mathbf{X} \boldsymbol{\beta}, \boldsymbol{\Sigma})$

where, instead of putting a Normal prior on $latex \boldsymbol{\beta}$ we put independent Laplace priors

$latex p(\beta_i) = \frac{\lambda}{2} \exp(-\lambda | \beta_i|).$

Here the tuning parameter $latex \lambda = c(n,p,s_0,\mathbf{X})\tilde{\lambda}$ where $latex p$ is the number of covariates, $latex n$ is the number of observations, $latex s_0$ is the number of “true” non-zero elements of $latex \boldsymbol{\beta}$, $latex \boldsymbol{\Sigma}$ is known, and $latex \tilde{\lambda}$ is an unknown scaling parameter that should be $latex \mathcal{O}(1)$.

Important Side note: This isn’t the exact same model as Park and Castella used as they didn’t use the transformation

$latex \lambda = c(n,p,s_0,\mathbf{X}) \tilde{\lambda}$

but rather just dealt with $latex \lambda$ as the parameter. From a Bayesian viewpoint, it’s a much better idea to put a prior on $latex \tilde{\lambda}$ than on $latex \lambda$ directly. Why? Because a prior on $latex \lambda$ needs to depend on np, $latex s_0$, and and hence needs to be changed for each problem, while a prior on $latex \tilde{\lambda}$ can be used for many problems.  One possible option is $latex c(n,p,s_0,\mathbf{X}) = 2\|\mathbf{X}\|\sqrt{\log p }$, which is a rate optimal parameter for the (non-Bayesian) Lasso. Later, we’ll do a back-of-the-envelope calculation that suggests we probably don’t need the square root  around the logarithmic term.

Edit: I’ve had some questions about this scaling, so I’m going to try to explain it a little better.  The idea here is that the Bayesian Lasso uses the i.i.d. Laplace priors with scaling parameter $latex \lambda$ on $latex \beta_j$ to express the substantive belief that the signal is approximately sparse.  The reason for scaling the prior is that not every value of $latex \lambda$ is consistent with this belief.  For example, $latex \lambda = 1 $ will not give an approximately sparse signal.  While we could just use a  prior for $latex \lambda$ that has a very heavy right tail (something like an inverse gamma), this is at odds with a good practice principle of making sure all of thee parameters in your models are properly scaled to make them order 1.  Why do we do this? Because it makes it much much easier to set sensible priors.

Other important side note: Some of you may have noticed that the scaling $latex c(n,p,s_0,\mathbf{X})$ can depend on the unknown sparsity $latex s_0$.  This seems like cheating. People who do asymptotic theory call this sort of value for $latex \lambda$ an oracle value, mainly because people studying Bayesian asymptotics are really really into databasing software. The idea is that this is the value of $latex \lambda$ that gives the model the best chance of working. When maths-ing, you work out the properties of the posterior with the oracle value of $latex \lambda$ and then you use some sort of smoothness argument to show that the actual method that is being used to select (or average over) the parameter gives almost the same answer.

Only once in Sheboygan. Only once.

So what’s wrong with the Bayesian Lasso? Well the short version is that the Laplace prior doesn’t have enough mass near zero relative to the mass in the tails to allow for a posterior that has a lot of entries that are almost zero and some entries that are emphatically not zero.  Because the Bayesian Lasso prior does not have a spike at zero, none of the entries will be a priori  exactly zero, so we need some sort of rule to separate the “zero” entries from the “non-zero” entries.  The way that we’re going to do this is to choose a cutoff $latex \epsilon$ where we assume that if $latex |\beta_j| <\epsilon$, then $latex \beta_j =0$.

So how do we know that the Lasso prior doesn’t put enough mass in important parts of the parameter space? Well there are two ways. I learnt it during the exciting process of  writing a paper that the reviewers insisted should have  an extended section about sparsity (although this was at best tangential to the rest of the paper), so I suddenly needed to know about Bayesian models of sparsity. So I read those Annals of Stats papers. (That’s why I know I should be scaling $latex \lambda$!).

What are the key references? Well all the knowledge that you seek is here and here.

But a much easier way to work out that the Bayesian Lasso is bad is to do some simple maths. Because the $latex \beta_j$ are a priori independent, we get a prior on the effective sparsity $latex s_\epsilon$

$latex s_\epsilon \sim {Bin}(p, \Pr(|\beta_j| > \epsilon)).$

For the Bayesian Lasso, that probability can be computed as

$latex \Pr ( | \beta_j | > \epsilon ) = e^{- \lambda \epsilon}$.

 

Ideally, the distribution of this effective sparsity would be centred on the true sparsity.  That is, we’d like to choose $latex \lambda$ so that

$latex \mathbb{E}(s_\epsilon)= p e^{- \lambda \epsilon}= s_0 $.

A quick re-arrangement suggests that

$latex \lambda = \epsilon^{-1} \log(p) – \epsilon^{-1} \log(s_0)$.

Now, we are interested in signals with $latex s_0 = o(p)$, i.e. where only a very small number of the $latex \beta_j $ are non-zero.  This suggests we can safely ignore the second term.  Some other theory  suggests that we need to take $latex \epsilon$ to depend on $latex p$ in such a way that it goes to zero faster than $latex p^{-1}$ as $latex p$ gets large. (Edit: Oops – this paper used $latex \mathbf{X} = \mathbf{I}$ and $latex p = n$.  For the general case, their reasoning leads to $latex \epsilon = o(\| \mathbf{X} \|^{-1})$.)

This means that we need to take $latex \lambda = \mathcal{O}(p \log(p))$ in order to ensure that we have our prior centred on sparse vectors (in the sense that the prior mean for the number of non-zero components is always much less than $latex p$).

Show some emotion

So for the Bayesian Lasso, a sensible parameter is $latex \lambda = p\log p$, which will usually have a large number of components less than the threshold $latex \epsilon$ and a small number that are larger.  But this is still not any good.  To see this, let’s consider the prior probability of seeing a $latex \beta_j$ larger than one:

$latex \Pr ( | \beta_j | > 1) = p^{-p} \downarrow \downarrow \downarrow 0$.

This is the problem with the Bayesian Lasso: in order to have a lot of zeros in the signal, you are also forcing the non-zero elements to be very small. A plot of this function is above, and it’s clear that even for very small values of $latex p$ this probability is infinitesimally small.

Basically, the Bayesian Lasso can’t give enough mass to both small and large signals simultaneously.  Other Bayesian models (such as the horseshoe and the Finnish horseshoe) can support both simultaneously and this type of calculation can show that (although it’s harder. See Theorem 6 here).

Side note: The scaling that I derived in the previous section is a little different to the standard Lasso scaling of $latex \lambda = \mathcal{O} (p \sqrt{\log p})$, but the same result holds: for large $latex p$ the probability of seeing a large signal is vanishingly small.

Maybe I was mean, but I really don’t think so

Now, obviously this is not what we see with real data when we fit the Bayesian Lasso in Stan with an unknown $latex \lambda$.  What happens is the model tries to strike a balance between the big signals and the small signals, shrinking the former too much, and letting the latter be too far from zero.  You will see this in the event that you try to fit the Bayesian Lasso in Stan.

Update: I’ve put some extra notes in the above text to make it hopefully a little clearer in one case and a little more correct in the other. Please accept this version of the title track as way of an apology

(Also this stunning version, done wearing even more makeup)

Contour as a verb

Our love is like the border between Greece and Albania – The Mountain Goats

(In which I am uncharacteristically brief)

Andrew’s answer to recent post reminded me of one of my favourite questions: how do you visualise uncertainty in spatial maps.  An interesting subspecies of this question relates to exactly how you can plot a contour map for a spatial estimate.  The obvious idea is to take a point estimate (like your mean or median spatial field) and draw a contour map on that.

But this is problematic because it does not take into account the uncertainty in your estimate.  A contour on a map indicates a line that separates two levels of a field, but if you do not know the value of the field exactly, you cannot separate it precisely.  Bolin and Lindgren have constructed a neat method for dealing with this problem by having an intermediate area where you don’t know which side of the chosen level you are on.  This replaces thin contour lines with thick contour bands that better reflect our uncertainty.

Interestingly, using these contour bands requires us to reflect on just how certain our estimates are when selecting the number of contours we wish to plot (or else there would be nothing left in the space other than bands).

There is a broader principle reflected in Bolin and Lindgren’s work: when you are visualising multiple aspects of an uncertain quantity, you need to allow for an indeterminate region.  This is the same idea that is reflected in Andrew’s “thirds” rule.

David and Finn also wrote a very nice R package that implements their method for computing contours (as well as for computing joint “excursion regions”, i.e. areas in space where the random field simultaneously exceeds a fixed level with a given probability).