Skip to content

Call for papers: Probabilistic Programming Languages, Semantics, and Systems (PPS 2018)

I’m on the program committee and they say they’re looking to broaden their horizons this year to include systems like Stan. The workshop is part of POPL, the big programming language theory conference. Here’s the official link

The submissions are two-page extended abstracts and the deadline is 17 October 2017; the workshop itself is in Los Angeles on 9 January 2018.

You can also see the program from last year. I would’ve liked to have seen Gordon Plotkin’s talk, but there aren’t even abstracts on line; I see that despite the hype surrounding comp sci, he’s still modestly titling his contributions “Towards …”.

The workshop is the day before StanCon starts in Monterey (up the coast). That’s cutting it too close for me, so I won’t be at the workshop. I do hope to see you at


Using black-box machine learning predictions as inputs to a Bayesian analysis

Following up on this discussion [Designing an animal-like brain: black-box “deep learning algorithms” to solve problems, with an (approximately) Bayesian “consciousness” or “executive functioning organ” that attempts to make sense of all these inferences], Mike Betancourt writes:

I’m not sure AI (or machine learning) + Bayesian wrapper would address the points raised in the paper. In particular, one of the big concepts that they’re pushing is that the brain builds generative/causal models of the world (they do a lot based on simple physics models) and then use those models to make predictions outside of the scope of the data that they have previous seen. True out of sample performance is still a big problem in AI (they’re trying to make the training data big enough to make “out of sample” an irrelevant concept, but that’ll never really happen) and these kinds of latent generative/causal models would go a long way to improving that. Adding a Bayesian wrapper could identify limitations of an AI, but I don’t see how it could move towards this kind of latent generative/causal construction.

If you wanted to incorporate these AI algorithms into a Bayesian framework then I think it’s much more effective to treat the algorithms as further steps in data reduction. For example, train some neural net, treat the outputs of the net as the actual measurement, and then add the trained neural net to your likelihood. This is my advice when people want to/have to use machine learning algorithms but also want to quantify systematic uncertainties.

My response: yes, I give that advice too, and I’ve used this method in consulting problems. Recently we had a pleasant example in which we started by using the output from the so-called machine learning as a predictor, then we fit a parametric model to the machine-learning fit, and now we’re transitioning toward modeling the raw data. Some interesting general lessons here, I think. In particular, machine-learning-type methods tend to be crap at extrapolation and can have weird flat behavior near the edge of the data. So in this case when we went to the parametric model, we excluded some of the machine-learning predictions in the bad zone as they were messing us up.

Betancourt adds:

It could also be spun into a neurological narrative. As in our sensory organs and lower brain functions operate as AI, reducing raw inputs into more abstract/informative features from which the brain can then go all Bayesian and build the generative/causal models advocated in
the paper.

p less than 0.00000000000000000000000000000000 . . . now that’s what I call evidence!

I read more carefully the news article linked to in the previous post, which describes a forking-pathed nightmare of a psychology study, the sort of thing that was routine practice back in 2010 or so but which we’ve mostly learned to at least try to avoid.

Anyway, one thing I learned there’s something called “terror management theory.” Not as important as embodied cognition, I guess, but it seems to be a thing: according to the news article, it’s appeared in “more than 500 studies conducted over the past 25 years.”

I assume that each of these separate studies had p less than 0.05, otherwise they wouldn’t’ve been published, and I doubt they’re counting unpublished studies.

So that would make the combined p-value less than 0.05^500.

Ummm, what’s that in decimal numbers?

> 500*log10(0.05)
[1] -650.515
> 10^(-0.515)
[1] 0.3054921

OK, so the combined result is p less than 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000031.

I guess terror management theory must be real, then.

The news article, from a Pulitzer Prize-winning reporter (!), concludes:

Score one for the scientists.

That’s old-school science writing for ya.

As I wrote in my previous post, I feel bad for everyone involved in this one. Understanding of researcher degrees of freedom and selection bias has only gradually percolated through psychology research, and it stands to reason that there are still lots of people, young and old, left behind, still doing old-style noise-mining, tea-leaf-reading research. I can only assume these researchers are doing their best, as is the journalist reporting these results, with none of them realizing that they’re doing little more than shuffling random numbers.

The whole thing is funny, but it’s sad, but I hope we’re moving forward. The modern journalists are getting clued in, and I expect the traditional science journalists will follow. There remains the problem of selection bias, that the credulous reporters write up these stories while the skeptics don’t bother. But I’m hoping that, one by one, reporters will figure out what’s going on.

After all, nobody wants to be the last one on the sinking ship. I guess not completely alone, as you’d be accompanied by the editor of Perspectives on Psychological Science and the chair of the Association for Psychological Science publications board. But who really wants to hang out with them all day?

Stan Course in Newcastle, United Kingdom!

(this post is by Betancourt)

The growth of Stan has afforded the core team many opportunities to give courses, to both industrial and academic audiences and at venues  across the world.  Regrettably we’re not always able to keep up with demand for new courses, especially outside of the United States, due to our already busy schedules.  Fortunately, however, some of our colleagues are picking up the slack!

In particular, Jumping Rivers is hosting a two day introductory RStan course at the University of Newcastle in the United Kingdom from Thursday December 7th to Friday December 8th.  The instructor is my good friend Sarah Heaps, who not only is an excellent statistician and avid Stan user but also attended one of the first RStan courses I gave!

If you are on the other side of the Atlantic and interested in learning RStan then I highly recommend attending (and checking out Newcastle’s surprisingly great Chinatown during lunch breaks).

And if you are interested in organizing a Stan course with any members of the core team then don’t hesitate to contact me to see if we might be able to arrange something.

As if the 2010s never happened

E. J. writes:

I’m sure I’m not the first to send you this beauty.

Actually, E. J., you’re the only one who sent me this!

It’s a news article, “Can the fear of death instantly make you a better athlete?”, reporting on a psychology experiment:

For the first study, 31 male undergraduates who liked basketball and valued being good at it were recruited for what they were told was a personality and sports study. The subjects were asked to play two games of one-on-one basketball against a person they thought was another subject but who was actually one of the researchers.

In between the two games, the participants were asked to fill out a questionnaire. Half of the subjects were randomly assigned questions that probed them to think about a neutral topic (playing basketball); the other half were prompted to think about their mortality with questions such as, “Please briefly describe the thoughts and emotions that the thought of your own death arouses in you” . . .

That’s right, priming! What could be more retro than that?

The news article continues:

The researchers hypothesized that according to terror management theory, those who answered the mortality questions should show an improvement in their second game. When the results of the experiment, which was videotaped, were analyzed, the researchers found out the subjects’ responses exceeded their expectations: The performance in the second game for those who had received a memento mori increased 40 percent, while the other group’s performance was unchanged.

They quoted one of the researchers as saying, “What we were surprised at was the magnitude of the effect, the size in which we saw the increases from baseline.”

I have a feeling that nobody told them about type M errors.

There’s more at the link, if you’re interested.

I feel bad for everyone involved in this one. Understanding of researcher degrees of freedom and selection bias has only gradually percolated through psychology research, and it stands to reason that there are still lots of people, young and old, left behind, still doing old-style noise-mining, tea-leaf-reading research. I can only assume these researchers are doing their best, as is the journalist reporting these results, with none of them realizing that they’re doing little more than shuffling random numbers.

One recommendation that’s sometimes given in these settings is to do preregistered replication. I don’t always like to advise this because, realistically, I expect that the replication won’t work. But preregistration can help to convince. I refer you to the famous 50 shades of gray study.

Maybe this paper is a parody, maybe it’s a semibluff

Peter DeScioli writes:

I was wondering if you saw this paper about people reading Harry Potter and then disliking Trump, attached. It seems to fit the shark attack genre.

In this case, the issue seems to be judging causation from multiple regression with observational data, assuming that control variables are enough to narrow down to causality (or that it’s up to a critic to find the confounds). It speaks to a bigger issue about how researchers interpret multiple regression in causal terms.

Any thoughts on this, or obvious/good references critiquing causal interpretations of multiple regression? (like to assign to my PhD students)

My reply: Hi, yes, I saw this paper months ago. I suspected it was a parody but someone told me that it was actually supposed to be serious. I still think it’s a kind of half-parody, it’s what social scientists might call a “fun” result, and it’s published in a non-serious journal, so I doubt the author takes it completely seriously. Kinda like this: you find an interesting pattern in data, it’s probably no big deal, but who knows, so get it out there and people can make of it what they will.

Twenty years ago, social scientists could do this and it would be no problem; nowadays with all this stuff on shark attacks, college football, power pose, contagion of obesity, etc., it seems that people have more difficult putting such speculations into perspective: any damn data pattern they see, they want to insist it’s a big deal, from data analysis to publication to Ted talk and NPR. In some sense this Harry Potter paper is a throwback and it would probably be best to interpret it the way we’d have taken it twenty or thirty years ago.

It’s impossible for me to tell whether the author, Diana Mutz, is writing this paper as a parody. Intonation is notoriously difficult to convey in typed speech. It’s a funny thing: if the paper’s not a parody and I say it is, then I’m kinda being insulting. But if the paper is a parody and I take it seriously, then I’m not getting the joke. So there’s no safe interpretation here! (I could ask Mutz directly but that’s not much of a general solution; I’d rather think of a published article and its implications as standing on their own and not requiring typically unobtainable “meta-data” on authorial intentions.)

DeScioli responded:

It does have some whimsical passages so maybe it is half-parody.

And I continued:

Yeah, there’s this genre of research which is not entirely serious but not entirely a joke, kinda what in poker we’d call a semibluff. Back in the good old days before Gladwell, PPNAS, NPR hype, etc., it was reasonable for researchers to try out some of these ideas, they were long shots but had some appeal as part of the mix of science. For awhile, though, it was seeming like this sort of open-ended-speculation-backed-by-statistically-significant-p-values had become most of social science, and this has reduced all of our patience for this sort of thing. Which is kinda too bad. Another example is that observation that several recent presidents were left-handed. It seems like it should be possible to point to such data patterns, and even run some statistics on them, without making large claims.

DeScioli followed up:

Seems it could still be as fun and interesting to look for these types of correlations without claiming causality. I was just surprised to see the paper double down on the causal interpretation with the argument that the analysis controlled for everything they could think of. (My assumption is that observational data has countless confounding correlations that no one could think of.) I don’t think this paper is worse in over-interpreting than many others I’ve seen. It’s just easier to notice because of the whimsical topic.

What’s the lesson for avoiding this for a more serious-sounding theory? I typically restrict causal judgments to experimental manipulations. But maybe that is too restrictive? The only other thing I can think of is if a researcher knew so much about their subject that they could boil down the possible causes to a handful. Then maybe multiple regression with controls could help sort between them. If so, the issue with Harry Potter is that it’s one of millions of similar cultural influences that are all hopelessly tangled and so can’t be untangled with observational methods.

Where does the discussion go?

Jorge Cimentada writes:

In this article, Yascha Mounk is saying that political scientists have failed to predict unexpected political changes such as the Trump nomination and the sudden growth of populism in Europe, because, he argues, of the way we’re testing hypotheses.

By that he means the quantitative aspect behind science discovery. He goes on to talk about the historical change from qualitative to modern quantitative analysis which hinders the capacity of scholars to study ‘less common’ or unfrequent situations, such as the ones outlined above.

I [Cimentada] am pretty sure there’s some truth behind that, but still, I think that the capacity to predict is not entirely based on the frequency of things. Another thing he fails to distinguish is that specific questions require specific designs. Depending on your research question, you might need to use qualitative over quantitative approaches.

If you have some time, I’d like to hear your stand on this. I thought this might be something which could fit in one of your blog entries which is why I contacted you.

I leave you with one paragraph that summarizes his main point pretty well:

It is easier to amass high-quality data, and therefore to make “rigorous” causal claims, about the economy than about culture; in part for that reason, the social sciences now favor economic over cultural modes of explanation. Similarly, it is easier to amass high-quality data, and to test causal hypotheses, about frequent events that are easy to count and categorize, like votes in Congress, than about rare and intractable events, like political revolutions; in part for that reason, the social sciences now tend to focus more on the business-as-usual of the recent past than on the great turning points of more distant times.

My reply: these are good questions that are worth considering. It’s kinda funny that they appeared in an opinion article in the Chronicle of Higher Education rather than in a political science journal, but I guess that journals are not so important for communication anymore. Nowadays journals are all about academic promotion and tenure. When people want to have scholarly discussions, they turn to newspapers, blogs, etc.

Extended StanCon 2018 Deadline!

(this post is by Betancourt)

We received an ensemble of exciting submissions for StanCon2018, but some of our colleagues requested a little bit of extra time to put the finishing touches on their submissions.  Being the generous organizers that we are, we have decided to extend the submission deadline for everyone by two weeks.

Contributed submissions will be accepted until September 29, 2017 5:00:00 AM GMT (that’s midnight on the east coast of the States, for those who aren’t fans of the meridian time).  We will do our best to review and send decisions out before the early registration deadline, but the sooner you submit the more likely you will hear back before then.  For more details on the submission requirements and how to submit see the Submissions page.

Early registration ends on Friday November 10, 2017, after which registration costs increase significantly.  Registration for StanCon 2018 is in two parts: an initial information form followed by payment and accommodation reservation at the Asilomar website.

Type M errors in the wild—really the wild!

Jeremy Fox points me to this article, “Underappreciated problems of low replication in ecological field studies,” by Nathan Lemoine, Ava Hoffman, Andrew Felton, Lauren Baur, Francis Chaves, Jesse Gray, Qiang Yu, and Melinda Smith, who write:

The cost and difficulty of manipulative field studies makes low statistical power a pervasive issue throughout most ecological subdisciplines. . . . In this article, we address a relatively unknown problem with low power: underpowered studies must overestimate small effect sizes in order to achieve statistical significance. First, we describe how low replication coupled with weak effect sizes leads to Type M errors, or exaggerated effect sizes. We then conduct a meta-analysis to determine the average statistical power and Type M error rate for manipulative field experiments that address important questions related to global change; global warming, biodiversity loss, and drought. Finally, we provide recommendations for avoiding Type M errors and constraining estimates of effect size from underpowered studies.

As with the articles discussed in the previous post, I haven’t read this article in detail, but of course I’m supportive of the general point, and I have every reason to believe that type M errors are a big problem in a field such as ecology where measurement is difficult and variation is high.

P.S. Steven Johnson sent in the above picture of a cat who is not in the wild, but would like to be.

Type M errors studied in the wild

Brendan Nyhan points to this article, “Very large treatment effects in randomised trials as an empirical marker to indicate whether subsequent trials are necessary: meta-epidemiological assessment,” by Myura Nagendran, Tiago Pereira, Grace Kiew, Douglas Altman, Mahiben Maruthappu, John Ioannidis, and Peter McCulloch.

From the abstract:

Objective To examine whether a very large effect (VLE; defined as a relative risk of ≤0.2 or ≥5) in a randomised trial could be an empirical marker that subsequent trials are unnecessary. . . .

Data sources Cochrane Database of Systematic Reviews (2010, issue 7) with data on subsequent large trials updated to 2015, issue 12. . . .

Conclusions . . . Caution should be taken when interpreting small studies with very large treatment effects.

I’ve not read the paper and so can’t evaluate these claims but they are in general consistent with our understanding of type M and type S errors. So, just speaking generally, I think it’s good to see this sort of study.

Along similar lines, Jonathan Falk pointed me to this paper, “On the Reproducibility of Psychological Science,” by Val Johnson, Richard Payne, Tianying Wang, Alex Asher, and Soutrik Mandal. I think their model (in which effects are exactly zero or else are spaced away from zero) is goofy and I don’t like the whole false-positive, false-negative thing or the idea that hypothesis tests in psychology experiments correspond to “scientific discoveries,” but I’m guessing they’re basically correct in their substantive conclusions, as this seems similar to what I’ve heard from other sources.

New Zealand election polling

Llewelyn Richards-Ward writes:

Here is a forecaster apparently using a simulated (?Bayesian) approach and smoothing over a bunch of poll results in an attempt to guess the end result. I looked but couldn’t find his methodology but he is at University of Auckland, if you want to track him down…

As a brief background, we have a very centrist government position and this has been so for many years since the neo-liberalist earthquake of the 80s hit us all. Really, the two main parties, Labour and National, are a metre away from each other compared to the gulf between similar parties overseas. We have an MMP system, with 120 seats. Usually there is a coalition government, which has given us very stable growth, positive social outcomes and reasonable taxation. This election is one where the swings in polls and voter options have been startling, for a small place like ours. First, the Green Party co-leader admitted to welfare fraud, actually told us all about it, hoping to garner sympathy for their social issues. She was a goner after public outcry. We all hate cheaters. Then the leader of a stabilising minor party decided to quit, probably as he was polling lower than usual. That party is probably now in that place called oblivion as, without an electorate seat or 5% polling, no ticket to parliament is handed out. Then the Labour party, lately rather staid and uninspiring, killed off yet another leader and a new, female, young, face appeared. This has been dubbed “jacinda-mania”. Lots of energy, she has a PR background, and appeals to the young chic voters (most of whom never actually vote). Labour, after leaping from the low teens to 43% (or something, in some polls) as a result of a fresh face (same policies and other candidates), now is slipping back again. Today, after stupidly promising a tax working group (which we all know means with people who will eat towards their preferred outcomes), they have U-U-turned and now say they will come up with ideas and put it to the electorate next election.The lesson again is don’t ask voters to write blank cheques and don’t threaten middle-NZ with property taxes whilst large corporates are paying very little. NZ, given mostly we are doing well, is not a place easily swayed into change when status-quo seems to be working. Who knows what tomorrow brings — I personally will be voting early to reduce the tension of it all.

My take is that all the pollsters are off-centre because of poll variability. It seems that the above issues have induced changes of preference in the public, rather than simply variability being about noise, so-called.

I don’t know enough about New Zealand to comment on this one. I did read an interesting book about New Zealand politics back when I visited the country in, ummm, 1992 I think it was. But I haven’t really thought about the country since, except briefly when doing our research project on representation and government spending in subnational units. New Zealand is (or was) one of the few countries in the world a unitary government in which no power was devolved to states or provinces.

Anyway, regarding Richards-Ward’s final paragraph above, I would expect that (a) much of the apparent fluctuation in the polls is really explained by fluctuations in differential nonresponse, but (b) you’ll see some instability in a multi-candidate election that you wouldn’t see in a (nearly) pure two-party system such as we have in the United States.

P.S. I feel like I should make some sort of Gandalf joke here but I’m just not really up for it.

American Democracy and its Critics

I just happened to come across this article of mine from 2014: it’s a review published in the American Journal of Sociology of the book “American Democracy,” by Andrew Perrin.

My review begins:

Actually-existing democracy tends to have support in the middle of the political spectrum but is criticized on the two wings.

I like Perrin’s book, and I like my review too, so I’m sharing it with you here.

P.S. Turnabout is fair play; here’s Perrin’s review from a few years back of Red State Blue State and several other books.

Causal inference using data from a non-representative sample

Dan Gibbons writes:

I have been looking at using synthetic control estimates for estimating the effects of healthcare policies, particularly because for say county-level data the nontreated comparison units one would use in say a difference-in-differences estimator or quantile DID estimator (if one didn’t want to use the mean) are not especially clear. However, given that most of the data surrounding alcohol or drug use is from surveys, this would involve modifying these procedures to use either weighting or Bayesian hierarchical models to deal with the survey design. The current approach is just to assume that means are THE aggregate population-level mean, and ignore the sample error. I don’t like this particularly much, especially because with the synthetic control approach specifically the constructed `artificial’ region may be constructed out of the wrong donor units if the sample errors of the donor units are ignored.

My main subsequent worry here is that using a model-based weighting approach followed by the synthetic control estimator would basically be too much complexity to estimate from surveys in practice, in that sample sizes needed would be inflated and the estimates could become deeply unreliable (my fellow econometricians tend to rely on asymptotic results here, but I don’t trust them much in practice).

I am wondering from your perspective, whether it is best with cases that are more complicated than simple regression to either ignore the survey design, use the given sample survey weights or use a model-based approach? I know there is no “answer” to this question, I just worry that the current approach in economics which largely involves shrugging is not acceptable and thought you might be able to give some pointers.

My reply:

1. I generally think it’s best to include in your regression any relevant pre-treatment variables that are predictive of inclusion in the survey, and also to consider interactions of the treatment effect with these predictors. This points toward a MRP approach, in which you use multilevel modeling to get stable estimates of your coefficients, and you poststratify to get average treatment effects of interest.

I’d like a good worked example of this sort of analysis in the causal inference context. Once we have something specific to point to, it should be easier for people to follow along and apply this method to their problems.

2. Regarding “shrugging is not acceptable”: one way to demonstrate or to check this is through a fake-data simulation.

Job openings at online polling company!

Kyle Dropp of online polling firm Morning Consult says they are hiring a bunch of mid-level data scientists and software engineers at all levels:

About Morning Consult:

We are interviewing about 10,000 adults every day in the U.S. and ~20 countries, we have worked with 150+ Fortune 500 companies and industry associations and we are doing weekly polling with the New York Times and POLITICO. We are using the data to build a range of SaaS marketing tools and intelligence platforms that are relevant to CMOs, heads of government affairs, chief comms officers, traders, journalists and researchers.


Title: Data Scientists

Job Description:

Morning Consult is re-defining how public opinion data is measured – from collection, to storage, to visualization. We’re looking to add talented, ambitious technical talent to our data science team to help continue pushing boundaries.

This role requires getting your hands dirty with all aspects of the data collection process, such as designing survey questionnaires, administering survey experiments, and improving sampling/weighting methods. Data scientists will need to use statistical techniques such as regression analysis and machine learning to develop and improve statistical models.

Candidates must be comfortable in ambiguity and willing to learn; entrepreneurial builders who proactively seek out ways to help their teammates and improve processes will thrive.

Key Responsibilities:

· Take ownership over all aspects of the data collection process such as programming survey questionnaires and performing analyses of custom data sets
· Continuously generate ideas and build tools to improve data collection methods and quality
· Utilize statistical techniques such as regression analysis and machine learning to develop and improve statistical models
· Learn new programming languages as necessary
· Team player that is willing to have fun and work hard in a fast-paced startup environment
· Continuous self-improvement; willing to accept feedback.

Desired Skills:

· 1-3 years of experience in highly analytical role at a fast-paced company (or equivalent)
· A bachelor’s or master’s degree in economics, statistics, computer science, math, economics, physics, or engineering (or equivalent)
· Expertise with R statistical software (i.e., data visualization and processing)
· Knowledge of descriptive/inferential statistics and econometrics
· Experience with online survey design, online survey panels and online weighting methodologies
· Familiarity with the following a plus: Python, Amazon Web Services, Qualtrics, hypothesis testing, A/B testing.

Cool. I like the bit about “getting your hands dirty with all aspects of the data collection process.”

I hope they’re using Mister P (otherwise they can have problems generalizing from sample to population) and Stan (so they can fit flexible models and include relevant prior information).

P.S. The cat above is lying in wait to interview you.

Trial by combat, law school style

This story is hilarious. 78-year-old law professor was told he can no longer teach a certain required course; this jeopardizes his current arrangement where he is paid full time but only teaches one semester a year, so he’s suing his employer . . . Columbia Law School.

The beautiful part of this story is how logical it all is. Dude wins the case? That means he’s such a stone-cold lawyer that he can sue Columbia University and win, thus he clearly is qualified to teach a required course. But if he loses the case, he’s a fool’s fool who foolishly thought he ever had a chance to win this lawsuit, and thus is clearly unqualified to teach at a top law school.

It’s trial by combat, but this time it makes sense.

The only part that doesn’t work is that he seems to have hired a law firm. Really he should be representing himself. That would make the story better.

P.S. I don’t know any of the people involved in this one, and the reactions here are just mine; I’m not speaking in any official Columbia University capacity.

It seemed to me that most destruction was being done by those who could not choose between the two

Amateurs, dilettantes, hacks, cowboys, clones — Nick Cave

[Note from Dan 11Sept: I wanted to leave some clear air after the StanCon reminder, so I scheduled this post for tomorrow. Which means you get two posts (one from me, one from Andrew) on this in two days. That’s probably more than the gay face study deserves.]

I mostly stay away from the marketing aspect of machine learning / data science / artificial intelligence. I’ve learnt a lot from the methodology and theory aspects, but the marketing face makes me feel a little ill.  (Not unlike the marketing side of statistics. It’s a method, not a morality.)

But occasionally stories of the id and idiocy of that world wander past my face and I can’t ignore it any more. Actually, it’s happened twice this week, which—in a week of two different orientations, two different next-door frat parties until two different two A.M.s, and two grant applications that aren’t as finished as I want them to be—seems like two times too many to me.

The first one can be dispatched with quickly.  This idiocy came from IBM who are discovering that deep learning will not cure cancer.  Given the way that governments, health systems, private companies, and  research funding bodies are pouring money into the field, this type of snake oil pseudo-science is really quite alarming. Spectacularly well-funded hubris is still hubris and sucks money and oxygen from things that help more than the researcher’s CV (or bottom line).

In the end, the “marketing crisis” in data science is coming from the same place as the “replication crisis” in statistics.  The idea that you can reduce statistics and data analysis (or analytics) down to some small set of rules that can be automatically applied is more than silly, it’s dangerous. So just as the stats community has begun the job of clawing back the opportunities we’ve lost to p=0.05, it’s time for data scientists to control their tail.

Untouchable face

The second article occurred, as Marya Dmitriyevna (who is old-school) in Natasha, Pierre, and the Great Comet of 1812 shrieks after Anatole (who is hot) tries to elope with Natasha (who is young) , in my house. So I’m going to spend most of this post on that. [It was pointed out to me that the first sentence is long and convoluted, even for my terrible writing. What can I say? It’s a complicated Russian novel, everyone’s got nine different names.]

(Side note: The only reason I go to New York is Broadway and cabaret. It has nothing to do with statistics and the unrepeatable set of people in one place. [please don’t tell Andrew. He thinks I like statistics.])

It turns out that deep neural networks join the illustrious ranks of “men driving past in cars”, “angry men in bars”, “men on the internet”,  and “that one woman at a numerical analysis conference in far north Queensland in 2006” in feeling obliged to tell me that I’m gay.

Now obviously I rolled my eyes so hard at this that I almost pulled a muscle.  I marvelled at the type of mind who would decide “is ‘gay face’ real” would be a good research question. I also really couldn’t understand why you’d bother training a neural network. That’s what instagram is for. (Again, amateurs, dilettantes, hacks, cowboys, clones.)

But of course, I’d only read the headline.  I didn’t really feel the urge to read more deeply.  I sent a rolled eyes emoji to the friend who thought I’d want to know about the article and went back about my life.

Once, twice, three times a lady

But last night, as the frat party next door tore through 2am (an unpleasant side-effect of faculty housing it seems) it popped up again on my facebook feed. This time it had grown an attachment ot that IBM story.  I was pretty tired and thought “Right. I can blog on this”.

(Because that’s what Andrew was surely hoping for, three thousand words on whether “gay face” is a thing. In my defence, I did say “are you sure?” more than once and reminded him what happened that one time X let me post on his blog.)

(Side note: unlike odd, rambling posts about a papers we’d just written, posts about bad statistics published in psychology journals is right in Andrew’s wheelhouse.  So why aren’t I waiting until mid 2018 for a post on this that he may or may not have written to appear? [DS 11Sept: Ha!] There’s not much that I can be confident that I know a lot more about than Andrew does, but I am very confident I know a lot more about “gay face”.)

So I read the Guardian article, which (wonder of wonders!) actually linked to the original paper. Attached to the original paper was an authors’ note, and an authors’ response to the inevitable press release from GLAAD calling them irresponsible.

Shall we say that there’s some drift from the paper to the press reports. I’d be interested to find out if there was at any point a press release marketing the research.

The main part that has gone AWOL is a rather long, dystopian discussion that the authors have about how they felt morally obliged to publish this research because people could do this with “off the shelf” tools to find gay people in countries where its illegal.  (We will unpack that later)

But the most interesting part is the tension between footnote 10 of the paper (“The results reported in this paper were shared, in advance, with several leading international LGBTQ organizations”) and the GLAAD/HRC press release that says

 Stanford University and the researchers hosted a call with GLAAD and HRC several months ago in which we raised these myriad concerns and warned against overinflating the results or the significance of them. There was no follow-up after the concerns were shared and none of these flaws have been addressed.

With one look

If you strip away all of the coverage, the paper itself does some things right. It has a detailed discussion of the limitations of the data and the method. (More later)  It argues that, because facial features can’t be taught, these findings provide evidence towards the prenatal hormone theory of sexual orientation (ie that we’re “born this way”).

(Side note: I’ve never like the “born this way” narrative. I think it’s limiting.  Also, getting this way took quite a lot of work. Baby gays think they can love Patti and Bernadette at the same time. They have to learn that you need to love them with different parts of your brain or else they fight. My view is more “We’re recruiting. Apply within”.)

So do I think that this work should be summarily dismissed? Well, I have questions.

(Actually, I have some serious doubts, but I live in Canada now so I’m trying to be polite. Is it working?)

Behind the red door

Male facial image brightness correlates 0.19 with the probability of being gay, as estimated by the DNN-based classifier. While the brightness of the facial image might be driven by many factors, previous research found that testosterone stimulates melanocyte structure and function leading to a darker skin. (Footnote 6)

Again, it’s called instagram.

In the Authors’ notes attached to the paper, the authors recognise that “[gay men] take better pictures”, in the process acknowledging that they themselves have gay friends [honestly, there are too many links for that] who also possess this power. (Once more for those in the back: they’re called filters. Straight people could use them if they want. Let no photo go unmolested.)

(Side note: In 1989 Liza Minelli released a fabulous collaboration with the Petshop Boys that was apparently called Results because that’s what Janet Street-Porter used to call one of her outfits. I do not claim to understand straight men [nor am I looking to], but I have seen my female friends’ tinder. Gentlemen: you could stand to be a little more Liza.)

Back on top

But enough about methodology, let’s talk about data. Probably the biggest criticism that I can make of this paper is that they do not identify the source of their data. (Actually, in the interview with The Economist that originally announced the study, Kosinski says that this is intentional to “discourage copycats”.)

Obviously this is bad science.

(Side note: I cannot imagine the dating site in question would like its users to know that it allows its data to be scraped and analysed. This is what happens when you don’t read the “terms of service”, which I imagine the authors (and the ethics committee at Stanford) read very closely.)

[I’m just assuming that, as a study that used identifiable information about people who could not consent to being studied and deals with a sensitive study, this would’ve gone though an ethics process.]

This failure to disclose the origin of the data means we cannot contextualise it within the Balkanisation of gay desire. Gay men (at least, I will not speak for lesbians and I have nothing to say about bisexuals except “I thank you for your service”) will tell you what they want (what they really really want). This perverted and wonderful version of “The Secret” has manifested in multiple dating platforms that cater to narrow subgroups of the gay community.

Withholding information about the dating platform prevents people from independently scrutinising how representative the sample is likely to be. This is bad science.

(Side note: How was this not picked up by peer review? I’d guess it was reviewed by straight people. Socially active gay men should be able to pick a hole in the data in three seconds flat.  That’s the key point about diversity in STEM. It’s not about ticking boxes or meeting quotas. It’s that you don’t know what a minority can add to your work if they’re not in the room to add it.)

If you look at figure 4, the composite “straight” man is a trucker, while the composite “gay” man is a twunk. This strongly suggests that this is not my personal favourite type of gay dating site: the ones that caters to those of us who look like truckers. It also suggests that the training sample is not representative of the population the authors are generalising to. This is bad science.

(Terminology note: “Twunk” is the past form of “twink”, which is gay slang for a young (18-early20something), skinny, hairless gay man.)

The reality of gay dating websites is that you tailor your photographs to the target audience. My facebook profile picture (we will get to that later) is different to my grindr profile picture is different to my scruff profile picture. In the spirit of Liza, these are chosen for results. In these photos, I run a big part of the gauntlet between those two “composite ideals” in Figure 4. (Not all the way, because never having been a twink, and hence I never twunk.)

(Side note: The last interaction I had on Scruff was a two hour conversation about Patti LuPone. This is an “off-label” usage that is not FDA approved.)

So probably my biggest problem with this study is that it the training sample is likely unrepresentative of the population at large. This means that any inferences drawn from a model trained on this sample will be completely unable to answer questions about whether gay face is real in Caucasian Americans.  By withholding critical information about the data, the authors make it impossible to assess the extent of the problem.

One way to assess this error would be to take the classifier trained on their secret data and use it to, for example, classify face pics from a site like Scruff. There is a problem with this (as mentioned in the GLAAD/HRC press release) that activity and identity are not interchangeable. So some of the men who have sex with men (MSM, itself a somewhat controversial designation) will identify as neither gay nor bisexual and this is not necessarily information that would be available to the researcher. Nevertheless, it is probably safe to assume that people who have a publicly showing face picture on an dating app mainly used by MSMs are not straight.

If the classifier worked on this sort of data, then there is at least a chance that the findings of the study will replicate. But, those of you who read the paper of the Author notes howl, the authors did test the classifier on a validation sample gathered from Facebook.

At this point, let us pause to look at stills from Derek Jarman films

Pictures of you

First, we used the Facebook Audience Insights platform to identify 50 Facebook Pages most popular among gay men, including Pages such as: “I love being Gay”, “Manhunt”, “Gay and Fabulous”, and “Gay Times Magazine”. Second, we used the “interested in” field of users’ Facebook profiles, which reveals the gender of the people that a given user is interested in. Males that indicated an interest in other males, and that liked at least two out of the predominantly gay Facebook Pages, were labeled as gay.

I beseech you, in the Bowels of Christ, think it possible that that your validation sample may be biased.

(I mean, really. That’s one hell of an inclusion criterion.)

Rebel Girl / Nancy Boy

So what of the other GLAAD/HRC problems? They are mainly interesting to show the difference between the priorities of an advocacy organisation and statistical priorities. For example, the study only considered caucasians, which the press release criticises. The paper points out that there was not enough data to include people of colour. Without presuming to speak for LGB (the study didn’t consider trans people, so I’ve dropped the T+ letters [they also didn’t actively consider bisexuals, but LG sells washing machines]) people of colour, I can’t imagine that they’re disappointed to be left out of this specific narrative. That being said, the paper suggests that these results will generalise to other races. I am very skeptical of this claim.

Those composite faces also suggest that fat gay men don’t exist. Again, I am very happy to be excluded from this narrative.

What about the lesbians? Neural networks apparently struggle with lesbians. Again, it could be an issue of sampling bias. It could also be that the mechanical turk gender verification stage (4 of 6 people needed to agree with the person’s stated gender for them to be included) is adding additional bias.  The most reliable way to verify a person’s gender is to ask them.  I am uncertain why the researchers deviated from this principle. Certain sub-cultures in the LGB community identify as lesbian or gay but have either androgynous style, or purposely femme or butch style.  Systematically excluding these groups (I’m thinking particularly of butch lesbians and femme gays) will bias the results.

Breaking the law

So what of the dystopian picture the authors paint of governments using this type of procedure to find and eliminate homosexuals? Meh.

Just as I’m opposed to marketing in machine learning, I’m opposed to “story time” in otherwise serious research.  This claim makes the paper really exciting to read—you get to marvel at their moral dilemma. But the reality is much more boring. This is literally why universities (and I’m including Stanford in that category just to be nice) have ethics committees. I assume this study went through the internal ethics approval procedure, so the moralising is mostly pantomime.

The other reason I’m not enormously concerned about this is that I am skeptical of the idea that there is enough high-quality training data to train a neural network in countries where homosexuality is illegal (which are not white caucasian majority countries). Now I may well be very wrong about this, but I think we can all agree that it would be much harder than in the Caucasian case. LGBT+ people in these countries have serious problems, but neural networks are not one of them.

Researchers claiming that gay men, and to a slightly lesser extent lesbians, are structurally different from straight people is a problem. (I think we all know how phrenology was used to perpetuate racist myths.)  The authors acknowledge this during their tortured descriptions of their moral struggles over whether to publish the article. Given that, I would’ve expected the claims to be better validated. I just don’t believe that either their data set or their  validation set gives a valid representation of gay people.

All science should be good science, but controversial science should be unimpeachable. This study is not.

Live and let die (Gerri Halliwell version)

Ok. That’s more than enough of my twaddle. I guess the key point of this is that marketing and story telling, as well as being bad for science, gets in the way of effectively disseminating information.

Deep learning / neural networks / AI are all perfectly effective tools that can help people solve problems. But just like it wasn’t a generic convolutional neural network that wins games of Go, this tool only works when deployed by extremely skilled people. The idea that IBM could diagnose cancer with deep learning is just silly. IBM knows bugger all about cancer.

In real studies, selection of the data makes the difference between a useful step forward in knowledge and “junk science”.  I think there are more than enough problems with the “gay face” study (yes I will insist on calling it that) to be immensely skeptical. Further work may indicate that the authors’ preliminary results were correct, but I wouldn’t bet the farm on it. (I wouldn’t even bet a farm I did not own on it.) (In fact, if this study replicates I’ll buy someone a drink. [That someone will likely be me.])


Some things coming after the fact:

“How conditioning on post-treatment variables can ruin your experiment and what to do about it”

Brendan Nyhan writes:

Thought this might be of interest – new paper with Jacob Montgomery and Michelle Torres, How conditioning on post-treatment variables can ruin your experiment and what to do about it.

The post-treatment bias from dropout on Turk you just posted about is actually in my opinion a less severe problem than inadvertent experimenter-induced bias due to conditioning on post-treatment variables in determining the sample (attention/manipulation checks, etc.) and controlling for them/using them as moderators. We show how common these practices are in top journal articles, demonstrate the problem analytically, and reanalyze some published studies. Here’s the table on the extent of the problem:

Post-treatment bias is not new but it’s an important area where practice hasn’t improved as rapidly as in other areas.

I wish they’d round their numbers to the nearest percentage point.

God, goons, and gays: 3 quick takes

Next open blog spots are in April but all these are topical so I thought I’d throw them down right now for ya.

1. Alex Durante writes:

I noticed that this study on how Trump supporters respond to racial cues is getting some media play, notably over at Vox. I was wondering if you have any thoughts on it. At first glance, it seems to me that its results are being way overhyped. Thanks for your time.

Here’s a table showing one of their analyses:

My reaction to this sort of thing is: (a) I won’t believe this particular claim until I see the preregistered replication. Too many forking paths. And (b) of course it’s true that “Supporters and Opponents of Donald Trump Respond Differently to Racial Cues” (that’s the title of the paper). How could that not be true, given that Trump and Clinton represent different political parties with way different positions on racial issues? So I don’t really know what’s gained by this sort of study that attempts to scientifically demonstrate a general claim that we already know, by making a very specific claim that I have no reason to think will replicate. Unfortunately, a lot of social science seems to work this way.

Just to clarify: I think the topic is important and I’m not opposed to this sort of experimental study. Indeed, it may well be that interesting things can be learned from the data from this experiment, and I hope the authors make their raw data available immediately. I’m just having trouble seeing what to do with these specific findings. Again, if the only point is that “Supporters and Opponents of Donald Trump Respond Differently to Racial Cues,” we didn’t need this sort of study in the first place. So the interest has to be in the details, and that’s where I’m having problems with the motivation and the analysis.

2. A couple people pointed me to this paper from 2006 by John “Mary Rosh” Lott, “Evidence of Voter Fraud and the Impact that Regulations to Reduce Fraud Have on Voter Participation Rates,” which is newsworthy because Lott has some connection to this voter commission that’s been in the news. Lott’s empirical analysis is essentially worthless because he’s trying to estimate causal effects from a small dataset by performing unregularized least squares with a zillion predictors. It’s the same problem as this notorious paper (not by Lott) on gun control that appeared in the Lancet last year. I think that if you were to take Lott’s dataset you could with little effort obtain just about any conclusion you wanted by just fiddling around with which variables go into the regression.

3. Andrew Jeon-Lee points us to this post by Philip Cohen regarding a recent paper by Yilun Wang and Michal Kosinski that uses a machine learning algorithm and reports, “Given a single facial image, a classifier could correctly distinguish between gay and heterosexual men in 81% of cases, and in 71% of cases for women. Human judges achieved much lower accuracy: 61% for men and 54% for women.”

Hey, whassup with that? I can get 97% accuracy by just guessing Straight for everybody.

Oh, it must depend on the population they’re studying! Let’s read the paper . . . they got data on 37000 men and 39000 women, approx 50/50 gay/straight. So I guess my classification rule won’t work.

More to the point, I’m guessing that the classification rule that will work will depend a lot on what population you’re using.

I had some deja vu on this one because last year there was a similar online discussion regarding a paper by Xiaolin Wu and Xi Zhang demonstrating an algorithmic classification of faces of people labeled as “criminals” and “noncriminals” (which I think makes even less sense than labeling everybody as straight or gay, but that’s another story). I could’ve sworn I blogged something on that paper but it didn’t show up in any search so I guess I didn’t bother (or maybe I did write something and it’s somewhere in the damn queue).

Anyway, I had the same problem with that paper from last year as I have with this recent effort: it’s fine as a classification exercise, and it can be interesting to see what happens to show up in the data (lesbians wear baseball caps!), but the interpretation is way over the top. It’s no surprise at all that two groups of people selected from different populations will differ from each other. That will be the case if you compare a group of people from a database of criminals to a group of people from a different database, or if you compare a group of people from a gay dating website to a group of people from a straight data website. And if you have samples from two different populations and a large number of cases, then you should be able to train an algorithm to distinguish them at some level of accuracy. Actually doing this is impressive (not necessarily an impressive job by these researchers, but it’s an impressive job by whoever wrote the algorithms that these people ran). It’s an interesting exercise, and the fact that the algorithms outperform unaided humans, that’s interesting too. But this kind of thing: like “The phenomenon is, clearly, troubling to those who hold privacy dear—especially if the technology is used by authoritarian regimes where even a suggestion of homosexuality or criminal intent may be viewed harshly.” That’s just silly, as it completely misses the point that the success of these algorithms entirely depends on the data used to train them.

Also Cohen in his post picks out this quote from the article in question:

[The results] provide strong support for the PHT [prenatal hormone theory], which argues that same-gender sexual orientation stems from the underexposure of male fetuses and overexposure of female fetuses to prenatal androgens responsible for the sexual differentiation of faces, preferences, and behavior.

Huh? That’s just nuts. I agree with Cohen that it would be better to say that the results are “not inconsistent” with the theory, just as they’re not inconsistent with other theories such as the idea that gay people are vampires (or, to be less heteronormative, the idea that straight people lack the vampirical gene).

Also some goofy stuff about the fact that gay men in this particular sample are less likely to have beards.

In all seriousness, I think the best next step here, for anyone who wants to continue research in this area, is to do a set of “placebo control” studies, as they say in econ, each time using the same computer program to classify people chosen from two different samples, for example college graduates and non-college graduates, or English people and French people, or driver’s license photos in state X and driver’s license photos in state Y, or students from college A and students from college B, or baseball players and football players, or people on straight dating site U and people on straight dating site V, or whatever. Do enough of these different groups and you might get some idea of what’s going on.

The StanCon Cometh

(In a stunning deviation from the norm, this post is not by Andrew or Dan, but Betancourt!)

Some important dates for StanCon2018 are rapidly approaching!

Contributed submissions are due September 16, 2017 5:00:00 AM GMT. That’s less than 6 days away!  We want to make sure we can review submissions early enough to get responses back to submitters in time for early registration.  For more details on the submission requirements and how to submit see the Submissions page.

Speaking of which, early registration ends Friday November 10, 2017 after which the registration cost significantly increases. That’s in just about two months!

Finally, just because I still can’t believe that we have such an amazing ensemble of invited speakers let me remind everyone that attendees will get to see talks from

  • Susan Holmes (Department of Statistics, Stanford University)
  • Frank Harrell (School of Medicine and Department of Biostatistics, Vanderbilt University)
  • Sophia Rabe-Hesketh (Educational Statistics and Biostatistics, University of California, Berkeley)
  • Sean Taylor and Ben Letham (Facebook Core Data Science)
  • Manuel Rivas (Department of Biomedical Data Science, Stanford University)
  • Talia Weiss (Department of Physics, Massachusetts Institute of Technology)

Looking for the bottom line

I recommend this discussion of how to summarize posterior distributions. I don’t recommend summarizing by the posterior probability that the new treatment is better than the old treatment, as that is not a bottom-line statement!