Skip to content

Visualizing your fitted Stan model using ShinyStan without interfering with your Rstudio session

ShinyStan is great, but I don’t always use it because when you call it from R, it freezes up your R session until you close the ShinyStan window.

But it turns out that it doesn’t have to be that way. Imad explains:

You can open up a new session via the RStudio menu bar (Session >> New Session), which should have the same working directory as wherever you were prior to running launch_shinystan(). This will let you work on whatever you need to work on while simultaneously running ShinyStan (albeit via two RStudio sessions).

OK, good to know.

San Francisco housing debate: A yimby responds

Phil Price recently wrote two much-argued-about posts here and here on the yimby (“yes in my backyard”) movement in San Francisco. One of the people disagreeing with him is Sonja Trauss, who writes:

Phil makes a pretty basic mistake of reasoning in his post, namely, that the high income residents of the proposed new housing move to SF if and only if the new housing is built.

The reality is that 84% of the residents of new buildings already live in SF (so they’re already spending money in SF etc.) That’s from the SF Controller’s office. Not to mention, no one moves only because they see that a new building is finished. They move for work, or family.

I’m one of the founders of the YIMBY movement in the Bay, so it’s my job to write refutations of essays like Phil’s.

Here’s the response from Trauss:

Last week Phil wrote a post [] and a follow up []. He says he’ll “post again when I can figure out whether or not I really am completely wrong.”

The question is, wrong about what?

A few years ago I had a roommate who brought home a giant rabbit to live with us and the cat, Marmalade. We put the two creatures together wondering what would happen. The rabbit was as big as the cat, would the cat nonetheless chase the rabbit? Answer: No. Marmalade was disgusted by the rabbit. The cat crept around the rabbit at a safe distance, horrified but unable to look away from what was apparently the ugliest cat Marmalade had ever seen. The rabbit, on the other hand, did not recognize Marmalade as a living creature at all. The rabbit made no distinction between Marmalade, a throw pillow, or a balled up sweatshirt, which alarmed Marmalade highly as long as they lived together.

Marmalade was missing critical information that he would have needed to understand what he was seeing. There was nothing wrong with the rabbit at all, it was a normal giant rabbit. Marmalade thought there was something inexplicable about the situation because he started with the wrong assumption.

Likewise, Phil’s question about YIMBYs was sincere. After meandering through a mental model of how population & rents interact, he got to his point,

“Given all of the above, … Why are these people promoting policies that are so bad for them?”

The answer to “why are people promoting policies that are bad for them?” Is always the same thing – the asker has wrong or missing information about the constituency’s (1) material conditions, or (2) values or goals, or else he doesn’t understand the policy proposal or how it works.

Another clue that a question is ill-formed is if it leads an otherwise intelligent person to a self-contradictory or incoherent answer:

…So this is my new theory: the YIMBY and BARF people know that building more market-rate housing in San Francisco will make median rents go up, and that this will be bad for them, but they want to do it anyway because it’s a thumb in the eye of the “already-haves”, those smug people who already have a place they like and are trying to slam the door behind them.”

So the answer to Phil’s question lies with him, not with us – Which of the assumptions Phil made were wrong?

Basically all of them.

The first bad assumption was about the goal of the YIMBY policy program. Phil got the idea somewhere that our policy goal has a singular focus on median rents, in San Francisco. It’s never been true that our focus was only on San Francisco. One of the interesting things about our movement is that we encourage people to be politically active across city and county boundaries. When I started SFBARF in 2014, I lived in West Oakland. My activities were targeted at San Francisco, and I started the club specifically to protect low rents in Oakland.

It’s also not true that that we would measure our success by only or primarily looking at the median rent. Ours is an anti-displacement, pro-short commute & pro-integration effort, which means the kinds of metrics we are interested in are commute times, population within X commute distance of job centers, rate of evictions, vacancy rate, measures of economic & racial integration in Bay Area cities, number of people leaving the Bay Area per year, in addition to public opinion data like do people feel like housing is hard to find? Do people like their current housing situation? If we do look at statistics about rent, median is a lot less interesting than standard distribution.

Prices are a side-effect, a symptom, a signal of an underlying situation. In California generally and the Bay in particular the underlying housing situation is one of real shortage. What is the unmet demand for housing in the Bay? Do 10 million people want to live here? 20 million? Who knows, we can only be sure at this point that it’s more than the 7 million that do live here.

Because the way we distribute housing currently is (mostly) via the price mechanism, the way most people experience their displacement is by being priced out. But distributing housing stock by some other method wouldn’t solve the displacement problem.

Suppose the total demand for housing in the Bay is 20 million people. Currently we have housing for about 7 million people. If we distributed the limited housing we have by lottery, 13 million people would experience displacement as losing the lottery. If we distributed it via political favoritism, people’s experience of displacement would be finding out their application for housing wasn’t granted. Either way it doesn’t matter. If 20 million people want housing, and you only have housing for 7 million of them, then 13 million people are out of luck, no matter how you distribute it.

[If you want to get a really intense feeling for why prices are a distraction, I recommend learning to prove both of the welfare theorems. [] Spend some hours imagining the tangent lines or planes attaching themselves to the convex set, then imagine the convex set ballooning out to meet the prices. Then read Red Plenty.]

Phil believes in induced demand, that feeding the need for housing will only create more need for housing. I don’t think this is true but I also don’t think it matters. The reason I don’t think it’s true is that if regional economies worked like that, Detroit, St. Louis, Baltimore, Philadelphia would not exist in their current state. How can induced demand folks explain either temporary regional slowdowns, like the dot com busts, or the permanent obsolescence of the rust belt? The reason I don’t think it matters is that continuous growth in the Bay Area doesn’t sound like a bad thing to me. The Bay would be better, more livable, a better value with 20 million or more people in it.

The reason the YIMBY movement exists, in part, is that the previous generation of rent advocates were singularly obsessed with prices, like Phil is. They thought, ‘if only we could fix prices, our problem would go away,’ so they focused on price fixing policies like rent control & were indifferent or actively hostile to shortage ending practices like building housing. In fact, hostility to building new housing, and insistence that prices alone should be the focus of activism are philosophies that fit nicely together. The result, after almost 40 years, is that prices are worse than ever, but population level hasn’t changed very much.

So the first answer to “What’s wrong with Phil?” is that Phil thinks prices are the primary focus of our activism, when in fact allocations are what we are interested in.

The second and third problems with Phil are assumptions he made in thinking through his toy model of the SF housing market. This chain of reasoning purports to show that building more housing in SF increases median rent in SF (but decreases it in other parts of the Bay). This is the reasoning that led Phil to think YIMBYs are pursuing policies with outcomes that are counter to our own interests. Of course, as explained above, Phil fully identified “our interests” with “SF median rent” which was already wrong.

The biggest problem with Phil’s model is “…for better or worse I am assuming, essentially axiomatically, that building expensive units draws in additional high-income renters and buyers [to San Francisco] …” who would not have lived there otherwise. The second problem is he imagines that all of their disposable income would be new to the City. Whether these assumptions are actually true or not has a huge impact on the conclusion Phil reaches.

It happens all the time that people make axiomatic assumptions about things they think they can’t know, in order to eventually reason to the outcome that they want. This is a perfectly fine activity, but reasoning from a guess about a state of the word, to the outcome one already believes isn’t social science. It’s rationalization.

Whether N new housing units actually results in N new high income people spending new disposable income in San Francisco is an interesting research question. I would love for Phil to pursue it. I would recommend talking to demographers, economic planners and econometricians.

If pursuing the data seems too labor intensive, then Phil can try to assume the opposite of his axiom about the necessary demographics of population growth, and see if he can still reason to his desired conclusion. If he can, that result should be published at Berkeley Daily Planet, which is a blog for anti-housing ideology.

Alternately, Phil could abandon this particular line of investigation altogether because it’s not relevant to the initial question. We’ve already established that prices aren’t the fundamental focus of YIMBY activism, and Marmalade wasn’t looking at a grotesque cat, but an ordinary rabbit.

I hope you feel your mystery is solved Phil. Next time you have a question you should ask your local YIMBYs at

In my own neighborhood in NY, I’m a yimby: a few months ago someone put up a flyer in my building warning us that some developer was planning to build a high-rise nearby, and we should all go to some community board meeting and protest. My reaction was annoyance—I almost wanted to go to the community board meeting just to say that I have no problems with someone building another high-rise in my neighborhood. That said, much depends on the details: new housing is good, but I do sometimes read about sweetheart tax deals where the developer builds and then the taxpayer is stuck with huge bills for infrastructure. In any particular example, the details have to matter, and this isn’t a debate I’m planning to jump into.

P.S. The cat picture at the top of this post comes from Steven Johnson. It is not the cat discussed by Trauss. But I’m guessing it’s the same color!

The Other Side of the Night

Don Green points us to this quantitative/qualitative meta-analysis he did with Betsy Levy Paluck and Seth Green. The paper begins:

This paper evaluates the state of contact hypothesis research from a policy perspective. Building on Pettigrew and Tropp’s (2006) influential meta-analysis, we assemble all intergroup contact studies that feature random assignment and delayed outcome measures, of which there are 27 in total. . . . We find the evidence . . . consistent with Pettigrew and Tropp’s (2006) conclusion that contact “typically reduces prejudice.” At the same time, our meta-analysis suggests that contact’s effects vary, with interventions directed at ethnic or racial prejudice generating substantially weaker effects. Moreover, our inventory of relevant studies reveals important gaps, most notably the absence of studies addressing adults’ racial or ethnic prejudice, an important limitation for both theory and policy. We also call attention to the lack of research that systematically investigates the scope conditions suggested by Allport (1954) under which contact is most influential.

I like that they don’t just give their conclusions; they also talk about the limitations of their data.

The contact hypothesis is a big deal, and I know that Don Green and his collaborators have been thinking a lot about meta-analysis in recent years, so I’m glad to see this research being done.

I’ve not yet read the article, but I did notice that it doesn’t have a lot of graphs—and those that it does have, are close to impossible to read. This just seems to be a preprint, though, so maybe someone can help them visualize their data and their findings, and some real graphs can be added to the paper before it is published in its final version. With all this work on data gathering and data analysis, it would be a pity for the results to not be explored. There should be lots more stuff to be uncovered from some graphical displays.

P.S. The second author of the paper, Seth Green, is a Columbia political science graduate and is currently working at a startup called “Code Ocean.” Googling led to this webpage which says nothing about Columbia. Apparently Code Ocean is “a cloud-based computational reproducibility platform” with a mission to “to make the world’s scientific code more reusable, executable and reproducible.”

It also says on that webpage that Code Ocean handles code in Python, R, and eight other programming languages. If so, I think they should handle Stan code too! In all seriousness, I think a bit of Stan would help this project, in that it would push the researchers toward modeling the effects of interest, and away from time-wasters such as this: “We begin our quantitative analysis by assessing cross-study heterogeneity using Cochran’s Q. The test decisively rejects the null hypothesis of homogeneity of effects across studies (Q(26) = 173.563, p < .001; I^2 = 0.85). We therefore reject the fixed-effects meta-analysis model in favor of a random-effects meta-analysis model, where the variance of the normal random component is estimated using method of moments. The resulting estimate is 0.39, with a 95% confidence interval ranging from 0.234 to 0.555.” Or this: “The results presented in Table 3 suggest that a one-unit increase in standard error is associated with a 2.07 unit increase in effect size, although the pattern is of borderline significance (two-tailed p-value = .049).” I know they’re just doing their best, and I like the general flow of this analysis; it would just be easier in Stan to get to the substance.

Traditionally, classical analyses based on statistical significance have been considered to be the safe option when analyzing data in the social sciences. But as we continue to work in areas where data are sparse and noisy, and effects are highly variable, it will make sense to just start with Bayes, to say goodbye forever to “p less than .05,” and to use Stan as your first rather than last resort for managing uncertainty and variation.

P.P.S. Cat picture demonstrating contact hypothesis from Diana Senechal.

Take two on Laura Arnold’s TEDx talk.

This post is by Keith.

In this post I try to be more concise and direct about what I found of value in Laura Arnold’s TEDx talk that I recently blogged about here.

Primarily it was the disclosure from someone who could afford to buy good evidence (and experts to assess it) that they did not think good enough evidence was actually available right now. Yes there is some real gold among the research outputs – but it just cannot be reliably distinguished from fools gold at present – so don’t buy now or buy at your own risk. Furthermore, believing high quality evidence was necessary in any reasonable attempt to make things better, there was an argument/decision that evidence quality needed to be rectified first. And at the end of their talk they did not blame reporters or careerist researchers but instead claimed it was _our_ fault and that _we_ needed to be prepared to contribute to it being resolved.

That it is up to the masses to fix the current faulty evidence generating machines very much rings true to me. I think that was an important turning point the AllTrials initiative “It immediately occurred to Tracey [Director, Sense about Science] that there had to be thousands of people who took part in clinical trials who felt “I want my participation to count for something—I want the results that you who ran the trial got because I took part to be used.” Here maybe – “I want people who might want to help me to actually have good evidence on how do that”.

I also have more nuanced or speculative take. Continue reading ‘Take two on Laura Arnold’s TEDx talk.’ »

PCI Statistics: A preprint review peer community in statistics

X informs me of a new effort, “Peer community in . . .”, which describes itself as “a free recommendation process of published and unpublished scientific papers.” So far this exists in only one field, Evolutionary Biology. But this looks like a great idea and I expect it will soon exist in statistics, political science, and many other areas.

Here’s how X describes it:

Authors of a preprint or of a published paper request a recommendation from the forum. If someone from the board finds the paper of interest, this person initiates a quick refereeing process with one or two referees and returns a review to the authors, with possible requests for modification, and if the lead reviewer is happy with the new version, the link to the paper and the reviews are published on PCI Evol Biol, which thus gives a stamp of validation to the contents in the paper. The paper can then be submitted for publication in any journal.

X thinks we should start small and have a PCI Computational Statistics. Seems like a good idea to me.

I suggested this idea for the Journal of the American Statistical Association, but they’re super-conservative, I don’t think they’re ever gonna do it. So I’m with X on this: I’d like to do PCI Computational Statistics (or PCI Stan?) unless anyone can explain to me why this is not a good idea.

It would be also good to have a sense of where all this is going. If PCI stays around, will there be a proliferation of PCI communities? Or will things shake down to just a few? Something should be done, that’s for sure: Arxiv is too unstructured, NBER is a restrictive club, traditional journals are too slow and waste most of their effort on the worst papers, twitter’s a joke, and blogs don’t scale.

So, again, any good reasons not to do this?

This company wants to hire people who can program in R or Python and do statistical modeling in Stan

Doug Puett writes:

I am a 2012 QMSS [Columbia University Quantitative Methods in Social Sciences] grad who is currently trying to build a Data Science/Quantitative UX team, and was hoping for some advice. I am finding myself having a hard time finding people who are really interested in understanding people and who especially are excited in doing so in quantitative way, along the lines of this.

I more often see candidates who are more interested in machine learning and prediction for its own sake rather than seeking to gain human empathy and a depth of understanding. Somewhat related to this is experience with Stan, and while not necessary for doing good analysis, is such a wonderful tool for creating powerful models and doing so in an iterative, exploratory way. I have found that the amount of complexity that can be built into a Stan model is a very good way to capture pretty subtle but important aspects of our domain, and that a willingness to invest in the tool is a really good signal for a certain way of thinking about data and data problems.

So I was wondering, do you know of a source of candidates that would align to this way of working with data, and if so, how do I best reach them? I know that QMSS was designed to cater to these needs, but even from there it has been hard to source candidates who aren’t more interested in machine learning (or else don’t have the technical experience necessary). It wouldn’t be quite fair to advertise a STAN developer position, but I do intend for this sort of modeling to be a critical component of our work (after, and along with, simpler methods of exploratory analysis, of course).

A brief description of what I’m looking for is this:

– Excited about working with and supporting qualitative research (this is both our research domain and our internal partners in doing so)
– Excited about understanding human systems quantitatively
– Emphasis on exploratory data analysis
– Stan experience
– Very strong programming experience (Python or R)

Our company:

Hey—QMSS is the program that my colleagues and I created, nearly 20 years ago. It’s great to see it still going strong.

How is a politician different from a 4-year-old?

A few days ago I shared my reactions to an op-ed by developmental psychologist Alison Gopnik.

Gopnik replied:

As a regular reader of your blog, I thought you and your readers might be interested in a response to your very fair comments. In the original draft I had an extra few paragraphs (below) that speak to that point but that I left out for space and aesthetic unity reasons. One way to put it is this, the distribution of traits and behaviors across development suggests that there are differences in the underlying psychological causal structure of children and adults. The claim of Douthat and Brooks etc. is that Trump’s distinctively awful features reflect the fact that he shares more of that causal structure with children than the average adult does. My claim is that they reflect the fact that he shares LESS of the typical causal structure with children than the average adult, instead he is an outlier on dimensions that typically characterize adults and not children (such as the desire for power and exploitation over knowledge and exploration). In particular, I chose the examples in the piece because they exemplify two rather distinctive underlying dimensions of preschool children – their epistemic motivation – that is the fact that they are so interested in and motivated by learning, and their initial identification with other people – just the opposite of egocentrism. Those are still features of most adults at least some of the time, but they become less important than other motivations and attitudes, and they certainly aren’t characteristic of Trump. But even in this time of nerd exaltation this might be too wonky for a Sunday op-ed!

Now, all this is not to say that a four-year-old would make a good chief executive. The genuine differences between children and adults reflect a kind of evolutionary division of labor, and being President is certainly a grown-up job.

Children are designed to explore. They focus on learning about the world, especially learning about other people, and imagining alternative ways the world could be. In fact, from an evolutionary perspective, the whole point of childhood is to provide human beings with a protected learning period. Young children don’t have to worry about acting to get the resources they need, or about power or status – their caregivers deal with all that. They can just learn instead.

Adults are designed to exploit, they take what they’ve learned as children and put it to use, for good or ill, to accomplish adult goals. Trump does lack some of the adult skills that allow grown ups to act effectively, like impulse control and long term planning. But those deficiencies just make him incompetent, not dangerous. The real dangers of Trump’s character are uniquely adult ones.

As we grow up, our focus and experience necessarily narrow, and we inevitably develop a more egotistical concentration on our own needs, goals and plans. That’s a natural consequence of the grown-up responsibility to act and achieve, including the responsibility to care for the next generation of hungry, exploratory children. We all trudge along on “the hedonic treadmill”, chasing new rewards as soon as we’ve attained the old ones. Still, most adults, even most Presidents, and certainly the best Presidents, manage to retain some of their child-like traits – curiosity, openness to experience, intuitive sensitivity to others.

But Trump has spent a lifetime single-mindedly pursuing the uniquely adult goals of power, status and domination, financially, sexually, and politically, and has done so with remarkable success. Time and again, from Shakespearean kings like Macbeth and Richard III to the 20th centuries’ strongmen, grown-ups with that temperament and history become trapped in the narrow, never-ending, unslakeable thirst for adulation and reward. They lose the child’s desire to learn and so gradually lose touch with the greater reality outside themselves – sometimes to the point of delusion and megalomania. We’d all be better off if Trump were more, not less, like a four-year-old.

Gopnik also sends along this paper with Thomas Griffiths and Christopher Lucas, which begins:

We describe a surprising developmental pattern we found in studies involving three different kinds of problems and age ranges. Younger learners are better than older ones at learning unusual abstract causal principles from evidence. We explore two factors that might contribute to this counterintuitive result. The first is that as our knowledge grows, we become less open to new ideas. The second is that younger minds and brains are intrinsically more flexible and exploratory, although they are also less efficient as a result.

This is interesting, thinking of differences between children and adults through the lens of the exploration-exploitation tradeoff in statistics (also, for convoluted reasons, called the “bandit” problem). The idea is that in the early stages of an experiment, you should explore, making decisions that are probably suboptimal for the purpose of gathering data; but then, later on, once you know more, you can exploit your information and move rapidly toward an (approximately) optimal solution. There are lots of examples of this tradeoff, but I hadn’t thought of it in the context of children and adults.

Gopnik’s response is a useful answer to the questions in my earlier post. She wasn’t just making the (trivial) point that Trump is different from the average four-year-old, or even that he was different from most four-year-olds. Rather, she has a conceptual scale in which the distribution of American four-year-olds is in one place, and the distributions of American adults is in another place, and she’s saying that, in certain dimensions, Trump is particularly far from the average four-year-olds, that he’s too much of an optimizer and not enough of an explorer. Or, more generally, and that he’s trapped in the “narrow, never-ending, unslakeable thirst for adulation and reward” that is characteristic of some adults but not, apparently, of most four-year-olds. Yes, four-year-olds throw tantrums but I guess that’s a bit different.

As I wrote in my original post, I’m not so clear this all applies to Trump. Sure, the unslakeable thirst etc., that could be: I don’t know the guy personally, but it’s possible. The exploration-exploitation tradeoff, though: that’s one where Trump seems much more of an explorer, at least compared to the average elderly politician. Trump tries all sorts of things—he’s notorious for saying things he doesn’t believe (or, perhaps, talking himself into believing things that he should know are not true), just to get people’s reactions. That’s exploration, right? He’s also an improviser, as we saw during the campaign and during the new administration too. Indeed, one common criticism of Trump is that he just says and does things without working through the consequences. This is not necessarily a good thing but I don’t know how it fits into the story that Trump lacks curiosity etc.

In short, I remain skeptical of the application of these ideas to Donald Trump (or, for that matter, to Hillary Clinton, or Paul Ryan, or other political figures). But the psychology stuff, that’s great. I like the idea of comparing how kids and adults think, by stepping back and looking not just at how we perform specific tasks, but about our goals as learners. This seems to me a big deal, a way of linking Piagetian stages of development to problems of adult decision making.

P.S. Carsten Allefeld provided the above picture of a cat basking in adulation.

Design top down, Code bottom up

Top-down design means designing from the client application programmer interface (API) down to the code. The API lays out a precise functional specification, which says what the code will do, not how it will do it.

Coding bottom up means coding the lowest-level foundations first, testing them, then continuing to build. Sometimes this requires dropping a thin mock framework top down or at least thinking about it a bit.

This top-down, then bottom-up approach solves two very important problems at once. First, it lets you answer the what-you-want-to-build problem cleanly from a client (the user of the code’s) perspective. Coding bottom up lays down a sold foundation on which to build. But more importantly, it helps you break big scary problems down into approachable pieces.

Documentation and testing go hand in hand

At every stage, it is important to decompose the problem into bits that can be easily described with doc and easily tested. Otherwise, there’s going to be now ay for anyone (including you in the future) to understand why the code is organized the way it is. It’s usually worth paying a small price in efficiency to achieve modularity, but usually it’s a win-win situation where writing the code in meaningful units makes everything much easier to design and test and makes the final design more efficient.

Programmers often complain that testing and doc add time to a project. I’ve found careful application of both can be huge time savers. It’s so much easier to debug a low level function on its own terms if you trust its foundations than it is to tackle a huge end-to-end system that is misbehaving. Especially if that big system is a plate of spaghetti, hairball, or describable by some other analogy of tangling.

Why this came up—testing Stan’s autodiff in C++11

I’m kicking off my next big Stan project, which is to write a test framework for our math library differentiable functions. They’re tested very thoroughly for reverse mode, but not so much beyond first-order derivatives. This is huge and scary and now there’s all of C++11 lying at my feet for the taking.

I spent a few weeks working on other things while panicking and thinking the problem was just too hard because I couldn’t grasp it all at once.

Then I bit the bullet and asked what I’d like to see from the user’s perspective. That was enough to let me write the client-facing technical spec in a simple GitHub issue on stan-dev/math: Issue 557: fvar testing framework. I came up with the rough outline of a feasible high-level API that would let the user express the problem in their own terms; here’s what it looks like for double values:

std::function<double(double,double)> f = ... some binding ...;
auto t = tester(f);

t.good(1, 1, 2);
t.good(2, 3, 5);

t.throw(NaN, 1.0, std::domain_error);;

I’ve already got a simple implementation hard coded for scalars on a branch.

I don’t know exactly what all that is going to look like. I’ve barely dipped my toe into C++11, though I can say I agree with Stroustroup that it looks like an entirely new (and better!) language.

Of course, as soon as I tried to code a simple (non-polymorphic) instance, I realized I actually need a much more general binding for the function and really need to pass a static templated function, which means passing a class type for a type that implements the function application concept. Otherwise, though, everything will look pretty much like this, right down to the factory method. And I should be able to make it fly for arbitrary argument and return types with enough typelist magic.

The takeaway message on design, as the entire industry has learned for most scales of projects, it’s really not worth going overboard on design—just something committed enough to see where you’re going without having to question that at every turn (which is a huge time sink).

So now I have to start building bottom up, where I’ll write a general functional test framework that takes in a differentiable functor and runs it through all five variations of autodiff (none, reverse, directional derivatives, second-order directional derivatives, Hessians, and tensors of third derivatives). Then, in a month or three, we’ll be ready to release Riemannian HMC in an interface (which will also require some distribution/config issues in all the interfaces).

That’s the rest of my weekend accounted for, but the magic is that it’ll just be clean coding. Not much thinking required. I can lay this code down like a bricklayer (programmers love building metaphors).

Even fictional physicists have this problem

The other reason I wrote this post today is that I just read a striking passage on page 79 of the sci-fi thriller Dark Matter by Blake Crouch. Our protagonist is trying to get his bearings and get out of his current bind and decides to fall back on his physics skills (no spoilers):

Experimental physics—hell, all of science—is about solving problems. However, you can’t solve them all at once. There’s always a larger, overarching question—the big target. But if you obsess on the sheer enormity of it, you lose focus.

The key is to start small. Focus on solving problems you can answer. Build some dry ground to stand on. And after you’ve put in the work, and if you’re lucky, the mystery of the overarching question becomes knowable. Like stepping slowly back from a photomontage to witness the ultimate image revealing itself.

I have to separate myself from the fear, the paranoia, the terror, and simply attack this problem as if I were in a lab—one small question at a time.

Exactly! Build some dry ground to stand on. If you try to build the entire program while in a rowboat, you’ll not only be coding alone, you’ll be miserable doing it.

Also in nonfiction

When I first read Hunt and Thomas’s Pragmatic Programmer, a book that changed my whole attitude about programming, it dawned on me that the whole thing was about controlling fear. Writing software other people are going to use is a scary business. We all know how hard it is to get even a few lines of code right, much less hundreds of thousands of lines.

And don’t forget real life

It helped to have a great practical introduction to all this at SpeechWorks. We had great marketing people helping with the functional specs, an awesome engineering team who could handle the technical specs and costing, and managers with enough judgement to get an amazing amount of coordinated work out of 20 programmers without exhausting us. Despite having watched John Nguyen do it, I’m still not sure how he pulled it off. Everyone who worked there misses the company.

How to think scientifically about scientists’ proposals for fixing science

I wrote this article for a sociology journal:

Science is in crisis. Any doubt about this status has surely been been dispelled by the loud assurances to the contrary by various authority figures who are deeply invested in the current system and have written things such as, “Psychology is not in crisis, contrary to popular rumor . . . Crisis or no crisis, the field develops consensus about the most valuable insights . . . National panels will convene and caution scientists, reviewers, and editors to uphold standards.” (Fiske, Schacter, and Taylor, 2016). When leaders go to that much trouble to insist there is no problem, it’s only natural for outsiders to worry . . .

When I say that the replication crisis is also an opportunity, this is more than a fortune-cookie cliche; it is also a recognition that when a group of people make a series of bad decisions, this motivates a search for what went wrong in their decision-making process. . . .

A full discussion of the crisis in science would include three parts:
1. Evidence that science is indeed in crisis . . .
2. A discussion of what has gone wrong . . .
3. Proposed solutions . . .
I and others have written enough on topics 1 and 2, and since this article has been solicited for a collection on Fixing Science, I’ll restrict my attention to topic 3: what to do about the problem?

Then comes the fulcrum:

My focus here will not be on the suggestions themselves but rather on what are our reasons for thinking these proposed innovations might be good ideas. The unfortunate paradox is that the very aspects of “junk science” that we so properly criticize—the reliance on indirect, highly variable measurements from nonrepresentative samples, open-ended data analysis, followed up by grandiose conclusions and emphatic policy recommendations drawn from questionable data—all seem to occur when we suggest our own improvements to the system. . . . I will now discuss various suggested solutions to the replication crisis, and the difficulty of using scientific evidence to guess at their effects.

And then the discussion of various suggested reforms:

The first set of reforms are educational . . . These may well be excellent ideas, but what evidence is there that they will “fix science” in any way. Given the widespread misunderstandings of statistical and research methods, even among statisticians, what makes us so sure that more classroom or training hours will make a difference?

The second set of reforms involve statistical methods: I am a loud proponent of some of these ideas, but, again, I come bearing no statistical evidence that they will improve scientific practice. My colleagues and I have given many examples of modern statistical methods solving problems and resolving confusions that arose from null hypothesis significance testing, and our many good stories in this vein represent . . . what’s the plural of “anecdote,” again?

A related set of ideas involve revised research practices, including open data and code, preregistration, and, more generally, a clearer integration of workflow into scientific practice. I favor all these ideas, to different extents, and have been trying to do more of them myself. But, again, I don’t see the data demonstrating their effectiveness. If the community of science critics (to which I consider myself a member) were to hold the “open data” movement to the same standards that we demand for research such as “power pose,” we would have no choice but to label all these ideas as speculative.

The third set of proposed reforms are institutional and involve altering the existing incentives that favor shoddy science and that raise the relative costs to doing good, careful work. . . . these proposals sound good to me. But, again, no evidence.

I conclude:

The foregoing review is intended to be thought provoking, but not nihilistic. One of the most important statistical lessons from the recent replication crisis is that certainty or even near-certainty is harder to come by then most of us had imagined. We need to make some decisions in any case, and as the saying goes, deciding to do nothing is itself a decision. Just as an anxious job-interview candidate might well decide to chill out with some deep breaths, full-body stretches, and a power pose, those of us within the scientific community have to make use of whatever ideas are nearby, in order to make the micro-decisions that, in the aggregate, drive much of the directions of science. And, when considering larger ideas, proposals for educational requirements or recommendations for new default statistical or research methods or reorganizations of the publishing system, we need to recognize that our decisions will necessarily rely much more on logic and theory than on direct empirical evidence. This suggests in turn that our reasoning be transparent and openly connected to the goals and theories that motivate and guide our attempts toward fixing science.

An obvious fact about constrained systems.


This post is not by Andrew. This post is by Phil.

This post is prompted by Andrew’s recent post about the book “Everything is obvious once you know the answer,” together with a recent discussion I’ve been involved in. I’m going to say something obvious.

True story: earlier this year I was walking around in my backyard and I noticed a big hump in the ground next to a tree. “This hump wasn’t here before”, I thought. I looked up and saw that the tree, which had always been tilting slightly, was now tilting a lot more than slightly. It was now tilting very substantially, straight north towards our neighbor’s house! The hump in the ground was the roots on the other side of the tree being pulled up from the ground.

It was a Sunday but I immediately called our tree guy and left a number on his emergency line. (Did you know tree guys have emergency lines? They do. They’re like plumbers: a significant fraction of their calls are urgent). Then I called our neighbors.

The tree guy came out and said something I already knew: this tree is well on its way to falling down. He immediately had his crew come out with heavy ropes to anchor the tree to the trunks of two other trees as a temporary measure. See the diagram above (north is diagonally to the left). A week or so later, his crew came out and cut down the tree piece by piece. They started from the top ;)

Continue reading ‘An obvious fact about constrained systems.’ »

Some natural solutions to the p-value communication problem—and why they won’t work.

John Carlin and I write:

It is well known that even experienced scientists routinely misinterpret p-values in all sorts of ways, including confusion of statistical and practical significance, treating non-rejection as acceptance of the null hypothesis, and interpreting the p-value as some sort of replication probability or as the posterior probability that the null hypothesis is true.

A common conceptual error is that researchers take the rejection of a straw-man null as evidence in favor of their preferred alternative. A standard mode of operation goes like this: p < 0.05 is taken as strong evidence against the null hypothesis, p > 0.15 is taken as evidence in favor of the null, and p near 0.10 is taken either as weak evidence for an effect or as evidence of a weak effect.

Unfortunately, none of those inferences is generally appropriate: a low p-value is not necessarily strong evidence against the null, a high p-value does not necessarily favor the null (the strength and even the direction of the evidence depends on the alternative hypotheses), and p-values are in general not measures of the size of any underlying effect. But these errors persist, reflecting (a) inherent difficulties in the mathematics and logic of p-values, and (b) the desire of researchers to draw strong conclusions from their data.

Continued evidence of these and other misconceptions and their dire consequences for science . . . motivated the American Statistical Association to release a Statement on Statistical Significance and p-values in an attempt to highlight the magnitude and importance of problems with current standard practice . . .

At this point it would be natural for statisticians to think that this is a problem of education and communication. If we could just add a few more paragraphs to the relevant sections of our textbooks, and persuade applied practitioners to consult more with statisticians, then all would be well, or so goes this logic.

Nope. It won’t be so easy.

We consider some natural solutions to the p-value communication problem that won’t, on their own, work:

Listen to the statisticians, or clarity in exposition

. . . it’s not that we’re teaching the right thing poorly; unfortunately, we’ve been teaching the wrong thing all too well. . . . The statistics profession has been spending decades selling people on the idea of statistics as a tool for extracting signal from noise, and our journals and textbooks are full of triumphant examples of learning through statistical significance; so it’s not clear why we as a profession should be trusted going forward, at least not until we take some responsibility for the mess we’ve helped to create.

Confidence intervals instead of hypothesis tests

A standard use of a confidence interval is to check whether it excludes zero. In this case it’s a hypothesis test under another name. Another use is to consider the interval as a statement about uncertainty in a parameter estimate. But this can give nonsensical answers, not just in weird trick problems but for real applications. . . . So, although confidence intervals contain some information beyond that in p-values, they do not resolve the larger problems that arise from attempting to get near-certainty out of noisy estimates.

Bayesian interpretation of one-sided p-values

. . . The problem comes with the uniform prior distribution. We tend to be most concerned with overinterpretation of statistical significance in problems where underlying effects are small and variation is high . . . We do not consider it reasonable in general to interpret a z-statistic of 1.96 as implying a 97.5% chance that the corresponding estimate is in the right direction.

Focusing on “practical significance” instead of “statistical significance”

. . . in a huge study, comparisons can be statistically significant without having any practical importance. Or, as we would prefer to put it, effects can vary: a +0.3 for one group in one scenario might become −0.2 for a different group in a different situation. Tiny effects are not only possibly trivial, they can also be unstable, so that for future purposes an estimate of 0.3±0.1 might not even be so likely to remain positive. . . . That said, the distinction between practical and statistical significance does not resolve the difficulties with p-values. The problem is not so much with large samples and tiny but precisely-measured effects but rather with the opposite: large effect-size estimates that are hopelessly contaminated with noise. . . . This problem is central to the recent replication crisis in science . . . but is not at all touched by concerns of practical significance.

Bayes factors

Another direction for reform is to preserve the idea of hypothesis testing but to abandon tail-area probabilities (p-values) and instead summarize inference by the posterior probabilities of the null and alternative models . . . The difficulty of this approach is that the marginal likelihoods of the separate models (and thus the Bayes factor and the corresponding posterior probabilities) depend crucially on aspects of the prior distribution that are typically assigned in a completely arbitrary manner by users. . . . Beyond this technical criticism . . . the use of Bayes factors for hypothesis testing is also subject to many of the problems of p-values when used for that same purpose . . .

What do do instead? We give some suggestions:

Our own preferred replacement for hypothesis testing and p-values is model expansion and Bayesian inference, addressing concerns of multiple comparisons using hierarchical modeling . . . or through non-Bayesian regularization techniques such as lasso . . . The general idea is to use Bayesian or regularized inference as a replacement of hypothesis tests but . . . through estimation of continuous parameters rather than by trying to assess the probability of a point null hypothesis. And . . . informative priors can be crucial in getting this to work.

It’s not all about the Bayes:

Indeed, in many contexts it is the prior information rather than the Bayesian machinery that is the most important. Non- Bayesian methods can also incorporate prior information in the form of postulated effect sizes in post-data design calculations . . . In short, we’d prefer to avoid hypothesis testing entirely and just perform inference using larger, more informative models.

But, we continue:

To stop there, though, would be to deny one of the central goals of statistical science. . . . there is a demand for hypothesis testing. We can shout till our throats are sore that rejection of the null should not imply the acceptance of the alternative, but acceptance of the alternative is what many people want to hear. . . . we think the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation . . . we recommend saying No to binary conclusions in our collaboration and consulting projects: resist giving clean answers when that is not warranted by the data. Instead, do the work to present statistical conclusions with uncertainty rather than as dichotomies. Also, remember that most effects can’t be zero (at least in social science and public health), and that an “effect” is usually a mean in a population (or something similar such as a regression coefficient)—a fact that seems to be lost from consciousness when researchers slip into binary statements about there being “an effect” or “no effect” as if they are writing about constants of nature. Again, it will be difficult to resolve the many problems with p-values and “statistical significance” without addressing the mistaken goal of certainty which such methods have been used to pursue.

This article will be published in the Journal of the American Statistical Association, as a comment on the article, “Statistical significance and the dichotomization of evidence,” by Blakeley McShane and David Gal.

P.S. Above cat picture is from Diana Senechal. If anyone wants to send me non-copyrighted cat pictures that would be appropriate for posting, feel free to do so.


I think there’s something wrong this op-ed by developmental psychologist Alison Gopnik, “4-year-olds don’t act like Trump,” and which begins,

The analogy is pervasive among his critics: Donald Trump is like a child. . . . But the analogy is profoundly wrong, and it’s unfair to children. The scientific developmental research of the past 30 years shows that Mr. Trump is utterly unlike a 4-year-old.

Gopnik continues with a list of positive attributes, each one which, she asserts, is held by four-year-olds but not by the president:
Continue reading ‘#NotAll4YearOlds’ »

Hotel room aliases of the statisticians

Barry Petchesky writes:

Below you’ll find a room list found before Game 1 at the Four Seasons in Houston (right across from the arena), where the Thunder were staying for their first-round series against the Rockets. We didn’t run it then because we didn’t want Rockets fans pulling the fire alarm or making late-night calls to the rooms . . .

This is just great, and it makes me think we need the same thing at statistics conferences:

LAPLACE, P . . . Christian Robert
EINSTEIN, A . . . Brad Efron
CICCONE, M . . . Grace Wahba
SPRINGSTEEN, B . . . Brad Carlin
NICKS, S . . . Jennifer Hill
THATCHER, M . . . Deb Nolan
KEILLOR, G . . . Jim Berger
BARRIS, C . . . Rob Tibshirani
SOPRANO, T . . . that would be Don Rubin.

OK, you get the idea.

I’ll grab ULAM, S, if it hasn’t been taken already. Otherwise please assign me to BUNNY, B. And JORDAN, M can just use his own name; nobody would guess his real identity!

A continuous hinge function for statistical modeling

This comes up sometimes in my applied work: I want a continuous “hinge function,” something like the red curve above, connecting two straight lines in a smooth way.

Why not include the sharp corner (in this case, the function y=-0.5*x if x<0 or y=0.2*x if x>0)? Two reasons. First, computation: Hamiltonian Monte Carlo can trip on discontinuities. Second, I want a smooth curve anyway, as I’d expect it to better describe reality. Indeed, the linear parts of the curve are themselves typically only approximations.

So, when I’m putting this together, I don’t want to take two lines and then stitch them together with some sort of quadratic or cubic, creating a piecewise function with three parts. I just want one simple formula that asymptotes to the lines, as in the above picture.

As I said, this problem comes up occasion, and each time I struggle to remember what’s a good curve that looks like this. Then after I do it, I forget what I did, the next time the problem comes up. And, amazingly enough, googling *hinge function* gives no convenient solution.

So this time I decided to derive a hinge function from first principles, so we’ll have it forever.

Let’s start with the simplest example, a function y=0 for negative x, and y=x for positive x:

What’s a good curve for this? We can start with the sharp-cornered hinge. Its derivative is just a step function:

Ha! Now it’s clear what to do. We set the derivative to the inverse logistic function, dy/dx = exp(x)/(1 + exp(x)):

And now we just integrate this to get the original function, the continuous hinge. Integrating exp(x)/(1 + exp(x)) dx is trivial; you just define u = exp(x), then it’s the integral of 1/(1+u) du, which is log(1+u), hence y = log (1 + exp(x)).

And that’s what I plotted in the second graph above, the one labeled, “Continuous hinge function: simplest example.”

More generally, we might want a continuous version of a hinge with corner at (x0, a), with slope b0 for x < x0 and slope b1 for x > x0. And there’s one more parameter which is the distance scale of the inverse-logit, in the dimensions of x. Label that distance scale as delta.

Our desired continuous hinge function then has the derivative,

dy/dx = b0 + (b1 – b0) * exp((x-x0)/delta)/(1 + exp((x-x0)/delta)).

Integrating this over x, and setting it so the corner point is at (x0, a), yields the smooth curve,

y = a + b0*(x – x0) + (b1 – b0) * delta * log (1 + exp((x-x0)/delta))

The top curve above shows the continuous curve with the values x0=2, a=1, b0=0.1, b1=0.5, delta=1.

Playing around with delta keeps the asymptotes the same but compresses or spreads the curving part. For example, here’s what you get by setting delta=3:

In this example, delta=1 (displayed in the very first graph in this post) looks like a better choice if we really want something that looks like a hinge; on the other hand, there are settings where something smoother is desired; and all depends on the scale of x. The above graphs would look much different if plotted from -100 to 100, for example. Anyway, the point here is that we can set delta, and now we understand how it works.

Here’s the function in Stan (from Bob):

real logistic_hinge(real x, real x0, real a, real b0, real b1, real delta) { 
  real xdiff = x - x0;
  return a + b0 * xdiff + (b1 - b0) * delta * log1p_exp(xdiff / delta);

And here it is in R:

logistic_hinge < - function(x, x0, a, b0, b1, delta) {
  return(a + b0 * (x - x0) + (b1 - b0) * delta * log(1 + exp((x - x0) / delta)))

This is not actually the best way to compute things, as the exponential can easily send you into overflow, especially if you set delta to a small number, as you might very well do in order to approximate the sharp corner. Indeed, in the graphs above, I drew the dotted lines using hinge() with delta=0.1, as this was less effort than writing a separate if/else function for the sharp hinge, and these graphs are at a resolution where setting delta=0.1 is close enough to setting it to 0.

What value of delta should be used in a real application? It depends on the context. You should resist the inclination to set delta to a tiny value such as 0.001 or even 0.1. Think of the curving connector piece not just as a computational compromise but as a part of your model, in that real-world functions typically do not have sharp corners.

Also, my R function could be cleaned up---rewritten in a more stable and computationally efficient manner. With the version that I wrote, make sure that the numbers being exponentiated aren't too extreme. If they are far from 0, do some rescaling in the computation to avoid instabilities. The hinge function Bob wrote in Stan is better, I think, in that it uses log1p_exp to avoid some of the worst instability problems.
Continue reading ‘A continuous hinge function for statistical modeling’ »

My review of Duncan Watts’s book, “Everything is Obvious (once you know the answer)”

We had some recent discussion of this book in the comments and so I thought I’d point you to my review from a few years ago. Lots to chew on in the book, and in the review.

Causal inference using Bayesian additive regression trees: some questions and answers

[cat picture]

Rachael Meager writes:

We’re working on a policy analysis project. Last year we spoke about individual treatment effects, which is the direction we want to go in. At the time you suggested BART [Bayesian additive regression trees; these are not averages of tree models as are usually set up; rather, the key is that many little nonlinear tree models are being summed; in that sense, Bart is more like a nonparametric discrete version of a spline model. —AG].

But there are 2 drawbacks of using BART for this project. (1) BART predicts the outcome not the individual treatment effect – although those are obviously related and there has been some discussion of this in the econ literature. (2) It will be hard for us to back out the covariate combinations / interactions that predict the outcomes / treatment effects strongly. We can back out the important individual predictors using the frequency of appearance in the branches, but BART (and Random Forests) don’t have the easy interpretation that Trees give.

Obviously it should be possible to fit Bayesian Trees if one can fit BART. So my questions to you are:

1. Is it kosher to fit BART and also fit a Tree separately? Is there a better way?

2. Our data has a hierarchical structure (villages, implementers, countries) and it looks like trees/BART don’t have any way to declare that structure. Do you know of a way to incorporate it? Any advice/cautions here?

My reply:

– I don’t understand this statement: “BART predicts the outcome not the individual treatment effect.” Bart does predict the outcome, but the individual treatment effect is just the outcome with treatment=1, minus the outcome with treatment=0. So you get this directly. At least, that’s what I took as the message of Jennifer Hill’s 2011 paper. So I don’t see why anything new needs to be invoked here.

– Your second point is that a complicated fitted model is hard to understand: “It will be hard for us to back out the covariate combinations / interactions that predict the outcomes / treatment effects strongly.” I think you should do this using average predictive comparisons as in my paper with Pardoe. In that paper, we work with linear regressions and glms, but the exact same principle would work with Bart, I think. This might be of general interest so maybe it’s worth writing a paper on it.

– I would strongly not recommend “backing out the important individual predictors using the frequency of appearance in the branches.” The whole point of Bart, as I understand it, is that it is a continuous predictive model; it’s just using trees as a way to construct the nonparametric fit. In that way, Bart is like a spline: The particular functional form is a means to an end, just as in splines where what we care about is the final fitted curve, not the particular pieces used to put it together.

– I disagree that trees have an easy interpretation. I mean, sure, they seem easy to interpret, but in general they make so sense, so the apparent easy interpretation is just misleading.

– Jennifer and I have been talking about adding hierarchical structure to Bart. She might have already done it, in fact! Jennifer’s been involved in the development of a new R package that does Bart much faster and, I think, more generally, than the previously existing implementation.

In short, I suspect you can do everything you need to do with Bart already. But the multilevel modeling, there I’m not sure. One approach would be to switch to a nonparametric Bayesian model using Gaussian processes. This could be a good solution but it probably does not make so much sense here, given your existing investment in Bart. Instead I suggest an intermediate approach where you fit the model in Bart and then you fit a hierarchical linear model to the residuals to suck up some multilevel structure there.

GP, like Bart, can be autotuned. To some extent this is still a research project, but we’ve been making a lot of progress on this recently. So I don’t think this tuning issue is an inherent problem with GP’s; rather, it’s more of a problem with our current state of knowledge, but I think it’s a problem that we’re resolving.

When Jennifer says she doesn’t trust the estimate of the individual treatment effect, I think she’s saying that (a) such an estimate will have a large standard error, and (b) it will be highly model dependent. Inference for an average treatment effect can be stable, even if inferences for individual treatment effects are not.

I really don’t like the idea of counting the number of times a variable is in a tree, as a measure of importance. There are lots of problems here, most obviously that counting doesn’t give any sense of the magnitude of the prediction. More fundamentally, all variables go into a prediction, and the fact that a variable is included in one tree and not another . . . that isn’t really relevant. Again, it would be like trying to understand a spline by looking at individual components; the only purpose of the individual components is to combine to make that total prediction.

Why do trees make no sense? It depends on context. In social science, there are occasional hard bounds (for example, attitudes on health care in the U.S. could change pretty sharply around age 65) but in general we don’t expect to see such things. It makes sense for the underlying relationships to be more smooth and continuous, except in special cases where there happen to be real-life discontinuities (and in those cases we’d probably include the discontinunity directly in our model, for example by including an “age greater than 65” indicator). Again, Bart uses trees under the hood in the same way that splines use basis functions: as a tool for producing a smooth prediction surface.

P.S. More from Jennifer here.

NIMBYs and economic theories: Sorry / Not Sorry

This post is not by Andrew. This post is by Phil.

A few days ago I posted What’s the deal with the YIMBYs?  In the rest of this post, I assume you have read that one. I plan to post a follow-up in a month or two when I have had time to learn more, but there are a couple of things I can say right now.

I. Sorry

  1. I apologize unreservedly to YIMBY supporters who know, or think they know, that buiding more housing in San Francisco will decrease rents there or at least will greatly reduce the rate at which they rise. I characterized the entire YIMBY movement as being at least partly motivated by a desire to stick a thumb in the eye of the smug slam-the-door-now-that-I’m-inside NIMBY crowd, rather than by a genuine belief that loosening land use restrictions in SF will decrease rents there. This was simply wrong of me.
  2. There might be something else I need to apologize for too, I’m not sure. (I don’t remember whether or not I said this in the comments; I’ve taken a quick skim through and didn’t see where I said it, but maybe I did). I have seen a few articles that touch on Bay Area housing, and the YIMBY movement, in which people who characterize themselves as YIMBYs say things like “we aren’t talking about turning San Francisco into Manhattan, we are talking about building some more housing to take the pressure off so rents come down.” [That is not a quote, it’s a paraphrase.] I think that in the current economic environment you would need to build an enormous amount of housing in SF to get the price to come down, so I’ve felt that people who say things like that are being disengenuous. It’s possible that I characterized the entire movement incorrectly, based on those examples. If I did, then I apologize to people like Sonja Trauss, who is a leader in the YIMBY movement: Sonja is absolutely up-front about wanting to build however much a free- or nearly-free market would allow, even if this does indeed lead to the Manhattanization of San Francisco. Trauss is not being disengenous at all, there are others in the movement that are also straightforward about the fact that their desired policies might completely transform the city.
  3.  I’m also sorry that I wrote authoritatively rather than speculatively. What I should have said is “Here’s what I think is happening, and I’d welcome comments” rather than, in essence, “Here’s what is happening, and I welcome comments.”

II. Not Sorry.
I proposed a model for San Francisco housing prices that makes sense to me. Quite a few people posted that my economic model is nonsense and I’m an arrogant fool, or worse, for having proposed it…that I should be embarrassed… etc.  But not only do I not think I’m arrogant, I think some of the people who accused me of being arrogant are themselves arrogant. (As for whether I’m a fool, that’s possible but I am not convinced).

Continue reading ‘NIMBYs and economic theories: Sorry / Not Sorry’ »

Using Stan for week-by-week updating of estimated soccer team abilites

Milad Kharratzadeh shares this analysis of the English Premier League during last year’s famous season. He fit a Bayesian model using Stan, and the R markdown file is here.

The analysis has three interesting features:

1. Team ability is allowed to continuously vary throughout the season; thus, once the season is over, you can see an estimate of which teams were improving or declining.

2. But that’s not what is shown in the plot above. Rather, the plot above shows estimated team abilities after the model was fit to prior information plus week 1 data alone; prior information plus data from weeks 1 and 2; prior information plus data from weeks 1, 2, and 3; etc. For example, look at the plot for surprise victor Leicester City: after a few games, the team is already estimated to be in the middle of the pack; then throughout the season the team’s estimated ability gradually increases. This does not necessarily mean that we think the team improved during the season; it’s a graph showing the accumulation of information.

3. The graphs showing parameter estimates and raw data on the same scale, which I find helpful for understanding the information that goes into the fit.

From a modeling standpoint, item 1 above is the most important: with Bayesian inference and Stan, we can estimate continuously varying parameters even though we only get one game per week from each team.

From the standpoint of interpreting the results, I like item 2 in that it addresses the question of when those notorious 5000-1 odds should’ve been lowered. Based on the above graph, it looks like the answer is: Right away, after the very first game. It’s not that the first game of the season necessarily offers any useful clue to the future, but it does inform a bit about the possibilities. The point is not that Leicester City’s early performance signaled that they were a top team; it’s that the data did not rule out that possibility at 5000:1 odds.

Also, Figure 2 is kinda counterintuitive, in that you might think that if you already have a time-varying parameter, you could just plot its estimate as a function of time and you’d be done. But, no, if you want to see how inferences develop during the season, you need to re-fit the model after each new week of data (or use some sort of particle-filtering, importance-sampling approach, but in this case it’s easier to use the hammer and just re-fit the entire model each time in Stan.

P.S. The R code is in the markdown file. Here’s the Stan program:

data {
  int nteams; // number of teams (20)
  int ngames; // number of games 
  int nweeks; // number of weeks 
  int home_week[ngames]; // week number for the home team
  int away_week[ngames]; // week number for the away team
  int home_team[ngames]; // home team ID (1, ..., 20)
  int away_team[ngames]; // away team ID (1, ..., 20)
  vector[ngames] score_diff;    // home_goals - away_goals
  row_vector[nteams] prev_perf; // a score between -1 and +1
parameters {
  real b_home; // the effect of hosting the game in mean of score_diff dist.
  real b_prev;                         // regression coefficient of prev_perf
  real sigma_a0;              // teams ability variation 
  real tau_a;                 // hyper-param for game-to-game variation
  real nu;                    // t-dist degree of freedom
  real sigma_y;               // score_diff variation
  row_vector[nteams] sigma_a_raw; // game-to-game variation
  matrix[nweeks,nteams] eta_a;         // random component
transformed parameters {
  matrix[nweeks, nteams] a;                        // team abilities
  row_vector[nteams] sigma_a; // game-to-game variation
  a[1] = b_prev * prev_perf + sigma_a0 * eta_a[1]; // initial abilities (at week 1)
  sigma_a = tau_a * sigma_a_raw;
  for (w in 2:nweeks) {
    a[w] = a[w-1] + sigma_a .* eta_a[w];           // evolution of abilities
model {
  vector[ngames] a_diff;
  // Priors
  nu ~ gamma(2,0.1);     
  b_prev ~ normal(0,1);
  sigma_a0 ~ normal(0,1);
  sigma_y ~ normal(0,5);
  b_home ~ normal(0,1);
  sigma_a_raw ~ normal(0,1);
  tau_a ~ cauchy(0,1);
  to_vector(eta_a) ~ normal(0,1);
  // Likelihood
  for (g in 1:ngames) {
     a_diff[g] = a[home_week[g],home_team[g]] - a[away_week[g],away_team[g]];
  score_diff ~ student_t(nu, a_diff + b_home, sigma_y);
generated quantities {
  vector[ngames] score_diff_rep;
  for (g in 1:ngames)
    score_diff_rep[g] = student_t_rng(nu, a[home_week[g],home_team[g]] - 
      a[away_week[g],away_team[g]]+b_home, sigma_y);

As explained in the markdown file, the generated quantities are there for posterior predictive checks.

I really think Stan is a key part of reproducible science in that the models are so clear and portable.

Higher credence for the masses: From a Ted talk?

The Four Most Dangerous Words? A New Study Shows | Laura Arnold | TEDxPennsylvaniaAvenue

I brought this link forward in some comments but wanted to promote it to a post as I think its important and I know many folks just do not read comments.
Continue reading ‘Higher credence for the masses: From a Ted talk?’ »

Taking Data Journalism Seriously

This is a bit of a followup to our recent review of “Everybody Lies.”

While writing the review I searched the blog for mentions of Seth Stephens-Davidowitz, and I came across this post from last year, concerning a claim made by author J. D. Vance that “the middle part of America is more religious than the South.” This was a claim that stunned me, given that I’d seen some of the statistics on the topic, and it turned out that Vance had been mistaken, that he’d used some unadjusted numbers which were not directly comparable when looking at different regions of the country. It was an interesting statistical example, also interesting in that claims made in data journalism, just like claims made in academic research, can get all sorts of uncritical publicity. People just trust the numbers, which makes sense in that takes some combination of effort, subject-matter knowledge, and technical expertise to dig deeper and figure out what’s really going on.

How should we think about data journalism, an endeavor which might be characterized as “informal social science”?

Data journalism is a thing, it’s out there, and maybe it needs to be evaluated by the same standards as we evaluate published scholarly research. For example, this exercise in noise mining—a study on college basketball that appeared in the New York Times—is as as bad as this Psychological Science paper on sports team performance. And then there’s data journalism done by academic researchers on holiday, as it were; wacky things like this. When I do data journalism I think it’s of the same high quality as my published work (except that it’s more likely to have some mistakes because it gets posted right away and hasn’t had the benefit of reviews), but I get the impression that other academics have different standards for newspaper articles and blog posts than for scholarly articles. One thing I like about Stephens-Davidowitz’s book is that it mixes results from different sources without privileging PPNAS or whatever.

Anyway, I don’t currently have any big picture regarding data journalism. I just think it’s important; it’s different from the sorts of social science research done in academia, business, and government; and we should be taking it seriously.

P.S. According to Wikipedia, J. D. Vance (author of the mistaken quote above about religiosity) is an “author and venture capitalist,” which connects us to another theme, that of silly statistics from clueless rich guys, of which my favorite remains this credulity-straining graph of “percentage of slaves or serfs in the world” from rich person Peter Diamandis. Wealthy people have no monopoly on foolishness, of course. But when a rich guy does believe passionately in some error, he might well have the connections to promulgate it widely. Henry Ford and Ron Unz come to mind.