Skip to content

The “power pose” of the 6th century B.C.

From Selected Topics in the History of Mathematics by Aaron Strauss (1973):

Today Pythagoras is known predominantly as a mathematician. However, in his own day and age (which was also the day and age of Buddha, Lao-Tsa, and Confucious), he was looked upon as the personification of the highest divine wisdom by his followers to whom he preached the immortality of the soul. The whole lot of them were often ridiculed by ancient Greek society as being superstitious, filthy vegetarians. . . .

Pythagorean number theory was closely related to number mysticism. Odd numbers were male while even numbers were female. (Shakespeare: “there is divinity in odd numbers”). Individual integers had their own unique properties:

   1    generator and number of reason
   2    1st female number – number of opinion
   3    1st male number – number of harmony, being composed of unity and diversity
   4    number of justice or retribution (square of accounts)
   5    number of marriage, being composed of the first male and female numbers
   6    number of creation
   7    signified the 7 planets and 7 days in a week
   10    holiest number of all composed of 1+2+3+4 which determine a point, a line, and space respectively (later it was discovered that 10 is the smallest integer n for which there exist as many primes and nonprimes between 1 and n)
   17    the most despised and horrible of all numbers

The rectangles with dimensions 4 x 4 and 3 x 6 are interesting in that the former has area and perimeter equal to 16 and the latter has area and perimeter equal to 18. Possibly 17’s horror was kept under control by being surrounded by 16 and 18.

The whole book (actually comb-bound lecture notes) is great. It’s too bad Strauss died so young. I pulled it off the shelf to check my memory following this blog discussion. Indeed I’d been confused. I’d remembered 4 being the number of justice and 17 being the evil number, so I just assumed that the Pythagoreans viewed even numbers as male and odd numbers as female.

Just imagine what these ancient Greeks would’ve been able to do, had they been given the modern tools of statistical significance. I can see it now:

Pythagoras et al. (-520). Are numbers gendered? Journal of Experimental Psychology: General, Vol. -2390, pp. 31-36.

Stan on the beach


This came in the email one day:

We have used the great software Stan to estimate bycatch levels of common dolphins (Delphinus delphis) in the Bay of Biscay from stranding data. We found that official estimates are underestimated by a full order of magnitude. We conducted both a prior and likelihood sensitivity analyses : the former contrasted several priors for estimating a covariance matrix and the latter contrasted results from a Negative Binomial and a Discrete Weibull likelihood. The article is available at:

Unfortunately (and I know that this is not truly an excuse), given the journal scope and space constraints most of the modelling with Stan is actually described in two appendices. Data and R scripts are available on github (

Thanking you again for the amazing Stan,

Yours sincerely,

Matthieu Authier

Peltier, H. and Authier, M. and Deaville, R. and Dabin, W. and Jepson, P.D. and {van Canneyt}, O. and Daniel, P. and Ridoux, V. (2016) Small Cetacean Bycatch as Estimated from Stranding Schemes: the Common Dolphin Case in the Northeast Atlantic. Environmental Science & Policy, 63: 7–18, doi:10.1016/j.envsci.2016.05.004

That’s what it’s all about.

When doing causal inference, define your treatment decision and then consider the consequences that flow from it

Danielle Fumia writes:

I am a research at the Washington State Institute for Public Policy, and I work on research estimating the effect of college attendance on earnings. Many studies that examine the effect of attending college on earnings control for college degree receipt and work experience. These models seem to violate the practice you discuss in your data analysis book of not including intermediate outcomes in the regression of the treatment (attending college) on y (earnings). However, I’m not sure if this situation is different because only college attendees can obtain a degree, but in any case, this restriction wouldn’t be true for work experience. I fear I am missing an important idea because most studies control for one or both seemingly intermediate outcomes, and I cannot seem to find an explanation after much research.

My reply:

For causal inference, I recommend that, instead of starting by thinking of the outcome, you start by thinking of the intervention. As always, the question with the intervention is “Compared to what?” You want to compare people who attended college to equivalent people who did not attend college. At that point you can think about all the outcomes that flow from this choice. In your regressions, you can control for things that came before the choice, but not after. So you can control for work experience before the forking of paths (college or no college) but not for work experience after that choice. Also, suppose your choice is at age 21: either a student attends college by age 21 or he or she does not. Then that’s the fork. If a non-attender later goes to college at age 25, he or she would still be in the “no college attendance at 21” path.

I’m not saying you have to define college attendance in that way, I’m just saying you have to define it in some way. If you don’t define the treatment, your analysis will (implicitly) define it for you.

Here’s more you can read on causal inference and regression, from my book with Jennifer. We go into these and other points in more detail.

“99.60% for women and 99.58% for men, P < 0.05.”

Gur Huberman pointed me to this paper by Tamar Kricheli-Katz and Tali Regev, “How many cents on the dollar? Women and men in product markets.” It appeared in something called ScienceAdvances, which seems to be some extension of the Science brand, i.e., it’s in the tabloids!

I’ll leave the critical analysis of this paper to the readers. Just one hint: Their information on bids and prices comes from an observational study and an experiment. The observational study comes from real transactions and N is essentially infinity so the problem is not with p-values or statistical significance, but the garden of forking paths still comes into play, as there is still the selection of which among many possible comparisons to present, and the selection of which among many possible regressions to run. Also lots of concerns about causal identification, given that they’re drawing conclusions about different numbers of bids and different average prices for products sold by men and women, but they also report that men and women are using different selling strategies. The experiment is N=116 people on Mechanical Turk so there we have the usual concerns about interpretation of small nonrepresentative samples.

The paper has many (inadvertently) funny lines; my favorite is this one:

Screen Shot 2016-02-21 at 1.03.00 PM

I do not however, believe this sort of research is useless. To the extent that you’re interested studying behavior in online auctions—and this is a real industry, it’s worth some study—it seems like a very sensible plan to gather a lot of data and look at differences between different groups. No need to just compare men and women, you could also compare sellers by age, by educational background, by goals in being on Ebay, and so forth. It’s all good. And, for that matter, it seems reasonable to start by highlighting differences between the sexes—that might get some attention, and there’s nothing wrong with wanting a bit of attention for your research. It should be possible to present the relevant comparisons in the data in some sort of large grid rather than following the playbook of picking out statistically significant comparisons.

P.S. Some online hype here.

The difference between “significant” and “not significant” is not itself statistically significant: Education edition

In a news article entitled “Why smart kids shouldn’t use laptops in class,” Jeff Guo writes:

For the past 15 years, educators have debated, exhaustively, the perils of laptops in the lecture hall. . . . Now there is an answer, thanks to a big, new experiment from economists at West Point, who randomly banned computers from some sections of a popular economics course this past year at the military academy. One-third of the sections could use laptops or tablets to take notes during lecture; one-third could use tablets, but only to look at class materials; and one-third were prohibited from using any technology.

Unsurprisingly, the students who were allowed to use laptops — and 80 percent of them did — scored worse on the final exam. What’s interesting is that the smartest students seemed to be harmed the most.

Uh oh . . . a report that an effect is in one group but not another. That raises red flags. Let’s read on some more:

Among students with high ACT scores, those in the laptop-friendly sections performed significantly worse than their counterparts in the no-technology sections. In contrast, there wasn’t much of a difference between students with low ACT scores — those who were allowed to use laptops did just as well as those who couldn’t. (The same pattern held true when researchers looked at students with high and low GPAs.)

OK, now let’s go to the tape. Here’s the article, “The Impact of Computer Usage on Academic Performance: Evidence from a Randomized Trial at the United States Military Academy,” by Susan Payne Carter, Kyle Greenberg, and Michael Walker, and here’s the relevant table:

Screen Shot 2016-05-24 at 2.07.06 PM

No scatterplot of data, unfortunately, but you can see the pattern: the result is statistically significant in the top third but not in the bottom third.

But now let’s do the comparison directly: the difference is (-0.10) – (-0.25) = 0.15, and the standard error of the difference is sqrt(0.12^2 + 0.10^2) = 0.16. Not statistically significant! There’s no statistical evidence of any interaction here.

Now back to the news article:

These results are a bit strange. We might have expected the smartest students to have used their laptops prudently. Instead, they became technology’s biggest victims. Perhaps hubris played a role. The smarter students may have overestimated their ability to multitask. Or the top students might have had the most to gain by paying attention in class.

Nonononononono. There’s nothing to explain here. It’s not at all strange that there is variation in a small effect, and they happen to find statistical significance in some subgroups but not others.

As the saying goes, The difference between “significant” and “not significant” is not itself statistically significant. (See here for a recent example that came up on the blog.)

The research article also had this finding:

These results are nearly identical for classrooms that permit laptops and tablets without restriction as they are for classrooms that only permit modified-tablet usage. This result is particularly surprising considering that nearly 80 percent of students in the first treatment group used a laptop or tablet at some point during the semester while only 40 percent of students in the second treatment group ever used a tablet.

Again, I think there’s some overinterpretation going on here. With small effects, small samples, and high variation, you can find subgroups where the results look similar. That doesn’t mean the true difference is zero, or even nearly zero—you still have these standard errors to worry about. When the s.e. is 0.07, and you have two estimates, one of which is -0.17 and one of which is -0.18 . . . the estimates being so nearly identical is just luck.

Just to be clear, I’m not trying to “shoot down” this research article nor am I trying to “debunk” the news report. I think it’s great for people to do this sort of study, and to report on it. It’s because I care about the topic that I’m particularly bothered when they start overinterpreting the data and drawing strong conclusions from noise.

Annals of really pitiful spammers

Here it is:

On May 18, 2016, at 8:38 AM, ** <**@**.org> wrote:

Dr. Gelman,

I hope all is well. I looked at your paper on [COMPANY] and would be very interested in talking about having a short followup or a review article about this published in the next issue of the Medical Research Archives. It would be interesting to see a paper with new data since this was published, or any additional followup work you have done. If you could also tell me more about your current projects that would be helpful. The Medical Research Archives is an online and print peer-reviewed journal. The deadlines are flexible. I am happy to asnwer any questions. Please respond at your earliest convenience.

Best Regards,

Medical Research Archives
** ** Avenue
** CA ***** USA


Ummm, I guess it makes sense, if these people actually knew what they were doing, they’d either (a) have some legitimate job, or (b) be running a more lucrative grift.

But if they really really have their heart set on scamming academic researchers, I recommend they join up with Wolfram Research. Go with the market leaders, that’s what I say.

Here’s something I know nothing about

Paul Campos writes:

Does it seem at all plausible that, as per the CDC, rates of smoking among people with GED certificates are double those among high school dropouts and high school graduates?

My reply: It does seem a bit odd, but I don’t know who gets GED’s. There could be correlations with age and region of the country. It’s hard to know what to do with this sort of demographically-unadjusted number.

Albedo-boy is back!

New story here. Background here and here.

“Lots of hype around pea milk, with little actual scrutiny”

Paul Alper writes:

Had no idea that “Pea Milk” existed, let alone controversial. Learn something new every day.

Indeed, I’d never heard of it either. I guess “milk” is now a generic word for any white sugary drink? Sort of like “tea” is a generic word for any drink made from a powder steeped in boiling water.

Splitsville for Thiel and Kasparov?

The tech zillionaire and the chess champion were always a bit of an odd couple, and I’ve felt for awhile that it was just as well that they never finished that book they were talking about.

But given that each of them has taken a second career in political activism, I can’t imagine that they’re close friends after this:

Kasparov, July 2015:

As for Trump, the less said the better. He said he could be Putin’s pal if elected, showing he has no clue about democracy or Putin — and probably not anything else. It’s worth pointing out that Trump and Bernie Sanders are both getting substantial attention from the media and voters despite having zero chance of winning a primary . . .

Kasparov, April 2016:

Trump sells the myth of American success instead of the real thing. . . . It’s tempting to rally behind him-but we should resist. Because the New York values Trump represents are the very worst kind. . . . He may have business experience, but unless the United States plans on going bankrupt, it’s experience we don’t need.

May 2016:

Thiel, co-founder of PayPal and Palantir and a director at Facebook, is now a Trump delegate in San Francisco, according to a Monday filing.

Kasparov I can understand. He hates Putin, Trump likes Putin, it’s that simple.

Thiel I can understand too. He hasn’t been in the news for awhile and probably misses the headlines. And, unlike his onetime coauthor, he can’t get publicity whenever he wants just by playing a blitz tournament.

On deck this week

Mon: Splitsville for Thiel and Kasparov?

Tues: Here’s something I know nothing about

Wed: The “power pose” of the 6th century B.C.

Thurs: “99.60% for women and 99.58% for men, P < 0.05.”

Fri: Stan on the beach

Sat: Michael Lacour vs John Bargh and Amy Cuddy

Sun: Should he major in political science and minor in statistics or the other way around?

Now that’s what I call a power pose!


John writes:

See below for your humour file or blogging on a quiet day. . . . Perhaps you could start a competition for the wackiest real-life mangling of statistical concepts (restricted to a genuine academic setting?).

On 15 Feb 2016, at 5:25 PM, [****] wrote:

Pick of the bunch from tomorrow’s pile of applications at the [XYZ] Human Ethics Sub-Committee:

“. . . If the difference between the two scores is not as significant as predicted, a power calculation will be performed to determine if this is due to sampling . . .”

I love this. By which I mean, I hate this. Again, though, I have to pin much of the blame on the statistical profession, which has sold statistics to researchers as a way to distill 95% pure certainty out of randomness.

Statistics made a success out of the embodied-cognition guy, it made a hero out of the power-pose lady, and maybe it can do the same for you!

P.S. If you want something more to read today, there’s this that I posted on Retraction Watch. Some of the comments to that post were a bit . . . off. I think that lots of readers of that site have such strong views on retractions and corrections that they weren’t even reading what I wrote, they just teed off on what they thought I might be saying. It can be a challenge to write for this sort of audience.

“Stop the Polling Insanity”


Norman Ornstein and Alan Abramowitz warn against over-interpreting poll fluctuations:

In this highly charged election, it’s no surprise that the news media see every poll like an addict sees a new fix. That is especially true of polls that show large and unexpected changes. Those polls get intense coverage and analysis, adding to their presumed validity.

The problem is that the polls that make the news are also the ones most likely to be wrong.

Well put. Don’t chase the goddamn noise. We discussed this point on the sister blog the other day.

But this new op-ed by Ornstein and Abramowitz goes further by picking apart problems in recent outlier polls:

Take the Reuters/Ipsos survey. It showed huge shifts during a time when there were no major events. There is a robust scholarship, using sophisticated panel surveys, that demonstrates remarkable stability in voter preferences, especially in times of intense partisan preferences and tribal political identities. The chances that the shifts seen in these polls are real and not artifacts of sample design and polling flaws? Close to zero.

What about the neck-and-neck race described in the NBC/Survey Monkey poll? A deeper dig shows that 28 percent of Latinos in this survey support Mr. Trump. If the candidate were a conventional Republican like Mitt Romney or George W. Bush, that wouldn’t raise eyebrows. But most other surveys have shown Mr. Trump eking out 10 to 12 percent among Latino voters.

There’s only one place where I disagree with Ornstein and Abramowitz. They write:

Part of the problem stems from the polling process itself. Getting reliable samples of voters is increasingly expensive and difficult, particularly as Americans go all-cellular. Response rates have plummeted to 9 percent or less. . . . With low response rates and other issues, pollsters try to massage their data to reflect the population as a whole, weighting their samples by age, race and sex.

So far so good, but then they say:

But that makes polling far more of an art than a science, and some surveys build in distortions, having too many Democrats or Republicans, or too many or too few minorities. If polling these days is an art, there are a lot of mediocre or bad artists.

What’s my problem with that paragraph? First off, I don’t like the “more of an art than a science” framing. Science is an art! Second, and most relevant in this context, the adjustment of sample to the population is a scientific process! Suppose a chemist is calculating energy release in an experiment and has to subtract off the energy emitted by a heat source. That’s science—it’s taking raw data and adjusting it to estimate what you want to learn. And that’s what we do when we do survey adjustment (for example, here). Yes, you can do this adjustment badly or with bias, just as you can introduce sloppy or bias adjustments in a chemistry experiment. But it’s still science.

Anyway, I agree with the main points in their op-ed.

P.S. For more on polling biases in the 2016 campaign, see this thoughtful news article by Nate Cohn, “Is Traditional Polling Underselling Donald Trump’s True Strength?”

P.P.S. I would’ve posted this all on the sister blog where it’s a natural fit, but I couldn’t muster the energy to add paragraphs of background material. One pleasant thing about blogging here is that I can take it as your responsibility to figure out what I’ve written, not my responsibility to make it accessible to you. Indeed, I suspect that part of the fun of reading this blog for many people is that I don’t write down to you. I write at my own level and give you the chance to join in.

I’m planning to write a few books, though, so I’ll have to shift gears at some point. Damn. I’ve become so comfortable with this style. Good for me to get out of my comfort zone, I know. But still.

Nick and Nate and Mark on Leicester and Trump

Just following up on our post the other day on retrospective evaluations of probabilistic predictions:

For more on Leicester City, see Nick Goff on Why did bookmakers lose on Leicester? and What price SHOULD Leicester have been? (forwarded to me by commenter Iggy).

For more on Trump, see Nate Silver on How I Acted Like A Pundit And Screwed Up On Donald Trump and Mark Palko making a lot of these points, back in September, 2015, on how journalists were getting things wrong and not updating their models given the evidence in front of them.

Which is closely related to the “Why did bookmakers lose on Leicester?, because it seems that their big losses came not from those 5000-1 preseason odds but from bets during the early part of the season, when Leicester was winning and winning but the bookies were lowering the odds only gradually, not fully accounting for the information provided by these games.

I’ll also use this post as an occasion to repeat my idea that a good general framework for predicting events that only occur once is to embed them in a larger system, which can be done in two ways: (a) by considering the event in question as one of many in a reference set (other sports championships, other national elections), and (b) considering precursor events (won-lost record, score differentials, vote differentials). Either way requires assumptions (“models”) but that’s what you gotta do.

Will transparency damage science?

Jonathan Sterne sent me this opinion piece by Stephan Lewandowsky and Dorothy Bishop, two psychology researchers who express concern that the movement for science and data transparency has been abused.

It would be easy for me to dismiss them and take a hard-line pro-transparency position—and I do take a hard-line pro-transparency position—but, as they remind us, there is a long history of the political process being used to disparage scientific research that might get in the way of people making money. Most notoriously, think of the cigarette industry for many years attacking research on smoking, or more recently the NFL and concussions.

Rather than saying we want transparency but we don’t want interference with research, I’d rather say we want transparency and we don’t want interference with research,

Here are some quotes from their article that I like:

Good researchers do not turn away when confronted by alternative views.

The progress of research demands transparency.

We strongly support open data, and scientists should not regard all requests for data as harassment.

The status of data availability should be enshrined in the publication record along with details about what information has been withheld and why.

Blogs and social media enable rapid correction of science by scientists.

Scientific publications should be seen as ‘living documents’, with corrigenda an accepted — if unwelcome — part of scientific progress.

Freedom-of-information (FOI) requests have revealed conflicts of interest, including undisclosed funding of scientists by corporate interests such as pharmaceutical companies and utilities.

Researchers should scrupulously disclose all sources of funding.

Journals and institutions can also publish threats of litigation, and use sunlight as a disinfectant.

Issues such as reproducibility and conflicts of interest have legitimately attracted much scrutiny and have stimulated corrective action. As a result, the field is being invigorated by initiatives such as study pre-registration and open data.

Here are some quotes I don’t like:

Research is already moving towards study ‘pre-registration’ (researchers publishing their intended method and analysis plans before starting) as a way to avoid bias, and the same strictures should apply to critics during reanalysis.

All who participate in post-publication review should identify themselves.

I disagree with the first statement. Pre-registration is fine for what it is, but I certainly don’t think it should be a requirement or stricture.

And I disagree with the second statement. If a criticism is relevant, who cares if it’s made anonymously. For example, there seem to be several erroneous calculations of test statistics and p-values in the work of social psychologist Amy Cuddy. Once someone points these out, they can be assessed independently. On the other hand, it could make sense for pre-publication review to be identified. The problem is that pre-publication reviews are secret. So if someone makes a pre-publication criticism or endorsement of a paper, it can’t be checked. It wouldn’t be bad at all for such reviewers to have to stand by their statements.

And here’s Lewandowsky and Bishop’s summary chart:

Screen Shot 2016-02-04 at 10.22.22 AM

This mostly seems reasonable although I might make some small changes. For example, I don’t like the idea that it’s a red flag if Dr A promotes work that has not been peer-reviewed (a concern that is listed twice in the above table). If you do important work, by all means promote it right away, don’t wait on the often-slow peer-review process! We do our research because it’s important, and it seems completely reasonable to share it with the world.

What’s important to me is not peer review (see recent discussion) but transparency. If you have a preprint on Arxiv with all the details of your experiment, that’s great: promote your heart out and let any interested parties read the article. On the other hand, if you’re not gonna tell people what you actually did, I don’t see why we should be reading your press releases. That was my problem with that recent gay gene hype: I don’t care that this silly n=47 paper wasn’t peer reviewed; I care that there was no document anywhere saying what the researchers actually did (nor was there a release of the data).

What the researcher said regarding a critical journalist was:

I would have appreciated the chance to explain the analytical procedure in much more detail than was possible during my 10-minute talk but he didn’t give me the option.

There was time for a press release with an interview and a publicity photo, but no time for writing up the method or posting the data. That’s a flag that it’s safe to wait a bit before writing about the study.

The “peer-reviewed journal” thing is just a red herring. I completely disagree with the idea that we should raise a red flag when a researcher promotes work before publication. I for one do not want to wait on reviewers to share my work.

Also, one more thing. The above table includes this “red flag”: “Are the critics levelling personal attacks? Are criticisms from anonymous sources or ‘sock puppets’?” That’s fine, but then let’s also raise a red flag when researchers do this same behavior. Remember John Lott? His survey may have been real, but his use of sock puppets to defend his own work does give a credibility problem.

In any case, I think Lewandowsky and Bishop do a service by laying out these issues.

Full disclosure: Some of my research is funded by the pharmaceutical company Novartis.

Bias against women in academia


I’m not the best one to write about this: to the extent that there’s bias in favor of men, I’ve been a beneficiary. Also I’m not familiar with the research on the topic. I know there are some statistical difficulties in setting up these causal questions, comparable to the difficulties arising in using “hedonic regression” to estimate the so-called risk premium or value of a life. (See this post from 2004 for my discussion of these challenges, and what I called the “inevitable inconsistency” of these sorts of estimates.)

All challenges aside, there are disparities, for whatever reasons, between men and women in the workforce, and it’s a topic worth studying, especially given how the roles of men and women have changed in recent decades. I’m sure that bias against women comes in many ways, not just salary. It is possible for two people to have the same salary but to have unequal working conditions.

I have no reason to think that academia is worse than other sectors when it comes to how women are treated. But academia has an (imperfect) tradition of openness, so it makes sense to study things like inequality within academia, where maybe the topic will be more clear to study.

Tian Zheng pointed me to this page by Virginia Valian, author of the 1988 book, “Why So Slow? The Advancement of Women,” about women in certain professional careers. The page has transcripts from some interviews from 2006, so perhaps some readers can point us to research in this area since then? Also there’s the related topic of bias against ethnic minorities.

Birthday analysis—Friday the 13th update, and some model checking

Carl Bialik and Andrew Flowers at (Nate Silver’s site) ran a story following up on our birthdays example—that time series decomposition of births by day, which is on the cover of the third edition of Bayesian Data Analysis using data from 1968-1988, and which then Aki redid using a new dataset from 2000-2014.

Friday the 13th: The Final Chapter

Bialik and Flowers start with Friday the 13th. For each “13th” (Monday the 13th, Tuesday the 13th, etc.), they took the average number of babies born on this day over the years in the dataset, minus the average number of babies born one week before and one week after (that is, comparing to the 6th and 20th of the same month). Here’s what they found:


The drop on the 13th is biggest on Fridays, of course. It’s smaller on the weekends, which makes sense. Weekend births are already suppressed by selection so there is not much room for there to be more selection because of being on the 13th.

All 366 days

Here’s the fivethirtyeight’s art department’s redrawing of a graph Aki made for the 2000-2014 data:


You can see dips on the 13th of each month, also a slight bump on Valentine’s, and drops on several other special days, including February 29th, Halloween, and September 11th, along with a bunch of major holidays.

Here’s the full decomposition:

Screen Shot 2016-05-17 at 12.57.39 AM

If you look carefully on the bottom graph you’ll again see the little dips on the 13th of each month, but the blow-up above is better. The reason I’m showing all four graphs here is so you can see the decomposition with all the pieces of the model.

Comparing to the results from the earlier time period (just look at your copy of BDA3!), we see the new pattern on September 11th, also in general the day-of-week effects and special-day effects are much higher. This makes sense given that the rate of scheduled births has been going up. Indeed if the birth rate on Christmas is only 50% on other days, that suggests that something like 50% of all births in the U.S. are scheduled. Wow. Times have changed.

Checking and improving the model

There are some funny things about the fitted model. Look either at the graph above with the day-of-year patterns (the very second plot of this post) or, equivalently, the bottom of the four plots just above.

The first thing I noticed, after seeing the big dips on New Year’s, July 4th, and Christmas, were the blurs around Labor Day, Memorial Day, and Thanksgiving. These are floating days in the U.S. calendar, changing from year to year. (For example, Thanksgiving is the fourth Thursday in November.) At first I thought this was a problem, that the model was trying to fit these floating holidays with individual date effects, but it turns out that, no, Aki did it right: he looked up the floating holidays and added a term for each of them, in addition to the 366 individual date effects.

The second thing is that the day-of-week effects clearly go up over time, and we know from our separate analyses of the 1968-2008 and 2000-2014 data that the individual-date and special-day effects increase over time too. So really we’d want these to increase for each year in the model. Things are starting to get hard to fit if we let these parameters vary from year to year, but a simple linear trend would seem reasonable, I’d think.

The other problem with the model we fit can be seen, for example, around Christmas and New Year’s. There’s a lot more minus than plus. Similarly on April Fools and Halloween: the number of extra births on the days before and after these unlucky days is not enough to make up for the drop on the days themselves.

But this can’t be right. Once a baby’s in there, it’s gotta come out some time! The problem is that our individual-date and special-day effects are modeled as delta functions, so they won’t cancel out unless they just happen to do so. I think it would make more sense for us to explicitly fit the model with little ringing impulse-response functions that sum to 0. For example, something like this: (-1, -2, 6, -2, -1), on the theory that most of the missing babies on day X will appear between days X-2 and X+2. It’s an assumption, and it’s not perfect, but I think it makes more sense than our current (0, 0, 1, 0, 0).

You can also see this problem arising in the last half of September, where there seem to be “too many babies” for two entire weeks! Obviously what’s happening here is that the individual-date effects, unconstrained to sum locally to zero, are sucking up some of the variation that should be going into the time-of-year term. And, indeed, if you look at the estimated time-of-year series—the third in the set of four graphs above—you’ll see that the curves are too smooth. The fitted Gaussian process for the time-of-year series got estimated with a timescale hyperparameter that was too large, thus the estimated curve was too smooth, and the individual-date effects took up the slack. Put zero-summing impulse-response functions in there, and I think the time-of-year curve will be forced to do better.

This is a great example of posterior model checking, and a reminder that fitting the model is never the end of the story. As I said a few years ago, just as a chicken is an egg’s way of making another egg, Bayesian inference is just a theory’s way of uncovering problems with can lead to a better theory.

Data issues

Results from U.S. CDC data 1994-2003 look similar to the results from U.S. SSA data 2000-2014, but for some reason in the overlapping dates in years 2000-2003 there are on average 2% more births per day in the SSA data:


It’s easy enough to adjust for this in the model by just including an indicator for the dataset.

The model

We (that is, Aki) fit a hierarchical Gaussian process to the data, with several additive terms (on the log scale, I assume; at least, it should all be on the log scale). Details are in the appropriate chapter of BDA3.

Code or it didn’t happen

I’d love to give you the Stan code for this model—but it’s too big to fit in Stan. It’s not that the number of data points is too big, it’s that the Gaussian process has this big-ass covariance matrix which itself depends on hyperparameters. So lots of computing. We have plans to add some features to Stan to speed up some matrix operations so this sort of model can be fit more efficiently. Also there should be some ways to strip down the model to reduce its dimensionality.

But for now, the model was fit in GPstuff. And here’s the GPstuff code for various versions of the model (maybe not quite the one fit above).
Continue reading ‘Birthday analysis—Friday the 13th update, and some model checking’ »

Is fraac Scott Adams?

tl;dr: If you value your time, don’t read this post.
Continue reading ‘Is fraac Scott Adams?’ »

Beautiful Graphs for Baseball Strike-Count Performance

This post is by Bob. I have no idea what Andrew will make of these graphs; I’ve been hoping to gather enough comments from him to code up a ggplot theme. Shravan, you can move along, there’s nothing here but baseball.

Jim Albert created some great graphs for strike-count performance in a series of two blog posts. Here’s the first post:

* Jim Albert. Graphing pitch count effects (part 1)

Albert plots the pooled estimate of expected runs arising at various strike counts for all plate appearances for the 2011 season.

Using the x axis for count progression and the y axis for outcome yields a really nice visualization of strike count effects. I might have used arrows to really stress the state-space transitions. I might also have tried to label how often each of these transitions happened (including self-transitions at 0-2, 1-2, and 2-2 counts) and the total number of times each state came up for a batter. I don’t think we really need the double coding (color and vertical position), and indeed the coloring’s gone in the second plot.

I like the red line at the average effect for an at-bat (corresponding to a 0-0 count), but I would’ve preferred the actual expected runs on the y axis rather than something standardized to zero. The average plate appearance is worth more than 0 runs.

In the second post, Albert goes on to plot estimates by batter and pitcher without any pooling:

* Jim Albert. Graphing pitch count effects (part 2)

The following are plots for the 2015 Cy Young award winning pitchers:

He also included a couple of great batters from the same year:

With multiple graphs, it’d be nice to have the same y axis range and the same ratio of x axis to y axis size (I can never remember how to do this in ggplot). And it’d be nice to set it up so that the little bubbles don’t get truncated.

It dawned on me that the transitions are not Markovian. If a pitcher intentionally walks a batter from a 0-0 count, then the transitions from 0-0 to 1-0 to 2-0 to 3-0 are correlated. So I’d bet there are more such straight-through transitions for many batters than would be expected from Markovian transitions. We could test how well the Markovian approximation works through simulation, as Albert does for many other topics in his book Curve Ball.

Of course, the natural next step is to build a hierarchical model to partially pool the ball-strike count effects. An even more ambitious goal would be to model a particular batter vs. pitcher matchup.

I highly recommend clicking through to the original posts if you like baseball; there are many more players illustrated and much more in-depth baseball analysis.

On deck this week

Screen Shot 2016-05-16 at 8.19.28 AM

Birthdays, baseball, zombies, luxury . . . and fraac!