Skip to content

Ticket to Baaaaarf

A link from the comments here took me to the wonderfully named Barfblog and a report by Don Schaffner on some reporting.

First, the background: A university in England issued a press release saying that “Food picked up just a few seconds after being dropped is less likely to contain bacteria than if it is left for longer periods of time . . . The findings suggest there may be some scientific basis to the ‘5 second rule’ – the urban myth about it being fine to eat food that has only had contact with the floor for five seconds or less. Although people have long followed the 5 second rule, until now it was unclear whether it actually helped.” According to the press release, the study was “undertaken by final year Biology students” and led by a professor of microbiology.

The press release hit the big time, hitting NPR, Slate, Forbes, the Daily News, etc etc. Some typical headlines:

“5-second rule backed up by science” — Atlanta Journal Constitution

“Eating food off the floor may be OK, scientist says” — CNET

“Scientists confirm dad’s common sense: 5-second rule totally legit”

OK, that last one was from the Christian Science Monitor, a publication that I don’t think anyone will take very seriously when it comes to health issues.

Second, the take-home point from Schaffner:

If you don’t have any pathogens on your kitchen floor, it doesn’t matter how long food sits there. If you do have pathogens on your kitchen floor, you get more of them on wet food than dry food. But in my considered opinion, the five-second rule is nonsense. I’m a scientist, I’ll keep an open mind. I know what some people in my lab will be working on this summer. . . .

Third, the rant from Don Schaffner on barfblog:

I [Scaffner] can tell when something is a big news story.

First, I read about it in my news feed from one or more sources. Second, friends and family send it to me. By these two criteria, the recent news about the five second rule qualifies as a big news story. . . . And it’s a story, or a press release, not a study.

The press release is apparently based on a PowerPoint presentation. The study has not undergone any sort of peer review, as far as I know. Science by press release is something that really bugs me. It’s damned hard to do research. It’s even harder to get that research published in the peer-reviewed literature. And when reputable news outlets publish university press releases without even editing them, that does a disservice to everyone; the readers, the news outlet, and even the university researchers. . . .

A review of the slide set shows a number of problems with the study. The researchers present their data as per cent transfer. As my lab has shown repeatedly, through our own peer-reviewed research, when you study cross-contamination and present the results as percentage transfer, those data are not normally distributed. A logarithmic transformation appears to be suitable for converting percentage transfer data to a normal distribution. This is important because any statistics you do on the results generally assume the data to be normally distributed. If you don’t verify this assumption first, you may conclude things that aren’t true.

The next problem with the study is that the authors appear to have only performed three replicates for most of the conditions studied. Again, as my own peer-reviewed research has shown, the nature of cross-contamination is such that the data are highly variable. In our experience you need 20 to 30 replicates to reasonably truly characterize the variability in logarithmically transformed percent transfer data.

Our research has also shown that the most significant variable influencing cross-contamination appears to be moisture. This is not surprising. Bacteria need moisture to move from one location to another. When conditions are dry, it’s much less likely that a cell will be transferred.

Another problem that peer-reviewers generally pick up, is an awareness (or lack thereof) of knowledge of the pre-existing literature. Research on the five-second rule is not new. I’m aware of at least three groups that schaffnerhave worked in this area. Although it’s not peer-reviewed, the television show MythBusters has considered this issue. Paul Dawson at Clemson has also done research on the five-second rule. Dawson’s research has been peer-reviewed and was published in the Journal of Applied Microbiology. Hans Blaschek and colleagues were, as far as I know, the first lab to ever study this.

When I first read this, I was like, Yeah, you go guy! If only all the journalists did it as well as Mary Beth Breckenridge of the Beacon Journal, in a news article headlined, “Study supports five-second rule, but should you? Probably not”:

A new study appears to validate what every 12-year-old knows: If you drop food on the floor, you have five seconds until it becomes contaminated. Biology students at Aston University in Birmingham, England, tested the time-honored five-second rule and claim to have found some truth to it. The faster you pick food up off the floor, they discovered, the less likely it is to contain bacteria. . . .

But don’t go picking fallen Fritos out of the rug just yet.
The study contradicts findings of earlier research at Clemson University, where scientists tested how fast Salmonella Typhimurium bacteria made their way from flooring surfaces to bologna and bread. It happened instantly, the researchers found.
What’s more, the British study apparently hasn’t been published yet in a scientific journal, noted Jeffrey T. LeJeune, a food safety expert at the Ohio Agricultural Research and Development Center in Wooster Township.
Since the data aren’t available to other researchers, he said, there’s no way to replicate the study or determine whether the results are legitimate. “I would be very skeptically cautious about the results, and even more about the interpretation,” he said. . . .

But then I got a bit worried. What exactly is the take-home message? It can’t just be, “don’t report a study that hasn’t been peer-reviewed,” since (a) even if a study is published in a peer-reviewed journal, it could be crap (recall all those papers published in Psychological Science), and (b) if a topic is sufficiently important, it could well be newsworthy even before the grind of the peer review process.

This particular study does seem shaky, though: a student project that is not backed up by shared data or a preprint. The press release seems a bit irresponsible: “Although people have long followed the 5 second rule, until now it was unclear whether it actually helped,” which implies that now all is clear. But journalists should know better than to trust a press release! Don’t they teach them that in day 1 of journalism school?? The reports typically do express some skepticism, for example the NPR report says, “The team hasn’t published the data yet. So the findings are still preliminary and need to be confirmed” and later on quotes a biologist stating an opposite position. Even so, though, it seems like all these news outlets are taking the press release a bit too uncritically.

Some of this is simple envy: I’d love for my research to be discussed on NPR and I’m sure Don Schaffner wouldn’t mind this sort of exposure either. But it does seem to me that this sort of science-reporting-by-press-release creates the worst sort of incentives for researchers. I don’t blame the university researcher for promoting his students’ project (his quote: “The findings of this study will bring some light relief to those who have been employing the five-second rule for years, despite a general consensus that it is purely a myth”) but I do blame the reporting system for hyping this sort of thing, which seems like the flip side of the notorious proclivity of media organizations for scare stories. (As Jonathan Schoenfeld and John Ioannidis found, it seems like just about everything has been said to cause cancer at one time or another.)

P.S. This all got my attention not because I care about the so-called five-second rule but because I was attracted by the name of the barfblog.

Stan Model of the Week: Hierarchical Modeling of Supernovas

The Stan Model of the Week showcases research using Stan to push the limits of applied statistics.  If you have a model that you would like to submit for a future post then send us an email.

Our inaugural post comes from Nathan Sanders, a graduate student finishing up his thesis on astrophysics at Harvard. Nathan writes,

“Core-collapse supernovae, the luminous explosions of massive stars, exhibit an expansive and meaningful diversity of behavior in their brightness evolution over time (their “light curves”). Our group discovers and monitors these events using the Pan-STARRS1 telescope in Hawaii, and we’ve collected a dataset of about 20,000 individual photometric observations of about 80 Type IIP supernovae, the class my work has focused on. While this dataset provides one of the best available tools to infer the explosion properties of these supernovae, due to the nature of extragalactic astronomy (observing from distances
$\gtrsim$ 1 billion light years), these light curves typically have much lower signal-to-noise, poorer sampling, and less complete coverage than we would like.

My goal has been to develop a light curve model, with a physically interpretable parameterization, robust enough to fit the diversity of observed behavior and to extract the most information possible from every light curve in the sample, regardless of data quality or completeness.  Because light curve parameters of individual objects are often not identified by the data, we have adopted a hierarchical model structure.  The intention is to capitalize on partial pooling of information to simultaneously regularize the fits of individual light curves and constrain the population level properties of the light curve sample.  The highly non-linear character of the light curves motivates a full Bayes approach to explore the complex joint structure of the posterior.

Sampling from a ~$10^4$ dimensional, highly correlated joint posterior seemed intimidating to me, but I’m fortunate to have been empowered by having taken Andrew’s course at Harvard, by befriending expert practitioners in this field like Kaisey Mandel and Michael Betancourt, and by using Stan!  For me, perhaps the most attractive feature of Stan is its elegant probabilistic modeling language.  It has allowed us to rapidly develop and test a variety of functional forms for the light curve model and strategies for optimization and regularization of the hierarchical structure.  This would not be useful, of course, without Stan’s efficient implementation of NUTS, although the particular pathologies of our model’s posterior drove us to spend a great deal of time exploring divergence, tree depth saturation, numerical instability, and other problems encountered by the sampler.

Over the course of the project, I learned to pay increasingly close attention to the stepsize, n_treedepth and n_divergent NUTS parameters, and other diagnostic information provided by Stan in order to help debug sampling issues.  Encountering saturation of the treedepth and/or extremely small stepsizes often motivated simplifications of the hierarchical structure in order to reduce the curvature in the posterior.  Divergences during sampling led us to apply stronger prior information on key parameters (particularly those that are exponentiated in the light curve model) in order to avoid numerical overflow on samples drawn from the tails.  Posterior predictive checks have been a constant companion throughout, providing a natural means to visualize the model’s performance against the data to understand where failure modes have been introduced – be it through under- or over-constraining priors, inadequate flexibility in the light curve model form, or convergence failure between chains.”

By modeling the hierarchical structure of the supernova measurements Nathan was able to significantly improve the utilization of the data.  For more, see

By modeling the hierarchical structure of the supernova measurements Nathan was able to significantly improve the utilization of the data. For more, see the preprint.

Building and fitting this model proved to be a tremendous learning experience for both Nathan any myself.  We haven’t really seen Stan applied to such deep hierarchical models before, and our first naive implementations proved to be vulnerable to all kinds of pathologies.

A problem early on came in how to model hierarchical dependences
between constrained parameters.  As has become a common theme,
the most successful computational strategy is to model the hierarchical dependencies on the unconstrained latent space and transform to the constrained space only when necessary.

The biggest issue we came across, however, was the development of a well-behaved hierarchal prior with so many layers.  With multiple layers the parameter variances increase exponentially, and the naive generalization of a one-layer prior induces huge variances on the top-level parameters.  This became especially pathological when those top-level parameters are constrained — the exponential function is very easy to overflow in floating point.  Ultimately we established the desired variance on the top-level parameters and worked backwards, scaling the deeper priors by the number of groups in the next layer to ensure the desired behavior.

Another great feature of Stan is that the modeling language also serves as a convenient means of sharing models for reproducible science.  Nathan was able to include the full model as an appendix to his paper, which you can find on the arXiv.

Ticket to Baaaath


Ooooooh, I never ever thought I’d have a legitimate excuse to tell this story, and now I do! The story took place many years ago, but first I have to tell you what made me think of it:

Rasmus Bååth posted the following comment last month:

On airplane tickets a Swedish “å” is written as “aa” resulting in Rasmus Baaaath. Once I bought a ticket online and five minutes later a guy from Lufthansa calls me and asks if I misspelled my name…

OK, now here’s my story (which is not nearly as good). A long time ago (but when I was already an adult), I was in England for some reason, and I thought I’d take a day trip from London to Bath. So here I am on line, trying to think of what to say at the ticket counter. I remember that in England, they call Bath, Bahth. So, should I ask for “a ticket to Bahth”? I’m not sure, I’m afraid that it will sound silly, like I’m trying to fake an English accent. So, when I get to the front of the line, I say, hesitantly, “I’d like a ticket to Bath?” (with the American pronunciation). The ticket agent replies, slightly contemptuously: “Oh, you’d like a ticket to Baaaaaaath.” I pay for the ticket, take it, and slink away.

This is, like, my favorite story. Ok, not my favorite favorite story—that’s the time I saw this guy in Harvard Square and the back of his head looked just like Michael Keaton—but, still, it’s one of my best. Among linguistic-themed stories, it’s second only to the “I speak only English” story (see third paragraph here). Also, both of these are what might be called “reverse Feynman stories” in that they make me look like a fool.

On deck this week

Mon: Ticket to Baaaath

Tues: Ticket to Baaaaarf

Wed: Thinking of doing a list experiment? Here’s a list of reasons why you should think again

Thurs: An open site for researchers to post and share papers

Fri: Questions about “Too Good to Be True”

Sat: Sleazy sock puppet can’t stop spamming our discussion of compressed sensing and promoting the work of Xiteng Liu

Sun: White stripes and dead armadillos

Fooled by randomness

From 2006:

Naseem Taleb‘s publisher sent me a copy of “Fooled by randomness: the hidden role of chance in life and the markets” to review. It’s an important topic, and the book is written in a charming style—I’ll try to respond in kind, with some miscellaneous comments.

On the cover of the book is a blurb, “Named by Fortune one of the smartest books of all time.” But Taleb instructs us on page 161-162 to ignore book reviews because of selection bias (the mediocre reviews don’t make it to the book cover).

Books vs. articles

I prefer writing books to writing journal articles because books are written for the reader (and also, in the case of textbooks, for the teacher), whereas articles are written for referees. Taleb definitely seems to be writing to the reader, not the referee. There is risk in book-writing, since in some ways referees are the ideal audience of experts, but I enjoy the freedom in book-writing of being able to say what I really think.

Variation and randomness

Taleb’s general points—about variation, randomness, and selection bias—will be familiar with statisticians and also to readers of social scientists and biologists such as Niall Ferguson, A.J.P. Taylor, Stephen J. Gould, and Bill James who have emphasized the roles of contingency and variation in creating the world we see.


On pages xiiv-xlv, Taleb compares the “Utopian Vision, associated with Rousseau, Godwin, Condorcet, Thomas Painen, and conventional normative economists,” to the more realistic “Tragic Vision of humankind that believes in the existence of inherent limitations and flaws in the way we think and act,” associated with Karl Popper, Freidrich Hayek and Milton Friedman, Adam Smith, Herbert Simon, Amos Tversky, and others. He writes, “As an empiricist (actually a skeptical empiricist) I despise the moralizers beyond anything on this planet . . .”

Despise “beyond anything on this planet”?? Isn’t this a bit extreme? What about, for example, hit-and-run drivers? I despise them even more.


On page 39, Taleb quotes the maxim, “What is easy to conceive is clear to express / Words to say it would come effortlessly.” This reminds me of the duality in statistics between computation and model fit: better-fitting models tend to be easier to compute, and computational problems often signal modeling problems. (See here for my paper on this topic.)

Turing Test

On page 72, Taleb writes about the Turing test: “A computer can be said to be intelligent if it can (on aveage) fool a human into mistaking it for another human.” I don’t buy this. At the very least, the computer would have to fool me into thinking it’s another human. I don’t doubt that this can be done (maybe another 5-20 years, I dunno). But I wouldn’t use the “average person” as a judge. Average people can be fooled all the time. If you think I can be fooled easily, don’t use me as a judge, either. Use some experts.

Evaluations based on luck

I’m looking at my notes. Something in Taleb’s book, but I ‘m not sure what, reminded me of a pitfall in the analysis of algorithms that forecast elections. People have written books about this, “The Keys to the White House,” etc. Anyway, the past 50 years have seen four Presidential elections that have been, essentially (from any forecasting standpoint), ties: 1960, 1968, 1976, 2000. Any forecasting method should get no credit for forecasting the winner in any of these elections, and no blame for getting it wrong. Also in the past 50 years, there have been four Presidential elections that were landslides: 1956, 1964, 1972, 1984. (Perhaps you could also throw 1996 in there; obviously the distinction is not precise.) Any forecasting method better get these right, otherwise it’s not to be taken seriously at all. What is left are 1980, 1988, 1992, 1996, 2004: only 5 actual test cases in 50 years! You have a 1/32 chance of getting them all right by chance. This is not to say that forecasts are meaningless, just that a simple #correct is too crude a summary to be useful.


I once talked with someone who wanted to write a book called Winners, interviewing a bunch of lottery winners. Actually Bruce Sacerdote and others have done statistical studies of lottery winners, using the lottery win as a randomly assigned treatment. But my response was to write a book called Losers, interviewing a bunch of randomly-selected lottery players, almost all of which, of course, would be net losers.

Finance and hedging

When I was in college I interviewed for a summer job for an insurance company. The interviewer told me that his boss “basically invented hedging.” He also was getting really excited about a scheme for moving profits around between different companies so that none of the money got taxed. It gave me a sour feeling, but in retrospect maybe he was just testing me out to see what my reaction would be.

Forecasts, uncertainty, and motivations

Taleb describes the overconfidence of many “experts.” Some people have a motivation to display certainty. For example, auto mechanics always seemed to me to be 100% sure of their diagnosis (“It’s the electrical system”), then when they were wrong, it never would bother them a bit. Setting aside possible fradulence, I think they have a motivation to be certain, because we’re unlikely to follow their advice if they qualify it. In the other direction, academics like me perhaps have a motivation to overstate uncertainty, to avoid the potential loss in reputation from saying something stupid. But in practice, people seem to understate our uncertainty most of the time.

Some experts aren’t experts at all. I was once called by a TV network (one of the benefits of living in New York?) to be interviewed about the lottery. I’m no expert—I referred them to Clotfelter and Cook. Other times, I’ve seen statisticians quoted in the paper on subjects they know nothing about. Once, several years ago, a colleague came into my office and asked me what “sampling probability proportional to size” was. It turned out he was doing some consulting for the U.S. government. I was teaching a sampling class at the time, so i could help him out. But it was a little scary that he had been hired as a sampling expert. (And, yes, I’ve seen horrible statistical consulting in the private sector as well.)


A thought-provoking and also fun book. The statistics of low-probability events has long interested me, and the stuff about the financial world was all new to me. The related work of Mandelbrot discusses some of these ideas from a more technical perspective. (I became aware of Mandelbrot’s work on finance through this review by Donald MacKenzie.)


Taleb is speaking this Friday at the Collective Dynamics Seminar.

Update (2014):

I thought Fooled by Randomness made Taleb into a big star, but then his followup effort, The Black Swan, really hit the big time. I reviewed The Black Swan here.

The Collective Dynamics Seminar unfortunately is no more; several years ago, Duncan Watts left Columbia to join Yahoo research (or, as I think he was contractually required to write, Yahoo! research). Now he and his colleagues (who are my collaborators too) work at Microsoft research, still in NYC.

Index or indicator variables

Someone who doesn’t want his name shared (for the perhaps reasonable reason that he’ll “one day not be confused, and would rather my confusion not live on online forever”) writes:

I’m exploring HLMs and stan, using your book with Jennifer Hill as my field guide to this new territory. I think I have a generally clear grasp on the material, but wanted to be sure I haven’t gone astray.

The problem in working on involves a multi-nation survey of students, and I’m especially interested in understanding the effects of country, religion, and sex, and the interactions among those factors (using IRT to estimate individual-level ability, then estimating individual, school, and country effects).

Following the basic approach laid out in chapter 13 for such interactions between levels, I think I need to create a matrix of indicator variables for religion and sex. Elsewhere in the book, you recommend against indicator variables in favor of a single index variable.

Am I right in thinking that this is purely a matter of convenience, and that the matrix formulation of chapter 13 requires indicator variables, but that the matrix of indicators or the vector of indices yield otherwise identical results? I can’t see why they shouldn’t be the same, but my intuition is still developing around multi-level models.

I replied:

Yes, models can be formulated equivalently in terms of index or indicator variables. If a discrete variable can take on a bunch of different possible values (for example, 50 states), it makes sense to use a multilevel model rather than to include indicators as predictors with unmodeled coefficients. If the variable takes on only two or three values, you can still do a multilevel model but really it would be better at that point to use informative priors for any variance parameters. That’s a tactic we do not discuss in our book but which is easy to implement in Stan, and I’m hoping to do more of it in the future.

To which my correspondent wrote:

The main difference that occurs to me as I work through implementing this is that the matrix of indicator variables loses information about what the underlying variable was. So, for instance, if the matrix mixes an indicator for sex and n indicators for religion and m indicators for schools, we’d have Sigma_beta be an m+n+1 x m+n+1 matrix, when we really want a 3×3 matrix.

I could set up the basic structure of Sigma_beta, separately estimate the diagonal elements with a series of multilevel loops by sex, religion, and school, and eschew the matrix formulation in the individual model. So instead of y~N(X_iB_j[i],sigma^2_y) it would be (roughly, I’m doing this on my phone):


And the group-level formulation unchanged. Sigma_beta becomes a 3×3 matrix rather than an m+n+1 matrix, which seems both more reasonable and more computationally tractable.

My reply:

Now I’m getting tangled in your notation. I’m not sure what Sigma_beta is.

One-tailed or two-tailed?


Someone writes:

Suppose I have two groups of people, A and B, which differ on some characteristic of interest to me; and for each person I measure a single real-valued quantity X. I have a theory that group A has a higher mean value of X than group B. I test this theory by using a t-test. Am I entitled to use a *one-tailed* t-test? Or should I use a *two-tailed* one (thereby giving a p-value that is twice as large)?

I know you will probably answer: Forget the t-test; you should use Bayesian methods instead.

But what is the standard frequentist answer to this question?

My reply:

The quick answer here is that different people will do different things here. I would say the 2-tailed p-value is more standard but some people will insist on the one-tailed version, and it’s hard to make a big stand on this one, given all the other problems with p-values in practice:

P.S. In the comments, Sameer Gauria summarizes a key point:

It’s inappropriate to view a low P value (indicating a misfit of the null hypothesis to data) as strong evidence in favor of a specific alternative hypothesis, rather than other, perhaps more scientifically plausible, alternatives.

This is so important. You can take lots and lots of examples (most notably, all those Psychological Science-type papers) with statistically significant p-values, and just say: Sure, the p-value is 0.03 or whatever. I agree that this is evidence against the null hypothesis, which in these settings typically has the following five aspects:
1. The relevant comparison or difference or effect in the population is exactly zero.
2. The sample is representative of the population.
3. The measurement in the data corresponds to the quantities of interest in the population.
4. The researchers looked at exactly one comparison.
5. The data coding and analysis would have been the same had the data been different.
But, as noted above, evidence against the null hypothesis is not, in general, strong evidence in favor of a specific alternative hypothesis, rather than other, perhaps more scientifically plausible, alternatives.

If you get to the point of asking, just do it. But some difficulties do arise . . .

Nelson Villoria writes:

I find the multilevel approach very useful for a problem I am dealing with, and I was wondering whether you could point me to some references about poolability tests for multilevel models. I am working with time series of cross sectional data and I want to test whether the data supports cross sectional and/or time pooling. In a standard panel data setting I do this with Chow tests and/or CUSUM. Are these ideas directly transferable to the multilevel setting?

My reply: I think you should do partial pooling. Once the question arises, just do it. Other models are just special cases. I don’t see the need for any test.

That said, if you do a group-level model, you need to consider including group-level averages of individual predictors (see here). And if the number of groups is small, there can be real gains from using an informative prior distribution on the hierarchical variance parameters. This is something that Jennifer and I do not discuss in our book, unfortunately.

Looking for Bayesian expertise in India, for the purpose of analysis of sarcoma trials

Prakash Nayak writes:

I work as a musculoskeletal oncologist (surgeon) in Mumbai, India and am keen on sarcoma research.

Sarcomas are rare disorders, and conventional frequentist analysis falls short of providing meaningful results for clinical application.

I am thus keen on applying Bayesian analysis to a lot of trials performed with small numbers in this field.

I need advise from you for a good starting point for someone uninitiated in Bayesian analysis. What to read, what courses to take and is there a way I could collaborate with any local/international statisticians dealing with these methods.

I have attached a recent publication [Optimal timing of pulmonary metastasectomy – is a delayed operation beneficial or counterproductive?, by M. Kruger, J. D. Schmitto, B. Wiegmannn, T. K. Rajab, and A. Haverich] which is one amongst others I understand would benefit from some Bayesian analyses.

I have no idea who in India works in this area so I’m just putting this one out there in the hope that someone will be able to make the connection.

When you believe in things that you don’t understand


This would make Karl Popper cry. And, at the very end:

The present results indicate that under certain, theoretically predictable circumstances, female ovulation—long assumed to be hidden—is in fact associated with a distinct, objectively observable behavioral display.

This statement is correct—if you interpret the word “predictable” to mean “predictable after looking at your data.”

P.S. I’d like to say that April 15 is a good day for this posting because your tax dollars went toward supporting this research. But actually it was supported by the Social Sciences Research Council of Canada, and I assume they do their taxes on their own schedule.

P.P.S. In preemptive response to people who think I’m being mean by picking on these researchers, let me just say: Nobody forced them to publish these articles. If you put your ideas out there, you have to be ready for criticism.