The post “Deeper into democracy: the legitimacy of challenging Brexit’s majoritarian mandate” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Also I came across this short post by Rowson on “virtue signaling” which made some reasonable points.

**P.S.** I’d heard of Rowson from reading his book Seven Deadly Chess Sins, which is great (if a bit over my head, chess-wise), and which came out when he was only 23!

The post “Deeper into democracy: the legitimacy of challenging Brexit’s majoritarian mandate” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Zoologist slams statistical significance appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Zoologist slams statistical significance appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The graphs tell the story. Now I want to fit this story into a bigger framework so it all makes sense again. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Pretty stunning. I mean, really stunning. Why are we just hearing about this now, given that the pattern is a decade old?

And what’s this: “Data for the U.S. ends in 2007”? Huh?

Also, it’s surprising how high the rates were for Japan, Italy, and Germany in the 1970s. Whassup with that?

The whole thing is tough to understand; I just don’t know how to think about it.

One lesson from all of this, I think, is that our public space (newspapers, TV, etc.) doesn’t have enough experts on demography and public health; thus these sorts of basic statistics come as a surprise to us. Consider: compared to most people, *I’m* an expert on demography and public health, but these graphs came as a complete surprise to me.

In all seriousness, I think our public discussion space needs **fewer doctors and fewer economists**, and **more demographers and public health experts**. I have no problem with doctors and economists, considered individually as experts; there just seems to be an imbalance in aggregate exposure, comparing these professions to others with relevant expertise in the same questions.

The post The graphs tell the story. Now I want to fit this story into a bigger framework so it all makes sense again. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Let’s face it, I know nothing about spies. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Multiple federal agencies investigated claims that former Indiana basketball coach Bobby Knight groped and verbally sexually harassed several female employees when he gave a speech at the National Geospatial-Intelligence Agency in July 2015, according to newly-released documents. . . . he had slapped a “senior woman … on [her] butt” and “fondled a woman’s breast.”

And my first thought is, hey, this guy is what, 75 years old, and he’s groping *spies*?? Isn’t he afraid one of them will give him a karate chop to the head?

I guess I’ve watched too many episodes of The Americans. Real spies, it seems, just let old guys grope them. In real life, it seems that politeness is part of the spy code of conduct. Who knew? I’d’ve thought one of these ladies would’ve decked him.

The post Let’s face it, I know nothing about spies. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Low power and the replication crisis: What have we learned since 2004 (or 1984, or 1964)? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In this article, Maxwell covers a lot of the material later discussed in the paper Power Failure by Button et al. (2013), and the 2014 paper on Type M and Type S errors by John Carlin and myself. Maxwell also points out that these alarms were raised repeatedly by earlier writers such as Cohen, Meehl, and Rozeboom, from the 1960s onwards.

In this post, I’ll first pull out some quotes from that 2004 paper that presage many of the issues of non-replications that we still are wrestle with today. Then I’ll discuss what’s been happening since 2004: what’s new in our thinking in the past fifteen years.

I’ll argue that, yes, everyone should’ve been listening to Cohen, Meehl, Roseboom, Maxwell, etc., all along; and also that we have been making some progress, that we have some new ideas that might help us move forward.

**Part 1: They said it all before**

Here’s a key quote from Maxwell (2004):

When power is low for any specific hypothesis but high for the collection of tests, researchers will usually be able to obtain statistically significant results, but which specific effects are statistically significant will tend to vary greatly from one sample to another, producing a pattern of apparent contradictions in the published literature.

I like this quote, as it goes beyond the usual framing in terms of “false positives” etc., to address the larger goals of a scientific research program.

Maxwell continues:

A researcher adopting such a strategy [focusing on statistically significant patterns in observed data] may have a reasonable probability of discovering apparent justification for recentering his or her article around a new finding. Unfortunately, however, this recentering may simply reflect sampling error . . . this strategy will inevitably produce positively biased estimates of effect sizes, accompanied by apparent 95% confidence intervals whose lower limit may fail to contain the value of the true population parameter 10% to 20% of the time.

He also slams deterministic reasoning:

The presence or absence of asterisks [indicating p-value thresholds] tends to convey an air of finality that an effect exists or does not exist . . .

And he mentions the “decline effect”:

Even a literal replication in a situation such as this would be expected to reveal smaller effect sizes than those originally reported. . . . the magnitude of effect sizes found in attempts to replicate can be much smaller than those originally reported, especially when the original research is based on small samples. . . . these smaller effect sizes might not even appear in the literature because attempts to replicate may result in nonsignificant results.

Classical multiple comparisons corrections won’t save you:

Some traditionalists might suggest that part of the problem . . . reflects capitalization on chance that could be reduced or even eliminated by requiring a statistically significant multivariate test. Figure 3 shows the result of adding this requirement. Although fewer studies will meet this additional criterion, the smaller subset of studies that would now presumably appear in the literature are even more biased . . .

This was a point raised a few years later by Vul et al. in their classic voodoo correlations paper.

Maxwell points out that meta-analysis of published summaries won’t solve the problem:

Including underpowered studies in meta-analyses leads to biased estimates of effect size whenever accessibility of studies depends at least in part on the presence of statistically significant results.

And this:

Unless psychologists begin to incorporate methods for increasing the power of their studies, the published literature is likely to contain a mixture of apparent results buzzing with confusion.

And the incentives:

Not only do underpowered studies lead to a confusing literature but they also create a literature that contains biased estimates of effect sizes. Furthermore . . . researchers may have felt little pressure to increase the power of their studies, because by testing multiple hypotheses, they often assured themselves of a reasonable probability of achieving a goal of obtaining at least one statistically significant result.

And he makes a point that I echoed many years later, regarding the importance of measurement and the naivety of researchers who think that the answer to all problems is to crank up the sample size:

Fortunately, an assumption that the only way to increase power is to increase sample size is almost always wrong. Psychologists are encouraged to familiarize themselves with additional methods for increasing power.

**Part 2: (Some of) what’s new**

So, Maxwell covered most of the ground in 2004. Here are a few things that I would add, from my standpoint nearly fifteen years later:

1. I think the concept of “statistical power” itself is a problem in that it implicitly treats the attainment of statistical significance as a goal. As Button et al. and others have discussed, low-power studies have a winner’s curse aspect, in that if you do a “power = 0.06” study and get lucky and find a statistical significant result, your estimate will be horribly exaggerated and likely in the wrong direction.

To put it another way, I fear that a typical well-intentioned researcher will want to avoid low-power studies—and, indeed, it’s trivial to talk yourself into thinking your study has high power, by just performing the power analysis using an overestimated effect size from the published literature—but will also think that a low power study is essentially a role of the dice. The implicit attitude is that in a study with, say, 10% power, you have a 10% chance of winning. But in such cases, a win is really a loss.

2. Variation in effects and context dependence. It’s not about identifying whether an effect is “true” or a “false positive.” Rather, let’s accept that in the human sciences there are no true zeroes, and relevant questions include the magnitude of effects, and how and where they vary. What I’m saying is: less “discovery,” more exploration and measurement.

3. Forking paths. If I were to rewrite Maxwell’s article today, I’d emphasize that the concern is not just multiple comparisons that have been performed, but also multiple potential comparisons. A researcher can walk through his or her data and only perform one or two analyses, but these analyses will be contingent on data, so that had the data been different, they would’ve been summarized differently. This allows the probability of finding statistical significance to approach 1, given just about any data (see, most notoriously, this story). In addition, I would emphasize that “researcher degrees of freedom” (in the words of Simmons, Nelson, and Simonsohn, 2011) arise not just in the choice of which of multiple coefficients to test in a regression, but also in which variables and interactions to include in the model, how to code data, and which data to exclude (see my above-linked paper with Loken for sevaral examples).

4. Related to point 2 above is that some effects are really really small. We all know about ESP, but there are also other tiny effects being studied. An extreme example is the literature on sex ratios. At one point in his article, referring to a proposal that psychologists gather data on a sample of a million people, Maxwell writes, “Thankfully, samples this large are unnecessary even to detect minuscule effect sizes.” Actually, if you’re studying variation in the human sex ratio, that’s about the size of sample you’d actually need! For the calculation, see pages 645-646 of this paper.

5. Flexible theories: The “goal of obtaining at least one statistically significant result” is only relevant because theories are so flexible that just about any comparison can be taken to be consistent with theory. Remember sociologist Jeremy Freese’s characterization of some hypotheses as “more vampirical than empirical—unable to be killed by mere evidence.”

6. Maxwell writes, “it would seem advisable to require that a priori power calculations be performed and reported routinely in empirical research.” Fine, but we can also do design analysis (our preferred replacement term for “power calculations”) *after* the data have come in and the analysis has been published. The purpose of a design calculation is not just to decide whether to do a study or to choose a sample size. It’s also to aid in interpretation of published results.

7. Measurement.

The post Low power and the replication crisis: What have we learned since 2004 (or 1984, or 1964)? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bob likes the big audience appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I like the big audience for two reasons related to computer science principles.

The first benefit is the same reason it’s scary. The big audience is likely to find flaws. And point them out. In public! The trick is to get over feeling bad about it and realize that it’s a super powerful debugging tool for ideas. Owning up to being wrong in public is also very liberating. Turns out people don’t hold it against you at all (well, maybe they would if you were persistently and unapologetically wrong). It also provides a great teaching opportunity—if a postdoc is confused about something in their speciality, chances are that a lot of others are confused, too.

In programming, the principle is that you want routines to fail early. You want to inspect user input and if there’s a fatal problem with it that can be detected, fail right away and let the user know what the error is. Don’t fire up the algorithm and report some deeply nested error in a Fortran matrix algorithm. Something not being shot down on the blog is like passing that validation. It gives you confidence going on.

The second benefit is the same as in any writing, only the stakes are higher with the big audience. When you write for someone else, you’re much more self critical. The very act of writing can uncover problems or holes in your reasoning. I’ve started several blog posts and papers and realized at some point as I fleshed out an argument that I was missing a fundamental point.

There’s a principle in computer science called the rubber ducky.

One of the best ways to debug is to have a colleague sit down and let you explain your bug to them. Very often halfway through the explanation you find your own bug and the colleague never even understands the problem. The blog audience is your rubber ducky.

The term itself if a misnomer in that it only really works if the rubbery ducky can understand what you’re saying. They don’t need to understand it, just be capable of understanding it. Like the no free lunch principle, there are no free pair programmers.

The third huge benefit is that other people have complementary skills and knowledge. They point out connections and provide hints that can prove invaluable. We found out about automatic differentiation through a comment on the blog to a post where I was speculating about how we could calculate gradients of log densities in C++.

I guess there’s a computer science principle there, too—modularity. You can bring in whole modules of knowledge, like we did with autodiff.

I agree. It’s all about costs and benefits. The cost of an error is low if discovered early. You want to stress test, not to hide your errors and wait for them to be discovered later.

The post Bob likes the big audience appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Of rabbits and cannons appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I was reminded of this question recently when I happened to come across this exchange in the comments section from a couple years ago, in the context of the finding patterns in the frequencies of births on different days:

**Rahul:** Yes, inverting a million element matrix for this sort of problem does have the feel of killing mice with cannon.

**Andrew:** In many areas of research, you start with the cannon. Once the mouse is dead and you can look at it carefully from all angles, you can design an effective mousetrap. Red State Blue State went the same way: we found the big pattern only after fitting a multilevel model, but once we knew what we were looking for, it was possible to see it in the raw data.

The post Of rabbits and cannons appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The curse of dimensionality and finite vs. asymptotic convergence results appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In importance sampling one working solution in low dimensions is to use mixture of two proposals. One component tries to match the mode, and the other takes care that tails go down slower than the tails of the target ensuring bounded ratios. In the following I only look at the behavior with one component which has thicker tail and thus importance ratios are bounded (but I have made the similar experiment with mixture, too).

The target distribution is multidimensional normal distribution with zero mean and unit covariance matrix. In the first case the proposal distribution is also normal, but with scale 1.1 in each dimension. The scale is just slightly larger than for the proposal and we are often lucky if we can guess the scale of proposal with 10% accuracy. I draw 100000 draws from the proposal distribution.

The following figure shows when the number of dimensions go from 1 to 1024.

The upper subplot shows the estimated effective sample size. By D=512 importance weighted 100000 draws have only a few practically non-zero weights. The middle subplot shows the convergence rate compared to independent sampling, ie, how fast the variance goes down. By D=1024 the convergence rate has dramatically dropped and getting any improvement in the accuracy requires more and more draws. The bottom subplot shows Pareto khat diagnostic (see the paper for details). Dashed line is k=0.5, which the limit for variance being finite and dotted line is our suggestion for practically useful performance when using PSIS. But how can khat be larger than 0.5 when we have bounded weights! Central limit theorem has not failed here, but we have just not reach yet the asymptotic regime to get CLT to kick in!

The next plot shows more information what happens with D=1024.

Since humans are lousy in looking at 1024 dimensional plots the top subplot shows the 1 dimensional marginal density of the target (blue) and the proposal (red) densities of the distance from the origo r=sqrt(sum_{d=1}^D x_d^2). The proposal density has only 1.1 larger scale than the target, but most of the draws from the proposal are away from the typical set of the target! The vertical dashed line shows 1e-6 quantile of the proposal, ie when we draw 100000 draws, 90% of time we don’t get any draws from there. The middle subplot shows the importance ratio function, and we can see that the highest value is at 0, but that value is larger than 2*10^42! That’s a big number. The bottom subplot scales the y axis show that we see importance ratios near that 1e-6 quantile. Check the y-axis: it’s still from 0 to 1e6. So if we are lucky we may get a draw below the dashed line, but then it’s likely to get all the weight. The importance function is practically as steep everywhere where we can get draws in a time of the age of the universe. 1e-80 quantile is at 21.5 (1e80 is the estimated number of atoms in the visible universe). and it’s still far away from the region where the boundedness of the importance ratio function starts to affect.

I have more similar plots with thick tailed Student’s t, mixture of proposals etc, but I’ll save you from more plots. As long as there is some difference in target and proposal taking the number of dimensions to high enough, IS and PSIS break (PSIS giving slight improvement in the performance, and more importantly can diagnose the problem and improves the Monte Carlo estimate).

In addition that we need to take into account that many methods which work in small dimensions can break in high dimensions, we need to focus more on finite case performance. As seen here it doesn’t help us that CLT holds if we never can reach that asymptotic regime (same as why Metropolis algorithm in high dimensions may require close to infinite time to produce useful results). Pareto diagnostics has been empirically shown to provide very good finite case convergence rate estimates which also match some theoretical bounds.

**PS.** There has been lot of discussion in comments about typical set vs. high probability set. In the end Carlos Ungil wrote

Your blog post seems to work equally well if you change

“most of the draws from the proposal are away from the typical set of the target!”

to

“most of the draws from the proposal are away from the high probability region of the target!”

I disagree with this, and to show evidence for this I add here couple more plots I didn’t include before (I hope that this blog post will not have eventually as many plots as the PSIS paper).

If the target is multivariate normal, we get bounded weights by using Student-t distribution with finite degrees of freedom nu, even if the scale of the Student-t is smaller than the scale of the target distribution. In the next example the target is same as above, and the proposal distribution is multivariate Student-t with degrees of freedom nu=7 and variance=1.

The following figure shows when the number of dimensions go from 1 to 1024.

The upper subplot shows the estimated effective sample size. By D=64 importance weighted 100000 draws have only a few practically non-zero weights. The middle subplot shows the convergence rate compared to independent sampling, ie, how fast the variance goes down. By D=256 the convergence rate has dramatically dropped and getting any improvement in the accuracy requires more and more draws. The bottom subplot shows Pareto khat diagnostic which predicts well the finite case convergence rate (even if asymptotically with bounded ratios k<1/2).

The next plot shows more information what happens with D=512.

The top subplot shows the 1 dimensional marginal density of the target (blue) and the proposal (red) densities of the distance from the origo r=sqrt(sum_{d=1}^D x_d^2). The proposal density has the same variance but thicker tail. Most of the draws from the proposal are away from the typical set of the target and in this towards higher densities than the density in the typical set. The middle subplot shows the importance ratio function, and we can see that the highest value is close to 47.5 and then the ratio function starts to decrease again. The highest value of the ratio function is larger than 10^158. The region with highest value is far away from the typical set of the proposal distribution. The bottom subplot scales the y axis show that we see importance ratios near the proposal and the target distributions. The importance ratio goes from very small values up to 10^30 in very narrow range, and thus it’s likely that the largest draw from the proposal will get all the weight. It’s unlikely that in practical time we would get enough draws to get the asymptotic benefits of bounded ratios to kick in.

**PPS.** The purpose of this post was to illustrate that bounded ratios and asymptotic convergence results are not enough for the practically useful performance for IS, but there are also special cases where due to special structure of the posterior we can get practically good performance with IS, and especially with PSIS, also in high dimensions cases (PSIS paper has 400 dimensional example with khat<0.7).

The post The curse of dimensionality and finite vs. asymptotic convergence results appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Write No Matter What” . . . about what? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Her advice might well be reasonable—it’s hard for me to judge; as someone who blogs a few hundred times a year, I’m not really part of Jensen’s target audience. She offers “a variety of techniques to help . . . reduce writing anxiety; secure writing time, space and energy; recognize and overcome writing myths; and maintain writing momentum.” She recommends “spending at least 15 minutes a day in contact with your writing project . . . writing groups, focusing on accountability (not content critiques), are great ways to maintain weekly writing time commitments.”

Writing is non-algorithmic, and I’ve pushed hard against advice-givers who don’t seem to get that. So, based on this quick interview, my impression is that Jensen’s on the right track.

I’d just like to add one thing: If you want to write, it helps to have something to write *about*. Even when I have something I really want to say, writing can be hard. I can only imagine how hard it would be if I was just trying to write, to produce, without something I felt it was important to share with the world.

So, when writing, imagine your audience, and ask yourself why they should care. Tell ’em what they don’t know.

Also, when you’re writing, be aware of your audience’s expectations. You can satisfy their expectations or confound their expectations, but it’s good to have a sense of what you’re doing.

And here’s some specific advice about academic writing, from a few years ago.

**P.S.** In that same post, Cowen also links to a bizarre book review by Edward Luttwak who, among other things, refers to “George Pataki of New York, whose own executive experience as the State governor ranged from the supervision of the New York City subways to the discretionary command of considerable army, air force and naval national guard forces.” The New York Air National Guard, huh? I hate to see the Times Literary Supplement fall for this sort of pontificating. I guess that there will always be a market for authoritative-sounding pundits. But Tyler Cowen should know better. Maybe it was the New York thing that faked him out. If Luttwak had been singing the strategic praises of the New Jersey Air National Guard, that might’ve set off Cowen’s B.S. meter.

The post “Write No Matter What” . . . about what? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How to think about the risks from low doses of radon appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>First I wrote:

Low dose risk is inherently difficult to estimate using epidemiological studies. I’ve seen no evidence that risk is not linear at low dose, and there is evidence that areas with high radon levels have elevated levels of lung cancer. When it comes to resource allocation, we recommend that measurement and remediation be done in areas of high average radon levels but not at a national level; see here and here and, for a more technical treatment, here.

Regarding the question of “If the concerns about the linear no-threshold model for radiation risk are based on valid science, why don’t public health agencies like the EPA take them seriously?” I have no idea what goes on within the EPA, but when it comes to radon remediation, the effects of low dose exposure aren’t so relevant to the decision: if your radon level is low (as it is in most homes in the U.S.) you don’t need to do anything anyway; if your radon level is high, you’ll want to remediate; if you don’t know your radon level but it has a good chance of being high, you should get an accurate measurement and then make your decision.

For homes with high radon levels, radon is a “dangerous, proven harm,” and we recommend remediation. For homes with low radon levels, it might or might not be worth your money to remediate; that’s an individual decision based on your view of the risk and how much you can afford the $2000 or whatever to remediate.

Then Phil followed up:

The idea of hormesis [the theory that low doses of radiation can be beneficial to your health] is not quackery. Nor is LNT [the linear no-threshold model of radiation risk].

I will elaborate.

The theory behind LNT isn’t just ‘we have to assume something’, nor ‘everything is linear to first order’. The idea is that twice as much radiation means twice as many cells with damaged DNA, and if each cell with damaged DNA has a certain chance of initiating a cancer, then ceterus paribus you have LNT. That’s not crazy.

The theory behind hormesis is that your body has mechanisms for dealing with cancerous cells, and that perhaps these mechanisms recognize become more active or more effective when there is more damage. That’s not crazy either.

Perhaps exposure to a little bit of radiation isn’t bad for you at all. Perhaps it’s even good for you. Perhaps it’s just barely bad for you, but then when you’re exposed to more, you overwhelm the repair/rejection mechanisms and at some point just a little bit more adds a great deal of risk. This goes for smoking, too: maybe smoking 1/4 cigarette per day woud be good for you. For radiation there are various physiological models and there are enough adjustable parameters to get just about any behavior out of the models I have seen.

Of course what is needed is actual data. Data can be in vitro or in vivo; population-wide or case-control; etc.

There’s fairly persuasive evidence that the dose-response relationship is significantly nonlinear at low doses for “low linear-energy-transfer radiation”, aka low-LET radiation, such as x-rays. I don’t know whether the EPA still uses a LNT model for low-LET radiation.

But for high-LET radiation, including the alpha radiation emitted by radon and most of its decay products of concern, I don’t know much about the dose-response relationship at low does and I’m very skeptical of anyone who says they do know. There are some pretty basic reasons to expect low-LET and high-LET radiation to have very different effects. Perhaps I need to explain just a bit. For a given amount of energy that is deposited in tissue, low-LET radiation causes a small disruption to a lot of cells, whereas high-LET radiation delivers a huge wallop to relatively few cells.

An obvious thing to do is to look at people who have been exposed to high levels of radon and its decay products. As you probably know, it is really radon’s decay products that are dangerous, not radon itself. When we talk about radon risk, we really mean the risk from radon and its decay products.

At high concentrations, such as those found in uranium mines, it is very clear that radiation is dangerous, and that the more you are exposed to the higher your risk of cancer. I don’t think anyone would argue against the assertion that an indoor radon concentration of, say, 20 pCi/L leads to a substantially increased risk of lung cancer. And there are houses with living area concentrations that high, although not many.

A complication is that the radon risk for smokers seems to be much higher than for non-smokers. That is, a smoker exposed to 20 pCi/L for ten hours per day for several years is at much higher risk than a non-smoker with the same level of exposure.

But what about 10, 4, 2, or 1 pCi/L? No one really knows.

One thing people have done (notably Bernard Cohen, who you’ve probably come across) is to look at the average lung cancer rate by county, as a function of the average indoor radon concentration by county. If you do that, you find that low-radon counties actually have lower lung cancer rates than high-radon counties. But: a disproportionate fraction of low-radon counties are in the South, and that’s also where smoking rates are highest. It’s hard to completely control for the effect of smoking in that kind of study, but you can do things like look within individual states or regions (for instance, look at the relationship between average county radon concentrations and average lung cancer rates in just the northeast) and you still find a slight effect of higher radon being associated with lower lung cancer rates. If taken at face value, this would suggest that a living-aread concentration of 1 pCi/L or maybe even 2 pCi/L would be better than 0. But few counties have annual-average living-area radon concentration over about 2 pCi/L, and of course any individual county has a range of radon levels. Plus people move around, both within and between countties, so you don’t know the lifetime exposure of anyone. Putting it all together, even if there aren’t important confounding variables or other issues, these studies would suggest a protective effect at low radon levels but they don’t tell you anything about the risk at 10 pCi/L or 4 pCi/L.

There’s another class of studies, case-control studies, in which people with lung cancer are compared statistically to those without. In this country the objectively best of these looked at women in Iowa. (You may have come across this work, led by Bill Field). Iowa has a lot of farm women who don’t smoke and who have lived in just a few houses for their whole lives. Some of these women contracted lung cancer. The study made radon measurements in these houses, and in the houses of women of similar demographics who didn’t get lung cancer. They find increased risk at 4 pCi/L (even for nonsmokers, as I recall) and they are certainly inconsistent with a protective effect at 4 pCi/L. As I recall — you should check — they also found a positive estimated risk at 2 pCi/L that is consistent with LNT but also statistically consistent with 0 effect.

So, putting it all together, what do we have? I, at least, am convinced that increased exposure leads to increased risk for concentrations above 4 pCi/L. There’s some shaky empirical evidence for a weak protective effect at 2pCi/L compared to 0 pCi/L. In between it’s hard to say. All of the evidence below about 8 or 10 pCi/L is pretty shaky due to low expected risk, methodological problems with the studies, etc.

My informed belief is this: just as I wouldn’t suggest smoking a little bit of tobacco every day in the hope of a hormetic effect, I woudn’t recommend a bit of exposure to high-LET radiation every day. It’s not that it couldn’t possibly be protective, but I wouldn’t bet on it. And I’m pretty sure the EPA’s recommended ‘action level’ of 4 pCi/L is indeed risky compared to lower concentrations, especially for smokers. As a nonsmoker I wouldn’t necessarily remediate if my home were at 4 pCi/L, but I would at least consider it.

For low-LET radiation, I think the scientific evidence weighs against LNT. If public health agencies don’t take LNT seriously for this type of radiation it’s possible that they acknowledge this.

For high-LET radiation, such as alpha particles from radon decay products, there’s more a priori reason to believe LNT would be a good model, and less empirical evidence suggesting that it is a bad model. It might be hard for the agencies to explicitly disavow LNT in these circumstances. At the same time, there’s not compelling evidence in favor of LNT even for this type of radiation, and life is a lot simpler if you don’t take LNT ‘seriously’.

“Service” is one of my duties as a professor—the three parts of this job are teaching, research, and service—and, I guess, in general, those of us who’ve had the benefit of higher education have some sort of duty to share our knowledge when possible. So I have no problem answering reporters’ questions. But reporters have space constraints: you can send a reporter a long email or talk on the phone for an hour, and you’ll be lucky if one sentence of your hard-earned wisdom makes its way into the news article. So much effort all gone! It’s good to be able to post here and reach some more people.

The post How to think about the risks from low doses of radon appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Research project in London and Chicago to develop and fit hierarchical models for development economics in Stan! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We are seeking a Research Assistant skilled in R programming and the production of R packages.

The successful applicant will have experience creating R packages accessible on github or CRAN, and ideally will have experience working with Rstan. The main goal of this position is to create R packages to allow users to run Rstan models for Bayesian hierarchical evidence aggregation, including models on CDFs, sets of quantiles, and tailored parametric aggregation models as in Meager (2016) (https://bitbucket.org/rmeager/aggregating-distributional-treatment-effects/src). Ideally the resulting package will keep RStan “under the hood” with minimal user input, as in rstanarm. Further work on this project is likely to involve programming functions to allow users to run models to predict individual heterogeneity in treatment effects conditional on covariates in a hierarchical setting, potentially using ensemble methods such as Bayesian additive regression trees. Part of this work may involve developing and testing new statistical methodology. We aim to create comprehensively-tested packages that alert users when the underlying routines may have performed poorly. The application of our project is situated in development economics with a focus on understanding the potentially heterogeneous effects of the BRAC Graduation program to alleviate poverty.

The ideal candidate will have the right to work in the UK, and able to make weekly trips to London to meet with the research team. However a remote arrangement may be possible for the right candidate, and for those without the right to work in the UK, hiring can be done through Northwestern University. The start date is flexible but we aim to hire by the middle of March 2018.

The ideal candidate would commit 20-30 hours a week on a contracting basis, although the exact commitment is flexible. Pay rate is negotiable and commensurate with academic research positions and the candidate’s experience. Formally, the position is on a casual basis, but our working arrangement is also flexible with the option to work a number of hours corresponding to full-time, part time or casual. A 6-12 month commitment is ideal, with the option to extend pending satisfactory performance and funding availability, but a shorter commitment can be negotiated. Applications will be evaluated beginning immediately until the position is filled.

Please send your applications via email, attaching your CV and the links to any relevant packages or repositories, with the subject line “R programmer job” to rachael.meager@gmail.com and cc sstephen@poverty-action.org and ifalomir@poverty-action.org.

The post Research project in London and Chicago to develop and fit hierarchical models for development economics in Stan! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Use multilevel modeling to correct for the “winner’s curse” arising from selection of successful experimental results appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I came across this blog by Milan Shen recently and thought you might find it interesting.

A couple of things jumped out at me. It seemed like the so-called ‘Winner’s Curse’ is just another way of describing the statistical significance filter. It also doesn’t look like their correction method is very effective. I’d be very curious to hear your take on this work, especially this idea of a ‘Winner’s Curse’. I suspect the airbnb team could benefit from reading some of your work when it comes to dealing with these problems!

My reply: Yes, indeed I’ve used the term “winner’s curse” in this context. Others have used the term too.

Here’s a paper discussing the bias.

I think the right thing to do is fit a multilevel model.

The post Use multilevel modeling to correct for the “winner’s curse” arising from selection of successful experimental results appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The Lab Where It Happens appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>*The conclusions of both studies and the conclusions of the paper remain strong after correcting for these errors.”*

— Brian Wansink, David R. Just, Collin R. Payne, Matthew Z. Klinger, Preventive Medicine (2018).

This reminds me of a song . . .

Ah, Mister Editor

Mister Prof, sir

Did’ya hear the news about good old Professor Stapel

No

You know Lacour Street

Yeah

They renamed it after him, the Stapel legacy is secure

Sure

And all he had to do was lie

That’s a lot less work

We oughta give it a try

Ha

Now how’re you gonna get your experiment through

I guess I’m gonna fin’ly have to listen to you

Really

Measure less, claim more

Ha

Do whatever it takes to get my manuscript on the floor

Now, Reviewers 1 and 2 are merciless

Well, hate the data, love the finding

Food and Brand

I’m sorry Prof, I’ve gotta go

But

Decisions are happening over dinner

Two professors and an immigrant walk into a restaurant

Diametric’ly opposed, foes

They emerge with a compromise, having opened doors that were

Previously closed

Bros

The immigrant emerges with unprecedented citation power

A system he can shape however he wants

The professors emerge in the university

And here’s the pièce de résistance

No one else was in

The lab where it happened

The lab where it happened

The lab where it happened

No one else was in

The lab where it happened

The lab where it happened

The lab where it happened

No one really knows how the game is played

The art of the trade

How the sausage gets made

We just assume that it happens

But no one else is in

The lab where it happens

Prof claims

He was in Washington offices one day

In distress ‘n disarray

The Uni claims

His students said

I’ve nowhere else to turn

And basic’ly begged me to join the fray

Student claims

I approached the P.I. and said

I know you have the data, but I’ll tell you what to say

Professor claims

Well, I arranged the meeting

I arranged the menu, the venue, the seating

But

No one else was in

The lab where it happened

The lab where it happened

The lab where it happened

No one else was in

The lab where it happened

The lab where it happened

The lab where it happened

No one really knows how the

Journals get to yes

The pieces that are sacrificed in

Ev’ry game of chess

We just assume that it happens

But no one else is in

The room where it happens

Meanwhile

Scientists are grappling with the fact that not ev’ry issue can be settled by committee

Meanwhile

Journal is fighting over where to put the retraction

It isn’t pretty

Then pizza-man approaches with a dinner and invite

And postdoc responds with well-trained insight

Maybe we can solve one problem with another and win a victory for the researchers, in other words

Oh ho

A quid pro quo

I suppose

Wouldn’t you like to work a little closer to home

Actually, I would

Well, I propose the lunchroom

And you’ll provide him his grants

Well, we’ll see how it goes

Let’s go

No

One else was in

The lab where it happened

The lab where it happened

The lab where it happened

No one else was in

The lab where it happened

The lab where it happened

The lab where it happened

My data

In data we trust

But we’ll never really know what got discussed

Click-boom then it happened

And no one else was in the room where it happened

Professor of nutrition

What did they say to you to get you to sell your theory down the river

Professor of nutrition

Did the editor know about the dinner

Was there citation index pressure to deliver

All the coauthors

Or did you know, even then, it doesn’t matter

Who ate the carrots

‘Cause we’ll have the journals

We’re in the same spot

You got more than you gave

And I wanted what I got

When you got skin in the game, you stay in the game

But you don’t get a win unless you play in the game

Oh, you get love for it, you get hate for it

You get nothing if you

Wait for it, wait for it, wait

God help and forgive me

I wanna build

Something that’s gonna

Outlive me

What do you want, Prof

What do you want, Prof

If you stand for nothing

Prof, then what do you fall for

I

Wanna be in

The lab where it happens

The lab where it happens

I

Wanna be in

The lab where it happens

The lab where it happens

I

Wanna be

In the lab where it happens

I

I wanna be in the lab

Oh

Oh

I wanna be in

The lab where it happens

The lab where it happens

The lab where it happens

I wanna be in the lab

Where it happens

The lab where it happens

The lab where it happen

The art of the compromise

Hold your nose and close your eyes

We want our leaders to save the day

But we don’t get a say in what they trade away

We dream of a brand new start

But we dream in the dark for the most part

Dark as a scab where it happens

I’ve got to be in

The lab (where it happens)

I’ve got to be (the lab where it happens)

I’ve got to be (the lab where it happens)

Oh, I’ve got to be in

The lab where it happens

I’ve got to be, I’ve gotta be, I’ve gotta be

In the lab

Click bab

(Apologies to Lin-Manuel Miranda. Any resemblance to persons living or dead is entirely coincidental.)

**P.S.** Yes, these stories are funny—the missing carrots and all the rest—but they’re also just so sad, to think that this is what our scientific establishment has come to. I take no joy from these events. We laugh because, after awhile, we get tired of screaming.

I just wish Veronica Geng were still around to write about these hilarious/horrible stories. I just can’t give them justice.

The post The Lab Where It Happens appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post One data pattern, many interpretations appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>They come to a rather absurd conclusion, in my opinion, which is that neuroticism is protective if, and only if, you say you are in bad health, overlooking the probability that neuroticism instead makes you pessimistic when describing your health.

Here’s the abstract of the article, by Catharine Gale, Iva Cukic, G. David Batty, Andrew McIntosh, Alexander Weiss, and Ian Deary:

We examined the association between neuroticism and mortality in a sample of 321,456 people from UK Biobank and explored the influence of self-rated health on this relationship. After adjustment for age and sex, a 1-SD increment in neuroticism was associated with a 6% increase in all-cause mortality (hazard ratio = 1.06, 95% confidence interval = [1.03, 1.09]). After adjustment for other covariates, and, in particular, self-rated health, higher neuroticism was associated with an 8% reduction in all-cause mortality (hazard ratio = 0.92, 95% confidence interval = [0.89, 0.95]), as well as with reductions in mortality from cancer, cardiovascular disease, and respiratory disease, but not external causes. Further analyses revealed that higher neuroticism was associated with lower mortality only in those people with fair or poor self-rated health, and that higher scores on a facet of neuroticism related to worry and vulnerability were associated with lower mortality. Research into associations between personality facets and mortality may elucidate mechanisms underlying neuroticism’s covert protection against death.

The abstract is admirably modest in its claims; still, Pittelli’s criticism seems reasonable to me. I’m generally suspicious of reading too much into this sort of interaction in observational data. The trouble is that there are so many possible theories floating around, so many ways of explaining a pattern in data. I think it’s a good thing that the Gale et al. paper was published: they found a pattern in data and others can work to understand it.

The post One data pattern, many interpretations appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Testing Seth Roberts’ appetite theory appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>My organization is running a group test of Seth Roberts’ old theory about appetite.

We are running something like a “web trial” as discussed in your Chance article with Seth. And in fact our design was very inspired by your conversation… For one, we are using a control group which takes light olive oil *with* meals as you mentioned. We are also testing the mechanism of hunger rather than the outcome of weight loss. This is partly for pragmatic reasons about the variability of the measures, but it’s also an attempt to address the concern you raised that the mechanism is the 2 hour flavorless window itself. Not eating for two hours probably predicts weight loss but it wouldn’t seem to predict less hunger!

Here’s how to sign up for their experiment. I told Tupper that I found the documentation at that webpage to be confusing, so they also prepared this short document summarizing their plan.

I know nothing about these people but I like the idea of testing Seth’s diet, so I’m sharing this with you. (And I’m posting it now rather than setting it at the end of the queue so they can get their experimental data sooner rather than later.) Feel free to post your questions/criticisms/objections/thoughts in comments.

The post Testing Seth Roberts’ appetite theory appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post 3 quick tricks to get into the data science/analytics field appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Do you have advice getting into the data science/analytics field? I just graduated with a B.S. in environmental science and a statistics minor and am currently interning at a university. I enjoy working with datasets from sports to transportation and doing historical analysis and predictive modeling.

My quick advice is to avoid interning at a university as I think you’ll learn more by working in the so-called real world. If you work at a company and are doing analytics, try to do your best work, don’t just be satisfied with getting the job done, and if you’re lucky you’ll interact with enough people that you’ll find the ones who you like, who you can work with. You can also go to local tech meetups to stay exposed to new ideas.

But maybe my advice is terrible, I have no idea, so others should feel free to share your advice and experience in the comments.

The post 3 quick tricks to get into the data science/analytics field appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Hysteresis corner: “These mistakes and omissions do not change the general conclusion of the paper . . .” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Seriously, though, this is just an extreme example of a general phenomenon, which we might call *scientific hysteresis* or the *research incumbency advantage*:

When you’re submitting a paper to a journal, it can be really hard to get it accepted, and any possible flaw in your reasoning detected by a reviewer is enough to stop publication. But when a result has already been published, it can be really hard to get it overturned. All of a sudden, the burden is on the reviewer, not just to point out a gaping hole in the study but to demonstrate precisely where that hole led to an erroneous conclusion. Even when it turns out that a paper has several different mistakes (including, in the above example, mislabeling preschoolers as elementary school students, a change that entirely changes the intervention being studied), the author is allowed to claim, “These mistakes and omissions do not change the general conclusion of the paper.” It’s the research incumbency effect.

As I wrote in the context of a different paper, where t-statistics of 1.8 and 3.3 were reported as 5.03 and 11.14 and the authors wrote that this “does not change the conclusion of the paper”:

This is both ridiculous and all too true. It’s ridiculous because one of the key claims is entirely based on a statistically significant p-value that is no longer there. But the claim is true because the real “conclusion of the paper” doesn’t depend on any of its details—all that matters is that there’s something, somewhere, that has p less than .05, because that’s enough to make publishable, promotable claims about “the pervasiveness and persistence of the elderly stereotype” or whatever else they want to publish that day.

When the authors protest that none of the errors really matter, it makes you realize that, in these projects, the data hardly matter at all.

In some sense, maybe that’s fine. If this is the rules that the medical and psychology literatures want to play by, that’s their choice. It could be that the theories that these researchers come up with are so valuable, that it doesn’t really matter if they get the details wrong: the data are in some sense just an illustration of their larger points. Perhaps an idea such as “Attractive names sustain increased vegetable intake in schools” is so valuable—such a game-changer—that it should not be held up just because the data in some particular study don’t quite support the claims that were made. Or perhaps the claims in that paper are so robust that they hold up even despite many different errors.

OK, fine, let’s accept that. Let’s accept that, ultimately what matters is that a paper has a grabby idea that could change people’s lives, a cool theory that could very well be true. Along with a grab bag of data and some p-values. I don’t really see why the data are even necessary, but whatever. Maybe some readers have so little imagination that they can’t process an idea such as “Attractive names sustain increased vegetable intake in schools” without a bit of data, of some sort, to make the point.

Again, OK, fine, let’s go with that. But in that case, I think these journals should accept just about every paper sent to them. That is, they should become Arxiv.

Cos multiple fatal errors in a paper aren’t enough to sink it in post-publication review, why should they be enough to sink it in pre-publication review?

Consider the following hypothetical **Scenario 1:**

Author A sends paper to journal B, whose editor C sends it to referee D.

D: Hey, this paper has dozens of errors. The numbers don’t add up, and the descriptions don’t match the data. There’s no way this experiment could’ve been done as desribed.

C: OK, we’ll reject the paper. Sorry for sending this pile o’ poop to you in the first place!

And now the alternative, **Scenario 2:**

Author A sends paper to journal B, whose editor C accepts it. Later, the paper is read by outsider D.

D: Hey, this paper has dozens of errors. The numbers don’t add up, and the descriptions don’t match the data. There’s no way this experiment could’ve been done as desribed.

C: We sent your comments to the author who said that the main conclusions of the paper are unaffected.

D: #^&*$#@

[many months later, if ever]

C: The author published a correction, saying that the main conclusions of the paper are unaffected.

Does that really make sense? If the journal editors are going to behave that way in Scenario 2, why bother with Scenario 1 at all?

The post Hysteresis corner: “These mistakes and omissions do not change the general conclusion of the paper . . .” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Heating Up in NBA Free Throw Shooting” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I demonstrate that repetition heats players up, while interruption cools players down in NBA free throw shooting. My analysis also suggests that fatigue and stress come into play. If, as seems likely, all four of these effects have comparable impact on field goal shooting, they would justify strategic choices throughout a basketball game that take into account the hot hand. More generally my analysis motivates approaching causal investigation of the variation in the quality of all types of human performance by seeking to operationalize and measure these effects. Viewing the hot hand as a dynamic, causal process motivates an alternative application of the concept of the hot hand: instead of trying to detect which player happens to be hot at the moment, promote that which heats up you and your allies.

Pudaite says his paper is related to this post (and also, of course, this).

The post “Heating Up in NBA Free Throw Shooting” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Return of the Klam appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s the story behind Klam’s strangely interrupted career, as reported by Taffy Brodesser-Akner: “Matthew Klam’s New Book Is Only 17 Years Overdue“:

In 1993, Matthew Klam was sitting in his room at the Fine Arts Work Center in Provincetown, where he was a fellow, when he received a call from Daniel Menaker at The New Yorker saying that they were interested in buying his short story “Sam the Cat.” . . . an outrageous success — it sparkled with human observation that is so true it makes you cringe — so he wrote a short-story collection, also called Sam the Cat, for Random House. [That volume, which came out in 2000, really is great. The title story is excellent but it’s not even my favorite one in the book.] Klam won an O. Henry Award, a Whiting Award, an NEA grant, a PEN Robert W. Bingham Prize, and a Guggenheim fellowship. . . .

But Klam was not so happy with what he was producing:

He felt like a fraud. All those awards were for things he was going to do in the future, and he didn’t know if he had anything left to say. . . . In 2007, he realized he didn’t have another short story in him . . . [In the following years,] Klam sat in his unfinished basement, temperature 55 degrees, even with a space heater, and asked himself how to make something again. He threw away what his wife and friends and editors suspect was an entire book’s worth of material. . . .

What happened to Matthew Klam, Matthew Klam explains, wasn’t as simple as creative paralysis. He’d gotten a tenure-track teaching position at Johns Hopkins in 2010, in the creative-writing department. It was a welcome respite from the spotlight . . . Teaching allowed him to write and to feel like no one was really watching. . . . each day he returned home from Baltimore and sat in his basement and waited for his novel to become apparent to him.

And then . . .

Finally, his hand was forced. As part of his tenure-track review, he had to show what he was working on to his department-assigned mentor and the chair of the department. Klam showed her the first 100 pages of Who Is Rich?. He was worried his voice was no longer special. He was worried it was the same old thing.

Get this:

The department supervisor found the pages “sloppily written” and “glib and cynical” and said that if he didn’t abandon the effort, she thought he would lose his job.

Hey! Not to make this blog all about me or anything, but that’s just about exactly what happened to me back in 1994 or so when the statistics department chair stopped me in the hallway and told me that he’d heard I’d been writing this book on Bayesian statistics and that if I was serious about tenure, I should work on something else. I responded that I’d rather write the book and not have tenure at Berkeley than have tenure at Berkeley and not write the book. And I got my wish!

So did Klam:

Suddenly, forcefully, he was sure that this wasn’t a piece of shit. This was his book. He told his boss he was going to keep working on the book . . . He sold Who Is Rich? before it was complete, based on 120 pages, in 2013. In 2016, he was denied tenure.

Ha! It looks like the Johns Hopkins University English Department dodged a bullet on that one.

Still, it seems to have taken Klam another four years to write the next 170 or so pages of the book. That’s slow work. That’s fine—given the finished product, it was all worth it—it just surprises me: Klam’s writing seems so fluid, I would’ve thought he could spin out 10 pages a day, no problem.

**P.S.** One thing I couldn’t quite figure out from this article is what Klam does for a living, how he paid the bills all those years. I hope he writes more books and that they come out more frequently than once every 17 years. But it’s tough to make a career writing books; you pretty much have to do it just cos you want to. I don’t think they sell many copies anymore.

**P.P.S.** There was one other thing that bothered me about that article, which was when we’re told that the academic department supervisor didn’t like Klam’s new book: “‘They like Updike,’ Klam explains of his department’s reaction and its conventional tastes. ‘They like Alice Munro. They love Virginia Woolf.'” OK, Klam’s not so similar to Munro and Woolf, but he’s a pretty good match for Updike: middle-aged, middle-class white guy writing about adultery among middle-aged white guys in the Northeast. Can’t get much more Updikey than that. I’m not saying Johns Hopkins was right to let Klam go, not at all, I just don’t think it seems quite on target to say they did it because of their “conventional tastes.”

**P.P.P.S.** Also this. Brodesser writes:

He wanted a regular writing life and everything that went with it . . . Just a regular literary career — you know, tenure, a full professorship, a novel every few years.

That just made me want to cry. I mean, a tenured professorship is great, I like my job. But that’s “a regular literary career” now? That’s really sad, compared to the days when a regular literary career meant that you could support yourself on your writing alone.

The post Return of the Klam appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I’m skeptical of the claims made in this paper appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>One of my correspondents amusingly asked if we should see if the purported effect in this paper interacts with a certain notorious effect that has been claimed, and disputed, in this same subfield. I responded by listing three additional possible explanations from the discredited literature, thus suggesting a possible five-way interaction.

I followed up with, “I assume I can quote you if I blog this? (I’m not sure I will . . . do I really need to make N more enemies???)”

My correspondent replied: “*I* definitely don’t need N more enemies. I’m an untenured postdoc on a fixed contract. If you blog about it, could you just say the paper was sent your way?”

So that’s what I’m doing!

**P.S.** Although I find the paper in question a bit silly, I have no objection whatsoever to it being put out there. Speculative theory is fine, speculative data analysis is fine too. Indeed, one of the problems with the current system of scientific publication is that speculation generally isn’t enough: you have to gussy up your results with p-values and strong causal claims, or else you can have difficulty getting them published anywhere.

The post I’m skeptical of the claims made in this paper appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “No System is Perfect: Understanding How Registration-Based Editorial Processes Affect Reproducibility and Investment in Research Quality” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s what they found:

[N]o system is perfect. Registration is a useful complement to the traditional editorial process, but is unlikely to be an adequate replacement. By encouraging authors to shift from follow-up investment to up-front investment, REP encourages careful planning and ambitious data gathering, and reduces the questionable practices and selection biases that undermine the reproducibility of results. But the reduction in follow-up investment leaves papers in a less-refined state than the traditional process, leaving useful work undone. Because accounting is a small field, with journals that typically publish a small number of long articles, subsequent authors may have no clear opportunity to make a publishable contribution by filling in the gaps.

With experience, we expect that authors and editors will learn which editorial process is better suited to which types of studies, and learn how to draw inferences differently from papers produced by these very different systems. We also see many ways that both editorial processes could be improved by moving closer toward each other. REP would be improved by encouraging more outside input before proposals are accepted, and more extensive revisions after planned analyses are conducted, especially those relying on forms of discretion that our community sees as most helpful and least harmful. TEP would be improved by demanding more complete and accurate descriptions of procedures (as Journal of Accounting Research has been implementing for several years (updated JAR [2018]), not only those that help subsequent authors follow those procedures, but also those that help readers interpret p-values in light of the alternatives that authors considered and rejected in calculating them. REP and TEP would complement one another more successfully if journals would be more open to publishing short articles under TEP that fill in the gaps left by articles published under REP.

They also share some anecdotes:

“I was serving as a reviewer for a paper at a top journal, and the original manuscript submitted by the authors had found conflicting results relating to the theory they had proposed–in other words, some of the results were consistent with expectations derived from the theory while others were contrary. The other reviewer suggested that the authors consider a different theory that was, frankly, a better fit for the situation and that explained the pattern of results very well–far better than the theory proposed by the authors. The question immediately arose as to whether it would be ethical and proper for the authors to rewrite the manuscript with the new theory in place of the old. This was a difficult situation because it was clear the authors had chosen a theory that didn’t fit the situation very well, and had they been aware (or had thought of) the alternate theory suggested by the other reviewer, they would have been well advised on an a priori basis to select it instead of the one they went with, but I had concerns about a wholesale replacement of a theory after data had been collected to test a different theory. On the other hand, the instrument used in collecting the data actually constituted a reasonably adequate way to test the alternate theory, except, of course that it wasn’t specifically designed to differentiate between the two. I don’t recall exactly how the situation was resolved as it was a number of years ago, but my recollection is that the paper was published after some additional data was collected that pointed to the alternate theory.”

-Respondent 84, Full Professor, Laboratory Experiments

“As an author, I have received feedback from an editor at a Top 3 journal that the economic significance of the results in the paper seemed a little too large to be fully explained by the hypotheses. My co-authors and I were informed by the editor of an additional theoretical reason why the effects sizes could be that large and we were encouraged by the editor to incorporate that additional discussion into the underlying theory in the paper. My co-authors and I agreed that the theory and arguments provided by the editor seemed reasonable. As a result of incorporating this suggestion, we believe the paper is more informative to readers.”

-Respondent 280, Assistant Professor, Archival

“As a doctoral student, I ran a 2x2x2 design on one of my studies. The 2×2 of primary interest worked well in one level of the last variable but not at all in the other level. I was advised by my dissertation committee not to report the results for participants in the one level of that factor that “didn’t work” because that level of the factor was not theoretically very important and the results would be easier to explain and essentially more informative. As a result, I ended up reporting only the 2×2 of primary interest with participants from the level of the third variable where the 2×2 held up. To this day, I still feel a little uncomfortable about that decision, although I understood the rationale and thought it made sense.”

-Respondent 85, Full Professor, Laboratory Experiments

This all seems relevant to discussions of preregistration, post-publication review, etc.

The post “No System is Perfect: Understanding How Registration-Based Editorial Processes Affect Reproducibility and Investment in Research Quality” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What’s Wrong with “Evidence-Based Medicine” and How Can We Do Better? (My talk at the University of Michigan Friday 2pm) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>What’s Wrong with “Evidence-Based Medicine” and How Can We Do Better?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

“Evidence-based medicine” sounds like a good idea, but it can run into problems when the evidence is shaky. We discuss several different reasons that even clean randomized clinical trials can fail to replicate, and we discuss directions for improvement in design and data collection, statistical analysis, and the practices of the scientific community. See this paper: http://www.stat.columbia.edu/~gelman/research/published/incrementalism_3.pdf and this one: http://www.stat.columbia.edu/~gelman/research/unpublished/Stents_submitted.pdf

The post What’s Wrong with “Evidence-Based Medicine” and How Can We Do Better? (My talk at the University of Michigan Friday 2pm) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post 354 possible control groups; what to do? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m a PhD student in economics at Stockholm University and a frequent reader of your blog. I have for a long time followed your quest in trying to bring attention to p-hacking and multiple comparison problems in research. I’m now myself faced with the aforementioned problem and want to at the very least try to avoid picking (or being subject to the critique of having picked) control group which merely gives me fancy results. The setting is the following,

I run a difference-in-difference (DD) model between occupations where people working in occupation X is treated at year T. There are 354 other types of occupations where for at least 20-30 of them I could make up a “credible” story about why they would be a natural control group. One could of course run the DD-estimation on the treated group vs. the entire labor market, but claiming causality between the reform and the outcome hinges on not only the parallel trend assumption but also on that group specific shocks are absent. Hence one might wan’t to find a control group that would be subjected to the same type of shocks as the treated occupation X so one might be better of picking specific occupation from the rest of the 354 categories. Some of these might have parallel trends some others, but wouldn’t it be p-hacking choosing groups like this, based on parallel trends? The reader has no guarantee that I as a researcher haven’t picked control groups that gives me the results that will get me published?

So in summary: When one has 1 treated group and 354 potential control groups, how does one go about choosing among these?

My response: rather than picking one analysis (either ahead of time or after seeing the data), I suggest you do all 354 analyses and put them together using a hierarchical model as discussed in this paper. Really, this is not doing 354 analyses, it’s doing one analysis that includes all these comparisons.

The post 354 possible control groups; what to do? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Eid ma clack shaw zupoven del ba. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A few years back, Bill Callahan wrote a song about the night he dreamt the perfect song. In a fever, he woke and wrote it down before going back to sleep. The next morning, as he struggled to read his handwriting, he saw that he’d written the nonsense that forms the title of this post.

Variational inference is a lot like that song; dreams of the perfect are ruined in the harsh glow of the morning after.

(For more unnaturally tortured metaphors see my twitter. I think we can all agree setting one up was a fairly bad choice for me.)

But how can we tell if variational inference has written the perfect song or, indeed, if it has laid an egg? Unfortunately, there doesn’t seem to be a lot of literature to guide us. We (Yuling, Aki, Me, and Andrew) have a new paper to give you a bit more of an idea.

The guiding principle of variational inference is that if it’s impossible to work with the true posterior , then near enough is surely good enough. (It seldom is.)

In particular, we try to find the member of some tractable set of distributions (commonly the family of multivariate Gaussian distributions with diagonal covariance matrices) that minimizes the *Kullback-Leibler* divergence

The Kullback-Leibler divergence in this direction (it’s asymmetric, so the order of arguments is important) can be interpreted as the the amount of information lost if we replace the approximate posterior with the true posterior . Now, if this seems like the wrong way around to you [that we should instead worry about what happens if we replace the target posterior with the approximation ], you would be very *very *correct. That Kullback-Leibler divergence is *backwards.*

*What does this mean?* Well it means that we won’t penalize approximate distributions that are much less complex than the true one as heavily as we should. *How does this translate into real life?* It means that usually we will end up with approximations that are narrower than the true posterior. Usually this manifests as distributions with lighter tails.

(*Quiet side note:* Why are those last few sentences so wishy-washy? Well it turns out that minimizing a Kullback-Leibler divergence in the wrong direction can do all kinds of things to the resulting minimizer and it’s hard to really pin down what will happen. But it’s almost everyone’s experience that the variational posterior is almost always narrower than than the true posterior. So the previous paragraph is *usually *true.)

So variational inference is mostly set up to fail. Really, we should be surprised it works at all.

There are really two things we need to check when we’re doing variational inference. The first is that the optimization procedure that we have used to compute has actually converged to a (local) minimum. Naively, this seems fairly straightforward. After all, we don’t think of maximum likelihood estimation as being hard computationally, so we should be able to solve this optimization problem easily. But it turns out that if we want our variational inference to be *scalable* in the sense that we can apply it to big problems, we need to be more clever. For example Automatic Differentiation Variational Inference (ADVI) uses a fairly sophisticated stochastic optimization method to find .

So first we have to make sure the method actually converges. I don’t really want to talk about this, but it’s probably worth saying that it’s not trivial and stochastic methods like ADVI will occasionally terminate too soon. This leads to terrible approximations to the true posterior. It’s also well worth saying the if the true posterior is multimodal, there’s no guarantee that the minimum that is found will be a (nearly) global one. (And if the approximating family only contains unimodal distributions, we will have some problems!) There are perhaps some ways out of this (Yuling has many good ideas), but the key thing is that if you want to actually know if there is a potential problem, it’s important to run multiple optimizations beginning at a diverse set of initial values.

Anyway, let’s pretend that this isn’t a problem so that we can get onto the main point.

The second thing that we need to check is that the approximate posterior is an ok approximation to the true posterior . This is a much less standard task and we haven’t found a good method for addressing it in the literature. So we came up with two ideas.

Our first idea was based Aki, Andrew, and Jonah’s *Pareto-Smoothed Importance Sampling* (PSIS). The crux of our idea is that if is a good approximation to the true posterior, it can be used as an importance sampling proposal to compute expectations with respect to . So before we can talk about that method, we need to remember what PSIS does.

The idea is that we can approximate any posterior expectation using a self-normalized importance sampling estimator. We do this by drawing samples from the proposal distribution and computing the estimate

Here we define the *importance weights* as

We can get away with using the joint distribution instead of the posterior in the numerator because and we re-normalise the the estimator. This self-normalized importance sampling estimator is consistent with bias that goes asymptotically like . (The bias comes from the self-normalization step. Ordinary importance sampling is unbiased.)

The only problem is that if the distribution of has too heavy a tail, the self-normalized importance sampling estimator will have infinite variance. This is not a good thing. Basically, it means that the error in the posterior expectation could be any size.

The problem is that if the distribution of has a heavy tail, the importance sampling estimator will be almost entirely driven by a small number of samples with very large values. But there is a trick to get around this: somehow tamp down the extreme values of .

With PSIS, Aki, Andrew, and Jonah propose a nifty solution. They argue that you can model the tails of the distribution of the importance ratio with a *generalized Pareto distribution *

This is a very sensible thing to do: the generalized Pareto is the go-to distribution that you use when you want to model the distribution of all samples from an iid population that are above a certain (high) value. The PSIS approximation argues that you should take the largest (where is chosen carefully) and fit a generalized Pareto distribution to them. You then replace those largest observed importance weights with the corresponding expected order statistics from the fitted generalized Pareto.

There are some more (critical) details in the PSIS paper but the intuition is that we are replacing the “noisy” sample importance weights with their model-based estimates. This reduces the variance of the resulting self-normalized importance sampling estimator and reduces the bias compared to other options.

It turns out that the key parameter in the generalized Pareto distribution is the shape parameter . The interpretation of this parameter is that if the generalized Pareto distribution has shape parameter , then the distribution of the sampling weights have moments.

This is particularly relevant in this context as the condition for the importance sampling estimator to have finite variance (and be asymptotically normal) is that the sampling weights have (slightly more than) two moments. This translates to .

*VERY technical side note:* What I want to say is that the self-normalized importance sampling estimator is asymptotically normal. This was nominally proved in Theorem 2 of Geweke’s 1983 paper. The proof there looks wrong. Basically, he applies a standard central limit theorem to get the result, which seems to assume the terms in the sum are iid. The only problem is that the summands

are not independent. So it looks a lot like Geweke should’ve used a central limit theorem for weakly-mixing triangular arrays instead. He did not. What he actually did was quite clever. He noticed that the bivariate random variables are independent and satisfy a central limit theorem with mean . From there you’re a second-order Taylor expansion of the function to show that the sequence

is also asymptotically normal as long as zero or infinity are never in a neighbourhood of .

*End VERY technical side note!*

The news actually gets even better! The smaller is, the faster the importance sampling estimate will converge. Even better than that, the PSIS estimator seems to be useful even if is slightly bigger than 0.5. The recommendations in the PSIS paper is that if , the PSIS estimator is reliable.

But what is ? It’s the sample estimate of the shape parameter . Once again, some really nice things happen when you use this estimator. For example, even if we know from the structure of the problem that , if (which can happen), then importance sampling will perform poorly. The value of is strongly linked to the *finite sample *behaviour of the PSIS (and other importance sampling) estimators.

The intuition for why the estimated shape parameter is more useful than the population shape parameter is that it tells you when the sample of that you’ve drawn *could have come from a heavy tailed **distribution. *If this is the case, there isn’t enough information in your sample yet to push you into the asymptotic regime and pre-asymptotic behaviour will dominate (usually leading to worse than expected behaviour).

Ok, so what does all this have to do with variational inference? Well it turns out that if we draw samples from out variational posterior and use them to compute the importance weights, then we have another interpretation for the shape parameter :

where is the Rényi divergence of order . In particular, if , then the Kullback-Leibler divergence in the more natural direction even if minimizes the KL-divergence in the other direction! Once again, we have found that the estimate gives an excellent indication of the performance of the variational posterior.

So why is checking if a good heuristic to evaluate the quality of the variational posterior? There are a few reasons. Firstly, because the variational posterior minimizes the KL-divergence in the direction that penalizes approximations with heavier tails than the posterior much harder than approximations with lighter tails, it is very difficult to get a good value by simply “fattening out” the approximation. Secondly, empirical evidence suggests that the smaller the value of , the closer the variational posterior is to the true posterior. Finally, if we can automatically improve *any *expectation computed against the variational posterior using PSIS. This makes this tool both a diagnostic and a correction for the variational posterior that does not rely too heavily on asymptotic arguments. The value of has also proven useful for selecting the best parameterization of the model for the variational approximation (or equivalently, between different approximation families).

There are some downsides to this heuristic. Firstly, it really does check that the whole variational posterior is like the true posterior. This is a quite stringent requirement that variational inference methods often do not pass. In particular, as the number of dimensions increases, we’ve found that unless the approximating family is particularly well-chosen for the problem, the variational approximation will eventually become bad enough that will exceeded the threshold. Secondly, this diagnostic only considers the full posterior and cannot be modified to work on lower-dimensional subsets of the parameter space. This means that if the model has some “less important” parameters, we still require their posterior be very well captured by the variational approximation.

The thing about variational inference is that it’s actually often quite bad at estimating a posterior. On the other hand, the centre of the variational posterior is much more frequently a good approximation to the centre of the true posterior. This means that we can get good point estimates from variational inference even if the full posterior isn’t very good. So we need a diagnostic to reflect this.

Into the fray steps an old paper of Andrew’s (with Samantha Cook and Don Rubin) on verifying statistical software. We (mainly Stan developer Sean) have been playing with various ways of extending and refining this method for the last little while and we’re almost done on a big paper about it. (Let me tell you: god may be present in the sweeping gesture, but the devil is definitely in the details.) Thankfully for this work, we don’t need any of the new detailed work we’ve been doing. We can just use the original results as they are (with just a little bit of a twist).

The resulting heuristic, which we call *Variational Simulation-Based Calibration* (VSBC), complements the PSIS diagnostic by assessing the average performance of the implied variational approximation to *univariate* posterior marginals. One of the things that this method can do particularly well is indicate if the centre of the variational posterior will be, on average, biased. If it’s not biased, we can apply clever second-order corrections (like the one proposed by Ryan Giordano, Tamara Broderick, and Michael Jordan).

I keep saying “on average”, so what do I mean by that? Basically, VSBC looks at how well the variational posterior is calibrated by computing the distribution of where is simulated from the model with parameter that is itself drawn from the prior distribution. If the variational inference method is *calibrated*, then Cook *et al.* showed that the histogram of should be uniform.

This observation can be generalized using insight from the forecast validation community: *if the histogram of is asymmetric, then the variational posterior will be (on average over data drawn from the model) biased.* In the paper, we have a specific result, which shows that this insight is exactly correct if the true posterior is symmetric, and approximately true if it’s fairly symmetric.

There’s also the small problem that if the model is badly mis-specified, then it may fit the observed data much worse or better than the average of data drawn from the model. Again, this contrasts with the PSIS diagnostic that only assesses the fit for the particular data set you’ve observed.

In light of this, we recommend interpreting both of our heuristics the same way: conservatively. If either heuristic fails, then we can say the variational posterior is poorly behaved in one of two specific ways. If either or both heuristics pass, then we can have some confidence that the variational posterior will be a good approximation to the true distribution (especially after a PSIS or second-order correction), but this is still not guaranteed.

To close this post out symmetrically (because symmetry indicates a lack of bias), let’s go back to a different Bill Callahan song to remind us that even if it’s not the perfect song, you can construct something beautiful by leveraging formal structure:

*If*

*If you*

*If you could*

*If you could only*

*If you could only stop*

*If you could only stop your*

*If you could only stop your heart*

*If you could only stop your heart beat*

*If you could only stop your heart beat for*

*If you could only stop your heart beat for one heart*

*If you could only stop your heart beat for one heart beat*

The post Eid ma clack shaw zupoven del ba. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Methodological terrorism. For reals. (How to deal with “what we don’t know” in missing-data imputation.) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The linked article begins:

Although the empirical and analytical study of terrorism has grown dramatically in the past decade and a half to incorporate more sophisticated statistical and econometric methods, data validity is still an open, first-order question. Specifically, methods for treating missing data often rely on strong, untestable, and often implicit assumptions about the nature of the missing values.

Later, they write:

If researchers choose to impute data, then they must be clear about the benefits and drawbacks of using an imputation technique.

Yes, definitely. One funny thing about missing-data imputation is that the methods are so mysterious and are so obviously subject to uncheckable assumptions that there’s a tendency for researchers to just throw up their hands and give up, and either go for crude data-simplification strategies such as throwing away all cases where anything is missing, or just imputing without any attempt to check the resulting inferences.

My preference is to impute and then check assumptions, as here. That said, in practice this can be a bit of work so in a lot of my own applied work I kinda close my eyes to the problem too. I should do better.

The post Methodological terrorism. For reals. (How to deal with “what we don’t know” in missing-data imputation.) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post N=1 experiments and multilevel models appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Design and Implementation of N-of-1 Trials: A User’s Guide, edited by Richard Kravitz and Naihua Duan for the Agency for Healthcare Research and Quality, U.S. Department of Health and Human Services (2014).

Single-patient (n-of-1) trials: a pragmatic clinical decision methodology for patient-centered comparative effectiveness research, by Naihua Duan, Richard Kravitz, and Chris Schmid, for the Journal of Clinical Epidemiology (2013).

And a particular example:

The PREEMPT study – evaluating smartphone-assisted n-of-1 trials in patients with chronic pain: study protocol for a randomized controlled trial by Colin Barr et al., for Trials (2015), which begins:

Chronic pain is prevalent, costly, and clinically vexatious. Clinicians typically use a trial-and-error approach to treatment selection. Repeated crossover trials in a single patient (n-of-1 trials) may provide greater therapeutic precision. N-of-1 trials are the most direct way to estimate individual treatment effects and are useful in comparing the effectiveness and toxicity of different analgesic regimens.

This can also be framed as the problem of hierarchical modeling when the number of groups is 1 or 2, and this issue comes up, that once you go beyond N=1, you’re suddenly allowing more variation. One way to handle this is to include this between-person variance component even for an N=1 study. It’s just necessary to specify the between-person variance a priori—but that’s better than just setting it to 0. Similarly, once we have N=2 we can fit a hierarchical model but we’ll need strong prior info on the between-person variance parameter.

This relates to some recent work of ours in pharmacology—in this case, the problem is not N=1 patient, but N=1 study, and it also connects to a couple discussions we’ve had on this blog regarding the use of multilevel models to extrapolate to new scenarios; see here and here and here from 2012. We used to think of multilevel models as requiring 3 or more groups, but that’s not so at all; it’s just that when you have fewer groups, there’s more to be gained by including prior information on group-level variance parameters.

The post N=1 experiments and multilevel models appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post p=0.24: “Modest improvements” if you want to believe it, a null finding if you don’t. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Press Release: “A large-scale effort to reduce childhood obesity in two low-income Massachusetts communities resulted in some modest improvements among schoolchildren over a relatively short period of time…”

Study: “Overall, we did not observe a significant decrease in the percent of students with obesity from baseline to post intervention in either community in comparison with controls…”

Allison continues:

In the paper, the body of the text states:

Overall, we did not observe a significant decrease in the percent of students with obesity from baseline to post intervention in either community in comparison with controls (Community 1: −0.77% change per year, 95% confidence interval [CI] = −2.06 to 0.52, P = 0.240; Community 2: −0.17% change per year, 95% CI = −1.45 to 1.11, P = 0.795).

Yet, the abstract concludes “This multisector intervention was associated with a modest reduction in obesity prevalence among seventh-graders in one community compared to controls . . .”

The publicity also seems to exaggerate the findings, stating, “A large-scale effort to reduce childhood obesity in two low-income Massachusetts communities resulted in some modest improvements among schoolchildren over a relatively short period of time, suggesting that such a comprehensive approach holds promise for the future, according to a new study from Harvard T.H. Chan School of Public Health.”

I have mixed feelings about this one.

On one hand, we shouldn’t be using “p = 0.05” as a cutoff. Just cos a 95% conf interval excludes 0, it doesn’t mean a pattern in data reproduces in the general population; and just cos an interval *includes* 0, it doesn’t mean that nothing’s going on. So, with that in mind, sure, there’s nothing wrong with saying that the intervention “was associated with a modest reduction,” as long as you make clear that there’s uncertainty here, and the data are also consistent with a zero effect or even a modest increase in obesity.

On the other hand, there is a problem here with degrees of freedom available for researchers and publicists. The 95\% interval was [-2.1, 0.5], and this was reported as “a modest reduction” that was “holding promise for the future.” Suppose the 95% confidence interval had gone the other way and had been [-0.5, 2.1]. Would they have reported it as “a modest increase in obesity . . . holding danger for the future”? I doubt it. Rather, I expect they would’ve reported this outcome as a null (the p-value is 0.24, after all!) and gone searching for the positive results.

So there is a problem here, not so much with the reporting of this claim in isolation but with the larger way in which a study produces a big-ass pile of numbers which can then be mined to tell whatever story you want.

**P.S.** Just as a side note: above, I used the awkward but careful phrase “a pattern in data reproduces in the general population” rather than the convenient but vague formulation “the effect is real.”

**P.P.S.** I sent this to John Carlin, who replied:

That’s interesting and an area that I’ve had a bit to do with – basically there are zillions of attempts at interventions like this and none of them seem to work, so my prior distribution would be deflating this effect even more. The other point that occurred to me is that the discussion seems to have focussed entirely on the “time-confounded” before-after effect in the intervention group rather than the randomised(??) comparison with the control group – which looks even weaker.

John wanted to emphasize, though, that he’s not looked at the paper. So his comment represents a general impression, not a specific comment on what was done in this particular research project.

The post p=0.24: “Modest improvements” if you want to believe it, a null finding if you don’t. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Andrew vs. the Multi-Armed Bandit appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>It didn’t take long to implement some policies and a simulation framework in R (case study coming after I do some more plots and simulate some contextual bandit probability matching policies). When I responded with positive results on the Bernoulli bandit case with probability matching (aka “Thompson sampling”), Andrew replied with a concern about the nomenclature, which he said I could summarize publicly.

First, Each slot machine (or “bandit”) only has one arm. Hence it’s many one-armed bandits, not one multi-armed bandit.

Indeed. The Wikipedia says it’s sometimes called the *K*-bandit problem instead, but I haven’t seen any evidence of that.

Second, the basic strategy in these problems is to play on lots of machines until you find out which is the best, and then concentrate your plays on that best machine. This all presupposes that either (a) you’re required to play, or (b) at least one of the machines has positive expected value. But with slot machines, they all have negative expected value for the player (that’s why they’re called “bandits”), and the best strategy is not to play at all. So the whole analogy seems backward to me.

Yes. Nevertheless, people play slot machines. I suppose an economist would say the financial payout has low expected value, but high variance, so play may be rational for a risk-happy player who enjoys to play.

Third, I find the “bandit” terminology obscure and overly cute. It’s an analogy removed at two levels from reality: the optimization problem is not really like playing slot machines, and slot machines are not actually bandits. It’s basically a math joke, and I’m not a big fan of math jokes.

Seriously? The first comments I understand, but the third is perplexing when considering the source is the same statistician and blogger who brought us “Red State, Blue State, Rich State, Poor State”, “Mister P”, “Cantor’s Corner”, and a seemingly endless series of cat pictures, click-bait titles and posts named after novels.

I might not be so sensitive about this had I not pleaded in vain not to name our software “Stan” because it was too cute, too obscure, and the worst choice for SEO since “BUGS”.

P.S. Here’s the conjugate model for Bernoulli bandits in Stan followed by the probability matching policy implemented in RStan.

data { intK; // num arms int N; // num trials int z[N]; // arm on trial n int y[N]; // reward on trial n } transformed data { int successes[K] = rep_array(0, K); int trials[K] = rep_array(0, K); for (n in 1:N) { trials[z[n]] += 1; successes[z[n]] += y[n]; } } generated quantities { simplex[K] is_best; vector [K] theta; for (k in 1:K) theta[k] = beta_rng(1 + successes[k], 1 + trials[k] - successes[k]); { real best_prob = max(theta); for (k in 1:K) is_best[k] = (theta[k] >= best_prob); is_best /= sum(is_best); // uniform for ties } }

The generated quantities compute the indicator function needed to compute the event probability that a given bandit is the best. That’s then used as the distribution to sample which bandit’s arm to pull next in the policy.

model_conjugate <- stan_model("bernoulli-bandits-conjugate.stan") fit_bernoulli_bandit <- function(y, z, K) { data <- list(K = K, N = length(y), y = y, z = z) sampling(model_conjugate, algorithm = "Fixed_param", data = data, warmup = 0, chains = 1, iter = 1000, refresh = 0) } expectation <- function(fit, param) { posterior_summary <- summary(fit, pars = param, probs = c()) posterior_summary$summary[ , "mean"] } thompson_sampling_policy <- function(y, z, K) { posterior <- fit_bernoulli_bandit(y, z, K) p_best <- expectation(posterior, "is_best") sample(K, 1, replace = TRUE, p_best) }

1000 itertions is overkill here, but given everything's conjugate it's very fast. I can get away with a single chain because it's pure Monte Carlo (not MCMC)---that's the beauty of conjugate models. It's often possible to exploit conjugacy or collect sufficient statistics to produce more efficient implementations of a given posterior.

The post Andrew vs. the Multi-Armed Bandit appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Snappy Titles: Deterministic claims increase the probability of getting a paper published in a psychology journal appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I wanted to pass along something I found to be of interest today as a proponent of pre-registration.

Here is a recent article from Social Psychological and Personality Science. I was interested by the pre-registered study. Here is the pre-registration for Study 1.

The pre-registration lists eight (directional) hypotheses. Yet, in Study 1 of the paper, only 2-3 of these hypotheses are reported. Footnote 1 states that the study included some “exploratory” measures that were inconsistent and so not reported in the main text.

In one sense, this is pre-registration working well in that I can now identify these inconsistencies. In another sense, I think it’s a misuse of pre-registration when literally the first hypothesis listed (and three of the first four hypotheses) is now labeled as exploratory in the published version. It’s the first time I’ve seen this and I wonder how common it will become in the future, knowing that many people don’t look closely at pre-registrations.

My reply: I have no idea, but I was struck by how the paper in question follows the social-psychology-article template, with features such as a double-barreled title with catchy phrase followed by a deterministic claim (“Poisoned Praise: Discounted Praise Backfires and Undermines Subordinate Impressions in the Minds of the Powerful”) and a lead-off celebrity quote (“Flattery and knavery are blood relations. — Abraham Lincoln”). This does not mean the results are wrong—I guess if you’re interested in that question, you’ll have to wait for the preregistered replication—it’s just interesting to see the research pattern being essentially unchanged. Just speaking generally, without any knowledge of this particular topic, I’m skeptical about the possibility of learning about complex interactions (for example, this tongue-twister from the abstract: “high-power people’s tendency to discount feedback only produced negative partner perceptions when positive feedback, but not neutral feedback, was discounted”) from this sort of noisy between-person experiment.

I agree with my correspondent that preregistration is a good first step, but ultimately I think it will be necessary to use more efficient experimental designs and a stronger integration of data collection with theory, if the goal is to make progress in understanding of cognition and behavior.

**P.S.** My correspondent asked for anonymity, and that is a sign of something wrong in academia, that it’s can be difficult for people to criticize published papers. I don’t think this is a problem unique to psychology. In a better world, the person who sent me the above criticism would not be afraid to publish it openly and directly.

The post Snappy Titles: Deterministic claims increase the probability of getting a paper published in a psychology journal appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Here’s a post with a Super Bowl theme. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In this retrospective cohort study of 3812 NFL players who debuted between 1982 and 1992, there was no statistically significant difference in the risk of long-term all-cause mortality among career NFL players compared with NFL replacement players who participated in the NFL during a 3-game league-wide player strike in 1987 (adjusted hazard ratio, 1.38; 95% CI, 0.95-1.99). Career participation in NFL football, compared with participation as an NFL replacement player, was not associated with a statistically significant difference in the risk of all-cause mortality.

They report:

The study sample consisted of 2933 career NFL players and 879 NFL replacement players . . . A total of 144 deaths occurred among career NFL players . . . and 37 deaths occurred among NFL replacement players. . . . The mortality risk by study end line was 4.9% for career NFL players and 4.2% for NFL replacement players, and the unadjusted absolute risk difference was 0.7% . . . Adjusting for birth year, BMI, height, and position group, the absolute risk difference was 1.0%.

They also report that “Four career NFL players were excluded from the sample because they died while actively employed on an NFL roster.” Including them would bump up the mortality rate from 4.9% to 5.0%.

The real story, though, is that the sample is not large enough to reliably detect the difference between a 4% mortality rate and a 5% mortality rate.

From the standpoint of statistical communication, the challenge is how to report such a finding.

I think the authors did a pretty good job of not overselling their results. To start with, their title is noncommittal. They do report, “not associated with a statistically significant difference,” which I don’t think is so helpful. But, flip it around: what if the numbers had been very slightly different and the 95% interval had excluded a null effect? Then we wouldn’t want them blaring that they’d found a clear effect.

And the paper includes a strong section on Limitations:

This study has several limitations. First, the residual differences between career NFL players and NFL replacement players that may bias the results could not be addressed. Additional research that collects data on medical comorbidities, lifestyle factors, and earnings across the 2 groups could further address this bias. A complementary approach might focus on groups of NFL career and replacement players who are more similar in their duration of NFL exposure. For example, comparing career NFL players with shorter tenures vs NFL replacement players may avoid healthy worker bias from players who are able to persist in the league because of better unobserved health. . . .

Second, the estimates were based on a small number of events: 181 deaths, of which only 37 occurred among NFL replacement players. Consequently, the present analysis could be underpowered to detect meaningful associations. Additional analyses with longer-term follow-up may be informative.

Third, the data were drawn from online databases. While these have been used in other analyses, it is possible that some information was misreported . . .

That’s how you do it: report what you found and be clear about what you did.

The corresponding author works at the University of Pennsylvania. Eagles fan?

The post Here’s a post with a Super Bowl theme. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What to teach in a statistics course for journalists? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I am a science journalist for Swiss public television and have previously regularly covered the “crisis in science” on Swiss public radio, including things like p-hacking, relative risks, confidence intervals, reproducibility etc.

I have been giving courses in basic statistics and how to read scientific studies for Swiss journalists without science backgrounds. As such, I am increasingly wondering what to teach (knowing they will not dive into Bayesian statistics). How should a non-science journalist handle p values? Should he at all? What about 95 percent confidence intervals? Should relative risks be reported when absolute numbers aren’t available? What about small studies? Should they even report about single studies? And how about meta-analysis?

I wondered if you have some basic advise for journalists that you could share with me? Or are you aware of good existing checklists?

To start with, here’s my article from 2013, “Science Journalism and the Art of Expressing Uncertainty.”

Also relevant is this post on Taking Data Journalism Seriously. And this on what to avoid.

There are some teaching materials out there, which you can find by googling *statistics for journalists* or similar terms, but there is a problem that once you try to get in deeper, there’s little agreement in how to proceed.

I suppose that the starting point is understanding the statistical terms that often arise in science: sample and population, treatment and control, randomized experiment, observational study, regression, probability.

Should you teach p-values and 95% intervals? I’m not sure. OK, yeah, you have to say something about these, as they appear in so many published papers. And then you have to explain how these terms are defined. And then you have to explain the problems with these ideas. I think there’s no way around it.

Bayesian methods? Sure, you have to say something about these, because, again, they’re used in published papers. You don’t have to “dive” into Bayesian methods but you can and should explain the idea of predictive probabilities such as arise in election forecasts.

For the big picture, I recommend this bit from the first page of Regression and Other Stories:

The post What to teach in a statistics course for journalists? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “revision-female-named-hurricanes-are-most-likely-not-deadlier-than-male-hurricanes” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Three years ago, a scientific study claimed that storms named Debby are predisposed to kill more people than storms named Don. The study alleged that people don’t take female-named storms as seriously. Numerous analyses have since found that this conclusion has little merit.

Good to see the record being corrected. But there’s more. Samenow continues:

When the study came out, I [Samenow] reported on it, and it generated tremendous interest. Because of the Internet and the tendency for interesting articles to continue circulating years later, the 2014 story is still being read by thousands of readers — even though its key results have largely been debunked. . . . just a day after the study was published by the Proceedings of the National Academy of Sciences, critics began poking holes in it. . . . In subsequent months, several study rebuttals were submitted to and published by the same journal that had published it originally . . . In addition to the formal rebuttals published by the journal, Gary Smith, a professor of economics at Pomona College, wrote a critique on his blog, in which he described “several compelling reasons” to be skeptical about the study results. . . . The blog critique was subsequently expanded in an article published in the journal Weather and Climate Extremes in June 2016.

Samenow concludes:

The publication of this study and the hype it generated (and continues to generate) show the potential pitfalls of coming to unqualified conclusions from a single study. It also reveals the importance of the peer review process, which encourages critical responses to new findings so that work that cannot withstand scrutiny does not endure.

Let me just emphasize that what really worked here was *post-publication* peer-review, an informal process using social media as well as traditional journals.

Good on Samenow for making this correction. It’s excellent to see a news editor taking responsibility for past stories. In contrast, Susan Fiske, the PPNAS editor who accepted that “himmicanes” paper and several others of comparable quality, has never acknowledged that there might be any problems with this work. Fiske lives in a traditional academic world in which all peer review occurs before publication, and where published results are protected by a sort of “thin blue line” of credentialed scientists,” a world in which it’s considered ok for a published paper, no matter how ridiculous, to get extravagant publicity, but where any negative remarks, no matter how well sourced, are supposed to be whispered.

We all make mistakes, and I don’t hold it against Fiske that she published that flawed paper. Any of us can be fooled by crap research that is wrapped up in a fancy package. I know I’ve been fooled on occasion. What’s important is to learn from our mistakes.

The post “revision-female-named-hurricanes-are-most-likely-not-deadlier-than-male-hurricanes” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post When to add a feature to Stan? The recurring issue of the compound declare-distribute statement appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Compound declare-define statements**

Mitzi added declare-define statements a while back, so you can now write:

transformed parameter { real sigma = tau^-0.5; }

It’s generally considered good style to put the declaration of a variable as close to where it is used as possible. It saves a lot of scanning back and forth to line things up. Stan’s forcing users to declare all the variables at the top violates this principle and compound declare-define clears a lot of that up (though not all—there’s a related proposal to let statements be mingled into the declarations at the top of a block like the transformed data and transformed parameters blocks).

**Compound parameter declare-distribute statements?**

Allowing compound declare-distribute statements in the parameters block would support writing linear regression as follows.

data { int<lower = 0> N; vector[N] x; vector[N] y; } parameters { real alpha ~ normal(0, 10); real beta ~ normal(0, 5); real<lower = 0> sigma ~ normal(0, 2); } model { y ~ normal(alpha + beta * x, sigma); }

The current way of writing this would have the same data block, but the parameters block wold only have declarations—uses get moved to the model block.

parameters { real alpha; real beta; real<lower = 0> sigma; } model { alpha ~ normal(0, 10); beta ~ normal(0, 5); sigma ~ normal(0, 2); y ~ normal(alpha + beta * x, sigma); }

Compared to the current approach, the proposed compound declare-distribute statements pulls the first three uses of the variables, which is to assign them priors, up into the parameters block where they are declared.

**Who’s on which side on this?**

I’m trying to remain neutral as to who has which opinion as to whether compound declare-distribute statements would be a good thing. (The implementation’s no big deal by the way, so that’s not even an issue in the discussion.)

**Even more ambitious proposal**

An even more ambitious proposal (still easy to implement) would be to open up the parameters block to general statements so that anything at all in the model block could be moved up to the parameters block, though the intent would have it just be priors. The reason this matters is that not every prior in Stan can be easily coded in a single line with a distribution.

Allowing statements in the parameters block makes the transformed parameters and model blocks redundant. So we’ll just have one block that I’ll call “foo” to avoid bike-shedding on the name.

data { int<lower = 0> N; vector[N] x; vector[N] y; } foo { real alpha ~ normal(0, 10); real beta ~ normal(0, 5); real<lower = 0> sigma ~ normal(0, 2); y ~ normal(alpha + beta * x, sigma); }

It’d even be possible to just drop the block name and braces; but we’d still need to declare the generated quantities block.

**Compound data declare-distribute statements**

So far, we’ve only been talking about declare-distribute for parameters. But what about data? If what’s good for the goose is good for the gander, we can allow compound declare-define on data variables. But there’s a hitch—we can’t bring the sampling statement for the likelihood up into the data block, because we haven’t declared parameters yet.

The easy workaround here is to put the data declarations right on the variables instead of using blocks. Now we can move the data declarations down and put a compound declare-distribute statement on the modeled data y.

real alpha ~ normal(0, 10); real beta ~ normal(0, 5); real<lower = 0> sigma ~ normal(0, 2); data int<lower = 0> N; data vector[N] x; data vector[N] y ~ normal(alpha + beta * x, sigma);

This is pretty close to Maria Gorinova’s proposal for SlicStan and something that’s very much like Cibo Technologies’ StataStan. It looks more like PyMC3 or Edward or even BUGS than it looks like existing Stan, but will still support loops, (recursive) functions, conditionals, etc.

**Now what?**

Our options seem to be the following:

- hold steady
- allow compound declare-distribute for parameters
- further allow other statements in the parameters block
- full ahead toward SlicStan

**Stan’s basic design**

To frame the discussion, I think it’ll help to let you know my thinking when I originally designed the Stan language. I was very much working backwards from the C++ object I knew I wanted. Matt Hoffman and I already had HMC working with autodiff on models defined in C++ at that point (just before or after that point is when Daniel Lee joined the coding effort).

I thought of the C++ class definition as the definition of the model—typically in terms of a joint log density (I really regret calling the log probability density function `log_prob`

—it should’ve been `lpdf`

).

The constructor of the model class took the data as an argument. So when you constructed an instance of the class, you instantiated the model with data. That conditions the model on the data, so an instantiated model defines the log density of the posterior.

The data block declaration was essentially a way to code the arguments to the constructor of that class. It’s like the function argument declarations for a function in a language like C or Java. (It works much more indirectly in practice in Stan through generic I/O reader interface callbacks; the declarations specify how to generate the code to read the relevant variables from a reader and define them as member variables in the model class.)

The parameter block declaration was similarly just a way to code the arguments to the log density function. (This also works indirectly in practice through readers by taking unconstrained parameters, peeling of the next one in the declaration by size, adding any Jacobian adjustments for change of variables to the log density, and defining a local variable on the constrained scale to be used by the model—that’s also how we can generalize Maria Gorinova’s proposal for functions in SlicStan that required compile-time unfolding).

The model block was just the implementation of the body of the log density function.

That’s it for the basics and it’s still how I like to explain the Stan language to people—it’s just a way of writing down a log density function.

Later, I added the transformed data block as a way to save intermediate values at construction time. That is, member variables in the class that would be computed based on the data. The transformed parameter block is the same deal, only for parameters. So they define transforms as local variables for the model block to use (and get printed).

**Next up…**

Breaking the target log density into the sum of a log likelihood and log prior (or penalty)…

The post When to add a feature to Stan? The recurring issue of the compound declare-distribute statement appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The Anti-Bayesian Moment and Its Passing appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Over the years we have often felt frustration, both at smug Bayesians—in particular, those who object to checking of the fit of model to data, because all Bayesian models are held to be subjective and thus unquestioned (an odd combination indeed, but that is the subject of another article)—and angry anti-Bayesians who, as we wrote in our article, strain on the gnat of the prior distribution while swallowing the camel that is the likelihood.

The present article arose from our memory of a particularly intemperate anti-Bayesian statement that appeared in Feller’s beautiful and classic book on probability theory. We felt that it was worth exploring the very extremeness of Feller’s words, along with similar anti-Bayesian remarks by others, to better understand the background underlying controversies that still exist regarding the foundations of statistics. . . .

Here’s the key bit:

The second reason we suspect for Feller’s rabidly anti-Bayesian stance is the postwar success of classical Neyman-Pearson ideas. Many leading mathematicians and statisticians had worked on military problems during the World War II, using available statistical tools to solve real problems in real time. Serious applied work motivates the development of new methods and also builds a sense of confidence in the existing methods that have led to such success. After some formalization and mathematical development of the immediate postwar period, it was natural to feel that, with a bit more research, the hypothesis testing framework could be adapted to solve any statistical problem. In contrast, Thornton Fry could express his skepticism about Bayesian methods but could not so easily dismiss the entire enterprise, given that there was no comprehensive existing alternative.

If 1950 was the anti-Bayesian moment, it was due to the successes of the Neyman–Pearson–Wald approach, which was still young and growing, with its limitations not yet understood. In the context of this comparison, there was no need for researchers in the mainstream of statistics to become aware of the scattered successes of Bayesian inference. . . .

As Deborah Mayo notes, the anti-Bayesian moment, if it ever existed, has long passed. Influential non-Bayesian statisticians such as Cox and Efron are hardly anti-Bayesian, instead demonstrating both by their words and their practice a willingness to use full probability models as well as frequency evaluations in their methods, and purely Bayesian approaches have achieved footholds in fields as diverse as linguistics, marketing, political science, and toxicology.

If there ever was a “pure Bayesian moment,” that too has passed with the advent of “big data” that for computational reasons can only be analyzed using approximate methods. We have gone from an era in which prior distributions cannot be trusted to an era in which full probability models serve in many cases as motivations for the development of data-analysis algorithms.

The post The Anti-Bayesian Moment and Its Passing appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Geoff Norman: Is science a special kind of storytelling? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The nature of science was summarized beautifully by a Stanford professor of science education, Mary Budd Rowe, who said that:

Science is a special kind of story-telling with no right or wrong answers. Just better and better stories.

Benítez writes that he doesn’t buy this.

Neither do I, but I read the rest of Norman’s article and I really liked it.

Here’s a great bit:

I [Norman] turn to a wonderful paper by Cook et al. (2008) that described three fundamentally different kinds of research question:

1. Description

“I have a new (curriculum, questionnaire, simulation, OSCE method, course) and here’s how I developed it”

That’s not even poor research. It’s not research at all.

2. Justification

“I have a new (curriculum, module, course, software) and it works. Students really like it” OR “students self-reported knowledge was higher” OR “students did better on the final exam than a control group” OR even “students had lower mortality rates after the course”

OK, it’s research. But is it science? After all, what do we know about how the instruction actually works? Do we have to take the whole thing on board lock, stock and barrel, to get the effects? What’s the active ingredient? In short, WHY is it better? And that brings us to

3. Clarification

“I have a new (curriculum, module, course, software). It contains a number of potentially active ingredients including careful sequencing of concepts, imbedding of concepts in a problem, interleaved practice, and distributed practice. I have conducted a program of research where these factors have been systematically investigated and the effectiveness of each was demonstrated”.

That’s more like it. We’re not asking if it works, we’re asking why it works. And the results truly add to our knowledge about effective strategies in education. So one essential characteristic is that the findings are not limited to the particular gizmo under scrutiny in the study. The study adds to our general understanding of the nature of teaching and learning.

I don’t want to get caught up in a debate on what’s “science” or what’s “research” or whatever; the key point is that science and statistics are not, and should not, be, just about “what works” but rather “How does it work?” This relates to our earlier post about the problem with the usual formulation of clinical trials as purely evaluative, which leaves you in the lurch if someone doesn’t happen to provide you with a new and super-effective treatment to try out.

To put it another way, the “take a pill” or “black box” approach to statistical evaluation would work ok if we were regularly testing wonder-pills. But in the real world, effects are typically highly variable, and we won’t get far without looking into the damn box.

Norman also writes:

The most critical aspect of a theory is that, instead of addressing a simple “Does it work?,” to which the answer is “Yup” or “Nope”, it permits a critical examination of the effects and interactions of any number of variables that may potentially influence a phenomenon. So, one question leads to another question, and before you know it, we have a research program. Programmatic strategies inevitably lead to much more sophisticated understanding. And along the way, if it’s really going well, each study leads to further insights that in turn lead to the next study question. And each answer leads to further insight and explanation.

One interesting thing here is that this Lakatosian description might seem at first glance to be a good description, not just of healthy scientific research, but also of degenerate research programmes associated with elderly-words-and-slow-walking, or ovulation and clothing, or beauty and sex ratio, or power pose, or various other problematic research agendas that we’ve criticized in this space during the past decade. All these subfields, which have turned to be noise-spinning dead ends, feature a series of studies and publications, with each new result leading to new questions. But these studies are uncontrolled. Part of the problem there is a lack of substantive theories—no, vague pointing to just-so evolutionary stories is not enough. But another problem is statistical, the p-values and all that.

Now let’s put all these ideas together:

– If you want to engage in a scientifically successful research programme, your theories should be “thick” and full of interactions, not mere estimates of average treatment effects. That’s fine: all the areas we’ve been discussing, including the theories we don’t respect, are complex. Daryl Bem’s ESP hypotheses, for example: they were full of interactions.

– But now the next step is to model those interactions, to consider them together. If, for example, you decide that outdoor temperature is an important variable (as in that ovulation-and-clothing paper), you go back and include it in analysis of earlier as well as later studies. And if you’re part of a literature that includes other factors such as age, marital status, political orientation, etc., then, again, you include all these too.

– Including all these factors and then analyzing in a reasonable way (I’d prefer a multilevel model but, if you’re careful, even some classical multiple comparisons approach could do the trick) would reveal not much other than noise in those problematic research areas. In contrast, in a field where real progress is being made, a full analysis should reveal persistent patterns. For example, political scientists keep studying polarization in different ways. I think it’s real.

The point of my above discussion is to elaborate on Norman’s article by emphasizing the interlocking roles of substantive theory, data collection, and statistical analysis. I think that in Norman’s discussion, he’s kinda taking for granted that the statistical analysis will respect the substantive theory, but we see real problems when this doesn’t happen, in papers that consider isolated hypotheses without considering the interactions pointed to even within their own literatures.

**P.S.** Regarding the title of this post, here’s what Basbøll and I wrote a few years ago about science and stories. Although here we were talking not about story*telling* but about scientists’ use of and understanding of stories.

The post Geoff Norman: Is science a special kind of storytelling? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Education, maternity leave, and breastfeeding appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In today’s column, How We Are Ruining America, David Brooks writes that “Upper-middle-class moms have the means and the maternity leaves to breast-feed their babies at much higher rates than high school-educated moms, and for much longer periods.”

He’s correct about college grads being more likely to have access to maternity leave, but since this (and the cost of child care) translates to college grads being more likely to work and less likely to drop out of the labor force, I think he’s probably incorrect about this translating into ability to breastfeed. I used Census (ACS 2011-2015) data to look at typical hours worked per week in the previous year as well as current employment status among the mothers of infants. They both show the same thing: Education among the mothers of infants is positively correlated with number of hours typically worked per week in the previous year, as well as with currently being employed.

There is a leave gap, but it’s small in absolute terms. If you look at the sum of ‘not working’ + ’employed but on leave’ in the employment status graph (the ’employed but on leave’ group would be part of ’employed but not at work’) and assume that’s the group that’s not being prevented from breastfeeding due to their jobs, it’s the biggest among the less-educated group and the smallest among the most-educated group.

Below are my graphs and here is my workbook, and they’re also here along with the data. It’s possible that ‘hours worked’ is reflecting hours worked pre-baby – it depends on how old the baby is/how the respondent chose to answer. But it’s telling the same story as the employment status graph, and it’s tough to reconcile with Brooks’ story.

I’ve not looked at the details here but wanted to share with you.

The post Education, maternity leave, and breastfeeding appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Postdoc opening on subgroup analysis and risk-benefit analysis at Merck pharmaceuticals research lab appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We are looking for a strong postdoctoral fellow for a very interesting cutting edge project.

The project requires expertise in statistical modeling and machine learning.

Here is the official job ad. We are looking for candidates that are strong both analytically and computationally (excellent coding skills).

In the project, we are interested in subgroup analysis for benefit-risk, also optimal treatment allocation methods would be of interest.

I don’t post every job ad that gets sent to me, but this one sounds particularly interesting, based on the description above.

**P.S.** Full disclosure: I consult for pharmaceutical companies and have been paid by Merck.

The post Postdoc opening on subgroup analysis and risk-benefit analysis at Merck pharmaceuticals research lab appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post My suggested project for the MIT Better Science Ideathon: assessing the reasonableness of missing-data imputations. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We are 3 months away from the MIT Better Science Ideathon on April 23.

We would like to request your help with mentoring a team or 2 during the ideathon. During the ideathon, teams discuss a specific issue (lack of focus on reproducibility across majority of journals) or problem that arose from a well-intentioned initiative (e.g., proliferation of open-access journals), brainstorm on how to address the issue/problem and design a strategy. In this regard, we are soliciting ideas that the teams can choose from for the ideathon.

I’ll be participating in this event, and here’s my suggested project: assessing the reasonableness of missing-data imputations. Here are 3 things to read:

http://www.stat.columbia.edu/~gelman/research/published/paper77notblind.pdf

http://www.stat.columbia.edu/~gelman/research/published/mipaper.pdf

A Python program for multivariate missing-data imputation that works on large datasets!?

This last link is particularly relevant as it points to some Python code that the students can run to get started.

Of course, anyone can work on this project; no need to go to the Ideathon to do it. I guess the plan is for the Ideathon to motivate groups of people to focus.

The post My suggested project for the MIT Better Science Ideathon: assessing the reasonableness of missing-data imputations. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Objects of the class “Verbal Behavior” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>My nominee for a new “objects of the class” would be B. F. Skinner’s “Verbal Behavior” — i.e., the criticisms of the thing are more widely read than the thing itself.

Hmmm . . . I’ve heard of Skinner but not of “Verbal Behavior,” let alone its criticism. But the general idea sounds good.

What other objects are in that class?

It’s related to Objects of the class “Foghorn Leghorn” but slightly different in that here we’re talking criticism, not parody.

And here are other objects of the class “Objects of the class.”

The post Objects of the class “Verbal Behavior” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post New Stan case studies: NNGP and Lotka-Volterra appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Two new case studies**

Lu Zhang of UCLA contributed a case study on nearest neighbor Gaussian processes.

Bob Carpenter (that’s me!) of Columbia Uni contributed one on Lotka-Volterra population dynamics.

Mitzi Morris of Columbia Uni has been updating her ICAR spatial models case study; the neighbor graph is now explained in pictures, the results have been extended from Brooklyn to all five boroughs of NYC, and the results overlaid on Google maps.

**Tufte handout format!**

I’m really excited about the tufte-handout style I used. Thanks to Sarah Heaps of Newcastle Uni for showing me her teaching materials for her Jumping Rivers RStan course. It inspired me to check out the Tufte format. I’m already using it for the MCMC and Stan book to back up the Coursera courses I’m preparing.

**StanCon case studies coming soon**

As soon as Jonah can process them, there’ll be a whole bunch of case studies in the proceedings from the 2018 StanCon in Asilomar.

**Lots of case studies available**

Here’s the complete list of official case studies.

Of course, lots of people do Stan case studies in papers, YouTube videos and blogs.

**Your case study here**

Meanwhile, if you have case studies to contribute, please let us know. It’s as easy as using knitr or Jupyter and sending us the HTML and a repo link.

The post New Stan case studies: NNGP and Lotka-Volterra appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Looking at all possible comparisons at once: It’s not “overfitting” if you put it in a multilevel model appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The human brain mapping conference is on these days and heard via tweeter about this Overfitting toolbox for fMRI studies that helps explore the multiplicity of analytical pipelines in a more systematic fashion.

Reminded me a bit of your multiverse analysis: thought you might like the idea.

The link is to a conference poster by Joram Soch, Carsten Allefeld, and John-Dylan Haynes that begins:

A common problem in experimental science is if the analysis of a data set yields no significant result even though there is a strong prior belief that the effect exists. In this case, overfitting can help . . . We present The Overfitting Toolbox (TOT), a set of computational tools that allow to systematically exploit multiple model estimation, parallel statistical testing, varying statistical thresholds and other techniques that allow to increase the number of positive inferences.

I’m pretty sure it’s a parody: for one thing, the poster includes a skull-and-crossbones image; for another, it ends as follows:

Widespread use of The Overfitting Toolbox (TOT) will allow researchers to uncover literally unthinkable sorts of effects and lead to more spectacular findings and news coverage for the entire fMRI community.

That said, I actually think it can be a good idea to fit a model looking at all possible comparisons! My recommended next step, though, is not to look at p-values or multiple comparisons or false discovery rates or whatever, but rather to fit a hierarchical model for the distribution of all these possible effects. The point is that everything’s there, but most effects are small. In the areas where I’ve worked, this makes more sense than trying to pick out and focus on just a few comparisons.

The post Looking at all possible comparisons at once: It’s not “overfitting” if you put it in a multilevel model appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stacking and multiverse appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Recently Tim Disher asked in Stan discussion forum a question “Multiverse analysis – concatenating posteriors?”

Tim refers to a paper “Increasing Transparency Through a Multiverse Analysis” by Sara Steegen, Francis Tuerlinckx, Andrew Gelman, and Wolf Vanpaemel. The abstract says

Empirical research inevitably includes constructing a data set by processing raw data into a form ready for statistical analysis. Data processing often involves choices among several reasonable options for excluding, transforming, and coding data. We suggest that instead of performing only one analysis, researchers could perform a multiverse analysis, which involves performing all analyses across the whole set of alternatively processed data sets corresponding to a large set of reasonable scenarios. Using an example focusing on the effect of fertility on religiosity and political attitudes, we show that analyzing a single data set can be misleading and propose a multiverse analysis as an alternative practice. A multiverse analysis offers an idea of how much the conclusions change because of arbitrary choices in data construction and gives pointers as to which choices are most consequential in the fragility of the result.

In that paper the focus is in looking at the possible results from the multiverse of forking paths, but Tim asked whether it would *“make sense at all to combine the posteriors from a multiverse analysis in a similar way to how we would combine multiple datasets in multiple imputation”*?

After I (Aki) thought about this, my answer is

- in multiple imputation the different data sets are posterior draws from the missing data distribution and thus usually equally weighted
- I think multiverse analysis is similar to case of having a set of models with different variables, variable transformations, interactions and non-linearities like in our Stacking paper (Yao, Vehtari, Simpson, Gelman), where we have different models for arsenic well data (section 4.6). Then stacking would be sensible way to combine *predictions* (as we may have different model parameters for differently processed data) with non-equal weights. Stacking is a good choice for model combination here as
- we don’t need to assign prior probabilities for different forking paths
- stacking favors paths which give good predictions
- it avoids “prior dilutation problem” if some processed datasets happen to be very similar with each other (see fig 2c in Stacking paper)

The post Stacking and multiverse appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The multiverse in action! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The designing, collecting, analyzing, and reporting of psychological studies entail many choices that are often arbitrary. The opportunistic use of these so-called researcher degrees of freedom aimed at obtaining statistically significant results is problematic because it enhances the chances of false positive results and may inflate effect size estimates. In this review article, we present an extensive list of 34 degrees of freedom that researchers have in formulating hypotheses, and in designing, running, analyzing, and reporting of psychological research. The list can be used in research methods education, and as a checklist to assess the quality of preregistrations and to determine the potential for bias due to (arbitrary) choices in unregistered studies.

34 different degrees of freedom! That’s a real multiverse; it can get you to a p-value of 2^-34 (that is, 0.0000000001) in no time. And it’s worse than that: Wicherts et al. write, “We created a list of 34 researcher DFs, but our list is in no way exhaustive for the many choices that need be made during the different phases of a psychological experiment.”

Preregistration is fine, but let me remind all readers, though, that the most important steps in any study are valid and reliable measurements and, where possible, large and stable effect sizes. All the preregistration in the world won’t save you if your measurements are not serious or if you’re studying an effect that is tiny or highly variable. Remember the kangaroo problem.

As I wrote here, “I worry that the push toward various desirable *procedural* goals can make people neglect the fundamental *scientific* and *statistical* problems that, ultimately, have driven the replication crisis.”

The post The multiverse in action! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Another <del>bivariate</del> multivariate dependence measure! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Since you’ve posted much on various independence test papers (e.g., Reshef et al., and then Simon & Tibshirani criticism, and then their back and forth), I thought perhaps you’d post this one as well.

Tibshirani pointed out that distance correlation (Dcorr) was recommended, we proved that our oracle multiscale generalized correlation (MGC, pronounced “Magic”) statistically dominates Dcorr, and empirically demonstrate that sample MGC nearly dominates.

The new paper, by Cencheng Shen, Carey Priebe, Mauro Maggioni, Qing Wang, and Joshua Vogelstein, is called “Discovering Relationships and their Structures Across Disparate Data Modalities.” I don’t have the energy to read this right now but I thought it might interest some of you. I’m glad that people continue to do research on these methods as they would seem to have many areas of application.

My quick comment to Vogelstein on this paper was to suggest “whether” to “how” in the first sentence of the abstract.

The post Another <del>bivariate</del> multivariate dependence measure! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The Paper of My Enemy Has Been Retracted appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>And I am pleased.

From every media outlet it has been retracted

Like a van-load of p-values that has been seized

And sits in star-laden tables in a replication archive,

My enemy’s much-prized effort sits in tables

In the kind of journal where retraction occurs.

Great, square stacks of rejected articles and, between them, aisles

One passes down reflecting on life’s vanities,

Pausing to remember all that thoughtful publicity

Lavished to no avail upon one’s enemy’s article—

For behold, here is that study

Among these ranks and banks of duds,

These ponderous and seeminly irreducible cairns

Of complete stiffs.

The paper of my enemy has been retracted

And I rejoice.

It has gone with bowed head like a defeated legion

Beneath the yoke.

What avail him now his awards and prizes,

The praise expended upon his meticulous technique,

His cool new experiments?

Knocked into the middle of next week

His brainchild now consorts with the bad theories

The sinker, clinkers, dogs and dregs,

The Evilicious Edsels of pseudoscience,

The bummers that no amount of hype could shift,

The unbudgeable turkeys.

Yea, his slim paper with its understated abstract

Bathes in the blare of the brightly promoted Air Rage Paper,

His unmistakably individual new voice

Shares the same scrapyard with a forlorn skyscraper

Of Ovulation and Voting,

His honesty, proclaimed by himself and believed by others,

His renowned abhorrence of all p-hacking and pretense,

Is there with the Collected Works of Marc Hauser—

One Hundred Years of Research Misconduct,

And (oh, this above all) his sensibility,

His sensibility and its hair-like filaments,

His delicate, quivering sensibility is now as one

With Edward Wegman’s Wikipedia cribs,

A volume graced by the descriptive rubric

‘The simplex method visits all 2d vertices.’

Soon now a paper of mine could be retracted also,

Though not to the monumental extent

In which the chastisement of correction has been meted out

To the paper of my enemy,

Since in the case of my own study it will be due

To a miscoded variable, a confusion over data—

Nothing to do with merit.

But just supposing that such an event should hold

Some slight element of sadness, it will be offset

By the memory of this sweet moment.

Chill the champagne and polish the crystal goblets!

The paper of my enemy has been retracted

And I am glad.

(Adapted from the classic poem written by the brilliant Clive James.)

The post The Paper of My Enemy Has Been Retracted appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bayes, statistics, and reproducibility: My talk at Rutgers 5pm on Mon 29 Jan 2018 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Bayes, statistics, and reproducibility

The two central ideas in the foundations of statistics—Bayesian inference and frequentist evaluation—both are defined in terms of replications. For a Bayesian, the replication comes in the prior distribution, which represents possible parameter values under the set of problems to which a given model might be applied; for a frequentist, the replication comes in the reference set or sampling distribution of possible data that could be seen if the data collection process were repeated. Many serious problems with statistics in practice arise from Bayesian inference that is not Bayesian enough, or frequentist evaluation that is not frequentist enough, in both cases using replication distributions that do not make scientific sense or do not reflect the actual procedures being performed on the data. We consider the implications for the replication crisis in science and discuss how scientists can do better, both in data collection and in learning from the data they have.

Here are some relevant papers:

Philosophy and the practice of Bayesian statistics (with Cosma Shalizi) and rejoinder to discussion

Beyond subjective and objective in statistics (with Christian Hennig)

The post Bayes, statistics, and reproducibility: My talk at Rutgers 5pm on Mon 29 Jan 2018 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post When the appeal of an exaggerated claim is greater than a prestige journal appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Are you aiming to write a blog post soon on the recent PNAS article of ‘When the appeal of a dominant leader is greater than a prestige leader’? The connection it points out between economic uncertainty and preference for dominant leaders seems intuitive – perhaps a bit too intuitive. The “Edited by Susan T. Fiske” usually makes me suspicious, but I couldn’t find much wrong with the content. It’d be interesting to hear your blogged thoughts on the paper.

My reply: I suppose that as an exploratory analysis it’s fine. This sort of thing could never get published in a good political science journal—it’s more of a blog post, as it were—but it’s fine to get stuff like that out there. The key in communicating this sort of result is to place it in the speculative zone.

Let’s take a look. Here’s the abstract of the paper:

We examine why dominant/authoritarian leaders attract support despite the presence of other admired/respected candidates. Although evolutionary psychology supports both dominance and prestige as viable routes for attaining influential leadership positions, extant research lacks theoretical clarity explaining when and why dominant leaders are preferred. Across three large-scale studies we provide robust evidence showing how economic uncertainty affects individuals’ psychological feelings of lack of personal control, resulting in a greater preference for dominant leaders. This research offers important theoretical explanations for why, around the globe from the United States and Indian elections to the Brexit campaign, constituents continue to choose authoritarian leaders over other admired/respected leaders.

I have no problem with the first two sentences of this abstract. But then:

– “we provide robust evidence . . .” No.

– “how economic uncertainty affects . . .” No. Strike that causal language.

– “This research offers important theoretical explanations . . .” I doubt that. But “important” is a subjective term. I guess the authors think it’s important.

And, by the way, since they mentioned the recent U.S. elections, and setting aside the fact that Clinton won the popular vote, who were those “other admired/respected leaders” who were beat out by Trump? Jeb Bush? Ted Cruz? Hillary Clinton? Were those people really “admired/respected leaders”? I don’t know that the leading anti-Brexit campaigners were so admired or respected either.

My real concern here is not political here. Rather, it’s a point of science communication: If you see an interesting pattern in observational data, it doesn’t seem to be enough to just say that. Instead you have to make inflated claims about “robust evidence” (i.e., a bunch of p-values less than 0.05 found within a garden of forking paths) and “important theoretical explanations.” Without those big claims, I can’t imagine such a paper appearing in PPNAS in the first place.

To put it another way, there’s a selection effect. If you find a cool data pattern and present it as such, you probably won’t get much attention. But if you wrap it in the garb of scientific near-certainty, there’s a chance you could hit the media jackpot. The incentives are all wrong.

The post When the appeal of an exaggerated claim is greater than a prestige journal appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post NSF Cultivating Cultures for Ethical STEM appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>NSF Cultivating Cultures for Ethical STEM (CCE STEM [science, technology, engineering, and mathematics])

Funding: The maximum amount for 5-year awards is $600,000 (including indirect costs) and the maximum amount for 3-year awards is $400,000 (including indirect costs). The average award is $275,000.

Deadline: Internal Notice of Intent due 02/07/18

Final Proposal due 04/17/18Limit: One application per institution

Summary: Cultivating Cultures for Ethical STEM (CCE STEM) funds research projects that identify (1) factors that are effective in the formation of ethical STEM researchers and (2) approaches to developing those factors in all the fields of science and engineering that NSF supports. CCE STEM solicits proposals for research that explores what constitutes responsible conduct for research (RCR), and which cultural and institutional contexts promote ethical STEM research and practice and why.

Successful proposals typically have a comparative dimension, either between or within institutional settings that differ along these or among other factors, and they specify plans for developing interventions that promote the effectiveness of identified factors.

CCE STEM research projects will use basic research to produce knowledge about what constitutes or promotes responsible or irresponsible conduct of research, and how to best instill students with this knowledge. In some cases, projects will include the development of interventions to ensure responsible research conduct.

I don’t know what to think about this. On one hand, I think ethics in science is important. On the other hand, it’s not clear to me how to do $275,000 worth of research on this project. On the other hand—hmmm, I guess I should say, back on the first hand—I guess it should be possible to do some useful qualitative research. After all, I think a lot about ethics and I write a bit about the topic, but I haven’t really studied it systematically and I don’t really know how to. So it makes sense for someone to figure this out.

There’s also this:

What practices contribute to the establishment and maintenance of ethical cultures and how can these practices be transferred, extended to, and integrated into other research and learning settings?

I’m thinking maybe the Food and Brand Lab at Cornell University could apply for some of this funding. At this point, they must know a lot about what practices contribute to the establishment and maintenance of ethical cultures and how can these practices be transferred, extended to, and integrated into other research and learning settings. You could say they’re the true experts in the field, especially since that Evilicious guy has left academia.

Or maybe an ethics supergroup could be convened. (Here’s a related list, of which my favorite is the Traveling Wilburys. Sorry, Dan!)

In all seriousness, I really don’t know how to think about this sort of thing. I hope NSF gets some interesting proposals.

The post NSF Cultivating Cultures for Ethical STEM appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post To better enable others to avoid being misled when trying to learn from observations, I promise not be transparent, open, sincere nor honest? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Now I was fortunate at the time to be with a group where mentoring and enabling better research was job one. Most of those I worked with were clinical research fellows and their clinical mentors. They had fairly secure funding so they were not too worried about getting grants. Usually we would work together for a few years and publications did seem to be mostly about sharing knowledge rather than building careers. For instance, if something went wrong in a study, a paper often was written along the lines of identifying what went wrong and how others could avoid the same problems. Perhaps in such a context, learning how they thought – so I could better engage with them, made more sense. This maybe is rare in academia. Additionally, for much of the research, available or well recognised statistical methods were lacking at the time. Or maybe I just dropped the ball after leaving?

Before I left that group, I did provide some training on exploratory cognitive analysis to my replacement. It involved debriefing with them after meetings with clinical researchers discussing what the clinical researcher seemed to grasp, what they would likely accept as suggestions, what they did not seem to grasp and how to further engage with them. In some of the meetings, I arranged to purposely show up late or have to leave early so my replacement got a few solo runs at this. I was also reminded of this activity in a JSM presentation by Christine H. Fox, The Role of Analysis in Supporting Strategic decisions. Afterwards I mentioned this to my replacement and they mentioned that it has worked well for them. However, there does not seem to much written on the need to learn how others think when engaging in statistical consultation or collaboration. Or how to avoid providing “insultation” instead of consultation – as Xiao-Li Meng put it here. Now, is it likely that “not being transparent, open, sincere nor honest” will help here?

John’s assumptions and goals are interesting. “All I assume is that the world is a better place if non-experts’ [non-statisticians] beliefs are informed by true experts’ [statisticians] claims”. Additionally accepting the goal of “ensure[ing] that non-experts’ [non-statisticians’] beliefs mirror experts’ [statisticians’] claims in many ways” – the ethics of science communication boils down to identifying what is permissible in doing this. This makes sense – if non-statisticians were to think like statisticians that would enable better research – right? The challenge arises because of how most non-statisticians may learn from statisticians and the preponderance of a “folk philosophy of science [statistics]” among them. This allows “agnotology” (the deliberate creation and maintenance of ignorance) by careerist statisticians or those who get something by providing statistical “help” (e.g. to get co-authorship) – to flourish. For instance, the folk philosophy of statistics’ belief that there are correct and incorrect methods in statistics. Given that, once you have involved a statistician or statistically able consultant/collaborator, you can be fully confident the methods used in your study *are* correct. The uncertainty has been fully laundered. An example of this I ran into often at one university was “After our discussion about analysing my observational study, I talked to Dr. X (a statistician with a more senior appointment in the university) they assured me that step-wise regression is a correct way to analyse my observational study. They also said your concerns about confounding just over complicate things and make for a lot of unnecessary delay and work. In fact, I was able to do the analysis on my own, we have already submitted the paper and it looks like it will be accepted.”

Below I discuss John’s model regarding how non-statisticians become informed by statisticians (both competent ones, incompetent ones and agnotological ones), try to identify current deficiencies in the statistical discipline’s ability to deliver if John’s model is applicable and discern areas where I argue such a model is or is not too wrong to be helpful.

John offers a two step model of how non-experts learn from experts. It involves a sociological premise: Experts institutional structures are such that claims meets scientific “epistemic standards” for proper acceptance and the epistemological premise: If some claim meets scientific epistemic standards for proper acceptance, then I should accept that claim as well. John’s requisite expectations here being “The institutional structures of epistemic groups typically aim to ensure that its members assert and accept claims only when those claims do, in fact, meet relevant standards. … To think that the institutional structures of an epistemic group are working well, then, is to think that when members of the community assert or agree on some claim, the best explanation for them agreeing on a claim with this specific empirical content is that the claim meets relevant epistemic standards”. Or perhaps more pointedly “trust in individual scientists [statisticians] involves an assumption that institutions ensure that some “social type” – the accredited scientist [statistician] – is trustworthy”. In this model, what non-experts are taken to learn is not any content but rather the apparent consensus of experts along with an assurance that certain groups of experts’ “epistemic standards” should be taken as convincing. A nice counter example John gives involves astrological claims that do meet astrological epistemic standards but which few of us would find convincing.

In regard to statistics, I think both premises are questionable. For instance, the methods statisticians choose or recommend are largely idiosyncratic. On the other hand, when there is a wide statistical consensus such as the foolishness of gambling, many or most people ignore this. The first premise – statisticians’ institutional structures are such that claims meet statistical “epistemic standards” is certainly rather weak to non-existent. There are now some accreditation processes but they are fairly new and unproven? (Personally I have been wary of them.) Generally accepted standards for analysing studies like yours such as “Just do these two t.tests and call me if the headache from the journal reviewers still persists after re-submission” don’t seem promising – at least for challenging research. So there simply isn’t any assessment that a given statistician is competent (will be aware of a consensus) or obligated to adhere to any such consensus. That is, currently no widely accepted “system of conventions that normally enable individuals to recognize valid science [statistics] despite their inability to understand it” borrowing a phrase from Dan Kahan.

The ASA statement on p-values could be seen as a first step towards strengthening the first premise. It is not all that clear that the first premise can be adequately strengthened to meet John’s required expectations for it to work. Time may tell. Though I don’t see assurances for this being the case – “the best explanation for them agreeing on a claim with this specific empirical content is that the claim meets relevant epistemic standards”. Rather just a symposium with invited speakers and journal review and publication of invited and open submissions. What could possibly go wrong? Least publishable units exaggerating distinctions and individual contributions? There are perhaps worse possibilities “statistical methods [could] be subject to regulatory approval”.

Now the exploratory cognitive analysis discussed earlier in this post is primarily focused on discovering if a researcher is ready/capable, along with our being able to discern how to get across to them some amount of the actual content of statistical reasoning. On the other side of the interaction, for the statistical reasoning to be sensible we have to be ready/capable along with the researcher being able to discern how to get across to us, some amount of the actual content of domain reasoning. That of course always needs to be checked – did I get it, did they get it and finally did we get it “together”. That is exploratory cognitive analysis is all about we getting it “together”. This is very different than John’s two step process of simply discerning if there is a statistical consensus on an approach and whether such consensus should convince one.

When statistical consulting or collaborating involves open ended research, it is a process rather than a set of findings that researchers need to be informed of *and* involved in. John’s two step model seems inadequate here. The researchers need to grasp the essential content of the statistical reasoning and how it applies rather than just being informed there is a consensus that these statistical techniques are appropriate for their research and most accredited statisticians would/should recommend it. It definitely involves getting partially into the design and choice of the workings of sausage [science] factories. Science being a process that forces and accelerates getting less wrong about things that occasionally comes to a rest, but never ends.

The two step model seems more appropriate for static claims made after a research consensus has been reached, and the researchers want to convey those claims. In statistics perhaps applications that are fairly straight forward. Statistical defaults. Here with strengthened first and second premises in statistics, exploratory cognitive analysis could be easily be skipped. Not all applications of statistics are for “rocket science” and there simply are enough competent statisticians for even a small percentage of applications. Nor do the economics support it. For many applications, as Hadley Wickham argues, safe statistics would arguably be better than statistical abstinence.

The post To better enable others to avoid being misled when trying to learn from observations, I promise not be transparent, open, sincere nor honest? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post State-space modeling for poll aggregation . . . in Stan! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>As part of familiarising myself with the Stan probabilistic programming language, I replicate Simon Jackman’s state space modelling with house effects of the 2007 Australian federal election. . . .

It’s not quite the model that I’d use—indeed, Ellis writes,

“I’m fairly new to Stan and I’m pretty sure my Stan programs that follow aren’t best practice, even though I am confident they work”—but it’s not so bad to see a newcomer work through the steps on a real-data example. This might inspire some of you to use Stan to fit Bayesian models on your own problems.

Again, the key advantage of Stan is flexibility in modeling: you can (and should) start with something simple, and then you can add refinements to make the model more realistic, at each step thinking about the meanings of the new parameters and about what prior information you have. It’s great if you have strong prior information to help fit the model, but it can also help to have weak prior information that regularizes—gives more stable estimates—by softly constraining the zone of reasonable values of the parameters. Go step by step, and before you know it, you’re fitting models that make much more sense than all the crude approximations you were using before.

The post State-space modeling for poll aggregation . . . in Stan! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>