Comments on: Bayesian inference completely solves the multiple comparisons problem

By: Frank Harrell

Frank Harrell — Wed, 23 Aug 2023 11:45:58 +0000

I’m having trouble following the simulation that Andrew did. I see code for one instance but not code for how the 1,000,000 repetitions were done.

By: Dilsher Dhillon

Dilsher Dhillon — Fri, 19 Jul 2019 15:47:35 +0000

In reply to Andrew. Thank you, Andrew. That gives me a lot to think/read about. Hopefully I'm able to grasp this concept enough to convince my department to do the same.

By: Andrew

Andrew — Thu, 18 Jul 2019 20:28:12 +0000

In reply to Justin Smith. Justin: Here's what I think about Occam's Razor.

By: Andrew

Andrew — Thu, 18 Jul 2019 20:05:34 +0000

In reply to Dilsher Dhillon. Dilsher: To start, I'd take a look at what we do here. I'd start by fitting the model using stan_glmer in rstanarm in R (or glmer in lme4 in R, if you prefer) with varying intercepts for all batches of main effects and interactions of interest. No F-tests, no pairwise comparisons; just estimate everything.

By: Dilsher Dhillon

Dilsher Dhillon — Thu, 18 Jul 2019 15:41:49 +0000

Great intuitive post. I also read your paper “why we (usually) don’t have to worry about multiple comparisons. I’m an applied statistician with MS level classical training in statistics. I’m trying to steer away from using frequentist methods towards bayesian inferences, but there are situations where I’m quite lost.

I work in an industry where 2-factorial ANOVA experiments are common. For example, we’re testing 6 oils with 6 seals for a total of 36 ‘treatments’ in 2 blocks (so 2 replications, one in each block). In a typical frequentist scenario, we’d fit a linear model with fixed effects for Oil, Seal and Block plus an Oil:Seal interaction (as we believe oils behave differently at different seals). The typical procedure in a frequentist scenario is an F-test followed by Tukey pairwise comparisons.

I’m having a hard time building the blocks of this model in a bayesian context. How do I prevent the mutliple comparison here?
1. According to the post, as long as I have somewhat informative priors, we should avoid that problem. This is in itself a daunting task. I or the subject matter expert can’t tell me what they believe about each interaction coefficient. From the paper I referenced, the solution is to then build a multilevel model.
2. Do we then treat both factors, oil and seal as random effects? If we’re interested in making statements only about oils, do we then treat Oils as a random effect and seal and block as fixed effects? I’m finding it hard to build an intuition of this model and be able to justify the choice of priors and or random/fixed effects.

Hoping this gets some attention and discussion!

By: Justin Smith

Justin Smith — Sat, 21 Jul 2018 01:28:48 +0000

What does Occam’s Razor say about this? It is more likely to get just the likelihood correct, or the likelihood and the (one of of infinitely many possible) prior correct? How about for multidimensional problems?

Justin

By: Andrew

Andrew — Mon, 09 Jan 2017 20:33:45 +0000

In reply to Sean. Sean: No, R (and Stan) parameterize the normal in terms of location and scale.

By: Sean

Sean — Mon, 09 Jan 2017 19:56:39 +0000

theta <- rnorm(N, 0, tau)

Shouldn't tau be tau^2?

By: e f d p

e f d p — Mon, 29 Aug 2016 22:03:06 +0000

In reply to Daniel Lakeland. I remember it being really common to threshold on effect size (usually an effect size of at least 1 in log2 space, so 2x/0.5x) in addition to FDR, at least back in the microarray days.

By: Mat

Mat — Sun, 28 Aug 2016 14:56:51 +0000

In reply to Slutsky.

Yes, really good paper. There’s a phrase in the paper that sums up the difficulty for me: “the specifications are neither statistically independent nor part of a single model”.

Perhaps in principle we can write a bigger model that all the specifications fit into, but in practice it quickly becomes impossible. Even multilevel models, to handle different subgroups of the data, can be computationally prohibitive if we already have a highly parameterised model. So it’s good to have people thinking about how to perform inference in the presence of methodological ambiguity.

By: Daniel Lakeland

Daniel Lakeland — Fri, 26 Aug 2016 21:51:33 +0000

In reply to Daniel Lakeland.

@Rahul, it’s not clear to me if Bob considers LISP within the scope of what he’s talking about when he says “I honestly don’t know what the functional programming languages are good for”. I think he might be referring more to things like Haskell where there really *are no* side effects (ok, technically what there is is a state that you pass in and then a new state comes out).

Common LISP is *not* a pure functional language. it’s got all kinds of looping and iterative constructs, setf to set the value of just about anything, lots of printing and formatting stuff, explicit file-IO mechanisms. But, it’s cognitively very different from other languages. One big reason is the macro system where the language itself is used to write new language constructs.

LISP’s biggest problem seems to be that it just requires more than a High School education and a “learn X in 24 hours” level of knowledge in order to hire someone to do even basic work on a LISP program. For example, NASA hired some great programmers who came up with some kind of fancy autonomous robot planning and execution software which was fantabulously effective at what it was supposed to do, won some award, and then was canned by JPL. One big issue is no-one cheap could work on it. Same thing happened to Yahoo Store (programmed in LISP, bought out by Yahoo, then re-built over a decade in C++ by which time Yahoo had pretty much crashed and burned but just didn’t know it yet)

http://www.flownet.com/gat/jpl-lisp.html

I really want Julia to take-off, but I won’t be switching over until a very extensive version of ggplot2 is available ;-) and that gets right at your point about the extensive user community.

By: Rahul

Rahul — Fri, 26 Aug 2016 14:58:23 +0000

In reply to Daniel Lakeland. @Bob Great points. I've found the size, activity and expertise of the user community is a HUGE factor in how useful a language or package turns out. >>>but I’ve never run across a practical problem where they made a lot of sense.<< Lisp was used in AutoCAD and Emacs. Dunno if it meets the approval of functional purity.

By: Bob Carpenter

Bob Carpenter — Fri, 26 Aug 2016 13:00:20 +0000

In reply to Daniel Lakeland.

My point exactly. The functional programming wonks tend to extoll the purity of functional programming, but then wind up using languages with side effects like Common Lisp. Same way people extoll the virtues of Bayes (consistency, taking into account posterior uncertainty in inference), then fit models using variational inference and perform inference with the point estimates. I’m not saying that’s a bad way to go in either case if you want to get real work done (and as an aside, I think the focus on getting practical work done rather than scolding everyone for doing things wrong is why machine learning is eating statistics’ lunch). The pure solutions are often too burdensome.

I also think it comes down to “horses for courses” as the British say. There are better matrix and math libraries in FORTRAN than in Common Lisp. And better internet libraries for things like unicode and sockets and threading in Java. And C++ has both everything and the kitchen sink in the language *and* in the libraries. It’s enough to make a programming language theorist like me cry.

I honestly don’t know what the functional programming languages are good for. I love them dearly as theoretical constructs (I used to do programming language theory and a lot of typed and untyped lambda calculus), but I’ve never run across a practical problem where they made a lot of sense. Part of that’s just the lack of large communities of programmers in fields I’m interested in.

What I miss the most from the functional programming languages is lambda abstraction. It’s a complete hack in C++ with bind, functors, function pointers, etc. and also a hack in Java, too (like functors in C++, but maybe they’re anonymous). You get some of their benefit with continuations in languages like Python and R and even in C++ you can code things in a continuation-passing style (sort of built into my brain after all that Prolog tail recursion).

Guaranteed, though, no matter what language you choose for a project, a chorus of naysayers will tell you that you should’ve chosen another one. My experience at Julia Con was everyone telling me we should’ve coded Stan in Julia!

By: Slutsky

Slutsky — Fri, 26 Aug 2016 08:22:34 +0000

In reply to Mat.

I think this paper by Simonsohn et al. may be helpful:
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2694998

By: Daniel Lakeland

Daniel Lakeland — Thu, 25 Aug 2016 19:23:30 +0000

In reply to Daniel Lakeland.

Though, my point wasn’t really that Bayes was “more pure” just that when you don’t have Bayes available, a lot of what people do looks like “reinventing” it, just like when you don’t have hash tables in your language, a lot of what people do is write an ad-hoc informally specified hash table library. The history of Common Lisp, Perl, Python, R, Ruby, Lua etc is that we get closer to just having all that stuff built-in. That’s also the history of BUGS, JAGS, Stan

Having done a bunch of Prolog programming back 15 years ago, I honestly think that Stan feels like Prolog at a certain level. Prolog is a method for specifying depth-first-searches through discrete spaces, Stan is a method for specifying Hamiltonian searches through continuous parameter spaces. If you don’t acknowledge the search mechanism you can be doomed in ether one.

A Prolog program basically says “find me values of variables that make certain statements true” and Stan basically says “find me values of variables that make certain statements plausible”

By: Daniel Lakeland

Daniel Lakeland — Thu, 25 Aug 2016 18:39:22 +0000

In reply to Daniel Lakeland.

I’ve watched my wife working with these canned packages… click here to select from normalization method Foo, Bar, Baz, Quux each one linked to a particular paper advocating for one vs another. Then run an enormous least-squares based ANOVA of 20,000 dimensions, you can choose from taking the logarithm of the data first, or doing some other nonlinear transform, or just doing the raw data, you get 20,000 p values, you can use 1 of 15 different FDR correction procedures, get a principle component analysis, sort by post-FDR p value, sub-select by gene ontology group…

Does the user have ANY idea what any of it means? When I ask my wife she gets really frustrated, because she’s basically just doing what someone told her to do. In the end after making 10 or so ad-hoc choices are you discovering anything, or just finding data to justify your preconceived notions, or maybe just arriving at a puzzle-solution to get past the grant/publication gatekeepers?

By: Daniel Lakeland

Daniel Lakeland — Thu, 25 Aug 2016 18:28:15 +0000

In reply to Daniel Lakeland.

An effective counter to the “is or is not a difference” is to ask them whether “do you care if adding hormone x produces 1 extra transcript per decade?” Part of what’s going on is that there’s so much noise in Biology that if you find an effect, the effect is usually big enough that it’s practically significant (2x or 5x the transcript rate for example) of course, as Andrew’s regularly pointed out, type M errors are very common in p-value driven “discovery”

By: Daniel Lakeland

Daniel Lakeland — Thu, 25 Aug 2016 18:25:27 +0000

In reply to Bob Carpenter.

There is SO much effort put into these various packages on bioconductor and soforth, a bunch of smart people who seem to be led down the garden path here. Commercial and open-source projects for lots of really complex canned analyses with little web-based dashboards for point-and-click implementation of the garden of forking paths.

Typical bench Biologists really don’t know much more than a t-test and a chi-squared in my experience. There are more specialist people who call themselves Bioinformaticists and of course there’s Biostatisticians neither of those groups would typically have the skills to say do sterile tissue culture, or debug PCR primer problems, but their level of statistical sophistication is much higher. So it’s a matter of specialization. Bioinformatics/statistics seems to be predominantly about fancy methods for adjusting p-values though. The whole framework is about “there is an effect” vs “there isn’t an effect”. I’ve even had biologists *explicitly* tell me “I don’t really care whether it’s big or small, just whether adding hormone X produces a difference” or the like.

By: Daniel Lakeland

Daniel Lakeland — Thu, 25 Aug 2016 18:06:38 +0000

In reply to Bob Carpenter.

I don’t think Common Lisp was ever supposed to be “pure” it was just supposed to be designed to handle a really wide variety of stuff at the language level. Whereas, in the FORTRAN and C world all that stuff is handled at the library level and therefore re-implemented by each project in a buggy way.

But, fair enough. I’m being perhaps too glib.

By: Bob Carpenter

Bob Carpenter — Thu, 25 Aug 2016 17:56:34 +0000

In reply to Daniel Lakeland. Yet another generic that's nowhere near universally true. Ironically, all the "pure" functional programming languages wound up adding imperative features (for example, Common Lisp) due to the cost of random access in pure functional programs. So you get this motivation for "pure" functional programming, then you go and write quicksort anyway. That's just how programming goes. You often get the same thing in stats---lots of Bayesian handwaving, then a bunch of point estimates. When I was at Carnegie Mellon and Edinburgh, I used to work on logic programming (and some functional programming). It used to drive me crazy when people referred to Prolog as a logical programming language. It was basically depth-first search with backtracking (unless you used cut, which was neccesary for efficient non-tail recursion) and if you weren't aware of this and didn't code to this, your Prolog programs were doomed. (O'Keefe's Craft of Prolog remains a great programming book, as does Norvig's book on Lisp.)

By: Bob Carpenter

Bob Carpenter — Thu, 25 Aug 2016 17:53:05 +0000

In reply to Daniel Lakeland. This is the point I was trying to make in my comment about ranking and followup PCR experiements. It would be even better to include the costs of followup rolled in the decision. But it usually amounts to the same thing because the followup budgets tend to be pretty fixed (it's not like they see the results and say, "hey, let's followup on 100 genes because they all seem like good bets"---they just don't have the money, postdocs, or time). And that's a great use of generics in the statement about "Biologists"! There are lots of biologists who know quite a bit more about stats than that, and many more biostatisticians. The causes for this also involve supervisor and editor pressure to report things in terms of p-values. Mitzi used to work in this area and the editors for a paper she was working on for the ModEncode project for Science insisted that they provide p-values for their exploratory data analysis (which involved a clustering model nobody believed was a good model of anything, but useful for exploratory data analysis).

By: Rahul

Rahul — Thu, 25 Aug 2016 17:36:07 +0000

In reply to Rahul.

@Anoneuoid

Sure. If you make no. of papers purely the metric, sure you get truckloads of crap.

But no one’s contesting that. I hope.

By: Daniel Lakeland

Daniel Lakeland — Thu, 25 Aug 2016 17:20:02 +0000

In reply to Rahul. Anonuoid: I didn't want to harp on that too much. It is a significant issue though, possibly the single biggest issue in modern academic science.

By: Anoneuoid

Anoneuoid — Thu, 25 Aug 2016 14:10:38 +0000

In reply to Rahul.

>”final metric of performance”

What if the final metric of performance is how many papers you produce, and due to confusion, the limiting factor to publishing papers has become how many “statistically significant” results you can generate?

I think this is what is going on, the way of assessing progress used by many researchers these days is itself fatally flawed. That is exactly why NHST is such a destructive force, it allows you to *think* you are learning something when you are just producing massive amounts of garbage. I am by far not the first to come to this conclusion. It was pretty much Lakatos’ position in the 1970s…:

“one wonders whether the function of statistical techniques in the social sciences is not primarily to provide a machinery for producing phoney corroborations and thereby a semblance of ‘scientific progress’ where, in fact, there is nothing but an increase in pseudo-intellectual garbage…this intellectual pollution which may destroy our cultural environment even earlier than industrial and traffic pollution destroys our physical environment.”

Lakatos, I. (1978). Falsification and the methodology of scientific research programmes. In J. Worral & G. Curie (Eds.), The methodology of scientific research programs: lmre Lakatos’ philosophical papers (Vol. 1). Cambridge, England: Cambridge University Press. http://strangebeautiful.com/other-texts/lakatos-meth-sci-research-phil-papers-1.pdf (pages 88-89)

By: Mat

Mat — Thu, 25 Aug 2016 07:27:04 +0000

In reply to Daniel Lakeland. Thanks Daniel, I'll take a look at that link.

By: Rahul

Rahul — Thu, 25 Aug 2016 04:50:14 +0000

In reply to Rahul.

@Daniel

You could be right. I don’t know.

My point is that is I see a lot of arguments claiming superiority of an approach based on the merits of its logical structure and what information it uses than arguments based on the actual performance on outcomes that matter.

By: Daniel Lakeland

Daniel Lakeland — Thu, 25 Aug 2016 04:46:15 +0000

In reply to Rahul. There seems to be a lot of effort to do all sorts of sophisticated stuff within the FDR framework. Its hardly the same old flawed boring stuff its like this towering edifice of papers and software packages all designed to solve the wrong problem automatically. At least that's my impression.

By: Rahul

Rahul — Thu, 25 Aug 2016 03:32:32 +0000

In reply to Daniel Lakeland.

In general, I often see people critique an existing method on that grounds such as that: (a) it makes ad hoc choices or (b) It involves dichotomization of what is truly a continuous outcome or (c) because it does not use some info that is available & “ought” to be relevant to the problem or (d) because an alternative approach can be deemed “more logical”

My point is the challenger method must show that the improvement (if any) in the final metric of performance you get by removing shortcomings is worth the additional effort.

True, that many areas use simplistic, ad hoc predictive methods that are leaving information unused on the table or have alternatives with richer structure or logical foundations.

But unless the alternative can demonstrate that all its ingenuity can actually translate into better performance, or in fact, performance so much better that it is worth the additional modelling effort & the cognitive switching cost, I think practitioners are entirely justified in sticking to their same, old, flawed boring methods.

By: Daniel Lakeland

Daniel Lakeland — Thu, 25 Aug 2016 03:11:43 +0000

In reply to Rahul.

It’s not so much the “screening” as the “followup” that’s the costly bit. Screening comes in where you collect all this data that you do the FDR junk on… The FDR junk then gives you a list. How many items on that list should you follow-up? Which ones? All of them? Presumably there is both prior information, as well as “effect sizes” which should be taken into account to determine which are the most promising. None of that occurs in FDR, whose purpose is really just a way of limiting a YES/NO hypothesis testing procedure to produce a more limited number of YES.

In practice, sometimes this step is done where Biologists typically look at these lists and then just select stuff they think looks good (ie. stuff involved in pathways they can imagine being “real”)

It’s making the core concept driving your whole process into a dichotomization that bothers me. There’s a lot more information there.

By: Rahul

Rahul — Thu, 25 Aug 2016 02:45:14 +0000

In reply to Daniel Lakeland. But if your cost-per-screening-a-gene is roughly constant won't the results be roughly the same as what you got from a naive FDR based approach? Do you have evidence that a nuanced approach gives better results?

By: Daniel Lakeland

Daniel Lakeland — Wed, 24 Aug 2016 19:38:58 +0000

In reply to Mat.

No question that very large models that are mixtures over multiple plausible explanations and etc etc are computationally challenging. Thinking about your situation and trying to come up with one model that incorporates many possibilities while remaining tractable can help.

If you’re talking about time-series regression with discontinuity, you might find my recent post on placing priors on functions interesting:

http://models.street-artists.org/2016/08/23/on-incorporating-assertions-in-bayesian-models/

You could, for example, compute some summary statistic of the function behavior in the vicinity of the change-point, and assert something in the model about its plausibility. For example, abs(f(somewhatafter)-f(atchange)) ~ gamma(a,b) to assert that you think there should be on average a nonzero slope up or down in this region, where you choose a,b so as to constrain the slopes you’re entertaining.

You could do a similar thing for the endpoints of the time interval. Calculate a mean value pre-policy, and calculate a mean value in the asymptotic far-post-policy time period, and assert something about the plausibility of the change in asymptotic behavior.

Often this kind of very flexible model can be a substitute for a discrete set of plausible simple models. I mention this, because I recently tried to generalize the state-space vote-share model that was posted a couple weeks back using a Gaussian process, and found it computationally infeasible for 400+ days of data. Looking for a way to specify time-series function behaviors other than Gaussian processes led me to the suggested idea.

By: Mat

Mat — Wed, 24 Aug 2016 18:10:09 +0000

In reply to Daniel Lakeland. I agree in principle that we should approach these analytical decisions where possible as an opportunity for continuous model expansion. It very quickly gets computationally infeasible though. It all depends what model of Bayesian machinery you have though, I guess.

By: Mat

Mat — Wed, 24 Aug 2016 17:12:18 +0000

In reply to Anoneuoid.

No, I just was trying to find a shorthand way to say that after looking at the results we changed our belief to “there doesn’t appear to be anything going on here, or at least our data isn’t good enough to estimate it (and we have pretty good data)”.
I’m using hypothesis tests and p-values to go with the flow, because it’s not my project.

By: Daniel Lakeland

Daniel Lakeland — Wed, 24 Aug 2016 16:49:43 +0000

In reply to Bob Carpenter.

I think what’s really going on in these Biology experiments is an attempt to minimize the cost of finding something useful. A lab wants to discover a gene they can target for drugs to prolong remission of cancer, find out which pathways are involved in recruiting cells for bone repair, detect a cancer before its very advanced, discover how an environmental toxin modifies kidney cells… whatever. They’ve got a large number of genes that might be involved in the process. They have a small pile of money to do research. They need to focus their money somewhere. What subset of “every gene in the genome” should they dump their money on?

Unfortunately for Biologists if they understand anything about statistics it’s what you learn from “Statistics for Biologists 101” which is more or less a t-test and a chi-squared test, so their concept is “there are true targets and there are false targets, and I want to minimize the number of false targets while keeping most of the true targets so that I waste as little money as possible” so they describe this to a classically trained statistician and wind up with FDR procedures.

I think this makes them feel like they’re scientists because they’re “discovering” things, as opposed to engineers who “minimize costs”. But, the truth is, *at this stage in the process* they’re looking to trade off costs vs number of useful results. If they formulated their problem in that way, they’d wind up with a more consistent and logical framework.

By: Daniel Lakeland

Daniel Lakeland — Wed, 24 Aug 2016 16:37:43 +0000

In reply to Mat.

“Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.” — Greenspun’s Tenth Rule of Programming

Which is to say, aren’t you basically reinventing an ad-hoc version of a Bayesian analysis? The core concept of a Bayesian analysis is basically this: “Here’s a bunch of different possibilities I’m willing to entertain, and some data, which of these possibilities should I entertain after I’ve seen the data?”

The ABC version of how Bayesian analysis works is “pick values from the prior distribution, compute the consequences of the model, and weight this consequences by the likelihood of the error you see between the computed consequences and data” which is to say that ABC looks a lot like “try this out and see how it works, try that out and see how it works, try this other thing out and see how it works…”

So, you have some kind of time-series of health outcomes, and a policy change-point, you have various ideas about how to describe the data itself (reasonable bandwidths to get summary statistics or whatever), and various ideas about how the summary statistics should change post-policy-change (broken-stick linear models, exponential decay to a new stable state, an initial confusion causing worse outcomes followed by a decay to a new stable state where things are better, nothing happens at all post change…)

Take your existing set of analyses as a kind of model-search, incorporate all the model-search ideas into one Bayesian model, put broad priors on it all, run the Bayesian machinery, and get a posterior distribution over what seems to be true.

By: Daniel Lakeland

Daniel Lakeland — Wed, 24 Aug 2016 16:23:19 +0000

In reply to Daniel Lakeland.

Left out the link: http://models.street-artists.org/2016/08/16/some-comments-on-coxs-theorem/

By: Daniel Lakeland

Daniel Lakeland — Wed, 24 Aug 2016 16:21:34 +0000

In reply to Rahul.

The question to ask is “what values of the parameter am I willing to entertain?” Any prior that places the set of values that you’re willing to entertain in the high probability region is usually good enough. So for Bill’s example of the speed of light

normal(3e8,3e7) m/s is probably good (actual value technically defined to be 2.9979246e8)

but exponential(1/3e8) is just fine for lots of purposes. Imagine that you have a measurement system capable of measuring to 10% accuracy. After 1 measurement your posterior is going to be +- 0.3e8 so the fact that the exponential prior includes values of 3 m/s out to 9e8 m/s that would still be considered within a high probability region is irrelevant.

The typical problem with experts in a field without high precision information is that their priors are TOO NARROW, and so if you ask several experts you’ll get a bunch of tight intervals that don’t even overlap. The solution is to use a prior that encompasses everything they all say.

What’s the a-priori frequency of failures of space-shuttle launches? The NASA managers said 1e-7, the rocket booster engineers said 0.1, the solution? beta(.5,5) or something thereabouts, or maybe uniform(0,.5) or exponential(1/.05) truncated to 1, or normal(0.05,.1) truncated to [0,1]… anything that includes all the experts favorite regions of the parameter space is reasonable.

My recent set of posts and commentary about Cox’s theorem make it clear, the “correctness” of a Bayesian probability model is a question of whether the state of information it’s conditional on is well summarized by the choice of distributions, not whether there’s an objective fact about the world that you’re trying to match.

By: Anoneuoid

Anoneuoid — Wed, 24 Aug 2016 15:39:28 +0000

In reply to Mat.

>”we found no strong evidence of anything (statistically insignificant and widely varying results)… Inevitably a couple of effects have significant p-values”

This is strange to me. This comment seems to say that statistical significance means finding strong evidence (it doesn’t) but also realizes that statistical significance is just something inevitable that happens by looking at some data.

By: Mat

Mat — Wed, 24 Aug 2016 15:35:26 +0000

In reply to Andrew.

Thanks for the link. That “multiverse” idea really taps into the problem, and it would be great to see a blog on this. The thing about the “garden of forking paths” and multiple comparisons is that in some contexts applied researchers get the impression that being thorough is good practice – such as, for example, robustness and sensitivity analyses after finding an effect. At other times the message seems to be, “make one choice at each decision point, before seeing the data, or you’ll be guilty of p-hacking”.

My personal preferred practice when faced with what you call “whimsical” decisions is to code up a loop and try every combination. But then, having obtained all this data, there doesn’t appear to be any widely recognised statistical way to analyse and report it. Model averaging??

By: Andrew

Andrew — Wed, 24 Aug 2016 15:14:03 +0000

In reply to Mat. Mat: Long answer, I'd fit a multilevel model. Short answer, I'd take what you did already and call it a multiverse analysis as in this paper (which I guess I should blog sometime). If you want a single p-value, I recommend taking the mean of the p-values from all the different analyses you did.

By: Mat

Mat — Wed, 24 Aug 2016 15:08:47 +0000

I was wondering what you would do about multiple testing in a “garden of forking paths” scenario where you have several analyses, but they are not on different independent datasets? I’m currently working as a statistical consultant on a behavioural economics project where it was hypothesised there would be an effect of some government policy intervention on a range of health outcomes. We did a regression discontinuity design and we found no strong evidence of anything (statistically insignificant and widely varying results), so we weren’t going to even bother thinking about multiple comparisons.

But my clients want to write this up as a negative result, and so they should. So, with the intention of showing we have been thorough, and not with any intention of p-hacking, and because we really didn’t have any strong beliefs on various analytical decisions we had made, we redid the analysis with a range of “garden of forking paths” style tweaks to the sample selection, the bandwidths, covariates etc. Now we have a couple of hundred estimates on the same dataset, and they are very noisy, with different signs etc. Inevitably a couple of effects have significant p-values.

What would be the proper way to approach this? It doesn’t seem to fit into a hierarchical framework.

By: Martha (Smith)

Martha (Smith) — Wed, 24 Aug 2016 14:52:05 +0000

In reply to Jacob Egner. Thanks for noticing and fixing the typo.

By: Bill Harris

Bill Harris — Wed, 24 Aug 2016 14:42:14 +0000

In reply to Daniel Lakeland.

Oops: leaving out “tau” wasn’t nearly as embarrassing as changing units in mid-phrase. I knew that. Sorry.

Daniel: thanks for the link to your experiment that’s nicely matched to just this question.

By: Gavin Kelly

Gavin Kelly — Wed, 24 Aug 2016 11:45:12 +0000

In reply to numeric.

This is what I was trying to query in my previous comment. The R package I’m using does something similar to Andrew’s approach, shrinking the effect size towards zero, but then does an additional round of FDR, and I worry that this might introduce some form of double jeopardy so, inspired by this post, I intend to run a few simulations.
We (I’m a maths guy, pretending to be a statistician, advising bioinformaticians how to help scientists) tend to use FDRs in Andrew’s screening sense, where we threshold on a fixed FDR, and then rank on (if anything) effect size. But more commonly the hits are then taken as the starting point for the scientist building a narrative, and performing more focused experiments.

By: Shravan

Shravan — Wed, 24 Aug 2016 10:50:52 +0000

In reply to Shravan.

“Half the linguists at Ohio State boycotted my job talk there”

I got my PhD from OSU, and we actually overlapped in 1997 when you visited for your talk. We had dinner together (with Carl Pollard). I must say I am surprised that even OSU had that reaction to you and Bird and Sproat. But it looks like the so-called non-linguists won in the end there, because the current profs in computational linguistics at OSU would by the 1997 standards also not be considered linguists.

About the NSF rejections, it seems that reviewers routinely use the review process to control the direction that the field is going in (the same happens in Europe). The program officer/funding agency often also has entrenched interests and/or has political concerns in mind when they make their decisions. Science is more about politics and control, the science happens in spite of this whole expensive machinery for funding and jobs.

It is probably for the best that you ended up where you are and not linguistics.

By: Bob Carpenter

Bob Carpenter — Wed, 24 Aug 2016 10:21:50 +0000

In reply to numeric. I know this is the standard practice, but it's not what people actually do in biology. Nobody cares about false-discovery rates per se, they only care about ranking the genes (or isoforms or SNPs or whatever) either in phenotype/disease studies or differential expression. Why? Because you get tens of thousands of potential targets, but only have time to run a few followup PCR experiments. So what do they do? They rank by p-values or probability of differential expression. I believe it would be fruitful for them to think in terms of ranking. And how do I know they don't really care about false-discovery rates? Because they never seem to worry about calibration.

By: Bob Carpenter

Bob Carpenter — Wed, 24 Aug 2016 10:12:05 +0000

In reply to Shravan.

It really depended where you were at and who you were talking to. Yes, I’d get papers and grant proposals back with reviews that said only “this isn’t linguistics” (one NSF review actually said my work was “too European”—I went to Edinburgh for grad school). Half the linguists at Ohio State boycotted my job talk there, saying I wasn’t a linguist (Steve Bird and Richard Sproat, also on the shortlist, got the same treatment—others who brought us in wanted to hire a computational linguist who’d worked on morpho-phonology). NSF wouldn’t even review one of Carl Pollard’s and my grant proposals, insisting HPSG wasn’t linguistics. The joint program between CMU and Pitt broke up when they wouldn’t let one of Carl’s and my students do a qual paper on HPSG (this time they told us what kind of non-linguistics it was — engineering). But after all that MIT Press published my semantics book. And I got tenure (though in a philosophy department, not a linguistics department—I wouldn’t have stood a chance in linguistics).

All of the academic fields I’ve been exposed to (linguistics, cognitive psych, computer science, and statistics) have a certainty bias along with a sweep-the-dirt-under-the-rug bias so that people don’t hedge claims or list shortcominings for fear of papers being rejected.

By: Rahul

Rahul — Wed, 24 Aug 2016 07:50:07 +0000

In reply to Keith O’Rourke.

@Keith

I know what you mean. I think the problem here is that most math curricula were designed in a pre-computer algebra age and are relics of a time not so long ago when, if you couldn’t do an integral or a PDE you couldn’t just fire up Wolfram Alpha or Mathematica or Maxima and ask it to do it for you. Best you could do was run over to the library’s Reference Section and pore over a thick dusty Handbook of integrals or tabulated PDE solutions and boundary conditions.

But you are right, a lot of math tools being taught to undergrads are archaic. The Math Departments have been too slow to change.

By: Rahul

Rahul — Wed, 24 Aug 2016 07:45:28 +0000

In reply to Rahul.

@Shravan

Sorry, maybe I wasn’t clear about my point: I didn’t mean that the Math class would make them literate about p-values, or t-tests or the intricacies of the pitfalls etc.

My simple point is that for @digithead it will be really hard to get these concepts across to a student body that is, as you put it, “elected out of being numerate”.

Ergo, if you make them take Calculus or similar foundational courses then an “Advanced Quant Methods” course will start making a lot more sense. It makes “digithead”‘s life easier and the students learn more too.

Of course, there may be some students who are just not capable or motivated to deal with Calculus-101 but then I wonder whether we should really be asking them to work on Soc Sci problems that actually need “Advanced Quant Methods”.

By: Shravan

Shravan — Wed, 24 Aug 2016 06:43:47 +0000

In reply to Rahul. This is not how I use priors at least. I always do a sensitivity analysis with different prior specifications to check whether my posterior changes as a result of the prior. After three years of doing this, I have realized that in some of my own datasets (i.e., when I have lots of data) it simply doesn't matter, I can even use the uniform priors that Andrew now rejects and get the same result. In cases where data is sparse, the situation is completely different of course and one has to proceed carefully and systematically.