Kevin Lewis points us to this paper by Tarun Chordia, Amit Goyal, and Alessio Saretto. I have no disagreement with the substance, but I don’t like their statistical framework with that “false discoveries” thing, as I don’t think there are any true zeros. I believe that most possible trading strategies have very little effect but I doubt the effect is exactly zero, hence I disagree with the premise that there are “type 1 errors.” For the same reason, I don’t like the Bayes factors in the paper. Their whole approach using statistical significance seems awkward to me and I think that in future they’d be able to learn more using multilevel models and forgetting thresholds entirely.

Again, the Chordia et al. paper is fine for what it is; I just think they’re making their life more difficulty by using this indirect hypothesis-testing framework, testing hypotheses that can’t be true and oscillating between the two inappropriate extremes of theta=0 and theta being unconstrained. To me, life’s just to short to mess around like that.

I have also seen some adopt a Bayesian framework with non-informative priors, gleefully pointing out that Bayesian intervals are almost exactly the same as frequentest intervals and the posterior probabilities or Bayes factors lead to the same conclusions.

But also they seem convinced that all the problems of forking paths, inappropriate extremes of theta=0 and theta being unconstrained, inappropriate thresholds, etc. they (now?) associate with frequentest methods have been side stepped or resolved. Even though the switch to Bayesian methods had no perceptible impact.

Then when challenged they respond with vague or even actual references to experts explicitly or implicitly supporting their view.

Actually I need to get a set of good references for such folks…

This has been an issue in finance for many years. For instance, Andrew Lo has a paper back in 1994 on data snooping bias

https://www.cfapubs.org/doi/pdf/10.2469/cp.v1994.n9.8

There are relatively few asset pricing papers that use multi-level models. One that stands out is below (and I found very interesting)

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2451582

I’ve fit some of these models in Stan. One of issues is that there is a significant amount of correlation between stocks. So it helps to have some kind of factor structure. CAPM is popular in the literature (regress stocks against market returns). This is the approach taken in the above paper.

The more typical strategy in the asset pricing literature is to make a portfolio that is long some percent and short some percent of the stocks based on a characteristic. They then test the properties of this portfolio. You would the regress the returns against the returns of this portfolio (and the market and other known factors). It’s a bit of a sloppy procedure, IMO. You get the returns of a portfolio that uses asset returns as inputs and then you regress the asset returns on the portfolio and get a beta relative to it. I’ve tried doing the same thing in Stan, though without the beta part.

Or maybe it isn’t “garden of forked paths” at all, which is as provable as “ghosts made my analysis wrong”. Perhaps data mining works on stable patterns but fails when patterns aren’t stable. The same “forked path” methods applied to Sun spot data would find a pattern and the pattern would be reproducible and predictive.

The real conundrum is what is it about current statistical practice and philosophy that fools staterrati wholesale into thinking they have stable patterns when they rarely do?

Anon:

You don’t understand. Forking paths are the default. See here. The analogy is this: preregistration is like randomized assignment is like random sampling. Suppose someone does a survey on a haphazard sample. Then the default assumption is that there is sampling bias. Suppose someone does an experiment in which people choose their own treatments. Then the default assumption is that there is selection bias. Similarly, suppose someone does a data analysis without preregistration. Then the default assumption is that there are forking paths.

To put it another way, “forking paths” is the broad space of all possible analyses plans. “No forking paths” is the zero-measure submanifold that arises under preregistration or other rigid rules of data processing and analysis.

The analogy to ghosts is incorrect because ghosts don’t exist.

And yet if someone preregisters their stock market analysis they’re still very unlikely to make money reproducible, while if forks, hacks, and data fishing are applied to sun spot data they’ll still find an ~11 year cycle in sun spot data, which is reproducible and predictive.

Here’s your definition of “garden of forked paths”:

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values.The exact same analysis and data can be good or bad depending the existence or not of some ephemeral, ghostly even, “potentials”. How would you prove this?

You can’t prove it by looking at preregistered studies. If every study was preregistered and people’s careers depended on the outcome, they may very well only preregister studies their background and intuition tell them are likely to be a success (sun spot data for example). That doesn’t prove your “garden of forked paths”, it just proves peoples prior info can be pretty good sometimes. Indeed preregistered studies can be better or worse depending on a whole host of reasons none of which have anything to do with those “potentials”.

Anon:

As I’ve discussed (although perhaps not in that particular paper with Loken, I don’t remember), these problems become much more serious when measurements are noisy, and when effects are small and highly variable. Various statistical methods which might have worked well in an environment of large, persistent patterns and accurate measurements, won’t work so well in a noisy, highly variable environment.

Also as I’ve said many times, preregistration is just one thing. Preregistration can resolve the problem of forking paths, but it doesn’t, by itself, do anything about effect sizes, variability, or measurement. One thing that has frustrated me, for example, with the Wansink story is that it’s been presented as a problem with “p-hacking,” and I think the big problems there are weak theory and poor measurement.

Or to put it another way.

Some highly successful analysis can only be described as extreme examples of hacking, fishing, forking, whatever. Both the discovery of the double helix structure of DNA (they kept guessing structures until they hit upon one that reproduced the scattering images) or Kepler’s discovery of elliptical orbits (he tried something like 40 other hypothesis before guessing right).

Some highly unsuccessful analysis can only be described as extreme examples of hacking, fishing,forking, whatever. Trying to predict the stock market by data mining past price sequences is a good example.

The only thing that distinguishes between these two cases is that one group were analyzing phenomena with stable patterns and the other one isn’t.

If due to your claim that all analysis not preregistered is suspect due to “potential” choices which were never made, and preregistration becomes the norm, people will start doing studies on things their intuition tells them will be stable patterns (or perhaps they’ll do experiments ahead of time to very the needed stability).

That might be a good thing or a bad thing, but either way, the better results is due to them analyzing stable phenomenon and not due to the supposed lack of “potential” analysis which was never performed.

What you’re really doing is slapping the “garden of forked paths” label on any result your intuition tells you isn’t likely to be stable (and it’s conveniently forgotten when your intuition tells you otherwise). It’s an all purpose explanation which nobody can prove wrong. Like WWII aircraft mechanics claiming the engine malfunctioned due to “gremlins”.

Anon:

You write, that I “claim that all analysis not preregistered is suspect due to “potential” choices which were never made.” I don’t claim that. I’m making a statement about p-values, which are

specificallya statement about the probability distribution of what would’ve been done had the data been different! The assumptions of p-values are very clear. As in all of statistics, the calculation might be close to accurate even when the assumptions are violated. But the assumptions are there.Again, consider the analogy to random sampling and randomized experimentation. Certain phenomena are so stable that you can estimate them using haphazard samples and observational studies with self-selected treatments. But then you have to be more concerned about what might go wrong. If I see a survey finding with a weird result, and it turns out that the sampling procedure is a mess, then, yes, that’s a concern.

Finally, I’ve written this a few million times too . . . I don’t recommend that analyses generally be preregistered. I’ve worked on hundreds of applied projects and have almost never preregistered. I learn a lot from data exploration, and I recommend data exploration in my textbooks and practice it in my research. But the way to go is not data exploration summarized with a statistically significant p-value. It’s much better to present and analyze all the results.

Andrew,

It is a bit difficult to explain. So bear with me. Looking at the big picture, here is what I claim has happened.

(1) When you intuitively have a Frequentist understanding of probabilities then even before starting you’ve implicitly assumed an serious amount of stability in nature. Just the idea that the data is a realization of a “data generation” mechanism” or is a realization of a frequency distribution that gets more stable as n approaches infinity assumes an enormous amount of stability (von Mises was admirably upfront about this compare to most staunch Frequentists). Even if you use non-stationary models you’re just pushing the stability assumption up a level. So before the modeling gets started, or you’ve even looked at one bit of evidence, you’ve in effect made strong stability assumptions.

(2) When people do their analysis they may check the model assumptions. They may show for example that the data is what’s expected from the distribution chosen or that errors are “normally distributed” and so on.

(3) Here’s where the problem creeps in. That model checking assumes the needed stability. In the vast majority of cases it in no way is evidence for the needed stability. So they think they’ve checked the model when in fact, they haven’t checked the one key assumption that determines success in most cases (the stability of the phenomenon).

(4) Since people think that this does verify everything they need, they then make confidence assumptions as if that stability will be seen in real life. After all their philosophy and understanding of probabilities, not to mention their stat class, tells them pretty strongly they’ve verified the model.

(5) When they compare their models to reality though, they’re wrong most of the time. This puts them in a bit of a quandary. Their “verified” models don’t work. They may have for example calculated a bunch of p-values, got low values, and believed this implies they should be ‘rarely in error’, to quote Fisher, but it turns out they’re constantly in error.

(6) At this point they should just admit their understanding of probabilities is wrong. Instead though, they double down on the Frequency interpretation. If those p-values should rarely be in error, but they’re all the time in error, then they must not be “real” p-values somehow. If only we were to calculate the “real” p-values we would get close to the truth, or so they think.

(7) So for them the way to fix the situation is to enlarge or change the space of possibilities over which p-values should be calculated. The first such enlargement occurred with “p-hacking”. The idea was that p-values should be calculated based on all the different analysis that were done. But that didn’t prove expansive enough to explain away all the failures. So the “garden of forked paths” enlarges this further to include analysis which might have been done but wasn’t. Hopefully that will be enough to explain away the failures, but if it isn’t I have every confidence you can invent another similar idea to do so. You can always find a way to save p-values this way.

The net result is that “garden of forked paths” is a catch all “gremlins did it” type explanation to explain away bad studies and which has the convenient property that no one can prove you wrong.

I think the false discovery framework being used is a consequence of the idea that in general, markets are efficient, unless there is an ‘anomaly’. I’m not entirely sure where that idea came from, though. Also relevant is the fact that discovering a small market inefficiency can give you great rewards, if you invest with leverage. In contrast, a medical treatment with a tiny positive effect is practically useless.

I have some positive things to say about theta = 0 and this seems like a good venue to test them out.

I’ll start with Box’s idea that all models are wrong and some are useful. I think that theta = 0 is a useful model for something being unimportant. However, theta != 0 is a poor model for something to be important. So zeroness of theta is semi-useful.

If we want to find out whether theta is positive, we can contrast the data we got with what we would find for negative theta. The hardest negative theta to separate from positive ones are -epsilon. In the limit of epsilon ->0 we get theta = 0 as the test case. Similarly, if we don’t want to claim something is important when it is really small, theta=0 is a good way to describe small. A procedure that finds unwanted things even when theta=0 is likely to find all the more when theta is small but unimportant. If theta = 0 fits the data then the true size could be negligible for any practical definition of negligible that might apply. [It might still be large, hence the need for interval estimates.]

If we are able to infer that theta != 0, either by a p-value or Bayes factor, then we are not done. We are just starting. In large samples, rejecting theta = 0 is not even a speed bump, much less a mountain. The true theta might vary from setting to setting and even have a different sign depending on other factors. What is true in our sample from one setting might not transfer to other settings. Also, barely significant theta values can be substantially over estimated and even have the wrong sign. [See Andrew’s power = 0.06 post (nit-pick: actually closer to 0.055).] Another problem with barely significant findings is that there are usually calibration errors. Confidence intervals commonly approach their nominal coverage levels from below as n -> infinity. Credible intervals might not have a good objective calibration. Then there are issues of multiplicity, p-hacking, forking paths and incentives. Finally, if theta is not zero then the discussion has to switch to the sign and / or magnitude of theta and its consequences or uses.

My first exposure to theta = 0 was from George Barnard. He taught that the true theta would essentially never be 0 but rejecting it means you now know the sign of theta. The power = 0.06 story means that Barnard’s interpretation has to be reappraised. If the edge of the confidence interval is close to 0, in units of its own width, then we could easily have the wrong sign.

Art:

I think it depends on the application. In some areas of genetics or certain medical treatments, things either work or they don’t. In the social and behavioral science problems that I’ve seen, everything has an effect, and these effects vary a lot by person and context, hence testing theta=0 is pretty much irrelevant.

I think I agree with Art (I wouldn’t have always). If a model has population theta in it somewhere then we can safely say: it’s factually wrong. Arguing that effects won’t exactly cancel out so theta can’t be zero is true, but is a bit besides the point – the model was wrong before (basically, that’s what models do) so what is added by a convincing argument that “if every other aspect of the model was true (it surely isn’t), theta = 0 is false in the real world”?

But as a not-true model, theta=0 is generally simpler – it simply drops an interaction or whatever. It has shorter description length. Then it’s legitimate to ask: for my purposes, do I have evidence that this simpler model is practically worse than another (also wrong, but more complex) model? I don’t want to estimate the best theta, since I no more believe that there is a “right” theta than do do that theta could be zero. I want to know if I am clearly being harmed by using a simpler wrong model.

Zero is focal in a way that ‘best estimate’ is not – it is a descriptively simpler (false, but maybe still useful) model.