Question about data mining bias in finance

Finance professor Ravi Sastry writes:

Let’s say we have N vectors of data, {y_1,y_2,…,y_N}. Each is used as the dependent variable in a series of otherwise identical OLS regressions, yielding t-statistics on some parameter of interest, theta: {t_1,t_2,…,t_N}.

The maximum t-stat is denoted t_n*, and the corresponding data are y_n*. These are reported publicly, as if N=1. The remaining N-1 data vectors and tests are not reported or disclosed in any way.

Given priors on theta and N, and only y_n*, how do we form a posterior on theta?

I would greatly appreciate any help at all, including proper terminology to describe this problem (which is endemic in academic finance) and pointers to relevant papers.

My reply:

I don’t know the relevant literature but I think you could do this easily enough in a Bayesian context by just treating all the unreported results as missing data (that is, as unknown quantities). You could do this in Stan—almost. (You’d have to assume N is known, but that might not be so horrible in practice.) Or maybe there are ways to do this using a theoretical analysis as well; this could give some insight. It seems like an unpleasant problem, though, if you’re not allowed to see most of the data.

7 thoughts on “Question about data mining bias in finance

  1. My two cents is that the basic design of the model(s) is flawed. Treating the N vectors as N independent regressions seems unhelpful to me. I would approach the N vectors as a pooled time series where the cross-sections are the N independent vectors, the dependent variable (the “data,” whatever it is) is logged or log-centered (to account for differences in scale) and a discrete, anova-like factor numbered from 1 to N explains the vector cross-sections. This is a more flexible design where additional predictors that account for other sources of variation (whatever they are) can easily be added to the model. In addition, contrasts between the cross-sections are straightforward to estimate.

  2. Probably missing from the problem description is the fact that the y_i are likely to be correlated. In this case the Frequentist would rely on a Hotelling T analysis of some sort, assuming that N is very large, but the y_i are drawn approximately from a low rank vector space. This kind of analysis requires you to estimate (or ‘SWAG’) the rank of this vector space. I give an example in Section 3.1 of the vignette to SharpeR, http://cran.r-project.org/web/packages/SharpeR/vignettes/SharpeR.pdf .

  3. Another citation that might be useful.

    Copas, J. B. (2013). A likelihood-based sensitivity analysis for publication bias in meta-analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(1), 47–66. doi:10.1111/j.1467-9876.2012.01049.x

Leave a Reply

Your email address will not be published. Required fields are marked *