Question about data mining bias in finance

Posted on November 16, 2014 9:02 AM by Andrew

Finance professor Ravi Sastry writes:

Let’s say we have N vectors of data, {y_1,y_2,…,y_N}. Each is used as the dependent variable in a series of otherwise identical OLS regressions, yielding t-statistics on some parameter of interest, theta: {t_1,t_2,…,t_N}.

The maximum t-stat is denoted t_n*, and the corresponding data are y_n*. These are reported publicly, as if N=1. The remaining N-1 data vectors and tests are not reported or disclosed in any way.

Given priors on theta and N, and only y_n*, how do we form a posterior on theta?

I would greatly appreciate any help at all, including proper terminology to describe this problem (which is endemic in academic finance) and pointers to relevant papers.

My reply:

I don’t know the relevant literature but I think you could do this easily enough in a Bayesian context by just treating all the unreported results as missing data (that is, as unknown quantities). You could do this in Stan—almost. (You’d have to assume N is known, but that might not be so horrible in practice.) Or maybe there are ways to do this using a theoretical analysis as well; this could give some insight. It seems like an unpleasant problem, though, if you’re not allowed to see most of the data.

7 thoughts on “Question about data mining bias in finance”

Corey on November 16, 2014 9:59 AM at 9:59 am said:

The joint distribution of order statistics is known, so that’s the complete-data likelihood right there. One just has to marginalize out N (making the posterior a discrete sum) and y_(1) through y_(k-1) (and the degrees of freedom of the non-central-t distribution if unknown).

Reply ↓
- Corey on November 16, 2014 10:09 AM at 10:09 am said:
  
  Actually, the marginalization of the non-max order statistics is easy, so N is the only challenge.
  
  Reply ↓
Frank on November 16, 2014 10:45 AM at 10:45 am said:

This paper may be of interest:

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2249314

Reply ↓
Thomas on November 16, 2014 12:27 PM at 12:27 pm said:

My two cents is that the basic design of the model(s) is flawed. Treating the N vectors as N independent regressions seems unhelpful to me. I would approach the N vectors as a pooled time series where the cross-sections are the N independent vectors, the dependent variable (the “data,” whatever it is) is logged or log-centered (to account for differences in scale) and a discrete, anova-like factor numbered from 1 to N explains the vector cross-sections. This is a more flexible design where additional predictors that account for other sources of variation (whatever they are) can easily be added to the model. In addition, contrasts between the cross-sections are straightforward to estimate.

Reply ↓
george on November 17, 2014 2:34 AM at 2:34 am said:

Seems a lot like the file-drawer problem … “unpleasant” is a gentle way to describe the difficulty.

Reply ↓
Steven on November 17, 2014 1:56 PM at 1:56 pm said:

Probably missing from the problem description is the fact that the y_i are likely to be correlated. In this case the Frequentist would rely on a Hotelling T analysis of some sort, assuming that N is very large, but the y_i are drawn approximately from a low rank vector space. This kind of analysis requires you to estimate (or ‘SWAG’) the rank of this vector space. I give an example in Section 3.1 of the vignette to SharpeR, http://cran.r-project.org/web/packages/SharpeR/vignettes/SharpeR.pdf .

Reply ↓
Jonas on November 17, 2014 9:21 PM at 9:21 pm said:

Another citation that might be useful.

Copas, J. B. (2013). A likelihood-based sensitivity analysis for publication bias in meta-analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(1), 47–66. doi:10.1111/j.1467-9876.2012.01049.x

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Question about data mining bias in finance

7 thoughts on “Question about data mining bias in finance”

Leave a Reply Cancel reply