Finance professor Ravi Sastry writes:
Let’s say we have N vectors of data, {y_1,y_2,…,y_N}. Each is used as the dependent variable in a series of otherwise identical OLS regressions, yielding t-statistics on some parameter of interest, theta: {t_1,t_2,…,t_N}.
The maximum t-stat is denoted t_n*, and the corresponding data are y_n*. These are reported publicly, as if N=1. The remaining N-1 data vectors and tests are not reported or disclosed in any way.
Given priors on theta and N, and only y_n*, how do we form a posterior on theta?
I would greatly appreciate any help at all, including proper terminology to describe this problem (which is endemic in academic finance) and pointers to relevant papers.
My reply:
I don’t know the relevant literature but I think you could do this easily enough in a Bayesian context by just treating all the unreported results as missing data (that is, as unknown quantities). You could do this in Stan—almost. (You’d have to assume N is known, but that might not be so horrible in practice.) Or maybe there are ways to do this using a theoretical analysis as well; this could give some insight. It seems like an unpleasant problem, though, if you’re not allowed to see most of the data.
The joint distribution of order statistics is known, so that’s the complete-data likelihood right there. One just has to marginalize out N (making the posterior a discrete sum) and y_(1) through y_(k-1) (and the degrees of freedom of the non-central-t distribution if unknown).
Actually, the marginalization of the non-max order statistics is easy, so N is the only challenge.
This paper may be of interest:
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2249314
My two cents is that the basic design of the model(s) is flawed. Treating the N vectors as N independent regressions seems unhelpful to me. I would approach the N vectors as a pooled time series where the cross-sections are the N independent vectors, the dependent variable (the “data,” whatever it is) is logged or log-centered (to account for differences in scale) and a discrete, anova-like factor numbered from 1 to N explains the vector cross-sections. This is a more flexible design where additional predictors that account for other sources of variation (whatever they are) can easily be added to the model. In addition, contrasts between the cross-sections are straightforward to estimate.
Seems a lot like the file-drawer problem … “unpleasant” is a gentle way to describe the difficulty.
Probably missing from the problem description is the fact that the y_i are likely to be correlated. In this case the Frequentist would rely on a Hotelling T analysis of some sort, assuming that N is very large, but the y_i are drawn approximately from a low rank vector space. This kind of analysis requires you to estimate (or ‘SWAG’) the rank of this vector space. I give an example in Section 3.1 of the vignette to SharpeR, http://cran.r-project.org/web/packages/SharpeR/vignettes/SharpeR.pdf .
Another citation that might be useful.
Copas, J. B. (2013). A likelihood-based sensitivity analysis for publication bias in meta-analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(1), 47–66. doi:10.1111/j.1467-9876.2012.01049.x