We can break up any statistical problem into three steps:
1. Design and data collection.
2. Data analysis.
3. Decision making.
It’s well known that step 1 typically requires some thought of steps 2 and 3: It is only when you have a sense of what you will do with your data, that you can make decisions about where, when, and how accurately to take your measurements. In a survey, the plans for future data analysis influence which background variables to measure in the sample, whether to stratify or cluster; in an experiment, what pre-treatment measurements to take, whether to use blocking or multilevel treatment assignment; and so on.
The relevance for step 3 to step 2 is perhaps not so well understood. It came up in a recent thread following a comment by Nick Menzies. In many statistics textbooks (including my own), the steps of data analysis and decision making are kept separate: we first discuss how to analyze the data, with the general goal being the production of some (probabilistic) inferences that can be piped into any decision analysis.
But your decision plans may very well influence your analysis. Here are two ways this can happen:
– Precision. If you know ahead of time you only need to estimate a parameter to within an uncertainty of 0.1 (on some scale), say, and you have a simple analysis method that will give you this precision, you can just go simple and stop. This sort of thing occurs all the time.
– Relevance. If you know that a particular variable is relevant to your decision making, you should not sweep it aside, even if it is not statistically significant (or, to put it Bayesianly, even if you cannot express much certainty in the sign of its coefficient). For example, the problem that motivated our meta-analysis of effects of survey incentives was a decision of whether to give incentives to respondents in a survey we were conducting, the dollar value of any such incentive, and whether to give the incentive before or after the survey interview. It was important to keep all these variables in the model, even if their coefficients were not statistically significant, because the whole purpose of our study was to estimate these parameters. This is not to say that on should use simple least squares: another impact of the anticipated decision analysis is to suggest parts of the analysis where regularization and prior information will be particularly crucial.
Conversely, a variable that is not relevant to decisions could be excluded from the analysis (possibly for reasons of cost, convenience, or stability), in which case you’d interpret inferences as implicitly averaging over some distribution of that variable.