Tim Bock writes:

I understood how to address weights in statistical tests by reading Lu and Gelman (2003). Thanks.

You may be disappointed to know that this knowledge allowed me to write software, which has been used to compute many billions of p-values. When I read your posts and papers on forking paths, I always find myself in agreement. But, I can’t figure out how they apply to commercial survey research. Sure, occasionally commercial research involves modeling and hierarchical Bayes can work out, but nearly all commercial market research involves the production of large numbers of tables, with statistical tests being used to help researchers work out which numbers on which tables are worth thinking about. Obviously, lots of false positives can occur, and a researcher can try and protect themselves by, for example:

1. Stating prior beliefs relating to important hypotheses prior to looking at the data.

2. Skepticism/the smell test.

3. Trying to corroborate unexpected results using other data sources.

4. Looking for alternative explanations for interesting results (e.g., questionnaire wording effects, fieldwork errors).

5. Applying multiple comparison corrections (e.g., FDR).

In Gelman, Hill, and Yajima (2013), you wrote “the problem of multiple comparisons can disappear entirely when viewed from a hierarchical Bayesian perspective.” I really like the logic of the paper, and get how I can apply it if I am building a model. But, how can I get rid of frequentist p-values as a tool for sifting through thousands of tables to find the interesting ones?

Many a professor has looked at commercial research and criticized the process, suggesting that research should be theory led, and it is invalid to scan through a lot of tables. But, such a response misses the point of commercial research, which is an inductive process by-and-large. To say that one must have a hypothesis going in, is to miss the point of commercial research.

What is the righteous Bayesian solution to the problem? Hopefully this email can inspire a blog post that can help an industry.

My response:

First, change “thousands of tables” to “several pages of graphs.” See my MRP paper with Yair for examples of how to present many inferences in a structured way.

Second, change “p-values . . . sifting through” to “multilevel modeling.” The key idea is that a “table” has structure; structure represents information; and this information can and should be used to guide your inferences. Again, that paper with Yair has examples.

Third, model interactions without aiming for certainty. In this article I discuss the connection between varying treatment effects and the crisis of unreplicable research.

Fourth, re-read Cleveland’s books, i.e., “Visualizing Data” and “The Elements of Graphing Data”.

I want to share my research experience with sentiments data from surveys. As I read from Greenwood and Shleifer(2014) expectation and expected returns, they found direct measure of expectation in stock returns are negatively correlated with model-based expected returns, which they indicate as a fundamental difference between financial model and practical world. I’m not agreeing yet since the sentiments survey might suffer from these issues. However I do believe there are certain informative data to convey expectation in a behavioral way. As the sources of different sentiments survey enlarge, how to translate and design the sentiments measure become significantly important. A simple question of “what do you think the stock market will be up or down in the next six months?” have several levels of understanding from both time recognition horizon to the comparision of two targets. How to use the correct way to ask questions in the survey to capture the goal of research is as important as the methodology of analysis. This is just my current understanding (I will continue my research in this area).

Thanks for taking the time to answer. It took me a couple of days to work through the references you provided! I know more, which is great. However, the quantity of the inferences in the papers you list are trivial compared to the quantity that I was trying trying to ask about.

In a moderate-to-large study, the researcher may have 1,000 potential outcome variables (it can be in the tens-of-thousands). If I follow the approach that you describe in your papers, I think I end up needing 1,000 models (or some heroic assumptions and fewer models) and a few visualizations for each. Building 1,000 multilevel models and hand-crafting beautiful visualizations of the type you show in your MRP paper cannot be done in the blink of an eye, even with Stan. And, it is impossible if trying to do the job well (e.g., informative priors). And, even if it could be done, wouldn’t I end up with errors due to perceptual mistakes made when reviewing the thousands of visualizations? And, wouldn’t the analyses be ignoring the Bayesian equivalent of familywise type 1 error between the 1,000 models? And lastly, the third of the papers you present says “Present all your comparisons”, which seems to be arguing against the whole purpose of the commercial researcher (i.e., which is to synthesize a lot of data on behalf of a client).

Tim:

I don’t think the term “heroic assumptions” is particularly helpful.

Anycausal inferences you make in market research will require heroic assumptions. You have to make lots of assumptions. There is no alternative.Regarding the 1000 models: I’d suggest building them one at a time, starting with important problems where it’s worth it to you to get closer to the right answer. I’m guessing that once you’ve done a few, you can apply these same models to the other 997 or so problems; then you can see how they work and go from there.

Is it necessary that the work be done “in the blink of an eye”? In that case it sounds like you’re looing at 1000 outcomes every half-second; that’s a few zillion a year. That can’t be right!

Regarding the visualizations: the purpose of the visualizations is not to directly come up with estimates or decisions; it’s to reveal problems with your models. If you think the models are fine, you could do the whole thing without any visualizations. And, indeed, I don’t think you’ll want to be looking at 10,000 graphs. It’s good to be able to look at a graph, though, when questions come up.

Regarding “familywise type 1 error”: I have zero interest in this concept and I don’t think you should waste any time on it either. Familywise type 1 error is all about studying the statistical properties of your method under the hypothesis that all effects in all settings are zero. I’ve never worked in an area where this could be true. If you think that all your treatments might have zero effects in all situations, then you have bigger problems than statistics!

Tim Re: quantity of the inferences in the papers you list are trivial compared to the quantity that I was trying trying to ask about.

That came up when I was working with this group Leatt, P., O’Rourke, K., Fried, B., & Deber, R. (1992). Regulatory intensity, hospital size and the formalization of medical staff organization in hospitals.

I had an initial careful analysis which suggested about 3 _findings_ and when I presented this to them they responded that their study was basically a replication of a US study in the Canadian context and the Americans had 13 findings and so it would be lame for us to only have 3!

But this was an opportunity as I could ask which of their findings did you think should replicate in the Canadian context and they answered basically all of them. But as essentially none of them were observed to be even similar, those 13 findings lost their truth status and it was easy to get agreement that perhaps 3 findings would actually be better.

Now, I do understand the challenge in commercial research that it is hard to convince folks that less is more but in any research (assuming the purpose is to find out how things really are) it almost always is.

p.s. That paper was commercial research as I was charging by the hour.