Thinking more seriously about the design of exploratory studies: A manifesto

Posted on November 17, 2016 9:33 AM by Andrew

In the middle of a long comment thread on a silly Psychological Science paper, Ed Hagen wrote:

Exploratory studies need to become a “thing.” Right now, they play almost no formal role in social science, yet they are essential to good social science. That means we need to put as much effort in developing standards, procedures, and techniques for exploratory studies as we have for confirmatory studies. And we need academic norms that reward good exploratory studies so there is less incentive to disguise them as confirmatory.

Yesyesyesyesyesyesyesyesyesyesyes.

The problem goes like this:

1. Exploratory work gets no respect. Do an exploratory study and you’ll have a difficult time getting it published.

2. So, people don’t want to do exploratory studies, and when someone does do an exploratory study, he or she is motivated to cloak it in confirmatory language. (Our hypothesis was Z, we did test Y, etc.)

3. If you tell someone you will interpret their study as being exploratory, they may well be insulted, as if you’re saying their study is only exploration and not real science.

4. Then there’s the converse: it’s hard to criticize an exploratory study. It’s just exploratory, right? Anything goes!

And here’s what I think:

Exploration is important. In general, hypothesis testing is overrated and hypothesis generation is underrated, so it’s a good idea for data to be collected with exploration in mind.

But exploration, like anything else, can be done well or it can be done poorly (or anywhere in between). To describe a study as “exploratory” does not get it off the hook for problems of measurement, conceptualization, etc.

For example, Ed Hagen in that thread mentioned that horrible ovulation and clothing paper, and its even more horrible followup where the authors pulled the outdoor temperature variable out of a hat to explain away an otherwise embarrassing non-replication (which shouldn’t’ve been embarrassing at all given the low low power and many researcher degrees of freedom of the original study which had gotten them on the wrong track in the first place). As I wrote in response to Hagen, I love exploratory studies, but gathering crappy one-shot data on a hundred people and looking for the first thing that can explain your results . . . that’s low-quality exploratory research.

From “EDA” to “Design of exploratory studies”

With the phrase “Exploratory Data Analysis,” the statistician John Tukey named and gave initial shape to a whole new way of thinking formally about statistics. Tukey of course did not invent data exploration, but naming the field gave a boost to thinking about it formally (in the same way that, to a much lesser extent, our decades of writing about posterior predictive checks has given a sense of structure and legitimacy to Bayesian model checking). And that’s all fine. EDA is great, and I’ve written about connections between EDA and Bayesian modeling; see here and here.

But today I want to talk about something different, which is the idea of design of an exploratory study.

Suppose you know ahead of time that your theories are a bit vague and omnidirectional, that all sorts of interesting things might turn up that you will want to try to understand, and you want to move beyond the outmoded Psych Sci / PPNAS / Plos-One model of chasing p-values in a series of confirmatory studies.

You’ve thought it through and you want to do it right. You know it’s time for exploration first and confirmation later, if at all. So you want to design an exploratory study.

What principles do you have? What guidelines? If you look up “design” in statistics or methods textbooks, you’ll find a lot of power calculations, maybe something on bias and variance, and perhaps some advice on causal identification. All these topics are relevant to data exploration and hypothesis generation, but not directly so, as the output of the analysis is not an estimate or hypothesis test.

So I think we—the statistics profession—should be offering guidelines on the design of exploratory studies.

An analogy here is observational studies. Way back when, causal inference was considered to come from experiments. Observational studies were second best, and statistics textbooks didn’t give any advice on the design of observational studies. You were supposed to just take your observational data, feel bad that they didn’t come from experiments, and go from there. But then Cochran, and Rosenbaum, and Angrist and Pischke, wrote textbooks on observational studies, including advice on how to design them. We’re gonna be doing observational studies, so let’s do a good job at them, which includes thinking about how to plan them.

Same thing with exploratory studies. Data-based exploration and hypothesis generation are central to science. Statisticians should be involved in the design as well as the analysis of these studies.

So what advice should we give? What principles do we have for the design of exploratory studies?

Let’s try to start from scratch, rather than taking existing principles such as power, bias, and variance that derive from confirmatory statistics.

– Measurement. I think this has to be the #1 principle. Validity and reliability: that is, you’re measuring what you think you’re measuring, and you’re measuring it precisely. Related: within-subject designs or, to put it more generally, structured measurements. If you’re interested in studying people’s behavior, measure it over and over, ask people to keep diaries, etc. If you’re interested in improving education, measure lots of outcomes, try to figure out what people are actually learning. And so forth.

– Open-endedness. Measuring lots of different things. This goes naturally with exploration.

– Connections between quantitative and qualitative data. You can learn from those open-ended survey responses—but only if you look at them.

– Where possible, collect or construct continuous measurements. I’m thinking of this partly because graphical data analysis is an important part of just about any exploratory study. And it’s hard to graph data that are entirely discrete.

I think much more can be said here. It would be great to have some generally useful advice for the design of exploratory studies.

64 thoughts on “Thinking more seriously about the design of exploratory studies: A manifesto”

Gabriel Reynés on November 17, 2016 10:16 AM at 10:16 am said:

The sad part is that, at least in medical research, a big amount of exploratory analisis are sold as real confirmatory studies. And I am sure that part of the replication crisis is due to this.

Many times I encountered the following workflow:
*have N patients -> record many variables -> make some t-test (or variants) -> choose all variables with p>0.05 and a few more to pretend to be a real study-> make some hypothesis to justify the selection of these variables -> publish

Reply ↓
- Rahul on November 17, 2016 12:47 PM at 12:47 pm said:
  
  First suggestion: Make every exploratory study announce in bold right on top that it is indeed an exploratory study.
  
  Reply ↓
  - Ivan on November 21, 2016 1:13 PM at 1:13 pm said:
    
    I have a better suggestion: don’t take confirmatory mumbo-jumbo too seriously and just assume every study is exploratory until proven otherwise.
    
    Reply ↓
    - Diana Senechal on November 21, 2016 7:27 PM at 7:27 pm said:
      
      But when researchers present a shaky study as confirmatory, they may exclude data and analyses that complicate the findings. The study itself becomes constricted.
      
      Clarity is key. Mumbo-jumbo isn’t just a local glitch; it spreads out into the whole. If the language of a study can’t be taken seriously, the study itself has problems.
Bill Harris on November 17, 2016 10:31 AM at 10:31 am said:

– Be open to the idea that the data /may/ exhibit serial correlation and may be better understood if modeled as having come from a set of ODEs (depending upon the data, of course).

Reply ↓
Bill Harris on November 17, 2016 10:40 AM at 10:40 am said:

– From Greg Kruger, an industrial statistician I once worked with: “Don’t expend more than ~10% of your (time, money, data) budget on the first pass at EDA. You’ll learn so much in that pass that you’ll want and need the resources to take the subsequent steps.” He may have gotten that from someone else, but I don’t remember to whom he may have given attribution.

Reply ↓
LRS on November 17, 2016 11:10 AM at 11:10 am said:

As a graduate student who has been in several gardens and unwittingly chosen paths at many forks, I’m terribly afraid of collecting tons of outcomes, even in the exploratory work I am doing now. How do you recommend we mitigate overinterpretation of analyses with many outcomes while still generating interesting hypotheses? How do we best tie our hands when rummaging around in the dark and?

Reply ↓
- Andrew on November 17, 2016 3:54 PM at 3:54 pm said:
  
  LRS:
  
  I do not at all recommend that we “tie our hands.” I think we should look at everything we can, and structure that with multilevel models.
  
  Reply ↓
- psyoskeptic on November 17, 2016 9:06 PM at 9:06 pm said:
  
  The garden of forking paths criticism is not a criticism of exploratory research. It’s a criticism of exploratory research passed off as confirmatory. So if you recognize you’re doing exploratory work, and honestly report what you’ve done, then there’s little for you to be concerned about. If you are concerned that causes some of your more important findings to appear less impactful important then redo the research as genuine confirmatory work.
  
  Reply ↓
  - anon on November 18, 2016 4:23 AM at 4:23 am said:
    
    >So if you recognize you’re doing exploratory work, and honestly report what you’ve done, then there’s little for you to be concerned about
    
    Except career advancement, of course.
    
    Reply ↓
  - Keith O'Rourke on November 18, 2016 10:25 AM at 10:25 am said:
    
    > then there’s little for you to be concerned about
    But you want it to be as profitable (for getting at an underlying reality) as it can be.
    
    Reply ↓
  - Andrew on November 18, 2016 10:29 AM at 10:29 am said:
    
    Psyoskeptic:
    
    Honest reporting is important, even necessary. But design is important too. Dishonest reporting can mislead, but honest reporting is not enough to save a poor design. For example, consider that beauty and sex-ratio study we discussed several years ago. It doesn’t matter how honest and careful the researcher is here: the data are just too noisy to learn anything, from an exploratory or a confirmatory perspective.
    
    Reply ↓
    - Martha (Smith) on November 18, 2016 12:02 PM at 12:02 pm said:
      
      +1
Rajiv Ramarajan on November 17, 2016 11:14 AM at 11:14 am said:

A guideline to check for the number of dimensions, depth along the dimensions, sub-findings, or other measures would help to evaluate the strength of an exploratory study. I see parallels with the Design Thinking process. The ‘divergent’ phase of exploring new ideas, while synthesizing the problem space, can feel uncomfortable if focus is on the end result.

Reply ↓
Anonymous on November 17, 2016 11:15 AM at 11:15 am said:

Coming up with possible improvements of exploratory work is probably a good idea, however i am not convinced about also *publishing* this kind of work.

I come from a psychological research background and, just thinking about psychology, i find it hard to comprehend why it might be a good idea to publish even more exploratory work. I really just don’t get why. This comes from assuming:

1) you can find just about anything doing exploratory work (Simmons, Nelson, Simonsohn, 2011)
2) replications (confirmations) of (presumably mostly exploratory) work are rare (Makel, 2012)
3) people will cite and build up on just about anything (see all the “failed” RRR’s and the citation numbers of the original work they replicated) as long as there are no consequences for their chances of “success” (made possible by allowing publishing of their new exploratory work)

Assuming these 3 things make any sense, i just wonder 2 things:

What will be the evidential value of a literature when 1, 2, and 3 are all factors at work?

What will be a more fruitful basis in order to build a more cumulative, and “correct” literature, one that is based on only published confimatory findings or one that is based on only published exploratory findings? (If it’s the former why bother adding exploratory work into the mix?)

Reply ↓
- Daniel Lakeland on November 17, 2016 2:59 PM at 2:59 pm said:
  
  I don’t think this is about publishing “more” exploratory work. It’s just about making it ok to continue to publish the kind of stuff that people already publish but now under a classification that better represents what it actually is.
  
  Reply ↓
- Andrew on November 17, 2016 4:00 PM at 4:00 pm said:
  
  Anon:
  
  One reason I favor the publication of exploratory work is that just about all the applied research I’ve ever done is exploratory, and I think a lot of it has been worth sharing!
  
  Reply ↓
  - Anonymous on November 17, 2016 7:03 PM at 7:03 pm said:
    
    Thank you for the reply. I am trying to understand this all, so please forgive me if i sound rude and/or stupid.
    
    “One reason I favor the publication of exploratory work is that just about all the applied research I’ve ever done is exploratory, and I think a lot of it has been worth sharing!”
    
    I don’t think that’s a valid and/or useful scientific reason in and of itself. The question then arises, why has it been worth publishing from a scientific perspective?
    
    I don’t know if this applies to your kind of work (i am reasoning from an experimental psychological research perspective as i am the most familiar with that type of work), but as a result of your reply i thought about the following questions:
    
    1) What is the percentage of your published exploratory work that has been directly replicated?
    
    2) What is the percentage of these direct replications that found similar results (“confirmed” the exploratory findings)?
    
    3) What is the the average number of citations received for your exploratory findings that have been confirmed via direct replications and for those that have not?
    
    Given the answers to 1, 2 and 3 would you say that that represents the optimal way for science to build a cumulative and “correct” literature?
    
    Reply ↓
    - Andrew on November 17, 2016 7:11 PM at 7:11 pm said:
      
      Anon:
      
      I almost always work with observational data and just about none of my applied work has been replicated in any direct sense. The only example I can think of is my recent paper with Rayleigh and Yair.
      So my answer to your questions are:
      
      (1) 1 out of, umm, I dunno, 100 or 200? Call it 0.5% or 1%.
      
      (2) 1/1 = 100%. Our one replication found similar results. But not identical results, actually, so one could call it a partial confirmation. I’ll count it as 1/2, or 50%.
      
      (3) The new paper has not yet been released so it has no citations. What’s the average number of citations of my published applied papers? I dunno, 20?
      
      Finally, in answer to your last question: I have no idea what is the optimal way for science to do anything; indeed I question the very idea that there is some optimal way. But if there is an optimal way, I’m sure that it doesn’t describe what I or anyone else is doing.
      
      P.S. I think much of my applied work has been useful. You can take a look at some of my published papers to get a sense of what I’m talking about.
    - Anonymous on November 17, 2016 8:00 PM at 8:00 pm said:
      
      Thank you very much for your reply!
      
      (I really appreciate your posts, your willingness to engage with (anonymous) commenters, and your openness)
- psyoskeptic on November 17, 2016 9:14 PM at 9:14 pm said:
  
  I also come from psychological research and I see obviously one exploratory study after another in the literature and the review process and they’re passed off as confirmatory. Everyone wants confirmatory but the blunt truth is that a lot of them are lying about it.
  
  Just yesterday I was going over a paper with 14 p-values. There was a single p-value close to 0.05. It was reported as under but it was actually over. Every other p-value was exactly right. The author lied. There were also several unreported possible obvious tests that made perfect sense to do. They were not reported in the study and it was framed like the ones reported were all that was done. I didn’t believe it for a second.
  
  Coming up with guidelines for exploratory research is not a plan to have more exploratory research. It’s a plan to properly perform and report it and make it so that authors don’t feel like they have to lie and cover over what they’ve done. I remember as a young student the first time someone told me that science was like a sausage factor and you don’t want to see how it’s made. Thinking like that is a serious problem unfortunately pervasive in all of social science.
  
  Reply ↓
  - Martha (Smith) on November 18, 2016 12:49 AM at 12:49 am said:
    
    “I remember as a young student the first time someone told me that science was like a sausage factor and you don’t want to see how it’s made.”
    
    Ooh, wow! Things are even worse than I thought. Sad, but thanks for letting me know.
    
    Reply ↓
Chris Chambers on November 17, 2016 11:31 AM at 11:31 am said:

Thanks for the post – this is an important discussion to have.

We’re developing an “Exploratory Reports” format at the journal Cortex to complement our existing Registered Reports format. The idea is to create a type of publication that celebrates pure exploration, rather than forcing researchers to fit exploratory science into an ill-fitting confirmatory framework — shoehorning Kuhn into Popper.

Making this a meaningful format of article is challenging and we would welcome input.

One of the biggest challenges, I think, is this: do inferential statistics play any meaningful role in pure exploration, or is it a case of simply observing (descriptively) what might be happening in data set and then proposing hypotheses going forward? Does a p value add anything of value when there is no hypothesis? Does a Bayes factor tell us anything when there is no *prospective* prior?

Reply ↓
Diana Senechal on November 17, 2016 12:34 PM at 12:34 pm said:

This is great. If developed and instituted, it could bring big improvements to research in the psychological science (and possibly to so-called “action research” in education and other fields).

Identifying what you will measure–and making sure this is what you’re in fact measuring–would prevent a lot of junk. In addition, I advocate working out a clear and well-considered rationale for measuring this particular thing. One frustrating aspect of psychological (and education) research is that the researcher often don’t take time to work out their initial premises and terms.

Take, for instance, research regarding the Big Five theory of personality (about which I am generally skeptical). One of the “Big Five” traits is “openness to experience”–but researchers seem to take this to mean “openness to an ever-changing array of experiences.” That is, someone who reads five different books would be considered more “open to experience” than someone who reads the same book five times–even if the latter person finds new things with each rereading. As I see it, someone who rereads may be at least as open to experience as the person always reaching for a new book–but the quality manifests itself differently.

That’s just one example of a situation where researchers could use clear, well-tuned definitions that have gone through some questioning.

Reply ↓
- Martha (Smith) on November 18, 2016 12:59 AM at 12:59 am said:
  
  “Identifying what you will measure–and making sure this is what you’re in fact measuring–would prevent a lot of junk. In addition, I advocate working out a clear and well-considered rationale for measuring this particular thing. One frustrating aspect of psychological (and education) research is that the researcher often don’t take time to work out their initial premises and terms.”
  
  +1
  
  Even something as prosaic as grading calculus exams can bring up measurement disagreements. Once I taught the required “teaching” course for math TA’s. In preparation for a discussion of grading, I copied a bunch of student solutions to a couple of problems on a calculus final the previous semester. Then I had the students in the TA teaching course grade each solution on a scale of 1 to 10. Lots of disagreements! But the exercise led to worthwhile discussion.
  
  Reply ↓
- Martha (Smith) on November 18, 2016 11:50 AM at 11:50 am said:
  
  Diana continued the above quote with
  
  “Take, for instance, research regarding the Big Five theory of personality (about which I am generally skeptical). One of the “Big Five” traits is “openness to experience”–but researchers seem to take this to mean “openness to an ever-changing array of experiences.”
  
  That’s just one example of a situation where researchers could use clear, well-tuned definitions that have gone through some questioning.”
  
  If I’m not mistaken, the label “openness to experience” was an ex-post-facto label that then became taken as a definition. To elaborate, my understanding is that the process was as follows:
  
  The initial research started not by making “definitions” of personality traits, but by making lists of questions that the researchers thought might capture the varied aspects of personality. They administered the resulting questionnaire to a group of people and ran a factor analysis on the resulting data. This spit out a list of “factors” (linear combinations of the question responses) that “accounted for” (relatively) high proportions of the total variance in responses. They decided that the first five (ie, the five with the highest amount of total variance accounted for) gave a pretty good list of the major “traits” of personality. They then gave names to these traits.
  
  In other words, they did not start with definitions of traits; this was exploratory research that gave them candidates for traits. The real definition of the traits was “whatever this linear combination measures”. However, the labels they attached to these factors became “reified” — that is, taken to be The Real Thing Measured, even thought the labels were fuzzy terms subject to varying interpretations.
  
  (Unfortunately, there is a common human tendency to fall into the trap of “when you give it a name, you feel you understand it.” This is often an obstacle to learning and understanding — e.g., “statistical significance” is so often understood as something other than what it really is.)
  
  Reply ↓
  - Diana Senechal on November 18, 2016 12:00 PM at 12:00 pm said:
    
    Thank you, Martha, for this background. In light of this, maybe it isn’t desirable (or possible) to define one’s working terms at the outset of exploratory research. Maybe the real point is to be clear about what one has and hasn’t defined, and to acknowledge ambiguity and uncertainty for what they are. The error here lay in “reifying” the labels assigned to the traits. If the research stayed clear about the imprecision and tentativeness of the labels, it might avoid many traps.
    
    Reply ↓
    - Martha (Smith) on November 18, 2016 12:13 PM at 12:13 pm said:
      
      “If the research stayed clear about the imprecision and tentativeness of the labels, it might avoid many traps.”
      
      Yes, but that’s a big “if”. It requires a lot of going against human nature: “The Game of Telephone” and “TTWWADI” effects (see http://www.ma.utexas.edu/blogs/mks/2014/06/22/beyond-the-buzz-on-replications-part-i-overview-of-additional-issues-choice-of-measure-the-game-of-telephone-and-twwadi/) seem to be strong human tendencies. Countering them takes a lot of continued effort.
    - Diana Senechal on November 18, 2016 1:18 PM at 1:18 pm said:
      
      Thank you for the link. Your blog raises another question: How many people bother to look up the sources cited in a research study? Does anyone look them all up? I have found that some especially problematic studies are saturated with citations; the Sokal Hoax (http://www.physics.nyu.edu/faculty/sokal/transgress_v2/transgress_v2_singlefile.html) demonstrates how misleading this can get.
      
      When a study references too many others, I wonder how many of them the researchers actually read carefully. I’m sure some do this because they think it’s good form (bringing up every bit of research they can find), but it makes the study difficult to wade through and probably conceals many misinterpretations.
      
      I would rather see fewer citations and better reasoning.
    - Martha (Smith) on November 18, 2016 2:49 PM at 2:49 pm said:
      
      “I would rather see fewer citations and better reasoning.”
      
      +1
Diana Senechal on November 17, 2016 4:42 PM at 4:42 pm said:

Here’s an example of a study that would be improved (perhaps) if made exploratory rather than confirmatory. Then again, it may be beyond hope.

The authors call their paper (published in Neurocase) the first “clear unambiguous proof for the veracity and true perceptual nature” of calendar synesthesia. They actually say this.

https://www.newscientist.com/article/2113063-synaesthetes-who-see-calendar-hint-how-our-brains-handle-time/

This strikes me as woefully iffy (from the description in New Scientist):

“Next they asked ML and eight non-synaesthetes to name the months of the year backwards, skipping one or two months each time – a task most people find challenging. They figured that ML should be able to complete the task quicker than the others as she could read it from her calendar. Indeed, ML was much quicker at the task: when reciting every three months backwards, she took 1.88 seconds per month compared with 4.48 seconds in non-synaesthetes.”

Who needs 4.48 seconds per month to recite every three months backwards? I can do it in under 2 seconds per month, without a vivid calendar vision. The actual paper published in Neurocase states:

“In control subjects, the average RT for reciting all of the months backward (n = 8) was 1.46 s/month. For skipping 1 or 2 months – the average was 2.54 and 4.48 s/month respectively. For ML, the average RT for the same 3 tasks were (A) 0.58 s/month, (B) 1.63 s/month, and (C) 1.88 s/month (see legends in Figure 2).”

Things get even sillier when they come to subject EA (this quotation also from the actual paper):

“Indeed, on a previous occasion, we had informally tested a synesthete EA, who might have qualified as a higher calendar synesthete. Her calendar form was shaped like a hula-hoop (the most common manifestation of calendar forms) in the transverse plane in front of her chest. Unlike ML, though, when EA turned her head rightward or leftward, the calendar remained stuck to the body, suggesting that it was being computed in body-centered, rather than head (and eye) centered coordinates. The variation across calendar synesthetes, in this regard, reminds us that even in neurotypical brains there are probably multiple parallel representations of body in space that can be independently accessed depending on immediate task demands.”

Oh dear. I wish this were a hoax.

There may well be people who picture calendars more vividly than others (whether as hula hoops, V shapes, beehives, or silly strings)–but they aren’t necessarily “calendar synaesthetes,” nor does their perception necessarily have anything to do with time. They might similarly be able to picture sequences of numbers in other contexts.

These things may be worth sorting out and distinguishing–but without pressure to “prove” anything on the basis of a month-reciting test with one subject and eight controls.

(The article in New Scientist says there were two subjects–but actually the second subject, HP, did not complete the full experiment: “We then studied the second subject – HP – but for practical reasons – were only able to conduct a subset of the experiments that we had performed on ML.” The subject EA was from a previous experiment.)

Reply ↓
- Diana Senechal on November 17, 2016 7:08 PM at 7:08 pm said:
  
  P.S. As I reread this study, I think it just might be a hoax. If it is, I laugh; if it isn’t, I still laugh, but sadly.
  
  Reply ↓
Stan Lazic on November 17, 2016 6:22 PM at 6:22 pm said:

An(other) excellent post!

One of the best references on the topic that I’ve come across is:

Sheiner LB (1997). Learning versus confirming in clinical drug development. Clin Pharmacol Ther 61(3): 275–291. [PMID: 9084453]

The ideas are relevant to all areas of research, not just drug development.

Reply ↓
Eric on November 17, 2016 6:22 PM at 6:22 pm said:

I recommend John Gerring’s “Mere Description” BJPS article, ungated version here:

http://people.bu.edu/jgerring/documents/MereDescription.pdf

See, e.g., fn. 14.

Reply ↓
Brian D. Silver on November 17, 2016 8:33 PM at 8:33 pm said:

Andrew, it’s worthwhile to bring to the attention of younger scholars the elegant piece by Leslie Kish in ASR 1959, “Some Statistical Problems in Research Design.” His warning about “hunting with a shotgun,” and other false steps is well worth reviewing and understanding especially in exploratory research.

Reply ↓
- Andrew on November 17, 2016 8:43 PM at 8:43 pm said:
  
  Brian:
  
  Kish’s paper is worth reading. Unfortunately, statisticians back in 1959 didn’t understand the general applicability of multilevel models. So Kish talks about the problems with multiple comparisons and then recommends the methods of Tukey, Duncan, Scheffe, and others, which I think turned out to a big dead end. Those researchers correctly had the sense that the way to solve multiple comparisons problems was to look at all the comparisons at once, but they didn’t know about multilevel modeling and had little idea of what to do beyond rearranging the p-values as the boat was sinking.
  
  Kish also has a great quote from Frank Yates. I love Yates.
  
  Reply ↓
  - Shravan on November 18, 2016 1:19 AM at 1:19 am said:
    
    Andrew, I think that there are situations where multiple comparisons become a problem even when one does hierarchical (multilevel) modeling. See e.g.
    
    https://arxiv.org/abs/1504.06896
    
    Reply ↓
    - Ian on November 18, 2016 5:06 AM at 5:06 am said:
      
      Shravan, my understanding of that paper is that they don’t test the type of multi-level model that Andrew is presumably suggesting, i.e. one model of all the data across measures (or regions of analysis).
    - Simon Gates on November 21, 2016 10:43 AM at 10:43 am said:
      
      I feel a bit statistically challenged compared to most people here… can you (or anyone) point me to a good published example of the sort of multilevel modelling that Andrew is meaning?
      
      Even better if there is code!
psyoskeptic on November 17, 2016 9:20 PM at 9:20 pm said:

I have one suggestion, a bit of a focus on determining if a sample is representative or not. I see people with obviously non-representative samples and there’s just no guideline for really thinking about them except perhaps, correct it with a strong prior.

Or perhaps, probably more importantly, discussing your sample with that as a potential property. It’s not something really talked about and it’s the fundamental underlying hypothesis to all statistics based on samples. I just looked at a couple of stats textbooks and none of them even mentioned representativeness. I think that within exploratory research this becomes more important than even in confirmatory research.

Reply ↓
- Andrew on November 17, 2016 10:13 PM at 10:13 pm said:
  
  Psyoskeptic:
  
  I’ve discussed this point in presentations; I’m not sure if it’s made its way into any of my published papers. The point is kind of subtle and it goes like this: If you’re estimating a constant effect and you have a randomized experiment, there’s no need for your sample to be representative. But if your effect has strong interactions with pre-treatment variables, then it’s important that your sample be representative with respect to those variables, otherwise your inferences can be completely wrong. And just about all the social psychology studies we’ve been discussing over the past few years involve interactions: effects that appear in some circumstances but not others (as has been claimed for power pose and in lots of other cases where study B didn’t replicate study A and then the people who did study A claimed that the effect only was supposed to appear if you did the experiment juuust right, i.e. they were claiming big interactions with pre-treatment conditions). So, yeah, representativeness is important.
  
  The subtle point is that lots of methodologists were trained that, if you have a randomly assigned treatment, that you didn’t need a representative sample. But their training did not specify that this holds only in the absence of interactions.
  
  The only place I can find where I discussed this in writing is in this post from a few years ago on the Freshman Fallacy. Good discussion thread there too.
  
  Reply ↓
  - jrc on November 18, 2016 2:13 PM at 2:13 pm said:
    
    Andrew – this problem is also related to questions about the use of survey weights when you have a sample that is only “representative” when properly weighted. If you think there is a stable, constant treatment effect, you don’t need weights at all, but if there are heterogeneities in that treatment effect you won’t get the average treatment effect for the population right without weights (should there exist one and should you be interested in it). Of course, you may also not need to get the average right if you can get all the sub-group treatment effects right, and then you could re-weight ex-post… but I digress. The point is that if you think of any particular non-representative sample as being able to be made representative with appropriate probability weights, then it is basically the same problem.*
    
    Here is a nice discussion of the weighting aspect, and I suspect it speaks broadly to representativeness and treatment effect estimation more broadly: “What are we weighting for?” by Solon, Haider and Wooldridge… who are obviously economists, since the title has a pun in it.
    
    http://www.nber.org/papers/w18859.pdf
    
    *Note: 30 white undergraduates as a sample will never let you back out the population average treatment effect in this context, so you need some “diversity” of the population in your sample, but it doesn’t at all have to be representative in the simple-random-sampling sense.
    
    Reply ↓
    - Andrew on November 18, 2016 2:19 PM at 2:19 pm said:
      
      Jrc:
      
      This is related to the point that Jennifer and I make in our book, discussing causal inference, where we distinguish between two sources of concern:
      1. Imbalance between treated and control groups.
      2. Lack of complete overlap between treated and control groups.
      
      Imbalance can be corrected using weighting, regression, etc. When you have lack of complete overlap, you need to restrict your region of inference or else be explicit how your inferences depend on modeling assumptions and extrapolation.
- AnonAnon on November 18, 2016 1:40 PM at 1:40 pm said:
  
  As a social psychologist, I’ve made this point before to colleagues: Randomization only protects against the main effects of pre-treatment covariates and any interaction between the manipulation and pre-treatment covariate is an omitted variable. And I added, with emphasis, that this shouldn’t be hard to understand if you’ve ever read or subscribed to the Person by Situation Interactionism (individual difference by experimentally manipulated situation) framework because randomization didn’t prevent or block the effect of the individual difference on the dependent variable. So randomization isn’t a panacea even in experimental settings and shouldn’t be a substitute for critically thinking and measuring pre-treatment covariates (before treatment).
  
  But as Andrew notes this isn’t an easy point to get across, in part because of how we (in the royal social psychologist sense) talk about randomization.
  
  This old thread* is being wound up in newer frameworks involving Evolutionary Psychology and Cultural Psychology** as the new Person by Situation Interactionism.
  
  *Some Old Threads:
  Mischel, W. (1977). The interaction of person and situation. Personality at the crossroads: Current issues in interactional psychology, 333, 352.
  Diener, E., Larsen, R. J., & Emmons, R. A. (1984). Person× Situation interactions: Choice of situations and congruence response models. Journal of personality and social psychology, 47(3), 580.
  
  **Some New Yarns:
  Tooby, J., & Cosmides, L. (1995). The psychological foundations of culture. The adapted mind: Evolutionary psychology and the generation of culture, 19-136.
  Varnum, M. E., & Grossmann, I. (2016). Pathogen prevalence is associated with cultural changes in gender equality. Nature Human Behaviour, 1, 0003.
  
  Reply ↓
  - Andrew on November 18, 2016 2:16 PM at 2:16 pm said:
    
    Anon:
    
    Ironically (or not), Tooby and Cosmides (see your second-to-last reference) are among the coauthors of the fat-arms-and-voting study that we’ve discussed occasionally. That was another one of those studies reporting big interactions based on data from some convenience samples. Here’s a bit from that paper:
    
    We showed that upper-body strength in modern adult men influences their willingness to bargain in their own self-interest over income and wealth redistribution. These effects were replicated across cultures and, as expected, found only among males. The effects in the Danish sample were especially informative because it was a large and representative national sample.
    
    Lots to chew on here, including the fact that they praise their representative sample of Danes as being “especially informative” without seeming to realize the problems with treating their other samples of college students as representing “modern adult men.” Also that quote talks about bargaining but I didn’t see anything in the data about bargaining, nor were the causal claims supported by the analysis.
    
    As I wrote at the time: They believe their story (of course they do, that’s why they’re putting in the effort to study it), and so it’s natural for them, when reflecting on problems of measurement, causal identification, and representativeness of the sample, to see these as minor nuisances rather than as fundamental challenges to their interpretation of their data.
    
    Studying nearby college students can be a good way to gather data cheaply and to generate hypotheses that can be tested via more serious study. But then recognize this is your goal, and be serious about measurement issues rather than finding a “p less than .05” and hanging on for dear life.
    
    Reply ↓
    - AnonAnon on November 18, 2016 2:43 PM at 2:43 pm said:
      
      Andrew:
      
      Oh I’m a long time choir member and not quite first time commenter. I’ve been exposed to and some extent trained in the Evolutionary and Cultural strands of social psychology. And while I’m amenable to general propositions like human behavior and subsequently human culture has been shaped by evolutionary processes, I’ve been maddened by things like the bicep and ovulation studies. And more importantly the unwillingness of some social psychologists to productively accept methodological criticism.
      
      This is probably because I was also fortunate enough to have some pretty substantial quantitative training. Which is the context for me being critical of peers ignoring the fact that, as you pointed out, the composition of the sample matters and no randomization is not a cure all. The fact that some were what you might consider strong interactionists made it all the more maddening when they would use randomization as a shield for their claims.
    - Chris on November 21, 2016 12:51 PM at 12:51 pm said:
      
      Once more with feeling, Tooby and Cosmides are not social psychologists.From their joint website:
      
      Tooby’s A.B. was in experimental psychology and his Ph.D. in biological anthropology; Cosmides’ A.B.was in biology and her Ph.D. in cognitive psychology.
      
      If every cognitive psycholgist, economist, biz school prof, or anthropologist doing mediocre work is labeled a “social psychologist,” then it’s no wonder people’s view of the field is poor.
      
      Please be precise, as I am collateral damage.
    - AnonAnon on November 21, 2016 2:59 PM at 2:59 pm said:
      
      Chris,
      
      Evolutionary and cultural psychologists like Buss, Kenrick, or Varnnum are also social psychologists, at least I believe they are treated that way at the departmental level. The point of highlighting Tooby and Cosmides is that their work has strongly influenced some notable evolutionary psychologists and also importantly there are departments where the social area embraces evolution and culture as the new person by situation interactionism.
      
      If your point is just that not all social psychologists subscribe to the evolutionary or cultural frameworks, then I agree entirely that it is wrong to generalize from some to all social psychologists.
Bill Harris on November 17, 2016 11:10 PM at 11:10 pm said:

Another proposed guideline:

– Save the majority of the data for CDA.

Figure 1 (Statistical power of a replication study) of Button et al. 2013 Power failure: why small sample size undermines the reliability of neuroscience (http://www.nature.com/nrn/journal/v14/n5/full/nrn3475.html) notes that the seemingly common advice to hold out 20-30% of the data for CDA gives a fairly low-powered test that may stand a poor chance of being truly confirmatory. That seems to be similar in recommendation to the advice to use 10% of the resources in the initial testing, albeit for different reasons.

Reply ↓
Matthew Higgs on November 18, 2016 5:29 AM at 5:29 am said:

This is great. Andrew, what are the next steps we can take to make “Exploratory Study Design” a thing? I’m all in.

Reply ↓
Clark on November 18, 2016 9:26 AM at 9:26 am said:

It has been my experience (in clinical research) that many investigators (usually MDs) don’t give sufficient thought to controlling bias or issues of confounding variables, particularly when doing observational/retrospective studies. They have a tendency to gather their data and hand it to a statistician (or more likely, a med student or Fellow) for a simple analysis, and a lot of unreplicable stuff gets published. For instance, in pediatric research it is important to control for the age of the patient, and in burns research it is important to control for the area burned. Sometimes they’ll use t-tests and chi-square tests and simply look at p-values to decide that the groups are balanced, ignoring effect sizes and variability, which doesn’t work very well with small sample sizes. Another problematic study is to compare outcomes before and after a date where some policy changed, without considering things like evolving standard of care and patient characteristics over time. One benefit of developing a long-term relationship with clinical researchers is to learn where the skeletons are buried, and be able to discuss these points and obtain more complete data for improved modeling. And of course, sometimes it is impossible to get the answers they want from available data due to confounding among the variables — I’ve sometimes found it helpful to make charts of the directional relationships among different variables to demonstrate that we don’t have independence where we need it for good modeling.

Reply ↓
- Keith O'Rourke on November 18, 2016 10:47 AM at 10:47 am said:
  
  Yes, this is sad.
  
  In the last hospital research institute I worked this is how it went.
  
  MD researcher – I want to learn about X from this observational study I have as there’s no RCTs and unlikely will be.
  
  Me: OK but we really need to give sufficient thought to controlling bias or issues of confounding variables and that is a lot of hard work and always very uncertain.
  
  MD researcher (maybe after putting up with me and doing some of that hard work) – but in the stats course taught by the senior biostatistician in our hospital research institute they said to do pairwise regression, throw away any variables with p_values greater than 10% and then to step-wise regression to get the best model. I talked to some other MDs is this is what they did and it took then less time than I have spent so far.
  
  Of course then then did that and got the senior biostatistician to add their name to the paper.
  
  Expect to hear the same story for at least another 20 years.
  
  Reply ↓
  - Clark on November 18, 2016 11:03 AM at 11:03 am said:
    
    Optimistically (I’m that sort of person), I’d like to think that we are in the midst of a broader movement within the research community to improve the quality of research and reproducibility of publications, and that this thread is one step along that path. I figure this will trickle down from statisticians to laboratory investigators and eventually to MDs, and over the next decade or two we’ll see some real progress. Part of this is having these issues discussed frankly within introductory statistics courses (how about all statistics courses?). Another part is in the course of our daily interactions with investigators. Of course, one thing I find particularly frustrating is that some of the worst offenders are also the most proliferative publishers, which sets a bad example for those who try to do the right thing.
    
    Perhaps what we need is a post-publication rating system for published work, which might translate into career incentives for better quality research as opposed to sheer quantity.
    
    Reply ↓
    - Martha (Smith) on November 18, 2016 11:59 AM at 11:59 am said:
      
      “Trickle down” theories are usually overly optimistic. I suspect that making the changes trickle down will take a lot of deliberate effort.
      
      The post-publication rating system sounds good, but could be highjacked by pressure to dumb it down.
jrc on November 18, 2016 2:31 PM at 2:31 pm said:

I’ll add the economists’ contribution as a part of the manifesto: “Understanding and Discussing Identifying Variation”.

What I mean by this is: know, describe and critically discuss what “changes in the world” your model is using to estimate the effects you are interested in, and what comparisons between what (kinds of) people you are making. This is a broader definition, with less focus on endogeneity, than the definition most economists have in mind, but I do believe economists have moved this discussion forward quite a bit.

Here is an example: suppose I want to know the effects of veteran status on future earnings. I could compare people who served and didn’t serve, but if I were thinking clearly about this comparison, I would immediately say “different people choose to serve than those who don’t serve, so this comparison captures both the effect of serving in the military and the bundled effect of being the kind of person who serves in the military.” So instead maybe I compare people who were subjected to the Vietnam draft lottery and did or did not get “lucky” (whatever direction you think of “lucky” there). Maybe this also isn’t perfect (it is now an estimate among people who chose not to simply enlist on their own or who didn’t have rich, powerful Moms and Dads that could buy them out of the draft), but now, by stating clearly the comparisons we are making and the source of identifying variation the model latches on to (the thing in the world that changes – which is that your draft number got drawn or it didn’t) we can have a much clearer understanding of what our estimates are telling us about the world.

Sure, often discussions of identifying variation get sidetracked with technical nonsense claims about correlation of treatment with error term, and that isn’t always helpful. But that isn’t economics’ real contribution to this discussion. The useful part is a careful thinking about and discussion of the relationship between the world as it is, the particular changes in the world you are harnessing to estimate the effect of interest, and the model that latches on to those changes in the world (the identifying variation) to estimate the effect of interest.

Last point: even though this concept has been developed in the “quasi-experimental” world of causal effect estimating, I think it is an important idea for purely observational or exploratory analysis too. Just because you don’t have clean, tight causal interpretations of the specific comparisons you are making in your analysis, a discussion of various comparisons or sources of variation can help illuminate more clearly what your estimates are telling you about the world.

Reply ↓
Keith O'Rourke on November 18, 2016 2:54 PM at 2:54 pm said:

> Data-based exploration and hypothesis generation are central to science.
I would argue to also include experience/expertise exploration as well e.g. http://bouchat.github.io/Bouchat-IMCpaper.pdf or http://www.methods-colloquium.com/single-post/2016/10/21/Sarah-Bouchat-Engaging-Experts-Dealing-with-Divergent-Elicited-Priors-in-Political-Science

> Statisticians should be involved in the design as well as the analysis of these studies.
OK, so both design and analysis is on the table.

> What principles do we have for the design of exploratory studies?
Perhaps maximize the probability of generating productive surprise and confusion?
(the productive clause underlines the need for #1 principle. Validity and reliability)

For the analysis of exploratory studies maybe utilize something like coaching circles or active learning?
(e.g. http://www2.cipd.co.uk/nr/rdonlyres/749aa8fe-8119-4bf4-b9a1-8b29b2e91378/0/cherylbrookresearchpaperb.pdf or https://www.youtube.com/watch?v=ZW9LCQ-3U2w)

The simple idea here is to have 4 or 5 engaged questioners that will jointly ask you questions only (or mostly) about what you are trying to get out of the exploratory analysis. The engagement often comes by agreeing to rotate when these others are doing an exploratory analysis.

I would be interested in hearing if anyone has done something like this for their statistical analyses?
(Somehow I missed this topic when I was in MBA school but with n of 1 it seems like it could be very helpful.)

Reply ↓
Jim Yocom on November 18, 2016 3:56 PM at 3:56 pm said:

I’ve been interested in the design of exploratory experimental studies in the social sciences for some time. (Even have a lapsed R&R on it lying around somewhere, but I let it die of old age.) I think much can be learned on the trade offs by deliberate and in-depth comparisons with experimental design in medical research and industrial applications. Industrial applications (and, if I understand the rumors correctly, a lot of biological research) run zillions of “screening” experiments, typically employing fractional replicates. Medical research often uses “sequential” experiments–running tiny clusters of participants and testing significance after each cluster. Common complaints in social sciences about the horrors of Type-I or Type-II errors, power, not knowing what to do with interaction effects–seem limited or even bizarre in the context of how other disciplines approach the use of experiments for exploratory efforts. In particular, fears of main-effect x interaction confounding seem largely at odds with how other disciplines routinely discover new effects. Plenty of scientists use screening and confounding (and sometimes sequential testing) to efficiently evaluate LOTS and LOTS of potential factors precisely because they are capable of neatly “unconfounding” them with orderly, subsequent studies. Quantitative researchers should, I think, keep an open mind and really truly honestly ask whether they would not benefit from a broader array of experimental techniques than the constipated models presented in social-scientist friendly books. (Some social scientists, alas, don’t even know that experiments CAN be used to discover new stuff–only formally adjudicate competing explanations.)

Reply ↓
LMM on November 19, 2016 10:50 AM at 10:50 am said:

I think you’re making a false distinction between hypothesis-testing and exploratory studies. Nobody just goes out and starts measuring things, like behavior, in general. What would that even mean? You are measuring specific things, and for a specific reason, to answer a question. I think the distinction may be that in the case of what we might label exploratory studies, our hypothesis may be rather vague, and RELATIVELY open-ended. You might be looking for correlation as a preliminary step to looking for causation. But you have to have some question, and the question constrains where you will look for an answer. E.g. if you ask what cause cancer you’ll measure different than if you ask what causes something else. Exploratory studies should probably be done and shared informally (which doesn’t mean they shouldn’t be funded or encouraged).

Reply ↓
Mayo on November 20, 2016 6:15 PM at 6:15 pm said:

There’s a lot of confusion about pejorative and non-pejorative “exploratory” studies. The issue is closely connected to the thorny problem of when “double counting” is unreliable. In testing the assumptions of a statistical model M, for example, the “same”, data to be used in testing a claim within M can be used to reliably test an assumption about M, by remodeling the data to ask a different question, e.g., using residuals. Anyway, because of the confusion about when explorations are sanctioned, I point out on my current blog that the explorations Clark Glymour is defending (in causal search) are not the ones Ioannidis is criticizing. https://errorstatistics.com/2016/11/19/glymour-at-the-psa-exploratory-research-is-more-reliable-than-confirmatory-research/
I’ll add a link to Gelman’s post on my blog: As it happens, he was at the session with Glymour when this arose, but the points at issue didn’t come up in the brief discussion at the conference.

Reply ↓
- Keith O'Rourke on November 21, 2016 8:19 AM at 8:19 am said:
  
  Glymour’s analysis presumably is for non-randomized studies (focused to provide adequately powered epidemiological exploratory studies in Ioannidis’ chart).
  
  Did something very similar in this study http://cid.oxfordjournals.org/content/28/4/800.full.pdf by doing all possible adjustments for covariates to see if the effect size remained positive and worthwhile*. (When I first suggested such an analysis, I was asked to withdraw from the paper which originally submitted a single best analysis – based on step-wise regression – but the journal reviewers _stepped_ very hard on such an analysis.)
  
  Also Glymour’s ideas seem like a mix of Fisher’s suggestion for non-randomized studies to make your theory complex and the blessing of dimensionality http://statmodeling.stat.columbia.edu/2004/10/27/the_blessing_of/
  
  * I was not that convinced the treatment was established as effective but rather that 1. a randomized trial to do this was simply not doable (someone later tried and had to shut down the trail due to lack of enrollment) 2. side effects were well understood and #. it a rare disease and so would not have huge cost issues. So I no, I did not find my analysis that convincing but rather supportive.
  
  Reply ↓
Jefrey Lijffijt on November 21, 2016 3:38 AM at 3:38 am said:

You are posing some very interesting questions that are essential to the problems we have been working on in the past few years; a topic we call Exploratory Data Mining (EDM). We distinguish this from EDA mainly because of the much richer toolbox of methods that we now have for computer-aided hypotheses generation. These methods mostly stem from Data Mining.

One thing that we have been considering, which I think is absolutely vital to real use, is how a data analyst fits into this exploration. Almost surely, knowledge discovery works best if you combine the expertise of humans, which have a vast body of common sense and domain knowledge, with a computer that can crunch lots of numbers.

In short, I think you should consider how human operators (as Friedman and Tukey called them in [1]) that use a data exploration system, fit into Exploratory Studies?

[1] Friedman, J.H., Tukey, J.W.: A projection pursuit algorithm for exploratory data analysis. IEEE Tr. Comp. 100(23), 881–890 (1974)

Reply ↓
Robert Grant on November 21, 2016 6:13 PM at 6:13 pm said:

Coming late to this post. I get involved in a lot of studies that are exploratory and sometimes involve hypothesis tests too. I think it is all in how you present it, in the same way that overt multiple testing is OK if you are honest about it and do some kind of adjustment, and don’t draw strong conclusions. But these papers are generally not understood as being any different to completely confirmatory ones, so we need to build awareness of principled exploration to avoid misunderstanding. And yes, they get rejected over and over (surely not because my writing!) when an does spot that difference. Another widespread problem is the use of pre-existing data, collected for another purpose, because that undermines the external validity of your first goal, Measurement.

Reply ↓
Christian Hennig on November 29, 2016 11:51 AM at 11:51 am said:

Even later to this post…
My first principle is that I am skeptical against that whatever I “find” in exploratory data analysis; I always ask me a) could this just be a meaningless instance of statistical variation and b) could it be an artifact (of measurement, selection, bad design)?
Quite a number of findings are, and it’s possible to find many of them in the process (still this can be quite useful because the same artifacts could also plague a supposedly confirmatory study). Obviously sometimes it won’t work and a sense in which something is either noise or an artifact slips through, for which reason it’s still important that this is exploratory and needs some independent confirmation.

One thing that is helpful is to give oneself a good idea how many “degrees of freedom” one has, i.e., how big the toolbox is that either has been tried out on the data or could potentially have been tried out conditionally on some earlier analyses going another way.

All of this is very shaky and informal but still I think it’s helpful.

By the way one of my favourite papers:
“Statistical inference for exploratory data analysis and model diagnostics” (Buja, Cook, Hofmann, Lawrence, Lee, Swayne, Wickham, 2009; Philosophical Transactions of The Royal Society, A) available on Andreas Buja’s homepage.

Reply ↓
Neil Lawrence on January 13, 2017 7:23 AM at 7:23 am said:

Hi Andrew, Matthew Higgs just pointed me to your post on this. Triggered by meetings on data analysis I’ve been wondering about “data readiness levels”. Just got round to posting on it here: http://inverseprobability.com/2017/01/12/data-readiness-levels

I think exploratory data analysis is a key component (recently got a copy of Tukey’s book on this … but not yet read!).

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Thinking more seriously about the design of exploratory studies: A manifesto

64 thoughts on “Thinking more seriously about the design of exploratory studies: A manifesto”

Leave a Reply Cancel reply