David Shor writes:

Suppose you’re conducting an experiment on the effectiveness of a pain medication, but in the post survey, measure a large number of indicators of well being (Sleep quality, self reported pain, ability to get tasks done, anxiety levels, etc).

After the experiment, the results are insignificant (or the posterior effect size isn’t large or whatever) on 9th of the 10 measures, but is significant on the tenth. You then declare the experiment a success and publish.

I think most people would find that scenario unseemly, but what is the “right” thing to do in that situation? Traditional techniques to avoid “p-hacking” (Regularization, multiple comparison corrections, etc) don’t really apply because the observations across measures aren’t independent of each other.

I can think of some approaches: Trying to regularize in a multi-task learning environment or maybe doing some sort of factor analysis and only doing inference on the first factor. But I’m curious if you’ve thought of the problem before.

The Bull paper cites a study by Greenland which uses a hierarchical model. The Bull and Wu papers most clearly differentiate the kinds of hypotheses you want to test if you are taking a more classical (NHST) approach. The Hummel GlobalAncova tests a hypothesis that may not be what you want.

Bull, S.B. (1998). Regression models for multiple outcomes in large epidemiologic studies. Statist. Med. 17, 2179–2197.

Dallow, N.S., Leonov, S.L., and Roger, J.H. (2008). Practical usage of O’Brien’s OLS and GLS statistics in clinical trials. Pharmaceutical Statistics 7, 53–68.

Logan, B.R., and Tamhane, A.C. (2004). On O’Brien’s OLS and GLS Tests for Multiple Endpoints. Lecture Notes-Monograph Series 47, 76–88.

O’Brien, P.C. (1984). Procedures for Comparing Samples with Multiple Endpoints. Biometrics 40, 1079–1087.

Sammel, M., Lin, X., and Ryan, L. (1999). Multivariate linear mixed models for multiple outcomes. Statist. Med. 18, 2479–2492.

Hummel, M., Meister, R., and Mansmann, U. (2008). GlobalANCOVA: exploration and assessment of gene group effects. Bioinformatics 24, 78–85.

Wu, D., Lim, E., Vaillant, F., Asselin-Labat, M.-L., Visvader, J.E., and Smyth, G.K. (2010). ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics 26, 2176–2182.

I should have added that Wu et al use a hierarchical model to estimate estimate variances for the t-statistic.

I am not an expert on multilevel models and I am curious about how this would be done. What would the levels be? How would the 10 outcomes be nested? I thought multilevel models were for nested data, like students within classes within schools within districts, but I don’t see any obvious nesting here. Can anyone enlighten me?

Same question here!

I think one of the problems with training in classical ANOVA-type modeling and nested ANOVA is that it requires pretty extensive detraining to become comfortable with the rich world of hierarchical modeling. Andrew’s regression book is the bible but here are two good introductions

Greenland, S. (2000). Principles of multilevel modelling. International Journal of Epidemiology 29, 158–167.

Ji, H., and Liu, X.S. (2010). Analyzing ’omics data using hierarchical models. Nat Biotech 28, 337–340.

The outcome is a 10 element vector. It could be modeled as coming from a Multivariate Normal with the same deterministic equation form for the means vector, but with intercept and treatment effect varying for each outcome. Put a common prior for the intercepts and a common prior for the treatment parameter (maybe coming from the same multivariate distribution), estimate the full covariance matrix of the errors, and voilà! A multilevel model.

My Stan model for this, based on the SUR model in the manual and some similar models I have used in multi-outcome trials:

data {

int N; // # Observations

int J; // # Outcomes

vector[J] y[N];

real treat[N]; // Treatment indicator

}

parameters {

vector[J] alpha; // Intercepts

vector[J] beta; // Treatment effects

cholesky_factor_corr[J] E_Omega; // Error corr matrix

vector[J] Tau; // Error variances

vector[2] muPars; // Mean of parameters distribution

cholesky_factor_corr[2] P_Omega; // Corr matrix for prior on parameters

vector[2] TauPars; // Variance for prior on parameters

}

model {

vector[J] mus[N];

vector[2] pars[J];

matrix[J, J] ErrorSigma;

matrix[2, 2] ParsSigma;

for (n in 1:N){

mus[n] <- alpha + beta*treat[n];

}

for (j in 1:J){

pars[j, 1] <- alpha[j];

pars[j, 2] <- beta[j];

}

ErrorSigma <- diag_pre_multiply(Tau, E_Omega);

ParsSigma <- diag_pre_multiply(TauPars, P_Omega);

E_Omega ~ lkj_corr_cholesky(2);

P_Omega ~ lkj_corr_cholesky(2);

Tau ~ cauchy(0, 2);

TauPars ~ cauchy(0, 2);

pars ~ multi_normal_cholesky(muPars, ParsSigma);

muPars ~ normal(0, 5);

y ~ multi_normal_cholesky(mus, ErrorSigma);

}

The common prior on treatment effect for each outcome will partial pool the estimates and the estimated mean of this prior can be used to make inferences about the overall effect of whatever is being tested. All caveats about data quality will apply, though.

Think of outcomes as being nested within patients.

Imagine your data have columns | outcome_score | outcome_measure_type | individual | treatment|, so you have 10 outcomes_scores for each individual (10 outcomes), and one measure_types for each individual (your data are *tidy*).

Now what if you pooled all the data together and ran your OLS/polr/logit whatever on outcome_score ~ treatment. The issue here is that your errors aren’t iid–individuals get measured multiple times (across different outcome_measure_types), and outcome_measure_type gets measured multiple times (across different individuals).

Next you realize that you’ve violated an assumption of your model, and decide to split the data, grouping by outcome_measure_type. On each subset of the data, you run your OLS/polr/logit. But now you’re saying that when estimating how much the pill affects pain, you can learn _nothing_ from how it affects anxiety. You’re also inviting all sorts of garden-of-forking-paths problems.

One approach (hinted at by Sean below) would be to let each outcome type *and* individual have their own treatment effects, but in such a way that the treatment effect on one outcome can provide information to the other outcomes. Something along the lines of (for individual i, outcome j, and a link function f()):

outcome_score(i, j) = f(bi(1,i) + bj(1,j) + (bi(2,i) + bj(2,j))* treatment(i, j) + residual(i, j))

with bi ~ multi_normal(mu_bi, S1); bj ~ multi_normal(mu_bj, S2)

Note that this isn’t going to be well identified without prior information. Setting mu_bi to 0 should help.

I put together a quick demo here:

https://gist.github.com/khakieconomics/9dd785c241a1ee0b6f32

Bob also has an excellent case study here, that talks about pooling effects.

http://mc-stan.org/documentation/case-studies/pool-binary-trials.html

Thanks a lot to everyone who responded to my comment with all of these helpful explanations!

Was the study preregistered?

Ideally, before even doing the first measurement I’d hope the researchers have gotten expert opinions about whether all those ten outcomes matter equally & if not how to do the weighting? Also, considering the cost-benefit analysis, it’d be nice if the researcher has pre-declared what threshold of improvement would be a break even to proceed further.

For a typical clinical trial, pre-registration is ideal. But regardless, seems to me that the best thing to do is … another study, focused on that 10th issue, and the people who suffer from it most. Then you can label for the relief which is offered. In short, the study was hypothesis generating, not conclusive.

Ben:

Sure, it’s fine to do a new study. But . . . (1) before doing that new study, it would be good to have an estimate of the effect size, (2) such an estimate would be useful in designing a new study, and (3) if you do want to do follow-up research, it’s not at all clear you should do exactly one study, and even if you do just one follow-up study, it’s not clear that you should do it on that 9th measure, just cos it happens to have been statistically significant in that one comparison.

Hi,

assuming we have limited resources to do a follow-up study, our best bet would stil be the effect that was significant or am I mistaken?

regards

Alex:

Ifyou know you want to do exactly one follow-up studyandall costs and benefits for the 10 outcomes are equal, then, sure, the p-value will be something like a one-to-one function of the measured outcome, and so you might as well pick the largest one.Butin general there’s no reason to do 1 follow-up study rather than 0 or 2 or 3, and in general there will be information distinguishing the outcomes (obv you’ll know something about them or you wouldn’t have chosen to measure them at all) so, no, I think it’s a bad idea to make such a decision based on this noisy number, just because it happens to be around. Again, the problem is that observed p-values are super noisy and it just about can’t be the right decision to make future choices based on p-value thresholds.I don’t think this has to do with the number of experiments I can do – let us assume I can do 3 for the arguments sake. I originally investigated 10 effects and now I can run the follow-ups. So, how should I pick the candidates using the information I gained in the first round of experiments?

thanks and regards

Alex:

If your number of future experiments is picked based on prior reasons, you’re already doing better, because now you’re not using statistical significance as a rule to decide how many things to follow up. At this point, if you want to choose your portfolio of 3 experiments, you should at the very least have some sense of costs and benefits. You can use the results from the first round to help make your decision—but you should remember how noisy those p-values are (recall that 9 of them weren’t even statistically significant). Weak data implies that prior info is more important here, so I’m guessing it’s a big mistake to make your decisions based on data alone.

(assuming you are staying in the NHST paradigm) another assumption of pursuing further (hopefully experimental) research on the one significant response based on the p-value alone is that all tests had equal power, which is certainly unlikely.

Aren’t the 10 outcomes actually measuring one latent variable?

In that case something like Item Response Theory might also be applicable. My understanding is that Rasch model used in IRT shows where each individual falls on the latent variable scale and it also shows the “difficulty” of each question on the Item Characteristic Curve in terms of the latent variable state. The term difficulty is used because IRT is often applied to tests where we want to know how difficult each question is as we measure the latent variable (intelligence or skill) for each respondent.

Couldn’t a Seemingly Unrelated Regression (SUR) model be used in this case?

Normally though what happens at the planning stage of a drug efficacy study with multiple outcomes of equal importance is that you would specify and justify a method you will use to control the FWER, typically something Bonferonni based like the Holm or Hochberg procedures- the OP is incorrect when he says that multiple comparison methods must assume independence. In real life, if you conducted a clinical trial but failed to prespecify what the endpoint is and how you were going to analyse the data, your main concern shouldn’t be about publishing papers, it should be updating your resumé and looking on monster.com, because you are going to be fired once your boss finds out.

Ok, so you start a new job and are trying to salvage something from the data from a flawed phIII study that your recently fired predecessor ran, and you want to write a paper claiming the pvalue was significant… If the p-value was <0.005 I guess I would maybe try and argue that if the dumbass who ran the trial had prespecified any reasonable analysis, even the most conservative possible multiplicity adjustment, then the difference would have been significant.

But in any case, no matter the p-value, you should still publish your results, and explain that the subscale with the observed difference wasn't the primary endpoint so as to not mislead. It is still useful science to prove that a drug probably doesn't work well. If you are in the USA it is (probably) a legal obligation to publish the results of post-phase I clinical trials.

Mikey:

Given the information above, I don’t think it’s correct to say that they’ve proved that the drug probably doesn’t work well.

True, definitely not enough info to say it is proven not to work well- though I was also assuming that as a clinical study testing efficacy it would have been adequately sized to estimate treatment effects with high precision, and that the outcomes recorded were chosen because they were all relevant towards judging efficacy in this indication- and so considered that if 9/10 of these outcomes didn’t reach p<0.05 despite the high power that it was reasonable to infer that the true treatment effect- for the original indication it is being tested for- is not much greater than 0.

Mikey:

Sure, I agree with the general principle that disappointing results should be published too. Going further and looking for something that might work is great, but one should also report all the things that didn’t work!

Mikey,

Some multiple testing procedures hold under arbitrary dependence structures among the hypotheses. The Bonferroni correction and Holm step-up procedure both hold under arbitrary dependence for the FWER. For the FDR, you choose k s.t. p_k < k / (m log(m)) * alpha, where the log(m) protects you against arbitrary dependence structures among the hypotheses.

See here: http://onlinelibrary.wiley.com/doi/10.1002/sim.6082/abstract for a nice review article.

Why are you giving this drug? It’s to improve some kind of concept of quality of people’s lives. So, treat this as a decision theory problem with a latent measure of “quality of life”. Each of these 10 outcomes somehow contributes to your ability to understand this unobserved “quality”. Build a model for how your outcomes relate to this quality. Then use the information you have from all 10 outcomes to estimate this quality for each person.

What drug has the purpose of “improving quality of life”? I mean, incidentally, sure, but usually drugs do something like “cure infections” or “reduce headache pain”.

Sorry, that was snarky, but my real point is really that just because you have the particular 10 outcomes that you measured doesn’t mean those measures all have anything in common or should all be included or should in any way be expected to have any model-able common effect (I mean, a pain med that reduces anxiety – great! but not what i’m looking for in a pain med). In general, my mental model reduces to each outcome having its own biological relationship to the medication, and so I’d end up including a full set of interactions between treatment and outcome anyway, and then you aren’t even buying any additional information by estimating everything at once.

Same critique with the Bonferroni type corrections – so what if all of them are/not significant? They aren’t all equally important in a pain medication. We shouldn’t give them equal weight in the effectiveness evaluation.

I’m much more open to just showing all of the T/C comparisons by outcome. But for a true medical trial of a drug that is supposed to do a particular thing, I would think that there should have been from the start a specific primary outcome defined in a specific way (or just a few). In the case of a few closely related measures (say – headache pain on a subjective scale and hours of headache pain), I’d be much less worried about thinking of this as a latent factor in people (even if I don’t really like that whole framework in general).

Jrc:

Forget about exchangeability or thinking the 10 measures have anything in common. Just do basic Bayes. You have 10 outcomes, you need a prior on each. Base the prior on your understanding of the literature. If the drug is being compared to existing best practices, effects are likely to be small. You could set priors to normal with mean 0 and sd 0.1 on some reasonable scale, for example. Maybe you want the priors to be correlated because if the drug is effective it might be effective in different ways. Maybe you have more information too. The default “8-schools” hierarchical model is, just that, a default. It’s an exchangeable model not because there is a belief that the 10 outcomes are similar but because in a default model you can’t do much more.

OK – so am I just making the same mistake as others above who are thinking “multi-level model” and assuming that there must be some underlying structure that generates correlations between the multiple units in the model (be they units likes schools or units like outcomes). Said another way – you would be happy letting treatment have un-correlated effects across all the outcomes, but still think they should be estimated simultaneously in a single Bayesian model?

I guess I don’t see the gains from the multi-level model if you let the effect of treatment be totally uncorrelated across outcomes (and let any covariates have separate effects across outcomes). Shouldn’t it eventually reduce to just doing a Bayesian analysis on each of the outcomes separately (or running 10 different least squares or logistic regressions, if you had totally uninformative priors)? Where do the efficiency gains come from – just from the priors? Or is their some pooling of information you want that I don’t see?

jrc: I think the only reason to take a drug is that you have a problem that makes your quality of life lower. “cure an infection” is a *mechanism* not a purpose in and of itself. To see how that works, notice that you are actually “infected” with all kinds of beneficial bacteria. You wouldn’t want to “cure” those, because they actually enhance your quality of life rather than detract from it.

But, I do get your point, and I think it’s actually the same point that I have. Presumably you measure 10 things, they don’t all have a similar effect on the thing you’re trying to achieve (let’s say reducing the bad effects of pain). So you need a model for how each of the 10 contributes to “reducing the bad effects of pain”. Some of them may not contribute at all… so they would basically not enter into the model.

Also although drugs can be “intended” to do X, it’s often the case that after you give them you find out that they’re better at Y… so you change what they’re “intended” to do. Cyclobenzaprine for example was “designed” as an anti-depressant, but it turns out it’s a great non-addictive muscle relaxant for people with injuries, but it makes people sleepy and as a result is a lousy antidepressant

Another way to put this is: just because you measured 10 things, doesn’t mean your outcome is 10 dimensional, you need to collapse this measure down to something meaningful. It could be 1 dimensional, or 2 dimensional (main effect, and a side effect?), but it’s probably not really 10 dimensional.

But how? How does one convert 10 measures into one? Is there an objective way? Or we just ask different experts for their weighing preferences & live with the huge non-consensus I expect we would get?

Creating models is an art. We can’t do science without it. It’s a mistake to think that it’s all about “weighting” though (IE. A linear combo of outcomes)

There is undoubtedly some significant uncertainty in the model. Fortunately we have a comprehensive language for expressing uncertainty. I hope the experiment had a control group. We can compare experimental to control under the full range of uncertain goodness models. Maybe we get some clear comparisons even if the form of the model has a wide range.

I think I agree with you entirely in the general sense.

It’s the specifics that puzzle me. How does one convert these 10 measurements into any one objective function? Can you suggest ways outsides of just asking doctors / patients for their weighing opinions?

Inevitably this will require insight specific to the area of study. But, I think there’s no way to get away from either asking, or experimenting on, people’s preferences. Some things I could think of for pain meds specifically:

Things we want pain meds to do:

reduce the product of intensity and duration, or more specifically the integral of intensity*dt.

increase our ability to function at productive mental tasks, instead of being distracted by pain: accuracy and speed on moderate difficulty logic or math tasks could be a proxy for how distracting the pain is. They may have measured some other proxy.

increase our ability to function at physical tasks, similar to above.

NOT increase our ability to function in the short term at physical tasks if doing so is causing damage (ie. the fact that you can’t feel your ligaments tearing is no reason to go lift massive weights and tear your ligaments)

Things we don’t want:

addictiveness

physical damage, or increased risk of damage to organs (heart, kidney, liver, peripheral nervous system etc)

numbness, tingling etc

many more I’m sure….

it makes sense to sit down and think about what these things they’ve measured mean, how they affect people, whether they interact (ie. intensity and duration clearly interact through an area under a curve), whether there’s a particular purpose that this drug excels at, which should be considered (ie. suppose your drug doesn’t really make the pain stop, but it makes it very easy to ignore and do lots of important tasks. Perhaps there is a good use for such a drug in an emergency situation, like for soldiers or rescue workers or something)…

you can’t just stamp out models of the world via an ANOVA factory or whatever. This is one of the biggest problems with typical “frequentist procedures” associated with “null hypotheses”, they’re just dodging the real work of modeling.

In this case, reducing 10 measures down to 1 thing will let you then use information gained from the 10 measurements on each person to get a hopefully more informed sense of how well the drug is working at the task we really care about it doing. If we don’t know how to weigh different aspects in terms of tradeoffs, then we can at least be explicit about that, by providing priors over weighing factors that are broad and independent. Then, we can still compare experiment to control under each sample from the posterior, we may find that it doesn’t matter much how we weigh things, we always get some effect. Or it may be that we only get a positive effect under certain regimes… at least we’ll find out something more than just “p < 0.05 for effect 3" or whatever

To be more explicit about that, we don't really need a "scale factor" for the goodness function in each case, we care about mean(g(experiment))/mean(g(control))

or, if we have matched pairs, even better mean(g(experiment)/g(matched-control))

where g is our "goodness" function, which has some uncertainty in weighing factors and in parameters related to measurement error, and in perhaps which of the various forms we might prefer, which we can model using uncertainty as bayesian probability (it's just not possible to model this via frequentist probability, since there's no sense in which the "frequency with which g1 is the right model" can be defined)

Also, you might have say an allergy medicine that is very effective at reducing the symptoms of allergies, but makes you uselessly sleepy… so you don’t take it, because overall it’s not improving your quality of life, better to be awake with mild allergies. Until maybe you’re having anaphalaxis, and then you do take it, because it’s overall improving the quality (and quantity) of life.

Maybe calling it “quality of life” invokes something too metaphysical or whatever. The point is, generally you need to evaluate whether on some scale things are “better” or “worse”. and this implies an *ordering* and only 1 dimensional fields are ordered, so you have to collapse down to a 1 dimensional measure, whatever you call it.

I think we are in more agreement than it seems (a usual thing for us). But my point here would be to run one analysis on “symptom of allergies” which tells you how effective the drug is at doing the thing you want it to do; and a separate analysis on “side-effect” that tells you how effective the drug is at not doing other (undesirable) things.

I just don’t get the value of estimating these things simultaneously, unless you are trying to use information from the same person across outcomes. I think that either you have to believe in some underlying, unobserved “factor” that, for individuals, specifically captures how they respond to treatment and this response is common across many outcomes; or you don’t get any efficiency gains. And since in general I tend not to believe in that first bit…

“trying to use information from the same person across outcomes”:

Yes, absolutely! The point is to use information from a given person’s outcomes to estimate a net “goodness of the effect”

Having thought about this a little, let me clarify:

There are 2 consumers of this information. 1) the people doing the drug development and approval, and 2) consumers who might take the drug.

Under the current regime (1) needs some information about whether the drug is “safe and effective” enough. This can be based on some kind of “average” goodness of outcome across “typical” consumers.

Once a drug is on the market, what consumers need is information about the individual effects of clinical interest, so they can balance under their own “utility” function, which is something that changes constantly (ie. I care a lot more about anti-allergy effectiveness when having anaphalaxis than when I’ve got mild itchy watery eyes, where I care more about not having things like sleepiness)

So, I agree with you that we want to know the individual effectiveness of each major clinically relevant axis, but I still think to make a decision you have to aggregate it down to a kind of “average” approximation of a single-dimensional potential clinical consumer group’s preferences and desires for different outcome types.

Agreed in principle. But I’m still not convinced that borrowing information within a person across outcomes is a good idea. I might be on the (possibly rightfully) losing side of this argument, but if the within-and-between-person results are really that much more different than the between-person results in an RCT, it must mean something else is going on (which I guess is Andrew’s point – that thing going on being noise swamping signal).

That said, I’m used to dealing with observational datasets with thousands of observations or experimental ones with many hundreds (or thousands), and so the problems I’m used to thinking about are somewhat different. If you have n=50, maybe you do need to use within-person variation… you just need to do it very carefully.

Of course, the obvious answer would be to do all of the analyses and show all of them (or a fair set of them, with the others in some online appendix somewhere). But that is a suggestion for the journal world, not the FDA world.

One issue with these posts to Gelman’s blog is that we rarely have very good *specific* information about the application, but advice and techniques vary a LOT with the specifics.

So, for example, suppose there are just two dimensions, and they were “average degree of pain across 3 measurements over the first hour” and the second dimension was “time until pain score dropped to “. CLEARLY both of these are potentially noisy when it comes to measurement, and they obviously interact. You can use the two of them to model some latent “pain as a function of time” and hence an area under the curve of that function for example. Doing so will help you in your analysis pretty obviously.

But if there were two dimensions and they were “time until pain drops to SMALL VALUE” and “time standing on one leg with eyes closed” (to measure whether the drug causes dizziness or something) then the two are pretty orthogonal.

To the extent that some of the 10 outcomes measure different aspects of a similar thing (ie. pain and its bad effects) then combining them will generally reduce noise. To the extent that they measure really different things (say, pain levels, liver function, dizziness, urinary frequency, dry mouth…) then combining them is probably harder and will tell you less about effectiveness.

With small expensive studies, it’s important to extract information, and using multiple outcomes per person each of which is designed to measure some aspect of effectiveness is a great way to get better measurements. But, there’s no silver bullet. You need specific information about the project and background information about the topic to do a good job (for example, we know that liver failure is worse than dry mouth).

>”Suppose you’re conducting an experiment on the effectiveness of a pain medication, but in the post survey, measure a large number of indicators of well being (Sleep quality, self reported pain, ability to get tasks done, anxiety levels, etc).”

This sounds like another one of those studies designed to ignore everything that has been learned previously (or perhaps so little has been learned from tens of billions of dollars and decades spent studying painkillers?).

Humans are not static systems. Does sleep quality, etc cycle daily, weekly, monthly, seasonally? What kind of intra-individual day-to-day variation do we expect? What does the curve look like if you plot vs age? What do dose-response curves look like? What models have been proposed to explain these curves?

How are you even going to come up with a model to explain quantitatively why a painkiller had the effect size it did? This is impossible with a single “effect size” number, you need to get some sort of curve…

>”After the experiment, the results are insignificant (or the posterior effect size isn’t large or whatever) on 9th of the 10 measures, but is significant on the tenth. You then declare the experiment a success and publish.”

Ah, of course. The study is designed around NHST rather than to learn useful things about the world. If previous studies were also designed in this way, it would make sense that we continue to have a rudimentary understanding no matter how much time/money is spent on the problem.

Surely this question requires some scientific context before an answer can be made. See my commentary on the ASA statement on P-values for an exposition of appropriate analytical differences between exploratory and planned experiments (or, equivalently, between preliminary and definitive studies). [Document 13 on this page: http://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108%5D

The common use of inadequately described questions like this makes it hard for users of statistics to know what they should do, because people tend to expect and give unconditional answers.

Do you really think it’s a good idea to call a study “definitive”?

Is “Confirmatory” better?

No matter what you call it, I’d sure like a clear way to tell apart a-fishing-expedition / data-mining / hypothesis-generation / EDA from a clear, specific, pre-defined hypothesis someone set out to test.