We should provide more help and guidance along these lines. As a start, I’ve made more a feature of John’s examples:

http://damjan.vukcevic.net/2015/09/22/eliminating-significant/

Related to this: I just saw the new editorial from Jeff Leek and Roger Peng – it seems like a helpful little tool.

]]>> Not trying to give John a hard time, these things do need to be restated over and over again, but there has been so little real impact

I really can’t see why that initial statement by me lead you to think I missed “it hasn’t sunk very far into actual practice”.

I know I am often hard to understand, but “but there has been so little real impact” would seem obvious.

]]>My apologies, I presummed you were stating a preference based on some experience (not I’d like to see what methods are available).

Without such experience, it might difficulty to get what I am getting at, but you do seem to be confusing (observed) statistics with (unobserved) parameters.

]]>Wow — that’s not what I said, nor was it what I was thinking. I was thinking of what information is useful to the buyer.

I realize that in some cases a loss function may be useful, but in most cases where I’ve seen them used, there isn’t a good justification (and often no justification whatever) of why that particular loss function was used.

]]>How much more useful one answer is than another in any given situation (i.e. utility) is *exactly* what a loss function describes.

For some tasks (e.g. working out insurance premiums per household, in a certain area in a certain time period) considering your utility will lead to use of the mean, or something like it. For others (e.g. as you point out, estimating how much a buyer will pay, where each extra dollar of discrepancy counts the same) you’ll end up somewhere near the median – even starting with the same data.

The usual sensitivity argument (mean sensitive to outliers, median not) isn’t very convincing; there are entire families of location estimates that are not sensitive, why not use one of the other ones? The loss-based approach can be a lot more work, but with more satisfactory answers.

]]>The question of mean vs median might come up be considering a loss function, but it can also come up as “Which is more useful for the problem at hand?”

A prosaic example: Real estate summaries usually give median prices for house sales (in a certain area in a certain time period) because that is what is relevant: mean prices are usually affected strongly by a small number of especially expensive houses, but most buyers are interested in houses nearer to the median.

]]>I wasn’t claiming that there are currently meta-analysis techniques that focus on medians and quartiles — I don’t know if there are or not. The phrase “I’d like to see ..” (which you omitted from your quote) and the “would make more sense to me” seem to me to show that I was not making such a claim, but more a claim that such methods need to be developed if they are not already, and used if and when they are available, when appropriate.

To answer “in what sense would it be better”: For problems where the median and quartiles are the important parameters, … well, it seems logical that a meta-analysis on the important parameters would be better than meta-analysis on a parameter that is of lesser relevance.

]]>So again, its good that your doing this – but I am wondering if there is something else other than, as Tukey put it “mere technical statistical knowledge”, that is more central.

> the sampling distribution from the study groups is usually approximately Normally distributed, making the reported means and standard

> deviations approximately sufficient _and good summaries for meta-analysis input

This did not paint the picture I was trying to get across. Think of having the raw data from the different studies – this would allow you get the likelihood for any distributional assumption, parameters and parameterisation you wish. Conveniently factor that likelihood into groups within studies – they will almost always be approximately log quadratic. The study groups means and standard deviations, as usually worked with (say inverse variance weighted methods), provide an exactly log quadratic likelihood that almost always is a good approximation to the raw data one. The problem with other summaries is that as far as I know there is no analytical correction to get an approximation to the unavailable raw data likelihood (Martha may know of one) and we all know that for most distributional assumptions summaries like median and range are not sufficient and so there must be a loss of information.

> the Cochrane Collaboration’s “legislation”

Standard group think that Collaborations almost can’t avoid, someone with charisma gets a leading role and they can’t help discouraging other views (for instance, I have been banned from their Statistical Methods Group for posting overly provocative emails).

> I don’t understand how a focus on computation (via Stan!) will help with the broader issues related to inference.

Indirectly and gradually it should – it will enable more people more flexibility in how they do statistics.

Thanks for taking the time to reply to the comments.

]]>If I seek to minimize MSE, then I probably want to estimate a Mean. And if then my estimator is a little less efficient than the estimator of the Median, because of skewness or kurtosis in the population distribution, then so be it.

But yes, people should be *aware* of what they might be giving up in efficiency when they choose one loss function over another.

]]>> are focusing on appears to be “the way things are currently done.”

When you are trying to make sense of published studies done by others (meta-analysis) you don’t have much choice.

> Why is there so much focus on the mean

The mean is just one of many possible summaries (statistics) – it does not constrain the modelling of (any) parameter given the likelihood function it defines.

> (including meta-analyses) focusing on medians and quartiles rather than on means and standard deviations. That would make more sense to me.

Do you have a reference on how such a meta-analysis could be done (i.e. carried out by a statistician) and in what sense it would be better. (I am taking you as making a claim here that you are willing to defend.)

Thanks.

]]>This is exactly right! Or at least fits exactly with my experience.

]]>Simon

]]>Thanks, I think we agree on most things!

Re my “classic example”, yes, total or mean cost is important when considering cost-effectiveness and related population health issues. Here’s a nice statistical review. See also Thomas Lumley’s review of the broader issue of over-emphasis on normality assumptions (and the related tendency to flee from modelling means): “The importance of the normality assumption in large public health data sets.” Annual Review of Public Health (2002) 23: 151-169. (The title should have been “The non-importance…”!)

Re “Many students seem to come away from their basic statistics education thinking that means can’t be estimated reliably if the data are skewed”, perhaps a better way of putting it is that students seem to come away with the message that medians are good because they have better statistical properties – but in fact it’s the reverse: means are generally better estimated (because of the CLT)! Of course I agree that medians may be more relevant in some problems but I guess I’m not too keen on the idea of a “default parameter-of-interest”…

]]>Thanks for letting Andrew post your notes, and for your comments. I think talks like yours are important, and I have made a note to look at your notes for possible revisions to my analogous talks next time I give them.

Some comments on “But if we are trying to do scientific inference we need to consider what is the most relevant parameter (or parameters) for our substantive question, and in some cases this may be the mean (even if the distribution is skewed). A classic example in health is when costs (or some other proxy for total population burden) are the target. Many students seem to come away from their basic statistics education thinking that means can’t be estimated reliably if the data are skewed”:

“But if we are trying to do scientific inference we need to consider what is the most relevant parameter (or parameters) for our substantive question”

I agree wholeheartedly.

“and in some cases this may be the mean (even if the distribution is skewed). A classic example in health is when costs (or some other proxy for total population burden) are the target.”

My first reaction was to ask for explanation here, but on second reading, I think I see that “proxy for total population burden” is the key. Is this correct?

“Many students seem to come away from their basic statistics education thinking that means can’t be estimated reliably if the data are skewed”

This is something I hadn’t observed, so thanks for pointing it out; I can see how Keith’s comments were addressing this. My point, however, was/is that the mean is often not the appropriate parameter when dealing with a skewed distribution. In these cases, whether or not the mean can be estimated is moot, since in such cases it is the median that needs to be estimated.

Also: Bearing in mind that symmetric distributions have median = mean, it seems that the default parameter-of-inerest in most cases ought to be the median.

Also a note on confidence intervals: I think that frequentist confidence intervals have the advantage over p-values that they (at least, if used properly) keep uncertainty up front (rather than the certainty suggested by the language often used with p-values). But they only take into account sampling variability, so that modeling uncertainty gets ignored, and estimates appear more precise than is usually warranted. And, of course, like p-values, they are often misunderstood and/or misused.

]]>– Andrew’s point about a “different sort of [audience]” (for my talk) is important. By the way, they weren’t primarily “clinical” people – I guess a fair minority had some sort of clinical background but most were researchers in public health with a range of (semi-)quantitative backgrounds in epidemiology and social science. I completely accept the point made by several commenters that there is nothing new in what I had to say. I wasn’t claiming to say anything new but just to bring these issues to the attention of yet another group who have either never seen the arguments before or have sublimated them (head-in-the-sand style). I believe it really is the case that a vast majority of people doing empirical research have only very vague notions about p-values and stat-signif – it is just what you do to find out if you’ve “got something” (and therefore are likely to be able to publish it). In the rarefied atmosphere of Andrew’s blog people can say this is not our problem and it was all pointed out years ago, but I think that’s ignoring the real world of what’s going on out there.

– So what can we do about it? I don’t really know – as perhaps is painfully obvious from the talk! But we could certainly start by actively discouraging the routine use of jargon such as “statistically significant” – that would I think go a long way to preventing the worst misunderstandings and misuse. So there would no longer be a conventional way of “filtering” research results… Well, given the chaos and misunderstanding that I believe flows throughout much of scientific publication from this mostly arbitrary filtering I don’t think that would be such a bad thing. Researchers would be forced to make more active (and inevitably nuanced) arguments for their claims rather than simply sticking them out there with lots of asterisks attached.

– Are confidence intervals better than accept/reject based on p-values? First, I should say that I only think about confidence intervals in their approximate Bayesian version – i.e. as approximate intervals of posterior uncertainty when there is a large amount of data relative to plausible prior assumptions. Andrew has been pointing out recently that this probably makes sense (i.e. the “non-informative” prior) less often than standard practice (of using CIs based on standard likelihood approximations) assumes. I definitely don’t think frequentist CIs are any more useful than accept/reject, as in the single-parameter case they seem just to describe a couple of accept/reject boundaries (the two ends of the interval) instead of one. But if CIs have an at least approximately valid Bayesian meaning as a “plausible range for the unknown parameter” then I do think that’s useful – it pushes the attention on to what it is we are trying to estimate or make inferences about (e.g. mean difference in blood pressure levels between two drug regimes) and away from false dichotomies relating to (usually) implausible exact null hypotheses.

– I think I sit somewhere between Keith’s and Martha’s positions on means vs medians. A lot of intro statistics teaching deprecates the mean because it doesn’t provide a good descriptive summary of a skewed distribution. But if we are trying to do scientific inference we need to consider what is the most relevant parameter (or parameters) for our substantive question, and in some cases this may be the mean (even if the distribution is skewed). A classic example in health is when costs (or some other proxy for total population burden) are the target. Many students seem to come away from their basic statistics education thinking that means can’t be estimated reliably if the data are skewed, but good inference for the mean is very possible even with quite small samples and it may in fact be what you need in order to answer the real question of interest. (Which is not to say that the Cochrane Collaboration’s “legislation” that the mean should always be the target is sensible. I often think it ironic that the field of evidence-based medicine seems to be dominated by conventions and rules rather than active critical thinking!)

– Someone said that I was confusing “fundamental criticisms with poor practice” – perhaps so, but for the intended audience I don’t think that mattered too much. It would not have been possible to give a detailed “fundamental critique” without losing most of the audience, and the message I wanted to get across was simply that there is actually a serious problem here!

– Finally, I don’t understand how a focus on computation (via Stan!) will help with the broader issues related to inference. The problem of “significant-itis” is that researchers want to draw binary conclusions from noisy data – how will avoiding the difficulties of inductive inference and focusing on computation help with this?

]]>There seem to be two issues going on here: The one you are focusing on appears to be “the way things are currently done.” My focus is more on “the way things ought to be done.”

Ideally, data should be available, not just summaries. I agree that if someone is indeed interested in the mean and standard deviation, but only median and quartiles are reported, then that is not user friendly for that user. But if someone is interested in the median and quartile, and only means and standard deviations are given, then that is also not user friendly.

The deeper issue is: Why is there so much focus on the mean, and so little discussion of when that is and is not appropriate?

The point you raise that sampling distributions of means are usually approximately normal seems to miss the point of whether or not means are the best things to look at to begin with.

I approach medical studies from a consumer perspective. From that perspective, I don’t see information about means and standard deviations as adequately informative. I believe that information about medians and quartiles is more informative for making medical decisions. I’d like to see medical studies (including meta-analyses) focusing on medians and quartiles rather than on means and standard deviations. That would make more sense to me.

]]>The difference is the user interface.

Roman numerals and Arabic numerals describe the exact same numberS yet arithmetic is a lot easier in one of them.

]]>1. “Scientific Inference Using Stan” (not the same as data analysis)

2. This format: https://probmods.org/generative-models.html

What do you want to filter? Most of the problem is researchers asking meaningless questions like “Does drug A activate protein P?”, then they publish if they observe “activation” and do not publish if they do not.

Instead it should be “What effects does drug A have on protein P? Under what conditions is this observable and not observable?” If the answer is “we could not measure any effect at any timepoint or under any conditions”, that is useful information. Once a valid science project is funded there should rarely be reason to not publish the results.

I think there is also a widespread misbelief that data=evidence, this is not true. Data becomes evidence via interpretation (considering the methods that generated the data, various theories, etc). The generally poor quality of methods sections found in the medical literature, a problem that increases with the prestige of the journal, tells us that researchers do not think the interpretation step is important. Instead they just pick one explanation for the data and run with it.

]]>From your notes “When we focus on the mean of a variable, we’re usually trying to focus on what happens “on average,” or perhaps “typically”.

Maybe that can be the focus, but often not usually in for instance reporting results of randomised clinical trials. Here it is standard to report group summaries separately as this allows later meta-analysis of all published studies. This gave rise to http://www.cochrane.org/ and the maintstream of evidence based medicine.

Now if authors follow your (well meaning advice) “Automatically giving just mean and standard deviation for a variable, without considering whether the variable might have a skewed distribution. For a skewed distribution, it’s usually better to use the first and third quartiles, as well as the medians, since these will give some sense of the asymmetry of the distribution.” at least the last time I communicated with the head of the Cochrane statistical methods group, their study will be excluded, or if there is dichotomised data summarised just that will be used, or guestimates of the unreported means and standard deviation based on what is reported will be made and it included only in a sensitivity model.

The problem arises as there is seldom access to the raw data, the sampling distribution from the study groups is usually approximately Normally distributed, making the reported means and standard deviations approximately sufficient _and good summaries for meta-analysis input – even for very skewed distributions. on the other hand, methods based on other reported summaries are technically hard to work with and often contain less information.

All the gory details are available in my Phd thesis https://andrewgelman.com/wp-content/uploads/2010/06/ThesisReprint.pdf which I would not have been funded to do if more authors just reported group mean and standard deviations. Fortunately, for me, the were taught otherwise ;-)

I do think a large part of the problem is the disconnect between those teaching statistics and what the teaching leads to in actual practice. Why would one suspect reporting summaries other than mean and standard deviation would would cause problems?

And when I taught an intro Stats course at Duke, I actually went through above exercise, starting with studies with really skewed data, pointed out the sampling distributions of the group means were approximately Normal, the differences in the group means within studies was even more Normal and the meta-analysis sampling distribution of the mean within study difference was even more Normal. Some did understand, whether that had an impact on them after the final exam – who knows.

]]>One reason we’re working so much on the cutting edge is that, believe it or not, cutting-edge technology is, it seems, necessary for automatically fitting simple models of just about any generality. And doing this requires a lot of research effort, both in statistics and computing.

That said, we are working on user-friendliness, for example with Stanbox, stan_glm, stan_mle, etc. And we’re planning a “Data analysis using Stan” book that is strongly focused on what to do and how to do it.

]]>Right now I see the emphasis on being super cutting edge, which is fine. But there is a huge elephant in the room that is getting less attention — current scientific practice is, to a large extent, broke. How will Stan or some other probabilistic programing language help address that? And how will that mission affect their development? Can we have tools like [Scratch](https://scratch.mit.edu/) to teach scientific inference?

]]>It sounds like you’re talking about Stan, so, yeah, I’m down with that plan too!

]]>Similarly, the evidence is in that the dominant mode of teaching — frequentists statistics — also has failed to improve understanding.

So The way I see it the problem has nothing to do with frequentism vs bayesianism and everything to do with semantics, teaching, and tools.

IMHO what is needed is some thinking out of the box. Throw the standard stats curriculum (frequentist or Bayesian) in the bin, and start anew. My 2 cents if for a combination of probabilistic programing, simulation, and graphical models.

]]>My impression is that the Bayesian interpretation of the inference problem conflates decision and filtering. Decision is about the state of belief, how strongly one believes X is true rather than Y or Z. This is formally the same as the problem of practical decision, what are the net benefits of taking action X rather than action Y or Z in a given context of uncertainty. Decision in this sense is ultimately modular: all the factors pertaining to it (prior knowledge, new evidence, costs and benefits) are perceptible to the decider and can be taken into account in the context of an individual decision.

Filtering arises from the structure of science as a collective enterprise with a finely articulated division of labor across participants and over time. Minimization of Type I error is paramount in this context, since the progress of the enterprise depends on the reliability of each individual component which, once it passes the filter, is available for use throughout the system. Science accepts a high degree of conservative bias as the cost of systemic and long-term progressivity. It is the only consistently progressive human enterprise.

The filtering imperative is used to justify NHST, but of course this is not correct. The likelihood of false null rejection depends on many features of the data and the methods used to collect and analyze them, and p-values perform filtering poorly, with devastating real world consequences. Moreover, rejecting an arbitrarily chosen null does not constitute adequate filtering for specific alternative hypotheses. My own field of economics suffers mightily from this confusion.

Nevertheless, isn’t it possible to agree with the criticisms of NHST but accept the goal of highly conservative filtering in science? The trick here would be to differentiate decision-making from filtering, recognizing the validity of each in its own sphere. I admit this might be difficult in practice without a bright line to identify sufficiently filtered claims. p-values do this poorly at present, but shouldn’t there be some criterion or set of criteria, enforced by professional norms, that takes its place? In an ideal world action and belief would reflect the balance of factors at a particular time and place, while science would be more curmudgeonly, resting on the distinction between what we rationally believe or act on based on the evidence currently available to us and what we really, really know.

An example of this distinction in practice is the synthesis work of the IPCC, where assessments of the weight of evidence on particular claims relating to climate change are given rough percentages. The reports are telling us that the claims are not yet decisively filtered—researchers working to extend knowledge cannot yet take them as reliably true—but there is enough evidentiary weight for them to be incorporated into rational action on the climate front. IPCC has been struggling publicly with the right way to communicate this complex message.

]]>Just before reading it, I had come to the tentative hypothesis that the underlying problem is more a lack of understanding the process of scientific inquiry rather than the statistical quantification of uncertainties involved (no matter how tentative that quantitation is understood). So not an unbiased reading and I’ll add some comments (from Peirce) on what the process of scientific inquiry might be.

For my “not new” claim:

Page 22 “Absence of evidence is not the same as evidence of absence”

My director (http://en.wikipedia.org/wiki/Allan_S._Detsky) made a point of embarrassing every new clinical researcher with this phrase since 1986 and pointing them to discussions of it in clinical journals.

Page 14 “E.g. can calculate frequency of “accepting” & “rejecting” true and false null hypotheses”

In about 1987 I tried writing an expository paper for a clinical research audience based on these sorts of calculations, and was dissuaded by references to others who had already done that.

Page 9 “Results provide almost equal support to the null hypothesis and the proposed alternative”

Rubin and Rosenthal’s http://en.wikipedia.org/wiki/Counternull – which I though was earlier than 1994 – OK 20+ years.

Not trying to give John a hard time, these things do need to be restated over and over again, but there has been so little real impact – does not make sense to me that its the root cause (apart from the strong publish or perish motivation to not understand).

Page 23 “Change style of presentation, away from “findings” to incremental evidence”

I would argue evidence is _not_ incremental (like temperature) but qualitative (can drastically weaken or strengthen given new perspectives or observations (e.g. that suggest radical change of model.)

Page 23 “Embrace more Bayesian inference”

Not objecting to that as an hypothesis/conjecture – but you don’t present any evidence nor do I expect you to have strong evidence.

(Peirce) comments on inquiry:

“a state of satisfaction, is all that Truth, or the aim of inquiry, consists in”

“if Truth consists in satisfaction, it cannot be any actual satisfaction, but must be the satisfaction which would ultimately be found if the inquiry were pushed to its ultimate and indefeasible issue”

These comments point me back to Ken Rice’s 2010 paper “In Fisher’s interpretation of statistical testing, a test is seen as a ‘screening’ procedure; one either reports some scientific findings, or alternatively gives no firm conclusions.” https://andrewgelman.com/2014/04/29/ken-rice-presents-unifying-approach-statistical-inference-hypothesis-testing/

So I [Keith] would put it – I [scientist] am satisfied enough for now with my making sense of this data/experiment to call it to my colleagues attention. But expecting them and future work to remove that satisfaction and move us further.

]]>I wholeheartedly agree with your comments on the difficulty of doing adequate justice to power (and lots of other things) in an introductory statistics course. In fact, my experience is that most students don’t understand the concepts of sampling distribution, p-value, and confidence interval from just a first course (see e.g., http://www.ma.utexas.edu/blogs/mks/2015/02/03/another-mixed-bag-gigerenzers-mindless-statistics/, which was in response to one of Andrew’s posts and linked in the comments there).

I think I might be coming from a place that is intermediate between where you are and where Andrew is, so possibly some of the things I have written might help you in getting a better picture of the continuum spanning from misunderstandings/misinterpretations through fundamental problems (that aren’t always fatal, but mean caution and/or alternatives are warranted), to fundamental “attacks”.

In particular, these might be of interest:

1. http://www.ma.utexas.edu/users/mks/CommonMistakes2014/commonmistakeshome2014.html (Has notes from a “continuing education” course I’ve been teaching the past few years, with target audience including consumers of research using statistics, teachers of statistics, researchers using statistics, and people who believe they need more than an introductory course to understand what they didn’t understand in the intro course)

2. Blog posts from June 22, 2014 through August 26, 2014 at http://www.ma.utexas.edu/blogs/mks/

]]>Not including the type of prior information easily handled in a Bayesian prior is an additional problem on top of this one.

But even both of those are small beans in comparison to the problem that the CI’s were constructed under the assumption that the system being studied produces data with stable frequency histograms. This assumption is required for Frequentists to even mention “probability distributions”. The statistical model verification steps people use to “verify” these model don’t check this assumption the vast majority of the time, and usually it’s not even accidentally true. This guarantees most conclusions will not be reproducible all by itself.

So to sum up, the fact that we’re still talking about CI’s in 2015 is testament to the gut wrenching awful stupidity of the human race. But hey, morons need a job too, and statistics is as good a place as any for them.

]]>Compared to hyp tests and p-values, conf intervals do offer some advantages in that they put the focus on the uncertainty rather than on quasi-deterministic claims such as “p less than .05.”

But classical confidence intervals (and noninformative Bayesian intervals) still have the problem of ignoring prior information.

Recall my example of the estimated odds ratio of 3 with a 95% CI of [1.1, 8.2]. In just about any real setting in biostatistics, the “1.1” is much more plausible than the “8.2,” and indeed all sorts of values such as 0.95 and 1.05 are also consistent with such data. I discuss these issues in my 2012 paper in Epidemiology.

]]>But I think the Bayesian and “garden of forked paths” criticisms are more fundamental (as evidenced by the repeated claims that confidence intervals have the same problems as hypothesis tests). I think it would be wise to distinguish between misuse/misunderstanding and these more fundamental critiques. And, for the latter, I think more work needs to be done concerning practical advice about when a properly interpreted confidence interval still may be substantially incorrect (and even what that means – probably more along the lines of S and M errors). I didn’t think Carlin’s presentation clarified the difference between poor practice and more fundamental errors at all.

]]>What I think keeps getting confused (on this blog as well as elsewhere) are the clear examples where NHST is misused or misinterpreted and criticisms that say the entire effort is misguided and/or wrong. I think we need to be clearer on exactly what is wrong with the procedure if implemented and interpreted correctly. I think there is a Bayesian criticism and a “garden” criticism that are more fundamental attacks than reacting to the ease of misinterpreting what NHST tells you. But confusing these fundamental criticisms with poor practice only makes the issues muddier. Carlin’s presentation, while interesting and valuable, does not seem to clear this confusion up at all.

]]>