This post is by Phil Price.

I’ve been preparing a review of a new statistics textbook aimed at students and practitioners in the “physical sciences,” as distinct from the social sciences and also distinct from people who intend to take more statistics courses. I figured that since it’s been years since I looked at an intro stats textbook, I should look at a few others and see how they differ from this one, so in addition to the book I’m reviewing I’ve looked at some other textbooks aimed at similar audiences: Milton and Arnold; Hines, Montgomery, Goldsman, and Borror; and a few others. I also looked at the table of contents of several more. There is a lot of overlap in the coverage of these books — they all have discussions of common discrete and continuous distributions, joint distributions, descriptive statistics, parameter estimation, hypothesis testing, linear regression, ANOVA, factorial experimental design, and a few other topics.

I can see how, from a statistician’s point of view, the standard arrangement of topics makes perfect sense; indeed, given that adding anything else to the list necessarily means taking something away — there’s only so much time in an academic year, after all — perhaps people think no other set of topics is even possible. But here’s the thing. In my 20 years as a practicing data analyst/scientist, I have been involved in one way or another with a wide variety of projects and topics, sometimes as a full participant, sometimes just as a stats kibbitzer. A partial list of topics I’ve worked on includes the geographical and statistical distribution of indoor radon in the U.S. ; computed tomography of air pollutant concentrations; the airborne transport of biological agents; statistical and spatial distributions of ventilation and ventilation practices in homes and in large commercial buildings; effectiveness of kitchen exhaust hoods; time series models to predict electricity use in large buildings; statistical and causal relationships between vehicle characteristics and fatality rates; performance of very low-cost cookstoves in the developing world; and several other topics but I’m get tired of listing them. It’s a pretty big range of topics and a large number of researchers I’ve worked with, so I think I’m qualified to express this opinion: the standard curriculum covered in the books leaves out some of the very most important topics that my colleagues (and I) tend to struggle with or that would be useful to us, and includes several that are effectively useless or may even be harmful if people apply them without full understanding.

(keep reading below the fold)

To go ahead and shoot the largest fish in the barrel, in most of these books there is far too much discussion of hypothesis tests and far too little discussion of what people ought to do instead of hypothesis testing. This is a case where a little knowledge can be a dangerous thing. First, no matter what caveats are put in the books, many people will incorrectly interpret “we cannot reject the hypothesis that A=B” as meaning “we can safely assume that A=B”; after all, if that’s not the point of the test then what IS the point of the test? Second — and this might actually be the more important point — people who know of the existence of hypothesis tests often assume that that’s what they want, which prevents them from pondering what they really _do_ want. To give one example out of many in my own experience: I have worked with a group that is trying to provide small cookstoves to desperately poor people, mostly in Africa, to decrease the amount of wood they need to gather in order to cook their food. The group had cooked a standard meal several times, using the type of wood and the type of pan available to the people of north Sudan, and using cookstoves of different designs, and they wanted to see which cookstove required the least wood on average. They approached me with the request that I help them do a hypothesis test to see whether all of the stoves are equivalent. This is an example of a place where a hypothesis test is not what you want: the stoves couldn’t possibly perform

*exactly*equally, so all the test will tell you is whether you have enough statistical power to convincingly demonstrate the difference. What these researchers needed to do was to think about what information they really need in order to choose a stove: how big a difference in performance is of practical importance; how robust are their results (for example, they had done all of their tests using the same type of wood, dried to the same amount, but in actual use the wood type and moisture level would vary widely, so are they sure the best stove with their test wood would be the best stove in reality); and a whole bunch of other questions. In any case there is usually little point in testing a hypothesis (in this case: that all of the stoves are exactly equal) that you know to be false.

It’s true that in the physical sciences you do occasionally find cases in which a null hypothesis really could be true — the electron and the positron really could have

*exactly*the same mass, for example — but in most real-world cases, and literally every case I have actually encountered, they null hypothesis is known at the outset to be false so a test of hypothetical equality of two quantities is merely a test of statistical power (often without being recognized as such by the people performing the test).

These textbooks (and courses) should eliminate the chapter on hypothesis testing, replacing it with a one-page description of what hypothesis testing is and why it’s less useful than it seems. I’ll admit that hypothesis tests have their place and that it is a pity if students only get a one-page discussion of them, but something has to give.

Once the hypothesis test chapter is gone, it can be replaced by something useful. One thing that is needed (but usually missing) is a chapter on exploratory data analysis, especially including graphics. Graphics are important. In many of the research areas I have listed above (and others not listed), my introduction to the project occurs when a grad student, or sometimes a senior researcher, comes to me with questions about data they have been looking at for weeks. The first thing I ask for is appropriate plots: histograms of x, y, and z; plots of y vs x; etc…. the details depend on the problem but I always want to start by looking at the data. Amazingly often, the student has either never plotted the raw data or, at best, has used default plotting procedures (often from Microsoft Excel) to make just a few plots that are inadequate and fail to reveal important features. Often they have simply calculated some summary statistics and not plotted anything at all. (I keep a folder of plots of “Anscombe’s Quartet”, and I give one to anyone who comes to my office with calculations of summary statistics but without plots).

Another thing missing from the books I’ve looked at is a useful discussion of some common pitfalls of real-world experiments. Many, perhaps most, experimental datasets I’ve encountered have some really undesirable properties, such as the potential for large selection bias, or the inability to distinguish the effect of nuisance covariates from the variables of interest, or simply sources of uncertainty that are far larger than expected. I’ll again illustrate with a real example: an experiment to manipulate ventilation rates and to quantify the effect of ventilation rate on the performance of workers in a “call center.” These workers answer phones to do routine tasks such as scheduling doctor appointments. There are hundreds of workers in the building, and the experiment involved varying the amount of outdoor air they got, ranging from the typical amount to several times that much. A statistics professor had designed a pattern of low, medium, and high ventilation, with the rate varying daily during some periods and weekly during other periods. Average time spent to process a phone call was the main performance metric of interest. It all seemed pretty clean on paper, but in practice it was a mess. Partway through the study,the business introduced a new computer system, which led to an immediate drop in performance that was much larger than the effect of interest could possibly be, with performance gradually improving again as employees learned the system. Additionally, several times large numbers of new workers were added, and again there was a short-term effect on productivity that was large compared to the effect of interest (and the available data recorded only the average call processing times, not the times for individual workers). There were some other problems too. In the end, there wasn’t enough statistical power to see effects of a size that could reasonably have occurred. This example fits into a broader pattern that is almost a general rule: real data are rarely as clean as the textbook examples. In fact, many of the challenges that are routinely faced by a data analyst are related to coping with inadequacies of data. The nice, clean examples given in most textbooks are very much the exception, not the rule.

Yet another topic that has come up frequently in my experience is numerical simulation, either to directly determine an answer of interest or to confirm an analytical result. An example is error propagation: I have an output that is a complicated function of several inputs, and the values of the inputs are subject to uncertainty. What is the uncertainty in the outputs? The easiest way to answer this is often to sample from the input distributions and generate the resulting output value; repeat as needed to get a statistical distribution of outputs. Importantly, this approach can be used with any statistical distribution of input parameters (including empirical distributions), not just standard, friendly distributions. On the whole I’d probably prefer that researchers understand the use of this method but don’t know the analytical results for, say the normal distribution, rather than the other way around. But the other way around is the only thing that’s taught in these books. (Of course it would be better to know both, but again, there’s only so much time available). Oh, and speaking of errors, a standard weapon in the arsenal is cross-validation, but most of these books don’t seem to cover it, and many don’t even mention it in passing.

Overall, I don’t much like the statistics book that I’m reviewing, but I can’t say it’s any worse than the typical stats book aimed at physical scientists and engineers who do not plan to take further statistics courses. But I don’t claim to have anything like an exhaustive knowledge of the state of intro stats textbooks. Are there any books out there that cover exploratory data analysis (including exploratory graphics), and dealing with common problems of real-world data, and other things that I think should be in these books? If not, someone should write one.

I’ll say it here too, ’cause people always forget: this post is by Phil Price.

One drawback with removing the hypothesis test chapter: tests is about concentrating on some _comparison of interest_, and when teaching graphics that is just the concept you need! One should decide on what is the comparioson of interest, and then design the graphic to answer that. That is very close to deciding on an optimal test statistic! So maybe hypothesis tests and graphics should go in the same chapter?

kjetil, this is a great comment. In applied work, one is always comparing things (even in physics, where the comparison might be to background levels).

The major comparisons that are done in intro stat are via hypotheses tests, which assume you already know what the appropriate comparison is and the right units to express it (either in terms of graphics or a selection of the key values for a table).

The book that really got me into modern statistics is Ben Bolker’s Ecological Data and Models in R (http://www.math.mcmaster.ca/~bolker/emdbook/), and I’d say it does a good job fufilling what you’re looking for (as well as having a free draft version available online!). The drawbacks for teaching from it is that it assumes at least a little familiarity with statistics already (at a first-year intro stats level), so you can’t really hand it to someone and expect they learn stats from the ground up using it. Further, as evidenced by the title, it’s very heavily aimed at ecologists and organismal biologists, especially for the examples used. You might find it interesting to look at anyway, as one approach that seems to be working for teaching more modern methods.

Not a textbook, but ” Data analysis with open source tools” by Philipp Janert takes much the same attitude towards the subject as you call for. Perhaps useful as a companion volume?

Thanks for telling me about this, it looks really useful.

I heartily agree with your idea of ditching hypothesis testing. Not only are point null hypotheses almost always false, even if they are possibly true (such as your example of positron/electron mass being equal), in practice an actual experiment to test such a hypothesis is virtually guaranteed to have flaws that would render them useless at some point (i.e., at some point you’ll collect enough data that you’ll be just observing the experimental flaws and “reject”, even if the null is actually true).

I entered astronomy graduate school fifty years ago this year (how time flies!). We learned such statistics as we were thought to need from other astronomers, not from statisticians, and I then taught astronomy grad students such statistics as were deemed to be needed by astronomers for many years. Point null hypothesis testing was not something that I was taught or that I taught later. I have seen such testing occasionally in the astronomical literature, but not very often.

Chp 1 Theories

Chp 2 Concepts and Measures

Chp 3 Measurement

Chp 4-8 Analysis

Chp 9-10 Decision

Chp 11 Lab practices (registration, lab book, ethics, documentation, sharing)

I would teach high school students using a Raspberrypi, a tray of seedlings, some coffee grounds and – shock horror – a randomization test of the null of no effect using R.

PS In keeping with the minimalist Raspberry Pi spirit I would try to substitute “Chp” for “pgs” or “par” and write it on a microfiche….

Absolutely agree – less is more for students. One book I have long admired for its construction is Michael Greenacre’s “Correspondence Analysis in Practice”. Every chapter is exactly 8 pages long and fairly self-contained.

Perhaps Neyman-Pearson hypothesis tests should be relegated to a page that explains why they are usually irrelevant to scientific experimentation, but significance tests are estimation, and that’s useful. Both Fisher and Jaynes agree on that:

“It may be added that in the theory of estimation we consider a continuum of hypotheses each eligible as null hypothesis …” (Fisher, 1955)

“the distinction between significance testing and estimation is artificial and of doubtful value in statistics” (Jaynes, 1980)

I’m amazed that you, Gelman, wrote this post. You wrote several books, including one about how to teach statistics, and yet you say that you should be writting a different book than the ones you wrote! Amazing.

Ok, just kiding! I know the post is by Phil Price, not Gelman. But I couldn’t prevent myself of laughing when I saw you writting again that the post was by Phil price. And yet you’re totally right. People forget that the post is not by Gelman, and I was always amazed by this fact in other posts by you.

Anyway, nice post. I hope people who write these books read this post. It would be a real advance.

Thanks, Manoel!

This post, times 10.

I am teaching an introductory MA level statistics course in the spring and am trying to think of ways to enhance coverage of issues like exploratory analysis, visualization, organizing data structures, and evaluating effect sizes rather than a full course of classical hypothesis testing (think people that never, ever took a statistics course in undergrad). Because I teach in a program with a large number of students looking to be professionals in business-related fields, they need to know not just WHAT, but HOW MUCH. In other words, the statistics come from a business standpoint.

But this is the only course they take. If they simply learn H0 vs. H1 with a T-test and ANOVA and that’s it, it is practically useless to them. I do think using hypothesis testing to drive home the importance of probability distributions can be helpful, especially with visually effective simulations (i.e. one can visualize the probability distribution being sampled and built to look like the theoretical one in this case).

I am all ears on teaching the HOW MUCH question early on. I have thought about starting regression much earlier than in the classical statistics course, but have questioned whether this would be feasible without talking for a good while about the little p-values that come out of the regression output in the classical way. My worry is that they take the course, then take the coefficient point estimates as god-like numbers to give to management.

I think a course like you describe would necessarily have to be for those that won’t take another statistics course. Otherwise, the students won’t have the base of knowledge that many later courses use, whether you agree with their use or not (regression coefficient p-values, permutation tests, etc.). In that case, it would probably call for a full overhaul of an applied statistics program or significant alteration to the standard methodology track in other programs that require it.

We want to test whether coffee grounds used as fertilizer increase plant size at 4 weeks by [.x standard deviations, or by x%, or by x mm.] relative to some baseline measure.

Pair seedlings. Randomly treat one in each pair. At 4 weeks record a 1 if treated seedling passes size criteria, 0 otherwise. The end line sequence of 0s and 1s can motivate a randomization test of the null hypothesis that the effect size is x.

Sure, you could do regression, or Bayes, or whatever, and sure, they’ll give you more information, point estimates, probability intervals, continuity, etc.. But they are not as simple and require a lot more pre-existing knowledge, whereas the above can be understood by a teenager with a poor attention span.

And busy executives and policy makers are like teenagers. In my experience policy makers cannot interpret fan charts (https://en.wikipedia.org/wiki/Fan_chart_%28time_series%29). They focus on mean and bounds, and typically only on the lower or higher bound (for GDP and inflation respectively): two pieces of info max. I suppose busy executives also boil everything down to a binary decision. Don’t throw the baby out with the bathwater.

So in writing your book imagine you are writing it for a cellular automata with limited cognitive capabilities and a short attention span.

“But they are not as simple and require a lot more pre-existing knowledge…”

I think you hit my concern with going to something like regression quickly on the nose, and not covering things that are of significant value (or, throwing the baby out with the bathwater as you say). I think that was my point and, of course, my difficulty.

I think your example is interesting, but I don’t know that it would be as clear relative to the familiarity a regression coefficient has to non-math students (they all know rise-over-run from middle school). I do like the idea of simulations as I said above because probability distributions are a very abstract concept for some despite their tractability, but I’m curious how these sorts of strategies have caught on for others at the very very introductory level.

I guess the issue is the constraints, and providing useful skills/tools from a practitioner standpoint. Assume there is only 1 semester of statistics and one year in the program and assume that students need to be reminded how to calculate a mean.

“I don’t know that it would be as clear relative to the familiarity a regression coefficient”

Indeed, many students of applied stats courses come out believing that science IS running a regression. The whole course sequence seems designed to lead to the Nirvana of multivariate regression.

Instead, a good course on statistics and the scientific method would teach that regression is but one component of scientific inquiry.

Simulations is a great idea, but one must be carefull. Probability distributions are abstract because the concept of a sample space, repeated draws, etc, is abstract. To make it accesible, this idea of a “chance setup” must be concretised, and that must be done in real life before it makes sense to simulate it on a computer. To people who have never done some (whatever!) experiments, got data, plotted them, repeated it, seen that means vary from time to time, etc, doing simulations on the computer will probably only be some abracadabra.

Keith, this is my worry. Perhaps first doing it in some familiar way, and then going on to a computer simulation as an example (not for them to run) to visualize what happens.

Fernado,

No doubt about regression being but one component/tool! But my point is which tools are the most useful in the context of a non-scientist (at least in the academic sense). Certainly my goal would be to encourage exploration more than anything else, and advise them of limitations in their own tools that they’ll need to defer to an expert for or learn more about. With the time constraint, though, the question is how to maximize utility of the class.

As teachers we always want to encourage more exploration and understanding of “what else can be done.” this is my fear: I initially left my undergraduate stats class thinking a simplistic two-way ANOVA was that Nirvana. I feel as though knowledge of regression opens doors to broader curiosity (an empiric question, of course).

Jumping in here, my current misunderstanding is that you need to provide a representation for uncertainty that the audience can grasp, won’t discount/reject as “can’t be worthwhile/important/necessary for people like me” and that they can comfortable manipulate/investigate/experiment/play with.

I am pretty sure Monte-Carlo simulation does not do this having trying out on hundreds of students – as mentioned here it presupposes a grasp of abstract distributions. This does seem to work for many including teenagers http://phaneron0.files.wordpress.com/2012/11/animation3.pdf – based on a select sample of colleagues friends and family. But to go anywhere with it (play with it) they will need good computing skills and be able to think abstractly.

Disclaimer: It is part of working through earler ideas posted here http://andrewgelman.com/2010/06/course_proposal/

Agree, a good course would motivate everything by running a simple experiment in class.

PS I have secret fantasies of forking a blog. Meaning that one thread becomes so active it advances more rapidly than the blog to which it belongs, taking a life all of its own.

But I think for that you need plenty of controversial statements and a polemicist, hence the monicker. And now the statement:

Exploratory data analysis is like evolution: the work of the Devil.

K? O’Rourke,

Yes, I think this is my worry with bringing simulation beyond “showing” it up front.

For example, in my first graduate statistics course, one of the most enlightening things I saw in lecture was an animation of a histogram being built and an animation of random samples producing confidence intervals (and then were plotted against a vertical line of 0 up on the screen). I did not get much into being able to DO this myself from a programming standpoint until working with permutation tests in a graduate non-parametrics course.

I would not expect students to do these (which was my concern with Fernando’s initial very interesting suggestion). But the visual can help to say “okay, I see why the histogram is built that way” or “okay, i see most of the confidence intervals include 0 when sampled from a n(0,1) distribution”). So really, I would be doing the manipulation from some data that we develop in class through whatever activity (an experiment, or early on could even be physically drawing numbers out of a hat and inputting that into a data frame or vector, then me doing the work on the back end). As a class, we could play with it rather than at the individual level. I think this is what you are talking about with the link you include in your post.

But you do answer my question with whether the Monte Carlo for each student helps: not so much.

I would look for the course to provide some shelf-ware that the minimally statistically aware can use as a reference. Having a guide which guides the newbie towards the appropriate technique(s) and also provides the WATCH OUT FOR THIS warning to help avoid newbie mistakes

This oldie but goodie lived in my library for many years.

http://www.psc.isr.umich.edu/dis/infoserv/isrpub/pdf/GuideSelectingStatisticalTechniques_OCR.pdf

*”This flowchart is insane! It’s an absolute gem. 28 pages of decision tree guiding the reader to the correct statistical technique. 28! And then several pages of cites and a glossary. It needs to be updated–they note a few “new” techniques in an appendix, some of which have become standard fare since 1981. It’s amazing, though. I ended up using to it construct my own simplified (3 pages) decision tree (which I’m quite proud of!)–if a professor had done this for me as a master’s student, I would have trudged through the Ph.D. at least a little more easily.

So, if you’re interested, the full cite, with additional kudos to Edwidge and a standing ovation for these authors:

Andrews, F. M., Klem, L., Davidson, T. N., O’Malley, P. M., and Rodgers, W. L. (1981). A Guide for Selecting Statistical Techniques for Analyzing Social Science Data (2nd edition). Ann Arbor: University of Michigan, Institute for Social Research, Survey Research Center.”

*found the comments above at http://chronicle.com/forums/index.php?topic=30469.0;wap2

Tukey’s “Exploratory Data Analysis” and Mosteller and Tukey’s “Data Analysis and Regression” (especially the chapter “Hunting Out the Real Uncertainty”) are worth considering as supplements, even if they’re too dated to serve as main textbooks.

Actually, this points to a “simple” solution to the original book-writing problem.

1) Simulate Tukey.

2) Write the modern equivalent of his EDA book.

In answer to the question:

‘“we cannot reject the hypothesis that A=B” as meaning “we can safely assume that A=B”; after all, if that’s not the point of the test then what IS the point of the test?’

Perhaps this is a difference between a physical scientist and an economist, but It seems to me that the point of the test is not to conclude that A and B are precisely equal. The point of the test is to see if the difference we observe between A and B is sufficiently large that random variation is unlikely to be the reason for the difference. We do this so that our conclusions aren’t driven by sampling error.

I think hypothesis testing is very important to teach. It is a critical part of scientific method. I also don’t understand the value of doing regression analysis without hypothesis testing. It seems that we are often testing for the existence of relationships between x variables and the y. What’s the point of teaching regression analysis, if the student can’t ultimately decide which of the variables are related to y and which ones aren’t?

JK, what you say about understanding the point of hypothesis tests isn’t so much the difference between a physical scientist and an economist as it is the difference between someone who understands hypothesis tests and someone who doesn’t.

I almost included this in my post: When someone proposes a hypothesis test, I usually say “but you don’t even need to test that hypothesis, you know in advance that [the stoves, the cars, the exhaust hoods, the electricity consumption at two different times,...] aren’t equal.” Often, the person says “I don’t know that; they could be equal,” and my stock answer is “I don’t even think they could agree to the first 10 decimal places, much less be exactly equal!” And they say something like “Oh, sure, they can’t be _exactly_ equal, but they can be close enough to equal that I don’t care about the difference,” and then I say “how close is that?”, and we’re off to the races. The point that I’m trying to get at is that “how close is close enough” is a substantive question, not a statistical one. Just because my current dataset doesn’t reject the null hypothesis at a particular p-value doesn’t mean I’m safe to assume the two numbers are equal, and just because I can reject it doesn’t mean the difference is big enough to matter to me.

I do not agree that hypothesis testing is an important part of the scientific method in general, although there may be situations in which it is important. (I have never encountered such a situation in my own work). And that’s my main point about hypothesis tests: sure, you can find some disciplines in which hypothesis testing is important, but most scientists can (and probably should) go their whole career without ever testing a null hypothesis, whereas every scientist should use exploratory graphics. So I think it’s a shame that these books cover the former but not the latter.

I also disagree with your implication that people should, or can, use hypothesis tests to decide “which of the variables are related to y and which ones aren’t.” If a researcher has gone to the trouble to collect data on several x variables then you can bet they are all related to y. In any case I’ve ever encountered I don’t think any of the regression coefficients (expressed in reasonable units for the problem) could even be zero to 10 decimal places, if known exactly; they certainly aren’t exactly zero. People can, should, and indeed must ask themselves what explanatory variables should be included, but they shouldn’t make this decision based on hypothesis tests.

In short, I think I disagree with you on all points!

Phil:

Perhaps this paragraph of yours needs to be in the (hypothetical) book:

1a: “Compound x reduces the risk of cancer (p-value = 0.001)”

1b: “Compound x reduces the risk of cancer from a baseline of 1% to 0.5% (95% CI: 0.2,0.4)”

2: “Statement 1b is way more informative than 1a”

3: “Therefore you should never ever ever use hypothesis tests, and if you do you are an ignoramus”.

I agree 100% with statement 2 and disagree in equal measure with statement 3.

Hypothesis tests, specially sharp nulls in experimental contexts, are useful pedagogically, and in many routine business and policy settings.

As for the statement:

“The point that I’m trying to get at is that “how close is close enough” is a substantive question, not a statistical one. Just because my current dataset doesn’t reject the null hypothesis at a particular p-value doesn’t mean I’m safe to assume the two numbers are equal, and just because I can reject it doesn’t mean the difference is big enough to matter to me.”

Implicitly I read the above as evidence of an unknown loss function and hence a misapplication of a hypothesis test.

I don’t know where this “ignoramus” thing comes from. I said in my post “I’ll admit that hypothesis tests have their place and that it is a pity if students only get a one-page discussion of them, but something has to give.”

In your example, I can see why someone would want to know 1b. I don’t know what anyone would do with 1a in the absence of 1b. I suppose 1a could be useful in defending against a fraudulent advertising claim or something?

1a could mean that the cancer risk is reduced by an estimated 0.01% with very low uncertainty, or by an estimated 30% with much higher uncertainty. Either could give p=0.001, but surely these would have hugely different implications as far as determining whether it’s worth giving the drug (or producing it), how much it’s worth paying for it, etc.

Maybe I suffer from a poverty of imagination. What would I do with the knowledge of the p-value, that I wouldn’t do at least as well, or perhaps much better, with simply an estimate and a confidence interval?

I feel confident that I’ve misunderstood something. Say an efficient cookstove needs to reduce the chance of respiratory disease by at least 10% on average to be worth buying. Would a classic null hypothesis of “\beta <= .1" not be an appropriate way of thinking about this problem or is this not the kind of hypothesis testing being talked about? Or alternatively, do you think this would be a bad question to ask?

The “hypophobes” would have you plot the data, estimate the proportion, assume a probability model and provide a CI.

All very nice, but you still have to make a binary decision. Even if the world is continuous.

It’s like quality management. You accept <=x% defective light bulbs in a shipmetnt. Most people would take a random sample and do a hypothesis test. It's simpler and contractually a whole lot more transparent.

Yes, you have to make a binary decision, but you almost never want to make this on the basis of a p-value! Andrew and I wrote a paper about this.

A (large estimated effect and a medium uncertainty) give the same p-value as a (small estimated effect and a very small uncertainty), but you will often make different decisions in these cases, or at least you should.

How about hypothesis tests with inequality constraints? E.g. testing the null hypothesis that stove A (the one with higher observed efficiency in the available data) is less efficient than stove B? That’s not a null that can be ruled out a priori.

I’m not sure why it is necessary to insult me to respond to my comment. My point is that when you gather data on the two stoves, for instance, and you are trying to determine which is the better one, the fact that one burns less wood in your experiment could be due to sampling error and not be because the one that burned less is truly better. As Eli mentioned above, if there needs to be a set improvement or difference level, you can incorporate that into your test.

As for regression, I certainly don’t think that the fact that I went to the bother of gathering data (x’s) necessarily means that they are related to a y of interest to me. Here’s a simple example: I am interested in variation in sales at my many stores across the country. THe managers choose how much to spend on advertising in the newspaper, on the radio and on TV. I would like to know if my advertising dollars are well spent and if those modes of advertising are even useful at generating sales. I might use sales as my dependent variable and spending on the various types of advertising as x variables in order to determine whether those modes of advertising are effective. I hope that the methods of advertising are useful, but I would certainly want to know if they are or not.

I can’t really speak for Phil, but there’s some ambiguity in his statement such that I read his comment as meaning that although YOU know what the hypothesis test is for, in general the audience for these kinds of textbooks don’t understand them even after taking the class.

Daniel:

I can’t speak for Phil either (although I guess I’m better qualified than most to speak for him!), but I think what he’d say—what I’d say—is that it’s very rare that a hypothesis test is an appropriate answer to a decision question. I’m not saying that hyp tests and p-values are useless—I’ve written papers on that myself—but maybe they’re not important enough to devote two weeks of a thirteen-week introductory statistics course.

Yes! In this matter you can speak for me.

This is supposed to be a response to Daniel, one level up, but it keeps getting put here as a response to Andrew instead:

Yeah, that’s exactly what I meant: people who take these classes (or read these books, or have simply heard the terms bandied about) think that a hypothesis test will tell them whether they can safely treat two numbers as being equal, whereas people who know better know better.

Also I think Phil’s answer to the “relatedness” question is that all your variables are related, some of them may have tiny effects which are “close enough to zero for you” but the setting of the “close enough” threshold is substantive not a pure question of statistical significance, in particular if you could somehow gather 3 orders of magnitude more data you might find that your coefficient on newspaper advertising is suddenly 0.0102 +- 0.0001 and therefore highly significant, however a 1% increase in sales from newspapers has to be compared to say the 122% increase from TV and the 33% increase from radio, and therefore is essentially worthless to the company (ie. close enough to zero).

If you knew what you were testing, why, and with what consequences you might have done a power study beforehand.

With infinite data everything is significant, sure, but by the same token I don’t want an infinite number of CI in my final report.

Which CIs do you drop out from the report, why, and with what practical consequences?

If you can answer the above you might be in a position to motivate hypothesis tests, if not avoid them and do exploratory analysis.

JK, I wasn’t trying to insult you and I didn’t think I had (unless you mean saying that I disagree with you is an insult, and I don’t think you mean that). Perhaps you’re reacting to my first sentence in my response? If so you’re interpreting it exactly backwards! I meant that some people (including your hypothetical physicist) incorrectly assume that a hypothesis test answers the question they want answered, whereas others (including YOU) have a better understanding of what is being tested. You’re the smart guy in that sentence, not the dumb guy.

I still disagree with your other points but I was not trying to be insulting.

As for your advertising example, there is no way that advertising has no effect whatsoever. I admit it’s at least mathematically possible that not a single extra person spent an extra penny at any of your stores, so a mathematical zero is a theoretical possibility, but advertising does work and if you are running radio and TV ads there is going to be a nonzero effect. I think you should try to estimate the size of the effect, not try to prove that it’s nonzero, which it is.

In fact, this is another good example of where focusing on the hypothesis test, though perhaps a harmless exercise if done in conjunction with other work, can detract from deeper thinking about what you should be doing instead of hypothesis tests.

A very similar issue comes up repeatedly in a research area I’m involved with, which is quantifying how much changes in building operations or equipment affect the energy used by the building. There is some sentiment that “there is no point making a change if we can’t conclusively see the effect in the utility bill,” and people really do take that idea seriously: they won’t spend $100 to save $500, if the $500 savings can’t be demonstrated by a low p-value. I can understand the general sentiment, and there are circumstances in which I agree with it: how do I know I’m not wasting my money? But suppose I spend $100 and get an estimated savings of $500 +/- $500; I should be happy..especially if I have prior information (like engineering calculations) that also suggest that I’ve saved something. I don’t think it makes sense to test the null hypothesis in a case like this; much better to generate my estimate of the savings and its uncertainty.

@ phiI

I would never make a decision solely on a p value and I don’t care about p values.

But I think test of a sharp null are useful pedagogically, can structure a research design, plan and power an analysis, and you can write contracts on them becuse they prespecify decision criteria.

Take a look at the SAS program JMP and its accompanying books. The whole focus of JMP is exploratory in nature.

One thing you left out in your discussion is a discussion of where data comes from. I think anyone who works with data must first address the question of its integrity. In the private sector, I’ve worked with clients who handed me “junk” data and wanted me to draw population inferences from it. The foundation of statistical inference sits on a set of assumptions concerning how the sample data was drawn.

Mark

Mark, yes, data issues are big too. I tried to touch on that with my ventilation example.

I’ll take a look at JMP.

I was very pleasantly surprised by a JMP demo I saw this week (presented well by Dick DeVeaux). Their “Graph Builder” seemed to make it really easy to play with graphs — drag variables to the x and y axis, drag them up top to facet by another variable, easily show or hide the raw data (scatterplot) and overlaid summaries (regression or loess lines)… It seems like a great tool for an intro course, especially if EDA and graphics are going to be part of it.

http://www.jmp.com/capabilities/Graph_Builder.shtml

If you don’t have SAS/JMP, similar programs in this vein (all descended from Leland Wilkinson’s Graphboard, I think) are SPSS Visualization Designer, if you have SPSS:

http://www-142.ibm.com/software/products/us/en/spss-viz-designer/

or Tableau if you have neither:

http://www.tableausoftware.com/

or Jeroen Ooms’s ggplot2 online demo if you’re comfortable with R:

http://www.stat.ucla.edu/~jeroen/ggplot2/

Personally I love R for my own work, but it took me forever to get the hang of graphics there and many Stat101 students may not have the patience.

Eli,

Cassical hypothesis tests, as taught in these books, can’t test a hypothesis that beta <= 0.1; they can only test a point value, like beta = 0.1. That might not be a big deal in this case: I can do a one-tailed test of the hypothesis that beta = 0.1, no problem.

But (as always) I don't think a hypothesis test is the right way to think about this, even if you test the inequality that you're interested in. Suppose I can't reject the hypothesis that beta <= 0.1 at the 5% level. Does that mean I can safely conclude that the stove doesn't perform well enough to be worth buying? (No.) Or does it simply mean that I don't have enough data to determine if the stove would reduce respiratory disease by more than 10% or merely by 6%?

We really want to make these go/no-go decisions based on the performance of the stove, not based on the amount of data we happen to have collected about the stove, but the hypothesis test tangles these two things together. Tell me that "given the available data, we estimate a reduction in respiratory disease of 7% +/- 4%", or "7% +/- 2%", that's a lot more useful than "we can reject the hypothesis that there's a 10% benefit at the 5% level but not at the 10% level."

Please also tell me that in order to get to statement:

“given the available data, we estimate a reduction in respiratory disease of 7% +/- 4%”, or “7% +/- 2%”

you assumed Normality, a linear model, you controlled for covariates, and you decided which covariates to include by running 30 specifications of which you are reporting the one you like.

One beauty of designing a study _as if_ the analytical goal is a simple sharp null test is to put greater weight on the study design rather than the analysis.

Many students feel a scientific study means gathering data, running many regressions until an interesting patter is found, and then telling a story around it with perfectly kosher statements like the one above.

Phil, just to clarify … classical hypothesis tests can test a null hypothesis that beta <= 0.1 (or any other value) – typically, the probability of rejecting the null hypothesis when beta <= 0.1 at a given level is maximized at beta = 0.1 so this parameter value is used in probability calculations. Elementary texts do tend to focus on the point null, but that's probably to avoid confusion in the unwashed.

I do generally agree with your other points though.

Anonymous, I don’t necessarily object to designing a study as if the analytical goal is a “simple sharp null test”…not sure exactly what that means.

I also agree that pretty much any statistical estimate or test involves a bunch of assumptions, whether those assumptions are stated or not. That’s true of hypothesis tests of course, but no less true of other quantities. However, some models and some parameter estimates are more sensitive to errors in those assumptions than other models and estimates are. Actually this might be another thing that is given inadequate attention in these books, but of course paying more attention to this issue would mean paying less attention to something else.

I think we fundamentally agree. But to give you an example, take Woodlridge’s Introductory Econometrics and compare it to Rosenbaum’s Observational Studies.

If I had to take one book to a desert island I’d take the latter. The reasons it is it is much more focused on design.

I have not read Tukey, I’m not even a statistician, but I suppose doing exploratory data analysis without a computer is much less dangerous than doing it with a computer where you can run 100 regressions in a loop in under 2 seconds.

Perhaps students in the course should do exercises in a mainframe with punch cards and wait 2 days for results.

Having only read the first 16 pages I recommend “Statistics and the Scientific Method” by Diggle and Chetwynd

Hey that looks like a pretty nice book based on some clicking around inside via Amazon.

Well, I like the Table of Contents:

1. Introduction

2. Overview

3. Uncertainty

4. Exploratory data analysis

5. Experimental design

6. Simple comparative experiments

7. Statistical modelling

8. Survival analysis

9. Time series analysis

10. Spatial statistics

I’ll look into it more. Thanks!

Phil:

Perhaps if the cost, risk of confusion and distraction of tests could be reduced?

Hadley Wickham has seemed to have done that greatly here http://courses.had.co.nz/12-effective-vis/slides/inference.pdf

Less cost as not very technical and the legal analogy should be widely grasped, less risk of confusion as the null hypothesis is made to be true in fake multiple graphs that unlikely will be confused with them representing real things that can’t be exactly the same and less of a distraction as it can also served to introduce exploratory graphics.

But whether the benefit of covering tests is worth the costs – should be up to the author.

I liked it but there is one minor problem: a sole researcher is often both the prosecutor and judge, there is no defense, and the jury only gets to see one side of the story.

Well there is this conversation between Picaso and an irate husband of a wife he had painted that I was once told.

Husband: That painting does not look like my wife at all!

Picaso: Really, what does she look like?

Husband: I have a picture of her here in my wallet – look!

Picaso: My, she is very tiny!

Picasso faced completely different incentives.

The average researcher would have started by asking:

How do you want me to paint your wife?

Descriptive inference, causal inference and prediction are different types of knowledge requiring different methods of knowledge production.

John snow only needed one chart because the data were so willing to confess.

One book that I keep coming back to as a practitioner is Phillip Good’s Resampling Methods. I find the focus on using re-sampling as the way to build confidence intervals and other measures invaluable. this is especially true when faced with messy data where parametric tests may be misleading.

It’s not an intro book, but maybe replace some other content with a chapter on re-sampling to benefit the students who can be persuaded empirically but not theoretically.

The book also contains many examples in different programming languages.

[...] One approach is to forget the t tests, F tests, etc. and instead frame problems as quantitative comparisons, predictions, and causal inferences (which are a form of prediction of potential outcomes). You get the conf intervals, s.e.’s, etc from a random sampling model that you recognize is an approximation. This all loops back to Phil’s recent discussion. [...]

[...] Phil Price, Write This Book: [...]

[...] What should a statistics textbook include (worth noting this post isn’t from AG himself, but still indicative of why i like his posts). [...]