Joachim Krueger writes:

As many of us rely (in part) on p values when trying to make sense of the data, I am sending a link to a paper Patrick Heck and I published in Frontiers in Psychology. The goal of this work is not to fan the flames of the already overheated debate, but to provides some estimates about what p can and cannot do. Statistical inference will always require experience and good judgment regardless of which school of thought (Bayesian, frequentist, or other) we are leaning to.

I have three reactions.

**1.** I don’t think there’s any “overheated debate” about the p-value; it’s a method that has big problems and is part of the larger problem that is null hypothesis significance testing (see my article, The problems with p-values are not just with p-values); also p-values are widely misunderstood (see also here).

From a Bayesian point of view, p-values are most cleanly interpreted in the context of uniform prior distributions—but the setting of uniform priors, where there’s nothing special about zero, is the scenario where p-values are generally irrelevant.

So I don’t have much use for p-values. They still get used in practice—a lot—so there’s room for lots more articles explaining them to users, but I’m kinda tired of the topic.

**2.** I disagree with Krueger’s statement that “statistical inference will always require experience and good judgment.” For better or worse, lots of statistical inference is done using default methods by people with poor judgment and little if any relevant experience. Too bad, maybe, but that’s how it is.

Does statistical inference require experience and good judgment? No more than driving a car requires experience and good judgment. All you need is gas in the tank and the key in the ignition and you’re ready to go. The roads have all been paved and anyone can drive on them.

**3.** In their article, Krueger and Heck write, “Finding p = 0.055 after having found p = 0.045 does not mean that a bold substantive claim has been refuted (Gelman and Stern, 2006).” Actually, our point was much bigger than that. Everybody knows that 0.05 is arbitrary and there’s no real difference between 0.045 and 0.055. Our point was that apparent huge differences in p-values are not actually stable (“statistically significant”). For example, a p-value of 0.20 is considered to be useless (indeed, it’s often taken, erroneously, as evidence of no effect), and a p-value of 0.01 is considered to be strong evidence. But a p-value of 0.20 corresponds to a z-score of 1.28, and a p-value of 0.01 corresponds to a z-score of 2.58. The difference is 1.3, which is not close to statistically significant. (The difference between two independent estimates, each with standard error 1, has a standard error of sqrt(2); thus a difference in z-scores of 1.3 is actually less than 1 standard error away from zero!) So I fear that, by comparing 0.055 to 0.045, they are minimizing the main point of our paper.

More generally I think that all the positive aspects of the p-value they discuss in their paper would be even more positive if researchers were to use the z-score and not ever bother with the misleading transformation into the so-called p-value. I’d much rather see people reporting z-scores of 1.5 or 2 or 2.5 than reporting p-values of 0.13, 0.05, and 0.01.

> The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant

It’s true that the difference between a significant and a not significant value *may not* be significant itself. In the same way that with an unmarked yard stick you can distinguish a string longer than one yard from one shorter than one yard, but the difference in length between them may be less than one yard.

It becomes a problem in the following situation: I have treatments A and B. I compare both to a control treatment. Treatment A is (statistically) significantly better than control, but treatment B is not. Then I claim that treatment A is “better” than B. However, I never compared the effect of A to B. If I did, I would have found that A and B were not significantly different from one another. Your example misses the point that these effects (i.e. length of 2 strings) are random samples and we are unsure if the strings we have are representative of the “true” string length–a better intuition would be that you have randomly drawn 2 strings from 2 different populations of strings which vary in their lengths (different manufacturers I suppose?). A string from manufacturer A is longer than your yardstick, and one from manufacturers B is shorter. If the length of the yardstick is the “significance threshold”, then string A is statistically significant, but not string B. That said, we cannot conclude that the population of strings A is larger than B because the difference between our two strings is not larger than our yardstick.

In general relations lack transitivity. I could also give examples involving distributions and probabilities.

Top 100 MLB players make substantially more than NFL or NBA players. NBA players also make more than NFL players, but the difference is not so high. MLB players also make more than NBA players, but the difference is also lower.

Germany is much more likely to win the World Cup next year than Portugal. Spain not a clear favorite compared to Portugal and Germany is not a clear favorite compared to Spain.

I know that’s not very interesting. But if this is a problem, we end up with a Zeno-esque rejection of the possibility of claiming that two objects are different only beyond some threshold. Because for two objects barely different there will be an intermediate object which is not different enough to any of them.

Context is important here. If this is a confirmatory study, where an a priori specified comparison between treatments A and B are of interest, then certainly the approach described above is inadequate. Here one must conduct a proper pairwise comparison between treatments A and B taking into account the joint distribution of their effects.

However, suppose it’s an exploratory study, examining potentially dozens of treatments. Then in the Discussion section one wants to postulate evidence of treatment differences to be explored for future research. Here, examining the marginal effects of various treatments in this manner in an attempt to elucidate treatment differences of interest may be acceptable.

As we are talking about p-values again: Intake of dairy foods and risk of Parkinson disease, doi: http://dx.doi.org/10.1212/WNL.0000000000004057

In your Bayesian Data Analysis you do a lot of work with “Bayesian p-values” via predictive posterior checks. It seems that these p-values could also be abused in the way that frequentist p-values often are… do you agree?

I am sure fools can come up with abuses, but fundamentally I don’t think posterior predictive checks are the same thing, no.

A posterior predictive check is basically a check to see whether the model generates data that is sufficiently like real data to capture what is actually going on in the world. a small p value means you should reject your model and go back to the drawing board. This actually makes sense in that it’s the purpose of p values to tell you when something is weird about a model.

In NHST you posit a meaningless model that no one believes (null of exactly zero effect) and then when you reject it you reject it in favor of some preferred model on spec. Note that at no time does the preferred model get actually evaluated in any way. It’s a shell game.

So, yes, everything can be abused, but it helps rein in abuse if you’re not playing a shell game to begin with.

Daniel,

Sorry to harp on this point again, but what if you aren’t concerned with fitting the data well? What if you are aware that you are focusing on a mechanism that only plays a small role in the DGP, and as a result you don’t expect to match real data well with your model. More generally, what criteria are people using to “reject” models? Is it mostly just fit?

What do you think about this:

http://andrewgelman.com/2017/06/11/im-not-participating-transparent-psi-project/#comment-507115

Basically, I’m struggling to see the context in social sciences where NHST (done well) is not the most reasonable option. A null of zero may not be of interest, but what other null are we supposed to look at? My point is that the theory in social sciences is not good enough, and likely won’t be for a long time, to generate substantive testable predictions (the effect of X on Y is 0.5).

If you aren’t concerned with fitting the data well, then don’t do posterior predictive checks. Bayes isn’t in general about fitting the frequency distribution well, it’s about finding parameters that make predictions which make your data be in the high probability region of the predictive distribution, not necessarily fill it up.

“what other null are we supposed to look at?”: you shouldn’t be looking at a null at all. Hypothesize a real model and fit it.

As for the idea that theory in social science isn’t good enough and won’t be for a long time… that’s especially true if everyone just gives up on the idea of creating models and just goes around looking for not-null effects.

I basically totally disagree with the idea that we can’t make predictions in social sciences. I believe that the real problem in social sciences, especially economics, is a focus on frequentist statistics and a misunderstanding of what it means to fit a bayesian model. your comments suggest you’re half on the right track. You’re absolutely right you don’t need to fit the frequency distribution of your data well… but when would you do that? When you have some specific theory that you are interested in understanding. In which case, code up that theory and fit it with Stan.

your idea that fitting means “the effect of x on y is 0.5” shows confusion about what a Bayesian fit does. You don’t posit a number, and then try to show that it’s true. You posit a mechanism and try to find what range of numbers make that mechanism predict things like your observed data. If the range is large, so be it.

Yeah people do that in economics, it would be put under the umbrella of “structural” empirical work. I don’t see how it’s “Bayesian” to posit a mechanism and fit that model to the data, you can do that in a frequentist framework with GMM or MLE as well. But my point again is that there are instances where you can’t just look at the fit of your model to see if it is the “correct” one. If you posit some mechanism, and it fits the data well, why should I believe you have the correct model? A classic example would be fitting a model of labor demand to wage/employment data. You’ve estimated your labor demand model, it fits pretty well, but why should I think that your parameters are labor demand parameters? They could equally well be labor supply parameters. At which point you need to argue about the nature of the variation in employment you are using, or something like that. This is my fundamental beef, and it never seems to be addressed on this blog because causal inference doesn’t seem to be the focus that often.

The example you gave only works because its basically an experimental setting; you know that the only thing you changed was the controller. These things don’t happen in economics, especially in macro. My only point is that you need to understand the variation used to identify parameters, fit doesn’t tell you about that. In your example, ya, its really clear that what is causing the change in the outcome variables is this controller alteration. In my labor demand example, you need to argue why you have a labor supply shifter to estimate your demand parameters, and not the other way around.

In the second half of your example…I agree with this, and a lot of people do this in economics as I’ve said above. Except that the model is not posited beforehand – and my earlier point is at this stage it simply can’t really be done in the social sciences. But the drawback to this modelling approach is that the parameters may not be what you say they are – i.e. conditional on your model being true, your parameters have the interpretation you’ve given them, but how do we know how close to truth the model is?

Matt: it’s Bayesian if you use probability distributions that aren’t frequency distributions. So for example, if you have a “structural” model where you say y = f(x) + e and in your model e has a special gamma distribution you derived based on some assumptions, and you go and you fit this model, and you find empirically that the e values *are not* gamma distributed… and you consider this to be not a problem because the gamma distribution was a description of your knowledge of how well your model ought to predict rather than a description of what the frequency distribution of the errors *will be* when you collect data… Then you’re doing Bayesian modeling. And I think this is relevant to what you were discussing about not necessarily fitting the data exactly.

Why should you believe I have the correct model? You shouldn’t! You should try a bunch of other things too, and see which ones are consistently valuable for understanding the questions of interest.

Good models unify a lot of phenomena and make predictions which are born out in future data in new circumstances outside the realm of what was observed before. That’s one of the most powerful ways to check models.

I’m working on an economic issue right now where I believe that certain people will make decisions of a certain kind based on tradeoffs they experience, and some of the relevant variables are things like how much they spend on housing, how much their family makes in income, what the minimum wage is, how much education they have, and how many children they have… I am building a model that incorporates all of those variables. To say that it can’t be done seems silly to me because I’m doing it right now. It’s hard, yes, because there are a bunch of factors and zillions of survey data points to consider, but it’s fundamentally technical challenges, not conceptual ones.

Of course I’m not saying model building in economics can’t be done; look at any current empirical economics papers and they are likely to have an economic model. My point is that the theory still isn’t very good ; while you may write a model that seems ok for SF, it probably won’t be any good for New York. While the physical laws that govern the world are the same no matter where you are or what time period you are in, we can’t say the same when modeling human behavior. What you end up with it seems, is a set of models in economics that, a lot of the time, are only useful in the specific context within which they were created. This was my point; it’s hard to rigorously test models when those models are constantly changing.

But if there is commonality of phenomena across times and places, then things like per-location models with partial pooling are still very useful. I agree with you that we’re not going to find some universal demand function for peanuts in the way that we can find a universal potential energy function for the coulomb force. But, that just means you acknowledge variation and work with it: variation in time, space, culture, technology… if you give up the idea that you’re looking for universal laws, it’s still possible to find lots of regularity in human behavior that can be explained by a common “structure” with variation in quantities.

Let me give you an example that might convince you that this problem isn’t just limited to social sciences, and that nevertheless physical sciences plod along and make progress precisely because they don’t give up.

Suppose you’re working on engineering a new engine controller for ocean going cargo ships. Your goal is to eke out a marginal increase in efficiency. So, you do some kind of guesswork or simulation or something and come up with a way to change the timing and the pressure of the fuel injection process on these enormous diesel engines, and then you upload your code to the engine controller computer and tell your captain to go sail from shanghai to port of LA with another load of goods. Now, every time they make the trip there are different cargos loaded differently, different winds, different storms, different waves, different shipping traffic to avoid, different pilots at the harbor, etc. So you have no interest in building a complicated model for all of that. You’re just trying to determine if the mechanism you are interested in involving alteration of combustion in the engine reduces consumption relative to the counterfactual, what would have happened if you had left the controller alone.

I still think you make best progress by looking say at the throttle control signal and the fuel consumption data and soforth and building a model that predicts what the mechanism you programmed into the computer would do to your fuel consumption based on the rules you put in to the computer, conditional on knowing certain unknowns…. and then try to guess ranges of values that are plausible for the unknowns, and then collect the dataset of actual consumption data, and find out which values of those unknowns actually cause the predicted consumption to look more like your measured consumption data, and which look less. And that’s the essence of real Bayesian modeling.

People told me biomed was too complex/hard a field to come up with real quantitative models. Then I found one from the 1930s that worked really well after a bit of modification. It turned out the same people who said it was too hard did not care anyway, and just wanted to know if there was a significant difference between groups.

I really don’t take arguments like that seriously anymore. It is just laziness, misspent time/energy, etc. Similar to how now cancer is “many diseases” now, so it is unreasonable to expect a cure any time soon… How convenient for them.

Ya I’m not suggesting we just compare groups. I think at some point I’ve seen you write something like “you have a model, it generates an exact prediction (i.e. speed of light is X), and then you go test that empirically – and this is an argument for why NHST is terrible; the bar for success is so low…with enough data we will always find differences between groups. But, show me some examples in social science where this is remotely feasible. My only point is that the theory is underdeveloped still, and that makes things harder from an empirical standpoint.

Most of the time it is not really about an exact fit. For example, no one ever really incorporates all the asteroids, etc into solar system simulations, so they expect to get the wrong results. Other theories make predictions that are off by 10^120…:

https://en.wikipedia.org/wiki/Cosmological_constant#Predictions

Matt, I’m going to challenge you to simply try something like it. Here’s a sample problem:

households spend money on food, food varies in price around the country and people’s cultural preferences vary for the type of food they want to consume. Build a model for the amount of money spent on chicken, beef, pork, seafood, root vegetables, and green vegetables as a function of which state the family lives in, the composition of the family (details of the age, and sex of each person in the family) and transportation distance from the nearest source of each component, the density of the city they live in, and the household’s income after taxes.

The consumer expenditure survey microdata would be useful:

https://www.bls.gov/cex/

are you telling me that this seems completely beyond the realm of possible?

Sure, it’s a challenge, and sure to do it right would take a lot of computing and things, and sure 20 years from now it won’t still be valid, but I know this could be done today with enough work.

And, furthermore, I think there are plenty of good causal components in this model. Transportation costs affect price of the food item, this will affect demand, to the extent that those transportation costs have been in place for long periods, this will affect cultural components of what people prefer (Lobster in Maine, Crawdads in Baton Rouge, Salmon in Alaska, Beef in Montana, whatever), the connection is causal. How about quantity of food? calorie requirements are an understood phenomenon for different age and sex and activity level. Activity levels vary with outside temperature, humdity, and local economic conditions…

I’m with Anoneuoid, we’ve got billions spent on the Census bureau. The USDA does a lot of work on surveying food consumption patterns… it’s all there for the picking. The truth is closer to the “people who said it was too hard did not care anyway” than “it’s not possible to do a good job of this stuff”

Who’s going to get an Econ Nobel out of a Bayesian spatio-temporal model of demand for food items?

Robert:

Yes, in the most recent edition of the book I have toned down the p-values and have placed more emphasis on the graphical use of these checks for exploratory data analysis.

This is probably the reason for the popularity of p-values. p-values zoom to teeny tiny numbers very quickly while z-scores plod along getting only modestly larger. A p-value of 0.001 sounds GREAT, while a z-score of 3.0 sounds like exactly what it is: a bit higher than 2.0.

Which is why Boos and Stefanski advocate -log10(p), or, to bring things back full circle, a return to *,*, and ***, ie no more than about three or four magnitudes of p values at all. http://www.tandfonline.com/doi/abs/10.1198/tas.2011.10129

This zooming behavior is one of the problem areas, because it only works right when 1) the tails of the distribution are very close to properly gaussian, and 2) there is virtually no bias. Otherwise, the z-values are probably still reasonably informative but very small p-values are vary misleading.

A z-score of 3.0 sounds like exactly what it is: a bit higher than 2.0How about a z-squared of 9 vs 4? This doesn’t avoid the problem of random variability, but at least in simple settings does get the discussion on a scale that might make some intuitive sense – the more significant result is like having the same signal as the weaker one, but a sample size that’s x9/4 (i.e. 225%) times larger.

George:

Your statement about “signal” only applies with reference to an effect of exactly zero, but the p-value is typically applied in settings where the effect is believed to not be zero.

Agree about “signal”. But measuring relative to zero is not the same as believing in zero as the truth (with non-zero probability) so unclear on your second point.

In response to the 3 critical points raised by Professor Gelman:

[1] P values and NHST evoke some strong emotions and corresponding language. That’s why we referred to the debate as ‘overheated.’ We need to step back and see how p values perform relative to other metrics. One charge perennially lobbed at p values is that they are associated with (or ‘enable’) the practice of dichotomous betting with regard to the truth or falsity of the statistical hypothesis. Dichotomous decisions are deplorable and unnecessary in some contexts such as theory development and refinement, but required in other contexts, as when an action must be either taken or withheld. A distaste for categorical decisions may need to be reigned in when such (dis)taste is also expressed categorically. Hence we framed our paper as an investigation of the pros and cons of p values, and we tried to give them quantified, graded, expression.

[2] We strongly believe that “statistical inference will always require experience and good judgment.” To say that much statistical activity is now automated and staked on default settings and assumptions is to say that we are moving into another era of ritualized practice. This is a scary prospect. The assumption that ritualization is good is a chimera, as it suggests that the problems of induction can be solved by brute mathematical force. Experience and life will always refute this assumption. We may grant that statistical work can be automated and ritualized; this does not mean that this is good. The analogy with driving a car is poor. Driving a car requires the acquisition of a complex skill set and the intervention of conscious judgment and decision-making in moments of danger. Much like statistical work [so the analogy works after all, if not as intended by Professor Gelman].

[3] Noting that a difference between two p values is not significant is to invite self-contraction. Why reject the work of significance testing by performing a significance test? To prefer a z-score (or any other metric with a point on the x-axis), while rejecting p values also lacks logical force because, as Professor Gelman notes, z-scores entail (“correspond to”) their p values. Why reject one metric if it is perfectly log-linearly correlated with a preferred one. Such a choice reduces to a preference-without-reason.

“The assumption that ritualization is good is a chimera, as it suggests that the problems of induction can be solved by brute mathematical force.”

But no one’s making that assumption. AG in particular wrote, “For better or worse… Too bad, maybe, but that’s how it is.” Also, you seem to have missed the point of the driving-a-car analogy, which is that bad drivers do actually exist.

Wow, it seems like things sailed totally over your head

1) P-values don’t just “enable” dichotomous betting, they encourage it in an utterly irresponsible way that completely ignores consequences. The proper way to make dichotomous choices is bayesian decision theory using real-world consequences in the utility. Wald’s theorem ensures this is true even if you don’t “believe” in priors.

2) Gelman is lamenting the fact that statistical inference can be done without experience or judgement. It isn’t required in exactly the same way that it isn’t required to actually be a good driver in order to drive around on the roads. Yes, it helps, but there are plenty of bad drivers and plenty of bad analysts.

3) z scores are preferred because they are dimensionless expressions of physically meaningful quantities, whereas p values are arbitrarily transformed in ways that destroys their connection to the ratio of two measurements of physical quantities. You could take the z score as the seed to a uniform RNG and then take the 1000th output of said RNG and it would have a perfect 1-1 correspondence, but it would have no meaning.

P-values don’t require dichotomous betting. They originate from Fisher’s significance testing approach which was quite different from the decision-oriented hypothesis testing of Neyman and Pearson. NHST is an incoherent amalgamation of both.

P-values are not an arbitrary transformation of z-scores, they are a physically meaningful transformation of z-scores. Z-scores also enable and encourage the same kind of dichotomous choices, by the way (v.g. when sigma is over three, do whatever corrective action is required).

P-values are defined in general, unlike z-scores. Although the transformation can be made in the other direction and is not completely unusual to express p-values not based on z-scores in terms of sigmas.

Carlos:

I’m not sure what you mean by “physically meaningful,” but I think p-values are misleading in that they are with reference to a null hypothesis of exactly zero effect, which in general is not of interest.

I agree with you that it’s a mistake to use z-scores or p-values as a decision threshold. To the extent these numbers are used as data description, I’d much rather see people reporting z-scores of 1.5 or 2 or 2.5 than reporting p-values of 0.13, 0.05, and 0.01.

I was paraphrasing Daniel, actually I’m not sure why the z-score would be a physically meaningful quantity either. I’m glad to take back the “physically” qualifier.

P-values are in reference to a null hypothesis of exactly zero effect, in the same way that z-scores are with reference to a null hypothesis of exactly zero effect. Z-scores are not model free, if they allow you to discard information and reduce pairs of the form [ “estimate” , “standard error” ] to their ratio is only because of some implicit assumptions.

[ 0.2 , 0.1 ] and [ 20 , 10 ] correspond to the same z-score 2 only with reference to a null hypothesis of exactly zero effect.

In my own work (which isn’t done for publication) I tend to rely on comparing the estimated coefficient to the estimated standard error of that estimate, i.e., the z-score. But I doubt that I would use the z-scores if I didn’t also know that under standard conditions a z-score of 2 or more can happen about 5 percent of the time and a z-score of 4 or more would be rare. I doubt that z-score users would use them if they didn’t have some sort of concept of this relationship to p-values. I don’t have any problem considering the p-values in this sense “physically meaningful.”

The only reason I would prefer z-scores over p-values is that it seems that this is what a lot of physicists do (or at least the ones I’ve known), and this leads to statements things like “my certainty is 5 sigmas”, which I find amusing in that it out-clutters p-values.

Re physically meaningful. Consider something like height z score used in child development, difference between height of child and height of average in a large dataset of healthy population divided by the sd of the healthy population. This tells you how far away from typical the child is on a scale determined by natural population variability in a healthy population.

It’s dimensionless, and it measures what you want to know: whether the child’s growth is “abnormal” such as due to nutritional deficit or a hormone issue or whatever.

You can create a physically meaningful p value out of this, but only by taking the percentile in the large healthy population dataset, that is: by having the *correct* sampling distribution of the actual population with all its skewness and bumps and mixtures of different races, and different healthy diets across cultures etc.

In your example a p-value based on the empirical distribution would be clearly more informative than a z-score. (By the way, do you also need a *correct* average and *correct* sd to calculate a z-score, or can you make do with estimations?) I don’t think doctors are going to substitute z-scores for percentiles when discussing child development with parents.

And what would be the interpretation of the z-score if you cannot assume normality due to all this skeweness and bumps and mixtures? Does it really measure whether the child’s growth is “abnormal”? How so? Could you give two examples of z-scores corresponding to A) “abnormal” growth and B) “non-abnormal” growth?

The z score is more informative for modeling purposes, it measures a length in dimensionless form and so, if you want to model say growth stunting as a function of nutritional variables, race, and soforth, you should model the change in z score, not the change in p value. The p value based on the empirical distribution, which I did just say *was* physically meaningful is helpful for determining how abnormal a score is, you’re absolutely right. But a typical kind of p value that’s actually calculated based on say the maximum likelihood normal could easily be very very misleading, as the whole thing is sensitive to the behavior in the tails, that’s why having the empirical distribution is so important for this kind of purpose. If the empirical distribution has a very light left tail and a long right tail, you might be telling people their child is fine when in fact their child’s height is extremely abnormally small, just because your normal distribution fit doesn’t fit the left tail very well and needs to have a long tail on the right, so it also has a long tail on the left, etc.

Perhaps my biggest point is this: p values have specific purposes, as measures of empirically justified “abnormality” and when used in those manners, they are fine and useful, but those purposes only vaguely resemble the way they are actually used in practice, and in practice, dimensionless ratios of physical quantities, whether they be z scores, or just the ratio of two means, or whatever, are often more useful as they are directly *linearly* related to a measurement of a “physical” quantity, and it’s this physical quantity which is really of most interest to the researcher.

For example the statement “cell division rate was significantly decreased in the experimental group (p = 0.0022)” gives you virtually no sense of what is going on, it tells you only that some distributional assumption was violated.

“To say that much statistical activity is now automated and staked on default settings and assumptions is to say that we are moving into another era of ritualized practice. This is a scary prospect. The assumption that ritualization is good is a chimera, as it suggests that the problems of induction can be solved by brute mathematical force.”

This may be true in experimental behavioral science, but this is dismissive of much research in expert systems / machine learning / statistical process control, etc. where the aim is to ritualize the algorithms and analysis sufficiently that the analyst becomes pretty much irrelevant — the analyst then moving on to different areas of process improvement.

In many of these areas, dichotomous decisions are pretty much the order of the day — do we investigate the machine because it seems out of spec? does our advertising pay out, so that we continue this strategy? Should we show banner ad A or banner ad B? With the result of medical test X, should be order further tests or not?

In most of these contexts, the problem isn’t that a dichotomous decisions shouldn’t be made, it’s that p values usually aren’t the best way to make the decisions.

“statistical inference will always require experience and good judgment.”

That’s true for the brilliant people who set up the original rules (e.g. those who develop the original expert system), but not true down the road once the system is set up. In a given problem domain, the aim is often to simplify the decision process sufficiently so that less highly trained / cheaper entities can take over the process — a totally automated system often being faster, cheaper, and more consistent.

In terms of the “driving a car” analogy — in 1910 cars broke down a lot, and a driver needed mechanical expertise. Now, you don’t need more than a license, a key, and a AAA card. I hope by the time I’m in my 80’s self-driving cars will have taken over. By then, I’m likely to trust an expert set of algorithms more than my own failing physical skills.

This comment deserves a post (or even a blog) unto itself. I am inclined to agree as a prediction, but I find this future deeply disturbing. Even if it is true that algorithms are superior to many human decision makers, I’m not sure that is a world I want to live in. What are we assuming all those inferior humans will be doing? It sounds like an image out of Wallye – obese humans with no function existing as consumer-bots. And, since I spend my mental efforts teaching – trying to improve people’s experience and judgment – do I have a future? I guess I’d better work on my golf game.

Dale, this dystopian future is a big problem in my opinion. As automation increases in effectiveness the marginal productivity of an hour of typical human work goes to zero. With the assumption that people need to work and earn wages so built into the fabric of our society, and social constructs so slow to change, the future could easily not look like Wall-E but rather like something out of Blade Runner or the great depression, people awash in misery in the midst of plenty.

Suppose for the moment that we plopped down 1 warehouse sized star trek replicator + transporter per 10000 people in the US. How many people would starve to death when they lost their jobs and had no money to buy time on the replicator? Law and contracts and things simply don’t adjust to that sort of thing in less than decades. People have 30 year mortgages, but literally they could have a mansion assembled in a few seconds and beamed into place at essentially no materials or labor cost. It’s a useful extreme case to highlight the social difficulties of rapid technological change that invalidates core stable assumptions.

Thanks for correcting my movie title reference. As I think about your statement about rapid change and human adaptability, a different problem emerges – one that I think is missing from most of the machine learning literature. Suppose we have a recommender system that suggests people we might want to connect with, or books we might want to purchase, etc. Suppose further that the algorithm works “well” perhaps even better than anyw human could do. How do we measure how well the algorithm or the human actually performs? Is it by whether we friend the person that was recommended or purchase the book that was suggested?

I would think those are natural metrics to use and these appear to be the kinds of things used to evaluate the performance of algorithms. But shouldn’t they be compared to something else? And shouldn’t we be comparing not just the purchase but the satisfaction from the purchase? An algorithm may recommend a book which I then purchase. I may even like the book. But I might have liked a different book better. How will the algorithm ever know? will it even care?

I think the problem may not be that humans adapt slowly but that humans adapt too quickly! We may quickly learn to adopt algorithmic recommendations without ever asking these questions of “compared to what?”

Well, I think you’re talking about a fundamental difference between a buyer and a seller. The buyer “wins” when you buy their stuff, regardless of whether you later regret having done so. This is especially true when repeat business is minimal. The seller on the other hand wins when they buy stuff they actually like.

So, if the system is working for the seller: more sales is the metric

if the system is working for the buyer: higher satisfaction ratings or something like that is the metric.

One fundamental problem at the moment is that in much of the commerce going on, the consumer of the service is neither a buyer, nor a seller, they’re the product! Think Facebook: they collect a herd of people, and then sell third parties the opportunity to put their ad on the pages the herd sees. Yahoo was doing the same thing, In many ways Google does the same thing: Google Voice? It’s just a way to collect audio snippets for training speech recognition. Gmail? It’s just a way to get you to view their site all day long so you will see their ads. Android phones: google doesn’t get money from you, they get money from Samsung and from developers who sell their apps through their Play store.

I think this is fundamentally flawed from the perspective of making progress towards broad human happiness. The happiness of the consumer masses is only fairly tangentially related to the monetary feedback mechanism that is supposed to be helping to direct resources.

Still, it’s a logical consequence of wage stagnation and the decline in human economic productivity: sell things to the people who have money. And the “people” who have money are corporations with enormous offshore corporate tax shelters.

Daniel: “The buyer “wins” when you buy their stuff, regardless of whether you later regret having done so. This is especially true when repeat business is minimal. The seller on the other hand wins when they buy stuff they actually like. “

Is this really what you meant?

Martha: yes I think so. I mean, it’s not a good situation to have a disconnect between the interests of buyers and the interests of sellers, but it is the situation you wind up with often at the moment.

Ah, sorry. I see what I did:

the *seller* wins when the buyer buys their stuff regardless of whether the buyer later regrets it because piece of junk etc.

the *buyer* wins when they buy stuff they actually like.

I see that somehow I switched the roles in the text, while in my brain it read totally fine.

;-)

I think you are getting the economics a bit messed up – and I think it is largely a red herring in this case. People buy things because they think it is worth what they pay – generally more than they pay. Aside from mistakes (we all make them), I think we should accept that premise (if you don’t, fine, but you have to throw out virtually all of economics then). Let’s accept that sellers’ metric is sales. The issue still is: how do we judge the “success” of a recommendation algorithm? Even if sales increase, that is not evidence that the algorithm works better than anything else (unless we are actually testing it against something else). For measuring its success for buyers it is even more complicated, but still involves the same issue: compared to what? My point is that many machine learning algorithms are not being tested against anything else, or just against what sales were before using the algorithm. I think this makes the claim that “algorithms work better than people” tenuous at best. I know there is some literature comparing algorithms with human decision making (e.g., in medical diagnoses), but that is the exception.

So, when we eliminate all this human work and replace it with algorithms, how do we know the automated system is better? I was responding to zbicyclist’s statement above. And, I am not even denying its truth. What I am saying is that humans are actually quite adaptable – we are likely to accept that the automated system is better when we have no real evidence of that.

Dale: you’ve said “except for mistakes” but *mistakes* is what I was talking about! Asymmetry of information is rampant, and people buy some wifi router that has terrible problems with intermittent overheating and dropping connections, and they spend a lot of time griping about their internet… the seller got the money and moved on to the next model, the buyer hates what they got, but it’s too late… and they don’t even know what the problem is, a metric based on sales makes it look like everyone’s winning… but they’re not. Comcast sells you some BS high speed internet service but they prioritize traffic to speed-test sites so it looks like you get high speeds, but on everyday connections you get crappy connection and lost packets and large latencies… and you don’t even know it.

As technologies role in society increases, this stuff is rampant: car transmissions, iphones, internet service providers, pharmaceuticals (Tamiflu anyone?) etc etc etc. People buy things thinking they will get X and in fact they get something else. Sales metrics don’t mean the same thing as satisfaction metrics would.

Your issue is also very relevant. But I do think people like Amazon are already doing changes to their recommender system, and then selecting a random sample of their customers, dividing it randomly in half, and trying the new recommender on half with the other half as control. So it’s not hard to see that a recommender improves sales… but does it improve the quality of life of the customer? Or does it just extract more money by confusing them into thinking they’re going to get some good stuff when in fact in the end down the line, when the limitations that were known to the seller but not the buyer become evident, it winds up being much more mediocre?

The existence of Consumer Reports suggests that my concern has been around for a long time, but I suspect with very high turnover rates of technological change, something like CR becomes less useful in combatting this problem, while sales metrics become easy to optimize.

Also, consider in this “Market for lemons” the asymmetry of information makes people unwilling to pay more for supposedly higher quality, because they’ve been burned many times. So we wind up with a race to the bottom, a lot of marginal technology pumped out in rapid succession, high sales, and huge amounts of wasted resources on half-broken stuff. Recommender systems based on sales metrics do nothing to improve this aspect of quality, and in fact you can easily measure price sensitivity, so recommender systems will likely force you more quickly into low price low quality, where higher profit is. Furthermore, a bunch of effort is then spent on marketing things where quality is ephemeral and hard to analyze: think Monster Cables and Audiophile pre-amps and soforth.

It’s a one-sided bet in some sense. If everyone is producing lemons, consumers just get used to drinking unsweetened lemonade, which returns us to your point which is that perhaps we just get used to it…

Maybe this isn’t a new phenomenon, but I think automating the process of rent-seeking via asymmetric information is not exactly a recipe for societal health.

Volkswagen Diesels!

“Do we investigate the machine” or “does our advertising pay out” seem like very poor illustrations of dichotomous decisions. Maybe every decision reduces to “this not that” but choosing based on the highest estimate total return is not the problem.

K&H ask:

Why reject one metric if it is perfectly log-linearly correlated with a preferred one?

Well, does human behavior lead to different outcomes when using such measures? In many circumstances, it is easier to calculate average gasoline consumption per mile if fuel consumption is expressed in gal/mile rather than miles/gal.

More relevant to my life experience: Which is heavier per square inch–40 lb bond paper or 60 lb offset paper? The metric version of this question is: Which is heavier 150.5 g/m^2 paper or 90.3 g/m^2 paper?

Bob

+10 !

Thanks. And, I did not even point out that the basis weight is the weight of a ream (500 sheets)of the paper. However, if you weigh 500 sheets of 8.5×11 40-pound bond paper, it will weigh about 10 pounds.

Bob

Because the size of one (uncut) sheet is 17×22, isn’t it? A ream (40 pounds) can then be cut in four to produce 2000 8.5×11 sheets.

Just to be clear, 17×22 is the size of a sheet of bond paper (other paper types come in larger dimensions).

Yes. IIRC, 17×22 is known as the “basic size” of bond paper and 40 pounds is the basis weight. So, in comparing paper weights you need to look at the basis weight divided by the area of the basic size to get a comparable unit. The metric approach skips that step.

Bob

With regard to “as when an action must be either taken or withheld” (first point): how can you make a decision without considering information on the consequences of the decision? By considering heuristics (p-values, bayes factors, or whatever) that dont use information on how we value the outcomes of the decision, we are essentially arguing over the best color of crayon to put out a fire.

Refers to… http://andrewgelman.com/2017/06/14/ride-crooked-mile/#comment-507588

By considering heuristics (p-values, bayes factors, or whatever) that don’t use information on how we value the outcomes of the decision, we are essentially arguing over the best color of crayon to put out a fire.+1. And I’d include estimates and intervals in the “or whatever” list.

It seems like the authors are placing a lot of responsibility on improving education:

“Another, more serious, criticism is that researchers deliberately or unwittingly engage in practices resulting in depressed p-values (Simmons et al., 2011; Masicampo and Lalande, 2012; Head et al., 2015; Perezgonzalez, 2015b; Kunert, 2016; Kruschke and Lidell, 2017). For our purposes, it is essential to note that both these criticisms are matters of education and professional ethics, which need to be confronted on their own terms. We will therefore concentrate on criticism directed at the intrinsic properties of p.”

To me, these “matters of education and professional ethics” are the principle issue, and the intrinsic properties of p are secondary. Yes, if we could get everyone to pre-register all analyses in which p-values will be reported, then we could move on to considering the intrinsic properties of p. But let’s assume that doesn’t happen. Then the problem is that p-values in practice achieve the exact opposite of what they are intended to do: provide protection against mistaking noise for signal.

For researchers who are getting their p-values in the manner that our intro stat textbook examples assume (in which pre-registration is implied by the fact that there is only one analysis to be be performed), the simulated correlations between P(H|D) and P(D|H) presented in this paper can be useful. But I don’t think that very many of the p-values we see are brought into this world in such a manner. Under the realistic setting in which researchers have flexibility, the results of the simulations in this paper must be overly optimistic.

I’d be interested in hearing more about the effective use of z-scores for making real-world and pragmatic decisions. This seems a much more positive and useful focus than being stuck in the mire of an apparently “over-heated debate” (I always thought that meant ‘punch-up’!). The above comment rang a bell in the sense that, in clinical and forensic psychology settings, z-scores are ”core stats”. The big example is of course in IQ testing, where scores are transformed z-scores. But z-scores also are used in less obvious areas, such as in risk assessment for forensic populations, where deviation from 0, or change from an initial to later z-score, is used as collateral evidence of change. However, the often unspoken questions when using these measures are a) whether they have meaning in the first place when comparing a person’s response to a population (the answer comes from careful clinical assessment of other information in the context of the score; convergence of evidence) and b) how one decides upon the meaning of a change in z-score over time/intervention or whatever.