“Using numbers to replace judgment”

Julian Marewski and Lutz Bornmann write:

In science and beyond, numbers are omnipresent when it comes to justifying different kinds of judgments. Which scientific author, hiring committee-member, or advisory board panelist has not been confronted with page-long “publication manuals”, “assessment reports”, “evaluation guidelines”, calling for p-values, citation rates, h-indices, or other statistics in order to motivate judgments about the “quality” of findings, applicants, or institutions? Yet, many of those relying on and calling for statistics do not even seem to understand what information those numbers can actually convey, and what not. Focusing on the uninformed usage of bibliometrics as worrysome outgrowth of the increasing quantification of science and society, we place the abuse of numbers into larger historical contexts and trends. These are characterized by a technology-driven bureaucratization of science, obsessions with control and accountability, and mistrust in human intuitive judgment. The ongoing digital revolution increases those trends. We call for bringing sanity back into scientific judgment exercises.

I agree. Vaguely along the same lines is our recent paper on the fallacy of decontextualized measurement.

This happens a lot, that the things that people do specifically to make their work feel more scientific, actually pull them away from scientific inquiry.

Another way to put it is that subjective judgment is unavoidable. When Blake McShane and the rest of us were writing our paper on abandoning statistical significance, once potential criticism we had to address was: What’s the alternative? If researchers, journal editors, policymakers, etc., don’t have “statistical significance” to make their decisions, what can they do? Our response was that decision makers already are using their qualitative judgment to make decisions. PNAS, for example, doesn’t publish every submission that is sent to them with “p less than .05”; no, they still reject most of them, on other grounds (perhaps because their claims aren’t dramatic enough). Journals may use statistical significance as a screener, but they still have to make hard decisions based on qualitative judgment. We, and Marewski and Bornmann, are saying that such judgment is necessary, and it can be counterproductive to add a pseudo-objective overlay on top of that.

18 thoughts on ““Using numbers to replace judgment”

  1. Andrew,
    Yes in political science and international related queries, experts rely on their ‘qualitative’ judgments. I believe that Serge Lang addressed the problem with quantifying or probablizing hypotheses in political sciences. We seem to fall into dichotomous reasoning, often engaging in false dichotomies. At least this has been my consistent observation. So then how are we to evaluate the qualitative empirically? I discerned ‘base rate’ neglect in many inferences offered as definitive insights. Not to mention violations of basic logic. The other thing though is that very good measurement tools can be co-opted as marketing tools. I’m just thinking off the cuff. That’s how I start off.

  2. I agree with much of their diagnosis of the problems with overemphasis on numbers. However, there is much in this paper that disturbs me. Here are a few things:

    1. They suggest the use of heuristics such as the need to have experts in the field review scientific proposals and work. Sounds fine in theory, but we’ve seen that these “experts” have their own systematic biases. A better heuristic might be to have generalists judge the quality of work – if it can’t be explained to a layperson, then perhaps it is not very good.

    2. In any case, the justification for heuristics (which I generally support) requires that they be based on sufficient evidence. I want doctors and pilots to use heuristics – it is the only way they can do their jobs effectively. But we require years of training for these heuristics to work well. If we are talking generally about decision making under uncertainty, then I’m not sure how many people have sufficient experience to be confident about their heuristics. Many business people have “gut feelings” about the right course of action – based upon a very limited set of experiences where they have made such decisions. Their “heuristics” may simply reflect their “biases” or cognitive limitations.

    3. One of the examples the authors give is to “just imagine what would happen if everybody followed the simple rule of thumb to only relay on numbers…and quantitative models… they are able to fully program from scratch themselves, that is, without any help of off-the-shelf software (e.g. SPSS) and without making use of their pre-programmed functions? Likely there would be all five: more reading, more thinking, more informed discussions about contents, less uninformed abuse of numbers, and more experts in bibliometrics and statistics.” Aside from the fact that this would put me out of business, it sounds to me like a wild shot across the bow. They might as well have said, just imagine if we follow the simple rule of thumb that we only rely on numbers if we collected them ourselves. How this creates a productive symbiosis between humans and computers is beyond me. I’m quite content to let machines do things that are tedious for me to do myself – it is the judgement which I am unwilling to give to the machine.

    4. They appear to fail to understand the sociological forces that underlie the problems they cite. The reliance of “objective” numbers (e.g., citations to judge quality, graduation rates to judge colleges, etc.) has much to do with the competition for power of one group over another. The misuse of numbers that result will not be cured by heuristics – the struggle for power is the source of much of the problem.

    5. It is hard to argue with many of their recommendations – more expertise, more training, more toleration of errors and ambiguity, etc. – indeed I support all of these. But I feel like their channeling these into a call for more “heuristics” is off the point. I would advocate that the real need is for humans to retain the ability – and accountability – for making judgements, whether it is via heuristics, bias, or “objective” methodologies. Human judgement is imperfect – at times, horribly so. One reaction is to take many judgements away from humans and give them to machines (e.g. algorithmic sentencing of criminals, algorithmic diagnosing of disease, etc.). There is considerable evidence (and it is growing) of the superiority of machine judgement over human judgement. But it is all so, dare I say it, inhuman. (the science fiction writers have already anticipated where this is headed)

    6. A recent interview in the Chronicle of Higher Education with Jill Lepore (https://www.chronicle.com/article/The-Academy-Is-Largely/245080?cid=wcontentgrid_hp_6) contains the following fascinating idea: that facts are the realm of the humanities, numbers the realm of the social sciences, and data the realm of the natural sciences. It is not that one is better than the other, but that these are different ways of knowing. It is important that we have all three. I think this is related to this paper. Here is an example: a soccer team is stuck in a cave in Thailand. The humanist, faced with this fact, might devote the resources (money, time, and a life) to save them. The social scientist (at least the economist) can show how the costs outweigh the benefits of doing so – or, at least, that the resources can be spent elsewhere to save more lives. The data analyst might have an algorithm that includes the probability of success relative to the probabilities of saving lives through other avenues and reach an “unbiased” conclusion about what to do. Are any of these three approaches the “right” one? Don’t we need all three? I’m not sure heuristics have much to do with this – humans do use heuristics, but a computer could also be programmed to use them. What I am looking for is who will make the decision, what will they base it on, and who will be accountable for it?

    You may think I have gotten off track – I almost think so. But here is the paper:

    “It looks as if parts of academia and society suffer from collective anxiety, with various forms of number-crunching having developed into fear-reducing “mental drugs” to cope with the unreducible and unbearable uncertainties surrounding scientific judgment problems. Be it when it comes to evaluating theories, findings, or data, or when it comes to judging people, papers, and institutions, often there simply is no clear answer. But there are ignorant reviewers, hiring committee members, peers, lawyers, politicians and other respect-inspiring authorities who insist on obtaining a clear, “objective”, justifiable answer.”

    “We agree with the view that numbers aid science and society to progress. Yet, we also think that one should, perhaps, once in a while, walk the staircase of scientific and societal developments backward, in quest of what seems to have gotten lost on the way up: good human judgment.”

    I don’t disagree with any of this – indeed, it is well put. But I’m not sure they have really diagnosed the problem or provided a way forward. Relying on heuristics seems somewhat ill-suited to the task to me.

    I’m sorry to have occupied so much space here, but these are things I’ve been thinking about a lot lately. If nothing else, I may either save you the time of reading the paper or inspired you to do so.

    • Thanks, Dale, I find your thoughts very valuable!

      This trend reminded me immediately of the constricting ideology of political correctness and identity politics. It seems we as a society can no longer tolerate the unsanitized richness of human thought and behavior and instead grasp for clear-cut criteria like gender and race and an ever narrower set of virtues to determine what to think and do. It’s a desperate need for guidance and orientation that parallels the increasing intolerance of uncertainty in science. It’s also a refusal to take responsibility. What has driven us there?

  3. Very nice commentary Dale. When you [and others] use the term ‘generalist’, I push back against it, as a contrast with, say, a specialist. Or expert vs. non-expert, The other popular comparison is Isiah Berlin’s Foxes vs. Hedgehogs.

    RATHER I view the potential for science as a matter of fluid and crystalized intelligence. Some just have more of one or both intelligences. As a result, some people will just be simply better diagnosticians whether they are credentialed in a one ore more fields or not. That is what reading biographies of scientists indicates to me. And I believe that there are some individuals who have more wisdom as a consequence of their learning disposition & temperament. It’s a matter of whether they have influence.

    Moreover, te very categorization of educational fields and associated terminology may be an obstacle epistemically. I believe Gigerenzer has elaborated on this point cogently.

  4. Strongly agree!

    This becomes even more obvious when approaching these analyses from a statistical-decision-theory standpoint. You need to justify your sample size by considering what the value of the information will be – and that requires a “subjective” prior distribution, as well as a “subjective” value of the decisions and outcomes. This can be helpful – when looking to find a basis for the value of outcomes, you need to explicitly consider the very subjective factors like value of life, value of validating a research hypothesis, or value of justifying further investigation.

    These quantities are implicit in current research, but rarely considered clearly. Thinking about the subjective values implied can and should also make decision-making clearer – What type of change in which clinical endpoint would justify the costs of this study? How much evidence would actually be enough to change anyone’s mind? Would this observational study design actually do anything to convince people to follow-up with an RCT? Would this observational study be academically useful enough to justify bothering? Etc.

    When Natures says the “claims aren’t dramatic enough,” this is something that should have been considered before trying to do the study – and by failing to note the implicit subjectivity, and think through it, researchers waste time and effort doing low-value and too-often meaningless work. And when their work could be meaningful, they don’t appreciate it, or don’t do the things that would let it me useful, because they never thought about what it leads to or what decisions it supports.

  5. A bit more empiricism would not have hurt – it’s not like there isn’t ample research on “good human judgment” in hiring processes. Those scarequotes are well-deserved.
    At the end of the paper, the authors pretty much walk back their argument and simply argue for different metrics and smart use of such metrics.
    That, I can get behind. I don’t know if the rant about numbers was necessary. Present metrics are poor measures; let’s improve them.
    If anyone is interested in a few more concrete suggestions and questions to ask of new metrics, I wrote this http://www.the100.ci/2018/06/26/the-tyranny-of-metrics-in-science-evaluation/

  6. I’ve got somewhat mixed feelings about Marewski and Bornmann’s article. I agree with many of their points (e.g.,“many of those relying on and calling for statistics do not even seem to understand what information those numbers can actually convey, and what not.“) But I winced at their listing “mistrust in human intuitive judgment” as part of the reason for the problem.

    So I looked up the paper, which somewhat (but not entirely) lessened my concern. In particular, items 3 and 4 on p. 5 say,

    “3. The advance of numbers as substitute for judgment is propelled by rather old ideals, including those of “rational, analytic decision making”, dating back, at least, to the Enlightenment. The past century has seen a new twist of those ideals, with much work in the decision sciences documenting how human judgment deviates from
    alleged gold standards for rationality, this way trying to establish
    intuition’s flawed nature (e.g., Kahnemann, Slovic, & Tversky, 1982).
    While the accent on human irrationality has further fueled the advance of “objectifiers” –
    including measurement and numbers – it has also brought harm, namely by likely contributing to the blind – and often even mindless – routine use of seemingly “objective” indicators.”

    4. We point towards antidotes against the harmful side effects of the increasing quantification of science evaluation: while the mindless use of indicators for seemingly “objective” evaluation is nowadays prevalent in virtually all branches of science, within the decision sciences themselves, a novel view of how humans ought to make good judgments under uncertainty has gained impetus (Gigerenzer & Gaissmaier, 2011; Gigerenzer, Todd, & ABC Research Group, 1999). In line with that view, we point out how the mindless use of numbers can be overcome if science evaluators and consumers of science evaluations learn to rely on a repertoire of simple decision strategies, called heuristics. Being statistically literate aids all three: (i) to know when to use which heuristic, (ii) to understand when good human judgment ought to be trusted even when numbers speak against that judgment, and (iii) to realize why good human judgment and intuition is, at the end of the day, what matters, and that even when there is no number attached to it.”

    This does sound better than what is in the abstract – especially the phrase “good human judgment,” although I wish that “good” had been emphasized here as well as just included. Also, there is a danger that teaching “heuristics” can be reduced to a formulaic approach. (And I somewhat question calling the use of heuristics in decision science “novel” – since the use of heuristics in problem solving goes back at least to Polya’s work.)

    • The sections you pulled out makes me think their work is just a re-phrasing of the Gigerenzer/Kahnemann positions. I’ve never fully understood the relationship between those two bodies of work. They seem to be at odds with each other (professional squabbles?) but I’ve always gotten a lot out of both of them and thought they complement each other. The problem I have with heuristics is the same as the reaction I had to Gladwell’s Blink. Heuristics are critical – but they only work under certain circumstances. When those circumstances do not apply (limited experience; situations that don’t match our evolutionary traits), then heuristics seem to become an excuse for sloppy analysis. Why analyze if my heuristic works better? On the other hand, heuristics derived from analysis, experience, and practice can be beautiful.

      • The coolest thing I have learned — think I have learned?? — from Gigerenzer is that in real-world decision contexts, “tallying” works better than regression. Just take the top 3 or 5 or 10 or whatever factors your research suggests should matter to your decision, and weight them equally; otherwise you will tend to over-fit.

        • This advice seems strange – and wrong – to me. I can think of plenty of analyses I have done where the top 3 or 5 or whatever factors should not be weighted equally. Some things are just more important to decisions than other things, and that is a major reason why analysis matters. Equal weighting sounds like a blueprint for a form of what Kahnemann calls the base rate fallacy – where the base rate of a rare event is overlooked and people concentrate on things like the sensitivity of a test (not an exact metaphor, but the similarity is that equal weighting is an invitation to think things matter equally) – hmm, makes me wonder if there is something more to the Gigerenzer/Kahnemann controversy after all.

          Anyway, I think this is a good example of my issue with heuristics generally. Overfitting is a real problem and Gigerenzer’s tallying may well protect against it. But at what cost? What are the other options? This is a heuristic that might work well – but only under particular circumstances and I don’t think is good general advice.

        • It may have been related to the Dawes paper, I don’t recall. I do recall the moniker “tallying” for the strategy. I overstated the case slightly — it’s certainly not a dominant strategy in all situations — but the Gigerenzer paper cited several real-world examples where tallying worked better, with of course lower search costs, than a model with the same factors, weighted, would have. I dimly recall that the tallying heuristic worked because the underlying data was so noisy that the weights were nigh impossible to get “right.”

  7. Have been a huge fan of your work, and blog, for years – posting now because I had this post sent to me by four separate people, all of whom said “Hey, isn’t this kind of what you wrote a book about?” And yes, it sort of is. In Navigation by Judgment (Amazon: https://amzn.to/2A5CRUK) I argue that foreign aid agencies fail precisely when (often for legitimacy-seeking reasons, i.e. demonstrating what they’ve done to politicians) seek to guide their efforts by numbers, rather than judgment, and bring both quant and qual data to bear to attempt to illustrate that thats’ the case.

    A recent (ungated) Stanford Social Innovation Review piece which gives an overview of the book’s argument is here, if it’s of interest to anyone who reads this post: https://ssir.org/articles/entry/the_power_of_letting_go In my view, this problem is likely to get worse – that the reductive seduction of numbers creeps all the faster as we increasingly turn to seemingly (falsely, in my view) ‘objective’ algorithms, machine learning, etc. My view is that data is very frequently an excellent input into decision processes and the exercise of that judgment, but very infrequently a superior substitute for human judgment.

    • When there is a factual answer to a question, humans will never be anywhere near as good as computers, but when there are value issues, computers will not replace humans. However, humans are terrible at value transparency, we play political games far too well for our own good in a world where we could be applying values to hugely powerful data based factual prediction rules. Except to do that requires programming values for use by computers, and making explicit what humans have been keeping secret for millennia, their real motivations.

      • It is probably sometimes the case that sometimes humans have been keeping their real motivations secret, but I suspect that often it is a matter of not really understanding what they are doing — which I often include under the umbrella phrase “clueless that they are clueless”.

  8. I’m not sure that the numbers/judgment dichotomy is the best way to look at this issue.

    Perhaps it is more subtle to think of the scope of the data used in the analysis. After all, isn’t judgment another term for “other data collected over a lifetime of observation”? This data is a disorganized mass collected into an often-conflicting set of often poorly-articulated heuristics. It can’t be easily summarized, so we don’t think of it as data, but that is what it is at heart.

    A “narrow” or “numbers” analysis looks only at the data collected in a particular study and tries to reduce the data to a few summary statistics such as p-values. A “broad” or “judgement-inclusive” analysis incorporates all the other data collected and processed over a lifetime.

    This seems very Bayesian. “Judgment” is our priors.

Leave a Reply

Your email address will not be published. Required fields are marked *