A Bayesian approach for peer-review panels? and a speculation about Bruno Frey

Daniel Sgroi and Andrew Oswald write:

Many governments wish to assess the quality of their universities. A prominent example is the UK’s new Research Excellence Framework (REF) 2014. In the REF, peer-review panels will be provided with information on publications and citations. This paper suggests a way in which panels could choose the weights to attach to these two indicators. The analysis draws in an intuitive way on the concept of Bayesian updating (where citations gradually reveal information about the initially imperfectly-observed importance of the research). Our study should not be interpreted as the argument that only mechanistic measures ought to be used in a REF.

I agree that, if you’re going to choose a weighted average, it makes sense to think about where the weights are coming from. Some aspects of Sgroi and Oswald’s proposal remind me of the old idea of evaluating journal articles by expected number of total citations. The idea is that you’d use four pieces of information on an article: its field of study, the impact factor of the journal where it was published, its total number of citations so far, and the number of years since appearance of the paper. If the paper is old enough, you can just take its total number of citations and map that on to some asymptoting curve to get a predicted total. If the paper has just come out, the journal’s citation index provides some information. The point here is not that citations are all, but rather that, to the extent that you care about impact, it makes sense to count citations in a context that adjusts for how long the paper has been sitting out there.

I haven’t read Sgroi and Oswald’s paper in detail, but I will comment on its general approach. Here’s how they put it:

Our later proposal boils down to a rather intuitive idea. It is that of using citations gradually to update an initial estimate (the Prior) of a journal article’s quality to form instead a considered, more informed estimate (the Posterior) of its quality.

So far, so good. They continue:

The intention of the REF is to seek to determine whether a paper is “work that makes an outstanding contribution to setting agendas for work in the field or has the potential to do so” which would merit a “four star” rating or of some lesser impact (advancing or contributing to a field).3 The simplest possible way to model this intention is to think in terms of a simple binary partitioning of the state space. Essentially, either a paper submitted to the REF is considered to be making an outstanding contribution or not. On that basis we can specify the state space to be ω ∈ Ω = {a, b} where “a” is taken to mean “making an outstanding contribution to setting agendas” and “b” is taken to mean “not doing so”. Following the more general theoretical literature on testing and evaluation, such as Gill and Sgroi (2008, 2012), a Bayesian model in this context is simply one that produces a posterior probability, p_i, that a given submitted paper indexed by i is of type a rather than type b.

I respect hat they are trying to match the goals of the assessment exercise, but I still don’t like this sort of discrete model. To the extent that citations are measuring quality or impact, I think that the underlying quantity (as well as its measures) have to be continuous. It doesn’t make sense to me to characterize papers as type a or type b. If, at the end of the day, you want a binary measure, I’d define this based on the underlying continuous quantity.

So I am sympathetic to the general ideas of this paper, but the particular approach they use doesn’t seem quite right to me. Or, perhaps I should say, the particular approach they use doesn’t seem quite right to me, but I am sympathetic to the general ideas of this paper.

P.S. I was reading through the paper and then something jumped out at me:

Rightly or wrongly — see criticisms such as in Williams (1998), Osterloh and Frey (2009) and Frey and Osterloh (2011) — the United Kingdom has been a leader in the world in formal ways to measure the research performance of universities.

“Frey” . . . I think I’ve heard that name before! It would take a lot of chutzpah for Frey to criticize formal ways of measuring research performance. On the other hand, he knows from the inside how fragile these measures can be.

I was curious so I googled one of the articles and found this, by Bruno Frey and Margit Osterloh, which begins as follows:

Research rankings based on publications and citations today dominate governance of academia. Yet they have unintended side effects on individual scholars and academic institutions and can be counterproductive. They induce a substitution of the “taste for science” by a “taste for publication”.

I think he meant to say, “a taste for replication.” In any case, the above paragraph is consistent with my theory that, in his self-plagiarism, Frey felt he was playing a distasteful game that he had to play in order to keep up with everyone else. Just as, surely, some bike racers are doping with regret, only because they have to. Or just as some manufacturers might prefer to pollute less but feel they cannot afford to, in an environment where anti-pollution rules are rarely enforced. From this perspective, Frey was being consistent, not hypocritical, in criticizing a count-the-publications system under which he personally benefited.

24 thoughts on “A Bayesian approach for peer-review panels? and a speculation about Bruno Frey

  1. As an aside, one (perhaps small) flaw that citation based systems suffer from is that all cites are considered positive cites.

    Say someone does something egregiously bad, and everyone points to him saying, here’s how not to do it, it still doesn’t matter. He’ll earn a great citation metric.

    In some sense it systemically & subtly incentivizes reckless sensationalism: Make sure people notice, one way or another, whether it earns praise or criticism, that doesn’t matter.

    • You can use natural language processing to estimate whether the citation is positive or negative. People do this for analysis of social networks, for instance. I can’t remember the citation, but I remember seeing this being done for political blogging. Blogs would cite sources they hated then rant about them, and you didn’t want to make the sources cited part of the “friends” network for the blog.

  2. I think this underscores the always (serious) challenge of what is _the data_ that one needs to condition on.

    What data would one want to condition on if they could get it would be the larger part of the answering of the challenge but even with what is at hand, its often not at all clear.

  3. Can someone elaborate on reasons why self-plagiarism should be considered fundamentally wrong or unethical?

    How does repeating what I myself said earlier hurt anyone only when I do it without advertising that I did say it earlier. If at all, I blame stupid rating metrics.

    • Rahul:

      You could ask Frey directly, as he described his self-plagiarism as “deplorable” (see link above). If nothing else, it’s a waste of a lot of people’s time.

      • But it seems to me that this is more or less him continuing to “play the game”. I consider self-plagiarism as basically a stupid creation of a system that likes to make rules for playing the game and uses citation metrics to dole out economic benefits such as tenure and salary and grants etc. The only thing “wrong” with it is that it appears from these metrics that the author has higher output than he really does.

        • The only people whose time he was “wasting” were those willing to read his paper twice hoping it had some additional different content. I suspect most people could figure out before finishing the abstract that they’d read it before in another context, so … yeah basically it’s just games. For those who only saw it in one context, he was basically “doing a service” of wider publicizing. I realize “it’s the rules” but I don’t think this rule is designed around something fundamental to ethics etc, it’s a hack to get around people gaming the broken system of tenure/grants/etc.

        • Absolutely. I think this is why books don’t “count” — they’re supposed to be rehashes of ideas that have already had their LPUs metered by journals.

          The fashion in EE (at least in speech recognition a decade ago when I hung out there) was to publish the same paper every year with updated results.

          Why look at papers as static products? I think the right solution to this problem is to let authors update papers.

          I suppose one benefit is that static papers allow stable cross-referencing. But who ever references papers by section, much less by page?

        • Dan:

          As far as scholarly rule-breaking goes, I don’t think self-plagiarism is the worst thing out there. But I do think it is an ethical violation. The problem is not with Frey (or whoever) repeating material, the problem is with him obscuring the source. If it’s really just ok to repeat material from other papers, he should cite those other papers clearly and explain that he’s doing it. That’s what I do here, for example. Why is Frey repeating material but not saying where it came from? Because, I think, he’s trying to present it as new and original research, and that’s how the journal is then implicitly representing it to its readers. If the journal wants to run reprint articles from other journals, that’s fine with me—but then they should be represented as such.

        • Obscuring the source intentionally perhaps has more ethical issues than just not citing himself would. But overall I think the real issue is economic benefits such as grants handed out by citation type metrics, and I think the existence of such a broken system is an enormous ethical violation far larger than this. We take tax money from individuals and then hand it out in this largely incredibly broken way. It needs a serious fixing with the aim being to maximize the benefit *to the public* who pay for it all. The beginning of such a conversation is to figure out a variety of ways to measure public benefit, what the timescales appropriate for such measurements are (ie. discount rates for public benefit calculations etc). any such fixes need to acknowledge that we DON’T KNOW what the best projects to pursue are, and that expert opinion on such things is not that informative.

          Fixing the “gaming the system” problem at the right level is what’s needed, not bolting on more technical plagiarism rules to the game.

        • In particular, I like systems which use randomized assignment within a pool of grants that meet some minimal bar for competency. Perhaps with a mild tilt due to expert opinion. Let’s waste a lot less time on grant reviews in which some supposedly “objective” expert score is the only thing allowed to both qualify the grant for fundability, AND ration the grant money.

          In a system in which the noise in the 3rd digit of the average score over 5 categories or so is determining whether you get grants or not, having a couple more papers in the citation metric can bias that noise a little higher and significantly affect whether you hit the hurdle. In a system in which you qualify as “fundable” if you have sufficient basic education, your grant is on a topic that is considered part of the agencies mission, and you have preliminary data or other indicators to show that the research can be accomplished… and then all the fundable grants are put in a pool and the appropriate number chosen at random, clearly there is no incentive to publish crap papers. I’m not saying this is the only way to go, but we need to start thinking along these lines, designing the game to achieve the goals of *public benefit* with public money.

        • I think of the randomized approach as somewhat nihilistic.

          What’s really needed is better expert panels, more outsider inputs (to prevent gaming by cliques), and in general more leeway for an expert’s judgement and lesser adherence to easily-game-able metrics e.g. less emphasis on sheer number of publications or impact factors.

        • “clearly there is no incentive to publish crap papers”

          OTOH, more incentive to submit crap proposals and perhaps multiple smaller ones.

        • Rahul: I’m replying here because of reply-depth limitations.

          Yes, your point is well taken about applying with smaller lower quality grants, or in general gaming the system still is an issue. We need to create rules that encourage all the types of behaviors we want (high quality grant applications on lots of likely-to-be-useful directions, and very little crap). However, note that applying with a lot of small low quality grants is actually the current strategy advocated by several senior scientists I’ve talked to anyway, the current system is largely random since it relies on the roundoff errors in the ratings by a random selection of “experts” and in NIH at least you can’t re-apply more than once so if you do a “high quality” pre-study and are killed by two rounds of random shuffle you’ve basically killed your career by using up all your startup funds and having a great idea that can’t be re-sent due to the rules.

          In my own view we most likely need all of expert review, randomization, and limiting the rate at which grants can be applied for. The pure random approach is somewhat nihilistic, it says “experts add no value”. I disagree with that, experts add SOME value. But it’s also wrong to think that experts can always know what are fruitful areas of research and what are not, and that cliques and things don’t enter into decision making. We’re smart people here, with knowledge of probability and statistics, I’m sure we can come up with a decent way to combine expert and random choice. And if we can do it, so could other smart people who actually could affect policy.

          For example, we could get expert ratings from one expert panel who is blind to the authors of the grant, and one expert panel who is not blind to the authors of the grant. We could then create a probability of selection proportional to 1+(EB+EN)/2 where EB and EN are the blind and non blind ratings rescaled to 0-1 scale (where 0 is the lowest scoring grant in the panel and 1 is the highest). In this scenario half of the information is expert and half random. Then we could allow grants to be re-applied as many times as you like, but we limit the application rate for any given lab to 3 applications per year. So you want to put in your best ideas because you can’t just put in “a lot” of crappy grants and hope for a few lucky breaks.

          My biggest point is that doing research is a lot like trying to find optima of complex high dimensional nondifferentiable landscapes. Adding randomness can get you searching in fruitful directions where a purely deterministic approach sinks you into a local optimum that traps you in place with research cliques keeping you there.

  4. I can recall the guts of this discussion being had for at least 30 years in the context of tenure. Subjective judgments of quality are hard to make, especially for people outside your field. Different fields have completely different expectations on factors like paper length, number of publications, and authorship policy.

    For example, economists are always annoyed by how the liberal authorship rules of bio-medicine lead to larger CVs. But adopting an economic view of authorship would poison the current collaborative structure. On the other hand, it is grossly unfair to compare an econ CV to a biomedical CV and ask the economist “where are your papers”?

    Finally, since the outcome of these processes can be legally actionable, there is a great wish for an objective metric that can be used to explain why candidate X was promoted and candidate Y was not. The problem with any objective measure is that it can be gamed. The more high stakes you make the measure, the more likely that this will happen. We see the same thing with high stakes tests of students that impact teacher retention and pay: if you make the costs of a bad test score high enough people will adapt to doing their best to maximize this score. Even if all of the steps are ethical, they may well be non-optimal.

    • The occasions where an economist CV gets pitted against a bio-medicine CV must be quite limited. So to some extent, every field is open to having its own conventions.

      I don’t agree with your “legally actionable” argument. So are, say, decisions about industrial jobs. But has industry gravitated towards some such easy metric?

      • Maybe not, but you will notice that legal action has made a lot of industry players refuse to give detailed reference letters. Instead they confirm salaries and dates of employment (my old industry job had that rule, for example). They also used standardized test to pre-screen candidates, at least partially to make the process more “objective”.

        My vision into the tenure process was at a small university, so there was a lot of cross comparison. But that might not be a common experience.

        • It’s been about a decade so practices may have changed, but this is what HR insisted at the time as to why the test scores had to be given and could never be challeenged. Of course, the problem with anecdotes is this policy might have represented a fairly large outlier, especially with the lax regulatory environment that finance is subjected to.

      • On the other hand:

        “What’s really needed is better expert panels, more outsider inputs (to prevent gaming by cliques), and in general more leeway for an expert’s judgement and lesser adherence to easily-game-able metrics e.g. less emphasis on sheer number of publications or impact factors.”

        With this we are in complete agreement. You can still make a process fair and transparent but have it involve complex thinking. Simple metrics are easy to defend but they involve a lot of lost information.

Comments are closed.