Groundhog day in August?

A colleague writes:

Due to my similar interest in plagiarism, I went to The Human Cultural and Social Landscape session. [The recipient of the American Statistical Association’s Founders Award in 2002] gave the first talk in the session instead of Yasmin Said, which was modestly attended (20 or so people) and gave a sociology talk with no numbers — and no attribution to where these ideas (on Afghanistan culture) came from. Would it really have hurt to give the source of this? I’m on board with plain laziness for this one.

I think he may have mentioned a number of his collaborators at the beginning, and all he talked about were cultural customs and backgrounds, no science to speak of.

It’s kind of amazing to me that he actually showed up at JSM, but of course if he had any shame, he wouldn’t have repeatedly stolen copied without proper attribution in the first place. It’s not even like Doris Kearns Goodwin who reportedly produced a well-written book out of it!

n = 2

People in Chicago are nice. The conductor on the train came by and I asked if I could buy a ticket right there. He said yes, $2.50. While I was getting the money he asked if the ticket machine at the station had been broken. I said, I don’t know, I saw the train and ran up the stairs to catch it. He said, that’s not what you’re supposed to say. So I said, that’s right, the machine was broken.

It’s just like on that radio show where Peter Sagal hems and haws to clue the contestant in that his guess is wrong so he can try again.

Google Refine

Tools worth knowing about:

Google RefineGoogle Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.

A recent discussion on the Polmeth list about the ANES Cumulative File is a setting where I think Refine might help (admittedly 49760×951 is bigger than I’d really like to deal with in the browser with js… but on a subset yes). [I might write this example up later.]

Go watch the screencast videos for Refine. Data-entry problems are rampant in stuff we all use — leading or trailing spaces; mixed decimal-indicators; different units or transformations used in the same column; mixed lettercase leading to false duplicates; that’s only the beginning. Refine certainly would help find duplicates, and it counts things for you too. Just counting rows is too much for researchers sometimes (see yesterday’s post)!

Refine 2.0 adds some data-collection tools for scraping and parsing web data. I have not had a chance to play with any of this kind of advanced scripting with it yet. I also have not had occasion to use Freebase which seems sort of similar (in that it is mostly open data with web APIs) to infochimps (for more on this, see the infochimps R package by Drew Conway).

Another day, another stats postdoc

This post is from Phil Price.  I work in the Environmental Energy Technologies Division at Lawrence Berkeley National Laboratory, and I am looking for a postdoc who knows substantially more than I do about time-series modeling; in practice this probably means someone whose dissertation work involved that sort of thing.  The work involves developing models to predict and/or forecast the time-dependent energy use in buildings, given historical data and some covariates such as outdoor temperature.  Simple regression approaches (e.g. using time-of-week indicator variables, plus outdoor temperature) work fine for a lot of things, but we still have a variety of problems.  To give one example, sometimes building behavior changes — due to retrofits, or a change in occupant behavior — so that a single model won’t fit well over a long time period. We want to recognize these changes automatically .  We have many other issues besides: heteroskedasticity, need for good uncertainty estimates, ability to partially pool information from different buildings, and so on.  Some knowledge of engineering, physics, or related fields would be a plus, but really I just need someone who knows about ARIMA and ARCH and all that jazz and is willing to learn the rest. If you’re interested, apply through the LBNL website.

Reproducibility in Practice

In light of the recent article about drug-target research and replication (Andrew blogged it here) and l’affaire Potti, I have mentioned the “Forensic Bioinformatics” paper (Baggerly & Coombes 2009) to several colleagues in passing this week. I have concluded that it has not gotten the attention it deserves, though it has been discussed on this blog before too.

Figure 1 from Baggerly & Coombes 2009

Figure 1 from Baggerly & Coombes 2009


Continue reading

Type M errors in the lab

Jeff points us to this news article by Asher Mullard:

Bayer halts nearly two-thirds of its target-validation projects because in-house experimental findings fail to match up with published literature claims, finds a first-of-a-kind analysis on data irreproducibility.

An unspoken industry rule alleges that at least 50% of published studies from academic laboratories cannot be repeated in an industrial setting, wrote venture capitalist Bruce Booth in a recent blog post. A first-of-a-kind analysis of Bayer’s internal efforts to validate ‘new drug target’ claims now not only supports this view but suggests that 50% may be an underestimate; the company’s in-house experimental data do not match literature claims in 65% of target-validation projects, leading to project discontinuation. . . .

Khusru Asadullah, Head of Target Discovery at Bayer, and his colleagues looked back at 67 target-validation projects, covering the majority of Bayer’s work in oncology, women’s health and cardiovascular medicine over the past 4 years. Of these, results from internal experiments matched up with the published findings in only 14 projects, but were highly inconsistent in 43 (in a further 10 projects, claims were rated as mostly reproducible, partially reproducible or not applicable . . .

High-impact journals did not seem to publish more robust claims, and, surprisingly, the confirmation of any given finding by another academic group did not improve data reliability. “We didn’t see that a target is more likely to be validated if it was reported in ten publications or in two publications,” says Asadullah. . . .

There’s also this amusing bit:

The analysis is limited by a small sample size, and cannot itself be checked because of company confidentiality concerns . . .

More here and here.

Duke postdoctoral fellowships in nonparametric Bayes & high-dimensional data

I hate to announce this one because it’s directly competing with us, but it actually looks pretty good! If I were getting my Ph.D. right now, I’d definitely apply . . .

David Dunson announces:

There will be several postdoctoral fellowships available at Duke to work with me [Dunson] & others on research related to foundations of nonparametric Bayes in high-dimensional settings, with a particular focus on showing theoretical properties and developing new models and computational approaches in machine learning applications & genomics.

Send applications to Ellen Currin, Department of Electrical and Computer Engineering, [email protected]

My wikipedia edit

The other day someone mentioned my complaint about the Wikipedia article on “Bayesian inference” (see footnote 1 of this article) and he said I should fix the Wikipedia entry myself.

And so I did. I didn’t have the energy to rewrite the whole article–in particular, all of its examples involve discrete parameters, whereas the Bayesian problems I work on generally have continuous parameters, and its “mathematical foundations” section focuses on “independent identically distributed observations x” rather than data y which can have different distributions. It’s just a wacky, unbalanced article. But I altered the first few paragraphs to get rid of the stuff about the posterior probability that a model is true.

I much prefer the Scholarpedia article on Bayesian statistics by David Spiegelhalter and Kenneth Rice, but I couldn’t bring myself to simply delete the Wikipedia article and replace it with the Scholarpedia content.

Just to be clear: I’m not at all trying to disparage the efforts of the Wikipedians. It’s only through putting stuff out there that it can be edited and improved.

The importance of style in academic writing

In my comments on academic cheating, I briefly discussed the question of how some of these papers could’ve been published in the first place, given that they tend to be of low quality. (It’s rare that people plagiarize the good stuff, and, when they do—for example when a senior scholar takes credit for a junior researcher’s contributions without giving proper credit—there’s not always a paper trail, and there can be legitimate differences of opinion about the relative contributions of the participants.)

Anyway, to get back to the cases at hand: how did these rulebreakers get published in the first place? The question here is not how did they get away with cheating but how is it that top journals were publishing mediocre research?
Continue reading

Some thoughts on academic cheating, inspired by Frey, Wegman, Fischer, Hauser, Stapel

As regular readers of this blog are aware, I am fascinated by academic and scientific cheating and the excuses people give for it.

Bruno Frey and colleagues published a single article (with only minor variants) in five different major journals, and these articles did not cite each other. And there have been several other cases of his self-plagiarism (see this review from Olaf Storbeck). I do not mind the general practice of repeating oneself for different audiences—in the social sciences, we call this Arrow’s Theorem—but in this case Frey seems to have gone a bit too far. Blogger Economic Logic has looked into this and concluded that this sort of common practice is standard in “the context of the German(-speaking) academic environment,” and what sets Frey apart is not his self-plagiarism or even his brazenness but rather his practice of doing it in high-visibility journals. Economic Logic writes that “[Frey’s] contribution is pedagogical, he found a good and interesting way to explain something already present in the body of knowledge.” Textbook writers copy and rearrange their own and others’ examples all the time; it’s only when you aim for serious academic journals that it’s a problem.

One question with the econ blogger did not address is: why did all these top research journals publish a paper with no serious research content. Setting aside the self-plagiarism thing, everyone knows that publication in to econ journals is extremely competitive. Why would five different journals be interested in a fairly routine analysis of a small public dataset that has been analyzed many times before?

I don’t have a great answer to that one, except that the example may have seemed offbeat enough to be worthy of publication just for fun (and, unfortunately, none of the journal editors happened to know that they were publishing a variant of a standard example in introductory statistics books).

Ed Wegman is a prominent statistician (he’s received the Founders Award for service to the profession from the American Statistical Association) who has plagiarized in several articles and even a report for the U.S. Congress! And, as is often the case, the plagiarism is typically worse than the original, sometimes introducing errors, other times simply rephrasing in a way that revealed a serious lack of understanding of the original material. There are various theories of what drove Wegman to steal, but I’ll go for my generic explanation of laziness, desire to simulate expertise or creativity where there is none.

The Frey and Wegman stories came out in their full glory a few months ago. I don’t know if Frey is giving public talks. But I was amazed to see, in the program of the Joint Statistical Meetings this past August, that Wegman was involved in two sessions! The first session (“The Human Cultural and Social Landscape”) was organized and chaired by Wegman and featured three speakers, all from Wegman’s department, including Yasmin Said, his coauthor on the paper that was retracted for plagiarism. In his other session, Wegman spoke on computational algorithms for order-restricted inference. The talk is described as a review so plagiarism isn’t so much of an issue, I guess. Still, I wonder if he actually showed up to these sessions.

Frank Fischer is the political scientist who copied big blocks of text from others’ writings without authorization (also, like Frey and Wegman, about 70 years old at the time of being caught), in what looks at a distance to be another lazy attempt to simulate expertise without actually doing the work of digesting the stolen material. I asked a friend about this case the other day, and he said that to the best of his knowledge Fischer has not admitted doing anything wrong. Unlike Frey (who’s a bigshot in European academia) or Wegman (whose work is politically controversial), Fisher is enough of a nobody that apparently survive after being called out for plagiarism with his career otherwise unaffected.

Mark Hauser is the recently retired (at the age of 51) Harvard psychologist who is working on a book, “Evilicious: Explaining Our Evolved Taste for Being Bad,” and also reportedly dabbled in a bit of unethical behavior himself involving questionable interpretation of research data. He was turned in by some of his research assistants who didn’t like that he was being evasive and not letting others replicate his measurements.

I asked E. J. Wagenmakers what he thought about the Hauser case and he replied with an interesting explanation that is based on process rather than personality:

One of the problems is that the field of social psychology has become very competitive, and high-impact publications are only possible for results that are really surprising. Unfortunately, most surprising hypotheses are wrong. That is, unless you test them against data you’ve created yourself. There is a slippery slope here though; although very few researchers will go as far as to make up their own data, many will “torture the data until they confess”, and forget to mention that the results were obtained by torture….

This is a combination of the usual “competitive pressure” story with a more statistical argument about systematic overestimation arising from the statistical-significance filter.

Diederik Stapel is the subject of the most recent high-profile academic plagiarism case. Wagenmakers writes:

He published about 100 articles, and in high ranking journals too (Science being one of them). Turns out he was simply making up his data. He was caught because his grad students discovered that part of the data he gave them contained evidence of a copy-paste job. The extent to which all his work is contaminated (including that of his many PhD students, who he often “gave” experimental results) is as yet unkown. Tilburg university has basically fired him.

Diederik Stapel was not just a productive researcher, but he also made appearances on Dutch TV shows. The scandal is all over the Dutch news. Oh, one of the courses he taught was on something like “Ethical behavior in research”, and one of his papers is about how power corrupts. It doesn’t get much more ironic than this. I should stress that the extent of the fraud is still unclear.

I’ve never done any research fraud myself but I have to say I can see the appeal. The other day I was typing data from survey forms into a file for analysis, and I noticed that the data from some of the research participants didn’t go the way they were “supposed” to. I discussed with my collaborator who had a good explanation for each person based on what had happened in their lives recently. I could feel the real temptation to cheat and adjust the numbers to what I’d guess they should’ve been, absent the shocks which were irrelevant to the study at hand. We didn’t cheat, of course, but it would’ve been so easy. There’s no way anyone would’ve checked, and it would’ve made the results much more convincing in a way that seems appropriate in the larger context of the research.

I can see how a scientist such as Hauser or Stapel could justify this sort of behavior in the name of scientific truth. Similarly, Wegman and Fischer probably felt that, in some deep sense, they really were be experts in the fields they were plagiarizing. Sure, they hadn’t fully absorbed the literature, but they might have felt that they were experts enough that could always have the capacity to understand it if necessary. As for Frey, my guess based on his many writings on academic publication ethics is that he feels that everybody does it, so he needs to play the game too.

Fourteen magic words: an update

In the discussion of the fourteen magic words that can increase voter turnout by over 10 percentage points, questions were raised about the methods used to estimate the experimental effects. I sent these on to Chris Bryan, the author of the study, and he gave the following response:

We’re happy to address the questions that have come up. It’s always noteworthy when a precise psychological manipulation like this one generates a large effect on a meaningful outcome. Such findings illustrate the power of the underlying psychological process.

I’ve provided the contingency tables for the two turnout experiments below. As indicated in the paper, the data are analyzed using logistic regressions. The change in chi-squared statistic represents the significance of the noun vs. verb condition variable in predicting turnout; that is, the change in the model’s significance when the condition variable is added. This is a standard way to analyze dichotomous outcomes.

Four outliers were excluded in Experiment 2. These were identified following a procedure described by Tabachnick and Fidell (5th Ed., pp. 443, 446-7) using standardized residuals; the z = 2.58 criterion was based on G. David Garson’s online publication “Statnotes: Topics in Multivariate analysis.” We think excluding these values is the appropriate way to analyze these data; if they are retained, the difference between conditions reduces to 9.4 percentage points (still a considerable difference), and the P-value increases to just under 0.15. Regardless, the larger point is that the effect replicated in the second, larger and more representative study (Experiment 3) where, incidentally, no outliers were excluded.

We agree that it is important to test the effects of this manipulation with larger samples; doing so would address the applied implications of the study–can this technique be used to increase turnout on a population scale and by how much? Nonetheless, as compared to typical psychology studies, the sample sizes are ample and, as the effects are consistently statistically significant, they clearly demonstrate the important psychological process we were interested in: that subtle linguistic cues that evoke the self can motivate socially-desirable behavior like voting.

I agree that the timing of the exercise–completed the day before and the morning of Election Day–was likely important, although the degree to which the effect decays over time is an important topic for future research. It’s also relevant that we manipulated the phrasing of 10 survey questions, not just one. So, while the difference between conditions was subtle, participants were exposed to it multiple times.

I hope this response is helpful. I’d be happy to address any further questions. In addition, if anyone is interested in collaborating on a larger-scale implementation of this experiment, I [Chris Bryan] would be excited to talk about that.

Contingency tables

Experiment 2:
Noun: 42 (voted), 2(did not)
Verb: 36 (voted), 8 (did not)

Experiment 3
Noun: 98 (voted), 11 (did not)
Verb: 83 (voted), 22 (did not)

Interesting and important for psychology (the idea that ideas of essentialism affect important real-world decisions) and for political science (the idea of political participation being connected to one’s view of oneself).

For my taste, the statistical analysis is way too focused on p-values and hypothesis testing—I’m not particularly interested in testing hypotheses of zero effect, as I think everything has some effect, the real question being how large it is—but, what can I say, that’s how they do things in psychology research (psychometrics aside). I’m guessing that the 10 percentage points is an overestimate of the effect. Also, I don’t quite understand the bit about outliers: if the outcomes are simply yes or no, what does it mean to be an outlier?

In any case, I think it’s great to have such discussions out in the open. This is the way to move forward.

The statistical significance filter

I’ve talked about this a bit but it’s never had its own blog entry (until now).

Statistically significant findings tend to overestimate the magnitude of effects. This holds in general (because E(|x|) > |E(x)|) but even more so if you restrict to statistically significant results.

Here’s an example. Suppose a true effect of theta is unbiasedly estimated by y ~ N (theta, 1). Further suppose that we will only consider statistically significant results, that is, cases in which |y| > 2.

The estimate “|y| conditional on |y|>2” is clearly an overestimate of |theta|. First off, if |theta|<2, the estimate |y| conditional on statistical significance is not only too high in expectation, it's always too high. This is a problem, given that |theta| is in reality probably is less than 2. (The low-hangning fruit have already been picked, remember?)

But even if |theta|>2, the estimate |y| conditional on statistical significance will still be too high in expectation.

For a discussion of the statistical significance filter in the context of a dramatic example, see this article or the first part of this presentation.

I call it the statistical significance filter because when you select only the statistically significant results, your “type M” (magnitude) errors become worse.

And classical multiple comparisons procedures—which select at an even higher threshold—make the type M problem worse still (even if these corrections solve other problems). This is one of the troubles with using multiple comparisons to attempt to adjust for spurious correlations in neuroscience. Whatever happens to exceed the threshold is almost certainly an overestimate. This might not be a concern in some problems (for example, in identifying candidate genes in a gene-association study) but it arises in any analysis (including just about anything in social or environmental science where the magnitude of the effect is important.

[This is part of a series of posts analyzing the properties of statistical procedures as they are actually done rather than as they might be described in theory. Earlier I wrote about the problems of inverting a family of hypothesis tests to get a confidence interval and how this falls apart given the way that empty intervals are treated in practice. Here I consider the statistical properties of an estimate conditional on it being statistically significant, in contrast to the usual unconditional analysis.]

My homework success

A friend writes to me:

You will be amused to know that students in our Bayesian Inference paper at 4th year found solutions to exercises from your book on-line. The amazing thing was that some of them were dumb enough to copy out solutions verbatim. However, I thought you might like to know you have done well in this class!

I’m happy to hear this. I worked hard on those solutions!

The difference between significant and not significant…

E. J. Wagenmakers writes:

You may be interested in a recent article [by Nieuwenhuis, Forstmann, and Wagenmakers] showing how often researchers draw conclusions by comparing p-values. As you and Hal Stern have pointed out, this is potentially misleading because the difference between significant and not significant is not necessarily significant.

We were really suprised to see how often researchers in the neurosciences make this mistake. In the paper we speculate a little bit on the cause of the error.

From their paper:

In theory, a comparison of two experimental effects requires a statistical test on their difference. In practice, this comparison is often based on an incorrect procedure involving two separate tests in which researchers conclude that effects differ when one effect is significant (P < 0.05) but the other is not (P > 0.05). We reviewed 513 behavioral, systems and cognitive neuroscience articles in five top-ranking journals (Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience) and found that 78 used the correct procedure and 79 used the incorrect procedure.

I assume this has been an issue for close to a century; it’s interesting that it’s been noticed more in the past few years. I wonder what’s going on.

P.S. E. J. writes, “I know of no references that precede your work with Hal Stern.” I wonder, though. The idea is so important that I’d be surprised if Fisher, Yates, Neyman, Box, Tukey, etc., didn’t ever discuss it.

How to solve the Post Office’s problems?

Felix Salmon offers some suggestions. (For some background, see this news article by Steven Greenhouse.)

I have no management expertise but about fifteen years ago I did some work on a project for the Postal Service, and I remember noticing some structural problems back then:

Everyone would always get annoyed about the way the price of a first class stamp would go up in awkward increments, from 29 cents to 32 cents to 33 cents to 34 cents etc. Why couldn’t they just jump to the next round number (for example, 35 cents) and keep it there for a few years? The answer, I was told, was that the Postal Service was trapped by a bunch of rules. They were required to price everything exactly at cost. If they charged too much for first class mail, then UPS and Fed-Ex would sue and say the PO was illegally cross-subsidizing their bulk mail. If they charged too little, then the publishers and junk mailers would sue. Maybe I’m getting the details wrong here but that was the basic idea. There was actually a system of postal courts (it probably still exists) to adjudicate these fights. Basically, the post office is always broke because it’s legally required to be broke. It’s not like other utilities which are regulated in a gentle way to allow them to make profits. Looking at this from a political direction, things must somehow be set up so that the Postal Service’s customers have more clout than the Postal Service itself. I don’t really have a sense of why this would happen for mail more than for gas, electricity, water, etc.