‘Researcher Degrees of Freedom’

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

[I]t is unacceptably easy to publish “statistically significant” evidence consistent with any hypothesis.

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

Another excellent link via Yalda Afshar. Other choice quotes, “Everything reported here actually happened”, “Author order is alphabetical, controlling for father’s age (reverse-coded)”.

I [Malecki] would rank author guidelines №s 5 & 6 higher in the order.

Google Refine

Tools worth knowing about:

Google RefineGoogle Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.

A recent discussion on the Polmeth list about the ANES Cumulative File is a setting where I think Refine might help (admittedly 49760×951 is bigger than I’d really like to deal with in the browser with js… but on a subset yes). [I might write this example up later.]

Go watch the screencast videos for Refine. Data-entry problems are rampant in stuff we all use — leading or trailing spaces; mixed decimal-indicators; different units or transformations used in the same column; mixed lettercase leading to false duplicates; that’s only the beginning. Refine certainly would help find duplicates, and it counts things for you too. Just counting rows is too much for researchers sometimes (see yesterday’s post)!

Refine 2.0 adds some data-collection tools for scraping and parsing web data. I have not had a chance to play with any of this kind of advanced scripting with it yet. I also have not had occasion to use Freebase which seems sort of similar (in that it is mostly open data with web APIs) to infochimps (for more on this, see the infochimps R package by Drew Conway).

Reproducibility in Practice

In light of the recent article about drug-target research and replication (Andrew blogged it here) and l’affaire Potti, I have mentioned the “Forensic Bioinformatics” paper (Baggerly & Coombes 2009) to several colleagues in passing this week. I have concluded that it has not gotten the attention it deserves, though it has been discussed on this blog before too.

Figure 1 from Baggerly & Coombes 2009

Figure 1 from Baggerly & Coombes 2009


Continue reading

Parallel JAGS RNGs

As a matter of convention, we usually run 3 or 4 chains in JAGS. By default, this gives rise to chains that draw samples from 3 or 4 distinct pseudorandom number generators. I didn’t go and check whether it does things 111,222,333 or 123,123,123, but in any event the “parallel chains” in JAGS are samples drawn from distinct RNGs computed on a single processor core.

But we all have multiple cores now, or we’re computing on a cluster or the cloud! So the behavior we’d like from rjags is to use the foreach package with each JAGS chain using a parallel-safe RNG. The default behavior with n.chain=1 will be that each parallel instance will use .RNG.name[1], the Wichmann-Hill RNG.

JAGS 2.2.0 includes a new lecuyer module (along with the glm module, which everyone should probably always use, and doesn’t have many undocumented tricks that I know of). But lecuyer is completely undocumented! I tried .RNG.name="lecuyer::Lecuyer", .RNG.name="lecuyer::lecuyer", and .RNG.name="lecuyer::LEcuyer"
all to no avail. It ought to be .RNG.name="lecuyer::Lecuyer" to be consistent with the other .RNG.name values! I looked around in the source to find where it checks its name from the inits, to discover that in fact it is

.Rng.name="lecuyer::RngStream"

So here’s how I set up 4 chains now:
Continue reading

RStudio – new cross-platform IDE for R

The new R environment RStudio looks really great, especially for users new to R. In teaching, these are often people new to programming anything, much less statistical models. The R GUIs were different on each platform, with (sometimes modal) windows appearing and disappearing and no unified design. RStudio fixes that and has already found a happy home on my desktop.

Initial impressions

I’ve been using it for the past couple of days. For me, it replaces the niche that R.app held: looking at help, quickly doing something I don’t want to pollute a project workspace with; sometimes data munging, merging, and transforming; and prototyping plots. RStudio is better than R.app at all of these things. For actual development and papers, though, I remain wedded to emacs+ess (good old C-x M-c M-Butterfly).

Favorite features in no particular order

  • plots seamlessly made in new graphics devices. This is huge— instead of one active plot window named something like quartz(1) the RStudio plot window holds a whole stack of them, and you can click through to previous ones that would be overwritten and ‘lost’ in R.app.
  • help viewer. Honestly I use this more than anything else in R.app and the RStudio one is prettier (mostly by being not set in Times), and you can easily get contextual help from the source doc or console pane (hit tab for completions, then F1 on what you want).
  • workspace viewer with types and dimensions of objects. Another reason I sometimes used R.app instead of emacs. This one doesn’t seem much different from the R.app one, but its integration into the environment is better than floaty thing that R.app does.
  • ‘Import Dataset’ menu item and button in the workspace pane. For new R users, the answer to “How do I get data into this thing?” has always been “Use one of the textbook package’s included datasets until you learn to read.csv()”. This is a much better answer.
  • obviously, the cross-platform nature of RStudio took the greatest engineering effort. The coolest platform is actually that it will run on a server and you access it using a modern browser (i.e., no IE). (“While RStudio is compatible with Internet Explorer, other browsers provide an improved user experience through faster JavaScript performance and more consistent handling of browser events.” more).

It would be nice if…

  • indents worked like emacs. I think my code looks nice largely because of emacs+ess. The default indent of two spaces is nice (see the Google style guide) but where newlines line up by default is pretty helpful in avoiding silly typing errors (omitted commas, unclosed parentheses
  • you could edit data.frames, which I’ll guess they are working on. It must be hard, since the R.app one and the X one that comes up in emacs are so abysmal (the R.app one is the least bad). RStudio currently says “ Editing of matrix and data.frame objects is not currently supported in RStudio.” :-(

Overall, really great stuff!