Skip to content
Archive of posts filed under the Miscellaneous Statistics category.

Interaction-based feature selection and classification for high-dimensional biological data

Ilya Esteban writes: In your blog your advice for performing regression in the presence of large numbers of correlated features, has been to use composite scores and hierarchical modeling. Unfortunately, many problems don’t provide an obvious and unambiguous way of grouping features together (e.g. gene expression data). Are there any techniques that you would recommend [...]

Don’t let your standard errors drive your research agenda

Alexis Le Nestour writes: How do you test for no effect? I attended a seminar where the person assumed that a non significant difference between groups implied an absence of effect. In that case, the researcher needed to show that two groups were similar before being hit by a shock conditional on some observable variables. [...]

I don’t believe the paper, “Empirical estimates suggest most published medical research is true.” That is, most published medical research may well be true, but I’m not at all convinced by the analysis being used to support this claim.

David Austin pointed me to this article by Leah Jager and Jeffrey Leek. The title is funny but the article is serious: The accuracy of published medical research is critical both for scientists, physicians and patients who rely on these results. But the fundamental belief in the medical literature was called into serious question by [...]

Wanted: 365 stories of statistics

The American Statistical Association has a blog called the Statistics Forum that I edit but haven’t been doing much with. Originally I thought we’d get a bunch of bloggers and have a topic each week or each month and get discussions from lots of perspectives. But it was hard to get people to keep contributing, [...]

How do you think about the values in a confidence interval?

Philip Jones writes: As an interested reader of your blog, I wondered if you might consider a blog entry sometime on the following question I posed on CrossValidated (StackExchange). I originally posed the question based on my uncertainty about 95% CIs: “Are all values within the 95% CI equally likely (probable), or are the values [...]

Preregistration of Studies and Mock Reports

The traditional system of scientific and scholarly publishing is breaking down in two different directions. On one hand, we are moving away from relying on a small set of journals as gatekeepers: the number of papers and research projects is increasing, the number of publication outlets is increasing, and important manuscripts are being posted on [...]

Understanding regression models and regression coefficients

David Hoaglin writes: After seeing it cited, I just read your paper in Technometrics. The home radon levels provide an interesting and instructive example. I [Hoaglin] have a different take on the difficulty of interpreting the estimated coefficient of the county-level basement proportion (gamma-sub-2) on page 434. An important part of the difficulty involves “other [...]

New book by Stef van Buuren on missing-data imputation looks really good!

Ben points us to a new book, Flexible Imputation of Missing Data. It’s excellent and I highly recommend it. Definitely worth the $89.95. Van Buuren’s book is great even if you don’t end up using the algorithm described in the book (I actually like their approach but I do think there are some limitations with [...]

What do people do wrong? WSJ columnist is looking for examples!

Carl Bialik of the Wall Street Journal writes: I’m working on a column this week about numerical/statistical tips and resolutions for writers and people in other fields in the new year. 2013 is the International Year of Statistics, so I’d like to offer some ways to better grapple with statistics in the year ahead. Here’s [...]

Peter Bartlett on model complexity and sample size

Zach Shahn saw this and writes: I just heard a talk by Peter Bartlett about model selection in “unlimited” data situations that essentially addresses this curve. He talks about the problem of model selection given a computational budget (rather than given a sample size). You can either use your computational budget to get more data [...]

Postdoc positions at Microsoft Research – NYC

Sharad Goel sends this in:

Statistics in a world where nothing is random

Rama Ganesan writes: I think I am having an existential crisis. I used to work with animals (rats, mice, gerbils etc.) Then I started to work in marketing research where we did have some kind of random sampling procedure. So up until a few years ago, I was sort of okay. Now I am teaching [...]

The p-value is not . . .

From a recent email exchange: I agree that you should never compare p-values directly. The p-value is a strange nonlinear transformation of data that is only interpretable under the null hypothesis. Once you abandon the null (as we do when we observe something with a very low p-value), the p-value itself becomes irrelevant. To put [...]

Write This Book

This post is by Phil Price. I’ve been preparing a review of a new statistics textbook aimed at students and practitioners in the “physical sciences,” as distinct from the social sciences and also distinct from people who intend to take more statistics courses. I figured that since it’s been years since I looked at an intro [...]

What is expected of a consultant

Robin Hanson writes on paid expert consulting (of the sort that I do sometime, and is common among economists and statisticians). Hanson agrees with Keith Yost, who says: Fellow consultants and associates . . . [said] fifty percent of the job is nodding your head at whatever’s being said, thirty percent of it is just [...]

I need a title for my book on ethics and statistics!!

“Ethics and Statistics” is descriptive but boring. It sounds like the textbook for a course which, unfortunately, nobody will take. “Lies, Damn Lies, and Statistics” is too unoriginal. “How to Lie, Cheat, and Steal With Statistics” is kind of ok, maybe? “Statistical Dilemmas”: maybe a bit too boring as well. “Knaves and Frauds of Statistics, [...]

Thinking like a statistician (continuously) rather than like a civilian (discretely)

John Cook writes: When I hear someone say “personalized medicine” I want to ask “as opposed to what?” All medicine is personalized. If you are in an emergency room with a broken leg and the person next to you is lapsing into a diabetic coma, the two of you will be treated differently. The aim [...]

Choose your default, or your default will choose you (election forecasting edition)

Statistics is the science of defaults. One of the differences between statistics and other branches of engineering is that we have a special love for default procedures, perhaps because so many statistical problems are routine (or, at least, people would like them to be). We have standard estimates for all sorts of models, books of [...]

‘Researcher Degrees of Freedom’

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant [I]t is unacceptably easy to publish “statistically significant” evidence consistent with any hypothesis. The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should [...]

Two postdoc opportunities to work with our research group!! (apply by 15 Nov 2012)

(1) Hop the Q-Train! That is, the Columbia/NYU Quantitative Training Program, funded by the Institute of Education Sciences to create a cohort of postdoctoral scholars both to develop the new statistical methods required to meet future research challenges and to effectively train and consult with other education researchers. You’ll be working with Jennifer Hill, Marc [...]

Model complexity as a function of sample size

As we get more data, we can fit more model. But at some point we become so overwhelmed by data that, for computational reasons, we can barely do anything at all. Thus, the curve above could be thought of as the product of two curves: a steadily increasing curve showing the statistical ability to fit [...]

A statistical model for underdispersion

We have lots of models for overdispersed count data but we rarely see underdispersed data. But now I know what example I’ll be giving when this next comes up in class. From a book review by Theo Tait: A number of shark species go in for oophagy, or uterine cannibalism. Sand tiger foetuses ‘eat each [...]

100!

Behavioral and Brain Sciences

Another reason why you can get good inferences from a bad model

John Cook considers how people justify probability distribution assumptions: Sometimes distribution assumptions are not justified. Sometimes distributions can be derived from fundamental principles [or] . . . on theoretical grounds. For example, large samples and the central limit theorem together may justify assuming that something is normally distributed. Often the choice of distribution is somewhat [...]

Little Data: How traditional statistical ideas remain relevant in a big-data world

See if you can interpolate the talk from the slides. The background is: I was invited to speak in this seminar on “big data.” I said I didn’t know anything about big data, I worked on little data. They said that was ok. Actually it was probably a crowd-pleasing move to tell these people that [...]