Skip to content

The last word on the Canadian lynx series

The “Canadian lynx data” is one of the famous examples used in time series analysis. And the usual models that are fit to these data, don’t work well. Cavan Reilly and Angelique Zeringue write:

Reilly and Zeringue then present their analysis. Their simple little predator-prey model with a weakly informative prior way outperforms the standard big-ass autoregression models. Check this out:

Or, to put it into numbers, when they fit their model to the first 80 years and predict to the next 34, their root mean square out-of-sample error is 1480 (see scale of data above). In contrast, the standard model fit to these data (the SETAR model of Tong, 1990) has more than twice as many parameters but gets a worse-performing root mean square error of 1600, even when that model is fit to the entire dataset. (If you fit the SETAR or any similar autoregressive model to the first 80 years and use it to predict the next 34, the predictions are a disaster—the predicted values quickly go toward the mean and can’t even attempt to track the curve.)

As Reilly and Zeringue note, the above graph shows potential room for improvement in the model, but even as is, it shows the huge benefits that can be obtained by attempting to model the underlying process rather than simply fitting the data using a conventional family of models.

(It’s funny for me to emphasize this point, given how often I use conventional models such as linear and logistic regression.)

Educational monoculture

John Cook writes that he’d like to hear more people talk about “educational monoculture.” I don’t actually know John Cook but I enjoy reading his blog, so I feel like the least I can do is to honor his request.

I have to admit that I have a bit of a monocultural temperament myself. I have strong feelings about the right and wrong way to do things, and I don’t have much patience for what seems to me to be the wrong way. As a result, I’ve often disparaged or ignored important statistical developments because some small aspect of the new idea didn’t fit with my thinking. (On the plus side, I think I’ve disparaged or ignored lots more bad ideas thad deserve oblivion.)

I’ve always been suspicious of the hedgehog/fox distinction because my impression is that just about everybody likes to think of him or herself as a fox. Being a hedgehog is like being “ideological”; most of us like to think of ourselves as pragmatic foxes. And in any case I think most statisticians are foxes.

One of the many positive outcomes of my mugging at Berkeley was a commitment to pluralism (for example, see here).

Beyond this, I move away from my natural monocultural instincts by teaching classes that include material I wouldn’t otherwise cover, by listening carefully to people I respect who do things in a different way than I do, and by thinking hard about why certain methods or attitudes which seem silly to me, still remain popular.

Finally, my approach as a political scientist and public opinion researcher is to understand the views of others. I think I have a pretty good grip on why it can make sense for people to vote for Gingrich or Romney or Obama or Santorum or whatever, and I’m interested in understanding political ideologies as they manifest themselves in different areas (even in statistics, where political views range from Dennis Lindley to Jacob Wolfowitz).

“Moving beyond monoculture” doesn’t mean that I abandon my skepticism but it means that I should at least try to understand other approaches to looking at the world.

P.S. I thought the above discussion would be more useful than yet another argument about the extent to which modern education is such a scam etc.

Suggested resolution of the Bem paradox

There has been an increasing discussion about the proliferation of flawed research in psychology and medicine, with some landmark events being John Ioannides’s article, “Why most published research findings are false” (according to Google Scholar, cited 973 times since its appearance in 2005), the scandals of Marc Hauser and Diederik Stapel, two leading psychology professors who resigned after disclosures of scientific misconduct, and Daryl Bem’s dubious recent paper on ESP, published to much fanfare in Journal of Personality and Social Psychology, one of the top journals in the field.

Alongside all this are the plagiarism scandals, which are uninteresting from a scientific context but are relevant in that, in many cases, neither the institutions housing the plagiarists nor the editors and publishers of the plagiarized material seem to care. Perhaps these universities and publishers are more worried about bad publicity (and maybe lawsuits, given that many of the plagiarism cases involve law professors) than they are about scholarly misconduct.

Before going on, perhaps it’s worth briefly reviewing who is hurt by the publication of flawed research. It’s not a victimless crime. Here are some of the malign consequences:

- Wasted time and resources spent by researchers trying to replicate non-findings and chasing down dead ends.

- Fake science news bumping real science news off the front page.

- When the errors and scandals come to light, a decline in the prestige of higher-quality scientific work.

- Slower progress of science, delaying deeper understanding of psychology, medicine, and other topics that we deem important enough to deserve large public research efforts.

This is a hard problem!

There’s a general sense that the system is broken with no obvious remedies. I’m most interested in presumably sincere and honest scientific efforts that are misunderstood and misrepresented into more than they really are (the breakthrough-of-the-week mentality criticized by Ioannides and exemplfied by Bem). As noted above, the cases of outright fraud have little scientific interest but I brought them up to indicate that, even in extreme cases, the groups whose reputations seem at risk from the unethical behavior often seem more inclined to bury the evidence than to stop the madness.

If universities, publishers, and editors are inclined to look away when confronted with out-and-out fraud and plagiarism, we can hardly be surprised if they’re not aggressive against merely dubious research claims.

In the last section of this post, I briefly discuss several examples of dubious research that I’ve encountered, just to give a sense of the difficulties that can arise in evaluating such reports.

What to do (statistics)?

My generic solution to the statistics problems involved in estimating small effects is to replace multiple comparisons by multilevel modeling, that is, to estimate configurations rather than single effects or coefficients. This tactic won’t solve every problem but it’s my overarching conceptual framework. There’s lots room for research on how to do better in particular problem settings.

What to do (scientific publishing)?

I have clearer ideas of resolutions (at least in the short term) of the Bem paradox; in short, what to do with dubious but potentially interesting findings.

So far there seem to be two suggestions out there: Either publish such claims in top journals (as for example Bem’s in JPSP, or the contagion-of-obesity paper in NEJM), or the journals should reject them (perhaps from some combination of more careful review of methodology, higher standards than classical 5% significance, and Bayesian skepticism).

The problem with the publish-in-top-journals strategy is that it ensures publicity for some mistakes and it creates incentives for researchers to stretch their statistics to get a prestigious publication.

The problem with the reject-’em-all-and-let-the-Arxiv-sort-’em-out strategy is that it’s perhaps too rigorous. So many papers have potential methodological flaws. Recall that the Bem paper was published, which means in some sense that its reviewers thought the paper’s flaws were no worse than what usually gets published in JPSP. Long-term, sure, we’d like to improve methodological rigor, but in the meantime a key problem with Bem’s paper was not just its methodological flaws, it was also the implausibility of the claimed results.

So here’s my proposed solution. Instead of publishing speculative results in top journals such as JPSP, Science, Nature, etc., publish them in lower-ranked venues. For example, Bem could publish his experiments in some specialized journal of psychological measurement. If the work appears to be solid (as judged by the usual corps of referees), then publish it, get it out there. I’m not saying to send the paper to a trash journal; if it’s good stuff it can go in a good journal, the sort where peer review really means something. (I assume there’s also a journal of parapsychology but that’s probably just for true believers; it’s fair enough that Bem etc would like to publish somewhere that outsiders would respect.)

Under this system, JPSP could feel free to reject the Bem paper on the grounds that it’s too speculative to get the journal’s implicit endorsement. This is not suppression or censorship or anything like it, it’s just a recommendation that the paper be sent to a more specialized journal where there will be a chance for criticism and replication. At some point, if the findings are tested and replicated and seem to hold up, then it could be time for a publication in JPSP, Science, or Nature.

From the other side, this should be acceptable to the Bems and Fowlers who like to work on the edge. You still get your ideas out there in a respectable publication (and you still might even get a bit of publicity), and then you, the skeptics, and the rest of the scientific community can go at it in public.

There have also been proposals for more interactive publications of individual articles, with bloglike opportunities for discussion and replies. That’s fine too, but I think the only way to make real progress here is to accept that no individual article will tell the whole story, especially if the article is a report of new research. If the Bem finding is real, this can be demonstrated in a series of papers in some specialized journal.
Continue reading ‘Suggested resolution of the Bem paradox’ »

Chris Schmid on Evidence Based Medicine

Chris Schmid is a statistician at New England Medical Center who is an expert on evidence-based medicine. I invited him to present an introductory overview lecture on the topic at last year’s Joint Statistical Meetings, and here are his slides. All 123 of them. I don’t know how he expected to go though all of these in an hour. You could teach a semester-long course based on this material.

Good stuff, I recommend you all read it.

Difficulties in publishing non-replications of implausible findings

Eric Tassone points me to this news article by Christopher Shea on the challenges of debunking ESP. Shea writes:

Earlier this year, a major psychology journal published a paper suggesting that there was some evidence for “pre-cognition,” a form of ESP. Stuart Ritchie, a doctoral student at the University of Edinburgh, is part of a team that tried, but failed, to replicate those results. Here, he tells the Chronicle of Higher Education’s Tom Bartlett about the difficulties he’s had getting the results published.

Several journals told the team they wouldn’t publish a study that did no more than disprove a previous study. . . . An editor at another journal said he’d “only accept our paper if we ran a fourth experiment where we got a believer [in ESP] to run all the participants, to control for . . . experimenter effects.”

My reaction is, this isn’t as easy a question as it might seem. At first, one’s reaction might share Ritchie’s frustration that a shoddy paper by Bem got published while Ritchie’s careful replication got dinged. But, as I wrote when the issue came up on the sister blog:

Setting aside the whole “psychic powers” thing, it makes sense to me not to run the new experiment. After all, it’s hardly news that ESP doesn’t work. If “ESP doesn’t work” were publishable, you could fill up a journal many times over with such findings. And what would be the point of that? Better to start a new journal with some catchy title such as Replications of Well-Known Findings. In the physics division, you could have articles demonstrating that objects fall down, not up. In the chemistry division, you could publish demonstrations that H2 + O2 yields H2O plus energy. The biology section could have a paper demonstrating that cats and dogs can’t produce offspring. And so on.

So I don’t know the answer here. On one hand, we can hardly require or even expect that journals fill their pages with dog-bites-man nonreplications. (And, even in a computerized era where there are no page limits, there are still constraints on the time of editors and reviewers.) On the other hand, this leads to an asymmetry where crap gets on the front page and the refutation doesn’t even get published on page B16.

Fight! (also a bit of reminiscence at the end)

Martin Lindquist and Michael Sobel published a fun little article in Neuroimage on models and assumptions for causal inference with intermediate outcomes. As their subtitle indicates (“A response to the comments on our comment”), this is a topic of some controversy. Lindquist and Sobel write:

Our original comment (Lindquist and Sobel, 2011) made explicit the types of assumptions neuroimaging researchers are making when directed graphical models (DGMs), which include certain types of structural equation models (SEMs), are used to estimate causal effects. When these assumptions, which many researchers are not aware of, are not met, parameters of these models should not be interpreted as effects. . . . [Judea] Pearl does not disagree with anything we stated. However, he takes exception to our use of potential outcomes notation, which is the standard notation used in the statistical literature on causal inference, and his comment is devoted to promoting his alternative conventions. [Clark] Glymour’s comment is based on three claims that he inappropriately attributes to us. Glymour is also more optimistic than us about the potential of using directed graphical models (DGMs) to discover causal relations in neuroimaging research . . .

Lindquist and Sobel’s arguments make sense to me, except on one point. They consider a causal setting z -> x -> y, where z is the treatment variable, x is the intermediate outcome, and y is the ultimate outcome, and much of their discussion centers on estimating the causal effect of x on y. I have two difficulties with their perspective:

1. If x is an observed variable that is not directly manipulated, I don’t know if it makes sense to talk about the effect of x on y, unconditional on the intervention that was used to change x. In their example, I’d talk about “the effect of x on y, if x is changed through z.” Different z’s can induce different effects of x on y.

2. Lindquist and Sobel talk about the effect of z on x. If z=0 or 1, they write x(z), so that the causal effect of z on x is x(1) – x(0) (or, more generally, x(1) compared to x(0), but we lose nothing by considering simple differences here). So far, so good.

But I get stuck at the next step, where they define the effect of x on y. If x can equal 0 or 1, they write y(z,x), so that the causal effect of x on y, conditional on z, is y(z,1) – y(z,0). At least, I think that’s what they’re saying.

The trouble is, I don’t see how the two parts of this model fit together. For any given item in the experiment, I think they’re following the rule that x(z) has a particular (although maybe unknown) value. But then I don’t see what it means to look at y(z,1) – y(z,0). For any particular value of z, it seems to me that only one of these two terms is possible. (For example, if x(z)=1, then y(z,1) is defined but y(z,0) seems meaningless.)

I’m not saying that this framework is wrong, just that I don’t understand it.

That said, Lindquist and Sobel’s criticisms of Pearl and Glymour seem sound to me.
Continue reading ‘Fight! (also a bit of reminiscence at the end)’ »

Advice on do-it-yourself stats education?

Dustin Palmer writes:

I am a recent graduate looking for a bit of advice. While I took intro classes on math and statistics in my undergraduate degree as a political science major, I find myself university-less and seeking to develop my statistics toolkit.

I work for an NGO in the international development field. I think that a solid statistics foundation would offer me not only more career opportunities, but more importantly, a deeper and more nuanced understanding of the processes and problems that interest me. I’m talking about field experiments and practical quantitative and qualitative data analysis.

I have plenty of free time, ambition, and enthusiasm to improve this part of my toolbox, but I lack an attachment to an institution and much in the way of financial resources. How would you go about making a concentrated effort at acquiring an understanding of the field and its actual application in something like R or Stata, which I admit to never having used?

Perhaps I am simply asking about web resources or best texts, but any broad advice would be much appreciated too.

My gut recommendation is to start with a problem you care about and figure out what you need to get a reasonable solution, then go to the next problem, and so forth. For books, you could start with The Statistical Sleuth and my book with Jennifer. If you want to learn R, just try to make some pretty and useful graphs, that will motivate you to be able to do more.

Any other suggestions?

Lessons learned from a recent R package submission

R has zillions of packages, and people are submitting new ones each day. The volunteers who keep R going are doing an incredibly useful service to the profession, and they’re busy.

A colleague sends in some suugestions based on a recent experience with a package update:

1. Always use the R dev version to write a package. Not the current stable release. The R people use the R dev version to check your package anyway. If you don’t use the R dev version, there is chance that your package won’t pass the check. In my own experience, every time R has a major change, it tends to have new standards and find new errors in your package with these new standards. So better use the dev version to find out the potential errors in advance.

2. After submission, write an email to claim it. I used to submit the package to the CRAN without writing an email. This was standard operating procedure, but it has changed. Writing an email to claim about the submission is now a requirement. There is a good reason. The R team is afraid that the package was not submitted by a legal developer. So there is a security issue involved here. Write an email to remind them that you submit a package, not a virus.

3. The R people are busy. The number of R packages submitted to CRAN is growing exponentially. So the R people’s working loads are heavy. We should understand their situation and try to work with them to solve the package issues, when problems come up.

The first two lessons are the most important. If you have done the first two, I believe you won’t need the third one.

I’ve never actually written an R package myself—my last experience with this sort of thing was several years ago, using dyn.load and dyn.load2 in S—but I’ve used many R packages and I’ve contributed to several widely-used R packages. So I really appreciate the effort put in by the central R people, and I’m posting this note as a way to make their lives easier and also help the people who are writing and updating R packages.

A counterfeit data graphic

Kaiser Fung discusses. It’s a good sign when statistical graphics are so popular that people feel the need to fake them!

Judea Pearl on why he is “only a half-Bayesian”

In an article published in 2001, Pearl wrote:

I [Pearl] turned Bayesian in 1971, as soon as I began reading Savage’s monograph The Foundations of Statistical Inference [Savage, 1962]. The arguments were unassailable: (i) It is plain silly to ignore what we know, (ii) It is natural and useful to cast what we know in the language of probabilities, and (iii) If our subjective probabilities are erroneous, their impact will get washed out in due time, as the number of observations increases.

Thirty years later, I [Pearl] am still a devout Bayesian in the sense of (i), but I now doubt the wisdom of (ii) and I know that, in general, (iii) is false.

He elaborates:

The bulk of human knowledge is organized around causal, not probabilistic relationships, and the grammar of probability calculus is insufficient for capturing those relationships. Specifically, the building blocks of our scientific and everyday knowledge are elementary facts such as “mud does not cause rain” and “symptoms do not cause disease” and those facts, strangely enough, cannot be expressed in the vocabulary of probability calculus. It is for this reason that I consider myself only a half-Bayesian.

Interesting. The Neyman-Rubin framework of potential outcomes does allow for casual reasoning within a probabilistic structure, but indeed it does not allow for statements such as “mud does not cause rain.” In the potential outcomes notation, one could define a random variable y=1 for rain or 0 for no rain, and define y^1 to be the outcome under treatment and y^2 to be the outcome under control. But it would not make sense for “mud” to be a treatment: in the potential-outcomes framework, a treatment is something that you do, not something such as “mud” that you observe.

I’m not saying here that Pearl’s framework is a good or bad idea; my point here is that I’m agreeing that he indeed seems to be asking questions that cannot be addressed by probability models.

Some of my earlier discussions with Pearl are here.

Stan: A (Bayesian) Directed Graphical Model Compiler

Here’s Bob’s talk from the NYC machine learning meetup. And here’s Stan himself:

Bugs Bunny, the governor of Massachusetts, the Dow 36,000 guy, presidential qualifications, and Peggy Noonan

Elsewhere:

1. They asked me to write about my “favorite election- or campaign-related movie, novel, or TV show” (Salon)

2. The shopping period is over; the time for buying has begun (NYT)

3. If anybody’s gonna be criticizing my tax plan, I want it to be this guy (Monkey Cage)

4. The 4 key qualifications to be a great president; unfortunately George W. Bush satisfies all four, and Ronald Reagan doesn’t match any of them (Monkey Cage)

5. The politics of eyeliner (Monkey Cage)

Prior beliefs about locations of decision boundaries

Sharon Begley: Worse than Stephen Jay Gould?

Commenter Tggp links to a criticism of science journalist Sharon Begley by science journalist Matthew Hutson. I learned of this dispute after reporting that Begley had received the American Statistical Association’s Excellence in Statistical Reporting Award, a completely undeserved honor, if Hutson is to believed.

The two journalists have somewhat similar profiles: Begley was science editor at Newsweek (she’s now at Reuters) and author of “Train Your Mind, Change Your Brain: How a New Science Reveals Our Extraordinary Potential to Transform Ourselves,” and Hutson was news editor at Psychology Today and wrote the similarly self-helpy-titled, “The 7 Laws of Magical Thinking: How Irrational Beliefs Keep Us Happy, Healthy, and Sane.”

Hutson writes:
Continue reading ‘Sharon Begley: Worse than Stephen Jay Gould?’ »

Beautiful Line Charts

I stumbled across a chart that’s in my opinion the best way to express a comparison of quantities through time:

It compares the new PC companies, such as Apple, to traditional PC companies like IBM and Compaq, but on the same scale. If you’d like to see how iPads and other novelties compare, see here. I’ve tried to use the same type of visualization in my old work on legal data visualization.

Continue reading ‘Beautiful Line Charts’ »

Bob on Stan

Thurs 19 Jan 7pm at the NYC Machine Learning meetup.

Stan‘s entirely publicly funded and open-source and it has no secrets. Ask us about it and we’ll tell you everything you might want to know.

P.S. And here‘s the talk.

The Fixie Bike Index

Where are the fixed-gear bike riders?

Rohin Dhar explains:
Continue reading ‘The Fixie Bike Index’ »

Big corporations are more popular than you might realize

Robin Hanson writes, “people tend to like cities more than firms . . . people tend to dislike bigger firms more than small ones, cities tend to be bigger than firms, and the biggest cities tend to be the most celebrated.”

Hansen goes on to consider explanations (involving “the joy of sometimes dominating,” etc.) but ends up describing himself as confused.

One reason he might be confused is that, at least in the data I’ve seen, people don’t dislike big corporations—at least not when the corporation is named.

Consider some survey data from 2007:

OK, sure, now that there’s a recession on, Citibank probably doesn’t have 78% approval anymore. Still, these companies are pretty damn popular. You might think lots of Americans think Starbucks is stuck-up? Nope. 79% approval. Pfizer? 77%. I have no idea why Target is so much more popular than Walmart, but in any case all these numbers (with the exception of oil-spillin’ Exxon and war-profitin’ Halliburton) are stratospheric.

I don’t know if there are approval ratings of cities, but I doubt that NY, Chicago, LA, etc. (not to mention Washington, D.C.!) would be up there in the 70-80-90% range.

Of course, liking or disliking a city is much different than liking or disliking a corporation. I don’t really know what such a comparison would mean. But based on the polls, corporations appear to be quite popular, so I don’t think Hansen needs to be so confused as to why they’re not!

P.S. I don’t knock Hansen for not realizing this; before I’d seen the poll data shown above, I would never have guessed that that Citibank, Microsoft, Starbucks, etc. were so popular.

How to map geographically-detailed survey responses?

David Sparks writes:

I am experimenting with the mapping/visualization of survey response data, with a particular focus on using transparency to convey uncertainty. See some examples here.

Do you think the examples are successful at communicating both local values of the variable of interest, as well as the lack of information in certain places? Also, do you have any general advice for choosing an approach to spatially smoothing the data in a way that preserves local features, but prevents individual respondents from standing out? I have experimented a lot with smoothing in these maps, and the cost of preventing the Midwest and West from looking “spotty” is the oversmoothing of the Northeast.

My quick impression is that the graphs are more pretty than they are informative. But “pretty” is not such a bad thing! The conveying-information part is more difficult: to me, the graphs seem to be displaying a somewhat confusing mix of opinion level and population density. Consider, for example, the bright red color in Dallas. There must be areas in the countryside that are also heavily Republican but Dallas stands out because there are a lot of people there. In some ways this makes sense—that’s where the voters are—but to me it makes the map a bit confusing. I’m also bothered by the blurriness of the entire northeast—I assume this is happening because all the cities are in each others’ penumbras. That’s just one problem though; really, what’s bugging me more is the overlay of intensity with density.

I think I’d prefer something simpler such as putting a colored circle in each county with the size of the circle proportional to population. But, as I said above, “pretty” is important too.

“Groundbreaking or Definitive? Journals Need to Pick One”

Sanjay Srivastava writes:

As long as a journal pursues a strategy of publishing “wow” studies, it will inevitably contain more unreplicable findings and unsupportable conclusions than equally rigorous but more “boring” journals. Groundbreaking will always be higher-risk. And definitive will be the territory of journals that publish meta-analyses and reviews. . . .

Most conclusions, even those in peer-reviewed papers in rigorous journals, should be regarded as tentative at best; but press releases and other public communication rarely convey that. . . .

His message to all of us:

Our standard response to a paper in Science, Nature, or Psychological Science should be “wow, that’ll be really interesting if it replicates.” And in our teaching and our engagement with the press and public, we need to make clear why that is the most enthusiastic response we can justify.

Excellence in Statistical Reporting Award

The American Statistical Association is seeking nominations for its annual Excellence in Statistical Reporting Award. The award was created in 2004 to encourage and recognize members of the communications media who have best displayed an informed interest in the science of statistics and its role in public life. The award can be given for a single statistical article or for a body of work.

Former winners of the award include: Felix Salmon, financial blogger, 2010; Sharon Begley, Newsweek, 2009; Mark Buchanan, New York Times, 2008; John Berry, Bloomberg News, 2005; and Gina Kolata, New York Times, 2004.

If anyone has any suggestions for the 2012 award, feel free to post in the comments or email me.

Fun fight over the Grover search algorithm

Joshua Vogelstein points me to this blog entry by Robert Tucci, diplomatically titled “Unethical or Really Dumb (or both) Scientists from University of Adelaide ‘Rediscover’ My Version of Grover’s Algorithm”:
Continue reading ‘Fun fight over the Grover search algorithm’ »

R-squared for multilevel models

A model rejection letter

Howard Wainer sends in this rejection letter from Sir David Brewster of The Edinburgh Journal of Science to Charles Babbage:

It is no inconsiderable degree of reluctance that I decline the offer of any Paper from you. I think, however, you will upon reconsideration of the subject be of the opinion that I have no other alternative. The subjects you propose for a series of Mathematical and Metaphysical Essays are so profound, that there is perhaps not a single subscriber to our Journal who could follow them.

Nowadays, he could just submit to Wiley Interdisciplinary Reviews . . .

Infographic on the economy

Gabriel Bergin writes:

Just thought I’d share an infographic you might enjoy. I [Bergin] quite like what they did with the colored ranges of previous curves in the two middle graphs:

I like it. Would it be possible to put the two long time series on the same scale? As it is, one starts in 1948 and the other starts in 1980. The only thing about the display that I really don’t like are those balls on the top indicating the duration of recessions. It looks weird to me to display a time duration in the form of the area of a ball.