Matt Selove writes:
Continue reading ‘Weak identification provides partial information’ »
Matt Selove writes:
David Shor sends along a job announcement for Civis Analytics, which he describes as “basically Obama’s Analytics team reconstituted as a company”:
Data Scientists are responsible for providing the fundamental data science that powers our work – including predictive analytics, data mining, experimental design and ad-hoc statistical analysis. As a Data Scientist, you will join our Chicago-based data science team, working closely and collaboratively with analysts and engineers to identify, quantify and solve big, meaningful problems. Data Scientists will have the opportunity to dive deeply into big problems and work in a variety of areas. Civis Analytics has opportunities for applicants who are seasoned professionals, brilliant new comers, and anywhere in between.
· Master’s degree in statistics, machine learning, computer science with heavy quant focus, a related subject, or a Bachelor’s degree and significant work experience
· An ability and eagerness to constantly learn and teach others
· Experience using applied statistics or machine learning in a professional or other intensive problem-solving environment with large, complex datasets
· Expertise in R, STATA, or other statistical packages
· Experience identifying and adapting to imperfect data
· PhD in statistics, computer science, or a related subject
· Expert ability to identify and adapt to imperfect data
· Significant work experience in statistical modeling, machine learning,
· Comfort with programming
I didn’t see anything about “Bayesian” but I suppose that’s implicit . . .
P.S. If a reconstituted Romney Analytics team is hiring, let me know and I’ll post that ad too.
The other day, a friend told me that when he saw me blogging on Noam Chomsky, he was surprised not to see any mention of disgraced primatologist Marc Hauser. I was like, whaaaaaa? I had no idea these two had any connection. In fact, though, they wrote papers together.
This made me wonder what Chomsky thought of Hauser’s data scandal. I googled *marc hauser noam chomsky* and the first item that came up was this, from July 2011, reported by Tom Bartlett:
I [Bartlett] asked Chomsky for his comment on the Hauser resignation and he e-mailed the following:
Mark Hauser is a fine scientist with an outstanding record of accomplishment. His resignation is a serious loss for Harvard, and given the nature of the attack on him, for science generally.
Chomsky is a mentor of Hauser so I can’t fault Chomsky for defending the guy. But why couldn’t he have stuck with something more general, something like, “I respect and admire Mark Hauser and am not aware of any improprieties in his work.” Or maybe something like, “It is possible that—well, he has published quite a lot in various areas. It’s possible that some of the papers went to press without sufficient rethinking, but I don’t know of any cases.” That’s actually what Chomsky did say, a year earlier.
So what happened, that Chomsky changed his tune and got so aggressive? “A serious loss for Harvard” etc? My guess (without any evidence, but, hey, I’m free to guess) is that, as we discussed previously, Chomsky seems to be surrounded mostly by admirers or his haters. The admirers give no useful feedback, and the haters are so clearly against him that he can ignore them. Basically, he lives in a world in which everything is a battle, so it’s hard for him to do nuance. Or, to put it more carefully, he can do nuance, but if he thinks it’s a war going on, he goes into war mode.
Leading theoretical statistician Larry Wassserman in 2008:
Some of the greatest contributions of statistics to science involve adding additional randomness and leveraging that randomness. Examples are randomized experiments, permutation tests, cross-validation and data-splitting. These are unabashedly frequentist ideas and, while one can strain to fit them into a Bayesian framework, they don’t really have a place in Bayesian inference. The fact that Bayesian methods do not naturally accommodate such a powerful set of statistical ideas seems like a serious deficiency.
To which I responded on the second-to-last paragraph of page 8 here.
Larry Wasserman in 2013:
Some people say that there is no role for randomization in Bayesian inference. In other words, the randomization mechanism plays no role in Bayes’ theorem. But this is not really true. Without randomization, we can indeed derive a posterior for theta but it is highly sensitive to the prior. This is just a restatement of the non-identifiability of theta. With randomization, the posterior is much less sensitive to the prior. And I think most practical Bayesians would consider it valuable to increase the robustness of the posterior.
Exactly! I completely agree with 2013 Larry (and it’s what we say in our Bayesian book, following the ideas of Rubin and others).
I’m happy to see this development. Much of my recent work has involved Bayesian analysis of sample surveys. And, indeed, our models typically assume simple random sampling within poststratification cells. Such models are never correct (even if the survey is conducted by a probability sampling design, nonresponse will not be random) but it’s a useful starting point that we try to approximate in many of our designs. In other settings, we simply don’t have random sampling or random assignment, and then, indeed, our inferences can be more sensitive to our assumptions. The only place I’d disagree with Larry is when he writes “sensitive to the prior,” I’d say, “sensitive to the model,” because the data model comes into play too, not just the prior distribution (that is, the model for the parameters).
P.S. Beyond appreciating Larry’s recognition of this particular issue, I find his larger point interesting, that we add noise in different ways to achieve robustness or computational efficiency.
Maybe Niall Ferguson will show up, given his interest in the history of mid-twentieth-century gay English heroes.
Alberto Cairo tells a fascinating story about John Snow, H. W. Acland, and the Mythmaking Problem:
Every human community—nations, ethnic and cultural groups, professional guilds—inevitably raises a few of its members to the status of heroes and weaves myths around them. . . . The visual display of information is no stranger to heroes and myth. In fact, being a set of disciplines with a relatively small amount of practitioners and researchers, it has generated a staggering number of heroes, perhaps as a morale-enhancing mechanism. Most of us have heard of the wonders of William Playfair’s Commercial and Political Atlas, Florence Nightingale’s coxcomb charts, Charles Joseph Minard’s Napoleon’s march diagram, and Henry Beck’s 1933 redesign of the London Underground map. . . .
Cairo’s goal, I think, is not to disparage these great pioneers of graphics but rather to put their work in perspective, recognizing the work of their excellent contemporaries.
I would like to echo Cairo’s message for a slightly different reason: I want to resist the idea that it is desirable to send a message via a single visualization. One of my big problems with graphs such as those as Nightingale and Minard is how they’re celebrated as one-stop marvels. Instead of people trying to create the next Napoleon-map, I’d prefer they graph their data and conclusions directly, using meat-and-potatoes methods such as dotplots, lineplots, and small multiples. Make the grabby viz as well—capture the excitement of the data in a graph, that’s a great thing to do—but consider that as an advertisement or intro to the data, not a substitute for direct display of the information.
By making a new graph that gets attention and engaging her audience, Florence Nightingale deserves to be a hero. But if her plots deter people from graphing and looking at the clear and direct time series, that’s a problem—not with Nightingale, but with the myth of the heroic visualization.
Phil Plait writes:
Earth May Have Been Hit by a Cosmic Blast 1200 Years Ago . . . this is nothing to panic about. If it happened at all, it was a long time ago, and unlikely to happen again for hundreds of thousands of years.
This left me confused. If it really did happen 1200 years ago, basic statistics would suggest it would occur approximately once every 1200 years or so (within half an order of magnitude). So where does “hundreds of thousands of years” come from?
I emailed astronomer David Hogg to see if I was missing something here, and he replied:
Yeah, if we think this hit us 1200 years ago, we should imagine that this happens every few thousand years at least. Now that said, if there are *other* reasons for thinking it is exceedingly rare, then that would be a strong a priori argument against believing in the result. So you should either believe that it didn’t happen 1200 years ago, or else you should believe it will happen again in the next few thousand.
So, from a Bayesian standpoint, our prior guess is that this event has very low frequency, thus conditional on the assumed data (1200 years since the most recent event), the estimated frequency should be something a bit less than 1 per 1200 years. But it’s hard to see how we’d get 1 per 300,000 years. That would require a prior that’s so strong that it would be contradicted by the data. (Or, again, perhaps the data are being misinterpreted, but the above analysis is conditional on that interpretation.)
David supplied an update:
Here is the paper and the authors of the paper say the rate is one per 375,000 yr to 3,750,000 yr. (You will enjoy the precision they use in their numbers, given their uncertainties.) Then they say that this is consistent with one event in the last 3000 yr within 2.6 sigma. (They figure the census for such events is complete back to 3000 years, and there is one event found in that census.)
I guess it all depends on how strongly you believe your prior and how strongly you believe the data.
P.S. I wrote this post in January and then put it on the queue, confident that there was no rush. After all, the probability of a major gamma ray burst in any given year is so small!
Tyler Cowen links:
“I’m writing this book because we need to think about the future for more than just 140 characters or 15 minutes at a time if we want to make real long-term progress,” Mr. Thiel said in a statement. “’Zero to One’ is about learning from Silicon Valley how to solve hard problems and build great things that have never existed before.”
I wonder how Thiel’s previous book turned out?
Jonathan Robinson writes:
I’m a survey researcher who mostly does political work, but I also have a strong interest in economics. I have a question about this graph you commonly see in the economics literature. It is of a concept called the Beveridge Curve [recently in the newspaper here]. It is one of the more interesting concepts in labor economics, relating the vacancy rate in jobs to the unemployment rate. A good primer is here.
However, despite being one of the more interesting concepts in economics, the way it is displayed visually is nothing short of atrocious:
These graphs are nothing short of unreadable and pretty much the standard (Brad Delong has linked to this graph above and it can appear like this in publication as well). I’ve only really seen one representation of the curve that is more clear than this and it is at this link:
Do you have any ideas of any way of making these graphs more readable? I like the second Cleveland Fed graph, but I hardly think its ideal, as it ignores close to 50% of the data.
I don’t actually think these graphs are so bad—I assume that they’re readable to the specialists who matter—so I have no particular suggestions. But I’ll throw this out there for the rest of you.
Aurelian Muntean writes:
What draw my attention was the discussion in terms of causation implied by one of the authors of the article interviewed in the NPR news, and also by the conclusions of the article itself claiming large effects.
Although the total sample (self-selecting pregnant women) seems very large (85,176) the subsamples (270 out of which 114 were in the sub-subsample revealing statistically significant association) used to support the analysis seem to be too small. Or not?
The different sources of information do seem to be in some conflict:
- The JAMA article reports the autism rate of 1 per 1000 for children of mothers taking folic acid and 2 per 1000 for children of mothers not taking folic acid. (They also report the adjusted odds ratio as 0.6 rather than 0.5, indicted that the two groups differ a bit in some background variables.)
- The NPR article has this quote: “‘That’s a huge effect,’ says Ian Lipkin, one of the study’s authors . . . ‘when you start talking about autism, a disorder that has an incidence of 1 percent or higher, that really does bring it to home,’ Lipkin says. ‘That is a substantial risk.’” How do you get from 1 or 2 per 1000 to “1 percent or higher”?
Another issue that arises is multiple comparisons: they studied at least three subpopulations (“114 with autistic disorder, 56 with Asperger syndrome, and 100 with PDD-NOS”) and at least two predictors (“Similar analyses for prenatal fish oil supplements showed no such association with autistic disorder”). So it seems like we’re seeing the most statistically significant of at least 6 comparisons.
In general I recommend addressing multiple comparisons problems by using hierarchical models, but it’s not clear to me exactly what I would do in this case. It would be good to have a general method to recommend for this sort of problem. I think it would involve regularization and informative priors.
Finally, there’s causality—do the folate and non-folate parents differ, on average, in important ways not controlled for in the analysis? I have no idea. I will say, though, that we followed the folate recommendation ourselves.
Ole Rogeberg writes:
Recently read your blogpost on Pinker’s views regarding red and blue states. This might help you see where he’s coming from: The “conflict of visions” thing that Pinker repeats to likely refers to Thomas Sowell’s work in the books “Conflict of Visions” and “Visions of the anointed.” The “Conflict of visions” book is on his top-5 favorite book list and in a Q&A interview he explains it as follows:
Q: What is the Tragic Vision vs. the Utopian Vision?
A: They are the different visions of human nature that underlie left-wing and right-wing ideologies. The distinction comes from the economist Thomas Sowell in his wonderful book “A Conflict of Visions.” According to the Tragic Vision, humans are inherently limited in virtue, wisdom, and knowledge, and social arrangements must acknowledge those limits. According to the Utopian vision, these limits are â€œproductsâ€ of our social arrangements, and we should strive to overcome them in a better society of the future. Out of this distinction come many right-left contrasts that would otherwise have no common denominator. Rightists tend to like tradition (because human nature does not change), small government (because no leader is wise enough to plan society), a strong police and military (because people will always be tempted by crime and conquest), and free markets (because they convert individual selfishness into collective wealth). Leftists believe that these positions are defeatist and cynical, because if we change parenting, education, the media, and social expectations, people could become wiser, nicer, and more peaceable and generous.
As with Pinker’s writing on red and blue states, I think Pinker is lacking some historical perspective here. I do not think it’s all correct to say that “rightists like small government.” I think it’s more accurate to say that rightists like large government when it’s controlled by the right, and leftists like large government when it’s controlled by the left. And how do you classify, for example, right-wing clerical governments? Do they have the tragic vision (because they are conservative) or the utopian vision (because they want a single church to be in control)?
And then there is the conjunction between “small government” and “a strong police and military.” Guatemala for many years ago had small government (for one thing, the people who ran the country did not want to pay taxes) and a strong police and military. Put these together and you get a demonstration of “the tragic vision”: the strong police and military killed hundreds of thousands of Guatemalans. More generally, what does it mean to have a strong police and military with a small government? Who then is in charge of all these armed men?
Pinker’s characterization of leftism seems to miss something too. Try taking his statement, “if we change parenting, education, the media, and social expectations, people could become wiser, nicer, and more peaceable and generous” and flipping it around: “if we change parenting, education, the media, and social expectations, people could become more foolish, nasty, warlike, and selfish.” After all, if people can be changed in one direction, why not in the other direction?
Also, I agree with Pinker that lots of issues get grouped into left and right, but I don’t see tragic vs. utopian as so central. For example, lots of conservatives want to restrict birth control and abortion. I don’t see this fitting into a tragic vision of human nature. Perhaps it is a utopian vision (all potential babies deserve to be born) or a conservative vision (the roles of men and women should remain traditional).
As with many such binary divides, I think this classification tells us more about the person doing the classifying than about the reality that is being classified.
P.S. Just to clarify a few things:
Continue reading ‘I don’t think we get much out of framing politics as the Tragic Vision vs. the Utopian Vision’ »
This is just a local Columbia thing, so I’m posting Sunday night when nobody will read it . . .
Samantha Cooney reports in the Spectator (Columbia’s student newspaper):
Frontiers of Science may be in for an overhaul.
After a year reviewing the course, the Educational Policy and Planning Committee has issued a report detailing its findings and outlining potential ways to make the oft-maligned course more effective. The EPPC’s report, a copy of which was obtained by Spectator, suggests eliminating the lecture portion of the course in favor of small seminars with a standardized curriculum, mirroring other courses in the Core Curriculum.
This seems reasonable to me. It sounds like the seminar portion of the class has been much more successful than the lectures. Once the lectures are removed entirely, perhaps it will allow the students to focus on learning during the seminar periods.
Also, I appreciate that Cooney did a good job quoting me. As I wrote last month, I respect that the organizers of the course did a pre-test, post-test evaluation, but I’m exhausted by all the hype and happytalk around that evaluation in particular and the course more generally. I’m not the world’s greatest teacher so I have a lot of sympathy for organizers of courses that don’t go quite as planned, but I wish they’d be a bit more forthright in admitting their errors—especially for a class that is required of most of the undergraduates.
Avi sent along this old paper from Bryk and Raudenbush, who write:
The presence of heterogeneity of variance across groups indicates that the standard statistical model for treatment effects no longer applies. Specifically, the assumption that treatments add a constant to each subject’s development fails. An alternative model is required to represent how treatment effects are distributed across individuals. We develop in this article a simple statistical model to demonstrate the link between heterogeneity of variance and random treatment effects. Next, we illustrate with results from two previously published studies how a failure to recognize the substantive importance of heterogeneity of variance obscured significant results present in these data. The article concludes with a review and synthesis of techniques for modeling variances. Although these methods have been well established in the statistical literature, they are not widely known by social and behavioral scientists.
This is really important. I’ll have to think about whether we can do more on this, following the lead of my article on treatment effects in before-after data, which made a very similar point but in a less systematic way. Connections between interactions, varying treatment effects, and multilevel models. A really big deal if we can put it together in the right way.
Pearl reports that his Journal of Causal Inference has just posted its first issue, which contains a mix of theoretical and applied papers. Pearl writes that they welcome submissions on all aspects of causal inference.
Torbjørn Skardhamar writes:
Continue reading ‘Using trends in R-squared to measure progress in criminology??’ »
Corey Yanofsky writes:
In your work, you’ve robustificated logistic regression by having the logit function saturate at, e.g., 0.01 and 0.99, instead of 0 and 1. Do you have any thoughts on a sensible setting for the saturation values? My intuition suggests that it has something to do with proportion of outliers expected in the data (assuming a reasonable model fit).
It would be desirable to have them fit in the model, but my intuition is that integrability of the posterior distribution might become an issue.
My reply: it should be no problem to put these saturation values in the model, I bet it would work fine in Stan if you give them uniform (0,.1) priors or something like that. Or you could just fit the robit model.
And this reminds me . . . I’ve been told that when Stan’s on its optimization setting, it fits generalized linear models just about as fast as regular glm or bayesglm in R. This suggests to me that we should have some precompiled regression models in Stan, then we could run all those regressions that way, and we could feel free to use whatever priors we want.
Sometimes I get books in the mail with titles like, Statistics for Everyone, or Whassup with American Politics?, and I don’t know what to say about them because I’m clearly not their target audience: from my work, I already pretty much know everything that’s going to be in those books.
Yesterday, though, what came in the mail but this book by psychologists Elizabeth Dunn and Michael Norton that’s written just for people like me: rich comfortable Americans with lots of spending money who want to use that money to be even happier than they already are. I have some mixed feelings about this goal (don’t people like me have enough happiness as it is???) and I think that Dunn and Norton do too, in that, although they give occasional hints as to their affluent audience (referring, for example, to luxury cars, houses with swimming pools in exclusive suburbs, massages at the Four Seasons Hotel, and bullfight tickets (yuck!) for a “dream vacation in Spain”), they don’t seem to offer any explicit discussion of who they are aiming their advice at.
I certainly don’t think it’s wrong for Dunn and Norton to try to improve the happiness of America’s economic elite—after all, I teach at Harvard and Columbia!—I just think it’s a central, if unstated, aspect of the book. And, to be fair, many of their suggestions would work with lower-income people as well.
The book was sent to me, perhaps, because it overlaps with two of my research interests: happiness studies, and the statistical interpretation of psychology research.
Many of the studies have that “Psychological Science” look to them, for example this one by Kathleen Vohs, Nicole Mead, and Miranda Goode that reported a huge effect on participants as “a function of whether they had been exposed earlier to a fish screensaver, a blank screen, or a money screensaver”:
Unfortunately, when I looked at the linked paper I could not find any information on data, experimental conditions, who was in the study, or even the sample sizes. The effect appears so huge that I don’t even know how to think about it.
Overall I found Dunn and Norton’s advice to be reasonable, and I appreciated their effort to link their recommendations to experimental data.
At some points, though, their general theoretical framework seemed to contradict their specific recommendations. For example, one of their general points is that one should “buy experiences rather than material goods,” but then to illustrate that point, they give an example of Harvard dorm rooms. Apparently, some dorms at that Ivy-leaved haven are much more desired than others, but retrospective evaluations found that the students who were randomly assigned to the more sought-after dorms were no happier, on average, than those assigned to the less-popular dorms. B-b-b-ut . . . living in a dorm room for a year or two is an experience, not a material good! So this example seems to go against their argument. Here is one very salient experience that doesn’t matter as much as people think.
To be fair, another one of the authors’ points is that everyday experiences (such as living in a nice dorm room) are less important than special experiences (like that bullfight—ugh, I don’t want to keep thinking about that one…). So the Harvard dorm story could work for them, it’s just in the wrong chapter. It was just funny to see an example right at the beginning of chapter 1 that contradicted the chapter’s message.
There’s something about the book’s tone that seems off to me—they talk about everyday happiness and they talk about national policy, but they never seem to get around to the tough problems that can make people really sad. It seems odd for a book on happiness and spending not to discuss anything about how to spend money to alleviate depression: should depressed people spend their money on drugs, on therapy, should they change their jobs, etc.? There must be a lot of research by psychologists on this topic.
But, within the narrow bounds of the book’s topics, I appreciate the focus on research. My favorite parts are when the authors describe the experiments that they and their collaborators did to answer this or that question about everyday happiness.
Edward Wyatt reports:
Now the Obama administration is cracking down on what many call patent trolls, shell companies that exist merely for the purpose of asserting that they should be paid . . . “The United States patent system is vital for our economic growth, job creation, and technological advance,” [Senator] Leahy said in a statement. “Unfortunately, misuse of low-quality patents through patent trolling has tarnished the system’s image.”
There is some opposition:
But some big software companies, including Microsoft, expressed dismay at some of the proposals, saying they could themselves stifle innovation.
Microsoft . . . patent trolls . . . hmmm, where have we heard this connection before?
There is also some support for the bill:
“These guys are terrorists,” said John Boswell, chief legal officer for SAS, a business software and services company, said at a panel discussion on Tuesday. SAS was cited in the White House report as an example of a company that has spent millions to defend itself against what it believes are frivolous lawsuits.
Austin Kelly writes:
While reading your postings [or here] on the subject of testing your model by running fake data I was reminded of the fact that I got one of these kinds of tests actually published in a GAO report back in the day. Reading your posts on Unz and political vs. economic discourse made me think of that work again. I thought I’d actually drop you a line on the subject.
Back in 2003 GAO was asked to look at Farmer Mac, including a look at the Farm Credit Agency’s regulation of Farmer Mac. As the resident mortgage econometrician back then I was asked to look at FCA’s risk based capital stress test for Farmer Mac. The work was pretty easy. I found a lot of oddities, but the biggest one was that they were using a discrete choice set up (loan goes bad or doesn’t) instead of a hazard model (loan goes bad this period or survives to the next). Not necessarily a problem – lots of mortgage models run that way. But you have to be really careful with your independent variables. FCA’s academic consultants weren’t. They defined as an “independent” variable the largest drop in farmland prices from mortgage origination to now, or to the date the mortgage went bad, whichever came first. I always get a little suspicious when the event you are trying to predict gets incorporated as part of the definition of the variable that’s supposed to explain the event. As a student of Jim Heckman’s I recalled being taught in a labor econometrics class back in the 1980′s that you really couldn’t do that. [We discuss related issues in section 9.7 of ARM (link to chapter here), under the heading, "Do not control for post-treatment variables."] I searched through Heckman’s old reading list, JSTOR, etc. but couldn’t come up with a proof of why that doesn’t work. Best I could do was Yamaguchi’s book on event history analysis that gave a verbal example of this kind of technique failing, but no proof. So I decided that the easiest thing to do was simulate data with loan failure in any discrete time period being determined by SAS’s Ranuni function, with no reference to farmland price change, farmland price change taken from historical data, and the independent variable calculated as it was in the model (change in farmland price from origination to current or fail, whichever comes first), and ran the regressions. Even though the true effect was zero by construction, I got significant and negative coefficients over 90% of the time. That was the “proof” that got into the appendix of the GAO report. Oddly enough, about the same time that I did this someone in Michigan’s B-School was doing the hard slog of writing our likelihood functions and formally proving that FCA’s set up was inconsistent (I don’t have access to JSTOR anymore so don’t have an easy way of finding key facts, like his name or the citation). Generating some fake data was a lot easier, and apparently more persuasive to my non-quant colleagues. The report is here.
Reading your post on academic vs political, my first thought was that just about every time I’ve engaged an academic in a “political” sphere they’ve adopted “political” discourse. I remember another case where academics had estimated the impacts of Economic Development Administration grants on county level employment, without controlling for the size of the county – ignoring the fact that a county with 1,000,000 workers was more likely to get a grant in the first place than a county with 10,000 workers. Their coefficient implied that every ten thousand dollars in EDA spending created a permanent job! Their main response to criticism was to tell us about how many PhDs they had.
Over my career I could point to many cases where the response from an academic was political discourse. But I can also think of many cases where it was academic. It’s just that the political responses are the ones that stick in the craw and are most easily remembered.
Regarding academic vs political discourse, I agree completely that academics often seem to care more about short-term winning than about getting things right. My point in that blog post was not that academics are better or more honorable than politicians, but rather that the rules are different. We would like an academic to engage in open discourse and not use the truth as negotiation chits, and when they behave in political ways we are unhappy. In contrast, a politician is supposed to negotiate. If a politician makes concessions without getting anything in return, we respect him less.
Much of statistical practice is an effort to reduce or deny variation and uncertainty. The reduction is done through standardization, replication, and other practices of experimental design, with the idea being to isolate and stabilize the quantity being estimated and then average over many cases. Even so, however, uncertainty persists, and statistical hypothesis testing is in many ways an endeavor to deny this, by reporting binary accept/reject decisions.
Classical statistical methods produce binary statements, but there is no reason to assume that the world works that way. Expressions such as Type 1 error, Type 2 error, false positive, and so on, are based on a model in which the world is divided into real and non-real effects. To put it another way, I understand the general scientific distinction of real vs. non-real effects but I do not think this maps well into the mathematical distinction of θ=0 vs. θ≠0. Yes, there are some unambiguously true effects and some that are arguably zero, but I would guess that the challenge in most current research in psychology is not that effects are zero but that they vary from person to person and in different contexts.
But if we do not want to characterize science as the search for true positives, how should we statistically model the process of scientific publication and discovery? An empirical approach is to identify scientific truth with replicability; hence, the goal of an experimental or observational scientist is to discover effects that replicate in future studies.
The replicability standard seems to be reasonable. Unfortunately, as Francis (in press) and Simmons, Nelson, and Simonsohn (2011) have pointed out, researchers in psychology (and, presumably, in other fields as well) seem to have no problem replicating and getting statistical significance, over and over again, even in the absence of any real effects of the size claimed by the researchers.
. . .
As a student many years ago, I heard about opportunistic stopping rules, the file drawer problem, and other reasons why nominal p-values do not actually represent the true probability that observed data are more extreme than what would be expected by chance. My impression was that these problems represented a minor adjustment and not a major reappraisal of the scientific process. After all, given what we know about scientists’ desire to communicate their efforts, it was hard to imagine that there were file drawers bulging with unpublished results.
More recently, though, there has been a growing sense that psychology, biomedicine, and other fields are being overwhelmed with errors (consider, for example, the generally positive reaction to the paper of Ioannidis, 2005). In two recent series of papers, Gregory Francis and Uri Simonsohn and collaborators have demonstrated too-good-to-be-true patterns of p-values in published papers, indicating that these results should not be taken at face value.
. . .
Although I do not know how useful Francis’s particular method is, overall I am supportive of his work as it draws attention to a serious problem in published research.
Finally, this is not the main point of the present discussion but I think that my anti-hypothesis-testing stance is stronger than that of Francis (in press). I disagree with the following statement from that article:
For both confirmatory and exploratory research, a hypothesis test is appropriate if the outcome drives a specific course of action. Hypothesis tests provide a way to make a decision based on data, and such decisions are useful for choosing an action. If a doctor has to determine whether to treat a patient with drugs or surgery, a hypothesis test might provide useful information to guide the action. Likewise, if an interface designer has to decide whether to replace a blue notification light with a green notification light in a cockpit, a hypothesis test can provide guidance on whether an observed difference in reaction time is different from chance and thereby influence the designer’s choice.
I have no expertise on drugs, surgery, or human factors design and so cannot address these particular examples—but, speaking in general terms, I think Francis is getting things backward here. When making a decision, I think it is necessary to consider effect sizes (not merely the possible existence of a nonzero effect) as well as costs. Here I speak not of the cost of hypothetical false positives or false negatives but of the direct costs and benefits of the decision. An observed difference can be relevant to a decision whether or not that difference is statistically significant.
Louis Mittel writes:
The premise of the column this guy is starting is interesting: Noah Davis interviews a smart person and then interviews the smartest person that smart person knows and so on.
It reminded me of you mentioning survey design strategy of asking people about other people, like “How many people do you know named Stuart?” or “How many people do you know that have had an abortion?”
Ignoring the interview aspect of what this guy is doing, I think there’s some cool questions about the distribution/path behavior of smartest-person-I-know chains (say, seeded at random). Do they loop? If so, how long do they run before looping, how large are the loops? What parts of the population do the explore? Do you know of anything that’s been done on something like this?
My reply: Interesting question. It could be asked of any referral chain, for example asking a sequence of people, “Who’s the tallest person you know?” or “Who’s the best piano player you know” or “Who’s the weirdest person you know” or whatever. But let’s stick with the “who’s the smartest” chain.
In answer to Louis’s first question: yes, such a chain would have to loop, as there’s only a finite number of people. Some of the loops might be pretty short. For example, if you ask Stephen Hawking for the smartest person he knows, and then ask that next person, you’ll probably loop back to . . . Stephen Hawking. The distribution of lengths of the loops, that I have no idea.
I’m trying to think how one could measure the distribution of this sort of referral network.
P.S. All in all, the guy in the above-linked interview seemed reasonable, but I was struck by this one bit, where he writes of one of his early business experiences:
Our head of IT at the time was adamant that we should start an Internet Service Provider because it was hard to get onto the Internet if you weren’t at college, and ISPs were growing something like 1,000 percent a month. He tried to convince me to invest $10,000 to start an ISP in Cambridge. . . . I thought it was the stupidest thing ever. That was the end of my Internet foray. If I had listened to him, I would have been like Zuckerberg or something. I completely missed the boat back then.
Everything’s relative. The guy is rich, successful, can do anything he wants. But he thinks he missed the boat.
P.P.S. One thing that came up in comments is, can people refer to themselves? I assume not, otherwise all chains would eventually dead-end at Stephen Hawking, Scott Adams, and that albedo guy.
Mark Palko asks what I think of this article by Francisco Louca, who writes about “‘hybridization’, a synthesis between Fisherian and Neyman-Pearsonian precepts, defined as a number of practical proceedings for statistical testing and inference that were developed notwithstanding the original authors, as an eventual convergence between what they considered to be radically irreconcilable.”
To me, the statistical ideas in this paper are too old-fashioned. The issue is not that the Neyman-Pearson and Fisher approaches are “irreconcilable” but rather that neither does the job in the sort of hard problems that face statistical science today. I’m thinking of technically difficult models such as hierarchical Gaussian processes and also challenges that arise with small sample size and multiple testing. Neyman, Pearson, and Fisher all were brilliant, and they all developed statistical methods that remain useful today, but I think their foundations are out of date. Yes, we currently use many of Fisher’s, Neyman’s, and Pearson’s ideas, but I don’t think either of their philosophies, or any convex mixture of the two, will really work anymore, as general frameworks for inference. Ioannidis, Bem, Simonsohn, Kanazawa, etc. Not to mention hierarchical models.
One example we give to illustrate Benford’s law is the first digits of addresses. Javier Marquez Pena had a survey and, just for laffs, he looked the distribution of first digits:
Cool—it really works!
P.S. The y-axis shouldn’t go below zero, and I’d much prefer an L-type graphics box (par(bty=”l”)) rather than the square, but those are familiar problems with R defaults.