Someone pointed me to this amusing/horrifying story of a clueless oldster.

Some people are horrified by what the old guy said, other people are horrified by how he was treated. He was clueless in his views about women in science, or he was cluelessly naive about gotcha journalism. I haven’t been following the details and so I’ll expressing no judgment one way or another.

My comment on the episode is that I find the whole “nobel laureate” thing a bit tacky in general. People get the prize and they get attention for all sorts of stupid things, people do all sorts of things in order to try to get it, and, beyond all that, research shows that not getting the Nobel Prize reduces your expected lifespan by two years. Bad news all around.

P.S. Regarding this case in particular, Basbøll points to this long post from Louise Mensch. Again, I don’t want to get involved in the details here, but I am again reminded how much I prefer blogs to twitter. On the positive side, I prefer a blog exchange to a twitter debate. And, on the negative side, I’d rather see a blogwar than a twitter mob.

The principal of a popular elementary school in Harlem acknowledged that she forged answers on students’ state English exams in April because the students had not finished the tests . . . As a result of the cheating, the city invalidated several dozen English test results for the school’s third grade.

The school is a new public school—it opened in 2011—that is run jointly by the New York City Department of Education and Columbia University Teachers College.

So far, it just seems like an unfortunate error. According to the news article, “Nancy Streim, associate vice president for school and community partnerships at Teachers College, said Ms. Worrell-Breeden had created a ‘culture of academic excellence'” at the previous school where she was principal. Maybe Worrell-Breeden just cared too much and was under too much pressure to succeed, she cracked and helped the students cheat.

But then I kept reading:

In 2009 and 2010, while Ms. Worrell-Breeden was at P.S. 18, she was the subject of two investigations by the special commissioner of investigation. The first found that she had participated in exercise classes while she was collecting what is known as “per session” pay, or overtime, to supervise an after-school program. The inquiry also found that she had failed to offer the overtime opportunity to others in the school, as required, before claiming it for herself.

The second investigation found that she had inappropriately requested and obtained notarized statements from two employees at the school in which she asked them to lie and say that she had offered them the overtime opportunity.

After those findings, we learn, “She moved to P.S. 30, another school in the Bronx, where she was principal briefly before being chosen by Teachers College to run its new school.”

So, let’s get this straight: She was found to be a liar, a cheat, and a thief, and then, with that all known, she was hired to two jobs as school principal??

The news article quotes Nancy Streim of Teachers College as saying, “We felt that on balance, her recommendations were so glowing from everyone we talked to in the D.O.E. that it was something that we just were able to live with.”

On balance, huh? Whatever else you can say about Worrell-Breeden, she seems to have had the talent of conning powerful people. Or maybe just one or two powerful people in the Department of Education who had the power to get her these jobs.

This is really bad. Is it so hard to find a school principal that you have no choice but to hire someone who lies, cheats, and steals?

It just seems weird to me. I accept that all of us have character flaws, but this is ridiculous. Principal is a supervisory position. What kind of toxic environment will you have in a school where the principal is in the habit of forging documents and instructing employees to lie? How could this possibly be considered a good idea?

Here’s the blurb on the relevant Teachers College official:

Nancy Streim joined Teachers College in August 2007 in the newly created position of Associate Vice President for School and Community Partnership. . . . Dr. Streim comes to Teachers College after nineteen years at the University of Pennsylvania’s Graduate School of Education where she most recently served as Associate Dean for Educational Practice. . . . She recently completed a year long project for the Bill and Melinda Gates Foundation in which she documented principles underlying successful university-assisted public schools across the U.S. She has served as principal investigator for five major grant-funded projects that address the teaching and learning of math and science in elementary and middle grades.

It’s not clear to me whether Streim actually thought Worrell-Breeden was the best person for the job. Reading between the lines, maybe what happened is that Worrell-Breeden was plugged into the power structure at the Department of Education and someone at the D.O.E. lined up the job for her.

In a talk I found online, Streim says something about “patient negotiations” with school officials. Maybe a few years ago someone in power told her: Yes, we’ll give you a community school to run, but you have to take Worrell-Breeden as principal. I don’t know, but it’s possible.

I guess I’d prefer to think that Teachers College made a dirty but necessary deal. That’s more palatable to me than the idea that the people at the Department of Education and Teachers College thought it was a good idea to hire a liar/cheat/thief as a school principal.

Or maybe I’m missing the point? Perhaps integrity is not so important. The world is full of people with integrity but no competence, and we wouldn’t want that either.

Someone pointed me to this discussion by Lior Pachter of a controversial claim in biology.

The statistics

The statistical content has to do with a biology paper by M. Kellis, B. W. Birren, and E.S. Lander from 2004 that contains the following passage:

Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair, providing strong support for a specific model of evolution, and allowing us to distinguish ancestral and derived functions.

Here’s where the 95% came from. In Pachter’s words:

The authors identified 457 duplicated gene pairs that arose by whole genome duplication (for a total of 914 genes) in yeast. Of the 457 pairs 76 showed accelerated (protein) evolution in S. cerevisiae. The term “accelerated” was defined to relate to amino acid substitution rates in S. cerevisiae, which were required to be 50% faster than those in another yeast species, K. waltii. Of the 76 genes, only four pairs were accelerated in both paralogs. Therefore 72 gene pairs showed acceleration in only one paralog (72/76 = 95%).

In his post on the topic, Pachter asks for a p-value for this 72/76 result which the authors of the paper in question had called “surprising.”

My first thought on the matter was that no p-value is needed because 72 out of 76 is such an extreme proportion. I guess I’d been implicitly comparing to a null hypothesis of 50%. Or, to put it another way, if you have 76 pairs, out of which 80 were accelerated (I think I did this right and that I’m not butchering the technical jargon: I got 80 by taking 72 pairs with only one paralog plus 4 pairs with two paralogs each), it would be extremely extremely unlikely to see only four pairs with acceleration in both.

But, then, as I read on, I realized this isn’t an appropriate comparison. Indeed, the clue is above, where Pachter notes that there were 457 pairs in total, thus in a null model you’re working with a probability of 80/(2*457) = 0.087, and when the probability is 0.087, it’s not so unlikely that you’d only see 4 pairs out of 457 with two accelerated paralogs. (Just to get the order of magnitude, 0.087^2 = 0.0077, and 0.0077*457 = 3.5, so 4 pairs is pretty much what you’d expect.)

So it sounds like Kellis et al. got excited by this 72 out of 76 number, without being clear on the denominator. I don’t know enough about biology to comment on the implications of this calculation on the larger questions being asked.

Pachter frames his criticisms around p-values, a perspective I find a bit irrelevant, but I agree with his larger point that, where possible, probability models should be stated explicitly.

The link between the scientific theory and statistical theory is often a weak point in quantitative research. In this case, the science has something to do with genes and evolution, and the statistical model is was that allowed Kellis et al. to consider 72 out of 76 to be “striking” and “surprising.” It is all too common for a researcher to reject a null hypothesis that is not clearly formed, in order to then make a positive claim of support for some preferred theory. But a lot of steps are missing in such an argument.

The culture

The cultural issue is summarized in this comment by Michael Eisen:

The more this conversation goes on the more it disturbs me [Eisen]. Lior raised an important point regarding the analyses contained in an influential paper from the early days of genome sequencing. A detailed, thorough and occasionally amusing discussion ensued, the long and the short of which to any intelligent reader should be that a major conclusion of the paper under discussion was simply wrong. This is, of course, how science should proceed (even if it rarely does). People make mistakes, others point them out, we all learn something in the process, and science advances.

However, I find the responses from Manolis and Eric to be entirely lacking. Instead of really engaging with the comments people have made, they have been almost entirely defensive. Why not just say “Hey look, we were wrong. In dealing with this complicated and new dataset we did an analysis that, while perhaps technically excusable under some kind of ‘model comparison defense’ was, in hindsight, wrong and led us to make and highlight a point that subsequent data and insights have shown to be wrong. We should have known better at the time, but we’ve learned from our mistake and will do better in the future. Thanks for helping us to be better scientists.”

Sadly, what we’ve gotten instead is a series of defenses of an analysis that Manolis and Eric – who is no fool – surely know by this point was simply wrong.

In an update, Pachter amplifies upon this point:

One of the comments made in response to my post that I’d like to respond to first was by an author of KBL [Kellis, Birren, and Lander; in this case the comment was made by Kellis] who dismissed the entire premise of the my challenge writing “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)”

This comment exemplifies the proclivity of some authors to view publication as the encasement of work in a casket, buried deeply so as to never be opened again lest the skeletons inside it escape. But is it really beneficial to science that much of the published literature has become, as Ferguson and Heene noted, a vast graveyard of undead theories?

Indeed. One of the things I’ve been fighting against recently (for example, in my article, It’s too hard to publish criticisms and obtain data for replication, or in this discussion of some controversial comments about replication coming from a cancer biologist) is the idea that, once something is published, it should be taken as truth. This attitude, of raising a high bar to post-publication criticism, is sometimes framed in terms of fairness. But, as I like to say, what’s so special about publication in a journal? Should there be a high barrier to criticisms of claims made in Arxiv preprints? What about scrawled, unpublished lab notes??? Publication can be a good way of spreading the word about a new claim or finding, but I don’t don’t don’t don’t don’t like the norm in which something that is published should not be criticized.

To put it another way: Yes, ha ha ha, let’s spend our time on guitar practice rather than exhuming 11-year-old published articles. Fine—I’ll accept that, as long as you also accept that we should not be citing 11-year-old articles.

If a paper is worth citing, it’s worth criticizing its flaws. Conversely, if you don’t think the flaws in your 11-year-old article are worth careful examination, maybe there could be some way you could withdraw your paper from the published journal? Not a “retraction,” exactly, maybe just an Expression of Irrelevance? A statement by the authors that the paper in question is no longer worth examining as it does not relate to any current research concerns, nor are its claims of historical interest. Something like that. Keep the paper in the public record but make it clear that the authors no longer stand behind its claims.

P.S.Elsewhere Pachter characterizes a different work of Kellis as “dishonest and fraudulent.” Strong words, considering Kellis is a tenured professor at MIT who has received many awards. As an outsider to all this, I’m wondering: Is it possible that Kellis is dishonest, fraudulent, and also a top researcher? Kinda like how Linda is a bank teller who is also a feminist? Maybe Kellis is an excellent experimentalist but with an unfortunate habit of making overly broad claims from his data? Maybe someone can help me out on this.

Did anyone else notice that this DC multiple-murder case seems just like a Pelecanos story?

Check out the latest headline, “D.C. Mansion Murder Suspect Is Innocent Because He Hates Pizza, Lawyer Says”:

Robin Flicker, a lawyer who has represented suspect Wint in the past but has not been officially hired as his defense attorney, says police are zeroing in on Wint because his DNA was found on pizza at the crime scene. The only problem, Flicker said is that Wint doesn’t like pizza.

“He doesn’t eat pizza,” Flicker told ABC News. “If he were hungry, he wouldn’t order pizza.”

When I saw the DC setting, the local businessman, the manhunt, and the horror/comic story of a pizza-ordering killer, I thought about Pelecanos immediately. And then I noticed that the victim’s family was Greek. Can’t get more Pelecanos than that.

I googled *pizza murder dc pelecanos* but I didn’t see any hits at all. I can’t figure that one out: surely someone would interview him for his thoughts on this one?

Mon: Ripped from the pages of a George Pelecanos novel

Tues: “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)”

Wed: What do I say when I don’t have much to say?

Thurs: “Women Respond to Nobel Laureate’s ‘Trouble With Girls’”

Fri: This sentence by Thomas Mallon would make Barry N. Malzberg spin in his grave, except that he’s still alive so it would just make him spin in his retirement

Sat: If you leave your datasets sitting out on the counter, they get moldy

Last week I ran into a younger colleague who said he had a conference deadline that week and could we get together next week, maybe? So I contacted him on the weekend and asked if he was free. He responded:

This week quickly got booked after last week’s NIPS deadline.

So we’re meeting in another week. That’s busy for you: after one week off the grid, he had a week’s worth of pent-up meetings! I thought I was busy, but it’s nothing like that.

And this made me formulate my idea of the 3 Stages of Busy. It goes like this:

Stage 1 (early career): Not busy, at least not with external commitments. You can do what you want.

Stage 2 (mid career; my friend described above): Busy, overwhelmed with obligations.

Stage 3 (late career; me): So busy that it’s pointless to schedule anything, so you can do what you want (including writing blogs two months in advance!).

There’s this study done by the Pew Research Center and Smithsonian Magazine . . . they called up one thousand and one Americans. I do not understand why it is a thousand and one rather than just a thousand. Maybe a thousand and one just seemed sexier or something. . . .

I think I know the answer to this one! The survey may well have aimed for 1000 people, but you can’t know ahead of time exactly how many people will respond. They call people, leave messages, call back, call back again, etc. The exact number of people who end up in the survey is a random variable.

I’ve spent some time examining the work done by Richard Tol which was used in the latest IPCC report. I was troubled enough by his work I even submitted a formal complaint with the IPCC nearly two months ago (I’ve not heard back from them thus far). It expressed some of the same concerns you expressed in a post last year.

The reason I wanted to contact you is I recently realized most people looking at Tol’s work are unaware of a rather important point. I wrote a post to explain it which I’d invite you to read, but I’ll give a quick summary to possibly save you some time.

As you know, Richard Tol claimed moderate global warming will be beneficial based upon a data set he created. However, errors in his data set (some of which are still uncorrected) call his results into question. Primarily, once several errors are corrected, it turns out the only result which shows any non-trivial benefit from global warming is Tol’s own 2002 paper.

That is obviously troubling, but there is a point which makes this even worse. As it happens, Tol’s 2002 paper did not include just one result. It actually included three different results. A table for it shows those results are +2.3%, +0.2% and -2.7%.

The 2002 paper does nothing to suggest any one of those results is the “right” one, nor does any of Tol’s later work. That means Tol used the +2.3% value from his 2002 paper while ignoring the +0.2% and -2.7% values, without any stated explanation.

It might be true the +2.3% value is the “best” estimate from the 2002 paper, but even if so, one needs to provide an explanation as to why it should be favored over the other two estimates. Tol didn’t do so. Instead, he simply pretended the other two estimates did not exist. That is inexcusable.

I’m not sure how interested you are in Tol’s work, but I thought you might be interested to know things are even worse than you thought.

This is horrible and also kind of hilarious. We start with a published paper by Tol claiming strong evidence for a benefit from moderate global warming. Then it turns out he had some data errors; fixing the errors led to a weakening of this conclusions. Then more errors came out, and it turned out that there was only one point in his entire dataset supporting his claims—and that point came from his own previously published study. And then . . . even that one point isn’t representative of that paper.

You pull and pull on the thread, and the entire garment falls apart. There’s nothing left.

At no point did Tol apologize or thank the people who pointed out his errors; instead he lashed out, over and over again. Irresponsible indeed.

Math library with autodiff now available in its own repo!!!!!^{ }

^{(1)} Just doing install.packages(“rstan”) isn’t going to work because of dependencies; please go to the RStan getting started page for instructions of how to install from CRAN. It’s much faster than building from source and you no longer need a machine with a lot of RAM to install.

^{(2)} Coming soon to an interface near you.

Full Release Notes

v2.7.0 (9 July 2015)
======================================================================
New Team Members
--------------------------------------------------
* Alp Kucukelbir, who brings you variational inference
* Robert L. Grant, who brings you the StataStan interface
Major New Feature
--------------------------------------------------
* Black-box variational inference, mean field and full
rank (#1505)
New Features
--------------------------------------------------
* Line numbers reported for runtime errors (#1195)
* Wiener first passage time density (#765) (thanks to
Michael Schvartsman)
* Partial initialization (#1069)
* NegBinomial2 RNG (#1471) and PoissonLog RNG (#1458) and extended
range for Dirichlet RNG (#1474) and fixed Poisson RNG for older
Mac compilers (#1472)
* Error messages now use operator notation (#1401)
* More specific error messages for illegal assignments (#1100)
* More specific error messages for illegal sampling statement
signatures (#1425)
* Extended range on ibeta derivatives with wide impact on CDFs (#1426)
* Display initialization error messages (#1403)
* Works with Intel compilers and GCC 4.4 (#1506, #1514, #1519)
Bug Fixes
--------------------------------------------------
* Allow functions ending in _lp to call functions ending in _lp (#1500)
* Update warnings to catch uses of illegal sampling functions like
CDFs and updated declared signatures (#1152)
* Disallow constraints on local variables (#1295)
* Allow min() and max() in variable declaration bounds and remove
unnecessary use of math.h and top-level :: namespace (#1436)
* Updated exponential lower bound check (#1179)
* Extended sum to work with zero size arrays (#1443)
* Positive definiteness checks fixed (were > 1e-8, now > 0) (#1386)
Code Reorganization and Back End Upgrades
--------------------------------------------------
* New static constants (#469, #765)
* Added major/minor/patch versions as properties (#1383)
* Pulled all math-like functionality into stan::math namespace
* Pulled the Stan Math Library out into its own repository (#1520)
* Included in Stan C++ repository as submodule
* Removed final usage of std::cout and std::cerr (#699) and
updated tests for null streams (#1239)
* Removed over 1000 CppLint warnings
* Remove model write CSV methods (#445)
* Reduced generality of operators in fvar (#1198)
* Removed folder-level includes due to order issues (part of Math
reorg) and include math.hpp include (#1438)
* Updated to Boost 1.58 (#1457)
* Travis continuous integration for Linux (#607)
* Add grad() method to math::var for autodiff to encapsulate math::vari
* Added finite diff functionals for testing (#1271)
* More configurable distribution unit tests (#1268)
* Clean up directory-level includes (#1511)
* Removed all lint from new math lib and add cpplint to build lib
(#1412)
* Split out derivative functionals (#1389)
Manual and Documentation
--------------------------------------------------
* New Logo in Manual; remove old logos (#1023)
* Corrected all known bug reports and typos; details in
issues #1420, #1508, #1496
* Thanks to Sunil Nandihalli, Andy Choi, Sebastian Weber,
Heraa Hu, @jonathan-g (GitHub handle), M. B. Joseph, Damjan
Vukcevic, @tosh1ki (GitHub handle), Juan S. Casallas
* Fix some parsing issues for index (#1498)
* Added chapter on variational inference
* Added strangely unrelated regressions and multivariate probit
examples
* Discussion from Ben Goodrich about reject() and sampling
* Start to reorganize code with fast examples first, then
explanations
* Added CONTRIBUTING.md file (#1408)

I heeded your call to construct a Stan model of the height of Kit “Snow” Harrington. The response on Gawker has been poor, unfortunately, but here it is, anyway.

Yeah, I think the people at Gawker have bigger things to worry about this week. . . .

From this analysis it is unclear how tall Kit is, there is much uncertainty in the posterior distribution, but according to the analysis (which might be quite off) there’s a 50% probability he’s between 1.71 and 1.75 cm tall. It is stated in the article that he is NOT 5’8” (173 cm), but according to this analysis it’s not an unreasonable height, as the mean of the posterior is 173 cm.

His Stan model is at the link. (I tried to copy it here but there was some html crap.)

As D.M.C. would say, bad meaning bad not bad meaning good.

Deborah Mayo points to this terrible, terrible definition of statistical significance from the Agency for Healthcare Research and Quality:

Statistical Significance

Definition: A mathematical technique to measure whether the results of a study are likely to be true. Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true). Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05). Example: For example, results from a research study indicated that people who had dementia with agitation had a slightly lower rate of blood pressure problems when they took Drug A compared to when they took Drug B. In the study analysis, these results were not considered to be statistically significant because p=0.2. The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

The definition is wrong, as is the example. I mean, really wrong. So wrong that it’s perversely impressive how many errors they managed to pack into two brief paragraphs:

1. I don’t even know what it means to say “whether the results of a study are likely to be true.” The results are the results, right? You could try to give them some slack and assume they meant, “whether the results of a study represent a true pattern in the general population” or something like that—but, even so, it’s not clear what is meant by “true.”

2. Even if you could some how get some definition of “likely to be true,” that is not what statistical significance is about. It’s just not.

3. “Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance.” Ummm, this is close, if you replace “an effect” with “a difference at least as large as what was observed” and if you append “conditional on there being a zero underlying effect.” Of course in real life there are very few zero underlying effects (I hope the Agency for Healthcare Research and Quality mostly studies treatments with positive effects!), hence the irrelevance of statistical significance to relevant questions in this field.

4. “The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true).” No no no no no. As has been often said, the p-value is a measure of sample size. And, even conditional on sample size, and conditional on measurement error and variation between people, the probability that the results are true (whatever exactly that means) depends strongly on what is being studied, what Tversky and Kahneman called the base rate.

5. As Mayo points out, it’s sloppy to use “likely” to talk about probability.

6. “Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05)." Ummmm, yes, I guess that's correct. Lots of ignorant researchers believe this. I suppose that, without this belief, Psychological Science would have difficulty filling its pages, and Science, Nature, and PPNAS would have no social science papers to publish and they'd have to go back to their traditional plan of publishing papers in the biological and physical sciences. 7. "The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems." Hahahahahaha. Funny. What's really amusing is that they hyperlink "probability" so we can learn more technical stuff from them. OK, I'll bite, I'll follow the link:

Probability

Definition: The likelihood (or chance) that an event will occur. In a clinical research study, it is the number of times a condition or event occurs in a study group divided by the number of people being studied.

Example: For example, a group of adult men who had chest pain when they walked had diagnostic tests to find the cause of the pain. Eighty-five percent were found to have a type of heart disease known as coronary artery disease. The probability of coronary artery disease in men who have chest pain with walking is 85 percent.

Fuuuuuuuuuuuuuuuck. No no no no no. First, of course “likelihood” has a technical use which is not the same as what they say. Second, “the number of times a condition or event occurs in a study group divided by the number of people being studied” is a frequency, not a probability.

It’s refreshing to see these sorts of errors out in the open, though. If someone writing a tutorial makes these huge, huge errors, you can see how everyday researchers make these mistakes too.

For example:

A pair of researchers find that, for a certain group of women they are studying, three times as many are wearing red or pink shirts during days 6-14 of their monthly cycle (which the researchers, in their youthful ignorance, were led to believe were the most fertile days of the month). Therefore, the probability (see above definition) of wearing red or pink is three times more likely during these days. And the result is statistically significant (see above definition), so the results are probably true. That pretty much covers it.

All snark aside, I’d never really had a sense of the reasoning by which people get to these sorts of ridiculous claims based on such shaky data. But now I see it. It’s the two steps: (a) the observed frequency is the probability, (b) if p less than .05 then the result is probably real. Plus, the intellectual incentive of having your pet theory confirmed, and the professional incentive of getting published in the tabloids. But underlying all this are the wrong definitions of “probability” and “statistical significance.”

Who wrote these definitions in this U.S. government document, I wonder? I went all over the webpage and couldn’t find any list of authors. This relates to a recurring point made by Basbøll and myself: it’s hard to know what to do with a piece of writing if you don’t know where it came from. Basbøll and I wrote about this in the context of plagiarism (a statistical analogy would be the statement that it can be hard to effectively use a statistical method if the person who wrote it up doesn’t understand it himself), but really the point is more general. If this article on statistical significance had an author of record, we could examine the author’s qualifications, possibly contact him or her, see other things written by the same author, etc. Without this, we’re stuck.

Wikipedia articles typically don’t have named authors, but the authors do have online handles and they thus take responsibility for their words. Also Wikipedia requires sources. There are no sources given for these two paragraphs on statistical significance which are so full of errors.

What, then?

The question then arises: how should statistical significance be defined in one paragraph for the layperson? I think the solution is, if you’re not gonna be rigorous, don’t fake it.

Here’s my try.

Statistical Significance

Definition: A mathematical technique to measure the strength of evidence from a single study. Statistical significance is conventionally declared when the p-value is less than 0.05. The p-value is the probability of seeing a result as strong as observed or greater, under the null hypothesis (which is commonly the hypothesis that there is no effect). Thus, the smaller the p-value, the less consistent are the data with the null hypothesis under this measure.

I think that’s better than their definition. Of course, I’m an experienced author of statistics textbooks so I should be able to correctly and concisely define p-values and statistical significance. But . . . the government could’ve asked me to do this for them! I’d’ve done it. It only took me 10 minutes! Would I write the whole glossary for them? Maybe not. But at least they’d have a correct definition of statistical significance.

I guess they can go back now and change it.

Just to be clear, I’m not trying to slag on whoever prepared this document. I’m sure they did the best they could, they just didn’t know any better. It would be as if someone asked me to write a glossary about medicine. The flaw is in whoever commissioned the glossary, to not run it by some expert to check. Or maybe they could’ve just omitted the glossary entirely, as these topics are covered in standard textbooks.

P.S. And whassup with that ugly, ugly logo? It’s the U.S. government. We’re the greatest country on earth. Sure, our health-care system is famously crappy, but can’t we come up with a better logo than this? Christ.

P.P.S. Following Paul Alper’s suggestion, I made my definition more general by removing the phrase, “that the true underlying effect is zero.”

P.P.P.S. The bigger picture, though, is that I don’t think people should be making decisions based on statistical significance in any case. In my ideal world, we’d be defining statistical significance just as a legacy project, so that students can understand outdated reports that might be of historical interest. If you’re gonna define statistical significance, you should do it right, but really I think all this stuff is generally misguided.

Daniel, Andrew, and I are on our second day of teaching, and like many places, Memorial Sloan-Kettering has all their classrooms set up with a whiteboard placed directly behind a projection screen. This gives us a sliver of space to write on without pulling the screen up and down.

If you have any say in setting up your seminar rooms, don’t put your board behind your screen, please — I almsot always want to use them both at the same time.

I also just got back from a DARPA workshop at the Embassy Suites in Portland, and there the problem was a podium in between two tiny screens, neither of which was easily visible from the back of the big ballroom. Nobody knows where to point when there are two boards. One big screen is way better.

At my summer school course in Sydney earlier this year, they had a neat setup where there were two screens, but one could be used with an overhead projection of a small desktop, so I could just write on paper and send it up to the second screen. And the screens were big enough that all 200+ students could see both. Yet another great feature of Australia.

I followed a link at Steve Hsu’s blog and came to this discussion of Feyman’s cognitive style. Hsu writes that “it was often easier for [Feynman] to invent his own solution than to read through someone else’s lengthy paper” and he follows up with a story in which “Feynman did not understand the conventional formulation of QED even after Dyson’s paper proving the equivalence of the Feynman and Schwinger methods.” Apparently Feynman was eventually able to find an error in this article but only after an immense effort. In Hsu’s telling (which I have no reason to doubt), Feynman avoided reading papers by others in part out of a desire to derive everything from first principles, but also because of his own strengths and limitations, his “cognitive profile.”

This is all fine and it makes sense to me. Indeed, I recognize Feynman’s attitude myself: it can often be a take a lot of work to follow someone else’s paper if has lots of technical material, and I typically prefer to read a paper shallowly, get the gist, and then focus on a mix of specific details (trying to understand one example) and big picture, without necessarily following all the arguments. This seems to be Feynman’s attitude too.

The place where I part from Hsu is in this judgment of his:

Feynman’s cognitive profile was probably a bit lopsided — he was stronger mathematically than verbally. . . . it was often easier for him to invent his own solution than to read through someone else’s lengthy paper.

I have a couple of problems with this. First, Feynman was obviously very strong verbally, given that he wrote a couple of classic books. Sure, he dictated these books, he didn’t actually write them (at least that’s my understanding of how the books were put together), but still, you need good verbal skills to put things the way he did. By comparison, consider Murray Gell-mann, who prided himself on his cultured literacy but couldn’t write well for general audiences.

Anyway, sure, Feynman’s math skills were much better developed than his verbal skills. But compared to other top physicists (which is the relevant measure here)? That’s not so clear.

I’ll go with Hsu’s position that Feynman was better than others at coming up with original ideas while not being so willing to put in the effort to understand what others had written. But I’m guessing that this latter disinclination doesn’t have much to do with “verbal skills.”

Here’s where I think Hsu has fallen victim to the tyranny of measurement—that is, to the fallacy of treating concepts as more important if they are more accessible to measurement.

“Much stronger mathematically than verbally”—where does that come from?

College admissions tests are divided into math and verbal sections, so there’s that. But it’s a fallacy to divide cognitive abilities into these two parts, especially in a particular domain such as theoretical physics which requires very particular skills.

Let me put it another way. My math skills are much lower than Feynman’s and my verbal skills are comparable. I think we can all agree that my “imbalance”—the difference (however measured) between my math and verbal skills—is much lower than Feynman’s. Nonetheless, I too do my best to avoid reading highly technical work by others. Like Feynman (but of course at a much lower level), I prefer to come up with my own ideas rather than work to figure out what others are doing. And I typically evaluate others’ work using my personal basket of examples. Which can irritate the Judea Pearls of the world, as I just don’t always have the patience to figure out exactly why something that doesn’t work, doesn’t work. Like Feynman in that story, I can do it, but it takes work. Sometimes that work is worth it; for example, I’ve spent a lot of time trying to understand exactly what assumptions implicitly support regression discontinuity analysis, so that I could get a better sense of what happened in the notorious regression discontinuity FAIL pollution in China analysis, where the researchers in question seemingly followed all the rules but still went wrong.

Anyway, that’s a tangent. My real point is that we should be able to talk about different cognitive styles and abilities without the tyranny of measurement straitjacketing us into simple categories that happen to line up with college admissions tests. In many settings I imagine these dimensions are psychometrically relevant but I’m skeptical about applying them to theoretical physics.

Here’s some summer reading for you. The schedule may change because of the insertion of topical material, but this is the basic plan:

Richard Feynman and the tyranny of measurement

A bad definition of statistical significance from the U.S. Department of Health and Human Services, Effective Health Care Program

Ta-Nehisi Coates, David Brooks, and the “street code” of journalism

Flamebait: “Mathiness” in economics and political science

45 years ago in the sister blog

Ira Glass asks. We answer.

The 3 Stages of Busy

Ripped from the pages of a George Pelecanos novel

“We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)”

What do I say when I don’t have much to say?

“Women Respond to Nobel Laureate’s ‘Trouble With Girls’”

This sentence by Thomas Mallon would make Barry N. Malzberg spin in his grave, except that he’s still alive so it would just make him spin in his retirement

If you leave your datasets sitting out on the counter, they get moldy

Spam!

The plagiarist next door strikes back: Different standards of plagiarism in different communities

How to parameterize hyperpriors in hierarchical models?

How Hamiltonian Monte Carlo works

When does Bayes do the job?

Here’s a theoretical research project for you

Classifying causes of death using “verbal autopsies”

All hail Lord Spiegelhalter!

Dan Kahan doesn’t trust the Turk

Neither time nor stomach

He wants to teach himself some statistics

Hey—Don’t trust anything coming from the Tri-Valley Center for Human Potential!

Harry S. Truman, Jesus H. Christ, Roy G. Biv

Why couldn’t Breaking Bad find Mexican Mexicans?

Rockin the tabloids

A statistical approach to quadrature

Humans Can Discriminate More than 1 Trillion Olfactory Stimuli. Not.

0.05 is a joke

Jökull Snæbjarnarson writes . . .

Aahhhhh, young people!

Plaig! (non-Wegman edition)

We provide a service

“The belief was so strong that it trumped the evidence before them.”

“Can you change your Bayesian prior?”

How to analyze hierarchical survey data with post-stratification?

A political sociological course on statistics for high school students

Questions about data transplanted in kidney study

Performing design calculations (type M and type S errors) on a routine basis?

“Another bad chart for you to criticize”

Constructing an informative prior using meta-analysis

Stan attribution

Cannabis/IQ follow-up: Same old story

Defining conditional probability

In defense of endless arguments

Emails I never finished reading

BREAKING . . . Sepp Blatter accepted $2M payoff from Dennis Hastert

Comments on Imbens and Rubin causal inference book

“Dow 36,000″ guy offers an opinion on Tom Brady’s balls. The rest of us are supposed to listen?

Irwin Shaw: “I might mistrust intellectuals, but I’d mistrust nonintellectuals even more.”

Death of a statistician

Being polite vs. saying what we really think

Why is this double-y-axis graph not so bad?

“There are many studies showing . . .”

Even though it’s published in a top psychology journal, she still doesn’t believe it

He’s skeptical about Neuroskeptic’s skepticism

Turbulent Studies, Rocky Statistics: Publicational Consequences of Experiencing Inferential Instability

Medical decision making under uncertainty

Unreplicable

“The frequentist case against the significance test”

Erdos bio for kids

Have weak data. But need to make decision. What to do?

“I do not agree with the view that being convinced an effect is real relieves a researcher from statistically testing it.”

Optimistic or pessimistic priors

Draw your own graph!

Low-power pose

Annals of Spam

The Final Bug, or, Please please please please please work this time!

Want to drive the baby-naming public up the wall? Tell them you’re naming your daughter Renesmee. Author Stephenie Meyer invented the name for the half-vampire child in her wildly popular Twilight series. In the story it’s simply an homage to the child’s two grandmothers, Renee and Esmé. To the traditional-minded, though, Renesmee has become a symbol of everything wrong with modern baby naming: It’s not a “real name.” The author just made it up, then parents followed in imitation of pop culture.

All undeniably true, yet that history itself is surprisingly traditional. . . .

The commenters express some disagreement regarding Coraline but it seems that the others on the list really were just made up. And a commenter also adds the names Stella and Norma among the made-up list. And “People who are not Shakespeare give us names like Nevaeh and Quvenzhane.”

P.S. Wattenberg adds:

Note for sticklers: Each of the writers below is credited with using the name inventively—as a coinage rather than a recycling of a familiar name—and with introducing the name to the broader culture. Scattered previous examples of usage may exist, since name creativity isn’t limited to writers.

Really, no snark here. She’s got some excellent tracks on the new Nina Simone tribute album. The best part’s the sample from the classic Nina song. But that’s often the case. They wouldn’t sample something if it was no good.

P.S. Let me clarify: I prefer Lauryn’s version to Nina’s original. The best parts of Lauryn’s are the Nina samples, but I think in its entirety the new version works better, at least to my modern ears.

I received the following email with subject line, “Andrew, just finished ‘Foreign language skills …'”:

Andrew,

Just finished http://andrewgelman.com/2010/12/24/foreign_languag/

This leads to the silliness of considering foreign language skills as a purely positional good or as a method for selecting students, while forgetting the direct benefits of being able to communicate in various ways with different cultures.
– Found this interesting..

Since you covered a language-related topic, I thought you might be interested in our new infographic where we put the new Google translate iOS app to the test. We compared the app against our best human translators and found the results quite surprising.

Would you like me to send you the link?

Thanks,
Ashley Harris
Outreach Coordinator

“Ashley Harris,” indeed.

What I’m wondering is, can’t all these bots just communicate with each other and leave us humans out of the loop? Or maybe I should be afraid of this happening?

The other day, in the context of a discussion of an article from 1972, I remarked that the great statistician William Cochran, when writing on observational studies, wrote almost nothing about causality, nor did he mention selection or meta-analysis.

It was interesting that these topics, which are central to any modern discussion of observational studies, were not considered important by a leader in the field, and this suggests that our thinking has changed since 1972.

Today I’d like to make a similar argument, this time regarding the topic of measurement. This time I’ll consider Donald Rubin’s 2008 article, “For objective causal inference, design trumps analysis.”

All of Rubin’s article is worth reading—it’s all about the ways in which we can structure the design of observational studies to make inferences more believable—and the general point is important and, I think, underrated.

When people do experiments, they think about design, but when they do observational studies, they think about identification strategies, which is related to design but is different in that it’s all about finding and analyzing data and checking assumptions, not so much about about systematic data collection. So Rubin makes valuable points in his article.

But today I want to focus on something that Rubin doesn’t really mention in his article: measurement, which is a topic we’ve been talking a lot about here lately.

Rubin talks about randomization, or the approximate equivalent in observational studies (the “assignment mechanism”), and about sample size (“traditional power calculations,” as his article was written before Type S and Type M errors were well known), and about the information available to the decision makers, and about balance between treatment and control groups.

Rubin does briefly mention the importance of measurement, but only in the context of being able to match or adjust for pre-treatment differences between treatment and control groups.

That’s fine, but here I’m concerned with something even more basic: the validity and reliability of the measurements of outcomes and treatments (or, more generally, comparison groups). I’m assuming Rubin was taking validity for granted—assuming that the x and y variables being measured were the treatment and outcome of interest—and, in a sense, the reliability question is included in the question about sample size. In practice, though, studies are often using sloppy measurements (days of peak fertility, fat arms, beauty, etc.), and if the measurements are bad enough, the problems go beyond sample size, partly because in such studies the sample size would have to creep into the zillions for anything to be detectable, and partly because the biases in measurement can easily be larger than the effects being studied.

So, I’d just like to take Rubin’s excellent article and append a brief discussion of the importance of measurement.

P.S. I sent the above to Rubin, who replied:

In that article I was focusing on the design of observational studies, which I thought had been badly neglected by everyone in past years, including Cochran and me. Issues of good measurement, I think I did mention briefly (I’ll have to check—I do in my lectures, but maybe I skipped that point here), but having good measurements had been discussed by Cochran in his 1965 JRSS paper, so were an already emphasized point.

And I wanted to focus on the neglected point, not all relevant points for observational studies.