Skip to content

More gremlins: “Instead, he simply pretended the other two estimates did not exist. That is inexcusable.”

1977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x529

1977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x529

1977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x529

1977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x5291977_AMC_Gremlin_X_-_Hershey_2012_d-1024x529

Brandon Shollenberger writes:

I’ve spent some time examining the work done by Richard Tol which was used in the latest IPCC report.  I was troubled enough by his work I even submitted a formal complaint with the IPCC nearly two months ago (I’ve not heard back from them thus far).  It expressed some of the same concerns you expressed in a post last year.

The reason I wanted to contact you is I recently realized most people looking at Tol’s work are unaware of a rather important point.  I wrote a post to explain it which I’d invite you to read, but I’ll give a quick summary to possibly save you some time.

As you know, Richard Tol claimed moderate global warming will be beneficial based upon a data set he created.  However, errors in his data set (some of which are still uncorrected) call his results into question.  Primarily, once several errors are corrected, it turns out the only result which shows any non-trivial benefit from global warming is Tol’s own 2002 paper.

That is obviously troubling, but there is a point which makes this even worse.  As it happens, Tol’s 2002 paper did not include just one result.  It actually included three different results.  A table for it shows those results are +2.3%, +0.2% and -2.7%.

The 2002 paper does nothing to suggest any one of those results is the “right” one, nor does any of Tol’s later work.  That means Tol used the +2.3% value from his 2002 paper while ignoring the +0.2% and -2.7% values, without any stated explanation.

It might be true the +2.3% value is the “best” estimate from the 2002 paper, but even if so, one needs to provide an explanation as to why it should be favored over the other two estimates.  Tol didn’t do so.  Instead, he simply pretended the other two estimates did not exist.  That is inexcusable.

I’m not sure how interested you are in Tol’s work, but I thought you might be interested to know things are even worse than you thought.

This is horrible and also kind of hilarious. We start with a published paper by Tol claiming strong evidence for a benefit from moderate global warming. Then it turns out he had some data errors; fixing the errors led to a weakening of this conclusions. Then more errors came out, and it turned out that there was only one point in his entire dataset supporting his claims—and that point came from his own previously published study. And then . . . even that one point isn’t representative of that paper.

You pull and pull on the thread, and the entire garment falls apart. There’s nothing left.

At no point did Tol apologize or thank the people who pointed out his errors; instead he lashed out, over and over again. Irresponsible indeed.

Stan 2.7 (CRAN, variational inference, and much much more)

logo_textbottom

Stan 2.7 is now available for all interfaces. As usual, everything you need can be found starting from the Stan home page:

Highlights

  • RStan is on CRAN!(1)
  • Variational Inference in CmdStan!!(2)
  • Two new Stan developers!!! 
  • A whole new logo!!!! 
  • Math library with autodiff now available in its own repo!!!!! 

(1) Just doing install.packages(“rstan”) isn’t going to work because of dependencies; please go to the RStan getting started page for instructions of how to install from CRAN. It’s much faster than building from source and you no longer need a machine with a lot of RAM to install.

(2) Coming soon to an interface near you.

Full Release Notes

v2.7.0 (9 July 2015)
======================================================================

New Team Members
--------------------------------------------------
* Alp Kucukelbir, who brings you variational inference
* Robert L. Grant, who brings you the StataStan interface

Major New Feature
--------------------------------------------------
* Black-box variational inference, mean field and full
  rank (#1505)

New Features
--------------------------------------------------
* Line numbers reported for runtime errors (#1195)
* Wiener first passage time density (#765) (thanks to
  Michael Schvartsman)
* Partial initialization (#1069)
* NegBinomial2 RNG (#1471) and PoissonLog RNG (#1458) and extended
  range for Dirichlet RNG (#1474) and fixed Poisson RNG for older
  Mac compilers (#1472)
* Error messages now use operator notation (#1401)
* More specific error messages for illegal assignments (#1100)
* More specific error messages for illegal sampling statement 
  signatures (#1425)
* Extended range on ibeta derivatives with wide impact on CDFs (#1426)
* Display initialization error messages (#1403)
* Works with Intel compilers and GCC 4.4 (#1506, #1514, #1519)

Bug Fixes
--------------------------------------------------
* Allow functions ending in _lp to call functions ending in _lp (#1500)
* Update warnings to catch uses of illegal sampling functions like
  CDFs and updated declared signatures (#1152)
* Disallow constraints on local variables (#1295)
* Allow min() and max() in variable declaration bounds and remove
  unnecessary use of math.h and top-level :: namespace (#1436)
* Updated exponential lower bound check (#1179)
* Extended sum to work with zero size arrays (#1443)
* Positive definiteness checks fixed (were > 1e-8, now > 0) (#1386)

Code Reorganization and Back End Upgrades
--------------------------------------------------
* New static constants (#469, #765)
* Added major/minor/patch versions as properties (#1383)
* Pulled all math-like functionality into stan::math namespace
* Pulled the Stan Math Library out into its own repository (#1520)
* Included in Stan C++ repository as submodule
* Removed final usage of std::cout and std::cerr (#699) and
  updated tests for null streams (#1239)
* Removed over 1000 CppLint warnings
* Remove model write CSV methods (#445)
* Reduced generality of operators in fvar (#1198)
* Removed folder-level includes due to order issues (part of Math
  reorg) and include math.hpp include (#1438)
* Updated to Boost 1.58 (#1457)
* Travis continuous integration for Linux (#607)
* Add grad() method to math::var for autodiff to encapsulate math::vari
* Added finite diff functionals for testing (#1271)
* More configurable distribution unit tests (#1268)
* Clean up directory-level includes (#1511)
* Removed all lint from new math lib and add cpplint to build lib
  (#1412)
* Split out derivative functionals (#1389)


Manual and Documentation
--------------------------------------------------
* New Logo in Manual; remove old logos (#1023)
* Corrected all known bug reports and typos; details in 
  issues #1420, #1508, #1496
* Thanks to Sunil Nandihalli, Andy Choi, Sebastian Weber,
  Heraa Hu, @jonathan-g (GitHub handle), M. B. Joseph, Damjan
  Vukcevic, @tosh1ki (GitHub handle), Juan S. Casallas
* Fix some parsing issues for index (#1498)
* Added chapter on variational inference
* Added strangely unrelated regressions and multivariate probit
  examples 
* Discussion from Ben Goodrich about reject() and sampling
* Start to reorganize code with fast examples first, then
  explanations
* Added CONTRIBUTING.md file (#1408)

BREAKING . . . Kit Harrington’s height

Screen Shot 2015-06-15 at 9.00.23 PM

Rasmus “ticket to” Bååth writes:

I heeded your call to construct a Stan model of the height of Kit “Snow” Harrington. The response on Gawker has been poor, unfortunately, but here it is, anyway.

Yeah, I think the people at Gawker have bigger things to worry about this week. . . .

Here’s Rasmus’s inference for Kit’s height:

Screen Shot 2015-07-21 at 10.24.21 PM

And here’s his summary:

From this analysis it is unclear how tall Kit is, there is much uncertainty in the posterior distribution, but according to the analysis (which might be quite off) there’s a 50% probability he’s between 1.71 and 1.75 cm tall. It is stated in the article that he is NOT 5’8” (173 cm), but according to this analysis it’s not an unreasonable height, as the mean of the posterior is 173 cm.

His Stan model is at the link. (I tried to copy it here but there was some html crap.)

A bad definition of statistical significance from the U.S. Department of Health and Human Services, Effective Health Care Program

As D.M.C. would say, bad meaning bad not bad meaning good.

Deborah Mayo points to this terrible, terrible definition of statistical significance from the Agency for Healthcare Research and Quality:

Statistical Significance

Definition: A mathematical technique to measure whether the results of a study are likely to be true. Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance. Statistical significance is usually expressed as a P-value. The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true). Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05).

Example: For example, results from a research study indicated that people who had dementia with agitation had a slightly lower rate of blood pressure problems when they took Drug A compared to when they took Drug B. In the study analysis, these results were not considered to be statistically significant because p=0.2. The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.

The definition is wrong, as is the example. I mean, really wrong. So wrong that it’s perversely impressive how many errors they managed to pack into two brief paragraphs:

1. I don’t even know what it means to say “whether the results of a study are likely to be true.” The results are the results, right? You could try to give them some slack and assume they meant, “whether the results of a study represent a true pattern in the general population” or something like that—but, even so, it’s not clear what is meant by “true.”

2. Even if you could some how get some definition of “likely to be true,” that is not what statistical significance is about. It’s just not.

3. “Statistical significance is calculated as the probability that an effect observed in a research study is occurring because of chance.” Ummm, this is close, if you replace “an effect” with “a difference at least as large as what was observed” and if you append “conditional on there being a zero underlying effect.” Of course in real life there are very few zero underlying effects (I hope the Agency for Healthcare Research and Quality mostly studies treatments with positive effects!), hence the irrelevance of statistical significance to relevant questions in this field.

4. “The smaller the P-value, the less likely it is that the results are due to chance (and more likely that the results are true).” No no no no no. As has been often said, the p-value is a measure of sample size. And, even conditional on sample size, and conditional on measurement error and variation between people, the probability that the results are true (whatever exactly that means) depends strongly on what is being studied, what Tversky and Kahneman called the base rate.

5. As Mayo points out, it’s sloppy to use “likely” to talk about probability.

6. “Researchers generally believe the results are probably true if the statistical significance is a P-value less than 0.05 (p<.05)." Ummmm, yes, I guess that's correct. Lots of ignorant researchers believe this. I suppose that, without this belief, Psychological Science would have difficulty filling its pages, and Science, Nature, and PPNAS would have no social science papers to publish and they'd have to go back to their traditional plan of publishing papers in the biological and physical sciences.

7. “The probability that the results were due to chance was high enough to conclude that the two drugs probably did not differ in causing blood pressure problems.” Hahahahahaha. Funny. What’s really amusing is that they hyperlink “probability” so we can learn more technical stuff from them.

OK, I’ll bite, I’ll follow the link:

Probability

Definition: The likelihood (or chance) that an event will occur. In a clinical research study, it is the number of times a condition or event occurs in a study group divided by the number of people being studied.

Example: For example, a group of adult men who had chest pain when they walked had diagnostic tests to find the cause of the pain. Eighty-five percent were found to have a type of heart disease known as coronary artery disease. The probability of coronary artery disease in men who have chest pain with walking is 85 percent.

Fuuuuuuuuuuuuuuuck. No no no no no. First, of course “likelihood” has a technical use which is not the same as what they say. Second, “the number of times a condition or event occurs in a study group divided by the number of people being studied” is a frequency, not a probability.

It’s refreshing to see these sorts of errors out in the open, though. If someone writing a tutorial makes these huge, huge errors, you can see how everyday researchers make these mistakes too.

For example:

A pair of researchers find that, for a certain group of women they are studying, three times as many are wearing red or pink shirts during days 6-14 of their monthly cycle (which the researchers, in their youthful ignorance, were led to believe were the most fertile days of the month). Therefore, the probability (see above definition) of wearing red or pink is three times more likely during these days. And the result is statistically significant (see above definition), so the results are probably true. That pretty much covers it.

All snark aside, I’d never really had a sense of the reasoning by which people get to these sorts of ridiculous claims based on such shaky data. But now I see it. It’s the two steps: (a) the observed frequency is the probability, (b) if p less than .05 then the result is probably real. Plus, the intellectual incentive of having your pet theory confirmed, and the professional incentive of getting published in the tabloids. But underlying all this are the wrong definitions of “probability” and “statistical significance.”

Who wrote these definitions in this U.S. government document, I wonder? I went all over the webpage and couldn’t find any list of authors. This relates to a recurring point made by Basbøll and myself: it’s hard to know what to do with a piece of writing if you don’t know where it came from. Basbøll and I wrote about this in the context of plagiarism (a statistical analogy would be the statement that it can be hard to effectively use a statistical method if the person who wrote it up doesn’t understand it himself), but really the point is more general. If this article on statistical significance had an author of record, we could examine the author’s qualifications, possibly contact him or her, see other things written by the same author, etc. Without this, we’re stuck.

Wikipedia articles typically don’t have named authors, but the authors do have online handles and they thus take responsibility for their words. Also Wikipedia requires sources. There are no sources given for these two paragraphs on statistical significance which are so full of errors.

What, then?

The question then arises: how should statistical significance be defined in one paragraph for the layperson? I think the solution is, if you’re not gonna be rigorous, don’t fake it.

Here’s my try.

Statistical Significance

Definition: A mathematical technique to measure the strength of evidence from a single study. Statistical significance is conventionally declared when the p-value is less than 0.05. The p-value is the probability of seeing a result as strong as observed or greater, under the null hypothesis (which is commonly the hypothesis that there is no effect). Thus, the smaller the p-value, the less consistent are the data with the null hypothesis under this measure.

I think that’s better than their definition. Of course, I’m an experienced author of statistics textbooks so I should be able to correctly and concisely define p-values and statistical significance. But . . . the government could’ve asked me to do this for them! I’d’ve done it. It only took me 10 minutes! Would I write the whole glossary for them? Maybe not. But at least they’d have a correct definition of statistical significance.

I guess they can go back now and change it.

Just to be clear, I’m not trying to slag on whoever prepared this document. I’m sure they did the best they could, they just didn’t know any better. It would be as if someone asked me to write a glossary about medicine. The flaw is in whoever commissioned the glossary, to not run it by some expert to check. Or maybe they could’ve just omitted the glossary entirely, as these topics are covered in standard textbooks.

Screen Shot 2015-07-18 at 3.40.10 PM

P.S. And whassup with that ugly, ugly logo? It’s the U.S. government. We’re the greatest country on earth. Sure, our health-care system is famously crappy, but can’t we come up with a better logo than this? Christ.

P.P.S. Following Paul Alper’s suggestion, I made my definition more general by removing the phrase, “that the true underlying effect is zero.”

P.P.P.S. The bigger picture, though, is that I don’t think people should be making decisions based on statistical significance in any case. In my ideal world, we’d be defining statistical significance just as a legacy project, so that students can understand outdated reports that might be of historical interest. If you’re gonna define statistical significance, you should do it right, but really I think all this stuff is generally misguided.

Don’t put your whiteboard behind your projection screen

Daniel, Andrew, and I are on our second day of teaching, and like many places, Memorial Sloan-Kettering has all their classrooms set up with a whiteboard placed directly behind a projection screen. This gives us a sliver of space to write on without pulling the screen up and down.

If you have any say in setting up your seminar rooms, don’t put your board behind your screen, please — I almsot always want to use them both at the same time.

I also just got back from a DARPA workshop at the Embassy Suites in Portland, and there the problem was a podium in between two tiny screens, neither of which was easily visible from the back of the big ballroom. Nobody knows where to point when there are two boards. One big screen is way better.

At my summer school course in Sydney earlier this year, they had a neat setup where there were two screens, but one could be used with an overhead projection of a small desktop, so I could just write on paper and send it up to the second screen. And the screens were big enough that all 200+ students could see both. Yet another great feature of Australia.

Richard Feynman and the tyranny of measurement

I followed a link at Steve Hsu’s blog and came to this discussion of Feyman’s cognitive style. Hsu writes that “it was often easier for [Feynman] to invent his own solution than to read through someone else’s lengthy paper” and he follows up with a story in which “Feynman did not understand the conventional formulation of QED even after Dyson’s paper proving the equivalence of the Feynman and Schwinger methods.” Apparently Feynman was eventually able to find an error in this article but only after an immense effort. In Hsu’s telling (which I have no reason to doubt), Feynman avoided reading papers by others in part out of a desire to derive everything from first principles, but also because of his own strengths and limitations, his “cognitive profile.”

This is all fine and it makes sense to me. Indeed, I recognize Feynman’s attitude myself: it can often be a take a lot of work to follow someone else’s paper if has lots of technical material, and I typically prefer to read a paper shallowly, get the gist, and then focus on a mix of specific details (trying to understand one example) and big picture, without necessarily following all the arguments. This seems to be Feynman’s attitude too.

The place where I part from Hsu is in this judgment of his:

Feynman’s cognitive profile was probably a bit lopsided — he was stronger mathematically than verbally. . . . it was often easier for him to invent his own solution than to read through someone else’s lengthy paper.

I have a couple of problems with this. First, Feynman was obviously very strong verbally, given that he wrote a couple of classic books. Sure, he dictated these books, he didn’t actually write them (at least that’s my understanding of how the books were put together), but still, you need good verbal skills to put things the way he did. By comparison, consider Murray Gell-mann, who prided himself on his cultured literacy but couldn’t write well for general audiences.

Anyway, sure, Feynman’s math skills were much better developed than his verbal skills. But compared to other top physicists (which is the relevant measure here)? That’s not so clear.

I’ll go with Hsu’s position that Feynman was better than others at coming up with original ideas while not being so willing to put in the effort to understand what others had written. But I’m guessing that this latter disinclination doesn’t have much to do with “verbal skills.”

Here’s where I think Hsu has fallen victim to the tyranny of measurement—that is, to the fallacy of treating concepts as more important if they are more accessible to measurement.

“Much stronger mathematically than verbally”—where does that come from?

College admissions tests are divided into math and verbal sections, so there’s that. But it’s a fallacy to divide cognitive abilities into these two parts, especially in a particular domain such as theoretical physics which requires very particular skills.

Let me put it another way. My math skills are much lower than Feynman’s and my verbal skills are comparable. I think we can all agree that my “imbalance”—the difference (however measured) between my math and verbal skills—is much lower than Feynman’s. Nonetheless, I too do my best to avoid reading highly technical work by others. Like Feynman (but of course at a much lower level), I prefer to come up with my own ideas rather than work to figure out what others are doing. And I typically evaluate others’ work using my personal basket of examples. Which can irritate the Judea Pearls of the world, as I just don’t always have the patience to figure out exactly why something that doesn’t work, doesn’t work. Like Feynman in that story, I can do it, but it takes work. Sometimes that work is worth it; for example, I’ve spent a lot of time trying to understand exactly what assumptions implicitly support regression discontinuity analysis, so that I could get a better sense of what happened in the notorious regression discontinuity FAIL pollution in China analysis, where the researchers in question seemingly followed all the rules but still went wrong.

Anyway, that’s a tangent. My real point is that we should be able to talk about different cognitive styles and abilities without the tyranny of measurement straitjacketing us into simple categories that happen to line up with college admissions tests. In many settings I imagine these dimensions are psychometrically relevant but I’m skeptical about applying them to theoretical physics.

On deck this week

Mon: Richard Feynman and the tyranny of measurement

Tues: A bad definition of statistical significance from the U.S. Department of Health and Human Services, Effective Health Care Program

Wed: Ta-Nehisi Coates, David Brooks, and the “street code” of journalism

Thurs: Flamebait: “Mathiness” in economics and political science

Fri: 45 years ago in the sister blog

Sat: Ira Glass asks. We answer.

Sun: The 3 Stages of Busy

On deck for the rest of the summer and beginning of fall

Here’s some summer reading for you. The schedule may change because of the insertion of topical material, but this is the basic plan:

Richard Feynman and the tyranny of measurement

A bad definition of statistical significance from the U.S. Department of Health and Human Services, Effective Health Care Program

Ta-Nehisi Coates, David Brooks, and the “street code” of journalism

Flamebait: “Mathiness” in economics and political science

45 years ago in the sister blog

Ira Glass asks. We answer.

The 3 Stages of Busy

Ripped from the pages of a George Pelecanos novel

“We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)”

What do I say when I don’t have much to say?

“Women Respond to Nobel Laureate’s ‘Trouble With Girls’”

This sentence by Thomas Mallon would make Barry N. Malzberg spin in his grave, except that he’s still alive so it would just make him spin in his retirement

If you leave your datasets sitting out on the counter, they get moldy

Spam!

The plagiarist next door strikes back: Different standards of plagiarism in different communities

How to parameterize hyperpriors in hierarchical models?

How Hamiltonian Monte Carlo works

When does Bayes do the job?

Here’s a theoretical research project for you

Classifying causes of death using “verbal autopsies”

All hail Lord Spiegelhalter!

Dan Kahan doesn’t trust the Turk

Neither time nor stomach

He wants to teach himself some statistics

Hey—Don’t trust anything coming from the Tri-Valley Center for Human Potential!

Harry S. Truman, Jesus H. Christ, Roy G. Biv

Why couldn’t Breaking Bad find Mexican Mexicans?

Rockin the tabloids

A statistical approach to quadrature

Humans Can Discriminate More than 1 Trillion Olfactory Stimuli. Not.

0.05 is a joke

Jökull Snæbjarnarson writes . . .

Aahhhhh, young people!

Plaig! (non-Wegman edition)

We provide a service

“The belief was so strong that it trumped the evidence before them.”

“Can you change your Bayesian prior?”

How to analyze hierarchical survey data with post-stratification?

A political sociological course on statistics for high school students

Questions about data transplanted in kidney study

Performing design calculations (type M and type S errors) on a routine basis?

“Another bad chart for you to criticize”

Constructing an informative prior using meta-analysis

Stan attribution

Cannabis/IQ follow-up: Same old story

Defining conditional probability

In defense of endless arguments

Emails I never finished reading

BREAKING . . . Sepp Blatter accepted $2M payoff from Dennis Hastert

Comments on Imbens and Rubin causal inference book

“Dow 36,000″ guy offers an opinion on Tom Brady’s balls. The rest of us are supposed to listen?

Irwin Shaw: “I might mistrust intellectuals, but I’d mistrust nonintellectuals even more.”

Death of a statistician

Being polite vs. saying what we really think

Why is this double-y-axis graph not so bad?

“There are many studies showing . . .”

Even though it’s published in a top psychology journal, she still doesn’t believe it

He’s skeptical about Neuroskeptic’s skepticism

Turbulent Studies, Rocky Statistics: Publicational Consequences of Experiencing Inferential Instability

Medical decision making under uncertainty

Unreplicable

“The frequentist case against the significance test”

Erdos bio for kids

Have weak data. But need to make decision. What to do?

“I do not agree with the view that being convinced an effect is real relieves a researcher from statistically testing it.”

Optimistic or pessimistic priors

Draw your own graph!

Low-power pose

Annals of Spam

The Final Bug, or, Please please please please please work this time!

Enjoy.

“17 Baby Names You Didn’t Know Were Totally Made Up”

From Laura Wattenberg:

Want to drive the baby-naming public up the wall? Tell them you’re naming your daughter Renesmee. Author Stephenie Meyer invented the name for the half-vampire child in her wildly popular Twilight series. In the story it’s simply an homage to the child’s two grandmothers, Renee and Esmé. To the traditional-minded, though, Renesmee has become a symbol of everything wrong with modern baby naming: It’s not a “real name.” The author just made it up, then parents followed in imitation of pop culture.

All undeniably true, yet that history itself is surprisingly traditional. . . .

And here are the 17 classic, yet made-up, names:

Wendy
Cedric
Miranda
Vanessa
Coraline
Evangeline
Amanda
Gloria
Dorian
Clarinda
Cora
Pamela
Fiona
Jessica
Lucinda
Ronia
Imogen

The commenters express some disagreement regarding Coraline but it seems that the others on the list really were just made up. And a commenter also adds the names Stella and Norma among the made-up list. And “People who are not Shakespeare give us names like Nevaeh and Quvenzhane.”

P.S. Wattenberg adds:

Note for sticklers: Each of the writers below is credited with using the name inventively—as a coinage rather than a recycling of a familiar name—and with introducing the name to the broader culture. Scattered previous examples of usage may exist, since name creativity isn’t limited to writers.

Lauryn’s back!

Really, no snark here. She’s got some excellent tracks on the new Nina Simone tribute album. The best part’s the sample from the classic Nina song. But that’s often the case. They wouldn’t sample something if it was no good.

P.S. Let me clarify: I prefer Lauryn’s version to Nina’s original. The best parts of Lauryn’s are the Nina samples, but I think in its entirety the new version works better, at least to my modern ears.

Annals of Spam

I received the following email with subject line, “Andrew, just finished ‘Foreign language skills …'”:

Andrew,

Just finished http://andrewgelman.com/2010/12/24/foreign_languag/

This leads to the silliness of considering foreign language skills as a purely positional good or as a method for selecting students, while forgetting the direct benefits of being able to communicate in various ways with different cultures.
– Found this interesting..

Since you covered a language-related topic, I thought you might be interested in our new infographic where we put the new Google translate iOS app to the test. We compared the app against our best human translators and found the results quite surprising.

Would you like me to send you the link?

Thanks,
Ashley Harris
Outreach Coordinator

“Ashley Harris,” indeed.

What I’m wondering is, can’t all these bots just communicate with each other and leave us humans out of the loop? Or maybe I should be afraid of this happening?

Measurement is part of design

The other day, in the context of a discussion of an article from 1972, I remarked that the great statistician William Cochran, when writing on observational studies, wrote almost nothing about causality, nor did he mention selection or meta-analysis.

It was interesting that these topics, which are central to any modern discussion of observational studies, were not considered important by a leader in the field, and this suggests that our thinking has changed since 1972.

Today I’d like to make a similar argument, this time regarding the topic of measurement. This time I’ll consider Donald Rubin’s 2008 article, “For objective causal inference, design trumps analysis.”

All of Rubin’s article is worth reading—it’s all about the ways in which we can structure the design of observational studies to make inferences more believable—and the general point is important and, I think, underrated.

When people do experiments, they think about design, but when they do observational studies, they think about identification strategies, which is related to design but is different in that it’s all about finding and analyzing data and checking assumptions, not so much about about systematic data collection. So Rubin makes valuable points in his article.

But today I want to focus on something that Rubin doesn’t really mention in his article: measurement, which is a topic we’ve been talking a lot about here lately.

Rubin talks about randomization, or the approximate equivalent in observational studies (the “assignment mechanism”), and about sample size (“traditional power calculations,” as his article was written before Type S and Type M errors were well known), and about the information available to the decision makers, and about balance between treatment and control groups.

Rubin does briefly mention the importance of measurement, but only in the context of being able to match or adjust for pre-treatment differences between treatment and control groups.

That’s fine, but here I’m concerned with something even more basic: the validity and reliability of the measurements of outcomes and treatments (or, more generally, comparison groups). I’m assuming Rubin was taking validity for granted—assuming that the x and y variables being measured were the treatment and outcome of interest—and, in a sense, the reliability question is included in the question about sample size. In practice, though, studies are often using sloppy measurements (days of peak fertility, fat arms, beauty, etc.), and if the measurements are bad enough, the problems go beyond sample size, partly because in such studies the sample size would have to creep into the zillions for anything to be detectable, and partly because the biases in measurement can easily be larger than the effects being studied.

So, I’d just like to take Rubin’s excellent article and append a brief discussion of the importance of measurement.

P.S. I sent the above to Rubin, who replied:

In that article I was focusing on the design of observational studies, which I thought had been badly neglected by everyone in past years, including Cochran and me. Issues of good measurement, I think I did mention briefly (I’ll have to check—I do in my lectures, but maybe I skipped that point here), but having good measurements had been discussed by Cochran in his 1965 JRSS paper, so were an already emphasized point.

And I wanted to focus on the neglected point, not all relevant points for observational studies.

Stan is Turing complete

New papers on LOO/WAIC and Stan

Screen Shot 2015-07-16 at 12.12.19 AM

Aki, Jonah, and I have released the much-discussed paper on LOO and WAIC in Stan: Efficient implementation of leave-one-out cross-validation and WAIC for evaluating fitted Bayesian models.
We (that is, Aki) now recommend LOO rather than WAIC, especially now that we have an R function to quickly compute LOO using Pareto smoothed importance sampling. In either case, a key contribution of our paper is to show how LOO/WAIC can be computed in the regular workflow of model fitting.

We also compute the standard error of the difference between LOO (or WAIC) when comparing two models, and we demonstrate with the famous arsenic well-switching example.

Also 2 new tutorial articles on Stan will be appearing:
in JEBS: Stan: A probabilistic programming language for Bayesian inference and optimization.
in JSS: Stan: A probabilistic programming language
The two articles have very similar titles but surprisingly little overlap. I guess it’s the difference between what Bob thinks is important to say, and what I think is important to say.

Enjoy!

P.S. Jonah writes:

For anyone interested in the the R package “loo” mentioned in the paper, please install from GitHub and not CRAN. There is a version on CRAN but it needs to be updated so please for now use the version here:

https://github.com/jgabry/loo

To get it running, you must first install the “devtools” package in R, then you can just install and load “loo” via:

library("devtools")
install_github("jgabry/loo")
library("loo")

Jonah will post an update when the new version is also on CRAN.

P.P.S. This P.P.S. is by Jonah. The latest version of the loo R package (0.1.2) is now up on CRAN and should be installable for most people by running

install.packages("loo")

although depending on various things (your operating system, R version, CRAN mirror, what you ate for breakfast, etc.) you might need

install.packages("loo", type = "source")

to get the new version. For bug reports, installation trouble, suggestions, etc., please use our GitHub issues page. The Stan users google group is also a fine place to ask questions about using the package with your models.

Finally, while we do recommend Stan of course, the R package isn’t only for Stan models. If you can compute a pointwise log-likelihood matrix then you can use the package.

Psych dept: “We are especially interested in candidates whose research program contributes to the development of new quantitative methods”

This is cool. The #1 psychology department in the world is looking for a quantitative researcher:

The Department of Psychology at the University of Michigan, Ann Arbor, invites applications for a tenure-track faculty position. The expected start date is September 1, 2016. The primary criterion for appointment is excellence in research and teaching. We are especially interested in candidates whose research program contributes to the development of new quantitative methods.

Although the specific research area is open, we are especially interested in applicants for whom quantitative theoretical modeling, which could include computational models, analytic models, statistical models or psychometric models, is an essential part of their research program. The successful candidate will participate in the teaching rotation of graduate-level statistics and methods. Quantitative psychologists from all areas of psychology and related disciplines are encouraged to apply. This is a university-year appointment.

Successful candidates must have a Ph.D. in a relevant discipline (e.g. Bio-statistics, Psychology) by the time the position starts, and a commitment to undergraduate and graduate teaching. New faculty hired at the Assistant Professor level will be expected to establish an independent research program. Please submit a letter of intent, curriculum vitae, a statement of current and future research plans, a statement of teaching philosophy and experience, and evidence of teaching excellence (if any).

Applicants should also request at least three letters of recommendation from referees. All materials should be uploaded by September 30, 2015 as a single PDF attachment to https://psychology-lsa.applicantstack.com/x/apply/a2s9hqlu3cgv. For inquiries about the positions please contact Richard Gonzalez. gonzo@umich.edu.

The University of Michigan is an equal opportunity/affirmative action employer. Qualified women and minority candidates are encouraged to apply. The University is supportive of the needs of dual-career couples.

Prior information, not prior belief

The prior distribution p(theta) in a Bayesian analysis is often presented as a researcher’s beliefs about theta. I prefer to think of p(theta) as an expression of information about theta.

Consider this sort of question that a classically-trained statistician asked me the other day:

If two Bayesians are given the same data, they will come to two conclusions. What do you think about that? Does it bother you?

My response is that the statistician has nothing to do with it. I’d prefer to say that if two different analyses are done using different information, they will come to different conclusions. This different information can come in the prior distribution p(theta), it could come in the data model p(y|theta), it could come in the choice of how to set up the model and what data to include in the first place. I’ve listed these in roughly increasing order of importance.

Sure, we could refer to all statistical models as “beliefs”: we have a belief that certain measurements are statistically independent with a common mean, we have a belief that a response function is additive and linear, we have a belief that our measurements are unbiased, etc. Fine. But I don’t think this adds anything beyond just calling this a “model.” Indeed, referring to “belief” can be misleading. When I fit a regression model, I don’t typically believe in additivity or linearity at all, I’m just fitting a model, using available information and making assumptions, compromising the goal of including all available information because of the practical difficulties of fitting and understanding a huge model.

Same with the prior distribution. When putting together any part of a statistical model, we use some information without wanting to claim that this represents our beliefs about the world.

Awesomest media request of the year

(Sent to all the American Politics faculty at Columbia, including me)

RE: Donald Trump presidential candidacy

Hi,

Firstly, apologies for the group email but I wasn’t sure who would be best prized to answer this query as we’ve not had much luck so far.

I am a Dubai-based reporter for **.
Donald Trump recently announced his intension to run for the US presidency in 2016.
He currently has a lot of high profile commercial and business deals in Dubai and is actively in talks for more in the wider region.

We have been trying to determine:
If a candidate succeeds in winning a nomination and goes on to win the election and reside in the White House do they have to give up their business interests as these would be seen as a conflict of interest? Can a US president serve in office and still have massive commercial business interests abroad?

Basically, would Trump have to relinquish these relationships if he was successfully elected? Are there are existing rules specifically governing this? Is there any previous case studies to go on?

Lastly, what are his chances of winning a nomination or being elected? So far, from what we have read it seems highly unlikely?

Regards,

***
Executive Editor
***

Survey weighting and regression modeling

Yphtach Lelkes points us to a recent article on survey weighting by three economists, Gary Solon, Steven Haider, and Jeffrey Wooldridge, who write:

We start by distinguishing two purposes of estimation: to estimate population descriptive statistics and to estimate causal effects. In the former type of research, weighting is called for when it is needed to make the analysis sample representative of the target population. In the latter type, the weighting issue is more nuanced. We discuss three distinct potential motives for weighting when estimating causal effects: (1) to achieve precise estimates by correcting for heteroskedasticity, (2) to achieve consistent estimates by correcting for endogenous sampling, and (3) to identify average partial effects in the presence of unmodeled heterogeneity of effects.

These is indeed an important and difficult topic and I’m glad to see economists becoming aware of it. I do not quite agree with their focus—in practice, heteroskedasticity never seems like much of a bit deal to me, nor do I care much about so-called consistency of estimates—but there are many ways to Rome, and the first step is to move beyond a naive view of weighting as some sort of magic solution.

Solon et al. pretty much only refer to literature within the field of economics, which is too bad because they miss this twenty-year-old paper by Chris Winship and Larry Radbill, “Sampling Weights and Regression Analysis,” from Sociological Methods and Research, which begins:

Most major population surveys used by social scientists are based on complex sampling designs where sampling units have different probabilities of being selected. Although sampling weights must generally be used to derive unbiased estimates of univariate population characteristics, the decision about their use in regression analysis is more complicated. Where sampling weights are solely a function of independent variables included in the model, unweighted OLS estimates are preferred because they are unbiased, consistent, and have smaller standard errors than weighted OLS estimates. Where sampling weights are a function of the dependent variable (and thus of the error term), we recommend first attempting to respecify the model so that they are solely a function of the independent variables. If this can be accomplished, then unweighted OLS is again preferred. . . .

This topic also has close connections with multilevel regression and poststratification, as discussed in my 2007 article, “Struggles with survey weighting and regression modeling,” which is (somewhat) famous for its opening:

Survey weighting is a mess. It is not always clear how to use weights in estimating anything more complicated than a simple mean or ratios, and standard errors are tricky even with simple weighted means.

See also our response to the discusssions.

I was unaware of Winship and Radbill’s work when writing my paper, so I accept blame for insularity as well.

In any case, it’s good to see broader interest in this important unsolved problem.

Don’t do the Wilcoxon

no_wilcoxon

The Wilcoxon test is a nonparametric rank-based test for comparing two groups. It’s a cool idea because, if data are continuous and there is no possibility of a tie, the reference distribution depends only on the sample size. There are no nuisance parameters, and the distribution can be tabulated. From a Bayesian point of view, however, this is no big deal, and I prefer to think of Wilcoxon as a procedure that throws away information (by reducing the data to ranks) to gain robustness.

Fine. But if you’re gonna do that, I’d recommend instead the following approach:

1. As in classical Wilcoxon, replace the data by their ranks: 1, 2, . . . N.

2. Translate these ranks into z-scores using the inverse-normal cdf applied to the values 1/(2*N), 3/(2*N), . . . (2*N – 1)/(2*N).

3. Fit a normal model.

In simple examples this should work just about the same as Wilcoxon as it is based on the same general principle, which is to discard the numerical information in the data and just keep the ranks. The advantage of this new approach is that, by using the normal distribution, it allows you to plug in all the standard methods that you’re familiar with: regression, analysis of variance, multilevel models, measurement-error models, and so on.

The trouble with Wilcoxon is that it’s a bit of a dead end: if you want to do anything more complicated than a simple comparison of two groups, you have to come up with new procedures and work out new reference distributions. With the transform-to-normal approach you can do pretty much anything you want.

The question arises: if my simple recommended approach indeed dominates Wilcoxon, how is it that Wilcoxon remains popular? I think much has to do with computation: the inverse-normal transformation is now trivial, but in the old days it would’ve added a lot of work to what, after all, is intended to be rapid and approximate.

Take-home message

I am not saying that the rank-then-inverse-normal-transform strategy is always or even often a good idea. What I’m saying is that, if you were planning to do a rank transformation before analyzing your data, I recommend this z-score approach rather than the classical Wilcoxon method.

On deck this week

Mon: Don’t do the Wilcoxon

Tues: Survey weighting and regression modeling

Wed: Prior information, not prior belief

Thurs: Draw your own graph!

Fri: Measurement is part of design

Sat: Annals of Spam

Sun: “17 Baby Names You Didn’t Know Were Totally Made Up”