Tenure track faculty opening at the Center for the Promotion of Research Involving Innovative Statistical Methodology, with Jennifer Hill, Marc Scott, and other world-class researchers. It looks like a great opportunity.
Mark Palko is irritated by the Times’s refusal to retract a recounting of a hoax regarding Dickens and Dostoevsky. All I can say is, the Times refuses to retract mistakes of fact that are far more current than that! See here for two examples that particularly annoyed me, to the extent that I contacted various people at the Times but ran into refusals to retract.
I guess a daily newspaper publishes so much material that they can’t be expected to run a retraction every time they publish something false, even when such things are brought to their attention.
Speaking of corrections, I wonder if later editions of the Samuelson economics textbook discussed their notorious graph predicting Soviet economic performance. The easiest thing would be just to remove the graph, but I think it would be a better economics lesson to discuss the error!
Similarly, I think the NYT would do well to run an article on their Dickens-Dostoevsky mistake, along with a column by Arthur Brooks on how he messed up with the happiness statistics, and a column by David Brooks on how he messed up with the statistics on Jewish achievement. Instead of one more column on the usual topic, why not something explaining what went wrong? (And, yes, I do sometimes write about my mistakes.)
After reading Rachel and Cathy’s book, I wrote that “Statistics is the least important part of data science . . . I think it would be fair to consider statistics as a subset of data science. . . . it’s not the most important part of data science, or even close.”
But then I received “Data Science for Business,” by Foster Provost and Tom Fawcett, in the mail. I might not have opened the book at all (as I’m hardly in the target audience) but for seeing a blurb by Chris Volinsky, a statistician whom I respect a lot.
So I flipped through the book and it indeed looked pretty good. It moves slowly but that’s appropriate for an intro book. But what surprised me, given the book’s title and our recent discussion on the nature of data science, was that the book was 100% statistics! Continue reading ‘More on “data science” and “statistics”’ »
A colleague writes:
Personally my Kasparov number is two: I beat ** in a regular tournament game, and ** beat Kasparov!
That’s pretty impressive, especially given that I didn’t know this guy played chess at all! Anyway, this got me thinking, what’s my Kasparov number? OK, that’s easy. I beat Magnus Carlsen the other day when he was passing through town on vacation, Carlsen beat Anand, . . .
OK, just kidding. What is my Kasparov number, though? Note that the definition, unlike that of the Erdos or Bacon numbers, is asymmetric: it has to be that I had a victory over person 1, and person 1 had a victory over person 2, etc., and ultimately person N-1 had a victory over Kasparov. The games don’t have to be in time order, they just all have to be victories. And we’ll further require that the games all be played after childhood and before senility (i.e., it doesn’t count if I happened to play someone who happens to be a cousin of some grandmaster whom he beat when they were both 7 years old). I’d also require the these be official tournament games, but I’ve never played in a tournament so we’ll relax that rule.
So what is N? How can I do this? I beat my dad, who on occasion beat his dad, who was really good. Grandpa Moses probably beat some serious tournament players from time to time (he played at coffeehouses etc., not tournaments, but I think some competitive players would show up), and one of the people he beat must have had a win against one of the top U.S. players in that era (1920s-1930s I guess is when Grandpa was strongest). To get from one of those guys to Kasparov . . . how many steps would it take? I don’t know, but that one must not be too hard to answer. Just take the oldest guy who Kasparov ever lost to, then the oldest American player that guy lost to, . . . then maybe 2 more steps after that? So then my Kasparov number would be 9 (dad, grandpa, the (hypothetical) tournament player my grandpa beat, the top guy he beat, then the next 2 guys in the chain, then the American player who beat the old guy who beat Kasparov). So that’s my guess.
What else could it be? Phil is better than me but I’ve beaten him on occasion. I think Phil played in a tournament once or twice, and he might have won a game against someone who played in tournaments more frequently, . . .
In any case, I have a feeling that my Kasparov number is much much lower than my Muhammed Ali number. The last time I got into a fight was in 8th grade, but I wouldn’t call it a win, it was more of a draw. If I were to count it though, then I’m pretty sure this guy got into a lot of fights after that, but it would take a long long chain of slugging to get to Michael Spinks or whatever. If my Ali number (or even my Chuck Wepner number) is less than infinity, it must be in the hundreds.
P.S. Phil reports that the best opponent he ever beat in a tournament had a rating of something like
1562 1800, so maybe we could get to Kasparov that way. Someone who was 1800 when he played Phil might ultimately have reached 1900 and beat someone who at some point reached 2100, etc etc etc. I don’t really have a good sense of whether the Phil path or the Grandpa path would get me faster to that elusive Kasparov victory. But I’m pretty sure these are the only options I have.
In response to some big new push for testing schoolchildren, Mark Palko writes:
The announcement of a new curriculum is invariably followed by a round of hearty round of self congratulations and talk of “keeping standards high” as if adding a slide to a PowerPoint automatically made students better informed. It doesn’t work that way. Adding a topic to the list simply means that students will be exposed to it, not that they will understand or master or retain it.
Well put. In my own teaching, I often am tempted to believe that just putting a topic in a homework problem is enough to ensure that students will learn it. But it doesn’t work that way. Even if they manage to somehow struggle through and solve the problem (and many don’t, or they rely on their friends’ solutions), they won’t learn much if they don’t see the connection with everything else they know.
I’m reminded of the time, several years ago, that I learned that photocopying an article and filing it was not the same as reading it!
Objects of the class “Foghorn Leghorn”: parodies that are more famous than the original. (“It would be as if everybody were familiar with Duchamp’s Mona-Lisa-with-a-moustache while never having heard of Leonardo’s version.”)
Objects of the class “Whoopi Goldberg”: actors who are undeniably talented but are almost always in bad movies, or at least movies that aren’t worthy of their talent. (The opposite: William Holden.)
Objects of the class “Weekend at Bernie’s”: low-quality movie, nobody’s actually seen it, but everybody knows what it’s about. (Other examples: Heathers and Zelig.)
I love these. We need some more.
The answer is no, as explained in this classic article by Warren Browner and Thomas Newman from 1987. If I were to rewrite this article today, I would frame things slightly differently—referring to Type S and Type M errors rather than speaking of “the probability that the research hypothesis is true”—but overall they make good points, and I like their analogy to medical diagnostic testing.
This came up already but I’m afraid the point got lost in the middle of our long discussion of Rachel and Cathy’s book. So I’ll say it again:
There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. . . .
The tech industry has always had to deal with databases and coding; that stuff is a necessity. The statistical part of data science is more of an option.
To put it another way: you can do tech without statistics but you can’t do it without coding and databases.
This came up because I was at a meeting the other day (more comments on that in a later post) where people were discussing how statistics fits into data science. Statistics is important—don’t get me wrong—statistics helps us correct biases from nonrandom samples (and helps us reduce the bias at the sampling stage), statistics helps us estimate causal effects from observational data (and helps us collect data so that causal inference can be performed more directly), statistics helps us regularize so that we’re not overwhelmed by noise (that’s one of my favorite topics!), statistics helps us fit models, statistics helps us visualize data and models and patterns. Statistics can do all sorts of things. I love statistics! But it’s not the most important part of data science, or even close.