The statistics software signal

Tyler Cowen links to a post by Sean Taylor, who writes the following about users of R:

You are willing to invest in learning something difficult. You do not care about aesthetics, only availability of packages and getting results quickly.

To me, R is easy and Sas is difficult. I once worked with some students who were running Sas and the output was unreadable! Pages and pages of numbers that made no sense. When it comes to ease or difficulty of use, I think it depends on what you’re used to! And I really don’t understand the bit about aesthetics. What about this? One reason I use R is to make pretty graphs. That said, if I’d never learned R, I’d just be making pretty graphs in Fortran or whatever. My guess is, the way I program, R is actually hindering rather than helping my ability to make attractive graphs. Half the time I’m scrambling around, writing custom code to get around R’s defaults.

24 thoughts on “The statistics software signal

  1. I couldn’t agree more. Sas is so difficult for me. When I was a freshman, we were taught R. So I use R, simple as that. I’ve taken more classes about Sas than R, but Sas and Chinese are still the same thing for me. One more reason why I use R, it is free. Even my mom’s has R installed on it.

  2. When Sean says that R users don’t care about aesthetics, I believe he’s saying they don’t care about the aesthetics of the programming language, not the aesthetics of the output.

    • He associates Mathematica with being an aesthete, so I’m guessing he’s referring to the user interface rather than either the aesthetics of the programming language or the graphs. Anyway, we may be putting more thought into this than Sean did. The Matlab one makes me laugh – “You definitely know what you’re doing and you care about performance”??

  3. Agreed that SAS is the more difficult language. Usually I’ll have an idea of what I want to do and then think, “OK, now how do I do this in SAS’s cookie cutter, straight-jacketing PROCs… and why are for loops called data steps?”

    And agreed that aesthetics referred to code.

  4. R is a remarkably nice language. It has a lot of modern facilities (closures and support for functional programming, generic functions with multimethods (s4), object-oriented constructs, etc). The problem is that most of these features are easier to understand if you know a language like Common Lisp, otherwise they seem a bit obscure — for example, R even has macros, but for someone who has no experience with Lisp-like macros they look like a dark corner of the language.

    I would even go as far as recommending R as one’s first programming language: it is interactive, you can get results quickly, and R’s underlying semantics is quite modern.

    • Lisp is brilliant for two reasons: 1) you can get in and prototype things very quickly, and 2) in the long run you can create your own task-specific language and program in that. Unfortunately, Lisp got a bad rap for speed back in the day, and it looks strange. Most of its concepts have leaked into other languages, but things like macros are only shadows of what Lisp offers, and it’s how everything works together that makes Lisp magic.

      If I were to take an existing language and make it into the R of the Future, I’d either start with Lisp (maybe even something like Clojure) or Scala. Unfortunately, the stats-in-Lisp thing has already failed to catch on (Xlisp-stat).

      I don’t believe general-purpose languages like Python will ever be as easy to use as R from the perspective of someone sitting down to do some statistics. (Obviously, someone who does a lot of programming and wants to include statistics into that will get along just fine in Python, et al.)

  5. Taylor says that if you are a Mathematica user then, “You are an aesthete who believes everything Stephen Wolfram says.” An immensely stupid and insulting comment. I have used R (mostly Splus), Matlab, and Mathematica for over 15 years. Starting with version 8, Wolfram Research upgraded Mathematica’s statistics capability and again with the recently released version 9. Mathematica has a very steep learning curve, actually (in my opinion) steeper than R or Matlab. However, it’s worth it. Having symbolic computing capabilities is extremely useful. Try it for generating functions.

    A lot of people don’t seem to like Stephen Wolfram, and they don’t hesitate to fling gratuitous insults. Perhaps they are jealous of his success.

  6. I suppose it depends on where you started. I started with BAL, Fortran, PL/I and JCL, so SAS makes perfect sense. To Dave’s point, if I don’t like the current PROCs (just preprogrammed steps in a JCL stream) I code it up in a DATA step or in IML. Since I work in a production environment rather than a boutique or academic environment, I also have the comfort of knowing that code will work years from now, and won’t blow out as frequently because of memory issues or version shifts. Our production stuff needs to be stable, and reproducible over time as well as somewhat scalable.

    If I want “state-of-the-art” for one-off’s or to experiment with an algorithm or analysis, I use R. (It was S, but lately it’s R). R is great for that. Now that SAS is starting to support R as an external call, I’m experimenting with mixing them for production job. There is still that stability concern, though.

  7. I would also disagree with those who say R is an ugly language. It (S, actually) was way ahead of its time, and it does a remarkable job to this day. The only real complaints anyone has about it are: 1) it used ‘<-' instead of '=', 2) some of its transmogrification magic gives unexpected results, and 3) it's not as fast as, say Matlab. Matlab: now that's an ugly language. Does it even have non-kludged named parameters yet?

    The thing I can't get over with SAS is that it's so obviously written for punch cards. It's like someone took some custom Fortran code, combined it with the parts of JCL (Job Control Language — used to control batch jobs back in the punch-card days) and called it SAS. And you can also clearly see the seams where a 1960's-era scriptish ability was added to it, graphics, etc.

    • R’s lexical scoping is a pain (what I would give for use strict). The inconsistent behavior of errors and error reporting is also unfortunate. Automatic typing occasionally bites me. Trying to pass by reference is also painful. I haven’t used the non-base OO methods, but doubt they are as natural as other languages.

      As a statistical package, the hodge-podge construction means that the interfaces are very inconsistent. Compare that to stata where you can virtually guess how the arguments will work.

      • Lexical scoping is widely used (Pascal, C and C-likes, Haskell, etc). Pass by reference isn’t really compatible with functional languages (S is LISP-like), but I’m a fan LISP so it’s natural to me to avoid such things. (Though I can see the benefits for huge data, which is a weakness of R’s.)

        Stata seems very SAS-like, with a command-oriented language core supplemented by a macro language. (I’ve not used it, just looked at guides.) I’m not sure that it’s truly comparable to R (or Matlab, Python, etc).

        • I’ll give you two example of R’s version of scope behaving differently (and IMO badly) from C.

          > for( i in seq(2)) {}
          > i
          [1] 2
          tempfunc tempfunc( c(‘a’, ‘v’, ‘t’))
          [1] “v”
          [1] “v”
          [1] “v”
          These are mistakes that one can avoid by being careful, but I’ve seen lots of people make them, and it’s nice when the system offers you an option like perl’s use strict to check for unintended consequences. When a variable is not locally defined, there is no easy way to make R error or warn instead of going looking for that variable elsewhere up the tree. A careful dance of environments and the search path can do it, but it’s a mess.

        • At least in the case of `for`, it’s designed to work this way. The documentation says, “When the loop terminates, var remains as a variable containing its latest value.” That may not be the way other languages work, and I would not have expected it to work this way, but this isn’t a scoping issue.

          To be honest, I’d have expected i to not be set/modified after the loop, but the `for` function would’ve returned 2 (the final value of its loop variable).

          I think your second example got messed up by the blog swallowing your assignment operator as an attempt at HTML. That’s definitely a pain on many sites when using R.

          Another common stumbling block is the `ifelse` function, which may not return what you might guess it would return.

  8. Till Julia grows up and/or Python becomes nifty, and/or SAS remains expensive /closed, till then R remains the leader of the statistical pack, me hearties

  9. As a commenter elsewhere has said, one of R’s main benefits is that “R is both a stat pack command language and an application programming language, with a single (?) syntax.”

    If you need both features, you’re not going to use point-and-click software like SPSS. SAS has both, but it requires learning a hodgepodge of completely separate syntaxes (DATA steps, PROC steps, IML, ODS…); Python’s packages are still getting up to speed; MATLAB costs $$$… So R is often the least bad option.

  10. I find Sean Taylor’s comments on SAS odd. I use it for political science work. I’ve never used a macro, only syntax.

  11. Pingback: Software is as software does « Statistical Modeling, Causal Inference, and Social Science

  12. Mr. Taylor claims that software choice is “correlated with bad quality science” but does not provide data or an analysis (in any software). Mr. Taylor’s remarks are insulting to many researchers who choose commercial software. Given his age and experience, I assume he hasn’t met many high-quality researchers that use SAS and SPSS. His remarks are naive. You cannot judge the quality of research by a software tool any more than you can judge quality by someone’s gender, skin color, or the girth of their belt. Research is done by researchers, and software facilitates their analysis, but I think that good and bad research is independent of the software choice.

    As to his comments regarding SPSS researchers and their “fear or writing code,” I suggest that he look up the CV of Leland Wilkinson, Fellow of the ASA and the AAAS, and originator of the grammar of graphics ideas that were implemented in ggplot. I also invite him and others to read some of the SAS-oriented blogs (such as my own) that have examples of how to “hand-code statistical methods” in the event that the built-in analyses are not sufficient.

  13. Again, many of these arguments are about specific examples in the never-ending language wars.
    For context, try Languages, Levels, Libraries and Longevity @ ACM Queue.

    S moved ~1979 to UNIX on a VAX-11/780, a 5MHz, (eventually) up to 8MB memory. Let’s see, smartphones are typically 1GHz with a bit more memory, and even laptops are 2-3GHz with multiple cores. While one cannot just compare clock rates, one is talking about order-of-magnitude 1000X faster performance in a laptop, and more throughput.

    FORTRAN & COBOL are ugly, but still around.
    Humans still have appendices.

    Everybody could do better if allowed to start from scratch, but software doesn’t work that way… :-)
    [Except a famous example in The Mythical Man=Month where they lost the code and had to start over.]

Comments are closed.